~~Most existing techniques for combining multiple alignment tables can combine only two alignment tables at a time, and are based on heuristics (Och and Ney,2003), (Koehn et al, 2003).~~
~~In this paper, we propose a novel mathematical formulation for combining an arbitrary num ber of alignment tables using their powermean.~~
~~The method frames the combi nation task as an optimization problem,and finds the optimal alignment lying between the intersection and union of multiple alignment tables by optimizing the parameter p: the affinely extended real num ber defining the order of the power meanfunction.~~
~~The combination approach pro duces better alignment tables in terms of both F-measure and BLEU scores.~~
~~Machine Translation (MT) systems are trained on bi-text parallel corpora.~~
~~One of the first steps involved in training a MT system is obtaining alignments between words of source and target languages.~~
~~This is typically done using someform of Expectation Maximization (EM) algo rithm (Brown et al, 1993), (Och and Ney, 2003),(Vogel et al, 1996).~~
~~These unsupervised algo rithms provide alignment links between english words ei and the foreign words fj for a given e?f sentence pair.~~
~~The alignment pairs are then usedto extract phrases tables (Koehn et al, 2003), hi erarchical rules (Chiang, 2005), or tree-to-string mappings (Yamada and Knight, 2001).~~
~~Thus, the accuracy of these alignment links has a significant impact in overall MT accuracy.One of the commonly used techniques to improve the alignment accuracy is combining align ment tables obtained for source to target (e2f ) and target to source (f2e) directions (Och and Ney,2003).~~
~~This combining technique involves obtain ing two sets of alignment tables A1 and A2 for the same sentence pair e ? f , and producing a newset based on union A?~~
~~= A1 ? A2 or intersec tion A?~~
~~= A1 ?A2 or some optimal combinationAo such that it is subset of A1 ? A2 but a super set of A1 ? A2.~~
~~How to find this optimal Ao is akey question.~~
~~A? has high precision but low re call producing fewer alignments and A?~~
~~has high recall but low precision.~~
~~Most existing methods for alignment combination (symmetrization) rely on heuristics to iden tify reliable links (Och and Ney, 2003), (Koehn et al, 2003).~~
~~The method proposed in (Och andNey, 2003), for example, interpolates the intersection and union of two asymmetric alignment tables by adding links that are adjacent to intersection links, and connect at least one previously un aligned word.~~
~~Another example is the method in(Koehn et al, 2003), which adds links to the intersection of two alignment tables that are the diagonal neighbors of existing links, optionally requir ing that any added links connect two previously unaligned words.Other methods try to combine the tables dur ing alignment training.~~
~~In (Liang et al, 2006),asymmetric models are jointly trained to maximize the similarity of their alignments, by opti 828 mizing an EM-like objective function based on agreement heuristics.~~
~~In (Ayan et al, 2004), theauthors present a technique for combining align ments based on various linguistic resources such as parts of speech, dependency parses, or bilingual dictionaries, and use machine learning techniquesto do alignment combination.~~
~~One of the main disadvantages of (Ayan et al, 2004)?s method, how ever, is that the algorithm is a supervised learning method, and so requires human-annotated data.~~
~~Recently, (Xiang et al, 2010) proposed a method that can handle multiple alignments with soft linkswhich are defined by confidence scores of align ment links.~~
~~(Matusov et al, 2004) on the other hand, frame symmetrization as finding a set with minimal cost using use a graph based algorithm where costs are associated with local alignment probabilities.In summary, most existing alignment combina tion methods try to find an optimal alignment set Ao that lies between A?~~
~~and A?~~
~~using heuristics.The main problems with methods based on heuris tics are: 1.~~
~~they may not generalize well across language pairs 2.~~
~~they typically do not have any parameters to optimize3.~~
~~most methods can combine only 2 align ments at a time 4.~~
~~most approaches are ad-hoc and are not mathematically well definedIn this paper we address these issues by proposing a novel mathematical formulation for com bining an arbitrary number of alignment tables.The method frames the combination task as an optimization problem, and finds the optimal align ment lying between the intersection and union ofmultiple alignment tables by optimizing the pa rameter p of the power mean function.~~
~~power mean Given an english-foreign sentence pair (eI1, fJ1 )the alignment problem is to determine the pres ence of absence of alignment links aij between the words ei and fj , where i ? I and j ? J . In this paper we will use the convention that when aij = 1, words ei and fj are linked, otherwiseaij = 0.~~
~~Let us define the alignment tables we ob tain for two translation directions as A1 and A2, respectively.~~
~~The union of these two alignment tables A?~~
~~contain all of the links in A1 and A2, and the intersection A?~~
~~contain only the common links.~~
~~Definitions 1 and 2 below define A?~~
~~andA?~~
~~more formally.~~
~~Our goal is to find an align ment set Ao such that |A?| ? |Ao| ? |A?| thatmaximizes some objective function.~~
~~We now de scribe the power mean (PM) and show how the PM can represent both the union and intersection of alignment tables using the same formula.~~
~~The power mean: The power mean is defined by equation 1 below, where p is a real number in (??,?)~~
~~and an is a positive real number.~~
~~Sp(a1, a2, ..., an) = ( 1 n n?~~
~~k=1 apk) 1 p (1) The power mean, also known as the generalized mean, has several interesting properties that are relevant to our alignment combination problem.~~
~~In particular, the power mean is equivalent to thegeometric mean G when p?~~
~~0 as shown in equa tion 2 below: G(a1, a2, ..., an) = ( n?~~
~~i=1 ai) 1 n = lim p?0 ( 1n n?~~
~~k=1 apk) 1 p (2) The power mean, furthermore, is equivalent to the maximum function M when p??: M(a1, a2, ..., an) = max(a1, a2, ..., an) = lim p??( 1 n n?~~
~~k=1 apk) 1 p (3) Importantly, the PM Sp is a non-decreasing function of p. This means that Sp is lower bounded by G and upper-bounded by M for p ? [0, ?]: G < Sp < M, 0 < p <?.~~
~~(4) 829Figure 1: The power-mean is a principled way to interpolate between the extremes of union and inter section when combining multiple alignment tables.~~
~~They key insight underpinning our mathematicalformulation of the alignment combination problem is that the geometric mean of multiple align ment tables is equivalent to their intersection, while the maximum of multiple alignment tables is equivalent to their union.~~
~~Let Aq be an alignment with elements aqij such that aqij = 1 if words ei and fj are linked, and aqij = 0 otherwise.~~
~~The union and intersection of a set of n alignment tables can then be formally defined as follows: Definition 1: The union of alignments A1, A2, ..., An is a set A?~~
~~with a?ij = 1 if aqij = 1 for any q ? {1, 2, ..., n}.~~
~~Definition 2: The intersection of alignments A1, A2, ..., An is a set A?~~
~~with a?ij = 1 if aqij = 1 for all q ? {1, 2, ..., n}.Figure 1 depicts a simple example of the align ment combination problem for the common caseof alignment symmetrization.~~
~~Two alignments ta bles, Ae?f and Af?e (one-to-many alignments), need to be combined.~~
~~The result of takingthe union A?~~
~~and intersection A?~~
~~of the ta bles is shown.~~
~~A? can be computed by taking the element-wise maximum of Ae?f and Af?e, which in turn is equal to the power mean Ap of the elements of these tables in the limit as p??.The intersection of the two tables, A?, can simi larly be computed by taking the geometric mean of the elements of Ae?f and Af?e, which is equal to the power mean Ap of the elements of these tables in the limit as p?~~
~~0.~~
~~For p ?~~
~~(0,?),equation 4 implies that Ap has elements with val ues between A?~~
~~and A?.~~
~~We now provide formalproofs for these results when combining an arbi trary number of alignment tables.~~
~~3.1 The intersection of alignment tables.~~
~~A1..An is equivalent to their element-wise geometric mean G(A1, A2, ..., An), as defined in (2).~~
~~Proof : Let A?~~
~~be the intersection of all Aq where q ? {1, 2, .., n}.~~
~~As per our definition ofintersection ? between alignment tables, A?~~
~~con tains links where aqij = 1 ? q. Let Ag be the set that contains the elements 830of G(A1, A2, ..., An).~~
~~Then agij is the geo metric mean of the elements aqij where q ? {1, 2, .., n}, as defined in equation 2, that is, agij = (?nq=1 agij) 1 n . This product is equal to 1 iff aqij = 1 ? q and zero otherwise, since aqij ? {0, 1} ? q. Hence Ag = A?.~~
~~Q.E.D. 3.2 The union of alignment tables A1..An is. equivalent to their element-wise maximum M(A1, A2, ..., An), as defined in (3).~~
~~Proof : Let A?~~
~~be the union of all Aq for q ?{1, 2, .., n}.~~
~~As per our definition of the union be tween alignments A?~~
~~has links where aqij = 1 for some q. Let Am be the set that contain the elements of M(A1, A2, ..., An).~~
~~Let amij be the maximum of the elements aqij where q ? {1, 2, .., n}, as defined in equation (3).~~
~~The max function is equal to 1 iff aqij = 1 for some q and zero otherwise, since aqij ? {0, 1} ? q. Hence Am = A?.~~
~~Q.E.D. 3.3 The element-wise power mean.~~
~~Sp(A1, A2, ..., An) of alignment tables A1..An has entries that are lower-bounded by the intersection of these tables, and upper-bounded by their union for p ? [0, ?].~~
~~Proof : We have already shown that the union and intersection of a set of alignment tables are equivalent to the maximum and geometric mean of these tables, respectively.~~
~~Therefore given that the result in equation 4 is true (we will not prove it here), the relation holds.~~
~~In this sense, the powermean can be used to interpolate between the in tersection and union of multiple alignment tables.~~
~~Q.E.D.~~
~~We evaluate the proposed method using an English-Pashto translation task, as defined by the DARPA TransTac program.~~
~~The training data forthis task consists of slightly more than 100K par allel sentences.~~
~~The Transtac task was designed to evaluate speech-to-speech translation systems, so all training sentences are conversational in nature.~~
~~The sentence length of these utterances varies greatly, ranging from a single word to more than Method F-measure I 0.5979 H 0.6891 GDF 0.6712 PM 0.6984 PMn 0.7276 U 0.6589 Table 1: F-measure Based on Various Alignment Combination Methods 50 words.~~
~~2026 sentences were randomly sampledfrom this training data to prepare held out devel opment set.~~
~~The held out Transtac test set consists of 1019 parallel sentences.~~
~~We have shown in the previous sections that unionand intersection of alignments can be mathemat ically formulated using the power mean.~~
~~Since both combination operations can be represented with the same mathematical expression, we cansearch the combination space ?between?~~
~~the intersection and union of alignment tables by op timizing p w.r.t. any chosen objective function.In these experiments, we define the optimal alignment as the one that maximizes the objective function f({aijt}, {a?ijt}, p), where f is standard Fmeasure, {a?ijt} is the set of all estimated align ment entries on some dataset, {aijt} is the set ofall corresponding human-annotated alignment entries, and p is the order of the power mean function.~~
~~Instead of attempting to optimize the F measure using heuristics, we can now optimize it by finding the appropriate power order p using any suitable numerical optimization algorithm.~~
~~In ourexperiments we used the general simplex algo rithm of amoeba search (Nelder and Mead, 1965),which attempts to find the optimal set of parame ters by evolving a simplex of evaluated points in the direction that the F-measure is increasing.In order to test our alignment combination for mulation empirically we performed experiments on English-Pashto language with data described in Section 4.~~
~~We first trained two sets of alignments, the e2f and f2e directions, based on GIZA++(Och and Ney, 2003) algorithm.~~
~~We then combined these alignments by performing intersec 831 tion (I) and union (U).~~
~~We obtained F-measure of 0.5979 for intersection (I), 0.6589 for union (U).For intersection the F-measure is lower presum ably because many alignments are not shared by the input alignment tables so the number of links is under-estimated.~~
~~We then also re-produced thetwo commonly used combination heuristic methods that are based on growing the alignment di agonally (GDF) (Koehn et al, 2003), and adding links based on refined heuristics (H) (Och and Ney, 2003), respectively.~~
~~We obtained F-measure of 0.6891 for H, and 0.6712 for GDF as shown in Table 1.~~
~~We then used our power mean formulation for combination to maximize the F-measure function with the aforementioned simplex algorithm for tuning the power parameter p, where F-measureis computed with respect to the hand aligned de velopment data, which contains 150 sentences.~~
~~This hand aligned development set is different than the development set for training MT models.~~
~~While doing so we also optimized table weights Wq ?~~
~~(0, 1), ? q Wq = 1, which were applied to the alignment tables before combining them using the PM.~~
~~The Wq allow the algorithm to weight thetwo directions differently.~~
~~We found that the F measure function had many local minima so thesimplex algorithm was initialized at several val ues of p and {Wq} to find the globally optimal F-measure.~~
~~After obtaining power mean outputs for the alignment entries, they need to be converted into binary valued alignment links, that is,Sp(a1ij , a2ij , ...anij) needs to be converted into a binary table.~~
~~There are many ways to do this con version such as simple thresholding or keeping best N% of the links.~~
~~In our experiments we usedthe following simple selection method, which ap pears to perform better than thresholding.~~
~~First we sorted links by PM value and then added the links from the top of the sorted list such that ei and fj are linked if ei?1 and ei+1 are connected to fj , or fj?1 and fj+1 is linked to ei, or both ei and fj arenot connected.~~
~~After tuning power mean parame ter and the alignment weights the best parameter gave an F-measure of 0.6984 which is higher than commonly used GDF by 2.272% and H by 0.93% absolute respectively.~~
~~We observe in Figure 2 that even though PM has higher F-measure compared with GDF it has significantly fewer number of alignment links suggesting that PM has improved precision on the finding the alignment links.~~
~~The presented PM based alignment combination can be tuned to optimize any chosen objective, so it is not surprising that we can improve upon previous results based on heuristics.~~
~~One of the main advantages of the combiningalignment tables using the PM is that our state ments are valid for any number of input tables,whereas most heuristic approaches can only pro cess two alignment tables at a time.~~
~~The presented power mean algorithm, in contrast, can be usedto combine any number of alignments in a sin gle step, which, importantly, makes it possible tojointly optimize all of the parameters of the com bination process.In the second set of experiments the PM approach, which we call PMn, is applied simultane ously to more than two alignments.~~
~~We obtainedfour more sets of alignments from the Berke ley aligner (BA) (Liang et al, 2006), the HMM aligner (HA) (Vogel et al, 1996), the alignment based on partial words (PA), and alignment based on dependency based reordering (DA) (Xu et al,2009).~~
~~Alignment I was obtained by using Berke ley aligner as an off-the-shelf alignment tool.~~
~~We built the HMM aligner based on (Vogel et al, 1996) and use the HMM aligner for producingAlignment II.~~
~~Producing different sets of align ments using different algorithms could be useful because some alignments that are pruned by onealgorithm may be kept by another giving us a big ger pool of possible links to chose from.~~
~~We produced Alignment III based on partial words.~~
~~Pashto is morphologically rich languagewith many prefixes and suffixes.~~
~~In lack of a mor phological segmenter it has been suggested thatkeeping only first ?n? characters of a word can effectively reduce the vocabulary size and may pro duce better alignments.~~
~~(Chiang et al, 2009) used partial words for alignment training in English and Urdu.~~
~~We trained such alignments using using GIZA++ on parallel data with partial words for Pashto sentences.~~
~~The fourth type of alignment we produced, Alignment IV, was motivated by the (Xu et al, 832Figure 2: Number of Alignments Links for Dif ferent Combination Types2009).~~
~~(Xu et al, 2009) showed that transla tion between subject-verb-object (English) andsubject-object-verb (Pashto) languages can be improved by reordering the source side of the par allel data.~~
~~They obtained dependency tree of thesource side and used high level human generated rules to reorder source side using precedence based movement of dependency subtrees.~~
~~The rules were particularly useful in reordering ofverbs that moved to the end of the sentence.~~
~~Mak ing the ordering of source and target side moresimilar may produce better alignments for lan guage pairs which differ in verb ordering, as many alignment algorithms penalize or fail to consider alignments that link words that differ greatly in sentence position.~~
~~A Pashto language expert was hired to produce similar precedence-based rules for the English-Pashto language pair.~~
~~Using the rules and algorithm described in (Xu et al, 2009) we reordered all of the source side and used GIZA++ to align the sentences.The four additional alignment sets just described, including our baseline alignment, Align ment V, were combined using the presented PMn combination algorithm, where n signifies the number of tables being combined.~~
~~As seen on Table 1, we obtained an F-measure of 0.7276 which is 12.97% absolute better than intersection and 6.87% better than union.~~
~~Furthermore PMn, which in these experiments utilizes 5 alignments, is better than PM by 2.92% absolute.~~
~~This is an encouraging result because this not only showsthat we are finding better alignments than inter section and union, but also that combining more than two alignments is useful.~~
~~We note that PMn performed 3.85% absolute better than H (Och andNey, 2003), and 5.64% better than GDF heuris tics.~~
~~In the above experiments the parameters of the power mean combination method were tunedon development data to optimize alignment Fmeasure, and the performance of several align ment combination techniques were compared in terms of F-measure.~~
~~However, it is not clear how correlated alignment F-measures are with BLEU scores, as explained in (Fraser and Marcu, 2007).~~
~~While there is no mathematical problem withoptimizing the parameters of the presented PM based combination algorithm w.r.t. BLEU scores, computationally it is not practical to do so because each iteration would require a complete trainingphase.~~
~~To further evaluate the quality of the align ments methods being compared in this paper, webuilt several MT models based on them and com pared the resulting BLEU scores.~~
~~E2F Dev Test I 0.1064 0.0941 H 0.1028 0.0894 GDF 0.1256 0.1091 PM 0.1214 0.1094 PMn 0.1378 0.1209 U 0.1062 0.0897 Table 2: E2F BLEU: PM Alignment Combination Based MT Model Comparision We built a standard phrase-based translationsystem (Koehn et al, 2003) that utilizes a stack based decoder based on an A?~~
~~search.~~
~~Based onthe combined alignments, we extracted phrase tables with a maximum phrase length of 6 for En glish and 8 for Pashto, respectively.~~
~~We thentrained the lexicalized reordering model that pro duced distortion costs based on the number of words that are skipped on the target side, in a manner similar to (Al-Onaizan and Papineni, 2006).~~
~~Our training sentences are a compilation of sentences from various domains collected byDARPA, and hence we were able to build interpo lated language model which weights the domains differently.~~
~~We built an interpolated LM for both 833English and Pashto, but for English we had signif icantly more monolingual sentences (1.4 millionin total) compared to slightly more than 100K sen tences for Pashto.~~
~~We tuned our MT model using minimum error rate (Och, 2003) training.~~
~~F2E Dev Test I 0.1145 0.1101 H 0.1262 0.1193 GDF 0.1115 0.1204 PM 0.1201 0.1155 PMn 0.1198 0.1196 U 0.1111 0.1155Table 3: F2E BLEU : PM Alignment Combina tion Based MT Model Comparision We built five different MT models based on Intersection (I), Union (U), (Koehn et al, 2003) Grow Diagonal Final (GDF), (Och and Ney, 2003)H refined heuristics and Power Mean (PMn) alignment sets where n = 5.~~
~~We obtained BLEU (Pa pineni et al, 2002) scores for E2F direction as shown in Table 2.~~
~~As expected MT model based on I alignment has the low BLEU score of 0.1064 on the dev set and 0.0941 on the test set on E2Fdirection.~~
~~Intersection, though, has higher preci sion, but throws away many alignments, so theoverall number of alignments is too small to pro duce a good phrase translation table.~~
~~Similarly the U alignment also has low scores (0.1062 and 0.0897) on the dev and test sets, respectively.~~
~~The best scores for E2F direction for both dev and testset is obtained using the model based on PMn al gorithm.~~
~~We obtained BLEU scores of 0.1378 onthe dev set and 0.1209 on the test set which is bet ter than all heuristic based methods.~~
~~It is better by 1.22 absolute BLEU score on the dev set and 1.18 on a test compared to commonly used GDF (Koehn et al, 2003) heuristics.~~
~~The above BLEU scores were all computed based on 1 reference.Note that for the e2f direction PM, which com bines only 2 alignments, is not worse than any of the heuristic based methods.~~
~~Also note that the difference in the BLEU score of PM and PMn is quite large, which indicates that combining more than two alignments using the power mean leads to substantial gains in performance.Although we saw significant gains on E2F di Type PT Size (100K) I 182.17 H 30.73 GDF 27.65 PM 60.87 PMn 25.67 U 24.54 Table 4: E2F Phrase Table Sizerection we did not see similar gains on F2E di rection unfortunately.~~
~~Matching our expectation Intersection (I) produced the worse results with BLEU scores of 0.1145 and 0.1101 on the dev and test set respectively, as shown in Table 3.~~
~~Our PMn algorithm obtained BLEU score of 0.1198 on the dev set and 0.1196 on test set which is better by 0.83 absolute in dev set over GDF.~~
~~On the test set though performance between PMn and GDF is only slightly different with 0.1196 for PMn and 0.1204 for GDF.~~
~~The standard deviation on test set BLEU scores for F2E direction is only0.0042 which is one third of the standard devia tion in E2F direction at 0.013 signifying that the alignment seems to make less difference in F2Edirection for our models.~~
~~One possible explana tion for such results is that the Pashto LM for theE2F direction is trained on a small set of sentences available from training corpus while English LM for F2E direction was trained on 1.4 mil lion sentences.~~
~~Therefore the English LM, which is trained on significantly more data, is probably more robust to translation model errors.~~
~~Type PT Size (100K) I 139.98 H 56.76 GDF 22.96 PM 47.50 PMn 21.24 U 20.33 Table 5: F2E Phrase Table SizeNote that different alignments lead to different phrase table (PT) sizes (Figure 2).~~
~~The intersection (I) method has the least number of align ment links, and tends to produce the largest phrase tables, because there are less restrictions on the 834 phrases to be extracted.~~
~~The Union (U) method,on the other hand, tends to produce the least number of phrases, because the phrase extraction algorithm has more constraints to satisfy.~~
~~We observe that PT produced by intersection is signifi cantly larger than others as seen in Tables 4 and 5.~~
~~The PT size produced by PMn as shown in.~~
~~Table 4 is between I and U and is significantly smaller than the other heuristic based methods.~~
~~It is 7.1% smaller than GDF heuristic based phrase table.~~
~~Similarly in F2E direction as well (Table 5) we see the similar trend where PMn PT size is smaller than GDF by 4.2%.~~
~~The decrease in phrase table size and increase in BLEU scores for most of the dev and test sets show that our PMbased combined alignments are helping to pro duce better MT models.~~
~~We have presented a mathematical formulation for combining alignment tables based on their power mean.~~
~~The presented framework allows us to find the optimal alignment between intersection and union by finding the best power mean parameterbetween 0 and ?, which correspond to intersection and union operations, respectively.~~
~~We evaluated the proposed method empirically by computing BLEU scores in English-Pashto transla tion task and also by computing an F-measure with respect to human alignments.~~
~~We showedthat the approach is more effective than intersec tion, union, the heuristics of (Och and Ney, 2003), and the grow diagonal final (GDF) algorithm of(Koehn et al, 2003).~~
~~We also showed that our al gorithm is not limited to two tables, which makes it possible to jointly optimize the combination ofmultiple alignment tables to further increase per formance.~~
~~In future work we would like to address two particular issues.~~
~~First, in this work we converted power mean outputs to binary alignment links bysimple selection process.~~
~~We are currently investi gating ways to integrate the binary constraint into the PM-based optimization algorithm.~~
~~Second,we do not have to limit ourselves to alignments tables that are binary.~~
~~PM based algorithm can com bine alignments that are not binary, which makes it easier to integrate other sources of information such as posterior probability of word translation into the alignment combination framework.~~
~~This work is partially supported by the DARPA TRANSTAC program under the contract number of NBCH2030007.~~
~~Any opinions, findings, and conclusions or recommendations expressed in thismaterial are those of the authors and do not nec essarily reflect the views of DARPA.~~