Paper: Fast Generation of Translation Forest for Large-Scale SMT Discriminative Training

ACL ID D11-1081
Title Fast Generation of Translation Forest for Large-Scale SMT Discriminative Training
Venue Conference on Empirical Methods in Natural Language Processing
Session Main Conference
Year 2011
Authors

Althoughdiscriminativetrainingguaranteesto improve statistical machine translation by in- corporatingalargeamountofoverlappingfea- tures, it is hard to scale up to large data due to decoding complexity. We propose a new al- gorithm to generate translation forest of train- ing data in linear time with the help of word alignment. Our algorithm also alleviates the oracle selection problem by ensuring that a forest always contains derivations that exactly yield the reference translation. With millions of features trained on 519K sentences in 0.03 second per sentence, our system achieves sig- nificant improvement by 0.84 BLEU over the baseline system on the NIST Chinese-English test sets.