Factored bilingual n-gram language models for statistical machine translation

Machine Translation - Tập 24 - Trang 159-175 - 2010
Josep M. Crego1, François Yvon2
1LIMSI-CNRS, Orsay Cedex, France
2LIMSI/CNRS and Université Paris-Sud, Orsay Cedex, France

Tóm tắt

In this work, we present an extension of n-gram-based translation models based on factored language models (FLMs). Translation units employed in the n-gram-based approach to statistical machine translation (SMT) are based on mappings of sequences of raw words, while translation model probabilities are estimated through standard language modeling of such bilingual units. Therefore, similar to other translation model approaches (phrase-based or hierarchical), the sparseness problem of the units being modeled leads to unreliable probability estimates, even under conditions where large bilingual corpora are available. In order to tackle this problem, we extend the n-gram-based approach to SMT by tightly integrating more general word representations, such as lemmas and morphological classes, and we use the flexible framework of FLMs to apply a number of different back-off techniques. In this work, we show that FLMs can also be successfully applied to translation modeling, yielding more robust probability estimates that integrate larger bilingual contexts during the translation process.

Tài liệu tham khảo

Crego JM, Mariño JB (2007a) Extending MARIE: an N-gram-based SMT decoder. In: Proceedings of the 45rd annual meeting of the association for computational linguistics (ACL’07). Ann Arbor, Michigan

Crego JM, Mariño JB (2007b) Improving SMT by coupling reordering and decoding. Mach Transl 20(3): 199–215

Mariño JB, Banchs RE, Crego JM, de Gispert A, Lambert P, Fonollosa JA, Costa-Jussà MR (2006) N-gram-based machine translation. Comput Linguist 32(4): 527–549

Niesler TR (1997) Category-based statistical language models. Ph.D. thesis, University of Cambridge

Papineni K, Roukos S, Ward T, Zhu W-J (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the annual meeting of the association for computation linguistics. Philadelphia, PA, pp 311–318