Symbolic-to-statistical hybridization: extending generation-heavy machine translation

Machine Translation - Tập 23 - Trang 23-63 - 2009
Nizar Habash1, Bonnie Dorr2, Christof Monz3
1Center for Computational Learning Systems, Columbia University, New York, USA
2Institute for Advanced Computer Studies, University of Maryland, College Park, USA
3Informatics Institute, University of Amsterdam, Amsterdam, The Netherlands

Tóm tắt

The last few years have witnessed an increasing interest in hybridizing surface-based statistical approaches and rule-based symbolic approaches to machine translation (MT). Much of that work is focused on extending statistical MT systems with symbolic knowledge and components. In the brand of hybridization discussed here, we go in the opposite direction: adding statistical bilingual components to a symbolic system. Our base system is Generation-heavy machine translation (GHMT), a primarily symbolic asymmetrical approach that addresses the issue of Interlingual MT resource poverty in source-poor/target-rich language pairs by exploiting symbolic and statistical target-language resources. GHMT’s statistical components are limited to target-language models, which arguably makes it a simple form of a hybrid system. We extend the hybrid nature of GHMT by adding statistical bilingual components. We also describe the details of retargeting it to Arabic–English MT. The morphological richness of Arabic brings several challenges to the hybridization task. We conduct an extensive evaluation of multiple system variants. Our evaluation shows that this new variant of GHMT—a primarily symbolic system extended with monolingual and bilingual statistical components—has a higher degree of grammaticality than a phrase-based statistical MT system, where grammaticality is measured in terms of correct verb-argument realization and long-distance dependency translation.

Tài liệu tham khảo

Abdel-Monem A, Shaalan K, Rafea A, Baraka H (2003) A proposed approach for generating Arabic from interlingua in a multilingual machine translation system. In: Proceedings of the 4th conference on language engineering. Cairo, Egypt, pp 197–206 Alsharaf H, Cardey S, Greenfield P, Shen Y (2004) Problems and solutions in machine translation involving Arabic, Chinese and French. In: Proceedings of the international conference on information technology. Las Vegas, NA, pp 293–297 Antworth E (1990) PC-KIMMO: a two-level processor for morphological analysis. Dallas Summer Institute of Linguistics, Dallas, TX Ayan NF, Borr B, Habash N (2004) Multi-align: combining linguistic and statistical techniques to improve alignments for adaptable MT. In: Proceedings of the conference of the Association for Machine Translation in the Americas (AMTA-2004). Washington DC, USA, pp 17–26 Aymerich J (2001) Generation of noun-noun compounds in the Spanish–English machine translation system SPANAM. In: Proceedings of the eighth machine translation summit (MT SUMMIT VIII). Santiago de Compostela, Spain Banerjee S, Lavie A (2005) METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the ACL workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization. Ann Arbor, MI, pp 65–72 Bangalore S, Rambow O (2000a) Corpus-based lexical choice in natural language generation. In: ACL 2000: 38th annual meeting of the association for computational linguistics. Hong Kong, China, pp 464–471 Bangalore S, Rambow O (2000b) Exploiting a probabilistic hierarchical model for generation. In: Proceedings of the 18th international conference on computational linguistics. Saarbrücken, Germany, pp 42–48 Beaven J (1992) Shake and bake machine translation. In: Proceedings of fifteenth [sic] international conference on computational linguistics. Nantes, France, pp 603–609 Bikel D (2002) Design of a multi-lingual, parallel-processing statistical parsing engine. In: Proceedings of HLT 2002, second international conference on human language technology conference. San Diego, CA, pp 178–182 Black E, Abney S, Flickinger D, Gdaniec C, Grishman R, Harrison P, Hindle D, Ingria R, Jelinek F, Klavans J, Liberman M, Marcus M, Roukos S, Santorini B, Strzalkowski T (1991) A procedure for quantitatively comparing the syntactic coverage of English grammars. In: Proceedings of the 1991 DARPA speech and natural language workshop. Pacific Grove, CA, Morgan Kaufmann, pp 306–311 Brown R, Frederking R (1995) Applying statistical English language modeling to symbolic machine translation. In: Proceedings of the sixth international conference on theoretical and methodological issues in machine translation. Leuven, Belgium, pp 221–239 Brown P, Della-Pietra S, Della-Pietra V, Mercer R (1993) The mathematics of machine translation: parameter estimation. Comput Linguist 19(2): 263–311 Brown RD, Hutchinson R, Bennett PN, Carbonell JG, Jansen P (2003) Reducing boundary friction using translation-fragment overlap. In: MT Summit IX, Proceedings of the ninth machine translation summit. New Orleans, LA, pp 24–31 Buckwalter T (2002) Buckwalter Arabic morphological analyzer version 1.0. Linguistic Data Consortium Catalog No.: LDC2002L49 Callison-Burch C, Osborne M, Koehn P (2006) Re-evaluating the role of BLEU in machine translation research. In: Proceedings of the 11th conference of the European chapter of the association for computational linguistics (EACL’06). Trento, Italy, pp 249–256 Carbonell J, Klein S, Miller D, Steinbaum M, Grassiany T, Frey J (2006) Context-based machine translation. In: Proceedings of the 7th conference of the association for machine translation in the Americas: visions for the future of machine translation. Cambridge, MA, pp 19–28 Charniak E (1997) Statistical parsing with a context-free grammar and word statistics. In: Proceedings of the AAAI. Providence, RI, pp 598–603 Charniak E (2000) A maximum-entropy-inspired parser. In: Proceedings of the 1st North American chapter of the association for computational linguistics conference. Seattle, WA, pp 132–139 Charniak E, Johnson M (2001) Edit detection and parsing for transcribed speech. In: Proceedings of the second meeting of the North American chapter of the association for computational linguistics. Pittsburgh, PA, pp 118–126 Collins M (1997) Three generative, lexicalised models for statistical parsing. In: 35th annual meeting of the association for computational linguistics and 8th conference of the European chapter of the association for computational linguistics, proceedings of the conference. Madrid, Spain, pp 16–23 Collins M, Koehn P, Kucerova I (2005) Clause restructuring for statistical machine translation. In: 43rd annual meeting of the association for computational linguistics. Ann Arbor, MI, pp 531–540 Cormen TH, Leiserson CE, Rivest RL, Stein C (2001) Introduction to algorithms, 2nd edn. The MIT Press, Cambridge, MA Crego JM, Mariño JB (2007) Syntax-enhanced N-gram-based SMT. In: Machine translation Summit XI, proceedings. Copenhagen, Denmark, pp 111–118 Daumé H III, Knight K, Langkilde-Geary I, Marcu D, Yamada K (2002) The importance of lexicalized syntax models for natural language generation tasks. In: Proceedings of the international natural language generation conference (INLG-02). New York, NY, pp 9–16 Diab M, Hacioglu K, Jurafsky D (2004) Automatic tagging of Arabic text: from raw text to base phrase chunks. In: Proceedings of the 5th meeting of the North American chapter of the association for computational linguistics/human language technologies conference (HLT-NAACL04). Boston, MA, pp 149–152 Doddington G (2002) Automatic evaluation of machine translation quality using n-gram co-occurrence statistics. In: Proceedings of the second international conference on human language technology research. San Francisco, CA, pp 138–145 Dorr BJ (1993a) Interlingual Machine translation: a parameterized approach. Artif Intell 63(1 & 2): 429–492 Dorr BJ (1993b) Machine translation: a view from the Lexicon. The MIT Press, Cambridge, MA Dorr BJ (2001) LCS verb database. Technical Report Online Software Database, University of Maryland, College Park, MD (with Mari Olsen and Nizar Habash and Scott Thomas). http://www.umiacs.umd.edu/~bonnie/LCS_Database_Docmentation.html Dorr BJ, Habash N (2002) Interlingua approximation: a generation-heavy approach. In: Workshop on interlingua reliability, fifth conference of the association for machine translation in the Americas, AMTA-2002. Tiburon, CA, pp 1–6 Dorr BJ, Jordan PW, Benoit JW (1999) A survey of current research in machine translation. In: Zelkowitz M (eds) Advances in computers. Academic Press, London, pp 1–68 Dorr BJ, Pearl L, Hwa R, Habash N (2002) DUSTer: a method for unraveling cross-language divergences for statistical word-level alignment. In: Proceedings of the 5th conference of the association for machine translation in the Americas (AMTA-02). Springer-Verlag, Berlin/Heidelberg, pp 31–43 Dugast L, Senellart J, Koehn P (2009) Selective addition of corpus-extracted phrasal lexical rules to a rule-based machine translation system. In: MT Summit XII, proceedings of the twelfth machine translation summit. Ottawa, ON, Canada, pp 222–229 El Isbihani A, Khadivi S, Bender O, Ney H (2006) Morpho-syntactic Arabic preprocessing for Arabic to English statistical machine translation. In: Proceedings of the NAACL workshop on statistical machine translation. New York, NY, pp 15–22 Fellbaum C (1998) WordNet: an electronic lexical database. MIT Press, Cambridge, MA. http://www.cogsci.princeton.edu/~wn(2000, September 7) Font-Llitjós A, Vogel S (2007) A walk on the other side: adding statistical components to a transfer-based translation system. In: Proceedings of the workshop on syntax and structure in statistical translation at the human language technology conference of the North American chapter of the association for computational linguistics. Rochester, NY, pp 72–79 Giménez J, Màrquez L (2007) Linguistic features for automatic evaluation of heterogenous MT systems. In: Proceedings of the second workshop on statistical machine translation. Prague, Czech Republic, pp 256–264 Goldwater S, McClosky D (2005) Improving statistical MT through morphological analysis. In: HLT/EMNLP 2005, proceedings of human language technology conference and conference on empirical methods in natural language processing. Vancouver, BC, Canada, pp 676–683 Graff D (1994) UN parallel text (Spanish-English). Linguistic Data Consortium Catalog No. LDC94T4A Graff D (2003a) Arabic Gigaword. Linguistic Data Consortium Catalog No. LDC2003T12 Graff D (2003b) English Gigaword corpus. Linguistic Data Consortium Catalog No. LDC2003T05 Grimshaw J, Mester A (1988) Light verbs and theta-marking. Linguist Inq 19: 205–232 Habash N (2000) oxyGen: a language independent linearization engine. In: AMTA-2000, fourth conference of the association for machine translation in the Americas: envisioning machine translation in the information future. Cuernavaca, Mexico, pp 68–79 Habash N (2003a) Generation heavy hybrid machine translation. Ph.D. thesis, University of Maryland, College Park, MD Habash N (2003b) Matador: a large scale Spanish-English GHMT system. In: MT Summit IX, proceedings of the ninth machine translation summit. New Orleans, LA, pp 149–156 Habash N (2004) The use of a structural N-gram language model in generation-heavy hybrid machine translation. In: Belz A, Evans R, Piwek P (eds) Natural language generation, third international conference, INLG 2004. Springer-Verlag, Berlin, Heidelberg, NY, pp 61–69 Habash N (2007a) Arabic morphological representations for machine translation. In: van den Bosch A, Soudi A, Neumann G (eds) Arabic computational morphology: knowledge-based and empirical methods. Springer, Dordrecht, The Netherlands, pp 263–285 Habash N (2007b) Syntactic preprocessing for statistical MT. In: Machine translation summit XI, proceedings. Copenhagen, Denmark, pp 215–222 Habash N, Dorr BJ (2002) Handling translation divergences: combining statistical and symbolic techniques in generation-heavy machine translation. In: Machine translation: from research to real users, 5th conference of the association for machine translation in the Americas, AMTA 2002, proceedings. Springer-Verlag, Berlin Heidelberg, New York, pp 84–93 Habash N, Dorr BJ (2003) A categorial variation database for English. In: HLT-NAACL: human language technology conference of the North American chapter of the association for computational linguistics, Vol. 1. Edmonton, AL, Canada, pp 96–102 Habash N, Elkholy A (2008) SEPIA: surface span extension to syntactic dependency precision-based MT evaluation. In: Proceedings of the NIST metrics for machine translation workshop at the association for machine translation in the Americas conference, AMTA-2008. Waikiki, HI Habash N, Rambow O (2004) Extracting a tree adjoining grammar from the Penn Arabic Treebank. In: Proceedings of Traitement Automatique du Langage Naturel (TALN-04). pp 277–284. Fez, Morocco Habash N, Rambow O (2005) Arabic tokenization, part-of-speech tagging and morphological disambiguation in one fell swoop. In: 43rd annual meeting of the association for computational linguistics (ACL’05). Ann Arbor, MI, pp 573–580 Habash N, Sadat F (2006) Arabic preprocessing schemes for statistical machine translation. In: Proceedings of the 7th meeting of the North American chapter of the association for computational linguistics/human language technologies conference (HLT-NAACL06). New York, NY, pp 49–52 Habash N, Dorr BJ, Traum D (2003) Hybrid natural language generation from lexical conceptual structures. Mach Transl 18: 81–127 Habash N, Soudi A, Buckwalter T (2007) On Arabic transliteration. In: van den Bosch A, Soudi A, Neumann G (eds) Arabic computational morphology: knowledge-based and empirical methods. Springer, Dordrecht, The Netherlands, pp 15–22 Han C, Lavoie B, Palmer M, Rambow O, Kittredge R, Korelsky T, Kim N, Kim M (2000) Handling structural divergences and recovering dropped arguments in a Korean/English machine translation system. In: AMTA-2000, fourth conference of the association for machine translation in the Americas: envisioning machine translation in the information future. Cuernavaca, Mexico, pp 40–53 Hwa R (2001) Const2Dep Tool. http://www.cs.cmu.edu/afs/cs/user/alavie/MTEval/code/hwc/const2dep/ Jackendoff R (1983) Semantics and cognition. The MIT Press, Cambridge, MA Jackendoff R (1990) Semantic structures. The MIT Press, Cambridge, MA Johnson M (2001) Joint and conditional estimation of tagging and parsing models. In: Association for computational linguistics, 39th annual meeting and 10th conference of the European chapter, proceedings of the conference. Toulouse, France, pp 314–321 Knight K, Hatzivassiloglou V (1995) Two-level, many-paths generation. In: 33rd annual meeting of the association for computational linguistics (ACL-95). Cambridge, MA, pp 252–260 Koehn P (2004a) Pharaoh: a beam search decoder for phrase-based statistical machine translation models. In: Proceedings of the 6th biennial conference of the association for machine translation in the Americas. Washington, DC, pp 115–124 Koehn P (2004b) Statistical significance tests for machine translation evaluation. In: Proceedings of the 2004 conference on empirical methods in natural language processing conference. Barcelona, Spain, pp 388–395 Koehn P, Och F, Marcu D (2003) Statistical phrase-based translation. In: HLT-NAACL: human language technology conference of the North American chapter of the association for computational linguistics. Edmonton, AL, Canada, pp 127–133 Kulick S, Gabbard R, Marcus M (2006) Parsing the Arabic Treebank: analysis and improvements. In: Proceedings of the Treebanks and linguistic theories conference. Prague, Czech Republic, pp 31–42 Langkilde I (2000) Forest-based statistical sentence generation. In: 1st meeting of the North American chapter of the association for computational linguistics, proceedings. Seattle, WA, pp 170–177 Langkilde I, Knight K (1998a) Generating word lattices from abstract meaning representation. Technical report, Information Science Institute, University of Southern California, Marina del Rey, CA Langkilde I, Knight K (1998b) Generation that exploits corpus-based statistical knowledge. In: COLING-ACL 98, 36th annual meeting of the association for computational linguistics and 17th international conference on computational linguistics, proceedings of the conference. Montreal, QC, Canada, pp 704–710 Lavoie B, Kittredge R, Korelsky T, Rambow O (2000) A framework for MT and multilingual NLG systems based on uniform lexico-structural processing. In: 6th applied natural language processing conference, proceedings of the conference. Seattle, WA, pp 63–67 Lavoie B, White M, Korelsky T (2001) Inducing lexico-structural transfer rules from parsed bi-texts. In: Proceedings of the 39th annual meeting of the association for computational linguistics—DDMT workshop. Toulouse, France, pp 17–24 Lee Y-S (2004) Morphological analysis for statistical machine translation. In: Proceedings of the 5th meeting of the North American chapter of the association for computational linguistics/human language technologies conference (HLT-NAACL04). Boston, MA, pp 57–60 Levin B (1993) English verb classes and alternations: a preliminary investigation. University of Chicago Press, Chicago, IL Liu D, Gildea D (2005) Syntactic features for evaluation of machine translation. In: Proceedings of the ACL workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization. Ann Arbor, MI, pp 25–32 Maamouri M, Bies A, Buckwalter T, Mekki W (2004) The Penn Arabic Treebank: building a large-scale annotated Arabic Corpus. In: NEMLAR conference on Arabic language resources and tools. Cairo, Egypt, pp 102–109 Macleod C, Grishman R, Meyers A, Barrett L, Reeves R(1998) NOMLEX: a lexicon of nominalizations. In: Proceedings of EURALEX’98. Liège, Belgium, pp 187–193 Marcus MP, Santorini B, Marcinkiewicz MA (1994) Building a large annotated Corpus of English: the Penn Treebank. Comput Linguist 19(2): 313–330 Mel’čuk I (1988) Dependency syntax: theory and practice. State University of New York Press, Albany, NY Nasr A, Rambow O (2006) Parsing with lexicalized probabilistic recursive transition networks. In: Yli-Jyrä A, Karttunen L, Karhumäki J (eds) Finite-state methods and natural language processing, vol 4002 of lecture notes in computer science. Springer-Verlag, Berlin/Heidelberg, pp 156–166 Nasr A, Rambow O, Palmer M, Rosenzweig J (1997) Enriching lexical transfer with cross-linguistic semantic features (or how to do interlingua without interlingua). In: Proceedings of the 2nd international workshop on interlingua. San Diego, CA Nguyen TP, Shimazu A (2006) Improving phrase-based statistical machine translation with morphosyntactic transformation. Mach Transl 20(3): 147–166 Nießen S, Ney H (2004) Statistical machine translation with scarce resources using morpho-syntactic information. Comput Linguist 30(2): 181–204 Och FJ (2003) Minimum error rate training for statistical machine translation. In: 41st annual meeting of the association for computational linguistics. Sapporo, Japan, pp 160–167 Och FJ (2005) Google system description for the 2005 NIST MT evaluation. In: MT Eval workshop (unpublished talk) Owczarzak K, van Genabith J, Way A (2007) Labelled dependencies in machine translation evaluation. In: Proceedings of the second workshop on statistical machine translation. Prague, Czech Republic, pp 104–111 Papineni K, Roukos S, Ward T, Zhu W-J (2002) BLEU: a method for automatic evaluation of machine translation. In: 40th annual meeting of the association for computational linguistics. Philadelphia, PA, pp 311–318 Popović M, Ney H (2004) Towards the use of word stems and suffixes for statistical machine translation. In: Proceedings of the 4th international conference on language resources and evaluation (LREC). Lisbon, Portugal, pp 1585–1588 Porter M (1980) An algorithm for suffix stripping. Program 14(3): 130–137 Press WH, Teukolsky SA, Vetterling WT, Flannery BP (2002) Numerical recipes in C++. Cambridge University Press, Cambridge, UK Quirk C, Menezes A, Cherry C (2005) Dependency treelet translation: syntactically informed phrasal SMT. In: 43rd annual meeting of the association for computational linguistics. Ann Arbor, MI, pp 271–279 Ratnaparkhi A (2000) Trainable methods for surface natural language generation. In: Proceedings of the 1st annual North American association of computational linguistics (NAACL-2000). Seattle, WA, pp 194–201 Resnik P (1997) Evaluating multilingual gisting of web pages. AAAI symposium on natural language processing for the world wide web, Stanford, CA Resnik P, Olsen M, Diab M (1999) The bible as a parallel corpus: annotating the book of 2000 tongues. Comput Humanit 33: 129–153 Riesa J, Yarowsky D (2006) Minimally supervised morphological segmentation with applications to machine translation. In: Proceedings of the 7th conference of the association for machine translation in the Americas: visions for the future of machine translation. Cambridge, MA, pp 185–192 Rogers W (2000) TREC Spanish corpus. Linguistic Data Consortium catalog no. LDC2000T51 Roth R, Rambow O, Habash N, Diab M, Rudin C (2008) Arabic morphological tagging, diacritization, and lemmatization using lexeme models and feature ranking. In: 46th annual meeting of the association for computational linguistics: human language technologies, proceedings of the conference, short papers. Columbus, OH, pp 117–120 Sadat F, Habash N (2006) Combination of Arabic preprocessing schemes for statistical machine translation. In: Proceedings of the 21st international conference on computational linguistics and 44th annual meeting of the association for computational linguistics. Sydney, Australia, pp 1–8 Senellart J (2006) Boosting linguistic rule-based MT system with corpus-based approaches. In: Presentation. GALE PI Meeting. Boston, MA Sharaf M (2002) Implications of the agreement features in (English to Arabic) machine translation. Master’s thesis, Al-Azhar University, Cairo, Egypt Sima’an K (2000) Tree-gram parsing: lexical dependencies and structural relations. In: 38th annual meeting of the association for computational linguistics (ACL’00). Hong Kong, China, pp 37–44 Snover M, Dorr BJ, Schwartz R, Micciulla L, Makhoul J (2006) A study of translation error rate with targeted human annotation. In: Proceedings of the 7th conference of the association for machine translation in the Americas: visions for the future of machine translation. Cambridge, MA, pp 223–231 Soudi A (2004) Challenges in the generation of Arabic from interlingua. In: Proceedings of Traitement Automatique des Langues Naturelles (TALN-04). Fez, Morocco, pp 343–350 Soudi A, Cavalli-Sforza V, Jamari A (2002) A prototype English-to-Arabic interlingua-based MT system. In: Proceedings of the third international conference on language resources and evaluation: workshop on Arabic language resources and evaluation: status and prospects. Las Palmas de Gran Canaria, Spain, pp 18–25 Stolcke A. (2002) SRILM—an extensible language modeling toolkit. In: Proceedings of the international conference on spoken language processing (ICSLP), vol 2. Denver, CO, pp 901–904 Tanaka T, Baldwin T (2003) Translation selection for Japanese–English noun-noun compounds. In: MT Summit IX, proceedings of the ninth machine translation summit. New Orleans, LA, pp 378–385 Tapanainen P, Jarvinen T (1997) A non-projective dependency parser. In: Proceedings of the 5th conference on applied natural language pro cessing. Washington, DC, pp 64–71 Traum D, Habash N (2000) Generation from lexical conceptual structures. In: Proceedings of the workshop on applied interlinguas, North American association of computational linguistics/applied natural language processing conference, NAACL/ANLP-2000. Seattle, WA, pp 34–41 Vauquois B (1968) A survey of formal grammars and algorithms for recognition and transformation in machine translation. In: IFIP congress-68. Edinburgh, UK, pp 254–260 Watanabe H, Kurohashi S, Aramaki E (2000) Finding structural correspondences from bilingual parsed corpus for corpus-based translation. In: Proceedings of the 18th international conference on computational linguistics, vol 2. Saarbrücken, Germany, pp 906–912 Whitelock P (1992) Shake-and-bake translation. In: Proceedings of fifteenth [sic] international conference on computational linguistics. Nantes, France, pp 784–791 Xia F, McCord M (2004) Improving a statistical MT system with automatically learned rewrite patterns. In: Proceedings of the 20th international conference on computational linguistics (COLING 2004). Geneva, Switzerland, pp 508–514 Zhang Y, Zens R, Ney H (2007) Chunk-level reordering of source language sentences with automatically learned rules for statistical machine translation. In: Proceedings of the workshop on syntax and structure in statistical translation at the human language technology conference of the North American chapter of the association for computational linguistics. Rochester, NY, pp 1–8 Zollmann A, Venugopal A, Vogel S (2006) Bridging the inflection morphology gap for Arabic statistical machine translation. In: Proceedings of the human language technology conference of the NAACL, companion volume: short papers. New York, NY, pp 201–204