Dịch đồng thời các bài giảng và phát biểu

Machine Translation - Tập 21 - Trang 209-252 - 2008
Christian Fügen1,2, Alex Waibel1,3,2, Muntsin Kolss1
1International Center for Advanced Communication Technologies (InterACT), Fakultät für Informatik, Universität Karlsruhe (TH), Karlsruhe, Germany
2Mobile Technologies LLC, Pittsburgh, USA
3International Center for Advanced Communication Technologies (InterACT), School of Computer Science, Carnegie Mellon University, Pittsburgh, USA

Tóm tắt

Với sự gia tăng toàn cầu hóa, giao tiếp qua các ranh giới ngôn ngữ và văn hóa đang trở thành yêu cầu thiết yếu trong việc làm kinh doanh, cung cấp giáo dục và cung cấp dịch vụ công. Do chi phí đáng kể của dịch vụ dịch thuật con người, chỉ một phần nhỏ tài liệu văn bản và thậm chí còn nhỏ hơn tỷ lệ các cuộc gặp gỡ bằng lời nói, chẳng hạn như các cuộc họp và hội nghị quốc tế, được dịch, với hầu hết việc giao tiếp dựa vào việc sử dụng một ngôn ngữ chung (ví dụ: tiếng Anh) hoặc không diễn ra chút nào. Công nghệ có thể cung cấp một cách thức cách mạng nếu việc dịch nói đồng thời, không phụ thuộc vào lĩnh vực, có thể được thực hiện. Trong bài báo này, chúng tôi trình bày một hệ thống dịch nói đồng thời dựa trên công nghệ nhận diện và dịch thuật thống kê. Chúng tôi thảo luận về công nghệ, các cải tiến khác nhau của hệ thống và đề xuất các cơ chế cho việc trình bày kết quả thân thiện với người dùng. Qua việc đánh giá và so sánh rộng rãi hệ thống thành phần và hệ thống từ đầu đến cuối với hiệu suất dịch thuật của con người, chúng tôi kết luận rằng máy móc đã có thể cung cấp đầu ra dịch thuật đồng thời hiểu được. Hơn nữa, trong khi hiệu suất của máy bị ảnh hưởng bởi các lỗi nhận diện (và do đó có thể cải thiện), hiệu suất của con người bị giới hạn bởi thách thức nhận thức của việc thực hiện nhiệm vụ này trong thời gian thực.

Từ khóa

#dịch đồng thời #công nghệ dịch thuật #dịch thuật máy #hội nghị quốc tế

Tài liệu tham khảo

Accipio Consulting (2006) Sprachtechnologien für Europa [Language technologies for Europe]. ITC IRST, Trento, Italy. Available at http://www.tc-star.org/pubblicazioni/D17_HLT_DE.pdf. Accessed 29 Oct 2008 Al-Khanji R, El-Shiyab S, Hussein R (2000) On the use of compensatory strategies in simultaneous interpretation. Meta J Traduc 45: 544–557 Atal B (1974) Effectiveness of linear prediction characteristics of speech wave for automatic speaker identification and verification. J Acoust Soc Am 55: 1304–1312 Bahl L, Brown P, de Souza P, Mercer R (1986) Maximum mutual information estimation of hidden Markov model parameters for speech recognition. In: ICASSP ’86, IEEE international conference on acoustics, speech, and signal processing, Tokyo, Japan, pp 49–52 Bain K, Basson S, Faisman A, Kanevsky D (2005) Accessibility, transcription, and access everywhere. IBM Syst J 44: 589–603 Barik HC (1969) A study of simultaneous interpretation. PhD thesis, University of North Carolina at Chapel Hill Bellegarda JR (2004) Statistical language model adaptation: review and persepectives. Speech Commun 42: 93–108 Black AW, Taylor PA (1997) The Festival speech synthesis system: system documentation. Technical Report HCRC/TR-83, Human Communciation Research Centre, University of Edinburgh, Edinburgh, Scotland Brown PF, Della Pietra SA, Della Pietra VJ, Mercer RL (1994) The mathematics of statistical machine translation: parameter estimation. Comput Linguist 19: 263–311 Bulyko I, Ostendorf M, Stolcke A (2003) Getting more mileage from Web text sources for conversational speech language modeling using class-dependent mixtures. In: HLT-NAACL 2003 Human language technology conference of the North American chapter of the Association for Computational Linguistics, Companion volume: short papers, student research workshop, demonstrations, tutorial abstracts, Edmonton, Alberta, Canada, pp 7–9 Burger S, MacLaren V, Waibel A (2004) ISL meeting speech part 1, catalog nbr LDC2004S05, Linguistic Data Consortium, Philadelphia, PA Carletta J, Ashby S, Bourban S, Flynn M, Guillemot M, Hain T, Kadlec J, Karaiskos V, Kraaij W, Kronenthal M, Lathoud G, Lincoln M, Lisowska A, McCowan I, Post W, Reidsma D, Wellner P (2005) The AMI meeting corpus: A pre-announcement. In: 2nd joint workshop on multimodal interaction and related machine learning algorithms MLMI 05, Edinburgh, Scotland, pp 28–39 Cettolo M, Falavigna D (1998) Automatic detection of semantic boundaries based on acoustic and lexical knowledge. In: Fifth international conference on spoken language processing, ICSLP’98, Sydney, Australia, pp 1551–1554 Cettolo M, Federico M (2006) Text segmentation criteria for statistical machine translation. In: Salakoski T, Ginter F, Pyysalo S, Pahikkala T (eds) Advances in natural language processing, 5th international conference, FinTAL 2006, Turku, Finland, LNCS 4139. Springer Verlag, Berlin, pp 664–673 Cettolo M, Brugnara F, Federico M (2004) Advances in the automatic transcription of lectures. In: ICASSP 2004, IEEE international conference on acoustics, speech, and signal processing, Montreal, Canada, pp 769–772 Chen CJ (1999) Speech recognition with automatic punctuation. In: Sixth European conference on speech communication and technology (Eurospeech’99), Budapest, Hungary, pp 447–450 de Mori R, Federico M (1999) Language model adaptation. In: Ponting K (eds) Computational models of speech pattern processing. Springer Verlag, Berlin, pp 280–303 Doddington G (2002) Automatic evaluation of MT quality using n-gram co-occurrence statistics. In: Proceedings of human language technology conference 2002, San Diego, CA, 138–145 Eide E, Gish H (1996) A paramteric approach to vocal tract length normalization. In: 1996 IEEE international conference on acoustics, speech, and signal processing, Atlanta, Georgia, pp 346–348 Finke M, Geutner P, Hild H, Kemp T, Ries K, Westphal M (1997) The Karlsruhe-verbmobil speech recognition engine. In: 1997 IEEE international conference on acoustics, speech, and signal processing (ICASSP’97), Munich, Germany, pp 83–86 Fiscus J (1997) A post-processing system to yield reduced word error rates: recogniser output voting error reduction (ROVER). In: Proceedings of the 1997 IEEE workshop on automatic speech recognition and understanding, Santa Barbara, CA, pp 347–352 Fiscus J, Garofolo J, Przybocki M, Fisher W, Pallett D (1998) 1997 English broadcast news speech (HUB4), catalog nbr LDC98S71, Linguistic Data Consortium, Philadelphia, PA Foster G, Kuhn R, Johnson H (2006) Phrasetable smoothing for statistical machine translation. In: EMNLP 2006 conference on empirical methods in natural language processing, Sydney, Australia, pp 53–61 Fritsch J, Rogina I (1996) The bucket box intersection (BBI) algorithm for fast approximative evaluation of diagonal mixture Gaussians. In: 1996 IEEE international conference on acoustics, speech, and signal processing, Atlanta, Georgia, pp 837–840 Fügen C, Kolss M (2007) The influence of utterance chunking on machine translation performance. In: Interspeech 2007, 8th annual conference of the International Speech Communication Association, Antwerp, Belgium, pp 2837–2840 Fügen C, Westphal M, Schneider M, Schultz T, Waibel A (2001) LingWear: a mobile tourist information system. In: Proceedings of the first international conference on human language technology research, San Diego, California, 5 pp Fügen C, Ikbal S, Kraft F, Kumatani K, Laskowski K, McDonough JW, Ostendorf M, Stüker S, Wölfel M (2006a) The ISL RT-06S speech-to-text system. In: Renals et al (2006), pp 407–418 Fügen C, Kolss M, Paulik M, Waibel A (2006b) Open domain speech translation: from seminars and speeches to lectures. In: TC-STAR workshop on speech to speech translation, Barcelona, Spain, pp 81–86 Furui S (1986) Cepstral analysis technique for automatic speaker verification. IEEE T Acoust Speech Signal Proc 34:52–59 Furui S (2005) Recent progress in corpus-based spontaneous speech recognition. IEICE T Inform Syst E88-D:366–375 Furui S (2007) Recent advances in automatic speech summarization. In: Symposium on large-scale knowledge resources (LKR 2007), Tokyo, Japan, pp 49–54 Gales MJF (1998) Maximum likelihood linear transformations for HMM-based speech recognition. Comput Speech Lang 12: 75–98 Garofolo JS, Michel M, Stanford VM, Tabassi E, Fiscus J, Laprun CD, Pratz N, Lard J (2004) NIST meeting pilot corpus speech, catalog nbr LDC2004S09, Linguistic Data Consortium, Philadelphia, PA Geutner P, Finke M, Scheytt P (1998) Adaptive vocabularies for transcribing multilingual broadcast news. In: Proceedings of the 1997 IEEE international conference on acoustics, speech, and signal processing (ICASSP ’97), Seattle, Washington, pp 925–928 Glass J, Hazen TJ, Cyphers S, Malioutov I, Huynh D, Barzilay R (2007) Recent progress in the MIT spoken lecture processing project. In: Interspeech 2007, 8th annual conference of the International Speech Communication Association, Antwerp, Belgium, pp 2553–2556 Godfrey JJ, Holliman E (1993) Switchboard-1 transcripts, catalog nbr LDC93T4, Linguistic Data Consortium, Philadelphia, PA Gollan C, Bisani M, Kanthak S, Schlüter R, Ney H (2005) Cross domain automatic transcription on the TC-STAR EPPS corpus. In: ICASSP, 2005 IEEE conference on acoustics, speech, and signal processing, Philadelphia, PA, pp 825–828 Gollan C, Hahn S, Schlüter R, Ney H (2007) An improved method for unsupervised training of LVCSR systems. In: Interspeech 2007, 8th annual conference of the International Speech Communication Association, Antwerp, Belgium, pp 2101–2104 Graff D (1994) UN parallel text (complete), catalog nbr LDC94T4A, Linguistic Data Consortium, Philadelphia, PA Graff D (2003) English gigaword, catalog nbr LDC2003T05, Linguistic Data Consortium, Philadelphia, PA Graff D, Garofolo J, Fiscus J, Fisher W, Pallett D (1997) 1996 English broadcast news speech (HUB4), catalog nbr LDC97S44, Linguistic Data Consortium, Philadelphia, PA Hamon O, Mostefa D, Choukri K (2007) End-to-end evaluation of a speech-to-speech translation system in TC-STAR. In: Machine translation summit XI, Copenhagen, Denmark, pp 223–230 Henderson JA (1982) Some psychological aspects of simultaneous interpreting. Incorp Ling 21(4): 149–150 Hendricks PV (1971) Simultaneous interpreting: a practical book. Longman, London Huang J, Zweig G (2002) Maximum entropy model for punctuation annotation from speech. In: 7th international conference on spoken language processing (ICSLP 2002, Interspeech 2002), Denver, Colorado, pp 917–920 Huang J, Westphal M, Chen SF, Siohan O, Povey D, Libal V, Soneiro A, Schulz H, Ross T, Potamianos G (2006) The IBM rich transcription spring 2006 speech-to-text system for lecture meetings. In: Renals et al (2006), pp 432–443 Janin A, Edwards J, Ellis D, Gelbart D, Morgan N, Peskin B, Pfau T, Shriberg E, Stolcke A, Wooters C (2004) ICSI meeting speech, catalog nbr LDC2004S02, Linguistic Data Consortium, Philadelphia, PA Jones R (1998) Conference interpreting explained. St. Jerome Publishing, Manchester Kim J-H, Woodland PC (2001) The use of prosody in a combined system for punctuation generation and speech recognition. In: Eurospeech 2001 Scandinavia, 7th European conference on speech communication and technology, 2nd Interspeech event, Aalborg, Denmark, pp 2757–2760 Klakow D, Peters J (2002) Testing the correlation of word error rate and perplexity. Speech Commun 38: 19–28 Koehn P, Axelrod A, Mayne AB, Callison-Burch C, Osborne M, Talbot D (2005) Edinburgh system description for the 2005 IWSLT speech translation evaluation. In: Proceedings of international workshop on spoken language translation, Pittsburgh, PA Kolss M, Zhao B, Vogel S, Hildebrand AS, Niehues J, Venugopal A, Zhang Y (2006) The ISL statistical machine translation system for the TC-STAR spring 2006 evaluation. In: TC-STAR workshop on speech to speech translation, Barcelona, Spain Kopczynski A (1994) Bridging the gap: empirical research in simultaneous interpretation. John Benjamins, Amsterdam/Philadelphia Lamel LF, Schiel F, Fourcin A, Mariani J, Tillmann HG (1994) The translanguage English database TED. In: Third international conference on spoken language processing (ICSLP 94), Yokohama, Japan, pp 1795–1798 Lamel L, Bilinski E, Adda G, Gauvain J-L, Schwenk H (2006) The LIMSI RT06s lecture transcription system. In: Renals et al. (2006), pp 457–468 Lamel L, Gauvain J-L, Adda G, Barras C, Bilinski E, Galibert O, Pujol A, Schwenk H, Zhu X (2007) The LIMSI 2006 TC-STAR EPPS transcription systems. In: ICASSP 2007, international conference on acoustics, speech, and signal processing, Honolulu, Hawaii, pp 997–1000 Lederer M (1978) Simultaneous interpretation: units of meaning and other features. In: Gerver D, Sinaiko HW (eds) Language interpretation and communication. Plenum Press, New York, pp 323–332 Leggetter CJ, Woodland PC (1995) Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models. Comput Speech Lang 9: 171–185 Liu Y (2004) Structural event detection for rich transcription of speech. PhD thesis, Purdue University, West Lafayette, IN Lööf J, Bisani M, Gollan C, Heigold G, Hoffmeister B, Plahl C, Schlüter R, Ney H (2006) The 2006 RWTH parliamentary speeches transcription system. In: Interspeech 2006 – ICSLP, ninth international conference on spoken language processing, Pittsburgh, PA, pp 105–108 Mani I (2001) Automatic summarization. John Benjamins, Amsterdam Matusov E, Leusch G, Bender O, Ney H (2005) Evaluating machine translation output with automatic sentence segmentation. In: Proceedings of international workshop on spoken language translation, Pittsburgh, PA Matusov E, Mauser A, Ney H (2006) Automatic sentence segmentation and punctuation prediction for spoken language translation. In: International workshop on spoken language translation, Kyoto, Japan, pp 158–165 Matusov E, Leusch G, Banchs RE, Bertoldi N, Déchelotte D, Federico M, Kolss M, Lee Y-S, Mariño JB, Paulik M, Roukos S, Schwenk H, Ney H (2008) System combination for machine translation of spoken and written language. IEEE T Audio Speech Lang Proc 16: 1222–1237 Morimoto T, Takezawa T, Yato F, Sagayama S, Tashiro T, Nagata M, Kurematsu A (1993) ATR’s speech translation system: ASURA. In: European conference on speech communication and technology 1993, Eurospeech 1993, Berlin, Germany, pp 1291–1294 Moser-Mercer B, Kunzli A, Korac M (1998) Prolonged turns in interpreting: effects on quality, physiological and psychological stress (Pilot Study). Interpreting: Int J Res Prac Interpreting 3:47–64 Normandin Y (1991) Hidden Markov models, maximum mutual information estimation and the speech recognition problem. PhD thesis, McGill University, Montreal, Quebec, Canada Och FJ (2003) Minimum error rate training in statistical machine translation. In: 41st annual meeting of the Association for Computational Linguistics, Sapporo, Japan, pp 160–167 Och FJ, Ney H (2003) A systematic comparison of various statistical alignment models. Comput Linguist 29:19–51 Olszewski D, Prasetyo F, Linhard K (2005) Steerable highly directional audio beam louspeaker. In: Interspeech’2005 – Eurospeech, Lisboa, Portugal, pp 137–140 Papineni K, Roukos S, Ward T, Zhu W (2002) Bleu: a method for automatic evaluation of machine translation. In: 40th annual meeting of the Association of Computational Linguistics, Philadelphia, Pennsylvania, pp 311–318 Paulik M, Waibel A (2008) Extracting clues from human interpreter speech for spoken language translation. In: ICASSP 2008 IEEE international conference on acoustics, speech, and signal processing, Las Vegas, Nevada, pp 5097–5100 Ramabhadran B, Huang J, Picheny M (2003) Towards automatic transcription of large spoken archives – English ASR for the Malach project. In: Proceedings of the 2003 IEEE conference on acoustics, speech, and signal processing (ICASSP 2003), Hong Kong, China, pp 216–219 Ramabhadran B, Siohan O, Mangu L, Zweig G, Westphal M, Schulz H, Soneiro A (2006) The IBM 2006 speech transcription system for European Parliamentary speeches. In: Interspeech 2006 – ICSLP, ninth international conference on spoken language processing, Pittsburgh, PA, pp 1225–1228 Rao S, Lane I, Schultz T (2007) Optimizing sentence segmentation for spoken language translation. In: Interspeech 2007, 8th annual conference of the International Speech Communication Association, Antwerp, Belgium, pp 2845–2848 Renals S, Bengio S, Fiskus J (eds) (2006) Machine learning for multimodal interaction: third international workshop, MLMI 2006, Bethesda. Revised selected papers, LNCS 4299, Springer Verlag, Berlin Rogina I, Schaaf T (2002) Lecture and presentation tracking in an intelligent meeting room. In: 4th IEEE international conference on multimodal interfaces (ICMI 2002), Pittsburgh, PA, pp 47–52 Roukos S, Graff D, Melamed D (1995) Hansard French/English, catalog nbr LDC95T20, Linguistic Data Consortium, Philadelphia, PA Seleskovitch D (1978) Interpreting for international conferences: problems of language and communication. Pen & Booth, Washington DC Snover M, Dorr B, Schwartz R, Micciulla L, Makhoul J (2006) A study of translation edit rate with targeted human annotation. In: Proceedings of the 7th conference of Association for Machine Translation in the Americas: visions for the future of machine translation, Cambridge, Massachusetts, pp 223–231 Soltau H, Metze F, Fügen C, Waibel A (2001) A one-pass decoder based on polymorphic linguistic context assignment. In: ASRU 2001, automatic speech recognition and understanding workshop, Madonna di Campiglio, Trento, Italy, pp 214–217 Soltau H, Yu H, Metze F, Fügen C, Jin Q, Jou S-C (2004) The 2003 ISL rich transcription system for conversational telephony speech. In: ICASSP 2004, IEEE international conference on acoustics, speech, and signal processing, Montreal, Canada, pp 773–776 Stolcke A (2002) SRILM – an extensible language modeling toolkit. In: 7th international conference on spoken language processing (ICSLP 2002, Interspeech 2002), Denver, Colorado, pp 901–904 Stüker S, Fügen C, Hsiao R, Ikbal S, Jin Q, Kraft F, Paulik M, Raab M, Tam Y-C, Wölfel M (2006) The ISL TC-STAR spring 2006 ASR evaluation systems. In: TC-STAR workshop on speech to speech translation, Barcelona, Spain, pp 139–144 Stüker S, Paulik M, Kolss M, Fügen C, Waibel A (2007) Speech translation enhanced ASR for European Parliament speeches – on the influence of ASR performance on speech translation. In: ICASSP 2007, international conference on acoustics, speech, and signal processing, Honolulu, Hawaii, pp 1293–1296 Trancoso I, Nunes R, Neves L (2006) Classroom lecture recognition. In: Vieira R, Quaresma P, Nunes MdGV, Mamede NJ, Oliveira C, Dias MC (eds) Computational processing of the Portuguese language, 7th international workshop, PROPOR 2006, Itatiaia, Brazil, LNCS 3960, Springer Verlag, Berlin, pp 190–199 Vidal M (1997) New study on fatigue confirms need for working in teams. Proteus Newsl NAJIT 6.1 Vogel S (2003) SMT decoder dissected: word reordering. In: International conference on natural language processing and knowledge engineering, Beijing, China, pp 561–566 Vogel S (2005) PESA: phrase pair extraction as sentence splitting. In: MT summit X, the tenth machine translation summit, Phuket, Thailand, pp 251–258 Vogel S, Ney H, Tillmann C (1996) HMM-based word alignment in statistical translation. In: COLING-96, the 16th international conference on computational linguistics, Copenhagen, Denmark, pp 836–841 Waibel A, Fügen C (2008) Spoken language translation. IEEE Signal Proc Mag 25(3): 70–79 Waibel A, Stiefelhagen R (eds) (2009) Computers in the human interaction loop. Springer Verlag, Berlin Waibel A, Jain AN, McNair AE, Saito H, Hauptmann AG, Tebelskis J (1991) JANUS, a speech-to-speech translation using connectionist and symbolic processing strategies. In: ICASSP-91, proceedings of the international conference on acoustics, speech, and signal processing, Toronto, Canada, pp 793–796 Waibel A, Steusloff H, Stiefelhagen R, the CHIL Project Consortium (2004) CHIL – computers in the human interaction loop. In: WIAMIS 2004, 5th international workshop on image analysis for multimedia interactive services, Lisbon, Portugal, 4 pp Yagi SM (2000) Studying style in simultaneous interpretation. Meta J Traduc 45: 520–547 Yuan J, Liberman M, Cieri C (2006) Towards an integrated understanding of speaking rate in conversation. In: Interspeech 2006 – ICSLP, ninth international conference on spoken language processing, Pittsburgh, Pennsylvania, paper Mon3A3O-1 Zechner K (2002) Summarization of spoken language – challenges, methods, and prospects. Speech Technol Expert eZine, 6 Zhan P, Westphal M (1997) Speaker normalization based on frequency warping. In: 1997 IEEE international conference on acoustics, speech, and signal processing (ICASSP’97), Munich, Germany, p 1039