Statistical multimodal integration for audio-visual speech processing

IEEE Transactions on Neural Networks - Tập 13 Số 4 - Trang 854-866 - 2002
S. Nakamura1
1ATR Translational Research Laboratories, Kyoto, Japan

Tóm tắt

Sensory information is indispensable for living things. It is also important for living things to integrate multiple types of senses to understand their surroundings. In human communications, human beings must further integrate the multimodal senses of audition and vision to understand intention. In this paper, we describe speech related modalities since speech is the most important media to transmit human intention. To date, there have been a lot of studies concerning technologies in speech communications, but performance levels still have room for improvement. For instance, although speech recognition has achieved remarkable progress, the speech recognition performance still seriously degrades in acoustically adverse environments. On the other hand, perceptual research has proved the existence of the complementary integration of audio speech and visual face movements in human perception mechanisms. Such research has stimulated attempts to apply visual face information to speech recognition and synthesis. This paper introduces works on audio-visual speech recognition, speech to lip movement mapping for audio-visual speech synthesis, and audio-visual speech translation.

Từ khóa

#Speech processing #Speech synthesis #Speech recognition #Humans #Hidden Markov models #Keyboards #Mice #Man machine systems #Communications technology #Oral communication

Tài liệu tham khảo

10.1109/ICASSP.1998.679695 rogozan, 1997, adaptive determination of audio–visual weights for automatic speech recognition, Proc Auditory&#x2013 Visual Speech Processing, 67 jourlin, 1996, handling disynchronization phenomena with hmm in connected speech, Proc EUSIPCO 10.1109/ICASSP.1996.543247 deleglise, 1996, asynchronous integration of audio and visual sources in bimodal automatic speech recognition, Proc EUSIPCO 10.1109/ICSLP.1996.607020 andre-obrecht, 1997, audio visual speech recognition and segmental master slave hmm, Proc Auditory&#x2013 Visual Speech Processing, 49 cox, 1997, combining noise compensation with visual information in speech recognition, Proc Auditory&#x2013 Visual Speech Processing, 53 nakamura, 1997, improved bimodal speech recognition using tied-mixture hmms and 5000 word audio–visual synchronous database, Proc Eurospeech 97, 1623 potamianos, 1997, speaker independent audiovisual database for bimodal asr, Proc European Tutorial Workshop Audiovisual Speech Processing 10.1007/978-3-662-13015-5_35 10.1007/978-3-662-13015-5_25 10.1007/978-3-662-13015-5_35 goldschen, 1993, Continuous Automatic Speech Recognition by Lipreading 10.1109/ICASSP.1994.389567 10.1109/ICASSP.1993.319179 10.1109/ICASSP.1995.479285 brooke, 1994, automatic speech recognition that includes visual speech cues, Proc Inst Acoust, 16, 15 10.1109/ICME.2001.1237699 10.1007/978-3-662-13015-5_26 ito, 2000, 3-D lip expression generation by using new lip parameters, a16 24 adjoudani, 1995, audio-visual speech recognition compared across two architectures, Proc Eurospeech 95, 2, 1563 petajan, 1984, Automatic lipreading to enhance speech recognition 10.1109/ICCV.1995.466899 chu, 2000, bimodal speech recognition using coupled hidden markov models, Proc Int Conf Spoken Language Processing (ICSLP 00), 2, 747 zhang, 2000, speaker independent audio–visual speech recognition, Proc IEEE Int Conf Multimedia Expo (ICME 00), 10.1109/ICME.2000.871546 jourlin, 1995, automatic bimodal speech recognition, Proc ICPhS 10.1109/ICASSP.1994.389596 10.1109/ACSSC.1994.471518 10.1109/ICASSP.1993.319179 peeling, 1986, the multilayer perceptron as a tool for speech pattern processing research, Proc Inst Acoust, 8, 307 10.1109/ACSSC.1994.471515 potamianos, 1999, speaker adaptation for audio–visual speech recognition, Proc Eurospeech 99, 1291 10.1007/978-3-662-13015-5_24 10.1121/1.1907309 10.1007/BF00849043 dodd, 1987, Hearing by Eye The Psychology of Lipreading campbell, 1999, Hearing by Eye II Advances in the Psychology of Speechreading and Auditory&#x2013 Visual Speech nakamura, 2001, state synchroonous modeling of audio–visual information for bi-modal speech recognition, Proc IEEE Workshop on Automatic Speech Recognition and Understanding, 1, 10.1109/ASRU.2001.1034671 10.1007/978-3-662-13015-5 10.1109/35.41402 potamianos, 2000, a cascade image transform for speaker independent automatic speechreading, Proc IEEE Int Conf Multimedia, 10.1109/ICME.2000.871552 miyajima, 2000, audio-visual speech recognition using mce-based hmms and model-dependent stream weights, Proc Int Conf Spoken Language Processing (ICSLP 00), 1023 10.1109/ICME.2001.1237846 huang, 2000, tracking of multiple faces for huan-computer interfaces and virtual environments, Proc IEEE Int Conf Multimedia, 10.1109/ICME.2000.871067 neti, 2000, perceptual interfaces for information interaction: joint processing of audio and visual information for human-computer interaction, Proc Int Conf Spoken Language Processing (ICSLP 00), 3, 11 nankaku, 1999, intensity-and location-normalized training for hmm-based visual speech recognition, Proc Eurospeech 99, 1287, 10.21437/Eurospeech.1999-302 nakamura, 2000, stream weight optimization of speech and lip image sequence for audio–visual speech recognition, Proc Int Conf Spoken Language Processing (ICSLP 00), 3, 20 heckmann, 2000, labeling audio–visual speech corpora and training an ann/hmm audio–visual speech recognition system, Proc Int Conf Spoken Language Processing (ICSLP 00), 4, 9 10.1016/S0167-6393(98)00068-5 10.1109/ICASSP.1989.266799 kshirsagar, 2000, lip synchronization using linear predictive analysis, Proc IEEE Int Conf Multimedia Expo (ICME 00), 10.1109/ICME.2000.871547 10.1109/ICME.2001.1237847 10.1109/ICASSP.1996.545722 goldenthal, 1997, driving synthetic mouth gestures: phonetic recognition for faceme!, Proc Eurospeech 97, 1995, 10.21437/Eurospeech.1997-529 10.1109/49.81953 10.1109/86.372898 simons, 1990, generation of mouthshape for a synthetic talking head, Proc Inst Acoust, 12, 475 10.1109/ICASSP.1995.479939 pan, 2000, a new approach to integrate audio and visual features of speech, Proc IEEE Int Conf Multimedia Expo (ICME 00), 10.1109/ICME.2000.871551 naphade, 2001, duration dependent input output markov models for audio–visual event detection, Proc IEEE Int Conf Multimedia and Expo (ICME 01), 369 luettin, 2001, asynchronous stream modeling for large vocabulary audio–visual speech recognition, Proc IEEE Int Conf Acoust Speech Signal Processing (ICASSP 01), 165 heckmann, 2001, comparing audio- and a posteriori-probability-based stream confidence measure for audio–visual speech recognition, Proc EUROSPEECH 01, 1023, 10.21437/Eurospeech.2001-293 10.1023/A:1008014206206 10.1109/ICASSP.1997.596176 10.1109/5.726793 potamianos, 2000, stream confidence estimation for audio–visual speech recognition, Proc Int Conf Spoken Language Processing (ICSLP 00), 3, 746 glotin, 2001, weighting schemes for audio–visual fusion in speech recognition, Proc IEEE Int Conf Acoust Speech Signal Processing (ICASSP 01), 173 kuratate, 1999, audio-visual synthesis of talking faces from speech production correlates, Proc Eurospeech 99, 1279, 10.21437/Eurospeech.1999-300 10.3758/BF03211929 10.1038/264746a0 10.1109/ICME.2001.1237647 guiard-marigny, 1994, a 3-d model of the lips for visual speech synthesis, Proc 2nd ESCA/IEEE Workshop Speech Synthesis, 49 10.1109/MMSP.1998.738912 10.1023/A:1011171430700 10.1023/A:1008179732362 10.1109/ICME.2000.869631 campbell, 1995, CHATR A multilingual speech re-sequencing synthesis system harashima, 1997, Facial image processing system for human-like `kansei agent 10.1109/ICME.2000.871548 morimoto, 1998, japanese-to-english speech translation system:atr-matrix, Proc Int Conf Spoken Language Processing (ICSLP 98), 957 chen, 2001, audiovisual speech processing, IEEE Signal Processing Mag, 9, 10.1109/79.911195 schwartz, 1999, ten years after summerfield: a taxonomy of models for audio-visual fusion in speech perception, Hearing by Eye II Advances in the Psychology of Speechreading and Auditory-Visual Speech, 85 brooke, 1986, seeing speech: investigations into the synthesis and recognition of visible speech movements using automatic image processing and computer graphics, Proc Int Conf Speech Imput/Output Techniques Applicat, 104 10.1016/0167-8655(88)90094-3 10.1109/5.58349 watanabe, 1990, lip-reading of japanese vowels using neural networks, Proc Int l Conf Spoken Language Processing, 1373 10.1016/0031-3203(91)90089-N brooke, 1998, two- and three-dimensional audio–visual speech synthesis, Proc Auditory&#x2013 Visual Speech Processing stprl, 1991, sources of structure in neural networks for speech and language, Int J Neural Syst, 2, 159, 10.1142/S0129065791000157 yamamoto, 1997, speech-to-lip movement synthesis by hmm, Proc Auditory&#x2013 Visual Speech Processing, 137 stork, 1992, neural network lipreading system for improved speech recogniton, Proc IJCNN 92, 2, 285 yamamoto, 1998, subjective evaluation for hmm-based speech-to-lip movement synthesis, Proc Auditory&#x2013 Visual Speech Processing silsbee, 1993, Computer lipreading for improved accuracy in automatic speech recognition 10.1109/ICASSP.1998.679698 10.1109/ICIP.1995.537548 faruquie, 2000, translingual visual speech synthesis, Proc IEEE Int Conf Multimedia Expo (ICME 00), 2393 10.1016/S0167-6393(98)00054-5 huang, 1998, real-time lip-synch face animation driven by human voice, Proc IEEE Workshop Multimedia Signal Processing (MMSP 98) tamura, 1999, text–audio–visual speech synthesis based on parameter generation from hmm, Proc Eurospeech 99, 959 sako, 2000, hmm-based text-to-audio–visual speech synthesis, Proc Int Conf Spoken Language Processing (ICSLP 00), 3, 25