Statistical multimodal integration for audio-visual speech processing
Tóm tắt
Sensory information is indispensable for living things. It is also important for living things to integrate multiple types of senses to understand their surroundings. In human communications, human beings must further integrate the multimodal senses of audition and vision to understand intention. In this paper, we describe speech related modalities since speech is the most important media to transmit human intention. To date, there have been a lot of studies concerning technologies in speech communications, but performance levels still have room for improvement. For instance, although speech recognition has achieved remarkable progress, the speech recognition performance still seriously degrades in acoustically adverse environments. On the other hand, perceptual research has proved the existence of the complementary integration of audio speech and visual face movements in human perception mechanisms. Such research has stimulated attempts to apply visual face information to speech recognition and synthesis. This paper introduces works on audio-visual speech recognition, speech to lip movement mapping for audio-visual speech synthesis, and audio-visual speech translation.
Từ khóa
#Speech processing #Speech synthesis #Speech recognition #Humans #Hidden Markov models #Keyboards #Mice #Man machine systems #Communications technology #Oral communicationTài liệu tham khảo
10.1109/ICASSP.1998.679695
rogozan, 1997, adaptive determination of audio–visual weights for automatic speech recognition, Proc Auditory– Visual Speech Processing, 67
jourlin, 1996, handling disynchronization phenomena with hmm in connected speech, Proc EUSIPCO
10.1109/ICASSP.1996.543247
deleglise, 1996, asynchronous integration of audio and visual sources in bimodal automatic speech recognition, Proc EUSIPCO
10.1109/ICSLP.1996.607020
andre-obrecht, 1997, audio visual speech recognition and segmental master slave hmm, Proc Auditory– Visual Speech Processing, 49
cox, 1997, combining noise compensation with visual information in speech recognition, Proc Auditory– Visual Speech Processing, 53
nakamura, 1997, improved bimodal speech recognition using tied-mixture hmms and 5000 word audio–visual synchronous database, Proc Eurospeech 97, 1623
potamianos, 1997, speaker independent audiovisual database for bimodal asr, Proc European Tutorial Workshop Audiovisual Speech Processing
10.1007/978-3-662-13015-5_35
10.1007/978-3-662-13015-5_25
10.1007/978-3-662-13015-5_35
goldschen, 1993, Continuous Automatic Speech Recognition by Lipreading
10.1109/ICASSP.1994.389567
10.1109/ICASSP.1993.319179
10.1109/ICASSP.1995.479285
brooke, 1994, automatic speech recognition that includes visual speech cues, Proc Inst Acoust, 16, 15
10.1109/ICME.2001.1237699
10.1007/978-3-662-13015-5_26
ito, 2000, 3-D lip expression generation by using new lip parameters, a16 24
adjoudani, 1995, audio-visual speech recognition compared across two architectures, Proc Eurospeech 95, 2, 1563
petajan, 1984, Automatic lipreading to enhance speech recognition
10.1109/ICCV.1995.466899
chu, 2000, bimodal speech recognition using coupled hidden markov models, Proc Int Conf Spoken Language Processing (ICSLP 00), 2, 747
zhang, 2000, speaker independent audio–visual speech recognition, Proc IEEE Int Conf Multimedia Expo (ICME 00), 10.1109/ICME.2000.871546
jourlin, 1995, automatic bimodal speech recognition, Proc ICPhS
10.1109/ICASSP.1994.389596
10.1109/ACSSC.1994.471518
10.1109/ICASSP.1993.319179
peeling, 1986, the multilayer perceptron as a tool for speech pattern processing research, Proc Inst Acoust, 8, 307
10.1109/ACSSC.1994.471515
potamianos, 1999, speaker adaptation for audio–visual speech recognition, Proc Eurospeech 99, 1291
10.1007/978-3-662-13015-5_24
10.1121/1.1907309
10.1007/BF00849043
dodd, 1987, Hearing by Eye The Psychology of Lipreading
campbell, 1999, Hearing by Eye II Advances in the Psychology of Speechreading and Auditory– Visual Speech
nakamura, 2001, state synchroonous modeling of audio–visual information for bi-modal speech recognition, Proc IEEE Workshop on Automatic Speech Recognition and Understanding, 1, 10.1109/ASRU.2001.1034671
10.1007/978-3-662-13015-5
10.1109/35.41402
potamianos, 2000, a cascade image transform for speaker independent automatic speechreading, Proc IEEE Int Conf Multimedia, 10.1109/ICME.2000.871552
miyajima, 2000, audio-visual speech recognition using mce-based hmms and model-dependent stream weights, Proc Int Conf Spoken Language Processing (ICSLP 00), 1023
10.1109/ICME.2001.1237846
huang, 2000, tracking of multiple faces for huan-computer interfaces and virtual environments, Proc IEEE Int Conf Multimedia, 10.1109/ICME.2000.871067
neti, 2000, perceptual interfaces for information interaction: joint processing of audio and visual information for human-computer interaction, Proc Int Conf Spoken Language Processing (ICSLP 00), 3, 11
nankaku, 1999, intensity-and location-normalized training for hmm-based visual speech recognition, Proc Eurospeech 99, 1287, 10.21437/Eurospeech.1999-302
nakamura, 2000, stream weight optimization of speech and lip image sequence for audio–visual speech recognition, Proc Int Conf Spoken Language Processing (ICSLP 00), 3, 20
heckmann, 2000, labeling audio–visual speech corpora and training an ann/hmm audio–visual speech recognition system, Proc Int Conf Spoken Language Processing (ICSLP 00), 4, 9
10.1016/S0167-6393(98)00068-5
10.1109/ICASSP.1989.266799
kshirsagar, 2000, lip synchronization using linear predictive analysis, Proc IEEE Int Conf Multimedia Expo (ICME 00), 10.1109/ICME.2000.871547
10.1109/ICME.2001.1237847
10.1109/ICASSP.1996.545722
goldenthal, 1997, driving synthetic mouth gestures: phonetic recognition for faceme!, Proc Eurospeech 97, 1995, 10.21437/Eurospeech.1997-529
10.1109/49.81953
10.1109/86.372898
simons, 1990, generation of mouthshape for a synthetic talking head, Proc Inst Acoust, 12, 475
10.1109/ICASSP.1995.479939
pan, 2000, a new approach to integrate audio and visual features of speech, Proc IEEE Int Conf Multimedia Expo (ICME 00), 10.1109/ICME.2000.871551
naphade, 2001, duration dependent input output markov models for audio–visual event detection, Proc IEEE Int Conf Multimedia and Expo (ICME 01), 369
luettin, 2001, asynchronous stream modeling for large vocabulary audio–visual speech recognition, Proc IEEE Int Conf Acoust Speech Signal Processing (ICASSP 01), 165
heckmann, 2001, comparing audio- and a posteriori-probability-based stream confidence measure for audio–visual speech recognition, Proc EUROSPEECH 01, 1023, 10.21437/Eurospeech.2001-293
10.1023/A:1008014206206
10.1109/ICASSP.1997.596176
10.1109/5.726793
potamianos, 2000, stream confidence estimation for audio–visual speech recognition, Proc Int Conf Spoken Language Processing (ICSLP 00), 3, 746
glotin, 2001, weighting schemes for audio–visual fusion in speech recognition, Proc IEEE Int Conf Acoust Speech Signal Processing (ICASSP 01), 173
kuratate, 1999, audio-visual synthesis of talking faces from speech production correlates, Proc Eurospeech 99, 1279, 10.21437/Eurospeech.1999-300
10.3758/BF03211929
10.1038/264746a0
10.1109/ICME.2001.1237647
guiard-marigny, 1994, a 3-d model of the lips for visual speech synthesis, Proc 2nd ESCA/IEEE Workshop Speech Synthesis, 49
10.1109/MMSP.1998.738912
10.1023/A:1011171430700
10.1023/A:1008179732362
10.1109/ICME.2000.869631
campbell, 1995, CHATR A multilingual speech re-sequencing synthesis system
harashima, 1997, Facial image processing system for human-like `kansei agent
10.1109/ICME.2000.871548
morimoto, 1998, japanese-to-english speech translation system:atr-matrix, Proc Int Conf Spoken Language Processing (ICSLP 98), 957
chen, 2001, audiovisual speech processing, IEEE Signal Processing Mag, 9, 10.1109/79.911195
schwartz, 1999, ten years after summerfield: a taxonomy of models for audio-visual fusion in speech perception, Hearing by Eye II Advances in the Psychology of Speechreading and Auditory-Visual Speech, 85
brooke, 1986, seeing speech: investigations into the synthesis and recognition of visible speech movements using automatic image processing and computer graphics, Proc Int Conf Speech Imput/Output Techniques Applicat, 104
10.1016/0167-8655(88)90094-3
10.1109/5.58349
watanabe, 1990, lip-reading of japanese vowels using neural networks, Proc Int l Conf Spoken Language Processing, 1373
10.1016/0031-3203(91)90089-N
brooke, 1998, two- and three-dimensional audio–visual speech synthesis, Proc Auditory– Visual Speech Processing
stprl, 1991, sources of structure in neural networks for speech and language, Int J Neural Syst, 2, 159, 10.1142/S0129065791000157
yamamoto, 1997, speech-to-lip movement synthesis by hmm, Proc Auditory– Visual Speech Processing, 137
stork, 1992, neural network lipreading system for improved speech recogniton, Proc IJCNN 92, 2, 285
yamamoto, 1998, subjective evaluation for hmm-based speech-to-lip movement synthesis, Proc Auditory– Visual Speech Processing
silsbee, 1993, Computer lipreading for improved accuracy in automatic speech recognition
10.1109/ICASSP.1998.679698
10.1109/ICIP.1995.537548
faruquie, 2000, translingual visual speech synthesis, Proc IEEE Int Conf Multimedia Expo (ICME 00), 2393
10.1016/S0167-6393(98)00054-5
huang, 1998, real-time lip-synch face animation driven by human voice, Proc IEEE Workshop Multimedia Signal Processing (MMSP 98)
tamura, 1999, text–audio–visual speech synthesis based on parameter generation from hmm, Proc Eurospeech 99, 959
sako, 2000, hmm-based text-to-audio–visual speech synthesis, Proc Int Conf Spoken Language Processing (ICSLP 00), 3, 25