Audio-visual speech recognition by speechreading

Xiaozheng Zhang1, R.M. Mersereau1, M.A. Clements1
1School of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, GA, USA

Tóm tắt

Speechreading increases intelligibility in human speech perception. This suggests that conventional acoustic-based speech processing can benefit from the addition of visual information. This paper exploits speechreading for joint audio-visual speech recognition. We first present a color-based feature extraction algorithm that is able to extract salient visual speech features reliably from a frontal view of the talker in a video sequence. Then, a new fusion strategy using a coupled hidden Markov model (CHMM) is proposed to incorporate visual modality into the acoustic subsystem. By maintaining temporal coupling across the two modalities at the feature level and allowing asynchrony in the state at the same time, a CHMM provides a better model for capturing temporal correlations between the two streams of information. The experimental results demonstrate that the combined audio-visual system outperforms the acoustic-only recognizer over a wide range of noise levels.

Từ khóa

#Speech recognition #Hidden Markov models #Humans #Speech processing #Feature extraction #Data mining #Video sequences #Maintenance #Streaming media #Audio-visual systems

Tài liệu tham khảo

0 brand, 1996, Coupled hidden Markov models for modeling interacting processes, Tech Rept TR 405 boyen, 1998, Tractable inference for complex stochastic processes, Proc 14 Ann Conf Uncertainty in Artif Intel, 33 huang, 1994, Inference in belief networks: a procedural guide, Int J Approx Reasoning, 11, 1 young, 1999, The HTK Book murphy, 2001, The Bayes' net toolbox for Matlab, Proc Symp Interface Statist Comput Sci, 33 10.1109/ICASSP.1994.389567 10.1109/35.41402 potamianos, 2001, Heirarchical discriminant features for audio-visual LVCSR, Proc IEEE ICASSP 10.1109/89.536928 10.1109/ICIP.2000.899336 10.1109/34.982900 petajan, 1984, Automatic Lipreading to Enhance Speech Recognition 10.1121/1.1907309 zhang, 2001, Automatic speechreading with applications to human-computer interfaces, EURASIP J Appl Sig Proc