An HMM-based speech-to-video synthesizer

IEEE Transactions on Neural Networks - Tập 13 Số 4 - Trang 900-915 - 2002
J.J. Williams1, A.K. Katsaggelos1
1Department of Electrical and Computer Engineering, Northwestern University, Evanston, IL. USA

Tóm tắt

Emerging broadband communication systems promise a future of multimedia telephony, e.g. the addition of visual information to telephone conversations. It is useful to consider the problem of generating the critical information useful for speechreading, based on existing narrowband communications systems used for speech. This paper focuses on the problem of synthesizing visual articulatory movements given the acoustic speech signal. In this application, the acoustic speech signal is analyzed and the corresponding articulatory movements are synthesized for speechreading. This paper describes a hidden Markov model (HMM)-based visual speech synthesizer. The key elements in the application of HMMs to this problem are the decomposition of the overall modeling task into key stages and the judicious determination of the observation vector's components for each stage. The main contribution of this paper is a novel correlation HMM model that is able to integrate independently trained acoustic and visual HMMs for speech-to-visual synthesis. This model allows increased flexibility in choosing model topologies for the acoustic and visual HMMs. Moreover the propose model reduces the amount of training data compared to early integration modeling techniques. Results from objective experiments analysis show that the propose approach can reduce time alignment errors by 37.4% compared to conventional temporal scaling method. Furthermore, subjective results indicated that the purpose model can increase speech understanding.

Từ khóa

#Synthesizers #Hidden Markov models #Speech synthesis #Telephony #Signal synthesis #Speech analysis #Broadband communication #Multimedia systems #Narrowband #Acoustic applications

Tài liệu tham khảo

10.1121/1.423751 10.1121/1.1907309 10.1044/jshr.2001.130 10.1044/jshd.4104.530 10.1044/jshr.1704.619 erber, 1974, sensory capabilities of hearing-impaired children, Discussion Lipreading Skills, 69 10.1023/A:1008062122135 lesner, 1987, training influences on visual consonant and sentence recognition, Ear Hearing, 8, 283, 10.1097/00003446-198710000-00005 10.1044/jshr.2803.381 10.1044/jshr.2402.207 10.1145/258734.258880 yamamoto, 1998, subjective evaluation for hmm-based speech-to-lip movement synthesis, Proc Conf Auditory-Visual Speech Processing 10.1109/5.664274 tamura, 1998, visual speech synthesis based on parameter generation from hmm: speech-driven and text-and-speech-driven approaches, Proc Auditory-Visual Speech Processing 1998 10.1109/ICASSP.1995.479684 10.1016/S0167-6393(98)00054-5 10.1109/MMSP.1998.738912 choi, 1999, baum-welch hidden markov model inversion for reliable audio-to-visual conversion, Proc 1999 IEEE 3rd Workshop Multimedia Signal Processing, 175, 10.1109/MMSP.1999.793816 10.1109/72.557656 10.1109/ICASSP.2000.859323 williams, 2000, Speech-to-Video Conversion for Individuals With Impaired Hearing 10.1159/000259969 10.1109/ACSSC.1994.471516 10.1049/ic:19961152 10.1097/00003446-199210000-00009 10.1121/1.400733 10.1098/rstb.1992.0009 rabiner, 1993, Fundamentals of speech recognition bernstein, 1986, Johns Hopkins Lipreading Corpus I&#x2013 II 10.1109/ICASSP.1997.599592 garstecki, 1997, hearing status and aging, Aging and Communication For Clinicians by Clinicians, 97 simons, 1990, generation of mouthshapes for a synthetic talking head, Proc Inst Acoust, 12, 475 kochkin, 1996, marketrak iv: 10-year trends in the hearing aid market-has anything changed?, Hearing J, 49 10.1117/12.429527 10.1109/41.661300 10.1109/ICASSP.1996.543247 tamura, 1998, visual speech synthesis based on parameter generation from hmm: speech-driven and text-and-speech-driven approaches, Proc Auditory-Visual Speech 10.1049/ic:19961145 brooke, 1998, two- and three-dimensional audio–visual speech synthesis, Proc Auditory-Visual Speech 10.1109/AFGR.1996.557235 massaro, 1998, Perceiving Talking Faces From Speech Production to a Behavioral Principle