An HMM-based speech-to-video synthesizer
Tóm tắt
Emerging broadband communication systems promise a future of multimedia telephony, e.g. the addition of visual information to telephone conversations. It is useful to consider the problem of generating the critical information useful for speechreading, based on existing narrowband communications systems used for speech. This paper focuses on the problem of synthesizing visual articulatory movements given the acoustic speech signal. In this application, the acoustic speech signal is analyzed and the corresponding articulatory movements are synthesized for speechreading. This paper describes a hidden Markov model (HMM)-based visual speech synthesizer. The key elements in the application of HMMs to this problem are the decomposition of the overall modeling task into key stages and the judicious determination of the observation vector's components for each stage. The main contribution of this paper is a novel correlation HMM model that is able to integrate independently trained acoustic and visual HMMs for speech-to-visual synthesis. This model allows increased flexibility in choosing model topologies for the acoustic and visual HMMs. Moreover the propose model reduces the amount of training data compared to early integration modeling techniques. Results from objective experiments analysis show that the propose approach can reduce time alignment errors by 37.4% compared to conventional temporal scaling method. Furthermore, subjective results indicated that the purpose model can increase speech understanding.
Từ khóa
#Synthesizers #Hidden Markov models #Speech synthesis #Telephony #Signal synthesis #Speech analysis #Broadband communication #Multimedia systems #Narrowband #Acoustic applicationsTài liệu tham khảo
10.1121/1.423751
10.1121/1.1907309
10.1044/jshr.2001.130
10.1044/jshd.4104.530
10.1044/jshr.1704.619
erber, 1974, sensory capabilities of hearing-impaired children, Discussion Lipreading Skills, 69
10.1023/A:1008062122135
lesner, 1987, training influences on visual consonant and sentence recognition, Ear Hearing, 8, 283, 10.1097/00003446-198710000-00005
10.1044/jshr.2803.381
10.1044/jshr.2402.207
10.1145/258734.258880
yamamoto, 1998, subjective evaluation for hmm-based speech-to-lip movement synthesis, Proc Conf Auditory-Visual Speech Processing
10.1109/5.664274
tamura, 1998, visual speech synthesis based on parameter generation from hmm: speech-driven and text-and-speech-driven approaches, Proc Auditory-Visual Speech Processing 1998
10.1109/ICASSP.1995.479684
10.1016/S0167-6393(98)00054-5
10.1109/MMSP.1998.738912
choi, 1999, baum-welch hidden markov model inversion for reliable audio-to-visual conversion, Proc 1999 IEEE 3rd Workshop Multimedia Signal Processing, 175, 10.1109/MMSP.1999.793816
10.1109/72.557656
10.1109/ICASSP.2000.859323
williams, 2000, Speech-to-Video Conversion for Individuals With Impaired Hearing
10.1159/000259969
10.1109/ACSSC.1994.471516
10.1049/ic:19961152
10.1097/00003446-199210000-00009
10.1121/1.400733
10.1098/rstb.1992.0009
rabiner, 1993, Fundamentals of speech recognition
bernstein, 1986, Johns Hopkins Lipreading Corpus I– II
10.1109/ICASSP.1997.599592
garstecki, 1997, hearing status and aging, Aging and Communication For Clinicians by Clinicians, 97
simons, 1990, generation of mouthshapes for a synthetic talking head, Proc Inst Acoust, 12, 475
kochkin, 1996, marketrak iv: 10-year trends in the hearing aid market-has anything changed?, Hearing J, 49
10.1117/12.429527
10.1109/41.661300
10.1109/ICASSP.1996.543247
tamura, 1998, visual speech synthesis based on parameter generation from hmm: speech-driven and text-and-speech-driven approaches, Proc Auditory-Visual Speech
10.1049/ic:19961145
brooke, 1998, two- and three-dimensional audio–visual speech synthesis, Proc Auditory-Visual Speech
10.1109/AFGR.1996.557235
massaro, 1998, Perceiving Talking Faces From Speech Production to a Behavioral Principle