Transforming an embodied conversational agent into an efficient talking head: from keyframe-based animation to multimodal concatenation synthesis
Tóm tắt
Từ khóa
Tài liệu tham khảo
Badin, P., & Serrurier, A. (2006). Three-dimensional linear modeling of tongue: Articulatory data and models. Paper presented at the 7th International Seminar on Speech Production, Belo Horizonte, Brazil
Badin, P, Bailly, G, Reveret, L, Baciu, M, Segebarth, C, & Savariaux, C. (2002). Three-dimensional linear articulatory modeling of tongue, lips and face, based on MRI and video images. Journal of Phonetics, 30(3), 533–553.
Badin, P., Elisei, F., Bailly, G., & Tarabalka, Y. (2008). An audiovisual talking head for augmented speech generation: Models and animations based on a real speaker's articulatory data. In Articulated Motion and Deformable Objects, Proceedings (Vol. 5098, pp. 132–143, Lecture Notes in Computer Science)
Bailly, G., Gibert, G., & Odisio, M (2002). Evaluation of movement generation systems using the point-light technique. In Speech Synthesis, 2002. Proceedings of 2002 IEEE Workshop on, 2002 (pp. 27–30)
Bailly, G, Berar, M, Elisei, F, & Odisio, M. (2003). Audiovisual Speech Synthesis. International Journal of Speech Technology, 6, 331–346.
Bailly, G., Govokhina, O., Elisei, F., & Breton, G. (2009). Lip-synching using speaker-specific articulation, shape and appearance models. Journal of Acoustics, Speech and Music Processing. Special issue on “Animating Virtual Speakers or Singers from Audio: Lip-Synching Facial Animation”, doi:10.1155/2009/769494
Berry, JJ. (2011). Accuracy of the NDI Wave Speech Research System. Journal of Speech, Language, and Hearing Research, 54(5), 1295–1301. doi: 10.1044/1092-4388(2011/10-0226) .
Black, A. W., & Lenzo, K. (2007). Festvox: Building synthetic voices. (2.1 ed.)
Boersma, P., & Weenink, D. (2010). Praat: doing phonetics by computer. (5.1.31 ed.)
Burnham, D., Dale, R., Stevens, K., Powers, D., Davis, C., Buchholz, J., et al. (2006–2011). From Talking Heads to Thinking Heads: A Research Platform for Human Communication Science. ARC/NH&MRC Special Initiatives, TS0669874
Cohen, MM, & Massaro, D. (1993). Modeling Coarticulation in Synthetic Visual Speech. In NM Thalmann & D Thalmann (Eds.), Models and Techniques in Computer Animation. Tokyo, Japan: Springer.
Cosatto, E, & Graf, H-P. (2000). Photo-realistic talking heads from image samples. IEEE Transactions on Multimedia, 2, 152–163.
Engwall, O. (2000). A 3D tongue model based on MRI data. In International Conference on Spoken Language Processing, Beijing, China (Vol. 3, pp. 901–904)
Engwall, O. (2003). Combining MRI, EMA and EPG measurements in a three-dimensional tongue model. Speech Communication, 41(2–3), 303–329. doi: 10.1016/s0167-6393(03)00132-2 .
Engwall, O. (2005). Articulatory synthesis using corpus-based estimation of line spectrum pairs. Paper presented at the INTERSPEECH, Lisbon, Portugal.
Engwall, O. (2008). Can audio-visual instructions help learners improve their articulation? An ultrasound study of short term changes. In Interspeech 2008, Brisbane, Australia, 2008 (pp. 2631–2634)
Ezzat, T, & Poggio, T. (2000). Visual speech synthesis by morphing visemes. International Journal of Computer Vision, 38(1), 45–57.
Ezzat, T., Geiger, G., & Poggio, T. (2002). Trainable videorealistic speech animation. Paper presented at the ACM SIGGRAPH, San Antonio, TX
Fabre, D., Hueber, T., & Badin, P. (2014). Automatic animation of an articulatory tongue model from ultrasound images using Gaussian mixture regression. Paper presented at the INTERSPEECH, Singapore
Fisher, CG. (1968). Confusions Among Visually Perceived Consonants. Journal of Speech, Language, and Hearing Research, 11(4), 796–804.
Geiger, G., Ezzat, T., & Poggio, T. (2003). Perceptual Evaluation of Video-realistic Speech. In C. P. #224 (Ed.), AI Memo #2003-003. Cambridge, MA: Massachusetts Institute of Technology
Gibert, G., & Stevens, C. J. (2012). Realistic eye model for Embodied Conversational Agents. Paper presented at the ACM 3rd International Symposium on Facial Analysis and Animation, Vienna, Austria, 21st September 2012
Gibert, G, Bailly, G, Beautemps, D, Elisei, F, & Brun, R. (2005). Analysis and synthesis of the three-dimensional movements of the head, face, and hand of a speaker using cued speech. Journal of Acoustical Society of America, 118(2), 1144–1153. doi: 10.1121/1.1944587 .
Gibert, G., Attina, V., Tiede, M., Bundgaard-Nielsen, R., Kroos, C., Kasisopa, B., et al. (2012). Multimodal Speech Animation from Electromagnetic Articulography Data. Paper presented at the 20th European Signal Processing Conference (EUSIPCO), Bucharest, Romania
Gibert, G, Leung, Y, & Stevens, CJ. (2013). Control of speech-related facial movements of an avatar from video. Speech Communication, 55(1), 135–146. http://dx.doi.org/10.1016/j.specom.2012.07.001 .
Granstrom, B, & House, D. (2005). Audiovisual representation of prosody in expressive speech communication. Speech Communication, 46(3–4), 473–484.
Gris, I, Novick, D, Camacho, A, Rivera, D, Gutierrez, M, & Rayon, A. (2014). Recorded Speech, Virtual Environments, and the Effectiveness of Embodied Conversational Agents. In T Bickmore, S Marsella, & C Sidner (Eds.), Intelligent Virtual Agents. Vol. 8637, Lecture Notes in Computer Science (pp. 182–185). New York: Springer International Publishing.
Jiang, J., Alwan, A., Bernstein, L. E., Keating, P., & Auer, E. (2002). On the correlation between facial movements, tongue movements and speech acoustics. Paper presented at the International Conference on Spoken Language Processing (ICSLP), Bejing, China
Kim, J, Lammert, AC, Kumar Ghosh, P, & Narayanan, SS. (2014). Co-registration of speech production datasets from electromagnetic articulography and real-time magnetic resonance imaging. Journal of Acoustical Society of America, 135(2), EL115–EL121. http://dx.doi.org/10.1121/1.4862880 .
Kim, J, Toutios, A, Lee, S, & Narayanan, SS. (2015). A kinematic study of critical and non-critical articulators in emotional speech production. Journal of Acoustical Society of America, 137(3), 1411–1429. http://dx.doi.org/10.1121/1.4908284 .
Kuratate, T. (2008). Text-to-AV synthesis system for Thinking Head Project. Paper presented at the Auditory-Visual Speech Processing, Brisbane, Australia
Musti, U., Toutios, A., Colotte, V., & Ouni, S. (2011). Introducing Visual Target Cost within an Acoustic-Visual Unit-Selection Speech Synthesizer. Paper presented at the AVSP, Volterra, Italy
Narayanan, S, Toutios, A, Ramanarayanan, V, Lammert, A, Kim, J, Lee, S, et al. (2014). Real-time magnetic resonance imaging and electromagnetic articulography database for speech production research (TC). Journal of Acoustical Society of America, 136(3), 1307–1311. http://dx.doi.org/10.1121/1.4890284 .
Pammi, S. C., Charfuelan, M., & Schröder, M. (2010). DFKI-LT - Multilingual Voice Creation Toolkit for the MARY TTS Platform. Paper presented at the LREC, Valleta, Malta
Pelachaud, C. (2009). Studies on gesture expressivity for a virtual agent. Speech Communication, 51(7), 630–639. doi: 10.1016/j.specom.2008.04.009 .
Ramanarayanan, V, Goldstein, L, & Narayanan, SS. (2013). Spatio-temporal articulatory movement primitives during speech production: Extraction, interpretation, and validation. Journal of Acoustical Society of America, 134(2), 1378–1394. doi: 10.1121/1.4812765 .
Revéret, L., Bailly, G., & Badin, P. (2000). MOTHER: a new generation of talking heads providing a flexible articulatory control for video-realistic speech animation. In International Conference on Speech and Language Processing, Beijing, China, (pp. 755–758)
Rosenblum, LD, Johnson, JA, & Saldana, HM. (1996). Point-light facial displays enhance comprehension of speech in noise. Journal of Speech and Hearing Research, 39(6), 1159–1170.
Schröder, M, Charfuelan, M, Pammi, S, & Steiner, I. (2011). Open source voice creation toolkit for the MARY TTS Platform. In 12th Annual Conference of the International Speech Communication Association - Interspeech 2011, Florence, Italy, 2011–08 (pp. 3253–3256). Italy: ISCA. https://hal.inria.fr/hal-00661061/document ,https://hal.inria.fr/hal-00661061/file/Interspeech2011.pdf.
Sheng, L., Lan, W., & En, Q. The Phoneme-Level Articulator Dynamics for Pronunciation Animation. In Asian Language Processing (IALP), 2011 International Conference on, 15–17 Nov. 2011 2011 (pp. 283–286). doi:10.1109/ialp.2011.13
Steiner, I., Richmond, K., & Ouni, S. (2013). Speech animation using electromagnetic articulography as motion capture data. Paper presented at the Auditory-Visual Speech Processing (AVSP), Annecy, France, August 29 - September 1, 2013
Sumby, WH, & Pollack, I. (1954). Visual contribution to speech intelligibility in noise. Journal of Acoustical Society of America, 26, 212–215.
Theobald, BJ. (2003). Visual speech synthesis using shape and appearance models. Norwich, UK: University of East Anglia.
Theobald, B. J., Fagel, S., Bailly, G., & Elisei, F. (2008). LIPS 2008: Visual Speech Synthesis Challenge. Paper presented at the INTERSPEECH 2008, Brisbane, Australia
Toutios, A., Shrikanth, S., & Narayanan, S. (2013). Articulatory Synthesis of French Connected Speech from EMA Data. Paper presented at the INTERSPEECH, Lyon, France