Emotional speech recognition: Resources, features, and methods

Speech Communication - Tập 48 - Trang 1162-1181 - 2006

Dimitrios Ververidis¹, Constantine Kotropoulos¹

¹Artificial Intelligence and Information Analysis Laboratory, Department of Informatics, Aristotle University of Thessaloniki, University Campus, Box 451, Thessaloniki 541 24, Greece

Tài liệu tham khảo

Abelin, A., Allwood, J., 2000. Cross linguistic interpretation of emotional prosody. In: Proc. ISCA Workshop on Speech and Emotion, Vol. 1, pp. 110–113. Akaike, 1974, A new look at the statistical model identification, IEEE Trans. Automat. Contr., 19, 716, 10.1109/TAC.1974.1100705 Alpert, 2001, Reflections of depression in acoustic measures of the patients speech, J. Affect. Disord., 66, 59, 10.1016/S0165-0327(00)00335-9 Alter, K., Rank, E., Kotz, S.A., 2000. Accentuation and emotions – two different systems? In: Proc. ISCA Workshop on Speech and Emotion, Belfast, Vol. 1, pp. 138–142. Ambrus, D.C., 2000. Collecting and recording of an emotional speech database. Tech. rep., Faculty of Electrical Engineering, Institute of Electronics, Univ. of Maribor. Amir, N., Ron, S., Laor, N., 2000. Analysis of an emotional speech corpus in Hebrew based on objective criteria. In: Proc. ISCA Workshop on Speech and Emotion, Belfast, Vol. 1, pp. 29–33. Ang, J., Dhillon, R., Krupski, A., Shriberg, E., Stolcke, A., 2002. Prosody-based automatic detection of annoyance and frustration in human–computer dialog. In: Proc. Internat. Conf. on Spoken Language Processing (ICSLP ’02), Vol. 3, pp. 2037–2040. 1970 Atal, B., Schroeder, M., 1967. Predictive coding of speech signals. In: Proc. Conf. on Communications and Processing, pp. 360–361. Banse, 1996, Acoustic profiles in vocal emotion expression, J. Pers. Soc. Psychol., 70, 614, 10.1037/0022-3514.70.3.614 Bänziger, 2005, The role of intonation in emotional expressions, Speech Comm., 46, 252, 10.1016/j.specom.2005.02.016 Batliner, A., Hacker, C., Steidl, S., Nöth, E., D’ Archy, S., Russell, M., Wong, M., 2004. “You stupid tin box” – children interacting with the AIBO robot: a cross-linguistic emotional speech corpus. In: Proc. Language Resources and Evaluation (LREC ’04), Lisbon. Bou-Ghazale, 1998, Hmm based stressed speech modelling with application to improved synthesis and recognition of isolated speech under stress, IEEE Trans. Speech Audio Processing, 6, 201, 10.1109/89.668815 Buck, 1999, The biological affects, a typology, Psychol. Rev., 106, 301, 10.1037/0033-295X.106.2.301 Bulut, M., Narayanan, S.S., Sydral, A.K., 2002. Expressive speech synthesis using a concatenative synthesizer. In: Proc. Internat. Conf. on Spoken Language Processing (ICSLP ’02), Vol. 2, pp. 1265–1268. Burkhardt, F., Sendlmeier, W.F., 2000. Verification of acoustical correlates of emotional speech using formant-synthesis. In: Proc. ISCA Workshop on Speech and Emotion, Belfast, Vol. 1, pp. 151–156. Cairns, 1994, Nonlinear analysis and detection of speech under stressed conditions, J. Acoust. Soc. Am., 96, 3392, 10.1121/1.410601 Caldognetto, 2004, Modifications of phonetic labial targets in emotive speech: effects of the co-production of speech and emotions, Speech Comm., 44, 173, 10.1016/j.specom.2004.10.012 Choukri, K., 2003. European Language Resources Association, (ELRA). Available from: <www.elra.info>. Chuang, Z.J., Wu, C.H., 2002. Emotion recognition from textual input using an emotional semantic network. In: Proc. Internat. Conf. on Spoken Language Processing (ICSLP ’02), Vol. 3, pp. 2033–2036. Clavel, C., Vasilescu, I., Devillers, L., Ehrette, T., 2004. Fiction database for emotion detection in abnormal situations. In: Proc. Internat. Conf. on Spoken Language Process (ICSLP ’04), Korea, pp. 2277–2280. Cole, R., 2005. The CU kids’ speech corpus. The Center for Spoken Language Research (CSLR). Available from: <http://cslr.colorado.edu/>. Cowie, 2003, Describing the emotional states that are expressed in speech, Speech Comm., 40, 5, 10.1016/S0167-6393(02)00071-7 Cowie, R., Douglas-Cowie, E., 1996. Automatic statistical analysis of the signal and prosodic signs of emotion in speech. In: Proc. Internat. Conf. on Spoken Language Processing (ICSLP ’96), Vol. 3, pp. 1989–1992. Cowie, 2001, Emotion recognition in human–computer interaction, IEEE Signal Processing Mag., 18, 32, 10.1109/79.911197 Davis, 1980, Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences, IEEE Trans. Acoust. Speech Signal Processing, 28, 357, 10.1109/TASSP.1980.1163420 Dellaert, F., Polzin, T., Waibel, A., 1996. Recognizing emotion in speech. In: Proc. Internat. Conf. on Spoken Language Processing (ICSLP ’96), Vol. 3, pp. 1970–1973. Deller, 2000 Dempster, 1977, Maximum likelihood from incomplete data via the em algorithm, J. Roy. Statist. Soc. Ser. B, 39, 1 Douglas-Cowie, 2003, Emotional speech: towards a new generation of databases, Speech Comm., 40, 33, 10.1016/S0167-6393(02)00070-5 Eckman, 1992, An argument for basic emotions, Cognition Emotion, 6, 169, 10.1080/02699939208411068 Edgington, M., 1997. Investigating the limitations of concatenative synthesis. In: Proc. European Conf. on Speech Communication and Technology (Eurospeech ’97), Vol. 1, pp. 593–596. Efron, 1993 Engberg, I.S., Hansen, A.V., 1996. Documentation of the Danish Emotional Speech database (DES). Internal AAU report, Center for Person Kommunikation, Aalborg Univ., Denmark. Fernandez, 2003, Modeling drivers’ speech under stress, Speech Comm., 40, 145, 10.1016/S0167-6393(02)00080-8 Fischer, K., 1999. Annotating emotional language data. Tech. Rep. 236, Univ. of Hamburg. Flanagan, 1972, Speech Analysis, Synthesis and Perception France, 2000, Acoustical properties of speech as indicators of depression and suicidal risk, IEEE Trans. Biomed. Eng., 7, 829, 10.1109/10.846676 Fukunaga, 1990, Introduction to Statistical Pattern Recognition Gonzalez, G.M., 1999. Bilingual computer-assisted psychological assessment: an innovative approach for screening depression in Chicanos/Latinos. Tech. Rep. 39, Univ. Michigan. Hansen, J.H.L., 1996. NATO IST-03 (formerly RSG. 10) speech under stress web page. Available from: <http://cslr.colorado.edu/rspl/stress.html>. Hansen, 1995, ICARUS: Source generator based real-time recognition of speech in noisy stressful and Lombard effect environments, Speech Comm., 16, 391, 10.1016/0167-6393(95)00007-B Hanson, 1994, A system for finding speech formants and modulations via energy separation, IEEE Trans. Speech Audio Processing, 2, 436, 10.1109/89.294358 Haykin, 1998, Neural Networks: A Comprehensive Foundation Hess, 1992, Pitch and voicing determination Heuft, B., Portele, T., Rauth, M., 1996. Emotions in time domain synthesis. In: Proc. Internat. Conf. on Spoken Language Processing (ICSLP ’96), Vol. 3, pp. 1974–1977. Iida, A., Campbell, N., Iga, S., Higuchi, F., Yasumura, M., 2000. A speech synthesis system with emotion for assisting communication. In: Proc. ISCA Workshop on Speech and Emotion, Belfast, Vol. 1, pp. 167–172. Iida, 2003, A corpus-based speech synthesis system with emotion, Speech Comm., 40, 161, 10.1016/S0167-6393(02)00081-X Iriondo, I., Guaus, R., Rodriguez, A., 2000. Validation of an acoustical modeling of emotional expression in Spanish using speech synthesis techniques. In: Proc. ISCA Workshop on Speech and Emotion, Belfast, Vol. 1, pp. 161–166. Jiang, D.N., Cai, L.H., 2004. Speech emotion classification with the combination of statistic features and temporal features. In: Proc. Internat. Conf. on Multimedia and Expo (ICME ’04), Taipei. Kadambe, 1992, Application of the wavelet transform for pitch detection of signals, IEEE Trans. Inform. Theory, 38, 917, 10.1109/18.119752 Kawanami, H., Iwami, Y., Toda, T., Shikano, K., 2003. GMM-based voice conversion applied to emotional speech synthesis. In: Proc. European Conf. on Speech Communication and Technology (Eurospeech ’03), Vol. 4, pp. 2401–2404. Kwon, O.W., Chan, K.L., Hao, J., Lee, T.W., 2003. Emotion recognition by speech signals. In: Proc. European Conf. on Speech Communication and Technology (Eurospeech ’03), Vol. 1, pp. 125–128. Lee, 2005, Toward detecting emotions in spoken dialogs, IEEE Trans. Speech Audio Process., 13, 293, 10.1109/TSA.2004.838534 Leinonen, 1997, Expression of emotional motivational connotations with a one-word utterance, J. Acoust. Soc. Am., 102, 1853, 10.1121/1.420109 Liberman, M., 2005. Linguistic Data Consurtium (LDC). Available from: <http://www.ldc.upenn.edu/>. Linnankoski, 2005, Conveyance of emotional connotations by a single word in English, Speech Comm., 45, 27, 10.1016/j.specom.2004.09.007 Lloyd, 1999, Comprehension of prosody in Parkinson’s disease, Proc. Cortex, 35, 389, 10.1016/S0010-9452(08)70807-4 Makarova, V., Petrushin, V.A., 2002. RUSLANA: A database of Russian emotional utterances. In: Proc. Internat. Conf. on Spoken Language Processing (ICSLP ’02), Vol. 1, pp. 2041–2044. Mallat, S.G., Zhong, S., 1989. Complete signal representation with multiscale edges. Tech. rep., Courant Inst. of Math. Sci., rRT-483-RR-219. Markel, 1976 Martins, C., Mascarenhas, I., Meinedo, H., Oliveira, L., Neto, J., Ribeiro, C., Trancoso, I., Viana, C., 1998. Spoken language corpora for speech recognition and synthesis in European Portuguese. In: Proc. Tenth Portuguese Conf. on Pattern Recognition (RECPAD ’98), Lisboa. McGilloway, S., Cowie, R., Douglas-Cowie, E., Gielen, C.C.A.M., Westerdijk, M.J.D., Stroeve, S. H., 2000. Approaching automatic recognition of emotion from voice: a rough benchmark. In: Proc. ISCA Workshop on Speech and Emotion, Vol. 1, pp. 207–212. McMahon, E., Cowie, R., Kasderidis, S., Taylor, J., Kollias, S., 2003. What chance that a DC could recognise hazardous mental states from sensor outputs? In: Tales of the Disappearing Computer, Santorini, Greece. Mermelstein, 1975, Automatic segmentation of speech into syllabic units, J. Acoust. Soc. Am., 58, 880, 10.1121/1.380738 Montanari, S., Yildirim, S., Andersen, E., Narayanan, S., 2004. Reference marking in children’s computed-directed speech: an integrated analysis of discourse and gestures. In: Proc. Internat. Conf. on Spoken Language Processing (ICSLP ’04), Korea, Vol. 1, pp. 1841–1844. Montero, J.M., Gutierrez-Arriola, J., Colas, J., Enriquez, E., Pardo, J.M., 1999. Analysis and modelling of emotional speech in Spanish. In: Proc. Internat. Conf. on Phonetics and Speech (ICPhS ’99), San Francisco, Vol. 2, pp. 957–960. Morgan, 1995, Continuous speech recognition, IEEE Signal Processing Mag., 12, 24, 10.1109/79.382443 Mozziconacci, S.J.L., Hermes, D.J., 1997. A study of intonation patterns in speech expressing emotion or attitude: production and perception. Tech. Rep. 32, Eindhoven, IPO Annual Progress Report. Mozziconacci, S.J.L., Hermes, D.J., 2000. Expression of emotion and attitude through temporal speech variations. In: Proc. Internat. Conf. on Spoken Language Processing (ICSLP ’00), Beijing, Vol. 2, pp. 373–378. Mrayati, 1988, Distinctive regions and models: a new theory of speech production, Speech Comm., 7, 257, 10.1016/0167-6393(88)90073-8 Murray, I., Arnott, J.L., 1996. Synthesizing emotions in speech: is it time to get excited? In: Proc. Internat. Conf. on Spoken Language Processing (ICSLP ’96), Vol. 3, pp. 1816–1819. Nakatsu, R., Solomides, A., Tosa, N., 1999. Emotion recognition and its application to computer agents with spontaneous interactive capabilities. In: Proc. Internat. Conf. on Multimedia Computing and Systems (ICMCS ’99), Florence, Vol. 2, pp. 804–808. Niimi, Y., Kasamatu, M., Nishimoto, T., Araki, M., 2001. Synthesis of emotional speech using prosodically balanced VCV segments. In: Proc. ISCA Tutorial and Workshop on Research Synthesis (SSW 4), Scotland. Nogueiras, A., Marino, J.B., Moreno, A., Bonafonte, A., 2001. Speech emotion recognition using hidden Markov models. In: Proc. European Conf. on Speech Communication and Technology (Eurospeech ’01), Denmark. Nordstrand, 2004, Measurements of ariculatory variation in expressive speech for a set of Swedish vowels, Speech Comm., 44, 187, 10.1016/j.specom.2004.09.003 Nwe, 2003, Speech emotion recognition using hidden Markov models, Speech Comm., 41, 603, 10.1016/S0167-6393(03)00099-2 Pantic, 2003, Toward an affect-sensitive multimodal human–computer interaction, Proc. IEEE, 91, 1370, 10.1109/JPROC.2003.817122 Pellom, B.L., Hansen, J.H.L., 1996. Text-directed speech enhancement using phoneme classification and feature map constrained vector quantization. In: Proc. Internat. Conf. on Acoustics, Speech, and Signal Processing (ICASSP ’96), Vol. 2, pp. 645–648. Pereira, C., 2000. Dimensions of emotional meaning in speech. In: Proc. ISCA Workshop on Speech and Emotion, Belfast, Vol. 1, pp. 25–28. Petrushin, V.A., 1999. Emotion in speech recognition and application to call centers. In: Proc. Artificial Neural Networks in Engineering (ANNIE ’99), Vol. 1, pp. 7–10. Picard, 2001, Toward machine emotional intelligence: analysis of affective physiological state, IEEE Trans. Pattern Anal. Machine Intell., 23, 1175, 10.1109/34.954607 Pollerman, 2002 Polzin, T., Waibel, A., 2000. Emotion-sensitive human–computer interfaces. In: Proc. ISCA Workshop on Speech and Emotion, Belfast, Vol. 1, pp. 201–206. Polzin, T.S., Waibel, A.H., 1998. Detecting emotions in speech. In: Proc. Cooperative Multimodal Communication (CMC ’98). Quatieri, 2002 Rabiner, 1993 Rahurkar, M., Hansen, J.H.L., 2002. Frequency band analysis for stress detection using a Teager energy operator based feature. In: Proc. Internat. Conf. on Spoken Language Processing (ICSLP ’02), Vol. 3, pp. 2021–2024. Scherer, K.R., 2000a. A cross-cultural investigation of emotion inferences from voice and speech: implications for speech technology. In: Proc. Internat. Conf. on Spoken Language Processing (ICSLP ’00), Vol. 1, pp. 379–382. Scherer, K.R., 2000b. Emotion effects on voice and speech: paradigms and approaches to evaluation. In: Proc. ISCA Workshop on Speech and Emotion, Belfast, invited paper. Scherer, 2003, Vocal communication of emotion: a review of research paradigms, Speech Comm., 40, 227, 10.1016/S0167-6393(02)00084-5 Scherer, K.R., Banse, R., Wallbot, H.G., Goldbeck, T., 1991. Vocal clues in emotion encoding and decoding. In: Proc. Motiv. Emotion, Vol. 15, pp. 123–148. Scherer, K.R., Grandjean, D., Johnstone, L.T., G. Klasmeyer, T.B., 2002. Acoustic correlates of task load and stress. In: Proc. Internat. Conf. on Spoken Language Processing (ICSLP ’02), Colorado, Vol. 3, pp. 2017–2020. Schiel, F., Steininger, S., Turk, U., 2002. The Smartkom multimodal corpus at BAS. In: Proc. Language Resources and Evaluation (LREC ’02). Schröder, M., 2000. Experimental study of affect bursts. In: Proc. ISCA Workshop on Speech and Emotion, Vol. 1, pp. 132–137. Schröder, M., 2005. Humaine consortium: research on emotions and human–machine interaction. Available from: <http://emotion-research.net/>. Schröder, M., Grice, M., 2003. Expressing vocal effort in concatenative synthesis. In: Proc. Internat. Conf. on Phonetic Sciences (ICPhS ’03), Barcelona. Schüller, B., Rigoll, G., Lang, M., 2004. Speech emotion recognition combining acoustic features and linguistic information in a hybrid support vector machine-belief network architecture. In: Proc. Internat. Conf. on Acoustics, Speech and Signal Processing (ICASSP ’04), Vol. 1, pp. 557–560. Shawe-Taylor, 2004 Shi, R.P., Adelhardt, J., Zeissler, V., Batliner, A., Frank, C., Nöth, E., Niemann, H., 2003. Using speech and gesture to explore user states in multimodal dialogue systems. In: Proc. ISCA Tutorial and Research Workshop on Audio Visual Speech Processing (AVSP ’03), Vol. 1, pp. 151–156. Silverman, K., Beckman, M., Pitrelli, J., Ostendorf, M., Wightman, C., Price, P., Pierrehumbert, J., Hirschberg, J., 1992. ToBI: A standard for labeling English prosody. In: Proc. Internat. Conf. on Spoken Language Processing (ICSLP ’92), Vol. 2, pp. 867–870. Slaney, 2003, Babyears: A recognition system for affective vocalizations, Speech Comm., 39, 367, 10.1016/S0167-6393(02)00049-3 Sondhi, 1968, New methods of pitch extraction, IEEE Trans. Audio Electroacoust., 16, 262, 10.1109/TAU.1968.1161986 Steeneken, H.J.M., Hansen, J.H.L., 1999. Speech under stress conditions: overview of the effect of speech production and on system performance. In: Proc. Internat. Conf. on Acoustics, Speech, and Signal Processing (ICASSP ’99), Phoenix, Vol. 4, pp. 2079–2082. Stibbard, R., 2000. Automated extraction of ToBI annotation data from the Reading/Leeds emotional speech corpus. In: Proc. ISCA Workshop on Speech and Emotion, Belfast, Vol. 1, pp. 60–65. Tato, R., 2002. Emotional space improves emotion recognition. In: Proc. Internat. Conf. on Spoken Language Processing (ICSLP ’02), Colorado, Vol. 3, pp. 2029–2032. Teager, 1990, Evidence for nonlinear sound production mechanisms in the vocal tract, Vol. 15 Tolkmitt, 1986, Effect of experimentally induced stress on vocal parameters, J. Exp. Psychol. [Hum. Percept.], 12, 302, 10.1037/0096-1523.12.3.302 Van Bezooijen, 1984 van der Heijden, 2004 Ververidis, D., Kotropoulos, C., 2004. Automatic speech classification to five emotional states based on gender information. In: Proc. European Signal Processing Conf. (EUSIPCO ’04), Vol. 1, pp. 341–344. Ververidis, D., Kotropoulos, C., 2005. Emotional speech classification using Gaussian mixture models and the sequential floating forward selection algorithm. In: Proc. Internat. Conf. on Multimedia and Expo (ICME ’05). Ververidis, D., Kotropoulos, C., Pitas, I., 2004. Automatic emotional speech classification. In: Proc. Internat. Conf. on Acoustics, Speech and Signal Processing (ICASSP ’04), Montreal, Vol. 1, pp. 593–596. Wagner, J., Kim, J., André, E., 2005. From physiological signals to emotions: implementing and comparing selected methods for feature extraction and classification. In: Proc. Internat. Conf. on Multimedia and Expo (ICME ’05), Amsterdam. Wendt, B., Scheich, H., 2002. The Magdeburger prosodie-korpus. In: Proc. Speech Prosody Conf., pp. 699–701. Womack, 1996, Classification of speech under stress using target driven features, Speech Comm., 20, 131, 10.1016/S0167-6393(96)00049-0 Womack, 1999, N-channel hidden Markov models for combined stressed speech classification and recognition, IEEE Trans. Speech Audio Processing, 7, 668, 10.1109/89.799692 Yildirim, S., Bulut, M., Lee, C.M., Kazemzadeh, A., Busso, C., Deng, Z., Lee, S., Narayanan, S., 2004. An acoustic study of emotions expressed in speech. In: Proc. Internat. Conf. on Spoken Language Processing (ICSLP ’04), Korea, Vol. 1, pp. 2193–2196. Yu, F., Chang, E., Xu, Y.Q., Shum, H.Y., 2001. Emotion detection from speech to enrich multimedia content. In: Proc. IEEE Pacific-Rim Conf. on Multimedia 2001, Beijing, Vol. 1, pp. 550–557. Yuan, J., 2002. The acoustic realization of anger, fear, joy and sadness in Chinese. In: Proc. Internat. Conf. on Spoken Language Processing (ICSLP ’02), Vol. 3, pp. 2025–2028. Zhou, 2001, Nonlinear feature based classification of speech under stress, IEEE Trans. Speech Audio Processing, 9, 201, 10.1109/89.905995

Scholar Hub - Công cụ hỗ trợ trích dẫn và phân tích khoa học Việt Nam

Về chúng tôi

Scholar Hub là công cụ hỗ trợ trích dẫn và phân tích các bài báo, công bố khoa học Việt Nam. Công cụ trợ giúp người nghiên cứu, tạp chí, đơn vị nghiên cứu tra cứu, phân tích và thống kê dữ liệu nghiên cứu khoa học tại Việt Nam và quốc tế.
ScholarHub KHÔNG đăng thông tin tổng hợp, KHÔNG đăng lại nội dung từ các trang báo chí Việt Nam hoặc trang thông tin điện tử khác tại Việt Nam.

Thông tin, cập nhật

Đăng ký Tạp chí tham gia vào Scholar Hub

Phản hồi ý kiến về Scholar Hub

Bài viết, nội dung cập nhật

Chủ đề khoa học

Website liên kết

Hệ thống CSDL Khoa học & Công nghệ

Phần mềm kiểm tra trùng lặp Kiểm Tra Tài Liệu

Phần mềm xuất bản tạp chí điện tử VOJS

Nền tảng trắc nghiệm và đề thi đa lĩnh vực LetQA