Emotional speech recognition: Resources, features, and methods
Tài liệu tham khảo
Abelin, A., Allwood, J., 2000. Cross linguistic interpretation of emotional prosody. In: Proc. ISCA Workshop on Speech and Emotion, Vol. 1, pp. 110–113.
Akaike, 1974, A new look at the statistical model identification, IEEE Trans. Automat. Contr., 19, 716, 10.1109/TAC.1974.1100705
Alpert, 2001, Reflections of depression in acoustic measures of the patients speech, J. Affect. Disord., 66, 59, 10.1016/S0165-0327(00)00335-9
Alter, K., Rank, E., Kotz, S.A., 2000. Accentuation and emotions – two different systems? In: Proc. ISCA Workshop on Speech and Emotion, Belfast, Vol. 1, pp. 138–142.
Ambrus, D.C., 2000. Collecting and recording of an emotional speech database. Tech. rep., Faculty of Electrical Engineering, Institute of Electronics, Univ. of Maribor.
Amir, N., Ron, S., Laor, N., 2000. Analysis of an emotional speech corpus in Hebrew based on objective criteria. In: Proc. ISCA Workshop on Speech and Emotion, Belfast, Vol. 1, pp. 29–33.
Ang, J., Dhillon, R., Krupski, A., Shriberg, E., Stolcke, A., 2002. Prosody-based automatic detection of annoyance and frustration in human–computer dialog. In: Proc. Internat. Conf. on Spoken Language Processing (ICSLP ’02), Vol. 3, pp. 2037–2040.
1970
Atal, B., Schroeder, M., 1967. Predictive coding of speech signals. In: Proc. Conf. on Communications and Processing, pp. 360–361.
Banse, 1996, Acoustic profiles in vocal emotion expression, J. Pers. Soc. Psychol., 70, 614, 10.1037/0022-3514.70.3.614
Bänziger, 2005, The role of intonation in emotional expressions, Speech Comm., 46, 252, 10.1016/j.specom.2005.02.016
Batliner, A., Hacker, C., Steidl, S., Nöth, E., D’ Archy, S., Russell, M., Wong, M., 2004. “You stupid tin box” – children interacting with the AIBO robot: a cross-linguistic emotional speech corpus. In: Proc. Language Resources and Evaluation (LREC ’04), Lisbon.
Bou-Ghazale, 1998, Hmm based stressed speech modelling with application to improved synthesis and recognition of isolated speech under stress, IEEE Trans. Speech Audio Processing, 6, 201, 10.1109/89.668815
Buck, 1999, The biological affects, a typology, Psychol. Rev., 106, 301, 10.1037/0033-295X.106.2.301
Bulut, M., Narayanan, S.S., Sydral, A.K., 2002. Expressive speech synthesis using a concatenative synthesizer. In: Proc. Internat. Conf. on Spoken Language Processing (ICSLP ’02), Vol. 2, pp. 1265–1268.
Burkhardt, F., Sendlmeier, W.F., 2000. Verification of acoustical correlates of emotional speech using formant-synthesis. In: Proc. ISCA Workshop on Speech and Emotion, Belfast, Vol. 1, pp. 151–156.
Cairns, 1994, Nonlinear analysis and detection of speech under stressed conditions, J. Acoust. Soc. Am., 96, 3392, 10.1121/1.410601
Caldognetto, 2004, Modifications of phonetic labial targets in emotive speech: effects of the co-production of speech and emotions, Speech Comm., 44, 173, 10.1016/j.specom.2004.10.012
Choukri, K., 2003. European Language Resources Association, (ELRA). Available from: <www.elra.info>.
Chuang, Z.J., Wu, C.H., 2002. Emotion recognition from textual input using an emotional semantic network. In: Proc. Internat. Conf. on Spoken Language Processing (ICSLP ’02), Vol. 3, pp. 2033–2036.
Clavel, C., Vasilescu, I., Devillers, L., Ehrette, T., 2004. Fiction database for emotion detection in abnormal situations. In: Proc. Internat. Conf. on Spoken Language Process (ICSLP ’04), Korea, pp. 2277–2280.
Cole, R., 2005. The CU kids’ speech corpus. The Center for Spoken Language Research (CSLR). Available from: <http://cslr.colorado.edu/>.
Cowie, 2003, Describing the emotional states that are expressed in speech, Speech Comm., 40, 5, 10.1016/S0167-6393(02)00071-7
Cowie, R., Douglas-Cowie, E., 1996. Automatic statistical analysis of the signal and prosodic signs of emotion in speech. In: Proc. Internat. Conf. on Spoken Language Processing (ICSLP ’96), Vol. 3, pp. 1989–1992.
Cowie, 2001, Emotion recognition in human–computer interaction, IEEE Signal Processing Mag., 18, 32, 10.1109/79.911197
Davis, 1980, Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences, IEEE Trans. Acoust. Speech Signal Processing, 28, 357, 10.1109/TASSP.1980.1163420
Dellaert, F., Polzin, T., Waibel, A., 1996. Recognizing emotion in speech. In: Proc. Internat. Conf. on Spoken Language Processing (ICSLP ’96), Vol. 3, pp. 1970–1973.
Deller, 2000
Dempster, 1977, Maximum likelihood from incomplete data via the em algorithm, J. Roy. Statist. Soc. Ser. B, 39, 1
Douglas-Cowie, 2003, Emotional speech: towards a new generation of databases, Speech Comm., 40, 33, 10.1016/S0167-6393(02)00070-5
Eckman, 1992, An argument for basic emotions, Cognition Emotion, 6, 169, 10.1080/02699939208411068
Edgington, M., 1997. Investigating the limitations of concatenative synthesis. In: Proc. European Conf. on Speech Communication and Technology (Eurospeech ’97), Vol. 1, pp. 593–596.
Efron, 1993
Engberg, I.S., Hansen, A.V., 1996. Documentation of the Danish Emotional Speech database (DES). Internal AAU report, Center for Person Kommunikation, Aalborg Univ., Denmark.
Fernandez, 2003, Modeling drivers’ speech under stress, Speech Comm., 40, 145, 10.1016/S0167-6393(02)00080-8
Fischer, K., 1999. Annotating emotional language data. Tech. Rep. 236, Univ. of Hamburg.
Flanagan, 1972, Speech Analysis, Synthesis and Perception
France, 2000, Acoustical properties of speech as indicators of depression and suicidal risk, IEEE Trans. Biomed. Eng., 7, 829, 10.1109/10.846676
Fukunaga, 1990, Introduction to Statistical Pattern Recognition
Gonzalez, G.M., 1999. Bilingual computer-assisted psychological assessment: an innovative approach for screening depression in Chicanos/Latinos. Tech. Rep. 39, Univ. Michigan.
Hansen, J.H.L., 1996. NATO IST-03 (formerly RSG. 10) speech under stress web page. Available from: <http://cslr.colorado.edu/rspl/stress.html>.
Hansen, 1995, ICARUS: Source generator based real-time recognition of speech in noisy stressful and Lombard effect environments, Speech Comm., 16, 391, 10.1016/0167-6393(95)00007-B
Hanson, 1994, A system for finding speech formants and modulations via energy separation, IEEE Trans. Speech Audio Processing, 2, 436, 10.1109/89.294358
Haykin, 1998, Neural Networks: A Comprehensive Foundation
Hess, 1992, Pitch and voicing determination
Heuft, B., Portele, T., Rauth, M., 1996. Emotions in time domain synthesis. In: Proc. Internat. Conf. on Spoken Language Processing (ICSLP ’96), Vol. 3, pp. 1974–1977.
Iida, A., Campbell, N., Iga, S., Higuchi, F., Yasumura, M., 2000. A speech synthesis system with emotion for assisting communication. In: Proc. ISCA Workshop on Speech and Emotion, Belfast, Vol. 1, pp. 167–172.
Iida, 2003, A corpus-based speech synthesis system with emotion, Speech Comm., 40, 161, 10.1016/S0167-6393(02)00081-X
Iriondo, I., Guaus, R., Rodriguez, A., 2000. Validation of an acoustical modeling of emotional expression in Spanish using speech synthesis techniques. In: Proc. ISCA Workshop on Speech and Emotion, Belfast, Vol. 1, pp. 161–166.
Jiang, D.N., Cai, L.H., 2004. Speech emotion classification with the combination of statistic features and temporal features. In: Proc. Internat. Conf. on Multimedia and Expo (ICME ’04), Taipei.
Kadambe, 1992, Application of the wavelet transform for pitch detection of signals, IEEE Trans. Inform. Theory, 38, 917, 10.1109/18.119752
Kawanami, H., Iwami, Y., Toda, T., Shikano, K., 2003. GMM-based voice conversion applied to emotional speech synthesis. In: Proc. European Conf. on Speech Communication and Technology (Eurospeech ’03), Vol. 4, pp. 2401–2404.
Kwon, O.W., Chan, K.L., Hao, J., Lee, T.W., 2003. Emotion recognition by speech signals. In: Proc. European Conf. on Speech Communication and Technology (Eurospeech ’03), Vol. 1, pp. 125–128.
Lee, 2005, Toward detecting emotions in spoken dialogs, IEEE Trans. Speech Audio Process., 13, 293, 10.1109/TSA.2004.838534
Leinonen, 1997, Expression of emotional motivational connotations with a one-word utterance, J. Acoust. Soc. Am., 102, 1853, 10.1121/1.420109
Liberman, M., 2005. Linguistic Data Consurtium (LDC). Available from: <http://www.ldc.upenn.edu/>.
Linnankoski, 2005, Conveyance of emotional connotations by a single word in English, Speech Comm., 45, 27, 10.1016/j.specom.2004.09.007
Lloyd, 1999, Comprehension of prosody in Parkinson’s disease, Proc. Cortex, 35, 389, 10.1016/S0010-9452(08)70807-4
Makarova, V., Petrushin, V.A., 2002. RUSLANA: A database of Russian emotional utterances. In: Proc. Internat. Conf. on Spoken Language Processing (ICSLP ’02), Vol. 1, pp. 2041–2044.
Mallat, S.G., Zhong, S., 1989. Complete signal representation with multiscale edges. Tech. rep., Courant Inst. of Math. Sci., rRT-483-RR-219.
Markel, 1976
Martins, C., Mascarenhas, I., Meinedo, H., Oliveira, L., Neto, J., Ribeiro, C., Trancoso, I., Viana, C., 1998. Spoken language corpora for speech recognition and synthesis in European Portuguese. In: Proc. Tenth Portuguese Conf. on Pattern Recognition (RECPAD ’98), Lisboa.
McGilloway, S., Cowie, R., Douglas-Cowie, E., Gielen, C.C.A.M., Westerdijk, M.J.D., Stroeve, S. H., 2000. Approaching automatic recognition of emotion from voice: a rough benchmark. In: Proc. ISCA Workshop on Speech and Emotion, Vol. 1, pp. 207–212.
McMahon, E., Cowie, R., Kasderidis, S., Taylor, J., Kollias, S., 2003. What chance that a DC could recognise hazardous mental states from sensor outputs? In: Tales of the Disappearing Computer, Santorini, Greece.
Mermelstein, 1975, Automatic segmentation of speech into syllabic units, J. Acoust. Soc. Am., 58, 880, 10.1121/1.380738
Montanari, S., Yildirim, S., Andersen, E., Narayanan, S., 2004. Reference marking in children’s computed-directed speech: an integrated analysis of discourse and gestures. In: Proc. Internat. Conf. on Spoken Language Processing (ICSLP ’04), Korea, Vol. 1, pp. 1841–1844.
Montero, J.M., Gutierrez-Arriola, J., Colas, J., Enriquez, E., Pardo, J.M., 1999. Analysis and modelling of emotional speech in Spanish. In: Proc. Internat. Conf. on Phonetics and Speech (ICPhS ’99), San Francisco, Vol. 2, pp. 957–960.
Morgan, 1995, Continuous speech recognition, IEEE Signal Processing Mag., 12, 24, 10.1109/79.382443
Mozziconacci, S.J.L., Hermes, D.J., 1997. A study of intonation patterns in speech expressing emotion or attitude: production and perception. Tech. Rep. 32, Eindhoven, IPO Annual Progress Report.
Mozziconacci, S.J.L., Hermes, D.J., 2000. Expression of emotion and attitude through temporal speech variations. In: Proc. Internat. Conf. on Spoken Language Processing (ICSLP ’00), Beijing, Vol. 2, pp. 373–378.
Mrayati, 1988, Distinctive regions and models: a new theory of speech production, Speech Comm., 7, 257, 10.1016/0167-6393(88)90073-8
Murray, I., Arnott, J.L., 1996. Synthesizing emotions in speech: is it time to get excited? In: Proc. Internat. Conf. on Spoken Language Processing (ICSLP ’96), Vol. 3, pp. 1816–1819.
Nakatsu, R., Solomides, A., Tosa, N., 1999. Emotion recognition and its application to computer agents with spontaneous interactive capabilities. In: Proc. Internat. Conf. on Multimedia Computing and Systems (ICMCS ’99), Florence, Vol. 2, pp. 804–808.
Niimi, Y., Kasamatu, M., Nishimoto, T., Araki, M., 2001. Synthesis of emotional speech using prosodically balanced VCV segments. In: Proc. ISCA Tutorial and Workshop on Research Synthesis (SSW 4), Scotland.
Nogueiras, A., Marino, J.B., Moreno, A., Bonafonte, A., 2001. Speech emotion recognition using hidden Markov models. In: Proc. European Conf. on Speech Communication and Technology (Eurospeech ’01), Denmark.
Nordstrand, 2004, Measurements of ariculatory variation in expressive speech for a set of Swedish vowels, Speech Comm., 44, 187, 10.1016/j.specom.2004.09.003
Nwe, 2003, Speech emotion recognition using hidden Markov models, Speech Comm., 41, 603, 10.1016/S0167-6393(03)00099-2
Pantic, 2003, Toward an affect-sensitive multimodal human–computer interaction, Proc. IEEE, 91, 1370, 10.1109/JPROC.2003.817122
Pellom, B.L., Hansen, J.H.L., 1996. Text-directed speech enhancement using phoneme classification and feature map constrained vector quantization. In: Proc. Internat. Conf. on Acoustics, Speech, and Signal Processing (ICASSP ’96), Vol. 2, pp. 645–648.
Pereira, C., 2000. Dimensions of emotional meaning in speech. In: Proc. ISCA Workshop on Speech and Emotion, Belfast, Vol. 1, pp. 25–28.
Petrushin, V.A., 1999. Emotion in speech recognition and application to call centers. In: Proc. Artificial Neural Networks in Engineering (ANNIE ’99), Vol. 1, pp. 7–10.
Picard, 2001, Toward machine emotional intelligence: analysis of affective physiological state, IEEE Trans. Pattern Anal. Machine Intell., 23, 1175, 10.1109/34.954607
Pollerman, 2002
Polzin, T., Waibel, A., 2000. Emotion-sensitive human–computer interfaces. In: Proc. ISCA Workshop on Speech and Emotion, Belfast, Vol. 1, pp. 201–206.
Polzin, T.S., Waibel, A.H., 1998. Detecting emotions in speech. In: Proc. Cooperative Multimodal Communication (CMC ’98).
Quatieri, 2002
Rabiner, 1993
Rahurkar, M., Hansen, J.H.L., 2002. Frequency band analysis for stress detection using a Teager energy operator based feature. In: Proc. Internat. Conf. on Spoken Language Processing (ICSLP ’02), Vol. 3, pp. 2021–2024.
Scherer, K.R., 2000a. A cross-cultural investigation of emotion inferences from voice and speech: implications for speech technology. In: Proc. Internat. Conf. on Spoken Language Processing (ICSLP ’00), Vol. 1, pp. 379–382.
Scherer, K.R., 2000b. Emotion effects on voice and speech: paradigms and approaches to evaluation. In: Proc. ISCA Workshop on Speech and Emotion, Belfast, invited paper.
Scherer, 2003, Vocal communication of emotion: a review of research paradigms, Speech Comm., 40, 227, 10.1016/S0167-6393(02)00084-5
Scherer, K.R., Banse, R., Wallbot, H.G., Goldbeck, T., 1991. Vocal clues in emotion encoding and decoding. In: Proc. Motiv. Emotion, Vol. 15, pp. 123–148.
Scherer, K.R., Grandjean, D., Johnstone, L.T., G. Klasmeyer, T.B., 2002. Acoustic correlates of task load and stress. In: Proc. Internat. Conf. on Spoken Language Processing (ICSLP ’02), Colorado, Vol. 3, pp. 2017–2020.
Schiel, F., Steininger, S., Turk, U., 2002. The Smartkom multimodal corpus at BAS. In: Proc. Language Resources and Evaluation (LREC ’02).
Schröder, M., 2000. Experimental study of affect bursts. In: Proc. ISCA Workshop on Speech and Emotion, Vol. 1, pp. 132–137.
Schröder, M., 2005. Humaine consortium: research on emotions and human–machine interaction. Available from: <http://emotion-research.net/>.
Schröder, M., Grice, M., 2003. Expressing vocal effort in concatenative synthesis. In: Proc. Internat. Conf. on Phonetic Sciences (ICPhS ’03), Barcelona.
Schüller, B., Rigoll, G., Lang, M., 2004. Speech emotion recognition combining acoustic features and linguistic information in a hybrid support vector machine-belief network architecture. In: Proc. Internat. Conf. on Acoustics, Speech and Signal Processing (ICASSP ’04), Vol. 1, pp. 557–560.
Shawe-Taylor, 2004
Shi, R.P., Adelhardt, J., Zeissler, V., Batliner, A., Frank, C., Nöth, E., Niemann, H., 2003. Using speech and gesture to explore user states in multimodal dialogue systems. In: Proc. ISCA Tutorial and Research Workshop on Audio Visual Speech Processing (AVSP ’03), Vol. 1, pp. 151–156.
Silverman, K., Beckman, M., Pitrelli, J., Ostendorf, M., Wightman, C., Price, P., Pierrehumbert, J., Hirschberg, J., 1992. ToBI: A standard for labeling English prosody. In: Proc. Internat. Conf. on Spoken Language Processing (ICSLP ’92), Vol. 2, pp. 867–870.
Slaney, 2003, Babyears: A recognition system for affective vocalizations, Speech Comm., 39, 367, 10.1016/S0167-6393(02)00049-3
Sondhi, 1968, New methods of pitch extraction, IEEE Trans. Audio Electroacoust., 16, 262, 10.1109/TAU.1968.1161986
Steeneken, H.J.M., Hansen, J.H.L., 1999. Speech under stress conditions: overview of the effect of speech production and on system performance. In: Proc. Internat. Conf. on Acoustics, Speech, and Signal Processing (ICASSP ’99), Phoenix, Vol. 4, pp. 2079–2082.
Stibbard, R., 2000. Automated extraction of ToBI annotation data from the Reading/Leeds emotional speech corpus. In: Proc. ISCA Workshop on Speech and Emotion, Belfast, Vol. 1, pp. 60–65.
Tato, R., 2002. Emotional space improves emotion recognition. In: Proc. Internat. Conf. on Spoken Language Processing (ICSLP ’02), Colorado, Vol. 3, pp. 2029–2032.
Teager, 1990, Evidence for nonlinear sound production mechanisms in the vocal tract, Vol. 15
Tolkmitt, 1986, Effect of experimentally induced stress on vocal parameters, J. Exp. Psychol. [Hum. Percept.], 12, 302, 10.1037/0096-1523.12.3.302
Van Bezooijen, 1984
van der Heijden, 2004
Ververidis, D., Kotropoulos, C., 2004. Automatic speech classification to five emotional states based on gender information. In: Proc. European Signal Processing Conf. (EUSIPCO ’04), Vol. 1, pp. 341–344.
Ververidis, D., Kotropoulos, C., 2005. Emotional speech classification using Gaussian mixture models and the sequential floating forward selection algorithm. In: Proc. Internat. Conf. on Multimedia and Expo (ICME ’05).
Ververidis, D., Kotropoulos, C., Pitas, I., 2004. Automatic emotional speech classification. In: Proc. Internat. Conf. on Acoustics, Speech and Signal Processing (ICASSP ’04), Montreal, Vol. 1, pp. 593–596.
Wagner, J., Kim, J., André, E., 2005. From physiological signals to emotions: implementing and comparing selected methods for feature extraction and classification. In: Proc. Internat. Conf. on Multimedia and Expo (ICME ’05), Amsterdam.
Wendt, B., Scheich, H., 2002. The Magdeburger prosodie-korpus. In: Proc. Speech Prosody Conf., pp. 699–701.
Womack, 1996, Classification of speech under stress using target driven features, Speech Comm., 20, 131, 10.1016/S0167-6393(96)00049-0
Womack, 1999, N-channel hidden Markov models for combined stressed speech classification and recognition, IEEE Trans. Speech Audio Processing, 7, 668, 10.1109/89.799692
Yildirim, S., Bulut, M., Lee, C.M., Kazemzadeh, A., Busso, C., Deng, Z., Lee, S., Narayanan, S., 2004. An acoustic study of emotions expressed in speech. In: Proc. Internat. Conf. on Spoken Language Processing (ICSLP ’04), Korea, Vol. 1, pp. 2193–2196.
Yu, F., Chang, E., Xu, Y.Q., Shum, H.Y., 2001. Emotion detection from speech to enrich multimedia content. In: Proc. IEEE Pacific-Rim Conf. on Multimedia 2001, Beijing, Vol. 1, pp. 550–557.
Yuan, J., 2002. The acoustic realization of anger, fear, joy and sadness in Chinese. In: Proc. Internat. Conf. on Spoken Language Processing (ICSLP ’02), Vol. 3, pp. 2025–2028.
Zhou, 2001, Nonlinear feature based classification of speech under stress, IEEE Trans. Speech Audio Processing, 9, 201, 10.1109/89.905995