Survey on speech emotion recognition: Features, classification schemes, and databases
Tóm tắt
Từ khóa
Tài liệu tham khảo
Akaike, 1974, A new look at the statistical model identification, IEEE Trans. Autom. Control, 19, 716, 10.1109/TAC.1974.1100705
N. Amir, S. Ron, N. Laor, Analysis of an emotional speech corpus in Hebrew based on objective criteria, in: SpeechEmotion-2000, 2000, pp. 29–33.
J. Ang, R. Dhillon, A. Krupski, E. Shriberg, A. Stolcke, Prosody-based automatic detection of annoyance and frustration in human–computer dialog, in: Proceedings of the ICSLP 2002, 2002, pp. 2037–2040.
Atal, 1974, Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification, J. Acoust. Soc. Am., 55, 1304, 10.1121/1.1914702
Athanaselis, 2005, Asr for emotional speech: clarifying the issues and enhancing the performance, Neural Networks, 18, 437, 10.1016/j.neunet.2005.03.008
M.M.H. El Ayadi, M.S. Kamel, F. Karray, Speech emotion recognition using Gaussian mixture vector autoregressive models, in: ICASSP 2007, vol. 4, 2007, pp. 957–960.
Banse, 1996, Acoustic profiles in vocal emotion expression, J. Pers. Soc. Psychol., 70, 614, 10.1037/0022-3514.70.3.614
A. Batliner, K. Fischer, R. Huber, J. Spiker, E. Noth, Desperately seeking emotions: actors, wizards and human beings, in: Proceedings of the ISCA Workshop Speech Emotion, 2000, pp. 195–200.
Beeke, 2009, Prosody as a compensatory strategy in the conversations of people with agrammatism, Clin. Linguist. Phonetics, 23, 133, 10.1080/02699200802602985
Bishop, 1995
M. Borchert, A. Dusterhoft, Emotions in speech—experiments with prosody and quality features in speech for use in categorical and dimensional emotion recognition environments, in: Proceedings of 2005 IEEE International Conference on Natural Language Processing and Knowledge Engineering, IEEE NLP-KE’05 2005, 2005, pp. 147–151.
Bosch, 2003, Emotions, speech and the asr framework, Speech Commun., 40, 213, 10.1016/S0167-6393(02)00083-3
Bou-Ghazale, 2000, A comparative study of traditional and newly proposed features for recognition of speech under stress, IEEE Trans. Speech Audio Process., 8, 429, 10.1109/89.848224
Le Bouquin, 1996, Enhancement of noisy speech signals: application to mobile radio communications, Speech Commun., 18, 3, 10.1016/0167-6393(95)00021-6
Breazeal, 2002, Recognition of affective communicative intent in robot-directed speech, Autonomous Robots, 2, 83, 10.1023/A:1013215010749
Burges, 1998, A tutorial on support vector machines for pattern recognition, Data Mining Knowl. Discovery, 2, 121, 10.1023/A:1009715923555
F. Burkhardt, A. Paeschke, M. Rolfes, W. Sendlmeier, B. Weiss, A database of German emotional speech, in: Proceedings of the Interspeech 2005, Lissabon, Portugal, 2005, pp. 1517–1520.
Busso, 2009, Analysis of emotionally salient aspects of fundamental frequency for emotion detection, IEEE Trans. Audio Speech Language Process., 17, 582, 10.1109/TASL.2008.2009578
Cahn, 1990, The generation of affect in synthesized speech, J. Am. Voice Input/Output Soc., 8, 1
Caims, 1994, Nonlinear analysis and detection of speech under stressed conditions, J. Acoust. Soc. Am., 96, 3392, 10.1121/1.410601
W. Campbell, Databases of emotional speech, in: Proceedings of the ISCA (International Speech Communication and Association) ITRW on Speech and Emotion, 2000, pp. 34–38.
C. Chen, M. You, M. Song, J. Bu, J. Liu, An enhanced speech emotion recognition system based on discourse information, in: Lecture Notes in Computer Science—I (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 3991, 2006, pp. 449–456, cited by (since 1996) 1.
L. Chen, T. Huang, T. Miyasato, R. Nakatsu, Multimodal human emotion/expression recognition, in: Proceedings of the IEEE Automatic Face and Gesture Recognition, 1998, pp. 366–371.
Z. Chuang, C. Wu, Emotion recognition using acoustic features and textual content, Multimedia and Expo, 2004. IEEE International Conference on ICME ’04, vol. 1, 2004, pp. 53–56.
R. Cohen, A computational theory of the function of clue words in argument understanding, in: ACL-22: Proceedings of the 10th International Conference on Computational Linguistics and 22nd Annual Meeting on Association for Computational Linguistics, 1984, pp. 251–258.
Cowie, 2003, Describing the emotional states that are expressed in speech, Speech Commun., 40, 5, 10.1016/S0167-6393(02)00071-7
R. Cowie, E. Douglas-Cowie, Automatic statistical analysis of the signal and prosodic signs of emotion in speech, in: Proceedings, Fourth International Conference on Spoken Language, 1996. ICSLP 96. vol. 3, 1996, pp. 1989–1992.
Cowie, 2001, Emotion recognition in human–computer interaction, IEEE Signal Process. Mag., 18, 32, 10.1109/79.911197
Cristianini, 2000
Davitz, 1964
Dempster, 1977, Maximum likelihood from incomplete data via the em algorithm, J. R. Stat. Soc., 39, 1
L. Devillers, L. Lamel, Emotion detection in task-oriented dialogs, in: Proceedings of the International Conference on Multimedia and Expo 2003, 2003, pp. 549–552.
Duda, 2001
Ekman, 1982
Abu El-Yazeed, 2004, On the determination of optimal model order for gmm-based text-independent speaker identification, EURASIP J. Appl. Signal Process., 8, 1078
I. Engberg, A. Hansen, Documentation of the Danish emotional speech database des 〈http://cpk.auc.dk/tb/speech/Emotions/〉, 1996.
R. Fernandez, A computational model for the automatic recognition of affect in speech, Ph.D. Thesis, Massachusetts Institute of Technology, February 2004.
France, 2000, Acoustical properties of speech as indicators of depression and suicidal risk, IEEE Trans. Biomedical Eng., 47, 829, 10.1109/10.846676
Freund, 1997, A decision-theoretic generalization of on-line learning and an application to boosting, J. Comput. Syst. Sci., 55, 119, 10.1006/jcss.1997.1504
L. Fu, X. Mao, L. Chen, Speaker independent emotion recognition based on svm/hmms fusion system, in: International Conference on Audio, Language and Image Processing, 2008. ICALIP 2008, pp. 61–65.
Gelfer, 1995, Comparisons of jitter, shimmer, and signal-to-noise ratio from directly digitized versus taped voice samples, J. Voice, 9, 378, 10.1016/S0892-1997(05)80199-7
H. Go, K. Kwak, D. Lee, M. Chun, Emotion recognition from the facial image and speech signal, in: Proceedings of the IEEE SICE 2003, vol. 3, 2003, pp. 2890–2895.
Gobl, 2003, The role of voice quality in communicating emotion, mood and attitude, Speech Commun., 40, 189, 10.1016/S0167-6393(02)00082-1
Grosz, 1986, Attention, intentions, and the structure of discourse, Comput. Linguist., 12, 175
Hansen, 1995, Icarus: source generator based real-time recognition of speech in noisy stressful and Lombard effect environments, Speech Commun., 16, 391, 10.1016/0167-6393(95)00007-B
Hernando, 1997, Linear prediction of the one-sided autocorrelation sequence for noisy speech recognition, IEEE Trans. Speech Audio Process., 5, 80, 10.1109/89.554273
K. Hirose, H. Fujisaki, M. Yamaguchi, Synthesis by rule of voice fundamental frequency contours of spoken Japanese from linguistic information, in: IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP ’84, vol. 9, 1984, pp. 597–600.
Ho, 1994, Decision combination in multiple classifier systems, IEEE Trans. Pattern Anal. Mach. Intell., 16, 66, 10.1109/34.273716
Hozjan, 2003, Context-independent multilingual emotion recognition from speech signal, Int. J. Speech Technol., 6, 311, 10.1023/A:1023426522496
V. Hozjan, Z. Moreno, A. Bonafonte, A. Nogueiras, Interface databases: design and collection of a multilingual emotional speech database, in: Proceedings of the 3rd International Conference on Language Resources and Evaluation (LREC’02) Las Palmas de Gran Canaria, Spain, 2002, pp. 2019–2023.
H. Hu, M. Xu, W. Wu, Dimensions of emotional meaning in speech, in: Proceedings of the ISCA ITRW on Speech and Emotion, 2000, pp. 25–28.
H. Hu, M. Xu, W. Wu, Gmm supervector based svm with spectral features for speech emotion recognition, in: IEEE International Conference on Acoustics, Speech and Signal Processing, 2007. ICASSP 2007, vol. 4, 2007, pp. IV 413–IV 416.
H. Hu, M.-X. Xu, W. Wu, Fusion of global statistical and segmental spectral features for speech emotion recognition, in: International Speech Communication Association—8th Annual Conference of the International Speech Communication Association, Interspeech 2007, vol. 2, 2007, pp. 1013–1016.
Jain, 2000, Statistical pattern recognition: a review, IEEE Trans. Pattern Anal. Mach. Intell., 22, 4, 10.1109/34.824819
Johnstone, 2005, Affective speech elicited with a computer game, Emotion, 5, 513, 10.1037/1528-3542.5.4.513
Johnstone, 2000
Deller, 1993
Kleinginna, 1981, A categorized list of emotion definitions, with suggestions for a consensual definition, Motivation Emotion, 5, 345, 10.1007/BF00992553
J. Kaiser, On a simple algorithm to calculate the ‘energy’ of the signal, in: ICASSP-90, 1990, pp. 381–384.
E. Kim, K. Hyun, S. Kim, Y. Kwak, Speech emotion recognition using eigen-fft in clean and noisy environments, in: The 16th IEEE International Symposium on Robot and Human Interactive Communication, 2007, RO-MAN 2007, 2007, pp. 689–694.
Kuncheva, 2002, A theoretical study on six classifier fusion strategies, IEEE Trans. Pattern Anal. Mach. Intell., 24, 281, 10.1109/34.982906
Kuncheva, 2004
O. Kwon, K. Chan, J. Hao, T. Lee, Emotion recognition by speech signal, in: EUROSPEECH Geneva, 2003, pp. 125–128.
Lee, 2005, Toward detecting emotions in spoken dialogs, IEEE Trans. Speech Audio Process., 13, 293, 10.1109/TSA.2004.838534
C. Lee, S. Narayanan, R. Pieraccini, Classifying emotions in human–machine spoken dialogs, in: Proceedings of the ICME’02, vol. 1, 2002, pp. 737–740.
C. Lee, S.S. Narayanan, R. Pieraccini, Classifying emotions in human–machine spoken dialogs, in: 2002 IEEE International Conference on Multimedia and Expo, 2002, ICME ’02, Proceedings, vol. 1, 2002, pp. 737–740.
C. Lee, R. Pieraccini, Combining acoustic and language information for emotion recognition, in: Proceedings of the ICSLP 2002, 2002, pp. 873–876.
C. Lee, S. Yildrim, M. Bulut, A. Kazemzadeh, C. Busso, Z. Deng, S. Lee, S. Narayanan, Emotion recognition based on phoneme classes, in: Proceedings of ICSLP, 2004, pp. 2193–2196.
Leinonen, 1997, Expression of emotional-motivational connotations with a one-word utterance, J. Acoust. Soc. Am., 102, 1853, 10.1121/1.420109
Leinonen, 1997, Expression of emotional-motivational connotations with a one-word utterance, J. Acoust. Soc. Am., 102, 1853, 10.1121/1.420109
X. Li, J. Tao, M.T. Johnson, J. Soltis, A. Savage, K.M. Leong, J.D. Newman, Stress and emotion classification using jitter and shimmer features, in: IEEE International Conference on Acoustics, Speech and Signal Processing, 2007. ICASSP 2007, vol. 4, April 2007, pp. IV-1081–IV-1084.
Lien, 2002, Detection, tracking and classification of action units in facial expression, J. Robotics Autonomous Syst., 31, 131, 10.1016/S0921-8890(99)00103-7
University of Pennsylvania Linguistic Data Consortium, Emotional prosody speech and transcripts 〈http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2002S28〉, July 2002.
J. Liscombe, Prosody and speaker state: paralinguistics, pragmatics, and proficiency, Ph.D. Thesis, Columbia University, 2007.
D.G. Lowe, Object recognition from local scale-invariant features, in: Proceedings of the IEEE International Conference on Computer Vision, vol. 2, 1999, pp. 1150–1157.
M. Lugger, B. Yang, The relevance of voice quality features in speaker independent emotion recognition, in: icassp, vol. 4, 2007, pp. 17–20.
M. Lugger, B. Yang, The relevance of voice quality features in speaker independent emotion recognition, in: IEEE International Conference on Acoustics, Speech and Signal Processing, 2007, ICASSP 2007, vol. 4, April 2007, pp. IV-17–IV-20.
M. Lugger, B. Yang, Psychological motivated multi-stage emotion classification exploiting voice quality features, in: F. Mihelic, J. Zibert (Eds.), Speech Recognition, In-Tech, 2008.
M. Lugger, B. Yang, Combining classifiers with diverse feature sets for robust speaker independent emotion recognition, in: Proceedings of EUSIPCO, 2009.
M. Lugger, B. Yang, W. Wokurek, Robust estimation of voice quality parameters under realworld disturbances, in: 2006 IEEE International Conference on Acoustics, Speech and Signal Processing, 2006, ICASSP 2006 Proceedings, vol. 1, May 2006, pp. I–I.
J. Ma, H. Jin, L. Yang, J. Tsai, in: Ubiquitous Intelligence and Computing: Third International Conference, UIC 2006, Wuhan, China, September 3–6, 2006, Proceedings (Lecture Notes in Computer Science), Springer-Verlag, New York, Inc., Secaucus, NJ, USA, 2006.
Markel, 1976
Mashao, 2006, Combining classifier decisions for robust speaker identification, Pattern Recognition, 39, 147, 10.1016/j.patcog.2005.08.004
Mesot, 2007, Switching linear dynamical systems for noise robust speech recognition, IEEE Trans. Audio Speech Language Process., 15, 1850, 10.1109/TASL.2007.901312
Mitra, 2002, Unsupervised feature selection using feature similarity, IEEE Trans. Pattern Anal. Mach. Intell., 24, 301, 10.1109/34.990133
Morrison, 2007, Ensemble methods for spoken emotion recognition in call-centres, Speech Commun., 49, 98, 10.1016/j.specom.2006.11.004
Murray, 1993, Toward a simulation of emotions in synthetic speech: A review of the literature on human vocal emotion, J. Acoust. Soc. Am., 93, 1097, 10.1121/1.405558
Nicholson, 2000, Emotion recognition in speech using neural networks, Neural Comput. Appl., 9, 290, 10.1007/s005210070006
Nwe, 2003, Speech emotion recognition using hidden Markov models, Speech Commun., 41, 603, 10.1016/S0167-6393(03)00099-2
O’Connor, 1973
A. Oster, A. Risberg, The identification of the mood of a speaker by hearing impaired listeners, Speech Transmission Lab. Quarterly Progress Status Report 4, Stockholm, 1986, pp. 79–90.
T. Otsuka, J. Ohya, Recognizing multiple persons’ facial expressions using hmm based on automatic extraction of significant frames from image sequences, in: Proceedings of the International Conference on Image Processing (ICIP-97), 1997, pp. 546–549.
T.L. Pao, Y.-T. Chen, J.-H. Yeh, W.-Y. Liao, Combining acoustic features for improved emotion recognition in Mandarin speech, in: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 3784, 2005, pp. 279–285, cited by (since 1996) 1.
V. Petrushin, Emotion recognition in speech signal: experimental study, development and application, in: Proceedings of the ICSLP 2000, 2000, pp. 222–225.
Picard, 2001, Toward machine emotional intelligence: analysis of affective physiological state, IEEE Trans. Pattern Anal. Mach. Intell., 23, 1175, 10.1109/34.954607
Pierre-Yves, 2003, The production and recognition of emotions in speech: features and algorithms, Int. J. Human–Computer Stud., 59, 157, 10.1016/S1071-5819(02)00141-6
Rabiner, 1986, An introduction to hidden Markov models, IEEE ASSP Mag., 3, 4, 10.1109/MASSP.1986.1165342
Rabiner, 1993
Rabiner, 1978
A. Razak, R. Komiya, M. Abidin, Comparison between fuzzy and nn method for speech emotion recognition, in: 3rd International Conference on Information Technology and Applications ICITA 2005, vol. 1, 2005, pp. 297–302.
Reynolds, 2000, Speaker verification using adapted Gaussian mixture models, Digital Signal Process., 10, 19, 10.1006/dspr.1999.0361
Reynolds, 1995, Robust text-independent speaker identification using Gaussian mixture speaker models, IEEE Trans. Speech Audio Process., 3, 72, 10.1109/89.365379
Rissanen, 1978, Modeling by shortest data description, Automatica, 14, 465, 10.1016/0005-1098(78)90005-5
Scherer, 1986, Vocal affect expression. A review and a model for future research, Psychological Bull., 99, 143, 10.1037/0033-2909.99.2.143
M. Schubiger, English intonation: its form and function, Niemeyer, Tubingen, Germany, 1958.
B. Schuller, Towards intuitive speech interaction by the integration of emotional aspects, in: 2002 IEEE International Conference on Systems, Man and Cybernetics, vol. 6, 2002, p. 6.
B. Schuller, M. Lang, G. Rigoll, Robust acoustic speech emotion recognition by ensembles of classifiers, in: Proceedings of the DAGA’05, 31, Deutsche Jahrestagung für Akustik, DEGA, 2005, pp. 329–330.
B. Schuller, S. Reiter, R. Muller, M. Al-Hames, M. Lang, G. Rigoll, Speaker independent speech emotion recognition by ensemble classification, in: IEEE International Conference on Multimedia and Expo, 2005. ICME 2005, 2005, pp. 864–867.
B. Schuller, G. Rigoll, M. Lang, Hidden Markov model-based speech emotion recognition, in: International Conference on Multimedia and Expo (ICME), vol. 1, 2003, pp. 401–404.
B. Schuller, G. Rigoll, M. Lang, Speech emotion recognition combining acoustic features and linguistic information in a hybrid support vector machine-belief network architecture, in: Proceedings of the ICASSP 2004, vol. 1, 2004, pp. 577–580.
M.T. Shami, M.S. Kamel, Segment-based approach to the recognition of emotions in speech, in: IEEE International Conference on Multimedia and Expo, 2005. ICME 2005, 2005, 4pp.
L.C. De Silva, T. Miyasato, R. Nakatsu, Facial emotion recognition using multimodal information, in: Proceedings of the IEEE International Conference on Information, Communications and Signal Processing (ICICS’97), 1997, pp. 397–401.
L.C. De Silva, T. Miyasato, R. Nakatsu, Facial emotion recognition using multi-modal information, in: Proceedings of 1997 International Conference on Information, Communications and Signal Processing, 1997, ICICS, vol. 1, September 1997, pp. 397–401.
Slaney, 2003, Babyears: a recognition system for affective vocalizations, Speech Commun., 39, 367, 10.1016/S0167-6393(02)00049-3
Stevens, 1994, Classification of glottal vibration from acoustic measurements, Vocal Fold Physiol., 147
R. Sun, E. Moore, J.F. Torres, Investigating glottal parameters for differentiating emotional categories with similar prosodics, in: IEEE International Conference on Acoustics, Speech and Signal Processing, 2009. ICASSP 2009, April 2009, pp. 4509–4512.
Tao, 2006, Prosody conversion from neutral speech to emotional speech, IEEE Trans. Audio Speech Language Process., 14, 1145, 10.1109/TASL.2006.876113
Teager, 1990, Some observations on oral air flow during phonation, IEEE Trans. Acoust. Speech Signal Process., 28, 599, 10.1109/TASSP.1980.1163453
H. Teager, S. Teager, Evidence for nonlinear production mechanisms in the vocal tract, in: Speech Production and Speech Modelling, Nato Advanced Institute, vol. 55, 1990, pp. 241–261.
Tsymbal, 2005, Diversity in search strategies for ensemble feature selection, Inf. Fusion, 6, 146
Tsymbal, 2003, Ensemble feature selection with the simple Bayesian classification, Inf. Fusion, 4, 146
D. Ververidis, C. Kotropoulos, Emotional speech classification using Gaussian mixture models and the sequential floating forward selection algorithm, in: IEEE International Conference on Multimedia and Expo, 2005. ICME 2005, July 2005, pp. 1500–1503.
Ververidis, 2006, Emotional speech recognition: resources, features and methods, Speech Commun., 48, 1162, 10.1016/j.specom.2006.04.003
D. Ververidis, C. Kotropoulos, I. Pitas, Automatic emotional speech classification, in: IEEE International Conference on Acoustics, Speech, and Signal Processing, 2004, Proceedings, (ICASSP ’04), vol. 1, 2004, pp. I-593-6.
Viterbi, 1967, Error bounds for convolutional codes and an asymptotically optimum decoding algorithm Viterbi, IEEE Trans. Inf. Theory, 13, 260, 10.1109/TIT.1967.1054010
Vlassis, 1999, A kurtosis-based dynamic approach to Gaussian mixture modeling, IEEE Trans. Syst. Man Cybern., 29, 393, 10.1109/3468.769758
Vlassis, 2002, A greedy em algorithm for Gaussian mixture learning, Neural Process. Lett., 15, 77, 10.1023/A:1013844811137
Wang, 2006, A dynamic conditional random field model for foreground and shadow segmentation, IEEE Trans. Pattern Anal. Mach. Intell., 28, 279, 10.1109/TPAMI.2006.25
Williams, 1972, Emotions and speech: some acoustical correlates, J. Acoust. Soc. Am., 52, 1238, 10.1121/1.1913238
C. Williams, K. Stevens, Vocal correlates of emotional states, Speech Evaluation in Psychiatry, Grune and Stratton, 1981, pp. 189–220.
Witten, 2000
Womack, 1999, N-channel hidden Markov models for combined stressed speech classification and recognition, IEEE Trans. Speech Audio Process., 7, 668, 10.1109/89.799692
J. Wu, M.D. Mullin, J.M. Rehg, Linear asymmetric classifier for cascade detectors, in: 22th International Conference on Machine Learning, 2005.
M. You, C. Chen, J. Bu, J. Liu, J. Tao, Getting started with susas: a speech under simulated and actual stress database, in: EUROSPEECH-97, vol. 4, 1997, pp. 1743–1746.
M. You, C. Chen, J. Bu, J. Liu, J. Tao, Emotion recognition from noisy speech, in: IEEE International Conference on Multimedia and Expo, 2006, 2006, pp. 1653–1656l.
M. You, C. Chen, J. Bu, J. Liu, J. Tao, Emotional speech analysis on nonlinear manifold, in: 18th International Conference on Pattern Recognition, 2006. ICPR 2006, vol. 3, 2006, pp. 91–94.
M. You, C. Chen, J. Bu, J. Liu, J. Tao, A hierarchical framework for speech emotion recognition, in: IEEE International Symposium on Industrial Electronics, 2006, vol. 1, 2006, pp. 515–519.
Young, 1996, Large vocabulary continuous speech recognition, IEEE Signal Process. Mag., 13, 45, 10.1109/79.536824
Zhou, 2001, Nonlinear feature based classification of speech under stress, IEEE Trans. Speech Audio Process., 9, 201, 10.1109/89.905995