Audiovisual emotion recognition using ANOVA feature selection method and multi-classifier neural networks

Neural Computing and Applications - Tập 24 - Trang 399-412 - 2012
Mahdi Bejani1, Davood Gharavian2, Nasrollah Moghaddam Charkari3
1Islamic Azad University, Tehran, Iran
2EE Department, Shahid Abbaspour University, Tehran, Iran
3Distributed Processing Lab, Tarbiat Modares University, Tehran, Iran

Tóm tắt

To make human–computer interaction more naturally and friendly, computers must enjoy the ability to understand human’s affective states the same way as human does. There are many modals such as face, body gesture and speech that people use to express their feelings. In this study, we simulate human perception of emotion through combining emotion-related information using facial expression and speech. Speech emotion recognition system is based on prosody features, mel-frequency cepstral coefficients (a representation of the short-term power spectrum of a sound) and facial expression recognition based on integrated time motion image and quantized image matrix, which can be seen as an extension to temporal templates. Experimental results showed that using the hybrid features and decision-level fusion improves the outcome of unimodal systems. This method can improve the recognition rate by about 15 % with respect to the speech unimodal system and by about 30 % with respect to the facial expression system. By using the proposed multi-classifier system that is an improved hybrid system, recognition rate would increase up to 7.5 % over the hybrid features and decision-level fusion with RBF, up to 22.7 % over the speech-based system and up to 38 % over the facial expression-based system.

Tài liệu tham khảo

Devillers L, Vidrascu L (2006) Real-life emotions detection with lexical and paralinguistic cues on human call center dialogs. In: Proceedings of the interspeech, pp 801–804 Lee C-C, Mower E, Busso C, Lee S, Narayanan S (2009) Emotion recognition using a hierarchical binary decision tree approach. In: Proceedings of the interspeech, pp 320–323 Polzehl T, Sundaram S, Ketabdar H, Wagner M, Metze F (2009) Emotion classification in children’s speech using fusion of acoustic and linguistic features. In: Proceedings of the interspeech, pp 340–343 Klein J, Moon Y, Picard RW (2002) This computer responds to user frustration: theory, design and results. Interact Comput 14:119–140 Oudeyer P-Y (2003) The production and recognition of emotions in speech: features and algorithms. Int J Hum Comput Interact Stud 59:157–183 Mansoorizadeh M, Moghaddam Charkari N (2009) Hybrid feature and decision level fusion of face and speech information for bimodal emotion recognition. In: Proceedings of the 14th international CSI computer conference Ambady N, Rosenthal R (1992) Thin slices of expressive behavior as predictors of interpersonal consequences: a meta-analysis. Psychol Bull 111(2):256–274 Ekman P, Rosenberg EL (2005) What the face reveals: basic and applied studies of spontaneous expression using the facial action coding system (FACS), 2nd edn. Oxford University Press, Oxford Mehrabian A (1968) Communication without words. Psychol Today 2:53–56 Greenwald M, Cook E, Lang P (1989) Affective judgment and psychophysiological response: dimensional covariation in the evaluation of pictorial stimuli. J Psychophysiol 3:51–64 Zeng Z, Pantic M, Roisman GI, Huang TS (2009) A survey of affect recognition methods: audio, visual, and spontaneous expressions. PAMI 31:39–58 Pantic M, Rothkrantz LJM (2000) Automatic analysis of facial expressions: the state of the art. IEEE Trans Patt Anal Mach Intell 22:1424–1445 De Silva LC, Pei Chi N (2000) Bimodal emotion recognition. In: Proceedings of the fourth IEEE international conference on automatic face and gesture recognition, vol 1, pp 332–335 Song M, You M, Li N, Chen C (1920) A robust multimodal approach for emotion recognition. Neurocomputing 71:1913–2008 Hoch S, Althoff F, McGlaun G, Rigooll G (2005) Bimodal fusion of emotional data in an automotive environment. In: Proceedings of the international conference on acoustics, speech, and signal processing, vol 2, pp 1085–1088 Wang Y, Guan L (2005) Recognizing human emotion from audiovisual information. In: Proceedings of the international conference on acoustics, speech, and signal processing, pp 1125–1128 Paleari M, Benmokhtar R, Huet B (2008) Evidence theory-based multimodal emotion recognition. In: MMM ‘09, pp 435–446 Sheikhan M, Bejani M, Gharavian D (2012) Modular neural-SVM scheme for speech emotion recognition using ANOVA feature selection method. Neural Comput Appl J. doi:10.1007/s00521-012-0814-8 Lee CM, Narayanan SS (2005) Toward detecting emotions in spoken dialogs. IEEE Transact Speech Audio Process 13:293–303 Gharavian D, Ahadi SM (2005) The effect of emotion on farsi speech parameters: a statistical evaluation. In: Proceedings of the international conference on speech and computer, pp 463–466 Ververidis D, Kotropoulos C (2006) Emotional speech recognition: resources, features, and methods. Speech Commun 48:1162–1181 Shami M, Verhelst W (2007) An evaluation of the robustness of existing supervised machine learning approaches to the classifications of emotions in speech. Speech Commun 49:201–212 Altun H, Polat G (2009) Boosting selection of speech related features to improve performance of multiclass SVMs in emotion detection. Expert Syst Appl 36:8197–8203 Gharavian D, Sheikhan M, Nazerieh AR, Garoucy S (2011) Speech emotion recognition using FCBF feature selection method and GA-optimized fuzzy ARTMAP neural network. Neural Comput Appl. doi:10.1007/s00521-011-0643-1 Sheikhan M, Safdarkhani MK, Gharavian D (2011) Emotion recognition of speech using small-size selected feature set and ANN-based classifiers: a comparative study. World Appl Sci J 14:616–625 Fersini E, Messina E, Archetti F (2012) Emotional states in judicial courtrooms: an experimental investigation. Speech Commun 54:11–22 Albornoz EM, Milone DH, Rufiner HL (2011) Spoken emotion recognition using hierarchical classifiers. Comput Speech Lang 25:556–570 López-Cózar R, Silovsky J, Kroul M (2011) Enhancement of emotion detection in spoken dialogue systems by combining several information sources. Speech Commun 53:1210–1228 Boersma P, Weenink D (2007) Praat: doing phonetics by computer (version 4.6.12) [computer program] Bobick AF, Davis JW (2001) The recognition of human movement using temporal templates. IEEE Trans Patt Anal Mach Intell 23(3):257–267 Valstar MF, Patras I, Pantic M (2004) Facial action unit recognition using temporal templates. In: IEEE international workshop on human robot interactive communication Osadchy M, Keren D (2004) A rejection-based method for event detection in video. IEEE Trans Circuits Syst Video Technol 14(4):534–541 Li N, Dettmer S, Shah M (1997) Visually recognizing speech using eigensequences. In: Motion-based recognition. Kluwer, Boston, pp 345–371 Babua RV, Ramakrishnanb KR (2004) Recognition of human actions using motion history information extracted from the compressed video. Image Vis Comput 22:597–607 Sadoghi Yazdi H, Amintoosi M, Fathy M (2007) Facial expression recognition with QIM and ITMI spatio-temporal database. In: 4th Iranian conference on machine vision and image processing, Mashhad, Iran, pp 14–15 (Persian) Intel, OpenCV Open source computer vision library. http://www.intel.com/research/mrl/research/opencv/ Ebrahimpour R (2007) View-independent face recognition with mixture of experts. Dissertation, The Institute for Research in Fundamental Sciences (IPM) Ghaderi R (2000) Arranging simple neural networks to solve complex classification problems. Dissertation, Surrey University Wolpert DH (1992) Stacked generalisation. Complex Syst 5:241–259 Martin O, Kotsia I, Macq B, Pitas I (2006) The enterface ‘05 audio-visual emotion database. In: Proceedings of the 22nd international conference on data engineering workshops (ICDEW ‘06) Paleari M, Huet B (2008) Toward emotion indexing of multimedia excerpts. In: CBMI Burkhardt F, Paeschke A, Rolfes M, Sendlmeier W, Weiss B (2005) A database of German emotional speech. In: Interspeech, Lisbon, Portugal Mansoorizadeh M, Moghaddam Charkari N (2009) Multimodal information fusion application to human emotion recognition from face and speech. Multimed Tools Appl Kanade T, Cohn J, Tian Y (2000) Comprehensive database for facial expression analysis. In: IEEE international conference on face and gesture recognition (AFGR ‘00), pp 46–53 SPSS (2007) Clementine® 12.0 algorithms guide. Integral Solutions Limited, Chicago Zeng Z, Hu Y, Roisman GI, Wen Z, Fu Y, Huang TS (2007) Audio-visual spontaneous emotion recognition. Artif Intell Hum Comput 4451:72–90 Busso C et al (2004) Analysis of emotion recognition using facial expressions, speech and multimodal information. In: Proceedings of the sixth ACM international conference on multimodal interfaces (ICMI ‘04), pp 205–211 Cheng-Yao C, Yue-Kai H, Cook P (2005) Visual/acoustic emotion recognition, pp 1468–1471 Schuller B, Arsic D, Rigoll G, Wimmer M, Radig B (2007) Audiovisual behavior modeling by combined feature spaces. In: ICASSP, pp 733–736