Audio-visual emotion recognition using multi-directional regression and Ridgelet transform

Journal on Multimodal User Interfaces - Tập 10 - Trang 325-333 - 2015
M. Shamim Hossain1, Ghulam Muhammad2
1Software Engineering Department, College of Computer and Information Sciences, King Saud University, Riyadh, Saudi Arabia
2Computer Engineering Department, College of Computer and Information Sciences, King Saud University, Riyadh, Saudi Arabia

Tóm tắt

In this paper, we propose an audio-visual emotion recognition system using multi-directional regression (MDR) audio features and ridgelet transform based face image features. MDR features capture directional derivative information in a spectro-temporal domain of speech, and, thereby, suitable to encode different levels of increasing or decreasing pitch and formant frequencies. For video inputs, interest points in a time frame are detected using spectro-temporal filters, and ridgelet transform is applied to cuboids around the interest points. Two separate extreme learning machine classifiers, one for speech modality and the other for face modality, are used. The scores of these two classifiers are fused using a Bayesian sum rule to make the final decision. Experimental results on eNTERFACE database show that the proposed method achieves accuracy of 85.06 % using bimodal inputs, 64.04 % using speech only, and 58.38 % using face only; these accuracies outnumber the accuracies obtained by some other state-of-the-art systems using the same database.

Tài liệu tham khảo

Bettadapura V (2012) Face expression recognition and analysis: the state of the art. College of Computing, Georgia Institute of Technology. arXiv:1203.6722v1

Martin O, Kotsia I, Macq B, Pitas I (2006) The eNTERFACE’05 audiovisual emotion database. In: Proceedings of ICDEW’2006, p 8, Atlanta, April 3–8

Hossain MS, Muhammad G, Song B, Hassan M, Alelaiwi A, Alamri A (2015) Audio-visual emotion-aware cloud gaming framework. IEEE Trans Circuits Syst Video Technol. doi:10.1109/TCSVT.2015.2444731