A novel dual-modal emotion recognition algorithm with fusing hybrid features of audio signal and speech context

Complex & Intelligent Systems - Tập 9 - Trang 951-963 - 2022

Yurui Xu¹, Hang Su², Guijin Ma¹, Xiaorui Liu¹

¹Automation School of Qingdao University, Institute of Future, Qingdao, China

²Department of Electronics, Information, and Bioengineering, Politecnico di Milano, Milan, Italy

Tóm tắt

With regard to human–machine interaction, accurate emotion recognition is a challenging problem. In this paper, efforts were taken to explore the possibility to complete the feature abstraction and fusion by the homogeneous network component, and propose a dual-modal emotion recognition framework that is composed of a parallel convolution (Pconv) module and attention-based bidirectional long short-term memory (BLSTM) module. The Pconv module employs parallel methods to extract multidimensional social features and provides more effective representation capacity. Attention-based BLSTM module is utilized to strengthen key information extraction and maintain the relevance between information. Experiments conducted on the CH-SIMS dataset indicate that the recognition accuracy reaches 74.70% on audio data and 77.13% on text, while the accuracy of the dual-modal fusion model reaches 90.02%. Through experiments it proves the feasibility to process heterogeneous information within homogeneous network component, and demonstrates that attention-based BLSTM module would achieve best coordination with the feature fusion realized by Pconv module. This can give great flexibility for the modality expansion and architecture design.

Tài liệu tham khảo

Nayak S, Nagesh B, Routray A et al (2021) A human–computer interaction framework for emotion recognition through time-series thermal video sequences. Comput Electr Eng 93:107–118 Bouhlal M, Aarika K, Ait Abdelouahid R et al (2020) Emotions recognition as innovative tool for improving students’ performance and learning approaches. Procedia Comput Sci 175:597–620 Krause FC, Linardatos Ef, Fresco DM et al (2021) Facial emotion recognition in major depressive disorder: a meta-analytic review. J Affect Disord 293:320–328 Cui Y, Ma Y, Li W et al (2020) Multi-EmoNet: a novel multi-task neural network for driver emotion recognition. IFAC PapersOnLine 53:650–655 Mumenthaler C, Sander D, Manstead ASR (2020) Emotion recognition in simulated social interactions. IEEE Trans Affect Comput 11(2):308–312 Volpert-Esmond HI, Bartholow BD (2021) A functional coupling of brain and behavior during social categorization of faces. Personal Soc Psychol Bull 47:1580–1595 Liu L, Xu H, Wang J, Li J, Xu H (2020) Cell type-differential modulation of prefrontal cortical gabaergic interneurons on low gamma rhythm and social interaction. Sci Adv 6(30):eaay4073 Baltruaitis T, Ahuja C, Morency LP (2019) Multimodal machine learning: a survey and taxonomy. IEEE Trans Pattern Anal Mach Intell 41:423–443 Poria S, Hazarika D, Majumder N et al (2020) Beneath the tip of the iceberg: current challenges and new directions in sentiment analysis research. IEE Trans Affect Comput 14:1–29 Sharma R, Pachori RB, Sircar P (2020) Automated emotions recognition based on higher order statistics and deep learning algorithm. Biomed Signal Process Control 58:101867 Singh K, Malhotra J (2022) Two-layer LSTM network based prediction of epileptic seizures using EEG spectral features. Complex Intell Syst 8:2405–2418 Sharma R, Sircar P, Pachori RB (2020) Seizures classification based on higher order statistics and deep neural network. Biomed Signal Process Control 59:101921 Qi X, Wang W, Guo L et al (2019) Building a Plutchik’s wheel inspired affective model for social robots. J Bionic Eng 16(002):209–221 Hossain MS, Muhammad G (2018) Emotion recognition using deep learning approach from audio-visual emotional big data. Inf Fusion 49 Srivastava N, Salakhutdinov R (2014) Multimodal learning with deep Boltzmann machines. J Mach Learn Res 15:2949–2980 Xu G, Li W, Liu J (2020) A social emotion classification approach using multi-model fusion. Future Gener Comput Syst 102:347–356 Cai H, Qu Z, Li Z et al (2020) Feature-level fusion approaches based on multimodal EEG data for depression recognition. Inf Fusion 59:127–138 Nguyen D, Nguyen K, Sridharan S et al (2018) Deep spatio-temporal feature fusion with compact bilinear pooling for multimodal emotion recognition. Comput Vis Image Underst 174:33–42 Liu Y, Fu G (2021) Emotion recognition by deeply learned multi-channel textual and EEG features. Future Gener Comput Syst 119:1–13 Li J, Selvaraju RR, Gotmare AD et al (2021) Align before fuse: vision and language representation learning with momentum distillation. In: Paper Presented at the Proceedings of the 35th Conference on Neural Information Processing System, Sydney, pp 104–121 Li W, Gao C, Niu G et al (2020) Unimo: towards unified-modal understanding and generation via cross-modal contrastive learning. In: Paper Presented at the Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Conference on Natural Language Processing, Thailand, pp 2592–2607 Wang X, Peng M, Pan L, Hu M, Jin C, Ren F (2018) Two-level attention with two-stage multi-task learning for facial emotion recognition. J Vis Commun Image Represent 62(JUL.):217–225 Ancilin J, Milton A (2021) Improved speech emotion recognition with mel frequency magnitude coefficient. Appl Acoust 179(3):108046 Farhoudi Z, Setayeshi S (2020) Fusion of deep learning features with mixture of brain emotional learning for audio-visual emotion recognition. Speech Commun 127:92–103 Lu J, Batra D, Parikh D et al (2019) Vilbert: pretraining task-agnostic visiolinguistic representations for vision and language tasks. In: Paper Presented at the Proceedings of 33rd Conference on Neural Information Processing Systems, Vancouver, Canada, pp 13–23 Liunian LH, Yatskar M, Yin D et al (2019) Visualbert: a simple and performant baseline for vision and language. arXiv arXiv:1908.03557 Chen YC, Li L, Yu L et al (2020) Uniter: universal image-text representation learning. In: European conference on computer vision. Paper Presented at the Proceedings of European Conference on Computer Vision, Glasgow, pp 1303–1313 Wang Z, Zhou X, Wang W et al (2020) Emotion recognition using multimodal deep learning in multiple psychophysiological signals and video. Int J Mach Learn Cybern 11:923–934 Xu H, Zhang H, Han K et al (2019) Learning alignment for multimodal emotion recognition from speech. In: Proceedings of InterSpeech 2019, September 15-19, Graz, Austria, pp 3569–3573 Narotam S, Nittin S, Abhinav D (2017) Continuous multimodal emotion recognition approach for AVEC 2017. arXiv arXiv:1709.05861 Meng Z (2021) Research on timbre classification based on BP neural network and MFCC. J Phys Conf Ser 1856(1):012006 Kolesnikova O, Gelbukh A (2020) A study of lexical function detection with word2vec and supervised machine learning. J Intell Fuzzy Syst 39(2):1–8 Shobana J, Murali M (2021) An efficient sentiment analysis methodology based on long short-term memory networks. Complex Intell Syst 7:2485–2501 Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and translate. Comput Sci 23:1399–1409 Yu W, Xu H, Meng F et al (2020) CH-SIMS: a Chinese multimodal sentiment analysis dataset with fine-grained annotation of modality. In: Proceedings of the 58th annual meeting of the association for computational linguistics, Seattle, pp 3718–3727 Singh P, Srivastava R, Rana K et al (2021) A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowl Based Syst 229:107–119 Vashishtha S, Susan S (2020) Inferring sentiments from supervised classification of text and speech cues using fuzzy rules. Procedia Comput Sci 167:1370–1379 Pepino L, Riera P, Ferrer L et al (2020) Fusion approaches for emotion recognition from speech using acoustic and text-based features. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Toronto, pp 6484–6488 Priyasad D, Fernando T, Denman S et al (2020) Attention driven fusion for multi-modal emotion recognition. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Toronto, pp 3227–3231 Makiuchi MR, Uto K, Shinoda K (2021) Multimodal emotion recognition with high-level speech and text features. In: Proceedings of the 2021 IEEE automatic speech recognition and understanding workshop, Cartagena, pp 350–357 Krishna D, Patil A (2020) Multimodal emotion recognition using cross-modal attention and ID convolutional neural network. In: Interspeech, Shanghai, China: ISCA, 2020, pp 4243–4247 Lian Z, Liu B, Tao J (2021) CTNet: conversational transformer network for emotion recognition. IEE/ACM Trans Audio Speech Lang Process 29:985–1000 Padi S, Sadjadi SO, Manocha D et al (2022) Multimodal emotion recognition using transfer learning from speaker recognition and bert-based models. arXiv:2202.08974, pp 407–414

Scholar Hub - Công cụ hỗ trợ trích dẫn và phân tích khoa học Việt Nam

Về chúng tôi

Scholar Hub là công cụ hỗ trợ trích dẫn và phân tích các bài báo, công bố khoa học Việt Nam. Công cụ trợ giúp người nghiên cứu, tạp chí, đơn vị nghiên cứu tra cứu, phân tích và thống kê dữ liệu nghiên cứu khoa học tại Việt Nam và quốc tế.
ScholarHub KHÔNG đăng thông tin tổng hợp, KHÔNG đăng lại nội dung từ các trang báo chí Việt Nam hoặc trang thông tin điện tử khác tại Việt Nam.

Thông tin, cập nhật

Đăng ký Tạp chí tham gia vào Scholar Hub

Phản hồi ý kiến về Scholar Hub

Bài viết, nội dung cập nhật

Chủ đề khoa học

Website liên kết

Hệ thống CSDL Khoa học & Công nghệ

Phần mềm kiểm tra trùng lặp Kiểm Tra Tài Liệu

Phần mềm xuất bản tạp chí điện tử VOJS

Nền tảng trắc nghiệm và đề thi đa lĩnh vực LetQA