TWACapsNet: mạng bao với cơ chế chú ý hai chiều cho nhận diện cảm xúc trong giọng nói

Soft Computing - Trang 1-13 - 2023
Xin-Cheng Wen1, Kun-Hong Liu2, Yan Luo3, Jiaxin Ye4, Liyan Chen2
1Department of Computer Science, Harbin Institute of Technology (Shenzhen), Shenzhen, China
2School of Film, Xiamen University, Xiamen, China
3School of Software and Microelectronics, Peking University, Beijing, China
4Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai, China

Tóm tắt

Nhận diện cảm xúc trong giọng nói (SER) là một nhiệm vụ đầy thách thức, và mạng nơ-ron tích chập (CNN) thông thường không thể xử lý tốt dữ liệu âm thanh trực tiếp. Bởi vì CNN có xu hướng hiểu thông tin cục bộ và bỏ qua các đặc điểm tổng quan. Bài báo này đề xuất một Mạng Bao với Cơ Chế Chú Ý Hai Chiều (TWACapsNet) để giải quyết vấn đề SER. TWACapsNet chấp nhận các đặc trưng không gian và phổ làm đầu vào, và lớp tích chập cùng lớp bao được triển khai để xử lý hai loại đặc trưng này theo hai cách riêng biệt. Sau đó, hai cơ chế chú ý được thiết kế để nâng cao thông tin thu được từ các đặc trưng không gian và phổ. Cuối cùng, các kết quả của hai cách này được kết hợp để tạo thành quyết định cuối cùng. Ưu điểm của TWACapsNet được xác minh qua các thí nghiệm trên nhiều tập dữ liệu SER, và các kết quả thí nghiệm cho thấy phương pháp đề xuất vượt trội hơn so với các mô hình mạng nơ-ron đã được triển khai rộng rãi trên ba tập dữ liệu SER điển hình. Hơn nữa, sự kết hợp của hai cách này góp phần vào hiệu suất cao hơn và ổn định hơn của TWACapsNet.

Từ khóa

#Nhận diện cảm xúc trong giọng nói #mạng bao #cơ chế chú ý hai chiều #mạng nơ-ron tích chập #đặc trưng không gian #đặc trưng phổ

Tài liệu tham khảo

Abdel-Hamid L (2020) Egyptian Arabic speech emotion recognition using prosodic, spectral and wavelet features. Speech Commun 122:19–30 Abdel-Hamid O, Mohamed A, Jiang H, Deng L, Penn G, Yu D (2014) Convolutional neural networks for speech recognition. IEEE/ACM Trans Audio, Speech Lang Process 22:1533–1545 Albornoz E, Milone DH, Rufiner HL (2011) Spoken emotion recognition using hierarchical classifiers. Comput Speech Lang 25:556–570 Bakkouri I, Afdel K (2020) Computer-aided diagnosis (cad) system based on multi-layer feature fusion network for skin lesion recognition in dermoscopy images. Multimed Tools Appl 79:20483–20518 Bakkouri I, Afdel K (2022) Mlca2f: Multi-level context attentional feature fusion for Covid-19 lesion segmentation from CT scans. Signal, Image and Video Processing, pp 1–8 Bandela SR, Kumar TK (2021) Unsupervised feature selection and NMF de-noising for robust speech emotion recognition. Appl Acoust 172:107645 Burgan H (2022) Comparison of different ANN (FFBP GRNN RBF) algorithms and multiple linear regression for daily streamflow prediction in kocasu river-turkey. Fresenius Environ Bull 31:4699–4708 Burkhardt F, Paeschke A, Rolfes M, Sendlmeier WF, Weiss B (2005) A database of german emotional speech, In: INTERSPEECH 2005 - Eurospeech, 9th European conference on speech communication and technology, Lisbon, Portugal, 2005 Busso C, Bulut M, Lee CC, Kazemzadeh A, Mower E, Kim S, Chang JN, Lee S, Narayanan SS (2008) Iemocap: Interactive emotional dyadic motion capture database, Springer. pp 335–359 Chaudhari PR, Alex JSR (2016) Selection of features for emotion recognition from speech. Indian J Sci Technol 9:1–5 George ML, Lakshmi NVSSR, Nagarajan SM, Mahapatra RP, Muthukumaran V, Sivaram M (2022) Intelligent recognition system for viewpoint variations on gait and speech using CNN-Capsnet. Int J Intell Comput Cybern 15:363–382 Göçeri E (2020) Capsnet topology to classify tumours from brain images and comparative evaluation. IET Image Process 14:882–889 Gudmalwar AP, Rama Rao CV, Dutta A (2018) Improving the performance of the speaker emotion recognition based on low dimension prosody features vector. Int J Speech Technol 22:521–531 Jackson P, Haq S (2014) Surrey audio-visual expressed emotion (savee) database. University of Surrey, Guildford Jalal MA, Loweimi E, Moore RK, Hain T (2019) Learning temporal clusters using capsule routing for speech emotion recognition, In: Proceedings of interspeech 2019, ISCA. pp 1701–1705 Li D, Zhou Y, Wang Z, Gao D (2021) Exploiting the potentialities of features for speech emotion recognition. Inf Sci 548:328–343 Liu J, Zhang C, Jiang X (2022) Imbalanced fault diagnosis of rolling bearing using improved MSR-GAN and feature enhancement-driven Capsnet. Mech Syst Signal Process 168:108664 McFee B, Raffel C, Liang D, Ellis D, Mcvicar M, Battenberg E, Nieto O (2015) librosa: audio and music signal analysis in python, pp 18–24 Menghan S, Baochen J, Jing Y (2011) Vocal emotion recognition based on HMM and GMM for mandarin speech. IEEE Computer Society, USA, pp 27–30 Mnih V, Heess N, Graves A, Kavukcuoglu K (2014) Recurrent models of visual attention, In: Proceedings of the 27th international conference on neural information processing systems - Volume 2, MIT Press, Cambridge, MA, USA. pp 2204-2212 Mustaqeem Kwon S (2020) MLT-DNet: speech emotion recognition using 1d dilated CNN based on multi-learning trick approach. Expert Syst Appl 167:114177 Nagarajan S, Nettimi SSS, Kumar LS, Nath MK, Kanhe A (2020) Speech emotion recognition using cepstral features extracted with novel triangular filter banks based on bark and erb frequency scales. Digital Signal Process 104:102763 Özseven T (2019) A novel feature selection method for speech emotion recognition. Appl Acoust 146:320–326 Sabour S, Frosst N, Hinton GE (2017) Dynamic routing between capsules. arXiv preprint arXiv:1710.09829 Subhashree R, Rathna G (2016) Speech emotion recognition: performance analysis based on fused algorithms and GMM modelling. Indian J Sci Technol 9:1–18 Sun L, Zou B, Fu S, Chen J, Wang F (2019) Speech emotion recognition based on DNN-decision tree SVM model. Speech Commun 115:29–37 Tao J, Liu F, Zhang M, Jia H (2008) Design of speech corpus for mandarin text to speech, In: The Blizzard Challenge 2008 workshop Wen X, Ye J, Luo Y, Xu Y, Wang X, Wu C, Liu K (2022) CTL-MTNet: a novel capsnet and transfer learning-based mixed task net for single-corpus and cross-corpus speech emotion recognition. IJCAI 2022. Austria, Vienna, pp 2305–2311 Wen XC, Liu KH, Zhang WM, Jiang K (2021) The application of capsule neural network based cnn for speech emotion recognition, In: 2020 25th international conference on pattern recognition (ICPR), pp 9356–9362. https://doi.org/10.1109/ICPR48806.2021.9412360 Wu X, Cao Y, Lu H, Liu S, Wang D, Wu Z, Liu X, Meng HM (2021) Speech emotion recognition using sequential capsule networks, pp 1–1. https://doi.org/10.1109/TASLP.2021.3120586 Wu X, Liu S, Cao Y, Li X, Yu J, Dai D, Ma X, Hu S, Wu Z, Liu X, Meng H (2019) Speech emotion recognition using capsule networks, In: ICASSP 2019 - 2019 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 6695–6699. https://doi.org/10.1109/ICASSP.2019.8683163 Wöllmer M, Schuller B, Eyben F, Rigoll G (2010) Combining long short-term memory and dynamic Bayesian networks for incremental emotion-sensitive artificial listening. IEEE J Select Topics Signal Process 4:867–881 Xie Y, Liang R, Liang Z, Huang C, Schuller B (2019) Speech emotion classification using attention-based LSTM. IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP) PP, pp 1–1 Xie Y, Zhu F, Wang J, Liang R, Zhao L, Tang G (2018) Long-short term memory for emotional recognition with variable length speech, In: 2018 First Asian conference on affective computing and intelligent interaction (ACII Asia), IEEE. pp 1–4 Yang Z, Yang D, Dyer C, He X, Smola A, Hovy E (2016) Hierarchical attention networks for document classification, In: Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: human language technologies, pp 1480–1489 Ye J, Wen X, Wang X, Xu Y, Luo Y, Wu C, Chen L, Liu K (2022) GM-TCNet: Gated multi-scale temporal convolutional network using emotion causality for speech emotion recognition. Speech Commun 145:21–35 Ye J, Wen X, Wei Y, Xu Y, Liu K, Shan H (2023) Temporal modeling matters: a novel temporal emotional modeling approach for speech emotion recognition, In: IEEE international conference on acoustics, speech and signal processing (ICASSP), Rhodes Island, Greece, 2023, pp 1–5 Yeh SL, Lin YS, Lee CC (2019) An interaction-aware attention network for speech emotion recognition in spoken dialogs, In: ICASSP 2019 - 2019 IEEE international conference on acoustics, speech and signal processing (ICASSP) Zhao J, Mao X, Chen L (2019) Speech emotion recognition using deep 1d & 2d CNN lSTM networks. Biomed Signal Process Control 47:312–323