A computationally efficient speech emotion recognition system employing machine learning classifiers and ensemble learning

International Journal of Speech Technology - 2024

N. Aishwarya¹, Kanwaljeet Kaur¹, Karthik Seemakurthy²

¹Department of ECE, Amrita School of Engineering, Amrita Vishwa Vidyapeetham, Chennai, India

²Lincoln Institute of Agri-Food Technology, University of Lincoln, Lincoln, UK

Tóm tắt

Speech Emotion Recognition (SER) is the process of recognizing and classifying emotions expressed through speech. SER greatly facilitates personalized and empathetic interactions, enhances user experiences, enables sentiment analysis, and finds applications in psychology, healthcare, entertainment, and gaming industries. However, accurately detecting and classifying emotions is a highly challenging task for machines due to the complexity and multifaceted nature of emotions. This work gives a comparative analysis of two approaches for emotion recognition based on original and augmented speech signals. The first approach involves extracting 39 Mel Frequency Cepstrum Coefficients (MFCC) features, while the second approach involves using MFCC spectrograms and extracting features using deep learning models such as MobileNet V2, VGG16, Inception V3, VGG19 and ResNet 50. These features are then tested on Machine learning classifiers such as SVM, Linear SVM, Naive Bayes, k-Nearest Neighbours, Logistic Regression and Random Forest. From the experiments, it is observed that the SVM classifier works best with all the feature extraction techniques Furthermore, to enhance the results, ensembling techniques involving CatBoost, and the Voting classifier along with SVM were utilized, resulting in improved test accuracies of 97.04% on the RAVDESS dataset, 93.24% on the SAVEE dataset, and 99.83% on the TESS dataset, respectively. It is worth noting that both approaches are computationally efficient as they required no training time.

Từ khóa

Tài liệu tham khảo

Abdel-Hamid, L. (2020). Egyptian Arabic speech emotion recognition using prosodic, spectral and wavelet features. Speech Communication, 122, 19–30. https://doi.org/10.1016/j.specom.2020.04.005 Abdul, Z. K., & Al-Talabani, A. K. (2022). Mel frequency cepstral coefficient and its applications: A review. IEEE Access, 10, 122136–122158. https://doi.org/10.1109/ACCESS.2022.3223444 Afreen, N., Patel, R., Ahmed, M., & Sameer, M. (2021). A novel machine learning approach using boosting algorithm for liver disease classification. In 2021 5th international conference on information systems and computer networks (ISCON) (pp. 1–5). https://doi.org/10.1109/ISCON52037.2021.9702488 Aishwarya, N., Prabhakaran, K. M., Debebe, F. T., Reddy, M. S. S. A., & Pranavee, P. (2023). Skin cancer diagnosis with Yolo deep neural network. Procedia Computer Science, 220, 651–658. https://doi.org/10.1016/j.procs.2023.03.083 Aishwarya, N., Praveena, N. G., & Priyanka, S. (2023). Smart farming for detection and identification of tomato plant diseases using light weight deep neural network. Multimedia Tools and Applications, 82, 18799–18810. https://doi.org/10.1007/s11042-022-14272-2 Akash, K., Aschana, M., Abhijith, M., & Shuvalila, M. (2016). Speech based emotion recognition system. International Journal of Advanced Research in Electrical, Electronics and Instrumentation Engineering, 5(6), 39–42. Akcay, M. B., & Oguz, K. (2020). Speech emotion recognition: Emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers. Speech Communication, 116, 56–76. Ancilin, J., & Milton, A. (2021). Improved speech emotion recognition with Mel frequency magnitude coefficient. Applied Acoustics, 179, 108046. https://doi.org/10.1016/j.apacoust.2021.108046 Arias, J. P., Busso, C., & Yoma, N. B. (2014). Shape-based modeling of the fundamental frequency contour for emotion detection in speech. Computer Speech and Language, 28, 278–294. Badshah, A. M., Ahmad, J., Rahim, N., & Baik S. W. (2017). Speech emotion recognition from spectrograms with deep convolutional neural network. In 2017 international conference on platform technology and service (PlatCon) (pp. 1–5). Busan. https://doi.org/10.1109/PlatCon.2017.7883728 Cheng, H., & Guo, Y. (2022). Data shift: A cross-modal data augmentation method for speech recognition and machine translation. In 2022 4th international conference on natural language processing (ICNLP) (pp. 341–344). https://doi.org/10.1109/ICNLP55136.2022.00062 Chowanda, A., Iswanto, I. A., & Andangsari, E. W. (2023). Exploring deep learning algorithm to model emotions recognition from speech. Procedia Computer Science, 216, 706–713. https://doi.org/10.1016/j.procs.2022.12.187 Christy, A., Vaithyasubramanian, S., & Jesudoss, A. (2020). Multimodal speech emotion recognition and classification using convolutional neural network techniques. International Journal of Speech Technology, 23, 381–388. https://doi.org/10.1007/s10772-020-09713-y Dolka, H., Vm, A. X., & Juliet, S. (2021). Speech emotion recognition using ANN on MFCC features. In 2021 3rd international conference on signal processing and communication (ICPSC) (pp. 431–435). https://doi.org/10.1109/ICSPC51351.2021.9451810 Duouis, K., & Pichora-Fuller, M. K. (2011). Recognition of emotional speech for younger and older talkers: Behavioural findings from the toronto emotional speech set. Canadian Acoustics - Acoustique Canadienne, 39(3), 182–183. Fatourechi, M., Ward, R. K., Mason, S. G., Huggins, J., Schlögl, A., & Birch, G. E. (2008). Comparison of evaluation metrics in classification applications with imbalanced datasets. In 2008 seventh international conference on machine learning and applications (pp. 777–782).https://doi.org/10.1109/ICMLA.2008.34 Ghosh S, Dasgupta A and Swetapadma A. (2019). A study on support vector machine based linear and non-linear pattern classification. In 2019 international conference on intelligent sustainable systems (ICISS) (pp. 24–28). https://doi.org/10.1109/ISS1.2019.8908018 Gupta, K., & Gupta, D. (2022). An analysis on LPC, RASTA and MFCC techniques in Automatic Speech recognition system. In 2016 6th international conference - cloud system and big data engineering (confluence) (pp. 493–497). https://doi.org/10.1109/CONFLUENCE.2016.7508170 Haq, S., & Jackson, P. J. B. (2010). Multimodal emotion recognition. In W. Wang (Ed.), Machine audition: Principles, algorithms and systems (pp. 398–423). IGI global. Ho, T. K. (1995). Random decision forests. In Proceedings of 3rd international conference on document analysis and recognition (Vol. 1, pp. 278–282). https://doi.org/10.1109/ICDAR.1995.598994 Huang, Y., & Li, L. (2011). Naive Bayes classification algorithm based on small sample set. In 2011 IEEE international conference on cloud computing and intelligence systems (pp. 34–39). https://doi.org/10.1109/CCIS.2011.6045027 Jaiswal, J. K., & Samikannu, R. (2021). Application of random forest algorithm on feature subset selection and classification and regression. In 2017 world congress on computing and communication technologies (WCCCT) (pp. 65–68). https://doi.org/10.1109/WCCCT.2016.25 Jothimani, S., & Premalatha, K. (2022). MFF-SAug: Multi feature fusion with spectrogram augmentation of speech emotion recognition using convolution neural network. Chaos, Solitons & Fractals, 162, 112512. https://doi.org/10.1016/j.chaos.2022.112512 Kaushik, S., & Birok, R. (2021). Heart failure prediction using voting ensemble classifier. In 2021 asian conference on innovation in technology (ASIANCON) (pp. 1–5). https://doi.org/10.1109/ASIANCON51346.2021.9544871 Kumar, C. S. A., Maharana, A. D., Krishnan S. M., Hanuma, S. S. S., Lal, G. J., & Ravi V. (2023). Speech emotion recognition using CNN-LSTM and vision transformer. In Innovations in bio-inspired computing and applications (IBICA), Lecture notes in networks and systems (Vol. 649). Springer. https://doi.org/10.1007/978-3-031-27499-2_8 Kumar, R., & Dhanya, N. (2021). Efficient speech to emotion recognition using convolutional neural network. Advances in Electrical and Computer Technologies. https://doi.org/10.1007/978-981-15-9019-1_24 Livingstone, S. R., & Russo, F. A. (2018). The Ryerson audio-visual database of emotional speech and song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English. PLoS ONE, 13(5), e0196391. Lotfian, R., & Busso, C. (2021). Lexical dependent emotion detection using synthetic speech reference. IEEE Access, 7, 22071–22085. https://doi.org/10.1109/access.2019.2898353 Mai, X., Liao, Z., & Couillet, R. (2019). A large-scale analysis of logistic regression: Asymptotic performance and new insights. In ICASSP 2019—2019 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 3357–3361). https://doi.org/10.1109/ICASSP.2019.8683376 Matin, R., & Valles, D. (2020). A speech emotion recognition solution-based on support vector machine for children with autism spectrum disorder to help identify human emotions. In 2020 intermountain engineering, technology and computing (IETC) (pp. 1–6). https://doi.org/10.1109/IETC47856.2020.9249147 Mohammed, A., & Kora, R. (2023). A comprehensive review on ensemble deep learning: Opportunities and challenges. Journal of King Saud University - Computer and Information Sciences, 35(2), 757–774. https://doi.org/10.1016/j.jksuci.2023.01.014 Mohan, M., Dhanalakshmi, P., & Kumar, R. S. (2023). Speech emotion classification using ensemble models with MFCC. Procedia Computer Science, 218, 1857–1868. https://doi.org/10.1016/j.procs.2023.01.163 Mohanta, A., & Mittal, V. K. (2022). Analysis and classification of speech sounds of children with autism spectrum disorder using acoustic features. Computer Speech & Language, 72, 101287. https://doi.org/10.1016/j.csl.2021.101287 Patel, R., & Chaware, A. (2020). Transfer learning with fine-tuned MobileNetV2 for diabetic retinopathy. In 2020 international conference for emerging technology (INCET) (pp. 1–4). https://doi.org/10.1109/INCET49848.2020.9154014 Rayhan Ahmed, M., Islam, S., Muzahidul Islam, A., & Shatabda, S. (2023). An ensemble 1D-CNN-LSTM-GRU model with data augmentation for speech emotion recognition. Expert Systems with Applications, 218, 119633. https://doi.org/10.1016/j.eswa.2023.119633 Shivaprasad, S., & Sadanandam, M. (2021). Dialect recognition from Telugu speech utterances using spectral and prosodic features. International Journal of Speech Technology. https://doi.org/10.1007/s10772-021-09854-8 Singh, V., & Prasad, S. (2023). Speech emotion recognition system using gender dependent convolution neural network. Procedia Computer Science, 218, 2533–2540. https://doi.org/10.1016/j.procs.2023.01.227 Taunk, K., De, S., Verma, S., & Swetapadma. A. (2019). A brief review of nearest neighbor algorithm for learning and classification. In 2019 international conference on intelligent computing and control systems (ICCS) (pp. 1255–1260). https://doi.org/10.1109/ICCS45141.2019.9065747 Tsaregorodtsev, A., Samoylov, V., Zenov, A., Zelenina, A., Petrosov, D., Pleshakova, E., Osipov, A., Ivanova, M., Petrosova, N., Lopatnuk, L., Radygin, V., & Roga, S. (2022). The architecture of the emotion recognition program by speech segments. Procedia Computer Science, 213, 338–345. https://doi.org/10.1016/j.procs.2022.11.076 Wang, Q. (2022). Support vector machine algorithm in machine learning. In 2022 IEEE international conference on artificial intelligence and computer applications (ICAICA) (pp. 750–756). https://doi.org/10.1109/ICAICA54878.2022.9844516 Yang, F. -J. (2018). An implementation of Naive Bayes classifier. In 2018 international conference on computational science and computational intelligence (CSCI) (pp. 301–306). https://doi.org/10.1109/CSCI46756.2018.00065 Zhang, S., Li, X., Zong, M., Zhu, X., & Wang, R. (2018). Efficient KNN classification with different numbers of nearest neighbors. IEEE Transactions on Neural Networks and Learning Systems, 29(5), 1774–1785. https://doi.org/10.1109/TNNLS.2017.2673241

Scholar Hub - Công cụ hỗ trợ trích dẫn và phân tích khoa học Việt Nam

Về chúng tôi

Scholar Hub là công cụ hỗ trợ trích dẫn và phân tích các bài báo, công bố khoa học Việt Nam. Công cụ trợ giúp người nghiên cứu, tạp chí, đơn vị nghiên cứu tra cứu, phân tích và thống kê dữ liệu nghiên cứu khoa học tại Việt Nam và quốc tế.
ScholarHub KHÔNG đăng thông tin tổng hợp, KHÔNG đăng lại nội dung từ các trang báo chí Việt Nam hoặc trang thông tin điện tử khác tại Việt Nam.

Thông tin, cập nhật

Đăng ký Tạp chí tham gia vào Scholar Hub

Phản hồi ý kiến về Scholar Hub

Bài viết, nội dung cập nhật

Chủ đề khoa học

Website liên kết

Phần mềm kiểm tra trùng lặp Kiểm Tra Tài Liệu

Phần mềm xuất bản tạp chí điện tử VOJS

Công cụ kiểm tra chính tả và thể thức Viver

Nền tảng trắc nghiệm và đề thi đa lĩnh vực LetQA