A computationally efficient speech emotion recognition system employing machine learning classifiers and ensemble learning
Tóm tắt
Speech Emotion Recognition (SER) is the process of recognizing and classifying emotions expressed through speech. SER greatly facilitates personalized and empathetic interactions, enhances user experiences, enables sentiment analysis, and finds applications in psychology, healthcare, entertainment, and gaming industries. However, accurately detecting and classifying emotions is a highly challenging task for machines due to the complexity and multifaceted nature of emotions. This work gives a comparative analysis of two approaches for emotion recognition based on original and augmented speech signals. The first approach involves extracting 39 Mel Frequency Cepstrum Coefficients (MFCC) features, while the second approach involves using MFCC spectrograms and extracting features using deep learning models such as MobileNet V2, VGG16, Inception V3, VGG19 and ResNet 50. These features are then tested on Machine learning classifiers such as SVM, Linear SVM, Naive Bayes, k-Nearest Neighbours, Logistic Regression and Random Forest. From the experiments, it is observed that the SVM classifier works best with all the feature extraction techniques Furthermore, to enhance the results, ensembling techniques involving CatBoost, and the Voting classifier along with SVM were utilized, resulting in improved test accuracies of 97.04% on the RAVDESS dataset, 93.24% on the SAVEE dataset, and 99.83% on the TESS dataset, respectively. It is worth noting that both approaches are computationally efficient as they required no training time.
Từ khóa
Tài liệu tham khảo
Abdel-Hamid, L. (2020). Egyptian Arabic speech emotion recognition using prosodic, spectral and wavelet features. Speech Communication, 122, 19–30. https://doi.org/10.1016/j.specom.2020.04.005
Abdul, Z. K., & Al-Talabani, A. K. (2022). Mel frequency cepstral coefficient and its applications: A review. IEEE Access, 10, 122136–122158. https://doi.org/10.1109/ACCESS.2022.3223444
Afreen, N., Patel, R., Ahmed, M., & Sameer, M. (2021). A novel machine learning approach using boosting algorithm for liver disease classification. In 2021 5th international conference on information systems and computer networks (ISCON) (pp. 1–5). https://doi.org/10.1109/ISCON52037.2021.9702488
Aishwarya, N., Prabhakaran, K. M., Debebe, F. T., Reddy, M. S. S. A., & Pranavee, P. (2023). Skin cancer diagnosis with Yolo deep neural network. Procedia Computer Science, 220, 651–658. https://doi.org/10.1016/j.procs.2023.03.083
Aishwarya, N., Praveena, N. G., & Priyanka, S. (2023). Smart farming for detection and identification of tomato plant diseases using light weight deep neural network. Multimedia Tools and Applications, 82, 18799–18810. https://doi.org/10.1007/s11042-022-14272-2
Akash, K., Aschana, M., Abhijith, M., & Shuvalila, M. (2016). Speech based emotion recognition system. International Journal of Advanced Research in Electrical, Electronics and Instrumentation Engineering, 5(6), 39–42.
Akcay, M. B., & Oguz, K. (2020). Speech emotion recognition: Emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers. Speech Communication, 116, 56–76.
Ancilin, J., & Milton, A. (2021). Improved speech emotion recognition with Mel frequency magnitude coefficient. Applied Acoustics, 179, 108046. https://doi.org/10.1016/j.apacoust.2021.108046
Arias, J. P., Busso, C., & Yoma, N. B. (2014). Shape-based modeling of the fundamental frequency contour for emotion detection in speech. Computer Speech and Language, 28, 278–294.
Badshah, A. M., Ahmad, J., Rahim, N., & Baik S. W. (2017). Speech emotion recognition from spectrograms with deep convolutional neural network. In 2017 international conference on platform technology and service (PlatCon) (pp. 1–5). Busan. https://doi.org/10.1109/PlatCon.2017.7883728
Cheng, H., & Guo, Y. (2022). Data shift: A cross-modal data augmentation method for speech recognition and machine translation. In 2022 4th international conference on natural language processing (ICNLP) (pp. 341–344). https://doi.org/10.1109/ICNLP55136.2022.00062
Chowanda, A., Iswanto, I. A., & Andangsari, E. W. (2023). Exploring deep learning algorithm to model emotions recognition from speech. Procedia Computer Science, 216, 706–713. https://doi.org/10.1016/j.procs.2022.12.187
Christy, A., Vaithyasubramanian, S., & Jesudoss, A. (2020). Multimodal speech emotion recognition and classification using convolutional neural network techniques. International Journal of Speech Technology, 23, 381–388. https://doi.org/10.1007/s10772-020-09713-y
Dolka, H., Vm, A. X., & Juliet, S. (2021). Speech emotion recognition using ANN on MFCC features. In 2021 3rd international conference on signal processing and communication (ICPSC) (pp. 431–435). https://doi.org/10.1109/ICSPC51351.2021.9451810
Duouis, K., & Pichora-Fuller, M. K. (2011). Recognition of emotional speech for younger and older talkers: Behavioural findings from the toronto emotional speech set. Canadian Acoustics - Acoustique Canadienne, 39(3), 182–183.
Fatourechi, M., Ward, R. K., Mason, S. G., Huggins, J., Schlögl, A., & Birch, G. E. (2008). Comparison of evaluation metrics in classification applications with imbalanced datasets. In 2008 seventh international conference on machine learning and applications (pp. 777–782).https://doi.org/10.1109/ICMLA.2008.34
Ghosh S, Dasgupta A and Swetapadma A. (2019). A study on support vector machine based linear and non-linear pattern classification. In 2019 international conference on intelligent sustainable systems (ICISS) (pp. 24–28). https://doi.org/10.1109/ISS1.2019.8908018
Gupta, K., & Gupta, D. (2022). An analysis on LPC, RASTA and MFCC techniques in Automatic Speech recognition system. In 2016 6th international conference - cloud system and big data engineering (confluence) (pp. 493–497). https://doi.org/10.1109/CONFLUENCE.2016.7508170
Haq, S., & Jackson, P. J. B. (2010). Multimodal emotion recognition. In W. Wang (Ed.), Machine audition: Principles, algorithms and systems (pp. 398–423). IGI global.
Ho, T. K. (1995). Random decision forests. In Proceedings of 3rd international conference on document analysis and recognition (Vol. 1, pp. 278–282). https://doi.org/10.1109/ICDAR.1995.598994
Huang, Y., & Li, L. (2011). Naive Bayes classification algorithm based on small sample set. In 2011 IEEE international conference on cloud computing and intelligence systems (pp. 34–39). https://doi.org/10.1109/CCIS.2011.6045027
Jaiswal, J. K., & Samikannu, R. (2021). Application of random forest algorithm on feature subset selection and classification and regression. In 2017 world congress on computing and communication technologies (WCCCT) (pp. 65–68). https://doi.org/10.1109/WCCCT.2016.25
Jothimani, S., & Premalatha, K. (2022). MFF-SAug: Multi feature fusion with spectrogram augmentation of speech emotion recognition using convolution neural network. Chaos, Solitons & Fractals, 162, 112512. https://doi.org/10.1016/j.chaos.2022.112512
Kaushik, S., & Birok, R. (2021). Heart failure prediction using voting ensemble classifier. In 2021 asian conference on innovation in technology (ASIANCON) (pp. 1–5). https://doi.org/10.1109/ASIANCON51346.2021.9544871
Kumar, C. S. A., Maharana, A. D., Krishnan S. M., Hanuma, S. S. S., Lal, G. J., & Ravi V. (2023). Speech emotion recognition using CNN-LSTM and vision transformer. In Innovations in bio-inspired computing and applications (IBICA), Lecture notes in networks and systems (Vol. 649). Springer. https://doi.org/10.1007/978-3-031-27499-2_8
Kumar, R., & Dhanya, N. (2021). Efficient speech to emotion recognition using convolutional neural network. Advances in Electrical and Computer Technologies. https://doi.org/10.1007/978-981-15-9019-1_24
Livingstone, S. R., & Russo, F. A. (2018). The Ryerson audio-visual database of emotional speech and song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English. PLoS ONE, 13(5), e0196391.
Lotfian, R., & Busso, C. (2021). Lexical dependent emotion detection using synthetic speech reference. IEEE Access, 7, 22071–22085. https://doi.org/10.1109/access.2019.2898353
Mai, X., Liao, Z., & Couillet, R. (2019). A large-scale analysis of logistic regression: Asymptotic performance and new insights. In ICASSP 2019—2019 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 3357–3361). https://doi.org/10.1109/ICASSP.2019.8683376
Matin, R., & Valles, D. (2020). A speech emotion recognition solution-based on support vector machine for children with autism spectrum disorder to help identify human emotions. In 2020 intermountain engineering, technology and computing (IETC) (pp. 1–6). https://doi.org/10.1109/IETC47856.2020.9249147
Mohammed, A., & Kora, R. (2023). A comprehensive review on ensemble deep learning: Opportunities and challenges. Journal of King Saud University - Computer and Information Sciences, 35(2), 757–774. https://doi.org/10.1016/j.jksuci.2023.01.014
Mohan, M., Dhanalakshmi, P., & Kumar, R. S. (2023). Speech emotion classification using ensemble models with MFCC. Procedia Computer Science, 218, 1857–1868. https://doi.org/10.1016/j.procs.2023.01.163
Mohanta, A., & Mittal, V. K. (2022). Analysis and classification of speech sounds of children with autism spectrum disorder using acoustic features. Computer Speech & Language, 72, 101287. https://doi.org/10.1016/j.csl.2021.101287
Patel, R., & Chaware, A. (2020). Transfer learning with fine-tuned MobileNetV2 for diabetic retinopathy. In 2020 international conference for emerging technology (INCET) (pp. 1–4). https://doi.org/10.1109/INCET49848.2020.9154014
Rayhan Ahmed, M., Islam, S., Muzahidul Islam, A., & Shatabda, S. (2023). An ensemble 1D-CNN-LSTM-GRU model with data augmentation for speech emotion recognition. Expert Systems with Applications, 218, 119633. https://doi.org/10.1016/j.eswa.2023.119633
Shivaprasad, S., & Sadanandam, M. (2021). Dialect recognition from Telugu speech utterances using spectral and prosodic features. International Journal of Speech Technology. https://doi.org/10.1007/s10772-021-09854-8
Singh, V., & Prasad, S. (2023). Speech emotion recognition system using gender dependent convolution neural network. Procedia Computer Science, 218, 2533–2540. https://doi.org/10.1016/j.procs.2023.01.227
Taunk, K., De, S., Verma, S., & Swetapadma. A. (2019). A brief review of nearest neighbor algorithm for learning and classification. In 2019 international conference on intelligent computing and control systems (ICCS) (pp. 1255–1260). https://doi.org/10.1109/ICCS45141.2019.9065747
Tsaregorodtsev, A., Samoylov, V., Zenov, A., Zelenina, A., Petrosov, D., Pleshakova, E., Osipov, A., Ivanova, M., Petrosova, N., Lopatnuk, L., Radygin, V., & Roga, S. (2022). The architecture of the emotion recognition program by speech segments. Procedia Computer Science, 213, 338–345. https://doi.org/10.1016/j.procs.2022.11.076
Wang, Q. (2022). Support vector machine algorithm in machine learning. In 2022 IEEE international conference on artificial intelligence and computer applications (ICAICA) (pp. 750–756). https://doi.org/10.1109/ICAICA54878.2022.9844516
Yang, F. -J. (2018). An implementation of Naive Bayes classifier. In 2018 international conference on computational science and computational intelligence (CSCI) (pp. 301–306). https://doi.org/10.1109/CSCI46756.2018.00065
Zhang, S., Li, X., Zong, M., Zhu, X., & Wang, R. (2018). Efficient KNN classification with different numbers of nearest neighbors. IEEE Transactions on Neural Networks and Learning Systems, 29(5), 1774–1785. https://doi.org/10.1109/TNNLS.2017.2673241