Music emotion recognition using recurrent neural networks and pretrained models

Journal of Intelligent Information Systems - Tập 57 - Trang 531-546 - 2021
Jacek Grekow1
1Faculty of Computer Science, Bialystok University of Technology, Bialystok, Poland

Tóm tắt

The article presents conducted experiments using recurrent neural networks for emotion detection in musical segments. Trained regression models were used to predict the continuous values of emotions on the axes of Russell’s circumplex model. A process of audio feature extraction and creating sequential data for learning networks with long short-term memory (LSTM) units is presented. Models were implemented using the WekaDeeplearning4j package and a number of experiments were carried out with data with different sets of features and varying segmentation. The usefulness of dividing the data into sequences as well as the point of using recurrent networks to recognize emotions in music, the results of which have even exceeded the SVM algorithm for regression, were demonstrated. The author analyzed the effect of the network structure and the set of used features on the results of the regressors recognizing values on two axes of the emotion model: arousal and valence. Finally, the use of a pretrained model for processing audio features and training a recurrent network with new sequences of features is presented.

Tài liệu tham khảo

Aljanaki, A., Yang, Y.H., & Soleymani, M. (2017). Developing a benchmark for emotional analysis of music. PLoS ONE, 12(3). Bachorik, J., Bangert, M., Loui, P., Larke, K., Berger, J., Rowe, R., & Schlaug, G. (2009). Emotion in motion: Investigating the time-course of emotional judgments of musical stimuli. Music Perception, 26, 355–364. Bogdanov, D., Wack, N., Gómez, E., Gulati, S., Herrera, P., Mayor, O., Roma, G., Salamon, J., Zapata, J., & Serra, X. (2013). ESSENTIA: An audio analysis library for music information retrieval. In Proceedings of the 14th international society for music information retrieval conference (pp. 493–498). Choi, K., Fazekas, G., Sandler, M.B., & Cho, K. (2017). Transfer learning for music classification and regression tasks. In S.J. Cunningham, Z. Duan, X. Hu, & D. Turnbull (Eds.) Proceedings of the 18th international society for music information retrieval conference, ISMIR 2017, Suzhou, China, October 23-27, 2017 (pp. 141–149). Chowdhury, S., Portabella, A.V., Haunschmid, V., & Widmer, G. (2019). Towards explainable music emotion recognition: The route via mid-level features. In Proceedings of the 20th international society for music information retrieval conference, ISMIR 2019, Delft, The Netherlands (pp. 237–243). Coutinho, E., Trigeorgis, G., Zafeiriou, S., & Schuller, B. (2015). Automatically estimating emotion in music with deep long-short term memory recurrent neural networks. In Working Notes Proceedings of the MediaEval 2015 Workshop, Wurzen, Germany. Delbouys, R., Hennequin, R., Piccoli, F., Royo-Letelier, J., & Moussallam, M. (2018). Music mood detection based on audio and lyrics with deep neural net. In Proceedings of the 19th international society for music information retrieval conference, ISMIR 2018, Paris, France (pp. 370–375). Gers, F.A., Schmidhuber, J., & Cummins, F.A. (2000). Learning to forget: Continual prediction with LSTM. Neural Computation, 12, 2451–2471. Grekow, J. (2015). Audio features dedicated to the detection of four basic emotions. In Computer information systems and industrial management: 14th IFIP TC 8 international conference, CISIM 2015, warsaw, poland, september 24-26, 2015, Proceedings (pp. 583–591). Springer International Publishing. Grekow, J. (2016). Music emotion maps in arousal-valence space. In Computer information systems and industrial management: 15th IFIP TC8 international conference, CISIM 2016, Vilnius, Lithuania, Proceedings (pp. 697–706). Springer International Publishing. Grekow, J. (2017). Audio features dedicated to the detection of arousal and valence in music recordings. In 2017 IEEE international conference on innovations in intelligent systems and applications (INISTA). https://doi.org/10.1109/INISTA.2017.8001129 (pp. 40–44). IEEE. Grekow, J. (2018a). Human annotation. In From content-based music emotion recognition to emotion maps of musical pieces (pp. 13–24). Cham: Springer International Publishing. Grekow, J. (2018b). Musical performance analysis in terms of emotions it evokes. Journal of Intelligent Information Systems, 51(2), 415–437. https://doi.org/10.1007/s10844-018-0510-y. Grekow, J. (2020). Static music emotion recognition using recurrent neural networks. In: D. Helic, G. Leitner, M. Stettinger, A. Felfernig, ZW. Ras (Eds.) Foundations of Intelligent Systems - 25th International Symposium, ISMIS 2020, Graz, Austria, September 23-25, 2020, Proceedings, Springer, Lecture Notes in Computer Science, vol 12117, pp 150–160. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., & Witten, I.H. (2009). The WEKA data mining software: an update. SIGKDD Explor Newsl, 11(1), 10–18. Hamel, P., Davies, M.E.P., Yoshii, K., & Goto, M. (2013). Transfer learning in mir: Sharing learned latent representations for music audio classification and similarity. In A. Jr de Souza Britto, F. Gouyon, & S. Dixon (Eds.) Proceedings of the 14th international society for music information retrieval conference, ISMIR 2013, Curitiba, Brazil, November 4-8 2013 (pp. 9–14). Humphrey, E.J., Bello, J.P., & LeCun, Y. (2012). Moving beyond feature design: Deep architectures and automatic feature learning in music informatics. In F. Gouyon, P. Herrera, L.G. Martins, & M. Müller (Eds.) Proceedings of the 13th international society for music information retrieval conference, ISMIR 2012, Mosteiro S.Bento Da Vitória, Porto, Portugal, October 8-12 2012 (pp. 403–408). Lang, S., Bravo-Marquez, F., Beckham, C., Hall, M., & Frank, E. (2019). WekaDeeplearning4j: A deep learning package for Weka based on Deeplearning4j. Knowledge-Based Systems, 178, 48–50. Lu, L., Liu, D., & Zhang, H.J. (2006). Automatic mood detection and tracking of music audio signals. Transactions on Audio, Speech, and Language Processing, 14(1), 5–18. Oord, A., Dieleman, S., & Schrauwen, B. (2014). Transfer learning by supervised pre-training for audio-based music classification. In H. Wang, Y. Yang, & J.H. Lee (Eds.) Proceedings of the 15th international society for music information retrieval conference, ISMIR 2014, Taipei, Taiwan, October 27-31 2014 (pp. 29–34). Panda, R., Malheiro, R.M., & Paiva, R.P. (2020). Audio features for music emotion recognition: a survey. IEEE Transactions on Affective Computing. https://doi.org/10.1109/TAFFC.2020.3032373. Patra, B., Das, D., & Bandyopadhyay, S. (2017). Labeling data and developing supervised framework for hindi music mood analysis. Journal of Intelligent Information Systems, 48, 633–651. https://doi.org/10.1007/s10844-016-0436-1. Patra, B., Das, D., & Bandyopadhyay, S. (2018). Multimodal mood classification of hindi and western songs. Journal of Intelligent Information Systems, 51, 579–596. https://doi.org/10.1007/s10844-018-0497-4. Russell, J.A. (1980). A circumplex model of affect. Journal of Personality and Social Psychology, 39(6), 1161–1178. Tzanetakis, G., & Cook, P. (2000). Marsyas: a framework for audio analysis. Org Sound, 4(3), 169–175. Tzanetakis, G., & Cook, P. (2002). Musical genre classification of audio signals. IEEE Transactions on Speech and Audio Processing, 10(5), 293–302. https://doi.org/10.1109/TSA.2002.800560. Weninger, F., Eyben, F., & Schuller, B. (2014). On-line continuous-time music mood regression with deep recurrent neural networks. In 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 5412–5416). Witten, I.H., Frank, E., Hall, M.A., & Pal, C.J. (2016). Data mining fourth edition: Practical machine learning tools and techniques, 4th edn. USA: Morgan Kaufmann Publishers Inc.