DEMoS: an Italian emotional speech corpus

Springer Science and Business Media LLC - Tập 54 - Trang 341-383 - 2019

Emilia Parada-Cabaleiro¹, Giovanni Costantini², Anton Batliner¹, Maximilian Schmitt¹, Björn W. Schuller¹

¹ZD.B Chair of Embedded Intelligence for Health Care and Wellbeing, University of Augsburg, Augsburg, Germany

²Department of Electronic Engineering, University of Rome “Tor Vergata”, Rome, Italy

Tóm tắt

We present DEMoS (Database of Elicited Mood in Speech), a new, large database with Italian emotional speech: 68 speakers, some 9 k speech samples. As Italian is under-represented in speech emotion research, for a comparison with the state-of-the-art, we model the ‘big 6 emotions’ and guilt. Besides making available this database for research, our contribution is three-fold: First, we employ a variety of mood induction procedures, whose combinations are especially tailored for specific emotions. Second, we use combinations of selection procedures such as an alexithymia test and self- and external assessment, obtaining 1,5 k (proto-) typical samples; these were used in a perception test (86 native Italian subjects, categorical identification and dimensional rating). Third, machine learning techniques—based on standardised brute-forced openSMILE ComParE features and support vector machine classifiers—were applied to assess how emotional typicality and sample size might impact machine learning efficiency. Our results are three-fold as well: First, we show that appropriate induction techniques ensure the collection of valid samples, whereas the type of self-assessment employed turned out not to be a meaningful measurement. Second, emotional typicality—which shows up in an acoustic analysis of prosodic main features—in contrast to sample size is not an essential feature for successfully training machine learning models. Third, the perceptual findings demonstrate that the confusion patterns mostly relate to cultural rules and to ambiguous emotions.

Tài liệu tham khảo

Amir, N., Ron, S., & Laor, N. (2000). Analysis of an emotional speech corpus in Hebrew based on objective criteria. In Proceedings of the ITRW, ISCA, Newcastle, UK (pp. 29–33).

Aubergé, V., Audibert, N., & Rilliard, A. (2003). Why and how to control the authentic emotional speech corpora. In Proceedings of the Interspeech, ISCA, Geneva, Switzerland (pp. 185–188).

Baiocco, R., Giannini, A. M., & Laghi, F. (2005). SAR–Scala Alessitimica Romana. Valutazione delle capacità di riconoscere, esprimere e verbalizzare le emozioni. Trento: Erickson.

Bänziger, T., Pirker, H., & Scherer, K. (2006). GEMEP-GEneva multimodal emotion portrayals: A corpus for the study of multimodal emotional expressions. In Proceedings of LREC, ELRA, Genova, Italy (pp. 15–19).

Barkhuysen, P., Krahmer, E., & Swerts, M. (2010). Crossmodal and incremental perception of audiovisual cues to emotional speech. Language and Speech, 53(1), 3–30.

Batliner, A., Fischer, K., Huber, R., Spilker, J., & Nöth, E. (2000). Desperately seeking emotions or: Actors, wizards, and human beings. In Proceedings of ITRW, ISCA (pp. 195–200).

Batliner, A., Hacker, C., Steidl, S., Nöth, E., D’Arcy, S., Russell, M. J., & Wong, M. (2004). ’You stupid tin box’—children interacting with the AIBO Robot: A cross-linguistic emotional speech corpus. In Proceedings of LREC, ELRA, Lisbon, Portugal (pp. 171–174).

Batliner, A., Steidl, S., Hacker, C., Nöth, E., & Niemann, H. (2005). Tales of tuning—prototyping for automatic classification of emotional user states. In Proceedings of Interspeech, ISCA, Lisbon, Portugal (pp. 489–492).

Bennett, M. J. (1979). Overcoming the golden rule: Sympathy and empathy. Annals of the International Communication Association, 3(1), 407–422.

Bonny, H. L. (2002). Music and consciousness: The evolution of guided imagery and music. Gilsum, NH: Barcelona Publishers.

Bradley, M. M., & Lang, P. J. (2000). Affective reactions to acoustic stimuli. Psychophysiology, 37(2), 204–215.

Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W. F., & Weiss, B. (2005). A database of German emotional speech. In Proceedings of Interspeech, Lisbon, Portugal (pp. 1517–1520).

Cavanagh, S. R., Urry, H. L., & Shin, L. M. (2011). Mood-induced shifts in attentional bias to emotional information predict ill-and well-being. Emotion, 11(2), 241–248.

Chiţu, A. G., van Vulpen, M., Takapoui, P., & Rothkrantz, L. J. M. (2008). Building a Dutch multimodal corpus for emotion recognition. In Workshop on Corpora for Research on Emotion and Affect (pp. 53–56). Marrakesh, Morocco: LREC.

Ciceri, M. R., & Anolli, L. M. (2000). La voce delle emozioni: Verso una semiosi della comunicazione vocale non-verbale delle emozioni. Milan: Franco Angeli.

Costantini, G., Iaderola, I., Paoloni, A., & Todisco, M. (2014). EMOVO Corpus: An Italian emotional speech database. In Proceedings of LREC, ELRA, Reykjavik, Iceland (pp. 3501–3504).

Cowie, R., Douglas-Cowie, E., Savvidou, S., McMahon, E., Sawey, M., & Schröder, M. (2000). Feeltrace: An instrument for recording perceived emotion in real time. In Proceedings of ITRW, ISCA, Newcastle, UK (pp. 19–24).

Cullen, C., Vaughan, B., Kousidis, S., Wang, Y., McDonnell, C., & Campbell, D. (2006). Generation of high quality audio natural emotional speech corpus using task based mood induction. In Proceedings of InSciT, Dublin Institute of Technology, Mérida, Spain.

Dan-Glauser, E. S., & Scherer, K. R. (2011). The Geneva affective picture database (GAPED): A new 730-picture database focusing on valence and normative significance. Behavior Research Methods, 43(2), 468–477.

Devillers, L., Abrilian, S., & Martin, J.-C. (2005a). Representing real-life emotions in audiovisual data with non basic emotional patterns and context features (pp. 519–526). ACII.

Devillers, L., Vidrascu, L., & Lamel, L. (2005b). Challenges in real-life emotion annotation and machine learning based detection. Neural Networks, 18, 407–422.

Douglas-Cowie, E., Campbell, N., Cowie, R., & Roach, P. (2003). Emotional speech: Towards a new generation of databases. Speech Communication, 40(1), 33–60.

Douglas-Cowie, E., Cowie, R., & Schröder, M. (2000). A new emotion database: Considerations, sources and scope. In Proceedings of ITRW, ISCA, Newcastle, UK (pp. 39–44).

Douglas-Cowie, E., Cowie, R., Sneddon, I., Cox, C., Lowry, O., McRorie, M., Martin, J.-C., Devillers, L., Abrilian, S., Batliner, A., Amir, N., & Karpouzis, K. (2007). The HUMAINE database: Addressing the collection and annotation of naturalistic and induced emotional data. In Proceedings of ACII, AAAC, Lisbon, Portugal (pp. 488–500).

Douglas-Cowie, E., Cox, C., Martin, J.-C., Devillers, L., Cowie, R., Sneddon, I., et al. (2011). Data and databases. In P. Petta, C. Pelachaud, & R. Cowie (Eds.), Emotion-oriented systems: The HUMAINE handbook (pp. 163–284). Berlin: Springer.

Ekman, P. (1984). Expression and the nature of emotion. Approaches to Emotion, 3, 19–344.

El Ayadi, M., Kamel, M. S., & Karray, F. (2011). Survey on speech emotion recognition: Features, classification schemes, and databases. Pattern Recognition, 44(3), 572–587.

Eyben, F., Salomão, G. L., Sundberg, J., Scherer, K. R., & Schuller, B. W. (2015). Emotion in the singing voice—a deeper look at acoustic features in the light of automatic classification. EURASIP Journal on Audio, Speech, and Music Processing, 1, 1–9.

Eyben, F., Wöllmer, M., & Schuller, B. (2010). Opensmile: the Munich versatile and fast open-source audio feature extractor. In Proceedings of ACM Multimedia, ACM, Florence, Italy (pp. 1459–1462).

Fan, R.-E., Chang, K.-W., Hsieh, C.-J., Wang, X.-R., & Lin, C.-J. (2008). Liblinear: A library for large linear classification. Journal of Machine Learning Research, 9, 1871–1874.

Fernandez, R., & Picard, R. W. (2003). Modeling drivers’ speech under stress. Speech Communication, 40, 145–159.

Fischer, A. H. (1993). Sex differences in emotionality: Fact or stereotype? Feminism and Psychology, 3, 303–318.

Fontaine, J. R., Scherer, K. R., Roesch, E. B., & Ellsworth, P. C. (2007). The world of emotions is not two-dimensional. Psychological Science, 18(12), 1050–1057.

Gerrards-Hesse, A., Spies, K., & Hesse, F. W. (1994). Experimental inductions of emotional states and their effectiveness: A review. British Journal of Psychology, 85(1), 55–78.

Grichkovtsova, I., Morel, M., & Lacheret, A. (2012). The role of voice quality and prosodic contour in affective speech perception. Speech Communication, 54(3), 414–429.

Gross, J., & Levenson, R. (1995). Emotion elicitation using films. Cognition and Emotion, 9, 87–108.

Husain, G., Thompson, W. F., & Schellenberg, E. G. (2002). Effects of musical tempo and mode on arousal, mood, and spatial abilities. Music Perception: An Interdisciplinary Journal, 20(2), 151–171.

Iida, A., Campbell, N., Higuchi, F., & Yasumura, M. (2003). A corpus-based speech synthesis system with emotion. Speech Communication, 40(1–2), 161–187.

Johnstone, T., & Scherer, K. R. (1999). The effects of emotions on voice quality. In Proceedings of ICPhS, UCLA, San Francisco, CA (pp. 2029–2032).

Johnstone, T., van Reekum, C. M., Hird, K., Kirsner, K., & Scherer, K. R. (2005). Affective speech elicited with a computer game. Emotion, 5(4), 513.

Keltner, D. (1996). Evidence for the distinctness of embarrassment, shame, and guilt: A study of recalled antecedents and facial expressions of emotion. Cognition and Emotion, 10, 155–172.

Klasmeyer, G., Johnstone, T., Bänziger, T., Sappok, C., & Scherer, K. R. (2000). Emotional voice variability in speaker verification. In Proceedings of ITRW, ISCA, Newcastle, UK (pp. 213–218).

Konečni, V. J., Brown, A., & Wanic, R. A. (2008). Comparative effects of music and recalled life-events on emotional state. Psychology of Music, 36(3), 289–308.

Labov, W. (1972). Sociolinguistic patterns. Philadelphia, PA: University of Pennsylvania Press.

Martin, M. (1990). On the induction of mood. Clinical Psychology Review, 10(6), 669–697.

Mayer, J. D., Allen, J. P., & Beauregard, K. (1995). Mood inductions for four specific moods: A procedure employing guided imagery vignettes with music. Journal of Mental Imagery, 19(1–2), 151–159.

McCraty, R., Barrios-Choplin, B., Atkinson, M., & Tomasino, D. (1998). The effects of different types of music on mood, tension, and mental clarity. Alternative Therapies in Health and Medicine, 4(1), 75–84.

Mencattini, A., Martinelli, E., Costantini, G., Todisco, M., Basile, B., Bozzali, M., et al. (2014). Speech emotion recognition using amplitude modulation parameters and a combined feature selection procedure. Knowledge-Based Systems, 63, 68–81.

Mikula, G., Scherer, K. R., & Athenstaedt, U. (1998). The role of injustice in the elicitation of differential emotional reactions. Personality and Social Psychology Bulletin, 24(7), 769–783.

Mower, E., Metallinou, A., Lee, C., Kazemzadeh, A., Busso, C., Lee, S., & Narayanan, S. (2009). Interpreting ambiguous emotional expressions. In Proceedings of ACII. Amsterdam, Netherlands: IEEE.

Murray, I. R., & Arnott, J. L. (1995). Implementation and testing of a system for producing emotion-by-rule in synthetic speech. Speech Communication, 16, 369–390.

Ortony, A., & Turner, T. J. (1990). What’s basic about basic emotions? Psychological Review, 97(3), 315–331.

Parada-Cabaleiro, E., Baird, A., Batliner, A., Cummins, N., Hantke, S., & Schuller, B. (2017). The perception of emotions in noisified non-sense speech. In Proceedings of Interspeech, ISCA, Stockholm, Sweden (pp. 3246–3250).

Parada-Cabaleiro, E., Costantini, G., Batliner, A., Baird, A., & Schuller, B. (2018). Categorical vs Dimensional perception of Italian emotional speech. In Proceedings of Interspeech, ISCA, Hyderabad, India (pp. 3638–3642).

Philippot, P. (1993). Inducing and assessing differentiated emotion-feeling states in the laboratory. Cognition and Emotion, 7(2), 171–193.

Plutchik, R. (1991). The emotions. Lanham, MD: University Press of America.

Roedema, T. M., & Simons, R. F. (1999). Emotion-processing deficit in alexithymia. Psychophysiology, 36(3), 379–387.

Rosch, E. H. (1973). Natural categories. Cognitive Psychology, 4(3), 328–350.

Russell, J. A. (1980). A circumplex model of affect. Journal of Personality and Social Psychology, 39(6), 1161–1178.

Russell, J. A. (1991). In defense of a prototype approach to emotion concepts. Journal of Personality and Social Psychology, 60, 37–47.

Scherer, K. R. (2005). What are emotions? And how can they be measured? Social Science Information, 44(4), 695–729.

Scherer, K. R. (2013). Vocal markers of emotion: Comparing induction and acting elicitation. Computer Speech and Language, 27(1), 40–58.

Scherer, K. R., & Ceschi, G. (1997). Lost luggage: A field study of emotion-antecedent appraisal. Motivation and Emotion, 21(3), 211–235.

Scherer, K. R., Shuman, V., Fontaine, J. R., & Soriano, C. (2013). The grid meets the wheel: Assessing emotional feeling via self-report. In J. R. Fontaine, K. R. Scherer, & C. Soriano (Eds.), Components of emotional meaning: Asourcebook (pp. 281–298). Oxford: Oxford University Press.

Schienle, A., Schäfer, A., Stark, R., Walter, B., & Vaitl, D. (2005). Relationship between disgust sensitivity, trait anxiety and brain activity during disgust induction. Neuropsychobiology, 51, 86–92.

Schlosberg, H. (1954). Three dimensions of emotion. Psychological Review, 61(2), 81.

Schröder, M. (2004). Speech and emotion research: An overview of research frameworks and a dimensional approach to emotional speech synthesis. PhD thesis, Saarland University.

Schuller, B., Batliner, A., Steidl, S., & Seppi, D. (2011). Recognising realistic emotions and affect in speech: State of the art and lessons learnt from the first challenge. Speech Communication, 53(9–10), 1062–1087.

Schuller, B., Steidl, S., Batliner, A., Marschik, P. B., Baumeister, H., Dong, F., Hantke, S., Pokorny, F., Rathner, E.-M., Bartl-Pokorny, K. D., Einspieler, C., Zhang, D., Baird, A., Amiriparian, S., Qian, K., Ren, Z., Schmitt, M., Tzirakis, P., & Zafeiriou, S. (2018). The Interspeech 2018 computational paralinguistics challenge: Atypical and self-assessed affect, crying and heart beats. In Proceedings of Interspeech, ISCA, Hyderabad, India (pp. 122–126).

Schuller, B., Steidl, S., Batliner, A., Vinciarelli, A., Scherer, K., Ringeval, F., Chetouani, M., Weninger, F., Eyben, F., Marchi, E., Mortillaro, M., Salamin, H., Polychroniou, A., Valente, F., & Kim, S. (2013). The Interspeech 2013 computational paralinguistics challenge: Social signals, conflict, emotion, autism. In Proceedings of Interspeech, ISCA, Lyon, France (pp. 148–152).

Schutte, N. S., Malouff, J. M., Hall, L. E., Haggerty, D. J., Cooper, J. T., Golden, C. J., et al. (1998). Development and validation of a measure of emotional intelligence. Personality and Individual Differences, 25(2), 167–177.

Singhi, A., & Brown, D. G. (2014). On cultural, textual and experiential aspects of music mood. In Proceedings of ISMIR, ISMIR, Taipei, Taiwan (pp. 3–8).

Sobin, C., & Alpert, M. (1999). Emotion in speech: The acoustic attributes of fear, anger, sadness, and joy. Journal of Psycholinguistic Research, 28(4), 347–365.

Tato, R., Santos, R., Kompe, R., & Pardo, J. M. (2002). Emotional space improves emotion recognition. In Proceedings of ICSLP, ISCA, Denver, CO (pp. 2029–2032).

Tolkmitt, F. J., & Scherer, K. R. (1986). Effect of experimentally induced stress on vocal parameters. Journal of Experimental Psychology: Human Perception and Performance, 12(3), 302–313.

Truong, K. P., Van Leeuwen, D. A., & de Jong, F. M. G. (2012). Speech-based recognition of self-reported and observed emotion in a dimensional space. Speech Communication, 54(9), 1049–1063.

Türk, U. (2001). The technical processing in smartkom data collection: A case study. In Proceedings of Eurospeech, ISCA, Aalborg, Denmark (pp. 1541–1544).

Utay, J., & Miller, M. (2006). Guided imagery as an effective therapeutic technique: A brief review of its history and efficacy research. Journal of Instructional Psychology, 33, 40–44.

Van der Does, W. (2002). Different types of experimentally induced sad mood? Behavior Therapy, 33(4), 551–561.

Västfjäll, D. (2001). Emotion induction through music: A review of the musical mood induction procedure. Musicae Scientiae, 5(1), 173–211.

Vaughan, B. (2011). Naturalistic emotional speech corpora with large scale emotional dimension ratings. PhD thesis, Dublin Institute of Technology.

Velten, E. (1968). A laboratory task for induction of mood states. Behaviour Research and Therapy, 6(4), 473–482.

Ververidis, D., & Kotropoulos, C. (2006). Emotional speech recognition: Resources, features, and methods. Speech Communication, 48(9), 1162–1181.

Wasserstein, R. L., & Lazar, N. A. (2016). The ASA’s statement on p-values: Context, process, and purpose. The American Statistician, 70, 129–133.

Westermann, R., Stahl, G., Spies, K., & Hesse, F. W. (1996). Relative effectiveness and validity of mood induction procedures: A meta-analysis. European Journal of Social Psychology, 26, 557–580.

Williams, C. E., & Stevens, K. N. (1972). Emotions and speech: Some acoustic correlates. The Journal of the Acoustical Society of America, 52(4B), 1238–1250.

Zhang, T., Hasegawa-Johnson, M., & Levinson, S. (2004). Children’s emotion recognition in an intelligent tutoring scenario. In Proceedings of Interspeech, ISCA, Jeju Island, Korea (pp. 1441–1444).

Zou, C., Huang, C., Han, D., & Zhao, L. (2011). Detecting practical speech emotion in a cognitive task. In Proceedings of ICCCN, IEEE, Maui, HI (pp. 1–5).

Scholar Hub - Công cụ hỗ trợ trích dẫn và phân tích khoa học Việt Nam

Về chúng tôi

Scholar Hub là công cụ hỗ trợ trích dẫn và phân tích các bài báo, công bố khoa học Việt Nam. Công cụ trợ giúp người nghiên cứu, tạp chí, đơn vị nghiên cứu tra cứu, phân tích và thống kê dữ liệu nghiên cứu khoa học tại Việt Nam và quốc tế.
ScholarHub KHÔNG đăng thông tin tổng hợp, KHÔNG đăng lại nội dung từ các trang báo chí Việt Nam hoặc trang thông tin điện tử khác tại Việt Nam.

Thông tin, cập nhật

Đăng ký Tạp chí tham gia vào Scholar Hub

Phản hồi ý kiến về Scholar Hub

Bài viết, nội dung cập nhật

Chủ đề khoa học

Website liên kết

Hệ thống CSDL Khoa học & Công nghệ

Phần mềm kiểm tra trùng lặp Kiểm Tra Tài Liệu

Phần mềm xuất bản tạp chí điện tử VOJS

Nền tảng trắc nghiệm và đề thi đa lĩnh vực LetQA