Negative emotions in the target speaker’s voice enhance speech recognition under “cocktail-party” environments

Attention, Perception, & Psychophysics - Tập 83 - Trang 247-259 - 2020
Lingxi Lu1,2, Yu Ding1,2, Chuanwei Xue3, Liang Li1,2,3
1School of Psychological and Cognitive Sciences, Peking University, Beijing, China
2Speech and Hearing Research Center, Key Laboratory on Machine Perception (Ministry of Education), Peking University, Beijing, China
3Beijing Institute for Brain Disorders, Capital Medical University, Beijing, China

Tóm tắt

Under a “cocktail-party” environment with simultaneous multiple talkers, recognition of target speech is effectively improved by a number of perceptually unmasking cues. It remains unclear whether emotions embedded in the target-speaker’s voice can either improve speech perception alone or interact with other cues facilitating speech perception against a masker background. This study used two target-speaker voices with different emotional valences to examine whether recognition of target speech is modulated by the emotional valence when the target speech and the maskers were perceptually co-located or separated. The results showed that both the speech recognition against the masker background and the separation-induced unmasking effect were higher for the target speaker with a negatively emotional voice than for the target speaker with a positively emotional voice. Moreover, when the negative voice was fear conditioned, the target-speech recognition was further improved against speech informational masking. These results suggested that the emotionally vocal unmasking cue interacts significantly with the perceived spatial-separation unmasking cue, facilitating the unmasking effect against a masking background. Thus, emotional features embedded in the target-speaker’s vocal timbre are also useful for unmasking the target speech in “cocktail-party” environments.

Tài liệu tham khảo

Arbogast, T. L., Mason, C. R., & G. Kidd Jr (2002). The effect of spatial separation on informational and energetic masking of speech. Journal of the Acoustical Society of America, 112(1), 2086-2098. Arons, B. (1992). A review of the cocktail party effect. Journal of the American Voice I/O Society, 12(7), 35-50. Bradley, M. M., & Lang, P. J. (1994). Measuring Emotion: The Self-Assessment Manikin and the Semantic Differential. J Behav Ther Exp Psychiatry, 25(1), 49-59. Bradley, M. M., & Lang, P. J. (2007). The International Affective Digitized Sounds (IADS-2): Affective ratings of sounds and instruction manual. University of Florida, Gainesville, FL, Tech. Rep. B-3, 29-41 Bragman, A. S. (1994). Auditory scene analysis: The perceptual organization of sound. Cambridge: MIT Press. Brainard, H. D. (1997). The Psychophysics Toolbox. Spatial Vision, 10(4), 433-436. Bregman, A. S. (1994). Auditory Scene Analysis: The Perceptual Organization of Sound. Cambridge: MIT Press. Brungart, D. S. (2001). Informational and energetic masking effects in the perception of two simultaneous talkers. Journal of the Acoustical Society of America, 109(3), 1101-1109. Case J., Seyfarth, S., and Levi, Susannah V. (2018). Short-term implicit voice-learning leads to a Familiar Talker Advantage: The role of encoding specificity. The Journal of the Acoustical Society of America, 144, EL479. Cherry, E. C. (1953). Some experiments on the recognition of speech, with one and with 2 ears. Journal of the Acoustical Society of America, 25(5), 975-979. https://doi.org/10.1121/1.1907229 Dupuis, K., & Pichora-Fuller, M. K. (2010). Use of Affective Prosody by Young and Older Adults. Psychology and Aging, 25(1), 16-29. https://doi.org/10.1037/a0018777 Dupuis, K., & Pichora-Fuller, M. K. (2014). Intelligibility of Emotional Speech in Younger and Older Adults. Ear and Hearing, 35(6), 695-707. Eastwood, J. D., Smilek, D., & Merikle, P. M. (2001). Differential attentional guidance by unattended faces expressing positive and negative emotion. Perception & Psychophysics, 63(6), 1004-1013. Koster, E., Crombez, G., Van Damme, S., Verschuere, B., De Houwer, J. (2005). Signals for threat modulate attentional capture and holding: Fear-conditioning and extinction during the exogenous cueing task. Cognition & Emotion, 19(5):771-780. Fox, E. (2002). Processing emotional facial expressions: The role of anxiety and awareness. Cognitive, Affective, & Behavioral Neuroscience 2(1), 52-63. Freyman, R. L., Balakrishnan, U., & Helfer, K. S. (2001). Spatial release from informational masking in speech recognition. Journal of the Acoustical Society of America, 109(5), 2112-2122. Freyman, R. L., Helfer, K. S., McCall, D. D., & Clifton, R. K. (1999). The role of perceived spatial separation in the unmasking of speech. Journal of the Acoustical Society of America, 106(6), 3578-3588. Frühholz, S., Trost, W., & Kotz, S. A. (2016). The sound of emotions: Towards a unifying neural network perspective of affective sound processing. Neuroscience & Biobehavioral Reviews, 68, 96–110. Gordon, M. S., & Ancheta, J. (2017). Visual and acoustic information supporting a happily expressed speech-in-noise advantage. Quarterly Journal of Experimental Psychology, 70(1), 163-178. Gordon, M. S., & Hibberts, M. (2011). Audiovisual speech from emotionally expressive and lateralized faces. Quarterly Journal of Experimental Psychology, 64(4), 730-750. Grandjean, D., Sander, D., Pourtois, G., Schwartz, S., Seghier, M. L., Scherer, K. R., & Vuilleumier, P. (2005). The voices of wrath: brain responses to angry prosody in meaningless speech. Nature Neuroscience, 8(2), 145-146. Haykin, S., & Chen, Z. (2005). The Cocktail Party Problem. Neural Computation, 17(9), 1875-1902. https://doi.org/10.1162/0899766054322964 Helfer, K. S. (1997). Auditory and auditory-visual perception of clear and conversational speech. Journal of Speech, Language, and Hearing Research, 40, 432-443. Holmes, E., Domingo, Y., & Johnsrude, I. S. (2018). Familiar Voices Are More Intelligible, Even if They Are Not Recognized as Familiar. Psychological Science, 29(10), 1575-1583. https://doi.org/10.1177/0956797618779083. Huang, Y., Huang, Q., Chen, X., Wu, X.-H., Li, L. (2009). Transient auditory storage of acoustic details is associated with release of speech from informational masking in reverberant conditions. Journal of Experimental Psychology: Human Perception and Performance, 35, 1618-1628. Huang, Y., Xu, L.-J., Wu, X.-H., Li, L. (2010). The effect of voice cuing on releasing speech from informational masking disappears in older adults. Ear and Hearing, 31, 579-583. Iwashiro, N., Yahata, N., Kawamuro, Y., Kasai, K., & Yamasue, H. (2013). Aberrant Interference of Auditory Negative Words on Attention in Patients with Schizophrenia. PLOS ONE, 8(12), 9. https://doi.org/10.1371/journal.pone.0083201 Jeong, J. W., Diwadkar, V. A., Chugani, C. D., Sinsoongsud, P., Muzik, O., Behen, M. E., … Chugani, D. C. (2011). Congruence of happy and sad emotion in music and faces modifies cortical audiovisual activation. Neuroimage, 54(4), 2973-2982. https://doi.org/10.1016/j.neuroimage.2010.11.017 Johnsrude, I. S., Mackey, A. , Hakyemez, H. , Alexander, E. , Trang, H. P. , & Carlyon, R. P. . (2013). Swinging at a cocktail party: voice familiarity aids speech perception in the presence of a competing voice. Psychological Science, 24(10), 1995-2004. Klaus R. Scherer, K. R. (1986). Vocal affect expression: A review and a model for future research. Psychological Bulletin, 99(2):143-165. Pischek-Simpson, L. K., Boschen, M. J., Neumann, D. L., Waters, A. M. (2009). The development of an attentional bias for angry faces following Pavlovian fear conditioning. Behaviour Research and Therapy 47(4):322-330 Levi, S. V., Winters, S. J., & Pisoni, D. B. (2011). Effects of cross-language voice training on speech perception: Whose familiar voices are more intelligible? Journal of the Acoustical Society of America, 130(6), 4053-4062. Li, H.-H., Kong, L.-Z., Wu, X.-H., Li, L. (2013). Primitive auditory memory is correlated with spatial unmasking that is based on direct-reflection integration. PLoS ONE, 8 (4) e63106. Li, L., Daneman, M., Qi, J. G., & Schneider, B. A. (2004). Does the Information Content of an Irrelevant Source Differentially Affect Spoken Word Recognition in Younger and Older Adults? Journal of Experimental Psychology Human Perception & Performan, 30(6), 1077-1091. Lu, L., Bao, X., Chen, J., Qu, T., Wu, X., & Li, L. (2018). Emotionally conditioning the target-speech voice enhances recognition of the target speech under “cocktail-party” listening conditions. Attention Perception & Psychophysics, 80(4), 871-883. New, J. J., & German, T. C. (2015). Spiders at the cocktail party: an ancestral threat that surmounts inattentional blindness. Evolution & Human Behavior, 36(3), 165-173. Nygaard, L. C., & Pisoni, D. B. (1998). Talker-specific learning in speech perception. Perception & Psychophysics, 60(3), 355-376. Ohman, A., Flykt, A., & Esteves, F. (2001). Emotion Drives Attention : Detecting the Snake in the Grass. Journal of Experimental Psychology General, 130(3), 466-478. Pollack, I., Pickett, J. M., & Sumby, W. H. (1954). ON THE IDENTIFICATION OF SPEAKERS BY VOICE. Journal of the Acoustical Society of America, 26(3), 403-406. https://doi.org/10.1121/1.1907349 Freyman, R. L., Balakrishnan, U., & Helfer, K. S. (2004). Effect of number of masking talkers and auditory priming on informational masking in speech recognition. The Journal of the Acoustical Society of America, 115(5):2246-2256. Sander, D., Grandjean, D., Pourtois, G., Schwartz, S., Seghier, M. L., Scherer, K. R., & Vuilleumier, P. (2005). Emotion and attention interactions in social cognition: Brain regions involved in processing anger prosody. Neuroimage, 28(4), 848-858. Schneider, B. A., Li, L., & Daneman, M. (2007). How competing speech interferes with speech comprehension in everyday listening situations. Journal of the American Academy of Audiology, 18(7), 559-572. https://doi.org/10.3766/jaaa.18.7.4 Singer, W. (1993). Synchronization of cortical activity and its putative role in information-processing and learning. Annual Review of Physiology, 55, 349-374. Spreadborough, K. L., & Anton-Mendez, I. (2018). It’s not what you sing, it’s how you sing it: How the emotional valence of vocal timbre influences listeners’ emotional perception of words. Psychology of Music. von der Malsburg, C. (1999). The what and why of binding: The modeler's perspective. Neuron, 24(1), 95-104. https://doi.org/10.1016/s0896-6273(00)80825-9 Vuilleumier, P. (2005). How brains beware: neural mechanisms of emotional attention. Trends in Cognitive Sciences 9(12), 585-594. Wallach, H., Newman, E. B., & Rosenzweig, M. R. (1949). A Precedence Effect in Sound Localization. The Journal of the Acoustical Society of America, 21, 468. Wolfram, S. (1991). Mathematica: A system for doing mathematics by computer. Addison-Wesley, New York. Wu, X., Wang, C., Chen, J., Qu, H., & Li, W. (2005). The effect of perceived spatial separation on informational masking of Chinese speech. Hear Res, 199(1-2), 1-10. Wu, X., Chen, J., Yang, Z., Huang, Q., Wang, M., & Li, L. (2007). Effect of number of masking talkers on speech-on-speech masking in Chinese. In Proceedings of Interspeech (pp. 390–393). Antwerp, Belgium. Yang, Z. G., Chen, J., Huang, Q., Wu, X. H., Wu, Y. H., Schneider, B. A., & Li, L. (2007). The effect of voice cuing on releasing Chinese speech from informational masking. Speech Communication, 49(12), 892-904. https://doi.org/10.1016/j.specom.2007.05.005