Feature compensation based on the normalization of vocal tract length for the improvement of emotion-affected speech recognition
Tóm tắt
The performance of speech recognition systems trained with neutral utterances degrades significantly when these systems are tested with emotional speech. Since everybody can speak emotionally in the real-world environment, it is necessary to take account of the emotional states of speech in the performance of the automatic speech recognition system. Limited works have been performed in the field of emotion-affected speech recognition and so far, most of the researches have focused on the classification of speech emotions. In this paper, the vocal tract length normalization method is employed to enhance the robustness of the emotion-affected speech recognition system. For this purpose, two structures of the speech recognition system based on hybrids of hidden Markov model with Gaussian mixture model and deep neural network are used. To achieve this goal, frequency warping is applied to the filterbank and/or discrete-cosine transform domain(s) in the feature extraction process of the automatic speech recognition system. The warping process is conducted in a way to normalize the emotional feature components and make them close to their corresponding neutral feature components. The performance of the proposed system is evaluated in neutrally trained/emotionally tested conditions for different speech features and emotional states (i.e., Anger, Disgust, Fear, Happy, and Sad). In this system, frequency warping is employed for different acoustical features. The constructed emotion-affected speech recognition system is based on the Kaldi automatic speech recognition with the Persian emotional speech database and the crowd-sourced emotional multi-modal actors dataset as the input corpora. The experimental simulations reveal that, in general, the warped emotional features result in better performance of the emotion-affected speech recognition system as compared with their unwarped counterparts. Also, it can be seen that the performance of the speech recognition using the deep neural network-hidden Markov model outperforms the system employing the hybrid with the Gaussian mixture model.
Tài liệu tham khảo
M. Najafian, Acoustic model selection for recognition of regional accented speech. PhD thesis, University of Birmingham (2016)
P.C. Woodland, in ISCA Tutorial and Research Workshop (ITRW) on Adaptation Methods for Speech Recognition. Speaker adaptation for continuous density HMMs: review (2001)
B. Vlasenko, D. Prylipko, A. Wendemuth, in 35th German Conference on Artificial Intelligence (KI-2012), Saarbrücken, Germany (September 2012). Towards robust spontaneous speech recognition with emotional speech adapted acoustic models (2012), pp. 103–107 Citeseer
F. Burkhardt, A. Paeschke, M. Rolfes, W.F. Sendlmeier, B. Weiss, in Ninth European Conference on Speech Communication and Technology. A database of German emotional speech (2005)
Y. Pan, M. Xu, L. Liu, P. Jia, in IMACS Multiconference on Computational Engineering in Systems Applications. Emotion-detecting based model selection for emotional speech recognition, vol 2 (2006), pp. 2169–2172 IEEE
B. Schuller, J. Stadermann, G. Rigoll, in Proc. Speech Prosody 2006, Dresden. Affect-robust speech recognition by dynamic emotional adaptation (2006)
Y. Ijima, M. Tachibana, T. Nose, T. Kobayashi, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Emotional speech recognition based on style estimation and adaptation with multiple-regression HMM (IEEE, 2009), pp. 4157–4160
T. Athanaselis, S. Bakamidis, I. Dologlou, R. Cowie, E. Douglas-Cowie, C. Cox, ASR for emotional speech: clarifying the issues and enhancing performance. Neural Networks 18(4), 437–444 (2005)
D. Gharavian, M. Sheikhan, M. Janipour, Pitch in emotional speech and emotional speech recognition using pitch frequency. Majlesi J Electrical Eng 4(1) (2010)
D. Gharavian, S. Ahadi, in the Proceedings of International Symposium on Chinese Spoken Language Processing. Recognition of emotional speech and speech emotion in Farsi, vol 2 (2006), pp. 299–308 Citeseer
Y. Sun, Y. Zhou, Q. Zhao, Y. Yan, in International Conference on Information Engineering and Computer Science (ICIECS). Acoustic feature optimization for emotion affected speech recognition (2009), pp. 1–4 IEEE
L. Lee, R. Rose, A frequency warping approach to speaker normalization. IEEE Transactions on Speech and Audio Processing 6(1), 49–60 (1998)
S. Panchapagesan, Frequency warping by linear transformation, and vocal tract inversion for speaker normalization in automatic speech recognition, PhD thesis, University of California, Los Angeles (2008)
P. Price, W.M. Fisher, J. Bernstein, D.S. Pallett, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). The DARPA 1000-word resource management database for continuous speech recognition (IEEE, 1988), pp. 651–654
M. Sheikhan, D. Gharavian, F. Ashoftedel, Using DTW neural–based MFCC warping to improve emotional speech recognition. Neural Computing and Applications 21(7), 1765–1773 (2012)
D. Povey et al., in IEEE 2011 workshop on automatic speech recognition and understanding. The Kaldi speech recognition toolkit (2011) no. EPFL-CONF-192584: IEEE Signal Processing Society
N. Keshtiari, M. Kuhlmann, M. Eslami, G. Klann-Delius, Recognizing emotional speech in Persian: a validated database of Persian emotional speech (Persian ESD). Behavior Research Methods 47(1), 275–294 (2015)
H. Cao, D.G. Cooper, M.K. Keutmann, R.C. Gur, A. Nenkova, R. Verma, CREMA-D: Crowd-sourced emotional multimodal actors dataset. IEEE Transactions on Affective Computing 5(4), 377–390 (2014)
S. Davis, P. Mermelstein, Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech, and Signal Processing 28(4), 357–366 (1980)
S.E. Bou-Ghazale, J.H. Hansen, A comparative study of traditional and newly proposed features for recognition of speech under stress. IEEE Transactions on Speech and Audio Processing 8(4), 429–442 (2000)
Y. Shao, Z. Jin, D. Wang, S. Srinivasan, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). An auditory-based feature for robust speech recognition (IEEE, 2009), pp. 4625–4628
J. Makhoul, Linear prediction: A tutorial review. Proceedings of the IEEE 63(4), 561–580 (1975)
H. Morgan, N. Bayya, A. Kohn, P. Hermansky, RASTA-PLP speech analysis. ICSI Technical Report TR-91-969 (1991)
C. Kim, R.M. Stern, Power-normalized cepstral coefficients (PNCC) for robust speech recognition. IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP) 24(7), 1315–1329 (2016)
D. Elenius, M. Blomberg, in Proceedings from Fonetik. Dynamic vocal tract length normalization in speech recognition (2010), pp. 29–34 Citeseer
B. Schuller, A. Batliner, S. Steidl, D. Seppi, Recognising realistic emotions and affect in speech: state of the art and lessons learnt from the first challenge. Speech Communication 53(9-10), 1062–1087 (2011)
S. Lee, S. Yildirim, A. Kazemzadeh, S. Narayanan, in Ninth European Conference on Speech Communication and Technology. An articulatory study of emotional speech production (2005)
S. Lee, E. Bresch, J. Adams, A. Kazemzadeh, S. Narayanan, in Ninth International Conference on Spoken Language Processing. A study of emotional speech articulation using a fast magnetic resonance imaging technique (2006)
W.R. Rodrıguez, O. Saz, A. Miguel, E. Lleida, in VI Jornadas en Tecnología del Habla and II Iberian SLTech Workshop. On line vocal tract length estimation for speaker normalization in speech recognition (2010)
K. Veselý, A. Ghoshal, L. Burget, D. Povey, in Interspeech. Sequence-discriminative training of deep neural networks, vol 2013 (2013), pp. 2345–2349
M. Bashirpour, M. Geravanchizadeh, Speech emotion recognition based on power normalized cepstral coefficients in noisy conditions. Iran J Electrical Electronic Eng 12(3), 197–205 (2016)
M. Bashirpour, M. Geravanchizadeh, Robust emotional speech recognition based on binaural model and emotional auditory mask in noisy environments. EURASIP J Audio Speech Music Process 2018(9), 1–13 (2018)
T.K. Kim, T test as a parametric statistic. Kor J Anesthesiol 68(6), 540–546 (2015)
G. Hinton et al., Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine 29(6), 82–97 (2012)
P. Dighe, A. Asaei, H. Bourlard, On quantifying the quality of acoustic models in hybrid DNN-HMM ASR. Speech Communication 119, 24–35 (2020)