Gammatonegram representation for end-to-end dysarthric speech processing tasks: speech recognition, speaker identification, and intelligibility assessment

Aref Farhadipour1, Hadi Veisi2
1Department of Computational Linguistics, University of Zurich, Zurich, Switzerland
2Faculty of New Sciences and Technologies, University of Tehran, Tehran, Iran

Tóm tắt

Dysarthria is a disability that causes a disturbance in the human speech system and reduces the quality and intelligibility of a person’s speech. Because of this effect, the normal speech processing systems cannot work correctly on this impaired speech. This disability is usually associated with physical disabilities. Therefore, designing a system that can perform some tasks by receiving voice commands in the smart home can be a significant achievement. In this work, we introduce Gammatonegram as an effective method to represent audio files with discriminative details, which can be used as input for convolutional neural networks. In other words, we convert each speech file into an image and propose an image recognition system to classify speech in different scenarios. The proposed convolutional neural networks are based on the transfer learning method on the pre-trained Alexnet. This research evaluates the efficiency of the proposed system for speech recognition, speaker identification, and intelligibility assessment tasks. According to the results on the UA speech dataset, the proposed speech recognition system achieved a 91.29% word recognition rate in speaker-dependent mode, the speaker identification system acquired an 87.74% recognition rate in text-dependent mode, and the intelligibility assessment system achieved a 96.47% recognition rate in two-class mode. Finally, we propose a multi-network speech recognition system that works fully automatically. This system is located in a cascade arrangement with the two-class intelligibility assessment system, and the output of this system activates each one of the speech recognition networks. This architecture achieves a word recognition rate of 92.3%.

Từ khóa


Tài liệu tham khảo

Mengistu, Kinfe Tadesse, Rudzicz, Frank: “Comparing humans and automatic speech recognition systems in recognizing dysarthric speech,” in Advances in Artificial Intelligence: 24th Canadian Conference on Artificial Intelligence, Canadian AI 2011, St. John’s, Canada, May 25-27, 2011. Proceedings 24. Springer, (2011), pp. 291–300 Park, T.J., Kanda, N., Dimitriadis, D., Han, K.J., Watanabe, S., Narayanan, S.: A review of speaker diarization: Recent advances with deep learning. Computer Speech & Language 72, 101317 (2022) Hinton, G., Deng, L., Dong, Yu., Dahl, G.E., Mohamed, A., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T.N., et al.: Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Process. Mag. 29(6), 82–97 (2012) Krizhevsky, Alex, Sutskever, Ilya, Hinton, Geoffrey E: “Imagenet classification with deep convolutional neural networks,” Advances in neural information processing systems, 25, (2012) Yang, Chao-Han Huck, Qi, Jun, Chen, Samuel Yen-Chi, Chen, Pin-Yu, Siniscalchi, Sabato Marco, Ma, Xiaoli, Lee, Chin-Hui: “Decentralizing feature extraction with quantum convolutional neural network for automatic speech recognition,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, (2021), pp. 6523–6527 Kim, Heejin, Hasegawa-Johnson, Mark, Perlman, Adrienne, Gunderson, Jon, Huang, Thomas S, Watkin, Kenneth, Frame, Simone: “Dysarthric speech database for universal access research,” in Ninth Annual Conference of the International Speech Communication Association, (2008) Zhang, Q., Yang, Q., Zhang, X., Bao, Q., Jinqi, S., Liu, X.: Waste image classification based on transfer learning and convolutional neural network. Waste Manage. 135, 150–157 (2021) Qian, Z., Xiao, K., Chongchong, Yu.: A survey of technologies for automatic dysarthric speech recognition. EURASIP Journal on Audio, Speech, and Music Processing 2023(1), 48 (2023) Kent, Ray D., Kim, Yunjung: “Acoustic analysis of speech,” The handbook of clinical linguistics, pp. 360–380, (2008) Young, S., Evermann, G., Gales, M., Hain, T., Kershaw, D., Liu, X., Moore, G., Odell, J., Ollason, D., Povey, D., et al.: The htk book. Cambridge university engineering department 3(175), 12 (2002) Kim, Myung Jong, Cao, Beiming, An, Kwanghoon, Wang, Jun: “Dysarthric speech recognition using convolutional lstm neural network.,” in INTERSPEECH, pp. 2948–2952 (2018) Liu, Shansong, Hu, Shoukang, Liu, Xunying, Meng, Helen: “On the use of pitch features for disordered speech recognition.,” in Interspeech, pp. 4130–4134 (2019) Bhat, Chitralekha, Das, Biswajit, Vachhani, Bhavik, Kopparapu, Sunil Kumar: “Dysarthric speech recognition using time-delay neural network based denoising autoencoder.,” in INTERSPEECH, pp. 451–455 (2018) Shahamiri, S.R.: Speech vision: An end-to-end deep learning-based dysarthric automatic speech recognition system. IEEE Trans. Neural Syst. Rehabil. Eng. 29, 852–861 (2021) Wang, Disong, Jianwei, Yu., Xixin, Wu., Sun, Lifa, Liu, Xunying, Meng, Helen, Improved end-to-end dysarthric speech recognition via meta-learning based model re-initialization, in,: 12th International Symposium on Chinese Spoken Language Processing (ISCSLP). IEEE 2021, 1–5 (2021) Liu, S., Geng, M., Shoukang, H., Xie, X., Cui, M., Jianwei, Yu., Liu, X., Meng, H.: Recent progress in the cuhk dysarthric speech recognition system. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 2267–2281 (2021) Yue, Zhengjun, Loweimi, Erfan, Cvetkovic, Zoran: “Raw source and filter modelling for dysarthric speech recognition,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, (2022), pp. 7377–7381 Takashima, Yuki, Takiguchi, Tetsuya, Ariki, Yasuo: “End-to-end dysarthric speech recognition using multiple databases,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, pp. 6395–6399 (2019) Shahamiri, Seyed Reza, Lal, Vanshika, Shah, Dhvani: “Dysarthric speech transformer: A sequence-to-sequence dysarthric speech recognition system,” IEEE Transactions on Neural Systems and Rehabilitation Engineering, (2023) Almadhor, A., Irfan, R., Gao, J., Saleem, N., Rauf, H.T., Kadry, S.: E2e-dasr: End-to-end deep learning-based dysarthric automatic speech recognition. Expert Syst. Appl. 222, 119797 (2023) Chongchong, Yu., Xiaosu, S., Qian, Z.: Multi-stage audio-visual fusion for dysarthric speech recognition with pre-trained models. IEEE Trans. Neural Syst. Rehabil. Eng. 31, 1912–1921 (2023) Rathod, Siddharth, Charola, Monil, Patil, Hemant A: “Transfer learning using whisper for dysarthric automatic speech recognition,” in International Conference on Speech and Computer. Springer, pp. 579–589 (2023) Farhadipour, A., Veisi, H., Asgari, M., Keyvanrad, M.A.: Dysarthric speaker identification with different degrees of dysarthria severity using deep belief networks. ETRI J. 40(5), 643–652 (2018) Kadi, K.L., Selouani, S.A., Boudraa, B., Boudraa, M.: Fully automated speaker identification and intelligibility assessment in dysarthria disease using auditory knowledge. Biocybernetics and Biomedical Engineering 36(1), 233–247 (2016) Salim, Shinimol, Ahmad, Waquar: “Constant q cepstral coefficients for automatic speaker verification system for dysarthria patients,” Circuits, Systems, and Signal Processing, pp. 1–18, (2023) Salim, Shinimol, Shahnawazuddin, Syed, Ahmad, Waquar: “Automatic speaker verification system for dysarthria patients.,” in INTERSPEECH, pp. 5070–5074 (2022) Gupta, S., Patil, A.T., Purohit, M., Parmar, M., Patel, M., Patil, H.A., Guido, R.C.: Residual neural network precisely quantifies dysarthria severity-level based on short-duration speech segments. Neural Netw. 139, 105–117 (2021) Al-Qatab, B.A., Mustafa, M.B.: Classification of dysarthric speech according to the severity of impairment: an analysis of acoustic features. IEEE Access 9, 18183–18194 (2021) Joshy, A.A., Rajan, R.: Automated dysarthria severity classification: A study on acoustic features and deep learning techniques. IEEE Trans. Neural Syst. Rehabil. Eng. 30, 1147–1157 (2022) Hall, K., Huang, A., Shahamiri, S.R.: An investigation to identify optimal setup for automated assessment of dysarthric intelligibility using deep learning technologies. Cogn. Comput. 15(1), 146–158 (2023) Nikhil Chowdary, Paleti, Vadlapudi Sai Aravind, Gorantla VNSL Vishnu Vardhan, Menta Sai Akshay, Menta Sai Aashish, Jyothish Lal G: “A few-shot approach to dysarthric speech intelligibility level classification using transformers,” arXiv e-prints, pp. arXiv–2309, (2023) Venugopalan, Subhashini, Tobin, Jimmy, Yang, Samuel J, Seaver, Katie, Cave, Richard J.N., Jiang, Pan-Pan, Zeghidour, Neil, Heywood, Rus, Green, Jordan, Brenner, Michael P: “Speech intelligibility classifiers from 550k disordered speech samples,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, (2023), pp. 1–5 Deng, Jia, Dong, Wei, Socher, Richard, Li, Li-Jia., Li, Kai, Fei-Fei, Li., Imagenet: A large-scale hierarchical image database, in,: IEEE conference on computer vision and pattern recognition. Ieee 2009, 248–255 (2009) Pour, Aref Farhadi, Asgari, Mohammad, Hasanabadi, Mohammad Reza: “Gammatonegram based speaker identification,” in 2014 4th International Conference on Computer and Knowledge Engineering (ICCKE). IEEE, pp. 52–55 (2014) Rabiner, L., Schafer, R.: Theory and applications of digital speech processing. Prentice Hall Press (2010) Alexey Sholokhov, Md., Sahidullah, and Tomi Kinnunen,: Semi-supervised speech activity detection with an application to automatic speaker verification. Computer Speech & Language 47, 132–156 (2018) Sameti, Hossein, Veisi, Hadi, Bahrani, Mohammad, Babaali, Bagher, Hosseinzadeh, Khosro: “Nevisa, a persian continuous speech recognition system,” in Advances in Computer Science and Engineering: 13th International CSI Computer Conference, CSICC 2008 Kish Island, Iran, March 9-11, 2008 Revised Selected Papers. Springer, pp. 485–492 (2009) Chavan, R.S., Sable, G.S.: An overview of speech recognition using hmm. Int. J. Comput. Sci. Mob. Comput. 2(6), 233–238 (2013) Murphy, Kevin: “Hidden markov model (hmm) toolbox for matlab,” https://www.cs.ubc.ca/~murphyk/Software/HMM/hmm.html, (1998) Shahamiri, S.R., Salim, S.S.B.: Real-time frequency-based noise-robust automatic speech recognition using multi-nets artificial neural networks: A multi-views multi-learners approach. Neurocomputing 129, 199–207 (2014)