A hybrid feature-extracted deep CNN with reduced parameters substitutes an End-to-End CNN for the recognition of spoken Bengali digits

Multimedia Tools and Applications - Tập 83 - Trang 1669-1692 - 2023
Bachchu Paul1, Santanu Phadikar2
1Department of Computer Science, Vidyasagar University, Midnapore, India
2Department of Computer Science & Engineering, Maulana Abul Kalam Azad University of Technology, West Bengal, Kolkata, India

Tóm tắt

Speech Recognition (SR) is an emerging field in the native language nowadays. Recognizing isolated words in the local language helps people use smartphones and electronic gadgets without technical or educational knowledge. This paper proposes a novel deep Convolutional Neural Network (CNN) architecture to classify ten spoken Bengali numerals. The proposed model generates almost similar prediction accuracy as compared to an end-to-end CNN with nine times fewer parameters has been trained. Here, the raw audio samples are pre-processed, and then a unique hybrid feature of Mel Frequency Cepstral Coefficients (MFCC), Spectral Sub-band Energy (SSE), and Log Spectral Sub-band Energy (LSSE) have been extracted frame-wise and engendered into a vector. Finally, these vectors are fed to the proposed architecture of a one-dimensional CNN and achieve the highest test accuracy of 98.52%. The model has been trained for our created speech corpus of 14000 spoken Bengali digits and 30000 spoken English digits from the audio-MNIST dataset. The proposed neural model generates high prediction accuracy with a few times fewer parameters to be trained, generating low computational costs. The outcome of the proposed model is compared with several pre-trained deep learning models; the result shows the model's superiority. Source Code:  https://github.com/BachchuPaul/Bengali-Isolated-Spoken-Digit .

Tài liệu tham khảo

Abdel-Hamid O, Mohamed AR, Jiang H, Deng L, Penn G, Yu D (2014) Convolutional neural networks for speech recognition. IEEE/ACM Trans Audio, Speech, Language Process 22(10):1533–1545 Ahammad K, Rahman MM (2016) Connected bangla speech recognition using artificial neural network. Int J Comput Appl 149(9):38–41 Becker S, Ackermann M, Lapuschkin S, Müller KR, Samek W (2018) Interpreting and explaining deep neural networks for classification of audio signals. arXiv preprint arXiv:1807.03418 Dikmese S, Sofotasios PC, Renfors M, Valkama M (2015) Subband energy based reduced complexity spectrum sensing under noise uncertainty and frequency-selective spectral characteristics. IEEE Trans Signal Process 64(1):131–145 Ferrer L, Lei Y, McLaren M, Scheffer N (2015) Study of senone-based deep neural network approaches for spoken language recognition. IEEE/ACM Trans Audio, Speech, Language Process 24(1):105–116 Gamit MR, Dhameliya K (2015) Isolated words recognition using MFCC, LPC and neural network. Int J Res Eng Technol 4(6):146–149 Girshick R (2015) Fast r-cnn. In Proceedings of the IEEE international conference on computer vision (pp 1440–1448) Grozdić ĐT, Jovičić ST, Subotić M (2017) Whispered speech recognition using deep denoising autoencoder. Eng Appl Artif Intell 59:15–22 Guiming D, Xia W, Guangyan W, Yan Z, Dan L (2016) Speech recognition based on convolutional neural networks. In 2016 IEEE International Conference on Signal and Image Processing (ICSIP) (pp 708-711). IEEE Gupta A, Sarkar K (2018) Recognition of spoken bengali numerals using MLP, SVM, RF based models with PCA based feature summarization. Int Arab J Inf Technol 15(2):263–269 Kadyan V, Mantri A, Aggarwal RK, Singh A (2019) A comparative study of deep neural network based Punjabi-ASR system. Int J Speech Technol 22(1):111–119 Kaur G, Srivastava M, Kumar A (2017) Speaker and speech recognition using deep neural network. Int J Emerg Res Manag Technol 6:8 Kondhalkar H, Mukherji P (2019) A novel algorithm for speech recognition using tonal frequency cepstral coefficients based on human cochlea frequency map. J Eng Sci Technol 14(2):726–746 Krishnamoorthy P, Prasanna SM (2011) Enhancement of noisy speech by temporal and spectral processing. Speech Commun 53(2):154–174 Lisa NJ, Eity QN, Muhammad G, Huda MN, Rahman CM (2010) Performance evaluation of Bangla word recognition using different acoustic features. Int J Comput Sci Netw Secur 10:96–100 Mahalingam H, Rajakumar M (2019) Speech recognition using multiscale scattering of audio signals and long short-term memory 0f neural networks. Int J Adv Comput Sci Cloud Comput 7:12–16 Masmoudi S, Frikha M, Chtourou M, Hamida AB (2011) Efficient MLP constructive training algorithm using a neuron recruiting approach for isolated word recognition system. Int J Speech Technol 14(1):1–10 Nagajyothi D, Siddaiah P (2018) Speech recognition using convolutional neural networks. Int J Eng Technol 7(4.6):133–137 Nicolson A, Hanson J, Lyons J, Paliwal K (2018) Spectral subband centroids for robust speaker identification using marginalization-based missing feature theory. Int J Signal Process Syst 6(1):12–16 Palaz D, Doss MM, Collobert R (2015) Convolutional neural networks-based continuous speech recognition using raw speech signal. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp 4295–4299). IEEE Paul B, Adhikary DD, Dey T, Guchhait S, Bera S (2022) Bangla Spoken Numerals Recognition by Using HMM. In Computational Intelligence in Pattern Recognition (pp 85–97). Springer, Singapore Paul B, Bera S, Paul R, Phadikar S (2021) Bengali spoken numerals recognition by MFCC and GMM technique. In Advances in Electronics, Communication and Computing (pp 85–96). Springer, Singapore Paul B, Dey T, Adhikary DD, Guchhai S, Bera S (2022) A novel approach of audio-visual color recognition using KNN. In Computational Intelligence in Pattern Recognition (pp 231–244). Springer, Singapore Paul B, Mukherjee H, Phadikar S, Roy K (2019) MFCC-Based Bangla Vowel Phoneme Recognition from Micro Clips. In International Conference on Intelligent Computing and Communication (pp 511–519). Springer, Singapore Paul B, Phadikar S, Bera S (2021) Indian regional spoken language identification using deep learning approach. In Proceedings of the Sixth International Conference on Mathematics and Computing (pp 263–274). Springer, Singapore Pawar GS, Morade SS (2014) Isolated English language digit recognition using hidden markov model toolkit. Int J Adv Res Comput Sci Softw Eng Jaunpur-222001, Uttar Pradesh, India, 4(6) Qadir JA, Al-Talabani AK, Aziz HA (2020) Isolated spoken word recognition using one-dimensional convolutional neural network. Int J Fuzzy Logic Intell Syst 20(4):272–277 Sarma M (2017) Speech recognition using deep neural network-recent trends. Int J Intell Syst Des Comput 1(1-2):71–86 Sharmin R, Rahut SK, Huq MR (2020) Bengali spoken digit classification: A deep learning approach using convolutional neural network. Proc Comput Sci 171:1381–1388 Shukla S, Jain M (2021) A novel stochastic deep resilient network for effective speech recognition. Int J Speech Technol 1–10 Si S, Wang J, Sun H, Wu J, Zhang C, Qu X, Cheng N, Chen L, Xiao J (2021) Variational information bottleneck for effective low-resource audio classification. arXiv preprint arXiv:2107.04803 Siniscalchi SM, Yu D, Deng L, Lee CH (2013) Exploiting deep neural networks for detection-based speech recognition. Neurocomputing 106:148–157 Song Z (2020) English speech recognition based on deep learning with multiple features. Computing 102(3):663–682 Sumon SA, Chowdhury J, Debnath S, Mohammed N, Momen S (2018) Bangla short speech commands recognition using convolutional neural networks. In 2018 international conference on bangla speech and language processing (ICBSLP) (pp 1–6). IEEE Tripathi AM, Paul K (2022) When sub-band features meet attention mechanism while knowledge distillation for sound classification. Appl Acoust 195:108813 Vani HY, Anusuya MA (2020) Fuzzy speech recognition: a review. Int J Comput Appl 177(47):39–54 Veisi H, Mani AH (2020) Persian speech recognition using deep learning. Int J Speech Technol 23(4):893–905