Building a neural speech recognizer for quranic recitations

Suhad Al-Issa1, Mahmoud Al-Ayyoub2, Osama Al-Khaleel1, Nouh Elmitwally3,4
1Department of Computer Engineering, Jordan University of Science and Technology, Irbid, Jordan
2Department of Computer Science, Jordan University of Science and Technology, Irbid, Jordan
3School of Computing and Digital Technology, Birmingham City University, Birmingham, UK
4Department of Computer Science, Faculty of Computers and Artificial Intelligence, Cairo University, Giza, Egypt

Tóm tắt

This work is an effort towards building Neural Speech Recognizers system for Quranic recitations that can be effectively used by anyone regardless of their gender and age. Despite having a lot of recitations available online, most of them are recorded by professional male adult reciters, which means that an ASR system trained on such datasets would not work for female/child reciters. We address this gap by adopting a benchmark dataset of audio records of Quranic recitations that consists of recitations by both genders from different ages. Using this dataset, we build several speaker-independent NSR systems based on the DeepSpeech model and use word error rate (WER) for evaluating them. The goal is to show how an NSR system trained and tuned on a dataset of a certain gender would perform on a test set from the other gender. Unfortunately, the number of female recitations in our dataset is rather small while the number of male recitations is much larger. In the first set of experiments, we avoid the imbalance issue between the two genders and down-sample the male part to match the female part. For this small subset of our dataset, the results are interesting with 0.968 WER when the system is trained on male recitations and tested on female recitations. The same system gives 0.406 WER when tested on male recitations. On the other hand, training the system on female recitations and testing it on male recitation gives 0.966 WER while testing it on female recitations gives 0.608 WER.

Tài liệu tham khảo

Abdelhamid, A., Alsayadi, H., Hegazy, I., & Fayed, Z. (2020). End-to-end Arabic speech recognition: A review. In The 19th conference of language engineering (ESOLEC’19). Abro, B., Naqvi, A.B., & Hussain, A. (2012). Qur’an recognition for the purpose of memorisation using speech recognition technique. In 2012 15th International multitopic conference (INMIC) (pp. 30–34). https://doi.org/10.1109/INMIC.2012.6511440 Abushariah, M. A. M. (2017). Tameem v1.0: Speakers and text independent Arabic automatic continuous speech recognizer. International Journal of Speech Technology, 20(2), 261–280. Agarwal, A., & Zesch, T. (2019). German end-to-end speech recognition based on deepspeech. In Proceedings of the 15th conference on natural language processing (KONVENS 2019). Akkila, A.N., & Abu-Naser, S. S. (2018). In Rules of Tajweed the Holy Quran Intelligent Tutoring System. Al-Anzi, F., & AbuZeina, D. (2018). Literature survey of Arabic speech recognition. In 2018 International conference on computing sciences and engineering (ICCSE), (pp. 1–6). Al-Ayyoub, M., Damer, N. A., & Hmeidi, I. (2018). Using deep learning for automatically determining correct application of basic quranic recitation rules. International Arab Journal of Information Technology, 15, 620. Algihab, W., Alawwad, N., Aldawish, A., & AlHumoud, S. (2019). Arabic speech recognition with deep learning: A review. In G. Meiselwitz (Ed.), Social computing and social media. Design, human behavior and analytics (pp. 15–31). Springer. Alhawarat, M., Hegazi, M. O., & Hilal, A. (2015). Processing the text of the Holy Quran: A text mining study. International Journal of Advanced Computer Science and Applications, 6, 262–267. Alkhateeb, J. (2020). A machine learning approach for recognizing the Holy Quran reciter. International Journal of Advanced Computer Science and Applications. https://doi.org/10.14569/IJACSA.2020.0110735 AlKhatib, H., Mansor, E., Alsamel, Z., & AlBarazi, J. (2020). A study of using VR game in teaching Tajweed for teenagers (pp. 244–260). https://doi.org/10.4018/978-1-7998-2637-8.ch013 Amodei, D., Anubhai, R., Battenberg, E., Case, C., Casper, J., Catanzaro, B., Chen, J., Chrzanowski, M., Coates, A., Diamos, G., Elsen, E., Engel, J.H., Fan, L., Fougner, C., Han, T., Hannun, A.Y., Jun, B., LeGresley, P., Lin, L., …, Zhu, Z. (2015). Deep speech 2: End-to-end speech recognition in English and Mandarin. CoRR. http://arxiv.org/abs/1512.02595 Battenberg, E., Chen, J., Child, R., Coates, A., Li, Y.G.Y., Liu, H., Satheesh, S., Sriram, A., & Zhu, Z. (2017). Exploring neural transducers for end-to-end speech recognition. In 2017 IEEE automatic speech recognition and understanding workshop (ASRU) (pp. 206–213). https://doi.org/10.1109/ASRU.2017.8268937 Bettayeb, N. (2020). Speech synthesis system for the Holy Quran recitation. The International Arab Journal of Information Technology, 18, 8–15. https://doi.org/10.34028/iajit/18/1/2 Chan, W., Jaitly, N., Le, Q., & Vinyals, O. (2016). Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 4960–4964). https://doi.org/10.1109/ICASSP.2016.7472621 Collobert, R., Puhrsch, C., & Synnaeve, G. (2016). Wav2letter: An end-to-end convnet-based speech recognition system. CoRR. http://arxiv.org/abs/1609.03193 Czerepinski, K. C. (2006). Tajweed rules of the Quran. DAR-AL-KHAIR ISLAMIC BOOK. Dahl, G. E., Yu, D., Deng, L., & Acero, A. (2012). Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Transactions on Audio, Speech, and Language Processing, 20(1), 30–42. https://doi.org/10.1109/TASL.2011.2134090 Davis, S., & Mermelstein, P. (1980). Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech, and Signal Processing, 28(4), 357–366. https://doi.org/10.1109/TASSP.1980.1163420 El Amrani, M. Y., Rahman, M. H., Wahiddin, M. R., & Shah, A. (2016). Building Cmu sphinx language model for the Holy Quran using simplified Arabic phonemes. Egyptian Informatics Journal, 17(3), 305–314. https://doi.org/10.1016/j.eij.2016.04.002 Eldeeb, T. Deepspeech-quran. (2021) https://github.com/tarekeldeeb/DeepSpeech-Quran Hannun, A., Case, C., Casper, J., Catanzaro, B., Diamos, G., Elsen, E., Prenger, R., Satheesh, S., Sengupta, S., Coates, A., & Ng, A.Y. (2014). Deep speech: Scaling up end-to-end speech recognition. Hayou, S., Doucet, A., & Rousseau, J. (2019). On the impact of the activation function on deep neural networks training. https://doi.org/10.48550/ARXIV.1902.06853 Heafield, K. (2011). KenLM: Faster and smaller language model queries. In Proceedings of the sixth workshop on statistical machine translation (pp. 187–197). Association for Computational Linguistics, Edinburgh, Scotland. https://www.aclweb.org/anthology/W11-2123 Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation 9(8), 1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735 Huang, X., & Deng, L. (2010). An overview of modern speech recognition. In Handbook of Natural Language Processing, Second Edition (pp. 339–366) Hyassat, H., & Abu Zitar, R. (2006). Arabic speech recognition using sphinx engine. International Journal of Speech Technology, 9(3), 133–150. Iakushkin, O., Fedoseev, G., Shaleva, A., Degtyarev, A., & Sedova, O. (2018). Russian-language speech recognition system based on deepspeech. In Proceedings of the VIII international conference “Distributed computing and grid-technologies in science and education”. Ibrahim, N. J., Idris, M., Razak, Z., & Rahman, N. (2013). Automated Tajweed checking rules engine for quranic learning. Multicultural Education & Technology Journal, 7, 275–287. https://doi.org/10.1108/metj-03-2013-0012 Ibrahim, Y. A., Odiketa, J. C., & Ibiyemi, T. S. (2017). Preprocessing technique in automatic speech recognition for human computer interaction: An overview. The journal Annals. Computer Science Series, XV, 186–191. Juang, B. H., & Rabiner, L. R. (1991). Hidden Markov models for speech recognition. Technometrics, 33(3), 251–272. Khalaf, E., Daqrouq, K., & Morfeq, A. (2014). Arabic vowels recognition by modular arithmetic and wavelets using neural network. Life Science Journal, 11, 33–41. Khalaf, E., Daqrouq, K., & Sherif, M. (2011a). Modular arithmetic and wavelets for speaker verification. Journal of Applied Sciences. https://doi.org/10.3923/jas.2011.2782.2790 Khalaf, E., Daqrouq, K., & Sherif, M. (2011b). Wavelet packet and percent of energy distribution with neural networks based gender identification system. Journal of Applied Sciences, 11, 2940. Kirchhoff, K., Bilmes, J., Das, S., Duta, N., Egan, M., Ji, G., He, F., Henderson, J., Liu, D., Noamany, M., Schone, P., Schwartz, R., & Vergyri, D. (2003). Novel approaches to arabic speech recognition: Report from the 2002 johns-hopkins summer workshop (pp. I–344). https://doi.org/10.1109/ICASSP.2003.1198788 Lamere, P., Kwok, P., Gouvêa, E., Raj, B., Singh, R., Walker, W., Warmuth, M., & Wolf, P. (2003). The cmu sphinx-4 speech recognition system. Lee, K. F., Hon, H. W., & Reddy, R. (1990). An overview of the sphinx speech recognition system. IEEE Transactions on Acoustics, Speech, and Signal Processing, 38(1), 35–45. https://doi.org/10.1109/29.45616 Lei, Z., Jiandong, L., Jing, L., & Guanghui, Z. (2005). A novel wavelet packet division multiplexing based on maximum likelihood algorithm and optimum pilot symbol assisted modulation for rayleigh fading channels. Circuits, Systems and Signal Processing, 24(3), 287–302. Lou, H. L. (1995). Implementing the viterbi algorithm. IEEE Signal Processing Magazine, 12(5), 42–52. https://doi.org/10.1109/79.410439 Mohammed, A., Sunar, M. S., & Salam, M. S. (2015). Quranic verses verification using speech recognition techniques. Journal Teknologi. https://doi.org/10.11113/jt.v73.4200 Mozilla: Deepspeech. (2021) https://github.com/mozilla/DeepSpeech Mozilla: Deepspeech 0.9.3. (2020) https://github.com/mozilla/DeepSpeech/releases Mustafa, B.S. Qdat. (2020) https://www.kaggle.com/annealdahi/quran-recitation Panaite, M., Ruseti, S., Dascalu, M., & Trausan-Matu, S. (2019). Towards a deep speech model for Romanian language. In 2019 22nd International Conference on Control Systems and Computer Science (CSCS) (pp. 416–419). https://doi.org/10.1109/CSCS.2019.00076 Pratap, V., Hannun, A., Xu, Q., Cai, J., Kahn, J., Synnaeve, G., Liptchinsky, V., & Collobert, R. (2018). wav2letter++: The fastest open-source speech recognition system. CoRR. http://arxiv.org/abs/1812.07625 Rabiner, L., & Juang, B. H. (1993). Fundamentals of speech recognition. Prentice-Hall Inc. Rabiner, L.R., & Schafer, R.W. (2007). An introduction to digital speech processing. Foundations and Trends. Radha, V. (2012). Implementing the Viterbi algorithm. World of Computer Science and Information Technology Journal (WCSIT), 2(1), 1–7. Riesen, K., & Bunke, H. (2010). Graph classification and clustering based on vector space embedding. World Scientific Publishing Co. Santosh, K., Bharti, W., & Yannawar, P. (2010). A review on speech recognition technique. International Journal of Computer Applications. https://doi.org/10.5120/1462-1976 Schuster, M., & Paliwal, K. (1997). Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing, 45(11), 2673–2681. https://doi.org/10.1109/78.650093 Shafie, N., Adam, M., & Abas, H. (2017). The model of al-quran recitation evaluation to support in da’wah technology media for self-learning of recitation using mobile apps (2017). https://doi.org/10.13140/RG.2.2.29744.87041 Tabbal, H., El Falou, W., & Monla, B. (2006). Analysis and implementation of a “quranic” verses delimitation system in audio files using speech recognition techniques. In 2006 2nd international conference on information communication technologies (vol. 2, pp. 2979–2984). https://doi.org/10.1109/ICTTA.2006.1684889 Wang, Y.Y., & Waibel, A. (1997). Decoding algorithm in statistical machine translation. In Proceedings of the 35th annual meeting of the association for computational linguistics and eighth conference of the European Chapter of the Association for Computational Linguistics, ACL ’98/EACL ’98 (pp. 366-372). association for computational linguistics, USA. https://doi.org/10.3115/976909.979664 Wang, D., Wang, X., & Lv, S. (2019). An overview of end-to-end automatic speech recognition. Symmetry, 11(8), 1018. Young, S. (1994). The htk hidden markov model toolkit: Design and philosophy (vol. 2, pp. 2–44). Entropic Cambridge Research Laboratory, Ltd.