End-to-end acoustic modelling for phone recognition of young readers

Speech Communication - Tập 134 - Trang 71-84 - 2021
Lucile Gelin1,2, Morgane Daniel2, Julien Pinquier1, Thomas Pellegrini1
1IRIT, Paul Sabatier University, CNRS, Toulouse, France
2Lalilo, France

Tài liệu tham khảo

Abad, 2020, Cross lingual transfer learning for zero-resource domain adaptation, 6909 Airaksinen, 2019, Data augmentation strategies for neural network F0 estimation, 6485 Andrew, 2015, Acoustic modelling with CD-CTC-SMBR LSTM RNNS, 604 Bahdanau, 2015 Bayerl, 2019, A comparison of hybrid and end-to-end models for syllable recognition, 352 Bengio, 2015, Scheduled sampling for sequence prediction with recurrent neural networks, 1171 Bolaños, 2011, FLORA: Fluent oral reading assessment of children’s speech, ACM Trans. Speech Lang. Process., 7, 16, 10.1145/1998384.1998390 Chan, 2016, Listen, attend and spell: A neural network for large vocabulary conversational speech recognition, 4960 Chen, 2020 Chiu, 2018, State-of-the-art speech recognition with sequence-to-sequence models, 4774 Cho, 2018, Multilingual sequence-to-sequence speech recognition: Architecture, transfer learning, and language modeling, 521 Chorowski, J., Bahdanau, D., Cho, K., Bengio, Y., 2014. End-to-end continuous speech recognition using attention-based recurrent NN: First results. In: Proc. of the International Conference on Neural Information Processing Systems (NIPS): Workshop on Deep Learning. pp. 1–10. Chorowski, 2015, Attention-based models for speech recognition, 577 Chung, J., Gulcehre, C., Cho, K., Bengio, Y., 2015. Gated feedback recurrent neural networks. In: Proc. of the International Conference on Machine Learning (ICML), Vol. 37. pp. 2067–2075. Dong, 2018, Speech-transformer: A no-recurrence sequence-to-sequence model for speech recognition, 5884 Duan, 2020, Cross-lingual transfer learning of non-native acoustic modeling for pronunciation error detection and diagnosis, IEEE/ACM Trans. Audio Speech Lang. Process., 28, 391, 10.1109/TASLP.2019.2955858 Fringi, E., Lehman, J.F., Russell, M.J., 2015. Evidence of phonological processes in automatic recognition of children’s speech. In: Proc. of the Annual Conference of the International Speech Communication Association (INTERSPEECH), Dresden. pp. 1621–1624. Gales, 2008, The application of hidden Markov models in speech recognition, Found. Trends Signal Process., 1, 195, 10.1561/2000000004 Gerosa, 2006, Acoustic analysis and automatic recognition of spontaneous children’s speech, 1886 Gibson, 2018, Multi-condition deep neural network training, 77 Godde, 2017, Evaluation of reading performance of primary school children: Objective measurements vs. subjective ratings, 23 Graves, 2006, Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks, 369 Graves, 2013, Speech recognition with deep recurrent neural networks, 6645 He, 2016, Deep residual learning for image recognition, 770 Karita, 2019, Improving transformer-based end-to-end speech recognition with connectionist temporal classification and language model integration, 1408 Karita, 2019, A comparative study on transformer vs RNN in speech applications, 449 Lee, 1999, Acoustics of children’s speech: developmental changes of temporal and spectral parameters, J. Acoust. Soc. Am., 105, 1455, 10.1121/1.426686 Lu, L., Zhang, X., Cho, K., Renals, S., 2015. A study of the recurrent neural network encoder-decoder for large vocabulary speech recognition. In: Proc. of the Annual Conference of the International Speech Communication Association (INTERSPEECH), Dresden. pp. 3249–3253. Metallinou, A., Cheng, J., 2014. Using deep neural networks to improve proficiency assessment for children english language learners. In: Proc. of the Annual Conference of the International Speech Communication Association (INTERSPEECH), Singapore. pp. 1468–1472. Mihaylova, 2019, Scheduled sampling for transformers, 351 Mostow, 2001, Evaluating tutors that listen: An overview of project LISTEN, 169 Mugitani, 2012, Development of vocal tract and acoustic features in children, J. Acoust. Soc. Japan, 68, 234 Ng, 2020 Potamianos, 1998, Spoken dialog systems for children, 197 Potamianos, 2003, Robust recognition of children’s speech, IEEE Trans. Speech Audio Process., 11, 603, 10.1109/TSA.2003.818026 Potamianos, 2007, A review of the acoustic and linguistic properties of children’s speech, 22 Povey, 2018, Semi-orthogonal low-rank matrix factorization for deep neural networks, 3743 Povey, 2011, The kaldi speech recognition toolkit, 1 Povey, 2016, Purely sequence-trained neural networks for ASR based on lattice-free MMI, 2751 Proença, 2018 Qian, 2016, Improving DNN-based automatic recognition of non-native children speech with adult speech, 40 Serizel, R., Giuliani, D., 2014. Deep neural network adaptation for children’s and adults’ speech recognition. In: Proc. of the Italian Computational Linguistics Conference (CLiC-It). pp. 137–140. Shivakumar, 2020, Transfer learning from adult to children for speech recognition: Evaluation, analysis and recommendations, Comput. Speech Lang., 63 Shivakumar, 2021 Sutskever, I., Vinyals, O., Le, Q.V., 2014. Sequence to sequence learning with neural networks. In: Proc. of the International Conference on Neural Information Processing Systems (NIPS). Cambridge, MA, USA. pp. 3104–3112. Tong, 2017, Multilingual training and cross-lingual adaptation on CTC-based acoustic model, Speech Commun., 104 Tong, 2017, Transfer learning for children’s speech recognition, 36 Vaswani, 2017, Attention is all you need, 6000 Veselý, K., Ghoshal, A., Burget, L., Povey, D., 2013. Sequence-discriminative training of deep neural networks. In: Proc. of the Annual Conference of the International Speech Communication Association (INTERSPEECH), Lyon. pp. 2345–2349. Vinyals, O., Le, Q., 2015. A neural conversational model. In: Proc. of the International Conference on Machine Learning (ICML): Deep Learning Workshop. Vinyals, 2015, Show and tell: A neural image caption generator, 3156 Waibel, 1989, Phoneme recognition using time-delay neural networks, IEEE Trans. Acoust. Speech Signal Process., 37, 328, 10.1109/29.21701 Watanabe, 2017, Hybrid CTC/Attention architecture for end-to-end speech recognition, IEEE J. Sel. Top. Sign. Proces., 11, 1240, 10.1109/JSTSP.2017.2763455 Wu, 2019, Advances in automatic speech recognition for child speech using factored time delay neural network, 1 Xu, 2015, Show, attend and tell: Neural image caption generation with visual attention, 2048 Yeung, 2018, On the difficulties of automatic speech recognition for kindergarten-aged children, 1661 Yong, 2011, Speaker-independent vowel recognition for malay children using time-delay neural network, 565 Yu, 2020 Zhou, 2019