End-to-End Spoken Language Understanding: Performance analyses of a voice command task in a low resource setting

Computer Speech & Language - Tập 75 - Trang 101369 - 2022
Thierry Desot1, François Portet1, Michel Vacher1
1Univ. Grenoble Alpes, CNRS, Grenoble INP, LIG, F-38000 Grenoble France

Tài liệu tham khảo

Abdel-Hamid, 2012, Applying convolutional neural networks concepts to hybrid NN-HMM model for speech recognition, 4277 Amodei, D., Ananthanarayanan, S., Anubhai, R., Bai, J., Battenberg, E., Case, C., Casper, J., Catanzaro, B., Cheng, Q., Chen, G., et al., 2016. Deep speech2: End-to-end speech recognition in English and Mandarin. In: International Conference on Machine Learning, ICML. pp. 173–182. Anastasakos, 1996, A compact model for speaker-adaptive training, vol. 2, 1137 Bahdanau, 2014 Bapna, 2017, Sequential dialogue context modeling for spoken language understanding, 103 Bird, 2009 Brenon, 2018, Arcades: a deep model for adaptive decision making in voice controlled smart-home, Pervasive Mobile Comput., 49, 92, 10.1016/j.pmcj.2018.06.011 Caubrière, A., Rosset, S., Estève, Y., Laurent, A., Morin, E., 2020. Where are we in named entity recognition from speech? In: Proceedings of the 12th Language Resources and Evaluation Conference. pp. 4514–4520. Caubrière, A., Tomashenko, N., Laurent, A., Morin, E., Camelin, N., Estève, Y., 2019. Curriculum-based transfer learning for an effective end-to-end spoken language understanding and domain portability. In: Interspeech. pp. 1198–1202. Chahuara, 2017, Context-aware decision making under uncertainty for voice-based control of smart home, Expert Syst. Appl., 75, 63, 10.1016/j.eswa.2017.01.014 Cho, 2018, Multilingual sequence-to-sequence speech recognition: Architecture, transfer learning, and language modeling, 521 Crystal, 2011 Denisov, 2020 Desot, 2019, SLU for voice command in smart home: Comparison of pipeline and end-to-end approaches, 822 Desot, T., Portet, F., Vacher, M., 2019b. Towards End-to-End spoken intent recognition in smart home. In: Conference on Speech Technology and Human-Computer Dialogue, SpeD. pp. 1–8. Desot, T., Portet, F., Vacher, M., 2020. Corpus generation for voice command in smart home and the effect of speech synthesis on end-to-end SLU. In: International Conference on Language Resources and Evaluation, LREC. pp. 6395–6404. Desot, 2018, Towards a French smart-home voice command corpus: Design and NLU experiments, 509 Devlin, 2019, BERT: Pre-training of deep bidirectional transformers for language understanding, vol. 1, 4171 Digalakis, 1996, Speaker adaptation using combined transformation and Bayesian methods, IEEE Trans. Speech Audio Process., 4, 294, 10.1109/89.506933 Galliano, S., Geoffrois, E., Mostefa, D., Choukri, K., Bonastre, J.-F., Gravier, G., 2005. The ESTER phase II evaluation campaign for the rich transcription of French broadcast news. In: European Conference on Speech Communication and Technology, EUROSPEECH. Galliano, S., Gravier, G., Chaubard, L., 2009. The ESTER 2 evaluation campaign for the rich transcription of French radio broadcasts. In: Tenth Annual Conference of the International Speech Communication Association. Gatt, 2018, Survey of the state of the art in natural language generation: Core tasks, applications and evaluation, J. Artificial Intelligence Res., 61, 10.1613/jair.5477 Ghannay, 2018 Giraudel, A., Carré, M., Mapelli, V., Kahn, J., Galibert, O., Quintard, L., The REPERE Corpus: A multimodal corpus for person recognition. In: International Conference on Language Resources and Evaluation, LREC. pp. 1102–1107. Goldwater, 2010, Which words are hard to recognize? Prosodic, lexical, and disfluency factors that increase speech recognition error rates, Speech Commun., 52, 181, 10.1016/j.specom.2009.10.001 Graves, 2006, Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks, 369 Gravier, G., Adda, G., Paulson, N., Carré, M., Giraudel, A., Galibert, O., 2012. The ETAPE corpus for the evaluation of speech-based TV content processing in the French language. In: International Conference on Language Resources and Evaluation, LREC. Hahn, S., Lehnen, P., Raymond, C., Ney, H., 2008. A comparison of various methods for concept tagging for spoken language understanding. In: International Conference on Language Resources and Evaluation, LREC. Hakkani-Tür, 2006, Beyond ASR 1-best: Using word confusion networks in spoken language understanding, Comput. Speech Lang., 20, 495, 10.1016/j.csl.2005.07.005 Hannun, 2014 He, 2003, A data-driven spoken language understanding system, 583 Hemphill, C.T., Godfrey, J.J., Doddington, G.R., 1990. The ATIS spoken language systems pilot corpus. In: Speech and Natural Language: Proceedings of a Workshop Held at Hidden Valley. Pennsylvania, 24–27 June 1990. Hou, Y., Liu, Y., Che, W., Liu, T., 2018. Sequence-to-sequence data augmentation for dialogue language understanding. In: Proceedings of the 27th International Conference on Computational Linguistics, COLING. pp. 1234–1245. Huang, L., Sil, A., Ji, H., Florian, R., 2017. Improving slot filling performance with attentive neural networks on dependency structures. In: Conference on Empirical Methods in Natural Language Processing, EMNLP. pp. 2588–2597. Jeong, 2008, Triangular-chain conditional random fields, IEEE/ACM Trans. Audio Speech Lang. Process., 16, 1287, 10.1109/TASL.2008.925143 Ko, 2017, A study on data augmentation of reverberant speech for robust speech recognition, 5220 Krueger, 2009, Flexible shaping: How learning in small steps helps, Cognition, 110, 380, 10.1016/j.cognition.2008.11.014 Leggetter, 1995, Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models, Comput. Speech Lang., 9, 171, 10.1006/csla.1995.0010 Li, 2018 Liu, 2016, Attention-based recurrent neural network models for joint intent detection and slot filling, 685 Liu, C., Zhu, S., Zhao, Z., Cao, R., Chen, L., Yu, K., 2020. Jointly encoding word confusion network and dialogue context with BERT for spoken language understanding. In: Interspeech. pp. 871–875. Lugosch, 2020, Using speech synthesis to train end-to-end spoken language understanding models Lugosch, L., Ravanelli, M., Ignoto, P., Tomar, V.S., Bengio, Y., 2019. Speech model pre-training for end-to-end spoken language understanding. In: Interspeech. pp. 814–818. Mangu, 2000, Finding consensus in speech recognition: Word error minimization and other applications of confusion networks, Comput. Speech Lang., 14, 373, 10.1006/csla.2000.0152 Mesnil, 2015, Using recurrent neural networks for slot filling in spoken language understanding, IEEE/ACM Trans. Audio Speech Lang. Process., 23, 530, 10.1109/TASLP.2014.2383614 Mishakova, A., Portet, F., Desot, T., Vacher, M., 2019. Learning natural language understanding systems from unaligned labels for voice command in smart homes. In: IEEE International Conference on Pervasive Computing and Communications Workshops, PerCom Workshops. pp. 832–837. Möller, S., Gödde, F., Wolters, M., 2008. Corpus Analysis of Spoken Smart-Home Interactions with Older Users. In: Proceedings of the 6th International Conference on Language Resources and Evaluation. Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., Lerer, A., 2017. Automatic differentiation in Pytorch. In: Advances in Neural Information Processing Systems, NIPS Workshop. Plack, 2005, Overview: The present and future of pitch, 1 Portet, 2019, Context-aware voice-based interaction in smart Home-VocADom@ A4H corpus collection and empirical assessment of its usefulness, 811 Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y., Schwarz, P., et al., 2011. The Kaldi speech recognition toolkit. In: Workshop on Automatic Speech Recognition and Understanding, ASRU. Povey, D., Zhang, X., Khudanpur, S., 2015. Parallel training of DNNs with natural gradient and parameter averaging. In: International Conference on Learning Representations, ICLR. Qian, 2016, Very deep convolutional neural networks for noise robust speech recognition, IEEE/ACM Trans. Audio Speech Lang. Process., 24, 2263, 10.1109/TASLP.2016.2602884 Qian, 2017, Exploring ASR-free end-to-end modeling to improve spoken language understanding in a cloud-based dialog system, 569 Rao, 2020 Sears, 1988, The DARPA spoken language systems program: Past, present, and future, J. Acoust. Soc. Am., 84, S188, 10.1121/1.2026042 Serdyuk, 2018, Towards end-to-end spoken language understanding, 5754 Serpollet, 2007, A large reference corpus for spoken French: ESLO 1 and 2 and its variations Simonnet, E., Ghannay, S., Camelin, N., Estève, Y., De Mori, R., 2017. ASR error management for improving spoken language understanding. In: Interspeech. pp. 3329–3333. Stehwien, S., Vu, N.T., 2016. Exploring the Correlation of Pitch Accents and Semantic Slots for Spoken Language Understanding. In: Interspeech. pp. 730–734. Su, 2018, Perceivable information structure in discourse prosody-detecting prominent prosodic words in spoken discourse using F0 contour, 424 Sudoh, 2006, Incorporating speech recognition confidence into discriminative named entity recognition of speech data, 617 Takahashi, S.-y., Morimoto, T., Maeda, S., Tsuruta, N., 2003. Dialogue experiment for elderly people in home health care system. In: International Conference on Text, Speech, and Dialogue, TSD. pp. 418–423. Tan, 2006, A French non-native corpus for automatic speech recognition, vol. 6, 1610 Tokui, 2015, Chainer: A next-generation open source framework for deep learning, vol. 5, 1 Tur, 2011 Ueno, 2018, Acoustic-to-word attention-based model complemented with character-level CTC-based model, 5804 Vacher, M., Bouakaz, S., Chaumon, M.-E.B., Aman, F., Khan, R.A., Bekkadja, S., Portet, F., Guillou, E., Rossato, S., Lecouteux, B., 2016. The CIRDO corpus: Comprehensive audio/video database of domestic falls of elderly people. In: International Conference on Language Resources and Evaluation, LREC. pp. 1389–1396. Vacher, 2015, Evaluation of a context-aware voice interface for ambient assisted living: Qualitative user study vs. quantitative system evaluation, ACM Trans. Access. Comput., 7, 5:1, 10.1145/2738047 Vacher, M., Fleury, A., Serignat, J.-F., Noury, N., Glasson, H., 2008. Preliminary evaluation of speech/sound recognition for telemedicine application in a real environment. In: Interspeech. pp. 496–499. Vacher, M., Lecouteux, B., Chahuara, P., Portet, F., Meillon, B., Bonnefond, N., 2014. The Sweet-Home speech and multimodal corpus for home automation interaction. In: International Conference on Language Resources and Evaluation, LREC. pp. 4499–4506. Vacher, 2013, Experimental evaluation of speech recognition technologies for voice-based home automation control in a smart home, 99 Wang, Y., Skerry-Ryan, R., Stanton, D., Wu, Y., Weiss, R.J., Jaitly, N., Yang, Z., Xiao, Y., Chen, Z., Bengio, S., et al., 2017. Tacotron: Towards end-to-end speech synthesis. In: Interspeech. pp. 4006–4010. Watanabe, S., Hori, T., Karita, S., Hayashi, T., Nishitoba, J., Unno, Y., Soplin, N., Heymann, J., Wiesner, M., Chen, N., Renduchintala, A., Ochiai, T., 2018. ESPNet: End-to-end speech processing toolkit. In: Interspeech. pp. 2207–2211. Watanabe, 2017, Hybrid CTC/attention architecture for end-to-end speech recognition, IEEE J. Sel. Top. Sign. Proces., 11, 1240, 10.1109/JSTSP.2017.2763455 Zhai, 2004, Using n-best lists for named entity recognition from chinese speech, 37 Zhang, Y., Pezeshki, M., Brakel, P., Zhang, S., Bengio, C.L.Y., Courville, A., 2016. Towards end-to-end speech recognition with deep convolutional neural networks. In: Interspeech 2016. pp. 410–414.