Direct enhancement of pre-trained speech embeddings for speech processing in noisy conditions

Computer Speech & Language - Tập 81 - Trang 101501 - 2023
Mohamed Nabih Ali1,2, Alessio Brutti2, Daniele Falavigna2
1IECS Doctoral School, University of Trento, Trento, Italy
2Digital Society Center, Fondazione Bruno Kessler, Trento, Italy

Tài liệu tham khảo

Al-Karawi, 2015, Automatic speaker recognition system in adverse conditions implication of noise and reverberation on system performance, Int. J. Inf. Electron. Eng., 5, 423 Ali, 2020, Speech enhancement using dilated wave-U-Net: An experimental analysis, 3 Ali, M.N., Falavigna, D., Brutti, A., 2022a. Enhancing Embeddings for Speech Classification in Noisy Conditions. In: Proc. of Interspeech. pp. 2933–2937. Ali, 2022, Time-domain joint training strategies of speech enhancement and intent classification neural models, Sensors, 22, 374, 10.3390/s22010374 Ali, 2021, A speech enhancement front-end for intent classification in noisy environments, 471 Alom, 2019, A state-of-the-art survey on deep learning theory and architectures, Electronics, 8, 292, 10.3390/electronics8030292 Amodei, 2016, Deep speech 2: End-to-end speech recognition in english and mandarin, 173 Arons, 1992, A review of the cocktail party effect, J. Am. Voice I/O Society, 12, 35 Baevski, 2019 Baevski, 2019 Baevski, 2020, Wav2vec 2.0: A framework for self-supervised learning of speech representations, Adv. Neural Inf. Process. Syst., 33, 12449 Bermuth, 2022 Bonet, 2021 Caldarini, 2022, A literature survey of recent advances in chatbots, Information, 13, 41, 10.3390/info13010041 Cámbara, 2022, TASE: Task-aware speech enhancement for wake-up word detection in voice assistants, Appl. Sci., 12, 1974, 10.3390/app12041974 Cao, 2021 Chen, 2021 Chen, 2022, Wavlm: Large-scale self-supervised pre-training for full stack speech processing, IEEE J. Sel. Top. Sign. Proces., 16, 1505, 10.1109/JSTSP.2022.3188113 Chung, 2020, Generative pre-training for speech with autoregressive predictive coding, 3497 De Andrade, 2018 Defossez, 2020 Devlin, 2018 Donahue, 2018, Exploring speech enhancement with generative adversarial networks for robust speech recognition, 5024 El-Fattah, 2014, Speech enhancement with an adaptive Wiener filter, Int. J. Speech Technol., 17, 53, 10.1007/s10772-013-9205-5 Eskimez, 2022, Personalized speech enhancement: New models and comprehensive evaluation, 356 Fan, 2020, Gated recurrent fusion with joint training framework for robust end-to-end speech recognition, IEEE/ACM Trans. Audio, Speech, Lang. Process., 29, 198, 10.1109/TASLP.2020.3039600 Fujimoto, M., Kawai, H., 2019. One-Pass Single-Channel Noisy Speech Recognition Using a Combination of Noisy and Enhanced Features. In: Proc. of Interspeech. pp. 486–490. Graves, 2012, Connectionist temporal classification, 61 Gu, 2019 Hoang, 2022, Multichannel speech enhancement with own voice-based interfering speech suppression for hearing assistive devices, IEEE/ACM Trans. Audio, Speech, Lang. Process., 10.1109/TASLP.2022.3145294 Hsu, 2021, Hubert: Self-supervised speech representation learning by masked prediction of hidden units, IEEE/ACM Trans. Audio, Speech, Lang. Process., 29, 3451, 10.1109/TASLP.2021.3122291 Hu, 2022, Interactive feature fusion for end-to-end noise-robust speech recognition, 6292 Huang, 2018, Supervised noise reduction for multichannel keyword spotting, 5474 Huang, 2022, Investigating self-supervised learning for speech enhancement and separation, 6837 Iwamoto, 2022 Li, 2021, Adversarial joint training with self-attention mechanism for robust end-to-end speech recognition, EURASIP J. Audio, Speech, Music Process., 2021, 1, 10.1186/s13636-021-00215-6 Li, X., Li, J., Yan, Y., 2017. Ideal Ratio Mask Estimation Using Deep Neural Networks for Monaural Speech Segregation in Noisy Reverberant Conditions. In: Proc. of Interspeech. pp. 1203–1207. Li, 2021, Improving speech recognition on noisy speech via speech enhancement with multi-discriminators CycleGAN, 830 Liu, B., Nie, S., Liang, S., Liu, W., Yu, M., Chen, L., Peng, S., Li, C., et al., 2019. Jointly Adversarial Enhancement Training for Robust End-to-End Speech Recognition. In: Proc. of Interspeech. pp. 491–495. Lotfidereshgi, 2017, Biologically inspired speech emotion recognition, 5135 Lu, 2022 Lugosch, 2019 Luo, 2018, Tasnet: time-domain audio separation network for real-time, single-channel speech separation, 696 Luo, 2019, Conv-tasnet: Surpassing ideal time–frequency magnitude masking for speech separation, IEEE/ACM Trans. Audio, Speech, Lang. Process., 27, 1256, 10.1109/TASLP.2019.2915167 Macartney, 2018 Miyazaki, 2012, Musical-noise-free speech enhancement based on optimized iterative spectral subtraction, IEEE Trans. Audio, Speech, Lang. Process., 20, 2080, 10.1109/TASL.2012.2196513 Narayanan, 2021, Cross-attention conformer for context modeling in speech enhancement for ASR, 312 Oord, 2018 Paliwal, 2011, The importance of phase in speech enhancement, Speech Commun., 53, 465, 10.1016/j.specom.2010.12.003 Panayotov, 2015, Librispeech: An asr corpus based on public domain audio books, 5206 Park, 2020 Pascual, 2017 Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., Lerer, A., 2017. Automatic differentiation in pytorch. In: Conference on Neural Information Processing Systems. Prasad, 2021, An investigation of end-to-end models for robust speech recognition, 6893 Reddy, 2019 Rethage, D., et al., 2018. A wavenet for speech denoising. In: International Conference on Acoustics, Speech and Signal Processing. pp. 5069–5073. Rix, 2001, Perceptual evaluation of speech quality (PESQ)-A new method for speech quality assessment of telephone networks and codecs, vol. 2, 749 Ronneberger, 2015, U-Net: Convolutional networks for biomedical image segmentation, 234 Rybakov, 2020 Schneider, 2019 Shi, 2022, Train from scratch: Single-stage joint training of speech separation and recognition, Comput. Speech Lang., 76, 10.1016/j.csl.2022.101387 Snyder, 2015 Sun, 2014, Speech enhancement via sparse coding with ideal binary mask, 537 Taal, C.H., et al., 2010. A short-time objective intelligibility measure for time-frequency weighted noisy speech. In: International Conference on Acoustics, Speech and Signal Processing. pp. 4214–4217. Tan, K., Wang, D., 2018. A convolutional recurrent neural network for real-time speech enhancement.. In: Proc. of Interspeech, Vol. 2018. pp. 3229–3233. Tanberk, 2021, Deep learning for videoconferencing: A brief examination of speech to text and speech synthesis, 506 Trinh, 2022, Unsupervised speech enhancement with speech recognition embedding and disentanglement losses, 391 Wang, 2018, IRM estimation based on data field of cochleagram for speech enhancement, Speech Commun., 97, 19, 10.1016/j.specom.2017.12.014 Wang, 2021, TSTNN: Two-stage transformer based neural network for speech enhancement in the time domain, 7098 Wang, 2021 Wang, 2016, A joint training framework for robust automatic speech recognition, IEEE/ACM Trans. Audio, Speech, Lang. Process., 24, 796, 10.1109/TASLP.2016.2528171 Wang, 2019, An overview of end-to-end automatic speech recognition, Symmetry, 11, 1018, 10.3390/sym11081018 Wang, 2020, Complex spectral mapping for single-and multi-channel speech enhancement and robust ASR, IEEE/ACM Trans. Audio, Speech, Lang. Process., 28, 1778, 10.1109/TASLP.2020.2998279 Wang, 2022, HGCN: Harmonic gated compensation network for speech enhancement, 371 Warden, 2018 Weninger, 2015, Speech enhancement with LSTM recurrent neural networks and its application to noise-robust ASR, 91 Williamson, 2015, Complex ratio masking for monaural speech separation, IEEE/ACM Trans. Audio, Speech, Lang. Process., 24, 483, 10.1109/TASLP.2015.2512042 Xu, 2021, Components loss for neural networks in mask-based speech enhancement, EURASIP J. Audio, Speech, Music Process., 2021, 1, 10.1186/s13636-021-00207-6 Yu, 2020 Zhao, 2016, DNN-based enhancement of noisy and reverberant speech, 6525 Zhu, 2022