Direct enhancement of pre-trained speech embeddings for speech processing in noisy conditions
Tài liệu tham khảo
Al-Karawi, 2015, Automatic speaker recognition system in adverse conditions implication of noise and reverberation on system performance, Int. J. Inf. Electron. Eng., 5, 423
Ali, 2020, Speech enhancement using dilated wave-U-Net: An experimental analysis, 3
Ali, M.N., Falavigna, D., Brutti, A., 2022a. Enhancing Embeddings for Speech Classification in Noisy Conditions. In: Proc. of Interspeech. pp. 2933–2937.
Ali, 2022, Time-domain joint training strategies of speech enhancement and intent classification neural models, Sensors, 22, 374, 10.3390/s22010374
Ali, 2021, A speech enhancement front-end for intent classification in noisy environments, 471
Alom, 2019, A state-of-the-art survey on deep learning theory and architectures, Electronics, 8, 292, 10.3390/electronics8030292
Amodei, 2016, Deep speech 2: End-to-end speech recognition in english and mandarin, 173
Arons, 1992, A review of the cocktail party effect, J. Am. Voice I/O Society, 12, 35
Baevski, 2019
Baevski, 2019
Baevski, 2020, Wav2vec 2.0: A framework for self-supervised learning of speech representations, Adv. Neural Inf. Process. Syst., 33, 12449
Bermuth, 2022
Bonet, 2021
Caldarini, 2022, A literature survey of recent advances in chatbots, Information, 13, 41, 10.3390/info13010041
Cámbara, 2022, TASE: Task-aware speech enhancement for wake-up word detection in voice assistants, Appl. Sci., 12, 1974, 10.3390/app12041974
Cao, 2021
Chen, 2021
Chen, 2022, Wavlm: Large-scale self-supervised pre-training for full stack speech processing, IEEE J. Sel. Top. Sign. Proces., 16, 1505, 10.1109/JSTSP.2022.3188113
Chung, 2020, Generative pre-training for speech with autoregressive predictive coding, 3497
De Andrade, 2018
Defossez, 2020
Devlin, 2018
Donahue, 2018, Exploring speech enhancement with generative adversarial networks for robust speech recognition, 5024
El-Fattah, 2014, Speech enhancement with an adaptive Wiener filter, Int. J. Speech Technol., 17, 53, 10.1007/s10772-013-9205-5
Eskimez, 2022, Personalized speech enhancement: New models and comprehensive evaluation, 356
Fan, 2020, Gated recurrent fusion with joint training framework for robust end-to-end speech recognition, IEEE/ACM Trans. Audio, Speech, Lang. Process., 29, 198, 10.1109/TASLP.2020.3039600
Fujimoto, M., Kawai, H., 2019. One-Pass Single-Channel Noisy Speech Recognition Using a Combination of Noisy and Enhanced Features. In: Proc. of Interspeech. pp. 486–490.
Graves, 2012, Connectionist temporal classification, 61
Gu, 2019
Hoang, 2022, Multichannel speech enhancement with own voice-based interfering speech suppression for hearing assistive devices, IEEE/ACM Trans. Audio, Speech, Lang. Process., 10.1109/TASLP.2022.3145294
Hsu, 2021, Hubert: Self-supervised speech representation learning by masked prediction of hidden units, IEEE/ACM Trans. Audio, Speech, Lang. Process., 29, 3451, 10.1109/TASLP.2021.3122291
Hu, 2022, Interactive feature fusion for end-to-end noise-robust speech recognition, 6292
Huang, 2018, Supervised noise reduction for multichannel keyword spotting, 5474
Huang, 2022, Investigating self-supervised learning for speech enhancement and separation, 6837
Iwamoto, 2022
Li, 2021, Adversarial joint training with self-attention mechanism for robust end-to-end speech recognition, EURASIP J. Audio, Speech, Music Process., 2021, 1, 10.1186/s13636-021-00215-6
Li, X., Li, J., Yan, Y., 2017. Ideal Ratio Mask Estimation Using Deep Neural Networks for Monaural Speech Segregation in Noisy Reverberant Conditions. In: Proc. of Interspeech. pp. 1203–1207.
Li, 2021, Improving speech recognition on noisy speech via speech enhancement with multi-discriminators CycleGAN, 830
Liu, B., Nie, S., Liang, S., Liu, W., Yu, M., Chen, L., Peng, S., Li, C., et al., 2019. Jointly Adversarial Enhancement Training for Robust End-to-End Speech Recognition. In: Proc. of Interspeech. pp. 491–495.
Lotfidereshgi, 2017, Biologically inspired speech emotion recognition, 5135
Lu, 2022
Lugosch, 2019
Luo, 2018, Tasnet: time-domain audio separation network for real-time, single-channel speech separation, 696
Luo, 2019, Conv-tasnet: Surpassing ideal time–frequency magnitude masking for speech separation, IEEE/ACM Trans. Audio, Speech, Lang. Process., 27, 1256, 10.1109/TASLP.2019.2915167
Macartney, 2018
Miyazaki, 2012, Musical-noise-free speech enhancement based on optimized iterative spectral subtraction, IEEE Trans. Audio, Speech, Lang. Process., 20, 2080, 10.1109/TASL.2012.2196513
Narayanan, 2021, Cross-attention conformer for context modeling in speech enhancement for ASR, 312
Oord, 2018
Paliwal, 2011, The importance of phase in speech enhancement, Speech Commun., 53, 465, 10.1016/j.specom.2010.12.003
Panayotov, 2015, Librispeech: An asr corpus based on public domain audio books, 5206
Park, 2020
Pascual, 2017
Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., Lerer, A., 2017. Automatic differentiation in pytorch. In: Conference on Neural Information Processing Systems.
Prasad, 2021, An investigation of end-to-end models for robust speech recognition, 6893
Reddy, 2019
Rethage, D., et al., 2018. A wavenet for speech denoising. In: International Conference on Acoustics, Speech and Signal Processing. pp. 5069–5073.
Rix, 2001, Perceptual evaluation of speech quality (PESQ)-A new method for speech quality assessment of telephone networks and codecs, vol. 2, 749
Ronneberger, 2015, U-Net: Convolutional networks for biomedical image segmentation, 234
Rybakov, 2020
Schneider, 2019
Shi, 2022, Train from scratch: Single-stage joint training of speech separation and recognition, Comput. Speech Lang., 76, 10.1016/j.csl.2022.101387
Snyder, 2015
Sun, 2014, Speech enhancement via sparse coding with ideal binary mask, 537
Taal, C.H., et al., 2010. A short-time objective intelligibility measure for time-frequency weighted noisy speech. In: International Conference on Acoustics, Speech and Signal Processing. pp. 4214–4217.
Tan, K., Wang, D., 2018. A convolutional recurrent neural network for real-time speech enhancement.. In: Proc. of Interspeech, Vol. 2018. pp. 3229–3233.
Tanberk, 2021, Deep learning for videoconferencing: A brief examination of speech to text and speech synthesis, 506
Trinh, 2022, Unsupervised speech enhancement with speech recognition embedding and disentanglement losses, 391
Wang, 2018, IRM estimation based on data field of cochleagram for speech enhancement, Speech Commun., 97, 19, 10.1016/j.specom.2017.12.014
Wang, 2021, TSTNN: Two-stage transformer based neural network for speech enhancement in the time domain, 7098
Wang, 2021
Wang, 2016, A joint training framework for robust automatic speech recognition, IEEE/ACM Trans. Audio, Speech, Lang. Process., 24, 796, 10.1109/TASLP.2016.2528171
Wang, 2019, An overview of end-to-end automatic speech recognition, Symmetry, 11, 1018, 10.3390/sym11081018
Wang, 2020, Complex spectral mapping for single-and multi-channel speech enhancement and robust ASR, IEEE/ACM Trans. Audio, Speech, Lang. Process., 28, 1778, 10.1109/TASLP.2020.2998279
Wang, 2022, HGCN: Harmonic gated compensation network for speech enhancement, 371
Warden, 2018
Weninger, 2015, Speech enhancement with LSTM recurrent neural networks and its application to noise-robust ASR, 91
Williamson, 2015, Complex ratio masking for monaural speech separation, IEEE/ACM Trans. Audio, Speech, Lang. Process., 24, 483, 10.1109/TASLP.2015.2512042
Xu, 2021, Components loss for neural networks in mask-based speech enhancement, EURASIP J. Audio, Speech, Music Process., 2021, 1, 10.1186/s13636-021-00207-6
Yu, 2020
Zhao, 2016, DNN-based enhancement of noisy and reverberant speech, 6525
Zhu, 2022