Cross-corpora spoken language identification with domain diversification and generalization

Computer Speech & Language - Tập 81 - Trang 101489 - 2023
Spandan Dey1, Md Sahidullah2, Goutam Saha1
1Department of Electronics & Electrical Communication Engineering, Indian Institute of Technology Kharagpur, Kharagpur, 721302, India
2Université de Lorraine, CNRS, Inria, LORIA, F-54000, Nancy, France

Tài liệu tham khảo

Adi, 2019, To reverse the gradient or not: an empirical comparison of adversarial and multi-task learning in speech recognition, 3742 Alumäe, 2022, Pretraining approaches for spoken language recognition: TalTech submission to the OLR 2021 challenge, 240 Benyassine, 1997, A silence compression scheme for use with G. 729 optimized for V. 70 digital simultaneous voice and data applications (recommendation G. 729 annex B), IEEE Commun. Mag., 35, 64, 10.1109/35.620527 Berouti, 1979, Enhancement of speech corrupted by acoustic noise, 208 Beyan, 2021, RealVAD: A real-world dataset and a method for voice activity detection by body motion analysis, IEEE Trans. Multimed., 23, 2071, 10.1109/TMM.2020.3007350 Blanchard, 2011, Generalizing from several related classification tasks to a new unlabeled sample, Adv. Neural Inf. Process. Syst., 24, 2178 Brookes, 1997 Brümmer, 2006, Application-independent evaluation of speaker detection, Comput. Speech Lang., 20, 230, 10.1016/j.csl.2005.08.001 Caruana, 1997, Multitask learning, Mach. Learn., 28, 41, 10.1023/A:1007379606734 Cha, 2021, SWAD: Domain generalization by seeking flat minima, 22405 Chakraborty, 2021, DenseRecognition of spoken languages, 9674 Chen, 2019, A graph embedding framework for maximum mean discrepancy-based domain adaptation algorithms, IEEE Trans. Image Process., 29, 199, 10.1109/TIP.2019.2928630 Chettri, 2021, Data quality as predictor of voice anti-spoofing generalization, 1659 Clark, 2019, The state of speech in HCI: Trends, themes and challenges, Interact. Comput., 31, 349, 10.1093/iwc/iwz016 Deng, W., Zheng, L., Ye, Q., Kang, G., Yang, Y., Jiao, J., 2018. Image-image domain adaptation with preserved self-similarity and domain-dissimilarity for person re-identification. In: CVPR. pp. 994–1003. Desplanques, 2020, ECAPA-TDNN: Emphasized channel attention, propagation and aggregation in TDNN based speaker verification, 1 Dey, 2021, Cross-corpora language recognition: A preliminary investigation with Indian languages, 546 Dey, 2022, An overview on Indian spoken language recognition from machine learning perspective, ACM Trans. Asian Low Resour. Lang. Inf. Process., 21, 1, 10.1145/3523179 Ding, 2017, Deep domain generalization with structured low-rank constraint, IEEE Trans. Image Process., 27, 304, 10.1109/TIP.2017.2758199 Doire, 2016, Single-channel online enhancement of speech corrupted by reverberation and noise, IEEE/ACM Trans. Audio Speech Lang. Process., 25, 572, 10.1109/TASLP.2016.2641904 Du, 2021, Data augmentation for end-to-end code-switching speech recognition, 194 Duroselle, R., Jouvet, D., Illina, I., 2020. Metric Learning Loss Functions to Reduce Domain Mismatch in the x-Vector Space for Language Recognition. In: INTERSPEECH. pp. 447–451. Ferrer, 2022, A discriminative hierarchical PLDA-based model for spoken language recognition, IEEE/ACM Trans. Audio Speech Lang. Process., 30, 2396, 10.1109/TASLP.2022.3190736 Ganin, 2016, Domain-adversarial training of neural networks, J. Mach. Learn. Res., 17, 2030 Garcia-Romero, 2020, MagNetO: X-vector magnitude estimation network plus offset for improved speaker recognition, 1 Gerczuk, 2021, EmoNet: A transfer learning framework for multi-corpus speech emotion recognition, IEEE Trans. Affect. Comput., 1, 10.1109/TAFFC.2021.3135152 Gerkmann, 2011, Unbiased MMSE-based noise power estimation with low complexity and low tracking delay, IEEE Trans. Audio Speech Lang. Process., 20, 1383, 10.1109/TASL.2011.2180896 Gideon, 2021, Improving cross-corpus speech emotion recognition with adversarial discriminative domain generalization (ADDoG), IEEE Trans. Affect. Comput., 12, 1055, 10.1109/TAFFC.2019.2916092 Gillespie, 2017, Cross-database models for the classification of dysarthria presence, 3127 Gonzalez-Dominguez, 2015, Frame-by-frame language identification in short utterances using deep neural networks, Neural Netw., 64, 49, 10.1016/j.neunet.2014.08.006 Greenberg, 2012, The 2011 NIST language recognition evaluation, 34 Gretton, 2006, A kernel method for the two-sample-problem, Adv. Neural Inf. Process. Syst., 19 Grollmisch, 2021, Analyzing the potential of pre-trained embeddings for audio classification tasks, 790 Gulrajani, I., Lopez-Paz, D., 2021. In Search of Lost Domain Generalization. In: International Conference on Learning Representations. Hu, 2017, Cross-dataset and cross-cultural music mood prediction: A case on Western and Chinese pop songs, IEEE Trans. Affect. Comput., 8, 228, 10.1109/TAFFC.2016.2523503 Iqbal, 2021, Enhancing audio augmentation methods with consistency learning, 646 Kang, W., Alam, M.J., Fathan, A., 2022. Deep learning-based end-to-end spoken language identification system for domain-mismatched scenario. In: Language Resources and Evaluation Conference. pp. 7339–7343. Karen, 2017 Khosla, 2012, Undoing the damage of dataset bias, 158 Korshunov, 2019, A cross-database study of voice presentation attack detection, 363 Kumawat, 2021, Applying TDNN architectures for analyzing duration dependencies on speech emotion recognition, 3410 Li, 2021, Deep joint learning for language recognition, Neural Netw., 141, 72, 10.1016/j.neunet.2021.03.026 Li, 2013, Spoken language recognition: from fundamentals to practice, Proc. IEEE, 101, 1136, 10.1109/JPROC.2012.2237151 Li, 2020, AP20-OLR challenge: Three tasks and their baselines, 550 Liu, 2022, PHO-LID: A unified model incorporating acoustic-phonetic and phonotactic information for language identification, 2233 Liu, 2022, Efficient self-supervised learning representations for spoken language identification, IEEE J. Sel. Top. Signal Process., 16, 1296, 10.1109/JSTSP.2022.3201445 Liu, 2022, Enhancing language identification using dual-mode model with knowledge distillation, 248 Long, 2015, Learning transferable features with deep adaptation networks, 97 Lopez-Moreno, 2016, On the use of deep feedforward neural networks for automatic language identification, Comput. Speech Lang., 40, 46, 10.1016/j.csl.2016.03.001 Loshchilov, I., Hutter, F., 2018. Decoupled Weight Decay Regularization. In: ICLR. Maity, 2012, IITKGP-MLILSC speech database for language identification, 1 Mandava, 2019, An investigation of LSTM-CTC based joint acoustic model for Indian language identification, 389 Mandava, 2019, Attention based residual-time delay neural network for Indian language identification, 1 Martinez, 2011, Language recognition in ivectors space Mauch, M., Ewert, S., 2013. The Audio Degradation Toolbox and its Application to Robustness Evaluation. In: International Society for Music Information Retrieval Conference. ISMIR, Curitiba, Brazil. Mohamed, 2022, Self-supervised speech representation learning: A review, IEEE J. Sel. Top. Signal Process., 16, 1179, 10.1109/JSTSP.2022.3207050 Monteiro, 2019, Residual convolutional neural network with attentive feature pooling for end-to-end language identification from short-duration speech, Comput. Speech Lang., 58, 364, 10.1016/j.csl.2019.05.006 Moreno-Torres, 2012, A unifying view on dataset shift in classification, Pattern Recognit., 45, 521, 10.1016/j.patcog.2011.06.019 Mozilla, 2020 Mushtaq, 2020, Environmental sound classification using a regularized deep convolutional neural network with data augmentation, Appl. Acoust., 167, 10.1016/j.apacoust.2020.107389 Nadimpalli, 2022, On improving cross-dataset generalization of deepfake detectors, 91 Padi, 2020, Towards relevance and sequence modeling in language recognition, IEEE/ACM Trans. Audio Speech Lang. Process., 28, 1223, 10.1109/TASLP.2020.2983580 Pan, 2010, Domain adaptation via transfer component analysis, IEEE Trans. Neural Netw., 22, 199, 10.1109/TNN.2010.2091281 Pandey, 2022, Self-attending RNN for speech enhancement to improve cross-corpus generalization, IEEE/ACM Trans. Audio Speech Lang. Process., 30, 1374, 10.1109/TASLP.2022.3161143 Park, 2019, SpecAugment: A simple data augmentation method for automatic speech recognition, 2613 Paszke, A., Gross, S., Massa, F., Lerer, A., 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In: NeurIPS. pp. 8024–8035. Paul, 2017, Generalization of spoofing countermeasures: A case study with ASVspoof 2015 and BTAS 2016 corpora, 2047 Povey, 2011, The Kaldi speech recognition toolkit Radford, 2022 Reddy, 2013, Identification of Indian languages using multi-level spectral and prosodic features, Int. J. Speech Technol., 16, 489, 10.1007/s10772-013-9198-0 Ribas, D., Vincent, E., Calvo, J.R., 2016. A study of speech distortion conditions in real scenarios for speech processing applications. In: Spoken Language Technology Workshop. SLT, pp. 13–20. Rossenbach, 2020, Generating synthetic audio data for attention-based speech recognition systems, 7069 Ruder, 2017 Sadjadi, 2018, The 2017 NIST language recognition evaluation Salamon, 2017, Deep convolutional neural networks and data augmentation for environmental sound classification, IEEE Signal Process. Lett., 24, 279, 10.1109/LSP.2017.2657381 Sarfjoo, S., Madikeri, S., Motlicek, P., Marcel, S., 2020. Supervised Domain Adaptation for Text-Independent Speaker Verification Using Limited Data. In: INTERSPEECH. pp. 3815–3819. Schuller, 2010, Cross-corpus acoustic emotion recognition: Variances and strategies, IEEE Trans. Affect. Comput., 1, 119, 10.1109/T-AFFC.2010.8 Shen, 2017, Conditional generative adversarial nets classifier for spoken language identification, 2814 Singh, 2021, Non-linear frequency warping using constant-Q transformation for speech emotion recognition, 1 Snyder, 2015 Snyder, D., et al., 2018a. Spoken language recognition using x-vectors.. In: Odyssey: The Speaker and Language Recognition Workshop. pp. 105–111. Snyder, 2018, X-vectors: Robust DNN embeddings for speaker recognition, 5329 Sturm, 2014, A simple method to determine if a music information retrieval system is a “horse”, IEEE Trans. Multimed., 16, 1636, 10.1109/TMM.2014.2330697 Tang, 2019, AP19-OLR challenge: Three tasks and their baselines, 1917 Thienpondt, 2022, Tackling the score shift in cross-lingual speaker verification by exploiting language information, 7187 Toledo-Ronen, 2013, Voice-based sadness and anger recognition with cross-corpora evaluation, 7517 Tong, 2021, ASV-subtools: Open source toolkit for automatic speaker verification, 6184 Tsakalidis, 2005, Acoustic training from heterogeneous data sources: Experiments in mandarin conversational telephone speech transcription, 461 Valk, 2021, VoxLingua107: a dataset for spoken language recognition, 652 Vlasenko, 2013, Parameter optimization issues for cross-corpora emotion classification, 454 Vlasenko, 2014, Modeling phonetic pattern variability in favor of the creation of robust emotion classifiers for real-life applications, Comput. Speech Lang., 28, 483, 10.1016/j.csl.2012.11.003 Vuddagiri, 2018, IIITH-ILSC speech database for indain language identification, 56 Wang, 2018, Additive margin softmax for face verification, IEEE Signal Process. Lett., 25, 926, 10.1109/LSP.2018.2822810 Wang, 2018, Deep visual domain adaptation: A survey, Neurocomputing, 312, 135, 10.1016/j.neucom.2018.05.083 Wang, 2018, Transferable joint attribute-identity deep learning for unsupervised person re-identification, 2275 Wei, 2020, A comparison on data augmentation methods based on deep learning for audio classification Xia, 2021, Self-supervised text-independent speaker verification using prototypical momentum contrastive learning, 6723 Yan, 2017, Mind the class weight bias: Weighted maximum mean discrepancy for unsupervised domain adaptation, 2272 Zhang, H., Cisse, M., Dauphin, Y.N., Lopez-Paz, D., 2018. mixup: Beyond empirical risk minimization. In: ICLR. Zhang, 2020, Unsupervised multi-class domain adaptation: Theory, algorithms, and practice, IEEE Trans. Pattern Anal. Mach. Intell., 10.1109/TPAMI.2020.3036956 Zhang, 2011, Unsupervised learning in cross-corpus acoustic emotion recognition, 523 Zhang, 2021, A survey on multi-task learning, IEEE Trans. Knowl. Data Eng., 1 Zhou, 2022, Domain generalization: A survey, IEEE Trans. Pattern Anal. Mach. Intell. Zhu, 2015, A transfer learning approach to cross-database facial expression recognition, 293 Zhu, 2020, Deep subdomain adaptation network for image classification, IEEE Trans. Neural Netw. Learn. Syst., 32, 1713, 10.1109/TNNLS.2020.2988928 Zhuang, 2020, A comprehensive survey on transfer learning, Proc. IEEE, 109, 43, 10.1109/JPROC.2020.3004555