Cross-corpora spoken language identification with domain diversification and generalization
Tài liệu tham khảo
Adi, 2019, To reverse the gradient or not: an empirical comparison of adversarial and multi-task learning in speech recognition, 3742
Alumäe, 2022, Pretraining approaches for spoken language recognition: TalTech submission to the OLR 2021 challenge, 240
Benyassine, 1997, A silence compression scheme for use with G. 729 optimized for V. 70 digital simultaneous voice and data applications (recommendation G. 729 annex B), IEEE Commun. Mag., 35, 64, 10.1109/35.620527
Berouti, 1979, Enhancement of speech corrupted by acoustic noise, 208
Beyan, 2021, RealVAD: A real-world dataset and a method for voice activity detection by body motion analysis, IEEE Trans. Multimed., 23, 2071, 10.1109/TMM.2020.3007350
Blanchard, 2011, Generalizing from several related classification tasks to a new unlabeled sample, Adv. Neural Inf. Process. Syst., 24, 2178
Brookes, 1997
Brümmer, 2006, Application-independent evaluation of speaker detection, Comput. Speech Lang., 20, 230, 10.1016/j.csl.2005.08.001
Caruana, 1997, Multitask learning, Mach. Learn., 28, 41, 10.1023/A:1007379606734
Cha, 2021, SWAD: Domain generalization by seeking flat minima, 22405
Chakraborty, 2021, DenseRecognition of spoken languages, 9674
Chen, 2019, A graph embedding framework for maximum mean discrepancy-based domain adaptation algorithms, IEEE Trans. Image Process., 29, 199, 10.1109/TIP.2019.2928630
Chettri, 2021, Data quality as predictor of voice anti-spoofing generalization, 1659
Clark, 2019, The state of speech in HCI: Trends, themes and challenges, Interact. Comput., 31, 349, 10.1093/iwc/iwz016
Deng, W., Zheng, L., Ye, Q., Kang, G., Yang, Y., Jiao, J., 2018. Image-image domain adaptation with preserved self-similarity and domain-dissimilarity for person re-identification. In: CVPR. pp. 994–1003.
Desplanques, 2020, ECAPA-TDNN: Emphasized channel attention, propagation and aggregation in TDNN based speaker verification, 1
Dey, 2021, Cross-corpora language recognition: A preliminary investigation with Indian languages, 546
Dey, 2022, An overview on Indian spoken language recognition from machine learning perspective, ACM Trans. Asian Low Resour. Lang. Inf. Process., 21, 1, 10.1145/3523179
Ding, 2017, Deep domain generalization with structured low-rank constraint, IEEE Trans. Image Process., 27, 304, 10.1109/TIP.2017.2758199
Doire, 2016, Single-channel online enhancement of speech corrupted by reverberation and noise, IEEE/ACM Trans. Audio Speech Lang. Process., 25, 572, 10.1109/TASLP.2016.2641904
Du, 2021, Data augmentation for end-to-end code-switching speech recognition, 194
Duroselle, R., Jouvet, D., Illina, I., 2020. Metric Learning Loss Functions to Reduce Domain Mismatch in the x-Vector Space for Language Recognition. In: INTERSPEECH. pp. 447–451.
Ferrer, 2022, A discriminative hierarchical PLDA-based model for spoken language recognition, IEEE/ACM Trans. Audio Speech Lang. Process., 30, 2396, 10.1109/TASLP.2022.3190736
Ganin, 2016, Domain-adversarial training of neural networks, J. Mach. Learn. Res., 17, 2030
Garcia-Romero, 2020, MagNetO: X-vector magnitude estimation network plus offset for improved speaker recognition, 1
Gerczuk, 2021, EmoNet: A transfer learning framework for multi-corpus speech emotion recognition, IEEE Trans. Affect. Comput., 1, 10.1109/TAFFC.2021.3135152
Gerkmann, 2011, Unbiased MMSE-based noise power estimation with low complexity and low tracking delay, IEEE Trans. Audio Speech Lang. Process., 20, 1383, 10.1109/TASL.2011.2180896
Gideon, 2021, Improving cross-corpus speech emotion recognition with adversarial discriminative domain generalization (ADDoG), IEEE Trans. Affect. Comput., 12, 1055, 10.1109/TAFFC.2019.2916092
Gillespie, 2017, Cross-database models for the classification of dysarthria presence, 3127
Gonzalez-Dominguez, 2015, Frame-by-frame language identification in short utterances using deep neural networks, Neural Netw., 64, 49, 10.1016/j.neunet.2014.08.006
Greenberg, 2012, The 2011 NIST language recognition evaluation, 34
Gretton, 2006, A kernel method for the two-sample-problem, Adv. Neural Inf. Process. Syst., 19
Grollmisch, 2021, Analyzing the potential of pre-trained embeddings for audio classification tasks, 790
Gulrajani, I., Lopez-Paz, D., 2021. In Search of Lost Domain Generalization. In: International Conference on Learning Representations.
Hu, 2017, Cross-dataset and cross-cultural music mood prediction: A case on Western and Chinese pop songs, IEEE Trans. Affect. Comput., 8, 228, 10.1109/TAFFC.2016.2523503
Iqbal, 2021, Enhancing audio augmentation methods with consistency learning, 646
Kang, W., Alam, M.J., Fathan, A., 2022. Deep learning-based end-to-end spoken language identification system for domain-mismatched scenario. In: Language Resources and Evaluation Conference. pp. 7339–7343.
Karen, 2017
Khosla, 2012, Undoing the damage of dataset bias, 158
Korshunov, 2019, A cross-database study of voice presentation attack detection, 363
Kumawat, 2021, Applying TDNN architectures for analyzing duration dependencies on speech emotion recognition, 3410
Li, 2021, Deep joint learning for language recognition, Neural Netw., 141, 72, 10.1016/j.neunet.2021.03.026
Li, 2013, Spoken language recognition: from fundamentals to practice, Proc. IEEE, 101, 1136, 10.1109/JPROC.2012.2237151
Li, 2020, AP20-OLR challenge: Three tasks and their baselines, 550
Liu, 2022, PHO-LID: A unified model incorporating acoustic-phonetic and phonotactic information for language identification, 2233
Liu, 2022, Efficient self-supervised learning representations for spoken language identification, IEEE J. Sel. Top. Signal Process., 16, 1296, 10.1109/JSTSP.2022.3201445
Liu, 2022, Enhancing language identification using dual-mode model with knowledge distillation, 248
Long, 2015, Learning transferable features with deep adaptation networks, 97
Lopez-Moreno, 2016, On the use of deep feedforward neural networks for automatic language identification, Comput. Speech Lang., 40, 46, 10.1016/j.csl.2016.03.001
Loshchilov, I., Hutter, F., 2018. Decoupled Weight Decay Regularization. In: ICLR.
Maity, 2012, IITKGP-MLILSC speech database for language identification, 1
Mandava, 2019, An investigation of LSTM-CTC based joint acoustic model for Indian language identification, 389
Mandava, 2019, Attention based residual-time delay neural network for Indian language identification, 1
Martinez, 2011, Language recognition in ivectors space
Mauch, M., Ewert, S., 2013. The Audio Degradation Toolbox and its Application to Robustness Evaluation. In: International Society for Music Information Retrieval Conference. ISMIR, Curitiba, Brazil.
Mohamed, 2022, Self-supervised speech representation learning: A review, IEEE J. Sel. Top. Signal Process., 16, 1179, 10.1109/JSTSP.2022.3207050
Monteiro, 2019, Residual convolutional neural network with attentive feature pooling for end-to-end language identification from short-duration speech, Comput. Speech Lang., 58, 364, 10.1016/j.csl.2019.05.006
Moreno-Torres, 2012, A unifying view on dataset shift in classification, Pattern Recognit., 45, 521, 10.1016/j.patcog.2011.06.019
Mozilla, 2020
Mushtaq, 2020, Environmental sound classification using a regularized deep convolutional neural network with data augmentation, Appl. Acoust., 167, 10.1016/j.apacoust.2020.107389
Nadimpalli, 2022, On improving cross-dataset generalization of deepfake detectors, 91
Padi, 2020, Towards relevance and sequence modeling in language recognition, IEEE/ACM Trans. Audio Speech Lang. Process., 28, 1223, 10.1109/TASLP.2020.2983580
Pan, 2010, Domain adaptation via transfer component analysis, IEEE Trans. Neural Netw., 22, 199, 10.1109/TNN.2010.2091281
Pandey, 2022, Self-attending RNN for speech enhancement to improve cross-corpus generalization, IEEE/ACM Trans. Audio Speech Lang. Process., 30, 1374, 10.1109/TASLP.2022.3161143
Park, 2019, SpecAugment: A simple data augmentation method for automatic speech recognition, 2613
Paszke, A., Gross, S., Massa, F., Lerer, A., 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In: NeurIPS. pp. 8024–8035.
Paul, 2017, Generalization of spoofing countermeasures: A case study with ASVspoof 2015 and BTAS 2016 corpora, 2047
Povey, 2011, The Kaldi speech recognition toolkit
Radford, 2022
Reddy, 2013, Identification of Indian languages using multi-level spectral and prosodic features, Int. J. Speech Technol., 16, 489, 10.1007/s10772-013-9198-0
Ribas, D., Vincent, E., Calvo, J.R., 2016. A study of speech distortion conditions in real scenarios for speech processing applications. In: Spoken Language Technology Workshop. SLT, pp. 13–20.
Rossenbach, 2020, Generating synthetic audio data for attention-based speech recognition systems, 7069
Ruder, 2017
Sadjadi, 2018, The 2017 NIST language recognition evaluation
Salamon, 2017, Deep convolutional neural networks and data augmentation for environmental sound classification, IEEE Signal Process. Lett., 24, 279, 10.1109/LSP.2017.2657381
Sarfjoo, S., Madikeri, S., Motlicek, P., Marcel, S., 2020. Supervised Domain Adaptation for Text-Independent Speaker Verification Using Limited Data. In: INTERSPEECH. pp. 3815–3819.
Schuller, 2010, Cross-corpus acoustic emotion recognition: Variances and strategies, IEEE Trans. Affect. Comput., 1, 119, 10.1109/T-AFFC.2010.8
Shen, 2017, Conditional generative adversarial nets classifier for spoken language identification, 2814
Singh, 2021, Non-linear frequency warping using constant-Q transformation for speech emotion recognition, 1
Snyder, 2015
Snyder, D., et al., 2018a. Spoken language recognition using x-vectors.. In: Odyssey: The Speaker and Language Recognition Workshop. pp. 105–111.
Snyder, 2018, X-vectors: Robust DNN embeddings for speaker recognition, 5329
Sturm, 2014, A simple method to determine if a music information retrieval system is a “horse”, IEEE Trans. Multimed., 16, 1636, 10.1109/TMM.2014.2330697
Tang, 2019, AP19-OLR challenge: Three tasks and their baselines, 1917
Thienpondt, 2022, Tackling the score shift in cross-lingual speaker verification by exploiting language information, 7187
Toledo-Ronen, 2013, Voice-based sadness and anger recognition with cross-corpora evaluation, 7517
Tong, 2021, ASV-subtools: Open source toolkit for automatic speaker verification, 6184
Tsakalidis, 2005, Acoustic training from heterogeneous data sources: Experiments in mandarin conversational telephone speech transcription, 461
Valk, 2021, VoxLingua107: a dataset for spoken language recognition, 652
Vlasenko, 2013, Parameter optimization issues for cross-corpora emotion classification, 454
Vlasenko, 2014, Modeling phonetic pattern variability in favor of the creation of robust emotion classifiers for real-life applications, Comput. Speech Lang., 28, 483, 10.1016/j.csl.2012.11.003
Vuddagiri, 2018, IIITH-ILSC speech database for indain language identification, 56
Wang, 2018, Additive margin softmax for face verification, IEEE Signal Process. Lett., 25, 926, 10.1109/LSP.2018.2822810
Wang, 2018, Deep visual domain adaptation: A survey, Neurocomputing, 312, 135, 10.1016/j.neucom.2018.05.083
Wang, 2018, Transferable joint attribute-identity deep learning for unsupervised person re-identification, 2275
Wei, 2020, A comparison on data augmentation methods based on deep learning for audio classification
Xia, 2021, Self-supervised text-independent speaker verification using prototypical momentum contrastive learning, 6723
Yan, 2017, Mind the class weight bias: Weighted maximum mean discrepancy for unsupervised domain adaptation, 2272
Zhang, H., Cisse, M., Dauphin, Y.N., Lopez-Paz, D., 2018. mixup: Beyond empirical risk minimization. In: ICLR.
Zhang, 2020, Unsupervised multi-class domain adaptation: Theory, algorithms, and practice, IEEE Trans. Pattern Anal. Mach. Intell., 10.1109/TPAMI.2020.3036956
Zhang, 2011, Unsupervised learning in cross-corpus acoustic emotion recognition, 523
Zhang, 2021, A survey on multi-task learning, IEEE Trans. Knowl. Data Eng., 1
Zhou, 2022, Domain generalization: A survey, IEEE Trans. Pattern Anal. Mach. Intell.
Zhu, 2015, A transfer learning approach to cross-database facial expression recognition, 293
Zhu, 2020, Deep subdomain adaptation network for image classification, IEEE Trans. Neural Netw. Learn. Syst., 32, 1713, 10.1109/TNNLS.2020.2988928
Zhuang, 2020, A comprehensive survey on transfer learning, Proc. IEEE, 109, 43, 10.1109/JPROC.2020.3004555