Speaker recognition with global information modelling of raw waveforms
Tóm tắt
Từ khóa
Tài liệu tham khảo
Kabi, M., Mridha, M. F., Shin, J., Jahan, I., & Ohi, A. Q. (2021). A survey of speaker recognition: Fundamental theories, recognition methods and opportunities. IEEE Access., 9, 79236–79263.
Snyder, D., Garcia-Romero, D., Sell, G., Povey, D. & Khudanpur, S. (2018). X-vectors: Robust dnn embeddings for speaker recognition. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP)., 5329–5333.
Desplanques, B., Thienpondt, J. & Demuynck, K. (2020). Ecapa-tdnn: Emphasized channel attention, propagation and aggregation in tdnn based speaker verification. arXiv., 2005–07143.
Chung, J.S., Nagrani, A. & Zisserman, A. (2020). Voxceleb2: Deep speaker recognition. arXiv., 2005–07143.
Cai, W., Chen, J., Zhang, J., & Li, M. (2020). On-the-fly data loader and utterance-level aggregation for speaker and language recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing., 28, 1038–1051.
Dehak, N., Kenny, P. J., Dehak, R., Dumouchel, P., & Ouellet, P. (2010). Front-end factor analysis for speaker verification. IEEE Transactions on Audio, Speech and Language Processing, 19(4), 788–798.
Baevski, A., Zhou, Y., Mohamed, A., & Auli, M. (2020). wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in Neural Information Processing Systems, 33, 12449–12460.
Sainath, T., Weiss, R.J., Wilson, K., Senior, A.W. & Vinyals, O. (2015). Learning the speech front-end with raw waveform cldnns. Advances in neural information processing systems.
Ravanelli, M. & Bengio, Y. (2018). Speaker recognition from raw waveform with sincnet. In 2018 IEEE spoken language technology workshop (SLT)., 1021–1028.
Oglic, D., Cvetkovic, Z., Bell, P. & Renals, S. (2020). A deep 2d convolutional network for waveform-based speech recognition. Interspeech., 1654–1658.
Pariente, M., Cornell, S., Deleforge, A. & Vincent, E. (2020). Filterbank design for end-to-end speech separation. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)., 6364–6368.
Jung, J.W., Heo, H.S., Yang, I.H., Shim, H.J. & Yu, H.J. (2018). A complete end-to-end speaker verification system using deep neural networks: From raw signals to verification result. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)., 5349–5353.
Muckenhirn, H., Doss, M.M. & Marcell, S. (2018). Towards directly modeling raw speech signal for speaker verification using cnns. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)., 4884–4888.
Zhu, G., Jiang, F. & Duan, Z. (2020). Y-vector: Multiscale waveform encoder for speaker embedding. arXiv, 2010–12951
Jung, J.W., Kim, Y.J., Heo, H.S., Lee, B.J., Kwon, Y. & Chung, J.S. (2022). Pushing the limits of raw waveform speaker recognition. arXiv, 2203–08488.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N. & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30.
Zhou, H., Zhang, S., Peng, J., Zhang, S., Li, J., Xiong, H., & Zhang, W. (2021). Informer: Beyond efficient transformer for long sequence time-series forecasting. In Proceedings of the AAAI conference on artificial intelligence, 35(12), 11106–11115.
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z. & Guo, B. (2021). Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, 10012–10022.
Wang, R., Ao, J., Zhou, L., Liu, S., Wei, Z., Ko, T. & Zhang, Y. (2022). Multi-view self-attention based transformer for speaker recognition. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 6732–6736.
Noé, P.G., Parcollet, T. & Morchid, M. (2020). Cgcnn: Complex gabor convolutional neural network on raw speech. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 7724–7728.
Andén, J., & Mallat, S. (2014). Deep scattering spectrum. In IEEE Transactions on Signal Processing., 62(16), 4114–4128.
Balestriero, R., Cosentino, R., Glotin, H. & Baraniuk, R. (2018). Spline filters for end-to-end deep learning. In International conference on machine learning.PMRL., 364–373.
Jung, J.W., Heo, H.S., Kim, J.H., Shim, H.J. & Yu, H.J. (2019). Rawnet: Advanced end-to-end deep neural network using raw waveforms for text-independent speaker verification. arXiv., 1904–08104.
Ba, J.L., Kiros, J.R. & Hinton, G.E. (2016). Layer normalization. arXiv., 1607–06450.
Kingma, D.P. & Ba, J. (2014). Adam: A method for stochastic optimization. In 2018 IEEE spoken language technology workshop (SLT), 1412–6980.