Deep Learning-Based End-to-End Speaker Identification Using Time–Frequency Representation of Speech Signal

Circuits, Systems, and Signal Processing - Tập 43 Số 3 - Trang 1839-1861 - 2024
Banala Saritha1, Mohammad Azharuddin Laskar1, Anish Monsley Kirupakaran1, Rabul Hussain Laskar1, Madhuchhanda Choudhury1, Nirupam Shome2
1Department of Electronics and Communication Engineering, National Institute of Technology Silchar, Silchar, India
2Department of Electronics and Communication Engineering, Assam University, Silchar, India

Tóm tắt

Từ khóa


Tài liệu tham khảo

P.K. Ajmera, D.V. Jadhav, R.S. Holambe, Text-independent speaker identification using Radon and discrete cosine transforms based features from speech spectrogram. Pattern Recognit. 44(10–11), 2749–2759 (2011). https://doi.org/10.1016/j.patcog.2011.04.009

N.N. An, N.Q. Thanh, Y. Liu, Deep CNNs with self-attention for speaker identification. IEEE Access 7, 85327–85337 (2019). https://doi.org/10.1109/ACCESS.2019.2917470

T. Arias-Vergara, P. Klumpp, J.C. Vasquez-Correa, E. Nöth, J.R. Orozco-Arroyave, M. Schuster, Multi-channel spectrograms for speech processing applications using deep learning methods. Pattern Anal. Appl. 24(2), 423–431 (2021). https://doi.org/10.1007/s10044-020-00921-5

A. Ashar, M.S. Bhatti, U. Mushtaq, Speaker identification using a hybrid CNN-MFCC approach, in 2020 International Conference on Emerging Trends in Smart Technologies ICETST 2020, 2020. https://doi.org/10.1109/ICETST49965.2020.9080730

H. Beigi, Speaker recognition: advancements and challenges, in New Trends Dev. Biometrics, pp. 3–30, 2012. https://doi.org/10.5772/52023

S. Bunrit, T. Inkian, N. Kerdprasop, K. Kerdprasop, Text-independent speaker identification using deep learning model of convolution neural network. Int. J. Mach. Learn. Comput. 9(2), 143–148 (2019). https://doi.org/10.18178/ijmlc.2019.9.2.778

W. Cai, J. Chen, M. Li, Exploring the encoding layer and loss function in end-to-end speaker and language recognition system, in The Speaker and Language Recognition Workshop (Odyssey 2018), pp. 74–81, 2018. https://doi.org/10.21437/Odyssey.2018-11

G. Chen, C. Parada, G. Heigold, Small-footprint keyword spotting using deep neural networks, in ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing, No. i, pp. 4087–4091, 2014. https://doi.org/10.1109/ICASSP.2014.6854370

S. Ding, T. Chen, X. Gong, W. Zha, Z. Wang, AutoSpeech: Neural architecture search for speaker recognition, in Proc. Annu. Conf. Int. Speech Commun. Assoc. INTERSPEECH, vol. 2020-October, pp. 916–920, 2020. https://doi.org/10.21437/Interspeech.2020-1258

S.A. El-Moneim, M.A. Nassar, M.I. Dessouky, N.A. Ismail, A.S. El-Fishawy, F.E. Abd El-Samie, Text-independent speaker recognition using LSTM-RNN and speech enhancement. Multimed. Tools Appl. 79(33–34), 24013–24028 (2020). https://doi.org/10.1007/s11042-019-08293-7

S. Farsiani, H. Izadkhah, S. Lotfi, An optimum end-to-end text-independent speaker identification system using convolutional neural network. Comput. Electr. Eng. 100, 107882 (2022). https://doi.org/10.1016/j.compeleceng.2022.107882

M. Hajibabaei, D. Dai, Unified hypersphere embedding for speaker recognition. Electr. Eng. Syst. Sci. Audio Speech Process. (2018). https://doi.org/10.48550/arXiv.1807.08312

M.R. Hasan, M.M. Hasan, M.Z. Hossain, How many Mel-frequency cepstral coefficients to be utilized in speech recognition? A study with the Bengali language. J. Eng. 2021(12), 817–827 (2021). https://doi.org/10.1049/tje2.12082

K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol. 45, no. 8, pp. 770–778, 2016. https://doi.org/10.1109/CVPR.2016.90

R. Jahangir, Y.W. Teh, H.F. Nweke, G. Mujtaba, M.A. Al-Garadi, I. Ali, Speaker identification through artificial intelligence techniques: A comprehensive review and research challenges. Expert Syst. Appl. 171, 114591 (2021). https://doi.org/10.1016/j.eswa.2021.114591

J.W. Jung, H.S. Heo, J.H. Kim, H.J. Shim, H.J. Yu, RawNet: Advanced end-to-end deep neural network using raw waveforms for text-independent speaker verification, in Proc. Annu. Conf. Int. Speech Commun. Assoc. INTERSPEECH, vol. 2019-September, pp. 1268–1272, 2019. https://doi.org/10.21437/Interspeech.2019-1982

J.W. Jung, H.S. Heo, I.H. Yang, H.J. Shim, H.J. Yu, A complete end-to-end speaker verification system using deep neural networks: From raw signals to verification result, in ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 2018-April, pp. 5349–5353, 2018. https://doi.org/10.1109/ICASSP.2018.8462575

M. Karu, T. Alumäe, Weakly supervised training of speaker identification models, in Speak. Lang. Recognit. Work. ODYSSEY 2018, pp. 24–30, 2018. https://doi.org/10.21437/Odyssey.2018-4

A. Krizhevsky, I. Sutskever, G.E. Hinton, ImageNet classification with deep convolutional neural networks. Commun. ACM 60(6), 84–90 (2017). https://doi.org/10.1145/3065386

A. Larcher, K.A. Lee, B. Ma, H. Li, Text-dependent speaker verification: classifiers, databases and RSR2015. Speech Commun. 60, 56–77 (2014). https://doi.org/10.1016/j.specom.2014.03.001

M.A. Laskar, R.H. Laskar, Integrating DNN–HMM technique with hierarchical multi-layer acoustic model for text-dependent speaker verification. Circuits Syst. Signal Process. 38(8), 3548–3572 (2019). https://doi.org/10.1007/s00034-019-01103-3

M.A. Laskar, R.H. Laskar, HiLAM-aligned kernel discriminant analysis for text-dependent speaker verification. Expert Syst. Appl. 182, 115281 (2021). https://doi.org/10.1016/j.eswa.2021.115281

Q.V. Le, M. Tan, EfficientNet: rethinking model scaling for convolutional neural networks. Can. J. Emerg. Med. 15(3), 190 (2013). https://doi.org/10.48550/arXiv.1905.11946

J. Lee, H.G. Kang, Two-stage refinement of magnitude and complex spectra for real-time speech enhancement. IEEE Signal Process. Lett. 29, 2188–2192 (2022). https://doi.org/10.1109/LSP.2022.3215100

T. Matsui, S. Furui, Comparison of text-independent speaker recognition methods using VQ-distortion and discrete/continuous HMM’s. IEEE Trans. Speech Audio Process. 2(3), 456–459 (1994). https://doi.org/10.1109/89.294363

H. Meng, T. Yan, F. Yuan, H. Wei, Speech emotion recognition from 3D log-mel spectrograms with deep learning network. IEEE Access 7, 125868–125881 (2019). https://doi.org/10.1109/ACCESS.2019.2938007

A. Nagraniy, J.S. Chungy, A. Zisserman, VoxCeleb: A large-scale speaker identification dataset, in Proc. Annu. Conf. Int. Speech Commun. Assoc. INTERSPEECH, vol. 2017-August, pp. 2616–2620, 2017. https://doi.org/10.21437/Interspeech.2017-950

P.K. Nayana, D. Mathew, A. Thomas, Comparison of text independent speaker identification systems using GMM and i-vector methods. Procedia Comput. Sci. 115, 47–54 (2017). https://doi.org/10.1016/j.procs.2017.09.075

I. Orović, S. Stanković, N. Žarić, Robust speech watermarking procedure in the time-frequency domain. EURASIP J. Adv. Signal Process. 2008, 1–9 (2008). https://doi.org/10.1155/2008/519206

R.D. Patterson, K. Robinson, J. Holdsworth, D. McKeown, C. Zhang, M. Allerhand, Complex sounds and auditory images, in Auditory Physiology and Perception, vol 83 (Elsevier, 1992), pp. 429–446. https://doi.org/10.1016/B978-0-08-041847-6.50054-X

S. Pruzansky, Pattern-matching procedure for automatic talker recognition. J. Acoust. Soc. Am. 35(3), 354–358 (1963). https://doi.org/10.1121/1.1918467

M. Ravanelli, Y. Bengio, Speaker recognition from raw waveform with SincNet, in 2018 IEEE spoken language technology workshop SLT 2018, pp. 1021–1028, 2019. https://doi.org/10.1109/SLT.2018.8639585

D.A. Reynolds, Speaker identification and verification using Gaussian mixture speaker models, in ESCA Work. Autom. Speak. Recognition, Identification, Verif. ASRIV 1994, vol. 17, pp. 27–30, 2019. https://doi.org/10.1016/0167-6393(95)00009-D

T.N. Sainath, B. Kingsbury, G. Saon, H. Soltau, G. Dahl, B. Ramabhadran, Deep convolutional neural networks for large-scale speech tasks. Neural Netw. 64, 39–48 (2015). https://doi.org/10.1016/j.neunet.2014.08.005

D. Salvati, C. Drioli, G.L. Foresti, End-to-end speaker identification in noisy and reverberant environments using raw waveform convolutional neural networks, in Proc. Annu. Conf. Int. Speech Commun. Assoc. INTERSPEECH, vol. 2019-September, pp. 4335–4339, 2019. https://doi.org/10.21437/Interspeech.2019-2403

M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, L.-C. Chen, MobileNetV2: inverted residuals and linear bottlenecks, in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4510–4520, 2018. https://doi.org/10.1109/CVPR.2018.00474

B. Saritha, N. Shome, R.H. Laskar, M. Choudhury, Enhancement in speaker recognition using sincnet through optimal window and frame shift, in 2022 2nd International Conference on Intelligent Technologies (CONIT), pp. 1–6, 2022. https://doi.org/10.1109/CONIT55038.2022.9848231

R.V. Sharan, T.J. Moir, Subband time-frequency image texture features for robust audio surveillance. IEEE Trans. Inf. Forensics Secur. 10(12), 2605–2615 (2015). https://doi.org/10.1109/TIFS.2015.2469254

K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition, in 3rd Int. Conf. Learn. Represent. ICLR 2015 - Conf. Track Proc., pp. 1–14, 2015. https://doi.org/10.48550/arXiv.1409.1556

H. Sinha, V. Awasthi, P.K. Ajmera, Audio classification using braided convolutional neural networks. IET Signal Process. 14(7), 448–454 (2020). https://doi.org/10.1049/iet-spr.2019.0381

L. Stankovic, A method for time-frequency analysis. IEEE Trans. Signal Process. 42(1), 225–229 (1994). https://doi.org/10.1109/78.258146

S. Stanković, Time-frequency analysis and its application in digital watermarking. EURASIP J. Adv. Signal Process. 2010, 1–20 (2010). https://doi.org/10.1155/2010/579295

M. Strake, B. Defraene, K. Fluyt, W. Tirry, T. Fingscheidt, Speech enhancement by LSTM-based noise suppression followed by CNN-based speech restoration. EURASIP J. Adv. Signal Process. 2020(1), 1–26 (2020). https://doi.org/10.1186/s13634-020-00707-1

C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, Going deeper with convolutions, in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol. 91, no. 8, pp. 1–9, 2015. https://doi.org/10.1109/CVPR.2015.7298594

C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, Z. Wojna, “Rethinking the Inception Architecture for Computer Vision, in Proceedings of the IEEE conference on computer vision and pattern recognition, vol. 2016-December, pp. 2818–2826, 2016. https://doi.org/10.1109/CVPR.2016.308

S.S. Tirumala, S.R. Shahamiri, A deep autoencoder approach for speaker identification, in ACM Int. Conf. Proceeding Ser., pp. 175–179, 2017. https://doi.org/10.1145/3163080.3163097

S.S. Tirumala, S.R. Shahamiri, A review on deep learning approaches in speaker identification, in ACM Int. Conf. Proceeding Ser., pp. 142–147, 2016. https://doi.org/10.1145/3015166.3015210

Tobias Birnbaum (2023), s_method_pub, https://github.com/PhaseSpaceContinuum/S_method.

Z. Wu, Z. Cao, Improved MFCC-based feature for robust speaker identification. Tsinghua Sci. Technol. 10(2), 158–161 (2005). https://doi.org/10.1016/S1007-0214(05)70048-1

S. Yadav, A. Rai, Learning discriminative features for speaker identification and verification, in Proc. Annu. Conf. Int. Speech Commun. Assoc. INTERSPEECH, vol. 2018-September, no. April, pp. 2237–2241, 2018. https://doi.org/10.21437/Interspeech.2018-1015

Z. Zhang, J. Geiger, J. Pohjalainen, A.E.D. Mousa, W. Jin, B. Schuller, Deep learning for environmentally robust speech recognition: An overview of recent developments. ACM Trans. Intell. Syst. Technol. 9(5), 1–28 (2018). https://doi.org/10.1145/3178115

X. Zhao, D. Wang, Analyzing noise robustness of MFCC and GFCC features in speaker identification, in ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 7204–7208, 2013. https://doi.org/10.1109/ICASSP.2013.6639061

B. Zoph, V. Vasudevan, J. Shlens, Q.V. Le, Learning transferable architectures for scalable image recognition, in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8697–8710, 2018. https://doi.org/10.1109/CVPR.2018.00907