AUC optimization for deep learning-based voice activity detection

Xiao-Lei Zhang1,2, Menglong Xu1,2
1Research & Development Institute of Northwestern Polytechnical University in Shenzhen, Shenzhen, China
2School of Marine Science and Technology, Northwestern Polytechnical University, Xi’an, China

Tóm tắt

Voice activity detection (VAD) based on deep neural networks (DNN) have demonstrated good performance in adverse acoustic environments. Current DNN-based VAD optimizes a surrogate function, e.g., minimum cross-entropy or minimum squared error, at a given decision threshold. However, VAD usually works on-the-fly with a dynamic decision threshold, and the receiver operating characteristic (ROC) curve is a global evaluation metric for VAD at all possible decision thresholds. In this paper, we propose to maximize the area under the ROC curve (MaxAUC) by DNN, which can maximize the performance of VAD in terms of the entire ROC curve. However, the objective of the AUC maximization is nondifferentiable. To overcome this difficulty, we relax the nondifferentiable loss function to two differentiable approximation functions—sigmoid loss and hinge loss. To study the effectiveness of the proposed MaxAUC-DNN VAD, we take either a standard feedforward neural network or a bidirectional long short-term memory network as the DNN model with either the state-of-the-art multi-resolution cochleagram or short-term Fourier transform as the acoustic feature. We conducted noise-independent training to all comparison methods. Experimental results show that taking AUC as the optimization objective results in higher performance than the common objectives of the minimum squared error and minimum cross-entropy. The experimental conclusion is consistent across different DNN structures, acoustic features, noise scenarios, training sets, and languages.

Tài liệu tham khảo

J. Padrell, D. Macho, C. Nadeu, in Acoustics, Speech, and Signal Processing, 2005. Proceedings.(ICASSP’05). IEEE International Conference On. Robust speech activity detection using lda applied to ff parameters, vol. 1 (IEEE, 2005), p. 557

T. Hughes, K. Mierle, in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. Recurrent neural networks for voice activity detection (2013). pp. 7378–7382

Q. Wang, J. Du, X. Bao, Z.-R. Wang, L.-R. Dai, C.-H. Lee, In: Sixteenth Annual Conference of the International Speech Communication Association. A universal vad based on jointly trained deep neural networks (2015)

L. Wang, K. Phapatanaburi, Z. Go, S. Nakagawa, M. Iwahashi, J. Dang, in Proceedings of ICME. Limiting numerical precision of neural networks to achieve real-time voice activity detection (2018), pp. 1087–1092

W.A. Jassim, N. Harte, in Proceedings of ICASSP. Voice activity detection using neurograms (2018), pp. 5524–5528