A unified DNN approach to speaker-dependent simultaneous speech enhancement and speech separation in low SNR environments

Speech Communication - Tập 95 - Trang 28-39 - 2017
Tian Gao1, Jun Du1, Li-Rong Dai1, Chin-Hui Lee2
1National Engineering Laboratory for Speech and Language Information Processing, University of Science and Technology of China, Hefei, Anhui, China
2Georgia Institute of Technology, Atlanta, Georgia, United States

Tài liệu tham khảo

Allen, 1977, A unified approach to short-time fourier analysis and synthesis, Proc. IEEE, 65, 1558, 10.1109/PROC.1977.10770 Benesty, 2005 Boll, 1979, Suppression of acoustic noise in speech using spectral subtraction, Acoust. Speech Signal Process. IEEE Transa., 27, 113, 10.1109/TASSP.1979.1163209 Cohen, 2001, Speech enhancement for non-stationary noise environments, Signal Process., 81, 2403, 10.1016/S0165-1684(01)00128-1 Dahl, 2012, Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition, Audio Speech Lang. Process. IEEE Trans., 20, 30, 10.1109/TASL.2011.2134090 Du, 2008, A speech enhancement approach using piecewise linear approximation of an explicit model of environmental distortions, 569 Du, 2016, A regression approach to single-channel speech separation via high-resolution deep neural networks, Audio Speech Lang. Process. IEEE/ACM Trans., 24, 1424, 10.1109/TASLP.2016.2558822 Du, 2014, Speech separation of a target speaker based on deep neural networks, 473 Ephraim, 1984, Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator, Acoustics Speech Signal Processing IEEE Trans, 32, 1109, 10.1109/TASSP.1984.1164453 Ephraim, 1985, Speech enhancement using a minimum mean-square error log-spectral amplitude estimator, Acoustics Speech Signal Process. IEEE Trans., 33, 443, 10.1109/TASSP.1985.1164550 Fan, 2014, Speech enhancement using segmental nonnegative matrix factorization, 4483 Fu, 2016, SNR-aware convolutional neural network modeling for speech enhancement, 3768, 10.21437/Interspeech.2016-211 Gao, 2015, A unified speaker-dependent speech separation and enhancement system based on deep neural networks, 687 Gao, 2015, Improving deep neural network based speech enhancement in low SNR environments, 75 Hinton, 2012, Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups, IEEE Signal Process. Mag., 29, 82, 10.1109/MSP.2012.2205597 Hinton, 2002, Training products of experts by minimizing contrastive divergence, Neural Comput., 14, 1771, 10.1162/089976602760128018 Hinton, 2006, A fast learning algorithm for deep belief nets, Neural Comput., 18, 1527, 10.1162/neco.2006.18.7.1527 Hu, 2010, A tandem algorithm for pitch estimation and voiced speech segregation, Audio Speech Lang. Process. IEEE Trans., 18, 2067, 10.1109/TASL.2010.2041110 Hu, 2013, An unsupervised approach to cochannel speech separation, Audio SpeechLang. Process. IEEE Trans., 21, 122, 10.1109/TASL.2012.2215591 Hu, 2008, Evaluation of objective quality measures for speech enhancement, IEEE Trans. Audio Speech Lang. Process., 16, 229, 10.1109/TASL.2007.911054 Huang, 2014, Deep learning for monaural speech separation, 1562 Huang, 2015, Joint optimization of masks and deep recurrent neural networks for monaural source separation, Audio Speech Lang. Process. IEEE/ACM Trans., 23, 2136, 10.1109/TASLP.2015.2468583 Hwang, 2016, Ensemble of deep neural networks using acoustic environment classification for statistical model-based voice activity detection, Comput. Speech Lang., 38, 1, 10.1016/j.csl.2015.11.003 Kamath, 2002, A multi-band spectral subtraction method for enhancing speech corrupted by colored noise, 4, IV Kim, 2015, Adaptive denoising autoencoders: a fine-tuning scheme to learn from test mixtures, 100 Kristjansson, 2004, Single microphone source separation using high resolution signal reconstruction, 2, ii Lim, 1978, All-pole modeling of degraded speech, Acoustics Speech Signal Process. IEEE Trans., 26, 197, 10.1109/TASSP.1978.1163086 Loizou, 2013 McAulay, 1980, Speech enhancement using a soft-decision noise suppression filter, Acoustics Speech Signal Process. IEEE Trans., 28, 137, 10.1109/TASSP.1980.1163394 Mohammadiha, 2013, Supervised and unsupervised speech enhancement using nonnegative matrix factorization, Audio Speech, Lang. Process. IEEE Trans., 21, 2140, 10.1109/TASL.2013.2270369 Povey, 2011, The kaldi speech recognition toolkit Roweis, 2000, One microphone source separation, 13, 793 Roweis, 2003, Factorial models and refiltering for speech separation and denoising, 1009 Schmidt, 2006, Single-channel speech separation using sparse non-negative matrix factorization Shao, 2006, Model-based sequential organization in cochannel speech, Audio Speech Lang. Process. IEEE Trans., 14, 289, 10.1109/TSA.2005.854106 Tu, 2014, Speech separation based on improved deep neural networks with dual outputs of speech features for both target and interfering speakers, 250 Varga, 1993, Assessment for automatic speech recognition: II. NOISEX-92: a database and an experiment to study the effect of additive noise on speech recognition systems, Speech Commun., 12, 247, 10.1016/0167-6393(93)90095-3 Vincent, 2006, Performance measurement in blind audio source separation, IEEE Trans. Audio Speech Lang. Process., 14, 1462, 10.1109/TSA.2005.858005 Wang, 1999, Separation of speech from interfering sounds based on oscillatory correlation, Neural Netwo. IEEE Transa., 10, 684, 10.1109/72.761727 Wang, 1999, Separation of speech from interfering sounds based on oscillatory correlation, Neural Netw. IEEE Trans., 10, 684, 10.1109/72.761727 Wang, 2006 Wang, 2015, A universal VAD based on jointly trained deep neural networks, 2282 Wang, 2014, On training targets for supervised speech separation, Audio Speech Lang. Process. IEEE/ACM Trans., 22, 1849, 10.1109/TASLP.2014.2352935 Wang, 2013, Towards scaling up classification-based speech separation, Audio Speech Lang. Process. IEEE Trans., 21, 1381, 10.1109/TASL.2013.2250961 Weninger, 2015, Speech enhancement with LSTM recurrent neural networks and its application to noise-robust ASR, 91 Wu, 2003, A multipitch tracking algorithm for noisy speech, Speech Audio Process. IEEE Trans., 11, 229, 10.1109/TSA.2003.811539 Xu, 2014, Dynamic noise aware training for speech enhancement based on deep neural networks, 2670 Xu, 2014, An experimental study on speech enhancement based on deep neural networks, Signal Process. Lett. IEEE, 21, 65, 10.1109/LSP.2013.2291240 Xu, 2014, Global variance equalization for improving deep neural network based speech enhancement, 71 Xu, 2015, A regression approach to speech enhancement based on deep neural networks, Audio Speech Lang. Process. IEEE/ACM Trans., 23, 7, 10.1109/TASLP.2014.2364452 Xu, 2015, Multi-objective learning and mask-based post-processing for deep neural network based speech enhancement, 1508 Zazo, 2016, Feature learning with raw-waveform CLDNNs for voice activity detection, 3668, 10.21437/Interspeech.2016-268 Zhang, 2016, Boosting contextual information for deep neural network based voice activity detection, IEEE/ACM Trans. Audio Speech Lang. Process., 24, 252, 10.1109/TASLP.2015.2505415 Zhang, 2016, A deep ensemble learning method for monaural speech separation, IEEE/ACM Trans. Audio Speech Lang. Process. (TASLP), 24, 967, 10.1109/TASLP.2016.2536478 Zhang, 2013, Deep belief networks based voice activity detection, IEEE Trans. Audio Speech Lang. Process., 21, 697, 10.1109/TASL.2012.2229986 Zöhrer, 2014, Single channel source separation with general stochastic networks., 978 Zöhrer, 2015, Representation models in single channel source separation, 713