Exploiting correlogram structure for robust speech recognition with multiple speech sources

Speech Communication - Tập 49 - Trang 874-891 - 2007
Ning Ma1, Phil Green1, Jon Barker1, André Coy1
1Department of Computer Science, University of Sheffield, Regent Court, 211 Portobello Street, Sheffield S1 4DP, UK

Tài liệu tham khảo

Assmann, 1990, Modeling the perception of concurrent vowels: Vowels with different fundamental frequencies, J. Acoust. Soc. Amer., 88, 680, 10.1121/1.399772 Barker, J., Josifovski, L., Cooke, M., Green, P., 2000. Soft decisions in missing data techniques for robust automatic speech recognition. In: Proc. ICSLP 2000, Beijing, China, pp. 373–376. Barker, 2005, Decoding speech in the presence of other sources, Speech Comm., 45, 5, 10.1016/j.specom.2004.05.002 Barker, J., Coy, A., Ma, N., Cooke, M., 2006. Recent advances in speech fragment decoding techniques. In: Proc. Interspeech 2006, Pittsburgh, pp. 85–88. Bregman, 1990 Brown, 1994, Computational auditory scene analysis, Comput. Speech Lang., 8, 297, 10.1006/csla.1994.1016 Brown, 2005, Separation of speech by computational auditory scene analysis, 371 Cooke, M., 1991. Modelling auditory processing and organisation. Ph.D. thesis, Department of Computer Science, University of Sheffield. Cooke, 2001, The auditory organization of speech and other sources in listeners and computational models, Speech Comm., 35, 141, 10.1016/S0167-6393(00)00078-9 Cooke, M., Morris, A., Green, P., 1997. Missing data techniques for robust speech recognition. In: Proc. ICASSP 1997, Vol. 1, Munich, pp. 25–28. Cooke, 2001, Robust automatic speech recognition with missing and uncertain acoustic data, Speech Comm., 34, 267, 10.1016/S0167-6393(00)00034-0 Cooke, 2006, An audio–visual corpus for speech perception and automatic speech recognition, J. Acoust. Soc. Amer., 2421, 10.1121/1.2229005 Cooke, M., Garcia Lecumberri, M., Barker, J. The foreign language cocktail party problem: energetic and informational masking effects in non-native speech perception. J. Acoust. Soc. Amer., submitted for publication. Coy, A., Barker, J., 2005. Soft harmonic masks for recognising speech in the presence of a competing speaker. In: Proc. Interspeech 2005, Lisbon, pp. 2641–2644. Coy, A., Barker, J., 2006. A multipitch tracker for monaural speech segmentation. In: Proc. Interspeech 2006, Pittsburgh, pp. 1678–1681. Coy, 2007, An automatic speech recognition system based on the scene analysis account of auditory perception, Speech Comm., 49, 384, 10.1016/j.specom.2006.11.002 de Cheveigné, 1993, Separation of concurrent harmonic sounds: fundamental frequency estimation and a time-domain cancellation model of auditory processing, J. Acoust. Soc. Amer., 93, 3271, 10.1121/1.405712 Ellis, 1999, Using knowledge to organize sound: the prediction-driven approach to computational auditory scene analysis and its application to speech/nonspeech mixtures, Speech Comm., 27, 281, 10.1016/S0167-6393(98)00083-1 Glasberg, 1990, Derivation of auditory filter shapes from notched-noise data, Hear. Res., 47, 103, 10.1016/0378-5955(90)90170-T Gonzales, 2004 Hirsch, H., Pearce, D., 2000. The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions. In: Proc. ICSLP 2000, Vol. 4, pp. 29–32. Hu, G., 2006. Monaural speech organization and segregation. Ph.D. thesis, The Ohio State University, Biophysics program. Licklider, 1951, A duplex theory of pitch perception, Experientia, 7, 128, 10.1007/BF02156143 Lim, 1979, Enhancement and bandwidth compression of noisy speech, Proc. IEEE, 67, 1586, 10.1109/PROC.1979.11540 Ma, N., Green, P., Coy, A., 2006. Exploiting dendritic autocorrelogram structure to identify spectro-temporal regions dominated by a single sound source. In: Proc. Interspeech 2006, Pittsburgh, PA, pp. 669–672. McAulay, 1986, Speech analysis/synthesis based on a sinusoidal representation, IEEE Trans. Acoust. Speech Signal Process., 34, 744, 10.1109/TASSP.1986.1164910 Meddis, 1991, Virtual pitch and phase sensitivity of a computer model of the auditory periphery. I: Pitch identification, J. Acoust. Soc. Amer., 89, 2866, 10.1121/1.400725 Meddis, 1992, Modeling the identification of concurrent vowels with different fundamental frequencies, J. Acoust. Soc. Amer., 91, 233, 10.1121/1.402767 Parra, L., Spence, C., 2000. Convolutive blind source separation of non-stationary sources. IEEE Trans. Speech Audio Process., pp. 320–327. Shamma, 1985, Speech processing in the auditory system. I: The representation of speech sounds in the responses of the auditory nerve, J. Acoust. Soc. Amer., 78, 1613, 10.1121/1.392799 Slaney, M., Lyon, R., 1990. A perceptual pitch detector. In: Proc. ICASSP 1990. Albequerque, pp. 357–360. Summerfield, Q., Lea, A., Marshall, D., 1990. Modelling auditory scene analysis: strategies for source segregation using autocorrelograms. In: Proc. Institute of Acoustics, Vol. 12, pp. 507–514. Wang, 1999, Separation of speech from interfering sounds based on oscillatory correlation, IEEE Trans. Neural Networks, 10, 684, 10.1109/72.761727