A noise robust speech features extraction approach in multidimensional cortical representation using multilinear principal component analysis

International Journal of Speech Technology - Tập 18 - Trang 351-365 - 2015
Mehdi Fartash1, Saeed Setayeshi2, Farbod Razzazi3
1Department of Electrical and Computer Engineering, Arak Branch, Islamic Azad University, Arak, Iran
2Department of Medical Radiation, Amirkabir University of Technology, Tehran, Iran
3Department of Electrical and Computer Engineering - Science and Research Branch, Islamic Azad University, Tehran, Iran

Tóm tắt

In this paper, we propose a new type of noise robust feature extraction method based on multidimensional perceptual representation of speech in the auditory cortex (AI). Different coded features in different dimensions cause an increase in discrimination power of the system. On the other hand, this representation causes a great increase in the volume of information that produces the curse of dimensionality phenomenon. In this study, we propose a second level feature extraction stage to make the features suitable and noise robust for classification training. In the second level of feature extraction, we target two main concerns: dimensionality reduction and noise robustness using singular value decomposition (SVD) approach. A multilinear principal component analysis framework based on higher-order SVD is proposed to extract the final features in high-dimensional AI output space. The phoneme classification results on different subsets of the phonemes of additive noise contaminated TIMIT database confirmed that the proposed method not only increased the classification rate considerably, but also enhanced the robustness significantly comparing to conventional Mel-frequency cepstral coefficient and cepstral mean normalization features, which were used to train in the same classifier.

Tài liệu tham khảo

Acar, E., & Yener, B. (2009). Unsupervised multiway data analysis: A literature survey. IEEE; Transactions on Knowledge and Date Engineering, 21, 6–20. Bishop, C. (2006). Pattern recognition and machine learning. New York: Springer. Cabellos, J. M. G., Moreno, C. P., Antolin, A. G., Cruz, F. P., Maria, F.D. (2004). SVM classifiers for ASR: A discussion about parameterization. Proceedings of EUSIPCO. Chi, T., Gao, Y., Guyton, C. G., Ru, P., & Shamma, S. (1999). Spectrotemporal modulation transfer functions and speech intelligibility. Journal of the Acoustical Society of America, 106, 2719–2732. Chi, T., Ru, P., & Shamma, S. (2005). Multiresolution spectrotemporal analysis of complex sounds. Journal of the Acoustical Society of America, 118, 887–906. DARPA TIMIT. (1990). Acoustic-Phonetic Continuous Speech Corpus. National Institute of Standards and Technology Speech. Depireux, D. A., Simon, J. Z., Klein, D. J., & Shamma, S. A. (2001). Spectrotemporal response field characterization with dynamic ripples in ferret primary auditory cortex. Journal of Neurophysiology, 85, 1220–1234. Doclo, S., & Moonen, M. (2002). GSVD-based optimal filtering for single and multiple speech enhancement. IEEE Transactions on Signal Processing, 50, 2230–2244. Elhilali, M., & Shamma, S. (2008). A cocktail party with a cortical twist: How cortical mechanisms contribute to sound segregation. The Journal of the Acoustical Society of America, 124, 3751–3771. Esfandian, N., Razzazi, F., & Behrad, A. (2012). A clustering based feature selection method in spectro-temporal domain for speech recognition. Engineering Applications of Artificial Intelligence, 25, 1194–1202. Fartash, M., Setayeshi, S., & Razzazi, F. (2010). A novel spectro-temporal feature extraction method for phoneme classification. IEEE 10th International Conference on Signal Processing, (pp. 569–572). Beijing: IEEE. Fartash, M., Setayeshi, S., & Razzazi, F. (2013). A scale-rate filter selection method in the spectro-temporal domain for phoneme classification. Computers & Electrical Engineering, 39, 1537–1548. Gerbrands, J. J. (1981). On the relationships between SVD, KLT and PCA. Pattern Recognition, 14, 375–381. Hassanpour, H. (2008). A time-frequency approach for noise reduction. Digital Signal Processing, 18, 728–738. He, Y., Gan, T., Chen, W., & Wang, H. (2011). Adaptive denoising by singular value decomposition. IEEE Signal Processing Letters, 18, 215–218. Hermansky, H., & Morgan, N. (1994). RASTA processing of speech. IEEE Transactions on Speech and Audio Processing, 2, 578–589. Hermus, K., Wambacq, P., & Hamme, H. V. (2007). A review of signal subspace speech enhancement and its application to noise robust speech recognition. EURASIP Journal on Applied Signal Processing, 2007(1), 195. Hou, Z. (2003). Adaptive singular value decomposition in wavelet domain for image denoising. Pattern Recognition, 36, 1747–1763. Hung, J. W., & Lee, L. S. (2006). Optimization of temporal filters for constructing robust features in speech recognition. IEEE Transactions on Audio, Speech and Language Processing, 14, 808–832. Hung, J. W., & Tsai, W. Y. (2008). Constructing modulation frequency domain-based features for robust speech recognition. IEEE Transactions on Audio, Speech and Language Processing, 16, 563–577. Jeon, W., & Juang, B. H. (2007). Speech analysis in a model of the central auditory system. IEEE Transactions on Audio, Speech, and Language Processing, 15, 1802–1817. Jha, S. K., & Yadava, R. D. S. (2011). Denoising by singular value decomposition and its application to electronic nose data processing. IEEE Sensors Journal, 11, 35–44. Kleinschmidt, M. (2003). Localized spectro-temporal features for automatic speech recognition. Proceeding of Eurospeech. Kleinschmidt, M., & Gelbart, D. (2002). Improving word accuracy with gabor feature extraction. Proceedings of ICSLP. Kolda, T. G., & Bader, B. W. (2009). Tensor decompositions and applications. SIAM Review, 51, 455–500. Kowalski, N., Depireux, D. A., & Shamma, S. (1996). Analysis of dynamic spectra in ferret primary auditory cortex I. Characteristics of single-unit response to moving ripple spectra. Journal of Neurophysiology, 76, 3503–3523. Kroonenberg, P., & Leeuw, J. (1980). Principal component analysis of three-mode data by means of alternating least squares algorithms. Psychometrika, 45, 69–97. Landgrebe, D. (2002). Hyperspectral image data analysis as a high dimensional signal processing problem. IEEE Signal Processing Magazine, 19, 17–28. Lathauwer, L. D., Moor, B. D., & Vandewalle, J. (2000a). On the best rank-1 and rank-(r1, r2,…, rn) approximation of higher-order tensors. SIAM Journal on Matrix Analysis and Applications, 21, 1324–1342. Lathauwer, L. D., Moor, B. D., & Vandewalle, J. (2000b). A multilinear singular value decomposition. SIAM journal on Matrix Analysis and Applications, 21, 1253–1278. Lathauwer, L. D., & Vandewalle, J. (2004). Dimensionality reduction in higher-order signal processing and rank-(R1; R2,…, RN) reduction in multilinear algebra. Linear Algebra and its Applications, 391, 31–55. Law, M. H. C., & Jain, A. K. (2006). Incremental nonlinear dimensionality reduction by manifold learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28, 377–391. Li, J., Zhang, L., Tao, D., Sun, H., & Zhao, Q. (2009). A prior neurophysiologic knowledge free tensor-based scheme for single trial EEG classification. IEEE Transactions on Neural Systems and Rehabilitation Engineering, 17, 107–115. Linden, J. F., & Liu, R. C. (2003). M. Sahani. Spectrotemporal structure of receptive fields in areas AI and AAF of mouse auditory cortex. Journal of Neurophysiology, 90, 2660–2675. Lippmann, R. P. (1997). Speech recognition by machines and humans. Speech Communication, 22, 1–15. Lu, H., Plataniotis, K. N., & Venetsanopoulos, A. N. (2008). MPCA: Multilinearprincipal component analysis of tensor objects. IEEE Transactions on Neural Networks, 19, 18–39. Lu, H., Plataniotis, K. N., & Venetsanopoulos, A. N. (2009). Boosting discriminant learners for gait recognition using MPCA features. Journal on Image and Video Processing,. doi:10.1155/2009/713183. Lu, H., Plataniotis, K. N., & Venetsanopoulos, A. N. (2011). A survey of multilinear subspace learning for tensor data. Pattern Recognition, 44, 1540–1551. Lyon, R., & Shamma, S. (1996). Auditory representation of timbre and pitch. Auditory Computation (pp. 221–270). New York: Springer handbook of auditory research. Maj, J. B., Royackers, L., Moonen, M., & Wouters, J. (2005). SVD-Based optimal filtering for noise reduction in dual microphone hearing aids: A real time implementation and perceptual evaluation. IEEE Transactions on Biomedical Engineering, 52, 1563–1573. Maj, J. B., Wouters, J., & Moonen, M. (2002). SVD-based optimal filtering technique for noise redcution in hearing aids using two microphones. Journal on Applied Signal Processing, 4, 432–443. Martínez, C. E., Goddard, J. C., Milone, D. H., & Rufiner, H. L. (2012). Bioinspired sparse spectro-temporal representation of speech for robust classification. Computer Speech & Language, 26, 336–348. Mesgarani, N., David, S. V., Fritz, J. B., & Shamma, S. (2008). Phoneme representation and classification in primary auditory cortex. The Journal of the Acoustical Society of America, 123, 899–909. Mesgarani, N., & Shamma, S. (2007). Denoising in the domain of spectrotemporal modulations. EURASIP Journal on Audio, Speech, and Music Processing, 2007, 3. Mesgarani, N., Slaney, M., & Shamma, S. (2006). Discrimination of speech from nonspeech based on multiscale spectro-temporal modulations. IEEE Transactions on Audio, Speech and Language Processing, 14, 920–930. Meyer BT, Wächter M, Brand T, Kollmeier B. Phoneme confusions in human and automatic speech recognition. Proc Interspeech 2007. p. 1485-8. Oseledets, I. V., & Tyrtyshnikov, E. E. (2009). Breaking the curse of dimensionality, or how to use SVD in many dimensions. SIAM Journal on Scientific Computing, 31, 3744–3759. Panagakis, Y., Kotropoulos, C., & Arce, G. R. (2010). Non-negative multilinear principal component analysis of auditory temporal modulations for music genre classification. IEEE Transactions on Audio, Speech and Language Processing, 18, 576–588. Renard, N., Bourennane, S., & Blanc-Talon, J. (2008). Denoising and dimensionality reduction using multilinear tools for hyperspectral images. IEEE Transactions on Geoscience and Remote Sensing, 5, 138–142. Safayani, M., & Shalmani, M. T. M. (2011). Three-dimensional modular discriminant analysis (3DMDA): A new feature extraction approach for face recognition. Computers & Electrical Engineering., 37, 811–823. Shamma, S. (1998). Methods of neuronal modeling, in spatial and temporal processing in the auditory system (pp. 411–460). Cambridge: MIT Press. Simon, J., Depireux, D. A., & Shamma, S. (1998). Representation of complex spectra in auditory cortex. Proceedings of the 11th International Symposium on Hearing. Tao, D., Li, X., Wu, X., & Maybak, J. S. (2007). General tensor discriminant analysis and gabor features for gait recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29, 1700–1715. Theunissen, F. E., Sen, K., & Doupe, A. (2000). Spectral-temporal receptive fields of nonlinear auditory neurons obtained using natural sounds. Journal of Neuroscience, 20, 2315–2331. Varga, A., & Steeneken, H. (1993). Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems. Speech Communication, 12, 247–251. Viikki, O., & Laurila, K. (1998). Cepstral domain segmental feature vector normalization for noise robust speech recognition. Speech Communication, 25, 133–147. Wang, J., Barreto, A., Wang, L., Chen, Y., Rishe, N., Andrian, J., et al. (2010). Multilinear principal component analysis for face recognition with fewer features. Neurocomputing, 73, 1550–1555. Wang, T. T., & Quatieri, T. F. (2010). High-Pitch formant estimation by exploiting temporal change of pitch. IEEE Transactions on Audio, Speech and Language Processing, 18, 1802–1817. Wang, K., & Shamma, S. (1994). Self-normalization and noise-robustness in early auditory representations. IEEE Transactions on Speech and Audio Processing, 2, 421–435. Weiland, S., & Belzen, F. V. (2010). Singular value decompositions and low rank approximations of tensors. IEEE Transactions on Signal Processing, 58, 1171–1182. Wongsawat, Y., Rao, K. R., & Oraintara, S. (2005). Multichannel SVD-based image denoising. Proceedings of the IEEE International Symposium Circuits and Systems. pp. 5990–5993. Wu, Q., Zhang, L., & Shi, G. (2011). Robust multifactor speech feature extraction based on gabor analysis. IEEE Transactions on Audio, Speech and Language Processing, 19, 936–937. Yan, S., Xu, D., Yang, Q., Zhang, L., Tang, X., & Zhang, H.-J. (2007). Multilinear discriminant analysis for face recognition. IEEE Transactions on Image Processing, 16, 212–220. Yang, X., Wang, K., & Shamma, S. (1992). Auditory representation of acoustic signals. IEEE Transactions on Information Theory, 38, 824–839. Yang, J., Zhang, D., Frangi, A. F., & Yang, J. (2004). Two-dimensional PCA: A new approach to appearance-based face representation and recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26, 131–137. Ye, J., Janardan, R., & Li, Q. (2004). GPCA: An efficient dimension reduction scheme for image compression and retrieval. Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. pp. 354–363.