Front end analysis of speech recognition: a review

International Journal of Speech Technology - Tập 14 - Trang 99-145 - 2011

M. A. Anusuya¹, S. K. Katti¹

¹Department of Computer Science, SJCE, Mysore, India

Tóm tắt

Automatic speech recognition (ASR) has made great strides with the development of digital signal processing hardware and software. But despite of all these advances, machines can not match the performance of their human counterparts in terms of accuracy and speed, especially in case of speaker independent speech recognition. So, today significant portion of speech recognition research is focused on speaker independent speech recognition problem. Before recognition, speech processing has to be carried out to get a feature vectors of the signal. So, front end analysis plays a important role. The reasons are its wide range of applications, and limitations of available techniques of speech recognition. So, in this report we briefly discuss the different aspects of front end analysis of speech recognition including sound characteristics, feature extraction techniques, spectral representations of the speech signal etc. We have also discussed the various advantages and disadvantages of each feature extraction technique, along with the suitability of each method to particular application.

Tài liệu tham khảo

Agrafiotis, D. K. (2003). Stochastic proximity embedding. Journal of Computational Chemistry, 24(10), 1215–1221.

Allen, J. B. (1985). Cochlear modeling. IEEE ASSP Magazine, 3(3), 3–29.

Alwan, A. (1989). Perceputal cues for place of articulation for the voiced pharyngealand uvular consonants. The Journal of the Acoustical Society of America, 86, 549–556.

Baudat, G., & Anouar, F. (2000). Generalized discriminant analysis using a kernel approach. Neural Computation, 12(10), 2385–2404.

Belkin, M., & Niyogi, P. (2002). Laplacian eigenmaps and spectral techniques for embedding and clustering. In Advances in neural information processing systems (Vol. 14, pp. 585–591). Cambridge: MIT Press.

Bell, A. J., & Sejnowski, T. J. (1995). An information maximization approach to blind separation and blind deconvolution. Neural Computation, 7(6), 1129–1159.

Bishop, C., Svensen, M., & Williams, C. (1998). GTM: The generative topographic mapping. Neural Computation, 10(1), 215–234.

Bocchieri, E. L., & Doddington, G. R. (1986). Frame specific statistical features for speaker-independent speech recognition. IEEE Transactions on Acoustics, Speech, and Signal Processing, 34(4), 755–764.

Brand, M. (2002). Charting a manifold. In Advances in neural information processing systems (Vol. 15, pp. 985–992). Cambridge: MIT Press.

Brand, M. (2004). From subspaces to submanifolds. In Proc. of the 15th British machine vision conference, London, UK.

Campbell, J., & Tremain, T. E. (1986). Voiced/unvoiced classification of speech with applications to the U.S. government LPC-10E algorithm. In Proceedings IEEE international conference on acoustics, speech, and signal processing (pp. 473–476), Tokyo, Japan, April.

Chang, K.-Y., & Ghosh, J. (1998). Principal curves for nonlinear feature extraction and classification. In Applications of artificial neural networks in image processing III (pp. 120–129). Bellingham: SPIE.

Cox, T., & Cox, M. (1994). Multidimensional scaling. London: Chapman & Hall.

Datig, M., & Schlurmann, T. (2004). Transformance and limitations of the Hilbert–Huang transformation (HHT) with an application to irregular water waves. Ocean Engineering, 31, 1783–1834.

Davis, S. B., & Mermelstein, P. (1980). Comparison of parametric representations of monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech, and Signal Processing, 28(4), 357–366.

Deller, J. R., Proakis, J. G., & Hansen, J. H. L. (1993). Discrete time processing of speech signals. New York: Macmillan.

DeMers, D., & Cottrell, G. (1993). Non-linear dimensionality reduction. In Advances in neural information processing systems (Vol. 5, pp. 580–587). San Mateo: Morgan Kaufmann.

Doddington, G. R. (1989). Phonetically sensitive discriminants for improved speech recognition. In Proceedings IEEE international conference on acoustics, speech, and signal processing (pp. 556–559), Glasgow, Scotland, May.

Donoho, D. L., & Grimes, C. (2005). Hessian eigenmaps: New locally linear embedding techniques for high-dimensional data. Proceedings of the National Academy of Sciences, 102(21), 7426–7431.

Faloutsos, C., & Lin, K.-I. (1995). FastMap: A fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets. In Proc. of the 1995 ACM international conference on management of data (pp. 163–174).

Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7, 179–188.

Flandrin, P., Rilling, G., & Gonçalves (2003). Empirical mode decomposition as a filter bank. IEEE Signal Processing Letters, 11(2), 112–114.

Fukunaga, K. (1990). Introduction to statistical pattern recognition. San Diego: Academic Press.

Furui, S. (1986). Speaker-independent isolated word recognition using dynamic features of the speech spectrum. IEEE Transactions on Acoustics, Speech, and Signal Processing, 34(1), 52–59.

Furui, S. (1990). On the use of hierarchical spectral dynamics in speech recognition. In Proceedings IEEE international conference on acoustics, speech, and signal processing (pp. 789–792), Albuquerque, New Mexico, USA, April.

Gold, B. (1964). Note on buzz-hiss detection. The Journal of the Acoustical Society of America, 36, 1659–1661.

Gold, B., & Rabiner, L. R. (1969). Parallel processing techniques for estimating pitch periods of speech in the time domain. The Journal of the Acoustical Society of America, 46(2), 442–449.

Haeb-Umbach, R., & Ney, H. (1992). Linear discriminant analysis for improved large vocabulary continuous speech recognition. In IEEE International conference on acoustics, speech, and signal processing (Vol. 1, pp. 13–16).

Hamming, R. W. (1989). Digital filters (2nd ed.). Englewood Cliffs: Prentice-Hall.

He, X., & Niyogi, P. (2004). Locality preserving projections. In Advances in neural information processing systems (Vol. 16, p. 37). Cambridge: MIT Press.

Hess, W. (1983). Pitch determination of speech signals. New York: Springer.

Hinton, G. E., & Salakhutdinov, R. R. (2006). Reducing the dimensionality of data with neural networks. Science, 313(5786), 504–507.

Hoffmann, H. (2007). Kernel PCA for novelty detection. Pattern Recognition, 40(3), 863–874.

Hotelling, H. (1933). Analysis of a complex of statistical variables into principal components. Journal of Educational Psychology, 24, 417–441.

Huang, N. E. (2005). Introduction to the Hilbert Huang transform and its related mathematical problems.

Huang, N. E., Long, S. R., & Shen, Z. (1996). The mechanism for frequency downshift in non linear evolution. Advances in Applied Mechanics, 32, 59–111.

Huang et al. (1998). The empirical mode decomposition and the Hilbert spectrum for nonlinear and non-stationary time series analysis. Proceedings of the Royal Society of London. Series A, 454, 903–993.

Huang, N. E., Shen, Z., & Long, R. S. (1999). A new view of nonlinear water waves—the Hilbert spectrum. Annual Review of Fluid Mechanics, 31, 417–457.

Huang, N. E., Wu, M. L., Long, S. R., Shen, S. S., Qu, W. D., Gloersen, P., & Fan, K. L. (2003). A confidence limit for the empirical mode decomposition and Hilbert spectral analysis. Proceedings of the Royal Society of London. Series A, 459, 2,317–2,345.

Huber, R., Ramoser, H., Mayer, K., Penz, H., & Rubik, M. (2005). Classification of coins using an eigenspace approach. Pattern Recognition Letters, 26(1), 61–75.

Jimenez, L. O., & Landgrebe, D. A. (1997). Supervised classification in high-dimensional space: Geometrical, statistical, and asymptotical properties of multivariate data. IEEE Transactions on Systems, Man and Cybernetics, 28(1), 39–54.

Juang, B. H., Rabiner, L. R., & Wilpon, J. G. (1987). On the use of bandpass liftering in speech recognition. IEEE Transactions on Acoustics, Speech, and Signal Processing, 35(7), 947–954.

Kohonen, T. (1988). Self-organization and associative memory. Berlin: Springer.

Kruskal, J. B. (1964). Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis. Psychometrika, 29, 1–27.

Lafon, S., & Lee, A. B. (2006). Diffusion maps and coarse-graining: A unified framework for dimensionality reduction, graph partitioning, and data set parameterization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(9), 1393–1403.

Lee, K. F. (1989). Automatic speech recognition: The development of the SPHINX system. Boston: Kluwer Academic.

Lima, A., Zen, H., Nankaku, Y., Miyajima, C., Tokuda, K., & Kitamura, T. (2004). On the use of kernel PCA for feature extraction in speech recognition. IEICE Transactions on Information and Systems, E87-D(12), 2802–2811.

Markel, J., & Gray, A. H., Jr. (1980). Linear prediction of speech. New York: Springer.

Mika, S., Schölkopf, B., Smola, A. J., Müller, K.-R., Scholz, M., & Rätsch, G. (1999). Kernel PCA and de-noising in feature spaces. In Advances in neural information processing systems (Vol. 11). Cambridge: MIT Press.

Møller, A. R. (1983). Auditory physiology. New York: Academic Press.

Naden, C., Hemando, J., & Gorricho, M. (1995). On the decorrelation of filter bank energies for speech recognition. In Int. proc. Eurospeech (pp. 1381–1384).

Naden, C., Macho, D., & Hermando, L. (2001). Frequency and time filtering of filter-bank energies for robust HMM speech recognition. Speech Communication, 34, 93–114.

Nadler, B., Lafon, S., Coifman, R. R., & Kevrekidis, I. G. (2006). Diffusion maps, spectral clustering and the reaction coordinates of dynamical systems. Applied and Computational Harmonic Analysis, 21, 113–127.

Nakadai, Y., & Sugamura, N. (1990). A speech recognition method for noise environments using dual inputs. In Proceedings of the international conference on spoken language processing (pp. 1141–1144), Kobe, Japan, November.

Ney, H. (1990). Experiments on mixture-density phoneme modelling for the speaker-independent 1000-word speech recognition DARPA task. In Proceedings IEEE acoustics, speech, and signal processing (pp. 713–716), Albuquerque, New Mexico, USA, April.

Noll, A. M. (1967). Cepstrum pitch determination. The Journal of the Acoustical Society of America, 41(2), 293–309.

Noll, A. M. (1990). Problems of speech recognition in mobile environments. In Proceedings of the international conference on spoken language processing (pp. 1133–1136), Kobe, Japan, November.

Ogata, K. (1970). Modern control engineering. Englewood Cliffs: Prentice-Hall.

Oppenheim, A. V., & Schafer, R. W. (1975). Digital signal processing. Englewood Cliffs: Prentice-Hall.

O’Shaughnessy, D. (1987). Speech communication: Human and machine. New York: Addison-Wesley.

Pallet, D. S. (1989). Speech results on resource management task. In Proceedings of the February 1989 DARPA speech and natural language workshop (pp. 18–24). Philadelphia: Morgan Kaufman.

Papamichalis, P. (1987). Practical approaches to speech coding. Prentice-Hall: Englewood Cliffs.

Partridge, M., & Calvo, R. (1997). Fast dimensionality reduction and simple PCA. Intelligent Data Analysis, 2(3), 292–298.

Paul, D. (1989). The Lincoln robust continuous speech recognizer. In Proceedings IEEE international conference on acoustics, speech, and signal processing (pp. 556–559). Glasgow, Scotland, May.

Paul, D. (1989). The Lincoln robust continuous speech recognizer. In Proceedings IEEE international conference on acoustics, speech, and signal processing (pp. 556–559), Glasgow, Scotland, May.

Paul, D. (1990). A speaker-stress resistant isolated word recognizer. In Proceedings IEEE international conference on acoustics, speech, and signal processing (pp. 713–716), Dallas, Texas, USA, April.

Pickles, J. O. (1988). An introduction to the physiology of hearing. New York: Academic Press.

Picone, J. (1983). Analytic signal processing. Ph.D. Dissertation, Illinois Institute of Technology, Chicago, Illinois, USA, December.

Picone, J. (1990). The demographics of speaker independent digit recognition. In Proceedings IEEE international conference on acoustics, speech, and signal processing (pp. 105–108), Albuquerque, New Mexico, USA, April.

Picone, J., Doddington, G. R., & Secrest, B. G. (1987). Robust pitch detection in a noisy telephone environment. In Proceedings IEEE international conference on acoustics, speech, and signal processing (pp. 1442–1445), Dallas, Texas, USA, April.

Proakis, J. G. (1989). Digital communications (2nd ed.). New York: McGraw-Hill.

Rabiner, L. R., & Schafer, R. W. (1978). Digital processing of speech signals. Englewood Cliffs: Prentice-Hall.

Reddy, D. R. (1967). Computer recognition of connected speech. The Journal of the Acoustical Society of America, 42(2), 329–347.

Roweis, S. T., & Saul, L. K. (2000). Nonlinear dimensionality reduction by locally linear embedding. Science, 290(5500), 2323–2326.

Roweis, S. T., Saul, L., & Hinton, G. (2001). Global coordination of local linear models. In Advances in neural information processing systems (Vol. 14, pp. 889–896). Cambridge: MIT Press.

Schafer, R. W., & Rabiner, L. R. (1970). System for automatic formant analysis of voiced speech. The Journal of the Acoustical Society of America, 47(2), 34–648.

Scheirer, E., & Slaney, M. Construction and evaluation of a robust multi feature speech/music discriminator. Interval Research Corp, 1801-C Page Mill Road, Pal Alto, CA, 94304, USA.

Schölkopf, B., Smola, A. J., & Müller, K.-R. (1998). Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 10(5), 1299–1319.

Seneff, S. (1988). A joint synchrony/mean-rate model of auditory speech processing. Journal of Phonetics, 16(1), 55–76.

Sha, F., & Saul, L. K. (2005). Analysis and extension of spectral methods for nonlinear dimensionality reduction. In Proceedings of the 22nd international conference on machine learning (pp. 785–792).

Shawe-Taylor, J., & Christianini, N. (2004). Kernel methods for pattern analysis. Cambridge: Cambridge University Press.

Shirai, K., Hosaka, N., Kitagawa, E., & Endou, T. (1990). Speaker adaptable phoneme recognition selecting reliable acoustic features based on mutual information. In Proceedings of the international conference on spoken language processing (pp. 353–356), Kobe, Japan, November.

Sondhi, M. M. (1968). New methods of pitch detection. IEEE Transactions on Audio and Electroacoustics, AU-16, 262–266.

Sukkar, R. S., LoCicero, J. L., & Picone, J. (1988). Design and implementation of a parallel processing based pitch detector. IEEE Journal on Selected Areas in Communications, 6(2), 441–451.

Suykens, J. A. K. (2007). Data visualization and dimensionality reduction using kernel maps with a reference point (Technical Report 07-22, ESAT-SISTA). K.U. Leuven.

Tamura, S. (1989). An analysis of a noise reduction neural network. In Proceedings IEEE international conference on acoustics, speech, and signal processing (pp. 2001–2004), Glasgow, Scotland, May.

Teh, Y. W., & Roweis, S. T. (2002). Automatic alignment of hidden representations. In Advances in neural information processing systems (Vol. 15, pp. 841–848). Cambridge: MIT Press.

Tenenbaum, J. B. (1998). Mapping a manifold of perceptual observations. In Advances in neural information processing systems (Vol. 10, pp. 682–688). Cambridge: MIT Press.

Tenenbaum, J. B., de Silva, V., & Langford, J. C. (2000). A global geometric framework for nonlinear dimensionality reduction. Science, 290(5500), 2319–2323.

Tipping M.E. (2000). Sparse kernel principal component analysis. In Advances in neural information processing systems (Vol. 13, pp. 633–639). Cambridge: MIT Press.

Torkkola, K. (2001). Linear discriminant analysis in document classification. In IEEE TextDM 2001 (pp. 800–806).

Venkatarajan, M. S., & Braun, W. (2004). New quantitative descriptors of amino acids based on multidimensional scaling of a large number of physical chemical properties. Journal of Molecular Modeling, 7(12), 445–453.

Verbeek, J. (2006). Learning nonlinear image manifolds by global alignment of local linear models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(8), 1236–1250.

von Békésy, G. (1960). Experiments in hearing. New York: McGraw-Hill.

Weinberger, K. Q., Packer, B. D., & Saul, L. K. (2005). Nonlinear dimensionality reduction by semidefinite programming and kernel matrix factorization. In Proceedings of the 10th international workshop on AI and statistics.

Welch, V. C., Tremain, T. E., & Campbell, J. P., Jr. (1989). A comparison of U.S. government standard voice coders. In IEEE military communications conference record (pp. 269–273), USA, September.

Wheatley, B., & Picone, J. (1991). Voice across America: Toward robust speaker independent speech recognition for telecommunications applications. Digital Signal Processing: A Review Journal, 1(2), 45–64.

Wilpon, J. G., DeMarco, D. M., & Mikkilineni, R. P. (1988). Isolated word recognition over the DDD telephone network—results of two extensive field trials. In Proceedings IEEE international conference on acoustics, speech, and signal processing (pp. 55–57). New York, NY, USA, April.

Wilpon, J. G., Lee, C. H., & Rabiner, L. R. (1989). Application of hidden Markov models for recognition of a limited set of words in unconstrained speech. In Proceedings IEEE international conference on acoustics, speech, and signal processing (pp. 254–257), Glasgow, Scotland.

Wilpon, J. G., Mikkilineni, R. P., Roe, D. B., & Gokcen, S. (1990). Speech recognition: From the laboratory to the real world. AT & T Bell Laboratories Technical Journal, 69(5), 14–24.

Wu, Z., & Huang, N. E. (2004). A study of the characteristics of white noise using the empirical mode decomposition method. Proceedings of the Royal Society of London. Series A, 460, 1597–1611.

Zhang, Z., & Zha, H. (2004). Principal manifolds and nonlinear dimensionality reduction via local tangent space alignment. SIAM Journal of Scientific Computing, 26(1), 313–338.

Zhang, T., Yang, J., Zhao, D., & Ge, X. (2007). Linear local tangent space alignment and application to face recognition. Neurocomputing, 70, 1547–1533.

Zwicker, E., & Terhardt, E. (1980). Analytical expressions for critical-band rate and critical bandwidth as a function of frequency. The Journal of the Acoustical Society of America, 68(5), 1523–1525.

Scholar Hub - Công cụ hỗ trợ trích dẫn và phân tích khoa học Việt Nam

Scholar Hub là công cụ hỗ trợ trích dẫn và phân tích ảnh hưởng của các bài báo, công bố khoa học Việt Nam và Quốc tế.
ScholarHub KHÔNG đăng thông tin tổng hợp, KHÔNG đăng lại nội dung từ các trang báo chí Việt Nam hoặc trang thông tin điện tử khác tại Việt Nam.

Thông tin, cập nhật

Đăng ký Tạp chí tham gia Scholar Hub

Phản hồi ý kiến về Scholar Hub

Bài viết, nội dung cập nhật

Chủ đề khoa học

Website liên kết

Hệ thống CSDL Khoa học & Công nghệ SciBase

Phần mềm kiểm tra trùng lặp Kiểm Tra Tài Liệu

Phần mềm xuất bản tạp chí điện tử VOJS

Hệ thống hội thảo khoa học Việt Nam

Nền tảng trắc nghiệm và đề thi đa lĩnh vực LetQA

Thông tin liên hệ & hỗ trợ

Đơn vị chủ quản, phát triển và vận hành: Công ty Cổ phần Metis

Địa chỉ liên hệ: 26A Lê Đức Thọ, Phường Từ Liêm, Thành phố Hà Nội

Số giấy chứng nhận ĐKKD: 0109293202 cấp ngày 03/08/2020 tại Sở Kế hoạch và Đầu tư thành phố Hà Nội

Người quản lý và chịu trách nhiệm nội dung: Nguyễn Ngọc Sơn

Hotline: 0566.685.688

Email: [email protected]