A survey of hybrid ANN/HMM models for automatic speech recognition

Neurocomputing - Tập 37 - Trang 91-126 - 2001
Edmondo Trentin1, Marco Gori2
1ITC-irst (Centro per la Ricerca Scientifica e Tecnologica), V. Sommarive, 18-Povo, Trento, Italy and Università di Firenze, V. S. Marta, 3 - Firenze, Italy
2Dipartimento di Ingegneria dell'Informazione, Università di Siena, V. Roma, 56 - Siena, Italy

Tài liệu tham khảo

S. Austin, G. Zavaliagkos, J. Makhoul, R. Schwartz, Speech recognition using segmental neural nets, International Conference on Acoustics, Speech and Signal Processing, San Franscisco, March 1992, pp. I-625–628. Bell, 1995, An information-maximization approach to blind separation and blind deconvolution, Neural Comput., 7, 1129, 10.1162/neco.1995.7.6.1129 Y. Bengio, Radial basis functions for speech recognition, in Speech Recognition and Understanding: Recent Advances, Trends and Applications, NATO Advanced Study Institute Series F: Computer and Systems Sciences, 1990, pp. 293–298. Bengio, 1993, A connectionist approach to speech recognition, Int. J. Pattern Recognition Artif. Intell., 7, 647, 10.1142/S0218001493000327 Bengio, 1996 S. Bengio, Y. Bengio, An EM algorithm for asynchronous input/output hidden Markov models, in: L. Xu (Ed.), International Conference on Neural Information Processing, Hong-Kong, 1996. Y. Bengio, R. De Mori, G. Flammia, R. Kompe, Phonetically motivated acoustic parameters for continuous speech recognition using artificial neural networks, Proceedings of EuroSpeech’91, 1991. Bengio, 1992, Global optimization of a neural network-hidden Markov model hybrid, IEEE Trans. Neural Networks, 3, 252, 10.1109/72.125866 Bengio, 1996, Input/Output HMMs for sequence processing, IEEE Trans. Neural Networks, 7, 1231, 10.1109/72.536317 Y. Bengio, M. Gori, R. De Mori, Learning the dynamic nature of speech with back-propagation for sequences, Pattern Recognition Lett. 13 (5) (1992) 375–386. (Special issue on Artificial Neural Networks). Y. Bengio, P. Simard, P. Frasconi, Learning long-term dependencies with gradient descent is difficult, IEEE Trans. Neural Networks 5 (2) (1994) 157–166. (Special Issue on Recurrent Neural Networks, March 94). Bourlard, 1993, Continuous speech recognition by connectionist statistical methods, IEEE Trans. Neural Networks, 4, 893, 10.1109/72.286885 Bourlard, 1994 H. Bourlard, N. Morgan, Connectionist Speech Recognition. A Hybrid Approach, Kluwer Academic Publishers, Boston, 1994, p. 117. Bourlard, 1990, Links between hidden Markov models and multilayer perceptrons, IEEE Trans. Pattern Anal. Mach. Intell., 12, 1167, 10.1109/34.62605 Bridle, 1990, Alphanets: a recurrent ‘neural’ network architecture with a hidden Markov model interpretation, Speech Commun., 9, 83, 10.1016/0167-6393(90)90049-F Le Cerf, 1994, Multilayer perceptrons as labelers for hidden Markov models, IEEE Trans. Speech Audio Process., 2, 185, 10.1109/89.260361 Chang, 1993, Discriminative training of dynamic programming based speech recognizers, IEEE Trans. Speech Audio Process., 1, 135, 10.1109/89.222873 C. Che, Q. Lin, J. Pearson, B. de Vries, J.L. Flanagan, Microphone arrays and neural networks for robust speech recognition, Proceedings of ARPA Human Language Technology (HLT), 1994, pp. 342–348. Chen, 1996, A speech recognition method based on the sequential multi-layer perceptrons, Neural Networks, 9, 655, 10.1016/0893-6080(95)00140-9 Chung, 1996, An MLP/HMM hybrid model using nonlinear predictors, Speech Commun., 19, 307, 10.1016/S0167-6393(96)00053-2 Chung, 1996, Multilayer perceptrons for state-dependent weighting of HMM likelihoods, Speech Commun., 18, 79, 10.1016/0167-6393(95)00038-0 Cosi, 1990, Phonetically-based multi-layered networks for acoustic property extraction and automatic speech recognition, Speech Commun., 9, 15, 10.1016/0167-6393(90)90041-7 P. Cosi, P. Frasconi, M. Gori, N. Griggio, Phonetic recognition experiments with recurrent neural networks, Proceedings of the International Conference on Spoken Language, Banff, Canada, October 1992, pp. 1335–1338. Cybenko, 1989, Approximations by superpositions of a sigmoidal function, Math. Control Signals Systems, 2, 303, 10.1007/BF02551274 S. Das, A. Nádas, D. Nahamoo, M. Pichney, Adaptation techniques for ambience and microphone compensation in the IBM Tangora speech recognition system, International Conference on Acoustics, Speech and Signal Processing, Adelaide, April 1994, pp. I-21–24. Davis, 1980, Comparison of parametric representations of monosyllabic word recognition in continuously spoken sentences, IEEE Trans. Acoust. Speech Signal Process., 28, 357, 10.1109/TASSP.1980.1163420 De Mori, 1998 Deller, 1993 Duda, 1973 Dugast, 1994, Combining TDNN and HMM in a hybrid system for improved continuous-speech recognition, IEEE Trans. Speech Audio Process., 2, 217, 10.1109/89.260364 S. Dupont, C. Ris, O. Deroo, V. Fontaine, J.M. Boite, L. Zanoni, Context-independent and context-dependent hybrid HMM/ANN systems for vocabulary independent tasks, Proceedings of EUROSPEECH, Vol. 4, Rhodi, 1997, pp. 1947–1950. Elman, 1990, Finding structure in time, Cognitive Sci., 14, 179, 10.1207/s15516709cog1402_1 G. Flammia, Speaker independent consonant recognition in continuous speech with distinctive phonetic features, Master's Thesis, McGill University, School of Computer Science, 1991. Flanagan, 1991, Autodirective microphone systems, Acustica, 75, 58 Franco, 1994, Context-dependent connectionist probability estimation in a hybrid hidden Markov model-neural net speech recognition system, Comput. Speech Language, 8, 211, 10.1006/csla.1994.1010 H. Franco, V. Digalakis, Temporal correlation modeling in a hybrid neural network/hidden Markov model speech recognizer, Proceedings of EUROSPEECH, Madrid, 1995, pp. 1681–1684. M.A. Franzini, K.F. Lee, A. Waibel, Connectionist Viterbi training: a new hybrid method for continuous speech recognition, International Conference on Acoustics, Speech and Signal Processing, Albuquerque, NM, 1990, pp. 425–428. Frasconi, 1990 Fukunaga, 1990 R. Gemello, D. Albesano, F. Mana, Continuous speech recognition with neural networks and stationary-transitional acoustic units, ICNN, Houston, TX, USA, 1997, pp. 2107–2111. M. Gori, Y. Bengio, R. De Mori, BPS: a learning algorithm for capturing the dynamical nature of speech, Proceedings of the International Joint Conference on Neural Networks, Washington, DC, IEEE, New York, 1989, pp. 643–644. F.S. Gurgen, J.M. Song, R.W. King, A continuous HMM based preprocessor for modular speech recognition neural networks, Proceedings of ICSLP, Yokohama, 1994, pp. 1507–1510. P. Haffner, M. Franzini, A. Waibel, Integrating time alignment and neural networks for high performance continuous speech recognition, International Conference on Acoustics, Speech and Signal Processing, Toronto, 1991, pp. 105–108. P. Haffner, A. Waibel, K. Shikano, Fast back-propagation learning methods for large phonemic neural networks, Proceedings of Eurospeech’89, 1989. Hampshire, 1990, A novel objective function for improved phoneme recognition using time-delay neural networks, IEEE Trans. Neural Networks, 1, 216, 10.1109/72.80233 J. Hennebert, C. Ris, H. Bourlard, S. Renals, N. Morgan, Estimation of global posteriors and forward-backward training of hybrid HMM/ANN systems, Proceedings of EUROSPEECH, Vol. 4, Rhodi, 1997, pp. 1951–1954. Hertz, 1991 M.M. Hochberg, S.J. Renals, A.J. Robinson, G.D. Cook, Recent improvements to the ABBOT large vocabulary csr system, International Conference on Acoustics, Speech and Signal Processing, Detroit, 1995, pp. 69–72. M.M. Hochberg, S.J. Renals, A.J. Robinson, D.J. Kershaw, Large vocabulary continuous speech recognition using a hybrid connectionist-HMM system, Proceedings of CSLP, Yokohama, 1994, pp. 1499–1502. X.D. Huang, Speaker normalization for speech recognition, International Conference on Acoustics, Speech and Signal Processing, San Franscisco, March 1992, pp. I-465–468. Huang, 1990 H.-P. Hutter, Comparison of a new hybrid connectionist-SCHMM approach with other hybrid approaches for speech recognition, International Conference on Acoustics, Speech and Signal Processing, Detroit, 1995, pp. 3311–3314. K. Iso, T. Watanabe, Speaker-independent word recognition using a neural prediction model, International Conference on Acoustics, Speech and Signal Processing, Albuquerque, NM, 1990, pp. 441–444. H. Iwamida, S. Katagiri, E. McDermott, Speaker-independent large vocabulary word recognition using an LVQ/HMM hybrid algorithm, International Conference on Acoustics, Speech and Signal Processing, Toronto, 1991, pp. 553–556. Jang, 1996, A new parameter smoothing method in the hybrid TDNN/HMM architecture for speech recognition, Speech Commun., 19, 317, 10.1016/S0167-6393(96)00052-0 F.T. Johansen, Global optimisation of HMM input transformations, Proceedings of ICSLP, Vol. 1, Yokohama, 1994, pp. 239–242. F.T. Johansen, A comparison of hybrid HMM-architectures using global discriminative training, Proceedings of ICSLP, Philadelphia, 1996, 498–501. F.T. Johansen, M.H. Johnsen, Non-linear input transformations for discriminative HMMs, International Conference on Acoustics, Speech and Signal Processing, Vol. 1, Adelaide, 1994, pp. 225–228. Jordan, 1989, Serial order: a parallel, distributed processing approach Juang, 1992, Discriminative learning for minimum error classification, IEEE Trans. Signal Process., 40, 3043, 10.1109/78.175747 Junqua, 1996 S. Katagiri, C.-H. Lee, B.H. Juang, New discriminative training algorithms based on the generalized probabilistic descent method, Proceedings of the IEEE Workshop on Neural Networks for Signal Processing, 1991, pp. 299–308. D. Kimber, M.A. Bush, G.N. Tajchman, Speaker-independent vowel classification using hidden Markov models and LVQ2, International Conference on Acoustics, Speech and Signal Processing, Albuquerque, NM, 1990, pp. 497–500. T. Kohonen, Learning vector quantization for pattern recognition, Report TKK-F-A601, Helsinki University of Technology, Espoo, Finland, 1986. K.J. Lang, G.E. Hinton, The development of the time-delay neural network architecture for speech recognition, Technical Report CMU-CS-88-152, Carnegie-Mellon University, 1988. LeCun, 1986, Learning processes in a asymmetric threshold network, 233 Lee, 1991, A study on speaker adaptation of the parameters of continuous density hidden Markov models, IEEE Trans. Signal Process., 39, 806, 10.1109/78.80902 Lee, 1996 E. Levin, Word recognition using hidden control neural architecture, International Conference on Acoustics, Speech and Signal Processing, Albuquerque, NM, 1990, pp. 433–436. E. Levin, R. Pieraccini, E. Bocchieri, Time-warping network: a hybrid framework for speech recognition, in: J.E. Moody, S.J. Hanson, R.P. Lippmann (Eds.), Advances in Neural Information Processing Systems 4, Denver, CO, 1992, pp. 151–158. K.P. Li, J.A. Naylor, A whole word recurrent neural network for keyword spotting, International Conference on Acoustics, Speech and Signal Processing, San Franscisco, March 1992, pp. II-81–84. A. Linden, J. Kindermann, Inversion of multilayer nets, International Joint Conference on Neural Networks, Washington, DC, June 1989, pp. 425–430. R. Lippmann, E. Singer, Hybrid neural-network/HMM approaches to wordspotting, International Conference on Acoustics, Speech and Signal Processing, 1993, pp. I–565–568. Lippmann, 1989, Review of neural networks for speech recognition, Neural Comput., 1, 1, 10.1162/neco.1989.1.1.1 R.P. Lippmann, B. Gold, Neural classifiers useful for speech recognition, IEEE Proceedings of First International Conference on Neural Networks, Vol. IV, San Diego, CA, 1987, pp. 417–422. W. Ma, D. Van Compernolle, TDNN labeling for a HMM recognizer, International Conference on Acoustics, Speech and Signal Processing, 1990. J.-F. Mari, D. Fohr, Y. Anglade, J-C Junqua, Hidden Markov models and selectively trained neural networks for connected confusable word recognition, Proceedings of ICSLP, Yokohama, 1994, pp. 1519–1522. McDermott, 1991, LVQ-based shift-tolerant phoneme recognition, IEEE Trans. Signal Process., 39, 1398, 10.1109/78.136545 X. Menendez-Pidal, R. de Cordoba, J. Ferreiros, J.M. Pardo, Incorporating fuzzy modelling in a hybrid HMM-ANNs system for CSR tasks, Proceedings of EUROSPEECH, Madrid, 1995, pp. 1689–1692. X. Menendez-Pidal, J. Ferreiros, R. de Cordoba, J.M. Pardo, Recent work in hybrid neural networks and HMM systems in CSR tasks, Proceedings of ICSLP, Yokohama, 1994, pp. 1515–1518. Merhav, 1993, A minimax classification approach with application to robust speech recognition, IEEE Trans. Speech Audio Process., 1, 90, 10.1109/89.221371 Minsky, 1969 Moon, 1997, Robust speech recognition based on joint model and feature space optimization of hidden Markov models, IEEE Trans. Neural Networks, 8, 194, 10.1109/72.557656 Morgan, 1991 D.P. Morgan, C.L. Scofield, J.E. Adcock, Multiple neural network topologies applied to keyword spotting, International Conference on Acoustics, Speech and Signal Processing, Toronto, 1991, pp. 313–316. N. Morgan, H. Bourland, Continuous speech recognition using multilayer perceptrons with hidden Markov models, International Conference on Acoustics, Speech and Signal Processing, Albuquerque, NM, 1990, pp. 413–416. N. Morgan, Y. Konig, S.L. Wu, H. Bourlard, Transition-based statistical training for ASR, Proceedings of IEEE Automatic Speech Recognition Workshop (Snowbird), 1995, pp. 133–134. T. Moudenc, R. Sokol, G. Mercier, Segmental phonetic features recognition by means of neural-fuzzy networks and integration in an N-best solutions post-processing, Proceedings of ICSLP, Philadelphia, 1996, pp. 338–341. Mozer, 1993, Neural net architectures for temporal sequence processing, 243 J. Neto, L. Almeida, M. Hochberg, C. Martins, L. Nunes, S. Renals, T. Robinson, Speaker-adaptation for hybrid HMM-ANN continuous speech recognition system, Proceedings of EUROSPEECH, Madrid, September 1995, pp. 2171–2174. C. Neukirchen, G. Rigoll, Advanced training methods and new network topologies for hybrid MMI-connectionist/HMM speech recognition systems, International Conference on Acoustics, Speech and Signal Processing, Munich, Germany, 1997, pp. 3257–3260. L.T. Niles, H.F. Silverman, Combining hidden Markov models and neural network classifiers, International Conference on Acoustics, Speech and Signal Processing, Albuquerque, NM, 1990, pp. 417–420. Pao, 1989 Pearlmutter, 1989, Learning state space trajectories in recurrent neural networks, Neural Comput., 1, 263, 10.1162/neco.1989.1.2.263 N. Pican, D. Fohr, J-F. Mari, HMMs and OWE neural network for continuous speech recognition, Proceedings of ICSLP, Philadelphia, 1996, pp. 1309–1312. N. Pican, J.-F. Mari, D. Fohr, Continuous speech recognition using a context sensitive ANN and HMMs, Proceedings of EUROSPEECH, Vol. 1, Rhodi, 1997, pp. 95–98. Pineda, 1989, Recurrent back-propagation and the dynamical approach to adaptive neural computation, Neural. Comput., 1, 161, 10.1162/neco.1989.1.2.161 Powell, 1987, Radial basis functions for multivariable interpolation: a review Rabiner, 1993 Rabiner, 1989, A tutorial on hidden Markov models and selected applications in speech recognition, Proc. IEEE, 77, 257, 10.1109/5.18626 Rabiner, 1986, An introduction to hidden Markov models, IEEE ASSP Mag., 77, 257 Rabiner, 1996, An overview of automatic speech recognition W. Reichl, G. Ruske, A hybird RBF-HMM system for continuous speech recognition, International Conference on Acoustics, Speech and Signal Processing, Detroit, 1995, pp. 3335–3338. Renals, 1994, Connectionist probability estimators in HMM speech recognition, IEEE Trans. Speech Audio Process., 2, 161, 10.1109/89.260359 Rigoll, 1994, Maximum mutual information neural networks for hybrid connectionist-HMM speech recognition systems, IEEE Trans. Speech Audio Process., 2, 175, 10.1109/89.260360 G. Rigoll, Ch. Neukirchen, J. Rottland, Large vocabulary speaker-independent continuous speech recognition with a new hybrid system based on MMI-neural networks, Proceedings of EUROSPEECH, Madrid, 1995, pp. 1659–1662. S.K. Riis, A. Krogh, Hidden neural networks: a framework for HMM/NN hybrids, International Conference on Acoustics, Speech and Signal Processing, Munich, Germany, 1997, pp. 3233–3236. A.J. Robinson, F. Fallside, Static and dynamic error propagation networks with application to speech coding, in: D.Z. Anderson (Ed.), Neural Information Processing Systems, American Institute of Physics, New York, Denver, CO, 1988, pp. 632–641. T. Robinson, A real-time recurrent error propagation network word recognition system, International Conference on Acoustics, Speech and Signal Processing, Vol. I, 1992, pp. 617–620. Robinson, 1994, An application of recurrent nets to phone probability estimation, IEEE Trans. Neural Networks, 5, 298, 10.1109/72.279192 Robinson, 1991, A recurrent error propagation network speech recognition system, Comput. Speech Language, 5, 259, 10.1016/0885-2308(91)90010-N J. Rottland, Ch. Neukirchen, D. Willett, G. Rigoll, Large vocabulary speech recognition with context dependent MMI-connectionist/HMM systems using the WSJ database 1, Proceedings of EUROSPEECH, Vol. 1, Rhodes, 1997, pp. 79–82. D.E. Rumelhart, G.E. Hinton, R.J. Williams, Learning internal representations by error propagation, in: D.E. Rumelhart, J.L. McClelland (Eds.), editors, Parallel Distributed Processing, Vol. 1, MIT Press, Cambridge, 1986, pp. 318–362 (Chapter 8). Sato, 1990, A real time learning algorithm for recurrent analog neural networks, Biol. Cybernet., 62, 237, 10.1007/BF00198098 E. Singer, R.P. Lippmann, A speech recognizer using radial basis function neural networks in an HMM framework, International Conference on Acoustics, Speech and Signal Processing, Vol. 1, San Franscisco, March 1992, pp. 629–632. J. Takahashi, S. Sagayama, Telephone line characteristic adaptation using vector field smoothing technique, Proceedings of ICSLP, Yokohama, September 1994, pp. 991–994. J. Tebelskis, A. Waibel, B. Petek, O. Schmidbauer, Continuous speech recognition using linked predictive networks, in: R.P. Lippman, R. Moody, D.S. Touretzky (Eds.), Advances in Neural Information Processing Systems 3, Morgan Kaufmann, San Mateo, Denver, CO, 1991, pp. 199–205. E. Trentin, D. Giuliani, Speaker normalization with a mixture of recurrent networks, Proceedings of ESANN97, European Symposium on Artificial Neural Networks, Bruges, Belgium, April 1997. Waibel, 1989, Modular construction of time-delay neural networks for speech recognition, Neural Comput., 1, 39, 10.1162/neco.1989.1.1.39 Waibel, 1989, Phoneme recognition using time-delay neural networks, IEEE Trans. Acoust. Speech Signal Process., 37, 328, 10.1109/29.21701 Waibel, 1989, Modularity and scaling in large phonemic neural networks, IEEE Trans. Acoust. Speech Signal Process., 37, 1888, 10.1109/29.45535 P. Werbos, Beyond regression: new tools for prediction and analysis in the behavioral sciences, Ph.D. Thesis, Harvard University, 1974. Werbos, 1988, Generalization of backpropagation with application to a recurrent gas market model, Neural Networks, 1, 339, 10.1016/0893-6080(88)90007-X Wilinski, 1998, Toward the border between neural and markovian paradigms, IEEE Trans. Systems Man Cybernet., 28, 146, 10.1109/3477.662756 Williams, 1989, Experimental analysis of the real-time recurrent learning algorithm, Connection Sci., 1, 87, 10.1080/09540098908915631 Williams, 1989, A learning algorithm for continually running fully recurrent neural networks, Neural Comput., 1, 270, 10.1162/neco.1989.1.2.270 Y. Yan, M. Fanty, R. Cole, Speech recognition using neural networks with forward–backward probability generated targets, International Conference on Acoustics, Speech and Signal Processing, Munich, Germany, 1997, pp. 3241–3244. D. Yu, T. Huang, D.W. Chen, A multi-stage NN/HMM hybrid method for high performance speech recognition, Proceedings of ICSLP, Yokohama, 1994, pp. 1503–1506. G. Zavaliagkos, S. Austin, J. Makhoul, R. Schwartz, A hybrid continuous speech recognition system using segmental neural nets with hidden Markov models, Int. J. Pattern Recognition Artif. Intell. (1993) 305–319. (Special Issue on Applications of Neural Networks to Pattern Recognition (I. Guyon Ed.)). G. Zavaliagkos, R. Schwartz. J. Makhoul, Batch, incremental and instantaneous adaptation techniques for speech recognition, International Conference on Acoustics, Speech and Signal Processing, Detroit, May 1995, pp. I-676–679. Zavaliagkos, 1994, A hybrid segmental neural net/hidden Markov model system for continuous speech recognition, IEEE Trans. Speech Audio Process., 2, 151, 10.1109/89.260358