Dynamic Bayesian network based speech recognition with pitch and energy as auxiliary variables
Tóm tắt
Pitch and energy are two fundamental features describing speech, having importance in human speech recognition. However, when incorporated as features in automatic speech recognition (ASR), they usually result in a significant degradation on recognition performance due to the noise inherent in estimating or modeling them. We show experimentally how this can be corrected by either conditioning the emission distributions upon these features or by marginalizing out these features in recognition. Since to do this is not obvious with standard hidden Markov models (HMMs), this work has been performed in the framework of dynamic Bayesian networks (DBNs), resulting in more flexibility in defining the topology of the emission distributions and in specifying whether variables should be marginalized out.
Từ khóa
#Bayesian methods #Speech recognition #Automatic speech recognition #Degradation #Hidden Markov models #Acoustic emission #Artificial intelligence #Humans #Speech enhancement #Network topologyTài liệu tham khảo
10.1109/ICASSP.1995.479283
10.1109/ICPR.2002.1047454
stephenson, 2001, Modeling Auxiliary Infermation in Bayesian Network Based ASR, 7th European Conference on Speech Communication and Technology, 4, 2765
zweig, 1998, Speech Recognition With Dynamic Bayesian Networks
10.1109/IJCNN.1992.226966
10.1109/ICASSP.2001.940880
10.1016/0167-9473(93)E0056-A
10.1080/01621459.1992.10476265
10.1109/TAU.1972.1162410
10.1023/A:1008935617754
10.1109/ICASSP.1997.598872
cowell, 1999, Probabilistic Networks and Expert Systems Statistics for Engineering and Information Science
10.1109/ICASSP.1998.675370