Maximum-likelihood training of the PLCG-based language model
Tóm tắt
In Van Uytsel et al. (2001) a parsing language model based on a probabilistic left-comer grammar (PLCG) was proposed and encouraging performance on a speech recognition task using the PLCG-based language model was reported. In this paper we show how the PLCG-based language model can be further optimized by iterative parameter reestimation on unannotated training data. The precalculation of forward, inner and outer probabilities of states in the PLCG network provides an elegant crosscut to the computation of transition frequency expectations, which are needed in each iteration of the proposed reestimation procedure. The training algorithm enables model training on very large corpora. In our experiments, test set perplexity is close to saturation after three iterations, 5 to 16% lower than initially. We however observed no significant improvement of recognition accuracy after reestimation.
Từ khóa
#Natural languages #Speech recognition #Maximum likelihood estimation #Training data #Testing #Computer networks #Iterative algorithms #Large-scale systems #Stochastic processes #Predictive modelsTài liệu tham khảo
van aelten, 2000, Inside-outside reestimation of Chelba-Jelinek models, Technical Report L&H-SR-00-027
chelba, 2000, Exploiting syntactic structure for natural language modeling
chamiak, 2000, A maximum-entropy inspired parser, In Proc NAACL-2000, 132
dempster, 1977, Maximum likelihood from incomplete data via the EM algorithm, J Royal Statistical Society Series B, 39, 1
10.1109/TASSP.1987.1165125
bod, 2000, Combining semantic and syntactic structure for language modeling, Proc ICSLP-2000, iii, 106
10.1109/SWAT.1970.5
manning, 1997, Probabilistic parsing using left corner language models, Proc 1WPT-1997, 147
van uytsel, 2000, Earley-inspired parsing language model: Background and preliminaries, Internal Report PSI-SPCH-00-1 Klf leuven ESAT
marcus, 1995, Building a large annotated corpus of English: the Penn Treebank, Computational Linguistics, 19, 313
10.3115/1073336.1073365