Maximum-likelihood training of the PLCG-based language model

D.H. Van Uytsel1, D. Van Compernolle1, P. Wambacq1
1ESAT/PSI, Katholieke Universiteit Leuven, Belgium

Tóm tắt

In Van Uytsel et al. (2001) a parsing language model based on a probabilistic left-comer grammar (PLCG) was proposed and encouraging performance on a speech recognition task using the PLCG-based language model was reported. In this paper we show how the PLCG-based language model can be further optimized by iterative parameter reestimation on unannotated training data. The precalculation of forward, inner and outer probabilities of states in the PLCG network provides an elegant crosscut to the computation of transition frequency expectations, which are needed in each iteration of the proposed reestimation procedure. The training algorithm enables model training on very large corpora. In our experiments, test set perplexity is close to saturation after three iterations, 5 to 16% lower than initially. We however observed no significant improvement of recognition accuracy after reestimation.

Từ khóa

#Natural languages #Speech recognition #Maximum likelihood estimation #Training data #Testing #Computer networks #Iterative algorithms #Large-scale systems #Stochastic processes #Predictive models

Tài liệu tham khảo

van aelten, 2000, Inside-outside reestimation of Chelba-Jelinek models, Technical Report L&H-SR-00-027 chelba, 2000, Exploiting syntactic structure for natural language modeling chamiak, 2000, A maximum-entropy inspired parser, In Proc NAACL-2000, 132 dempster, 1977, Maximum likelihood from incomplete data via the EM algorithm, J Royal Statistical Society Series B, 39, 1 10.1109/TASSP.1987.1165125 bod, 2000, Combining semantic and syntactic structure for language modeling, Proc ICSLP-2000, iii, 106 10.1109/SWAT.1970.5 manning, 1997, Probabilistic parsing using left corner language models, Proc 1WPT-1997, 147 van uytsel, 2000, Earley-inspired parsing language model: Background and preliminaries, Internal Report PSI-SPCH-00-1 Klf leuven ESAT marcus, 1995, Building a large annotated corpus of English: the Penn Treebank, Computational Linguistics, 19, 313 10.3115/1073336.1073365