Internet evolution and progress in full automatic French language modelling

D. Vaufreydaz1, M. Gery1
1Laboratoire CLIPS-IMAG, équipe GEOD et MRIM, Grenoble, France

Tóm tắt

The World Wide Web is the greatest information space ever seen, distributed all over the world, in many languages, on many various topics. We first describe the evolution of a French subset of this space during the last 3 years. During this time, the size of automatically extracted text for language modelling has multiplied by 6.5. Moreover, French coverage has grown from 140,000 to 200,000 lexical forms. So, we show that we can get more and more reliable data to train our trigram models. Recognition experiments, made on a French "state of the art" evaluation set, show that word accuracy increased from 51% up to 62.30% using two different models automatically computed on Web corpora. The first corpus was gathered at the beginning of 1999 and the last one at the end of 2000.

Từ khóa

#Internet #Natural languages #Speech recognition #Web server #Robots #HTML #Web sites #Data mining #Crawlers #Stochastic processes

Tài liệu tham khảo

pérennou, 1987, BDLEX lexical data and knowledge base of spoken and written French European conference on Speech Technology, 393 0, see the LIMSI web site about the GRACE action a French evaluation of text parsers dolmazon, 1997, Organisation de la première campagne Aupelf pour l'évaluation des systèmes de dictée vocale 1st jst Aupelf-Uref Avignon 0 nie, 1999, Cross-Language Information Retrieval Based on Parallel Texts and Automatic Mining of Parallel Texts from the Web 22ndAnnual International ACM SIGIR, 74 akbar, 1998, Parole et traduction automatique le module de reconnaissance RAPHAEL COLLING-ACL '98, 36 koster, 1996, A Method for Web Robots Control technical report of IETF vaufreydaz, 0, Internet documents A rich source for spoken language modeling ASRU'99, 277