A language model adaptation using multiple varied corpora

H. Yamamoto1, Y. Sagisaka1
1ATR Spoken Language Translation Research Laboratories, Soraku-gun, Kyoto, Japan

Tóm tắt

A new language model adaptation scheme is proposed to cope with multiple varied speech recognition tasks. Both topic difference and sentence style difference resulting from the speaker's role are reflected in the proposed language model adaptation. An adaptation is carried out using two different language corpora where only the topic or speaker's style is matched. New word clustering techniques are introduced to extract the topic or style dependency separately. Word neighboring characteristics in the two adaptation source data are regarded as different features in this clustering. All words are classified into commonly used word classes and topic or style dependent classes. Furthermore, target topic and sentence style dependent words and their neighboring characteristics are emphasized according to their frequency in the adaptation target data. In the evaluation experiment, the proposed method shows a 13% lower perplexity and a 9% lower word error rate in continuous speech recognition compared with the conventional adaptation method.

Từ khóa

#Adaptation model #Natural languages #Speech recognition #Data mining #Frequency #Error analysis #Vocabulary

Tài liệu tham khảo

10.1109/ICASSP.1997.596042 10.1109/ICASSP.1999.758180 10.1006/csla.1996.0021 takezawa, 1998, Speech and Language Databases for Speech Translation Research in ATR, Proc of the 1st International Workshop on East-Asian Language Resource and Evaluation shimizu, 1996, Spontaneous sialog speech recognition using cross-word context constrained word graphs, Proc ICASSP-96, 145 bai, 1998, Building Class-based Language Models with Contextual Statistics, Proc ICASSP-98, 173 moore, 2000, Class-based language model adaptation using mixture of word-class weight, Proc ICSLP 2000, 4, 512