Robust language modeling for a small corpus of target tasks using class-combined word statistics and selective use of a general corpus

Yosuke Wada*, Norihiko Kobayashi, Tetsunori Kobayashi

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

2 Citations (Scopus)

Abstract

In order to improve the accuracy of language models in speech recognition tasks for which collecting a large text corpus for language model training is difficult, we propose a class-combined bigram and selective use of general text. In the class-combined bigram, the word bigram and the class bigram are combined using weights that are expressed as the functions of the preceding word frequency and the succeeding word-type count. An experimental has shown that the accuracy of the proposed class-combined bigram is equivalent to that of the word bigram trained with a text corpus that is approximately three times larger. In the selective use of general text, the language model was corrected by automatically selecting sentences that were expected to produce better accuracy from a large volume of text collected without specifying the task and by adding these sentences to a small corpus of target tasks. An experiment has shown that the recognition error rate was reduced by up to 12% compared to a case in which text was not selected. Lastly, when we created a model that uses both the class-combined bigram and text addition, further improvements were obtained, resulting in improvements of approximately 34% in adjusted perplexity and approximately 31% in the recognition error rate compared to the word bigram created from the target task text only.

Original languageEnglish
Pages (from-to)92-102
Number of pages11
JournalSystems and Computers in Japan
Volume34
Issue number12
DOIs
Publication statusPublished - 2003 Nov 15

Keywords

  • Class N-gram
  • Language model
  • Large-vocabulary continuous speech recognition
  • Task adaptation

ASJC Scopus subject areas

  • Theoretical Computer Science
  • Information Systems
  • Hardware and Architecture
  • Computational Theory and Mathematics

Fingerprint

Dive into the research topics of 'Robust language modeling for a small corpus of target tasks using class-combined word statistics and selective use of a general corpus'. Together they form a unique fingerprint.

Cite this