Abstract
A new language model is proposed to cope with the scarcity of training data. The proposed multi-class N-gram achieves an accurate word prediction capability and high reliability with a small number of model parameters by clustering words multi-dimensionally into classes, where the left and right context are independently treated. Each multiple class is assigned by a grouping process based on the left and right neighboring characteristics. Furthermore, by introducing frequent word successions to partially include higher order statistics, multi-class N-grams are extended to more efficient multi-class composite N-grams. In comparison to conventional word tri-grams, the multi-class composite N-grams achieved 9.5% lower perplexity and a 16% lower word error rate in a speech recognition experiment with a 40% smaller parameter size.
Original language | English |
---|---|
Pages (from-to) | 369-379 |
Number of pages | 11 |
Journal | Speech Communication |
Volume | 41 |
Issue number | 2-3 |
DOIs | |
Publication status | Published - 2003 Oct |
Externally published | Yes |
Keywords
- Class N-gram
- N-gram language model
- Variable length N-gram
- Word clustering
ASJC Scopus subject areas
- Software
- Modelling and Simulation
- Communication
- Language and Linguistics
- Linguistics and Language
- Computer Vision and Pattern Recognition
- Computer Science Applications