TY - GEN
T1 - Multi-level language modeling and decoding for open vocabulary end-to-end speech recognition
AU - Hori, Takaaki
AU - Watanabe, Shinji
AU - Hershey, John R.
N1 - Publisher Copyright:
© 2017 IEEE.
PY - 2018/1/24
Y1 - 2018/1/24
N2 - We propose a combination of character-based and word-based language models in an end-to-end automatic speech recognition (ASR) architecture. In our prior work, we combined a character-based LSTM RNN-LM with a hybrid attention/connectionist temporal classification (CTC) architecture. The character LMs improved recognition accuracy to rival state-of-the-art DNN/HMM systems in Japanese and Mandarin Chinese tasks. Although a character-based architecture can provide for open vocabulary recognition, the character-based LMs generally under-perform relative to word LMs for languages such as English with a small alphabet, because of the difficulty of modeling Linguistic constraints across long sequences of characters. This paper presents a novel method for end-to-end ASR decoding with LMs at both the character and word level. Hypotheses are first scored with the character-based LM until a word boundary is encountered. Known words are then re-scored using the word-based LM, while the character-based LM provides for out-of-vocabulary scores. In a standard Wall Street Journal (WSJ) task, we achieved 5.6 % WER for the Eval'92 test set using only the SI284 training set and WSJ text data, which is the best score reported for end-to-end ASR systems on this benchmark.
AB - We propose a combination of character-based and word-based language models in an end-to-end automatic speech recognition (ASR) architecture. In our prior work, we combined a character-based LSTM RNN-LM with a hybrid attention/connectionist temporal classification (CTC) architecture. The character LMs improved recognition accuracy to rival state-of-the-art DNN/HMM systems in Japanese and Mandarin Chinese tasks. Although a character-based architecture can provide for open vocabulary recognition, the character-based LMs generally under-perform relative to word LMs for languages such as English with a small alphabet, because of the difficulty of modeling Linguistic constraints across long sequences of characters. This paper presents a novel method for end-to-end ASR decoding with LMs at both the character and word level. Hypotheses are first scored with the character-based LM until a word boundary is encountered. Known words are then re-scored using the word-based LM, while the character-based LM provides for out-of-vocabulary scores. In a standard Wall Street Journal (WSJ) task, we achieved 5.6 % WER for the Eval'92 test set using only the SI284 training set and WSJ text data, which is the best score reported for end-to-end ASR systems on this benchmark.
KW - End-To-End speech recognition
KW - attention decoder
KW - connectionist temporal classification
KW - decoding
KW - language modeling
UR - http://www.scopus.com/inward/record.url?scp=85050529645&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85050529645&partnerID=8YFLogxK
U2 - 10.1109/ASRU.2017.8268948
DO - 10.1109/ASRU.2017.8268948
M3 - Conference contribution
AN - SCOPUS:85050529645
T3 - 2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017 - Proceedings
SP - 287
EP - 293
BT - 2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017 - Proceedings
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017
Y2 - 16 December 2017 through 20 December 2017
ER -