TY - GEN
T1 - An End-to-End Language-Tracking Speech Recognizer for Mixed-Language Speech
AU - Seki, Hiroshi
AU - Watanabe, Shinji
AU - Hori, Takaaki
AU - Roux, Jonathan Le
AU - Hershey, John R.
N1 - Publisher Copyright:
© 2018 IEEE.
PY - 2018/9/10
Y1 - 2018/9/10
N2 - End-to-end automatic speech recognition (ASR) can significantly reduce the burden of developing ASR systems for new languages, by eliminating the need for linguistic information such as pronunciation dictionaries. This also creates an opportunity to build a monolithic multilingual ASR system with a language-independent neural network architecture. In our previous work, we proposed a monolithic neural network architecture that can recognize multiple languages, and showed its effectiveness compared with conventional language-dependent models. However, the model is not guaranteed to properly handle switches in language within an utterance, thus lacking the flexibility to recognize mixed-language speech such as code-switching. In this paper, we extend our model to enable dynamic tracking of the language within an utterance, and propose a training procedure that takes advantage of a newly created mixed-language speech corpus. Experimental results show that the extended model outperforms both language-dependent models and our previous model without suffering from performance degradation that could be associated with language switching.
AB - End-to-end automatic speech recognition (ASR) can significantly reduce the burden of developing ASR systems for new languages, by eliminating the need for linguistic information such as pronunciation dictionaries. This also creates an opportunity to build a monolithic multilingual ASR system with a language-independent neural network architecture. In our previous work, we proposed a monolithic neural network architecture that can recognize multiple languages, and showed its effectiveness compared with conventional language-dependent models. However, the model is not guaranteed to properly handle switches in language within an utterance, thus lacking the flexibility to recognize mixed-language speech such as code-switching. In this paper, we extend our model to enable dynamic tracking of the language within an utterance, and propose a training procedure that takes advantage of a newly created mixed-language speech corpus. Experimental results show that the extended model outperforms both language-dependent models and our previous model without suffering from performance degradation that could be associated with language switching.
KW - End-to-end ASR
KW - Hybrid attention/CTC
KW - Language identification
KW - Language-independent architecture
KW - Multilingual ASR
UR - http://www.scopus.com/inward/record.url?scp=85054263334&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85054263334&partnerID=8YFLogxK
U2 - 10.1109/ICASSP.2018.8462180
DO - 10.1109/ICASSP.2018.8462180
M3 - Conference contribution
AN - SCOPUS:85054263334
SN - 9781538646588
T3 - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
SP - 4919
EP - 4923
BT - 2018 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2018 - Proceedings
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2018 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2018
Y2 - 15 April 2018 through 20 April 2018
ER -