TY - JOUR
T1 - Pretraining by backtranslation for end-to-end ASR in low-resource settings
AU - Wiesner, Matthew
AU - Renduchintala, Adithya
AU - Watanabe, Shinji
AU - Liu, Chunxi
AU - Dehak, Najim
AU - Khudanpur, Sanjeev
N1 - Funding Information:
This work was supported by DARPA LORELEI Grant Nō HR0011-15-2-0024 and partially carried out during the 2018 Jelinek Memorial Summer Workshop on Speech and Language Technologies, supported by gifts from Microsoft, Amazon, Google, Facebook, and MERL/Mitsubishi Electric.
Publisher Copyright:
Copyright © 2019 ISCA
PY - 2019
Y1 - 2019
N2 - We explore training attention-based encoder-decoder ASR in low-resource settings. These models perform poorly when trained on small amounts of transcribed speech, in part because they depend on having sufficient target-side text to train the attention and decoder networks. In this paper we address this shortcoming by pretraining our network parameters using only text-based data and transcribed speech from other languages. We analyze the relative contributions of both sources of data. Across 3 test languages, our text-based approach resulted in a 20% average relative improvement over a text-based augmentation technique without pretraining. Using transcribed speech from nearby languages gives a further 20-30% relative reduction in character error rate.
AB - We explore training attention-based encoder-decoder ASR in low-resource settings. These models perform poorly when trained on small amounts of transcribed speech, in part because they depend on having sufficient target-side text to train the attention and decoder networks. In this paper we address this shortcoming by pretraining our network parameters using only text-based data and transcribed speech from other languages. We analyze the relative contributions of both sources of data. Across 3 test languages, our text-based approach resulted in a 20% average relative improvement over a text-based augmentation technique without pretraining. Using transcribed speech from nearby languages gives a further 20-30% relative reduction in character error rate.
KW - Encoder-decoder
KW - Low-resource
KW - Multi-modal data augmentation
KW - Multilingual ASR
KW - Pretraining
UR - http://www.scopus.com/inward/record.url?scp=85074688539&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85074688539&partnerID=8YFLogxK
U2 - 10.21437/Interspeech.2019-3254
DO - 10.21437/Interspeech.2019-3254
M3 - Conference article
AN - SCOPUS:85074688539
SN - 2308-457X
VL - 2019-September
SP - 4375
EP - 4379
JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
T2 - 20th Annual Conference of the International Speech Communication Association: Crossroads of Speech and Language, INTERSPEECH 2019
Y2 - 15 September 2019 through 19 September 2019
ER -