TY - GEN
T1 - Semi-supervised End-to-end Speech Recognition Using Text-to-speech and Autoencoders
AU - Karita, Shigeki
AU - Watanabe, Shinji
AU - Iwata, Tomoharu
AU - Delcroix, Marc
AU - Ogawa, Atsunori
AU - Nakatani, Tomohiro
N1 - Publisher Copyright:
© 2019 IEEE.
PY - 2019/5
Y1 - 2019/5
N2 - We introduce speech and text autoencoders that share encoders and decoders with an automatic speech recognition (ASR) model to improve ASR performance with large speech only and text only training datasets. To build the speech and text autoencoders, we leverage state-of-the-art ASR and text-to-speech (TTS) encoder decoder architectures. These autoencoders learn features from speech only and text only datasets by switching the encoders and decoders used in the ASR and TTS models. Simultaneously, they aim to encode features to be compatible with ASR and TTS models by a multi-task loss. Additionally, we anticipate that TTS joint training can also improve the ASR performance because both ASR and TTS models learn transformations between speech and text. The experimental result we obtained with our semi-supervised end-to-end ASR/TTS training revealed reductions from a model initially trained with a small paired subset of the LibriSpeech corpus in the character error rate from 10.4% to 8.4% and word error rate from 20.6% to 18.0% by retraining the model with a large unpaired subset of the corpus.
AB - We introduce speech and text autoencoders that share encoders and decoders with an automatic speech recognition (ASR) model to improve ASR performance with large speech only and text only training datasets. To build the speech and text autoencoders, we leverage state-of-the-art ASR and text-to-speech (TTS) encoder decoder architectures. These autoencoders learn features from speech only and text only datasets by switching the encoders and decoders used in the ASR and TTS models. Simultaneously, they aim to encode features to be compatible with ASR and TTS models by a multi-task loss. Additionally, we anticipate that TTS joint training can also improve the ASR performance because both ASR and TTS models learn transformations between speech and text. The experimental result we obtained with our semi-supervised end-to-end ASR/TTS training revealed reductions from a model initially trained with a small paired subset of the LibriSpeech corpus in the character error rate from 10.4% to 8.4% and word error rate from 20.6% to 18.0% by retraining the model with a large unpaired subset of the corpus.
KW - autoencoder
KW - encoder-decoder
KW - semi-supervised learning
KW - speech recognition
UR - http://www.scopus.com/inward/record.url?scp=85068998276&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85068998276&partnerID=8YFLogxK
U2 - 10.1109/ICASSP.2019.8682890
DO - 10.1109/ICASSP.2019.8682890
M3 - Conference contribution
AN - SCOPUS:85068998276
T3 - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
SP - 6166
EP - 6170
BT - 2019 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019 - Proceedings
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 44th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019
Y2 - 12 May 2019 through 17 May 2019
ER -