TY - JOUR
T1 - Mask CTC
T2 - 21st Annual Conference of the International Speech Communication Association, INTERSPEECH 2020
AU - Higuchi, Yosuke
AU - Watanabe, Shinji
AU - Chen, Nanxin
AU - Ogawa, Tetsuji
AU - Kobayashi, Tetsunori
N1 - Publisher Copyright:
© 2020 ISCA
PY - 2020
Y1 - 2020
N2 - We present Mask CTC, a novel non-autoregressive end-to-end automatic speech recognition (ASR) framework, which generates a sequence by refining outputs of the connectionist temporal classification (CTC). Neural sequence-to-sequence models are usually autoregressive: each output token is generated by conditioning on previously generated tokens, at the cost of requiring as many iterations as the output length. On the other hand, non-autoregressive models can simultaneously generate tokens within a constant number of iterations, which results in significant inference time reduction and better suits end-to-end ASR model for real-world scenarios. In this work, Mask CTC model is trained using a Transformer encoder-decoder with joint training of mask prediction and CTC. During inference, the target sequence is initialized with the greedy CTC outputs and low-confidence tokens are masked based on the CTC probabilities. Based on the conditional dependence between output tokens, these masked low-confidence tokens are then predicted conditioning on the high-confidence tokens. Experimental results on different speech recognition tasks show that Mask CTC outperforms the standard CTC model (e.g., 17.9% ? 12.1% WER on WSJ) and approaches the autoregressive model, requiring much less inference time using CPUs (0.07 RTF in Python implementation). All of our codes are publicly available at https://github.com/espnet/espnet.
AB - We present Mask CTC, a novel non-autoregressive end-to-end automatic speech recognition (ASR) framework, which generates a sequence by refining outputs of the connectionist temporal classification (CTC). Neural sequence-to-sequence models are usually autoregressive: each output token is generated by conditioning on previously generated tokens, at the cost of requiring as many iterations as the output length. On the other hand, non-autoregressive models can simultaneously generate tokens within a constant number of iterations, which results in significant inference time reduction and better suits end-to-end ASR model for real-world scenarios. In this work, Mask CTC model is trained using a Transformer encoder-decoder with joint training of mask prediction and CTC. During inference, the target sequence is initialized with the greedy CTC outputs and low-confidence tokens are masked based on the CTC probabilities. Based on the conditional dependence between output tokens, these masked low-confidence tokens are then predicted conditioning on the high-confidence tokens. Experimental results on different speech recognition tasks show that Mask CTC outperforms the standard CTC model (e.g., 17.9% ? 12.1% WER on WSJ) and approaches the autoregressive model, requiring much less inference time using CPUs (0.07 RTF in Python implementation). All of our codes are publicly available at https://github.com/espnet/espnet.
KW - Connectionist temporal classification
KW - End-to-end speech recognition
KW - Non-autoregressive
KW - Transformer
UR - http://www.scopus.com/inward/record.url?scp=85098143456&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85098143456&partnerID=8YFLogxK
U2 - 10.21437/Interspeech.2020-2404
DO - 10.21437/Interspeech.2020-2404
M3 - Conference article
AN - SCOPUS:85098143456
SN - 2308-457X
VL - 2020-October
SP - 3655
EP - 3659
JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Y2 - 25 October 2020 through 29 October 2020
ER -