TY - GEN
T1 - Fast-MD
T2 - 2021 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2021
AU - Inaguma, Hirofumi
AU - Dalmia, Siddharth
AU - Yan, Brian
AU - Watanabe, Shinji
N1 - Funding Information:
2This work was partly supported by ASAPP and JHU HLTCOE. This work used the Extreme Science and Engineering Discovery Environment (XSEDE) [69], which is supported by NSF grant number ACI-1548562. Specifically, it used the Bridges system [70], supported by NSF grant ACI-1445606, at the Pittsburgh Supercomputing Center. We also thank Jumon Nozaki for helpful discussions.
Publisher Copyright:
© 2021 IEEE.
PY - 2021
Y1 - 2021
N2 - The multi-decoder (MD) end-to-end speech translation model has demonstrated high translation quality by searching for better intermediate automatic speech recognition (ASR) decoder states as hidden intermediates (HI). It is a two-pass decoding model decomposing the overall task into ASR and machine translation sub-tasks. However, the decoding speed is not fast enough for real-world applications because it conducts beam search for both sub-tasks during inference. We propose Fast-MD, a fast MD model that generates HI by non-autoregressive (NAR) decoding based on connectionist temporal classification (CTC) outputs followed by an ASR decoder. We investigated two types of NAR HI: (1) parallel HI by using an autoregressive Transformer ASR decoder and (2) masked HI by using Mask-CTC, which combines CTC and the conditional masked language model. To reduce a mismatch in the ASR decoder between teacher-forcing during training and conditioning on CTC outputs during testing, we also propose sampling CTC outputs during training. Experimental evaluations on three corpora show that Fast-MD achieved about 2× and 4× faster decoding speed than that of the naïve MD model on GPU and CPU with comparable translation quality. Adopting the Conformer encoder and intermediate CTC loss further boosts its quality without sacrificing decoding speed.
AB - The multi-decoder (MD) end-to-end speech translation model has demonstrated high translation quality by searching for better intermediate automatic speech recognition (ASR) decoder states as hidden intermediates (HI). It is a two-pass decoding model decomposing the overall task into ASR and machine translation sub-tasks. However, the decoding speed is not fast enough for real-world applications because it conducts beam search for both sub-tasks during inference. We propose Fast-MD, a fast MD model that generates HI by non-autoregressive (NAR) decoding based on connectionist temporal classification (CTC) outputs followed by an ASR decoder. We investigated two types of NAR HI: (1) parallel HI by using an autoregressive Transformer ASR decoder and (2) masked HI by using Mask-CTC, which combines CTC and the conditional masked language model. To reduce a mismatch in the ASR decoder between teacher-forcing during training and conditioning on CTC outputs during testing, we also propose sampling CTC outputs during training. Experimental evaluations on three corpora show that Fast-MD achieved about 2× and 4× faster decoding speed than that of the naïve MD model on GPU and CPU with comparable translation quality. Adopting the Conformer encoder and intermediate CTC loss further boosts its quality without sacrificing decoding speed.
KW - CTC
KW - End-to-end speech translation
KW - Mask-CTC
KW - multi-decoder
KW - non-autoregressive decoding
UR - http://www.scopus.com/inward/record.url?scp=85126768966&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85126768966&partnerID=8YFLogxK
U2 - 10.1109/ASRU51503.2021.9687894
DO - 10.1109/ASRU51503.2021.9687894
M3 - Conference contribution
AN - SCOPUS:85126768966
T3 - 2021 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2021 - Proceedings
SP - 922
EP - 929
BT - 2021 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2021 - Proceedings
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 13 December 2021 through 17 December 2021
ER -