TY - GEN
T1 - A Comparative Study on Non-Autoregressive Modelings for Speech-to-Text Generation
AU - Higuchi, Yosuke
AU - Chen, Nanxin
AU - Fujita, Yuya
AU - Inaguma, Hirofumi
AU - Komatsu, Tatsuya
AU - Lee, Jaesong
AU - Nozaki, Jumon
AU - Wang, Tianzi
AU - Watanabe, Shinji
N1 - Funding Information:
This work was partly supported by ASAPP. This work used the Extreme Science and Engineering Discovery Environment (XSEDE) [65], which is supported by National Science Foundation grant number ACI-1548562. Specifically, it used the Bridges system [66], which is supported by NSF award number ACI-1445606, at the Pittsburgh Supercomputing Center (PSC). The authors would like to thank Linhao Dong and Florian Boyer for helpful discussions.
Publisher Copyright:
© 2021 IEEE.
PY - 2021
Y1 - 2021
N2 - Non-autoregressive (NAR) models simultaneously generate multiple outputs in a sequence, which significantly reduces the inference speed at the cost of accuracy drop compared to autoregressive baselines. Showing great potential for real-time applications, an increasing number of NAR models have been explored in different fields to mitigate the performance gap against AR models. In this work, we conduct a comparative study of various NAR modeling methods for end-to-end automatic speech recognition (ASR). Experiments are performed in the state-of-the-art setting using ESPnet. The results on various tasks provide interesting findings for developing an understanding of NAR ASR, such as the accuracy-speed trade-off and robustness against long-form utterances. We also show that the techniques can be combined for further improvement and applied to NAR end-to-end speech translation. All the implementations are publicly available to encourage further research in NAR speech processing.
AB - Non-autoregressive (NAR) models simultaneously generate multiple outputs in a sequence, which significantly reduces the inference speed at the cost of accuracy drop compared to autoregressive baselines. Showing great potential for real-time applications, an increasing number of NAR models have been explored in different fields to mitigate the performance gap against AR models. In this work, we conduct a comparative study of various NAR modeling methods for end-to-end automatic speech recognition (ASR). Experiments are performed in the state-of-the-art setting using ESPnet. The results on various tasks provide interesting findings for developing an understanding of NAR ASR, such as the accuracy-speed trade-off and robustness against long-form utterances. We also show that the techniques can be combined for further improvement and applied to NAR end-to-end speech translation. All the implementations are publicly available to encourage further research in NAR speech processing.
KW - Non-autoregressive sequence generation
KW - end-to-end speech recognition
KW - end-to-end speech translation
UR - http://www.scopus.com/inward/record.url?scp=85125788407&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85125788407&partnerID=8YFLogxK
U2 - 10.1109/ASRU51503.2021.9688157
DO - 10.1109/ASRU51503.2021.9688157
M3 - Conference contribution
AN - SCOPUS:85125788407
T3 - 2021 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2021 - Proceedings
SP - 47
EP - 54
BT - 2021 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2021 - Proceedings
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2021 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2021
Y2 - 13 December 2021 through 17 December 2021
ER -