Improved mask-ctc for non-autoregressive end-to-end asr

Yosuke Higuchi, Hirofumi Inaguma, Shinji Watanabe, Tetsuji Ogawa, Tetsunori Kobayashi

研究成果: Conference article査読

36 被引用数 (Scopus)

抄録

For real-world deployment of automatic speech recognition (ASR), the system is desired to be capable of fast inference while relieving the requirement of computational resources. The recently proposed end-to-end ASR system based on mask-predict with connectionist temporal classification (CTC), Mask-CTC, fulfills this demand by generating tokens in a non-autoregressive fashion. While Mask- CTC achieves remarkably fast inference speed, its recognition performance falls behind that of conventional autoregressive (AR) systems. To boost the performance of Mask-CTC, we first propose to enhance the encoder network architecture by employing a recently proposed architecture called Conformer. Next, we propose new training and decoding methods by introducing auxiliary objective to predict the length of a partial target sequence, which allows the model to delete or insert tokens during inference. Experimental results on different ASR tasks show that the proposed approaches improve Mask-CTC significantly, outperforming a standard CTC model (15.5%!9.1% WER on WSJ). Moreover, Mask- CTC now achieves competitive results to AR models with no degradation of inference speed (< 0.1 RTF using CPU). We also show a potential application of Mask-CTC to end-to-end speech translation.

本文言語English
ページ(範囲)8363-8367
ページ数5
ジャーナルICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
2021-June
DOI
出版ステータスPublished - 2021
イベント2021 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2021 - Virtual, Toronto, Canada
継続期間: 2021 6月 62021 6月 11

ASJC Scopus subject areas

  • ソフトウェア
  • 信号処理
  • 電子工学および電気工学

フィンガープリント

「Improved mask-ctc for non-autoregressive end-to-end asr」の研究トピックを掘り下げます。これらがまとまってユニークなフィンガープリントを構成します。

引用スタイル