Improving End-to-End Single-Channel Multi-Talker Speech Recognition

Wangyou Zhang, Xuankai Chang, Yanmin Qian*, Shinji Watanabe

*この研究の対応する著者

研究成果: Article査読

13 被引用数 (Scopus)

抄録

Although significant progress has been made in single-talker automatic speech recognition (ASR), there is still a large performance gap between multi-talker and single-talker speech recognition systems. In this article, we propose an enhanced end-to-end monaural multi-talker ASR architecture and training strategy to recognize the overlapped speech. The single-talker end-to-end model is extended to a multi-talker architecture with permutation invariant training (PIT). Several methods are designed to enhance the system performance, including speaker parallel attention, scheduled sampling, curriculum learning and knowledge distillation. More specifically, the speaker parallel attention extends the basic single shared attention module into multiple attention modules for each speaker, which can enhance the tracing and separation ability. Then the scheduled sampling and curriculum learning are proposed to make the model better optimized. Finally the knowledge distillation transfers the knowledge from an original single-speaker model to the current multi-speaker model in the proposed end-to-end multi-talker ASR structure. Our proposed architectures are evaluated and compared on the artificially mixed speech datasets generated from the WSJ0 reading corpus. The experiments demonstrate that our proposed architectures can significantly improve the multi-talker mixed speech recognition. The final system obtains more than 15% relative performance gains in both character error rate (CER) and word error rate (WER) compared to the basic end-to-end multi-talker ASR system.

本文言語English
論文番号9072433
ページ(範囲)1385-1394
ページ数10
ジャーナルIEEE/ACM Transactions on Audio Speech and Language Processing
28
DOI
出版ステータスPublished - 2020
外部発表はい

ASJC Scopus subject areas

  • コンピュータ サイエンス(その他)
  • 音響学および超音波学
  • 計算数学
  • 電子工学および電気工学

フィンガープリント

「Improving End-to-End Single-Channel Multi-Talker Speech Recognition」の研究トピックを掘り下げます。これらがまとまってユニークなフィンガープリントを構成します。

引用スタイル