TY - JOUR
T1 - Improving End-to-End Single-Channel Multi-Talker Speech Recognition
AU - Zhang, Wangyou
AU - Chang, Xuankai
AU - Qian, Yanmin
AU - Watanabe, Shinji
N1 - Funding Information:
Manuscript received December 2, 2019; revised March 7, 2020; accepted April 13, 2020. Date of publication April 20, 2020; date of current version May 14, 2020. This work was supported by the China NSFC Project No. U1736202. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Jinyu Li. (Corresponding author: Yanmin Qian.) Wangyou Zhang and Yanmin Qian are with the SpeechLab, Department of Computer Science and Engineering & MoE Key Laboratory of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University, Shanghai 200240, China (e-mail: wyz-97@sjtu.edu.cn; yanminqian@sjtu.edu.cn).
Publisher Copyright:
© 2014 IEEE.
PY - 2020
Y1 - 2020
N2 - Although significant progress has been made in single-talker automatic speech recognition (ASR), there is still a large performance gap between multi-talker and single-talker speech recognition systems. In this article, we propose an enhanced end-to-end monaural multi-talker ASR architecture and training strategy to recognize the overlapped speech. The single-talker end-to-end model is extended to a multi-talker architecture with permutation invariant training (PIT). Several methods are designed to enhance the system performance, including speaker parallel attention, scheduled sampling, curriculum learning and knowledge distillation. More specifically, the speaker parallel attention extends the basic single shared attention module into multiple attention modules for each speaker, which can enhance the tracing and separation ability. Then the scheduled sampling and curriculum learning are proposed to make the model better optimized. Finally the knowledge distillation transfers the knowledge from an original single-speaker model to the current multi-speaker model in the proposed end-to-end multi-talker ASR structure. Our proposed architectures are evaluated and compared on the artificially mixed speech datasets generated from the WSJ0 reading corpus. The experiments demonstrate that our proposed architectures can significantly improve the multi-talker mixed speech recognition. The final system obtains more than 15% relative performance gains in both character error rate (CER) and word error rate (WER) compared to the basic end-to-end multi-talker ASR system.
AB - Although significant progress has been made in single-talker automatic speech recognition (ASR), there is still a large performance gap between multi-talker and single-talker speech recognition systems. In this article, we propose an enhanced end-to-end monaural multi-talker ASR architecture and training strategy to recognize the overlapped speech. The single-talker end-to-end model is extended to a multi-talker architecture with permutation invariant training (PIT). Several methods are designed to enhance the system performance, including speaker parallel attention, scheduled sampling, curriculum learning and knowledge distillation. More specifically, the speaker parallel attention extends the basic single shared attention module into multiple attention modules for each speaker, which can enhance the tracing and separation ability. Then the scheduled sampling and curriculum learning are proposed to make the model better optimized. Finally the knowledge distillation transfers the knowledge from an original single-speaker model to the current multi-speaker model in the proposed end-to-end multi-talker ASR structure. Our proposed architectures are evaluated and compared on the artificially mixed speech datasets generated from the WSJ0 reading corpus. The experiments demonstrate that our proposed architectures can significantly improve the multi-talker mixed speech recognition. The final system obtains more than 15% relative performance gains in both character error rate (CER) and word error rate (WER) compared to the basic end-to-end multi-talker ASR system.
KW - Multi-talker mixed speech recognition
KW - curriculum learning
KW - end-to-end model
KW - knowledge distillation
KW - permutation invariant training
UR - http://www.scopus.com/inward/record.url?scp=85085593456&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85085593456&partnerID=8YFLogxK
U2 - 10.1109/TASLP.2020.2988423
DO - 10.1109/TASLP.2020.2988423
M3 - Article
AN - SCOPUS:85085593456
SN - 2329-9290
VL - 28
SP - 1385
EP - 1394
JO - IEEE/ACM Transactions on Audio Speech and Language Processing
JF - IEEE/ACM Transactions on Audio Speech and Language Processing
M1 - 9072433
ER -