TY - GEN
T1 - End-to-end Monaural Multi-speaker ASR System without Pretraining
AU - Chang, Xuankai
AU - Qian, Yanmin
AU - Yu, Kai
AU - Watanabe, Shinji
N1 - Funding Information:
Xuankai Chang, Yanmin Qian and Kai Yu were supported by the China NSFC project (No. 61603252) , the China NSFC project (No. U1736202) and the Shanghai Sailing Program No. 16YF1405300.
Publisher Copyright:
© 2019 IEEE.
PY - 2019/5
Y1 - 2019/5
N2 - Recently, end-to-end models have become a popular approach as an alternative to traditional hybrid models in automatic speech recognition (ASR). The multi-speaker speech separation and recognition task is a central task in cocktail party problem. In this paper, we present a state-of-the-art monaural multi-speaker end-to-end automatic speech recognition model. In contrast to previous studies on the monaural multi-speaker speech recognition, this end-to-end framework is trained to recognize multiple label sequences completely from scratch. The system only requires the speech mixture and corresponding label sequences, without needing any indeterminate supervisions obtained from non-mixture speech or corresponding labels/alignments. Moreover, we exploited using the individual attention module for each separated speaker and the scheduled sampling to further improve the performance. Finally, we evaluate the proposed model on the 2-speaker mixed speech generated from the WSJ corpus and the wsj0-2mix dataset, which is a speech separation and recognition benchmark. The experiments demonstrate that the proposed methods can improve the performance of the end-to-end model in separating the overlapping speech and recognizing the separated streams. From the results, the proposed model leads to ∼ 10.0% relative performance gains in terms of CER and WER respectively.
AB - Recently, end-to-end models have become a popular approach as an alternative to traditional hybrid models in automatic speech recognition (ASR). The multi-speaker speech separation and recognition task is a central task in cocktail party problem. In this paper, we present a state-of-the-art monaural multi-speaker end-to-end automatic speech recognition model. In contrast to previous studies on the monaural multi-speaker speech recognition, this end-to-end framework is trained to recognize multiple label sequences completely from scratch. The system only requires the speech mixture and corresponding label sequences, without needing any indeterminate supervisions obtained from non-mixture speech or corresponding labels/alignments. Moreover, we exploited using the individual attention module for each separated speaker and the scheduled sampling to further improve the performance. Finally, we evaluate the proposed model on the 2-speaker mixed speech generated from the WSJ corpus and the wsj0-2mix dataset, which is a speech separation and recognition benchmark. The experiments demonstrate that the proposed methods can improve the performance of the end-to-end model in separating the overlapping speech and recognizing the separated streams. From the results, the proposed model leads to ∼ 10.0% relative performance gains in terms of CER and WER respectively.
KW - CTC
KW - Cocktail party problem
KW - attention mechanism
KW - end-to-end speech recognition
KW - multi-speaker speech recognition
UR - http://www.scopus.com/inward/record.url?scp=85068976427&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85068976427&partnerID=8YFLogxK
U2 - 10.1109/ICASSP.2019.8682822
DO - 10.1109/ICASSP.2019.8682822
M3 - Conference contribution
AN - SCOPUS:85068976427
T3 - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
SP - 6256
EP - 6260
BT - 2019 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019 - Proceedings
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 44th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019
Y2 - 12 May 2019 through 17 May 2019
ER -