TY - GEN
T1 - MIMO-Speech
T2 - 2019 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2019
AU - Chang, Xuankai
AU - Zhang, Wangyou
AU - Qian, Yanmin
AU - Roux, Jonathan Le
AU - Watanabe, Shinji
N1 - Funding Information:
Wangyou Zhang and Yanmin Qian were supported by the China NSFC projects (No. 61603252 and No. U1736202).
Publisher Copyright:
© 2019 IEEE.
PY - 2019/12
Y1 - 2019/12
N2 - Recently, the end-To-end approach has proven its efficacy in monaural multi-speaker speech recognition. However, high word error rates (WERs) still prevent these systems from being used in practical applications. On the other hand, the spatial information in multi-channel signals has proven helpful in far-field speech recognition tasks. In this work, we propose a novel neural sequence-To-sequence (seq2seq) architecture, MIMO-Speech, which extends the original seq2seq to deal with multi-channel input and multi-channel output so that it can fully model multi-channel multi-speaker speech separation and recognition. MIMO-Speech is a fully neural end-To-end framework, which is optimized only via an ASR criterion. It is comprised of: 1) a monaural masking network, 2) a multi-source neural beamformer, and 3) a multi-output speech recognition model. With this processing, the input overlapped speech is directly mapped to text sequences. We further adopted a curriculum learning strategy, making the best use of the training set to improve the performance. The experiments on the spatialized wsj1-2mix corpus show that our model can achieve more than 60% WER reduction compared to the single-channel system with high quality enhanced signals (SI-SDR = 23.1 dB) obtained by the above separation function.
AB - Recently, the end-To-end approach has proven its efficacy in monaural multi-speaker speech recognition. However, high word error rates (WERs) still prevent these systems from being used in practical applications. On the other hand, the spatial information in multi-channel signals has proven helpful in far-field speech recognition tasks. In this work, we propose a novel neural sequence-To-sequence (seq2seq) architecture, MIMO-Speech, which extends the original seq2seq to deal with multi-channel input and multi-channel output so that it can fully model multi-channel multi-speaker speech separation and recognition. MIMO-Speech is a fully neural end-To-end framework, which is optimized only via an ASR criterion. It is comprised of: 1) a monaural masking network, 2) a multi-source neural beamformer, and 3) a multi-output speech recognition model. With this processing, the input overlapped speech is directly mapped to text sequences. We further adopted a curriculum learning strategy, making the best use of the training set to improve the performance. The experiments on the spatialized wsj1-2mix corpus show that our model can achieve more than 60% WER reduction compared to the single-channel system with high quality enhanced signals (SI-SDR = 23.1 dB) obtained by the above separation function.
KW - Overlapped speech recognition
KW - curriculum learning
KW - end-To-end
KW - neural beamforming
KW - speech separation
UR - http://www.scopus.com/inward/record.url?scp=85081575244&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85081575244&partnerID=8YFLogxK
U2 - 10.1109/ASRU46091.2019.9003986
DO - 10.1109/ASRU46091.2019.9003986
M3 - Conference contribution
AN - SCOPUS:85081575244
T3 - 2019 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2019 - Proceedings
SP - 237
EP - 244
BT - 2019 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2019 - Proceedings
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 15 December 2019 through 18 December 2019
ER -