TY - JOUR
T1 - End-to-end dereverberation, beamforming, and speech recognition with improved numerical stability and advanced frontend
AU - Zhang, Wangyou
AU - Boeddeker, Christoph
AU - Watanabe, Shinji
AU - Nakatani, Tomohiro
AU - Delcroix, Marc
AU - Kinoshita, Keisuke
AU - Ochiai, Tsubasa
AU - Kamo, Naoyuki
AU - Haeb-Umbach, Reinhold
AU - Qian, Yanmin
N1 - Funding Information:
Wangyou Zhang and Yanmin Qian were supported by the China NSFC projects (No. 62071288 and U1736202). The work reported here was started at JSALT 2020 at JHU, with support from Microsoft, Amazon and Google. Experiments were carried out on the PI supercomputers at Shanghai Jiao Tong University.
Publisher Copyright:
©2021 IEEE
PY - 2021
Y1 - 2021
N2 - Recently, the end-to-end approach has been successfully applied to multi-speaker speech separation and recognition in both single-channel and multichannel conditions. However, severe performance degradation is still observed in the reverberant and noisy scenarios, and there is still a large performance gap between anechoic and reverberant conditions. In this work, we focus on the multichannel multi-speaker reverberant condition, and propose to extend our previous framework for end-to-end dereverberation, beamforming, and speech recognition with improved numerical stability and advanced frontend subnetworks including voice activity detection like masks. The techniques significantly stabilize the end-to-end training process. The experiments on the spatialized wsj1-2mix corpus show that the proposed system achieves about 35% WER relative reduction compared to our conventional multi-channel E2E ASR system, and also obtains decent speech dereverberation and separation performance (SDR = 12.5 dB) in the reverberant multi-speaker condition while trained only with the ASR criterion.
AB - Recently, the end-to-end approach has been successfully applied to multi-speaker speech separation and recognition in both single-channel and multichannel conditions. However, severe performance degradation is still observed in the reverberant and noisy scenarios, and there is still a large performance gap between anechoic and reverberant conditions. In this work, we focus on the multichannel multi-speaker reverberant condition, and propose to extend our previous framework for end-to-end dereverberation, beamforming, and speech recognition with improved numerical stability and advanced frontend subnetworks including voice activity detection like masks. The techniques significantly stabilize the end-to-end training process. The experiments on the spatialized wsj1-2mix corpus show that the proposed system achieves about 35% WER relative reduction compared to our conventional multi-channel E2E ASR system, and also obtains decent speech dereverberation and separation performance (SDR = 12.5 dB) in the reverberant multi-speaker condition while trained only with the ASR criterion.
KW - Cocktail party problem
KW - Dereverberation
KW - Neural beamformer
KW - Overlapped speech recognition
KW - Speech separation
UR - http://www.scopus.com/inward/record.url?scp=85113834749&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85113834749&partnerID=8YFLogxK
U2 - 10.1109/ICASSP39728.2021.9414464
DO - 10.1109/ICASSP39728.2021.9414464
M3 - Conference article
AN - SCOPUS:85113834749
SN - 0736-7791
VL - 2021-June
SP - 6898
EP - 6902
JO - Proceedings - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing
JF - Proceedings - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing
T2 - 2021 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2021
Y2 - 6 June 2021 through 11 June 2021
ER -