TY - GEN
T1 - Closing the Gap between Time-Domain Multi-Channel Speech Enhancement on Real and Simulation Conditions
AU - Zhang, Wangyou
AU - Shi, Jing
AU - Li, Chenda
AU - Watanabe, Shinji
AU - Qian, Yanmin
N1 - Funding Information:
The authors would like to thank Dr. Tsubasa Ochiai for his helpful comments about the gap between Beam-TasNet enhancement results on real and simulation conditions. This work was supported by the China NSFC projects (No. 62071288 and No. U1736202) and Shanghai Municipal Science and Technology Major Project (2021SHZDZX0102). Experiments have been carried out on the PI super-computer at Shanghai Jiao Tong University.
Publisher Copyright:
© 2021 IEEE.
PY - 2021
Y1 - 2021
N2 - The deep learning based time-domain models, e.g. Conv-TasNet, have shown great potential in both single-channel and multi-channel speech enhancement. However, many experiments on the time-domain speech enhancement model are done in simulated conditions, and it is not well studied whether the good performance can generalize to real-world scenarios. In this paper, we aim to provide an insightful investigation of applying multi-channel Conv-TasNet based speech enhancement to both simulation and real data. Our preliminary experiments show a large performance gap between the two conditions in terms of the ASR performance. Several approaches are applied to close this gap, including the integration of multi-channel Conv-TasNet into the beamforming model with various strategies, and the joint training of speech enhancement and speech recognition models. Our experiments on the CHiME-4 corpus show that our proposed approaches can greatly reduce the speech recognition performance discrepancy between simulation and real data, while preserving the strong speech enhancement capability in the frontend.
AB - The deep learning based time-domain models, e.g. Conv-TasNet, have shown great potential in both single-channel and multi-channel speech enhancement. However, many experiments on the time-domain speech enhancement model are done in simulated conditions, and it is not well studied whether the good performance can generalize to real-world scenarios. In this paper, we aim to provide an insightful investigation of applying multi-channel Conv-TasNet based speech enhancement to both simulation and real data. Our preliminary experiments show a large performance gap between the two conditions in terms of the ASR performance. Several approaches are applied to close this gap, including the integration of multi-channel Conv-TasNet into the beamforming model with various strategies, and the joint training of speech enhancement and speech recognition models. Our experiments on the CHiME-4 corpus show that our proposed approaches can greatly reduce the speech recognition performance discrepancy between simulation and real data, while preserving the strong speech enhancement capability in the frontend.
KW - automatic speech recognition
KW - beamforming
KW - multi-channel speech enhancement
KW - time domain
UR - http://www.scopus.com/inward/record.url?scp=85123427507&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85123427507&partnerID=8YFLogxK
U2 - 10.1109/WASPAA52581.2021.9632720
DO - 10.1109/WASPAA52581.2021.9632720
M3 - Conference contribution
AN - SCOPUS:85123427507
T3 - IEEE Workshop on Applications of Signal Processing to Audio and Acoustics
SP - 146
EP - 150
BT - 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, WASPAA 2021
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, WASPAA 2021
Y2 - 17 October 2021 through 20 October 2021
ER -