TY - JOUR
T1 - Dual-path modeling for long recording speech separation in meetings
AU - Li, Chenda
AU - Chen, Zhuo
AU - Luo, Yi
AU - Han, Cong
AU - Zhou, Tianyan
AU - Kinoshita, Keisuke
AU - Delcroix, Marc
AU - Watanabe, Shinji
AU - Qian, Yanmin
N1 - Funding Information:
Chenda Li and Yanmin Qian were supported by the China NSFC projects (No. 62071288 and U1736202). The work reported here was started at JSALT 2020 at JHU, with support from Microsoft, Amazon, and Google.
Publisher Copyright:
©2021 IEEE
PY - 2021
Y1 - 2021
N2 - The continuous speech separation (CSS) is a task to separate the speech sources from a long, partially overlapped recording, which involves a varying number of speakers. A straightforward extension of conventional utterance-level speech separation to the CSS task is to segment the long recording with a size-fixed window and process each window separately. Though effective, this extension fails to model the long dependency in speech and thus leads to sub-optimum performance. The recent proposed dual-path modeling could be a remedy to this problem, thanks to its capability in jointly modeling the cross-window dependency and the local-window processing. In this work, we further extend the dual-path modeling framework for CSS task. A transformer-based dual-path system is proposed, which integrates transform layers for global modeling. The proposed models are applied to LibriCSS, a real recorded multi-talk dataset, and consistent WER reduction can be observed in the ASR evaluation for separated speech. Also, a dual-path transformer equipped with convolutional layers is proposed. It significantly reduces the computation amount by 30% with better WER evaluation. Furthermore, the online processing dual-path models are investigated, which shows 10% relative WER reduction compared to the baseline.
AB - The continuous speech separation (CSS) is a task to separate the speech sources from a long, partially overlapped recording, which involves a varying number of speakers. A straightforward extension of conventional utterance-level speech separation to the CSS task is to segment the long recording with a size-fixed window and process each window separately. Though effective, this extension fails to model the long dependency in speech and thus leads to sub-optimum performance. The recent proposed dual-path modeling could be a remedy to this problem, thanks to its capability in jointly modeling the cross-window dependency and the local-window processing. In this work, we further extend the dual-path modeling framework for CSS task. A transformer-based dual-path system is proposed, which integrates transform layers for global modeling. The proposed models are applied to LibriCSS, a real recorded multi-talk dataset, and consistent WER reduction can be observed in the ASR evaluation for separated speech. Also, a dual-path transformer equipped with convolutional layers is proposed. It significantly reduces the computation amount by 30% with better WER evaluation. Furthermore, the online processing dual-path models are investigated, which shows 10% relative WER reduction compared to the baseline.
KW - Continuous speech separation
KW - Dual-path modeling
KW - Long recording speech separation
KW - Online processing
UR - http://www.scopus.com/inward/record.url?scp=85115164492&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85115164492&partnerID=8YFLogxK
U2 - 10.1109/ICASSP39728.2021.9414127
DO - 10.1109/ICASSP39728.2021.9414127
M3 - Conference article
AN - SCOPUS:85115164492
SN - 0736-7791
VL - 2021-June
SP - 5739
EP - 5743
JO - Proceedings - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing
JF - Proceedings - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing
T2 - 2021 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2021
Y2 - 6 June 2021 through 11 June 2021
ER -