TY - GEN
T1 - Continuous speech separation using speaker inventory for long recording
AU - Han, Cong
AU - Luo, Yi
AU - Li, Chenda
AU - Zhou, Tianyan
AU - Kinoshita, Keisuke
AU - Watanabe, Shinji
AU - Delcroix, Marc
AU - Erdogan, Hakan
AU - Hershey, John R.
AU - Mesgarani, Nima
AU - Chen, Zhuo
N1 - Funding Information:
The work reported here was started at JSALT 2020 at JHU, with support from Microsoft, Amazon and Google. C.H.,Y.L., and N.M. were also supported by a grant from the National Institute of Health, NIDCD, DC014279; and a grant from Marie-Josée and Henry R. Kravis.
Publisher Copyright:
Copyright © 2021 ISCA.
PY - 2021
Y1 - 2021
N2 - Leveraging additional speaker information to facilitate speech separation has received increasing attention in recent years. Recent research includes extracting target speech by using the target speaker’s voice snippet and jointly separating all participating speakers by using a pool of additional speaker signals, which is known as speech separation using speaker inventory (SSUSI). However, all these systems ideally assume that the pre-enrolled speaker signals are available and are only evaluated on simple data configurations. In realistic multi-talker conversations, the speech signal contains a large proportion of non-overlapped regions, where we can derive robust speaker embedding of individual talkers. In this work, we adopt the SSUSI model in long recordings and propose a self-informed, clustering-based inventory forming scheme for long recording, where the speaker inventory is fully built from the input signal without the need for external speaker signals. Experiment results on simulated noisy reverberant long recording datasets show that the proposed method can significantly improve the separation performance across various conditions.
AB - Leveraging additional speaker information to facilitate speech separation has received increasing attention in recent years. Recent research includes extracting target speech by using the target speaker’s voice snippet and jointly separating all participating speakers by using a pool of additional speaker signals, which is known as speech separation using speaker inventory (SSUSI). However, all these systems ideally assume that the pre-enrolled speaker signals are available and are only evaluated on simple data configurations. In realistic multi-talker conversations, the speech signal contains a large proportion of non-overlapped regions, where we can derive robust speaker embedding of individual talkers. In this work, we adopt the SSUSI model in long recordings and propose a self-informed, clustering-based inventory forming scheme for long recording, where the speaker inventory is fully built from the input signal without the need for external speaker signals. Experiment results on simulated noisy reverberant long recording datasets show that the proposed method can significantly improve the separation performance across various conditions.
KW - Continuous speech separation
KW - Embedding clustering
KW - Speaker inventory
KW - speech separation
UR - http://www.scopus.com/inward/record.url?scp=85119188986&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85119188986&partnerID=8YFLogxK
U2 - 10.21437/Interspeech.2021-338
DO - 10.21437/Interspeech.2021-338
M3 - Conference contribution
AN - SCOPUS:85119188986
T3 - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
SP - 2273
EP - 2277
BT - 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021
PB - International Speech Communication Association
T2 - 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021
Y2 - 30 August 2021 through 3 September 2021
ER -