TY - GEN
T1 - Target-speaker voice activity detection with improved I-vector estimation for unknown number of speaker
AU - He, Maokui
AU - Raj, Desh
AU - Huang, Zili
AU - Du, Jun
AU - Chen, Zhuo
AU - Watanabe, Shinji
N1 - Funding Information:
The work reported here was started at JSALT 2020, with support from Microsoft, Amazon, and Google. We thank Tianyan Zhou, Xiaofei Wang, and Zhong Meng for their contributions for collecting LibriCSS 2spk and 5spk data. This work was supported by the Strategic Priority Research Program of Chinese Academy of Sciences under Grant No. XDC08050200
Publisher Copyright:
Copyright © 2021 ISCA.
PY - 2021
Y1 - 2021
N2 - Target-speaker voice activity detection (TS-VAD) has recently shown promising results for speaker diarization on highly overlapped speech. However, the original model requires a fixed (and known) number of speakers, which limits its application to real conversations. In this paper, we extend TS-VAD to speaker diarization with unknown numbers of speakers. This is achieved by two steps: first, an initial diarization system is applied for speaker number estimation, followed by TS-VAD network output masking according to this estimate. We further investigate different diarization methods, including clustering-based and region proposal networks, for estimating the initial i-vectors. Since these systems have complementary strengths, we propose a fusion-based method to combine frame-level decisions from the systems for an improved initialization. We demonstrate through experiments on variants of the LibriCSS meeting corpus that our proposed approach can improve the DER by up to 50% relative across varying numbers of speakers. This improvement also results in better downstream ASR performance approaching that using oracle segments.
AB - Target-speaker voice activity detection (TS-VAD) has recently shown promising results for speaker diarization on highly overlapped speech. However, the original model requires a fixed (and known) number of speakers, which limits its application to real conversations. In this paper, we extend TS-VAD to speaker diarization with unknown numbers of speakers. This is achieved by two steps: first, an initial diarization system is applied for speaker number estimation, followed by TS-VAD network output masking according to this estimate. We further investigate different diarization methods, including clustering-based and region proposal networks, for estimating the initial i-vectors. Since these systems have complementary strengths, we propose a fusion-based method to combine frame-level decisions from the systems for an improved initialization. We demonstrate through experiments on variants of the LibriCSS meeting corpus that our proposed approach can improve the DER by up to 50% relative across varying numbers of speakers. This improvement also results in better downstream ASR performance approaching that using oracle segments.
KW - Multi-speaker
KW - Overlap
KW - Speaker diarization
KW - TS-VAD
UR - http://www.scopus.com/inward/record.url?scp=85119208683&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85119208683&partnerID=8YFLogxK
U2 - 10.21437/Interspeech.2021-750
DO - 10.21437/Interspeech.2021-750
M3 - Conference contribution
AN - SCOPUS:85119208683
T3 - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
SP - 2523
EP - 2527
BT - 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021
PB - International Speech Communication Association
T2 - 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021
Y2 - 30 August 2021 through 3 September 2021
ER -