TY - GEN
T1 - DOVER-Lap
T2 - 2021 IEEE Spoken Language Technology Workshop, SLT 2021
AU - Raj, Desh
AU - Paola Garcia-Perera, Leibny
AU - Huang, Zili
AU - Watanabe, Shinji
AU - Povey, Daniel
AU - Stolcke, Andreas
AU - Khudanpur, Sanjeev
N1 - Funding Information:
This work was partially supported by grants from the JHU Applied Physics Laboratory, Nanyang Technological University, Hitachi Ltd., Japan, and the Government of Israel.
Publisher Copyright:
© 2021 IEEE.
PY - 2021/1/19
Y1 - 2021/1/19
N2 - Several advances have been made recently towards handling overlapping speech for speaker diarization. Since speech and natural language tasks often benefit from ensemble techniques, we propose an algorithm for combining outputs from such diarization systems through majority voting. Our method, DOVER-Lap, is inspired from the recently proposed DOVER algorithm, but is designed to handle overlapping segments in diarization outputs. We also modify the pair-wise incremental label mapping strategy used in DOVER, and propose an approximation algorithm based on weighted k-partite graph matching, which performs this mapping using a global cost tensor. We demonstrate the strength of our method by combining outputs from diverse systems - clustering-based, region proposal networks, and target-speaker voice activity detection - on AMI and LibriCSS datasets, where it consistently outperforms the single best system. Additionally, we show that DOVER-Lap can be used for late fusion in multichannel diarization, and compares favorably with early fusion methods like beamforming.
AB - Several advances have been made recently towards handling overlapping speech for speaker diarization. Since speech and natural language tasks often benefit from ensemble techniques, we propose an algorithm for combining outputs from such diarization systems through majority voting. Our method, DOVER-Lap, is inspired from the recently proposed DOVER algorithm, but is designed to handle overlapping segments in diarization outputs. We also modify the pair-wise incremental label mapping strategy used in DOVER, and propose an approximation algorithm based on weighted k-partite graph matching, which performs this mapping using a global cost tensor. We demonstrate the strength of our method by combining outputs from diverse systems - clustering-based, region proposal networks, and target-speaker voice activity detection - on AMI and LibriCSS datasets, where it consistently outperforms the single best system. Additionally, we show that DOVER-Lap can be used for late fusion in multichannel diarization, and compares favorably with early fusion methods like beamforming.
KW - multichannel diarization
KW - overlapped speaker diarization
KW - voting-based methods
UR - http://www.scopus.com/inward/record.url?scp=85101812972&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85101812972&partnerID=8YFLogxK
U2 - 10.1109/SLT48900.2021.9383490
DO - 10.1109/SLT48900.2021.9383490
M3 - Conference contribution
AN - SCOPUS:85101812972
T3 - 2021 IEEE Spoken Language Technology Workshop, SLT 2021 - Proceedings
SP - 881
EP - 888
BT - 2021 IEEE Spoken Language Technology Workshop, SLT 2021 - Proceedings
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 19 January 2021 through 22 January 2021
ER -