TY - GEN
T1 - Online End-To-End Neural Diarization with Speaker-Tracing Buffer
AU - Xue, Yawen
AU - Horiguchi, Shota
AU - Fujita, Yusuke
AU - Watanabe, Shinji
AU - Garcia, Paola
AU - Nagamatsu, Kenji
N1 - Publisher Copyright:
© 2021 IEEE.
PY - 2021/1/19
Y1 - 2021/1/19
N2 - This paper proposes a novel online speaker diarization algorithm based on a fully supervised self-attention mechanism (SA-EEND). Online diarization inherently presents a speaker's permutation problem due to the possibility to assign speaker regions incorrectly across the recording. To circumvent this inconsistency, we proposed a speaker-tracing buffer mechanism that selects several input frames representing the speaker permutation information from previous chunks and stores them in a buffer. These buffered frames are stacked with the input frames in the current chunk and fed into a self-attention network. Our method ensures consistent diarization outputs across the buffer and the current chunk by checking the correlation between their corresponding outputs. Additionally, we trained SA-EEND with variable chunk-sizes to mitigate the mismatch between training and inference introduced by the speaker-tracing buffer mechanism. Experimental results, including online SA-EEND and variable chunk-size, achieved DERs of 12.54 % for CALLHOME and 20.77 % for CSJ with 1.4 s actual latency.
AB - This paper proposes a novel online speaker diarization algorithm based on a fully supervised self-attention mechanism (SA-EEND). Online diarization inherently presents a speaker's permutation problem due to the possibility to assign speaker regions incorrectly across the recording. To circumvent this inconsistency, we proposed a speaker-tracing buffer mechanism that selects several input frames representing the speaker permutation information from previous chunks and stores them in a buffer. These buffered frames are stacked with the input frames in the current chunk and fed into a self-attention network. Our method ensures consistent diarization outputs across the buffer and the current chunk by checking the correlation between their corresponding outputs. Additionally, we trained SA-EEND with variable chunk-sizes to mitigate the mismatch between training and inference introduced by the speaker-tracing buffer mechanism. Experimental results, including online SA-EEND and variable chunk-size, achieved DERs of 12.54 % for CALLHOME and 20.77 % for CSJ with 1.4 s actual latency.
KW - Online speaker diarization
KW - end-to-end
KW - self-attention
KW - speaker-tracing buffer
UR - http://www.scopus.com/inward/record.url?scp=85098128484&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85098128484&partnerID=8YFLogxK
U2 - 10.1109/SLT48900.2021.9383523
DO - 10.1109/SLT48900.2021.9383523
M3 - Conference contribution
AN - SCOPUS:85098128484
T3 - 2021 IEEE Spoken Language Technology Workshop, SLT 2021 - Proceedings
SP - 841
EP - 848
BT - 2021 IEEE Spoken Language Technology Workshop, SLT 2021 - Proceedings
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2021 IEEE Spoken Language Technology Workshop, SLT 2021
Y2 - 19 January 2021 through 22 January 2021
ER -