TY - GEN
T1 - End-To-End Neural Speaker Diarization with Self-Attention
AU - Fujita, Yusuke
AU - Kanda, Naoyuki
AU - Horiguchi, Shota
AU - Xue, Yawen
AU - Nagamatsu, Kenji
AU - Watanabe, Shinji
N1 - Publisher Copyright:
© 2019 IEEE.
PY - 2019/12
Y1 - 2019/12
N2 - Speaker diarization has been mainly developed based on the clustering of speaker embeddings. However, the clustering-based approach has two major problems; i.e., (i) it is not optimized to minimize diarization errors directly, and (ii) it cannot handle speaker overlaps correctly. To solve these problems, the End-To-End Neural Diarization (EEND), in which a bidirectional long short-Term memory (BLSTM) network directly outputs speaker diarization results given a multi-Talker recording, was recently proposed. In this study, we enhance EEND by introducing self-Attention blocks instead of BLSTM blocks. In contrast to BLSTM, which is conditioned only on its previous and next hidden states, self-Attention is directly conditioned on all the other frames, making it much suitable for dealing with the speaker diarization problem. We evaluated our proposed method on simulated mixtures, real telephone calls, and real dialogue recordings. The experimental results revealed that the self-Attention was the key to achieving good performance and that our proposed method performed significantly better than the conventional BLSTM-based method. Our method was even better than that of the state-of-The-Art x-vector clustering-based method. Finally, by visualizing the latent representation, we show that the self-Attention can capture global speaker characteristics in addition to local speech activity dynamics. Our source code is available online at https://github.com/hitachi-speech/EEND.
AB - Speaker diarization has been mainly developed based on the clustering of speaker embeddings. However, the clustering-based approach has two major problems; i.e., (i) it is not optimized to minimize diarization errors directly, and (ii) it cannot handle speaker overlaps correctly. To solve these problems, the End-To-End Neural Diarization (EEND), in which a bidirectional long short-Term memory (BLSTM) network directly outputs speaker diarization results given a multi-Talker recording, was recently proposed. In this study, we enhance EEND by introducing self-Attention blocks instead of BLSTM blocks. In contrast to BLSTM, which is conditioned only on its previous and next hidden states, self-Attention is directly conditioned on all the other frames, making it much suitable for dealing with the speaker diarization problem. We evaluated our proposed method on simulated mixtures, real telephone calls, and real dialogue recordings. The experimental results revealed that the self-Attention was the key to achieving good performance and that our proposed method performed significantly better than the conventional BLSTM-based method. Our method was even better than that of the state-of-The-Art x-vector clustering-based method. Finally, by visualizing the latent representation, we show that the self-Attention can capture global speaker characteristics in addition to local speech activity dynamics. Our source code is available online at https://github.com/hitachi-speech/EEND.
KW - end-To-end
KW - neural network
KW - self-Attention
KW - speaker diarization
UR - http://www.scopus.com/inward/record.url?scp=85081552065&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85081552065&partnerID=8YFLogxK
U2 - 10.1109/ASRU46091.2019.9003959
DO - 10.1109/ASRU46091.2019.9003959
M3 - Conference contribution
AN - SCOPUS:85081552065
T3 - 2019 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2019 - Proceedings
SP - 296
EP - 303
BT - 2019 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2019 - Proceedings
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2019 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2019
Y2 - 15 December 2019 through 18 December 2019
ER -