End-To-End Neural Speaker Diarization with Self-Attention

Yusuke Fujita, Naoyuki Kanda, Shota Horiguchi, Yawen Xue, Kenji Nagamatsu, Shinji Watanabe

Research output: Chapter in Book/Report/Conference proceedingConference contribution

101 Citations (Scopus)

Abstract

Speaker diarization has been mainly developed based on the clustering of speaker embeddings. However, the clustering-based approach has two major problems; i.e., (i) it is not optimized to minimize diarization errors directly, and (ii) it cannot handle speaker overlaps correctly. To solve these problems, the End-To-End Neural Diarization (EEND), in which a bidirectional long short-Term memory (BLSTM) network directly outputs speaker diarization results given a multi-Talker recording, was recently proposed. In this study, we enhance EEND by introducing self-Attention blocks instead of BLSTM blocks. In contrast to BLSTM, which is conditioned only on its previous and next hidden states, self-Attention is directly conditioned on all the other frames, making it much suitable for dealing with the speaker diarization problem. We evaluated our proposed method on simulated mixtures, real telephone calls, and real dialogue recordings. The experimental results revealed that the self-Attention was the key to achieving good performance and that our proposed method performed significantly better than the conventional BLSTM-based method. Our method was even better than that of the state-of-The-Art x-vector clustering-based method. Finally, by visualizing the latent representation, we show that the self-Attention can capture global speaker characteristics in addition to local speech activity dynamics. Our source code is available online at https://github.com/hitachi-speech/EEND.

Original languageEnglish
Title of host publication2019 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2019 - Proceedings
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages296-303
Number of pages8
ISBN (Electronic)9781728103068
DOIs
Publication statusPublished - 2019 Dec
Externally publishedYes
Event2019 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2019 - Singapore, Singapore
Duration: 2019 Dec 152019 Dec 18

Publication series

Name2019 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2019 - Proceedings

Conference

Conference2019 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2019
Country/TerritorySingapore
CitySingapore
Period19/12/1519/12/18

Keywords

  • end-To-end
  • neural network
  • self-Attention
  • speaker diarization

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Signal Processing
  • Linguistics and Language
  • Communication

Fingerprint

Dive into the research topics of 'End-To-End Neural Speaker Diarization with Self-Attention'. Together they form a unique fingerprint.

Cite this