TY - JOUR
T1 - A review of speaker diarization
T2 - Recent advances with deep learning
AU - Park, Tae Jin
AU - Kanda, Naoyuki
AU - Dimitriadis, Dimitrios
AU - Han, Kyu J.
AU - Watanabe, Shinji
AU - Narayanan, Shrikanth
N1 - Publisher Copyright:
© 2021 Elsevier Ltd
PY - 2022/3
Y1 - 2022/3
N2 - Speaker diarization is a task to label audio or video recordings with classes that correspond to speaker identity, or in short, a task to identify “who spoke when”. In the early years, speaker diarization algorithms were developed for speech recognition on multispeaker audio recordings to enable speaker adaptive processing. These algorithms also gained their own value as a standalone application over time to provide speaker-specific metainformation for downstream tasks such as audio retrieval. More recently, with the emergence of deep learning technology, which has driven revolutionary changes in research and practices across speech application domains, rapid advancements have been made for speaker diarization. In this paper, we review not only the historical development of speaker diarization technology but also the recent advancements in neural speaker diarization approaches. Furthermore, we discuss how speaker diarization systems have been integrated with speech recognition applications and how the recent surge of deep learning is leading the way of jointly modeling these two components to be complementary to each other. By considering such exciting technical trends, we believe that this paper is a valuable contribution to the community to provide a survey work by consolidating the recent developments with neural methods and thus facilitating further progress toward a more efficient speaker diarization.
AB - Speaker diarization is a task to label audio or video recordings with classes that correspond to speaker identity, or in short, a task to identify “who spoke when”. In the early years, speaker diarization algorithms were developed for speech recognition on multispeaker audio recordings to enable speaker adaptive processing. These algorithms also gained their own value as a standalone application over time to provide speaker-specific metainformation for downstream tasks such as audio retrieval. More recently, with the emergence of deep learning technology, which has driven revolutionary changes in research and practices across speech application domains, rapid advancements have been made for speaker diarization. In this paper, we review not only the historical development of speaker diarization technology but also the recent advancements in neural speaker diarization approaches. Furthermore, we discuss how speaker diarization systems have been integrated with speech recognition applications and how the recent surge of deep learning is leading the way of jointly modeling these two components to be complementary to each other. By considering such exciting technical trends, we believe that this paper is a valuable contribution to the community to provide a survey work by consolidating the recent developments with neural methods and thus facilitating further progress toward a more efficient speaker diarization.
KW - Automatic speech recognition
KW - Deep learning
KW - Speaker diarization
UR - http://www.scopus.com/inward/record.url?scp=85119422781&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85119422781&partnerID=8YFLogxK
U2 - 10.1016/j.csl.2021.101317
DO - 10.1016/j.csl.2021.101317
M3 - Article
AN - SCOPUS:85119422781
SN - 0885-2308
VL - 72
JO - Computer Speech and Language
JF - Computer Speech and Language
M1 - 101317
ER -