TY - GEN
T1 - Integration of Speech Separation, Diarization, and Recognition for Multi-Speaker Meetings
T2 - 2021 IEEE Spoken Language Technology Workshop, SLT 2021
AU - Raj, Desh
AU - Denisov, Pavel
AU - Chen, Zhuo
AU - Erdogan, Hakan
AU - Huang, Zili
AU - He, Maokui
AU - Watanabe, Shinji
AU - Du, Jun
AU - Yoshioka, Takuya
AU - Luo, Yi
AU - Kanda, Naoyuki
AU - Li, Jinyu
AU - Wisdom, Scott
AU - Hershey, John R.
N1 - Publisher Copyright:
© 2021 IEEE.
PY - 2021/1/19
Y1 - 2021/1/19
N2 - Multi-speaker speech recognition of unsegmented recordings has diverse applications such as meeting transcription and automatic subtitle generation. With technical advances in systems dealing with speech separation, speaker diarization, and automatic speech recognition (ASR) in the last decade, it has become possible to build pipelines that achieve reasonable error rates on this task. In this paper, we propose an end-to-end modular system for the LibriCSS meeting data, which combines independently trained separation, diarization, and recognition components, in that order. We study the effect of different state-of-the-art methods at each stage of the pipeline, and report results using task-specific metrics like SDR and DER, as well as downstream WER. Experiments indicate that the problem of overlapping speech for diarization and ASR can be effectively mitigated with the presence of a well-trained separation module. Our best system achieves a speaker-attributed WER of 12.7%, which is close to that of a non-overlapping ASR.
AB - Multi-speaker speech recognition of unsegmented recordings has diverse applications such as meeting transcription and automatic subtitle generation. With technical advances in systems dealing with speech separation, speaker diarization, and automatic speech recognition (ASR) in the last decade, it has become possible to build pipelines that achieve reasonable error rates on this task. In this paper, we propose an end-to-end modular system for the LibriCSS meeting data, which combines independently trained separation, diarization, and recognition components, in that order. We study the effect of different state-of-the-art methods at each stage of the pipeline, and report results using task-specific metrics like SDR and DER, as well as downstream WER. Experiments indicate that the problem of overlapping speech for diarization and ASR can be effectively mitigated with the presence of a well-trained separation module. Our best system achieves a speaker-attributed WER of 12.7%, which is close to that of a non-overlapping ASR.
KW - Speech separation
KW - diarization
KW - multi-speaker
KW - speech recognition
UR - http://www.scopus.com/inward/record.url?scp=85101448230&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85101448230&partnerID=8YFLogxK
U2 - 10.1109/SLT48900.2021.9383556
DO - 10.1109/SLT48900.2021.9383556
M3 - Conference contribution
AN - SCOPUS:85101448230
T3 - 2021 IEEE Spoken Language Technology Workshop, SLT 2021 - Proceedings
SP - 897
EP - 904
BT - 2021 IEEE Spoken Language Technology Workshop, SLT 2021 - Proceedings
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 19 January 2021 through 22 January 2021
ER -