Target-speaker voice activity detection with improved I-vector estimation for unknown number of speaker

Maokui He, Desh Raj, Zili Huang, Jun Du*, Zhuo Chen, Shinji Watanabe

*この研究の対応する著者

研究成果: Conference contribution

1 被引用数 (Scopus)

抄録

Target-speaker voice activity detection (TS-VAD) has recently shown promising results for speaker diarization on highly overlapped speech. However, the original model requires a fixed (and known) number of speakers, which limits its application to real conversations. In this paper, we extend TS-VAD to speaker diarization with unknown numbers of speakers. This is achieved by two steps: first, an initial diarization system is applied for speaker number estimation, followed by TS-VAD network output masking according to this estimate. We further investigate different diarization methods, including clustering-based and region proposal networks, for estimating the initial i-vectors. Since these systems have complementary strengths, we propose a fusion-based method to combine frame-level decisions from the systems for an improved initialization. We demonstrate through experiments on variants of the LibriCSS meeting corpus that our proposed approach can improve the DER by up to 50% relative across varying numbers of speakers. This improvement also results in better downstream ASR performance approaching that using oracle segments.

本文言語English
ホスト出版物のタイトル22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021
出版社International Speech Communication Association
ページ2523-2527
ページ数5
ISBN(電子版)9781713836902
DOI
出版ステータスPublished - 2021
外部発表はい
イベント22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021 - Brno, Czech Republic
継続期間: 2021 8月 302021 9月 3

出版物シリーズ

名前Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
4
ISSN(印刷版)2308-457X
ISSN(電子版)1990-9772

Conference

Conference22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021
国/地域Czech Republic
CityBrno
Period21/8/3021/9/3

ASJC Scopus subject areas

  • 言語および言語学
  • 人間とコンピュータの相互作用
  • 信号処理
  • ソフトウェア
  • モデリングとシミュレーション

フィンガープリント

「Target-speaker voice activity detection with improved I-vector estimation for unknown number of speaker」の研究トピックを掘り下げます。これらがまとまってユニークなフィンガープリントを構成します。

引用スタイル