An investigation of neural uncertainty estimation for target speaker extraction equipped RNN transducer

Jiatong Shi, Chunlei Zhang*, Chao Weng, Shinji Watanabe, Meng Yu, Dong Yu


研究成果: Article査読


Target-speaker speech recognition aims to recognize the speech of an enrolled speaker from an environment with background noise and interfering speakers. This study presents a joint framework that combines time-domain target speaker extraction and recurrent neural network transducer (RNN-T) for speech recognition. To alleviate the adverse effects of residual noise and artifacts introduced by the target speaker extraction module to the speech recognition back-end, we explore to training the target speaker extraction and RNN-T jointly. We find a multi-stage training strategy that pre-trains and fine-tunes each module before joint training is crucial in stabilizing the training process. In addition, we propose a novel neural uncertainty estimation that leverages useful information from the target speaker extraction module to further improve the back-end speech recognizer (i.e., speaker identity uncertainty and speech enhancement uncertainty). Compared to a recognizer with target speech extraction front-end, our experiments show that joint-training and the neural uncertainty module reduce 7% and 17% relative character error rate (CER) on multi-talker simulation data, respectively. The multi-condition experiments indicate that our method can reduce 9% relative CER in the noisy condition without losing performance in the clean condition. We also observe consistent improvements in further evaluation of real-world data based on vehicular speech.

ジャーナルComputer Speech and Language
出版ステータスPublished - 2022 5月

ASJC Scopus subject areas

  • ソフトウェア
  • 理論的コンピュータサイエンス
  • 人間とコンピュータの相互作用


「An investigation of neural uncertainty estimation for target speaker extraction equipped RNN transducer」の研究トピックを掘り下げます。これらがまとまってユニークなフィンガープリントを構成します。