Improving RNN transducer with target speaker extraction and neural uncertainty estimation

Jiatong Shi*, Chunlei Zhang, Chao Weng, Shinji Watanabe, Meng Yu, Dong Yu

*Corresponding author for this work

Research output: Contribution to journalConference articlepeer-review

5 Citations (Scopus)

Abstract

Target-speaker speech recognition aims to recognize target-speaker speech from noisy environments with background noise and interfering speakers. This work presents a joint framework that combines time-domain target-speaker speech extraction and Recurrent Neural Network Transducer (RNN-T). To stabilize the joint-training, we propose a multi-stage training strategy that pre-trains and fine-tunes each module in the system before joint-training. Meanwhile, speaker identity and speech enhancement uncertainty measures are proposed to compensate for residual noise and artifacts from the target speech extraction module. Compared to a recognizer fine-tuned with a target speech extraction model, our experiments show that adding the neural uncertainty module significantly reduces 17% relative Character Error Rate (CER) on multi-speaker signals with background noise. The multi-condition experiments indicate that our method can achieve 9% relative performance gain in the noisy condition while maintaining the performance in the clean condition.

Original languageEnglish
Pages (from-to)6908-6912
Number of pages5
JournalICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
Volume2021-June
DOIs
Publication statusPublished - 2021
Externally publishedYes
Event2021 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2021 - Virtual, Toronto, Canada
Duration: 2021 Jun 62021 Jun 11

Keywords

  • Target-speaker speech extraction
  • Target-speaker speech recognition
  • Uncertainty estimation

ASJC Scopus subject areas

  • Software
  • Signal Processing
  • Electrical and Electronic Engineering

Fingerprint

Dive into the research topics of 'Improving RNN transducer with target speaker extraction and neural uncertainty estimation'. Together they form a unique fingerprint.

Cite this