TY - JOUR
T1 - An investigation of neural uncertainty estimation for target speaker extraction equipped RNN transducer
AU - Shi, Jiatong
AU - Zhang, Chunlei
AU - Weng, Chao
AU - Watanabe, Shinji
AU - Yu, Meng
AU - Yu, Dong
N1 - Publisher Copyright:
© 2021 Elsevier Ltd
PY - 2022/5
Y1 - 2022/5
N2 - Target-speaker speech recognition aims to recognize the speech of an enrolled speaker from an environment with background noise and interfering speakers. This study presents a joint framework that combines time-domain target speaker extraction and recurrent neural network transducer (RNN-T) for speech recognition. To alleviate the adverse effects of residual noise and artifacts introduced by the target speaker extraction module to the speech recognition back-end, we explore to training the target speaker extraction and RNN-T jointly. We find a multi-stage training strategy that pre-trains and fine-tunes each module before joint training is crucial in stabilizing the training process. In addition, we propose a novel neural uncertainty estimation that leverages useful information from the target speaker extraction module to further improve the back-end speech recognizer (i.e., speaker identity uncertainty and speech enhancement uncertainty). Compared to a recognizer with target speech extraction front-end, our experiments show that joint-training and the neural uncertainty module reduce 7% and 17% relative character error rate (CER) on multi-talker simulation data, respectively. The multi-condition experiments indicate that our method can reduce 9% relative CER in the noisy condition without losing performance in the clean condition. We also observe consistent improvements in further evaluation of real-world data based on vehicular speech.
AB - Target-speaker speech recognition aims to recognize the speech of an enrolled speaker from an environment with background noise and interfering speakers. This study presents a joint framework that combines time-domain target speaker extraction and recurrent neural network transducer (RNN-T) for speech recognition. To alleviate the adverse effects of residual noise and artifacts introduced by the target speaker extraction module to the speech recognition back-end, we explore to training the target speaker extraction and RNN-T jointly. We find a multi-stage training strategy that pre-trains and fine-tunes each module before joint training is crucial in stabilizing the training process. In addition, we propose a novel neural uncertainty estimation that leverages useful information from the target speaker extraction module to further improve the back-end speech recognizer (i.e., speaker identity uncertainty and speech enhancement uncertainty). Compared to a recognizer with target speech extraction front-end, our experiments show that joint-training and the neural uncertainty module reduce 7% and 17% relative character error rate (CER) on multi-talker simulation data, respectively. The multi-condition experiments indicate that our method can reduce 9% relative CER in the noisy condition without losing performance in the clean condition. We also observe consistent improvements in further evaluation of real-world data based on vehicular speech.
KW - Target-speaker speech extraction
KW - Target-speaker speech recognition
KW - Uncertainty estimation
UR - http://www.scopus.com/inward/record.url?scp=85121130712&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85121130712&partnerID=8YFLogxK
U2 - 10.1016/j.csl.2021.101327
DO - 10.1016/j.csl.2021.101327
M3 - Article
AN - SCOPUS:85121130712
SN - 0885-2308
VL - 73
JO - Computer Speech and Language
JF - Computer Speech and Language
M1 - 101327
ER -