TY - JOUR
T1 - Prior-based binary masking and discriminative methods for reverberant and noisy speech recognition using distant stereo microphones
AU - Tachioka, Yuuki
AU - Watanabe, Shinji
AU - Le Roux, Jonathan
AU - Hershey, John R.
N1 - Publisher Copyright:
© 2017 Information Processing Society of Japan.
PY - 2017
Y1 - 2017
N2 - Reverberant and noisy automatic speech recognition (ASR) using distant stereo microphones is a very challenging, but desirable scenario for home-environment speech applications. This scenario can often provide prior knowledge such as physical information about the sound sources and the environment in advance, which may then be used to reduce the influence of the interference. We propose a method to enhance the binary masking algorithm by using prior distributions of the time difference of arrival. This paper also validates state-of-the-art ASR techniques that include various discriminative training and feature transformation methods. Furthermore, we develop an efficient method to combine discriminative language modeling and minimum Bayes risk decoding in the ASR post-processing stage. We also investigate the effectiveness of this method when used for reverberated and noisy ASR of deep neural networks (DNNs) as well when used in systems that combine multiple DNNs using different features. Experiments on the medium vocabulary sub-task of the second CHiME challenge show that the system submitted to the challenge achieved a 26.86% word error rate (WER), moreover, the DNN system with the discriminative training, speaker adaptation and system combination achieves a 20.40% WER.
AB - Reverberant and noisy automatic speech recognition (ASR) using distant stereo microphones is a very challenging, but desirable scenario for home-environment speech applications. This scenario can often provide prior knowledge such as physical information about the sound sources and the environment in advance, which may then be used to reduce the influence of the interference. We propose a method to enhance the binary masking algorithm by using prior distributions of the time difference of arrival. This paper also validates state-of-the-art ASR techniques that include various discriminative training and feature transformation methods. Furthermore, we develop an efficient method to combine discriminative language modeling and minimum Bayes risk decoding in the ASR post-processing stage. We also investigate the effectiveness of this method when used for reverberated and noisy ASR of deep neural networks (DNNs) as well when used in systems that combine multiple DNNs using different features. Experiments on the medium vocabulary sub-task of the second CHiME challenge show that the system submitted to the challenge achieved a 26.86% word error rate (WER), moreover, the DNN system with the discriminative training, speaker adaptation and system combination achieves a 20.40% WER.
KW - CHiME challenge
KW - Deep neural networks
KW - Discriminative methods
KW - Feature transformation
KW - Noise-robust ASR
KW - Prior-based binary masking
KW - System combination
UR - http://www.scopus.com/inward/record.url?scp=85020911444&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85020911444&partnerID=8YFLogxK
U2 - 10.2197/ipsjjip.25.407
DO - 10.2197/ipsjjip.25.407
M3 - Article
AN - SCOPUS:85020911444
SN - 0387-5806
VL - 25
SP - 407
EP - 416
JO - Journal of information processing
JF - Journal of information processing
ER -