TY - GEN
T1 - Does speech enhancement work with end-to-end ASR objectives?
T2 - 2017 IEEE International Workshop on Machine Learning for Signal Processing, MLSP 2017
AU - Ochiai, Tsubasa
AU - Watanabe, Shinji
AU - Katagiri, Shigeru
N1 - Funding Information:
Tsubasa Ochiai and Shigeru Katagiri was supported in part by JSPS Grants-in-Aid for Scientific Research No. 26280063, MEXT-Supported Program Driver-in-the-Loop, and Grant-in-Aid for JSPS Fellows. Shinji Watanabe was supported by MERL.
Publisher Copyright:
© 2017 IEEE.
PY - 2017/12/5
Y1 - 2017/12/5
N2 - Recently we proposed a novel multichannel end-to-end speech recognition architecture that integrates the components of multichannel speech enhancement and speech recognition into a single neural-network-based architecture and demonstrated its fundamental utility for automatic speech recognition (ASR). However, the behavior of the proposed integrated system remains insufficiently clarified. An open question is whether the speech enhancement component really gains speech enhancement (noise suppression) ability, because it is optimized based on end-to-end ASR objectives instead of speech enhancement objectives. In this paper, we solve this question by conducting systematic evaluation experiments using the CHiME-4 corpus. We first show that the integrated end-to-end architecture successfully obtains adequate speech enhancement ability that is superior to that of a conventional alternative (a delay-and-sum beamformer) by observing two signal-level measures: the signal-todistortion ratio and the perceptual evaluation of speech quality. Our findings suggest that to further increase the performances of an integrated system, we must boost the power of the latter-stage speech recognition component. However, an insufficient amount of multichannel noisy speech data is available. Based on these situations, we next investigate the effect of using a large amount of single-channel clean speech data, e.g., the WSJ corpus, for additional training of the speech recognition component. We also show that our approach with clean speech significantly improves the total performance of multichannel end-to-end architecture in the multichannel noisy ASR tasks.
AB - Recently we proposed a novel multichannel end-to-end speech recognition architecture that integrates the components of multichannel speech enhancement and speech recognition into a single neural-network-based architecture and demonstrated its fundamental utility for automatic speech recognition (ASR). However, the behavior of the proposed integrated system remains insufficiently clarified. An open question is whether the speech enhancement component really gains speech enhancement (noise suppression) ability, because it is optimized based on end-to-end ASR objectives instead of speech enhancement objectives. In this paper, we solve this question by conducting systematic evaluation experiments using the CHiME-4 corpus. We first show that the integrated end-to-end architecture successfully obtains adequate speech enhancement ability that is superior to that of a conventional alternative (a delay-and-sum beamformer) by observing two signal-level measures: the signal-todistortion ratio and the perceptual evaluation of speech quality. Our findings suggest that to further increase the performances of an integrated system, we must boost the power of the latter-stage speech recognition component. However, an insufficient amount of multichannel noisy speech data is available. Based on these situations, we next investigate the effect of using a large amount of single-channel clean speech data, e.g., the WSJ corpus, for additional training of the speech recognition component. We also show that our approach with clean speech significantly improves the total performance of multichannel end-to-end architecture in the multichannel noisy ASR tasks.
KW - Encoder-decoder network
KW - Multichannel end-to-end automatic speech recognition
KW - Neural beamformer
UR - http://www.scopus.com/inward/record.url?scp=85042305174&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85042305174&partnerID=8YFLogxK
U2 - 10.1109/MLSP.2017.8168188
DO - 10.1109/MLSP.2017.8168188
M3 - Conference contribution
AN - SCOPUS:85042305174
T3 - IEEE International Workshop on Machine Learning for Signal Processing, MLSP
SP - 1
EP - 5
BT - 2017 IEEE International Workshop on Machine Learning for Signal Processing, MLSP 2017 - Proceedings
A2 - Ueda, Naonori
A2 - Chien, Jen-Tzung
A2 - Matsui, Tomoko
A2 - Larsen, Jan
A2 - Watanabe, Shinji
PB - IEEE Computer Society
Y2 - 25 September 2017 through 28 September 2017
ER -