TY - GEN
T1 - Noise-robust attention learning for end-to-end speech recognition
AU - Higuchi, Yosuke
AU - Tawara, Naohiro
AU - Ogawa, Atsunori
AU - Iwata, Tomoharu
AU - Kobayashi, Tetsunori
AU - Ogawa, Tetsuji
N1 - Publisher Copyright:
© 2021 European Signal Processing Conference, EUSIPCO. All rights reserved.
PY - 2021/1/24
Y1 - 2021/1/24
N2 - We propose a method for improving the noise robustness of an end-to-end automatic speech recognition (ASR) model using attention weights. Several studies have adopted a combination of recurrent neural networks and attention mechanisms to achieve direct speech-to-text translation. In the real-world environment, however, noisy conditions make it difficult for the attention mechanisms to estimate the accurate alignment between the input speech frames and output characters, leading to the degradation of the recognition performance of the end-to-end model. In this work, we propose noise-robust attention learning (NRAL) which explicitly tells the attention mechanism where to “listen at” in a sequence of noisy speech features. Specifically, we train the attention weights estimated from a noisy speech to approximate the weights estimated from a clean speech. The experimental results based on the CHiME-4 task indicate that the proposed NRAL approach effectively improves the noise robustness of the end-to-end ASR model.
AB - We propose a method for improving the noise robustness of an end-to-end automatic speech recognition (ASR) model using attention weights. Several studies have adopted a combination of recurrent neural networks and attention mechanisms to achieve direct speech-to-text translation. In the real-world environment, however, noisy conditions make it difficult for the attention mechanisms to estimate the accurate alignment between the input speech frames and output characters, leading to the degradation of the recognition performance of the end-to-end model. In this work, we propose noise-robust attention learning (NRAL) which explicitly tells the attention mechanism where to “listen at” in a sequence of noisy speech features. Specifically, we train the attention weights estimated from a noisy speech to approximate the weights estimated from a clean speech. The experimental results based on the CHiME-4 task indicate that the proposed NRAL approach effectively improves the noise robustness of the end-to-end ASR model.
KW - Attention mechanism
KW - Deep neural networks
KW - Noise robustness
KW - Speech recognition
UR - http://www.scopus.com/inward/record.url?scp=85099303161&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85099303161&partnerID=8YFLogxK
U2 - 10.23919/Eusipco47968.2020.9287488
DO - 10.23919/Eusipco47968.2020.9287488
M3 - Conference contribution
AN - SCOPUS:85099303161
T3 - European Signal Processing Conference
SP - 311
EP - 315
BT - 28th European Signal Processing Conference, EUSIPCO 2020 - Proceedings
PB - European Signal Processing Conference, EUSIPCO
T2 - 28th European Signal Processing Conference, EUSIPCO 2020
Y2 - 24 August 2020 through 28 August 2020
ER -