TY - GEN
T1 - CNN-based multichannel end-to-end speech recognition for everyday home environments
AU - Yalta, Nelson
AU - Watanabe, Shinji
AU - Hori, Takaaki
AU - Nakadai, Kazuhiro
AU - Ogata, Tetsuya
N1 - Funding Information:
The work has been supported by MEXT Grant-in-Aid for Scientific Research (A), No. 15H01710, except for the contribution of Mitsubishi Electric Research Laboratories (MERL).
Publisher Copyright:
© 2019 IEEE
PY - 2019/9
Y1 - 2019/9
N2 - Casual conversations involving multiple speakers and noises from surrounding devices are common in everyday environments, which degrades the performances of automatic speech recognition systems. These challenging characteristics of environments are the target of the CHiME-5 challenge. By employing a convolutional neural network (CNN)-based multichannel end-to-end speech recognition system, this study attempts to overcome the presents difficulties in everyday environments. The system comprises of an attention-based encoder-decoder neural network that directly generates a text as an output from a sound input. The multichannel CNN encoder, which uses residual connections and batch renormalization, is trained with augmented data, including white noise injection. The experimental results show that the word error rate is reduced by 8.5% and 0.6% absolute from a single channel end-to-end and the best baseline (LF-MMI TDNN) on the CHiME-5 corpus, respectively.
AB - Casual conversations involving multiple speakers and noises from surrounding devices are common in everyday environments, which degrades the performances of automatic speech recognition systems. These challenging characteristics of environments are the target of the CHiME-5 challenge. By employing a convolutional neural network (CNN)-based multichannel end-to-end speech recognition system, this study attempts to overcome the presents difficulties in everyday environments. The system comprises of an attention-based encoder-decoder neural network that directly generates a text as an output from a sound input. The multichannel CNN encoder, which uses residual connections and batch renormalization, is trained with augmented data, including white noise injection. The experimental results show that the word error rate is reduced by 8.5% and 0.6% absolute from a single channel end-to-end and the best baseline (LF-MMI TDNN) on the CHiME-5 corpus, respectively.
KW - End-to-end speech recognition
KW - Multichannel
KW - Residual networks
UR - http://www.scopus.com/inward/record.url?scp=85075609674&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85075609674&partnerID=8YFLogxK
U2 - 10.23919/EUSIPCO.2019.8902524
DO - 10.23919/EUSIPCO.2019.8902524
M3 - Conference contribution
AN - SCOPUS:85075609674
T3 - European Signal Processing Conference
BT - EUSIPCO 2019 - 27th European Signal Processing Conference
PB - European Signal Processing Conference, EUSIPCO
T2 - 27th European Signal Processing Conference, EUSIPCO 2019
Y2 - 2 September 2019 through 6 September 2019
ER -