TY - GEN
T1 - Speech recognition using deep canonical correlation analysis in noisy environments
AU - Isobe, Shinnosuke
AU - Tamura, Satoshi
AU - Hayamizu, Satoru
N1 - Publisher Copyright:
© 2021 by SCITEPRESS - Science and Technology Publications, Lda. All rights reserved
PY - 2021
Y1 - 2021
N2 - In this paper, we propose a method to improve the accuracy of speech recognition in noisy environments by utilizing Deep Canonical Correlation Analysis (DCCA). DCCA generates projections from two modalities into one common space, so that the correlation of projected vectors could be maximized. Our idea is to employ DCCA techniques with audio and visual modalities to enhance the robustness of Automatic Speech Recognition (ASR); A) noisy audio features can be recovered by clean visual features, and B) an ASR model can be trained using audio and visual features, as data augmentation. We evaluated our method using an audiovisual corpus CENSREC-1-AV and a noise database DEMAND. Compared to conventional ASR and feature-fusion-based audio-visual speech recognition, our DCCA-based recognizers achieved better performance. In addition, experimental results shows that utilizing DCCA enables us to get better results in various noisy environments, thanks to the visual modality. Furthermore, it is found that DCCA can be used as a data augmentation scheme if only a few training data are available, by incorporating visual DCCA features to build an audio-only ASR model, in addition to audio DCCA features.
AB - In this paper, we propose a method to improve the accuracy of speech recognition in noisy environments by utilizing Deep Canonical Correlation Analysis (DCCA). DCCA generates projections from two modalities into one common space, so that the correlation of projected vectors could be maximized. Our idea is to employ DCCA techniques with audio and visual modalities to enhance the robustness of Automatic Speech Recognition (ASR); A) noisy audio features can be recovered by clean visual features, and B) an ASR model can be trained using audio and visual features, as data augmentation. We evaluated our method using an audiovisual corpus CENSREC-1-AV and a noise database DEMAND. Compared to conventional ASR and feature-fusion-based audio-visual speech recognition, our DCCA-based recognizers achieved better performance. In addition, experimental results shows that utilizing DCCA enables us to get better results in various noisy environments, thanks to the visual modality. Furthermore, it is found that DCCA can be used as a data augmentation scheme if only a few training data are available, by incorporating visual DCCA features to build an audio-only ASR model, in addition to audio DCCA features.
KW - Audio-visual processing
KW - Canonical correlation analysis
KW - Data augmentation
KW - Deep learning
KW - Noise robustness
KW - Speech recognition
UR - http://www.scopus.com/inward/record.url?scp=85103849958&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85103849958&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85103849958
T3 - ICPRAM 2021 - Proceedings of the 10th International Conference on Pattern Recognition Applications and Methods
SP - 63
EP - 70
BT - ICPRAM 2021 - Proceedings of the 10th International Conference on Pattern Recognition Applications and Methods
A2 - De Marsico, Maria
A2 - di Baja, Gabriella Sanniti
A2 - Fred, Ana
PB - SciTePress
T2 - 10th International Conference on Pattern Recognition Applications and Methods, ICPRAM 2021
Y2 - 4 February 2021 through 6 February 2021
ER -