TY - GEN
T1 - Audio-visual processing toward robust speech recognition in cars
AU - Tamura, Satoshi
AU - Ninomiya, Hiroshi
AU - Kitaoka, Norihide
AU - Osuga, Shin
AU - Iribe, Yurie
AU - Takeda, Kazuya
AU - Hayamizu, Satoru
PY - 2015
Y1 - 2015
N2 - This paper reports our recent efforts to develop robust speech recognition in cars. Speech recognition is expected to handle many devices on cars. However, many kinds of acoustic noises, e.g. engine noise and car stereo, are observed in in-car environments, making performance of speech recognition decrease. In order to overcome the degradation, we develop a high-performance audio-visual speech recognition method. Lip images are obtained from captured face images using our face detection scheme. Some basic visual features are computed, then converted into visual features for speech recognition using a deep neural network. Audio features are obtained as well. Audio and visual features are subsequently concatenated into audio-visual features. As a recognition model, a multi-stream hidden Markov model is employed which can adjust contributions of audio and visual modalities. We evaluated our proposed method using an audio-visual corpus CENSREC-1-AV. In order to simulate driving-car condition, we prepared driving and music noises. Experimental results show that our method can significantly improving recognition performance in in-car condition.
AB - This paper reports our recent efforts to develop robust speech recognition in cars. Speech recognition is expected to handle many devices on cars. However, many kinds of acoustic noises, e.g. engine noise and car stereo, are observed in in-car environments, making performance of speech recognition decrease. In order to overcome the degradation, we develop a high-performance audio-visual speech recognition method. Lip images are obtained from captured face images using our face detection scheme. Some basic visual features are computed, then converted into visual features for speech recognition using a deep neural network. Audio features are obtained as well. Audio and visual features are subsequently concatenated into audio-visual features. As a recognition model, a multi-stream hidden Markov model is employed which can adjust contributions of audio and visual modalities. We evaluated our proposed method using an audio-visual corpus CENSREC-1-AV. In order to simulate driving-car condition, we prepared driving and music noises. Experimental results show that our method can significantly improving recognition performance in in-car condition.
KW - Audio-visual speech recognition
KW - Deep neural network
KW - In-car speech technology
KW - Multi-stream hidden markov model
UR - http://www.scopus.com/inward/record.url?scp=85017030735&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85017030735&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85017030735
T3 - 7th Biennial Workshop on Digital Signal Processing for In-Vehicle Systems and Safety 2015
SP - 31
EP - 34
BT - 7th Biennial Workshop on Digital Signal Processing for In-Vehicle Systems and Safety 2015
PB - University of Texas at Dallas
T2 - 7th Biennial Workshop on Digital Signal Processing for In-Vehicle Systems and Safety 2015
Y2 - 14 October 2015 through 16 October 2015
ER -