TY - JOUR
T1 - Lipreading using convolutional neural network
AU - Noda, Kuniaki
AU - Yamaguchi, Yuki
AU - Nakadai, Kazuhiro
AU - Okuno, Hiroshi G.
AU - Ogata, Tetsuya
PY - 2014/1/1
Y1 - 2014/1/1
N2 - In recent automatic speech recognition studies, deep learning architecture applications for acoustic modeling have eclipsed conventional sound features such as Mel-frequency cepstral co- efficients. However, for visual speech recognition (VSR) stud- ies, handcrafted visual feature extraction mechanisms are still widely utilized. In this paper, we propose to apply a convo- lutional neural network (CNN) as a visual feature extraction mechanism for VSR. By training a CNN with images of a speaker's mouth area in combination with phoneme labels, the CNN acquires multiple convolutional filters, used to extract vi- sual features essential for recognizing phonemes. Further, by modeling the temporal dependencies of the generated phoneme label sequences, a hidden Markov model in our proposed sys- Tem recognizes multiple isolated words. Our proposed system is evaluated on an audio-visual speech dataset comprising 300 Japanese words with six different speakers. The evaluation re- sults of our isolated word recognition experiment demonstrate that the visual features acquired by the CNN significantly out- perform those acquired by conventional dimensionality com- pression approaches, including principal component analysis.
AB - In recent automatic speech recognition studies, deep learning architecture applications for acoustic modeling have eclipsed conventional sound features such as Mel-frequency cepstral co- efficients. However, for visual speech recognition (VSR) stud- ies, handcrafted visual feature extraction mechanisms are still widely utilized. In this paper, we propose to apply a convo- lutional neural network (CNN) as a visual feature extraction mechanism for VSR. By training a CNN with images of a speaker's mouth area in combination with phoneme labels, the CNN acquires multiple convolutional filters, used to extract vi- sual features essential for recognizing phonemes. Further, by modeling the temporal dependencies of the generated phoneme label sequences, a hidden Markov model in our proposed sys- Tem recognizes multiple isolated words. Our proposed system is evaluated on an audio-visual speech dataset comprising 300 Japanese words with six different speakers. The evaluation re- sults of our isolated word recognition experiment demonstrate that the visual features acquired by the CNN significantly out- perform those acquired by conventional dimensionality com- pression approaches, including principal component analysis.
KW - Convolu- Tional neural network
KW - Lipreading
KW - Visual feature extraction
UR - http://www.scopus.com/inward/record.url?scp=84910090408&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84910090408&partnerID=8YFLogxK
M3 - Conference article
AN - SCOPUS:84910090408
SN - 2308-457X
SP - 1149
EP - 1153
JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
T2 - 15th Annual Conference of the International Speech Communication Association: Celebrating the Diversity of Spoken Languages, INTERSPEECH 2014
Y2 - 14 September 2014 through 18 September 2014
ER -