TY - CONF
T1 - Improvement of Lipreading Performance Using Discriminative Feature and Speaker Adaptation
AU - Seko, Takumi
AU - Ukai, Naoya
AU - Tamura, Satoshi
AU - Hayamizu, Satoru
N1 - Funding Information:
The part of this work was supported by JSPS KAK-ENHI Grant (Grant-in-Aid for Young Scientists (B)) No.25730109.
Publisher Copyright:
© Auditory-Visual Speech Processing 2013, AVSP 2013. All rights reserved.
PY - 2013
Y1 - 2013
N2 - In this paper, we apply a general and discriminative feature”GIF” (Genetic Algorithm based Informative feature) to lipreading (visual speech recognition), and improve the lipreading performance using speaker adaptation. The feature extraction method consists of two transforms, which convert an input vector into GIF for recognition. In the speaker adaptation, MAP (Maximum A Posteriori) adaptation is used to adapt a recognition model to a target speaker. Recognition experiments of continuous digit utterances were conducted using an audio-visual corpus CENSREC-1-AV [1] including more than 268,000 lip images. At first, we compared the GIF-based method with the baseline method employing conventional eigenlip features, using two kinds of images: pictures in the database around speakers' mouth, and extracted images only containing lips. Secondly, we evaluated the effectiveness of speaker adaptation for lipreading. The result of comparison shows that the GIF-based approach achieved slightly better than the baseline method. And it is found using the mouth-around images is more suitable than lip-only images. Furthermore, the result of speaker adaptation shows that speaker adaptation significantly improved recognition accuracy in the GIF-based method; after the adaptation, the recognition rate drastically increased from approximately 30% to 70%.
AB - In this paper, we apply a general and discriminative feature”GIF” (Genetic Algorithm based Informative feature) to lipreading (visual speech recognition), and improve the lipreading performance using speaker adaptation. The feature extraction method consists of two transforms, which convert an input vector into GIF for recognition. In the speaker adaptation, MAP (Maximum A Posteriori) adaptation is used to adapt a recognition model to a target speaker. Recognition experiments of continuous digit utterances were conducted using an audio-visual corpus CENSREC-1-AV [1] including more than 268,000 lip images. At first, we compared the GIF-based method with the baseline method employing conventional eigenlip features, using two kinds of images: pictures in the database around speakers' mouth, and extracted images only containing lips. Secondly, we evaluated the effectiveness of speaker adaptation for lipreading. The result of comparison shows that the GIF-based approach achieved slightly better than the baseline method. And it is found using the mouth-around images is more suitable than lip-only images. Furthermore, the result of speaker adaptation shows that speaker adaptation significantly improved recognition accuracy in the GIF-based method; after the adaptation, the recognition rate drastically increased from approximately 30% to 70%.
KW - CENSREC
KW - discriminative feature
KW - lip extraction
KW - lipreading
KW - speaker adaptation
UR - http://www.scopus.com/inward/record.url?scp=84899067176&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84899067176&partnerID=8YFLogxK
M3 - Paper
AN - SCOPUS:84899067176
SP - 221
EP - 226
T2 - 2013 International Conference on Auditory-Visual Speech Processing, AVSP 2013
Y2 - 29 August 2013 through 1 September 2013
ER -