TY - CONF
T1 - CENSREC-1-AV
T2 - 2010 International Conference on Auditory-Visual Speech Processing, AVSP 2010
AU - Tamura, Satoshi
AU - Miyajima, Chiyomi
AU - Kitaoka, Norihide
AU - Yamada, Takeshi
AU - Tsuge, Satoru
AU - Takiguchi, Tetsuya
AU - Yamamoto, Kazumasa
AU - Nishiura, Takanobu
AU - Nakayama, Masato
AU - Denda, Yuki
AU - Fujimoto, Masakiyo
AU - Matsuda, Shigeki
AU - Ogawa, Tetsuji
AU - Kuroiwa, Shingo
AU - Takeda, Kazuya
AU - Nakamura, Satoshi
N1 - Funding Information:
The authors would like to thank Suenaga laboratory in Nagoya University for their cooperation, and Speech Resource Consortium in National Institute of Informatics for their support.
Publisher Copyright:
© 2010 Auditory-Visual Speech Processing 2010, AVSP 2010. All rights reserved.
PY - 2010
Y1 - 2010
N2 - In this paper, an audio-visual speech corpus CENSREC-1-AV for noisy speech recognition is introduced. CENSREC-1-AV consists of an audio-visual database and a baseline system of bimodal speech recognition which uses audio and visual information. In the database, there are 3,234 and 1,963 utterances made by 42 and 51 speakers as a training and a test sets respectively. Each utterance consists of a speech signal as well as color and infrared pictures around a speaker's mouth. A baseline system is built so that a user can evaluate a proposed bimodal speech recognizer. In the baseline system, multi-stream HMMs are obtained using training data. A preliminary experiment was conducted to evaluate the baseline using acoustically noisy testing data. The results show that roughly a 35% relative error reduction was achieved in low SNR conditions compared with an audio-only ASR method.
AB - In this paper, an audio-visual speech corpus CENSREC-1-AV for noisy speech recognition is introduced. CENSREC-1-AV consists of an audio-visual database and a baseline system of bimodal speech recognition which uses audio and visual information. In the database, there are 3,234 and 1,963 utterances made by 42 and 51 speakers as a training and a test sets respectively. Each utterance consists of a speech signal as well as color and infrared pictures around a speaker's mouth. A baseline system is built so that a user can evaluate a proposed bimodal speech recognizer. In the baseline system, multi-stream HMMs are obtained using training data. A preliminary experiment was conducted to evaluate the baseline using acoustically noisy testing data. The results show that roughly a 35% relative error reduction was achieved in low SNR conditions compared with an audio-only ASR method.
KW - audio-visual database
KW - bimodal speech recognition
KW - eigenface
KW - noise robustness
KW - optical flow
UR - http://www.scopus.com/inward/record.url?scp=85133395284&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85133395284&partnerID=8YFLogxK
M3 - Paper
AN - SCOPUS:85133395284
Y2 - 30 September 2010 through 3 October 2010
ER -