In this paper, an audio-visual speech corpus CENSREC-1-AV for noisy speech recognition is introduced. CENSREC-1-AV consists of an audio-visual database and a baseline system of bimodal speech recognition which uses audio and visual information. In the database, there are 3,234 and 1,963 utterances made by 42 and 51 speakers as a training and a test sets respectively. Each utterance consists of a speech signal as well as color and infrared pictures around a speaker's mouth. A baseline system is built so that a user can evaluate a proposed bimodal speech recognizer. In the baseline system, multi-stream HMMs are obtained using training data. A preliminary experiment was conducted to evaluate the baseline using acoustically noisy testing data. The results show that roughly a 35% relative error reduction was achieved in low SNR conditions compared with an audio-only ASR method.
|出版ステータス||Published - 2010|
|イベント||2010 International Conference on Auditory-Visual Speech Processing, AVSP 2010 - Hakone, Japan|
継続期間: 2010 9月 30 → 2010 10月 3
|Conference||2010 International Conference on Auditory-Visual Speech Processing, AVSP 2010|
|Period||10/9/30 → 10/10/3|
ASJC Scopus subject areas