TY - GEN
T1 - Development of audio-visual speech corpus toward speaker-independent Japanese LVCSR
AU - Ukai, Kazuto
AU - Tamura, Satoshi
AU - Hayamizu, Satoru
N1 - Publisher Copyright:
© 2016 IEEE.
PY - 2017/5/3
Y1 - 2017/5/3
N2 - In the speech recognition literature, building corpora for Large Vocabulary Continuous Speech Recognition (LVCSR) is quite important. In addition, in order to overcome performance decrease caused by noise, using visual information such as lip images is effective. In this paper, therefore, we focus on collecting speech and lip-image data for audio-visual LVCSR. Audio-visual speech data were obtained from 12 speakers, each who uttered ATR503 phonetically-balanced sentences. These data were recorded in acoustically and visually clean environments. Using the data, we conducted recognition experiments. Mel Frequency Cepstral Coefficients (MFCCs) and eigenlip features were obtained, and multi-stream Hidden Markov Models (HMMs) were built. We compared the performance in clean condition to those in noisy environments. It is found that visual information is able to compensate the performance. In addition, it turns out that we should improve visual speech recognition for high-performance audio-visual LVCSR.
AB - In the speech recognition literature, building corpora for Large Vocabulary Continuous Speech Recognition (LVCSR) is quite important. In addition, in order to overcome performance decrease caused by noise, using visual information such as lip images is effective. In this paper, therefore, we focus on collecting speech and lip-image data for audio-visual LVCSR. Audio-visual speech data were obtained from 12 speakers, each who uttered ATR503 phonetically-balanced sentences. These data were recorded in acoustically and visually clean environments. Using the data, we conducted recognition experiments. Mel Frequency Cepstral Coefficients (MFCCs) and eigenlip features were obtained, and multi-stream Hidden Markov Models (HMMs) were built. We compared the performance in clean condition to those in noisy environments. It is found that visual information is able to compensate the performance. In addition, it turns out that we should improve visual speech recognition for high-performance audio-visual LVCSR.
KW - LVCSR
KW - audio-visual speech recognition
KW - lipreading
KW - multi-stream HMM
UR - http://www.scopus.com/inward/record.url?scp=85020176208&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85020176208&partnerID=8YFLogxK
U2 - 10.1109/ICSDA.2016.7918976
DO - 10.1109/ICSDA.2016.7918976
M3 - Conference contribution
AN - SCOPUS:85020176208
T3 - 2016 Conference of the Oriental Chapter of International Committee for Coordination and Standardization of Speech Databases and Assessment Techniques, O-COCOSDA 2016
SP - 12
EP - 15
BT - 2016 Conference of the Oriental Chapter of International Committee for Coordination and Standardization of Speech Databases and Assessment Techniques, O-COCOSDA 2016
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 19th Annual Conference of the Oriental Chapter of International Committee for Coordination and Standardization of Speech Databases and Assessment Techniques, O-COCOSDA 2016
Y2 - 26 October 2016 through 28 October 2016
ER -