Abstract
In this paper, we propose and develop a real-time audio-visual automatic continuous speech recognition system. The system utilizes live speech signals and facial images that collected from a microphone and a camera. Optical-flow-based features are used as visual feature. VAD technology and lip tracking are utilized to improve recognition accuracy. In this paper, several experiments are conducted using Japanese connected digit speech contaminated with white noise, music, television news and car engine noise. Experimental results show when the user is listening news or in a running car with window open the recognition accuracy of the proposed system are not enough. The accuracy of the proposed system is high at a place with light music or in a running car with window close.
Original language | English |
---|---|
Publication status | Published - 2010 |
Externally published | Yes |
Event | 2010 International Conference on Auditory-Visual Speech Processing, AVSP 2010 - Hakone, Japan Duration: 2010 Sept 30 → 2010 Oct 3 |
Conference
Conference | 2010 International Conference on Auditory-Visual Speech Processing, AVSP 2010 |
---|---|
Country/Territory | Japan |
City | Hakone |
Period | 10/9/30 → 10/10/3 |
Keywords
- multi-modal
- optical-flow
- real-time
- speech recognition system
ASJC Scopus subject areas
- Language and Linguistics
- Speech and Hearing
- Otorhinolaryngology