抄録
In this paper, we propose a multi-modal voice activity detection system (VAD) that uses audio and visual information. Audio-only VAD systems typically are not robust to (acoustic) noise. Incorporating visual information, for example information extracted from mouth images, can improve the robustness since the visual information is not affected by the acoustic noise. In multi-modal (speech) signal processing, there are two methods for fusing the audio and the visual information: concatenating the audio and visual features, and employing audio-only and visual-only classifiers, then fusing the unimodal decisions. We investigate the effectiveness of these methods and also compare model-based and model-free methods for VAD. Experimental results show feature fusion methods to generally be more effective, and decision fusion methods generally perform better using model-free methods.
本文言語 | English |
---|---|
ページ | 151-154 |
ページ数 | 4 |
出版ステータス | Published - 2009 |
外部発表 | はい |
イベント | 2009 International Conference on Auditory-Visual Speech Processing, AVSP 2009 - Norwich, United Kingdom 継続期間: 2009 9月 10 → 2009 9月 13 |
Conference
Conference | 2009 International Conference on Auditory-Visual Speech Processing, AVSP 2009 |
---|---|
国/地域 | United Kingdom |
City | Norwich |
Period | 09/9/10 → 09/9/13 |
ASJC Scopus subject areas
- 言語および言語学
- 言語聴覚療法
- 耳鼻咽喉科学