TY - JOUR
T1 - Audio-visual speech translation with automatic LIP synchronization and face tracking based on 3-D head model
AU - Morishima, Shigeo
AU - Ogata, Shin
AU - Murai, Kazumasa
AU - Nakamura, Satoshi
N1 - Funding Information:
This work was supported by the Nature Science Foundation of China [Grant No. 51505155]; Science and Technology Program of Guangzhou, China [Grant No. 1561000187]. The authors are very grateful to the reviewers for their valuable reviews and careful reading of earlier versions, which helped improving this paper.
PY - 2002
Y1 - 2002
N2 - Speech-to-speech translation has been studied to realize natural human communication beyond language barriers. Toward further multi-modal natural communication, visual information such as face and lip movements will be necessary. In this paper, we introduce a multi-modal English-to-Japanese and Japanese-to-English translation system that also translates the speaker's speech motion while synchronizing it to the translated speech. To retain the speaker's facial expression, we substitute only the speech organ's image with the synthesized one, which is made by a three-dimensional wire-frame model that is adaptable to any speaker. Our approach enables image synthesis and translation with an extremely small database. We conduct subjective evaluation by connected digit discrimination using data with and without audio-visual lip-synchronicity. The results confirm the sufficient quality of the proposed audio-visual translation system.
AB - Speech-to-speech translation has been studied to realize natural human communication beyond language barriers. Toward further multi-modal natural communication, visual information such as face and lip movements will be necessary. In this paper, we introduce a multi-modal English-to-Japanese and Japanese-to-English translation system that also translates the speaker's speech motion while synchronizing it to the translated speech. To retain the speaker's facial expression, we substitute only the speech organ's image with the synthesized one, which is made by a three-dimensional wire-frame model that is adaptable to any speaker. Our approach enables image synthesis and translation with an extremely small database. We conduct subjective evaluation by connected digit discrimination using data with and without audio-visual lip-synchronicity. The results confirm the sufficient quality of the proposed audio-visual translation system.
UR - http://www.scopus.com/inward/record.url?scp=0036295865&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=0036295865&partnerID=8YFLogxK
M3 - Conference article
AN - SCOPUS:0036295865
SN - 1520-6149
VL - 2
SP - II/2117-II/2120
JO - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
JF - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
T2 - 2002 IEEE International Conference on Acoustic, Speech and Signal Processing
Y2 - 13 May 2002 through 17 May 2002
ER -