TY - JOUR
T1 - Multimodal translation system using texture-mapped lip-sync images for video mail and automatic dubbing applications
AU - Morishima, Shigeo
AU - Nakamura, Satoshi
PY - 2004/9/1
Y1 - 2004/9/1
N2 - We introduce a multimodal English-to-Japanese and Japanese-to-English translation system that also translates the speaker's speech motion by synchronizing it to the translated speech. This system also introduces both a face synthesis technique that can generate any viseme lip shape and a face tracking technique that can estimate the original position and rotation of a speaker's face in an image sequence. To retain the speaker's facial expression, we substitute only the speech organ's image with the synthesized one, which is made by a 3D wire-frame model that is adaptable to any speaker. Our approach provides translated image synthesis with an extremely small database. The tracking motion of the face from a video image is performed by template matching. In this system, the translation and rotation of the face are detected by using a 3D personal face model whose texture is captured from a video frame. We also propose a method to customize the personal face model by using our GUI tool. By combining these techniques and the translated voice synthesis technique, an automatic multimodal translation can be achieved that is suitable for video mail or automatic dubbing systems into other languages.
AB - We introduce a multimodal English-to-Japanese and Japanese-to-English translation system that also translates the speaker's speech motion by synchronizing it to the translated speech. This system also introduces both a face synthesis technique that can generate any viseme lip shape and a face tracking technique that can estimate the original position and rotation of a speaker's face in an image sequence. To retain the speaker's facial expression, we substitute only the speech organ's image with the synthesized one, which is made by a 3D wire-frame model that is adaptable to any speaker. Our approach provides translated image synthesis with an extremely small database. The tracking motion of the face from a video image is performed by template matching. In this system, the translation and rotation of the face are detected by using a 3D personal face model whose texture is captured from a video frame. We also propose a method to customize the personal face model by using our GUI tool. By combining these techniques and the translated voice synthesis technique, an automatic multimodal translation can be achieved that is suitable for video mail or automatic dubbing systems into other languages.
KW - Audio-visual speech translation
KW - Face tracking with 3D template
KW - Lip-sync talking head
KW - Personal face model
KW - Texture-mapped facial animation
KW - Video mail and automatic dubbing
UR - http://www.scopus.com/inward/record.url?scp=10244240639&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=10244240639&partnerID=8YFLogxK
U2 - 10.1155/S1110865704404259
DO - 10.1155/S1110865704404259
M3 - Article
AN - SCOPUS:10244240639
SN - 1687-6172
VL - 2004
SP - 1637
EP - 1647
JO - Eurasip Journal on Advances in Signal Processing
JF - Eurasip Journal on Advances in Signal Processing
IS - 11
ER -