TY - GEN
T1 - Song2Face
T2 - SIGGRAPH Asia 2020 Technical Communications - International Conference on Computer Graphics and Interactive Techniques, SA 2020
AU - Iwase, Shohei
AU - Kato, Takuya
AU - Yamaguchi, Shugo
AU - Yukitaka, Tsuchiya
AU - Morishima, Shigeo
N1 - Funding Information:
This research is supported by the JST ACCEL (JPMJAC1602), JST-Mirai Program (JPMJMI19B2) and JSPS KAKENHI (JP17H06101 and JP19H01129)
Publisher Copyright:
© 2020 ACM.
PY - 2020/12/1
Y1 - 2020/12/1
N2 - We present Song2Face, a deep neural network capable of producing singing facial animation from an input of singing voice and singer label. The network architecture is built upon our insight that, although facial expression when singing varies between different individuals, singing voices store valuable information such as pitch, breathe, and vibrato that expressions may be attributed to. Therefore, our network consists of an encoder that extracts relevant vocal features from audio, and a regression network conditioned on a singer label that predicts control parameters for facial animation. In contrast to prior audio-driven speech animation methods which initially map audio to text-level features, we show that vocal features can be directly learned from singing voice without any explicit constraints. Our network is capable of producing movements for all parts of the face and also rotational movement of the head itself. Furthermore, stylistic differences in expression between different singers are captured via the singer label, and thus the resulting animations singing style can be manipulated at test time.
AB - We present Song2Face, a deep neural network capable of producing singing facial animation from an input of singing voice and singer label. The network architecture is built upon our insight that, although facial expression when singing varies between different individuals, singing voices store valuable information such as pitch, breathe, and vibrato that expressions may be attributed to. Therefore, our network consists of an encoder that extracts relevant vocal features from audio, and a regression network conditioned on a singer label that predicts control parameters for facial animation. In contrast to prior audio-driven speech animation methods which initially map audio to text-level features, we show that vocal features can be directly learned from singing voice without any explicit constraints. Our network is capable of producing movements for all parts of the face and also rotational movement of the head itself. Furthermore, stylistic differences in expression between different singers are captured via the singer label, and thus the resulting animations singing style can be manipulated at test time.
KW - Facial Animation
KW - Machine Learning
KW - Singing Audio
UR - http://www.scopus.com/inward/record.url?scp=85097433431&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85097433431&partnerID=8YFLogxK
U2 - 10.1145/3410700.3425435
DO - 10.1145/3410700.3425435
M3 - Conference contribution
AN - SCOPUS:85097433431
T3 - SIGGRAPH Asia 2020 Technical Communications, SA 2020
BT - SIGGRAPH Asia 2020 Technical Communications, SA 2020
PB - Association for Computing Machinery, Inc
Y2 - 4 December 2020 through 13 December 2020
ER -