TY - GEN
T1 - Audio-guided video interpolation via human pose features
AU - Nakatsuka, Takayuki
AU - Hamanaka, Masatoshi
AU - Morishima, Shigeo
N1 - Funding Information:
This work was supported by the Program for Leading Graduate Schools, “Graduate Program for Embodiment Informatics (No. A13722300)” of the Ministry of Education, Culture, Sports, Science and Technology (MEXT) of Japan, JST ACCEL Grant Number JPMJAC1602, and JSPS KAKENHI Grant Numbers JP16H01744, JP17H01847, JP17H06101, JP19H01129 and JP19H04137.
Publisher Copyright:
Copyright © 2020 by SCITEPRESS – Science and Technology Publications, Lda. All rights reserved
PY - 2020
Y1 - 2020
N2 - This paper describes a method that generates in-between frames of two videos of a musical instrument being played. While image generation achieves a successful outcome in recent years, there is ample scope for improvement in video generation. The keys to improving the quality of video generation are the high resolution and temporal coherence of videos. We solved these requirements by using not only visual information but also aural information. The critical point of our method is using two-dimensional pose features to generate high-resolution in-between frames from the input audio. We constructed a deep neural network with a recurrent structure for inferring pose features from the input audio and an encoder-decoder network for padding and generating video frames using pose features. Our method, moreover, adopted a fusion approach of generating, padding, and retrieving video frames to improve the output video. Pose features played an essential role in both end-to-end training with a differentiable property and combining a generating, padding, and retrieving approach. We conducted a user study and confirmed that the proposed method is effective in generating interpolated videos.
AB - This paper describes a method that generates in-between frames of two videos of a musical instrument being played. While image generation achieves a successful outcome in recent years, there is ample scope for improvement in video generation. The keys to improving the quality of video generation are the high resolution and temporal coherence of videos. We solved these requirements by using not only visual information but also aural information. The critical point of our method is using two-dimensional pose features to generate high-resolution in-between frames from the input audio. We constructed a deep neural network with a recurrent structure for inferring pose features from the input audio and an encoder-decoder network for padding and generating video frames using pose features. Our method, moreover, adopted a fusion approach of generating, padding, and retrieving video frames to improve the output video. Pose features played an essential role in both end-to-end training with a differentiable property and combining a generating, padding, and retrieving approach. We conducted a user study and confirmed that the proposed method is effective in generating interpolated videos.
KW - Gated Recurrent Unit
KW - Generative Adversarial Network
KW - Pose Estimation
KW - Signal Processing
KW - Video Interpolation
UR - http://www.scopus.com/inward/record.url?scp=85083487074&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85083487074&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85083487074
T3 - VISIGRAPP 2020 - Proceedings of the 15th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications
SP - 27
EP - 35
BT - VISAPP
A2 - Farinella, Giovanni Maria
A2 - Radeva, Petia
A2 - Braz, Jose
PB - SciTePress
T2 - 15th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications, VISIGRAPP 2020
Y2 - 27 February 2020 through 29 February 2020
ER -