Audio-Oriented Video Interpolation Using Key Pose

Takayuki Nakatsuka*, Yukitaka Tsuchiya, Masatoshi Hamanaka, Shigeo Morishima

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

3 Citations (Scopus)


This paper describes a deep learning-based method for long-term video interpolation that generates intermediate frames between two music performance videos of a person playing a specific instrument. Recent advances in deep learning techniques have successfully generated realistic images with high-fidelity and high-resolution in short-term video interpolation. However, there is still room for improvement in long-term video interpolation due to lack of resolution and temporal consistency of the generated video. Particularly in music performance videos, the music and human performance motion need to be synchronized. We solved these problems by using human poses and music features essential for music performance in long-term video interpolation. By closely matching human poses with music and videos, it is possible to generate intermediate frames that synchronize with the music. Specifically, we obtain the human poses of the last frame of the first video and the first frame of the second video in the performance videos to be interpolated as key poses. Then, our encoder-decoder network estimates the human poses in the intermediate frames from the obtained key poses, with the music features as the condition. In order to construct an end-to-end network, we utilize a differentiable network that transforms the estimated human poses in vector form into the human pose in image form, such as human stick figures. Finally, a video-to-video synthesis network uses the stick figures to generate intermediate frames between two music performance videos. We found that the generated performance videos were of higher quality than the baseline method through quantitative experiments.

Original languageEnglish
Article number2160016
JournalInternational Journal of Pattern Recognition and Artificial Intelligence
Issue number16
Publication statusPublished - 2021 Dec 30


  • Video interpolation
  • generative adversarial network
  • musical performance video
  • signal processing

ASJC Scopus subject areas

  • Software
  • Computer Vision and Pattern Recognition
  • Artificial Intelligence


Dive into the research topics of 'Audio-Oriented Video Interpolation Using Key Pose'. Together they form a unique fingerprint.

Cite this