Prompt: The man is riding a horse.
Source: Westworld/HBO
Prompt: The Joker is talking and smiling.
Source: Joker/Batman:Dark Knight
Prompt: A big wave.
Source: The Great Wave off Kanagawa
Prompt: A big sea wave.
Source: Brett Allen/Shutterstock.com
Prompt: Two person are walking.
Source: Malte Mueller/Getty Images
Prompt: The horse is walking.
Source: Ernie Cowan
We present Pix2Gif, a motion-guided diffusion model for image-to-GIF (video) generation. We tackle this problem differently by formulating the task as an image translation problem steered by text and motion magnitude prompts. To ensure that the model adheres to motion guidance, we propose a new motion-guided warping module to spatially transform the features of the source image conditioned on the two types of prompts. Furthermore, we introduce a perceptual loss to ensure the transformed feature map remains within the same space as the target image, ensuring content consistency and coherence. In preparation for the model training, we meticulously curated data by extracting coherent image frames from the TGIF video-caption dataset, which provides rich information about the temporal changes of subjects. After pretraining, we apply our model in a zero-shot manner to a number of video datasets. Extensive qualitative and quantitative experiments demonstrate the effectiveness of our model -- it not only captures the semantic prompt from text but also the spatial ones from motion guidance. We train all our models using a single node of 16xV100 GPUs.
Our model is built on the Stable Diffusion but with newly introduced motion-guided warping module. We formualte the GIF generation as a temporal instructed image editing problem.
Prompt: A cat is playing with wool.
Input Image
Ours (Pix2Gif)
Action 1: cat is playing with wool.
Action 2: cat is dancing.
Action 1+2: cat is dancing while playing with wool.
Prompt: A cat is playing with wool.
Prompt: The two person are running.
Prompt: A big wave.
Prompt: The wind is blowing the flower.
We use the TGIF dataset for our model training. The dataset contains 100K animated GIFs with captions. We extract frames from the GIFs and use the captions as the text prompts. We further curate the dataset by removing the GIFs with less than 5 frames and the GIFs with the same captions. The final dataset contains 100K GIFs with 5-20 frames. We split the dataset into 80K for training and 20K for testing.
@misc{kandala2024pix2gif,
title={Pix2Gif: Motion-Guided Diffusion for GIF Generation},
author={Hitesh Kandala and Jianfeng Gao and Jianwei Yang},
year={2024},
eprint={2403.04634},
archivePrefix={arXiv},
primaryClass={cs.CV}
}