Pix2Gif: Motion-Guided Diffusion for Gif Generation

Prompt: The man is riding a horse.

Source: Westworld/HBO

Prompt: The Joker is talking and smiling.

Source: Joker/Batman:Dark Knight

Prompt: A big wave.

Source: The Great Wave off Kanagawa

Prompt: A big sea wave.

Source: Brett Allen/Shutterstock.com

Prompt: Two person are walking.

Source: Malte Mueller/Getty Images

Prompt: The horse is walking.

Source: Ernie Cowan

Abstract

We present Pix2Gif, a motion-guided diffusion model for image-to-GIF (video) generation. We tackle this problem differently by formulating the task as an image translation problem steered by text and motion magnitude prompts. To ensure that the model adheres to motion guidance, we propose a new motion-guided warping module to spatially transform the features of the source image conditioned on the two types of prompts. Furthermore, we introduce a perceptual loss to ensure the transformed feature map remains within the same space as the target image, ensuring content consistency and coherence. In preparation for the model training, we meticulously curated data by extracting coherent image frames from the TGIF video-caption dataset, which provides rich information about the temporal changes of subjects. After pretraining, we apply our model in a zero-shot manner to a number of video datasets. Extensive qualitative and quantitative experiments demonstrate the effectiveness of our model -- it not only captures the semantic prompt from text but also the spatial ones from motion guidance. We train all our models using a single node of 16xV100 GPUs.

Pix2Gif

Our model is built on the Stable Diffusion but with newly introduced motion-guided warping module. We formualte the GIF generation as a temporal instructed image editing problem.

Our Examples

Prompt: A cat is playing with wool.

Input Image

Ours (Pix2Gif)

Compositionality of actions

Action 1: cat is playing with wool.

Action 2: cat is dancing.

Action 1+2: cat is dancing while playing with wool.

Comparison to state-of-the-art Image-to-Video Methods

Prompt: A cat is playing with wool.

Input Image

DynamiCrafter

Pika Labs

Ours

Prompt: The two person are running.

Input Image

DynamiCrafter

Pika Labs

Ours

Prompt: A big wave.

Input Image

DynamiCrafter

Pika Labs

Ours

Prompt: The wind is blowing the flower.

Input Image

DynamiCrafter

Pika Labs

Ours

Dataset

We use the TGIF dataset for our model training. The dataset contains 100K animated GIFs with captions. We extract frames from the GIFs and use the captions as the text prompts. We further curate the dataset by removing the GIFs with less than 5 frames and the GIFs with the same captions. The final dataset contains 100K GIFs with 5-20 frames. We split the dataset into 80K for training and 20K for testing.

Pix2Gif: Motion-Guided Diffusion
for GIF Generation

Abstract

Pix2Gif

Our Examples

Compositionality of actions

Comparison to state-of-the-art Image-to-Video Methods

Dataset

BibTeX

Pix2Gif: Motion-Guided Diffusion for GIF Generation

Abstract

Pix2Gif

Our Examples

Compositionality of actions

Comparison to state-of-the-art Image-to-Video Methods

Dataset

BibTeX

Pix2Gif: Motion-Guided Diffusion
for GIF Generation