Pix2Gif: Motion-Guided Diffusion
for GIF Generation

1*AMD, 2Microsoft Research
*work done while at Microsoft Research


Prompt: The man is riding a horse.

Pix2Gif

Source: Westworld/HBO

Prompt: The Joker is talking and smiling.

Pix2Gif

Source: Joker/Batman:Dark Knight

Prompt: A big wave.

Pix2Gif

Source: The Great Wave off Kanagawa

Prompt: A big sea wave.

Pix2Gif

Source: Brett Allen/Shutterstock.com

Prompt: Two person are walking.

Pix2Gif

Source: Malte Mueller/Getty Images

Prompt: The horse is walking.

Pix2Gif

Source: Ernie Cowan



Abstract



We present Pix2Gif, a motion-guided diffusion model for image-to-GIF (video) generation. We tackle this problem differently by formulating the task as an image translation problem steered by text and motion magnitude prompts. To ensure that the model adheres to motion guidance, we propose a new motion-guided warping module to spatially transform the features of the source image conditioned on the two types of prompts. Furthermore, we introduce a perceptual loss to ensure the transformed feature map remains within the same space as the target image, ensuring content consistency and coherence. In preparation for the model training, we meticulously curated data by extracting coherent image frames from the TGIF video-caption dataset, which provides rich information about the temporal changes of subjects. After pretraining, we apply our model in a zero-shot manner to a number of video datasets. Extensive qualitative and quantitative experiments demonstrate the effectiveness of our model -- it not only captures the semantic prompt from text but also the spatial ones from motion guidance. We train all our models using a single node of 16xV100 GPUs.





Pix2Gif



Our model is built on the Stable Diffusion but with newly introduced motion-guided warping module. We formualte the GIF generation as a temporal instructed image editing problem.



Method Overview



Our Examples



Prompt: A cat is playing with wool.

Input Image

Input Image

Pix2Gif Image

Ours (Pix2Gif)



Compositionality of actions



Input Image

Action 1: cat is playing with wool.

Pix2Gif Image

Action 2: cat is dancing.

Pix2Gif Image

Action 1+2: cat is dancing while playing with wool.



Comparison to state-of-the-art Image-to-Video Methods



Prompt: A cat is playing with wool.

Input Image

Input Image

Pix2Gif Image

Ours


Prompt: The two person are running.

Input Image

Input Image

Pix2Gif Image

Ours


Prompt: A big wave.

Input Image

Input Image

Pix2Gif Image

Ours


Prompt: The wind is blowing the flower.

Input Image

Input Image

Pix2Gif Image

Ours




Dataset



We use the TGIF dataset for our model training. The dataset contains 100K animated GIFs with captions. We extract frames from the GIFs and use the captions as the text prompts. We further curate the dataset by removing the GIFs with less than 5 frames and the GIFs with the same captions. The final dataset contains 100K GIFs with 5-20 frames. We split the dataset into 80K for training and 20K for testing.



Frames

Filter
Final


BibTeX

@misc{kandala2024pix2gif,
      title={Pix2Gif: Motion-Guided Diffusion for GIF Generation}, 
      author={Hitesh Kandala and Jianfeng Gao and Jianwei Yang},
      year={2024},
      eprint={2403.04634},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}