DreamVideo: High-Fidelity Image-to-Video Generation with Image Retention and Text Guidance

Anonymous Institute1
All images used for video generation are from Midjourney.

Abstract

Image-to-video generation, which aims to generate a video starting from a given reference image, has drawn great attention. Existing methods try to extend pre-trained text-guided image diffusion models to image-guided video generation models. Nevertheless, these methods often result in either low fidelity or flickering over time due to their limitation to shallow image guidance and poor temporal consistency. To tackle these problems, we propose a high-fidelity image-to-video generation method by devising a frame retention branch based on a pre-trained video diffusion model, named DreamVideo. Instead of integrating the reference image into the diffusion process at a semantic level, our DreamVideo perceives the reference image via convolution layers and concatenates the features with the noisy latents as model input. By this means, the details of the reference image can be preserved to the greatest extent. In addition, by incorporating double-condition classifier-free guidance, a single image can be directed to videos of different actions by providing varying prompt texts. This has significant implications for controllable video generation and holds broad application prospects. We conduct comprehensive experiments on the public dataset, and both quantitative and qualitative results indicate that our method outperforms the state-of-the-art method. Especially for fidelity, our model has a powerful image retention ability and delivers the best results in UCF101 compared to other image-to-video models to our best knowledge. Also, precise control can be achieved by giving different text prompts. Further details and comprehensive results of our model will be presented here.

Model Architecture

Architecture of our model.
The architecture of DreamVideo.

The architecture of our DreamVideo model consists of two main components: the primary U-Net block and the Image Retention block. Modules marked with flame symbols indicate that they are trainable. A reference image is processed by a convolution block and concatenated with the representation of noisy latents. The Image Retention Module, as a side branch copying from the downsample blocks of U-Net, plays a role for maintaining the visual details from the input image and meanwhile also accepting text prompts for motion control.

Classifier-free Guidance for Two Conditionings

These are the first frames of video generated using different image guidance scales (GS) for classifier-free guidance.
These are the first frames of video generated using different image guidance scales (GS) for classifier-free guidance.

It can be observed that an increase in GS substantially intensifies the brightness and contrast of the videos generated. This indicates a trend within the generated results, progressively shifting towards the image domain.

Quantitative Evaluation

Pipeline of our method.
Quantitative evaluation on UCF-101 and MSR-VTT.

Our model consistently presents the lowest Fréchet Video Distance (FVD) scores on both datasets. This proves that our model can generate video with better continuity. Regarding the Inception Score (IS), our model's IS score is higher than that of other methods, which means that the videos generated by DreamVideo has the best quality. As for the FFFCLIP metric, this is derived by employing clips to calculate the similarity between the initial frames of the generated and original videos. However, we believe that this metric may not fully represent the quality of the generated initial frame. The visual features encoded in the clip images are coarse and focus more on the presence of objects in the image rather than the quality of the initial frame. FFFSSIM demonstrate that our model's ability to generate video with high image retention.

Qualitative evaluation

The comparison of videos generated by I2VGen-XL, VideoCraft1, and DreamVideo. The videos produced utilizing our method closely resemble the source images, while the videos produced by I2VGen-XL and VideoCraft1 are significantly weaker in color, position and object retention than our method.

Multi-Combinations of video generation

Conditions for generating videos with DreamVideo: on the far left is text-only input, in the middle is both text and image input, and on the far right is image-only input.

Varied textual inputs inference

The videos generated under varied textual inputs.

Two-Stage inference

The videos generated in two-stage inference.

More video at high resolution

The videos generated under text and initial image at 512 × 512 resolution. Upon further exploration, we find that without the need for additional training, a direct modification of the original model's settings to a resolution of 512 facilitates the production of videos at 512 × 512 resolution.

Varied textual inputs inference at high resolution

The videos generated under varied textual inputs at 512 × 512 resolution.