6 months ago

Abstract

Significant advancements have been achieved in the realm of large-scalepre-trained text-to-video Diffusion Models (VDMs). However, previous methodseither rely solely on pixel-based VDMs, which come with high computationalcosts, or on latent-based VDMs, which often struggle with precise text-videoalignment. In this paper, we are the first to propose a hybrid model, dubbed asShow-1, which marries pixel-based and latent-based VDMs for text-to-videogeneration. Our model first uses pixel-based VDMs to produce a low-resolutionvideo of strong text-video correlation. After that, we propose a novel experttranslation method that employs the latent-based VDMs to further upsample thelow-resolution video to high resolution, which can also remove potentialartifacts and corruptions from low-resolution videos. Compared to latent VDMs,Show-1 can produce high-quality videos of precise text-video alignment;Compared to pixel VDMs, Show-1 is much more efficient (GPU memory usage duringinference is 15G vs 72G). Furthermore, our Show-1 model can be readily adaptedfor motion customization and video stylization applications through simpletemporal attention layer finetuning. Our model achieves state-of-the-artperformance on standard video generation benchmarks. Our code and model weightsare publicly available at https://github.com/showlab/Show-1.

Source PDF