HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

ModelScope Text-to-Video Technical Report

Jiuniu Wang Hangjie Yuan Dayou Chen Yingya Zhang Xiang Wang Shiwei Zhang

ModelScope Text-to-Video Technical Report

Abstract

This paper introduces ModelScopeT2V, a text-to-video synthesis model that evolves from a text-to-image synthesis model (i.e., Stable Diffusion). ModelScopeT2V incorporates spatio-temporal blocks to ensure consistent frame generation and smooth movement transitions. The model could adapt to varying frame numbers during training and inference, rendering it suitable for both image-text and video-text datasets. ModelScopeT2V brings together three components (i.e., VQGAN, a text encoder, and a denoising UNet), totally comprising 1.7 billion parameters, in which 0.5 billion parameters are dedicated to temporal capabilities. The model demonstrates superior performance over state-of-the-art methods across three evaluation metrics. The code and an online demo are available at \url{https://modelscope.cn/models/damo/text-to-video-synthesis/summary}.

Code Repositories

exponentialml/text-to-video-finetuning
Official
pytorch
Mentioned in GitHub
yhZhai/mcm
pytorch
Mentioned in GitHub
ali-vilab/VGen
pytorch
Mentioned in GitHub
ali-vilab/i2vgen-xl
pytorch
Mentioned in GitHub
picsart-ai-research/streamingt2v
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
text-to-video-generation-on-msr-vttModelScopeT2V
CLIPSIM: 0.2930
FID: 11.09
FVD: 550

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp