Command Palette
Search for a command to run...
Scaling Instruction-Based Video Editing with a High-Quality Synthetic Dataset

Abstract
Instruction-based video editing promises to democratize content creation, yetits progress is severely hampered by the scarcity of large-scale, high-qualitytraining data. We introduce Ditto, a holistic framework designed to tackle thisfundamental challenge. At its heart, Ditto features a novel data generationpipeline that fuses the creative diversity of a leading image editor with anin-context video generator, overcoming the limited scope of existing models. Tomake this process viable, our framework resolves the prohibitive cost-qualitytrade-off by employing an efficient, distilled model architecture augmented bya temporal enhancer, which simultaneously reduces computational overhead andimproves temporal coherence. Finally, to achieve full scalability, this entirepipeline is driven by an intelligent agent that crafts diverse instructions andrigorously filters the output, ensuring quality control at scale. Using thisframework, we invested over 12,000 GPU-days to build Ditto-1M, a new dataset ofone million high-fidelity video editing examples. We trained our model, Editto,on Ditto-1M with a curriculum learning strategy. The results demonstratesuperior instruction-following ability and establish a new state-of-the-art ininstruction-based video editing.
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.