HyperAI

Faster Video Diffusion with Trainable Sparse Attention

Peiyuan Zhang, Haofeng Huang, Yongqi Chen, Will Lin, Zhengzhong Liu, Ion Stoica, Eric P. Xing, Hao Zhang
Release Date: 5/21/2025
Faster Video Diffusion with Trainable Sparse Attention
Abstract

Scaling video diffusion transformers (DiTs) is limited by their quadratic 3Dattention, even though most of the attention mass concentrates on a smallsubset of positions. We turn this observation into VSA, a trainable,hardware-efficient sparse attention that replaces full attention at bothtraining and inference. In VSA, a lightweight coarse stage pools tokens intotiles and identifies high-weight critical tokens; a fine stage computestoken-level attention only inside those tiles subjecting to block computinglayout to ensure hard efficiency. This leads to a single differentiable kernelthat trains end-to-end, requires no post-hoc profiling, and sustains 85\% ofFlashAttention3 MFU. We perform a large sweep of ablation studies andscaling-law experiments by pretraining DiTs from 60M to 1.4B parameters. VSAreaches a Pareto point that cuts training FLOPS by 2.53times with no drop indiffusion loss. Retrofitting the open-source Wan-2.1 model speeds up attentiontime by 6times and lowers end-to-end generation time from 31s to 18s withcomparable quality. These results establish trainable sparse attention as apractical alternative to full attention and a key enabler for further scalingof video diffusion models.