Command Palette
Search for a command to run...
Frieren: Efficient Video-to-Audio Generation Network with Rectified Flow Matching
Wang Yongqi ; Guo Wenxiang ; Huang Rongjie ; Huang Jiawei ; Wang Zehan ; You Fuming ; Li Ruiqi ; Zhao Zhou

Abstract
Video-to-audio (V2A) generation aims to synthesize content-matching audiofrom silent video, and it remains challenging to build V2A models with highgeneration quality, efficiency, and visual-audio temporal synchrony. We proposeFrieren, a V2A model based on rectified flow matching. Frieren regresses theconditional transport vector field from noise to spectrogram latent withstraight paths and conducts sampling by solving ODE, outperformingautoregressive and score-based models in terms of audio quality. By employing anon-autoregressive vector field estimator based on a feed-forward transformerand channel-level cross-modal feature fusion with strong temporal alignment,our model generates audio that is highly synchronized with the input video.Furthermore, through reflow and one-step distillation with guided vector field,our model can generate decent audio in a few, or even only one sampling step.Experiments indicate that Frieren achieves state-of-the-art performance in bothgeneration quality and temporal alignment on VGGSound, with alignment accuracyreaching 97.22%, and 6.2% improvement in inception score over the strongdiffusion-based baseline. Audio samples are available athttp://frieren-v2a.github.io.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| video-to-sound-generation-on-vgg-sound | Frieren | FAD: 1.32 FD: 12.26 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.