Command Palette
Search for a command to run...
Ilpo Viertola Vladimir Iashin Esa Rahtu

Abstract
We introduce V-AURA, the first autoregressive model to achieve high temporalalignment and relevance in video-to-audio generation. V-AURA uses ahigh-framerate visual feature extractor and a cross-modal audio-visual featurefusion strategy to capture fine-grained visual motion events and ensure precisetemporal alignment. Additionally, we propose VisualSound, a benchmark datasetwith high audio-visual relevance. VisualSound is based on VGGSound, a videodataset consisting of in-the-wild samples extracted from YouTube. During thecuration, we remove samples where auditory events are not aligned with thevisual ones. V-AURA outperforms current state-of-the-art models in temporalalignment and semantic relevance while maintaining comparable audio quality.Code, samples, VisualSound and models are available athttps://v-aura.notion.site
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| video-to-sound-generation-on-vgg-sound | V-AURA | FAD: 1.92 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.