5 months ago

Read, Watch and Scream! Sound Generation from Text and Video

Jeong Yujin ; Kim Yunji ; Chun Sanghyuk ; Lee Jiyoung

Abstract

Despite the impressive progress of multimodal generative models,video-to-audio generation still suffers from limited performance and limits theflexibility to prioritize sound synthesis for specific objects within thescene. Conversely, text-to-audio generation methods generate high-quality audiobut pose challenges in ensuring comprehensive scene depiction and time-varyingcontrol. To tackle these challenges, we propose a novel video-and-text-to-audiogeneration method, called \ours, where video serves as a conditional controlfor a text-to-audio generation model. Especially, our method estimates thestructural information of sound (namely, energy) from the video while receivingkey content cues from a user prompt. We employ a well-performing text-to-audiomodel to consolidate the video control, which is much more efficient fortraining multimodal diffusion models with massive triplet-paired(audio-video-text) data. In addition, by separating the generative componentsof audio, it becomes a more flexible system that allows users to freely adjustthe energy, surrounding environment, and primary sound source according totheir preferences. Experimental results demonstrate that our method showssuperiority in terms of quality, controllability, and training efficiency. Codeand demo are available at https://naver-ai.github.io/rewas.

Code Repositories

naver-ai/rewas

Official

pytorch

Mentioned in GitHub

Benchmarks

Benchmark	Methodology	Metrics
video-to-sound-generation-on-vgg-sound	ReWas	FAD: 2.16 FD: 15.24

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette