5 months ago

Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis

Ho Kei Cheng Masato Ishii Akio Hayakawa Takashi Shibuya Alexander Schwing Yuki Mitsufuji

Abstract

We propose to synthesize high-quality and synchronized audio, given video andoptional text conditions, using a novel multimodal joint training frameworkMMAudio. In contrast to single-modality training conditioned on (limited) videodata only, MMAudio is jointly trained with larger-scale, readily availabletext-audio data to learn to generate semantically aligned high-quality audiosamples. Additionally, we improve audio-visual synchrony with a conditionalsynchronization module that aligns video conditions with audio latents at theframe level. Trained with a flow matching objective, MMAudio achieves newvideo-to-audio state-of-the-art among public models in terms of audio quality,semantic alignment, and audio-visual synchronization, while having a lowinference time (1.23s to generate an 8s clip) and just 157M parameters. MMAudioalso achieves surprisingly competitive performance in text-to-audio generation,showing that joint training does not hinder single-modality performance. Codeand demo are available at: https://hkchengrex.github.io/MMAudio

Code Repositories

hkchengrex/MMAudio

Official

pytorch

Mentioned in GitHub

Benchmarks

Benchmark	Methodology	Metrics
video-to-sound-generation-on-vgg-sound	MMAudio-S-16kHz	FAD: 0.79 FD: 5.22
video-to-sound-generation-on-vgg-sound	MMAudio-L-44.1kHz	FAD: 0.97 FD: 4.72

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette