Command Palette
Search for a command to run...
Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis
Ho Kei Cheng Masato Ishii Akio Hayakawa Takashi Shibuya Alexander Schwing Yuki Mitsufuji

Abstract
We propose to synthesize high-quality and synchronized audio, given video andoptional text conditions, using a novel multimodal joint training frameworkMMAudio. In contrast to single-modality training conditioned on (limited) videodata only, MMAudio is jointly trained with larger-scale, readily availabletext-audio data to learn to generate semantically aligned high-quality audiosamples. Additionally, we improve audio-visual synchrony with a conditionalsynchronization module that aligns video conditions with audio latents at theframe level. Trained with a flow matching objective, MMAudio achieves newvideo-to-audio state-of-the-art among public models in terms of audio quality,semantic alignment, and audio-visual synchronization, while having a lowinference time (1.23s to generate an 8s clip) and just 157M parameters. MMAudioalso achieves surprisingly competitive performance in text-to-audio generation,showing that joint training does not hinder single-modality performance. Codeand demo are available at: https://hkchengrex.github.io/MMAudio
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| video-to-sound-generation-on-vgg-sound | MMAudio-S-16kHz | FAD: 0.79 FD: 5.22 |
| video-to-sound-generation-on-vgg-sound | MMAudio-L-44.1kHz | FAD: 0.97 FD: 4.72 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.