Command Palette
Search for a command to run...
ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities
Peng Wang Shijie Wang Junyang Lin Shuai Bai Xiaohuan Zhou Jingren Zhou Xinggang Wang Chang Zhou

Abstract
In this work, we explore a scalable way for building a general representation model toward unlimited modalities. We release ONE-PEACE, a highly extensible model with 4B parameters that can seamlessly align and integrate representations across vision, audio, and language modalities. The architecture of ONE-PEACE comprises modality adapters, shared self-attention layers, and modality FFNs. This design allows for the easy extension of new modalities by adding adapters and FFNs, while also enabling multi-modal fusion through self-attention layers. To pretrain ONE-PEACE, we develop two modality-agnostic pretraining tasks, cross-modal aligning contrast and intra-modal denoising contrast, which align the semantic space of different modalities and capture fine-grained details within modalities concurrently. With the scaling-friendly architecture and pretraining tasks, ONE-PEACE has the potential to expand to unlimited modalities. Without using any vision or language pretrained model for initialization, ONE-PEACE achieves leading results on a wide range of uni-modal and multi-modal tasks, including image classification (ImageNet), semantic segmentation (ADE20K), audio-text retrieval (AudioCaps, Clotho), audio classification (ESC-50, FSD50K, VGGSound), audio question answering (AVQA), image-text retrieval (MSCOCO, Flickr30K), and visual grounding (RefCOCO/+/g). Code is available at https://github.com/OFA-Sys/ONE-PEACE.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| action-classification-on-kinetics-400 | ONE-PEACE | Acc@1: 88.1 Acc@5: 97.8 |
| audio-classification-on-fsd50k | ONE-PEACE | mAP: 69.7 |
| audio-classification-on-vggsound | ONE-PEACE (Audio-Visual) | Top 1 Accuracy: 68.2 |
| audio-classification-on-vggsound | ONE-PEACE (Audio-Only) | Top 1 Accuracy: 59.6 |
| image-classification-on-imagenet | ONE-PEACE | Number of params: 1520M |
| image-to-text-retrieval-on-coco | ONE-PEACE (ViT-G, w/o ranking) | Recall@1: 84.1 Recall@10: 98.3 Recall@5: 96.3 |
| image-to-text-retrieval-on-flickr30k | ONE-PEACE (finetuned, w/o ranking) | Recall@1: 97.6 Recall@10: 100 Recall@5: 100 |
| semantic-segmentation-on-ade20k | ONE-PEACE | Params (M): 1500 Validation mIoU: 63.0 |
| visual-question-answering-on-vqa-v2-test-dev | ONE-PEACE | Accuracy: 82.6 |
| visual-question-answering-on-vqa-v2-test-std | ONE-PEACE | number: 72.24 other: 74.15 overall: 82.52 yes/no: 94.85 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.