HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities

Peng Wang Shijie Wang Junyang Lin Shuai Bai Xiaohuan Zhou Jingren Zhou Xinggang Wang Chang Zhou

ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities

Abstract

In this work, we explore a scalable way for building a general representation model toward unlimited modalities. We release ONE-PEACE, a highly extensible model with 4B parameters that can seamlessly align and integrate representations across vision, audio, and language modalities. The architecture of ONE-PEACE comprises modality adapters, shared self-attention layers, and modality FFNs. This design allows for the easy extension of new modalities by adding adapters and FFNs, while also enabling multi-modal fusion through self-attention layers. To pretrain ONE-PEACE, we develop two modality-agnostic pretraining tasks, cross-modal aligning contrast and intra-modal denoising contrast, which align the semantic space of different modalities and capture fine-grained details within modalities concurrently. With the scaling-friendly architecture and pretraining tasks, ONE-PEACE has the potential to expand to unlimited modalities. Without using any vision or language pretrained model for initialization, ONE-PEACE achieves leading results on a wide range of uni-modal and multi-modal tasks, including image classification (ImageNet), semantic segmentation (ADE20K), audio-text retrieval (AudioCaps, Clotho), audio classification (ESC-50, FSD50K, VGGSound), audio question answering (AVQA), image-text retrieval (MSCOCO, Flickr30K), and visual grounding (RefCOCO/+/g). Code is available at https://github.com/OFA-Sys/ONE-PEACE.

Code Repositories

OFA-Sys/ONE-PEACE
Official
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
action-classification-on-kinetics-400ONE-PEACE
Acc@1: 88.1
Acc@5: 97.8
audio-classification-on-fsd50kONE-PEACE
mAP: 69.7
audio-classification-on-vggsoundONE-PEACE (Audio-Visual)
Top 1 Accuracy: 68.2
audio-classification-on-vggsoundONE-PEACE (Audio-Only)
Top 1 Accuracy: 59.6
image-classification-on-imagenetONE-PEACE
Number of params: 1520M
image-to-text-retrieval-on-cocoONE-PEACE (ViT-G, w/o ranking)
Recall@1: 84.1
Recall@10: 98.3
Recall@5: 96.3
image-to-text-retrieval-on-flickr30kONE-PEACE (finetuned, w/o ranking)
Recall@1: 97.6
Recall@10: 100
Recall@5: 100
semantic-segmentation-on-ade20kONE-PEACE
Params (M): 1500
Validation mIoU: 63.0
visual-question-answering-on-vqa-v2-test-devONE-PEACE
Accuracy: 82.6
visual-question-answering-on-vqa-v2-test-stdONE-PEACE
number: 72.24
other: 74.15
overall: 82.52
yes/no: 94.85

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp