HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

Retrieval-Augmented Multimodal Language Modeling

Michihiro Yasunaga Armen Aghajanyan Weijia Shi Rich James Jure Leskovec Percy Liang Mike Lewis Luke Zettlemoyer Wen-tau Yih

Retrieval-Augmented Multimodal Language Modeling

Abstract

Recent multimodal models such as DALL-E and CM3 have achieved remarkable progress in text-to-image and image-to-text generation. However, these models store all learned knowledge (e.g., the appearance of the Eiffel Tower) in the model parameters, requiring increasingly larger models and training data to capture more knowledge. To integrate knowledge in a more scalable and modular way, we propose a retrieval-augmented multimodal model, which enables a base multimodal model (generator) to refer to relevant text and images fetched by a retriever from external memory (e.g., documents on the web). Specifically, for the retriever, we use a pretrained CLIP, and for the generator, we train a CM3 Transformer on the LAION dataset. Our resulting model, named Retrieval-Augmented CM3 (RA-CM3), is the first multimodal model that can retrieve and generate both text and images. We show that RA-CM3 significantly outperforms baseline multimodal models such as DALL-E and CM3 on both image and caption generation tasks (12 FID and 17 CIDEr improvements on MS-COCO), while requiring much less compute for training (<30% of DALL-E). Moreover, we show that RA-CM3 exhibits novel capabilities, such as faithful image generation and multimodal in-context learning (e.g., image generation from demonstrations).

Benchmarks

BenchmarkMethodologyMetrics
image-captioning-on-cocoDALL-E
CIDEr: 20.2
image-captioning-on-cocoFlamingo (80B; 4-shot)
CIDEr: 103
image-captioning-on-cocoruDALL-E-XL
CIDEr: 38.7
image-captioning-on-cocominDALL-E
CIDEr: 48
image-captioning-on-cocoParti
CIDEr: 83.9
image-captioning-on-cocoX-LXMERT
CIDEr: 55.8
image-captioning-on-cocoVanilla CM3
CIDEr: 71.9
image-captioning-on-cocoRA-CM3 (2.7B)
CIDEr: 89.1
image-captioning-on-cocoFlamingo (3B; 4-shot)
CIDEr: 85
text-to-image-generation-on-cocoVanilla CM3
FID: 29.5
text-to-image-generation-on-cocoRA-CM3 (2.7B)
FID: 15.7
text-to-image-generation-on-cocoDALL-E (12B)
FID: 28
text-to-image-generation-on-cocoStable Diffusion
FID: 12.63

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp