HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

VLIS: Unimodal Language Models Guide Multimodal Language Generation

Jiwan Chung; Youngjae Yu

VLIS: Unimodal Language Models Guide Multimodal Language Generation

Abstract

Multimodal language generation, which leverages the synergy of language and vision, is a rapidly expanding field. However, existing vision-language models face challenges in tasks that require complex linguistic understanding. To address this issue, we introduce Visual-Language models as Importance Sampling weights (VLIS), a novel framework that combines the visual conditioning capability of vision-language models with the language understanding of unimodal text-only language models without further training. It extracts pointwise mutual information of each image and text from a visual-language model and uses the value as an importance sampling weight to adjust the token likelihood from a text-only model. VLIS improves vision-language models on diverse tasks, including commonsense understanding (WHOOPS, OK-VQA, and ScienceQA) and complex text generation (Concadia, Image Paragraph Captioning, and ROCStories). Our results suggest that VLIS represents a promising new direction for multimodal language generation.

Code Repositories

jiwanchung/vlis
Official
pytorch

Benchmarks

BenchmarkMethodologyMetrics
explanation-generation-on-whoopsVLIS (LLaVA)
Accuracy: 73
explanation-generation-on-whoopsVLIS (Lynx)
Accuracy: 80

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp