HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks

Lu Jiasen ; Clark Christopher ; Zellers Rowan ; Mottaghi Roozbeh ; Kembhavi Aniruddha

Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks

Abstract

We propose Unified-IO, a model that performs a large variety of AI tasksspanning classical computer vision tasks, including pose estimation, objectdetection, depth estimation and image generation, vision-and-language taskssuch as region captioning and referring expression, to natural languageprocessing tasks such as question answering and paraphrasing. Developing asingle unified model for such a large variety of tasks poses unique challengesdue to the heterogeneous inputs and outputs pertaining to each task, includingRGB images, per-pixel maps, binary masks, bounding boxes, and language. Weachieve this unification by homogenizing every supported input and output intoa sequence of discrete vocabulary tokens. This common representation across alltasks allows us to train a single transformer-based architecture, jointly onover 90 diverse datasets in the vision and language fields. Unified-IO is thefirst model capable of performing all 7 tasks on the GRIT benchmark andproduces strong results across 16 diverse benchmarks like NYUv2-Depth,ImageNet, VQA2.0, OK-VQA, Swig, VizWizGround, BoolQ, and SciTail, with notask-specific fine-tuning. Code and demos for Unified-IO are available at:https://unified-io.allenai.org.

Benchmarks

BenchmarkMethodologyMetrics
object-categorization-on-gritUnified-IOXL
Categorization (ablation): 61.7
Categorization (test): 60.8
object-localization-on-gritUnified-IOXL
Localization (ablation): 67.0
Localization (test): 67.1
visual-question-answering-on-gritUnified-IOXL
VQA (ablation): 74.5
VQA (test): 74.5

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks | Papers | HyperAI