Command Palette
Search for a command to run...
Supervised Multimodal Bitransformers for Classifying Images and Text
Douwe Kiela Suvrat Bhooshan Hamed Firooz Ethan Perez Davide Testuggine

Abstract
Self-supervised bidirectional transformer models such as BERT have led to dramatic improvements in a wide variety of textual classification tasks. The modern digital world is increasingly multimodal, however, and textual information is often accompanied by other modalities such as images. We introduce a supervised multimodal bitransformer model that fuses information from text and image encoders, and obtain state-of-the-art performance on various multimodal classification benchmark tasks, outperforming strong baselines, including on hard test sets specifically designed to measure multimodal performance.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| natural-language-inference-on-v-snli | MMBT | Accuracy: 90.5 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.