Command Palette
Search for a command to run...
Jianfeng Wang Zhengyuan Yang Xiaowei Hu Linjie Li Kevin Lin Zhe Gan Zicheng Liu Ce Liu Lijuan Wang

Abstract
In this paper, we design and train a Generative Image-to-text Transformer, GIT, to unify vision-language tasks such as image/video captioning and question answering. While generative models provide a consistent network architecture between pre-training and fine-tuning, existing work typically contains complex structures (uni/multi-modal encoder/decoder) and depends on external modules such as object detectors/taggers and optical character recognition (OCR). In GIT, we simplify the architecture as one image encoder and one text decoder under a single language modeling task. We also scale up the pre-training data and the model size to boost the model performance. Without bells and whistles, our GIT establishes new state of the arts on 12 challenging benchmarks with a large margin. For instance, our model surpasses the human performance for the first time on TextCaps (138.2 vs. 125.5 in CIDEr). Furthermore, we present a new scheme of generation-based image classification and scene text recognition, achieving decent performance on standard benchmarks. Codes are released at \url{https://github.com/microsoft/GenerativeImage2Text}.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| image-captioning-on-coco-captions | GIT | BLEU-4: 44.1 CIDER: 151.1 METEOR: 32.2 SPICE: 26.3 |
| image-captioning-on-nocaps-entire | GIT, Single Model | B1: 88.1 B2: 74.81 B3: 57.68 B4: 37.35 CIDEr: 123.39 METEOR: 32.5 ROUGE-L: 63.12 SPICE: 15.94 |
| image-captioning-on-nocaps-in-domain | GIT2, Single Model | B1: 88.86 B2: 75.86 B3: 59.94 B4: 41.1 CIDEr: 124.18 METEOR: 33.83 ROUGE-L: 63.82 SPICE: 16.36 |
| image-captioning-on-nocaps-in-domain | GIT, Single Model | B1: 88.55 B2: 76.1 B3: 60.53 B4: 41.65 CIDEr: 122.4 METEOR: 33.41 ROUGE-L: 64.02 SPICE: 16.18 |
| image-captioning-on-nocaps-near-domain | GIT2, Single Model | B1: 88.9 B2: 75.86 B3: 58.9 B4: 38.95 CIDEr: 125.51 METEOR: 32.95 ROUGE-L: 63.66 SPICE: 16.11 |
| image-captioning-on-nocaps-near-domain | GIT, Single Model | B1: 88.56 B2: 75.48 B3: 58.46 B4: 38.44 CIDEr: 123.92 METEOR: 32.86 ROUGE-L: 63.5 SPICE: 15.96 |
| image-captioning-on-nocaps-out-of-domain | GIT2, Single Model | B1: 86.28 B2: 71.15 B3: 52.36 B4: 30.15 CIDEr: 122.27 METEOR: 30.15 ROUGE-L: 60.91 SPICE: 15.62 |
| image-captioning-on-nocaps-out-of-domain | GIT, Single Model | B1: 85.99 B2: 71.28 B3: 52.66 B4: 30.04 CIDEr: 122.04 METEOR: 30.45 ROUGE-L: 60.96 SPICE: 15.7 |
| image-captioning-on-nocaps-xd-entire | GIT | B1: 88.1 B2: 74.81 B3: 57.68 B4: 37.35 CIDEr: 123.39 METEOR: 32.5 ROUGE-L: 63.12 SPICE: 15.94 |
| image-captioning-on-nocaps-xd-entire | GIT2 | B1: 88.43 B2: 75.02 B3: 57.87 B4: 37.65 CIDEr: 124.77 METEOR: 32.56 ROUGE-L: 63.19 SPICE: 16.06 |
| image-captioning-on-nocaps-xd-in-domain | GIT2 | B1: 88.86 B2: 75.86 B3: 59.94 B4: 41.1 CIDEr: 124.18 METEOR: 33.83 ROUGE-L: 63.82 SPICE: 16.36 |
| image-captioning-on-nocaps-xd-in-domain | GIT | B1: 88.55 B2: 76.1 B3: 60.53 B4: 41.65 CIDEr: 122.4 METEOR: 33.41 ROUGE-L: 64.02 SPICE: 16.18 |
| image-captioning-on-nocaps-xd-near-domain | GIT2 | B1: 88.9 B2: 75.86 B3: 58.9 B4: 38.95 CIDEr: 125.51 METEOR: 32.95 ROUGE-L: 63.66 SPICE: 16.11 |
| image-captioning-on-nocaps-xd-near-domain | GIT | B1: 88.56 B2: 75.48 B3: 58.46 B4: 38.44 CIDEr: 123.92 METEOR: 32.86 ROUGE-L: 63.5 SPICE: 15.96 |
| image-captioning-on-nocaps-xd-out-of-domain | GIT2 | B1: 86.28 B2: 71.15 B3: 52.36 B4: 30.15 CIDEr: 122.27 METEOR: 30.15 ROUGE-L: 60.91 SPICE: 15.62 |
| image-captioning-on-nocaps-xd-out-of-domain | GIT | B1: 85.99 B2: 71.28 B3: 52.66 B4: 30.04 CIDEr: 122.04 METEOR: 30.45 ROUGE-L: 60.96 SPICE: 15.7 |
| video-captioning-on-msr-vtt-1 | GIT2 | BLEU-4: 54.8 CIDEr: 75.9 GS: 201.6 METEOR: 33.1 ROUGE-L: 68.2 |
| visual-question-answering-on-msvd-qa-1 | GIT | Accuracy: 0.568 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.