5 months ago

iBOT: Image BERT Pre-Training with Online Tokenizer

Jinghao Zhou; Chen Wei; Huiyu Wang; Wei Shen; Cihang Xie; Alan Yuille; Tao Kong

Abstract

The success of language Transformers is primarily attributed to the pretext task of masked language modeling (MLM), where texts are first tokenized into semantically meaningful pieces. In this work, we study masked image modeling (MIM) and indicate the advantages and challenges of using a semantically meaningful visual tokenizer. We present a self-supervised framework iBOT that can perform masked prediction with an online tokenizer. Specifically, we perform self-distillation on masked patch tokens and take the teacher network as the online tokenizer, along with self-distillation on the class token to acquire visual semantics. The online tokenizer is jointly learnable with the MIM objective and dispenses with a multi-stage training pipeline where the tokenizer needs to be pre-trained beforehand. We show the prominence of iBOT by achieving an 82.3% linear probing accuracy and an 87.8% fine-tuning accuracy evaluated on ImageNet-1K. Beyond the state-of-the-art image classification results, we underline emerging local semantic patterns, which helps the models to obtain strong robustness against common corruptions and achieve leading results on dense downstream tasks, eg., object detection, instance segmentation, and semantic segmentation.

Code Repositories

bytedance/ibot

Official

pytorch

Mentioned in GitHub

https://gitlab.com/birder/birder

pytorch

Benchmarks

Benchmark	Methodology	Metrics
instance-segmentation-on-coco	iBOT (ViT-B/16)	mask AP: 44.2
instance-segmentation-on-coco	iBOT (ViT-S/16)	mask AP: 42.6
object-detection-on-coco	iBOT (ViT-B/16)	box mAP: 51.2
object-detection-on-coco	iBOT (ViT-S/16)	box mAP: 49.4
self-supervised-image-classification-on	iBOT (ViT-L/16) (IN22k)	Number of Params: 307M Top 1 Accuracy: 82.3%
self-supervised-image-classification-on	iBOT (ViT-L/16)	Number of Params: 307M Top 1 Accuracy: 81.3%
self-supervised-image-classification-on-1	iBOT (ViT-L/16)	Number of Params: 307M Top 1 Accuracy: 84.8%
self-supervised-image-classification-on-1	iBOT(ViT-L/16)	Number of Params: 307M Top 1 Accuracy: 86.6%
self-supervised-image-classification-on-1	iBOT (ViT-B/16)	Number of Params: 85M Top 1 Accuracy: 84.0%
self-supervised-image-classification-on-1	iBOT(ViT-L/16, 512)	Number of Params: 307M Top 1 Accuracy: 87.8%
self-supervised-image-classification-on-1	iBOT (ViT-B/16)	Number of Params: 85M Top 1 Accuracy: 84.4%
semantic-segmentation-on-ade20k	iBOT (ViT-S/16)	Validation mIoU: 45.4
semantic-segmentation-on-ade20k	iBOT (ViT-B/16) (linear head)	Validation mIoU: 38.3
semantic-segmentation-on-ade20k	iBOT (ViT-B/16)	Validation mIoU: 50.0
semi-supervised-image-classification-on-1	iBOT (ViT-S/16)	Top 1 Accuracy: 61.9%
unsupervised-image-classification-on-imagenet	iBOT (ViT-S/16)	ARI: 32.8 Accuracy (%): 43.4

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette