3 months ago

InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions

Wenhai Wang Jifeng Dai Zhe Chen Zhenhang Huang Zhiqi Li Xizhou Zhu Xiaowei Hu Tong Lu Lewei Lu Hongsheng Li

Abstract

Compared to the great progress of large-scale vision transformers (ViTs) in recent years, large-scale models based on convolutional neural networks (CNNs) are still in an early state. This work presents a new large-scale CNN-based foundation model, termed InternImage, which can obtain the gain from increasing parameters and training data like ViTs. Different from the recent CNNs that focus on large dense kernels, InternImage takes deformable convolution as the core operator, so that our model not only has the large effective receptive field required for downstream tasks such as detection and segmentation, but also has the adaptive spatial aggregation conditioned by input and task information. As a result, the proposed InternImage reduces the strict inductive bias of traditional CNNs and makes it possible to learn stronger and more robust patterns with large-scale parameters from massive data like ViTs. The effectiveness of our model is proven on challenging benchmarks including ImageNet, COCO, and ADE20K. It is worth mentioning that InternImage-H achieved a new record 65.4 mAP on COCO test-dev and 62.9 mIoU on ADE20K, outperforming current leading CNNs and ViTs. The code will be released at https://github.com/OpenGVLab/InternImage.

Code Repositories

opengvlab/internimage

Official

pytorch

Mentioned in GitHub

OpenGVLab/M3I-Pretraining

Mentioned in GitHub

chenller/mmseg-extension

pytorch

Mentioned in GitHub

Benchmarks

Benchmark	Methodology	Metrics
2d-object-detection-on-bdd100k-val	InternImage-H	mAP: 38.8
image-classification-on-imagenet	InternImage-S	GFLOPs: 8 Number of params: 50M Top 1 Accuracy: 84.2%
image-classification-on-imagenet	InternImage-B	GFLOPs: 16 Number of params: 97M Top 1 Accuracy: 84.9%
image-classification-on-imagenet	InternImage-DCNv3-G (M3I Pre-training)	Number of params: 3000M Top 1 Accuracy: 90.1%
image-classification-on-imagenet	InternImage-T	GFLOPs: 5 Number of params: 30M Top 1 Accuracy: 83.5%
image-classification-on-imagenet	InternImage-L	GFLOPs: 108 Number of params: 223M Top 1 Accuracy: 87.7%
image-classification-on-imagenet	InternImage-H	GFLOPs: 1478 Number of params: 1080M Top 1 Accuracy: 89.6%
image-classification-on-imagenet	InternImage-XL	GFLOPs: 163 Number of params: 335M Top 1 Accuracy: 88%
image-classification-on-inaturalist-2018	InternImage-H	Top-1 Accuracy: 92.6%
image-classification-on-places205	InternImage-H	Top 1 Accuracy: 71.7%
image-classification-on-places365	InternImage-H（CNN）	Top 1 Accuracy: 61.2%
instance-segmentation-on-coco	InternImage-H	AP50: 80.8 AP75: 62.2 APL: 70.3 APM: 58.9 APS: 41.0
instance-segmentation-on-coco-minival	InternImage-S	GFLOPs: 340 Params (M): 69 box AP: 49.7 mask AP: 44.5
instance-segmentation-on-coco-minival	InternImage-T	GFLOPs: 270 Params (M): 49 box AP: 49.1 mask AP: 43.7
instance-segmentation-on-coco-minival	InternImage-XL	GFLOPs: 1782 Params (M): 387 mask AP: 48.8
instance-segmentation-on-coco-minival	InternImage-H	AP50: 80.1 AP75: 61.5 APL: 74.4 APM: 58.4 APS: 37.9 mask AP: 55.4
instance-segmentation-on-coco-minival	InternImage-B	GFLOPs: 501 Params (M): 115
instance-segmentation-on-coco-minival	InternImage-L	GFLOPs: 1399 Params (M): 277 box AP: 56.1 mask AP: 48.5
object-detection-on-coco	InternImage-XL	Params (M): 602 box mAP: 64.3
object-detection-on-coco	InternImage-H (M3I Pre-training)	Params (M): 2180
object-detection-on-coco-minival	InternImage-H	box AP: 65.0
object-detection-on-coco-minival	InternImage-XL	box AP: 64.2
object-detection-on-coco-o	InternImage-L (Cascade Mask R-CNN)	Average mAP: 37.0 Effective Robustness: 11.72
object-detection-on-crowdhuman-full-body	InternImage-H	AP: 97.2
object-detection-on-lvis-v1-0-minival	InternImage-H	box AP: 65.8
object-detection-on-lvis-v1-0-val	InternImage-H	box AP: 63.2
object-detection-on-openimages-v6	InternImage-H	box AP: 74.1
object-detection-on-pascal-voc-2012	InternImage-H	MAP: 97.2
semantic-segmentation-on-ade20k	InternImage-L	GFLOPs: 2526 Params (M): 256 Validation mIoU: 54.1
semantic-segmentation-on-ade20k	InternImage-H	GFLOPs: 4635 Params (M): 1310 Validation mIoU: 62.9
semantic-segmentation-on-ade20k	InternImage-XL	GFLOPs: 3142 Params (M): 368 Validation mIoU: 55.3
semantic-segmentation-on-ade20k	InternImage-S	GFLOPs: 1017 Params (M): 80 Validation mIoU: 50.9
semantic-segmentation-on-ade20k	InternImage-H (M3I Pre-training)	Params (M): 1310
semantic-segmentation-on-ade20k	InternImage-B	GFLOPs: 1185 Params (M): 128 Validation mIoU: 51.3
semantic-segmentation-on-ade20k	InternImage-T	GFLOPs: 944 Params (M): 59 Validation mIoU: 48.1
semantic-segmentation-on-cityscapes	InternImage-H	Mean IoU (class): 86.1%
semantic-segmentation-on-cityscapes-val	InternImage-H	mIoU: 87
semantic-segmentation-on-cityscapes-val	InternImage-XL	mIoU: 86.4
semantic-segmentation-on-pascal-context	InternImage-H	mIoU: 70.3
semantic-segmentation-on-replica	InternImage	mIoU: 38.4

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions

Wenhai Wang Jifeng Dai Zhe Chen Zhenhang Huang Zhiqi Li Xizhou Zhu Xiaowei Hu Tong Lu Lewei Lu Hongsheng Li2 more

Abstract

Code Repositories

Benchmarks

Build AI with AI

Hyper Newsletters

Wenhai Wang Jifeng Dai Zhe Chen Zhenhang Huang Zhiqi Li Xizhou Zhu Xiaowei Hu Tong Lu Lewei Lu Hongsheng Li