HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions

InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions

Abstract

Compared to the great progress of large-scale vision transformers (ViTs) in recent years, large-scale models based on convolutional neural networks (CNNs) are still in an early state. This work presents a new large-scale CNN-based foundation model, termed InternImage, which can obtain the gain from increasing parameters and training data like ViTs. Different from the recent CNNs that focus on large dense kernels, InternImage takes deformable convolution as the core operator, so that our model not only has the large effective receptive field required for downstream tasks such as detection and segmentation, but also has the adaptive spatial aggregation conditioned by input and task information. As a result, the proposed InternImage reduces the strict inductive bias of traditional CNNs and makes it possible to learn stronger and more robust patterns with large-scale parameters from massive data like ViTs. The effectiveness of our model is proven on challenging benchmarks including ImageNet, COCO, and ADE20K. It is worth mentioning that InternImage-H achieved a new record 65.4 mAP on COCO test-dev and 62.9 mIoU on ADE20K, outperforming current leading CNNs and ViTs. The code will be released at https://github.com/OpenGVLab/InternImage.

Code Repositories

opengvlab/internimage
Official
pytorch
Mentioned in GitHub
OpenGVLab/M3I-Pretraining
Mentioned in GitHub
chenller/mmseg-extension
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
2d-object-detection-on-bdd100k-valInternImage-H
mAP: 38.8
image-classification-on-imagenetInternImage-S
GFLOPs: 8
Number of params: 50M
Top 1 Accuracy: 84.2%
image-classification-on-imagenetInternImage-B
GFLOPs: 16
Number of params: 97M
Top 1 Accuracy: 84.9%
image-classification-on-imagenetInternImage-DCNv3-G (M3I Pre-training)
Number of params: 3000M
Top 1 Accuracy: 90.1%
image-classification-on-imagenetInternImage-T
GFLOPs: 5
Number of params: 30M
Top 1 Accuracy: 83.5%
image-classification-on-imagenetInternImage-L
GFLOPs: 108
Number of params: 223M
Top 1 Accuracy: 87.7%
image-classification-on-imagenetInternImage-H
GFLOPs: 1478
Number of params: 1080M
Top 1 Accuracy: 89.6%
image-classification-on-imagenetInternImage-XL
GFLOPs: 163
Number of params: 335M
Top 1 Accuracy: 88%
image-classification-on-inaturalist-2018InternImage-H
Top-1 Accuracy: 92.6%
image-classification-on-places205InternImage-H
Top 1 Accuracy: 71.7%
image-classification-on-places365InternImage-H(CNN)
Top 1 Accuracy: 61.2%
instance-segmentation-on-cocoInternImage-H
AP50: 80.8
AP75: 62.2
APL: 70.3
APM: 58.9
APS: 41.0
instance-segmentation-on-coco-minivalInternImage-S
GFLOPs: 340
Params (M): 69
box AP: 49.7
mask AP: 44.5
instance-segmentation-on-coco-minivalInternImage-T
GFLOPs: 270
Params (M): 49
box AP: 49.1
mask AP: 43.7
instance-segmentation-on-coco-minivalInternImage-XL
GFLOPs: 1782
Params (M): 387
mask AP: 48.8
instance-segmentation-on-coco-minivalInternImage-H
AP50: 80.1
AP75: 61.5
APL: 74.4
APM: 58.4
APS: 37.9
mask AP: 55.4
instance-segmentation-on-coco-minivalInternImage-B
GFLOPs: 501
Params (M): 115
instance-segmentation-on-coco-minivalInternImage-L
GFLOPs: 1399
Params (M): 277
box AP: 56.1
mask AP: 48.5
object-detection-on-cocoInternImage-XL
Params (M): 602
box mAP: 64.3
object-detection-on-cocoInternImage-H (M3I Pre-training)
Params (M): 2180
object-detection-on-coco-minivalInternImage-H
box AP: 65.0
object-detection-on-coco-minivalInternImage-XL
box AP: 64.2
object-detection-on-coco-oInternImage-L (Cascade Mask R-CNN)
Average mAP: 37.0
Effective Robustness: 11.72
object-detection-on-crowdhuman-full-bodyInternImage-H
AP: 97.2
object-detection-on-lvis-v1-0-minivalInternImage-H
box AP: 65.8
object-detection-on-lvis-v1-0-valInternImage-H
box AP: 63.2
object-detection-on-openimages-v6InternImage-H
box AP: 74.1
object-detection-on-pascal-voc-2012InternImage-H
MAP: 97.2
semantic-segmentation-on-ade20kInternImage-L
GFLOPs: 2526
Params (M): 256
Validation mIoU: 54.1
semantic-segmentation-on-ade20kInternImage-H
GFLOPs: 4635
Params (M): 1310
Validation mIoU: 62.9
semantic-segmentation-on-ade20kInternImage-XL
GFLOPs: 3142
Params (M): 368
Validation mIoU: 55.3
semantic-segmentation-on-ade20kInternImage-S
GFLOPs: 1017
Params (M): 80
Validation mIoU: 50.9
semantic-segmentation-on-ade20kInternImage-H (M3I Pre-training)
Params (M): 1310
semantic-segmentation-on-ade20kInternImage-B
GFLOPs: 1185
Params (M): 128
Validation mIoU: 51.3
semantic-segmentation-on-ade20kInternImage-T
GFLOPs: 944
Params (M): 59
Validation mIoU: 48.1
semantic-segmentation-on-cityscapesInternImage-H
Mean IoU (class): 86.1%
semantic-segmentation-on-cityscapes-valInternImage-H
mIoU: 87
semantic-segmentation-on-cityscapes-valInternImage-XL
mIoU: 86.4
semantic-segmentation-on-pascal-contextInternImage-H
mIoU: 70.3
semantic-segmentation-on-replicaInternImage
mIoU: 38.4

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp