HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

DINOv2: Learning Robust Visual Features without Supervision

DINOv2: Learning Robust Visual Features without Supervision

Abstract

The recent breakthroughs in natural language processing for model pretraining on large quantities of data have opened the way for similar foundation models in computer vision. These models could greatly simplify the use of images in any system by producing all-purpose visual features, i.e., features that work across image distributions and tasks without finetuning. This work shows that existing pretraining methods, especially self-supervised methods, can produce such features if trained on enough curated data from diverse sources. We revisit existing approaches and combine different techniques to scale our pretraining in terms of data and model size. Most of the technical contributions aim at accelerating and stabilizing the training at scale. In terms of data, we propose an automatic pipeline to build a dedicated, diverse, and curated image dataset instead of uncurated data, as typically done in the self-supervised literature. In terms of models, we train a ViT model (Dosovitskiy et al., 2020) with 1B parameters and distill it into a series of smaller models that surpass the best available all-purpose features, OpenCLIP (Ilharco et al., 2021) on most of the benchmarks at image and pixel levels.

Code Repositories

roboflow/rf-detr
pytorch
Mentioned in GitHub
beneroth13/dinov2
pytorch
Mentioned in GitHub
mohammedsb/dinov2formedical
pytorch
Mentioned in GitHub
bespontaneous/proteus-pytorch
pytorch
Mentioned in GitHub
buyeah1109/finc
pytorch
Mentioned in GitHub
gorkaydemir/DINOSAUR
pytorch
Mentioned in GitHub
marrlab/dinobloom
pytorch
Mentioned in GitHub
fabio-sim/Depth-Anything-ONNX
pytorch
Mentioned in GitHub
buyeah1109/KEN
pytorch
Mentioned in GitHub
zhu-xlab/softcon
pytorch
Mentioned in GitHub
huggingface/transformers
pytorch
Mentioned in GitHub
facebookresearch/dinov2
Official
pytorch
Mentioned in GitHub
JHKim-snu/PGA
pytorch
Mentioned in GitHub
open-edge-platform/geti
pytorch
Mentioned in GitHub
seatizendoi/dinovdeau
pytorch
Mentioned in GitHub
facebookresearch/highrescanopyheight
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
depth-estimation-on-nyu-depth-v2DINOv2 (ViT-g/14 frozen, w/ DPT decoder)
RMS: 0.279
domain-generalization-on-imagenet-cDINOv2 (ViT-S/14, frozen model, linear eval)
Number of params: 21M
mean Corruption Error (mCE): 54.4
domain-generalization-on-imagenet-cDINOv2 (ViT-g/14, frozen model, linear eval)
Number of params: 1100M
mean Corruption Error (mCE): 28.2
domain-generalization-on-imagenet-cDINOv2 (ViT-B/14, frozen model, linear eval)
Number of params: 85M
mean Corruption Error (mCE): 42.7
domain-generalization-on-imagenet-cDINOv2 (ViT-L/14, frozen model, linear eval)
Number of params: 307M
mean Corruption Error (mCE): 31.5
fine-grained-image-classification-on-oxford-1DINOv2 (ViT-g/14, frozen model, linear eval)
Accuracy: 96.7
image-classification-on-cifar-10DINOv2 (ViT-g/14, frozen model, linear eval)
Percentage correct: 99.5
image-retrieval-on-amstertimeDINOv2 distilled (ViT-S/14 frozen)
mAP: 43.5
image-retrieval-on-amstertimeDINOv2 (ViT-g/14 frozen)
mAP: 46.7
image-retrieval-on-amstertimeDINOv2 distilled (ViT-B/14 frozen)
mAP: 45.6
image-retrieval-on-amstertimeDINOv2 distilled (ViT-L/14 frozen)
mAP: 50.0
monocular-depth-estimation-on-kitti-eigenDINOv2 (ViT-g/14 frozen, w/ DPT decoder)
Delta u003c 1.25: 0.968
Delta u003c 1.25^2: 0.997
Delta u003c 1.25^3: 0.9993
RMSE: 2.1128
RMSE log: 0.0882
Sq Rel: 0.1797
absolute relative error: 0.0652
monocular-depth-estimation-on-nyu-depth-v2DINOv2 (ViT-g/14 frozen, w/ DPT decoder)
Delta u003c 1.25: 0.9497
Delta u003c 1.25^2: 0.996
Delta u003c 1.25^3: 0.9994
RMSE: 0.279
absolute relative error: 0.0907
log 10: 0.0371
self-supervised-image-classification-onDINOv2 distilled (ViT-S/14)
Number of Params: 21M
Top 1 Accuracy: 81.1%
self-supervised-image-classification-onDINOv2 distilled (ViT-B/14)
Number of Params: 85M
Top 1 Accuracy: 84.5%
self-supervised-image-classification-onDINOv2 (ViT-g/14 @448)
Number of Params: 1100M
Top 1 Accuracy: 86.7%
self-supervised-image-classification-onDINOv2 distilled (ViT-L/14)
Number of Params: 307M
Top 1 Accuracy: 86.3%
self-supervised-image-classification-onDINOv2 (ViT-g/14)
Number of Params: 1100M
Top 1 Accuracy: 86.5%
self-supervised-image-classification-on-1DINOv2 (ViT-g/14, 448)
Number of Params: 1100M
Top 1 Accuracy: 88.9%
self-supervised-image-classification-on-1DINOv2 (ViT-g/14)
Number of Params: 1100M
Top 1 Accuracy: 88.5%
semantic-segmentation-on-ade20kDINOv2 (ViT-g/14 frozen model, w/ ViT-Adapter + Mask2former)
Params (M): 1080
Validation mIoU: 60.2
visual-place-recognition-on-17-placesDINOv2
Recall@1: 61.82
visual-place-recognition-on-baidu-mallDINOv2
Recall@1: 49.21
visual-place-recognition-on-gardens-pointDINOv2
Recall@1: 71.50
visual-place-recognition-on-hawkinsDINOv2
Recall@1: 27.97
visual-place-recognition-on-laurel-cavernsDINOv2
Recall@1: 40.18
visual-place-recognition-on-mid-atlanticDINOv2
Recall@1: 24.75
visual-place-recognition-on-nardo-airDINOv2
Recall@1: 73.24
visual-place-recognition-on-nardo-air-rDINOv2
Recall@1: 71.83
visual-place-recognition-on-oxford-robotcar-4DINOv2
Recall@1: 39.79
visual-place-recognition-on-pittsburgh-30kDINOv2
Recall@1: 78.32
visual-place-recognition-on-st-luciaDINOv2
Recall@1: 78.62
visual-place-recognition-on-vp-airDINOv2
Recall@1: 45.23

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp