5 months ago

Simple Open-Vocabulary Object Detection with Vision Transformers

Minderer Matthias ; Gritsenko Alexey ; Stone Austin ; Neumann Maxim ; Weissenborn Dirk ; Dosovitskiy Alexey ; Mahendran Aravindh ; Arnab Anurag ; Dehghani Mostafa ; Shen

Abstract

Combining simple architectures with large-scale pre-training has led tomassive improvements in image classification. For object detection,pre-training and scaling approaches are less well established, especially inthe long-tailed and open-vocabulary setting, where training data is relativelyscarce. In this paper, we propose a strong recipe for transferring image-textmodels to open-vocabulary object detection. We use a standard VisionTransformer architecture with minimal modifications, contrastive image-textpre-training, and end-to-end detection fine-tuning. Our analysis of the scalingproperties of this setup shows that increasing image-level pre-training andmodel size yield consistent improvements on the downstream detection task. Weprovide the adaptation strategies and regularizations needed to attain verystrong performance on zero-shot text-conditioned and one-shot image-conditionedobject detection. Code and models are available on GitHub.

Code Repositories

yangyucheng000/University/tree/main/model-1/owlvit

mindspore

google-research/scenic/tree/main/scenic/projects/owl_vit

Official

jax

Benchmarks

Benchmark	Methodology	Metrics
described-object-detection-on-description	OWL-ViT-base	Intra-scenario ABS mAP: 8.8 Intra-scenario FULL mAP: 8.6 Intra-scenario PRES mAP: 8.5
one-shot-object-detection-on-coco	OWL-ViT (R50+H/32)	AP 0.5: 41.8
open-vocabulary-object-detection-on-lvis-v1-0	OWL-ViT (CLIP-L/14)	AP novel-LVIS base training: 25.6 AP novel-Unrestricted open-vocabulary training: 31.2

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette