HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

Simple Open-Vocabulary Object Detection with Vision Transformers

Simple Open-Vocabulary Object Detection with Vision Transformers

Abstract

Combining simple architectures with large-scale pre-training has led tomassive improvements in image classification. For object detection,pre-training and scaling approaches are less well established, especially inthe long-tailed and open-vocabulary setting, where training data is relativelyscarce. In this paper, we propose a strong recipe for transferring image-textmodels to open-vocabulary object detection. We use a standard VisionTransformer architecture with minimal modifications, contrastive image-textpre-training, and end-to-end detection fine-tuning. Our analysis of the scalingproperties of this setup shows that increasing image-level pre-training andmodel size yield consistent improvements on the downstream detection task. Weprovide the adaptation strategies and regularizations needed to attain verystrong performance on zero-shot text-conditioned and one-shot image-conditionedobject detection. Code and models are available on GitHub.

Benchmarks

BenchmarkMethodologyMetrics
described-object-detection-on-descriptionOWL-ViT-base
Intra-scenario ABS mAP: 8.8
Intra-scenario FULL mAP: 8.6
Intra-scenario PRES mAP: 8.5
one-shot-object-detection-on-cocoOWL-ViT (R50+H/32)
AP 0.5: 41.8
open-vocabulary-object-detection-on-lvis-v1-0OWL-ViT (CLIP-L/14)
AP novel-LVIS base training: 25.6
AP novel-Unrestricted open-vocabulary training: 31.2

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp