Command Palette
Search for a command to run...

Abstract
Combining simple architectures with large-scale pre-training has led tomassive improvements in image classification. For object detection,pre-training and scaling approaches are less well established, especially inthe long-tailed and open-vocabulary setting, where training data is relativelyscarce. In this paper, we propose a strong recipe for transferring image-textmodels to open-vocabulary object detection. We use a standard VisionTransformer architecture with minimal modifications, contrastive image-textpre-training, and end-to-end detection fine-tuning. Our analysis of the scalingproperties of this setup shows that increasing image-level pre-training andmodel size yield consistent improvements on the downstream detection task. Weprovide the adaptation strategies and regularizations needed to attain verystrong performance on zero-shot text-conditioned and one-shot image-conditionedobject detection. Code and models are available on GitHub.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| described-object-detection-on-description | OWL-ViT-base | Intra-scenario ABS mAP: 8.8 Intra-scenario FULL mAP: 8.6 Intra-scenario PRES mAP: 8.5 |
| one-shot-object-detection-on-coco | OWL-ViT (R50+H/32) | AP 0.5: 41.8 |
| open-vocabulary-object-detection-on-lvis-v1-0 | OWL-ViT (CLIP-L/14) | AP novel-LVIS base training: 25.6 AP novel-Unrestricted open-vocabulary training: 31.2 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.