HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

DiffusionVID: Denoising Object Boxes with Spatio-temporal Conditioning for Video Object Detection

{Ki-Seok Chung Si-Dong Roh}

Abstract

Several existing still image object detectors suffer from image deterioration in videos, such as motion blur, camera defocus, and partial occlusion. We present DiffusionVID, a diffusion model-based video object detector, that exploits spatio-temporal conditioning. Inspired by the diffusion model, DiffusionVID refines random noise boxes to obtain the original object boxes in a video sequence. To effectively refine the box from the degraded images in the videos, we used three novel approaches: cascade refinement, dynamic core-set conditioning, and local batch refinement. The cascade refinement architecture effectively collects information from object regions, whereas the dynamic core-set conditioning further improves the denoising quality using adaptive conditional guidance based on the spatio-temporal core-set. Local batch refinement significantly improves the refinement speed by exploiting GPU parallelism. On the standard and widely used ImageNet-VID benchmark, our DiffusionVID with the ResNet-101 and Swin-Base backbones achieves 86.9 mAP @ 46.6 FPS and 92.4 mAP @ 27.0 FPS, respectively, which is state-of-the-art performance. To the best of the authors’ knowledge, this is the first video object detector based on a diffusion model. The code and models are available at https://github.com/sdroh1027/DiffusionVID.

Benchmarks

BenchmarkMethodologyMetrics
video-object-detection-on-imagenet-vidDiffusionVID (ResNet-101)
MAP : 87.1
video-object-detection-on-imagenet-vidDiffusionVID (Swin-B)
MAP : 92.5

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
DiffusionVID: Denoising Object Boxes with Spatio-temporal Conditioning for Video Object Detection | Papers | HyperAI