6 months ago

Abstract

Several existing still image object detectors suffer from image deterioration in videos, such as motion blur, camera defocus, and partial occlusion. We present DiffusionVID, a diffusion model-based video object detector, that exploits spatio-temporal conditioning. Inspired by the diffusion model, DiffusionVID refines random noise boxes to obtain the original object boxes in a video sequence. To effectively refine the box from the degraded images in the videos, we used three novel approaches: cascade refinement, dynamic core-set conditioning, and local batch refinement. The cascade refinement architecture effectively collects information from object regions, whereas the dynamic core-set conditioning further improves the denoising quality using adaptive conditional guidance based on the spatio-temporal core-set. Local batch refinement significantly improves the refinement speed by exploiting GPU parallelism. On the standard and widely used ImageNet-VID benchmark, our DiffusionVID with the ResNet-101 and Swin-Base backbones achieves 86.9 mAP @ 46.6 FPS and 92.4 mAP @ 27.0 FPS, respectively, which is state-of-the-art performance. To the best of the authors’ knowledge, this is the first video object detector based on a diffusion model. The code and models are available at https://github.com/sdroh1027/DiffusionVID.

Source PDF View Code