8 months ago

Abstract

We study the use of deep features extracted from a pretrained VisionTransformer (ViT) as dense visual descriptors. We observe and empiricallydemonstrate that such features, when extractedfrom a self-supervised ViT model(DINO-ViT), exhibit several striking properties, including: (i) the featuresencode powerful, well-localized semantic information, at high spatialgranularity, such as object parts; (ii) the encoded semantic information isshared across related, yet different object categories, and (iii) positionalbias changes gradually throughout the layers. These properties allow us todesign simple methods for a variety of applications, including co-segmentation,part co-segmentation and semantic correspondences. To distill the power of ViTfeatures from convoluted design choices, we restrict ourselves to lightweightzero-shot methodologies (e.g., binning and clustering) applied directly to thefeatures. Since our methods require no additional training nor data, they arereadily applicable across a variety of domains. We show by extensivequalitative and quantitative evaluation that our simple methodologies achievecompetitive results with recent state-of-the-art supervised methods, andoutperform previous unsupervised methods by a large margin. Code is availablein dino-vit-features.github.io.

Source PDF View Code