Command Palette
Search for a command to run...
Nikolas Adaloglou Felix Michels Hamza Kalisch Markus Kollmann

Abstract
We present a general methodology that learns to classify images without labels by leveraging pretrained feature extractors. Our approach involves self-distillation training of clustering heads based on the fact that nearest neighbours in the pretrained feature space are likely to share the same label. We propose a novel objective that learns associations between image features by introducing a variant of pointwise mutual information together with instance weighting. We demonstrate that the proposed objective is able to attenuate the effect of false positive pairs while efficiently exploiting the structure in the pretrained feature space. As a result, we improve the clustering accuracy over $k$-means on $17$ different pretrained models by $6.1$\% and $12.2$\% on ImageNet and CIFAR100, respectively. Finally, using self-supervised vision transformers, we achieve a clustering accuracy of $61.6$\% on ImageNet. The code is available at https://github.com/HHU-MMBS/TEMI-official-BMVC2023.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| image-clustering-on-cifar-10 | TEMI DINO ViT-B | ARI: 0.885 Accuracy: 0.94.5 Backbone: ViT-B NMI: 0.886 Train set: Train |
| image-clustering-on-cifar-10 | TEMI CLIP ViT-L (openai) | ARI: 0.932 Accuracy: 0.969 Backbone: ViT-L NMI: 0.926 Train set: Train |
| image-clustering-on-cifar-100 | TEMI DINO ViT-B | ARI: 0.533 Accuracy: 0.671 NMI: 0.769 Train Set: Train |
| image-clustering-on-cifar-100 | TEMI CLIP ViT-L (openai) | ARI: 0.612 Accuracy: 0.737 NMI: 0.799 Train Set: Train |
| image-clustering-on-imagenet | TEMI DINO (ViT-B) | ARI: 45.9 Accuracy: 58.0 NMI: 81.4 |
| image-clustering-on-imagenet | TEMI MSN (ViT-L) | ARI: 48.4 Accuracy: 61.6 NMI: 82.5 |
| image-clustering-on-imagenet-100 | TEMI CLIP ViT-L (openai) | ACCURACY: 0.8343 ARI: 0.7581 NMI: 0.9006 |
| image-clustering-on-imagenet-100 | TEMI MSN ViT-L | ACCURACY: 0.8286 ARI: 0.7408 NMI: 0.8853 |
| image-clustering-on-imagenet-100 | TEMI DINO ViT-B | ACCURACY: 0.7505 ARI: 0.6545 NMI: 0.8565 |
| image-clustering-on-imagenet-200 | TEMI CLIP ViT-L (openai) | - |
| image-clustering-on-imagenet-200 | TEMI DINO ViT-B | - |
| image-clustering-on-imagenet-200 | TEMI MSN ViT-L | - |
| image-clustering-on-imagenet-50-1 | TEMI DINO ViT-B | ACCURACY: 0.801 ARI: 0.7093 NMI: 0.8610 |
| image-clustering-on-imagenet-50-1 | TEMI CLIP ViT-L (openai) | ACCURACY: 0.8827 ARI: 0.8272 NMI: 0.9232 |
| image-clustering-on-imagenet-50-1 | TEMI MSN ViT-L | ACCURACY: 0.8487 ARI: 0.7646 NMI: 0.8814 |
| image-clustering-on-stl-10 | TEMI DINO ViT-B | ARI: 0.968 Accuracy: 0.985 Backbone: ViT-B NMI: 0.965 Train Split: Train |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.