HyperAIHyperAI

Command Palette

Search for a command to run...

4 months ago

COSMOS: Cross-Modality Self-Distillation for Vision Language Pre-training

Sanghwan Kim; Rui Xiao; Mariana-Iuliana Georgescu; Stephan Alaniz; Zeynep Akata

COSMOS: Cross-Modality Self-Distillation for Vision Language Pre-training

Abstract

Vision-Language Models (VLMs) trained with contrastive loss have achieved significant advancements in various vision and language tasks. However, the global nature of the contrastive loss makes VLMs focus predominantly on foreground objects, neglecting other crucial information in the image, which limits their effectiveness in downstream tasks. To address these challenges, we propose COSMOS: CrOSs-MOdality Self-distillation for vision-language pre-training that integrates a novel text-cropping strategy and cross-attention module into a self-supervised learning framework. We create global and local views of images and texts (i.e., multi-modal augmentations), which are essential for self-distillation in VLMs. We further introduce a cross-attention module, enabling COSMOS to learn comprehensive cross-modal representations optimized via a cross-modality self-distillation loss. COSMOS consistently outperforms previous strong baselines on various zero-shot downstream tasks, including retrieval, classification, and semantic segmentation. Additionally, it surpasses CLIP-based models trained on larger datasets in visual perception and contextual understanding tasks. Code is available at https://github.com/ExplainableML/cosmos.

Code Repositories

ExplainableML/cosmos
Official
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
unsupervised-semantic-segmentation-with-10COSMOS ViT-B/16
mIoU: 31.3
unsupervised-semantic-segmentation-with-3COSMOS ViT-B/16
mIoU: 34.7
unsupervised-semantic-segmentation-with-4COSMOS ViT-B/16
Mean IoU (val): 17.7
unsupervised-semantic-segmentation-with-7COSMOS ViT-B/16
mIoU: 77.7
unsupervised-semantic-segmentation-with-8COSMOS ViT-B/16
mIoU: 33.7
unsupervised-semantic-segmentation-with-9COSMOS ViT-B/16
mIoU: 23.2
zero-shot-cross-modal-retrieval-on-coco-2014COSMOS ViT-B/32
Image-to-text R@1: 64.3
Image-to-text R@10: 92.0
Image-to-text R@5: 86.5
Text-to-image R@1: 48.4
Text-to-image R@10: 82.6
Text-to-image R@5: 74.2
zero-shot-cross-modal-retrieval-on-coco-2014COSMOS ViT-B/16
Image-to-text R@1: 68.0
Image-to-text R@10: 92.5
Image-to-text R@5: 87.8
Text-to-image R@1: 52.5
Text-to-image R@10: 84.9
Text-to-image R@5: 77.2
zero-shot-cross-modal-retrieval-on-flickr30kCOSMOS ViT-B/32
Image-to-text R@1: 89.9
Image-to-text R@10: 99.3
Image-to-text R@5: 98.8
Text-to-image R@1: 76.1
Text-to-image R@10: 96.2
Text-to-image R@5: 92.8
zero-shot-cross-modal-retrieval-on-flickr30kCOSMOS ViT-B/16
Image-to-text R@1: 92.9
Image-to-text R@10: 99.9
Image-to-text R@5: 99.4
Text-to-image R@1: 80.3
Text-to-image R@10: 97.6
Text-to-image R@5: 95.3
zero-shot-segmentation-on-ade20k-trainingCOSMOS ViT-B/16
mIoU: 17.7

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp