HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

TagAlign: Improving Vision-Language Alignment with Multi-Tag Classification

Qinying Liu; Wei Wu; Kecheng Zheng; Zhan Tong; Jiawei Liu; Yu Liu; Wei Chen; Zilei Wang; Yujun Shen

TagAlign: Improving Vision-Language Alignment with Multi-Tag Classification

Abstract

The crux of learning vision-language models is to extract semantically aligned information from visual and linguistic data. Existing attempts usually face the problem of coarse alignment, e.g., the vision encoder struggles in localizing an attribute-specified object. In this work, we propose an embarrassingly simple approach to better align image and text features with no need of additional data formats other than image-text pairs. Concretely, given an image and its paired text, we manage to parse objects (e.g., cat) and attributes (e.g., black) from the description, which are highly likely to exist in the image. It is noteworthy that the parsing pipeline is fully automatic and thus enjoys good scalability. With these parsed semantics as supervision signals, we can complement the commonly used image-text contrastive loss with the multi-tag classification loss. Extensive experimental results on a broad suite of semantic segmentation datasets substantiate the average 5.2\% improvement of our framework over existing alternatives. Furthermore, the visualization results indicate that attribute supervision makes vision-language models accurately localize attribute-specified objects. Project page can be found at https://qinying-liu.github.io/Tag-Align.

Code Repositories

Qinying-Liu/TagAlign
Official
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
open-vocabulary-semantic-segmentation-on-1TaAlign(trained with image-text pairs)
mIoU: 37.6
open-vocabulary-semantic-segmentation-on-5TagAlign(trained with image-text pairs)
mIoU: 87.9
unsupervised-semantic-segmentation-with-10TagAlign
mIoU: 33.3
unsupervised-semantic-segmentation-with-11TagAlign
mIoU: 53.9
unsupervised-semantic-segmentation-with-3TagAlign
mIoU: 27.5
unsupervised-semantic-segmentation-with-4TagAlign
Mean IoU (val): 17.3
unsupervised-semantic-segmentation-with-7TagAlign
mIoU: 87.9
unsupervised-semantic-segmentation-with-8TagAlign
mIoU: 37.6
unsupervised-semantic-segmentation-with-9TagAlign
mIoU: 25.3

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp