Command Palette
Search for a command to run...
M2-Encoder: Advancing Bilingual Image-Text Understanding by Large-scale Efficient Pretraining
Qingpei Guo; Furong Xu; Hanxiao Zhang; Wang Ren; Ziping Ma; Lin Ju; Jian Wang; Jingdong Chen; Ming Yang

Abstract
Vision-language foundation models like CLIP have revolutionized the field of artificial intelligence. Nevertheless, VLM models supporting multi-language, e.g., in both Chinese and English, have lagged due to the relative scarcity of large-scale pretraining datasets. Toward this end, we introduce a comprehensive bilingual (Chinese-English) dataset BM-6B with over 6 billion image-text pairs, aimed at enhancing multimodal foundation models to well understand images in both languages. To handle such a scale of dataset, we propose a novel grouped aggregation approach for image-text contrastive loss computation, which reduces the communication overhead and GPU memory demands significantly, facilitating a 60% increase in training speed. We pretrain a series of bilingual image-text foundation models with an enhanced fine-grained understanding ability on BM-6B, the resulting models, dubbed as $M^2$-Encoders (pronounced "M-Square"), set new benchmarks in both languages for multimodal retrieval and classification tasks. Notably, Our largest $M^2$-Encoder-10B model has achieved top-1 accuracies of 88.5% on ImageNet and 80.7% on ImageNet-CN under a zero-shot classification setting, surpassing previously reported SoTA methods by 2.2% and 21.1%, respectively. The $M^2$-Encoder series represents one of the most comprehensive bilingual image-text foundation models to date, so we are making it available to the research community for further exploration and development.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| zero-shot-cross-modal-retrieval-on-coco-2014 | M2-Encoder | Image-to-text R@1: 72.8 Image-to-text R@10: 96.3 Image-to-text R@5: 92.3 Text-to-image R@1: 56.5 Text-to-image R@10: 88.8 Text-to-image R@5: 81.6 |
| zero-shot-cross-modal-retrieval-on-flickr30k | M2-Encoder | Image-to-text R@1: 91.2 Image-to-text R@10: 99.6 Image-to-text R@5: 99.2 Text-to-image R@1: 92.2 Text-to-image R@10: 99.7 Text-to-image R@5: 99.5 |
| zero-shot-learning-on-imagenet-cn | $M^2$-Encoder | Accuracy: 80.7 |
| zero-shot-transfer-image-classification-on-1 | M2-Encoder | Accuracy (Private): 88.5 Param: 10B |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.