Command Palette
Search for a command to run...
mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections

Abstract
Large-scale pretrained foundation models have been an emerging paradigm for building artificial intelligence (AI) systems, which can be quickly adapted to a wide range of downstream tasks. This paper presents mPLUG, a new vision-language foundation model for both cross-modal understanding and generation. Most existing pre-trained models suffer from the problems of low computational efficiency and information asymmetry brought by the long visual sequence in cross-modal alignment. To address these problems, mPLUG introduces an effective and efficient vision-language architecture with novel cross-modal skip-connections, which creates inter-layer shortcuts that skip a certain number of layers for time-consuming full self-attention on the vision side. mPLUG is pre-trained end-to-end on large-scale image-text pairs with both discriminative and generative objectives. It achieves state-of-the-art results on a wide range of vision-language downstream tasks, such as image captioning, image-text retrieval, visual grounding and visual question answering. mPLUG also demonstrates strong zero-shot transferability when directly transferred to multiple video-language tasks.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| image-captioning-on-coco-captions | mPLUG | BLEU-4: 46.5 CIDER: 155.1 METEOR: 32.0 SPICE: 26.0 |
| visual-question-answering-on-vqa-v2-test-dev | mPLUG (Huge) | Accuracy: 82.43 |
| visual-question-answering-on-vqa-v2-test-std | mPLUG-Huge | number: 69.82 other: 77.02 overall: 83.62 yes/no: 94.83 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.