HyperAI
HyperAI超神经
首页
算力平台
文档
资讯
论文
教程
数据集
百科
SOTA
LLM 模型天梯
GPU 天梯
顶会
开源项目
全站搜索
关于
中文
HyperAI
HyperAI超神经
Toggle sidebar
全站搜索…
⌘
K
Command Palette
Search for a command to run...
首页
SOTA
图像字幕生成
Image Captioning On Coco Captions
Image Captioning On Coco Captions
评估指标
BLEU-1
BLEU-4
CIDER
METEOR
ROUGE-L
SPICE
评测结果
各个模型在此基准测试上的表现结果
Columns
模型名称
BLEU-1
BLEU-4
CIDER
METEOR
ROUGE-L
SPICE
Paper Title
Repository
mPLUG
-
46.5
155.1
32.0
-
26.0
mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections
OFA
-
44.9
154.9
32.5
-
26.6
OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework
VALOR
-
-
152.5
-
-
25.7
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset
GIT
-
44.1
151.1
32.2
-
26.3
GIT: A Generative Image-to-text Transformer for Vision and Language
VAST
-
-
149.0
-
-
27.0
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
BLIP-2 ViT-G OPT 2.7B (zero-shot)
-
43.7
145.8
-
-
-
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
LEMON
-
42.6
145.5
31.4
-
25.5
Scaling Up Vision-Language Pre-training for Image Captioning
-
BLIP-2 ViT-G OPT 6.7B (zero-shot)
-
43.5
145.2
-
-
-
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
BLIP-2 ViT-G FlanT5 XL (zero-shot)
-
42.4
144.5
-
-
-
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
GRIT (No VL pretraining - base)
84.2
42.4
144.2
30.6
60.7
24.3
GRIT: Faster and Better Image captioning Transformer Using Dual Visual Features
ExpansionNet v2 (No VL pretraining)
83.5
42.7
143.7
30.6
61.1
24.7
Exploiting Multiple Sequence Lengths in Fast End to End Training for Image Captioning
CoCa
-
40.9
143.6
33.9
-
24.7
CoCa: Contrastive Captioners are Image-Text Foundation Models
SimVLM
-
40.6
143.3
33.4
-
25.4
SimVLM: Simple Visual Language Model Pretraining with Weak Supervision
Xmodal-Ctx + OSCAR
-
41.3
142.2
-
-
24.9
Beyond a Pre-Trained Object Detector: Cross-Modal Textual and Visual Context for Image Captioning
Prompt Tuning
-
41.81
141.4
31.51
-
24.42
Prompt Tuning for Generative Multimodal Pretrained Models
VinVL
-
41.0
140.9
31.1
-
25.2
VinVL: Revisiting Visual Representations in Vision-Language Models
X-VLM (base)
-
41.3
140.8
-
-
-
Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts
Oscar
-
41.7
140
30.6
-
24.5
Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks
Xmodal-Ctx
83.4
41.4
139.9
30.4
60.4
24.0
Beyond a Pre-Trained Object Detector: Cross-Modal Textual and Visual Context for Image Captioning
Prismer
-
40.4
136.5
31.4
-
24.4
Prismer: A Vision-Language Model with Multi-Task Experts
0 of 40 row(s) selected.
Previous
Next
Image Captioning On Coco Captions | SOTA | HyperAI超神经