Video Question Answering On Activitynet Qa

评估指标

Accuracy

评测结果

各个模型在此基准测试上的表现结果

模型名称	Accuracy	Paper Title	Repository
LocVLM-Vid-B+	38.2	Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs
E-MN	27.1	ActivityNet-QA: A Dataset for Understanding Complex Web Videos via Question Answering
VindLU	44.7	VindLU: A Recipe for Effective Video-and-Language Pretraining
Video-LLaVA	45.3	Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
E-VQA	25.1	ActivityNet-QA: A Dataset for Understanding Complex Web Videos via Question Answering
VALOR	48.6	VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset
E-SA	31.8	ActivityNet-QA: A Dataset for Understanding Complex Web Videos via Question Answering
BT-Adapter (zero-shot)	46.1	BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning
Mirasol3B	51.13	Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities	-
Chat-UniVi-13B	46.4	Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding
MA-LMM	49.8	MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding
FrozenBiLM+	44.8	Open-vocabulary Video Question Answering: A New Benchmark for Evaluating the Generalizability of Video Question Answering Models
MovieChat	45.7	MovieChat: From Dense Token to Sparse Memory for Long Video Understanding
Video-ChatGPT	35.2	Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models
LLaMA-VID-7B (2 Token)	47.4	LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models
VAST	50.4	VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
TESTA (ViT-B/16)	45	TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language Understanding
GPT-2 + CLIP-14 + CLIP-multilingual (Zero-Shot)	61.2	Composing Ensembles of Pre-trained Models via Iterative Consensus	-
LocVLM-Vid-B	37.4	Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs
VideoCoCa	56.1	VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners	-

0 of 36 row(s) selected.