HyperAI超神经

Video Question Answering On Situated

评估指标

Average Accuracy

评测结果

各个模型在此基准测试上的表现结果

模型名称
Average Accuracy
Paper TitleRepository
MIST51.13MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form Video Question Answering
TraveLER (0-shot)44.9TraveLER: A Modular Multi-LMM Agent Framework for Video Question-Answering
SHG-VQA (trained from scratch)39.47Learning Situation Hyper-Graphs for Video Question Answering
Flamingo-9B (4-shot)42.8Flamingo: a Visual Language Model for Few-Shot Learning
SeViLA64.9Self-Chained Image-Language Model for Video Localization and Question Answering
All-in-one47.5All in One: Exploring Unified Video-Language Pre-training
GF(sup)53.94Glance and Focus: Memory Prompting for Multi-Event Video Question Answering
VLAP (4 frames)67.1ViLA: Efficient Video-Language Alignment for Video Question Answering
SeViLA (0-shot)44.6Self-Chained Image-Language Model for Video Localization and Question Answering
Flamingo-80B (0-shot)39.7Flamingo: a Visual Language Model for Few-Shot Learning
LLaMA-VQA65.4Large Language Models are Temporal and Causal Reasoners for Video Question Answering
InternVideo58.7InternVideo: General Video Foundation Models via Generative and Discriminative Learning
Flamingo-9B (0-shot)41.8Flamingo: a Visual Language Model for Few-Shot Learning
Temp[ATP]48.37Revisiting the "Video" in Video-Language Understanding
AnyMAL-70B (0-shot)48.2AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model
Flamingo-80B (4-shot)42.4Flamingo: a Visual Language Model for Few-Shot Learning
GF(uns)53.86Glance and Focus: Memory Prompting for Multi-Event Video Question Answering
0 of 17 row(s) selected.