Video Question Answering On Situated

Metrics

Average Accuracy

Results

Performance results of various models on this benchmark

		Paper Title	Code
VLAP (4 frames)	67.1	ViLA: Efficient Video-Language Alignment for Video Question Answering
LLaMA-VQA	65.4	Large Language Models are Temporal and Causal Reasoners for Video Question Answering
SeViLA	64.9	Self-Chained Image-Language Model for Video Localization and Question Answering
InternVideo	58.7	InternVideo: General Video Foundation Models via Generative and Discriminative Learning
GF(sup)	53.94	Glance and Focus: Memory Prompting for Multi-Event Video Question Answering
GF(uns)	53.86	Glance and Focus: Memory Prompting for Multi-Event Video Question Answering
MIST	51.13	MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form Video Question Answering
Temp[ATP]	48.37	Revisiting the "Video" in Video-Language Understanding
AnyMAL-70B (0-shot)	48.2	AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model
All-in-one	47.5	All in One: Exploring Unified Video-Language Pre-training
TraveLER (0-shot)	44.9	TraveLER: A Modular Multi-LMM Agent Framework for Video Question-Answering
SeViLA (0-shot)	44.6	Self-Chained Image-Language Model for Video Localization and Question Answering
Flamingo-9B (4-shot)	42.8	Flamingo: a Visual Language Model for Few-Shot Learning
Flamingo-80B (4-shot)	42.4	Flamingo: a Visual Language Model for Few-Shot Learning
Flamingo-9B (0-shot)	41.8	Flamingo: a Visual Language Model for Few-Shot Learning
Flamingo-80B (0-shot)	39.7	Flamingo: a Visual Language Model for Few-Shot Learning
SHG-VQA (trained from scratch)	39.47	Learning Situation Hyper-Graphs for Video Question Answering

0 of 17 row(s) selected.

Video Question Answering On Situated

Metrics

Average Accuracy

Results

Performance results of various models on this benchmark

		Paper Title	Code
VLAP (4 frames)	67.1	ViLA: Efficient Video-Language Alignment for Video Question Answering
LLaMA-VQA	65.4	Large Language Models are Temporal and Causal Reasoners for Video Question Answering
SeViLA	64.9	Self-Chained Image-Language Model for Video Localization and Question Answering
InternVideo	58.7	InternVideo: General Video Foundation Models via Generative and Discriminative Learning
GF(sup)	53.94	Glance and Focus: Memory Prompting for Multi-Event Video Question Answering
GF(uns)	53.86	Glance and Focus: Memory Prompting for Multi-Event Video Question Answering
MIST	51.13	MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form Video Question Answering
Temp[ATP]	48.37	Revisiting the "Video" in Video-Language Understanding
AnyMAL-70B (0-shot)	48.2	AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model
All-in-one	47.5	All in One: Exploring Unified Video-Language Pre-training
TraveLER (0-shot)	44.9	TraveLER: A Modular Multi-LMM Agent Framework for Video Question-Answering
SeViLA (0-shot)	44.6	Self-Chained Image-Language Model for Video Localization and Question Answering
Flamingo-9B (4-shot)	42.8	Flamingo: a Visual Language Model for Few-Shot Learning
Flamingo-80B (4-shot)	42.4	Flamingo: a Visual Language Model for Few-Shot Learning
Flamingo-9B (0-shot)	41.8	Flamingo: a Visual Language Model for Few-Shot Learning
Flamingo-80B (0-shot)	39.7	Flamingo: a Visual Language Model for Few-Shot Learning
SHG-VQA (trained from scratch)	39.47	Learning Situation Hyper-Graphs for Video Question Answering

0 of 17 row(s) selected.

Video Question Answering On Situated | SOTA | HyperAI