Video Question Answering On Agqa 2 0 Balanced
评估指标
Average Accuracy
评测结果
各个模型在此基准测试上的表现结果
模型名称 | Average Accuracy | Paper Title | Repository |
---|---|---|---|
MIST - AIO | 50.96 | MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form Video Question Answering | |
GF (uns) - S3D | 53.33 | Glance and Focus: Memory Prompting for Multi-Event Video Question Answering | |
MIST - CLIP | 54.39 | MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form Video Question Answering | |
SHG-VQA (trained from scratch) | 49.2 | Learning Situation Hyper-Graphs for Video Question Answering | |
AIO - ViT | 48.59 | Glance and Focus: Memory Prompting for Multi-Event Video Question Answering | |
MMTF | 44.36 | MMTF: Multi-Modal Temporal Fusion for Commonsense Video Question Answering | - |
SViTT | 52.7 | SViTT: Temporal Learning of Sparse Video-Text Transformers | |
GF (sup) - Faster RCNN | 55.08 | Glance and Focus: Memory Prompting for Multi-Event Video Question Answering |
0 of 8 row(s) selected.