Command Palette
Search for a command to run...

Abstract
In this paper, we present ENTER, an interpretable Video Question Answering(VideoQA) system based on event graphs. Event graphs convert videos intographical representations, where video events form the nodes and event-eventrelationships (temporal/causal/hierarchical) form the edges. This structuredrepresentation offers many benefits: 1) Interpretable VideoQA via generatedcode that parses event-graph; 2) Incorporation of contextual visual informationin the reasoning process (code generation) via event graphs; 3) Robust VideoQAvia Hierarchical Iterative Update of the event graphs. Existing interpretableVideoQA systems are often top-down, disregarding low-level visual informationin the reasoning plan generation, and are brittle. While bottom-up approachesproduce responses from visual data, they lack interpretability. Experimentalresults on NExT-QA, IntentQA, and EgoSchema demonstrate that not only does ourmethod outperform existing top-down approaches while obtaining competitiveperformance against bottom-up approaches, but more importantly, offers superiorinterpretability and explainability in the reasoning process.
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| zero-shot-video-question-answer-on-intentqa | ENTER | Accuracy: 71.5 |
| zero-shot-video-question-answer-on-next-qa | ENTER | Accuracy: 75.1 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.