HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

GEB+: A Benchmark for Generic Event Boundary Captioning, Grounding and Retrieval

Wang Yuxuan ; Gao Difei ; Yu Licheng ; Lei Stan Weixian ; Feiszli Matt ; Shou Mike Zheng

GEB+: A Benchmark for Generic Event Boundary Captioning, Grounding and
  Retrieval

Abstract

Cognitive science has shown that humans perceive videos in terms of eventsseparated by the state changes of dominant subjects. State changes trigger newevents and are one of the most useful among the large amount of redundantinformation perceived. However, previous research focuses on the overallunderstanding of segments without evaluating the fine-grained status changesinside. In this paper, we introduce a new dataset called Kinetic-GEB+. Thedataset consists of over 170k boundaries associated with captions describingstatus changes in the generic events in 12K videos. Upon this new dataset, wepropose three tasks supporting the development of a more fine-grained, robust,and human-like understanding of videos through status changes. We evaluate manyrepresentative baselines in our dataset, where we also design a new TPD(Temporal-based Pairwise Difference) Modeling method for visual difference andachieve significant performance improvements. Besides, the results show thereare still formidable challenges for current methods in the utilization ofdifferent granularities, representation of visual difference, and the accuratelocalization of status changes. Further analysis shows that our dataset candrive developing more powerful methods to understand status changes and thusimprove video level comprehension. The dataset including both videos andboundaries is available at https://yuxuan-w.github.io/GEB-plus/

Code Repositories

yuxuan-w/geb-plus
Official
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
boundary-captioning-on-kinetic-gebActBERT-revised
CIDEr: 74.71
ROUGE-L: 28.15
SPICE: 19.52
boundary-grounding-on-kinetic-gebFROZEN-revised
F1@0.1s: 4.28
F1@0.2s: 8.54
F1@0.5s: 18.33
F1@1.0s: 31.04
F1@1.5s: 40.48
F1@2.0s: 47.86
F1@2.5s: 54.81
F1@3.0s: 61.45
F1@Avg: 33.35
text-to-video-retrieval-on-kinetic-gebFROZEN-revised
mAP: 23.39
text-to-video-retrieval-on-kinetic-gebFROZEN-revised (two-stream)
text-to-video R@1: 12.8
text-to-video R@10: 45.66
text-to-video R@5: 34.81
text-to-video R@50: 68.1

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp