HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

VTimeLLM: Empower LLM to Grasp Video Moments

Bin Huang; Xin Wang; Hong Chen; Zihan Song; Wenwu Zhu

VTimeLLM: Empower LLM to Grasp Video Moments

Abstract

Large language models (LLMs) have shown remarkable text understanding capabilities, which have been extended as Video LLMs to handle video data for comprehending visual details. However, existing Video LLMs can only provide a coarse description of the entire video, failing to capture the precise start and end time boundary of specific events. In this paper, we solve this issue via proposing VTimeLLM, a novel Video LLM designed for fine-grained video moment understanding and reasoning with respect to time boundary. Specifically, our VTimeLLM adopts a boundary-aware three-stage training strategy, which respectively utilizes image-text pairs for feature alignment, multiple-event videos to increase temporal-boundary awareness, and high-quality video-instruction tuning to further improve temporal understanding ability as well as align with human intents. Extensive experiments demonstrate that in fine-grained time-related comprehension tasks for videos such as Temporal Video Grounding and Dense Video Captioning, VTimeLLM significantly outperforms existing Video LLMs. Besides, benefits from the fine-grained temporal understanding of the videos further enable VTimeLLM to beat existing Video LLMs in video dialogue benchmark, showing its superior cross-modal understanding and reasoning abilities.

Code Repositories

huangb23/vtimellm
Official
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
dense-video-captioning-on-activitynetVTimeLLM
CIDEr: 27.6
SODA: 5.8
temporal-relation-extraction-on-vinogroundVTimeLLM
Group Score: 5.2
Text Score: 19.4
Video Score: 27
vcgbench-diverse-on-videoinstructVTimeLLM
Consistency: 2.35
Contextual Understanding: 2.48
Correctness of Information: 2.16
Dense Captioning: 1.13
Detail Orientation: 2.41
Reasoning: 3.45
Spatial Understanding: 2.29
Temporal Understanding: 1.46
mean: 2.17
video-based-generative-performanceVTimeLLM
Consistency: 2.47
Contextual Understanding: 3.40
Correctness of Information: 2.78
Detail Orientation: 3.10
Temporal Understanding: 2.49
mean: 2.85
video-based-generative-performance-1VTimeLLM
gpt-score: 2.78
video-based-generative-performance-2VTimeLLM
gpt-score: 2.47
video-based-generative-performance-3VTimeLLM
gpt-score: 3.40
video-based-generative-performance-4VTimeLLM
gpt-score: 3.10
video-based-generative-performance-5VTimeLLM
gpt-score: 2.49

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp