HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

RTQ: Rethinking Video-language Understanding Based on Image-text Model

Xiao Wang Yaoyu Li Tian Gan Zheng Zhang Jingjing Lv Liqiang Nie

RTQ: Rethinking Video-language Understanding Based on Image-text Model

Abstract

Recent advancements in video-language understanding have been established on the foundation of image-text models, resulting in promising outcomes due to the shared knowledge between images and videos. However, video-language understanding presents unique challenges due to the inclusion of highly complex semantic details, which result in information redundancy, temporal dependency, and scene complexity. Current techniques have only partially tackled these issues, and our quantitative analysis indicates that some of these methods are complementary. In light of this, we propose a novel framework called RTQ (Refine, Temporal model, and Query), which addresses these challenges simultaneously. The approach involves refining redundant information within frames, modeling temporal relations among frames, and querying task-specific information from the videos. Remarkably, our model demonstrates outstanding performance even in the absence of video-language pre-training, and the results are comparable with or superior to those achieved by state-of-the-art pre-training methods. Code is available at https://github.com/SCZwangxiao/RTQ-MM2023.

Code Repositories

sczwangxiao/tsgvs-mm2023
pytorch
Mentioned in GitHub
SCZwangxiao/RTQ-MM2023
Official
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
video-captioning-on-msr-vtt-1RTQ
BLEU-4: 49.6
CIDEr: 69.3
ROUGE-L: 66.1
video-captioning-on-msvd-1RTQ
BLEU-4: 66.9
CIDEr: 123.4
ROUGE-L: 82.2
video-question-answering-on-next-qaRTQ
Accuracy: 63.2
video-retrieval-on-activitynetRTQ
text-to-video R@1: 53.5
text-to-video R@10: 91.9
text-to-video R@5: 81.4
video-retrieval-on-didemoRTQ
text-to-video R@1: 57.6
text-to-video R@10: 89.9
text-to-video R@5: 84.1
video-retrieval-on-msr-vtt-1kaRTQ
text-to-video R@1: 53.4
text-to-video R@10: 84.4
text-to-video R@5: 76.1

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp