Command Palette
Search for a command to run...
Blanco-Fernández Eduardo ; Gutiérrez-Álvarez Carlos ; Nasri Nadia ; Maldonado-Bascón Saturnino ; López-Sastre Roberto J.

Abstract
Dense video captioning involves detecting and describing events within videosequences. Traditional methods operate in an offline setting, assuming theentire video is available for analysis. In contrast, in this work we introducea groundbreaking paradigm: Live Video Captioning (LVC), where captions must begenerated for video streams in an online manner. This shift brings uniquechallenges, including processing partial observations of the events and theneed for a temporal anticipation of the actions. We formally define the novelproblem of LVC and propose innovative evaluation metrics specifically designedfor this online scenario, demonstrating their advantages over traditionalmetrics. To address the novel complexities of LVC, we present a new model thatcombines deformable transformers with temporal filtering, enabling effectivecaptioning over video streams. Extensive experiments on the ActivityNetCaptions dataset validate the proposed approach, showcasing its superiorperformance in the LVC setting compared to state-of-the-art offline methods. Tofoster further research, we provide the results of our model and an evaluationtoolkit with the new metrics integrated at: https://github.com/gramuah/lvc.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| live-video-captioning-on-activitynet-captions | LVC | Live Score: 20.81 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.