Command Palette
Search for a command to run...
Han Donghoon ; Seo Seunghyeon ; Park Eunhwan ; Nam Seong-Uk ; Kwak Nojun

Abstract
Multimodal and large language models (LLMs) have revolutionized theutilization of open-world knowledge, unlocking novel potentials across varioustasks and applications. Among these domains, the video domain has notablybenefited from their capabilities. In this paper, we present Highlight-CLIP(HL-CLIP), a method designed to excel in the video highlight detection task byleveraging the pre-trained knowledge embedded in multimodal models. By simplyfine-tuning the multimodal encoder in combination with our innovative saliencypooling technique, we have achieved the state-of-the-art performance in thehighlight detection task, the QVHighlight Benchmark, to the best of ourknowledge.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| highlight-detection-on-qvhighlights | HL-CLIP | Hit@1: 70.60 mAP: 41.94 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.