Command Palette
Search for a command to run...
MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models
Dohwan Ko; Joonmyung Choi; Hyeong Kyu Choi; Kyoung-Woon On; Byungseok Roh; Hyunwoo J. Kim

Abstract
Foundation models have shown outstanding performance and generalization capabilities across domains. Since most studies on foundation models mainly focus on the pretraining phase, a naive strategy to minimize a single task-specific loss is adopted for fine-tuning. However, such fine-tuning methods do not fully leverage other losses that are potentially beneficial for the target task. Therefore, we propose MEta Loss TRansformer (MELTR), a plug-in module that automatically and non-linearly combines various loss functions to aid learning the target task via auxiliary learning. We formulate the auxiliary learning as a bi-level optimization problem and present an efficient optimization algorithm based on Approximate Implicit Differentiation (AID). For evaluation, we apply our framework to various video foundation models (UniVL, Violet and All-in-one), and show significant performance gain on all four downstream tasks: text-to-video retrieval, video question answering, video captioning, and multi-modal sentiment analysis. Our qualitative analyses demonstrate that MELTR adequately transforms' individual loss functions andmelts' them into an effective unified loss. Code is available at https://github.com/mlvlab/MELTR.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| multimodal-sentiment-analysis-on-cmu-mosi | UniVL + MELTR | Acc-2: 85.3 Corr: 0.789 F1: 85.4 MAE: 0.759 |
| video-captioning-on-msr-vtt-1 | UniVL + MELTR | BLEU-4: 44.17 CIDEr: 52.77 METEOR: 29.26 ROUGE-L: 62.35 |
| video-captioning-on-youcook2 | UniVL + MELTR | BLEU-3: 24.12 BLEU-4: 17.92 CIDEr: 1.90 METEOR: 22.56 ROUGE-L: 47.04 |
| video-retrieval-on-msr-vtt | All-in-one + MELTR | text-to-video R@1: 38.6 text-to-video R@10: 84.7 text-to-video R@5: 74.4 |
| video-retrieval-on-msr-vtt | VIOLET + MELTR | text-to-video Median Rank: 3 text-to-video R@1: 33.6 text-to-video R@10: 77.8 text-to-video R@5: 63.7 |
| video-retrieval-on-msr-vtt | UniVL + MELTR | text-to-video Median Rank: 4 text-to-video R@1: 28.5 text-to-video R@10: 67.6 text-to-video R@5: 55.5 |
| video-retrieval-on-msr-vtt-1ka | UniVL + MELTR | text-to-video Median Rank: 4 text-to-video R@1: 31.1 text-to-video R@10: 68.3 text-to-video R@5: 55.7 |
| video-retrieval-on-msr-vtt-1ka | All-in-one + MELTR | text-to-video R@1: 41.3 text-to-video R@10: 82.5 text-to-video R@5: 73.5 |
| video-retrieval-on-msr-vtt-1ka | VIOLET + MELTR | text-to-video Median Rank: 3 text-to-video R@1: 35.5 text-to-video R@10: 78.4 text-to-video R@5: 67.2 |
| video-retrieval-on-youcook2 | UniVL + MELTR | text-to-video Median Rank: 3 text-to-video R@1: 33.7 text-to-video R@10: 74.8 text-to-video R@5: 63.1 |
| visual-question-answering-on-msvd-qa-1 | VIOLET + MELTR | Accuracy: 0.517 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.