HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models

Dohwan Ko; Joonmyung Choi; Hyeong Kyu Choi; Kyoung-Woon On; Byungseok Roh; Hyunwoo J. Kim

MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models

Abstract

Foundation models have shown outstanding performance and generalization capabilities across domains. Since most studies on foundation models mainly focus on the pretraining phase, a naive strategy to minimize a single task-specific loss is adopted for fine-tuning. However, such fine-tuning methods do not fully leverage other losses that are potentially beneficial for the target task. Therefore, we propose MEta Loss TRansformer (MELTR), a plug-in module that automatically and non-linearly combines various loss functions to aid learning the target task via auxiliary learning. We formulate the auxiliary learning as a bi-level optimization problem and present an efficient optimization algorithm based on Approximate Implicit Differentiation (AID). For evaluation, we apply our framework to various video foundation models (UniVL, Violet and All-in-one), and show significant performance gain on all four downstream tasks: text-to-video retrieval, video question answering, video captioning, and multi-modal sentiment analysis. Our qualitative analyses demonstrate that MELTR adequately transforms' individual loss functions andmelts' them into an effective unified loss. Code is available at https://github.com/mlvlab/MELTR.

Code Repositories

mlvlab/MELTR
Official
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
multimodal-sentiment-analysis-on-cmu-mosiUniVL + MELTR
Acc-2: 85.3
Corr: 0.789
F1: 85.4
MAE: 0.759
video-captioning-on-msr-vtt-1UniVL + MELTR
BLEU-4: 44.17
CIDEr: 52.77
METEOR: 29.26
ROUGE-L: 62.35
video-captioning-on-youcook2UniVL + MELTR
BLEU-3: 24.12
BLEU-4: 17.92
CIDEr: 1.90
METEOR: 22.56
ROUGE-L: 47.04
video-retrieval-on-msr-vttAll-in-one + MELTR
text-to-video R@1: 38.6
text-to-video R@10: 84.7
text-to-video R@5: 74.4
video-retrieval-on-msr-vttVIOLET + MELTR
text-to-video Median Rank: 3
text-to-video R@1: 33.6
text-to-video R@10: 77.8
text-to-video R@5: 63.7
video-retrieval-on-msr-vttUniVL + MELTR
text-to-video Median Rank: 4
text-to-video R@1: 28.5
text-to-video R@10: 67.6
text-to-video R@5: 55.5
video-retrieval-on-msr-vtt-1kaUniVL + MELTR
text-to-video Median Rank: 4
text-to-video R@1: 31.1
text-to-video R@10: 68.3
text-to-video R@5: 55.7
video-retrieval-on-msr-vtt-1kaAll-in-one + MELTR
text-to-video R@1: 41.3
text-to-video R@10: 82.5
text-to-video R@5: 73.5
video-retrieval-on-msr-vtt-1kaVIOLET + MELTR
text-to-video Median Rank: 3
text-to-video R@1: 35.5
text-to-video R@10: 78.4
text-to-video R@5: 67.2
video-retrieval-on-youcook2UniVL + MELTR
text-to-video Median Rank: 3
text-to-video R@1: 33.7
text-to-video R@10: 74.8
text-to-video R@5: 63.1
visual-question-answering-on-msvd-qa-1VIOLET + MELTR
Accuracy: 0.517

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models | Papers | HyperAI