6 days ago

Video-LMM Post-Training: A Deep Dive into Video Reasoning with Large Multimodal Models

Yunlong Tang, Jing Bi, Pinxin Liu, Zhenyu Pan, Zhangyun Tan, Qianxiang Shen, Jiani Liu, Hang Hua, Junjia Guo, Yunzhong Xiao, Chao Huang, Zhiyuan Wang, Susan Liang, Xinyi Liu, Yizhi Song, Yuhe Nie, Jia-Xing Zhong, Bozheng Li, Daiqing Qi, Ziyun Zeng, Ali Vosoughi, Luchuan Song, Zeliang Zhang, Daiki Shimada, Han Liu, Jiebo Luo, Chenliang Xu

View Paper Details View Code

Video-LMM Post-Training: A Deep Dive into Video Reasoning with Large
Multimodal Models

Abstract

Video understanding represents the most challenging frontier in computervision, requiring models to reason about complex spatiotemporal relationships,long-term dependencies, and multimodal evidence. The recent emergence ofVideo-Large Multimodal Models (Video-LMMs), which integrate visual encoderswith powerful decoder-based language models, has demonstrated remarkablecapabilities in video understanding tasks. However, the critical phase thattransforms these models from basic perception systems into sophisticatedreasoning engines, post-training, remains fragmented across the literature.This survey provides the first comprehensive examination of post-trainingmethodologies for Video-LMMs, encompassing three fundamental pillars:supervised fine-tuning (SFT) with chain-of-thought, reinforcement learning (RL)from verifiable objectives, and test-time scaling (TTS) through enhancedinference computation. We present a structured taxonomy that clarifies theroles, interconnections, and video-specific adaptations of these techniques,addressing unique challenges such as temporal localization, spatiotemporalgrounding, long video efficiency, and multimodal evidence integration. Throughsystematic analysis of representative methods, we synthesize key designprinciples, insights, and evaluation protocols while identifying critical openchallenges in reward design, scalability, and cost-performance optimization. Wefurther curate essential benchmarks, datasets, and metrics to facilitaterigorous assessment of post-training effectiveness. This survey aims to provideresearchers and practitioners with a unified framework for advancing Video-LMMcapabilities. Additional resources and updates are maintained at:https://github.com/yunlong10/Awesome-Video-LMM-Post-Training