Video-LMM Post-Training: A Deep Dive into Video Reasoning with Large Multimodal Models

Video understanding represents the most challenging frontier in computervision, requiring models to reason about complex spatiotemporal relationships,long-term dependencies, and multimodal evidence. The recent emergence ofVideo-Large Multimodal Models (Video-LMMs), which integrate visual encoderswith powerful decoder-based language models, has demonstrated remarkablecapabilities in video understanding tasks. However, the critical phase thattransforms these models from basic perception systems into sophisticatedreasoning engines, post-training, remains fragmented across the literature.This survey provides the first comprehensive examination of post-trainingmethodologies for Video-LMMs, encompassing three fundamental pillars:supervised fine-tuning (SFT) with chain-of-thought, reinforcement learning (RL)from verifiable objectives, and test-time scaling (TTS) through enhancedinference computation. We present a structured taxonomy that clarifies theroles, interconnections, and video-specific adaptations of these techniques,addressing unique challenges such as temporal localization, spatiotemporalgrounding, long video efficiency, and multimodal evidence integration. Throughsystematic analysis of representative methods, we synthesize key designprinciples, insights, and evaluation protocols while identifying critical openchallenges in reward design, scalability, and cost-performance optimization. Wefurther curate essential benchmarks, datasets, and metrics to facilitaterigorous assessment of post-training effectiveness. This survey aims to provideresearchers and practitioners with a unified framework for advancing Video-LMMcapabilities. Additional resources and updates are maintained at:https://github.com/yunlong10/Awesome-Video-LMM-Post-Training