Ovis2.5 Technical Report

We present Ovis2.5, a successor to Ovis2 designed for native-resolutionvisual perception and strong multimodal reasoning. Ovis2.5 integrates anative-resolution vision transformer that processes images at their native,variable resolutions, avoiding the degradation from fixed-resolution tiling andpreserving both fine detail and global layout -- crucial for visually densecontent like complex charts. To strengthen reasoning, we train the model tomove beyond linear chain-of-thought and perform reflection -- includingself-checking and revision. This advanced capability is exposed as an optional"thinking mode" at inference time, allowing users to trade latency for enhancedaccuracy on difficult inputs. The model is trained via a comprehensivefive-phase curriculum that progressively builds its skills. The process beginswith foundational visual and multimodal pretraining, advances throughlarge-scale instruction tuning, and culminates in alignment and reasoningenhancement using DPO and GRPO. To scale these upgrades efficiently, we employmultimodal data packing and hybrid parallelism, yielding a significantend-to-end speedup. We release two open-source models: Ovis2.5-9B andOvis2.5-2B. The latter continues the "small model, big performance" philosophyof Ovis2, making it ideal for resource-constrained, on-device scenarios. On theOpenCompass multimodal leaderboard, Ovis2.5-9B averages 78.3, marking asubstantial improvement over its predecessor, Ovis2-8B, and achievingstate-of-the-art results among open-source MLLMs in the sub-40B parameterrange; Ovis2.5-2B scores 73.9, establishing SOTA for its size. Beyond aggregatescores, Ovis2.5 achieves leading results on STEM benchmarks, exhibits strongcapabilities on grounding and video tasks, and achieves open-source SOTA at itsscale for complex chart analysis.