HyperAIHyperAI
Back to Headlines

Zhipu AI Opens Source GLM-4.5V, Achieves 41 State-of-the-Art Results in Multimodal Tasks

2 days ago

Zhipu AI has officially released and open-sourced its next-generation vision-language model, GLM-4.5V, achieving state-of-the-art (SOTA) performance across 41 public multimodal benchmark tasks—marking a significant leap in open-source visual reasoning capabilities. The model is now available on GitHub, Hugging Face, and ModelScope under the MIT open-source license, allowing unrestricted commercial use. The project can be accessed at: https://github.com/zai-org/GLM-V/. GLM-4.5V is a large vision-language model (VLM) with 106 billion total parameters and 12 billion activated parameters. Built upon Zhipu’s flagship text foundation model, GLM-4.5-Air, it follows the technical lineage of GLM-4.1V-Thinking. The architecture consists of three core components: a vision encoder, an MLP adapter, and a language decoder. A key innovation is the integration of 3D Rotated Positional Encoding (3D-RoPE), which significantly enhances the model’s ability to perceive and reason about 3D spatial relationships within multimodal inputs. Additionally, the model supports up to 64K tokens of multimodal long-context input and employs 3D convolutions to improve video processing efficiency. These advancements enable GLM-4.5V to handle not only static images but also dynamic video content, with improved robustness in processing high-resolution images and those with extreme aspect ratios. To strengthen its multimodal reasoning, Zhipu optimized the model across three training phases. In the pre-training stage, the model was trained on vast amounts of multimodal data featuring interleaved text and images, including long-context sequences, to build a strong foundation in understanding complex visual and textual content. During supervised fine-tuning (SFT), the team introduced explicit "chain-of-thought" training samples, enhancing the model’s causal reasoning and depth of multimodal comprehension. In the reinforcement learning (RL) phase, a multi-domain reward system was implemented, combining verifiable reward-based reinforcement learning (RLVR) with Reinforcement Learning from Human Feedback (RLHF), resulting in significant improvements in STEM problem-solving, multimodal localization, and agent-based tasks. In official demonstrations, GLM-4.5V showcased its broad-spectrum visual reasoning capabilities. On image understanding tasks, it can accurately identify objects and output precise bounding boxes based on natural language queries. It can also infer geographic locations and approximate coordinates from subtle visual cues such as vegetation patterns, climate indicators, and architectural styles—without relying on external search tools. In a head-to-head test against human players, the model outperformed 99% of participants in a global "image-based puzzle game" competition within 16 hours and rose to rank 66 globally after seven days. Initial testing confirmed high accuracy, though some challenges arose with highly common or ambiguous scenes—such as a photo of a Beijing park, where the model struggled due to similar environmental features. For complex document analysis, GLM-4.5V excels at processing lengthy, multi-page documents rich in charts and tables. It reads each page with a human-like visual approach, simultaneously interpreting both text and visual elements. This allows for accurate summarization, translation, and chart information extraction—reducing errors commonly introduced by traditional workflows that separate OCR from text analysis. The model also introduces a "front-end replication" feature, capable of analyzing webpage screenshots or interactive videos and generating corresponding HTML, CSS, and JavaScript code to reconstruct layouts, styles, and even dynamic behaviors. Initial tests successfully recreated the Google Scholar homepage with high visual fidelity, though it missed the interactive sidebar behavior in the demo video. Additionally, GLM-4.5V’s GUI Agent capabilities allow it to interpret screen content, perform dialog-based interactions, and locate icons—laying the foundation for intelligent agents that can assist with desktop operations. Zhipu has also open-sourced a desktop assistant application that captures screenshots and screen recordings in real time, leveraging GLM-4.5V to perform a wide range of visual reasoning tasks, including code assistance, video content analysis, game solving, and document interpretation. References: 1. https://x.com/Zai_org/status/1954898011181789431 2. https://huggingface.co/zai-org/GLM-4.5V 3. https://github.com/zai-org/GLM-V/ 4. https://mp.weixin.qq.com/s/8cKtGwUtEvAaPriVzBI1Dg Editorial & Layout: He Chenlong

Related Links

Zhipu AI Opens Source GLM-4.5V, Achieves 41 State-of-the-Art Results in Multimodal Tasks | Headlines | HyperAI