HyperAIHyperAI

Command Palette

Search for a command to run...

Visual Generation Unlocks Human-Like Reasoning through Multimodal World Models

Jialong Wu Xiaoying Zhang Hongyi Yuan Xiangcheng Zhang Tianhao Huang Changjing He Chaoyi Deng Renrui Zhang Youbin Wu Mingsheng Long

Abstract

Humans construct internal world models and reason by manipulating the concepts within these models. Recent advances in AI, particularly chain-of-thought (CoT) reasoning, approximate such human cognitive abilities, where world models are believed to be embedded within large language models. Expert-level performance in formal and abstract domains such as mathematics and programming has been achieved in current systems by relying predominantly on verbal reasoning. However, they still lag far behind humans in domains like physical and spatial intelligence, which require richer representations and prior knowledge. The emergence of unified multimodal models (UMMs) capable of both verbal and visual generation has therefore sparked interest in more human-like reasoning grounded in complementary multimodal pathways, though their benefits remain unclear. From a world-model perspective, this paper presents the first principled study of when and how visual generation benefits reasoning. Our key position is the visual superiority hypothesis: for certain tasks--particularly those grounded in the physical world--visual generation more naturally serves as world models, whereas purely verbal world models encounter bottlenecks arising from representational limitations or insufficient prior knowledge. Theoretically, we formalize internal world modeling as a core component of CoT reasoning and analyze distinctions among different forms of world models. Empirically, we identify tasks that necessitate interleaved visual-verbal CoT reasoning, constructing a new evaluation suite, VisWorld-Eval. Controlled experiments on a state-of-the-art UMM show that interleaved CoT significantly outperforms purely verbal CoT on tasks that favor visual world modeling, but offers no clear advantage otherwise. Together, this work clarifies the potential of multimodal world modeling for more powerful, human-like multimodal AI.

One-sentence Summary

Tsinghua University and ByteDance Seed researchers propose the visual superiority hypothesis, showing that interleaved visual-verbal chain-of-thought reasoning in unified multimodal models outperforms verbal-only approaches on physical-world tasks, validated via their new VisWorld-Eval benchmark.

Key Contributions

  • The paper introduces the visual superiority hypothesis, arguing that for physical-world tasks, visual generation serves as a more natural world model than verbal reasoning alone, which faces representational and knowledge bottlenecks.
  • It formalizes world modeling in CoT reasoning and introduces VisWorld-Eval, a new benchmark suite designed to evaluate interleaved visual-verbal reasoning on tasks requiring explicit visual simulation.
  • Empirical results on state-of-the-art UMMs show interleaved CoT significantly outperforms verbal-only CoT on visual-favoring tasks, while offering no advantage on tasks not requiring visual modeling, clarifying the scope of multimodal reasoning benefits.

Introduction

The authors leverage unified multimodal models (UMMs) to explore how visual generation enhances reasoning by acting as explicit visual world models—complementing the verbal world models embedded in large language models. While current AI systems excel in abstract domains using verbal chain-of-thought reasoning, they struggle with physical and spatial tasks due to representational bottlenecks and insufficient visual prior knowledge. The authors introduce the “visual superiority hypothesis,” arguing that for tasks grounded in the physical world, visual generation provides richer, more grounded representations. Their main contribution is a principled theoretical framework linking world modeling to reasoning, plus the VisWorld-Eval benchmark—a controlled suite of tasks designed to isolate when visual generation helps. Empirical results show interleaved verbal-visual CoT significantly improves performance on tasks requiring spatial or physical simulation, but offers no gain on simpler symbolic tasks like Sokoban, validating their hypothesis.

Dataset

The authors use VisWorld-Eval, a curated suite of seven reasoning tasks designed to evaluate visual world modeling, organized into two categories: world simulation and world reconstruction.

  • Paper Folding (adapted from SpatialViz-Bench): Simulates unfolding paper after folds and hole punches. Grid sizes 3–8, 1–4 folds; holes in 5 shapes. Test set uses max difficulty (8x8 grid, 4 folds). CoTs generated via rule-based templates, refined by Gemini 2.5 Pro. Visual world modeling interleaves images of unfolding steps; verbal modeling uses grid matrices; implicit modeling skips state tracking.

  • Multi-hop Manipulation (based on CLEVR): Objects (cubes, spheres, cylinders) in Blender-rendered scenes undergo spatial operations (add/remove/change). Queries target final layout. Test prompts vary initial objects (3–6) and operations (1–5). CoTs simulate step-by-step execution, refined by Gemini 2.5 Pro.

  • Ball Tracking (adapted from RBench-V): Predicts ball trajectory after elastic wall reflections. Ball starts with green arrow direction; 4–8 randomized holes. Test cases require ≥1 wall bounce. CoTs generated by Seed 1.6, explaining frame-to-frame dynamics.

  • Sokoban: Grid puzzles (6–10x10) with one box and target. CoTs use search algorithm to find optimal path; only key steps rendered (approach, push, direction change). Augmented with randomized detours for reflection. CoTs from Seed 1.6. Visual modeling interleaves rendered steps; verbal removes images; implicit masks coordinates with [masked].

  • Maze: Fixed 5x5 grid puzzles. CoTs from rule-based templates, rewritten for naturalness. Visual modeling interleaves rendered path steps; verbal and implicit follow Sokoban protocol, masking positions as [masked].

  • Cube 3-view Projection (from SpatialViz-Bench): Infers unseen orthographic view from isometric + two orthographic views. Grids 3–5, two cube colors. Test set uses random grid sizes. CoTs construct queried view, mark occlusions with auxiliary color, count cubes—rewritten by Gemini 2.5 Pro. Visual modeling adds image of queried view; verbal uses character matrices for color encoding.

  • Real-world Spatial Reasoning (subset of MMSI-Bench): Uses real-world camera-object/region positional questions. Training CoTs generated via visual-CoT pipeline using SFT-trained BAGEL for novel view synthesis, then filtered and rewritten by Gemini 2.5 Pro.

All tasks use question-answering format with verifiable answers. Training data includes chain-of-thoughts (CoTs) generated via templates or models, refined for clarity. Visual world modeling interleaves intermediate state images; verbal modeling uses symbolic representations; implicit modeling removes or masks explicit state tracking. Training and test sample counts per task are summarized in Table 2.

Method

The authors formalize a world-model perspective on multimodal reasoning, conceptualizing the underlying environment as a multi-observable Markov decision process (MOMDP). This framework posits that a multimodal model operates by maintaining an internal representation of the world state, which is inferred from multiple, potentially incomplete, observations across different modalities. The core of this approach lies in two atomic capabilities: world reconstruction and world simulation. World reconstruction enables the model to infer the complete structure of the world from partial observations, allowing it to generate novel views of the same state. This is formalized as encoding a set of observations into an internal representation, s^=enc(oϕ1,,oϕn)\hat{s} = \text{enc}(o_{\phi_1}, \ldots, o_{\phi_n})s^=enc(oϕ1,,oϕn), which can then be decoded to synthesize an unseen observation, o^ϕn+1=dec(s^,ϕn+1)\hat{o}_{\phi_{n+1}} = \operatorname{dec}(\hat{s}, \phi_{n+1})o^ϕn+1=dec(s^,ϕn+1). In modern generative models, this capability is realized through end-to-end novel view generation, pθ(oϕn+1oϕ1,,oϕn)p_{\theta}(o_{\phi_{n+1}} \mid o_{\phi_1}, \ldots, o_{\phi_n})pθ(oϕn+1oϕ1,,oϕn). The second capability, world simulation, allows the model to predict the future evolution of the world state. This is formalized as predicting the transition of the current state and action, s^pred(s^,a)\hat{s}' \sim \text{pred}(\hat{s}, a)s^pred(s^,a), or more commonly in generative models, predicting future observations, pθ(ot+1ot,at)p_{\theta}(o_{t+1} \mid o_{\leq t}, a_{\leq t})pθ(ot+1ot,at). The authors argue that deliberate reasoning, such as chain-of-thought, is a process of manipulating these internal world observations. This is formalized as a sequence of reasoning steps and observations, R=(r1,o1),(r2,o2),,(rH,oH)R = (r_1, o_1), (r_2, o_2), \ldots, (r_H, o_H)R=(r1,o1),(r2,o2),,(rH,oH), where each step rir_iri is a logical operation and oio_ioi is an observation generated by invoking one of the atomic world modeling capabilities. This formulation is modality-agnostic, allowing for verbal, visual, or implicit world modeling.

The model architecture is built upon a unified multimodal foundation, specifically the BAGEL model, which is designed to handle both verbal and visual generation. The training process is designed to optimize the model for generating interleaved verbal and visual reasoning steps. It begins with supervised fine-tuning (SFT) on task-specific datasets, where both the verbal reasoning steps and visual intermediate outputs are optimized. The loss function for visual world modeling combines cross-entropy for verbal generation with a flow-matching loss for visual generation, which is designed to learn the internal representations required for novel view synthesis. The training objective is to minimize the following loss: Lθ(Q,I,R,A)=i=1H+1j=1rilogpθ(ri,jri,<j,Ri)+i=1HEt,ϵvθ(oit,tR~i)(ϵoi)22\mathcal{L}_{\theta}(Q, I, R, A) = - \sum_{i=1}^{H+1} \sum_{j=1}^{|r_i|} \log p_{\theta}(r_{i,j} \mid r_{i,<j}, R_i) + \sum_{i=1}^{H} \mathbb{E}_{t,\epsilon} \| v_{\theta}(o_i^t, t \mid \tilde{R}_i) - (\epsilon - o_i) \|_2^2Lθ(Q,I,R,A)=i=1H+1j=1rilogpθ(ri,jri,<j,Ri)+i=1HEt,ϵvθ(oit,tR~i)(ϵoi)22. Following SFT, reinforcement learning from verifiable rewards (RLVR) is applied to further refine the model. During RL, the verbal generation component is optimized using GRPO, while the visual generation component is regularized via the KL-divergence with respect to the SFT-trained reference model to prevent degradation. The overall training framework is designed to enable the model to perform complex multimodal reasoning by explicitly generating and manipulating visual and verbal observations.

Experiment

  • Evaluated visual world modeling via VisWorld-Eval, a 7-task suite spanning synthetic and real-world domains; advanced VLMs (Gemini 3 Flash/Pro) outperform others but remain suboptimal on complex tasks like paper folding and spatial reasoning.
  • Visual world modeling significantly boosts performance on world simulation tasks (paper folding, ball tracking, multi-hop manipulation) via richer spatial grounding; achieves 4x sample efficiency gain over verbal modeling in paper folding.
  • Enhances world reconstruction tasks (cube 3-view, real-world spatial reasoning) by leveraging visual priors; maintains ~10% accuracy advantage even on out-of-distribution cube stacks (size 6) and achieves >50% fidelity in view synthesis vs. near-zero for verbal modeling.
  • Offers no benefit on grid-world tasks (maze, Sokoban); implicit world modeling suffices, with internal model representations emerging post-pretraining and further refined via SFT without explicit coordinate supervision.
  • UMMs outperform VLM baselines (Qwen2.5-VL) even when VLMs are fine-tuned on identical verbal CoT data, confirming visual modeling’s inherent advantage, not compromised verbal reasoning.
  • RLVR improves all CoT formulations but does not close the performance gap; visual world modeling gains persist, suggesting its superiority is structural, not training-limited.
  • Qualitative analysis reveals hallucinations in verbal reasoning (especially symmetry tasks) and generation artifacts in visual modeling, with spatial understanding still limited across viewpoints—expected to improve with stronger base models and curated data.

The authors use a bar chart to compare the performance of different models and reasoning methods on three tasks: Paper Folding, Multi-Hop Manipulation, and Cube 3-View Projection. Results show that visual world modeling consistently outperforms verbal and implicit world modeling across all tasks, with the largest gains observed in Multi-Hop Manipulation and Cube 3-View Projection. In Paper Folding, visual world modeling achieves the highest accuracy, while implicit and verbal world modeling show similar, lower performance.

The authors compare implicit and visual world modeling on MMSI-Bench positional relationship tasks, showing that visual world modeling achieves higher accuracy on Cam.-Obj. and Cam.-Reg. subtasks, while implicit world modeling performs better on Obj.-Obj. and Obj.-Reg. The overall performance of visual world modeling is higher, indicating its effectiveness in spatial reasoning.

The authors use supervised fine-tuning with specific hyperparameters, including a learning rate of 1×1051 \times 10^{-5}1×105, a batch size of 128, and a GRPO mini batch size of 32, to train their models. The KL loss coefficient for visual generation is set to 0.1, while the coefficient for verbal generation is 0.0, indicating a focus on visual generation during training.

The authors use the VisWorld-Eval suite to evaluate advanced VLMs, reporting zero-shot performance across seven tasks. Results show that Gemini 3 Flash and Gemini 3 Pro significantly outperform other models, particularly on complex tasks like paper folding and ball tracking, though overall performance remains suboptimal.

The authors use supervised fine-tuning to evaluate the impact of different chain-of-thought formulations on multimodal reasoning tasks. Results show that visual world modeling significantly outperforms verbal and implicit world modeling on paper folding and multi-hop manipulation, where spatial reasoning requires precise geometric understanding. In contrast, for cube 3-view projection, visual world modeling achieves higher accuracy and better world-model fidelity compared to verbal approaches, though performance plateaus and hallucinations remain an issue.


Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing

HyperAI Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp