HyperAI

Google DeepMind has published a groundbreaking paper that introduces a novel theoretical framework to explain the emerging capabilities of its generative video model, Veo 3. The study, titled “Video models are zero-shot learners and reasoners,” presents compelling evidence that Veo 3 exhibits sophisticated zero-shot learning and reasoning abilities—capabilities previously associated primarily with large language models (LLMs). Central to this discovery is the introduction of a new concept: the “Chain-of-Frames” (CoF), a visual counterpart to the well-known “Chain-of-Thought” (CoT) mechanism in language models. The research team analyzed over 18,000 generated videos to systematically evaluate Veo 3’s ability to solve a wide range of tasks—from basic perception to complex visual reasoning—without any fine-tuning on specific problems. The findings suggest that, much like how LLMs have transformed natural language processing by enabling a single model to handle diverse tasks through prompting, generative video models are now emerging as universal foundation models for computer vision. In recent years, NLP has transitioned from task-specific models to unified, prompt-driven systems. However, the field of computer vision still largely relies on specialized models—such as YOLO for object detection or Segment Anything for image segmentation—lacking a general-purpose system capable of open-ended visual problem-solving. DeepMind argues that the same principles driving LLM success—training on massive, diverse datasets—can unlock similar breakthroughs in video generation. The key innovation lies in the concept of “Chain-of-Frames.” While LLMs break down reasoning into sequential text-based steps, video models naturally generate content over time and space. Each frame in a video sequence represents a step in a dynamic process, creating an inherent structure for step-by-step visual reasoning. This temporal progression enables what DeepMind calls “frame chain” reasoning: the model applies changes across frames to simulate planning, physical understanding, and problem-solving. To evaluate Veo 3’s capabilities, the team developed a four-tiered framework: Perception, Modeling, Manipulation, and Reasoning. At the perception level, Veo 3 demonstrated zero-shot performance on classic computer vision tasks—including image segmentation, edge detection, keypoint localization, super-resolution, blind deblurring, and denoising—without explicit training. These emergent abilities suggest that future video models could replace many specialized vision tools. Building on perception, the model shows strong physical modeling capabilities. It understands rigid and soft body dynamics, surface interactions, and fundamental physics such as buoyancy, air resistance, refraction, and reflection. In a “visual Jenga” task, Veo 3 removed blocks in physically plausible ways. It also grasps object functionality—determining which items can fit into a backpack—and maintains continuity across time and camera movement, enabling long-term scene understanding. The manipulation layer reveals Veo 3’s ability to perform diverse zero-shot image editing: background removal, style transfer, colorization, image restoration, and even editing based on hand-drawn sketches. It can combine disparate objects into coherent scenes or transform a casual selfie into a professional headshot. These skills suggest the model can simulate complex interactions, such as demonstrating how to roll a burrito or enabling a robotic arm to pick up a hammer with human-like dexterity. Finally, at the reasoning level, the “frame chain” mechanism shines. In a maze-solving task, Veo 3 generates a sequence of frames showing a red square moving along a white path to reach a green goal. On 5x5 grid mazes, it achieves a pass rate of 78% (pass@10), a dramatic improvement over Veo 2’s 14%. The model also succeeds in tasks like visual sequence completion, color matching, simple Sudoku solving, and symmetry completion. When compared to static image models like Nano Banana and language models such as Gemini 2.5 Pro, Veo 3 outperforms them in visual reasoning tasks. While image models struggle with process-based problems and language models falter when interpreting visual inputs directly, video models leverage frame-by-frame evolution to reason through visual challenges. Despite these advances, Veo 3 still lags behind specialized models in many areas—mirroring the early stage of LLM development. The computational cost of video generation remains high, but DeepMind points to historical trends: LLM inference costs have dropped by 9 to 900 times annually. The same trajectory is expected in video AI, where once-prohibitive general models will eventually surpass task-specific ones due to their flexibility and cost efficiency. The paper concludes that Veo 3 is not just a video generator, but a nascent visual intelligence system—capable of perception, memory, planning, and reasoning—powered by the emerging paradigm of “Chain-of-Frames.” This marks a pivotal step toward a new era of general-purpose visual AI.

DeepMind's "Chain-of-Frames" Demonstrates Veo 3's Visual Reasoning Capabilities

Related Links