Google Unveils Gemini 2.5 Computer Use Model with Browser-like Capabilities
Google has unveiled Gemini 2.5 Computer Use, a new AI model designed to interact with web and mobile interfaces through a browser, marking a significant step toward autonomous AI agents that can perform real-world digital tasks. Unlike traditional AI models that rely on APIs or structured data, this model uses visual understanding and reasoning to navigate user interfaces just as a human would—clicking, typing, scrolling, and filling out forms. It’s part of Google’s broader push to build general-purpose AI agents capable of completing complex, multi-step tasks without direct API access. The model is built on Gemini 2.5 Pro’s advanced visual and reasoning capabilities and is optimized for web-based interactions. It can perform 13 specific actions, including opening a browser, typing text, dragging and dropping elements, selecting from dropdowns, and handling login screens. Google emphasizes that the model currently only controls the browser environment and is not yet capable of operating at the desktop OS level. It is not designed for system-level access, such as modifying files or installing software, which helps limit potential misuse. Gemini 2.5 Computer Use is available to developers through Google AI Studio and Vertex AI, and a public demo is accessible via Browserbase. There, users can watch the AI complete tasks like playing the game 2048, browsing Hacker News for trending discussions, or filling out online forms. Google notes that the demo videos are sped up three times for clarity, but the model operates in real time in actual use. The model works within a loop: it receives a user request, a screenshot of the current interface, and a history of prior actions. It then analyzes the screen, decides on the next step, and executes it. This closed-loop system allows for dynamic, context-aware navigation. Google claims the model outperforms leading alternatives on multiple web and mobile control benchmarks, with lower latency, making it efficient for real-time applications. Safety is a top priority. Google has integrated safety features directly into the model to prevent harmful or unintended behavior. These include defenses against prompt injection attacks, malicious websites, and actions that could compromise system integrity, such as bypassing CAPTCHAs or accessing sensitive systems. Developers also have access to safety controls that allow them to block high-risk actions—like auto-completing financial transactions or modifying system settings—before they are executed. This release comes amid growing competition in the AI agent space. Just a day before, OpenAI announced new features for ChatGPT, including enhanced agent capabilities for task automation. Anthropic had already launched a similar “computer use” feature for its Claude AI last year. Google’s approach differs by focusing exclusively on browser-level interaction, avoiding broader system access, which reduces risk while still enabling powerful automation. Gemini 2.5 Computer Use is ideal for use cases like UI testing, automating repetitive web tasks, assisting users with complex online processes, and enabling AI agents to act on behalf of users in environments without APIs. It represents a move toward more practical, real-world AI assistants that can operate in the messy, unstructured world of graphical interfaces. While still in its early stages, the model signals Google’s commitment to building responsible, capable AI agents that can navigate the web with human-like precision. As AI continues to evolve, the ability to interact with interfaces will be key to unlocking broader automation, making this a pivotal development in the journey toward truly autonomous AI systems.