NVIDIA's AI Factories Revolutionize Efficiency: Blackwell Architecture Achieves 50x Performance Boost with Same Energy Output
The More You Buy, the More You Make When we use generative AI to answer questions or create images, large language models produce a series of intelligent tokens that combine to form the final output. This process, known as AI inference, involves generating a response to a single prompt. However, agentic AI goes beyond this one-shot approach. AI agents use reasoning to complete tasks, breaking them down into multiple steps, each employing different inference techniques. For instance, a single prompt might lead to many sets of tokens being generated to complete a complex job. The backbone of AI inference is what we call AI factories—massive infrastructures designed to serve AI to millions of users simultaneously. These factories produce vast quantities of AI tokens, essentially turning intelligence into a commodity. In the modern AI era, this intelligence directly contributes to growing revenue and profits. The sustainability of such growth hinges on the efficiency of the AI factory as it scales. AI factories must balance two crucial demands to deliver optimal inference: speed per user and overall system throughput. To achieve this, they rely on scaling to more floating-point operations per second (FLOPS) and higher bandwidth. This allows them to group and process AI workloads efficiently, maximizing productivity. However, their ultimate limitation lies in the power they can access. For example, in a 1-megawatt AI factory, NVIDIA’s Hopper architecture can generate up to 180,000 tokens per second (TPS) at maximum volume or 225 TPS for a single user at peak speed. Yet, the true measure of efficiency lies in how well the AI factory handles the varied performance demands of different workloads. Each point on the performance curve represents a batch of tasks the factory processes, each with unique requirements. NVIDIA GPUs excel in managing this full spectrum of workloads thanks to their programmability through NVIDIA CUDA software. The next evolution in AI factory efficiency comes from the NVIDIA Blackwell architecture. With optimizations in both software and hardware, Blackwell can do much more with the same amount of energy, pushing the performance curve further out. By leveraging NVIDIA Dynamo, a new operating system for AI factories, developers can optimize workloads autonomously. Dynamo breaks inference tasks into smaller components, dynamically routing them to the most suitable compute resources available at any given time. The results are impressive. A single generational leap from the Hopper to the Blackwell architecture can yield a 50-fold increase in AI reasoning performance using the same energy. This integration of hardware and software provides significant speed and efficiency gains, even between chip architecture updates. NVIDIA continues to push these boundaries, enhancing performance from hardware to software and from compute to networking. Each advancement brings about substantial improvements, enabling AI to contribute trillions of dollars in productivity to NVIDIA's partners and customers worldwide. Moreover, these advancements move us closer to solving major global challenges such as curing diseases, reversing climate change, and uncovering the mysteries of the universe. In essence, compute is transforming into capital and driving unprecedented progress.