Jan 06 2026

2026: Fast Inference Finds its Groove

I met my wife learning to dance Argentine tango. In tango you cannot fake your way through the steps. You have to feel the rhythm, listen to the moment, and respond in real time. Push too hard and the whole thing breaks. Find the groove and everything opens up.

In 2025, AI had its own tango moment.

For most of the last decade, people measured AI progress by the size of the model they could train. Bigger clusters, bigger budgets, bigger number. The ground shifted. The industry began to understand, not in theory but in practice, that inference speed is not a bragging point. It is the real constraint that determines what AI systems can do in the world.

At Cerebras, we have believed this for years. We built the largest chip ever made because it was the only way to address latency, throughput, and cost at their source. GPU systems were never designed for this phase of AI, and incremental improvement was not going to get the job done. In 2025, that belief finally aligned with what customers experienced directly.

The year opened with DeepSeek R1 Llama 70B running at more than 1,500 tokens per second, roughly 57 times faster than GPUs. That result jolted the community. It showed that inference could move from bottleneck to enabler. It also marked the beginning of a wave of adoption. Our early partnership announcements with Hugging Face and the Mayo Clinic team showed how fast training and fast inference open new paths in open-source and genomic AI.

Production systems quickly followed. Perplexity Sonar and Mistral Le Chat emerged as early examples of what real-time, high-quality responses feel like when latency disappears. AlphaSense demonstrated how a tenfold improvement in speed could reshape business analysis and decision making. New state-of-the-art language models developed with Inception and MBZUAI pushed sovereign AI forward in ways that were only possible at very high speeds.

Independent testing from Artificial Analysis soon confirmed what we were seeing in the field. Their evaluation showed that Cerebras outperformed NVIDIA Blackwell inference record for leading models from US and global AI labs.

This moment validated that wafer-scale computing behaves differently, and that it is now the best architecture for large-model inference.

The same pattern repeated throughout the year. Faster inference produced smarter, more capable AI systems. Sean Lie and James Wang described this clearly in “The Cerebras Scaling Law,” which explained how models that can think more in the same amount of time produce better results. We saw this across frontier models.

Qwen3 235B Instruct ran more than ten times faster than leading GPU clouds.
Qwen3 Coder 480B reached 2,000 tokens per second.
OpenAI’s gpt-oss-120B ran at 3,000 tokens per second.

These were not incremental improvements. They changed what people could build.

New products emerged from this shift. Cognition’s SWE-1.5 and the SWE-grep family delivered frontier-level coding performance up to 13x faster than general-purpose models. For the first time, developers could stay in flow while they explored codebases, ship features, and debug complex systems. NinjaTech AI launched Fast Deep Coder and accelerated software creation by a factor of five to ten. When powered by Cerebras, Rox’s Revenue Agent team reported higher daily engagement as responses reached "instant" territory. New partnerships with AWS Marketplace, IBM watsonx, Vercel, and OpenRouter continued to open paths for all AI builders to go faster.

Our infrastructure grew considerably to keep up with this demand. We opened a forty-plus exaflop datacenter in Oklahoma and over 7 datacenters across North America and Europe. And in September, we also closed our Series G fundraise, bringing new long-term partners into the company and strengthening the foundation for the next phase of growth.

We continued to execute on our commitment to Sovereign AI, launching Cerebras for Nations and announcing JAIS2 with MBZUAI and Inception. We closed the year with a new commitment to the U.S. Department of Energy’s Genesis Mission, and we were honored to receive TSMC’s Demo of the Year award..

Taken together, the events of last year tell a simple story. The world’s best builders feel the difference immediately between fast and slow. And they want Cerebras. The rhythm has changed.

2025 was proof.
2026 is where we find our groove.
The floor is open. Let’s dance.