Meta unleashes Llama API running 18x faster than OpenAI: Cerebras partnership delivers 2,600 tokens per second

0


Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More

Meta announced today a partnership with Cerebras Systems to power its new Llama API, offering developers access to inference speeds up to 18 times faster than traditional GPU-based solutions.

The announcement, made at Meta’s inaugural LlamaCon developer conference in Menlo Park, positions the company to compete directly with OpenAI, Anthropic, and Google in the rapidly growing AI inference service market, where developers purchase tokens by the billions to power their applications.

“Meta has selected Cerebras to collaborate to deliver the ultra-fast inference that they need to serve developers through their new Llama API,” said Julie Shin Choi, chief marketing officer at Cerebras, during a press briefing. “We at Cerebras are really, really excited to announce our first CSP hyperscaler partnership to deliver ultra-fast inference to all developers.”

The partnership marks Meta’s formal entry into the business of selling AI computation, transforming its popular open-source Llama models into a commercial service. While Meta’s Llama models have accumulated over one billion downloads, until now the company had not offered a first-party cloud infrastructure for developers to build applications with them.

“This is very exciting, even without talking about Cerebras specifically,” said James Wang, a senior executive at Cerebras. “OpenAI, Anthropic, Google — they’ve built an entire new AI business from scratch, which is the AI inference business. Developers who are building AI apps will buy tokens by the millions, by the billions sometimes. And these are just like the new compute instructions that people need to build AI applications.”

A benchmark chart shows Cerebras processing Llama 4 at 2,648 tokens per second, dramatically outpacing competitors SambaNova (747), Groq (600) and GPU-based services from Google and others — explaining Meta’s hardware choice for its new API. (Credit: Cerebras)

Breaking the speed barrier: How Cerebras supercharges Llama models

What sets Meta’s offering apart is the dramatic speed increase provided by Cerebras’ specialized AI chips. The Cerebras system delivers over 2,600 tokens per second for Llama 4 Scout, compared to approximately 130 tokens per second for ChatGPT and around 25 tokens per second for DeepSeek, according to benchmarks from Artificial Analysis.

“If you just compare on API-to-API basis, Gemini and GPT, they’re all great models, but they all run at GPU speeds, which is roughly about 100 tokens per second,” Wang explained. “And 100 tokens per second is okay for chat, but it’s very slow for reasoning. It’s very slow for agents. And people are struggling with that today.”

This speed advantage enables entirely new categories of applications that were previously impractical, including real-time agents, conversational low-latency voice systems, interactive code generation, and instant multi-step reasoning — all of which require chaining multiple large language model calls that can now be completed in seconds rather than minutes.

The Llama API represents a significant shift in Meta’s AI strategy, transitioning from primarily being a model provider to becoming a full-service AI infrastructure company. By offering an API service, Meta is creating a revenue stream from its AI investments while maintaining its commitment to open models.

“Meta is now in the business of selling tokens, and it’s great for the American kind of AI ecosystem,” Wang noted during the press conference. “They bring a lot to the table.”

The API will offer tools for fine-tuning and evaluation, starting with Llama 3.3 8B model, allowing developers to generate data, train on it, and test the quality of their custom models. Meta emphasizes that it won’t use customer data to train its own models, and models built using the Llama API can be transferred to other hosts—a clear differentiation from some competitors’ more closed approaches.

Cerebras will power Meta’s new service through its network of data centers located throughout North America, including facilities in Dallas, Oklahoma, Minnesota, Montreal, and California.

“All of our data centers that serve inference are in North America at this time,” Choi explained. “We will be serving Meta with the full capacity of Cerebras. The workload will be balanced across all of these different data centers.”

The business arrangement follows what Choi described as “the classic compute provider to a hyperscaler” model, similar to how Nvidia provides hardware to major cloud providers. “They are reserving blocks of our compute that they can serve their developer population,” she said.

Beyond Cerebras, Meta has also announced a partnership with Groq to provide fast inference options, giving developers multiple high-performance alternatives beyond traditional GPU-based inference.

Meta’s entry into the inference API market with superior performance metrics could potentially disrupt the established order dominated by OpenAI, Google, and Anthropic. By combining the popularity of its open-source models with dramatically faster inference capabilities, Meta is positioning itself as a formidable competitor in the commercial AI space.

“Meta is in a unique position with 3 billion users, hyper-scale datacenters, and a huge developer ecosystem,” according to Cerebras’ presentation materials. The integration of Cerebras technology “helps Meta leapfrog OpenAI and Google in performance by approximately 20x.”

For Cerebras, this partnership represents a major milestone and validation of its specialized AI hardware approach. “We have been building this wafer-scale engine for years, and we always knew that the technology’s first rate, but ultimately it has to end up as part of someone else’s hyperscale cloud. That was the final target from a commercial strategy perspective, and we have finally reached that milestone,” Wang said.

The Llama API is currently available as a limited preview, with Meta planning a broader rollout in the coming weeks and months. Developers interested in accessing the ultra-fast Llama 4 inference can request early access by selecting Cerebras from the model options within the Llama API.

“If you imagine a developer who doesn’t know anything about Cerebras because we’re a relatively small company, they can just click two buttons on Meta’s standard software SDK, generate an API key, select the Cerebras flag, and then all of a sudden, their tokens are being processed on a giant wafer-scale engine,” Wang explained. “That kind of having us be on the back end of Meta’s whole developer ecosystem is just tremendous for us.”

Meta’s choice of specialized silicon signals something profound: in the next phase of AI, it’s not just what your models know, but how quickly they can think it. In that future, speed isn’t just a feature—it’s the whole point.



Source link

You might also like
Leave A Reply

Your email address will not be published.