Scaling AI inference with open-source efficiency

NVIDIA has launched Dynamo, an open-source inference software designed to accelerate and scale reasoning models within AI factories.
Efficiently managing and coordinating AI inference requests across a fleet of GPUs is a critical endeavour to ensure that AI factories can operate with optimal cost-effectiveness and maximise the generation of token revenue.
As AI reasoning becomes increasingly prevalent, each AI model is expected to generate tens of thousands of tokens with every prompt, essentially representing its “thinking” process. Enhancing inference performance while simultaneously reducing its cost is therefore crucial for accelerating growth and boosting revenue opportunities for service providers.
A new generation of AI inference software
NVIDIA Dynamo, which succeeds the NVIDIA Triton Inference Server, represents a new generation of AI inference software specifically engineered to maximise token revenue generation for AI factories deploying reasoning AI models.
Dynamo orchestrates and accelerates inference communication across potentially thousands of GPUs. It employs disaggregated serving, a technique that separates the processing and generation phases of large language models (LLMs) onto distinct GPUs. This approach allows each phase to be optimised independently, catering to its specific computational needs and ensuring maximum utilisation of GPU resources.
“Industries around the world are training AI models to think and learn in different ways, making them more sophisticated over time,” stated Jensen Huang, founder and CEO of NVIDIA. “To enable a future of custom reasoning AI, NVIDIA Dynamo helps serve these models at scale, driving cost savings and efficiencies across AI factories.”
Using the same number of GPUs, Dynamo has demonstrated the ability to double the performance and revenue of AI factories serving Llama models on NVIDIA’s current Hopper platform. Furthermore, when running the DeepSeek-R1 model on a large cluster of GB200 NVL72 racks, NVIDIA Dynamo’s intelligent inference optimisations have shown to boost the number of tokens generated by over 30 times per GPU.
To achieve these improvements in inference performance, NVIDIA Dynamo incorporates several key features designed to increase throughput and reduce operational costs.
Dynamo can dynamically add, remove, and reallocate GPUs in real-time to adapt to fluctuating request volumes and types. The software can also pinpoint specific GPUs within large clusters that are best suited to minimise response computations and efficiently route queries. Dynamo can also offload inference data to more cost-effective memory and storage devices while retrieving it rapidly when required, thereby minimising overall inference costs.
NVIDIA Dynamo is being released as a fully open-source project, offering broad compatibility with popular frameworks such as PyTorch, SGLang, NVIDIA TensorRT-LLM, and vLLM. This open approach supports enterprises, startups, and researchers in developing and optimising novel methods for serving AI models across disaggregated inference infrastructures.
NVIDIA expects Dynamo to accelerate the adoption of AI inference across a wide range of organisations, including major cloud providers and AI innovators like AWS, Cohere, CoreWeave, Dell, Fireworks, Google Cloud, Lambda, Meta, Microsoft Azure, Nebius, NetApp, OCI, Perplexity, Together AI, and VAST.
NVIDIA Dynamo: Supercharging inference and agentic AI
A key innovation of NVIDIA Dynamo lies in its ability to map the knowledge that inference systems hold in memory from serving previous requests, known as the KV cache, across potentially thousands of GPUs.
The software then intelligently routes new inference requests to the GPUs that possess the best knowledge match, effectively avoiding costly recomputations and freeing up other GPUs to handle new incoming requests. This smart routing mechanism significantly enhances efficiency and reduces latency.
“To handle hundreds of millions of requests monthly, we rely on NVIDIA GPUs and inference software to deliver the performance, reliability and scale our business and users demand,” said Denis Yarats, CTO of Perplexity AI.
“We look forward to leveraging Dynamo, with its enhanced distributed serving capabilities, to drive even more inference-serving efficiencies and meet the compute demands of new AI reasoning models.”
AI platform Cohere is already planning to leverage NVIDIA Dynamo to enhance the agentic AI capabilities within its Command series of models.
“Scaling advanced AI models requires sophisticated multi-GPU scheduling, seamless coordination and low-latency communication libraries that transfer reasoning contexts seamlessly across memory and storage,” explained Saurabh Baji, SVP of engineering at Cohere.
“We expect NVIDIA Dynamo will help us deliver a premier user experience to our enterprise customers.”
Support for disaggregated serving
The NVIDIA Dynamo inference platform also features robust support for disaggregated serving. This advanced technique assigns the different computational phases of LLMs – including the crucial steps of understanding the user query and then generating the most appropriate response – to different GPUs within the infrastructure.
Disaggregated serving is particularly well-suited for reasoning models, such as the new NVIDIA Llama Nemotron model family, which employs advanced inference techniques for improved contextual understanding and response generation. By allowing each phase to be fine-tuned and resourced independently, disaggregated serving improves overall throughput and delivers faster response times to users.
Together AI, a prominent player in the AI Acceleration Cloud space, is also looking to integrate its proprietary Together Inference Engine with NVIDIA Dynamo. This integration aims to enable seamless scaling of inference workloads across multiple GPU nodes. Furthermore, it will allow Together AI to dynamically address traffic bottlenecks that may arise at various stages of the model pipeline.
“Scaling reasoning models cost effectively requires new advanced inference techniques, including disaggregated serving and context-aware routing,” stated Ce Zhang, CTO of Together AI.
“The openness and modularity of NVIDIA Dynamo will allow us to seamlessly plug its components into our engine to serve more requests while optimising resource utilisation—maximising our accelerated computing investment. We’re excited to leverage the platform’s breakthrough capabilities to cost-effectively bring open-source reasoning models to our users.”
Four key innovations of NVIDIA Dynamo
NVIDIA has highlighted four key innovations within Dynamo that contribute to reducing inference serving costs and enhancing the overall user experience:
- GPU Planner: A sophisticated planning engine that dynamically adds and removes GPUs based on fluctuating user demand. This ensures optimal resource allocation, preventing both over-provisioning and under-provisioning of GPU capacity.
- Smart Router: An intelligent, LLM-aware router that directs inference requests across large fleets of GPUs. Its primary function is to minimise costly GPU recomputations of repeat or overlapping requests, thereby freeing up valuable GPU resources to handle new incoming requests more efficiently.
- Low-Latency Communication Library: An inference-optimised library designed to support state-of-the-art GPU-to-GPU communication. It abstracts the complexities of data exchange across heterogeneous devices, significantly accelerating data transfer speeds.
- Memory Manager: An intelligent engine that manages the offloading and reloading of inference data to and from lower-cost memory and storage devices. This process is designed to be seamless, ensuring no negative impact on the user experience.
NVIDIA Dynamo will be made available within NIM microservices and will be supported in a future release of the company’s AI Enterprise software platform.
See also: LG EXAONE Deep is a maths, science, and coding buff
Want to learn more about AI and big data from industry leaders? Check out AI & Big Data Expo taking place in Amsterdam, California, and London. The comprehensive event is co-located with other leading events including Intelligent Automation Conference, BlockX, Digital Transformation Week, and Cyber Security & Cloud Expo.
Explore other upcoming enterprise technology events and webinars powered by TechForge here.