NVIDIA Blackwell Achieves 2.6x Performance Boost in MLPerf Training v5.0

Luisa Crawford
Jun 04, 2025 17:51
NVIDIA’s Blackwell architecture showcases significant performance improvements in MLPerf Training v5.0, delivering up to 2.6x faster training times across various benchmarks.
NVIDIA’s latest Blackwell architecture has made significant strides in the realm of artificial intelligence, demonstrating up to a 2.6x boost in performance during the MLPerf Training v5.0 benchmarks. According to NVIDIA, this achievement underscores the architectural advancements that Blackwell brings to the table, especially in the demanding fields of large language models (LLMs) and other AI applications.
Blackwell’s Architectural Innovations
Blackwell introduces several enhancements compared to its predecessor, the Hopper architecture. These include the fifth-generation NVLink and NVLink Switch technology, which greatly enhance bandwidth between GPUs. This improvement is critical for reducing training times and increasing throughput. Furthermore, Blackwell’s second-generation Transformer Engine and HBM3e memory contribute to faster and more efficient model training.
These advancements have allowed NVIDIA’s GB200 NVL72 system to achieve remarkable results, such as training the Llama 3.1 405B model 2.2x faster than the Hopper architecture. This system can reach up to 1,960 TFLOPS of training throughput.
Performance Across Benchmarks
MLPerf Training v5.0, known for its rigorous benchmarks, includes tests across various domains like LLM pretraining, text-to-image generation, and graph neural networks. NVIDIA’s platform excelled across all seven benchmarks, showcasing its prowess in both speed and efficiency.
For instance, in LLM fine-tuning using the Llama 2 70B model, Blackwell GPUs achieved a 2.5x speedup compared to previous submissions using the DGX H100 system. Similarly, the Stable Diffusion v2 pretraining benchmark saw a 2.6x performance increase per GPU, setting a new performance record at scale.
Implications and Future Prospects
The improvements in performance not only highlight the capabilities of the Blackwell architecture but also pave the way for faster deployment of AI models. Faster training and fine-tuning mean that organizations can bring their AI applications to market more quickly, enhancing their competitive edge.
NVIDIA’s continued focus on optimizing its software stack, including libraries like cuBLAS and cuDNN, plays a crucial role in these performance gains. These optimizations facilitate the efficient use of Blackwell’s enhanced computational power, particularly in AI data formats.
With these developments, NVIDIA is poised to further its leadership in AI hardware, offering solutions that meet the growing demands of complex and large-scale AI models.
For more detailed insights into NVIDIA’s performance in MLPerf Training v5.0, visit the NVIDIA blog.
Image source: Shutterstock