Iris Coleman
Apr 01, 2026 15:38
NVIDIA’s Blackwell Ultra GPUs set new MLPerf Inference records with 2.7x faster DeepSeek-R1 processing, hitting 2.5 million tokens per second across 288 GPUs.
NVIDIA’s Blackwell Ultra GPUs have delivered record-breaking performance in the latest MLPerf Inference v6.0 benchmarks, achieving up to 2.7x faster token throughput compared to submissions just six months ago. The results, published April 1, 2026, push NVIDIA’s cumulative MLPerf wins to 291—nine times more than all other submitters combined since 2018.
The standout figure: four GB300 NVL72 systems running 288 Blackwell Ultra GPUs processed 2.49 million tokens per second on DeepSeek-R1 in offline mode. That’s the largest GPU configuration ever submitted to any MLPerf Inference benchmark.
Software Optimization Drives Massive Gains
What’s particularly striking isn’t just raw hardware muscle—it’s how much performance NVIDIA extracted from the same silicon through software improvements. The GB300 NVL72 delivered 8,064 tokens per second per GPU on DeepSeek-R1’s server scenario, up from 2,907 tokens six months prior. Same chips, 2.77x more output.
The performance jump came from several TensorRT-LLM enhancements: faster fused kernels, optimized attention data parallel processing, and better load balancing across ranks. For the new DeepSeek-R1 Interactive scenario—which demands 5x faster minimum token rates than standard server deployments—NVIDIA deployed disaggregated serving, Wide Expert Parallel sharding, and multi-token prediction to hit 250,634 tokens per second.
Partner Nebius achieved the 2.7x speedup, demonstrating how NVIDIA’s open software stack enables ecosystem optimization. The practical implication? Token production costs dropped by over 60% on existing infrastructure.
First and Only Across New Benchmarks
MLPerf v6.0 introduced several demanding new tests, and NVIDIA was the sole platform to submit results across all of them:
Qwen3-VL-235B-A22B: The first multimodal vision-language model in MLPerf, hitting 79 samples/sec offlineGPT-OSS-120B: OpenAI’s 120B-parameter MoE reasoning model, achieving 1.05 million tokens/sec offlineWAN-2.2-T2V-A14B: Text-to-video generation at 21 seconds latency in single-stream modeDLRMv3: Transformer-based recommendation benchmark at 104,637 samples/sec
The multimodal Qwen3-VL submission used the vLLM open-source framework, while video generation ran on TensorRT-LLM VisualGen—both indicating how quickly the open-source ecosystem is building optimized pipelines for next-generation workloads.
Partner Ecosystem Shows Depth
Fourteen partners submitted results on the NVIDIA platform this round—the largest partner participation for any single platform in MLPerf history. ASUS, Cisco, CoreWeave, Dell, Google Cloud, HPE, Lenovo, and Supermicro all delivered competitive performance numbers, suggesting the Blackwell architecture has matured enough for broad enterprise deployment.
This breadth matters for AI infrastructure buyers evaluating vendor lock-in risk. The results arrived the same week NVIDIA announced a $2 billion strategic investment in Marvell Technology to expand AI infrastructure options, signaling the company’s push to position itself as the foundational layer for AI computing rather than a single-vendor solution.
What Comes Next
NVIDIA is leading development of MLPerf Endpoints, a new benchmark designed to measure real-world API performance under production traffic conditions. Current chip-level benchmarks can’t capture latency spikes, queuing behavior, or throughput degradation under sustained load—metrics that actually determine AI service economics.
For data center operators running inference at scale, the message from these results is clear: software optimization on existing Blackwell hardware may deliver more cost reduction than waiting for next-generation silicon. A 60% reduction in per-token costs changes the economics of deploying reasoning models like DeepSeek-R1 in production.
Image source: Shutterstock




Be the first to comment