CRPL -- Computational Research Programming Lab | Microbenchmarking NVIDIA's Blackwell Architecture at IPDPS 2026

This past week, we presented our latest work on NVIDIA’s Blackwell GPU architecture at the International Parallel and Distributed Processing Symposium (IPDPS) in New Orleans. The paper, Microbenchmarking NVIDIA’s Blackwell Architecture, represents a deep dive into the performance characteristics of NVIDIA’s newest accelerator—and the results reveal some genuinely surprising findings about how modern GPU hardware is evolving.

What We Measured:

The core question driving this work: How does Blackwell’s new hardware actually perform in practice? Rather than relying on marketing specs or synthetic benchmarks, we built a suite of targeted microbenchmarks to measure real behavior across several critical dimensions:

Tensor Core Performance:

We characterized the tcgen05.mma instruction (NVIDIA’s next-generation matrix multiply-accumulate operation), finding that it achieves remarkably low and flat latency across tile sizes—a departure from Hopper’s more variable performance profile.

Memory Hierarchy: TMEM, Blackwell’s new memory tier, represents a significant architectural change. We measured how it impacts data movement and what workloads benefit most from its introduction.

Precision Tradeoffs:

We systematically evaluated FP4 and FP6 formats for both training and inference, quantifying the accuracy-performance frontier. Hardware Decompression: Blackwell introduces a dedicated decompression engine. We explored what compression formats it supports and where it actually saves time.

Why This Matters:

Blackwell is a major step forward for HPC and AI workloads, but the gap between peak specs and sustained performance can be substantial. By publishing detailed, reproducible measurements, we hope to help the community understand where real performance wins are and where the marketing claims need skepticism.

Presenting at IPDPS:

Presenting this work at IPDPS was particularly gratifying—the conference attracted researchers and practitioners actively working on GPU optimization, compiler design, and benchmarking methodology. The feedback was sharp and thoughtful, with several conversations afterward pointing toward promising follow-up directions: decomposition engine workloads beyond LLM weights, TMEM portability across different frameworks, and how these findings translate to real applications on systems.

Paper: Microbenchmarking NVIDIA’s Blackwell Architecture (arXiv:2512.02189)