Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

pu-rs.org – Processing Unit Ranking System

The SPECfp for AI accelerators.

FLOPS don’t tell the full story. A chip rated at 1000 TFLOPS means nothing if your softmax kernel only achieves 5% utilization. pu-rs.org measures what matters: actual kernel execution time on real hardware, for the operations that AI workloads actually run.

Why this exists

What we measureWhat others report
Softmax latency at (64, 4096) f16Peak TFLOPS
LayerNorm throughput per wattMemory bandwidth (theoretical)
MatMul efficiency vs rooflineMarketing benchmarks
Cost per real GOPSCloud $/hour (opaque)

Scope

We benchmark the kernel primitives that compose every AI model:

CategoryKernels
ActivationSoftmax, GELU, SiLU
NormalizationLayerNorm, RMSNorm
Linear AlgebraGEMM, batched MatMul
AttentionScaled Dot-Product Attention
QuantizationVQ-Quantize, INT8 dequant
ConvolutionConv1D, dilated Conv1D
ReductionScatter-add, L1-smooth loss

Devices covered

TypeVendors
GPUNVIDIA (A100, H100, H200, B200), AMD (MI300X), Apple (M2/M4 Max), Cambricon
TPUGoogle (v5e, v6e Trillium)
NPUHuawei Ascend (910B, 910C), AWS Trainium2, Intel Gaudi 3

How it works

  1. Run standardized benchmark scripts on your hardware
  2. Submit CSV results via pull request
  3. CI validates format and sanity checks
  4. Leaderboard updates automatically with per-kernel rankings

All results tagged with git SHA, driver version, toolchain, and number of runs. Median latency reported. Full methodology.

End-to-end complement

Per-kernel latency is only half the story — a chip can win on softmax and still lose on a real model. The DeepSeek decode page reports end-to-end throughput across five accelerators (Ascend 910B2, TPU v2-8, Apple M2 Max, NVIDIA T4, AWS Trainium1) from the same 13-kernel Rust source emitted through the ascend-rs MLIR backends.


Built with ascend-rs kernel infrastructure. Data updated weekly.