Comprehensive performance benchmarks for numr operations across CPU and CUDA backends, with comparisons against reference implementations (ndarray, nalgebra).
Date: 2026-02-11 Version: numr 0.4.0 Branch: 0.4.0
System Specs:
- CPU: x86_64 (3.69-3.98 GHz)
- GPU: NVIDIA RTX 3060 (tested with --features cuda)
- Framework: FluxBench
Test Coverage:
- ✅ 6 benchmark suites (matmul, reduce, shape_ops, indexing, fft, parallelism)
- ✅ 16 CUDA benchmarks + CPU baselines
- ✅ 100+ total benchmarks (CPU + CUDA + parallelism)
- ✅ 30+ benchmarks in parallelism suite
- ✅ 12+ verification gates (critical + warning)
- ✅ 4 numerical parity unit tests
| Operation | numr (CPU) | numr (CUDA) | ndarray |
|---|---|---|---|
| Matmul 512×512 | 2.45µs | 2.68µs | 2.46µs |
| Matmul 1024×1024 | 17.57ms | 2.91ms | 21.39ms |
| Sum 1M elements | 624µs | 2.7µs | 631µs |
| Sum rows 1024×1024 | 53µs | 2.6µs | 85µs |
| Cat 10×1K tensors | 747ns | - | 784ns |
| Cat 10×256×64 | 15.4µs | 18.1µs | 15.3µs |
| Embedding lookup 32K | 12.2µs | 6.7µs | - |
All 5 verification gates pass (1.1x threshold):
✓ cat_1d: 0.95x ndarray (< 1.1 threshold)
✓ cat_2d: 1.01x ndarray (< 1.1 threshold)
✓ sum_1m: 0.99x ndarray (< 1.1 threshold)
✓ sum_10m: 0.99x ndarray (< 1.1 threshold)
✓ sum_rows_1k: 0.62x ndarray (< 1.1 threshold)
# Run all CPU benchmarks
cargo bench
# Run all benchmarks with CUDA support
cargo bench --features cuda
# Run specific benchmark suite
cargo bench --bench matmul # Matrix multiplication
cargo bench --bench reduce # Reduction operations (sum, mean, max)
cargo bench --bench shape_ops # Shape transformations (cat, stack, repeat, pad, roll)
cargo bench --bench indexing # Indexing operations (gather, take, embedding_lookup)
cargo bench --bench fft # FFT operations (CPU only, no CUDA support yet)
cargo bench --bench parallelism # CPU parallelism control (thread-scaling, chunk-tuning)
# Test parallelism numerical parity (verify identical results across thread counts)
cargo test --bench parallelism
# Run specific benchmark with CUDA
cargo bench --bench matmul --features cudaOperations Tested:
- Dense 2D matrix multiplication (f32, f64)
- Batched matrix multiplication
- Bias addition (fused with matmul)
Sizes:
- Small: 32×32, 64×64
- Medium: 128×128, 256×256
- Large: 512×512, 1024×1024
Comparisons:
MatmulSmall: CPU numr vs ndarray vs nalgebra (32×32)MatmulMedium: CPU numr vs ndarray vs nalgebra (128×128)MatmulLarge: CPU numr vs ndarray vs nalgebra (512×512) + CUDA (when available)MatmulXLarge: CPU numr vs ndarray vs nalgebra (1024×1024) + CUDA (when available)
Performance Target: 50%+ of cuBLAS (CUDA), 1.1x ndarray (CPU)
Synthetic Metrics (CUDA only):
CudaSpeedup512: GPU speedup vs CPU at 512×512CudaSpeedup1024: GPU speedup vs CPU at 1024×1024
Operations Tested:
sum: Sum all elements or along axismean: Compute meanmax: Find maximum value
Sizes:
- Single dimension: 1K, 100K, 1M, 10M elements
- 2D matrix reductions: 256×256, 1024×1024
- Data types: F32, F64
Comparisons:
Sum1M: CPU numr vs ndarray vs CUDA (1M elements)Sum10M: CPU numr vs ndarray vs CUDA (10M elements)SumRows1024: CPU numr vs ndarray vs CUDA (1024×1024 rows)
Verification Gates:
numr_sum_1m / ndarray_sum_1m < 1.1 (must be 91%+ of ndarray speed)
numr_sum_10m / ndarray_sum_10m < 1.1
numr_sum_rows_1024x1024 / ndarray_sum_rows_1024x1024 < 1.1
Scaling Analysis:
- Includes 4-point scaling series (1K→100K→1M→10M) to measure throughput improvements
Operations Tested:
cat: Concatenate tensors along dimensionstack: Stack tensors into new dimensionrepeat: Repeat tensor along each dimensionrepeat_interleave: Repeat elements interleavedunfold: Sliding window operationsplit/chunk: Partition tensors
Sizes:
- 1D: 1K, 10K, 100K elements
- 2D: 256×256, 256×64, 1024×64
- Repetitions: 2×2, 4×1, 4×, 8×, 10×
Comparisons:
Cat1D: CPU numr vs ndarray (10× 1000-elem tensors)Cat2D: CPU numr vs ndarray vs CUDA (10× 256×64 tensors)
Verification Gates:
numr_cat_10x_1000 / ndarray_cat_10x_1000 < 1.1 (must be 91%+ of ndarray speed)
numr_cat_10x_256x64 / ndarray_cat_10x_256x64 < 1.1
Performance Insight: CUDA overhead dominates for small tensors (18µs vs 15µs CPU for cat), but amortizes across larger operations.
Operations Tested:
gather: Gather slices from one dimensionindex_select: Select rows by indicestake: Flat indexingscatter: Scatter values into outputput: Flat scatterembedding_lookup: Common ML pattern (vocabulary lookup)
Sizes:
- Source: 1K, 100K vocabulary
- Queries: 256, 512, 10K indices
- Embedding dim: 64, 128
Comparisons:
IndexSelectCmp: 1K vs 100K scalingEmbeddingCmp: CPU numr vs CUDA at 32K/128K vocab
Performance Target: 0.85-1.0x CUDA speedup (memory bound, CPU cache-friendly for small tensors)
Operations Tested:
- FFT (fast Fourier transform)
- IFFT (inverse FFT)
- rfft (real FFT)
Sizes:
- 256, 1024, 4096, 16384, 65536 elements
- Batched: 8×1024, 16×4096, 32×16384
Status: CPU only (CUDA FFT support pending)
Comparisons:
FFT256throughFFT65K: Scaling series for algorithm analysis
Purpose: Validate thread-count scaling and chunk-size tuning for CPU operations with parallelism control.
Operations Tested:
- Matrix multiplication (batch parallelism with Rayon)
- Reductions (sum, mean - uses
rayon_min_len()) - FFT (batched transforms - uses
chunk_size_hint())
Thread Counts: 1, 2, 4, 8 (hardware-dependent, scales to available cores)
Benchmark Groups:
-
Thread Scaling (5 groups):
matmul_threads_512: Dense 512×512 matmul with 1, 2, 4, 8 threadsmatmul_batch_threads: Batched 32×128×128 matmul with 1, 2, 4, 8 threadsreduce_sum_1m_threads: 1M element sum with 1, 2, 4, 8 threadsreduce_sum_10m_threads: 10M element sum with 1, 2, 4, 8 threads (best for scaling analysis)reduce_mean_1m_threads: 1M element mean with 1, 4 threadsfft_threads_16k: 16384-element FFT with 1, 2, 4, 8 threadsfft_batch_threads: Batched 64×1024 FFT with 1, 2, 4, 8 threads
-
Chunk Size Sensitivity (1 group):
reduce_sum_chunk_sensitivity: 10M element sum with 4 threads, varying chunk_size: 256, 1024, 4096, 16384- Validates that
chunk_size_hint()tuning improves performance without overhead
-
Configuration Overhead (3 groups):
overhead_matmul: Default client vs custom config (None, None)overhead_reduce: Default client vs custom config (None, None)overhead_fft: Default client vs custom config (None, None)- Validates that
with_parallelism()< 5% overhead
Verification Gates:
// Scaling efficiency (hardware-dependent, severity = warning)
matmul_512x512_4threads / matmul_512x512_1thread < 0.95
reduce_sum_10m_4threads / reduce_sum_10m_1thread < 0.9
fft_16384_4threads / fft_16384_1thread < 0.9
// Configuration overhead (strict, severity = critical)
matmul_512x512_custom_same / matmul_512x512_default < 1.05
reduce_sum_1m_custom_same / reduce_sum_1m_default < 1.05
fft_1024_custom_same / fft_1024_default < 1.05Synthetic Metrics:
matmul_512_4t_speedup: 4-thread speedup ratio (1t / 4t)reduce_sum_1m_4t_speedup: 4-thread speedup for 1M sumreduce_sum_10m_4t_speedup: 4-thread speedup for 10M sum (best indicator)fft_16k_4t_speedup: 4-thread speedup for 16K FFTmatmul_overhead_ratio: Configuration overhead for matmulreduce_overhead_ratio: Configuration overhead for reducefft_overhead_ratio: Configuration overhead for FFT
Numerical Parity Tests (Unit Tests):
Critical: All parallelism configs MUST produce identical results (bit-for-bit, not approximate):
#[test]
fn test_matmul_parallelism_numerical_parity() {
// Verify: result_1t == result_4t == result_8t (EXACTLY)
}
#[test]
fn test_reduce_sum_parallelism_numerical_parity() {
// Verify: result_1t == result_4t == result_8t (EXACTLY)
}
#[test]
fn test_fft_parallelism_numerical_parity() {
// Verify: result_1t == result_4t == result_8t (EXACTLY)
}
#[test]
fn test_chunk_size_numerical_parity() {
// Verify: chunk_256 == chunk_1024 == chunk_4096 (EXACTLY)
}Why Numerical Parity is Critical: Parallelism should be a pure performance optimization with ZERO numerical impact. Different thread counts or chunk sizes must produce identical results (same order of operations, same accumulation).
Comparisons:
MatmulScaling512: 512×512 matmul thread scaling (1t, 2t, 4t, 8t)MatmulBatchScaling: Batched 32×128×128 thread scalingReduceSum1MScaling: 1M element sum thread scalingReduceSum10MScaling: 10M element sum thread scaling (best for performance analysis)FFT16KScaling: 16384-element FFT thread scalingFFTBatchScaling: Batched 64×1024 FFT thread scalingChunkSizeReduce: 10M sum chunk size impact (256 vs 1024 vs 4096 vs 16384)OverheadMatmul: Configuration overhead for matmulOverheadReduce: Configuration overhead for reduceOverheadFFT: Configuration overhead for FFT
Running Benchmarks:
# All parallelism benchmarks
cargo bench --bench parallelism
# Specific thread scaling groups
cargo bench --bench parallelism -- matmul_threads_512
cargo bench --bench parallelism -- reduce_sum_10m_threads
cargo bench --bench parallelism -- fft_threads_16k
# Chunk size sensitivity
cargo bench --bench parallelism -- reduce_sum_chunk_sensitivity
# Configuration overhead
cargo bench --bench parallelism -- overhead
# Numerical parity unit tests
cargo test --bench parallelism
# Without Rayon (verify graceful no-op behavior)
cargo bench --bench parallelism --no-default-features --features cpuPerformance Analysis:
Thread Scaling Expected Behavior:
- 1 thread (serial): Baseline
- 2-4 threads: 1.5-2.5x speedup (if workload large enough)
- 4-8 threads: Diminishing returns, scales sub-linearly due to Rayon overhead
- Hardware-dependent: 2-core vs 16-core systems will show very different results
Which Benchmarks Show Best Scaling:
- Matmul batched (best for scaling): Batch dimension parallelized, good load balance
- Reduce 10M (good for scaling): Large dataset, communication-to-computation ratio favorable
- FFT batched (good for scaling): Multiple FFTs computed in parallel
- Matmul 512×512 (moderate scaling): Square matrix, scales less than batched
Chunk Size Impact:
- Default (chunk_size=1): No chunking, full dataset per thread
- chunk_size=256: More granular, better load balance but more overhead
- chunk_size=1024: Sweet spot for most operations
- chunk_size=4096+: Large chunks, better cache locality but uneven load balance
Overhead Interpretation:
- ratio < 1.01: Perfect parity, no overhead
- ratio 1.01-1.05: Acceptable overhead (< 5%)
- ratio > 1.05: CRITICAL - indicates infrastructure bug in
with_parallelism()
Scaling Efficiency Interpretation:
- Ratio < 0.5: Linear or better (supralinear), indicates excellent parallelism
- Ratio 0.5-0.75: Sub-linear but good (typical for 4-thread)
- Ratio 0.75-0.95: Poor scaling, high Rayon overhead (investigate)
- Ratio > 0.95: Essentially no speedup (serial performance)
Note on Hardware Dependency:
Scaling efficiency gates have severity = "warning" because results vary dramatically by hardware:
- 2-core system: 4-thread config uses oversubscription, can be slower
- 4-core system: 4-thread config achieves best scaling (~2-3x)
- 8+ core system: 4-thread config shows diminishing returns (~1.5-2x)
Overhead gates have severity = "critical" because configuration overhead should be consistent regardless of hardware.
All benchmarks include automatic verification gates to detect regressions:
#[flux::verify(expr = "numr_512x512 / ndarray_512x512 < 1.1", severity = "critical")]
struct VerifyMatmul512;Threshold: 1.1x (numr must be ≤ 10% slower than reference)
- All operations: Must be ≤ 1.1x reference
- CUDA benchmarks: Track speedup via synthetic metrics
Failure Interpretation:
- Ratio < 1.0: numr is faster ✅
- Ratio 1.0-1.1: Within acceptable range ✅
- Ratio > 1.1: REGRESSION ❌ Investigate and fix
| Operation | F32 | F64 | F16 | Complex64 | Notes |
|---|---|---|---|---|---|
| matmul | ✅ | ✅ | ❌ | F64 tested on CUDA, F16 limited | |
| reduce | ✅ | ✅ | ❌ | F64 tested on CUDA | |
| shape_ops | ✅ | ❌ | ❌ | F32 primary, F64 optional | |
| fft | ❌ | ❌ | ❌ | ✅ | Complex64 only (CPU only) |
| indexing | ✅ | ❌ | ❌ | ❌ | F32 primarily tested |
| parallelism | ✅ | ❌ | ❌ | ❌ | F32 primary focus |
| Backend | Supported Types | Notes |
|---|---|---|
| CPU | F32, F64, F16, BF16, Complex64, Complex128 | Full dtype coverage |
| CUDA | F32, F64, F16, BF16, Complex64, Complex128 | Excellent coverage, F16/BF16 optional |
| WebGPU | F32 only (Complex64 for FFT) | WGSL limitation, no F64/F16/BF16 support |
Recommendation: For cross-platform benchmarks, use F32 as the standard dtype to ensure results are comparable across CPU/CUDA/WebGPU backends.
To benchmark additional dtypes:
// F64 variant (CPU and CUDA)
#[flux::bench(group = "matmul_2d_f64")]
fn numr_512x512_f64(b: &mut Bencher) {
let (device, client) = setup();
let a = client.rand(&[512, 512], DType::F64).unwrap(); // F64
let b = client.rand(&[512, 512], DType::F64).unwrap();
b.iter(|| black_box(client.matmul(&a, &b).unwrap()));
}
// Add comparison for F64
#[flux::compare(
id = "matmul_512_f64",
title = "Matmul 512x512 F64 (numr vs ndarray)",
benchmarks = ["numr_512x512_f64", "ndarray_512x512_f64"],
baseline = "numr_512x512_f64",
metric = "mean"
)]
struct MatmulF64;Current limitation: WebGPU benchmarks cannot use F64 (WGSL doesn't support it). Use CPU backend for F64 performance analysis.
cargo bench- All CPU benchmarks compile and run
- Comparisons show 2-way (numr vs reference) or 3-way (numr vs ndarray vs nalgebra)
- CUDA benchmarks and comparisons are skipped
cargo bench --features cuda- CPU benchmarks still run
- CUDA benchmarks added to same comparison groups
- Comparisons expand to 3-way (CPU) → 4-way (including CUDA)
- Same comparison IDs in both modes for result consistency
- Synthetic metrics calculate GPU speedup
Implementation Detail: Uses conditional struct definitions:
#[cfg(not(feature = "cuda"))]
#[flux::compare(...)] // CPU-only definition
struct MatmulLarge;
#[cfg(feature = "cuda")]
#[flux::compare(...)] // Includes CUDA benchmarks
struct MatmulLarge; // Same ID, different benchmarksGroup: matmul_2d_f32
------------------------------------------------------------
✓ numr_512x512
mean: 2454409.00 ns median: 2456866.00 ns stddev: 7854.80 ns
min: 2444071.00 ns max: 2464290.00 ns
samples: 5
p50: 2456866.00 ns p95: 2462941.40 ns p99: 2464020.28 ns
95% CI: [2445111.00, 2462941.40] ns
throughput: 407.43 ops/sec
cycles: mean 9064156 median 9073214 (3.69 GHz)
Matmul 512x512 (numr vs ndarray vs nalgebra)
------------------------------------------------------------
Benchmark mean Speedup
────────────────────────────────────────
numr_512x512 2454409 1.00x (baseline)
ndarray_512x512 2456036 1.00x
nalgebra_512x512 2454409 1.00x
Key Metrics:
- mean: Average execution time (most important)
- median: Middle value (stable timing, unaffected by outliers)
- stddev: Standard deviation (lower = more consistent)
- p95, p99: 95th/99th percentile (tail latency)
- throughput: Operations per second (1 / mean)
- Speedup: Ratio vs baseline (1.0x = equal to baseline)
| Operation | Expected vs Reference | Notes |
|---|---|---|
| Dense matmul (CPU) | 0.9-1.1x ndarray | BLIS-style tiling |
| Dense matmul (CUDA) | 0.5x cuBLAS | Native kernels, no vendor libs |
| Reductions (CPU) | 0.9-1.1x ndarray | SIMD vectorization |
| Cat (CPU) | 0.85-1.1x ndarray | Optimized memcpy |
| Indexing (CPU) | 1.0-1.1x | Cache-dependent |
| Indexing (CUDA) | 1.5-2.0x CPU | GPU memory bandwidth |
Benchmark results are written to target/criterion/ (FluxBench format):
# Find comparisons
ls target/criterion/*/comparison-data.json
# View specific comparison
cat target/criterion/matmul_large/comparison-data.json | jq- Add benchmark function with
#[flux::bench]attribute:
#[flux::bench(group = "matmul_2d_f32")]
fn numr_512x512(b: &mut Bencher) {
let (device, client) = setup();
let a = client.rand(&[512, 512], DType::F32).unwrap();
let b = client.rand(&[512, 512], DType::F32).unwrap();
b.iter(|| black_box(client.matmul(&a, &b).unwrap()));
}- Add CUDA variant (if applicable):
#[cfg(feature = "cuda")]
#[flux::bench(group = "matmul_2d_f32")]
fn cuda_512x512(b: &mut Bencher) {
let device = CudaDevice::new(0);
let client = CudaRuntime::default_client(&device);
let a = client.rand(&[512, 512], DType::F32).unwrap();
let b = client.rand(&[512, 512], DType::F32).unwrap();
b.iter(|| black_box(client.matmul(&a, &b).unwrap()));
}- Add or update comparison struct:
#[cfg(not(feature = "cuda"))]
#[flux::compare(
id = "matmul_large",
title = "Matmul 512x512 (numr vs ndarray)",
benchmarks = ["numr_512x512", "ndarray_512x512"],
baseline = "numr_512x512",
metric = "mean"
)]
struct MatmulLarge;
#[cfg(feature = "cuda")]
#[flux::compare(
id = "matmul_large",
title = "Matmul 512x512 (numr vs ndarray vs CUDA)",
benchmarks = ["numr_512x512", "ndarray_512x512", "cuda_512x512"],
baseline = "numr_512x512",
metric = "mean"
)]
struct MatmulLarge;- Add verification gate (for critical performance):
#[flux::verify(
expr = "numr_512x512 / ndarray_512x512 < 1.1",
severity = "critical"
)]
struct VerifyMatmul512;- Add synthetic metric for insights:
#[cfg(feature = "cuda")]
#[flux::synthetic(
id = "cuda_speedup_512",
formula = "numr_512x512 / cuda_512x512",
unit = "x"
)]
struct CudaSpeedup512;-
Check if it's measurement noise:
cargo bench --bench <name> -- --sample-size 100 # More samples
-
Profile with perf/flamegraph:
cargo bench --bench matmul -- --profile-time 10
-
Check verification gates:
- If gate fails (ratio > 1.1), compare against baseline:
git show HEAD:src/runtime/cpu/runtime.rs > /tmp/old.rs diff /tmp/old.rs src/runtime/cpu/runtime.rs -
Common causes:
- Unnecessary memory allocation (use
allocnotalloc_zeroed) - Arc clones avoiding contiguous check
- Unvectorized code paths
- Missing SIMD optimizations
- Inefficient packing/unpacking in matmul
- Unnecessary memory allocation (use
CPU (SIMD):
- Focus on cache alignment (64-byte for AVX-512)
- Minimize branch mispredictions
- Vectorize hot loops
CUDA:
- Coalesce memory access
- Use shared memory for tiling
- Minimize kernel launch overhead
- Check occupancy (register pressure)
WebGPU:
- Minimize shader compilation time (cache compiled shaders)
- Use workgroup synchronization efficiently
- Profile with GPU debuggers
| Problem | Solution |
|---|---|
| "CUDA not found" | Install CUDA 12.x, add to PATH |
| Benchmarks crash on startup | Ensure GPU has enough memory (>1GB for large matmul) |
| Inconsistent timing | Close background processes, use --sample-size 20 for stability |
| Verification gate fails | Investigate recent changes to hot paths (allocation, packing, etc.) |
| CUDA benchmarks not appearing | Check cargo bench --features cuda - verify feature flag is active |
- FluxBench Framework: https://github.com/anomalous-behavior/flux (benchmark harness)
- Backend Implementations:
../src/runtime/{cpu,cuda,wgpu}/ - Operation Kernels:
../src/runtime/cpu/kernels/,../src/runtime/cpu/helpers/
When adding new operations to numr:
- Add CPU benchmarks first (at least 2 size scales)
- Add CPU vs reference comparisons
- Add verification gates (1.1x threshold)
- If CUDA-enabled, add CUDA benchmarks and expand comparisons
- Run full benchmark suite before committing
- Document expected performance in this README
Example workflow:
# After implementing new operation:
cargo bench --bench <suite> # Check CPU performance
cargo bench --bench <suite> --features cuda # Check CUDA if applicable
git diff benches/<suite>.rs # Review benchmark changesLast Updated: 2026-02-11 numr Version: 0.4.0 Benchmark Framework: FluxBench Supported Backends: CPU (default), CUDA (--features cuda), WebGPU (planned)