Ten models, ten interleaved rounds, idle machine. Same architectures, same optimizer, same CUDA kernels — the only variable is the framework overhead.
| Model | PyTorch | flodl | Delta | Py σ | Rs σ |
|---|---|---|---|---|---|
| transformer | 3183.0 ms | 2199.8 ms | -31% | ±0.8 | ±1.0 |
| mlp | 291.1 ms | 207.0 ms | -29% | ±4.0 | ±1.3 |
| residual_tower | 406.9 ms | 309.7 ms | -24% | ±6.0 | ±3.3 |
| feedback_fixed | 275.3 ms | 231.3 ms | -16% | ±10.0 | ±6.0 |
| gated_routing | 248.0 ms | 217.3 ms | -12% | ±9.7 | ±2.8 |
| iterative_refine | 230.7 ms | 206.0 ms | -11% | ±2.2 | ±3.1 |
| gru_seq | 1105.1 ms | 1057.5 ms | -4% | ±16.7 | ±25.4 |
| conv_autoenc | 398.2 ms | 395.3 ms | -1% | ±1.1 | ±3.7 |
| lstm_seq | 692.3 ms | 692.3 ms | 0% | ±23.3 | ±15.2 |
| convnet | 1298.0 ms | 1298.2 ms | 0% | ±0.3 | ±0.1 |
Median epoch time (ms) across 10 interleaved rounds, best-of-3 runs per round. σ = scaled MAD (robust to OS/GC outliers). RTX 5060 Ti, GPU at 3090 MHz. flodl 0.2.2 vs PyTorch 2.10.0+cu128.
8 wins, 2 ties, zero regressions.
The ties prove both frameworks dispatch identical CUDA kernels.
The wins show what happens when you remove the overhead between them.
Dispatch-bound models (transformer −31%, mlp −29%) make many calls to small GPU kernels. PyTorch goes through Python → TorchScript dispatch → C++. flodl calls libtorch directly via FFI. When a model chains hundreds of ops per epoch, the overhead compounds.
Graph-builder architectures (residual_tower −24%, feedback_fixed −16%, gated_routing −12%) use pre-computed Vec-indexed routing — no HashMap lookups, no dynamic allocation in the forward pass.
Compute-bound ties (convnet 0%, lstm_seq 0%) spend >99% of time inside cuDNN kernels. Framework overhead is invisible. This is the cleanest evidence that the benchmark measures dispatch overhead, not CUDA kernel differences.
Variance is reported as scaled MAD, not standard deviation.
Here's why, and what difference it makes.
Standard deviation treats every outlier as real variance. In GPU benchmarking, that's wrong. Python's garbage collector fires at unpredictable intervals, creating 50–170% timing spikes in individual rounds. CUDA scheduling stalls cause similar (rarer, sometimes larger) spikes on the Rust side.
Scaled MAD (Median Absolute Deviation × 1.4826) is σ-equivalent for normal distributions but ignores outlier rounds from either side. It reports what each framework does in steady state, not what happens when the OS interferes.
The raw per-round data is published in benchmarks/rounds/.
Anyone can compute both metrics and inspect the outliers directly.
Deployment footprint: the flodl benchmark Docker image is 26.86 GB vs PyTorch's 38.45 GB — 30% smaller. No Python, no pip, just the Rust binary and libtorch. On clusters with cold starts, 12 GB less per node is real wall-clock savings.
Same CUDA kernels. The difference is what happens between them.
Rust calls libtorch C++ directly via FFI. No Python interpreter, no TorchScript dispatch layer. For dispatch-bound models (transformer, MLP, residual tower), this alone is 24–31%.
Adam/AdamW uses a single multi-tensor CUDA kernel instead of 4N per-parameter launches. Gradient clipping is 2 kernels total via foreach_norm + foreach_mul.
Forward pass dispatch is flat array indexing — no HashMap lookups, no dynamic allocation. Gate combination uses vectorized stack+broadcast+sum.
LSTM and GRU call cuDNN's fused sequence kernels with C++-side cached parameter handles. Zero per-forward FFI overhead — same strategy as PyTorch's flatten_parameters().
The convnet and lstm_seq ties at 0% prove both frameworks saturate the GPU on compute-bound models. The speed advantage appears precisely where framework overhead dominates.
The first benchmark measured 19% faster. Then we kept going.
| Optimization | What it does |
|---|---|
| Fused Adam/AdamW | Single multi-tensor CUDA kernel for the full optimizer step |
| Foreach ops (7 batched kernels) | One kernel for N parameters: zero, norm, scale, lerp, sqrt, add |
| Fused gradient clipping | 2 kernels instead of 2N via foreach_norm + foreach_mul |
| Fused RNN sequences | Single cuDNN kernel for full LSTM/GRU sequence across all layers |
| RNN param caching | C++ RnnParams handle eliminates per-forward FFI overhead |
| CUDA Graphs | Capture/replay kernel sequences — eliminate CPU dispatch overhead |
| Automatic mixed precision | Autocast + GradScaler for fp16/bf16 on Tensor Core GPUs |
| Channels-last memory | NHWC layout for Conv2d: 8–35% on Tensor Core hardware |
| Async device transfer | pin_memory + copy_(non_blocking) + to_device_async |
| Pre-computed graph routing | Vec-indexed dispatch, cached buffers, loop fast-path |
All optimizations are automatic — no code changes needed. The benchmark suite does not enable CUDA Graphs, mixed precision, or channels-last to ensure a fair comparison.
A real training workload on a GTX 1060 6GB — before the optimizations above. This is the actual training dashboard, not a mock.
| Metric | PyTorch 2.5.1 | flodl 0.1.1 | Delta |
|---|---|---|---|
| Avg epoch | 49.7s | 40.3s | -19% |
| Total | 82m 50s | 67m 10s | -19% |
| GPU utilization | ~80% (spiky) | 88-92% (flat) | more stable |
| Epoch σ | 0.85s | 0.10s | 8.5x tighter |
| Live dashboard | no | yes |
flodl v0.1.1 on GTX 1060 6GB, March 2026. Raw data at fbrl@5c58d71.
$ git clone https://github.com/fab2s/floDl.git $ cd floDl # Quick single-round benchmark $ make bench # Publication benchmark (10 interleaved rounds, locked clocks, 15s warmup) $ make bench-publish
Runs entirely in Docker. Per-round JSON files, merged results, and final report
saved to benchmarks/. See the full
benchmark report
for methodology, environment details, and honest accounting.