Benchmark — floDl

flodl vs PyTorch

Ten models, ten interleaved rounds, idle machine. Same architectures, same optimizer, same CUDA kernels — the only variable is the framework overhead.

Model	PyTorch	flodl	Delta	Py σ	Rs σ
transformer	3183.0 ms	2199.8 ms	-31%	±0.8	±1.0
mlp	291.1 ms	207.0 ms	-29%	±4.0	±1.3
residual_tower	406.9 ms	309.7 ms	-24%	±6.0	±3.3
feedback_fixed	275.3 ms	231.3 ms	-16%	±10.0	±6.0
gated_routing	248.0 ms	217.3 ms	-12%	±9.7	±2.8
iterative_refine	230.7 ms	206.0 ms	-11%	±2.2	±3.1
gru_seq	1105.1 ms	1057.5 ms	-4%	±16.7	±25.4
conv_autoenc	398.2 ms	395.3 ms	-1%	±1.1	±3.7
lstm_seq	692.3 ms	692.3 ms	0%	±23.3	±15.2
convnet	1298.0 ms	1298.2 ms	0%	±0.3	±0.1

Median epoch time (ms) across 10 interleaved rounds, best-of-3 runs per round. σ = scaled MAD (robust to OS/GC outliers). RTX 5060 Ti, GPU at 3090 MHz. flodl 0.2.2 vs PyTorch 2.10.0+cu128.

The speed story

8 wins, 2 ties, zero regressions.
The ties prove both frameworks dispatch identical CUDA kernels. The wins show what happens when you remove the overhead between them.

Dispatch-bound models (transformer −31%, mlp −29%) make many calls to small GPU kernels. PyTorch goes through Python → TorchScript dispatch → C++. flodl calls libtorch directly via FFI. When a model chains hundreds of ops per epoch, the overhead compounds.

Graph-builder architectures (residual_tower −24%, feedback_fixed −16%, gated_routing −12%) use pre-computed Vec-indexed routing — no HashMap lookups, no dynamic allocation in the forward pass.

Compute-bound ties (convnet 0%, lstm_seq 0%) spend >99% of time inside cuDNN kernels. Framework overhead is invisible. This is the cleanest evidence that the benchmark measures dispatch overhead, not CUDA kernel differences.

Methodology: honest variance

Variance is reported as scaled MAD, not standard deviation.
Here's why, and what difference it makes.

Standard deviation treats every outlier as real variance. In GPU benchmarking, that's wrong. Python's garbage collector fires at unpredictable intervals, creating 50–170% timing spikes in individual rounds. CUDA scheduling stalls cause similar (rarer, sometimes larger) spikes on the Rust side.

Scaled MAD (Median Absolute Deviation × 1.4826) is σ-equivalent for normal distributions but ignores outlier rounds from either side. It reports what each framework does in steady state, not what happens when the OS interferes.

The raw per-round data is published in benchmarks/rounds/. Anyone can compute both metrics and inspect the outliers directly.

Deployment footprint: the flodl benchmark Docker image is 26.86 GB vs PyTorch's 38.45 GB — 30% smaller. No Python, no pip, just the Rust binary and libtorch. On clusters with cold starts, 12 GB less per node is real wall-clock savings.

Why flodl is faster

Same CUDA kernels. The difference is what happens between them.

Zero dispatch overhead

Rust calls libtorch C++ directly via FFI. No Python interpreter, no TorchScript dispatch layer. For dispatch-bound models (transformer, MLP, residual tower), this alone is 24–31%.

Fused optimizer kernels

Adam/AdamW uses a single multi-tensor CUDA kernel instead of 4N per-parameter launches. Gradient clipping is 2 kernels total via foreach_norm + foreach_mul.

Pre-computed graph routing

Forward pass dispatch is flat array indexing — no HashMap lookups, no dynamic allocation. Gate combination uses vectorized stack+broadcast+sum.

Fused RNN with cached params

LSTM and GRU call cuDNN's fused sequence kernels with C++-side cached parameter handles. Zero per-forward FFI overhead — same strategy as PyTorch's flatten_parameters().

The convnet and lstm_seq ties at 0% prove both frameworks saturate the GPU on compute-bound models. The speed advantage appears precisely where framework overhead dominates.

Optimizations since v0.1.1

The first benchmark measured 19% faster. Then we kept going.

Optimization	What it does
Fused Adam/AdamW	Single multi-tensor CUDA kernel for the full optimizer step
Foreach ops (7 batched kernels)	One kernel for N parameters: zero, norm, scale, lerp, sqrt, add
Fused gradient clipping	2 kernels instead of 2N via `foreach_norm` + `foreach_mul`
Fused RNN sequences	Single cuDNN kernel for full LSTM/GRU sequence across all layers
RNN param caching	C++ `RnnParams` handle eliminates per-forward FFI overhead
CUDA Graphs	Capture/replay kernel sequences — eliminate CPU dispatch overhead
Automatic mixed precision	Autocast + GradScaler for fp16/bf16 on Tensor Core GPUs
Channels-last memory	NHWC layout for Conv2d: 8–35% on Tensor Core hardware
Async device transfer	`pin_memory` + `copy_`(non_blocking) + `to_device_async`
Pre-computed graph routing	Vec-indexed dispatch, cached buffers, loop fast-path

All optimizations are automatic — no code changes needed. The benchmark suite does not enable CUDA Graphs, mixed precision, or channels-last to ensure a fair comparison.

Earlier: FBRL letter model (v0.1.1)

A real training workload on a GTX 1060 6GB — before the optimizations above. This is the actual training dashboard, not a mock.

Metric	PyTorch 2.5.1	flodl 0.1.1	Delta
Avg epoch	49.7s	40.3s	-19%
Total	82m 50s	67m 10s	-19%
GPU utilization	~80% (spiky)	88-92% (flat)	more stable
Epoch σ	0.85s	0.10s	8.5x tighter
Live dashboard	no	yes

flodl v0.1.1 on GTX 1060 6GB, March 2026. Raw data at fbrl@5c58d71.

Reproduce

Terminal

$ git clone https://github.com/fab2s/floDl.git
$ cd floDl

# Quick single-round benchmark
$ make bench

# Publication benchmark (10 interleaved rounds, locked clocks, 15s warmup)
$ make bench-publish

Runs entirely in Docker. Per-round JSON files, merged results, and final report saved to benchmarks/. See the full benchmark report for methodology, environment details, and honest accounting.