flodl vs PyTorch

Ten models, ten interleaved rounds, idle machine. Same architectures, same optimizer, same CUDA kernels — the only variable is the framework overhead.

Model PyTorch flodl Delta Py σ Rs σ
transformer3183.0 ms2199.8 ms-31%±0.8±1.0
mlp291.1 ms207.0 ms-29%±4.0±1.3
residual_tower406.9 ms309.7 ms-24%±6.0±3.3
feedback_fixed275.3 ms231.3 ms-16%±10.0±6.0
gated_routing248.0 ms217.3 ms-12%±9.7±2.8
iterative_refine230.7 ms206.0 ms-11%±2.2±3.1
gru_seq1105.1 ms1057.5 ms-4%±16.7±25.4
conv_autoenc398.2 ms395.3 ms-1%±1.1±3.7
lstm_seq692.3 ms692.3 ms0%±23.3±15.2
convnet1298.0 ms1298.2 ms0%±0.3±0.1

Median epoch time (ms) across 10 interleaved rounds, best-of-3 runs per round. σ = scaled MAD (robust to OS/GC outliers). RTX 5060 Ti, GPU at 3090 MHz. flodl 0.2.2 vs PyTorch 2.10.0+cu128.


The speed story

8 wins, 2 ties, zero regressions.
The ties prove both frameworks dispatch identical CUDA kernels. The wins show what happens when you remove the overhead between them.

Dispatch-bound models (transformer −31%, mlp −29%) make many calls to small GPU kernels. PyTorch goes through Python → TorchScript dispatch → C++. flodl calls libtorch directly via FFI. When a model chains hundreds of ops per epoch, the overhead compounds.

Graph-builder architectures (residual_tower −24%, feedback_fixed −16%, gated_routing −12%) use pre-computed Vec-indexed routing — no HashMap lookups, no dynamic allocation in the forward pass.

Compute-bound ties (convnet 0%, lstm_seq 0%) spend >99% of time inside cuDNN kernels. Framework overhead is invisible. This is the cleanest evidence that the benchmark measures dispatch overhead, not CUDA kernel differences.


Methodology: honest variance

Variance is reported as scaled MAD, not standard deviation.
Here's why, and what difference it makes.

Standard deviation treats every outlier as real variance. In GPU benchmarking, that's wrong. Python's garbage collector fires at unpredictable intervals, creating 50–170% timing spikes in individual rounds. CUDA scheduling stalls cause similar (rarer, sometimes larger) spikes on the Rust side.

Scaled MAD (Median Absolute Deviation × 1.4826) is σ-equivalent for normal distributions but ignores outlier rounds from either side. It reports what each framework does in steady state, not what happens when the OS interferes.

The raw per-round data is published in benchmarks/rounds/. Anyone can compute both metrics and inspect the outliers directly.

Deployment footprint: the flodl benchmark Docker image is 26.86 GB vs PyTorch's 38.45 GB — 30% smaller. No Python, no pip, just the Rust binary and libtorch. On clusters with cold starts, 12 GB less per node is real wall-clock savings.


Why flodl is faster

Same CUDA kernels. The difference is what happens between them.

Zero dispatch overhead

Rust calls libtorch C++ directly via FFI. No Python interpreter, no TorchScript dispatch layer. For dispatch-bound models (transformer, MLP, residual tower), this alone is 24–31%.

Fused optimizer kernels

Adam/AdamW uses a single multi-tensor CUDA kernel instead of 4N per-parameter launches. Gradient clipping is 2 kernels total via foreach_norm + foreach_mul.

Pre-computed graph routing

Forward pass dispatch is flat array indexing — no HashMap lookups, no dynamic allocation. Gate combination uses vectorized stack+broadcast+sum.

Fused RNN with cached params

LSTM and GRU call cuDNN's fused sequence kernels with C++-side cached parameter handles. Zero per-forward FFI overhead — same strategy as PyTorch's flatten_parameters().

The convnet and lstm_seq ties at 0% prove both frameworks saturate the GPU on compute-bound models. The speed advantage appears precisely where framework overhead dominates.


Optimizations since v0.1.1

The first benchmark measured 19% faster. Then we kept going.

OptimizationWhat it does
Fused Adam/AdamWSingle multi-tensor CUDA kernel for the full optimizer step
Foreach ops (7 batched kernels)One kernel for N parameters: zero, norm, scale, lerp, sqrt, add
Fused gradient clipping2 kernels instead of 2N via foreach_norm + foreach_mul
Fused RNN sequencesSingle cuDNN kernel for full LSTM/GRU sequence across all layers
RNN param cachingC++ RnnParams handle eliminates per-forward FFI overhead
CUDA GraphsCapture/replay kernel sequences — eliminate CPU dispatch overhead
Automatic mixed precisionAutocast + GradScaler for fp16/bf16 on Tensor Core GPUs
Channels-last memoryNHWC layout for Conv2d: 8–35% on Tensor Core hardware
Async device transferpin_memory + copy_(non_blocking) + to_device_async
Pre-computed graph routingVec-indexed dispatch, cached buffers, loop fast-path

All optimizations are automatic — no code changes needed. The benchmark suite does not enable CUDA Graphs, mixed precision, or channels-last to ensure a fair comparison.


Earlier: FBRL letter model (v0.1.1)

A real training workload on a GTX 1060 6GB — before the optimizations above. This is the actual training dashboard, not a mock.

MetricPyTorch 2.5.1flodl 0.1.1Delta
Avg epoch49.7s40.3s-19%
Total82m 50s67m 10s-19%
GPU utilization~80% (spiky)88-92% (flat)more stable
Epoch σ0.85s0.10s8.5x tighter
Live dashboardnoyes

flodl v0.1.1 on GTX 1060 6GB, March 2026. Raw data at fbrl@5c58d71.


Reproduce

Terminal
$ git clone https://github.com/fab2s/floDl.git
$ cd floDl

# Quick single-round benchmark
$ make bench

# Publication benchmark (10 interleaved rounds, locked clocks, 15s warmup)
$ make bench-publish

Runs entirely in Docker. Per-round JSON files, merged results, and final report saved to benchmarks/. See the full benchmark report for methodology, environment details, and honest accounting.