There are many strong libraries for numerical computing. Most of them are written in C, C++, and Fortran, with excellent Rust wrappers and Python bindings on top.
Where Rust is especially convenient is dependency management and reproducible benchmarking, making it a good place to line up apples-to-apples comparisons across native crates and their Python bindings. NumWars exists for the same reason StringWars exists for StringZilla: to compare NumKong against mainstream CPU stacks on the workloads it was built for, including:
ndarrayandnalgebrafor dense tensor and linear algebra kernels.faerandmatrixmultiplyfor GEMM-like Rust baselines.geofor geographic distances.polarsand reduction-heavy analytics workloads.NumPy,SciPy, andscikit-learnon Python.
Of course, the APIs and internal kernels of those projects are different.
So this repository focuses on the workload families NumKong was designed for and compares their effective throughput using the native unit for each operation family instead of forcing everything into fake global ops/s.
Important
The numbers below are reference measurements collected on Intel Sapphire Rapids CPU in single-threaded mode. They will vary with CPU model, compiler flags, BLAS backend, and problem size. Rebuild and rerun on your own hardware before treating them as absolute.
NumKong packed dots are mixed-precision by design. i8 inputs produce i32 outputs. bf16 and f16 inputs produce f32 outputs. f32 inputs produce f64 outputs. The mainstream baselines shown here keep f32 β f32. Compared to Rust projects, it means:
NumKong:
numkong::Tensor::dots_packed i8 β i32 ββββββββββββββββββββββββββββββββββ 1,357.36 GSO/s
numkong::Tensor::dots_packed bf16 β f32 ββββββββββββββββββ 684.96 GSO/s
numkong::Tensor::dots_packed f16 β f32 βββ 106.63 GSO/s
numkong::Tensor::dots_packed f32 β f64 β 42.04 GSO/s
Alternatives:
faer::linalg::matmul::matmul f32 β f32 βββ 81.21 GSO/s
matrixmultiply::sgemm f32 β f32 ββ 78.61 GSO/s
ndarray::ArrayBase::dot f32 β f32 ββ 78.55 GSO/s
nalgebra::DMatrix Γ DMatrixα΅ f32 β f32 ββ 74.21 GSO/s
Compared to Python:
NumKong:
numkong.dots_packed i8 β i32 βββββββββββββββββββββββββββββββββββββββββββ 1,110.31 GSO/s
numkong.dots_packed bf16 β f32 βββββββββββββββββββ 487.89 GSO/s
numkong.dots_packed f16 β f32 ββββ 91.80 GSO/s
numkong.dots_packed f32 β f64 ββ 42.69 GSO/s
Alternatives:
numpy.matmul f32 β f32 ββββββ 145.73 GSO/s
See dots/README.md for details.
Single-pair vector kernels at 2048 dimensions. This lists Dot products and true Euclidean distances measurements into one throughput-sorted view. NumKong keeps its mixed-precision promotions, while the baseline libraries mostly stay in their input type.
Compared to Rust projects, it means:
NumKong:
numkong::Dot::dot u8 β u32 ββββββββββββββββββββββββββββββββββββ 54.28 GSO/s
numkong::Dot::dot i8 β i32 βββββββββββββββββββββββββββββ 43.18 GSO/s
numkong::Euclidean::euclidean u8 β f32 βββββββββββββββββββββββββββ 40.83 GSO/s
numkong::Euclidean::euclidean i8 β f32 βββββββββββββββββββββββ 34.10 GSO/s
numkong::Dot::dot bf16 β f32 ββββββββββββββ 20.09 GSO/s
numkong::Euclidean::euclidean bf16 β f32 βββββββββ 12.65 GSO/s
numkong::Dot::dot f32 β f64 ββββ 6.12 GSO/s
numkong::Euclidean::euclidean f32 β f64 ββββ 5.53 GSO/s
Alternatives:
ndarray::ArrayBase::dot f32 β f32 ββββββ 7.75 GSO/s
nalgebra::Matrix::dot f32 β f32 βββββ 7.56 GSO/s
ndarray sqrt((a - b)Β·(a - b)) f32 β f32 ββββ 4.75 GSO/s
nalgebra (a - b).norm() f32 β f32 ββββ 4.63 GSO/s
Compared to Python:
NumKong:
numkong.euclidean u8 β f32 βββββββββββββββββββββββββββββββββββ 5.65 GSO/s
numkong.euclidean i8 β f32 ββββββββββββββββββββββββββββββββ 5.08 GSO/s
numkong.dot u8 β u32 βββββββββββββββββββββββββββββββ 4.88 GSO/s
numkong.euclidean f32 β f64 βββββββββββββββββββββ 3.33 GSO/s
numkong.dot i8 β i32 βββββββββββββββββββββ 3.25 GSO/s
numkong.dot f32 β f64 βββββββββββββββββ 2.76 GSO/s
numkong.euclidean bf16 β f32 βββ 0.41 GSO/s
numkong.dot bf16 β f32 βββ 0.37 GSO/s
Alternatives:
scipy.linalg.blas.sdot f32 β f32 ββββββββββββββββββββ 3.14 GSO/s
scipy.spatial.distance.euclidean u8 β f32 βββ 0.48 GSO/s
scipy.spatial.distance.euclidean i8 β f32 βββ 0.38 GSO/s
scipy.spatial.distance.euclidean f32 β f32 βββ 0.38 GSO/s
See similarity/README.md for details.
Matrix-vs-matrix comparisons at 2048 rows by 2048 dimensions. These are the packed many-to-many siblings of the pairwise spatial kernels above. The merged lists below include angular and euclidean metrics, and the headline unit is GSO/s.
Compared to Rust projects, it means:
NumKong:
numkong::Tensor::angulars_packed u8 β f32 ββββββββββββββββββββββββββββββ 694.88 GSO/s
numkong::Tensor::angulars_packed i8 β f32 ββββββββββββββββββββββββββββββ 686.93 GSO/s
numkong::Tensor::euclideans_packed i8 β f32 ββββββββββββββββββββββββββββββ 685.67 GSO/s
numkong::Tensor::euclideans_packed u8 β f32 βββββββββββββββββββββββββββββ 672.37 GSO/s
numkong::Tensor::angulars_packed bf16 β f32 ββββββββββββββ 304.59 GSO/s
numkong::Tensor::euclideans_packed bf16 β f32 βββββββββββββ 302.61 GSO/s
numkong::Tensor::euclideans_packed f32 β f64 β 21.22 GSO/s
numkong::Tensor::angulars_packed f32 β f64 β 20.64 GSO/s
Alternatives:
ndarray angular matrix f32 β f32 ββ 38.20 GSO/s
nalgebra euclidean matrix f32 β f32 ββ 37.91 GSO/s
ndarray euclidean matrix f32 β f32 ββ 37.59 GSO/s
nalgebra angular matrix f32 β f32 ββ 36.97 GSO/s
Compared to Python through SciPy cdist:
NumKong:
numkong.angulars_packed u8 β f32 βββββββββββββββββββββββββββββββββββββββ 465.04 GSO/s
numkong.euclideans_packed u8 β f32 βββββββββββββββββββββββββββββββββββββββ 463.47 GSO/s
numkong.euclideans_packed i8 β f32 βββββββββββββββββββββββββββββββββββββββ 463.37 GSO/s
numkong.angulars_packed i8 β f32 ββββββββββββββββββββββββββββββββββββββ 454.74 GSO/s
numkong.angulars_packed bf16 β f32 βββββββββββββββββββ 226.56 GSO/s
numkong.euclideans_packed bf16 β f32 ββββββββββββββββββ 210.12 GSO/s
numkong.euclideans_packed f32 β f64 ββ 20.24 GSO/s
numkong.angulars_packed f32 β f64 ββ 19.84 GSO/s
Alternatives:
scipy.cdist euclidean f32 β f64 β 2.83 GSO/s
scipy.cdist cosine f32 β f64 β 2.62 GSO/s
See similarities/README.md for details.
Bandwidth-sensitive elementwise kernels β add and scale β over 1,000,000 elements. Sum shown as representative sample. In Rust:
NumKong:
numkong::EachSum i8 β i8 βββββββββββββββββββββββββββββββββββββββββββββ 23.23 GB/s
numkong::EachSum f16 β f16 βββββββββββββββββββββββββββββββββββββββ 19.93 GB/s
numkong::EachSum bf16 β bf16 βββββββββββββββββββββββββββββββββββββ 19.08 GB/s
numkong::EachSum f32 β f32 βββββββββββββββββββββββββββββββββββββ 18.73 GB/s
Alternatives:
serial code f32 β f32 βββββββββββββββββββββββββββββββββββββββ 19.86 GB/s
nalgebra::add f32 β f32 βββββββββββββββββββββββββββββββββββββββ 19.82 GB/s
ndarray::add f32 β f32 βββββββββββββββββββββββββββββββββββββββ 19.79 GB/s
In Python:
NumKong:
numkong.add i8 β i8 ββββββββββββββββββββββββββββββββββββββββββ 30.91 GB/s
numkong.add f32 β f32 ββββββββββββββββββββββββββββββββββββββββ 29.39 GB/s
numkong.add f16 β f16 βββββββββββββββββββββββββββββββββββββββ 28.84 GB/s
numkong.add f64 β f64 βββββββββββββββββββββββββββββββββββββββ 28.79 GB/s
numkong.add bf16 β bf16 ββββββββββββββββββββββββββββββββββββββ 27.72 GB/s
Alternatives:
numpy.add i8 β i8 βββββββββββββββββββββββββββββββββββββββββββββ 33.32 GB/s
numpy.add f32 β f32 βββββββββββββββββββββββββββββββββββ 25.65 GB/s
numpy.add f64 β f64 ββββββββββββββββββββββββββββββββββ 25.03 GB/s
numpy.add f16 β f16 ββ 0.95 GB/s
See each/README.md for details.
Horizontal reductions over 1,000,000 elements. The suite covers sum and row-wise L2 norms. In Rust:
ndarray::ArrayBase::sum f32 β f32 βββββββββββββββββββββββββββββββββββββββββββββ 32.50 GB/s
polars::ChunkedArray::sum f32 β f32 ββββββββββββββββββββββββββββββββββββββββββββ 31.26 GB/s
numkong::reduce_moments().sum f32 β f64 ββββββββββββββββββββββββββββββββββββββββββ 30.09 GB/s
serial sum loop f32 β f32 βββββββββ 6.38 GB/s
Row-wise L2 norms over a 2048Γ2048 matrix:
ndarray row norms f64 β f64 βββββββββββββββββββββββββββββββββββββββββββββ 27.11 GB/s
numkong::Dot self-dot + sqrt bf16 β f32 βββββββββββββββββββββββββββββββββββββββββ 24.46 GB/s
ndarray row norms f32 β f32 ββββββββββββββββββββββββββββββββββββ 21.63 GB/s
numkong::Dot self-dot + sqrt f32 ββββββββββββββββββββββββββββββββββ 20.01 GB/s
serial row norms loop f32 β f32 βββββββββββ 6.54 GB/s
In Python over 1,000,000 elements:
NumKong:
numkong.sum i8 β i8 βββββββββββββββββββββββββββββββββββββββββββββ 32.02 GB/s
numkong.sum f32 β f32 βββββββββββββββββββββββββββββββββββββββββ 29.17 GB/s
numkong.norm f32 β f64 ββββββββββββββββββββββββββββββββ 22.32 GB/s
numkong.sum f64 β f64 βββββββββββββββββββββββββββββ 20.68 GB/s
numkong.norm bf16 β f64 βββββββββββββββββββββββββ 17.82 GB/s
Alternatives:
numpy.sum f64 β f64 ββββββββββββββββββββββββββββββββββ 24.16 GB/s
numpy.sum f32 β f32 βββββββββββββββββββββββββββ 19.06 GB/s
numpy.linalg.norm f64 β f64 ββββββββββββ 8.21 GB/s
numpy.linalg.norm f32 β f64 βββββββββββ 7.48 GB/s
numpy.sum i8 β i8 ββββ 2.68 GB/s
See reduce/README.md for details.
ColBERT-style late interaction with 2048 query vectors, 2048 document vectors, and 2048 dimensions. NumKong promotes f32 β f64 here as well, while ndarray stays in f32. In Rust:
NumKong:
numkong::MaxSimPackedMatrix::score f16 β f32 ββββββββββββββββββββββββββββββ 423.69 GSO/s
numkong::MaxSimPackedMatrix::score f32 β f64 ββββββββββββββββββββββββββββββ 415.47 GSO/s
numkong::MaxSimPackedMatrix::score bf16 β f32 ββββββββββββββββ 224.48 GSO/s
Alternatives:
ndarray Q @ Dα΅ max-reduce f32 β f32 βββ 38.36 GSO/s
Compared to Python:
NumKong:
numkong.maxsim_packed f16 β f32 βββββββββββββββββββββββββββββββββββββββββββ 833.26 GSO/s
numkong.maxsim_packed f32 β f64 βββββββββββββββββββββββββββββββββββββββββ 776.43 GSO/s
numkong.maxsim_packed bf16 β f32 βββββββββββββββββββββββ 428.56 GSO/s
Alternatives:
numpy matmul f32 β f32 βββββββ 129.03 GSO/s
See maxsim/README.md for details.
Throughput over 2048 coordinate pairs. The unit is MP/s, or million coordinate pairs per second. The merged lists below include both Haversine and Vincenty distances.
Compared to Rust projects, it means:
NumKong:
numkong::haversine f32 β f32 ββββββββββββββββββββββββββββββββββββββββββββ 486.92 MP/s
numkong::haversine f64 β f64 ββββββββββββββ 151.65 MP/s
numkong::vincenty f32 β f32 βββββββ 68.96 MP/s
numkong::vincenty f64 β f64 ββ 17.79 MP/s
Alternatives:
geo::Haversine distance f32 β f32 ββββ 38.88 MP/s
geo::Haversine distance f64 β f64 βββ 24.07 MP/s
geo::Vincenty distance f64 β f64 β 1.15 MP/s
Compared to Python and its alternatives:
NumKong:
numkong.haversine f32 β f32 ββββββββββββββββββββββββββββββββββββββββ 475.41 MP/s
numkong.haversine f64 β f64 βββββββββββββ 154.92 MP/s
numkong.vincenty f32 β f32 βββββ 54.99 MP/s
numkong.vincenty f64 β f64 ββ 17.87 MP/s
Alternatives:
geopy.distance.great_circle f64 β f64 0.18 MP/s
geopy.distance.geodesic f64 β f64 0.0096 MP/s
See geospatial/README.md for details.
Throughput over point clouds with 2048 3D points each. The unit is MP/s, or million 3D points per second. The labels include the full return signature so RMSD and Kabsch can share one sorted list cleanly. In Rust:
NumKong:
numkong::MeshAlignment::rmsd f64 β f64 ββββββββββββββββββββββββββββββββββ 971.35 MP/s
numkong::MeshAlignment::rmsd f32 β f32 βββββββββββββββββββββ 592.73 MP/s
numkong::MeshAlignment::rmsd f16 β f16 βββββββββββββββββββββ 578.61 MP/s
numkong::MeshAlignment::rmsd bf16 β bf16 ββββββββββββββββββββ 567.69 MP/s
numkong::MeshAlignment::kabsch f32 β f32 βββββββββββββββ 404.69 MP/s
numkong::MeshAlignment::umeyama f32 β f32 ββββββββββββ 335.03 MP/s
numkong::MeshAlignment::kabsch bf16 β bf16 ββββββββββ 272.09 MP/s
numkong::MeshAlignment::umeyama bf16 β bf16 ββββββββββ 268.63 MP/s
numkong::MeshAlignment::kabsch f16 β f16 ββββββββββ 264.46 MP/s
numkong::MeshAlignment::umeyama f16 β f16 ββββββββββ 264.89 MP/s
numkong::MeshAlignment::kabsch f64 β f64 βββββββββ 245.90 MP/s
numkong::MeshAlignment::umeyama f64 β f64 ββββββ 147.75 MP/s
Alternatives:
nalgebra-based RMSD f32 β f32 βββββββββββββββββββ 537.04 MP/s
nalgebra-based Kabsch f32 β f64 βββββ 121.63 MP/s
nalgebra-based Umeyama f32 β f64 ββββ 106.47 MP/s
Compared to Python and its alternatives:
NumKong:
numkong.rmsd f64 β f64 βββββββββββββββββββββββββββββββββ 825.51 MP/s
numkong.rmsd f32 β f64 βββββββββββββββββββ 467.35 MP/s
numkong.kabsch f32 β f64 ββββββββββ 248.48 MP/s
numkong.umeyama f32 β f64 ββββββββββ 248.10 MP/s
numkong.kabsch f64 β f64 ββββββββββ 238.79 MP/s
numkong.umeyama f64 β f64 βββββββ 159.25 MP/s
Alternatives:
numpy-based RMSD f32 β f64 βββ 50.49 MP/s
numpy-based RMSD f64 β f64 ββ 46.74 MP/s
biopython SVDSuperimposer (Kabsch) f32 β f64 1.22 MP/s
biopython SVDSuperimposer (Kabsch) f64 β f64 1.19 MP/s
See mesh/README.md for details.
Every Rust benchmark is a Criterion harness behind a Cargo feature gate. Run one suite at a time or all at once:
# One suite β default 2048-element workload
RUSTFLAGS="-C target-cpu=native" \
cargo bench --features bench_similarity --bench bench_similarity
# All suites
RUSTFLAGS="-C target-cpu=native" \
cargo bench --features allTuning knobs (environment variables):
| Variable | Default | Purpose |
|---|---|---|
NUMWARS_DIMS |
2048 | Vector / matrix dimension shared by most suites |
NUMWARS_DIMS_HEIGHT |
2048 | Row count for GEMM workloads (dots, maxsim) |
NUMWARS_DIMS_WIDTH |
2048 | Column count for GEMM workloads (dots, maxsim) |
NUMWARS_DIMS_DEPTH |
2048 | Shared (contraction) dimension for GEMM workloads |
NUMWARS_FILTER |
(none) | Regex to select benchmarks by name |
NUMWARS_WARMUP_SECONDS |
3.0 | Criterion warm-up time |
NUMWARS_PROFILE_SECONDS |
10.0 | Criterion measurement time |
NUMWARS_SAMPLE_SIZE |
50 | Criterion sample count |
Install with uv and run any suite directly:
uv run --with "numkong,numpy,scipy,tabulate,ml_dtypes" \
python similarity/bench.pyOr install all extras and run from the repo root:
pip install -e ".[similarity,each,dots,geospatial,mesh,reduce,similarities]"
python dots/bench.py
python similarities/bench.py- similarity/README.md
- similarities/README.md
- dots/README.md
- each/README.md
- reduce/README.md
- maxsim/README.md
- geospatial/README.md
- mesh/README.md
Apache 2.0. See LICENSE.
