Micro-benchmarks for individual operations in a WaveNet neural network inference engine, comparing hand-optimized inline implementations against Eigen linear algebra expressions.
The primary motivation is optimizing WaveNet inference for real-time audio on ARM Cortex-M7 microcontrollers (specifically the Electrosmith Daisy platform with STM32H750), where Eigen's GEMM path triggers malloc in hot loops and expression template overhead is measurable for tiny matrices. The desktop benchmarks reveal which optimizations are universally beneficial vs. architecture-specific.
| # | Benchmark | What it measures |
|---|---|---|
| 1 | GEMM | Inline triple-loop and fully-unrolled GEMM vs Eigen for small matrices (2x2 to 8x8) |
| 1b | DTCM placement | Effect of placing GEMM operands in tightly-coupled memory (Daisy only) |
| 2 | __restrict__ |
Whether __restrict__ qualifiers improve GEMM codegen |
| 3 | Matrix copy | std::memcpy vs Eigen block assignment for contiguous copies |
| 4 | Element-wise ops | 4-wide unrolled addition/accumulation vs Eigen |
| 5 | Bias broadcast | Unrolled per-channel bias vs Eigen colwise() |
| 6 | Hardswish | Branchless vs branchy activation implementation |
| 7 | Activation unrolling | 1-wide vs 4-wide loop for ReLU, sigmoid, SiLU, softsign |
| 8 | LUT activation | Table lookup + lerp vs computed expf() for sigmoid/SiLU/tanh |
| 9 | Strided copy | Manual stride-aware copy vs Eigen .topRows().leftCols() |
| 10 | Depthwise conv | Inline element-wise multiply vs Eigen asDiagonal() |
| 11 | FiLM | Inline scale+shift vs Eigen .array() expressions |
| 12 | Ring buffer | Eigen middleCols() vs nested loop (Eigen wins here) |
- C++17 compiler (GCC 7+, Clang 5+, or Apple Clang)
- Eigen (header-only, downloaded automatically or manually)
- DaisyToolchain (provides
arm-none-eabi-gcc+ newlib)- macOS: install to
/Library/DaisyToolchain/0.2.0/(default) or setDAISY_TOOLCHAIN - Linux: install and add to
PATH, or setDAISY_TOOLCHAIN
- macOS: install to
- libDaisy built from source
- Eigen headers (same as desktop)
- Electrosmith Daisy Seed hardware + USB cable
git clone https://github.com/jfsantos/nam-inference-benchmarks.git
cd nam-inference-benchmarks
# Download Eigen (header-only, ~5 MB)
mkdir -p third_party
git clone --depth 1 --branch 3.4.0 https://gitlab.com/libeigen/eigen.git third_party/eigen# Build all benchmarks
make
# Run all benchmarks sequentially
make run
# Or run individual benchmarks
./bench_gemm
./bench_hardswish
./bench_lut_activation# Build libDaisy first (if not already done)
cd /path/to/libDaisy && make
cd -
# Cross-compile, passing the path to libDaisy
make -f Makefile.daisy LIBDAISY_DIR=/path/to/libDaisy
# Flash via DFU (hold BOOT button, press RESET, then release BOOT)
make -f Makefile.daisy LIBDAISY_DIR=/path/to/libDaisy program-dfu
# Connect to USB serial to see results
screen /dev/ttyACM0 115200 # Linux
screen /dev/tty.usbmodem* 115200 # macOSThe Daisy firmware runs all benchmarks sequentially on boot and outputs results over USB serial. The LED blinks when complete.
| Variable | Default | Description |
|---|---|---|
CXX |
g++ |
C++ compiler |
CXXFLAGS |
-O3 -ffast-math -march=native ... |
Compiler flags |
EIGEN_DIR |
third_party/eigen |
Path to Eigen headers |
| Variable | Default | Description |
|---|---|---|
LIBDAISY_DIR |
$(HOME)/src/DaisyExamples/libDaisy |
Path to built libDaisy |
DAISY_TOOLCHAIN |
/Library/DaisyToolchain/0.2.0 |
Path to DaisyToolchain |
OPT |
-O2 |
Optimization level |
.
├── common.h # Desktop benchmark utilities (chrono timing)
├── common_daisy.h # Daisy benchmark utilities (DWT cycle counter)
├── bench_gemm.cpp # Desktop: GEMM benchmarks
├── bench_restrict.cpp # Desktop: __restrict__ effect
├── bench_memcpy_vs_eigen.cpp
├── bench_elementwise.cpp
├── bench_bias_broadcast.cpp
├── bench_hardswish.cpp
├── bench_activation_unroll.cpp
├── bench_lut_activation.cpp
├── bench_strided_copy.cpp
├── bench_depthwise.cpp
├── bench_film.cpp
├── bench_ringbuffer.cpp
├── bench_daisy.cpp # Daisy: all benchmarks in one firmware binary
├── Makefile # Desktop build
├── Makefile.daisy # ARM cross-compile build
Desktop benchmarks are separate executables (one per category) for flexibility. The Daisy benchmark is a single firmware binary that runs all tests sequentially, since flashing is slow and USB serial output is the only interface.
- Desktop:
std::chrono::high_resolution_clock, 1000--10000 iterations after 100--200 warmup. Reports mean, stddev, min, max in nanoseconds. - Daisy: DWT CYCCNT register (cycle-accurate, zero overhead), 1000 iterations after 100 warmup. Reports cycles and microseconds at the configured CPU frequency. FTZ (Flush-to-Zero) and DN (Default-NaN) are enabled to avoid subnormal float slowdowns.
These benchmarks were developed as part of optimizing NeuralAmpModelerCore for real-time guitar amp simulation on the Electrosmith Daisy platform. The WaveNet architecture uses small matrix multiplications (4x8, 8x8) thousands of times per audio buffer, making per-operation overhead critical.
MIT