NAM Inference Micro-benchmarks

Micro-benchmarks for individual operations in a WaveNet neural network inference engine, comparing hand-optimized inline implementations against Eigen linear algebra expressions.

The primary motivation is optimizing WaveNet inference for real-time audio on ARM Cortex-M7 microcontrollers (specifically the Electrosmith Daisy platform with STM32H750), where Eigen's GEMM path triggers malloc in hot loops and expression template overhead is measurable for tiny matrices. The desktop benchmarks reveal which optimizations are universally beneficial vs. architecture-specific.

What's benchmarked

#	Benchmark	What it measures
1	GEMM	Inline triple-loop and fully-unrolled GEMM vs Eigen for small matrices (2x2 to 8x8)
1b	DTCM placement	Effect of placing GEMM operands in tightly-coupled memory (Daisy only)
2	`__restrict__`	Whether `__restrict__` qualifiers improve GEMM codegen
3	Matrix copy	`std::memcpy` vs Eigen block assignment for contiguous copies
4	Element-wise ops	4-wide unrolled addition/accumulation vs Eigen
5	Bias broadcast	Unrolled per-channel bias vs Eigen `colwise()`
6	Hardswish	Branchless vs branchy activation implementation
7	Activation unrolling	1-wide vs 4-wide loop for ReLU, sigmoid, SiLU, softsign
8	LUT activation	Table lookup + lerp vs computed `expf()` for sigmoid/SiLU/tanh
9	Strided copy	Manual stride-aware copy vs Eigen `.topRows().leftCols()`
10	Depthwise conv	Inline element-wise multiply vs Eigen `asDiagonal()`
11	FiLM	Inline scale+shift vs Eigen `.array()` expressions
12	Ring buffer	Eigen `middleCols()` vs nested loop (Eigen wins here)

Prerequisites

Desktop

C++17 compiler (GCC 7+, Clang 5+, or Apple Clang)
Eigen (header-only, downloaded automatically or manually)

Daisy (ARM Cortex-M7)

DaisyToolchain (provides arm-none-eabi-gcc + newlib)
- macOS: install to /Library/DaisyToolchain/0.2.0/ (default) or set DAISY_TOOLCHAIN
- Linux: install and add to PATH, or set DAISY_TOOLCHAIN
libDaisy built from source
Eigen headers (same as desktop)
Electrosmith Daisy Seed hardware + USB cable

Quick start

1. Clone and download Eigen

git clone https://github.com/jfsantos/nam-inference-benchmarks.git
cd nam-inference-benchmarks

# Download Eigen (header-only, ~5 MB)
mkdir -p third_party
git clone --depth 1 --branch 3.4.0 https://gitlab.com/libeigen/eigen.git third_party/eigen

2. Run desktop benchmarks

# Build all benchmarks
make

# Run all benchmarks sequentially
make run

# Or run individual benchmarks
./bench_gemm
./bench_hardswish
./bench_lut_activation

3. Run Daisy (on-target) benchmarks

# Build libDaisy first (if not already done)
cd /path/to/libDaisy && make
cd -

# Cross-compile, passing the path to libDaisy
make -f Makefile.daisy LIBDAISY_DIR=/path/to/libDaisy

# Flash via DFU (hold BOOT button, press RESET, then release BOOT)
make -f Makefile.daisy LIBDAISY_DIR=/path/to/libDaisy program-dfu

# Connect to USB serial to see results
screen /dev/ttyACM0 115200        # Linux
screen /dev/tty.usbmodem* 115200   # macOS

The Daisy firmware runs all benchmarks sequentially on boot and outputs results over USB serial. The LED blinks when complete.

Build options

Desktop Makefile

Variable	Default	Description
`CXX`	`g++`	C++ compiler
`CXXFLAGS`	`-O3 -ffast-math -march=native ...`	Compiler flags
`EIGEN_DIR`	`third_party/eigen`	Path to Eigen headers

Daisy Makefile

Variable	Default	Description
`LIBDAISY_DIR`	`$(HOME)/src/DaisyExamples/libDaisy`	Path to built libDaisy
`DAISY_TOOLCHAIN`	`/Library/DaisyToolchain/0.2.0`	Path to DaisyToolchain
`OPT`	`-O2`	Optimization level

Architecture

.
├── common.h              # Desktop benchmark utilities (chrono timing)
├── common_daisy.h        # Daisy benchmark utilities (DWT cycle counter)
├── bench_gemm.cpp        # Desktop: GEMM benchmarks
├── bench_restrict.cpp    # Desktop: __restrict__ effect
├── bench_memcpy_vs_eigen.cpp
├── bench_elementwise.cpp
├── bench_bias_broadcast.cpp
├── bench_hardswish.cpp
├── bench_activation_unroll.cpp
├── bench_lut_activation.cpp
├── bench_strided_copy.cpp
├── bench_depthwise.cpp
├── bench_film.cpp
├── bench_ringbuffer.cpp
├── bench_daisy.cpp       # Daisy: all benchmarks in one firmware binary
├── Makefile              # Desktop build
├── Makefile.daisy        # ARM cross-compile build

Desktop benchmarks are separate executables (one per category) for flexibility. The Daisy benchmark is a single firmware binary that runs all tests sequentially, since flashing is slow and USB serial output is the only interface.

Timing methodology

Desktop: std::chrono::high_resolution_clock, 1000--10000 iterations after 100--200 warmup. Reports mean, stddev, min, max in nanoseconds.
Daisy: DWT CYCCNT register (cycle-accurate, zero overhead), 1000 iterations after 100 warmup. Reports cycles and microseconds at the configured CPU frequency. FTZ (Flush-to-Zero) and DN (Default-NaN) are enabled to avoid subnormal float slowdowns.

Context

These benchmarks were developed as part of optimizing NeuralAmpModelerCore for real-time guitar amp simulation on the Electrosmith Daisy platform. The WaveNet architecture uses small matrix multiplications (4x8, 8x8) thousands of times per audio buffer, making per-operation overhead critical.

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NAM Inference Micro-benchmarks

What's benchmarked

Prerequisites

Desktop

Daisy (ARM Cortex-M7)

Quick start

1. Clone and download Eigen

2. Run desktop benchmarks

3. Run Daisy (on-target) benchmarks

Build options

Desktop Makefile

Daisy Makefile

Architecture

Timing methodology

Context

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
Makefile.daisy		Makefile.daisy
README.md		README.md
bench_activation_unroll.cpp		bench_activation_unroll.cpp
bench_bias_broadcast.cpp		bench_bias_broadcast.cpp
bench_daisy.cpp		bench_daisy.cpp
bench_depthwise.cpp		bench_depthwise.cpp
bench_elementwise.cpp		bench_elementwise.cpp
bench_film.cpp		bench_film.cpp
bench_gemm.cpp		bench_gemm.cpp
bench_hardswish.cpp		bench_hardswish.cpp
bench_lut_activation.cpp		bench_lut_activation.cpp
bench_memcpy_vs_eigen.cpp		bench_memcpy_vs_eigen.cpp
bench_restrict.cpp		bench_restrict.cpp
bench_ringbuffer.cpp		bench_ringbuffer.cpp
bench_strided_copy.cpp		bench_strided_copy.cpp
common.h		common.h
common_daisy.h		common_daisy.h

Folders and files

Latest commit

History

Repository files navigation

NAM Inference Micro-benchmarks

What's benchmarked

Prerequisites

Desktop

Daisy (ARM Cortex-M7)

Quick start

1. Clone and download Eigen

2. Run desktop benchmarks

3. Run Daisy (on-target) benchmarks

Build options

Desktop Makefile

Daisy Makefile

Architecture

Timing methodology

Context

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages