Skip to content

tone-3000/nam-inference-benchmarks

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NAM Inference Micro-benchmarks

Micro-benchmarks for individual operations in a WaveNet neural network inference engine, comparing hand-optimized inline implementations against Eigen linear algebra expressions.

The primary motivation is optimizing WaveNet inference for real-time audio on ARM Cortex-M7 microcontrollers (specifically the Electrosmith Daisy platform with STM32H750), where Eigen's GEMM path triggers malloc in hot loops and expression template overhead is measurable for tiny matrices. The desktop benchmarks reveal which optimizations are universally beneficial vs. architecture-specific.

What's benchmarked

# Benchmark What it measures
1 GEMM Inline triple-loop and fully-unrolled GEMM vs Eigen for small matrices (2x2 to 8x8)
1b DTCM placement Effect of placing GEMM operands in tightly-coupled memory (Daisy only)
2 __restrict__ Whether __restrict__ qualifiers improve GEMM codegen
3 Matrix copy std::memcpy vs Eigen block assignment for contiguous copies
4 Element-wise ops 4-wide unrolled addition/accumulation vs Eigen
5 Bias broadcast Unrolled per-channel bias vs Eigen colwise()
6 Hardswish Branchless vs branchy activation implementation
7 Activation unrolling 1-wide vs 4-wide loop for ReLU, sigmoid, SiLU, softsign
8 LUT activation Table lookup + lerp vs computed expf() for sigmoid/SiLU/tanh
9 Strided copy Manual stride-aware copy vs Eigen .topRows().leftCols()
10 Depthwise conv Inline element-wise multiply vs Eigen asDiagonal()
11 FiLM Inline scale+shift vs Eigen .array() expressions
12 Ring buffer Eigen middleCols() vs nested loop (Eigen wins here)

Prerequisites

Desktop

  • C++17 compiler (GCC 7+, Clang 5+, or Apple Clang)
  • Eigen (header-only, downloaded automatically or manually)

Daisy (ARM Cortex-M7)

  • DaisyToolchain (provides arm-none-eabi-gcc + newlib)
    • macOS: install to /Library/DaisyToolchain/0.2.0/ (default) or set DAISY_TOOLCHAIN
    • Linux: install and add to PATH, or set DAISY_TOOLCHAIN
  • libDaisy built from source
  • Eigen headers (same as desktop)
  • Electrosmith Daisy Seed hardware + USB cable

Quick start

1. Clone and download Eigen

git clone https://github.com/jfsantos/nam-inference-benchmarks.git
cd nam-inference-benchmarks

# Download Eigen (header-only, ~5 MB)
mkdir -p third_party
git clone --depth 1 --branch 3.4.0 https://gitlab.com/libeigen/eigen.git third_party/eigen

2. Run desktop benchmarks

# Build all benchmarks
make

# Run all benchmarks sequentially
make run

# Or run individual benchmarks
./bench_gemm
./bench_hardswish
./bench_lut_activation

3. Run Daisy (on-target) benchmarks

# Build libDaisy first (if not already done)
cd /path/to/libDaisy && make
cd -

# Cross-compile, passing the path to libDaisy
make -f Makefile.daisy LIBDAISY_DIR=/path/to/libDaisy

# Flash via DFU (hold BOOT button, press RESET, then release BOOT)
make -f Makefile.daisy LIBDAISY_DIR=/path/to/libDaisy program-dfu

# Connect to USB serial to see results
screen /dev/ttyACM0 115200        # Linux
screen /dev/tty.usbmodem* 115200   # macOS

The Daisy firmware runs all benchmarks sequentially on boot and outputs results over USB serial. The LED blinks when complete.

Build options

Desktop Makefile

Variable Default Description
CXX g++ C++ compiler
CXXFLAGS -O3 -ffast-math -march=native ... Compiler flags
EIGEN_DIR third_party/eigen Path to Eigen headers

Daisy Makefile

Variable Default Description
LIBDAISY_DIR $(HOME)/src/DaisyExamples/libDaisy Path to built libDaisy
DAISY_TOOLCHAIN /Library/DaisyToolchain/0.2.0 Path to DaisyToolchain
OPT -O2 Optimization level

Architecture

.
├── common.h              # Desktop benchmark utilities (chrono timing)
├── common_daisy.h        # Daisy benchmark utilities (DWT cycle counter)
├── bench_gemm.cpp        # Desktop: GEMM benchmarks
├── bench_restrict.cpp    # Desktop: __restrict__ effect
├── bench_memcpy_vs_eigen.cpp
├── bench_elementwise.cpp
├── bench_bias_broadcast.cpp
├── bench_hardswish.cpp
├── bench_activation_unroll.cpp
├── bench_lut_activation.cpp
├── bench_strided_copy.cpp
├── bench_depthwise.cpp
├── bench_film.cpp
├── bench_ringbuffer.cpp
├── bench_daisy.cpp       # Daisy: all benchmarks in one firmware binary
├── Makefile              # Desktop build
├── Makefile.daisy        # ARM cross-compile build

Desktop benchmarks are separate executables (one per category) for flexibility. The Daisy benchmark is a single firmware binary that runs all tests sequentially, since flashing is slow and USB serial output is the only interface.

Timing methodology

  • Desktop: std::chrono::high_resolution_clock, 1000--10000 iterations after 100--200 warmup. Reports mean, stddev, min, max in nanoseconds.
  • Daisy: DWT CYCCNT register (cycle-accurate, zero overhead), 1000 iterations after 100 warmup. Reports cycles and microseconds at the configured CPU frequency. FTZ (Flush-to-Zero) and DN (Default-NaN) are enabled to avoid subnormal float slowdowns.

Context

These benchmarks were developed as part of optimizing NeuralAmpModelerCore for real-time guitar amp simulation on the Electrosmith Daisy platform. The WaveNet architecture uses small matrix multiplications (4x8, 8x8) thousands of times per audio buffer, making per-operation overhead critical.

License

MIT

About

Microbenchmarks of different optimizations for NAM (Wavenet) inference

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors