Skip to content

joeltadeu/edge-ai-benchmark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Edge AI Benchmark Suite

A controlled benchmarking project comparing three ML inference frameworks — TensorFlow Lite, ONNX Runtime, and PyTorch Mobile — using a quantized MobileNet-V2 model across two target environments: Android devices and ARM64 (Raspberry Pi 4).

This project is the foundation for a research article on ML inference performance at the edge, covering latency, throughput, model size, and qualitative conversion effort across frameworks.


Motivation

AI workloads are increasingly moving from cloud to edge devices to meet low-latency, privacy, and connectivity requirements. Edge hardware imposes strict constraints — limited CPU, modest RAM, and tight latency budgets — so choosing the right inference runtime is critical.

This suite provides a repeatable, fair comparison by enforcing identical benchmark methodology across all three frameworks: single-threaded inference, fixed warm-up runs, integer millisecond latency (matching Android's nanosecond-to-ms floor conversion), and the same MobileNet-V2 input shape.


Repository Structure

edge-ai-benchmark/
├── aarch64/                        # ARM64 benchmark (Raspberry Pi 4 / QEMU)
│   ├── benchmark.py                # Main benchmark runner — all three frameworks
│   ├── fake_cpuinfo.c              # LD_PRELOAD shim source (QEMU only)
│   ├── fake_cpuinfo.so             # Compiled shim (generated, not committed)
│   ├── README.md                   # Full setup guide (Pi hardware + QEMU)
│   └── models/                     # Place converted models here before benchmarking
│       ├── tflite/
│       ├── onnx/
│       └── pytorch/
│
├── app/                            # Android benchmark application
│   └── AndroidBenchmarkMulti/
│       └── app/
│           ├── build.gradle        # Dependencies: TFLite 2.17, PyTorch 2.1, ORT 1.24.3
│           └── src/main/
│               ├── AndroidManifest.xml
│               └── assets/
│                   ├── onnx/       # ONNX model assets
│                   └── pytorch/    # PyTorch model assets
│
└── frameworks/                     # Model conversion pipelines
    ├── preprocess_shared.py        # Shared preprocessing (resize, crop, ImageNet normalize)
    ├── pipeline_pytorch_mobile.py  # Pipeline 1: PyTorch → FX INT8 → .ptl
    ├── pipeline_tflite.py          # Pipeline 2: PyTorch → ONNX → TF SavedModel → .tflite
    ├── pipeline_ort_mobile.py      # Pipeline 3: PyTorch → ONNX → ORT INT8 QDQ → .onnx
    ├── images/
    │   ├── calibration/            # Calibration images shared by all three pipelines
    │   └── sample.jpg              # Sample image for quick inference checks
    └── models/                     # Output directory (populated after running pipelines)
        ├── onnx/                   # Shared float ONNX intermediate
        ├── pytorch/                # Final .ptl artifact
        ├── tflite/                 # Final .tflite artifact + SavedModel intermediate
        └── ort/                    # Final INT8 QDQ .onnx artifact

Model

All benchmarks use MobileNet-V2 (ImageNet, 1000 classes) with a 224×224×3 input. Each framework receives a model converted from the same baseline checkpoint under the same quantization rules, ensuring the comparison is as apples-to-apples as possible.

Framework Format Precision
TensorFlow Lite .tflite INT8 static quantization
ONNX Runtime .onnx FP32 / dynamic / static
PyTorch Mobile .pt / .ptl INT8 dynamic quantization

Benchmark Methodology

All three benchmark targets (Python script and Android app) share the same rules:

  • Single-threaded inferencenum_threads=1 for all frameworks
  • Input tensor allocated once and reused across all runs
  • Latency stored as integer milliseconds (floor, matching Java / 1_000_000)
  • CPU time computed as total_accumulated / NUM_RUNS
  • Adaptive warmup — default 50 warm-up runs, automatically capped on slow hardware (QEMU) so total warm-up stays under --max-warmup-seconds
  • Metrics reported: mean latency, std dev, min, max, median, P95, CPU time, throughput (FPS), memory delta (MB)

aarch64/ — Raspberry Pi 4 Benchmark

A Python script that mirrors the Android app's benchmark logic, targeting ARM64 hardware. Supports both a real Raspberry Pi 4 and a QEMU-emulated ARM64 environment on Windows (WSL 2).

Key Files

File Purpose
benchmark.py Main runner — discovers model files, benchmarks all frameworks, prints a summary table, and saves a JSON results file
fake_cpuinfo.c C source for an LD_PRELOAD shim that intercepts fopen() / open() calls so ONNX Runtime and PyTorch don't crash on QEMU's broken /proc/cpuinfo and ARM sysfs registers

Supported Model Extensions

Extension Framework
.tflite TensorFlow Lite
.onnx ONNX Runtime
.pt .pth .ptl PyTorch

CLI Reference

Argument Default Description
--models-dir ./models Directory to scan for model files (recursive)
--runs 100 Number of timed inference runs per model
--warmup 50 Max warm-up runs — auto-capped on slow hardware
--max-warmup-seconds 30 Wall-time cap for the warm-up phase
--output results.json JSON file to save results ('' to skip)
--skip (none) Frameworks to skip: tflite onnx pytorch

Running on a Real Raspberry Pi 4

# Activate the virtual environment
source ~/benchmark-env/bin/activate
cd ~/edgeai

# Run all three frameworks
python benchmark.py --models-dir ./models

# Custom run count, output file
python benchmark.py --models-dir ./models --runs 100 --warmup 50 --output results.json

# Skip a specific framework
python benchmark.py --models-dir ./models --skip pytorch

Running Under QEMU (WSL 2 Emulation)

ONNX Runtime and PyTorch make C-level fopen() / open() calls to read /proc/cpuinfo and ARM-specific sysfs register files at startup. QEMU does not expose these correctly, causing a hard abort before Python can intercept it. The fake_cpuinfo.so shim intercepts these calls at the dynamic linker level and returns valid Raspberry Pi 4 data instead.

Compile the shim once (inside the ARM64 chroot):

cd /opt/edgeai
gcc -shared -fPIC -O2 -o fake_cpuinfo.so fake_cpuinfo.c -ldl

Run with the shim:

# All three frameworks
LD_PRELOAD=./fake_cpuinfo.so python benchmark.py --models-dir ./models

# Save results to a file
LD_PRELOAD=./fake_cpuinfo.so python benchmark.py --models-dir ./models --output results.json

# TFLite only — shim not needed (TFLite does not read cpuinfo natively)
python benchmark.py --models-dir ./models --skip onnx pytorch

When the shim is active you will see interception confirmations:

[fake_cpuinfo] intercepted fopen: /proc/cpuinfo
[fake_cpuinfo] intercepted open: /sys/devices/system/cpu/cpu0/regs/identification/midr_el1

Note: The shim is only needed in QEMU. On real Raspberry Pi 4 hardware, run benchmark.py directly.

Example Output

============================================================
  Edge AI Benchmark Suite
  Runs: 100  |  Warmup: up to 50  |  Threads: 1
  Max warmup time: 30s per model
============================================================

[running] mobilenet_v2_static_quantized.tflite  (tflite)

--- mobilenet_v2_static_quantized.tflite ---
Framework: TensorFlow Lite
Runs: 100 (warmup: 50)

Latency:
  Mean: 52 ms
  Std Dev: 3.1 ms
  Min: 48 ms
  Max: 61 ms
  Median: 51 ms
  P95: 58 ms

Resources:
  Memory: 4 MB
  CPU Time: 51.8 ms

Throughput: 19.23 FPS

Expected Results on Real Pi 4 (4 GB, 64-bit OS, 1 thread)

Model Framework Mean Latency FPS
mobilenet_v2_static_quantized.tflite TFLite ~40–80 ms ~12–25
mobilenet_v2_fp32.onnx ONNX Runtime ~120–180 ms ~5–8
mobilenet_v2_dynamic.onnx ONNX Runtime ~120–180 ms ~5–8
mobilenet_v2_static_quantized.ptl PyTorch ~200–350 ms ~3–5

TFLite typically leads on Pi 4 for quantized INT8 models due to its optimised XNNPACK delegate.

Full Setup Guide

For complete setup instructions (OS flashing, SSH, virtualenv, framework installation, QEMU chroot setup, and troubleshooting), see aarch64/README.md.


app/ — Android Benchmark App

An Android application (com.edgeai.benchmark) that runs the same benchmark logic on Android devices or the Android emulator (AVD).

Configuration

Setting Value
minSdk 24 (Android 7.0)
targetSdk 34 (Android 14)
TFLite org.tensorflow:tensorflow-lite:2.17.0
PyTorch Mobile org.pytorch:pytorch_android_lite:2.1.0
ONNX Runtime com.microsoft.onnxruntime:onnxruntime-android:1.24.3

Model Asset Layout

Place converted model files in the appropriate assets directory before building:

app/src/main/assets/
├── onnx/         ← .onnx files
└── pytorch/      ← .pt / .ptl files

TFLite models are loaded from a separate assets path configured in the app.

Building and Running

Open the app/AndroidBenchmarkMulti/ folder in Android Studio. Sync Gradle, connect a device or start an AVD, then run the app configuration. Results are displayed in-app and logged to Logcat.


frameworks/ — Model Conversion Pipelines

All three pipelines start from the same PyTorch MobileNet-V2 checkpoint (torchvision DEFAULT ImageNet weights) and produce an INT8-quantized model in the format required by each framework. They share a common preprocessing module and a common calibration image folder, ensuring that quantization statistics are computed on identical input data across all three frameworks.

frameworks/
├── preprocess_shared.py          # Shared preprocessing used by all three pipelines
├── pipeline_pytorch_mobile.py    # Pipeline 1 — PyTorch Mobile (.ptl)
├── pipeline_tflite.py            # Pipeline 2 — TensorFlow Lite (.tflite)
├── pipeline_ort_mobile.py        # Pipeline 3 — ONNX Runtime Mobile (.onnx, INT8)
├── images/
│   ├── calibration/              # Calibration images (JPEG/PNG) used by all pipelines
│   │   └── img01.jpg             # Add more images here to improve quantization quality
│   └── sample.jpg                # Sample inference image for quick sanity checks
└── models/                       # Output directory — populated after running the pipelines
    ├── pytorch/
    │   └── mobilenet_v2_static_quantized.ptl
    ├── tflite/
    │   ├── saved_model/          # Intermediate TF SavedModel (onnx2tf output)
    │   └── mobilenet_v2_static_quantized.tflite
    └── ort/
        └── mobilenet_v2_static_quantized.onnx

The intermediate models/onnx/mobilenet_v2_float.onnx is generated automatically by both the TFLite and ORT pipelines and is shared between them. It only needs to be exported once.

preprocess_shared.py — Shared Preprocessing

Used by all three pipelines for both calibration and inference. Implements the standard torchvision MobileNet-V2 preprocessing pipeline:

  1. Resize shortest edge to 256 px (bilinear)
  2. Centre-crop to 224×224
  3. Normalize to [0, 1], then apply ImageNet mean/std ([0.485, 0.456, 0.406] / [0.229, 0.224, 0.225])

Two variants are exposed to handle the axis layout difference between frameworks:

Function Output shape Used by
load_image_nchw(path) (1, 3, 224, 224) float32 PyTorch, ONNX Runtime
load_image_nhwc(path) (1, 224, 224, 3) float32 TFLite (channels-last after onnx2tf)

Pipeline 1 — PyTorch Mobile (pipeline_pytorch_mobile.py)

Output: models/pytorch/mobilenet_v2_static_quantized.ptl

Conversion steps:

  1. Load MobileNet_V2_Weights.DEFAULT from torchvision
  2. Apply FX-graph static INT8 quantization using the QNNPACK backend (per-tensor symmetric weights — QNNPACK's native quantization mode on ARM)
  3. Calibrate with images from images/calibration/
  4. Trace to TorchScript
  5. Run optimize_for_mobile (conv-BN fusion, mobile kernel selection)
  6. Save as a Lite Interpreter artifact (.ptl) via _save_for_lite_interpreter

Dependencies:

pip install torch torchvision

Run:

cd frameworks/
python pipeline_pytorch_mobile.py

Design choices: QNNPACK is chosen over FBGEMM because it is ARM-optimised and matches the Android benchmark app's backend. Per-tensor quantization is used because QNNPACK does not efficiently support per-channel convolution weights. I/O remains float32 for a fair comparison with the other two pipelines.


Pipeline 2 — TensorFlow Lite (pipeline_tflite.py)

Output: models/tflite/mobilenet_v2_static_quantized.tflite

Conversion steps:

  1. Export PyTorch MobileNet-V2 to ONNX (opset 12, NCHW) — skipped if models/onnx/mobilenet_v2_float.onnx already exists
  2. Convert ONNX → TF SavedModel via onnx2tf (automatically runs onnxsim and transposes NCHW → NHWC)
  3. Apply full-integer INT8 quantization (Optimize.DEFAULT, TFLITE_BUILTINS_INT8) using a representative dataset built from images/calibration/
  4. Write the final .tflite file

Dependencies:

pip install onnx2tf tensorflow onnx onnxsim torch torchvision

Run:

cd frameworks/
python pipeline_tflite.py

Design choices: onnx2tf is used instead of the raw onnx-tf converter because it runs the ONNX graph simplifier before conversion, producing a cleaner TF graph. Weights are sourced from PyTorch (not TF's own ImageNet weights) to match the other pipelines. I/O is kept at float32 — inference_input_type and inference_output_type are intentionally left at their defaults so that all three pipelines measure the same dequantisation boundary overhead during benchmarking. The XNNPACK delegate for INT8 ops is enabled automatically by TFLITE_BUILTINS_INT8; no GPU delegate or NNAPI is configured.


Pipeline 3 — ONNX Runtime Mobile (pipeline_ort_mobile.py)

Output: models/ort/mobilenet_v2_static_quantized.onnx

Conversion steps:

  1. Export PyTorch MobileNet-V2 to ONNX (opset 12, NCHW) — skipped if models/onnx/mobilenet_v2_float.onnx already exists
  2. Apply ORT static INT8 quantization (quantize_static) using MobileNetCalibrationReader with images from images/calibration/

Dependencies:

pip install onnx onnxruntime torch torchvision

Run:

cd frameworks/
python pipeline_ort_mobile.py

Design choices: QDQ format (QuantFormat.QDQ) is used — ORT's recommended format for ARM kernels. Per-channel weight quantization (per_channel=True) gives better accuracy than per-tensor with no runtime penalty on supported kernels. Activation type is QUInt8, weight type is QInt8, consistent with ORT's recommended QDQ settings. I/O is float32 to match the other pipelines.


Running All Pipelines

The pipelines are independent but share the ONNX export step. Running TFLite first (or ORT first) will generate models/onnx/mobilenet_v2_float.onnx; the other pipeline will detect and reuse it automatically.

cd frameworks/

# Recommended order — ONNX is generated once and reused
python pipeline_pytorch_mobile.py
python pipeline_tflite.py
python pipeline_ort_mobile.py

After all three pipelines complete, the models/ directory will contain:

models/
├── onnx/
│   └── mobilenet_v2_float.onnx                   # Shared intermediate
├── pytorch/
│   └── mobilenet_v2_static_quantized.ptl          # → aarch64/models/pytorch/
├── tflite/
│   └── mobilenet_v2_static_quantized.tflite       # → aarch64/models/tflite/
└── ort/
    └── mobilenet_v2_static_quantized.onnx         # → aarch64/models/onnx/

Copy the final model files into aarch64/models/ (for the Pi benchmark) or into the Android app's assets/ directories before building or running.

Adding More Calibration Images

More calibration images improve quantization accuracy. Add any JPEG or PNG images to frameworks/images/calibration/ before running the pipelines. All three pipelines pick them up automatically — no code changes needed. ImageNet validation images are a good source for representative calibration data.


Dependencies Summary

Model Conversion (frameworks/)

Each pipeline has its own dependency set. Install into a virtualenv on your development machine (x86, any OS):

# PyTorch Mobile pipeline
pip install torch torchvision

# TFLite pipeline
pip install onnx2tf tensorflow onnx onnxsim torch torchvision

# ONNX Runtime pipeline
pip install onnx onnxruntime torch torchvision

Python benchmark runner (aarch64/)

Install on the Raspberry Pi or inside the ARM64 chroot:

pip install "numpy<2" psutil tflite-runtime onnxruntime
pip install torch torchvision --index-url https://download.pytorch.org/whl/cpu

Always install numpy<2 first — it is a build-time dependency for the TFLite wheel.

Android app (app/) — managed automatically by Gradle

implementation 'org.tensorflow:tensorflow-lite:2.17.0'
implementation 'org.pytorch:pytorch_android_lite:2.1.0'
implementation 'com.microsoft.onnxruntime:onnxruntime-android:1.24.3'

Troubleshooting

Error Cause Fix
_ARRAY_API not found NumPy ≥ 2.0 incompatible with TFLite wheel pip install "numpy<2"
failed to parse /proc/cpuinfo QEMU CPU info not ARM-compatible Compile fake_cpuinfo.so, launch with LD_PRELOAD
Can't open MIDR_EL1 sysfs entry QEMU missing ARM CPU register files Launch with LD_PRELOAD=./fake_cpuinfo.so python benchmark.py
ONNX subprocess crash (exit 6) ORT reads /proc/cpuinfo via C library Compile fake_cpuinfo.sobenchmark.py auto-discovers it
mount point does not exist /proc not mounted in chroot sudo mount -t proc proc /opt/pi-chroot/proc
Illegal instruction on PyTorch Running 32-bit OS Use Raspberry Pi OS 64-bit
.ptl fails on desktop x86 Mobile Lite format is ARM-only Use .pt for desktop or emulator testing
Out of memory on 2 GB Pi PyTorch peak RAM Increase swap (see aarch64/README.md §1.12)
Warmup takes too long in QEMU Slow emulation × 50 warmup runs Adaptive cap handles this; or pass --max-warmup-seconds 10

Research Context

This benchmark suite accompanies a research article comparing edge inference runtimes. Sprint 1 covers Android (x86 emulator) with TFLite and PyTorch Mobile. Sprint 2 expands the study to include ONNX Runtime and measurements on physical ARM devices (Raspberry Pi 4).

Metrics collected: inference latency (mean, std, P95), throughput (FPS), model binary size (MB), and qualitative notes on conversion effort and tooling maturity.


Edge AI Framework Comparison — MobileNet-V2 Benchmark Suite

About

Benchmark comparing TFLite, ONNX Runtime & PyTorch Mobile on MobileNet-V2 across Android and Raspberry Pi 4 edge devices.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors