Edge AI Benchmark Suite

A controlled benchmarking project comparing three ML inference frameworks — TensorFlow Lite, ONNX Runtime, and PyTorch Mobile — using a quantized MobileNet-V2 model across two target environments: Android devices and ARM64 (Raspberry Pi 4).

This project is the foundation for a research article on ML inference performance at the edge, covering latency, throughput, model size, and qualitative conversion effort across frameworks.

Motivation

AI workloads are increasingly moving from cloud to edge devices to meet low-latency, privacy, and connectivity requirements. Edge hardware imposes strict constraints — limited CPU, modest RAM, and tight latency budgets — so choosing the right inference runtime is critical.

This suite provides a repeatable, fair comparison by enforcing identical benchmark methodology across all three frameworks: single-threaded inference, fixed warm-up runs, integer millisecond latency (matching Android's nanosecond-to-ms floor conversion), and the same MobileNet-V2 input shape.

Repository Structure

edge-ai-benchmark/
├── aarch64/                        # ARM64 benchmark (Raspberry Pi 4 / QEMU)
│   ├── benchmark.py                # Main benchmark runner — all three frameworks
│   ├── fake_cpuinfo.c              # LD_PRELOAD shim source (QEMU only)
│   ├── fake_cpuinfo.so             # Compiled shim (generated, not committed)
│   ├── README.md                   # Full setup guide (Pi hardware + QEMU)
│   └── models/                     # Place converted models here before benchmarking
│       ├── tflite/
│       ├── onnx/
│       └── pytorch/
│
├── app/                            # Android benchmark application
│   └── AndroidBenchmarkMulti/
│       └── app/
│           ├── build.gradle        # Dependencies: TFLite 2.17, PyTorch 2.1, ORT 1.24.3
│           └── src/main/
│               ├── AndroidManifest.xml
│               └── assets/
│                   ├── onnx/       # ONNX model assets
│                   └── pytorch/    # PyTorch model assets
│
└── frameworks/                     # Model conversion pipelines
    ├── preprocess_shared.py        # Shared preprocessing (resize, crop, ImageNet normalize)
    ├── pipeline_pytorch_mobile.py  # Pipeline 1: PyTorch → FX INT8 → .ptl
    ├── pipeline_tflite.py          # Pipeline 2: PyTorch → ONNX → TF SavedModel → .tflite
    ├── pipeline_ort_mobile.py      # Pipeline 3: PyTorch → ONNX → ORT INT8 QDQ → .onnx
    ├── images/
    │   ├── calibration/            # Calibration images shared by all three pipelines
    │   └── sample.jpg              # Sample image for quick inference checks
    └── models/                     # Output directory (populated after running pipelines)
        ├── onnx/                   # Shared float ONNX intermediate
        ├── pytorch/                # Final .ptl artifact
        ├── tflite/                 # Final .tflite artifact + SavedModel intermediate
        └── ort/                    # Final INT8 QDQ .onnx artifact

Model

All benchmarks use MobileNet-V2 (ImageNet, 1000 classes) with a 224×224×3 input. Each framework receives a model converted from the same baseline checkpoint under the same quantization rules, ensuring the comparison is as apples-to-apples as possible.

Framework	Format	Precision
TensorFlow Lite	`.tflite`	INT8 static quantization
ONNX Runtime	`.onnx`	FP32 / dynamic / static
PyTorch Mobile	`.pt` / `.ptl`	INT8 dynamic quantization

Benchmark Methodology

All three benchmark targets (Python script and Android app) share the same rules:

Single-threaded inference — num_threads=1 for all frameworks
Input tensor allocated once and reused across all runs
Latency stored as integer milliseconds (floor, matching Java / 1_000_000)
CPU time computed as total_accumulated / NUM_RUNS
Adaptive warmup — default 50 warm-up runs, automatically capped on slow hardware (QEMU) so total warm-up stays under --max-warmup-seconds
Metrics reported: mean latency, std dev, min, max, median, P95, CPU time, throughput (FPS), memory delta (MB)

`aarch64/` — Raspberry Pi 4 Benchmark

A Python script that mirrors the Android app's benchmark logic, targeting ARM64 hardware. Supports both a real Raspberry Pi 4 and a QEMU-emulated ARM64 environment on Windows (WSL 2).

Key Files

File	Purpose
`benchmark.py`	Main runner — discovers model files, benchmarks all frameworks, prints a summary table, and saves a JSON results file
`fake_cpuinfo.c`	C source for an `LD_PRELOAD` shim that intercepts `fopen()` / `open()` calls so ONNX Runtime and PyTorch don't crash on QEMU's broken `/proc/cpuinfo` and ARM sysfs registers

Supported Model Extensions

Extension	Framework
`.tflite`	TensorFlow Lite
`.onnx`	ONNX Runtime
`.pt` `.pth` `.ptl`	PyTorch

CLI Reference

Argument	Default	Description
`--models-dir`	`./models`	Directory to scan for model files (recursive)
`--runs`	`100`	Number of timed inference runs per model
`--warmup`	`50`	Max warm-up runs — auto-capped on slow hardware
`--max-warmup-seconds`	`30`	Wall-time cap for the warm-up phase
`--output`	`results.json`	JSON file to save results (`''` to skip)
`--skip`	(none)	Frameworks to skip: `tflite` `onnx` `pytorch`

Running on a Real Raspberry Pi 4

# Activate the virtual environment
source ~/benchmark-env/bin/activate
cd ~/edgeai

# Run all three frameworks
python benchmark.py --models-dir ./models

# Custom run count, output file
python benchmark.py --models-dir ./models --runs 100 --warmup 50 --output results.json

# Skip a specific framework
python benchmark.py --models-dir ./models --skip pytorch

Running Under QEMU (WSL 2 Emulation)

ONNX Runtime and PyTorch make C-level fopen() / open() calls to read /proc/cpuinfo and ARM-specific sysfs register files at startup. QEMU does not expose these correctly, causing a hard abort before Python can intercept it. The fake_cpuinfo.so shim intercepts these calls at the dynamic linker level and returns valid Raspberry Pi 4 data instead.

Compile the shim once (inside the ARM64 chroot):

cd /opt/edgeai
gcc -shared -fPIC -O2 -o fake_cpuinfo.so fake_cpuinfo.c -ldl

Run with the shim:

# All three frameworks
LD_PRELOAD=./fake_cpuinfo.so python benchmark.py --models-dir ./models

# Save results to a file
LD_PRELOAD=./fake_cpuinfo.so python benchmark.py --models-dir ./models --output results.json

# TFLite only — shim not needed (TFLite does not read cpuinfo natively)
python benchmark.py --models-dir ./models --skip onnx pytorch

When the shim is active you will see interception confirmations:

[fake_cpuinfo] intercepted fopen: /proc/cpuinfo
[fake_cpuinfo] intercepted open: /sys/devices/system/cpu/cpu0/regs/identification/midr_el1

Note: The shim is only needed in QEMU. On real Raspberry Pi 4 hardware, run benchmark.py directly.

Example Output

============================================================
  Edge AI Benchmark Suite
  Runs: 100  |  Warmup: up to 50  |  Threads: 1
  Max warmup time: 30s per model
============================================================

[running] mobilenet_v2_static_quantized.tflite  (tflite)

--- mobilenet_v2_static_quantized.tflite ---
Framework: TensorFlow Lite
Runs: 100 (warmup: 50)

Latency:
  Mean: 52 ms
  Std Dev: 3.1 ms
  Min: 48 ms
  Max: 61 ms
  Median: 51 ms
  P95: 58 ms

Resources:
  Memory: 4 MB
  CPU Time: 51.8 ms

Throughput: 19.23 FPS

Expected Results on Real Pi 4 (4 GB, 64-bit OS, 1 thread)

Model	Framework	Mean Latency	FPS
`mobilenet_v2_static_quantized.tflite`	TFLite	~40–80 ms	~12–25
`mobilenet_v2_fp32.onnx`	ONNX Runtime	~120–180 ms	~5–8
`mobilenet_v2_dynamic.onnx`	ONNX Runtime	~120–180 ms	~5–8
`mobilenet_v2_static_quantized.ptl`	PyTorch	~200–350 ms	~3–5

TFLite typically leads on Pi 4 for quantized INT8 models due to its optimised XNNPACK delegate.

Full Setup Guide

For complete setup instructions (OS flashing, SSH, virtualenv, framework installation, QEMU chroot setup, and troubleshooting), see aarch64/README.md.

`app/` — Android Benchmark App

An Android application (com.edgeai.benchmark) that runs the same benchmark logic on Android devices or the Android emulator (AVD).

Configuration

Setting	Value
`minSdk`	24 (Android 7.0)
`targetSdk`	34 (Android 14)
TFLite	`org.tensorflow:tensorflow-lite:2.17.0`
PyTorch Mobile	`org.pytorch:pytorch_android_lite:2.1.0`
ONNX Runtime	`com.microsoft.onnxruntime:onnxruntime-android:1.24.3`

Model Asset Layout

Place converted model files in the appropriate assets directory before building:

app/src/main/assets/
├── onnx/         ← .onnx files
└── pytorch/      ← .pt / .ptl files

TFLite models are loaded from a separate assets path configured in the app.

Building and Running

Open the app/AndroidBenchmarkMulti/ folder in Android Studio. Sync Gradle, connect a device or start an AVD, then run the app configuration. Results are displayed in-app and logged to Logcat.

`frameworks/` — Model Conversion Pipelines

All three pipelines start from the same PyTorch MobileNet-V2 checkpoint (torchvision DEFAULT ImageNet weights) and produce an INT8-quantized model in the format required by each framework. They share a common preprocessing module and a common calibration image folder, ensuring that quantization statistics are computed on identical input data across all three frameworks.

frameworks/
├── preprocess_shared.py          # Shared preprocessing used by all three pipelines
├── pipeline_pytorch_mobile.py    # Pipeline 1 — PyTorch Mobile (.ptl)
├── pipeline_tflite.py            # Pipeline 2 — TensorFlow Lite (.tflite)
├── pipeline_ort_mobile.py        # Pipeline 3 — ONNX Runtime Mobile (.onnx, INT8)
├── images/
│   ├── calibration/              # Calibration images (JPEG/PNG) used by all pipelines
│   │   └── img01.jpg             # Add more images here to improve quantization quality
│   └── sample.jpg                # Sample inference image for quick sanity checks
└── models/                       # Output directory — populated after running the pipelines
    ├── pytorch/
    │   └── mobilenet_v2_static_quantized.ptl
    ├── tflite/
    │   ├── saved_model/          # Intermediate TF SavedModel (onnx2tf output)
    │   └── mobilenet_v2_static_quantized.tflite
    └── ort/
        └── mobilenet_v2_static_quantized.onnx

The intermediate models/onnx/mobilenet_v2_float.onnx is generated automatically by both the TFLite and ORT pipelines and is shared between them. It only needs to be exported once.

`preprocess_shared.py` — Shared Preprocessing

Used by all three pipelines for both calibration and inference. Implements the standard torchvision MobileNet-V2 preprocessing pipeline:

Resize shortest edge to 256 px (bilinear)
Centre-crop to 224×224
Normalize to [0, 1], then apply ImageNet mean/std ([0.485, 0.456, 0.406] / [0.229, 0.224, 0.225])

Two variants are exposed to handle the axis layout difference between frameworks:

Function	Output shape	Used by
`load_image_nchw(path)`	`(1, 3, 224, 224)` float32	PyTorch, ONNX Runtime
`load_image_nhwc(path)`	`(1, 224, 224, 3)` float32	TFLite (channels-last after onnx2tf)

Pipeline 1 — PyTorch Mobile (`pipeline_pytorch_mobile.py`)

Output: models/pytorch/mobilenet_v2_static_quantized.ptl

Conversion steps:

Load MobileNet_V2_Weights.DEFAULT from torchvision
Apply FX-graph static INT8 quantization using the QNNPACK backend (per-tensor symmetric weights — QNNPACK's native quantization mode on ARM)
Calibrate with images from images/calibration/
Trace to TorchScript
Run optimize_for_mobile (conv-BN fusion, mobile kernel selection)
Save as a Lite Interpreter artifact (.ptl) via _save_for_lite_interpreter

Dependencies:

pip install torch torchvision

Run:

cd frameworks/
python pipeline_pytorch_mobile.py

Design choices: QNNPACK is chosen over FBGEMM because it is ARM-optimised and matches the Android benchmark app's backend. Per-tensor quantization is used because QNNPACK does not efficiently support per-channel convolution weights. I/O remains float32 for a fair comparison with the other two pipelines.

Pipeline 2 — TensorFlow Lite (`pipeline_tflite.py`)

Output: models/tflite/mobilenet_v2_static_quantized.tflite

Conversion steps:

Export PyTorch MobileNet-V2 to ONNX (opset 12, NCHW) — skipped if models/onnx/mobilenet_v2_float.onnx already exists
Convert ONNX → TF SavedModel via onnx2tf (automatically runs onnxsim and transposes NCHW → NHWC)
Apply full-integer INT8 quantization (Optimize.DEFAULT, TFLITE_BUILTINS_INT8) using a representative dataset built from images/calibration/
Write the final .tflite file

Dependencies:

pip install onnx2tf tensorflow onnx onnxsim torch torchvision

Run:

cd frameworks/
python pipeline_tflite.py

Design choices: onnx2tf is used instead of the raw onnx-tf converter because it runs the ONNX graph simplifier before conversion, producing a cleaner TF graph. Weights are sourced from PyTorch (not TF's own ImageNet weights) to match the other pipelines. I/O is kept at float32 — inference_input_type and inference_output_type are intentionally left at their defaults so that all three pipelines measure the same dequantisation boundary overhead during benchmarking. The XNNPACK delegate for INT8 ops is enabled automatically by TFLITE_BUILTINS_INT8; no GPU delegate or NNAPI is configured.

Pipeline 3 — ONNX Runtime Mobile (`pipeline_ort_mobile.py`)

Output: models/ort/mobilenet_v2_static_quantized.onnx

Conversion steps:

Export PyTorch MobileNet-V2 to ONNX (opset 12, NCHW) — skipped if models/onnx/mobilenet_v2_float.onnx already exists
Apply ORT static INT8 quantization (quantize_static) using MobileNetCalibrationReader with images from images/calibration/

Dependencies:

pip install onnx onnxruntime torch torchvision

Run:

cd frameworks/
python pipeline_ort_mobile.py

Design choices: QDQ format (QuantFormat.QDQ) is used — ORT's recommended format for ARM kernels. Per-channel weight quantization (per_channel=True) gives better accuracy than per-tensor with no runtime penalty on supported kernels. Activation type is QUInt8, weight type is QInt8, consistent with ORT's recommended QDQ settings. I/O is float32 to match the other pipelines.

Running All Pipelines

The pipelines are independent but share the ONNX export step. Running TFLite first (or ORT first) will generate models/onnx/mobilenet_v2_float.onnx; the other pipeline will detect and reuse it automatically.

cd frameworks/

# Recommended order — ONNX is generated once and reused
python pipeline_pytorch_mobile.py
python pipeline_tflite.py
python pipeline_ort_mobile.py

After all three pipelines complete, the models/ directory will contain:

models/
├── onnx/
│   └── mobilenet_v2_float.onnx                   # Shared intermediate
├── pytorch/
│   └── mobilenet_v2_static_quantized.ptl          # → aarch64/models/pytorch/
├── tflite/
│   └── mobilenet_v2_static_quantized.tflite       # → aarch64/models/tflite/
└── ort/
    └── mobilenet_v2_static_quantized.onnx         # → aarch64/models/onnx/

Copy the final model files into aarch64/models/ (for the Pi benchmark) or into the Android app's assets/ directories before building or running.

Adding More Calibration Images

More calibration images improve quantization accuracy. Add any JPEG or PNG images to frameworks/images/calibration/ before running the pipelines. All three pipelines pick them up automatically — no code changes needed. ImageNet validation images are a good source for representative calibration data.

Dependencies Summary

Model Conversion (`frameworks/`)

Each pipeline has its own dependency set. Install into a virtualenv on your development machine (x86, any OS):

# PyTorch Mobile pipeline
pip install torch torchvision

# TFLite pipeline
pip install onnx2tf tensorflow onnx onnxsim torch torchvision

# ONNX Runtime pipeline
pip install onnx onnxruntime torch torchvision

Python benchmark runner (`aarch64/`)

Install on the Raspberry Pi or inside the ARM64 chroot:

pip install "numpy<2" psutil tflite-runtime onnxruntime
pip install torch torchvision --index-url https://download.pytorch.org/whl/cpu

Always install numpy<2 first — it is a build-time dependency for the TFLite wheel.

Android app (`app/`) — managed automatically by Gradle

implementation 'org.tensorflow:tensorflow-lite:2.17.0'
implementation 'org.pytorch:pytorch_android_lite:2.1.0'
implementation 'com.microsoft.onnxruntime:onnxruntime-android:1.24.3'

Troubleshooting

Error	Cause	Fix
`_ARRAY_API not found`	NumPy ≥ 2.0 incompatible with TFLite wheel	`pip install "numpy<2"`
`failed to parse /proc/cpuinfo`	QEMU CPU info not ARM-compatible	Compile `fake_cpuinfo.so`, launch with `LD_PRELOAD`
`Can't open MIDR_EL1 sysfs entry`	QEMU missing ARM CPU register files	Launch with `LD_PRELOAD=./fake_cpuinfo.so python benchmark.py`
ONNX subprocess crash (exit 6)	ORT reads `/proc/cpuinfo` via C library	Compile `fake_cpuinfo.so` — `benchmark.py` auto-discovers it
`mount point does not exist`	`/proc` not mounted in chroot	`sudo mount -t proc proc /opt/pi-chroot/proc`
`Illegal instruction` on PyTorch	Running 32-bit OS	Use Raspberry Pi OS 64-bit
`.ptl` fails on desktop x86	Mobile Lite format is ARM-only	Use `.pt` for desktop or emulator testing
Out of memory on 2 GB Pi	PyTorch peak RAM	Increase swap (see `aarch64/README.md` §1.12)
Warmup takes too long in QEMU	Slow emulation × 50 warmup runs	Adaptive cap handles this; or pass `--max-warmup-seconds 10`

Research Context

This benchmark suite accompanies a research article comparing edge inference runtimes. Sprint 1 covers Android (x86 emulator) with TFLite and PyTorch Mobile. Sprint 2 expands the study to include ONNX Runtime and measurements on physical ARM devices (Raspberry Pi 4).

Metrics collected: inference latency (mean, std, P95), throughput (FPS), model binary size (MB), and qualitative notes on conversion effort and tooling maturity.

Edge AI Framework Comparison — MobileNet-V2 Benchmark Suite

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
aarch64		aarch64
app		app
frameworks		frameworks
.gitignore		.gitignore
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Edge AI Benchmark Suite

Motivation

Repository Structure

Model

Benchmark Methodology

aarch64/ — Raspberry Pi 4 Benchmark

Key Files

Supported Model Extensions

CLI Reference

Running on a Real Raspberry Pi 4

Running Under QEMU (WSL 2 Emulation)

Example Output

Expected Results on Real Pi 4 (4 GB, 64-bit OS, 1 thread)

Full Setup Guide

app/ — Android Benchmark App

Configuration

Model Asset Layout

Building and Running

frameworks/ — Model Conversion Pipelines

preprocess_shared.py — Shared Preprocessing

Pipeline 1 — PyTorch Mobile (pipeline_pytorch_mobile.py)

Pipeline 2 — TensorFlow Lite (pipeline_tflite.py)

Pipeline 3 — ONNX Runtime Mobile (pipeline_ort_mobile.py)

Running All Pipelines

Adding More Calibration Images

Dependencies Summary

Model Conversion (frameworks/)

Python benchmark runner (aarch64/)

Android app (app/) — managed automatically by Gradle

Troubleshooting

Research Context

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`aarch64/` — Raspberry Pi 4 Benchmark

`app/` — Android Benchmark App

`frameworks/` — Model Conversion Pipelines

`preprocess_shared.py` — Shared Preprocessing

Pipeline 1 — PyTorch Mobile (`pipeline_pytorch_mobile.py`)

Pipeline 2 — TensorFlow Lite (`pipeline_tflite.py`)

Pipeline 3 — ONNX Runtime Mobile (`pipeline_ort_mobile.py`)

Model Conversion (`frameworks/`)

Python benchmark runner (`aarch64/`)

Android app (`app/`) — managed automatically by Gradle

Packages