A controlled benchmarking project comparing three ML inference frameworks — TensorFlow Lite, ONNX Runtime, and PyTorch Mobile — using a quantized MobileNet-V2 model across two target environments: Android devices and ARM64 (Raspberry Pi 4).
This project is the foundation for a research article on ML inference performance at the edge, covering latency, throughput, model size, and qualitative conversion effort across frameworks.
AI workloads are increasingly moving from cloud to edge devices to meet low-latency, privacy, and connectivity requirements. Edge hardware imposes strict constraints — limited CPU, modest RAM, and tight latency budgets — so choosing the right inference runtime is critical.
This suite provides a repeatable, fair comparison by enforcing identical benchmark methodology across all three frameworks: single-threaded inference, fixed warm-up runs, integer millisecond latency (matching Android's nanosecond-to-ms floor conversion), and the same MobileNet-V2 input shape.
edge-ai-benchmark/
├── aarch64/ # ARM64 benchmark (Raspberry Pi 4 / QEMU)
│ ├── benchmark.py # Main benchmark runner — all three frameworks
│ ├── fake_cpuinfo.c # LD_PRELOAD shim source (QEMU only)
│ ├── fake_cpuinfo.so # Compiled shim (generated, not committed)
│ ├── README.md # Full setup guide (Pi hardware + QEMU)
│ └── models/ # Place converted models here before benchmarking
│ ├── tflite/
│ ├── onnx/
│ └── pytorch/
│
├── app/ # Android benchmark application
│ └── AndroidBenchmarkMulti/
│ └── app/
│ ├── build.gradle # Dependencies: TFLite 2.17, PyTorch 2.1, ORT 1.24.3
│ └── src/main/
│ ├── AndroidManifest.xml
│ └── assets/
│ ├── onnx/ # ONNX model assets
│ └── pytorch/ # PyTorch model assets
│
└── frameworks/ # Model conversion pipelines
├── preprocess_shared.py # Shared preprocessing (resize, crop, ImageNet normalize)
├── pipeline_pytorch_mobile.py # Pipeline 1: PyTorch → FX INT8 → .ptl
├── pipeline_tflite.py # Pipeline 2: PyTorch → ONNX → TF SavedModel → .tflite
├── pipeline_ort_mobile.py # Pipeline 3: PyTorch → ONNX → ORT INT8 QDQ → .onnx
├── images/
│ ├── calibration/ # Calibration images shared by all three pipelines
│ └── sample.jpg # Sample image for quick inference checks
└── models/ # Output directory (populated after running pipelines)
├── onnx/ # Shared float ONNX intermediate
├── pytorch/ # Final .ptl artifact
├── tflite/ # Final .tflite artifact + SavedModel intermediate
└── ort/ # Final INT8 QDQ .onnx artifact
All benchmarks use MobileNet-V2 (ImageNet, 1000 classes) with a 224×224×3 input. Each framework receives a model converted from the same baseline checkpoint under the same quantization rules, ensuring the comparison is as apples-to-apples as possible.
| Framework | Format | Precision |
|---|---|---|
| TensorFlow Lite | .tflite |
INT8 static quantization |
| ONNX Runtime | .onnx |
FP32 / dynamic / static |
| PyTorch Mobile | .pt / .ptl |
INT8 dynamic quantization |
All three benchmark targets (Python script and Android app) share the same rules:
- Single-threaded inference —
num_threads=1for all frameworks - Input tensor allocated once and reused across all runs
- Latency stored as integer milliseconds (floor, matching Java
/ 1_000_000) - CPU time computed as
total_accumulated / NUM_RUNS - Adaptive warmup — default 50 warm-up runs, automatically capped on slow hardware (QEMU) so total warm-up stays under
--max-warmup-seconds - Metrics reported: mean latency, std dev, min, max, median, P95, CPU time, throughput (FPS), memory delta (MB)
A Python script that mirrors the Android app's benchmark logic, targeting ARM64 hardware. Supports both a real Raspberry Pi 4 and a QEMU-emulated ARM64 environment on Windows (WSL 2).
| File | Purpose |
|---|---|
benchmark.py |
Main runner — discovers model files, benchmarks all frameworks, prints a summary table, and saves a JSON results file |
fake_cpuinfo.c |
C source for an LD_PRELOAD shim that intercepts fopen() / open() calls so ONNX Runtime and PyTorch don't crash on QEMU's broken /proc/cpuinfo and ARM sysfs registers |
| Extension | Framework |
|---|---|
.tflite |
TensorFlow Lite |
.onnx |
ONNX Runtime |
.pt .pth .ptl |
PyTorch |
| Argument | Default | Description |
|---|---|---|
--models-dir |
./models |
Directory to scan for model files (recursive) |
--runs |
100 |
Number of timed inference runs per model |
--warmup |
50 |
Max warm-up runs — auto-capped on slow hardware |
--max-warmup-seconds |
30 |
Wall-time cap for the warm-up phase |
--output |
results.json |
JSON file to save results ('' to skip) |
--skip |
(none) | Frameworks to skip: tflite onnx pytorch |
# Activate the virtual environment
source ~/benchmark-env/bin/activate
cd ~/edgeai
# Run all three frameworks
python benchmark.py --models-dir ./models
# Custom run count, output file
python benchmark.py --models-dir ./models --runs 100 --warmup 50 --output results.json
# Skip a specific framework
python benchmark.py --models-dir ./models --skip pytorchONNX Runtime and PyTorch make C-level fopen() / open() calls to read /proc/cpuinfo and ARM-specific sysfs register files at startup. QEMU does not expose these correctly, causing a hard abort before Python can intercept it. The fake_cpuinfo.so shim intercepts these calls at the dynamic linker level and returns valid Raspberry Pi 4 data instead.
Compile the shim once (inside the ARM64 chroot):
cd /opt/edgeai
gcc -shared -fPIC -O2 -o fake_cpuinfo.so fake_cpuinfo.c -ldlRun with the shim:
# All three frameworks
LD_PRELOAD=./fake_cpuinfo.so python benchmark.py --models-dir ./models
# Save results to a file
LD_PRELOAD=./fake_cpuinfo.so python benchmark.py --models-dir ./models --output results.json
# TFLite only — shim not needed (TFLite does not read cpuinfo natively)
python benchmark.py --models-dir ./models --skip onnx pytorchWhen the shim is active you will see interception confirmations:
[fake_cpuinfo] intercepted fopen: /proc/cpuinfo
[fake_cpuinfo] intercepted open: /sys/devices/system/cpu/cpu0/regs/identification/midr_el1
Note: The shim is only needed in QEMU. On real Raspberry Pi 4 hardware, run
benchmark.pydirectly.
============================================================
Edge AI Benchmark Suite
Runs: 100 | Warmup: up to 50 | Threads: 1
Max warmup time: 30s per model
============================================================
[running] mobilenet_v2_static_quantized.tflite (tflite)
--- mobilenet_v2_static_quantized.tflite ---
Framework: TensorFlow Lite
Runs: 100 (warmup: 50)
Latency:
Mean: 52 ms
Std Dev: 3.1 ms
Min: 48 ms
Max: 61 ms
Median: 51 ms
P95: 58 ms
Resources:
Memory: 4 MB
CPU Time: 51.8 ms
Throughput: 19.23 FPS
| Model | Framework | Mean Latency | FPS |
|---|---|---|---|
mobilenet_v2_static_quantized.tflite |
TFLite | ~40–80 ms | ~12–25 |
mobilenet_v2_fp32.onnx |
ONNX Runtime | ~120–180 ms | ~5–8 |
mobilenet_v2_dynamic.onnx |
ONNX Runtime | ~120–180 ms | ~5–8 |
mobilenet_v2_static_quantized.ptl |
PyTorch | ~200–350 ms | ~3–5 |
TFLite typically leads on Pi 4 for quantized INT8 models due to its optimised XNNPACK delegate.
For complete setup instructions (OS flashing, SSH, virtualenv, framework installation, QEMU chroot setup, and troubleshooting), see aarch64/README.md.
An Android application (com.edgeai.benchmark) that runs the same benchmark logic on Android devices or the Android emulator (AVD).
| Setting | Value |
|---|---|
minSdk |
24 (Android 7.0) |
targetSdk |
34 (Android 14) |
| TFLite | org.tensorflow:tensorflow-lite:2.17.0 |
| PyTorch Mobile | org.pytorch:pytorch_android_lite:2.1.0 |
| ONNX Runtime | com.microsoft.onnxruntime:onnxruntime-android:1.24.3 |
Place converted model files in the appropriate assets directory before building:
app/src/main/assets/
├── onnx/ ← .onnx files
└── pytorch/ ← .pt / .ptl files
TFLite models are loaded from a separate assets path configured in the app.
Open the app/AndroidBenchmarkMulti/ folder in Android Studio. Sync Gradle, connect a device or start an AVD, then run the app configuration. Results are displayed in-app and logged to Logcat.
All three pipelines start from the same PyTorch MobileNet-V2 checkpoint (torchvision DEFAULT ImageNet weights) and produce an INT8-quantized model in the format required by each framework. They share a common preprocessing module and a common calibration image folder, ensuring that quantization statistics are computed on identical input data across all three frameworks.
frameworks/
├── preprocess_shared.py # Shared preprocessing used by all three pipelines
├── pipeline_pytorch_mobile.py # Pipeline 1 — PyTorch Mobile (.ptl)
├── pipeline_tflite.py # Pipeline 2 — TensorFlow Lite (.tflite)
├── pipeline_ort_mobile.py # Pipeline 3 — ONNX Runtime Mobile (.onnx, INT8)
├── images/
│ ├── calibration/ # Calibration images (JPEG/PNG) used by all pipelines
│ │ └── img01.jpg # Add more images here to improve quantization quality
│ └── sample.jpg # Sample inference image for quick sanity checks
└── models/ # Output directory — populated after running the pipelines
├── pytorch/
│ └── mobilenet_v2_static_quantized.ptl
├── tflite/
│ ├── saved_model/ # Intermediate TF SavedModel (onnx2tf output)
│ └── mobilenet_v2_static_quantized.tflite
└── ort/
└── mobilenet_v2_static_quantized.onnx
The intermediate
models/onnx/mobilenet_v2_float.onnxis generated automatically by both the TFLite and ORT pipelines and is shared between them. It only needs to be exported once.
Used by all three pipelines for both calibration and inference. Implements the standard torchvision MobileNet-V2 preprocessing pipeline:
- Resize shortest edge to 256 px (bilinear)
- Centre-crop to 224×224
- Normalize to
[0, 1], then apply ImageNet mean/std ([0.485, 0.456, 0.406]/[0.229, 0.224, 0.225])
Two variants are exposed to handle the axis layout difference between frameworks:
| Function | Output shape | Used by |
|---|---|---|
load_image_nchw(path) |
(1, 3, 224, 224) float32 |
PyTorch, ONNX Runtime |
load_image_nhwc(path) |
(1, 224, 224, 3) float32 |
TFLite (channels-last after onnx2tf) |
Output: models/pytorch/mobilenet_v2_static_quantized.ptl
Conversion steps:
- Load
MobileNet_V2_Weights.DEFAULTfrom torchvision - Apply FX-graph static INT8 quantization using the QNNPACK backend (per-tensor symmetric weights — QNNPACK's native quantization mode on ARM)
- Calibrate with images from
images/calibration/ - Trace to TorchScript
- Run
optimize_for_mobile(conv-BN fusion, mobile kernel selection) - Save as a Lite Interpreter artifact (
.ptl) via_save_for_lite_interpreter
Dependencies:
pip install torch torchvisionRun:
cd frameworks/
python pipeline_pytorch_mobile.pyDesign choices: QNNPACK is chosen over FBGEMM because it is ARM-optimised and matches the Android benchmark app's backend. Per-tensor quantization is used because QNNPACK does not efficiently support per-channel convolution weights. I/O remains float32 for a fair comparison with the other two pipelines.
Output: models/tflite/mobilenet_v2_static_quantized.tflite
Conversion steps:
- Export PyTorch MobileNet-V2 to ONNX (opset 12, NCHW) — skipped if
models/onnx/mobilenet_v2_float.onnxalready exists - Convert ONNX → TF SavedModel via
onnx2tf(automatically runsonnxsimand transposes NCHW → NHWC) - Apply full-integer INT8 quantization (
Optimize.DEFAULT,TFLITE_BUILTINS_INT8) using a representative dataset built fromimages/calibration/ - Write the final
.tflitefile
Dependencies:
pip install onnx2tf tensorflow onnx onnxsim torch torchvisionRun:
cd frameworks/
python pipeline_tflite.pyDesign choices: onnx2tf is used instead of the raw onnx-tf converter because it runs the ONNX graph simplifier before conversion, producing a cleaner TF graph. Weights are sourced from PyTorch (not TF's own ImageNet weights) to match the other pipelines. I/O is kept at float32 — inference_input_type and inference_output_type are intentionally left at their defaults so that all three pipelines measure the same dequantisation boundary overhead during benchmarking. The XNNPACK delegate for INT8 ops is enabled automatically by TFLITE_BUILTINS_INT8; no GPU delegate or NNAPI is configured.
Output: models/ort/mobilenet_v2_static_quantized.onnx
Conversion steps:
- Export PyTorch MobileNet-V2 to ONNX (opset 12, NCHW) — skipped if
models/onnx/mobilenet_v2_float.onnxalready exists - Apply ORT static INT8 quantization (
quantize_static) usingMobileNetCalibrationReaderwith images fromimages/calibration/
Dependencies:
pip install onnx onnxruntime torch torchvisionRun:
cd frameworks/
python pipeline_ort_mobile.pyDesign choices: QDQ format (QuantFormat.QDQ) is used — ORT's recommended format for ARM kernels. Per-channel weight quantization (per_channel=True) gives better accuracy than per-tensor with no runtime penalty on supported kernels. Activation type is QUInt8, weight type is QInt8, consistent with ORT's recommended QDQ settings. I/O is float32 to match the other pipelines.
The pipelines are independent but share the ONNX export step. Running TFLite first (or ORT first) will generate models/onnx/mobilenet_v2_float.onnx; the other pipeline will detect and reuse it automatically.
cd frameworks/
# Recommended order — ONNX is generated once and reused
python pipeline_pytorch_mobile.py
python pipeline_tflite.py
python pipeline_ort_mobile.pyAfter all three pipelines complete, the models/ directory will contain:
models/
├── onnx/
│ └── mobilenet_v2_float.onnx # Shared intermediate
├── pytorch/
│ └── mobilenet_v2_static_quantized.ptl # → aarch64/models/pytorch/
├── tflite/
│ └── mobilenet_v2_static_quantized.tflite # → aarch64/models/tflite/
└── ort/
└── mobilenet_v2_static_quantized.onnx # → aarch64/models/onnx/
Copy the final model files into aarch64/models/ (for the Pi benchmark) or into the Android app's assets/ directories before building or running.
More calibration images improve quantization accuracy. Add any JPEG or PNG images to frameworks/images/calibration/ before running the pipelines. All three pipelines pick them up automatically — no code changes needed. ImageNet validation images are a good source for representative calibration data.
Each pipeline has its own dependency set. Install into a virtualenv on your development machine (x86, any OS):
# PyTorch Mobile pipeline
pip install torch torchvision
# TFLite pipeline
pip install onnx2tf tensorflow onnx onnxsim torch torchvision
# ONNX Runtime pipeline
pip install onnx onnxruntime torch torchvisionInstall on the Raspberry Pi or inside the ARM64 chroot:
pip install "numpy<2" psutil tflite-runtime onnxruntime
pip install torch torchvision --index-url https://download.pytorch.org/whl/cpuAlways install
numpy<2first — it is a build-time dependency for the TFLite wheel.
implementation 'org.tensorflow:tensorflow-lite:2.17.0'
implementation 'org.pytorch:pytorch_android_lite:2.1.0'
implementation 'com.microsoft.onnxruntime:onnxruntime-android:1.24.3'| Error | Cause | Fix |
|---|---|---|
_ARRAY_API not found |
NumPy ≥ 2.0 incompatible with TFLite wheel | pip install "numpy<2" |
failed to parse /proc/cpuinfo |
QEMU CPU info not ARM-compatible | Compile fake_cpuinfo.so, launch with LD_PRELOAD |
Can't open MIDR_EL1 sysfs entry |
QEMU missing ARM CPU register files | Launch with LD_PRELOAD=./fake_cpuinfo.so python benchmark.py |
| ONNX subprocess crash (exit 6) | ORT reads /proc/cpuinfo via C library |
Compile fake_cpuinfo.so — benchmark.py auto-discovers it |
mount point does not exist |
/proc not mounted in chroot |
sudo mount -t proc proc /opt/pi-chroot/proc |
Illegal instruction on PyTorch |
Running 32-bit OS | Use Raspberry Pi OS 64-bit |
.ptl fails on desktop x86 |
Mobile Lite format is ARM-only | Use .pt for desktop or emulator testing |
| Out of memory on 2 GB Pi | PyTorch peak RAM | Increase swap (see aarch64/README.md §1.12) |
| Warmup takes too long in QEMU | Slow emulation × 50 warmup runs | Adaptive cap handles this; or pass --max-warmup-seconds 10 |
This benchmark suite accompanies a research article comparing edge inference runtimes. Sprint 1 covers Android (x86 emulator) with TFLite and PyTorch Mobile. Sprint 2 expands the study to include ONNX Runtime and measurements on physical ARM devices (Raspberry Pi 4).
Metrics collected: inference latency (mean, std, P95), throughput (FPS), model binary size (MB), and qualitative notes on conversion effort and tooling maturity.
Edge AI Framework Comparison — MobileNet-V2 Benchmark Suite