A CPU-based offline LLM benchmarking and observability platform designed to evaluate local model inference performance under hardware constraints.
InferBench enables structured benchmarking of Ollama-hosted models while exposing production-style Prometheus metrics for monitoring and analysis.
- Benchmark multiple Ollama models
- Cold vs warm start latency measurement
- p50 / p95 latency metrics
- Tokens-per-second throughput tracking
- Memory usage measurement
- FastAPI-based structured benchmarking API
- Prometheus metrics integration
- Docker-ready deployment
- Fully offline (no GPU required)
Ollama (Local LLMs)
↑
InferBench (FastAPI + Metrics)
↑
Prometheus (Monitoring)
- Python 3.10+
- Ollama installed
- Docker (optional, for Prometheus)
ollama servePull models:
ollama pull phi3:mini
ollama pull gemma3:4bpip install -r requirements.txtuvicorn app.main:app --reloadOpen Swagger UI:
http://localhost:8000/docs
POST /benchmark
Example request:
{
"model": "phi3:mini",
"prompt": "Explain transformers in simple terms.",
"runs": 2
}http://localhost:8000/metrics
Exposed Prometheus metrics:
inferbench_requests_totalinferbench_duration_seconds(histogram)- Python GC metrics
Build image:
docker build -t inferbench .Run container:
docker run -p 8000:8000 inferbenchIf running Ollama on host while InferBench is inside Docker:
docker run -p 8000:8000 \
-e OLLAMA_URL=http://host.docker.internal:11434/api/generate \
inferbenchCreate prometheus.yml:
global:
scrape_interval: 5s
scrape_configs:
- job_name: "inferbench"
static_configs:
- targets: ["host.docker.internal:8000"]Run Prometheus:
docker run -p 9090:9090 \
-v C:/path/to/prometheus.yml:/etc/prometheus/prometheus.yml \
prom/prometheusOpen:
http://localhost:9090
Example PromQL queries:
inferbench_requests_total
inferbench_duration_seconds_sum / inferbench_duration_seconds_count
Compare CPU performance tradeoffs between models:
- Small models → lower latency, higher throughput
- Larger models → higher latency, better output quality
- Cold start vs warm start behavior
- Offline inference feasibility on consumer hardware
This project demonstrates:
- AI infrastructure engineering
- Offline LLM deployment design
- Model performance benchmarking
- Production-style observability
- Monitoring-ready microservice architecture
- Cost-aware AI system design without GPU dependency
Example environment:
- AMD Ryzen 7 5700U
- 16GB RAM
- CPU-only inference
- Ollama v0.17+
Stable. Prometheus-integrated. Docker-compatible. Production-observability ready.