InferBench AI

A CPU-based offline LLM benchmarking and observability platform designed to evaluate local model inference performance under hardware constraints.

InferBench enables structured benchmarking of Ollama-hosted models while exposing production-style Prometheus metrics for monitoring and analysis.

🚀 Features

Benchmark multiple Ollama models
Cold vs warm start latency measurement
p50 / p95 latency metrics
Tokens-per-second throughput tracking
Memory usage measurement
FastAPI-based structured benchmarking API
Prometheus metrics integration
Docker-ready deployment
Fully offline (no GPU required)

🏗 Architecture

Ollama (Local LLMs)
        ↑
InferBench (FastAPI + Metrics)
        ↑
Prometheus (Monitoring)

📦 Requirements

Python 3.10+
Ollama installed
Docker (optional, for Prometheus)

⚙️ Setup (Local Development)

1️⃣ Start Ollama

ollama serve

Pull models:

ollama pull phi3:mini
ollama pull gemma3:4b

2️⃣ Install Dependencies

pip install -r requirements.txt

3️⃣ Run the API

uvicorn app.main:app --reload

Open Swagger UI:

http://localhost:8000/docs

📊 Benchmark Endpoint

POST /benchmark

Example request:

{
  "model": "phi3:mini",
  "prompt": "Explain transformers in simple terms.",
  "runs": 2
}

📈 Metrics Endpoint

http://localhost:8000/metrics

Exposed Prometheus metrics:

inferbench_requests_total
inferbench_duration_seconds (histogram)
Python GC metrics

🐳 Docker

Build image:

docker build -t inferbench .

Run container:

docker run -p 8000:8000 inferbench

If running Ollama on host while InferBench is inside Docker:

docker run -p 8000:8000 \
-e OLLAMA_URL=http://host.docker.internal:11434/api/generate \
inferbench

📊 Prometheus Integration

Create prometheus.yml:

global:
  scrape_interval: 5s

scrape_configs:
  - job_name: "inferbench"
    static_configs:
      - targets: ["host.docker.internal:8000"]

Run Prometheus:

docker run -p 9090:9090 \
-v C:/path/to/prometheus.yml:/etc/prometheus/prometheus.yml \
prom/prometheus

Open:

http://localhost:9090

Example PromQL queries:

inferbench_requests_total

inferbench_duration_seconds_sum / inferbench_duration_seconds_count

🎯 Example Use Case

Compare CPU performance tradeoffs between models:

Small models → lower latency, higher throughput
Larger models → higher latency, better output quality
Cold start vs warm start behavior
Offline inference feasibility on consumer hardware

💡 Why This Project Matters

This project demonstrates:

AI infrastructure engineering
Offline LLM deployment design
Model performance benchmarking
Production-style observability
Monitoring-ready microservice architecture
Cost-aware AI system design without GPU dependency

🧠 Hardware Tested On

Example environment:

AMD Ryzen 7 5700U
16GB RAM
CPU-only inference
Ollama v0.17+

🏁 Status

Stable. Prometheus-integrated. Docker-compatible. Production-observability ready.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
app		app
tests		tests
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
prometheus.yml		prometheus.yml
pytest.ini		pytest.ini
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

InferBench AI

🚀 Features

🏗 Architecture

📦 Requirements

⚙️ Setup (Local Development)

1️⃣ Start Ollama

2️⃣ Install Dependencies

3️⃣ Run the API

📊 Benchmark Endpoint

📈 Metrics Endpoint

🐳 Docker

📊 Prometheus Integration

🎯 Example Use Case

💡 Why This Project Matters

🧠 Hardware Tested On

🏁 Status

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

InferBench AI

🚀 Features

🏗 Architecture

📦 Requirements

⚙️ Setup (Local Development)

1️⃣ Start Ollama

2️⃣ Install Dependencies

3️⃣ Run the API

📊 Benchmark Endpoint

📈 Metrics Endpoint

🐳 Docker

📊 Prometheus Integration

🎯 Example Use Case

💡 Why This Project Matters

🧠 Hardware Tested On

🏁 Status

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages