Skip to content

omprakash0702/inferbench-ai

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

InferBench AI

A CPU-based offline LLM benchmarking and observability platform designed to evaluate local model inference performance under hardware constraints.

InferBench enables structured benchmarking of Ollama-hosted models while exposing production-style Prometheus metrics for monitoring and analysis.


🚀 Features

  • Benchmark multiple Ollama models
  • Cold vs warm start latency measurement
  • p50 / p95 latency metrics
  • Tokens-per-second throughput tracking
  • Memory usage measurement
  • FastAPI-based structured benchmarking API
  • Prometheus metrics integration
  • Docker-ready deployment
  • Fully offline (no GPU required)

🏗 Architecture

Ollama (Local LLMs)
        ↑
InferBench (FastAPI + Metrics)
        ↑
Prometheus (Monitoring)

📦 Requirements

  • Python 3.10+
  • Ollama installed
  • Docker (optional, for Prometheus)

⚙️ Setup (Local Development)

1️⃣ Start Ollama

ollama serve

Pull models:

ollama pull phi3:mini
ollama pull gemma3:4b

2️⃣ Install Dependencies

pip install -r requirements.txt

3️⃣ Run the API

uvicorn app.main:app --reload

Open Swagger UI:

http://localhost:8000/docs

📊 Benchmark Endpoint

POST /benchmark

Example request:

{
  "model": "phi3:mini",
  "prompt": "Explain transformers in simple terms.",
  "runs": 2
}

📈 Metrics Endpoint

http://localhost:8000/metrics

Exposed Prometheus metrics:

  • inferbench_requests_total
  • inferbench_duration_seconds (histogram)
  • Python GC metrics

🐳 Docker

Build image:

docker build -t inferbench .

Run container:

docker run -p 8000:8000 inferbench

If running Ollama on host while InferBench is inside Docker:

docker run -p 8000:8000 \
-e OLLAMA_URL=http://host.docker.internal:11434/api/generate \
inferbench

📊 Prometheus Integration

Create prometheus.yml:

global:
  scrape_interval: 5s

scrape_configs:
  - job_name: "inferbench"
    static_configs:
      - targets: ["host.docker.internal:8000"]

Run Prometheus:

docker run -p 9090:9090 \
-v C:/path/to/prometheus.yml:/etc/prometheus/prometheus.yml \
prom/prometheus

Open:

http://localhost:9090

Example PromQL queries:

inferbench_requests_total
inferbench_duration_seconds_sum / inferbench_duration_seconds_count

🎯 Example Use Case

Compare CPU performance tradeoffs between models:

  • Small models → lower latency, higher throughput
  • Larger models → higher latency, better output quality
  • Cold start vs warm start behavior
  • Offline inference feasibility on consumer hardware

💡 Why This Project Matters

This project demonstrates:

  • AI infrastructure engineering
  • Offline LLM deployment design
  • Model performance benchmarking
  • Production-style observability
  • Monitoring-ready microservice architecture
  • Cost-aware AI system design without GPU dependency

🧠 Hardware Tested On

Example environment:

  • AMD Ryzen 7 5700U
  • 16GB RAM
  • CPU-only inference
  • Ollama v0.17+

🏁 Status

Stable. Prometheus-integrated. Docker-compatible. Production-observability ready.


About

CPU-based offline LLM benchmarking platform with Prometheus observability and Docker deployment.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors