RunPod Serverless endpoint for serving IBM's granite-docling-258M vision-language model as an OpenAI-compatible API.
Replaces the always-on Vast.ai GPU (~$650/month) with an auto-scaling-to-zero serverless endpoint — you only pay when requests come in.
ks-backend worker
│
├─ GET /v1/models ← health check
└─ POST /v1/chat/completions ← page image → docling markup
│
▼
RunPod Serverless (OpenAI proxy)
https://api.runpod.ai/v2/{id}/openai/v1/...
│
▼
runpod/worker-v1-vllm + ibm-granite/granite-docling-258M (rev: untied)
The Dockerfile extends RunPod's pre-built vLLM worker with configuration
for the granite-docling model. No custom serving code required — vLLM natively
supports the Idefics3 architecture this model uses.
git remote add origin [email protected]:<your-org>/docling-vlm.git
git push -u origin mainGo to RunPod Settings and authorize RunPod to access your GitHub repositories.
- Go to RunPod Console → Serverless → New Endpoint
- Select GitHub Repo and pick this repository (
docling-vlm) - RunPod will find the
Dockerfilein the root and build it automatically - Configure the endpoint:
| Setting | Value |
|---|---|
| GPU Type | Any with ≥ 4 GB VRAM (model is 258M params) |
| Active Workers | 0 (scale to zero) |
| Max Workers | 1 (increase later if needed) |
| Idle Timeout | 300 seconds |
| Execution Timeout | 600 seconds |
All model configuration (model name, revision, served name) is baked into the Dockerfile — no need to set env vars manually in the RunPod console.
Why
untiedrevision? Themainbranch of this model uses tied embeddings, which causes failures in vLLM. Theuntiedrevision fixes this.
Copy .env.example to .env and fill in your API key and endpoint ID:
cp .env.example .env
# edit .env with your RUNPOD_API_KEY and RUNPOD_ENDPOINT_IDRun the test script:
./scripts/test_endpoint.shThis will:
- Hit
GET /v1/modelsand verifygranite-docling-258Mis listed - Send a text-only chat completion request
- Measure hot-start latency with a follow-up request
The first request will be slow (cold start — GPU spinning up + model loading). Subsequent requests while the worker is warm should be fast.
Add to ks-backend/.env.dev:
VLM_ENDPOINT=https://api.runpod.ai/v2/<endpoint-id>/openai/v1/chat/completions
VLM_MODEL=granite-docling-258M
VLM_API_KEY=<your-runpod-api-key>
Then run the backend and upload a test PDF:
make dev-api
make dev-workerWatch the worker logs for vlm_model_available (success) or
vlm_endpoint_unreachable / vlm_model_not_found (failure).
Cold start vs health check timeout — The backend's health check
(GET /v1/models) has a 10-second timeout. RunPod cold starts can take
30–90 seconds. If the health check times out, the worker silently falls back
to the non-VLM pipeline. Workarounds:
- Pre-warm the endpoint before a batch job (send a manual curl request)
- Increase the timeout in
ks-backend/src/worker/utils/docling.py:check_vlm_available() - Set Active Workers to
1instead of0(always-on, but still cheaper than Vast.ai)
Model name mismatch — The name in /v1/models must match VLM_MODEL
(case-insensitive). The OPENAI_SERVED_MODEL_NAME_OVERRIDE env var ensures
vLLM reports granite-docling-258M instead of the full HuggingFace path.
See the Dockerfile for defaults. All can be overridden in the RunPod console.
| Variable | Default | Purpose |
|---|---|---|
MODEL_NAME |
ibm-granite/granite-docling-258M |
HuggingFace model ID |
MODEL_REVISION |
untied |
Required for vLLM compatibility |
OPENAI_SERVED_MODEL_NAME_OVERRIDE |
granite-docling-258M |
Model name exposed via /v1/models |
GPU_MEMORY_UTILIZATION |
0.90 |
Fraction of GPU VRAM to use |
MAX_MODEL_LEN |
4096 |
Max context length in tokens |
MAX_CONCURRENCY |
2 |
Max concurrent requests per worker |