This workspace now includes a tested planner that picks an inference engine from:
- the local hardware profile
- the rules captured in
cheatsheet.md - the selected model shape and workload
uv sync --devThat creates .venv and installs the CLI entrypoint:
uv run llm-inference-plan --helpuv run llm-inference-plan \
--model-name Qwen3-14B-Instruct-EXL2 \
--family qwen \
--params-b 14 \
--quantization exl2 \
--context-len 8192 \
--workload interactiveOn the current RTX 5070 Ti 16 GiB + 7950X3D box, that resolves to exllamav2.
uv run llm-inference-plan \
--model-name Llama-3.1-8B-Instruct-AWQ \
--family llama \
--params-b 8 \
--quantization awq \
--context-len 8192 \
--concurrency 12 \
--workload apiThat resolves to vllm with prefix caching enabled.
uv run llm-inference-plan \
--format json \
--model-name Qwen3-14B-Instruct-EXL2 \
--family qwen \
--params-b 14 \
--quantization exl2 \
--context-len 8192 \
--workload interactive