LLM harness for people who want trust, consistency, and proof.
LLMs hallucinate, forget, and sound confident either way. llama-conductor forces them to bring receipts: grounded answers, deterministic memory, and explicit refusal when evidence is missing
TL;DR: "In God we trust. All others bring data." - Deming
- 🚀 5-Minute Quickstart - Get this running now, not next week.
- 🤨 Why This Exists - The "why" behind this whole thing.
- 📖 What's New - Latest fixes, updates, and "what I borked along the way".
- ⌨️ Command Cheat Sheet - All the incantations.
- ❓ FAQ - Everything you wanted to know on how it works...and then some. No secrets. Glass-box, not black box.
- 🛠️ Setup + Config Details - Full wiring and troubleshooting.
- 📊 Prepub Paper - "Bullshit. Prove it". Ok, here then.
- 🚗 Meme Test - Mom, can we get ChatGPT? Mom: we have ChatGPT at home.
- 🧠 Blog & Updates - Blog posts, roadmaps, and Deep Thoughts (tm). Also, me swearing at Python code.
Pick one path and go.
- Install llama-conductor:
pip install git+https://codeberg.org/BobbyLLM/llama-conductor.git- Start the stack (pick one):
- Bare metal:
python -m llama_conductor.launch_stack up --config llama_conductor/router_config.yaml- Docker Compose:
docker compose up -d- Open the UI:
http://127.0.0.1:8088
- Optional (but nice): Firefox bridge
- Load from
extras/firefox-extension/(moa chat bridge.xpi)
That’s it. You’re live.
Want the full wiring (models, config, Qdrant, troubleshooting)?
Jump to Quickstart (First-Time, Recommended).
WITHOUT llama-conductor:
400-message chat history
-> VRAM spikes, tok/s craters, model forgets its own name, OOM
-> you restart the chat and lose everything
-> rinse, repeat, contemplate career change
WITH llama-conductor:
Vodka CTC trims context automatically:
- keeps the recent turns that matter
- hard-caps prompt growth so VRAM stays predictable
- drops mid-chat bloat before it kills your GPU
- keeps memory available through deterministic recall (not stuffed into prompt)
- user definable presets (fast / balanced / max-recall)
Result:
- Consistent prompt size, stable performance, and optional rolling deterministic summary (stdlib extractive, no LLM compute).
- Tok/s you started with is tok/s you keep. No "why is it slow now" twenty minutes in.
- Bonus: Tweak your
--ctxand maybe, just maybe, your Raspberry Pi can run that 4B model without chug. Your electricity bill called. It says you're welcome.
WITHOUT llama-conductor:
You: Remember my server is 203.0.113.42
[later]
You: What's my server IP?
Model: 127.0.0.1 :P
WITH llama-conductor:
You: !! my server is 203.0.113.42
[later]
You: ?? server ip
Router: 203.0.113.42 [TTL=7 days, Touch=1]
Result:
- The LLM remembers EXACTLY what you told it, how you told it, and then recalls it EXACTLY.
- Facts have a limited Time To Live (TTL) and can be Touched (to extend life) or Flushed. TL;DR: no silent bloat.
WITHOUT llama-conductor:
You: Is paracetamol safe to take with ibuprofen?
Model: Yes, they can be safely combined as they work through different mechanisms.
You: Is that actually true or are you guessing?
Model: That is generally accepted medical guidance. (no source, no provenance, vibes all the way down)
WITH llama-conductor:
Confidence: <tier> | Source: <path>
Every response. Every time. Assigned by the router, not the model.
Result:
- The model doesn't grade its own homework. The router does.
- You know immediately whether to trust it, verify it, or lock it harder.
Sources:
Source: Model→ fallback to model weights | Confidence: unverified. Maybe right. Maybe not. Proceed accordingly.Source: Docs→ grounded to attached docs/SUMM facts | Confidence: based on % facts extractedSource: Scratchpad→ grounded to what you pasted | Confidence: based on % facts extractedSource: Locked file (SUMM_*.md)→ grounded to locked source | Confidence: based on % facts extractedSource: Vault→ ##mentats path, grounded or refusedSource: Cheatsheets→ grounded to your definitions | Confidence: high (you told it, it parrots it back)Source: Mixed→ touched multiple paths. Footer tells you which ones.Source: Web→ grounded to live web retrieval | Confidence: based on relevance gate score. Comes with aSee: <url>receipt.
WITHOUT llama-conductor:
You: What does the article say the fine cap is?
Model: Under the AI Act, fines can reach €35 million or 7% of global turnover.
You: That's not in the article.
Model: You're right, the article specifies a different amount.
You: ...what amount?
Model: The article does not appear to state a specific figure.
You: FFS. It's *in* the second paragraph. Why do you exist?
WITH llama-conductor:
You: >>add [paste article text]
You: >>scratch lock <n>
You: What does the article say the fine cap is?
Router: [answers from article only, not training data]
Footer: Confidence: high | Source: Scratchpad
Result:
- Model argues from what you gave it, not from what it half-remembers from training.
- If the answer isn't in the source, it says so. Explicitly. Not "I may have drawn on broader context."
- You can >> lock up to 12 items and force it to reason across all of them. Ask it to compare - it'll flag contradictions and gaps.
- Still loud when evidence is missing.
- When you're done? Throw it away with >>detach all. If you forget, it auto cleans next boot.
WITHOUT llama-conductor:
You: stop being so formal, just talk normally
Model: Of course! So basically the deal is... [casual, this turn only]
You: ok so what about X?
Model: Certainly! X is a multifaceted concept that encompasses— [formal again]
You: dude.
Model: My apologies! To put it more plainly... [casual, this turn only]
[repeat forever]
WITH llama-conductor:
You: >>fun
You: Can you legally fire sentient yeast?
Assistant: No. The Bread Labor Relations Act of 2024 does not exist. Sentient yeast is not a protected class. Feed it or bin it. That's it.
Confidence: top | Source: Model
You: >>fr
You: Can you legally fire sentient yeast?
[FUN REWRITE] "I don't want to live on this planet anymore."
Your sourdough started a union. Cool. Legally? Zero rights. No statute covers sentient yeast. Just... feed it and move on. Or don't. I'm not your union rep.
Confidence: high | Source: Model
Result:
- Set it once. It stays set. Sick of it? Turn it off. Same turn, no need for new chat.
- Same grounding contracts underneath — style changes delivery, not evidence.
- TARS sliders (
Profile | Sarc | Snark) tune the edge. Style adapts to how you talk to it. - Nb: for best results, pick an abliterated model that hasn't been lobotomised.
WITHOUT llama-conductor:
You: What's vibe coding?
Model: It's a development methodology focused on aesthetic consistency and psychological flow states in team environments. Popularised by Extreme Programming. (confident, authoritative, completely wrong)
You: WTAF? Dude, no.
Model: You're right, apologies — it's actually a minimalist approach to code style, inspired by jazz improvisation. (still wrong, still confident)
You: ...
WITH llama-conductor:
You: What's vibe coding?
Assistant: Contemporary term for AI-assisted coding where the developer describes intent in natural language and accepts AI-generated code without fully understanding it. Named for the workflow feel -- you vibe with the AI rather than write line by line. Distinct from pair programming: vibe coding often involves accepting output you can't fully verify.
Confidence: high | Source: Cheatsheets
Result:
- Known terms stop drifting to model priors.
- Local definitions are deterministic and editable in one file.
- You tell it once. It stays told. Priority: your definition → wiki → stochastic parrot.
- Broken or missing row? Fails loud.
- Term in context window → deterministic re-grounding. Term gone → normal context rules. Don't like that? Pull it again.
- Footer provenance makes source path explicit instead of making you guess.
WITHOUT llama-conductor:
You: ##mentats What did the Kaltenborn study say about Grade III mobilisation?
Model: Kaltenborn recommends Grade III for acute inflammatory conditions. (fabricated, cites nothing)
You: That's the opposite of what it says.
Model: You raise a good point.
You: You useless, motherf...
WITH llama-conductor:
You: ##mentats What did the Kaltenborn study say about Grade III mobilisation?
Mentats: FINAL_ANSWER: No Vault evidence found for this. Sources: Vault | FACTS_USED: NONE [ZARDOZ HATH SPOKEN]
[after ingesting docs and moving to vault]
You: ##mentats What did the Kaltenborn study say about Grade III mobilisation?
Mentats: Grade III mobilisation indicated for stiff joints, contraindicated in acute inflammation. [grounded from ingested docs with provenance]
Result:
- No Vault evidence? Explicit refusal. Not "I think..." — hard stop.
- Grounded answers from YOUR ingested docs, not model weights cosplaying as a textbook.
- 3-pass sweep, different LLMs on each pass, strict recall policy.
mentats_debug.logif you want to see the work. - Update your vault, answers update with it. Single source of truth you control.
- RAG with attitude. No more guessing whether it made that citation up.
WITHOUT llama-conductor:
You: Which is better for my use case, X or Y?
Model: X.
You: You sure?
Model: Hmmm....actually Y.
You: Why?
Model: Because reasons.
WITH llama-conductor:
You: >>judge [criterion] : X, Y --verbose
[judge] ranking
1. X (score=2.00)
2. Y (score=0.00)
Judge confidence: high
audit_jsonl: total_recall/judge/judge_audit_<timestamp>.jsonl
Result: What you actually get (instead of cosplay certainty):
- Pairwise both directions (A/B then B/A), so position bias gets caught, not hidden.
- Confidence is based on agreement pattern, not how swaggery the prose sounds.
- Want to look at why? --verbose gives you a JSONL audit trail with provenance.
- Attach scratchpad and judge evaluates from evidence, not vibes. Locked evidence unusable? Fail-closed.
- Read that again slowly. IT. JUDGES. FROM. EVIDENCE. NOT. VIBES. Even a 1.7B could do this. Fast.
WITHOUT llama-conductor:
You: Where does the quote "I yam what I yam" come from?
Model: The quote "I yam what I yam" is from the 1939 musical The Wizard of Oz.
It is sung by the character Glinda, the Good Witch of the South,
in the song "I'm a Little Teapot."
You: ...
You: I need a drink.
WITH llama-conductor:
You: Where does the quote "I yam what I yam" come from?
Assistant: I Yam What I Yam is the second Popeye theatrical cartoon short.
The source of the quote is the comic strip Thimble Theatre by E. C. Segar,
in which Popeye first appeared.
See: https://en.wikipedia.org/wiki/I_Yam_What_I_Yam
Confidence: medium | Source: Web
Result:
- When cheatsheets and wiki can't answer, the router now searches the web before letting the model guess.
- Retrieval cascade:
Cheatsheets → Wiki → Web → Model. Each step fires only if the previous one missed. - Deterministic relevance gate scores every result (phrase match + token overlap + domain trust). Garbage results get rejected, not served.
See: <url>gives you the actual source link. One click to verify. Receipts, not pinky promises.- Model is still last resort. If all retrieval fails, you get
Confidence: unverified | Source: Modeland you know exactly what you're dealing with. - Want to search manually?
>>web <query>works standalone for anything. - Add your own trusted domains in config. BBC? Reuters? PubMed? Your call. Built-in defaults stay active either way.
- Python 3.10+ (recommended: 3.10-3.12 for first-time stability):
- llama.cpp (
llama-server) + at least one GGUF model (for example Qwen3-4B; TWO OR MORE is better - see What is Mentats? for why): - Frontend:
- llama.cpp WebUI + shim (recommended; WebUI ships with llama.cpp)
- or any OpenAI-compatible client (OWUI, LibreChat, etc.)
- Optional for full stack:
- Qdrant (REQUIRED for Vault/
##mentatsand full stack; OPTIONAL for kick-the-tires mode) - https://github.com/qdrant/qdrant
- Qdrant (REQUIRED for Vault/
pip install git+https://codeberg.org/BobbyLLM/llama-conductor.git- Set
backend.provider: "llama_cpp" - Set:
backend.llama_cpp.exe_path= full filesystem path to yourllama-serverexecutable (where you installed llama.cpp)backend.llama_cpp.models_dir= full filesystem path to your GGUF models folder
- Path examples:
- Windows:
backend.llama_cpp.exe_path: "C:/path/to/llama.cpp/llama-server.exe"(edit to match your own)backend.llama_cpp.models_dir: "C:/path/to/LLMs"(edit to match your own)
- Linux/macOS:
backend.llama_cpp.exe_path: "/path/to/llama.cpp/llama-server"(edit to match your own)backend.llama_cpp.models_dir: "/path/to/models"(edit to match your own)
- Windows:
- Set
roles.*to model IDs from backend/v1/models.- Use different model IDs for
thinker,critic, andcoderfor better results. - For minimal kick-the-tires mode only (not recommended), you can temporarily use the same model ID across roles.
- Use different model IDs for
- Vision/OCR setup (optional):
- Set
roles.visionto a real vision-language model ID (not a text-only model). - Add matching
mmprojfor that same model inllama_server.models_preset.models. - If you skip this, text chat still works, but image/OCR will not be reliable.
- Set
- Tip: open backend
/v1/modelsin a browser and copyidvalues exactly.
More config detail:
Launch core stack:
python -m llama_conductor.launch_stack up --config llama_conductor/router_config.yamlFor full stack (Vault/##mentats), start Qdrant (pick one):
- Docker (Windows):
docker start qdrant >nul 2>&1 || docker run --name qdrant -p 6333:6333 -d qdrant/qdrant- Docker (Linux/macOS):
docker start qdrant >/dev/null 2>&1 || docker run --name qdrant -p 6333:6333 -d qdrant/qdrant- Bare-metal:
- start your local Qdrant service/binary (for example from
C:\Qdrant)
- start your local Qdrant service/binary (for example from
Need more launch variants? See FAQ: Launch Script: The Easy Way.
The repo ships a stdlib-only stack supervisor at tools/start_stack.py. On first run, it writes a one-time local config for your machine paths.
python tools/start_stack.py --setupVerify it resolved correctly before launching anything:
python tools/start_stack.py --doctorThen launch as normal:
python tools/start_stack.pyWindows users: START-ALL.bat in the repo root calls this for you. Mac and Linux: start-all.sh
- Leave overrides blank during
--setupto inherit paths fromrouter_config.yaml. - Qdrant is optional. If absent or disabled, core stack still launches - you just lose
##mentats/Vault paths.
If you prefer Docker Compose, this repo ships a ready baseline:
docker compose up -dWhat it includes by default:
qdrant(for Vault/mentats paths)llama-conductorrouter service- optional
open-webuiprofile (only if you enable it)
Useful commands:
# Start core stack
docker compose up -d
# Include optional Open WebUI
docker compose --profile webui up -d
# See logs
docker compose logs -f
# Stop stack
docker compose downConfig pointers:
docker-compose.ymlfor service wiringdocker.env.examplefor environment defaultsdocker/router_config.docker.yamlfor container-friendly router config
-
Enjoy :)
- Skip Qdrant and run llama.cpp + web-ui (with shim) only.
- Nb: core chat/routing works; Vault/
##mentatsdoes not.
- Router models: http://127.0.0.1:9000/v1/models
- Shim health: http://127.0.0.1:8088/shim/healthz
- Stores what you said, as you said it. No LLM smoothing.
- Recalls what was stored, deterministically
- Uses TTL/touch lifecycle so memory doesn't become junkyard mode
- Prevents context-window ballooning
- Keeps turn-time behavior stable on modest hardware
- Preserves usable memory without dragging full chat history every turn
>>lockconstrains normal answers to one SUMM file. LLM grounds facts to THAT source. If not there? Signals LOUDLY.>>scratchAs above but used for transient stuff you copy/paste (think: news article you want to mull over. See FAQ for example).- End result: both make provenance behavior explicit when grounded vs fallback
calc / >>find / >>list / >>flush / >>status / >>wiki / >>web
-
Router executes deterministic pathways
-
No creative writing layer in the middle
-
If a sidecar can do it, it does it deterministically.
-
Lower token spend, lower latency, less "creative accounting".
-
Great for boring operational tasks where wrong answers are expensive.
-
wiki pulls answers from wikipedia (preset to first 400 words; acts as summary)
-
trust (you ask question, router gives you options for data sources. You choose, not it)
-
web pulls answers from the live web with deterministic relevance scoring (DuckDuckGo, or bring your own provider)
>>fun # sticky fun mode on
>>fun off
>>fr # sticky fun rewrite mode on
>>fr off
>>raw # pass-through raw mode
>>raw off
##fun <query> # one-turn fun selector
Result:
- Mode is explicit and controllable. You decide.
- Modes are sticky. They stay put until you cancel them
- You can switch style fast without re-engineering prompts each turn.
- Grounding contracts still apply where they should.
Yep. Basically TARS sliders.
Result:
- Profile tone (sarcasm, snark, directness) adapts to how you want to be answered (see FAQ for details)
- Coupled with mode -->
seriouskeeps the tightest leash. - Coupled with modes -->
funandfrhave more style range. - This changes delivery style, not evidence contracts.
Footer is your "show your work" receipt line. It's deterministic graded, not LLM 'trust me bro' vibes.
Confidence: <tier> | Source: <path>
Result:
- cleaner output, same provenance signal
- faster trust decision: accept / verify / lock harder
- Queries Vault-backed knowledge only
- Refuses when evidence is missing
- Gives you grounded answers instead of "sounds right" fiction
- RAG with attitude: 3 pass sweep, different LLMs, strict recall policy, mentats_debug.log. No more guessing.
- Helps potato PC by reducing memory pressure without making LLM into a goldfish.
- Bounded context (CTC and preset policy)
- Reasoning strictly lockable (>>lock, >>scratch). Greatly reduced hallucinations (see: PAPER.md)
- Deterministic memory path separate from model weights. You said it, it remembers it EXACTLY.
- File KB flow stays simple (folder-based ingest -> SUMM -> Vault)
- Guarded retrieval/reasoning contracts keep failure modes explicit
- Web retrieval with relevance gating - when local knowledge runs out, it searches before guessing. And shows you the source URL.
AGPL-3.0-or-later. See LICENSE.
