llama-conductor

LLM harness for people who want trust, consistency, and proof.

LLMs hallucinate, forget, and sound confident either way. llama-conductor forces them to bring receipts: grounded answers, deterministic memory, and explicit refusal when evidence is missing

TL;DR: "In God we trust. All others bring data." - Deming

Quick Links

Start Here

🚀 5-Minute Quickstart - Get this running now, not next week.
🤨 Why This Exists - The "why" behind this whole thing.
📖 What's New - Latest fixes, updates, and "what I borked along the way".

Using It

⌨️ Command Cheat Sheet - All the incantations.
❓ FAQ - Everything you wanted to know on how it works...and then some. No secrets. Glass-box, not black box.
🛠️ Setup + Config Details - Full wiring and troubleshooting.

Evidence and Research

📊 Prepub Paper - "Bullshit. Prove it". Ok, here then.
🚗 Meme Test - Mom, can we get ChatGPT? Mom: we have ChatGPT at home.
🧠 Blog & Updates - Blog posts, roadmaps, and Deep Thoughts (tm). Also, me swearing at Python code.

🚀 5-Minute Quickstart

Pick one path and go.

Install llama-conductor:

pip install git+https://codeberg.org/BobbyLLM/llama-conductor.git

Start the stack (pick one):

Bare metal:

python -m llama_conductor.launch_stack up --config llama_conductor/router_config.yaml

Docker Compose:

docker compose up -d

Open the UI:

http://127.0.0.1:8088

Optional (but nice): Firefox bridge

Load from extras/firefox-extension/ (moa chat bridge.xpi)

That’s it. You’re live.

Want the full wiring (models, config, Qdrant, troubleshooting)?
Jump to Quickstart (First-Time, Recommended).

🛠️ Some problems This Solves

1) 🧠 Context bloat on small hardware

WITHOUT llama-conductor:

400-message chat history
-> VRAM spikes, tok/s craters, model forgets its own name, OOM
-> you restart the chat and lose everything
-> rinse, repeat, contemplate career change

WITH llama-conductor:

Vodka CTC trims context automatically:
- keeps the recent turns that matter
- hard-caps prompt growth so VRAM stays predictable
- drops mid-chat bloat before it kills your GPU
- keeps memory available through deterministic recall (not stuffed into prompt)
- user definable presets (fast / balanced / max-recall)

Result:

Consistent prompt size, stable performance, and optional rolling deterministic summary (stdlib extractive, no LLM compute).
Tok/s you started with is tok/s you keep. No "why is it slow now" twenty minutes in.
Bonus: Tweak your --ctx and maybe, just maybe, your Raspberry Pi can run that 4B model without chug. Your electricity bill called. It says you're welcome.

2) 🐠 Goldfish memory and confident misremembering

WITHOUT llama-conductor:

You: Remember my server is 203.0.113.42
[later]
You: What's my server IP?
Model: 127.0.0.1 :P

WITH llama-conductor:

You: !! my server is 203.0.113.42
[later]
You: ?? server ip
Router: 203.0.113.42 [TTL=7 days, Touch=1]

Result:

The LLM remembers EXACTLY what you told it, how you told it, and then recalls it EXACTLY.
Facts have a limited Time To Live (TTL) and can be Touched (to extend life) or Flushed. TL;DR: no silent bloat.

3) 🧾 Lies, damned lies, and LLM's

WITHOUT llama-conductor:

You: Is paracetamol safe to take with ibuprofen?
Model: Yes, they can be safely combined as they work through different mechanisms.
You: Is that actually true or are you guessing?
Model: That is generally accepted medical guidance. (no source, no provenance, vibes all the way down)

WITH llama-conductor:

Confidence: <tier> | Source: <path>

Every response. Every time. Assigned by the router, not the model.

Result:

The model doesn't grade its own homework. The router does.
You know immediately whether to trust it, verify it, or lock it harder.

Sources:

Source: Model → fallback to model weights | Confidence: unverified. Maybe right. Maybe not. Proceed accordingly.
Source: Docs → grounded to attached docs/SUMM facts | Confidence: based on % facts extracted
Source: Scratchpad → grounded to what you pasted | Confidence: based on % facts extracted
Source: Locked file (SUMM_*.md) → grounded to locked source | Confidence: based on % facts extracted
Source: Vault → ##mentats path, grounded or refused
Source: Cheatsheets → grounded to your definitions | Confidence: high (you told it, it parrots it back)
Source: Mixed → touched multiple paths. Footer tells you which ones.
Source: Web → grounded to live web retrieval | Confidence: based on relevance gate score. Comes with a See: <url> receipt.

4) 📌 Grounding drift in normal chat (fixed with >>scratch / >>lock)

WITHOUT llama-conductor:

You: What does the article say the fine cap is?
Model: Under the AI Act, fines can reach €35 million or 7% of global turnover.
You: That's not in the article.
Model: You're right, the article specifies a different amount.
You: ...what amount?
Model: The article does not appear to state a specific figure.
You: FFS. It's *in* the second paragraph. Why do you exist?

WITH llama-conductor:

You: >>add [paste article text]
You: >>scratch lock <n>
You: What does the article say the fine cap is?
Router: [answers from article only, not training data]
Footer: Confidence: high | Source: Scratchpad

Result:

Model argues from what you gave it, not from what it half-remembers from training.
If the answer isn't in the source, it says so. Explicitly. Not "I may have drawn on broader context."
You can >> lock up to 12 items and force it to reason across all of them. Ask it to compare - it'll flag contradictions and gaps.
Still loud when evidence is missing.
When you're done? Throw it away with >>detach all. If you forget, it auto cleans next boot.

5) 🎛️ Modes (get the stick out of your LLMs butt)

WITHOUT llama-conductor:

You: stop being so formal, just talk normally
Model: Of course! So basically the deal is... [casual, this turn only]
You: ok so what about X?
Model: Certainly! X is a multifaceted concept that encompasses— [formal again]
You: dude.
Model: My apologies! To put it more plainly... [casual, this turn only]
[repeat forever]

WITH llama-conductor:

You: >>fun
You: Can you legally fire sentient yeast?
Assistant: No. The Bread Labor Relations Act of 2024 does not exist. Sentient yeast is not a protected class. Feed it or bin it. That's it.
Confidence: top | Source: Model

You: >>fr
You: Can you legally fire sentient yeast?
[FUN REWRITE] "I don't want to live on this planet anymore."
Your sourdough started a union. Cool. Legally? Zero rights. No statute covers sentient yeast. Just... feed it and move on. Or don't. I'm not your union rep.
Confidence: high | Source: Model

Result:

Set it once. It stays set. Sick of it? Turn it off. Same turn, no need for new chat.
Same grounding contracts underneath — style changes delivery, not evidence.
TARS sliders (Profile | Sarc | Snark) tune the edge. Style adapts to how you talk to it.
Nb: for best results, pick an abliterated model that hasn't been lobotomised.

6) 🧪 Vibes-based answers (your model is a stochastic idiot)

WITHOUT llama-conductor:

You: What's vibe coding?
Model: It's a development methodology focused on aesthetic consistency and psychological flow states in team environments. Popularised by Extreme Programming. (confident, authoritative, completely wrong)
You: WTAF? Dude, no.
Model: You're right, apologies — it's actually a minimalist approach to code style, inspired by jazz improvisation. (still wrong, still confident)
You: ...

WITH llama-conductor:

You: What's vibe coding?
Assistant: Contemporary term for AI-assisted coding where the developer describes intent in natural language and accepts AI-generated code without fully understanding it. Named for the workflow feel -- you vibe with the AI rather than write line by line. Distinct from pair programming: vibe coding often involves accepting output you can't fully verify.
Confidence: high | Source: Cheatsheets

Result:

Known terms stop drifting to model priors.
Local definitions are deterministic and editable in one file.
You tell it once. It stays told. Priority: your definition → wiki → stochastic parrot.
Broken or missing row? Fails loud.
Term in context window → deterministic re-grounding. Term gone → normal context rules. Don't like that? Pull it again.
Footer provenance makes source path explicit instead of making you guess.

7) 🏛️ Grounded deep retrieval (##mentats) — grounds or refuses, nothing in between

WITHOUT llama-conductor:

You: ##mentats What did the Kaltenborn study say about Grade III mobilisation?
Model: Kaltenborn recommends Grade III for acute inflammatory conditions. (fabricated, cites nothing)
You: That's the opposite of what it says.
Model: You raise a good point.
You: You useless, motherf...

WITH llama-conductor:

You: ##mentats What did the Kaltenborn study say about Grade III mobilisation?
Mentats: FINAL_ANSWER: No Vault evidence found for this. Sources: Vault | FACTS_USED: NONE [ZARDOZ HATH SPOKEN]

[after ingesting docs and moving to vault]
You: ##mentats What did the Kaltenborn study say about Grade III mobilisation?
Mentats: Grade III mobilisation indicated for stiff joints, contraindicated in acute inflammation. [grounded from ingested docs with provenance]

Result:

No Vault evidence? Explicit refusal. Not "I think..." — hard stop.
Grounded answers from YOUR ingested docs, not model weights cosplaying as a textbook.
3-pass sweep, different LLMs on each pass, strict recall policy. mentats_debug.log if you want to see the work.
Update your vault, answers update with it. Single source of truth you control.
RAG with attitude. No more guessing whether it made that citation up.

8) 🎲 Vibes-based ranking ("trust me bro" but make it stochastic)

WITHOUT llama-conductor:

You: Which is better for my use case, X or Y?
Model: X.
You: You sure?
Model: Hmmm....actually Y.
You: Why?
Model: Because reasons.

WITH llama-conductor:

You: >>judge [criterion] : X, Y --verbose
[judge] ranking
1. X (score=2.00)
2. Y (score=0.00)
Judge confidence: high
audit_jsonl: total_recall/judge/judge_audit_<timestamp>.jsonl

Result: What you actually get (instead of cosplay certainty):

Pairwise both directions (A/B then B/A), so position bias gets caught, not hidden.
Confidence is based on agreement pattern, not how swaggery the prose sounds.
Want to look at why? --verbose gives you a JSONL audit trail with provenance.
Attach scratchpad and judge evaluates from evidence, not vibes. Locked evidence unusable? Fail-closed.
Read that again slowly. IT. JUDGES. FROM. EVIDENCE. NOT. VIBES. Even a 1.7B could do this. Fast.

9) 🌐 "I'm sure it's from that movie..." (your model is a confident liar about things Google knows)

WITHOUT llama-conductor:

You: Where does the quote "I yam what I yam" come from?
Model: The quote "I yam what I yam" is from the 1939 musical The Wizard of Oz.
       It is sung by the character Glinda, the Good Witch of the South,
       in the song "I'm a Little Teapot."
You: ...
You: I need a drink.

WITH llama-conductor:

You: Where does the quote "I yam what I yam" come from?
Assistant: I Yam What I Yam is the second Popeye theatrical cartoon short.
The source of the quote is the comic strip Thimble Theatre by E. C. Segar,
in which Popeye first appeared.
See: https://en.wikipedia.org/wiki/I_Yam_What_I_Yam
Confidence: medium | Source: Web

Result:

When cheatsheets and wiki can't answer, the router now searches the web before letting the model guess.
Retrieval cascade: Cheatsheets → Wiki → Web → Model. Each step fires only if the previous one missed.
Deterministic relevance gate scores every result (phrase match + token overlap + domain trust). Garbage results get rejected, not served.
See: <url> gives you the actual source link. One click to verify. Receipts, not pinky promises.
Model is still last resort. If all retrieval fails, you get Confidence: unverified | Source: Model and you know exactly what you're dealing with.
Want to search manually? >>web <query> works standalone for anything.
Add your own trusted domains in config. BBC? Reuters? PubMed? Your call. Built-in defaults stay active either way.

Quickstart (First-Time, Recommended)

[Step 0] Prerequisites

Python 3.10+ (recommended: 3.10-3.12 for first-time stability):
- https://www.python.org/downloads/
llama.cpp (llama-server) + at least one GGUF model (for example Qwen3-4B; TWO OR MORE is better - see What is Mentats? for why):
- https://github.com/ggml-org/llama.cpp
- https://huggingface.co/unsloth
Frontend:
- llama.cpp WebUI + shim (recommended; WebUI ships with llama.cpp)
- or any OpenAI-compatible client (OWUI, LibreChat, etc.)
Optional for full stack:
- Qdrant (REQUIRED for Vault/##mentats and full stack; OPTIONAL for kick-the-tires mode)
- https://github.com/qdrant/qdrant

[Step 1] Install llama-conductor

pip install git+https://codeberg.org/BobbyLLM/llama-conductor.git

[Step 2] Configure `llama_conductor/router_config.yaml`

Set backend.provider: "llama_cpp"
Set:
- backend.llama_cpp.exe_path = full filesystem path to your llama-server executable (where you installed llama.cpp)
- backend.llama_cpp.models_dir = full filesystem path to your GGUF models folder
Path examples:
- Windows:
  - backend.llama_cpp.exe_path: "C:/path/to/llama.cpp/llama-server.exe" (edit to match your own)
  - backend.llama_cpp.models_dir: "C:/path/to/LLMs" (edit to match your own)
- Linux/macOS:
  - backend.llama_cpp.exe_path: "/path/to/llama.cpp/llama-server" (edit to match your own)
  - backend.llama_cpp.models_dir: "/path/to/models" (edit to match your own)
Set roles.* to model IDs from backend /v1/models.
- Use different model IDs for thinker, critic, and coder for better results.
- For minimal kick-the-tires mode only (not recommended), you can temporarily use the same model ID across roles.
Vision/OCR setup (optional):
- Set roles.vision to a real vision-language model ID (not a text-only model).
- Add matching mmproj for that same model in llama_server.models_preset.models.
- If you skip this, text chat still works, but image/OCR will not be reliable.
Tip: open backend /v1/models in a browser and copy id values exactly.

More config detail:

[Step 3] Start stack

Launch core stack:

python -m llama_conductor.launch_stack up --config llama_conductor/router_config.yaml

For full stack (Vault/##mentats), start Qdrant (pick one):

Docker (Windows):

docker start qdrant >nul 2>&1 || docker run --name qdrant -p 6333:6333 -d qdrant/qdrant

Docker (Linux/macOS):

docker start qdrant >/dev/null 2>&1 || docker run --name qdrant -p 6333:6333 -d qdrant/qdrant

Bare-metal:
- start your local Qdrant service/binary (for example from C:\Qdrant)

Need more launch variants? See FAQ: Launch Script: The Easy Way.

[Step 3A] First-time launcher setup (recommended)

The repo ships a stdlib-only stack supervisor at tools/start_stack.py. On first run, it writes a one-time local config for your machine paths.

python tools/start_stack.py --setup

Verify it resolved correctly before launching anything:

python tools/start_stack.py --doctor

Then launch as normal:

python tools/start_stack.py

Windows users: START-ALL.bat in the repo root calls this for you. Mac and Linux: start-all.sh

Leave overrides blank during --setup to inherit paths from router_config.yaml.
Qdrant is optional. If absent or disabled, core stack still launches - you just lose ##mentats/Vault paths.

[Step 3B] Docker Compose (single-host)

If you prefer Docker Compose, this repo ships a ready baseline:

docker compose up -d

What it includes by default:

qdrant (for Vault/mentats paths)
llama-conductor router service
optional open-webui profile (only if you enable it)

Useful commands:

# Start core stack
docker compose up -d

# Include optional Open WebUI
docker compose --profile webui up -d

# See logs
docker compose logs -f

# Stop stack
docker compose down

Config pointers:

docker-compose.yml for service wiring
docker.env.example for environment defaults
docker/router_config.docker.yaml for container-friendly router config

[Step 4] Open and enjoy

http://127.0.0.1:8088
Enjoy :)

Kick-the-tires mode

Skip Qdrant and run llama.cpp + web-ui (with shim) only.
Nb: core chat/routing works; Vault/##mentats does not.

NB (troubleshooting only)

Router models: http://127.0.0.1:9000/v1/models
Shim health: http://127.0.0.1:8088/shim/healthz

⚙️ Cute. What does this crap Actually do?

🧩 Deterministic Memory (!! / ??)

Stores what you said, as you said it. No LLM smoothing.
Recalls what was stored, deterministically
Uses TTL/touch lifecycle so memory doesn't become junkyard mode

✂️ Context Control (Vodka CTC)

Prevents context-window ballooning
Keeps turn-time behavior stable on modest hardware
Preserves usable memory without dragging full chat history every turn

🎯 Strict Grounding Paths (>>lock, >>scratch)

>>lock constrains normal answers to one SUMM file. LLM grounds facts to THAT source. If not there? Signals LOUDLY.
>>scratch As above but used for transient stuff you copy/paste (think: news article you want to mull over. See FAQ for example).
End result: both make provenance behavior explicit when grounded vs fallback

🧰 Deterministic sidecars

calc / >>find / >>list / >>flush / >>status / >>wiki / >>web

Router executes deterministic pathways
No creative writing layer in the middle
If a sidecar can do it, it does it deterministically.
Lower token spend, lower latency, less "creative accounting".
Great for boring operational tasks where wrong answers are expensive.
wiki pulls answers from wikipedia (preset to first 400 words; acts as summary)
trust (you ask question, router gives you options for data sources. You choose, not it)
web pulls answers from the live web with deterministic relevance scoring (DuckDuckGo, or bring your own provider)

🎚️ Mode switches (serious, fun, fun rewrite)

>>fun          # sticky fun mode on
>>fun off
>>fr           # sticky fun rewrite mode on
>>fr off
>>raw          # pass-through raw mode
>>raw off
##fun <query>  # one-turn fun selector

Result:

Mode is explicit and controllable. You decide.
Modes are sticky. They stay put until you cancel them
You can switch style fast without re-engineering prompts each turn.
Grounding contracts still apply where they should.

🤖 Profiles, Sarcasm, Snark (ala TARS from interstellar)

Yep. Basically TARS sliders.

Result:

Profile tone (sarcasm, snark, directness) adapts to how you want to be answered (see FAQ for details)
Coupled with mode --> serious keeps the tightest leash.
Coupled with modes --> fun and fr have more style range.
This changes delivery style, not evidence contracts.

🧷 Footer status

Footer is your "show your work" receipt line. It's deterministic graded, not LLM 'trust me bro' vibes.

Confidence: <tier> | Source: <path>

Result:

cleaner output, same provenance signal
faster trust decision: accept / verify / lock harder

🏛️ Grounded Deep Retrieval (##mentats)

Queries Vault-backed knowledge only
Refuses when evidence is missing
Gives you grounded answers instead of "sounds right" fiction
RAG with attitude: 3 pass sweep, different LLMs, strict recall policy, mentats_debug.log. No more guessing.

✅ So, in summary, why should you give a shit?

Helps potato PC by reducing memory pressure without making LLM into a goldfish.
Bounded context (CTC and preset policy)
Reasoning strictly lockable (>>lock, >>scratch). Greatly reduced hallucinations (see: PAPER.md)
Deterministic memory path separate from model weights. You said it, it remembers it EXACTLY.
File KB flow stays simple (folder-based ingest -> SUMM -> Vault)
Guarded retrieval/reasoning contracts keep failure modes explicit
Web retrieval with relevance gating - when local knowledge runs out, it searches before guessing. And shows you the source URL.

📜 License

AGPL-3.0-or-later. See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
_layouts		_layouts
assets/css		assets/css
blog		blog
codex		codex
docker		docker
docz		docz
extras/firefox-extension		extras/firefox-extension
llama_conductor		llama_conductor
logo		logo
prepub		prepub
tools		tools
.dockerignore		.dockerignore
DESIGN.md		DESIGN.md
Dockerfile		Dockerfile
FAQ.md		FAQ.md
LICENSE		LICENSE
NEW.md		NEW.md
README.md		README.md
START-ALL.bat		START-ALL.bat
SUMM_musk-article.md		SUMM_musk-article.md
_config.yml		_config.yml
docker-compose.yml		docker-compose.yml
docker.env.example		docker.env.example
feed.xml		feed.xml
index.md		index.md
llama_conductor_cli.py		llama_conductor_cli.py
pyproject.toml		pyproject.toml
start-all.sh		start-all.sh

Folders and files

Latest commit

History

Repository files navigation

llama-conductor

Quick Links

Start Here

Using It

Evidence and Research

🚀 5-Minute Quickstart

🛠️ Some problems This Solves

1) 🧠 Context bloat on small hardware

2) 🐠 Goldfish memory and confident misremembering

3) 🧾 Lies, damned lies, and LLM's

4) 📌 Grounding drift in normal chat (fixed with >>scratch / >>lock)

5) 🎛️ Modes (get the stick out of your LLMs butt)

6) 🧪 Vibes-based answers (your model is a stochastic idiot)

7) 🏛️ Grounded deep retrieval (##mentats) — grounds or refuses, nothing in between

8) 🎲 Vibes-based ranking ("trust me bro" but make it stochastic)

9) 🌐 "I'm sure it's from that movie..." (your model is a confident liar about things Google knows)

Quickstart (First-Time, Recommended)

[Step 0] Prerequisites

[Step 1] Install llama-conductor

[Step 2] Configure llama_conductor/router_config.yaml

[Step 3] Start stack

[Step 3A] First-time launcher setup (recommended)

[Step 3B] Docker Compose (single-host)

[Step 4] Open and enjoy

Kick-the-tires mode

NB (troubleshooting only)

⚙️ Cute. What does this crap Actually do?

🧩 Deterministic Memory (!! / ??)

✂️ Context Control (Vodka CTC)

🎯 Strict Grounding Paths (>>lock, >>scratch)

🧰 Deterministic sidecars

🎚️ Mode switches (serious, fun, fun rewrite)

🤖 Profiles, Sarcasm, Snark (ala TARS from interstellar)

🧷 Footer status

🏛️ Grounded Deep Retrieval (##mentats)

✅ So, in summary, why should you give a shit?

📜 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

[Step 2] Configure `llama_conductor/router_config.yaml`

Packages