Version: v8.6+ | Smart Query Routing | Cost Optimization | Development Tools
A lightweight, cost-optimized RAG system for climate control documentation and project management. Routes queries intelligently between local models and external LLMs to minimize costs (7.4× cheaper than all-Claude).
- Python 3.8+
- Qdrant vector database (running on localhost:6333)
Use ttkb_tut/venv — this is the ONLY environment needed for all tasks:
# 1. Activate project venv
source ttkb_tut/venv/bin/activate
# 2. Verify all packages installed
python3 -c "import fastapi, qdrant_client, fastembed; print('✅ All packages OK')"
# 3. Run services
python3 scripts/rag_router.py # Query Router on :8000
python3 scripts/dispatcher_service.py # Dispatcher on :8001 (optional)
python3 scripts/dev_logger.py # Auto-logging (git hooks)Single venv includes:
- ✅ FastAPI + Uvicorn (web services)
- ✅ Qdrant client (vector database)
- ✅ FastEmbed + FastEmbed-GPU (embeddings)
- ✅ Pydantic (data validation)
- ✅ Requests (HTTP client)
- ✅ All ML/AI dependencies (torch, transformers, etc.)
- ✅ Development tools (pytest, black, etc.)
Setup (if venv missing):
python3 -m venv ttkb_tut/venv
source ttkb_tut/venv/bin/activate
pip install fastapi uvicorn requests qdrant-client fastembed fastembed-gpu pydanticSmart query routing based on pattern matching.
# Dev: python3 scripts/rag_router.py
# Prod: /home/tamiel/programy/klimtech-embed-venv/bin/python3 scripts/rag_router.py
# Server starts on http://localhost:8000API:
curl -X POST http://localhost:8000/query \
-H "Content-Type: application/json" \
-d '{"query": "Jak zamontować klimatyzator?"}'Response:
{
"decision": "ALWAYS",
"use_rag": true,
"collection": "kb_md",
"recommended_model": "deepseek-flash",
"cost": "$0.000084",
"confidence": 0.95
}Decision Types:
- ✅ ALWAYS — Use RAG (high importance, project-specific, HVAC domain)
- ⏸️ CONDITIONAL — Try direct first, escalate to RAG if uncertain
- ❌ NEVER — Direct LLM only (writing, general knowledge, session context)
- 🌐 WEB_SEARCH — Fetch from internet (weather, news, current events)
Pretty-printed wrapper around Query Router.
# Dev (in activated venv)
python3 scripts/query_client.py "Jak zamontować klimatyzator?" [--verbose] [--no-cache] [--json]
# Prod
/home/tamiel/programy/klimtech-embed-venv/bin/python3 scripts/query_client.py "query text"Examples:
# Basic
python3 scripts/query_client.py "Jak zamontować klimatyzator?"
# Output: ✅ Decision: ALWAYS, Confidence: 1.00, Cost: $0.000084
# Verbose (show matched categories)
python3 scripts/query_client.py "GPU test" --verbose
# Output: ... Details: Matched: rag_always.project, Tokens: 600, Type: project
# JSON (raw API response)
python3 scripts/query_client.py "query" --json
# Bypass cache
python3 scripts/query_client.py "query" --no-cacheOrchestrates routing → retrieval → model selection → LLM call.
python3 scripts/dispatcher_service.py
# Server on http://localhost:8001Log commits, snapshots, and messages to supervisor_memory:
# Log git commit
python3 scripts/dev_logger.py log-commit --hash $(git rev-parse HEAD)
# Manual snapshot
python3 scripts/dev_logger.py snapshot --session "v8.6-release" --notes "Initial v8.6"
# Log message status
python3 scripts/dev_logger.py log-message --msg-id msg-013 --status DONE --task-type IMPLEMENTQuery
↓
[Query Router (:8000)]
├→ Pattern matching
├→ Decision (ALWAYS/CONDITIONAL/NEVER/WEB_SEARCH)
└→ Cost estimation
Decision
↓
[Context Retrieval (Qdrant)]
├→ kb_md (documentation)
├→ supervisor_memory (project history)
└→ robotnik_logs (session logs)
Context + Decision
↓
[Model Selection]
├→ Flash ($0.14/M) — 80% of queries
├→ Pro ($1.74/M) — 15% of queries (escalation)
└→ Claude ($5/M) — 5% critical
↓
[LLM Response]
↓
[Result Logging (supervisor_memory)]
See QUERY_CATEGORIES.md for 44+ classified examples.
Quick Reference:
| Category | Decision | Model | Example |
|---|---|---|---|
| HVAC procedures | ALWAYS | Flash + RAG | "Jak zamontować klimatyzator?" |
| Project history | ALWAYS | Flash + RAG | "Dlaczego GPU test się wysypał?" |
| Writing tasks | NEVER | Flash | "Napisz email do klienta" |
| Weather/news | WEB_SEARCH | Flash | "Jaka jest pogoda?" |
| Concepts | CONDITIONAL | Flash→Pro | "Na czym polega lazy loading?" |
Monthly cost comparison (100 queries/day = 3000/month):
| Approach | Cost | vs. All-Claude |
|---|---|---|
| All Claude | $7.50 | 1× |
| Smart Router | $1.02 | 7.4× cheaper |
| Flash only | $0.42 | 17.8× cheaper (but worse accuracy) |
Breakdown by category:
- 30% ALWAYS (RAG + Flash): $0.00084/query
- 40% CONDITIONAL direct: $0.00028/query
- 15% NEVER: $0.00021/query
- 5% WEB_SEARCH: $0.00042/query
- <1% escalation to Claude: $0.0025/query
# Query Router (69 test cases)
python3 -c "
import sys; sys.path.insert(0, 'scripts')
from rag_router import RAGRouter
router = RAGRouter()
tests = [
('Jak zamontować klimatyzator?', 'always'),
('Napisz mail.', 'never'),
('Jaka pogoda?', 'web_search'),
]
for q, expected in tests:
result = router.route(q)
status = '✓' if result.decision == expected else '✗'
print(f'{status} {q} → {result.decision}')
"# Start services
python3 scripts/rag_router.py &
# Run query client tests
python3 tests/test_query_skill.py
# Stop service
pkill -f rag_routerKlimtechRAG/
├── scripts/
│ ├── rag_router.py # Core Query Router (FastAPI)
│ ├── query_client.py # CLI wrapper with caching
│ ├── dispatcher_service.py # Multi-agent orchestrator (v8.7+)
│ ├── dev_logger.py # Auto-logging to Qdrant
│
├── tests/
│ ├── test_rag_router.py # 69 unit test cases
│ └── test_query_skill.py # 10 integration tests
│
├── postLLMs/ # Pair programming system
│ ├── claude_outbox/ # Tasks sent to Robotnik
│ ├── deepseek_outbox/ # Responses from Robotnik
│ └── worker.lock # Worker mode indicator
│
├── wiki/ # External memory
│ ├── status.md # Session summary
│ ├── decisions.md # Architecture decisions
│ └── lessons.md # Discovered patterns
│
├── CLAUDE.md # Project constitution
├── QDRANT_LAPTOP.md # VRAM strategy doc
├── QUERY_CATEGORIES.md # Pattern matching rules (1183 lines)
└── README.md # This file
# 1. Activate the single unified venv
source ttkb_tut/venv/bin/activate
# 2. Start Qdrant (separate terminal)
podman start qdrant # if using podman
# 3. Start Query Router (separate terminal)
python3 scripts/rag_router.py
# Output: Starting RAG Router on http://localhost:8000
# 4. Test Query Router (main terminal)
python3 scripts/query_client.py "Jak zamontować klimatyzator?"
# Output: ✅ Decision: ALWAYS, Confidence: 1.00, Cost: $0.000084
# 5. Optional: Start Dispatcher
python3 scripts/dispatcher_service.py
# Output: Starting Dispatcher on http://localhost:8001
# 6. Deactivate when done
deactivateAll services use the same venv. No environment switching needed.
export QDRANT_URL="http://localhost:6333" # Qdrant endpoint
export ROUTER_PORT=8000 # Query Router port
export DISPATCHER_PORT=8001 # Dispatcher port (v8.7+)
export EMBEDDING_MODEL="intfloat/multilingual-e5-large"
export CACHE_DB="/tmp/query_router_cache.db" # Query cache (24h TTL)
export QDRANT_COLLECTIONS="kb_md,supervisor_memory,robotnik_logs"{
"prices": {
"flash": 0.14, # $0.14/M tokens (DeepSeek Flash)
"pro": 1.74, # $1.74/M tokens (DeepSeek Pro)
"claude": 5.0 # $5/M tokens (Claude)
}
}- QUERY_CATEGORIES.md — Complete pattern matching rules with 44+ examples
- QDRANT_LAPTOP.md — VRAM strategy, model hierarchy, scaling points
- CLAUDE.md — Project constitution (git workflow, security, venv strategy)
- wiki/status.md — Session summary and progress tracking
- tasks/phase_2_integration.md — Integration roadmap (v8.7-v8.8)
# Check if port 8000 is in use
ss -tlnp | grep 8000
# Kill existing process
pkill -f rag_router
# Verify dependencies
python3 -c "import fastapi, uvicorn; print('OK')"
# If error, reinstall: pip install fastapi uvicorn# Check Qdrant is running
curl http://localhost:6333/collections
# Check logs
tail -20 /tmp/rag_router.log # if started with nohup
tail -20 /tmp/dispatcher.log # dispatcher logs# Check that venv is active
which python3
# Should be: /home/tamiel/KlimtechRAG/ttkb_tut/venv/bin/python3
# Activate the unified venv
source ttkb_tut/venv/bin/activate
# If importing fails, reinstall dependencies
pip install --upgrade fastapi uvicorn requests qdrant-client \
fastembed fastembed-gpu pydantic| Operation | Target | Current |
|---|---|---|
| Query routing | < 50ms | ~50ms ✅ |
| Qdrant retrieval | < 300ms | ~200-500ms |
| LLM response | < 2s | varies by model |
| Cache hit | < 5ms | ~2-5ms ✅ |
| End-to-end | < 500ms | ~250-700ms |
- Test examples:
tests/test_rag_router.py(69 cases) - Pattern matching:
scripts/query_patterns.json - API integration:
scripts/query_client.py(HTTP client example) - Qdrant usage:
scripts/dispatcher_service.py /search(retrieval pattern) - Auto-logging:
scripts/dev_logger.py(supervisor_memory integration)
Project: KlimtechRAG (Climate Control Documentation RAG)
Version: v8.6 (Query Routing MVP) → v8.7 (Integration) → v9.0 (Full Feature)
Built by: Szef (Claude Code) + Robotnik (DeepSeek/OpenCode)
Last Updated: 2026-04-24
This is a pair-programming project using postLLMs system (file-based async task management).
For developers:
- Choose venv path (dev vs prod)
- Activate environment
- Run services
- Test with query_client.py
- Submit changes via git
For pair programming:
- Tasks go in
postLLMs/claude_outbox/(msg-NNN-description.md) - Responses in
postLLMs/deepseek_outbox/(auto-poll monitors) - Memory snapshots to Qdrant supervisor_memory
See CLAUDE.md §20 for full protocol.
Questions? Check wiki/status.md or tasks/phase_2_integration.md for ongoing work.
Ready to optimize your queries? 🚀
python3 scripts/query_client.py "Your question here"