TokenSense
AI Orchestration Engine with RAG-Powered Context Optimization
Reduce LLM costs 40–70% through semantic retrieval, context compression, and intelligent model routing.
Inspiration
Every time developers query an LLM about their codebase, they waste thousands of tokens on irrelevant context. Entire files, lengthy docs, and redundant code get dumped into prompts — inflating costs and degrading response quality.
There's no middleware that sits between your app and the LLM and says: "Send only what matters, to the cheapest model that can handle it."
TokenSense is that middleware.
What It Does
TokenSense intercepts every LLM request and runs it through a 5-agent pipeline before reaching the model:
- Query Agent — Embeds the question and classifies task type (code / docs / general)
- Retrieval Agent — Semantic search in Actian VectorAI DB to fetch only relevant chunks
- Context Optimizer — Deduplicates, re-ranks by relevance, compresses to a token budget
- Routing Agent — Selects the cheapest model that fits (Haiku / GPT-4o Mini / Gemini Pro)
- Telemetry Agent — Logs tokens, cost, and latency to SQLite
The LLM sees less context, but better context. You pay less and get better answers.
Three interfaces:
- CLI —
pip install tokensense(published on PyPI, v0.1.4) - REST API — FastAPI with
/ask,/index,/optimize,/stats,/keysendpoints - Web Dashboard — Next.js 14 with playground, analytics charts, and API key management
How We Built It
RAG Pipeline (Retrieval-Augmented Generation)
- Index: User indexes a codebase → files split into ~150-token chunks line-by-line → each chunk embedded into a 1,536-dim vector → batch-stored in Actian VectorAI DB
- Retrieve: User asks a question → question embedded → Actian returns top-5 most similar chunks via cosine similarity
- Augment: Context Optimizer removes duplicates (>80% word overlap), re-ranks by score, trims to 8,000 token budget
- Generate: Routing Agent picks the cheapest viable model, injects compressed context as a system message, calls the LLM
- Log: Telemetry Agent calculates cost per model and writes to SQLite
Built from scratch — no LangChain, no frameworks. Pure Python + FastAPI + vector database.
Tech Stack
| Layer | Technology | Purpose |
|---|---|---|
| Backend | FastAPI + Python 3.11+ | Async API server with 5 agent modules |
| Vector DB | Actian VectorAI DB (gRPC) | Semantic retrieval with cosine similarity |
| Embeddings | OpenRouter (text-embedding-ada-002) |
1,536-dim vector representations |
| Model Routing | OpenRouter API | Multi-model abstraction (Claude / GPT) |
| Fallback LLM | Google Gemini API | Code reasoning and large-context tasks |
| CLI | Typer + httpx + Rich | pip-installable command-line interface |
| Frontend | Next.js 14 + React + Tailwind CSS | Web playground + analytics dashboard |
| Telemetry | SQLite via aiosqlite | Persistent cost and usage metrics |
| Deployment | Docker Compose on Vultr | Production-ready single-command deploy |
Best Use of Actian VectorAI DB
Actian VectorAI DB is the backbone of our RAG retrieval pipeline — every semantic search in TokenSense flows through it.
How We Use It
Indexing (storing knowledge):
- Split code files into ~150-token chunks, line by line
- Embed each chunk into a 1,536-dimensional vector
- Batch-upsert all chunks into an Actian collection with cosine distance
- Each vector stores a rich payload: the raw code content + source filename
Querying (finding relevant code):
- Embed the user's question into a vector
- Search Actian for the top-5 most similar chunks using cosine similarity
- Results come back scored, ranked, and ready to use — in under 50ms
Actian API Features We Used
| Feature | How We Use It |
|---|---|
AsyncCortexClient |
Fully async gRPC client, fits natively into our FastAPI pipeline |
create_collection |
1,536-dim collection with DistanceMetric.COSINE |
has_collection |
Check-before-create to avoid errors on repeat starts |
batch_upsert |
Index an entire codebase in one call per file — IDs, vectors, and payloads together |
upsert |
Single-chunk storage with deterministic content-hashed IDs |
search with with_payload=True |
Returns scored results with full content + source metadata |
What We Built Beyond the Basics
Thread-safe async concurrency — Double-checked locking with
asyncio.Lock()prevents race conditions when multiple users hit the API at the same time. Only the first request creates the collection; the rest wait safely.Deterministic content-hashed IDs — Every chunk ID is generated from
abs(hash(content)) % 2**62. Re-indexing the same codebase overwrites instead of duplicating. The database stays clean automatically — no cleanup scripts needed.Rich payload storage — Each vector carries its full content and source filename as payload. One search call returns everything: relevance score, the code text, and where it came from. No second database lookup needed.
Batch upsert from scratch — Actian's beta had minimal documentation for batch operations. We figured out the
batch_upsertAPI from the client surface and built batch indexing that processes entire files in a single gRPC call.
Why Actian
We needed a vector DB that is lightweight, Docker-native, gRPC-fast, and doesn't require auth setup. Actian checks all four. It runs as a single container (williamimoh/actian-vectorai-db:1.0b) with a persistent named volume. No config files, no cluster setup, no API keys. Start the container, connect on port 50051, go.
Best Use of Vultr Cloud
The entire TokenSense production stack runs on a single Vultr Optimized Cloud Compute VPS — live at http://108.61.192.150:8000.
What Runs on Vultr
One docker-compose up deploys three containers on the same VPS:
| Container | Service | Port |
|---|---|---|
actian |
Actian VectorAI DB | 50051 (internal only) |
backend |
FastAPI API server | 8000 (exposed) |
frontend |
Next.js web UI | 3000 (exposed) |
How Vultr Powers the User Experience
Zero-infrastructure onboarding: Anyone in the world can use TokenSense without running a single container locally:
pip install tokensense
tokensense init
# → API URL: http://108.61.192.150:8000
# → API Key: your-key
tokensense ask "how does auth work?"
No Docker on their machine. No git clone. No backend setup. The CLI talks directly to our Vultr-hosted backend, which runs the full pipeline — embedding, Actian vector search, context compression, model routing — all server-side.
Security: Actian Protected by Vultr Firewall
Actian VectorAI DB never touches the public internet. It communicates with the backend over Docker's internal network using service name DNS (actian:50051). Vultr's firewall ensures only ports 22 (SSH), 80 (HTTP), and 443 (HTTPS) are open. The vector database and its data are completely isolated.
Why Vultr
- Single-command deploy —
docker-compose up -dstarts the entire stack - Persistent storage — named Docker volumes survive container restarts
- Firewall protection — database layer invisible to the internet
- Reliable uptime — judges and users can hit the API right now
- Caddy-ready — HTTPS reverse proxy config included for production domains
Challenges We Ran Into
Iterating through four PyPI releases (v0.1.1 → v0.1.4) — Each publish surfaced new packaging issues — missing entry points, incorrect module paths, dependency conflicts. We learned the hard way that
pip install tokensenseonly works whenpyproject.toml, the package structure, and Hatchling build targets are all perfectly aligned.Integrating a beta vector DB with minimal documentation — Actian VectorAI DB is pre-release. No tutorials, no Stack Overflow answers. We reverse-engineered the gRPC client from example scripts and built batch upsert logic from scratch.
Real-time token estimation — OpenAI's tiktoken is accurate but too slow for live token counters. We built a fast approximation (
word_count × 1.3) that stays within ~5% of the real count and runs instantly.Normalizing cost across providers — OpenRouter prices per-million tokens; Gemini prices per-1K. We unified everything into a single cost-per-token calculation inside the Telemetry Agent so the dashboard shows accurate savings regardless of which model handled the request.
Context deduplication at scale — Naive string matching was O(n²). We switched to word-set overlap hashing for O(n) deduplication without sacrificing relevance.
Accomplishments We're Proud Of
Published to PyPI. Anyone can run
pip install tokensenseand start using it immediately — a real, publicly distributed package (v0.1.4), not just a repo to clone.A complete, working tool — not a demo. Three fully integrated interfaces (CLI, REST API, Web UI) all backed by the same five-agent pipeline: query analysis, semantic retrieval via Actian VectorAI DB, context compression, intelligent model routing, and full telemetry logging.
Up to 40–70% token reduction with measurable cost savings. The optimization pipeline cuts token usage and API costs, with every query's tokens, latency, and cost tracked in a real-time analytics dashboard.
105+ tests across backend, frontend, and E2E. Fully mocked pytest suite, Jest + React Testing Library component tests, and Playwright end-to-end tests — no API keys needed to run them.
What We Learned
- Agentic architecture is more maintainable than monolithic LLM wrappers — each agent has one job and does it well
- RAG isn't just for chatbots — semantic retrieval works brilliantly for code-aware developer tools
- Context compression matters more than people think — deduplication + re-ranking often improves answer quality more than better retrieval
- Model routing saves real money — auto-selecting Claude Haiku for simple queries cuts costs dramatically vs always using GPT-4
- Actian VectorAI handles real workloads — search latency under 50ms with thousands of stored chunks
- CLI-first design wins — developers adopt terminal tools faster than web UIs because it fits their workflow
What's Next
- Streaming responses — real-time token display in the playground
- Similarity caching — detect >90% similar queries and return cached answers instantly
- Output token optimization —
max_tokenscaps and conciseness prompts - Custom embeddings — bring your own embedding model
- Team features — shared indexes, usage quotas, role-based API keys
- VSCode extension — inline code explanations with context optimization built in
Links
- Live API:
http://108.61.192.150:8000 - PyPI Package: pypi.org/project/tokensense
- GitHub: github.com/yourusername/TokenSense
Installation
# Install CLI from PyPI
pip install tokensense
# Or run the full stack from source
git clone https://github.com/yourusername/TokenSense.git
cd TokenSense
cp .env.example .env # add your API keys
docker-compose up -d # starts Actian + backend + frontend
Built With
Backend: FastAPI · Python 3.11 · aiosqlite · httpx Vector DB: Actian VectorAI DB · gRPC · cosine similarity LLMs: OpenRouter API · Google Gemini API CLI: Typer · Rich · published to PyPI Frontend: Next.js 14 · React · Tailwind CSS · shadcn/ui · Recharts Deployment: Docker Compose · Vultr Cloud Compute
TokenSense — Send less, spend less, get better answers.
Log in or sign up for Devpost to join the conversation.