TokenSense

landing page
playground
dashboard
terminal execution
PyPI package

TokenSense

AI Orchestration Engine with RAG-Powered Context Optimization

Reduce LLM costs 40–70% through semantic retrieval, context compression, and intelligent model routing.

Inspiration

Every time developers query an LLM about their codebase, they waste thousands of tokens on irrelevant context. Entire files, lengthy docs, and redundant code get dumped into prompts — inflating costs and degrading response quality.

There's no middleware that sits between your app and the LLM and says: "Send only what matters, to the cheapest model that can handle it."

TokenSense is that middleware.

What It Does

TokenSense intercepts every LLM request and runs it through a 5-agent pipeline before reaching the model:

Query Agent — Embeds the question and classifies task type (code / docs / general)
Retrieval Agent — Semantic search in Actian VectorAI DB to fetch only relevant chunks
Context Optimizer — Deduplicates, re-ranks by relevance, compresses to a token budget
Routing Agent — Selects the cheapest model that fits (Haiku / GPT-4o Mini / Gemini Pro)
Telemetry Agent — Logs tokens, cost, and latency to SQLite

The LLM sees less context, but better context. You pay less and get better answers.

Three interfaces:

CLI — pip install tokensense (published on PyPI, v0.1.4)
REST API — FastAPI with /ask, /index, /optimize, /stats, /keys endpoints
Web Dashboard — Next.js 14 with playground, analytics charts, and API key management

How We Built It

RAG Pipeline (Retrieval-Augmented Generation)

Index: User indexes a codebase → files split into ~150-token chunks line-by-line → each chunk embedded into a 1,536-dim vector → batch-stored in Actian VectorAI DB
Retrieve: User asks a question → question embedded → Actian returns top-5 most similar chunks via cosine similarity
Augment: Context Optimizer removes duplicates (>80% word overlap), re-ranks by score, trims to 8,000 token budget
Generate: Routing Agent picks the cheapest viable model, injects compressed context as a system message, calls the LLM
Log: Telemetry Agent calculates cost per model and writes to SQLite

Built from scratch — no LangChain, no frameworks. Pure Python + FastAPI + vector database.

Tech Stack

Layer	Technology	Purpose
Backend	FastAPI + Python 3.11+	Async API server with 5 agent modules
Vector DB	Actian VectorAI DB (gRPC)	Semantic retrieval with cosine similarity
Embeddings	OpenRouter (`text-embedding-ada-002`)	1,536-dim vector representations
Model Routing	OpenRouter API	Multi-model abstraction (Claude / GPT)
Fallback LLM	Google Gemini API	Code reasoning and large-context tasks
CLI	Typer + httpx + Rich	pip-installable command-line interface
Frontend	Next.js 14 + React + Tailwind CSS	Web playground + analytics dashboard
Telemetry	SQLite via aiosqlite	Persistent cost and usage metrics
Deployment	Docker Compose on Vultr	Production-ready single-command deploy

Best Use of Actian VectorAI DB

Actian VectorAI DB is the backbone of our RAG retrieval pipeline — every semantic search in TokenSense flows through it.

How We Use It

Indexing (storing knowledge):

Split code files into ~150-token chunks, line by line
Embed each chunk into a 1,536-dimensional vector
Batch-upsert all chunks into an Actian collection with cosine distance
Each vector stores a rich payload: the raw code content + source filename

Querying (finding relevant code):

Embed the user's question into a vector
Search Actian for the top-5 most similar chunks using cosine similarity
Results come back scored, ranked, and ready to use — in under 50ms

Actian API Features We Used

Feature	How We Use It
`AsyncCortexClient`	Fully async gRPC client, fits natively into our FastAPI pipeline
`create_collection`	1,536-dim collection with `DistanceMetric.COSINE`
`has_collection`	Check-before-create to avoid errors on repeat starts
`batch_upsert`	Index an entire codebase in one call per file — IDs, vectors, and payloads together
`upsert`	Single-chunk storage with deterministic content-hashed IDs
`search` with `with_payload=True`	Returns scored results with full content + source metadata

What We Built Beyond the Basics

Thread-safe async concurrency — Double-checked locking with asyncio.Lock() prevents race conditions when multiple users hit the API at the same time. Only the first request creates the collection; the rest wait safely.
Deterministic content-hashed IDs — Every chunk ID is generated from abs(hash(content)) % 2**62. Re-indexing the same codebase overwrites instead of duplicating. The database stays clean automatically — no cleanup scripts needed.
Rich payload storage — Each vector carries its full content and source filename as payload. One search call returns everything: relevance score, the code text, and where it came from. No second database lookup needed.
Batch upsert from scratch — Actian's beta had minimal documentation for batch operations. We figured out the batch_upsert API from the client surface and built batch indexing that processes entire files in a single gRPC call.

Why Actian

We needed a vector DB that is lightweight, Docker-native, gRPC-fast, and doesn't require auth setup. Actian checks all four. It runs as a single container (williamimoh/actian-vectorai-db:1.0b) with a persistent named volume. No config files, no cluster setup, no API keys. Start the container, connect on port 50051, go.

Best Use of Vultr Cloud

The entire TokenSense production stack runs on a single Vultr Optimized Cloud Compute VPS — live at http://108.61.192.150:8000.

What Runs on Vultr

One docker-compose up deploys three containers on the same VPS:

Container	Service	Port
`actian`	Actian VectorAI DB	50051 (internal only)
`backend`	FastAPI API server	8000 (exposed)
`frontend`	Next.js web UI	3000 (exposed)

How Vultr Powers the User Experience

Zero-infrastructure onboarding: Anyone in the world can use TokenSense without running a single container locally:

pip install tokensense
tokensense init
# → API URL: http://108.61.192.150:8000
# → API Key: your-key
tokensense ask "how does auth work?"

No Docker on their machine. No git clone. No backend setup. The CLI talks directly to our Vultr-hosted backend, which runs the full pipeline — embedding, Actian vector search, context compression, model routing — all server-side.

Security: Actian Protected by Vultr Firewall

Actian VectorAI DB never touches the public internet. It communicates with the backend over Docker's internal network using service name DNS (actian:50051). Vultr's firewall ensures only ports 22 (SSH), 80 (HTTP), and 443 (HTTPS) are open. The vector database and its data are completely isolated.

Why Vultr

Single-command deploy — docker-compose up -d starts the entire stack
Persistent storage — named Docker volumes survive container restarts
Firewall protection — database layer invisible to the internet
Reliable uptime — judges and users can hit the API right now
Caddy-ready — HTTPS reverse proxy config included for production domains

Challenges We Ran Into

Iterating through four PyPI releases (v0.1.1 → v0.1.4) — Each publish surfaced new packaging issues — missing entry points, incorrect module paths, dependency conflicts. We learned the hard way that pip install tokensense only works when pyproject.toml, the package structure, and Hatchling build targets are all perfectly aligned.
Integrating a beta vector DB with minimal documentation — Actian VectorAI DB is pre-release. No tutorials, no Stack Overflow answers. We reverse-engineered the gRPC client from example scripts and built batch upsert logic from scratch.
Real-time token estimation — OpenAI's tiktoken is accurate but too slow for live token counters. We built a fast approximation (word_count × 1.3) that stays within ~5% of the real count and runs instantly.
Normalizing cost across providers — OpenRouter prices per-million tokens; Gemini prices per-1K. We unified everything into a single cost-per-token calculation inside the Telemetry Agent so the dashboard shows accurate savings regardless of which model handled the request.
Context deduplication at scale — Naive string matching was O(n²). We switched to word-set overlap hashing for O(n) deduplication without sacrificing relevance.

Accomplishments We're Proud Of

Published to PyPI. Anyone can run pip install tokensense and start using it immediately — a real, publicly distributed package (v0.1.4), not just a repo to clone.
A complete, working tool — not a demo. Three fully integrated interfaces (CLI, REST API, Web UI) all backed by the same five-agent pipeline: query analysis, semantic retrieval via Actian VectorAI DB, context compression, intelligent model routing, and full telemetry logging.
Up to 40–70% token reduction with measurable cost savings. The optimization pipeline cuts token usage and API costs, with every query's tokens, latency, and cost tracked in a real-time analytics dashboard.
105+ tests across backend, frontend, and E2E. Fully mocked pytest suite, Jest + React Testing Library component tests, and Playwright end-to-end tests — no API keys needed to run them.

What We Learned

Agentic architecture is more maintainable than monolithic LLM wrappers — each agent has one job and does it well
RAG isn't just for chatbots — semantic retrieval works brilliantly for code-aware developer tools
Context compression matters more than people think — deduplication + re-ranking often improves answer quality more than better retrieval
Model routing saves real money — auto-selecting Claude Haiku for simple queries cuts costs dramatically vs always using GPT-4
Actian VectorAI handles real workloads — search latency under 50ms with thousands of stored chunks
CLI-first design wins — developers adopt terminal tools faster than web UIs because it fits their workflow

What's Next

Streaming responses — real-time token display in the playground
Similarity caching — detect >90% similar queries and return cached answers instantly
Output token optimization — max_tokens caps and conciseness prompts
Custom embeddings — bring your own embedding model
Team features — shared indexes, usage quotas, role-based API keys
VSCode extension — inline code explanations with context optimization built in

Installation

# Install CLI from PyPI
pip install tokensense

# Or run the full stack from source
git clone https://github.com/yourusername/TokenSense.git
cd TokenSense
cp .env.example .env   # add your API keys
docker-compose up -d   # starts Actian + backend + frontend

Built With

Backend: FastAPI · Python 3.11 · aiosqlite · httpx Vector DB: Actian VectorAI DB · gRPC · cosine similarity LLMs: OpenRouter API · Google Gemini API CLI: Typer · Rich · published to PyPI Frontend: Next.js 14 · React · Tailwind CSS · shadcn/ui · Recharts Deployment: Docker Compose · Vultr Cloud Compute

TokenSense — Send less, spend less, get better answers.