The LLM Execution Layer
Compress input.
Control output.
Save 50–70% on every LLM call.
Drop-in middleware for any LLM provider. You keep 80% of the savings, or pay nothing.
Optimization Vectors
Five layers. One pass.
Each vector targets a different source of token waste. They compound — the output of one feeds into the next.
Code Minification
Strip the noise before compression begins.
Comments, empty lines, type annotations, and redundant whitespace are removed. High-frequency identifiers (≥3 occurrences) are automatically shortened. Foundation for all downstream stages.
15–55% on code-heavy content
Dictionary Aliasing
Say it once, reference it everywhere.
Repeated substrings across the context get compressed into short §XX aliases. The same API schema appearing 8 times becomes one definition and 7 references.
Bidirectional: input + output
Semantic Pruning
Token-level keep/drop decisions at GPU speed.
LLMLingua-2 runs on SageMaker GPU for ~100ms inference. Token-level binary classifier scores each token's importance and drops low-value ones while preserving meaning.
40–50% input tokens saved
Output Shaping
The hidden 4x tax on output tokens.
Output tokens cost 3–5x more than input across every major provider. Scenario-aware concise instructions and @XX output aliases reduce LLM verbosity. Aliases are restored in real-time during streaming.
50–80% output tokens saved
Adaptive Rate
Code and prose don't compress the same.
Content density analysis — structure ratio, technical term density, words per line — automatically adjusts compression rate. Dense structured content gets lighter treatment to prevent information loss.
Auto content-type routing
Landscape
Not a router. Not a cache.
Routers pick cheaper models. Caches skip repeated calls. Neither reduces the tokens in a new prompt to an expensive model. We do.
| Approach | Input saved | Output saved | New prompts | Quality | Setup |
|---|---|---|---|---|---|
| Manual prompt tuning | ~10% | 0% | Yes | Varies | Hours per prompt |
| Semantic cache | 100%* | 0% | No* | Exact match | Redis + embeddings |
| Model router | 0% | 0% | Yes | Lower model | Routing rules |
| OpenCompress | 40-50% | 50-80% | Yes | ≥0.80 cosine | One middleware call |
* Semantic cache only works on exact or near-duplicate prompts. Miss rate typically 70-90% in production.
Integration
One layer. Every model.
OpenCompress is model-agnostic. Drop it in front of any LLM provider and every call gets optimized automatically.
# Before: direct call — full price
response = client.chat(model, messages)
# After: same call, same response — 50–70% cheaper
response = opencompress.chat(model, messages)Pricing
You save money. We take 33% of the savings.
No subscriptions, no per-seat fees. We only charge a percentage of the money we actually save you. If compression doesn't help, you pay $0 extra.
// Every API response shows exactly what you saved
original_cost $0.0100
compressed_cost $0.0040
savings $0.0060
our_fee (33%) $0.0012
─────────────────────────
you_pay $0.0052 ← 48% cheaperFor you
Change 2 lines of code, save 50–70% on LLM costs.
For us
We take 33% of the savings we generate.
Alignment
We only make money when users save money.
Get started
See it compress your actual prompts.
Paste any prompt into the playground. Watch the A/B comparison stream in real time. No signup, no API key, no setup.