The LLM Execution Layer

Compress input. Control output. Save 50–70% on every LLM call.

Drop-in middleware for any LLM provider. You keep 80% of the savings, or pay nothing.

50–70%
cost saved
5-stage pipeline
~45%
input compressed
per-message average
~70%
output reduced
via concise inject + aliases

Optimization Vectors

Five layers. One pass.

Each vector targets a different source of token waste. They compound — the output of one feeds into the next.

01

Code Minification

Strip the noise before compression begins.

Comments, empty lines, type annotations, and redundant whitespace are removed. High-frequency identifiers (≥3 occurrences) are automatically shortened. Foundation for all downstream stages.

15–55% on code-heavy content

02

Dictionary Aliasing

Say it once, reference it everywhere.

Repeated substrings across the context get compressed into short §XX aliases. The same API schema appearing 8 times becomes one definition and 7 references.

Bidirectional: input + output

03

Semantic Pruning

Token-level keep/drop decisions at GPU speed.

LLMLingua-2 runs on SageMaker GPU for ~100ms inference. Token-level binary classifier scores each token's importance and drops low-value ones while preserving meaning.

40–50% input tokens saved

04

Output Shaping

The hidden 4x tax on output tokens.

Output tokens cost 3–5x more than input across every major provider. Scenario-aware concise instructions and @XX output aliases reduce LLM verbosity. Aliases are restored in real-time during streaming.

50–80% output tokens saved

05

Adaptive Rate

Code and prose don't compress the same.

Content density analysis — structure ratio, technical term density, words per line — automatically adjusts compression rate. Dense structured content gets lighter treatment to prevent information loss.

Auto content-type routing

Landscape

Not a router. Not a cache.

Routers pick cheaper models. Caches skip repeated calls. Neither reduces the tokens in a new prompt to an expensive model. We do.

ApproachInput savedOutput saved
Manual prompt tuning~10%0%
Semantic cache100%*0%
Model router0%0%
OpenCompress40-50%50-80%

* Semantic cache only works on exact or near-duplicate prompts. Miss rate typically 70-90% in production.

Integration

One layer. Every model.

OpenCompress is model-agnostic. Drop it in front of any LLM provider and every call gets optimized automatically.

OpenAI
Anthropic
Google
DeepSeek
xAI
Qwen
MiniMax
Mistral
Meta
Cohere
+ any provider with an OpenAI-compatible API
middleware.py
# Before: direct call — full price
response = client.chat(model, messages)

# After: same call, same response — 50–70% cheaper
response = opencompress.chat(model, messages)

Pricing

You save money. We take 33% of the savings.

No subscriptions, no per-seat fees. We only charge a percentage of the money we actually save you. If compression doesn't help, you pay $0 extra.

usage response
// Every API response shows exactly what you saved
original_cost     $0.0100
compressed_cost   $0.0040
savings           $0.0060
our_fee (33%)     $0.0012
─────────────────────────
you_pay           $0.0052 ← 48% cheaper

For you

Change 2 lines of code, save 50–70% on LLM costs.

For us

We take 33% of the savings we generate.

Alignment

We only make money when users save money.

Get started

See it compress your actual prompts.

Paste any prompt into the playground. Watch the A/B comparison stream in real time. No signup, no API key, no setup.