The LLM Execution Layer

Compress input.
Control output.
Save 50–70% on every LLM call.

Name: OpenCompress
Author: OpenCompress

Drop-in middleware for any LLM provider. You keep 80% of the savings, or pay nothing.

Get Started Read the docs

50–70%

cost saved

5-stage pipeline

~45%

input compressed

per-message average

~70%

output reduced

via concise inject + aliases

Optimization Vectors

Five layers. One pass.

Each vector targets a different source of token waste. They compound — the output of one feeds into the next.

Code Minification

Strip the noise before compression begins.

Comments, empty lines, type annotations, and redundant whitespace are removed. High-frequency identifiers (≥3 occurrences) are automatically shortened. Foundation for all downstream stages.

15–55% on code-heavy content

Dictionary Aliasing

Say it once, reference it everywhere.

Repeated substrings across the context get compressed into short §XX aliases. The same API schema appearing 8 times becomes one definition and 7 references.

Bidirectional: input + output

Semantic Pruning

Token-level keep/drop decisions at GPU speed.

LLMLingua-2 runs on SageMaker GPU for ~100ms inference. Token-level binary classifier scores each token's importance and drops low-value ones while preserving meaning.

40–50% input tokens saved

Output Shaping

The hidden 4x tax on output tokens.

Output tokens cost 3–5x more than input across every major provider. Scenario-aware concise instructions and @XX output aliases reduce LLM verbosity. Aliases are restored in real-time during streaming.

50–80% output tokens saved

Adaptive Rate

Code and prose don't compress the same.

Content density analysis — structure ratio, technical term density, words per line — automatically adjusts compression rate. Dense structured content gets lighter treatment to prevent information loss.

Auto content-type routing

Landscape

Not a router. Not a cache.

Routers pick cheaper models. Caches skip repeated calls. Neither reduces the tokens in a new prompt to an expensive model. We do.

Approach	Input saved	Output saved	New prompts	Quality	Setup
Manual prompt tuning	~10%	0%	Yes	Varies	Hours per prompt
Semantic cache	100%*	0%	No*	Exact match	Redis + embeddings
Model router	0%	0%	Yes	Lower model	Routing rules
OpenCompress	40-50%	50-80%	Yes	≥0.80 cosine	One middleware call

* Semantic cache only works on exact or near-duplicate prompts. Miss rate typically 70-90% in production.

Integration

One layer. Every model.

OpenCompress is model-agnostic. Drop it in front of any LLM provider and every call gets optimized automatically.

OpenAI

Anthropic

Google

DeepSeek

xAI

Qwen

MiniMax

Mistral

You save money. We take 33% of the savings.

No subscriptions, no per-seat fees. We only charge a percentage of the money we actually save you. If compression doesn't help, you pay $0 extra.

usage response

// Every API response shows exactly what you saved
original_cost     $0.0100
compressed_cost   $0.0040
savings           $0.0060
our_fee (33%)     $0.0012
─────────────────────────
you_pay           $0.0052 ← 48% cheaper

For you

Change 2 lines of code, save 50–70% on LLM costs.

For us

We take 33% of the savings we generate.

Alignment

We only make money when users save money.

Get started

See it compress your actual prompts.

Paste any prompt into the playground. Watch the A/B comparison stream in real time. No signup, no API key, no setup.

Get Started Documentation

Compress input. Control output. Save 50–70% on every LLM call.