The first LangChain integration for TurboQuant - Google Research's training-free vector compression algorithm (ICLR 2026).
Drop-in replacement for any LangChain vector store with ~6x memory reduction and near-zero accuracy loss. No GPU required.
Large-scale RAG pipelines store millions of embedding vectors in memory. At 1536 dimensions (OpenAI text-embedding-3-small), each vector takes 6 KB. A million vectors = 6 GB just for embeddings.
TurboQuant compresses these vectors to ~1 KB each (3-bit quantization), cutting memory by 6x while preserving search accuracy. Unlike Product Quantization (PQ) or IVFPQ, TurboQuant requires no codebook training - it works out of the box on any embedding.
| Feature | langchain-turboquant | FAISS (PQ) | Chroma |
|---|---|---|---|
| Compression ratio | ~6x (3-bit) | ~4x (8-bit PQ) | 1x (none) |
| Training required | No | Yes (codebook) | N/A |
| Drop-in LangChain | Yes | Partial | Yes |
| GPU required | No | Optional | No |
| Asymmetric search | Yes | Yes | N/A |
TurboQuant implements the two-stage compression algorithm from Google Research (ICLR 2026):
- Random orthogonal rotation: Multiply the vector by a random orthogonal matrix. This "isotropizes" the coordinates so each one follows the same distribution (the hypersphere marginal).
- Lloyd-Max quantization: Quantize each rotated coordinate independently using a pre-computed optimal codebook for the hypersphere marginal PDF.
The codebook is computed analytically from the distribution - no training data needed.
- Compute the quantization residual (difference between original and Stage 1 reconstruction).
- Project the residual through a random Gaussian matrix.
- Store only the sign bits (1 bit per dimension) of the projection.
At query time, an asymmetric estimator computes approximate inner products directly on compressed data - the query stays in full precision while stored vectors remain compressed.
For dimension d with b-bit quantization and QJL dimension m:
Compressed bits per vector = d * b + m * 1 + 32 + 32
= d * (b + 1) + 64
Original bits per vector = d * 32
Compression ratio = 32d / (d * (b+1) + 64)
At d=1536, b=3: ratio = 7.7x (theoretical) / ~6x (practical with uint8 storage)
pip install langchain-turboquantOr install from source:
git clone https://github.com/wjddusrb03/langchain-turboquant.git
cd langchain-turboquant
pip install -e ".[dev]"- Python >= 3.9
- NumPy >= 1.21
- SciPy >= 1.7
- LangChain Core >= 0.3
from langchain_turboquant import TurboQuantVectorStore
from langchain_openai import OpenAIEmbeddings
# Create a compressed vector store (3-bit = ~6x compression)
store = TurboQuantVectorStore(embedding=OpenAIEmbeddings(), bits=3)
# Add documents - just like any LangChain vector store
store.add_texts(
["TurboQuant compresses vectors by 6x",
"LangChain is a framework for LLM applications",
"RAG combines retrieval with generation"],
metadatas=[{"topic": "compression"}, {"topic": "framework"}, {"topic": "rag"}]
)
# Search
results = store.similarity_search("How does compression work?", k=2)
for doc in results:
print(doc.page_content)
# Check memory savings
print(store.memory_stats())
# {'num_documents': 3, 'dimension': 1536, 'bits': 3,
# 'compression_ratio': '7.7x', 'memory_saved_pct': '87.0%'}from langchain_core.runnables import RunnablePassthrough
from langchain_openai import ChatOpenAI
retriever = store.as_retriever(search_kwargs={"k": 3})
# Use in a RAG chain
chain = (
{"context": retriever, "question": RunnablePassthrough()}
| prompt
| ChatOpenAI()
)Run the included demo with fake embeddings (no API key needed):
python examples/rag_demo.pyTurboQuantVectorStore(
embedding: Embeddings, # Any LangChain embedding model
bits: int = 3, # Quantization bits (1-4, recommended: 3)
qjl_dim: int = None, # QJL dimensions (default: same as embedding dim)
seed: int = 42, # Random seed for reproducibility
)Methods:
| Method | Description |
|---|---|
add_texts(texts, metadatas, ids) |
Embed, compress, and store texts |
similarity_search(query, k) |
Return top-k most similar documents |
similarity_search_with_score(query, k) |
Return top-k with cosine similarity scores |
similarity_search_by_vector(vector, k) |
Search by pre-computed embedding vector |
from_texts(texts, embedding, ...) |
Class method to create and populate store |
delete(ids) |
Delete documents by ID |
get_by_ids(ids) |
Retrieve documents by ID |
as_retriever(**kwargs) |
Convert to LangChain Retriever |
save(path) |
Persist store to disk |
load(path, embedding) |
Load store from disk |
memory_stats() |
Get compression statistics |
from langchain_turboquant import TurboQuantizer
quantizer = TurboQuantizer(dim=1536, bits=3)
# Compress vectors
compressed = quantizer.quantize(vectors) # (n, 1536) -> CompressedVectors
# Asymmetric search (query in full precision, database compressed)
scores = quantizer.cosine_scores(query_vector, compressed)
# Reconstruct (for evaluation)
reconstructed = quantizer.dequantize(compressed)| Dimension | Bits | Theoretical Ratio | Memory Saved |
|---|---|---|---|
| 384 | 3 | 5.8x | 82.8% |
| 768 | 3 | 6.8x | 85.3% |
| 1536 | 3 | 7.3x | 86.3% |
| 3072 | 3 | 7.7x | 87.0% |
| 1536 | 2 | 9.5x | 89.5% |
| 1536 | 4 | 6.1x | 83.6% |
Higher dimensions benefit more from compression (the fixed 64-bit overhead for norms/gammas becomes negligible).
The project includes 296 comprehensive tests covering:
- Mathematical correctness (83 tests): Lloyd-Max codebook properties, rotation matrix orthogonality, MSE bounds, PDF integration, centroid conditions
- Edge cases (35 tests): NaN/Inf vectors, empty arrays, Unicode text, dim=1/2/3, zero vectors, large batches
- Search recall (44 tests): Top-k recall at various k/n/dim/bits, cluster discrimination, asymmetric estimator statistics, Pearson correlation
- Persistence (29 tests): Save/load roundtrips, serialization formats, state consistency after add/delete cycles
- Rigorous validation (68 tests): Compression ratios, performance benchmarks, score ordering, reconstruction quality
- Core functionality (37 tests): VectorStore CRUD, quantizer operations, LangChain integration
# Run all tests
pytest tests/ -v
# Run specific test suite
pytest tests/test_math_stress.py -v # Mathematical properties
pytest tests/test_recall_extensive.py -v # Search recall
pytest tests/test_edge_cases.py -v # Edge caseslangchain-turboquant/
├── src/langchain_turboquant/
│ ├── __init__.py # Package exports
│ ├── lloyd_max.py # Lloyd-Max optimal codebook computation
│ ├── quantizer.py # TurboQuantizer (PolarQuant + QJL)
│ └── vectorstore.py # LangChain VectorStore integration
├── tests/
│ ├── test_quantizer.py # Core quantizer tests
│ ├── test_vectorstore.py # VectorStore API tests
│ ├── test_rigorous.py # Rigorous validation
│ ├── test_math_stress.py # Mathematical properties
│ ├── test_edge_cases.py # Edge cases
│ ├── test_recall_extensive.py # Search recall
│ └── test_persistence.py # Persistence tests
├── examples/
│ └── rag_demo.py # Working RAG demo (no API key needed)
├── pyproject.toml
├── LICENSE
└── README.md
- TurboQuant: Zandieh et al., "TurboQuant: Redefining Efficiency of KV Cache Compression for Large Language Models" (ICLR 2026). arXiv:2504.19874
- PolarQuant: Zandieh et al., "PolarQuant: Achieving High-Fidelity Vector Quantization via Polar Coordinates" (AISTATS 2026). arXiv:2502.02617
- QJL: Zandieh et al., "QJL: 1-Bit Quantized JL Transform for KV Cache Quantization with Zero Overhead" (AAAI 2025). arXiv:2406.03482
- LangChain: langchain.com
Contributions are welcome! If you find a bug, have a feature request, or want to improve the code:
- Open an Issue describing the problem or idea
- Fork the repo and create a branch
- Write tests for your changes
- Submit a Pull Request
Please report any problems or suggestions in the Issues tab. All feedback is appreciated!
MIT License - see LICENSE for details.