voyager-index

Late-interaction retrieval for on-prem AI. One node. CPU or GPU. MaxSim is the truth scorer.

The pain

ColBERT-quality retrieval is table-stakes for serious RAG, and the production options force a choice you should not have to make.

Managed SaaS — fast to start, hard to control, your data leaves the box.
Distributed clusters — strong recall, expensive to operate.
Offline benchmarks — great numbers, no API, no WAL, no recovery.

Most "production" stacks treat MaxSim as an optional rerank stage and lose its signal under aggressive shortlisting. Most engines that ship operationally drop late interaction entirely.

The solution

voyager-index is a multi-vector native retrieval engine built around MaxSim as the final scorer — and engineered so a single machine can serve it.

One-node deployment. No control plane, no orchestration tax.
One contract across CPU and GPU. Rust SIMD on CPU, Triton on GPU.
Quantized fast paths. FP16, INT8, FP8, ROQ-4, all reranked back to float truth.
Late-interaction native. ColBERT, ColPali, ColQwen out of the box.
Database semantics. WAL, checkpoint, crash recovery, scroll, retrieve.
Optional graph lane. The Latence sidecar augments first-stage retrieval — never required.

How

pip install "voyager-index[full,gpu]"   # drop ,gpu on CPU-only hosts
voyager-index-server                    # OpenAPI at http://127.0.0.1:8080/docs

Python:

import numpy as np
from voyager_index import Index

rng = np.random.default_rng(7)
docs = [rng.normal(size=(16, 128)).astype("float32") for _ in range(32)]
query = rng.normal(size=(16, 128)).astype("float32")

idx = Index("demo", dim=128, engine="shard", n_shards=32,
            k_candidates=256, compression="fp16")
idx.add(docs, ids=list(range(len(docs))))
print(idx.search(query, k=5)[0])

HTTP (base64 vector payloads, fp8 GPU scoring, ColBANDIT pruning):

import numpy as np, requests
from voyager_index import encode_vector_payload

q = np.random.default_rng(7).normal(size=(16, 128)).astype("float32")
r = requests.post(
    "http://127.0.0.1:8080/collections/demo/search",
    json={"vectors": encode_vector_payload(q, dtype="float16"), "top_k": 5,
          "quantization_mode": "fp8", "use_colbandit": True},
    timeout=30,
)
print(r.json()["results"][0])

Docker:

docker build -f deploy/reference-api/Dockerfile -t voyager-index .
docker run -p 8080:8080 -v "$(pwd)/data:/data" voyager-index

Features

Routing — LEMUR proxy router + FAISS MIPS shortlist, optional ColBANDIT query-time pruning.
Scoring — Triton MaxSim and fused Rust MaxSim, INT8 / FP8 / ROQ-4 with float rerank.
Storage — safetensors shards, memory-mapped CPU, GPU-resident corpus mode.
Hybrid — BM25 + dense fusion via RRF or Tabu Search refinement.
Multimodal — text (ColBERT), images (ColPali / ColQwen), preprocessing for PDF / DOCX / XLSX.
Operations — WAL, checkpoint, crash recovery, scroll, retrieve, multi-worker FastAPI.
Optional graph lane — Latence sidecar for graph-aware rescue and provenance, additive to the OSS path.
Optional groundedness lane — Latence Trace premium sidecar for post-generation hallucination scoring against retrieved chunk_ids or raw context. Calibrated green/amber/red risk band, NLI peer with cross-encoder premise reranking, atomic-claim decomposition, retrieval-coverage observability, response chunking, multilingual EN+DE, three Pareto-optimal profiles, ~118 ms p95 end-to-end with NLI on. Commercial license; runs as a separate process, additive to the OSS retrieval path. See the Groundedness sidecar guide and latence.ai for access.

Benchmarks

BEIR retrieval — RTX A5000, search-only, full query set

Encoder: lightonai/GTE-ModernColBERT-v1. CPU lane uses 8 native Rust workers.

Dataset	Docs	NDCG@10	Recall@100	GPU QPS	GPU P95 (ms)	CPU QPS	CPU P95 (ms)
arguana	8,674	0.3679	0.9586	270.0	4.1	41.6	202.7
fiqa	57,638	0.4436	0.7297	164.8	5.0	80.2	115.7
nfcorpus	3,633	0.3833	0.3348	282.6	3.8	123.3	84.4
quora	15,675	0.9766	0.9993	346.8	2.6	271.7	46.9
scidocs	25,657	0.1977	0.4369	246.8	4.3	83.9	111.8
scifact	5,183	0.7544	0.9567	263.4	4.0	69.1	138.4

GPU P95 stays under 6 ms across every dataset. The full per-dataset head-to-head against next-plaid (same model, H100, encoding included), methodology, and caveats live in docs/benchmarks.md.

Architecture

query (token / patch embeddings)
  → LEMUR routing MLP → FAISS ANN → candidate IDs
  → optional BM25 fusion · centroid pruning · ColBANDIT
  → exact MaxSim   (Rust SIMD CPU  |  Triton FP16/INT8/FP8/ROQ-4 GPU)
  → optional Latence graph augmentation
  → top-K (or packed context)

Layer	What ships
Routing	LEMUR MLP + FAISS MIPS, candidate budgets
Storage	safetensors shards, mmap, GPU-resident corpus mode
Scoring	Triton + Rust fused MaxSim with INT8 / FP8 / ROQ-4 fast paths
Optional graph	Latence sidecar, additive after first-stage retrieval
Durability	WAL, memtable, checkpoint, crash recovery
Serving	FastAPI, base64 vector transport, multi-worker, OpenAPI

Three execution modes share the same collection format and API contract: CPU exact (mmap → Rust fused), GPU streamed (CPU → GPU → Triton), and GPU corpus (fully VRAM-resident). Start with CPU, add GPU when latency matters.

Documentation

Community And Project Health

File a bug: bug report template
Request a feature: feature request template
Open a PR: pull request template
Contributing guide: CONTRIBUTING.md
Security policy: SECURITY.md
Release process: RELEASING.md
Code of Conduct: CODE_OF_CONDUCT.md

License

Apache-2.0. See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 91 Commits
.github		.github
benchmarks		benchmarks
deploy		deploy
docs		docs
examples		examples
internal		internal
notebooks		notebooks
scripts		scripts
src		src
tests		tests
tools		tools
voyager_index		voyager_index
.dockerignore		.dockerignore
.gitignore		.gitignore
BENCHMARKS.md		BENCHMARKS.md
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
LICENSING.md		LICENSING.md
MANIFEST.in		MANIFEST.in
Makefile		Makefile
PRODUCTION.md		PRODUCTION.md
README.md		README.md
RELEASING.md		RELEASING.md
SECURITY.md		SECURITY.md
THIRD_PARTY_NOTICES.md		THIRD_PARTY_NOTICES.md
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

voyager-index

The pain

The solution

How

Features

Benchmarks

BEIR retrieval — RTX A5000, search-only, full query set

Architecture

Documentation

Community And Project Health

License

About

Uh oh!

Releases 3

Packages

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

voyager-index

The pain

The solution

How

Features

Benchmarks

BEIR retrieval — RTX A5000, search-only, full query set

Architecture

Documentation

Community And Project Health

License

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Uh oh!

Contributors 1

Languages

Packages