Hybrid Retrieval and Reranking RAG System

Hybrid BM25 + dense retrieval with Reciprocal Rank Fusion and ms-marco cross-encoder reranking. Achieves NDCG@10 of 0.726 on SciFact. Includes per-stage latency breakdown, LLM-as-judge faithfulness scoring, and a retrieval mode

Architecture

Query Pipeline

graph TD
    A[User Query] --> B[Query Encoder\nall-MiniLM-L6-v2]
    B --> C[BM25 Retrieval\nrank_bm25 — top 100]
    B --> D[Dense Retrieval\nFAISS IndexFlatIP — top 100]
    C --> E[Reciprocal Rank Fusion\nWeighted RRF k=60]
    D --> E
    E --> F[Cross-Encoder Reranker\nms-marco-MiniLM-L-6-v2]
    F --> G[Context Builder\ntiktoken token-budget selection]
    G --> H[LLM Generation\nGPT-4o-mini]
    H --> I[Answer + Sources]

Evaluation Pipeline

graph LR
    A[Test Queries] --> B[RAG Pipeline]
    B --> C[Retrieved Chunks]
    B --> D[Generated Answer]
    C --> E[Retrieval Metrics\nNDCG@K, Recall@K, MRR]
    D --> F[Generation Metrics\nFaithfulness, Token F1]
    D --> G[LLM-as-Judge\nCorrectness, Hallucination]
    E --> H[Evaluation Report]
    F --> H
    G --> H

Results

Evaluated on SciFact (~300 test queries, ~5K corpus documents):

Method	NDCG@10	Recall@100	MRR
BM25 only	0.665	0.917	0.731
Dense only (all-MiniLM-L6-v2)	0.623	0.891	0.703
Hybrid BM25 + Dense (RRF)	0.698	0.941	0.771
Hybrid + Cross-Encoder Reranker	0.726	0.941	0.798

Stage	p50	p99
BM25 retrieval	12 ms	28 ms
Dense retrieval (FAISS)	18 ms	41 ms
Cross-encoder reranking (20 pairs)	310 ms	480 ms
LLM generation (GPT-4o-mini)	890 ms	1,800 ms

make evaluate    # reproduce on SciFact

Key Design Decisions

RRF over score normalization: BM25 and dense scores are on incompatible scales. RRF fuses ranked lists using only rank position — scale-invariant and empirically robust. k=60 damps position-1 outsized influence.

Two-stage reranking: Retrieve 100 with bi-encoders (fast, independent encoding), rerank 20 with a cross-encoder (full cross-attention, accurate). Cross-encoder accuracy at retrieval-only cost.

LLM-as-judge over ROUGE: ROUGE penalizes correct answers with different phrasing. LLM judges correlate with human preference and enable reference-free faithfulness scoring at GPT-4o-mini cost.

tiktoken token budgeting: Greedily adds chunks by score until the token budget is exhausted. Long chunks don't starve shorter but more relevant ones.

ML Engineering Features

Feature	Implementation
Hybrid search	BM25 (rank_bm25) + FAISS dense retrieval, weighted RRF fusion
Cross-encoder reranking	ms-marco-MiniLM-L-6-v2, batched inference
Token-budget context selection	tiktoken counting, greedy chunk selection
Retrieval evaluation	NDCG@K, Recall@K, Precision@K, MRR, MAP with qrels
LLM-as-judge	Correctness (1–5), faithfulness (0/1), hallucination (0–1)
Ablation mode	BM25-only, dense-only, hybrid, hybrid+reranker per query
API-key-free mode	Full retrieval stack without OPENAI_API_KEY
Observability	Prometheus stage-level latency histograms + Grafana dashboard

Quickstart

make install      # install dependencies
make index        # index SciFact corpus (~5K docs, ~2 min)
make serve        # API on :8000, docs at /docs

curl -X POST http://localhost:8000/query \
  -H "Content-Type: application/json" \
  -d '{"question": "How does mRNA vaccine technology work?", "k": 5}'

Running Without an API Key

The full retrieval and reranking stack works without OPENAI_API_KEY. Only LLM generation requires one:

{
  "answer": "LLM not configured — set OPENAI_API_KEY to enable generation. See `sources` for relevant passages.",
  "sources": [...]
}

Use POST /search for retrieval-only (no LLM, no key needed). Local model via Ollama:

OPENAI_API_KEY=ollama OPENAI_BASE_URL=http://localhost:11434/v1 LLM_MODEL=llama3 make serve

API Reference

Endpoint	Method	Description
`/query`	POST	Full RAG: retrieval + reranking + LLM generation
`/search`	POST	Retrieval only. `mode`: `hybrid`, `bm25`, `dense`
`/health`	GET	Pipeline readiness
`/index/stats`	GET	Corpus size, model names, chunk count
`/evaluate`	POST	LLM-as-judge batch evaluation
`/metrics`	GET	Prometheus scrape endpoint

POST /query response:

{
  "answer": "mRNA vaccines work by...",
  "sources": [{"chunk_id": "...", "title": "...", "text": "...", "score": 0.92, "rank": 0}],
  "latency": {
    "retrieval_ms": 45.2,
    "reranking_ms": 312.1,
    "generation_ms": 890.4,
    "total_ms": 1251.3
  },
  "tokens_used": 487
}

Project Structure

docrank/
├── src/
│   ├── ingestion/          # Document loading, recursive chunking
│   ├── retrieval/          # BM25, FAISS dense, hybrid RRF
│   ├── reranking/          # Cross-encoder
│   ├── generation/         # Context builder, LLM client
│   ├── evaluation/         # Retrieval metrics, gen metrics, LLM judge
│   ├── pipeline/           # Indexing + RAG inference
│   └── serving/            # FastAPI + Prometheus middleware
├── scripts/
│   ├── index_corpus.py     # CLI indexer (scifact / directory / jsonl)
│   └── evaluate.py         # CLI evaluation runner
├── tests/
├── monitoring/
├── docker-compose.yml
└── Makefile

Docker

cp .env.example .env          # optionally add OPENAI_API_KEY
make docker-up                # API :8000, Prometheus :9090, Grafana :3000
docker-compose exec api python scripts/index_corpus.py --source scifact

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Hybrid Retrieval and Reranking RAG System

Architecture

Query Pipeline

Evaluation Pipeline

Results

Key Design Decisions

ML Engineering Features

Quickstart

Running Without an API Key

API Reference

Project Structure

Docker

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
monitoring		monitoring
scripts		scripts
src		src
tests		tests
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
Makefile		Makefile
README.md		README.md
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Hybrid Retrieval and Reranking RAG System

Architecture

Query Pipeline

Evaluation Pipeline

Results

Key Design Decisions

ML Engineering Features

Quickstart

Running Without an API Key

API Reference

Project Structure

Docker

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages