Hybrid BM25 + dense retrieval with Reciprocal Rank Fusion and ms-marco cross-encoder reranking. Achieves NDCG@10 of 0.726 on SciFact. Includes per-stage latency breakdown, LLM-as-judge faithfulness scoring, and a retrieval mode
graph TD
A[User Query] --> B[Query Encoder\nall-MiniLM-L6-v2]
B --> C[BM25 Retrieval\nrank_bm25 — top 100]
B --> D[Dense Retrieval\nFAISS IndexFlatIP — top 100]
C --> E[Reciprocal Rank Fusion\nWeighted RRF k=60]
D --> E
E --> F[Cross-Encoder Reranker\nms-marco-MiniLM-L-6-v2]
F --> G[Context Builder\ntiktoken token-budget selection]
G --> H[LLM Generation\nGPT-4o-mini]
H --> I[Answer + Sources]
graph LR
A[Test Queries] --> B[RAG Pipeline]
B --> C[Retrieved Chunks]
B --> D[Generated Answer]
C --> E[Retrieval Metrics\nNDCG@K, Recall@K, MRR]
D --> F[Generation Metrics\nFaithfulness, Token F1]
D --> G[LLM-as-Judge\nCorrectness, Hallucination]
E --> H[Evaluation Report]
F --> H
G --> H
Evaluated on SciFact (~300 test queries, ~5K corpus documents):
| Method | NDCG@10 | Recall@100 | MRR |
|---|---|---|---|
| BM25 only | 0.665 | 0.917 | 0.731 |
| Dense only (all-MiniLM-L6-v2) | 0.623 | 0.891 | 0.703 |
| Hybrid BM25 + Dense (RRF) | 0.698 | 0.941 | 0.771 |
| Hybrid + Cross-Encoder Reranker | 0.726 | 0.941 | 0.798 |
| Stage | p50 | p99 |
|---|---|---|
| BM25 retrieval | 12 ms | 28 ms |
| Dense retrieval (FAISS) | 18 ms | 41 ms |
| Cross-encoder reranking (20 pairs) | 310 ms | 480 ms |
| LLM generation (GPT-4o-mini) | 890 ms | 1,800 ms |
make evaluate # reproduce on SciFactRRF over score normalization: BM25 and dense scores are on incompatible scales. RRF fuses ranked lists using only rank position — scale-invariant and empirically robust. k=60 damps position-1 outsized influence.
Two-stage reranking: Retrieve 100 with bi-encoders (fast, independent encoding), rerank 20 with a cross-encoder (full cross-attention, accurate). Cross-encoder accuracy at retrieval-only cost.
LLM-as-judge over ROUGE: ROUGE penalizes correct answers with different phrasing. LLM judges correlate with human preference and enable reference-free faithfulness scoring at GPT-4o-mini cost.
tiktoken token budgeting: Greedily adds chunks by score until the token budget is exhausted. Long chunks don't starve shorter but more relevant ones.
| Feature | Implementation |
|---|---|
| Hybrid search | BM25 (rank_bm25) + FAISS dense retrieval, weighted RRF fusion |
| Cross-encoder reranking | ms-marco-MiniLM-L-6-v2, batched inference |
| Token-budget context selection | tiktoken counting, greedy chunk selection |
| Retrieval evaluation | NDCG@K, Recall@K, Precision@K, MRR, MAP with qrels |
| LLM-as-judge | Correctness (1–5), faithfulness (0/1), hallucination (0–1) |
| Ablation mode | BM25-only, dense-only, hybrid, hybrid+reranker per query |
| API-key-free mode | Full retrieval stack without OPENAI_API_KEY |
| Observability | Prometheus stage-level latency histograms + Grafana dashboard |
make install # install dependencies
make index # index SciFact corpus (~5K docs, ~2 min)
make serve # API on :8000, docs at /docscurl -X POST http://localhost:8000/query \
-H "Content-Type: application/json" \
-d '{"question": "How does mRNA vaccine technology work?", "k": 5}'The full retrieval and reranking stack works without OPENAI_API_KEY. Only LLM generation requires one:
{
"answer": "LLM not configured — set OPENAI_API_KEY to enable generation. See `sources` for relevant passages.",
"sources": [...]
}Use POST /search for retrieval-only (no LLM, no key needed). Local model via Ollama:
OPENAI_API_KEY=ollama OPENAI_BASE_URL=http://localhost:11434/v1 LLM_MODEL=llama3 make serve| Endpoint | Method | Description |
|---|---|---|
/query |
POST | Full RAG: retrieval + reranking + LLM generation |
/search |
POST | Retrieval only. mode: hybrid, bm25, dense |
/health |
GET | Pipeline readiness |
/index/stats |
GET | Corpus size, model names, chunk count |
/evaluate |
POST | LLM-as-judge batch evaluation |
/metrics |
GET | Prometheus scrape endpoint |
POST /query response:
{
"answer": "mRNA vaccines work by...",
"sources": [{"chunk_id": "...", "title": "...", "text": "...", "score": 0.92, "rank": 0}],
"latency": {
"retrieval_ms": 45.2,
"reranking_ms": 312.1,
"generation_ms": 890.4,
"total_ms": 1251.3
},
"tokens_used": 487
}docrank/
├── src/
│ ├── ingestion/ # Document loading, recursive chunking
│ ├── retrieval/ # BM25, FAISS dense, hybrid RRF
│ ├── reranking/ # Cross-encoder
│ ├── generation/ # Context builder, LLM client
│ ├── evaluation/ # Retrieval metrics, gen metrics, LLM judge
│ ├── pipeline/ # Indexing + RAG inference
│ └── serving/ # FastAPI + Prometheus middleware
├── scripts/
│ ├── index_corpus.py # CLI indexer (scifact / directory / jsonl)
│ └── evaluate.py # CLI evaluation runner
├── tests/
├── monitoring/
├── docker-compose.yml
└── Makefile
cp .env.example .env # optionally add OPENAI_API_KEY
make docker-up # API :8000, Prometheus :9090, Grafana :3000
docker-compose exec api python scripts/index_corpus.py --source scifactMIT