A search evaluation system that compares different search providers using a high-performance Rust reranker with token pruning. The system runs ablation studies to measure how late-interaction scoring and token pruning affect search quality.
This tool lets you:
- Compare search results from different providers (DuckDuckGo, Wikipedia, etc.)
- Rerank results using late-interaction MaxSim scoring instead of simple cosine similarity
- Test how token pruning affects both speed and quality
- Generate detailed reports with performance metrics
The system includes demo components for testing:
- Baseline Provider: Generates 15 dummy search results with titles like "Baseline Result 1 for 'query'"
- Mock Embedder: Creates random normalized embeddings when sentence-transformers isn't available
- These allow you to test the reranking pipeline without requiring real search APIs or embedding models
- Rust 1.70+ (install here)
- Python 3.8+
- Git
# Clone and navigate to the service directory
git clone <repository-url>
cd LateInteractionReranker/service
# Set up Python environment
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
pip install -r requirements.txtTerminal 1 - Start the reranker service:
cd ranker-rs
cargo run --releaseTerminal 2 - Run an evaluation:
cd orchestrator
python run.py --q "test query" --providers "baseline" --topk 5The system will automatically run ablation studies comparing different configurations and generate reports.
Query: test query
Protocol: both (pairwise N=5 trials, pointwise rubric)
Providers: DDG, Wikipedia Late: on/off Prune: none/16-64/8-32
Ablation (DDG)
Late Prune rel@5 ent_cov rerank_total_ms per_doc_p95_µs total_ms
--------------------------------------------------------------------------------
off — 0.829 9 0.3 120 26846
on none 0.807 10 3.8 980 18144
on 16/64 0.804 10 2.1 540 21529
on 8/32 0.806 9 1.5 410 24806
✅ Evaluation complete! Report saved to report.md
python run.py --q "your query" \
--providers "ddg,wikipedia" \
--topk 5 \
--judge heuristic \
--protocol both| Option | Description | Default |
|---|---|---|
--q |
Search query | Required |
--providers |
Which search providers to use | "ddg,wikipedia" |
--topk |
Number of results to return | 5 |
--judge |
Evaluation method (heuristic or llm) |
"heuristic" |
--protocol |
Evaluation type (pointwise, pairwise, both) |
"both" |
Query → Search Providers → Embedding → Rust Reranker → Evaluation → Reports
- Search: Queries multiple search providers (DuckDuckGo, Wikipedia, etc.)
- Embed: Converts text to vectors using sentence-transformers
- Rerank: Uses Rust service for fast late-interaction scoring with token pruning
- Evaluate: Compares results using heuristic or LLM-based scoring
- Report: Generates markdown and JSON reports with performance metrics
Instead of comparing single vectors, the system:
- Breaks queries and documents into token-level embeddings
- Uses MaxSim scoring:
score = Σᵢ maxⱼ (Q[i] · D[j]) - Prunes tokens to keep only the most important ones (16 query, 64 doc tokens)
To keep things fast, the system only keeps the most salient tokens:
salience = idf(token) × ||embedding||₂
service/
├── ranker-rs/ # Rust reranker service
│ ├── src/
│ │ ├── main.rs # HTTP server
│ │ ├── scoring.rs # MaxSim + token pruning
│ │ └── lib.rs
│ └── Cargo.toml
├── orchestrator/ # Python orchestrator
│ ├── run.py # Main entry point
│ ├── providers.py # Search providers
│ ├── embed.py # Text embedding
│ ├── judge.py # Result evaluation
│ ├── report.py # Report generation
│ └── utils.py # Utilities
└── requirements.txt
Port 8088 already in use:
lsof -i :8088 # Find what's using the port
# Kill the process or change the port in ranker-rs/src/main.rsPython dependencies fail:
pip install --upgrade pip
pip install -r requirements.txt --no-cache-dirOut of memory:
- Reduce
--topk(try 5 instead of 20) - Use
--judge heuristicinstead ofllm
Typical performance on modern hardware:
- Reranking: ~3-6ms for 100 documents
- Token pruning: 3-5x speedup vs naive approach
- Quality: 10-20% better relevance@5 vs single-vector cosine
Build Rust service:
cd ranker-rs
cargo build --release
cargo testRun Python tests:
python -m pytest orchestrator/Add new search provider:
- Implement
BaseProviderinterface inproviders.py - Add to
get_provider()factory function - Update CLI argument parsing
Set these environment variables if you want to use external APIs:
EXA_API_KEY: For Exa search providerOPENAI_API_KEY: For LLM-based evaluationANTHROPIC_API_KEY: For LLM-based evaluation
MIT License - see LICENSE file for details.