RAG Retrieval & Generation Evaluation System

A comprehensive evaluation framework for Retrieval-Augmented Generation (RAG) systems using Arize Phoenix for observability and LangChain with Together AI for LLM-based relevance scoring.

Features

🎯 Binary Relevance Evaluation: Each retrieved chunk is scored as 0 (not relevant) or 1 (relevant) using an LLM
📊 Comprehensive Metrics: Precision, Recall, Accuracy, F1-Score, and confusion matrix analysis
🔍 Phoenix Integration: Real-time tracing and observability with Arize Phoenix
🤖 LangChain + Together AI: Leverages Together AI's fast inference with LangChain
📈 Detailed Reports: HTML, CSV, and JSON output formats
✅ Full Test Coverage: Comprehensive unit and integration tests
✅ Docker Container: Docker File along with yml
✅ CI/CD pipeline: CI/CD pipeline integrated using github workflows

Architecture

┌─────────────────┐
│   User Query    │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  RAG Retrieval  │ ← Your retrieval system
└────────┬────────┘
         │
         ▼
┌─────────────────────────────┐
│  LLM-Based Relevance Eval   │ ← This system
│  (Together AI + LangChain)  │
└────────┬────────────────────┘
         │
         ▼
┌─────────────────────────────┐
│  Metrics Calculation        │
│  (Precision, Recall, F1)    │
└────────┬────────────────────┘
         │
         ▼
┌─────────────────────────────┐
│  Phoenix Tracing & Reports  │
└─────────────────────────────┘

Installation

Prerequisites

Python 3.8+
Together AI API key

Setup

Clone the repository

git clone <your-repo-url>
cd rag-evaluation

Create virtual environment

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies

pip install -r requirements.txt

Set up environment variables

# Create .env file
cat > .env << EOF
TOGETHER_API_KEY=your_together_api_key_here
LLM_MODEL=meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo
RELEVANCE_THRESHOLD=0.7
ENABLE_PHOENIX=true
EOF

Project Structure

rag-evaluation/
├── evaluator.py           # Main evaluation logic
├── config.py              # Configuration management
├── utils.py               # Utility functions
├── test_evaluator.py      # Unit tests
├── requirements.txt       # Python dependencies
├── .env                   # Environment variables (create this)
├── README.md              # This file
├── example_usage.py       # Example usage script
└── results/               # Output directory (auto-created)
    ├── retrieval_evaluation_results.csv
    ├── evaluation_metrics.csv
    └── evaluation_report.html

Usage

Basic Usage

import os
from evaluator import RAGRetrievalEvaluator, PhoenixRAGTracer, create_sample_data

# Initialize evaluator
together_api_key = os.getenv("TOGETHER_API_KEY")
evaluator = RAGRetrievalEvaluator(together_api_key)

# Initialize Phoenix tracer
phoenix_tracer = PhoenixRAGTracer()

# Prepare your data
queries = ["What is machine learning?"]
retrieved_chunks = [
    [
        "Machine learning is a subset of AI...",
        "Pizza is a popular food...",
        "ML algorithms learn from data..."
    ]
]
ground_truth_labels = [[1, 0, 1]]  # Optional

# Run evaluation
results = evaluator.evaluate_retrieval_batch(
    queries=queries,
    retrieved_chunks=retrieved_chunks,
    ground_truth_labels=ground_truth_labels
)

# Calculate metrics
metrics = evaluator.calculate_metrics(results)

print(f"Precision: {metrics.precision:.4f}")
print(f"Recall: {metrics.recall:.4f}")
print(f"F1 Score: {metrics.f1_score:.4f}")

# Log to Phoenix
results_df = phoenix_tracer.log_retrieval_results(results)

Running the Demo

# Run the main evaluation script
python evaluator.py

This will:

Start Phoenix UI (available at http://localhost:6006)
Run evaluation on sample data
Generate metrics and reports
Save results to CSV files

Custom Data Format

CSV Format (for utils.load_ground_truth_from_csv):

query,chunk,relevance_label
"What is ML?","Machine learning is...",1
"What is ML?","Pizza is food...",0

Python Format:

queries = ["query1", "query2"]
retrieved_chunks = [
    ["chunk1_for_q1", "chunk2_for_q1"],
    ["chunk1_for_q2", "chunk2_for_q2"]
]
ground_truth_labels = [
    [1, 0],  # Labels for query1 chunks
    [1, 1]   # Labels for query2 chunks
]

Configuration

Environment Variables

Variable	Description	Default
`TOGETHER_API_KEY`	Together AI API key	Required
`LLM_MODEL`	Model name	`meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo`
`LLM_TEMPERATURE`	LLM temperature	`0.0`
`RELEVANCE_THRESHOLD`	Relevance threshold	`0.7`
`ENABLE_PHOENIX`	Enable Phoenix tracing	`true`
`PHOENIX_PORT`	Phoenix UI port	`6006`
`RESULTS_DIR`	Output directory	`results`

Programmatic Configuration

from config import RAGEvalConfig, LLMConfig, EvaluationConfig

config = RAGEvalConfig(
    llm=LLMConfig(
        provider="together",
        model_name="meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo",
        temperature=0.0
    ),
    evaluation=EvaluationConfig(
        relevance_threshold=0.8,
        batch_size=20
    )
)

Metrics Explained

Binary Classification Metrics

Precision: Of all chunks marked as relevant, what percentage were actually relevant?
```
Precision = TP / (TP + FP)
```
Recall: Of all actually relevant chunks, what percentage did we identify?
```
Recall = TP / (TP + FN)
```

Accuracy: Overall correctness of predictions

Accuracy = (TP + TN) / (TP + TN + FP + FN)

F1 Score: Harmonic mean of precision and recall

F1 = 2 * (Precision * Recall) / (Precision + Recall)

Confusion Matrix

                    Predicted
                    Relevant    Not Relevant
Actual  Relevant    TP          FN
        Not Relevant FP          TN

Phoenix Observability

Phoenix provides real-time observability for your RAG evaluation:

Start Phoenix: Automatically starts when running the evaluator
Access UI: Navigate to http://localhost:6006
View Traces: See all LLM calls, latencies, and relevance decisions
Analyze Patterns: Identify systematic errors and biases

Testing

# Run all tests
pytest test_evaluator.py -v

# Run specific test class
pytest test_evaluator.py::TestRAGRetrievalEvaluator -v

# Run with coverage
pytest test_evaluator.py --cov=evaluator --cov-report=html

Advanced Usage

Custom Relevance Prompt

from langchain.prompts import PromptTemplate

custom_prompt = PromptTemplate(
    template="""Evaluate if this document answers the query.
Query: {query}
Document: {document}

Answer only 1 (relevant) or 0 (not relevant):""",
    input_variables=["query", "document"]
)

evaluator.relevance_prompt = custom_prompt
evaluator.chain = custom_prompt | evaluator.llm

Batch Processing with Progress

from tqdm import tqdm

results = []
for query, chunks in tqdm(zip(queries, retrieved_chunks), total=len(queries)):
    batch_results = evaluator.evaluate_retrieval_batch(
        queries=[query],
        retrieved_chunks=[chunks]
    )
    results.extend(batch_results)

Per-Query Analysis

from utils import calculate_metrics_by_query

query_metrics = calculate_metrics_by_query(results_df)
print(query_metrics)

Error Pattern Analysis

from utils import analyze_error_patterns

error_analysis = analyze_error_patterns(results_df)
print(f"False Positives: {error_analysis['total_false_positives']}")
print(f"False Negatives: {error_analysis['total_false_negatives']}")

Output Files

`retrieval_evaluation_results.csv`

Contains all evaluation results with columns:

query: The user query
retrieved_chunk: The retrieved document chunk
relevance_score: Confidence score (0.0-1.0)
is_relevant: Binary prediction (0 or 1)
ground_truth: Actual label (if provided)

`evaluation_metrics.csv`

Contains overall metrics:

Precision, Recall, Accuracy, F1-Score
True/False Positives/Negatives

`evaluation_report.html`

Interactive HTML report with tables and metrics

Troubleshooting

Common Issues

1. Phoenix not starting

# Check if port 6006 is available
lsof -i :6006

# Use different port
export PHOENIX_PORT=6007

2. Together AI rate limits

# Add delay between requests
import time
time.sleep(0.1)  # 100ms delay

3. Out of memory with large batches

# Reduce batch size
config.evaluation.batch_size = 5

Best Practices

Start with small samples: Test with 5-10 queries first
Use temperature=0: For consistent relevance evaluation
Validate ground truth: Ensure labels are accurate
Monitor Phoenix: Check for systematic biases
Iterate on prompts: Customize relevance prompt for your domain
Track metrics over time: Store results for trend analysis

Contributing

Contributions are welcome! Please:

Fork the repository
Create a feature branch
Add tests for new functionality
Ensure all tests pass
Submit a pull request

License

MIT License - see LICENSE file for details

Acknowledgments

Arize Phoenix - Observability platform
LangChain - LLM framework
Together AI - Fast LLM inference

Support

For issues and questions:

Open a GitHub issue
Check the documentation
Review example scripts in the repository

Roadmap

Support for additional LLM providers (OpenAI, Anthropic)
Advanced Phoenix evaluations (hallucination, toxicity)
Multi-metric optimization
A/B testing framework
Real-time streaming evaluation
Vector store integration

Happy Evaluating! 🚀

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.github/workflows		.github/workflows
Documentation		Documentation
Package_management		Package_management
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
config.py		config.py
docker-compose.yml		docker-compose.yml
evaluator.py		evaluator.py
evaluator_no_phoenix.py		evaluator_no_phoenix.py
example_usage.py		example_usage.py
test_evaluator.py		test_evaluator.py
utils.py		utils.py

Folders and files

Latest commit

History

Repository files navigation

RAG Retrieval & Generation Evaluation System

Features

Architecture

Installation

Prerequisites

Setup

Project Structure

Usage

Basic Usage

Running the Demo

Custom Data Format

Configuration

Environment Variables

Programmatic Configuration

Metrics Explained

Binary Classification Metrics

Confusion Matrix

Phoenix Observability

Testing

Advanced Usage

Custom Relevance Prompt

Batch Processing with Progress

Per-Query Analysis

Error Pattern Analysis

Output Files

retrieval_evaluation_results.csv

evaluation_metrics.csv

evaluation_report.html

Troubleshooting

Common Issues

Best Practices

Contributing

License

Acknowledgments

Support

Roadmap

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`retrieval_evaluation_results.csv`

`evaluation_metrics.csv`

`evaluation_report.html`

Packages