Burro is a powerful command-line interface (CLI) tool for evaluating Large Language Model (LLM) outputs. It provides both heuristic and LLM-based evaluation methods with secure API key management.
π For detailed feature documentation, see FEATURES.md
π Heuristic Evaluations (No API key required)
- Levenshtein Distance - Measure string similarity using edit distance
- Exact Match - Perfect matching for IDs, codes, and specific formats
- Case Insensitive Match - Flexible text matching
- Numeric Difference - Compare numerical values with configurable tolerance
- JSON Diff - Analyze structural differences in JSON outputs
- Jaccard Similarity - Calculate similarity between sets of tokens
- Contains - Verify if expected value appears in output
π€ LLM-as-a-Judge Evaluations (Requires OpenAI API key)
- Factuality - Answer correctness with context validation
- Close QA - Close-ended question matching
- Battle - Compare outputs from different models head-to-head
- Summarization - Evaluate summary quality and accuracy
- SQL - Verify correctness of generated SQL queries
- Translation - Assess translation quality across languages
- π Secure OpenAI API key management with AES encryption
- π Progress indicators for long-running evaluations
- πΎ Export results to JSON format
- π― Comprehensive error handling and validation
- π Cross-platform support (Mac, Linux, Windows)
- β‘ Fast execution with Deno runtime
- For Heuristic Evaluations: None! Works out of the box
- For LLM-based Evaluations: OpenAI API key
sudo curl -L "https://github.com/thisguymartin/burro/releases/download/latest/build-mac-silicon" -o /usr/local/bin/burro && sudo chmod +x /usr/local/bin/burrosudo curl -L "https://github.com/thisguymartin/burro/releases/download/latest/build-mac-intel" -o /usr/local/bin/burro && sudo chmod +x /usr/local/bin/burrosudo curl -L "https://github.com/thisguymartin/burro/releases/download/latest/build-linux-arm" -o /usr/local/bin/burro && sudo chmod +x /usr/local/bin/burrosudo curl -L "https://github.com/thisguymartin/burro/releases/download/latest/build-linux-intel" -o /usr/local/bin/burro && sudo chmod +x /usr/local/bin/burro- Download
build-windows.exefrom the releases page - Rename it to
burro.exe - Move it to your desired location (e.g.,
C:\Program Files\burro\burro.exe)
burro set-openai-keyHeuristic Evaluation (no API key needed):
burro run-eval -t exact example/exact-match.jsonLLM-based Evaluation:
burro run-eval -t factuality example/evals.jsonWith progress indicators and result export:
burro run-eval -t levenshtein example/levenshtein.json --progress -pMeasures string similarity using edit distance. Great for catching typos and minor variations.
Example: example/levenshtein.json
[
{
"input": "Who wrote Hamlet?",
"output": "William Shakespear",
"expected": "William Shakespeare"
}
]Run:
burro run-eval -t levenshtein example/levenshtein.jsonPerfect matching for critical data like IDs, codes, or specific formats.
Example: example/exact-match.json
[
{
"input": "What is the ISO code for United States?",
"output": "US",
"expected": "US"
}
]Run:
burro run-eval -t exact example/exact-match.jsonCompare numerical values with configurable tolerance.
Example: example/numeric.json
[
{
"input": "What is the value of Pi to 2 decimal places?",
"output": "3.14",
"expected": "3.14159",
"tolerance": 0.01
}
]Run:
burro run-eval -t numeric example/numeric.jsonAnalyze structural differences in JSON outputs.
Example: example/json-diff.json
[
{
"input": "Convert user data to JSON",
"output": "{\"name\": \"John Doe\", \"age\": 30}",
"expected": "{\"name\": \"John Doe\", \"age\": 30}"
}
]Run:
burro run-eval -t json example/json-diff.jsonCalculate similarity between sets of tokens.
Example: example/jaccard.json
[
{
"input": "List programming languages for web development",
"output": "JavaScript TypeScript Python Ruby PHP Java",
"expected": "JavaScript Python Ruby PHP Go"
}
]Run:
burro run-eval -t jaccard example/jaccard.jsonEvaluate answer correctness with context validation.
Example: example/evals.json
[
{
"input": "What is the capital of France?",
"output": "The capital city of France is Paris",
"expected": "Paris"
}
]Run:
burro run-eval -t factuality example/evals.jsonExact matching for close-ended questions.
Example: example/closeqa.json
[
{
"input": "List the first three prime numbers",
"output": "2,3,5",
"criteria": "Numbers must be in correct order, separated by commas"
}
]Run:
burro run-eval -t closeqa example/closeqa.jsonCompare outputs from different models head-to-head.
Example: example/battle.json
[
{
"input": "Write a haiku about technology",
"output": "Code flows like water\nBits and bytes dance in rhythm\nDigital zen speaks",
"expected": "Silicon pathways\nData streams through endless night\nMachines dream in code"
}
]Run:
burro run-eval -t battle example/battle.jsonEvaluate the quality and accuracy of text summaries.
Example: example/summarization.json
[
{
"input": "Summarize this text",
"output": "Climate change impacts polar regions; urgent global action needed.",
"context": "Long article about climate change effects..."
}
]Run:
burro run-eval -t summarization example/summarization.jsonVerify the correctness of generated SQL queries.
Example: example/sql.json
[
{
"input": "Find all users over age 18",
"output": "SELECT * FROM users WHERE age > 18;",
"expected": "SELECT * FROM users WHERE age > 18;",
"context": "Database schema: users(id, name, email, age)"
}
]Run:
burro run-eval -t sql example/sql.jsonAssess translation quality across languages.
Example: example/translation.json
[
{
"input": "Translate 'Hello' to Spanish",
"output": "Hola",
"expected": "Hola"
}
]Run:
burro run-eval -t translation example/translation.jsonSee SCENARIOS.md for comprehensive real-world examples including:
- π― Customer Support Bot Evaluation
- π» Code Generation Validation
- π Translation Quality Assessment
- βοΈ Chatbot Response Comparison
- π Data Extraction Accuracy
- π Educational Content Assessment
- AES-256 Encryption for API key storage
- Secure key generation using Web Crypto API
- Encrypted SQLite storage for settings
- No plaintext secrets ever stored on disk
For long-running evaluations:
burro run-eval -t factuality large-dataset.json --progressSave evaluation results to JSON:
burro run-eval -t battle comparison.json -p
# Results saved to ~/Downloads/comparison.json-result.jsonRun multiple evaluations sequentially:
burro run-eval -t exact tests/ids.json
burro run-eval -t factuality tests/qa.json
burro run-eval -t sql tests/queries.json| Use Case | Recommended Type | Why? |
|---|---|---|
| Order IDs, Product Codes | exact |
Requires perfect match |
| User Questions | factuality |
Needs semantic understanding |
| Price Calculations | numeric |
Allows tolerance |
| Model A/B Testing | battle |
Direct comparison |
| API Responses | json |
Structure validation |
| Spelling Variations | levenshtein |
Fuzzy matching |
| Keywords/Tags | jaccard |
Set similarity |
| Summaries | summarization |
Quality assessment |
| SQL Queries | sql |
Syntax + logic validation |
| Translations | translation |
Language expertise |
To determine which version to download:
uname -marm64β Use Apple Silicon version (M1/M2/M3)x86_64β Use Intel version
uname -maarch64orarm64β Use ARM versionx86_64β Use Intel version
sudo chmod +x /usr/local/bin/burro- Verify installation location is in your PATH
- Restart your terminal
- Check executable exists:
ls -l /usr/local/bin/burro
burro set-openai-key # Re-enter your API key- Exact match: Check for extra spaces or case differences
- Numeric: Adjust tolerance values
- JSON: Ensure consistent formatting
- Factuality: Make expected answers less specific
sudo rm /usr/local/bin/burro
which burro # Should return nothing- Delete
burro.exefrom installation location - Remove from PATH if added
All evaluation types have examples in the /example directory:
example/
βββ closeqa.json # Close-ended QA
βββ evals.json # Factuality evaluation
βββ levenshtein.json # String similarity
βββ exact-match.json # Exact matching
βββ numeric.json # Numeric comparison
βββ json-diff.json # JSON structure diff
βββ jaccard.json # Token similarity
βββ battle.json # Model comparison
βββ summarization.json # Summary quality
βββ sql.json # SQL validation
βββ translation.json # Translation quality
- Install Burro using the command for your platform
- Try a heuristic evaluation (no API key needed):
burro run-eval -t exact example/exact-match.json
- Set up your API key for LLM evaluations:
burro set-openai-key
- Run an LLM evaluation:
burro run-eval -t factuality example/evals.json
- Explore scenarios in SCENARIOS.md
- Create your own evaluation files based on examples
- Start with heuristics - They're fast and free
- Use the right tool - Match evaluation type to your use case
- Build incrementally - Start with 5-10 test cases
- Version your tests - Track evaluation files in git
- Automate regularly - Run evaluations as part of your workflow
- Compare methods - Try multiple evaluation types on the same data
- Check examples - Learn from provided examples in
/exampledirectory
- FEATURES.md - Complete feature documentation and technical details
- SCENARIOS.md - Real-world use cases and examples
- /example - Sample evaluation files for all types
- README.md - This file, quick start guide and overview
Contributions welcome! Please feel free to submit a Pull Request.
MIT License - feel free to use Burro in your projects!
Ready to evaluate your LLM outputs?
- Install Burro for your platform
- Pick an evaluation type that matches your needs
- Create or use an example evaluation file
- Run your first evaluation
- Iterate and improve!
Need help? Check the issues page or review the examples!
Made with β€οΈ for LLM developers and evaluators