Burro 🫏

Burro is a powerful command-line interface (CLI) tool for evaluating Large Language Model (LLM) outputs. It provides both heuristic and LLM-based evaluation methods with secure API key management.

🚀 Features

📖 For detailed feature documentation, see FEATURES.md

Evaluation Methods

📊 Heuristic Evaluations (No API key required)

Levenshtein Distance - Measure string similarity using edit distance
Exact Match - Perfect matching for IDs, codes, and specific formats
Case Insensitive Match - Flexible text matching
Numeric Difference - Compare numerical values with configurable tolerance
JSON Diff - Analyze structural differences in JSON outputs
Jaccard Similarity - Calculate similarity between sets of tokens
Contains - Verify if expected value appears in output

🤖 LLM-as-a-Judge Evaluations (Requires OpenAI API key)

Factuality - Answer correctness with context validation
Close QA - Close-ended question matching
Battle - Compare outputs from different models head-to-head
Summarization - Evaluate summary quality and accuracy
SQL - Verify correctness of generated SQL queries
Translation - Assess translation quality across languages

Additional Features

🔒 Secure OpenAI API key management with AES encryption
📈 Progress indicators for long-running evaluations
💾 Export results to JSON format
🎯 Comprehensive error handling and validation
🚀 Cross-platform support (Mac, Linux, Windows)
⚡ Fast execution with Deno runtime

📋 Prerequisites

For Heuristic Evaluations: None! Works out of the box
For LLM-based Evaluations: OpenAI API key

🛠️ Installation

MacOS - Apple Silicon (M1/M2/M3)

sudo curl -L "https://github.com/thisguymartin/burro/releases/download/latest/build-mac-silicon" -o /usr/local/bin/burro && sudo chmod +x /usr/local/bin/burro

MacOS - Intel

sudo curl -L "https://github.com/thisguymartin/burro/releases/download/latest/build-mac-intel" -o /usr/local/bin/burro && sudo chmod +x /usr/local/bin/burro

Linux - ARM

sudo curl -L "https://github.com/thisguymartin/burro/releases/download/latest/build-linux-arm" -o /usr/local/bin/burro && sudo chmod +x /usr/local/bin/burro

Linux - Intel

sudo curl -L "https://github.com/thisguymartin/burro/releases/download/latest/build-linux-intel" -o /usr/local/bin/burro && sudo chmod +x /usr/local/bin/burro

Windows

Download build-windows.exe from the releases page
Rename it to burro.exe
Move it to your desired location (e.g., C:\Program Files\burro\burro.exe)

🔧 Quick Start

1. Set up API Key (for LLM-based evaluations only)

burro set-openai-key

2. Run Your First Evaluation

Heuristic Evaluation (no API key needed):

burro run-eval -t exact example/exact-match.json

LLM-based Evaluation:

burro run-eval -t factuality example/evals.json

With progress indicators and result export:

burro run-eval -t levenshtein example/levenshtein.json --progress -p

📊 Evaluation Types Guide

Heuristic Evaluations

Levenshtein Distance

Measures string similarity using edit distance. Great for catching typos and minor variations.

Example: example/levenshtein.json

[
  {
    "input": "Who wrote Hamlet?",
    "output": "William Shakespear",
    "expected": "William Shakespeare"
  }
]

Run:

burro run-eval -t levenshtein example/levenshtein.json

Exact Match

Perfect matching for critical data like IDs, codes, or specific formats.

Example: example/exact-match.json

[
  {
    "input": "What is the ISO code for United States?",
    "output": "US",
    "expected": "US"
  }
]

Run:

burro run-eval -t exact example/exact-match.json

Numeric Difference

Compare numerical values with configurable tolerance.

Example: example/numeric.json

[
  {
    "input": "What is the value of Pi to 2 decimal places?",
    "output": "3.14",
    "expected": "3.14159",
    "tolerance": 0.01
  }
]

Run:

burro run-eval -t numeric example/numeric.json

JSON Diff

Analyze structural differences in JSON outputs.

Example: example/json-diff.json

[
  {
    "input": "Convert user data to JSON",
    "output": "{\"name\": \"John Doe\", \"age\": 30}",
    "expected": "{\"name\": \"John Doe\", \"age\": 30}"
  }
]

Run:

burro run-eval -t json example/json-diff.json

Jaccard Similarity

Calculate similarity between sets of tokens.

Example: example/jaccard.json

[
  {
    "input": "List programming languages for web development",
    "output": "JavaScript TypeScript Python Ruby PHP Java",
    "expected": "JavaScript Python Ruby PHP Go"
  }
]

Run:

burro run-eval -t jaccard example/jaccard.json

LLM-as-a-Judge Evaluations

Factuality

Evaluate answer correctness with context validation.

Example: example/evals.json

[
  {
    "input": "What is the capital of France?",
    "output": "The capital city of France is Paris",
    "expected": "Paris"
  }
]

Run:

burro run-eval -t factuality example/evals.json

Close QA

Exact matching for close-ended questions.

Example: example/closeqa.json

[
  {
    "input": "List the first three prime numbers",
    "output": "2,3,5",
    "criteria": "Numbers must be in correct order, separated by commas"
  }
]

Run:

burro run-eval -t closeqa example/closeqa.json

Battle

Compare outputs from different models head-to-head.

Example: example/battle.json

[
  {
    "input": "Write a haiku about technology",
    "output": "Code flows like water\nBits and bytes dance in rhythm\nDigital zen speaks",
    "expected": "Silicon pathways\nData streams through endless night\nMachines dream in code"
  }
]

Run:

burro run-eval -t battle example/battle.json

Summarization

Evaluate the quality and accuracy of text summaries.

Example: example/summarization.json

[
  {
    "input": "Summarize this text",
    "output": "Climate change impacts polar regions; urgent global action needed.",
    "context": "Long article about climate change effects..."
  }
]

Run:

burro run-eval -t summarization example/summarization.json

SQL

Verify the correctness of generated SQL queries.

Example: example/sql.json

[
  {
    "input": "Find all users over age 18",
    "output": "SELECT * FROM users WHERE age > 18;",
    "expected": "SELECT * FROM users WHERE age > 18;",
    "context": "Database schema: users(id, name, email, age)"
  }
]

Run:

burro run-eval -t sql example/sql.json

Translation

Assess translation quality across languages.

Example: example/translation.json

[
  {
    "input": "Translate 'Hello' to Spanish",
    "output": "Hola",
    "expected": "Hola"
  }
]

Run:

burro run-eval -t translation example/translation.json

📖 Real-World Scenarios

See SCENARIOS.md for comprehensive real-world examples including:

🎯 Customer Support Bot Evaluation
💻 Code Generation Validation
🌍 Translation Quality Assessment
⚔️ Chatbot Response Comparison
📊 Data Extraction Accuracy
📚 Educational Content Assessment

🔒 Security Features

AES-256 Encryption for API key storage
Secure key generation using Web Crypto API
Encrypted SQLite storage for settings
No plaintext secrets ever stored on disk

📈 Advanced Usage

Progress Indicators

For long-running evaluations:

burro run-eval -t factuality large-dataset.json --progress

Export Results

Save evaluation results to JSON:

burro run-eval -t battle comparison.json -p
# Results saved to ~/Downloads/comparison.json-result.json

Batch Evaluation

Run multiple evaluations sequentially:

burro run-eval -t exact tests/ids.json
burro run-eval -t factuality tests/qa.json
burro run-eval -t sql tests/queries.json

🎯 Choosing the Right Evaluation Type

Use Case	Recommended Type	Why?
Order IDs, Product Codes	`exact`	Requires perfect match
User Questions	`factuality`	Needs semantic understanding
Price Calculations	`numeric`	Allows tolerance
Model A/B Testing	`battle`	Direct comparison
API Responses	`json`	Structure validation
Spelling Variations	`levenshtein`	Fuzzy matching
Keywords/Tags	`jaccard`	Set similarity
Summaries	`summarization`	Quality assessment
SQL Queries	`sql`	Syntax + logic validation
Translations	`translation`	Language expertise

🏗️ System Architecture Check

To determine which version to download:

MacOS

uname -m

arm64 → Use Apple Silicon version (M1/M2/M3)
x86_64 → Use Intel version

Linux

uname -m

aarch64 or arm64 → Use ARM version
x86_64 → Use Intel version

🐛 Troubleshooting

Permission Denied

sudo chmod +x /usr/local/bin/burro

Command Not Found

Verify installation location is in your PATH
Restart your terminal
Check executable exists: ls -l /usr/local/bin/burro

API Key Issues

burro set-openai-key  # Re-enter your API key

Low Evaluation Scores

Exact match: Check for extra spaces or case differences
Numeric: Adjust tolerance values
JSON: Ensure consistent formatting
Factuality: Make expected answers less specific

🗑️ Uninstallation

MacOS & Linux

sudo rm /usr/local/bin/burro
which burro  # Should return nothing

Windows

Delete burro.exe from installation location
Remove from PATH if added

🎓 Examples Directory

All evaluation types have examples in the /example directory:

example/
├── closeqa.json           # Close-ended QA
├── evals.json            # Factuality evaluation
├── levenshtein.json      # String similarity
├── exact-match.json      # Exact matching
├── numeric.json          # Numeric comparison
├── json-diff.json        # JSON structure diff
├── jaccard.json          # Token similarity
├── battle.json           # Model comparison
├── summarization.json    # Summary quality
├── sql.json             # SQL validation
└── translation.json      # Translation quality

🚀 Getting Started Tutorial

Install Burro using the command for your platform

Try a heuristic evaluation (no API key needed):

burro run-eval -t exact example/exact-match.json

Set up your API key for LLM evaluations:
```
burro set-openai-key
```

Run an LLM evaluation:

burro run-eval -t factuality example/evals.json

Explore scenarios in SCENARIOS.md
Create your own evaluation files based on examples

💡 Tips for Success

Start with heuristics - They're fast and free
Use the right tool - Match evaluation type to your use case
Build incrementally - Start with 5-10 test cases
Version your tests - Track evaluation files in git
Automate regularly - Run evaluations as part of your workflow
Compare methods - Try multiple evaluation types on the same data
Check examples - Learn from provided examples in /example directory

📚 Documentation

FEATURES.md - Complete feature documentation and technical details
SCENARIOS.md - Real-world use cases and examples
/example - Sample evaluation files for all types
README.md - This file, quick start guide and overview

🤝 Contributing

Contributions welcome! Please feel free to submit a Pull Request.

📝 License

MIT License - feel free to use Burro in your projects!

🎯 Next Steps

Ready to evaluate your LLM outputs?

Install Burro for your platform
Pick an evaluation type that matches your needs
Create or use an example evaluation file
Run your first evaluation
Iterate and improve!

Need help? Check the issues page or review the examples!

Made with ❤️ for LLM developers and evaluators

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
example		example
src		src
test		test
.gitignore		.gitignore
FEATURES.md		FEATURES.md
README.md		README.md
SCENARIOS.md		SCENARIOS.md
TESTING.md		TESTING.md
deno.json		deno.json

Folders and files

Latest commit

History

Repository files navigation

Burro 🫏

🚀 Features

Evaluation Methods

Additional Features

📋 Prerequisites

🛠️ Installation

MacOS - Apple Silicon (M1/M2/M3)

MacOS - Intel

Linux - ARM

Linux - Intel

Windows

🔧 Quick Start

1. Set up API Key (for LLM-based evaluations only)

2. Run Your First Evaluation

📊 Evaluation Types Guide

Heuristic Evaluations

Levenshtein Distance

Exact Match

Numeric Difference

JSON Diff

Jaccard Similarity

LLM-as-a-Judge Evaluations

Factuality

Close QA

Battle

Summarization

SQL

Translation

📖 Real-World Scenarios

🔒 Security Features

📈 Advanced Usage

Progress Indicators

Export Results

Batch Evaluation

🎯 Choosing the Right Evaluation Type

🏗️ System Architecture Check

MacOS

Linux

🐛 Troubleshooting

Permission Denied

Command Not Found

API Key Issues

Low Evaluation Scores

🗑️ Uninstallation

MacOS & Linux

Windows

🎓 Examples Directory

🚀 Getting Started Tutorial

💡 Tips for Success

📚 Documentation

🤝 Contributing

📝 License

🎯 Next Steps

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Uh oh!

Contributors

Uh oh!

Languages