A PHP application that evaluates capabilities (like programming or other textual generation) of different LLM models via the OpenRouter API.
- Model Fetching: Fetch all models from OpenRouter and export to editable CSV
- Selective Testing: Choose which models to test by editing the CSV file
- Resumable Evaluation: Safe resume after interruption (idempotent)
- Complete Storage: All model outputs saved for follow-up evaluation
- Follow-up Conversations: Ask additional questions to tested models
- Multiple Reports: CLI and HTML output formats
# Install dependencies
composer install
# Copy environment template
cp .env.example .env
# Add your OpenRouter API key to .env
# OPENROUTER_API_KEY=your_key_here
# Fetch all models and create models.csv
php llm-scoring.php fetch
# Edit data/default/models.csv to select which models to test (enabled=1)
# Create data/default/task.md with instructions about the task to be executed.
# Create data/default/evaluator-hints.md with instructions the results should be evaluated, include specifics for what you are looking for.
# Test selected models (resumes automatically if interrupted)
php llm-scoring.php test
# Reset and start fresh
php llm-scoring.php test --reset
# View results
php llm-scoring.php list
# Generate report
php llm-scoring.php report --format htmlWhen you run php llm-scoring.php without any command, you'll see a help overview with quickstart instructions.
| Command | Description |
|---|---|
fetch |
Fetch all models from OpenRouter to CSV |
list-models |
View models in CSV with filtering |
test --from-csv |
Test models from CSV (enabled only) |
test --from-csv --reset |
Reset and start fresh evaluation |
status |
Show evaluation progress |
show <model_id> |
Display model test data (prompts, responses, evaluations, costs) |
list |
List all tested models (with cost aggregation) |
evaluate [<model_id>] |
Evaluate stored model content using an LLM. Omit model_id to evaluate all unevaluated models. |
export-models |
Export models from CSV to various formats |
report |
Generate evaluation reports (CLI or HTML format) |
stats |
Display evaluation statistics |
| Option | Description |
|---|---|
model_id |
The model ID to evaluate (optional, omit for all) |
-t, --test |
Test number (default: latest) |
-m, --model |
Evaluator model (default: EVALUATOR_MODEL env) |
-r, --raw |
Show raw JSON output |
-a, --all |
Evaluate all unevaluated models (same as omitting) |
# Evaluate a single model
php llm-scoring.php evaluate meta-llama/llama-3.1-8b-instruct
# Evaluate all unevaluated models at once
php llm-scoring.php evaluate
# Or explicitly with --all flag
php llm-scoring.php evaluate --all| Option | Description |
|---|---|
model_id |
The model ID to show (required) |
--test, -t |
Specific test number to show |
--raw, -r |
Show raw JSON output |
| Option | Description |
|---|---|
--format, -f |
Output format: table (default) or json |
--details, -d |
Show additional details (full paths) |
| Option | Description |
|---|---|
--from-csv, --input |
CSV file path (default: data/models.csv) |
--all, -a |
Test all models including disabled |
--free-only, -f |
Only test free models |
--limit, -l |
Limit number of models to test |
--prompt, -p |
Custom prompt to send to models |
--experiment-code, -q |
Question code for organizing results (default: default) |
--reset |
Reset evaluation state before starting |
Commands support a --experiment-code (or -q) option to organize results into separate subdirectories. This allows you to:
- Separate different tests: Run the same task with different prompts
- Track experiments: Compare results across different question sets
- Unit testing: Use
--experiment-code=unittestsfor isolated test runs
| Question Code | Use Case |
|---|---|
default |
Standard testing (used when no code is specified) |
unittests |
Unit testing with Pest |
| Custom code | Your own experiment name (e.g., my-experiment) |
Data Storage Structure:
data/models/
├── default/ # Standard test results
│ └── <model_id>/
│ ├── 01_test_prompt.json
│ ├── 01_raw_response.json
│ └── 01_evaluation.json
├── unittests/ # Unit test results
│ └── <model_id>/
│ └── ...
└── my-experiment/ # Custom experiment
└── <model_id>/
└── ...
Usage Examples:
# Test with default question code
php llm-scoring.php test --from-csv
# Test with custom question code
php llm-scoring.php test --from-csv --experiment-code=my-experiment
# Test for unit tests
php llm-scoring.php test --from-csv --experiment-code=unittests
# List results for a specific question code
php llm-scoring.php list --experiment-code=my-experiment
# Evaluate results for a specific question code
php llm-scoring.php evaluate --experiment-code=my-experiment
# Show status for a specific question code
php llm-scoring.php status --experiment-code=my-experiment
# Generate report for a specific question code
php llm-scoring.php report --experiment-code=my-experiment
# Show statistics for a specific question code
php llm-scoring.php stats --experiment-code=my-experimentThe task prompt is defined in data/<experiment-code>/task.md. This allows:
- Customize the task: Edit
data/<experiment-code>/task.mdto change what models should do - Version control: Track changes to the task definition over time
- Multiple tasks: Create different task files for different scenarios
Format for data/<experiment-code>/task.md:
Write a PHP script that counts down from 10 to 1, outputting each number on a new line. Only output the code, no explanations.The prompt is extracted from the ```markdown code block. If no code block is found, the entire file content is used.
Evaluator hints are defined in data/<experiment-code>/evaluator-hints.md to provide task-specific guidance for the LLM evaluator. This allows:
- Custom evaluation criteria: Define what makes good output for your specific task
- Detailed checks: Specify exactly what to look for and what to deduct points for
- Examples: Provide expected output format guidance
See data/<experiment-code>/evaluator-hints.md for the evaluation criteria.
| Option | Description |
|---|---|
--input, -i |
Input CSV file path (default: data/models.csv) |
--output, -o |
Output file path (default: stdout) |
--format, -f |
Output format: csv or json (default: csv) |
--enabled |
Only export enabled models |
--free-only |
Only export free models |
| Option | Description |
|---|---|
--format, -f |
Output format: cli (default) or html |
--output, -o |
Output file path for HTML format (default: results.html) |
| Option | Description |
|---|---|
--json, -j |
Output as JSON |
--detailed |
Show detailed statistics |
# Generate CLI report
php llm-scoring.php report
# Generate HTML report (saves to results.html)
php llm-scoring.php report --format html
# Generate HTML report with custom output path
php llm-scoring.php report --format html --output my-report.html# Show statistics in CLI format
php llm-scoring.php stats
# Show statistics in JSON format
php llm-scoring.php stats --json
# Show detailed statistics with score distribution
php llm-scoring.php stats --detailedLLModelScoring/
├── src/ # PHP source code
├── data/ # CSV files and model outputs
├── config/ # Configuration
├── tests/ # Unit and integration tests
├── llm-scoring.php # CLI entry point
└── composer.json # Dependencies
Tests are run with Pest. Because the pest.xml configuration is used, run tests as follows:
# Run all tests
./vendor/bin/pest --configuration=pest.xml
# Unit tests only
./vendor/bin/pest --configuration=pest.xml tests/Unit/
# Specific test file
./vendor/bin/pest --configuration=pest.xml tests/Unit/ModelCsvTest.php- PHP 8.4+
- Composer
- OpenRouter API key
