LLModelScoring

A PHP application that evaluates capabilities (like programming or other textual generation) of different LLM models via the OpenRouter API.

Features

Model Fetching: Fetch all models from OpenRouter and export to editable CSV
Selective Testing: Choose which models to test by editing the CSV file
Resumable Evaluation: Safe resume after interruption (idempotent)
Complete Storage: All model outputs saved for follow-up evaluation
Follow-up Conversations: Ask additional questions to tested models
Multiple Reports: CLI and HTML output formats

Quick Start

# Install dependencies
composer install

# Copy environment template
cp .env.example .env

# Add your OpenRouter API key to .env
# OPENROUTER_API_KEY=your_key_here

# Fetch all models and create models.csv
php llm-scoring.php fetch

# Edit data/default/models.csv to select which models to test (enabled=1)

# Create data/default/task.md with instructions about the task to be executed.

# Create data/default/evaluator-hints.md with instructions the results should be evaluated, include specifics for what you are looking for.

# Test selected models (resumes automatically if interrupted)
php llm-scoring.php test

# Reset and start fresh
php llm-scoring.php test --reset

# View results
php llm-scoring.php list

# Generate report
php llm-scoring.php report --format html

When you run php llm-scoring.php without any command, you'll see a help overview with quickstart instructions.

CLI Commands

Main Commands

Command	Description
`fetch`	Fetch all models from OpenRouter to CSV
`list-models`	View models in CSV with filtering
`test --from-csv`	Test models from CSV (enabled only)
`test --from-csv --reset`	Reset and start fresh evaluation
`status`	Show evaluation progress
`show <model_id>`	Display model test data (prompts, responses, evaluations, costs)
`list`	List all tested models (with cost aggregation)
`evaluate [<model_id>]`	Evaluate stored model content using an LLM. Omit model_id to evaluate all unevaluated models.
`export-models`	Export models from CSV to various formats
`report`	Generate evaluation reports (CLI or HTML format)
`stats`	Display evaluation statistics

Evaluate Command Options

Option	Description
`model_id`	The model ID to evaluate (optional, omit for all)
`-t, --test`	Test number (default: latest)
`-m, --model`	Evaluator model (default: EVALUATOR_MODEL env)
`-r, --raw`	Show raw JSON output
`-a, --all`	Evaluate all unevaluated models (same as omitting)

Evaluate Usage Examples

# Evaluate a single model
php llm-scoring.php evaluate meta-llama/llama-3.1-8b-instruct

# Evaluate all unevaluated models at once
php llm-scoring.php evaluate

# Or explicitly with --all flag
php llm-scoring.php evaluate --all

Show Command Options

Option	Description
`model_id`	The model ID to show (required)
`--test, -t`	Specific test number to show
`--raw, -r`	Show raw JSON output

List Command Options

Option	Description
`--format, -f`	Output format: table (default) or json
`--details, -d`	Show additional details (full paths)

Test Command Options

Option	Description
`--from-csv, --input`	CSV file path (default: data/models.csv)
`--all, -a`	Test all models including disabled
`--free-only, -f`	Only test free models
`--limit, -l`	Limit number of models to test
`--prompt, -p`	Custom prompt to send to models
`--experiment-code, -q`	Question code for organizing results (default: default)
`--reset`	Reset evaluation state before starting

Question Code

Commands support a --experiment-code (or -q) option to organize results into separate subdirectories. This allows you to:

Separate different tests: Run the same task with different prompts
Track experiments: Compare results across different question sets
Unit testing: Use --experiment-code=unittests for isolated test runs

Question Code	Use Case
`default`	Standard testing (used when no code is specified)
`unittests`	Unit testing with Pest
Custom code	Your own experiment name (e.g., `my-experiment`)

Data Storage Structure:

data/models/
├── default/           # Standard test results
│   └── <model_id>/
│       ├── 01_test_prompt.json
│       ├── 01_raw_response.json
│       └── 01_evaluation.json
├── unittests/         # Unit test results
│   └── <model_id>/
│       └── ...
└── my-experiment/     # Custom experiment
    └── <model_id>/
        └── ...

Usage Examples:

# Test with default question code
php llm-scoring.php test --from-csv

# Test with custom question code
php llm-scoring.php test --from-csv --experiment-code=my-experiment

# Test for unit tests
php llm-scoring.php test --from-csv --experiment-code=unittests

# List results for a specific question code
php llm-scoring.php list --experiment-code=my-experiment

# Evaluate results for a specific question code
php llm-scoring.php evaluate --experiment-code=my-experiment

# Show status for a specific question code
php llm-scoring.php status --experiment-code=my-experiment

# Generate report for a specific question code
php llm-scoring.php report --experiment-code=my-experiment

# Show statistics for a specific question code
php llm-scoring.php stats --experiment-code=my-experiment

Task Definition

The task prompt is defined in data/<experiment-code>/task.md. This allows:

Customize the task: Edit data/<experiment-code>/task.md to change what models should do
Version control: Track changes to the task definition over time
Multiple tasks: Create different task files for different scenarios

Format for data/<experiment-code>/task.md:

Task Definition

Task Prompt

Write a PHP script that counts down from 10 to 1, outputting each number on a new line. Only output the code, no explanations.

Content Type

The prompt is extracted from the ```markdown code block. If no code block is found, the entire file content is used.

Evaluator Hints

Evaluator hints are defined in data/<experiment-code>/evaluator-hints.md to provide task-specific guidance for the LLM evaluator. This allows:

Custom evaluation criteria: Define what makes good output for your specific task
Detailed checks: Specify exactly what to look for and what to deduct points for
Examples: Provide expected output format guidance

See data/<experiment-code>/evaluator-hints.md for the evaluation criteria.

Export Models Command Options

Option	Description
`--input, -i`	Input CSV file path (default: data/models.csv)
`--output, -o`	Output file path (default: stdout)
`--format, -f`	Output format: csv or json (default: csv)
`--enabled`	Only export enabled models
`--free-only`	Only export free models

Report Command Options

Option	Description
`--format, -f`	Output format: cli (default) or html
`--output, -o`	Output file path for HTML format (default: results.html)

Stats Command Options

Option	Description
`--json, -j`	Output as JSON
`--detailed`	Show detailed statistics

Report Usage Examples

# Generate CLI report
php llm-scoring.php report

# Generate HTML report (saves to results.html)
php llm-scoring.php report --format html

# Generate HTML report with custom output path
php llm-scoring.php report --format html --output my-report.html

Stats Usage Examples

# Show statistics in CLI format
php llm-scoring.php stats

# Show statistics in JSON format
php llm-scoring.php stats --json

# Show detailed statistics with score distribution
php llm-scoring.php stats --detailed

Project Structure

LLModelScoring/
├── src/              # PHP source code
├── data/             # CSV files and model outputs
├── config/           # Configuration
├── tests/            # Unit and integration tests
├── llm-scoring.php   # CLI entry point
└── composer.json     # Dependencies

Testing

Tests are run with Pest. Because the pest.xml configuration is used, run tests as follows:

# Run all tests
./vendor/bin/pest --configuration=pest.xml

# Unit tests only
./vendor/bin/pest --configuration=pest.xml tests/Unit/

# Specific test file
./vendor/bin/pest --configuration=pest.xml tests/Unit/ModelCsvTest.php

Requirements

PHP 8.4+
Composer
OpenRouter API key

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
src		src
tests		tests
.env.example		.env.example
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
cli-report.png		cli-report.png
composer.json		composer.json
composer.lock		composer.lock
llm-scoring.php		llm-scoring.php
pest.xml		pest.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLModelScoring

Features

Quick Start

CLI Commands

Main Commands

Evaluate Command Options

Evaluate Usage Examples

Show Command Options

List Command Options

Test Command Options

Question Code

Task Definition

Task Definition

Task Prompt

Content Type

Evaluator Hints

Export Models Command Options

Report Command Options

Stats Command Options

Report Usage Examples

Stats Usage Examples

Project Structure

Testing

Requirements

License

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LLModelScoring

Features

Quick Start

CLI Commands

Main Commands

Evaluate Command Options

Evaluate Usage Examples

Show Command Options

List Command Options

Test Command Options

Question Code

Task Definition

Task Definition

Task Prompt

Content Type

Evaluator Hints

Export Models Command Options

Report Command Options

Stats Command Options

Report Usage Examples

Stats Usage Examples

Project Structure

Testing

Requirements

License

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages