Parallel inference runner for SpreadsheetBench that uses GRID's spreadsheet agent API to solve tasks and evaluate results inline.
SpreadsheetBench's built-in inference scripts ask the LLM to generate Python code (using openpyxl) to manipulate spreadsheets. The generated code runs in a Jupyter kernel to produce output .xlsx files. Because openpyxl cannot evaluate Excel formulas, the pipeline then opens each output file in Excel or LibreOffice to recalculate formulas before evaluation can compare cell values.
GRID's agent operates directly on a live spreadsheet engine, so output files are fully calculated .xlsx workbooks. This eliminates two steps from the pipeline:
- No Python/openpyxl code generation — the agent manipulates spreadsheets directly.
- No Excel/LibreOffice recalculation pass — output files already contain computed values.
The evaluation logic (cell-by-cell comparison with type coercion) is ported from SpreadsheetBench/evaluation/evaluation_verified.py to ensure compatible results.
Requires Python 3.12+ and uv.
make installOr run directly without installing:
uv run sheetbench-runner --helpPoint the runner at a SpreadsheetBench dataset directory and an output directory:
sheetbench-runner \
--dataset data/spreadsheetbench_verified_400/ \
--run-dir data/runs/2026-02-05-my-run \
--concurrency 10To run a specific subset of tasks, use a task file:
sheetbench-runner \
--dataset data/spreadsheetbench_verified_400/ \
--run-dir data/runs/2026-02-05-my-run \
--task-file task-sets/all_verified_tasks.txt \
--concurrency 10Runs are resumable — if interrupted, re-running the same command skips already-completed tasks and retries any that failed due to transient errors (5xx, timeouts).
If the evaluation logic changes (e.g. a parser fix for edge-case Excel references), you can re-evaluate existing output files without re-running inference:
sheetbench-runner \
--dataset data/spreadsheetbench_verified_400/ \
--run-dir data/runs/2026-02-05-my-run \
--reevaluateUsage: sheetbench-runner [OPTIONS]
Options:
--dataset PATH Path to SpreadsheetBench dataset directory
(containing dataset.json) [required]
--run-dir PATH Directory to store results (creates if missing,
resumes if exists) [required]
--task-ids TEXT Comma-separated list of specific task IDs to run
--task-file PATH File with task IDs to run (one per line)
--config PATH Path to config.toml file
--infuser-url TEXT Override infuser URL from config
--concurrency INTEGER Number of parallel tasks (default: 4)
--timeout INTEGER Timeout per task in seconds (default: 3600)
-v, --verbose Enable verbose logging
--reevaluate Re-evaluate all tasks that have output files (useful
after parser fixes)
--help Show this message and exit.
Copy config.example.toml to config.toml and adjust as needed:
[infuser]
url = "http://localhost:3000"
[infuser.config]
git_hash = "abc1234"
model = "claude-sonnet-4-5"
planning_enabled = false
max_turns = 50
[runner]
concurrency = 4
timeout_seconds = 3600The [infuser.config] section accepts arbitrary key-value pairs. These are stored in run.json for reproducibility tracking — use them to record whatever agent configuration is relevant to the run.
CLI options (--infuser-url, --concurrency, --timeout) override their config file equivalents.
A run directory contains:
run-dir/
├── run.json # Run metadata (model, config, timestamp)
├── results.json # Task results sorted by task_id
├── run.log # Execution log
├── 13-1-output.xlsx # Output workbook for task 13-1
├── 13-1-transcript.json # Agent transcript for task 13-1
├── 203-15-output.xlsx
├── 203-15-transcript.json
└── ...
Each entry in results.json records the task outcome, timing, and token usage:
{
"task_id": "13-1",
"duration_seconds": 45.2,
"turns": 5,
"tool_calls": 8,
"input_tokens": 12500,
"output_tokens": 3200,
"output_file": "13-1-output.xlsx",
"transcript_file": "13-1-transcript.json",
"result": "pass",
"message": ""
}make test # run tests with coverage
make lt # lint + typecheck