This repository contains the raw evaluation results for the benchmarks described in the blog post Improving Claude Code by 10pp with Codeset.
Results are split across top-level directories, one per model/benchmark combination:
| Directory | Model | Benchmark | Tasks |
|---|---|---|---|
haiku_codeset_gym/ |
Claude Haiku 4.5 | codeset-gym-python | 150 |
sonnet_codeset_gym/ |
Claude Sonnet 4.5 | codeset-gym-python | 150 |
opus_codeset_gym/ |
Claude Opus 4.5 | codeset-gym-python | 150 |
sonnet_sweb_pro/ |
Claude Sonnet 4.5 | SWE-Bench Pro | 300 |
gpt54_codeset_gym/ |
GPT-5.4 | codeset-gym-python | 150 |
gpt54_sweb_pro/ |
GPT-5.4 | SWE-Bench Pro | 400 |
Each directory has two subdirectories:
baseline/— Claude Code running without Codeset contextcodeset/— Claude Code running with Codeset-generated context injected
Each task is named after its GitHub repository and issue number (e.g. lucidrains__x-transformers-303):
<model_dir>/<baseline|codeset>/<repo>__<issue>/
├── html/
│ ├── combined_transcripts.html # Full session transcript, human-readable
│ └── session-<uuid>.html # Per-session transcript
├── logs/
│ └── run.log # Execution log for the session
└── projects/default/
├── <session-uuid>.jsonl # Full session in JSONL format
├── agent-<id>.jsonl # Per-agent interaction logs
└── cache/ # Cached API responses
├── <session-uuid>.json
├── agent-<id>.json
└── index.json
codeset-gym-python (150 tasks)
| Model | Baseline | With Codeset | Improvement |
|---|---|---|---|
| Claude Haiku 4.5 | 52% (78/150) | 62% (93/150) | +10pp |
| Claude Sonnet 4.5 | 56% (84/150) | 65.3% (98/150) | +9.3pp |
| Claude Opus 4.5 | 60.7% (91/150) | 68% (102/150) | +7.3pp |
| GPT-5.4 | 60.7% | 66% | +5.3pp |
SWE-Bench Pro
| Model | Tasks | Baseline | With Codeset | Improvement |
|---|---|---|---|---|
| Claude Sonnet 4.5 | 300 | 53.0% (159/300) | 55.7% (167/300) | +2.7pp |
| GPT-5.4 | 400 | 56.5% | 58.5% | +2pp |