Skip to content

codeset-ai/codeset-release-evals

Repository files navigation

Codeset Release Evaluation Results

This repository contains the raw evaluation results for the benchmarks described in the blog post Improving Claude Code by 10pp with Codeset.

What's in here

Results are split across top-level directories, one per model/benchmark combination:

Directory Model Benchmark Tasks
haiku_codeset_gym/ Claude Haiku 4.5 codeset-gym-python 150
sonnet_codeset_gym/ Claude Sonnet 4.5 codeset-gym-python 150
opus_codeset_gym/ Claude Opus 4.5 codeset-gym-python 150
sonnet_sweb_pro/ Claude Sonnet 4.5 SWE-Bench Pro 300
gpt54_codeset_gym/ GPT-5.4 codeset-gym-python 150
gpt54_sweb_pro/ GPT-5.4 SWE-Bench Pro 400

Each directory has two subdirectories:

  • baseline/ — Claude Code running without Codeset context
  • codeset/ — Claude Code running with Codeset-generated context injected

Task directory structure

Each task is named after its GitHub repository and issue number (e.g. lucidrains__x-transformers-303):

<model_dir>/<baseline|codeset>/<repo>__<issue>/
├── html/
│   ├── combined_transcripts.html   # Full session transcript, human-readable
│   └── session-<uuid>.html         # Per-session transcript
├── logs/
│   └── run.log                     # Execution log for the session
└── projects/default/
    ├── <session-uuid>.jsonl         # Full session in JSONL format
    ├── agent-<id>.jsonl             # Per-agent interaction logs
    └── cache/                       # Cached API responses
        ├── <session-uuid>.json
        ├── agent-<id>.json
        └── index.json

Summary results

codeset-gym-python (150 tasks)

Model Baseline With Codeset Improvement
Claude Haiku 4.5 52% (78/150) 62% (93/150) +10pp
Claude Sonnet 4.5 56% (84/150) 65.3% (98/150) +9.3pp
Claude Opus 4.5 60.7% (91/150) 68% (102/150) +7.3pp
GPT-5.4 60.7% 66% +5.3pp

SWE-Bench Pro

Model Tasks Baseline With Codeset Improvement
Claude Sonnet 4.5 300 53.0% (159/300) 55.7% (167/300) +2.7pp
GPT-5.4 400 56.5% 58.5% +2pp

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors