High-performance SWE-bench dataset generator and evaluation harness that mines real GitHub pull requests, produces evaluation-ready task instances, and benchmarks coding agents.
Built on top of SweInfinite by @unconst, rewritten in Python with:
- Agentic command discovery (NO hardcoded install commands)
- Language detection (rule-based, OK to hardcode)
- Difficulty filtering with LLM classification
- Docker verification of generated tests
- Full parallelism with semaphore-based concurrency
- Structured LLM outputs via OpenAI function calling
- 200k context auto-compaction with smart summarization
swe-forge connects to GH Archive to discover recently merged pull requests, enriches them via the GitHub API, classifies their difficulty using an LLM, discovers install/test commands agenticly, generates test specifications via an agentic loop, and exports SWE-bench-compatible task instances.
| Feature | Description |
|---|---|
| 🔍 Real GitHub Data | Mines GH Archive for merged PRs across all public repositories |
| 🎯 Difficulty Filtering | Pre-classifies PRs as easy/medium/hard before expensive processing |
| 🤖 Agentic Discovery | Discovers install/test commands from CI/CD (NO hardcoding) |
| 📦 Docker Verification | Verifies tests in Docker before export |
| ⚡ Full Parallelism | GH Archive 8x, enrichment 20x, Docker 8x concurrent |
| 🧠 Smart Compaction | 200k context limit with structured summary templates |
| 📊 Complete Export | workspace.yaml + patch.diff + tests/ directory |
pip install swe-forgegit clone https://github.com/CortexLM/swe-forge.git
cd swe-forge
pip install -e .docker pull ghcr.io/cortexlm/swe-forge:latest# Required environment variables
export GITHUB_TOKEN="ghp_..." # GitHub PAT for PR enrichment
export OPENROUTER_API_KEY="sk-or-v1-..." # OpenRouter API key for LLM# Mine 10 tasks with workspace export
swe-forge mine mine \
--limit 10 \
--output ./tasks.jsonl \
--output-folder ./tasks \
--docker-username myuser \
--parallel 8
# Mine with difficulty filter
swe-forge mine mine \
--limit 5 \
--difficulty hard \
--min-stars 100
# Mine specific repository
swe-forge mine mine \
--repo python/cpython \
--limit 3# Full A-Z pipeline with test verification
swe-forge mine complete \
--repo owner/repo \
--pr 12345 \
--output ./tasks.jsonl \
--model openai/gpt-5.4tasks/
├── owner-repo-1234/
│ ├── workspace.yaml # Complete task configuration
│ ├── patch.diff # PR patch to apply
│ ├── test_patch.diff # Test file changes
│ └── tests/ # Extracted test files
│ ├── test_feature.py
│ └── test_another.py
└── owner-repo-5678/
└── ...
task_id: owner-repo-1234
repo:
url: https://github.com/owner/repo.git
base_commit: abc123def456...
merge_commit: fed456abc123...
language: python
difficulty_score: 5
prompt: "Fix the bug in..."
environment:
image: myuser/swe-forge-tasks:owner-repo-1234
language_version: "3.12"
install:
commands:
- pip install -e .
- pip install pytest
tests:
fail_to_pass:
- pytest tests/test_feature.py -v
- pytest tests/test_another.py::test_case -v
pass_to_pass:
- pytest tests/ -v --ignore=tests/test_feature.py
docker:
image: myuser/swe-forge-tasks:owner-repo-1234
build: trueswe-forge mine mine [OPTIONS]| Option | Short | Default | Description |
|---|---|---|---|
--repo |
-r |
All | Target repository (owner/repo format) |
--limit |
-l |
10 | Maximum tasks to mine |
--output |
-o |
./tasks.jsonl | Output JSONL file |
--output-folder |
-O |
None | Output folder for workspace format |
--docker-username |
-D |
None | Docker Hub username for image names |
--parallel |
-p |
8 | Concurrent Docker containers |
--difficulty |
-d |
All | Filter: easy, medium, hard |
--model |
-m |
moonshotai/kimi-k2.5 | LLM model for classification |
--min-stars |
100 | Minimum repository stars | |
--language |
python | Filter by language | |
--filter |
-f |
{"easy":10,"medium":10,"hard":10} | JSON max tasks per difficulty |
--verbose |
-v |
False | Enable verbose logging |
swe-forge mine complete [OPTIONS]| Option | Short | Default | Description |
|---|---|---|---|
--repo |
-r |
Required | Target repository (owner/repo) |
--pr |
-p |
Required | Pull request number |
--output |
-o |
./tasks.jsonl | Output file |
--model |
-m |
openai/gpt-5.4 | LLM model |
--verbose |
-v |
False | Verbose logging |
sequenceDiagram
participant GHA as GH Archive
participant SF as swe-forge
participant GH as GitHub API
participant LLM as LLM
participant D as Docker
GHA->>SF: Merged PR events (8x concurrent)
SF->>SF: Pre-filter (bots, org, stars)
SF->>GH: Enrich candidates (20x concurrent)
GH-->>SF: PR metadata + diff
SF->>LLM: Classify difficulty
LLM-->>SF: easy / medium / hard
SF->>D: Agentic discovery (8x concurrent)
D-->>SF: fail_to_pass + pass_to_pass
SF->>LLM: Quality scoring
LLM-->>SF: Accept / reject
SF-->>SF: Export workspace.yaml
| Stage | Semaphore | Default | Description |
|---|---|---|---|
| GH Archive Fetch | gh_archive_sem |
8 | Download hourly dumps |
| GitHub Enrichment | enrichment_sem |
20 | Fetch PR metadata (5000/h rate limit) |
| Pre-classification | preclassify_sem |
25 | LLM triage on title+body |
| Deep Processing | deep_sem |
8 | Full pipeline per candidate |
| Docker Containers | docker_sem |
8 | Concurrent test verification |
IMPORTANT: Commands are NEVER hardcoded.
sequenceDiagram
participant AD as Agent Discovery
participant CI as CI/CD Config
participant LLM as LLM
participant SH as Shell (Docker)
AD->>CI: Parse .github/workflows/, .gitlab-ci.yml
CI-->>AD: Install patterns, test commands
AD->>SH: Clone repo in Docker
AD->>LLM: "Discover how to install and test"
loop Up to 200 turns
LLM->>SH: shell("pip install -e .")
SH-->>LLM: exit_code=0
LLM->>SH: shell("pytest tests/")
SH-->>LLM: exit_code=0, output
end
LLM->>AD: submit_tests(fail_to_pass, pass_to_pass)
- Clone repository at base commit
- Detect language from files (package.json, pyproject.toml, Cargo.toml, etc.)
- Discover commands by:
- Parsing CI/CD workflows
- Reading package manager configs
- Trying commands and checking exit codes
- Generate tests via LLM agentic loop
- Verify tests fail before patch (proves bug exists)
- Apply patch
- Verify tests pass after patch (proves fix works)
| Level | Score Range | Typical Changes | Examples |
|---|---|---|---|
| Easy | 0.1 – 0.35 | Typos, config, single-file | Fix import, update version |
| Medium | 0.4 – 0.65 | Bug fixes, features, APIs | Fix race condition, add endpoint |
| Hard | 0.7 – 1.0 | Cross-cutting, architectural | New subsystem, migration |
- Pre-classification:
moonshotai/kimi-k2.5(fast triage on title+body) - Full classification: Uses complete diff and test spec
When context exceeds 200k tokens, the system uses structured summarization:
## Goal
[What goal(s) is the user trying to accomplish?]
## Instructions
- [What important instructions did the user give you]
- [If there is a plan or spec, include information about it]
## Discoveries
[What notable things were learned during this conversation]
## Accomplished
[What work has been completed, in progress, and left?]
## Relevant files / directories
[Structured list of relevant files]This preserves critical context across long agentic sessions.
| Variable | Required | Description |
|---|---|---|
GITHUB_TOKEN |
Yes | GitHub PAT for PR enrichment |
OPENROUTER_API_KEY |
Yes | OpenRouter API key for LLM calls |
HF_TOKEN |
No | HuggingFace token for dataset upload |
RUST_LOG |
No | Log level: debug, info, warn, error |
| Language | Detection | Package Managers |
|---|---|---|
| Python | pyproject.toml, setup.py, requirements.txt | pip, poetry, uv |
| JavaScript/TypeScript | package.json | npm, yarn, pnpm |
| Rust | Cargo.toml | cargo |
| Go | go.mod | go mod |
| Java | pom.xml, build.gradle | maven, gradle |
# Clone and install dev dependencies
git clone https://github.com/CortexLM/swe-forge.git
cd swe-forge
pip install -e ".[dev]"
# Install pre-commit hooks
pre-commit install# Run all tests
pytest tests/ -v
# Run specific test module
pytest tests/test_swe/test_pipeline.py -v
# Run with coverage
pytest tests/ --cov=src/swe_forge --cov-report=html# Format
ruff format src/
# Lint
ruff check src/
# Type check
pyright src/Benchmark run with 100 candidate PRs from GH Archive:
| Stage | Count | Percentage |
|---|---|---|
| Raw GH Archive events (12h) | 1,752,426 | 100% |
| Merged PR events | 35,498 | 2.03% |
| After pre-filter | 1,394 | 3.93% |
| Enriched successfully | 21 | 1.51% |
| Tests generated | 11 | 52.38% |
| Quality passed | 8 | 72.73% |
| Metric | Value |
|---|---|
| Tasks per hour | 8 |
| Avg time per task | 450s |
| Docker parallelism | 8 containers |
from swe_forge.swe.pipeline import SwePipeline, SwePipelineConfig
from swe_forge.export.workspace import export_tasks_to_workspace
# Configure pipeline
config = SwePipelineConfig(
max_candidates=50,
max_tasks=10,
min_stars=100,
languages=["python"],
)
# Run pipeline
async with SwePipeline(config) as pipeline:
result = await pipeline.run()
# Export to workspace format
export_tasks_to_workspace(
result.tasks,
output_folder="./tasks",
docker_username="myuser"
)from swe_forge.swe.models import SweTask
@dataclass
class SweTask:
id: str
repo: str # owner/repo format
base_commit: str # Git SHA
merge_commit: str # Git SHA
language: str # python, rust, etc.
difficulty_score: int # 1-10
patch: str # Unified diff
test_patch: str # Test file changes
fail_to_pass: list[str] # Test commands
pass_to_pass: list[str] # Test commands
install_config: dict # Discovered install commands
prompt: str # Task description
quality_score: float # 0.0-1.0
status: SweTaskStatus # candidate, validated, etc.Built on top of SweInfinite by @unconst.
Extended with:
- Python rewrite with full async support
- Agentic command discovery (NO hardcoding)
- Docker verification of generated tests
- Structured workspace export
- 200k context auto-compaction
- Configurable parallelism
MIT — see LICENSE.