SWE-Forge (Python)

High-performance SWE-bench dataset generator and evaluation harness that mines real GitHub pull requests, produces evaluation-ready task instances, and benchmarks coding agents.

Built on top of SweInfinite by @unconst, rewritten in Python with:

Agentic command discovery (NO hardcoded install commands)
Language detection (rule-based, OK to hardcode)
Difficulty filtering with LLM classification
Docker verification of generated tests
Full parallelism with semaphore-based concurrency
Structured LLM outputs via OpenAI function calling
200k context auto-compaction with smart summarization

What it does

swe-forge connects to GH Archive to discover recently merged pull requests, enriches them via the GitHub API, classifies their difficulty using an LLM, discovers install/test commands agenticly, generates test specifications via an agentic loop, and exports SWE-bench-compatible task instances.

Key Features

Feature	Description
🔍 Real GitHub Data	Mines GH Archive for merged PRs across all public repositories
🎯 Difficulty Filtering	Pre-classifies PRs as easy/medium/hard before expensive processing
🤖 Agentic Discovery	Discovers install/test commands from CI/CD (NO hardcoding)
📦 Docker Verification	Verifies tests in Docker before export
⚡ Full Parallelism	GH Archive 8x, enrichment 20x, Docker 8x concurrent
🧠 Smart Compaction	200k context limit with structured summary templates
📊 Complete Export	workspace.yaml + patch.diff + tests/ directory

Installation

From PyPI

pip install swe-forge

From Source

git clone https://github.com/CortexLM/swe-forge.git
cd swe-forge
pip install -e .

Docker

docker pull ghcr.io/cortexlm/swe-forge:latest

Quick Start

Prerequisites

# Required environment variables
export GITHUB_TOKEN="ghp_..."           # GitHub PAT for PR enrichment
export OPENROUTER_API_KEY="sk-or-v1-..." # OpenRouter API key for LLM

Mine Tasks from GH Archive

# Mine 10 tasks with workspace export
swe-forge mine mine \
  --limit 10 \
  --output ./tasks.jsonl \
  --output-folder ./tasks \
  --docker-username myuser \
  --parallel 8

# Mine with difficulty filter
swe-forge mine mine \
  --limit 5 \
  --difficulty hard \
  --min-stars 100

# Mine specific repository
swe-forge mine mine \
  --repo python/cpython \
  --limit 3

Complete Mining with Docker Verification

# Full A-Z pipeline with test verification
swe-forge mine complete \
  --repo owner/repo \
  --pr 12345 \
  --output ./tasks.jsonl \
  --model openai/gpt-5.4

Output Structure

Directory Format (when using `--output-folder`)

tasks/
├── owner-repo-1234/
│   ├── workspace.yaml      # Complete task configuration
│   ├── patch.diff          # PR patch to apply
│   ├── test_patch.diff     # Test file changes
│   └── tests/              # Extracted test files
│       ├── test_feature.py
│       └── test_another.py
└── owner-repo-5678/
    └── ...

workspace.yaml Format

task_id: owner-repo-1234
repo:
  url: https://github.com/owner/repo.git
  base_commit: abc123def456...
  merge_commit: fed456abc123...
language: python
difficulty_score: 5
prompt: "Fix the bug in..."
environment:
  image: myuser/swe-forge-tasks:owner-repo-1234
  language_version: "3.12"
install:
  commands:
    - pip install -e .
    - pip install pytest
tests:
  fail_to_pass:
    - pytest tests/test_feature.py -v
    - pytest tests/test_another.py::test_case -v
  pass_to_pass:
    - pytest tests/ -v --ignore=tests/test_feature.py
docker:
  image: myuser/swe-forge-tasks:owner-repo-1234
  build: true

CLI Reference

`swe-forge mine mine` - Mine from GH Archive

swe-forge mine mine [OPTIONS]

Option	Short	Default	Description
`--repo`	`-r`	All	Target repository (owner/repo format)
`--limit`	`-l`	10	Maximum tasks to mine
`--output`	`-o`	./tasks.jsonl	Output JSONL file
`--output-folder`	`-O`	None	Output folder for workspace format
`--docker-username`	`-D`	None	Docker Hub username for image names
`--parallel`	`-p`	8	Concurrent Docker containers
`--difficulty`	`-d`	All	Filter: easy, medium, hard
`--model`	`-m`	moonshotai/kimi-k2.5	LLM model for classification
`--min-stars`		100	Minimum repository stars
`--language`		python	Filter by language
`--filter`	`-f`	{"easy":10,"medium":10,"hard":10}	JSON max tasks per difficulty
`--verbose`	`-v`	False	Enable verbose logging

`swe-forge mine complete` - Full Pipeline with Verification

swe-forge mine complete [OPTIONS]

Option	Short	Default	Description
`--repo`	`-r`	Required	Target repository (owner/repo)
`--pr`	`-p`	Required	Pull request number
`--output`	`-o`	./tasks.jsonl	Output file
`--model`	`-m`	openai/gpt-5.4	LLM model
`--verbose`	`-v`	False	Verbose logging

Architecture

Pipeline Flow

sequenceDiagram
    participant GHA as GH Archive
    participant SF as swe-forge
    participant GH as GitHub API
    participant LLM as LLM
    participant D as Docker

    GHA->>SF: Merged PR events (8x concurrent)
    SF->>SF: Pre-filter (bots, org, stars)
    SF->>GH: Enrich candidates (20x concurrent)
    GH-->>SF: PR metadata + diff
    SF->>LLM: Classify difficulty
    LLM-->>SF: easy / medium / hard
    SF->>D: Agentic discovery (8x concurrent)
    D-->>SF: fail_to_pass + pass_to_pass
    SF->>LLM: Quality scoring
    LLM-->>SF: Accept / reject
    SF-->>SF: Export workspace.yaml

Parallelism Configuration

Stage	Semaphore	Default	Description
GH Archive Fetch	`gh_archive_sem`	8	Download hourly dumps
GitHub Enrichment	`enrichment_sem`	20	Fetch PR metadata (5000/h rate limit)
Pre-classification	`preclassify_sem`	25	LLM triage on title+body
Deep Processing	`deep_sem`	8	Full pipeline per candidate
Docker Containers	`docker_sem`	8	Concurrent test verification

Agentic Command Discovery

IMPORTANT: Commands are NEVER hardcoded.

sequenceDiagram
    participant AD as Agent Discovery
    participant CI as CI/CD Config
    participant LLM as LLM
    participant SH as Shell (Docker)

    AD->>CI: Parse .github/workflows/, .gitlab-ci.yml
    CI-->>AD: Install patterns, test commands
    
    AD->>SH: Clone repo in Docker
    AD->>LLM: "Discover how to install and test"
    
    loop Up to 200 turns
        LLM->>SH: shell("pip install -e .")
        SH-->>LLM: exit_code=0
        LLM->>SH: shell("pytest tests/")
        SH-->>LLM: exit_code=0, output
    end
    
    LLM->>AD: submit_tests(fail_to_pass, pass_to_pass)

What Happens in Docker

Clone repository at base commit
Detect language from files (package.json, pyproject.toml, Cargo.toml, etc.)
Discover commands by:
- Parsing CI/CD workflows
- Reading package manager configs
- Trying commands and checking exit codes
Generate tests via LLM agentic loop
Verify tests fail before patch (proves bug exists)
Apply patch
Verify tests pass after patch (proves fix works)

Difficulty Classification

Level	Score Range	Typical Changes	Examples
Easy	0.1 – 0.35	Typos, config, single-file	Fix import, update version
Medium	0.4 – 0.65	Bug fixes, features, APIs	Fix race condition, add endpoint
Hard	0.7 – 1.0	Cross-cutting, architectural	New subsystem, migration

Classification Models

Pre-classification: moonshotai/kimi-k2.5 (fast triage on title+body)
Full classification: Uses complete diff and test spec

Auto-Compaction (200k Context)

When context exceeds 200k tokens, the system uses structured summarization:

## Goal
[What goal(s) is the user trying to accomplish?]

## Instructions
- [What important instructions did the user give you]
- [If there is a plan or spec, include information about it]

## Discoveries
[What notable things were learned during this conversation]

## Accomplished
[What work has been completed, in progress, and left?]

## Relevant files / directories
[Structured list of relevant files]

This preserves critical context across long agentic sessions.

Configuration

Environment Variables

Variable	Required	Description
`GITHUB_TOKEN`	Yes	GitHub PAT for PR enrichment
`OPENROUTER_API_KEY`	Yes	OpenRouter API key for LLM calls
`HF_TOKEN`	No	HuggingFace token for dataset upload
`RUST_LOG`	No	Log level: debug, info, warn, error

Supported Languages

Language	Detection	Package Managers
Python	pyproject.toml, setup.py, requirements.txt	pip, poetry, uv
JavaScript/TypeScript	package.json	npm, yarn, pnpm
Rust	Cargo.toml	cargo
Go	go.mod	go mod
Java	pom.xml, build.gradle	maven, gradle

Development

Setup

# Clone and install dev dependencies
git clone https://github.com/CortexLM/swe-forge.git
cd swe-forge
pip install -e ".[dev]"

# Install pre-commit hooks
pre-commit install

Testing

# Run all tests
pytest tests/ -v

# Run specific test module
pytest tests/test_swe/test_pipeline.py -v

# Run with coverage
pytest tests/ --cov=src/swe_forge --cov-report=html

Code Quality

# Format
ruff format src/

# Lint
ruff check src/

# Type check
pyright src/

Benchmark Results

Benchmark run with 100 candidate PRs from GH Archive:

Pipeline Funnel

Stage	Count	Percentage
Raw GH Archive events (12h)	1,752,426	100%
Merged PR events	35,498	2.03%
After pre-filter	1,394	3.93%
Enriched successfully	21	1.51%
Tests generated	11	52.38%
Quality passed	8	72.73%

Throughput

Metric	Value
Tasks per hour	8
Avg time per task	450s
Docker parallelism	8 containers

API Reference

Python API

from swe_forge.swe.pipeline import SwePipeline, SwePipelineConfig
from swe_forge.export.workspace import export_tasks_to_workspace

# Configure pipeline
config = SwePipelineConfig(
    max_candidates=50,
    max_tasks=10,
    min_stars=100,
    languages=["python"],
)

# Run pipeline
async with SwePipeline(config) as pipeline:
    result = await pipeline.run()
    
    # Export to workspace format
    export_tasks_to_workspace(
        result.tasks,
        output_folder="./tasks",
        docker_username="myuser"
    )

SweTask Model

from swe_forge.swe.models import SweTask

@dataclass
class SweTask:
    id: str
    repo: str                    # owner/repo format
    base_commit: str             # Git SHA
    merge_commit: str            # Git SHA
    language: str                 # python, rust, etc.
    difficulty_score: int         # 1-10
    patch: str                    # Unified diff
    test_patch: str               # Test file changes
    fail_to_pass: list[str]       # Test commands
    pass_to_pass: list[str]       # Test commands
    install_config: dict          # Discovered install commands
    prompt: str                   # Task description
    quality_score: float          # 0.0-1.0
    status: SweTaskStatus         # candidate, validated, etc.

Credits

Built on top of SweInfinite by @unconst.

Extended with:

Python rewrite with full async support
Agentic command discovery (NO hardcoding)
Docker verification of generated tests
Structured workspace export
200k context auto-compaction
Configurable parallelism

License

MIT — see LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 166 Commits
.github/workflows		.github/workflows
src		src
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

SWE-Forge (Python)

What it does

Key Features

Installation

From PyPI

From Source

Docker

Quick Start

Prerequisites

Mine Tasks from GH Archive

Complete Mining with Docker Verification

Output Structure

Directory Format (when using --output-folder)

workspace.yaml Format

CLI Reference

swe-forge mine mine - Mine from GH Archive

swe-forge mine complete - Full Pipeline with Verification

Architecture

Pipeline Flow

Parallelism Configuration

Agentic Command Discovery

What Happens in Docker

Difficulty Classification

Classification Models

Auto-Compaction (200k Context)

Configuration

Environment Variables

Supported Languages

Development

Setup

Testing

Code Quality

Benchmark Results

Pipeline Funnel

Throughput

API Reference

Python API

SweTask Model

Credits

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Directory Format (when using `--output-folder`)

`swe-forge mine mine` - Mine from GH Archive

`swe-forge mine complete` - Full Pipeline with Verification

Packages