Automatically discover high-performance agent skills for any task!
- 🧠 What is EvoSkill?
- 🏗️ How It Works
- 📦 Installation & Setup
- 🐍 Python API
- ⚡ Quickstart: Running the Self-Improvement Loop
- 📊 Running Evaluations
- 🔑 Key Concepts
- 🧩 Extending EvoSkill: Adding a New Task
- 📚 Citation
- 📄 License
EvoSkill is a self-improving agent framework that automatically discovers high-performance skills for AI agents. Rather than relying on manual prompt engineering, EvoSkill runs an evolutionary loop that tests an agent on benchmark questions, identifies failure patterns, proposes improvements (new skills or prompt mutations), evaluates the changes, and keeps the best-performing variants.
The core insight is simple: treat agent configurations as programs that can be iterated on automatically. Each "program" is a versioned combination of a system prompt and a set of skills. EvoSkill maintains a frontier of the top-N performing programs, uses failures to drive targeted improvements, and tracks everything through git branches for full reproducibility.
EvoSkill has been validated on multiple benchmarks including DABStep (data analysis), SEAL-QA (search-augmented QA), and OfficeQA, demonstrating that automated skill discovery can match or exceed hand-tuned agent configurations.
The self-improvement loop follows five stages:
- Base Agent — Attempts benchmark questions using the current best program (system prompt + skills).
- Proposer — Analyzes failure cases and proposes targeted skill or prompt changes to address them.
- Generator — Creates the proposed changes: writes new skill files or rewrites the system prompt.
- Evaluator — Scores the new program variant on a held-out validation set to measure improvement.
- Frontier — Tracks the top-N performing programs as git branches; the best survive to the next iteration.
This cycle repeats for a configurable number of iterations, automatically converging on stronger agent configurations.
Requirements:
- Python 3.12+
uv(recommended) orpip- Docker (for LiveCodeBench evaluation with secure code execution sandbox)
Install dependencies:
# Using uv (recommended)
uv sync
# Or using pip
pip install -e .Environment variables: Either login to Claude Code or use your API key
# Required — used by the Claude agent SDK
export ANTHROPIC_API_KEY=your-key-hereSDK and Model Selection:
Use --sdk and --model to configure which SDK and model to use:
# Claude SDK (default)
uv run python scripts/run_eval.py --sdk claude --model claude-sonnet-4-5-20250514
# OpenCode SDK with different models
uv run python scripts/run_eval.py --sdk opencode --model deepseek-ai/DeepSeek-V3
uv run python scripts/run_eval.py --sdk opencode --model google/gemini-2.0-flash-expDataset preparation:
Place your benchmark datasets in the .dataset/ directory:
- DABStep:
.dataset/dabstep_data.csv - SEAL-QA:
.dataset/seal-0.csv - OfficeQA: see
scripts/run_eval.pyfor expected path
EvoSkill provides a high-level Python API that reduces the boilerplate needed to run the self-improvement loop or standalone evaluations to just a few lines.
from src.api import EvoSkill
# Minimal — uses task defaults
result = await EvoSkill(dataset=".dataset/seal-0.csv", task="sealqa").run()
# Full configuration
evo = EvoSkill(
task="sealqa",
model="sonnet",
mode="skill_only",
max_iterations=20,
frontier_size=3,
concurrency=4,
train_ratio=0.18,
val_ratio=0.12,
continue_mode=False,
)
result = await evo.run()
# Synchronous usage (wraps asyncio.run)
result = EvoSkill(task="base").run_sync()
# Preview dataset splits without running
print(evo.dataset_info)from src.api import EvalRunner
summary = await EvalRunner(
task="sealqa",
model="sonnet",
max_concurrent=8,
).run()
print(f"Accuracy: {summary.accuracy:.1%} ({summary.correct}/{summary.successful})")Three tasks are registered out of the box:
| Task | Agent | Default Dataset | Scorer |
|---|---|---|---|
"base" |
Base agent | .dataset/new_runs_base/solved_dataset.csv |
Multi-tolerance (default) |
"sealqa" |
SEAL-QA agent | .dataset/seal-0.csv |
LLM-graded (GPT) |
from src.api import list_tasks
print(list_tasks()) # ['base', 'dabstep', 'sealqa']from src.api import TaskConfig, register_task
register_task(TaskConfig(
name="my_task",
make_agent_options=make_my_agent_options,
scorer=my_scorer_fn, # (question, predicted, ground_truth) -> float
column_renames={"label": "ground_truth", "topic": "category"},
default_dataset=".dataset/my_data.csv",
))
result = await EvoSkill(task="my_task").run()The CLI scripts remain available for users who prefer the command line.
Run the evolutionary skill discovery loop on a benchmark:
OfficeQA:
python scripts/run_loop.py --mode skill_only --max-iterations 20SEAL-QA:
python scripts/run_loop_sealqa.py --mode skill_only --max-iterations 20Key CLI flags:
| Flag | Description | Default |
|---|---|---|
--mode |
Evolution mode: skill_only or prompt_only |
skill_only |
--max-iterations |
Number of improvement iterations | 20 |
--frontier-size |
Number of top programs to keep | 3 |
--concurrency |
Concurrent evaluations | 4 |
--continue |
Resume from existing frontier | off |
--no-cache |
Disable run caching | off |
--model |
Base agent model (opus, sonnet, haiku) |
opus |
Evaluate an agent configuration on a full benchmark dataset:
OfficeQA:
python scripts/run_eval.py --model opus --max-concurrent 8SEAL-QA:
python scripts/run_eval_sealqa.py --model opus --max-concurrent 8Common eval flags: --output <path>, --max-concurrent <n>, --num-samples <n>, --no-resume.
- Program — A versioned agent configuration (system prompt + skills), stored as a git branch.
- Frontier — The top-N highest-scoring programs, tracked via git tags and branches.
- Evolution Mode —
skill_onlydiscovers new reusable skills;prompt_onlyoptimizes the system prompt directly. - Skill — A reusable capability file written to
.claude/skills/that the agent can invoke during execution. - Proposer — Analyzes agent failures and suggests what skill or prompt change would help.
- Generator — Takes a proposal and produces the actual skill file or prompt rewrite.
EvoSkill is designed to be extended to new benchmarks. There are two approaches: using the Python API (recommended) or creating standalone scripts.
Add a new directory under src/agent_profiles/ for your task:
src/agent_profiles/my_task_agent/
├── __init__.py
├── my_task_agent.py # Options factory
└── prompt.txt # (optional) task-specific system prompt
Your agent module should expose a make_*_agent_options factory that returns ClaudeAgentOptions. See src/agent_profiles/dabstep_agent/dabstep_agent.py or src/agent_profiles/sealqa_agent/sealqa_agent.py for reference.
Then register the exports in src/agent_profiles/__init__.py.
Add a scorer under src/evaluation/ that compares the agent's output to ground truth:
# src/evaluation/my_task_scorer.py
def score_my_task(question: str, predicted: str, ground_truth: str) -> float:
"""Return 1.0 if correct, 0.0 otherwise."""
return 1.0 if predicted.strip().lower() == ground_truth.strip().lower() else 0.0For more complex grading (e.g. partial credit or LLM-based judging), see src/evaluation/sealqa_scorer.py. If no scorer is provided, the default multi-tolerance scorer is used.
from src.api import TaskConfig, register_task, EvoSkill, EvalRunner
register_task(TaskConfig(
name="my_task",
make_agent_options=make_my_task_agent_options,
scorer=score_my_task,
column_renames={"label": "ground_truth", "topic": "category"},
default_dataset=".dataset/my_data.csv",
))
# Run the self-improvement loop
result = await EvoSkill(task="my_task").run()
# Or run a standalone evaluation
summary = await EvalRunner(task="my_task", model="sonnet").run()You can also create scripts directly under scripts/ following the existing patterns.
Evaluation script — loads your dataset and runs evaluate_full():
from src.agent_profiles import Agent, make_my_task_agent_options
from src.evaluation.eval_full import evaluate_full
from src.schemas import AgentResponse
agent = Agent(make_my_task_agent_options(model="opus"), AgentResponse)
results = await evaluate_full(agent=agent, items=items, output_path=output, ...)See scripts/run_eval_dabstep.py for a complete example.
Loop script — follow the pattern in scripts/run_loop.py. The key ingredients are:
- A dataset split function (train set for failure analysis, validation set for scoring)
- Your agent options factory and scorer wired into
SelfImprovingLoop - A
LoopConfigwith your chosen mode (skill_onlyorprompt_only)
If you use EvoSkill in your research, please cite:
@software{al_zubi_2025_17052592,
author = {Alzubi, Salaheddin and},
title = {SentientResearchAgent: A Hierarchical AI Agent
Framework for Research and Analysis
},
month = sep,
year = 2025,
publisher = {Zenodo},
version = {ROMA},
doi = {10.5281/zenodo.17052592},
url = {https://doi.org/10.5281/zenodo.17052592},
swhid = {swh:1:dir:69cd1552103e0333dd0c39fc4f53cb03196017ce
;origin=https://doi.org/10.5281/zenodo.17052591;vi
sit=swh:1:snp:f50bf99634f9876adb80c027361aec9dff97
3433;anchor=swh:1:rel:afa7caa843ce1279f5b4b29b5d3d
5e3fe85edc95;path=salzubi401-ROMA-b31c382
},
}This project is licensed under the Apache 2.0 License - see the LICENSE file for details.

