EvoSkill: Automated Skill discovery For Multi-Agent Systems

Automatically discover high-performance agent skills for any task!

📑 Table of Contents

🧠 What is EvoSkill?

EvoSkill is a self-improving agent framework that automatically discovers high-performance skills for AI agents. Rather than relying on manual prompt engineering, EvoSkill runs an evolutionary loop that tests an agent on benchmark questions, identifies failure patterns, proposes improvements (new skills or prompt mutations), evaluates the changes, and keeps the best-performing variants.

The core insight is simple: treat agent configurations as programs that can be iterated on automatically. Each "program" is a versioned combination of a system prompt and a set of skills. EvoSkill maintains a frontier of the top-N performing programs, uses failures to drive targeted improvements, and tracks everything through git branches for full reproducibility.

EvoSkill has been validated on multiple benchmarks including DABStep (data analysis), SEAL-QA (search-augmented QA), and OfficeQA, demonstrating that automated skill discovery can match or exceed hand-tuned agent configurations.

🏗️ How It Works

The self-improvement loop follows five stages:

Base Agent — Attempts benchmark questions using the current best program (system prompt + skills).
Proposer — Analyzes failure cases and proposes targeted skill or prompt changes to address them.
Generator — Creates the proposed changes: writes new skill files or rewrites the system prompt.
Evaluator — Scores the new program variant on a held-out validation set to measure improvement.
Frontier — Tracks the top-N performing programs as git branches; the best survive to the next iteration.

This cycle repeats for a configurable number of iterations, automatically converging on stronger agent configurations.

📦 Installation & Setup

Requirements:

Python 3.12+
uv (recommended) or pip
Docker (for LiveCodeBench evaluation with secure code execution sandbox)

Install dependencies:

# Using uv (recommended)
uv sync

# Or using pip
pip install -e .

Environment variables: Either login to Claude Code or use your API key

# Required — used by the Claude agent SDK
export ANTHROPIC_API_KEY=your-key-here

SDK and Model Selection:

Use --sdk and --model to configure which SDK and model to use:

# Claude SDK (default)
uv run python scripts/run_eval.py --sdk claude --model claude-sonnet-4-5-20250514

# OpenCode SDK with different models
uv run python scripts/run_eval.py --sdk opencode --model deepseek-ai/DeepSeek-V3
uv run python scripts/run_eval.py --sdk opencode --model google/gemini-2.0-flash-exp

Dataset preparation:

Place your benchmark datasets in the .dataset/ directory:

DABStep: .dataset/dabstep_data.csv
SEAL-QA: .dataset/seal-0.csv
OfficeQA: see scripts/run_eval.py for expected path

🐍 Python API

EvoSkill provides a high-level Python API that reduces the boilerplate needed to run the self-improvement loop or standalone evaluations to just a few lines.

`EvoSkill` — Run the self-improvement loop

from src.api import EvoSkill

# Minimal — uses task defaults
result = await EvoSkill(dataset=".dataset/seal-0.csv", task="sealqa").run()

# Full configuration
evo = EvoSkill(
    task="sealqa",
    model="sonnet",
    mode="skill_only",
    max_iterations=20,
    frontier_size=3,
    concurrency=4,
    train_ratio=0.18,
    val_ratio=0.12,
    continue_mode=False,
)
result = await evo.run()

# Synchronous usage (wraps asyncio.run)
result = EvoSkill(task="base").run_sync()

# Preview dataset splits without running
print(evo.dataset_info)

`EvalRunner` — Run standalone evaluation

from src.api import EvalRunner

summary = await EvalRunner(
    task="sealqa",
    model="sonnet",
    max_concurrent=8,
).run()

print(f"Accuracy: {summary.accuracy:.1%} ({summary.correct}/{summary.successful})")

Built-in tasks

Three tasks are registered out of the box:

Task	Agent	Default Dataset	Scorer
`"base"`	Base agent	`.dataset/new_runs_base/solved_dataset.csv`	Multi-tolerance (default)
`"sealqa"`	SEAL-QA agent	`.dataset/seal-0.csv`	LLM-graded (GPT)

from src.api import list_tasks
print(list_tasks())  # ['base', 'dabstep', 'sealqa']

Registering a custom task

from src.api import TaskConfig, register_task

register_task(TaskConfig(
    name="my_task",
    make_agent_options=make_my_agent_options,
    scorer=my_scorer_fn,  # (question, predicted, ground_truth) -> float
    column_renames={"label": "ground_truth", "topic": "category"},
    default_dataset=".dataset/my_data.csv",
))

result = await EvoSkill(task="my_task").run()

⚡ Quickstart: Running the Self-Improvement Loop

The CLI scripts remain available for users who prefer the command line.

Run the evolutionary skill discovery loop on a benchmark:

OfficeQA:

python scripts/run_loop.py --mode skill_only --max-iterations 20

SEAL-QA:

python scripts/run_loop_sealqa.py --mode skill_only --max-iterations 20

Key CLI flags:

Flag	Description	Default
`--mode`	Evolution mode: `skill_only` or `prompt_only`	`skill_only`
`--max-iterations`	Number of improvement iterations	`20`
`--frontier-size`	Number of top programs to keep	`3`
`--concurrency`	Concurrent evaluations	`4`
`--continue`	Resume from existing frontier	off
`--no-cache`	Disable run caching	off
`--model`	Base agent model (`opus`, `sonnet`, `haiku`)	`opus`

📊 Running Evaluations

Evaluate an agent configuration on a full benchmark dataset:

OfficeQA:

python scripts/run_eval.py --model opus --max-concurrent 8

SEAL-QA:

python scripts/run_eval_sealqa.py --model opus --max-concurrent 8

Common eval flags: --output <path>, --max-concurrent <n>, --num-samples <n>, --no-resume.

🔑 Key Concepts

Program — A versioned agent configuration (system prompt + skills), stored as a git branch.
Frontier — The top-N highest-scoring programs, tracked via git tags and branches.
Evolution Mode — skill_only discovers new reusable skills; prompt_only optimizes the system prompt directly.
Skill — A reusable capability file written to .claude/skills/ that the agent can invoke during execution.
Proposer — Analyzes agent failures and suggests what skill or prompt change would help.
Generator — Takes a proposal and produces the actual skill file or prompt rewrite.

🧩 Extending EvoSkill: Adding a New Task

EvoSkill is designed to be extended to new benchmarks. There are two approaches: using the Python API (recommended) or creating standalone scripts.

Option A: Using `register_task` (recommended)

1. Create an Agent Profile

Add a new directory under src/agent_profiles/ for your task:

src/agent_profiles/my_task_agent/
├── __init__.py
├── my_task_agent.py    # Options factory
└── prompt.txt          # (optional) task-specific system prompt

Your agent module should expose a make_*_agent_options factory that returns ClaudeAgentOptions. See src/agent_profiles/dabstep_agent/dabstep_agent.py or src/agent_profiles/sealqa_agent/sealqa_agent.py for reference.

Then register the exports in src/agent_profiles/__init__.py.

2. Create a Scorer (optional)

Add a scorer under src/evaluation/ that compares the agent's output to ground truth:

# src/evaluation/my_task_scorer.py

def score_my_task(question: str, predicted: str, ground_truth: str) -> float:
    """Return 1.0 if correct, 0.0 otherwise."""
    return 1.0 if predicted.strip().lower() == ground_truth.strip().lower() else 0.0

For more complex grading (e.g. partial credit or LLM-based judging), see src/evaluation/sealqa_scorer.py. If no scorer is provided, the default multi-tolerance scorer is used.

3. Register and run

from src.api import TaskConfig, register_task, EvoSkill, EvalRunner

register_task(TaskConfig(
    name="my_task",
    make_agent_options=make_my_task_agent_options,
    scorer=score_my_task,
    column_renames={"label": "ground_truth", "topic": "category"},
    default_dataset=".dataset/my_data.csv",
))

# Run the self-improvement loop
result = await EvoSkill(task="my_task").run()

# Or run a standalone evaluation
summary = await EvalRunner(task="my_task", model="sonnet").run()

Option B: Standalone scripts

You can also create scripts directly under scripts/ following the existing patterns.

Evaluation script — loads your dataset and runs evaluate_full():

from src.agent_profiles import Agent, make_my_task_agent_options
from src.evaluation.eval_full import evaluate_full
from src.schemas import AgentResponse

agent = Agent(make_my_task_agent_options(model="opus"), AgentResponse)
results = await evaluate_full(agent=agent, items=items, output_path=output, ...)

See scripts/run_eval_dabstep.py for a complete example.

Loop script — follow the pattern in scripts/run_loop.py. The key ingredients are:

A dataset split function (train set for failure analysis, validation set for scoring)
Your agent options factory and scorer wired into SelfImprovingLoop
A LoopConfig with your chosen mode (skill_only or prompt_only)

📚 Citation

If you use EvoSkill in your research, please cite:

@software{al_zubi_2025_17052592,
  author       = {Alzubi, Salaheddin and},
  title        = {SentientResearchAgent: A Hierarchical AI Agent
                   Framework for Research and Analysis
                  },
  month        = sep,
  year         = 2025,
  publisher    = {Zenodo},
  version      = {ROMA},
  doi          = {10.5281/zenodo.17052592},
  url          = {https://doi.org/10.5281/zenodo.17052592},
  swhid        = {swh:1:dir:69cd1552103e0333dd0c39fc4f53cb03196017ce
                   ;origin=https://doi.org/10.5281/zenodo.17052591;vi
                   sit=swh:1:snp:f50bf99634f9876adb80c027361aec9dff97
                   3433;anchor=swh:1:rel:afa7caa843ce1279f5b4b29b5d3d
                   5e3fe85edc95;path=salzubi401-ROMA-b31c382
                  },
}

📄 License

This project is licensed under the Apache 2.0 License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 179 Commits
.claude/skills		.claude/skills
assets		assets
notebooks		notebooks
scripts		scripts
src		src
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
evoskill_tech_report.pdf		evoskill_tech_report.pdf
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EvoSkill: Automated Skill discovery For Multi-Agent Systems

📑 Table of Contents

🧠 What is EvoSkill?

🏗️ How It Works

📦 Installation & Setup

🐍 Python API

`EvoSkill` — Run the self-improvement loop

`EvalRunner` — Run standalone evaluation

Built-in tasks

Registering a custom task

⚡ Quickstart: Running the Self-Improvement Loop

📊 Running Evaluations

🔑 Key Concepts

🧩 Extending EvoSkill: Adding a New Task

Option A: Using `register_task` (recommended)

1. Create an Agent Profile

2. Create a Scorer (optional)

3. Register and run

Option B: Standalone scripts

📚 Citation

📄 License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

EvoSkill: Automated Skill discovery For Multi-Agent Systems

📑 Table of Contents

🧠 What is EvoSkill?

🏗️ How It Works

📦 Installation & Setup

🐍 Python API

EvoSkill — Run the self-improvement loop

EvalRunner — Run standalone evaluation

Built-in tasks

Registering a custom task

⚡ Quickstart: Running the Self-Improvement Loop

📊 Running Evaluations

🔑 Key Concepts

🧩 Extending EvoSkill: Adding a New Task

Option A: Using register_task (recommended)

1. Create an Agent Profile

2. Create a Scorer (optional)

3. Register and run

Option B: Standalone scripts

📚 Citation

📄 License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

`EvoSkill` — Run the self-improvement loop

`EvalRunner` — Run standalone evaluation

Option A: Using `register_task` (recommended)

Packages