AgentSkillOS logo

Agent Skill Ecosystem

Organizing, Orchestrating, and Benchmarking Agent Skills at Ecosystem Scale

Build Your Agent from 200,000+ Skills via Skill Retrieval & Orchestration

Two-Stage AgentSkillOS Framework

AgentSkillOS follows a two-stage pipeline: Manage Skills (capability-tree construction) and Solve Tasks (retrieval, orchestration, and execution).

Stage 1 · Manage Skills

Organize ecosystem skills into a capability tree for coarse-to-fine discovery.

  • Node-level recursive categorization
  • Usage-frequency queue for active set
  • Dormant index with semantic suggestions

Stage 2 · Solve Tasks

Build a task-specific agent through retrieval and DAG-based multi-skill execution.

  • Task-driven skill retrieval
  • DAG orchestration with strategy variants
  • Layered execution with dependency control
Quality-First Efficiency-First Simplicity-First
Fig.1 framework in paper
Figure 1: overall AgentSkillOS workflow for skill retrieval, orchestration, and execution.

Capability Tree Expansion

Visualizing how the skill tree scales from a curated starter set to a large-scale ecosystem.

200-skill capability tree expansion

200-Skill Tree

4 levels, 22 nodes — a curated starter set.

1000-skill capability tree expansion

1,000-Skill Tree

7 levels, 47 nodes — mid-scale ecosystem.

10000-skill capability tree expansion

10,000-Skill Tree

6 levels, 67 nodes — large-scale ecosystem.

30 Multimodal Creative Tasks Across 5 Categories

The benchmark emphasizes three properties: multimodal creative tasks spanning multi-format outputs, pairwise evaluation with position-bias mitigation, and Bradley-Terry aggregation.

Multimodal Creative Tasks

Tasks require end-user artifacts in multi-format outputs such as PDF, PPTX, DOCX, HTML, video, and generated images.

Pairwise Evaluation

Outputs are compared in both orders to reduce position bias and capture reliable preference signals.

Bradley-Terry Scores

Pairwise preferences are aggregated into continuous ranking scores for fine-grained system comparisons.

Data Computation ×6 Document Creation ×6 Motion Video ×6 Visual Creation ×6 Web Interaction ×6
Fig.3 task overview and complexity
Figure 3: 30 tasks grouped into five categories, with task-complexity statistics.

Benchmark Tasks with Prompts

Each benchmark task is shown as a card with a prompt preview. Open details to read the full prompt, required skills, and output files.

0 tasks

Experimental Findings

Evaluated across 200 / 1K / 200K skill ecosystems, AgentSkillOS demonstrates consistent superiority over baselines, with ablation confirming that both retrieval and orchestration are indispensable, and strategy selection producing structurally distinct execution graphs.

Finding 1: Substantial Gains over Baselines at Every Scale

All three AgentSkillOS variants achieve the highest Bradley-Terry scores across 200 / 1K / 200K ecosystems. The w/ Full Pool baseline, which feeds the entire skill set directly to the agent, scores poorly because a growing fraction of skills becomes invisible — structured retrieval and orchestration overcome this scalability bottleneck.

Finding 2: Ablation: Both Retrieval and Orchestration Are Essential

Removing components reveals a clear degradation gradient. Without DAG orchestration, retrieval alone is insufficient; without retrieval, even oracle skills cannot close the gap. Compared to the oracle upper bound, Quality-First shows only a modest deficit that narrows as the ecosystem grows, validating that tree-based retrieval effectively approximates oracle skill selection.

Finding 3: Strategy Choice Shapes Execution Structure

Each orchestration strategy faithfully translates its design intent into a distinct DAG topology. Quality-First builds deep, multi-stage pipelines with rich dependencies; Efficiency-First trades depth for width to maximize parallelism; Simplicity-First retains only essential steps. Users gain real control over the quality–speed–simplicity trade-off through strategy selection alone.

Fig.4 radar results

Fig. 4: Category Radar

Per-category Bradley-Terry performance across ecosystem scales, showing broad and stable coverage.

Fig.5 ablation study

Fig. 5: Ablation

Separates retrieval and orchestration effects; confirms both components are required.

Fig.6 DAG metrics

Fig. 6: DAG Structure Metrics

Different orchestration strategies induce distinct topology profiles (depth, width, edges, nodes).