Agent Skill Ecosystem

Organizing, Orchestrating, and Benchmarking Agent Skills at Ecosystem Scale

Build Your Agent from 200,000+ Skills via Skill Retrieval & Orchestration

Method Benchmark Experiments Case Study

200,000+

Public Skills

Benchmark Tasks

Task Categories

200 / 1K / 200K

Ecosystem Sizes

GitHub Project

-- Stars -- Forks

Updated: --

If AgentSkillOS helps your work, support us with a Star and cite our paper.

GitHub Give us a Star Paper Arxiv

Method

Two-Stage AgentSkillOS Framework

AgentSkillOS follows a two-stage pipeline: Manage Skills (capability-tree construction) and Solve Tasks (retrieval, orchestration, and execution).

Stage 1 · Manage Skills

Organize ecosystem skills into a capability tree for coarse-to-fine discovery.

Node-level recursive categorization
Usage-frequency queue for active set
Dormant index with semantic suggestions

Stage 2 · Solve Tasks

Build a task-specific agent through retrieval and DAG-based multi-skill execution.

Task-driven skill retrieval
DAG orchestration with strategy variants
Layered execution with dependency control

Quality-First Efficiency-First Simplicity-First

Fig.1 framework in paper — Figure 1: overall AgentSkillOS workflow for skill retrieval, orchestration, and execution.

Capability Tree Expansion

Visualizing how the skill tree scales from a curated starter set to a large-scale ecosystem.

200-Skill Tree

4 levels, 22 nodes — a curated starter set.

1,000-Skill Tree

7 levels, 47 nodes — mid-scale ecosystem.

10,000-Skill Tree

6 levels, 67 nodes — large-scale ecosystem.

Benchmark

30 Multimodal Creative Tasks Across 5 Categories

The benchmark emphasizes three properties: multimodal creative tasks spanning multi-format outputs, pairwise evaluation with position-bias mitigation, and Bradley-Terry aggregation.

Multimodal Creative Tasks

Tasks require end-user artifacts in multi-format outputs such as PDF, PPTX, DOCX, HTML, video, and generated images.

Pairwise Evaluation

Outputs are compared in both orders to reduce position bias and capture reliable preference signals.

Bradley-Terry Scores

Pairwise preferences are aggregated into continuous ranking scores for fine-grained system comparisons.

Data Computation ×6 Document Creation ×6 Motion Video ×6 Visual Creation ×6 Web Interaction ×6

Fig.3 task overview and complexity — Figure 3: 30 tasks grouped into five categories, with task-complexity statistics.

Benchmark Tasks with Prompts

Each benchmark task is shown as a card with a prompt preview. Open details to read the full prompt, required skills, and output files.

Experiments

Experimental Findings

Evaluated across 200 / 1K / 200K skill ecosystems, AgentSkillOS demonstrates consistent superiority over baselines, with ablation confirming that both retrieval and orchestration are indispensable, and strategy selection producing structurally distinct execution graphs.

Finding 1: Substantial Gains over Baselines at Every Scale

All three AgentSkillOS variants achieve the highest Bradley-Terry scores across 200 / 1K / 200K ecosystems. The w/ Full Pool baseline, which feeds the entire skill set directly to the agent, scores poorly because a growing fraction of skills becomes invisible — structured retrieval and orchestration overcome this scalability bottleneck.

Finding 2: Ablation: Both Retrieval and Orchestration Are Essential

Removing components reveals a clear degradation gradient. Without DAG orchestration, retrieval alone is insufficient; without retrieval, even oracle skills cannot close the gap. Compared to the oracle upper bound, Quality-First shows only a modest deficit that narrows as the ecosystem grows, validating that tree-based retrieval effectively approximates oracle skill selection.

Finding 3: Strategy Choice Shapes Execution Structure

Each orchestration strategy faithfully translates its design intent into a distinct DAG topology. Quality-First builds deep, multi-stage pipelines with rich dependencies; Efficiency-First trades depth for width to maximize parallelism; Simplicity-First retains only essential steps. Users gain real control over the quality–speed–simplicity trade-off through strategy selection alone.

Fig. 4: Category Radar

Per-category Bradley-Terry performance across ecosystem scales, showing broad and stable coverage.

Fig. 5: Ablation

Separates retrieval and orchestration effects; confirms both components are required.

Fig. 6: DAG Structure Metrics

Different orchestration strategies induce distinct topology profiles (depth, width, edges, nodes).

Case Study

Case Study and Supplementary Artifacts

This section highlights the qualitative case-study figure and representative supplementary artifacts from the original project assets.

AgentSkillOS video cover — Demo preview from the AgentSkillOS project assets.

Fig.7 case study — Figure 7: qualitative comparison between the vanilla baseline and AgentSkillOS Quality-First outputs.

Example 01 · Bug Diagnosis Report

Mobile bug localization, fix validation, and visual bug report generation with before/after evidence.

Example 02 · UI Design Research

Design-language research, report generation, and multi-direction concept mockups for knowledge software.

Example 03 · Paper Promotion

Transforms academic papers into social slides, scientific pages, and platform-specific promotion content.

Example 04 · Meme Video

Green-screen compositing, subtitle timing, and viral short-video production with multi-version outputs.