Skip to content

aelshamy/SkillForge

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SkillForge

Zero-error skill creation framework — fuses agentskills.io spec with MAKER reliability principles to generate production-ready AI skills that follow directions perfectly.

SkillForge takes your high-level workflow descriptions and transforms them into battle-tested, agentskills.io-compatible Skills with built-in validation, red-flag detection, and consensus verification — so the AI never hallucinates, truncates, or goes off-script.


Table of Contents


Quick Start

# 1. Install
cd skill_sdk && npm install

# 2. Scaffold a skill
npx tsx bin/skillforge.ts init my-first-skill \
  --description "Generates unit tests for Python functions" \
  --language python

# 3. Edit the config
#    Open my-first-skill/skillforge.config.json and define your workflow

# 4. Build the skill
npx tsx bin/skillforge.ts build \
  --config my-first-skill/skillforge.config.json \
  --output my-first-skill/dist --verbose

# 5. Test the skill
npx tsx bin/skillforge.ts test \
  --config my-first-skill/skillforge.config.json

Installation

# Clone and install
git clone <repo-url> skill_sdk
cd skill_sdk
npm install

# Verify everything works
npm test

Requires Node.js ≥ 18 and npm.

Key npm Scripts

Script Command Description
npm test vitest run Run all 73 unit tests
npm run test:watch vitest Watch mode for tests
npm run build tsc Compile TypeScript to dist/
npm run lint tsc --noEmit Type-check without emitting
npm start tsx bin/skillforge.ts Run the CLI directly

Core Concepts

SkillForge is built on three pillars from the MAKER framework (Maximal Agentic Knowledge-Error Reduction):

flowchart LR
    A["📝 Workflow Steps"] --> B["🧬 Decompose"]
    B --> C["🚩 Validate"]
    C --> D["🗳️ Consensus"]
    D --> E["✅ Accepted Output"]
    D -->|"No agreement"| F["🔄 Retry"]
    F --> C
    C -->|"Red-flag triggered"| G["⛔ Discard & Resample"]
    G --> C

    style A fill:#1e293b,stroke:#3b82f6,color:#e2e8f0
    style B fill:#1e293b,stroke:#8b5cf6,color:#e2e8f0
    style C fill:#1e293b,stroke:#f59e0b,color:#e2e8f0
    style D fill:#1e293b,stroke:#10b981,color:#e2e8f0
    style E fill:#065f46,stroke:#10b981,color:#e2e8f0
    style F fill:#1e293b,stroke:#6366f1,color:#e2e8f0
    style G fill:#7f1d1d,stroke:#ef4444,color:#e2e8f0
Loading

🧬 Maximal Agentic Decomposition (MAD)

Every workflow step you write gets automatically broken down into the smallest possible atomic sub-steps. Each sub-step has:

  • A single, clear instruction
  • An input/output contract
  • Its own validation script
  • Red-flag rules
flowchart TD
    WS["Workflow Step\n'Build a React component with tests'"]
    WS --> AS1["1.1 Read user request"]
    WS --> AS2["1.2 Identify component name"]
    WS --> AS3["1.3 List required props"]
    WS --> AS4["1.4 Define TypeScript interface"]
    WS --> AS5["1.5 Implement component"]
    WS --> AS6["1.6 Add ARIA attributes"]
    WS --> AS7["1.7 Write unit tests"]
    WS --> AS8["... +3 more"]

    AS1 -.-> V1["✅ Validate"]
    AS2 -.-> V2["✅ Validate"]
    AS3 -.-> V3["✅ Validate"]
    AS4 -.-> V4["✅ Validate"]

    style WS fill:#1e293b,stroke:#8b5cf6,color:#e2e8f0
    style AS1 fill:#1e293b,stroke:#3b82f6,color:#e2e8f0
    style AS2 fill:#1e293b,stroke:#3b82f6,color:#e2e8f0
    style AS3 fill:#1e293b,stroke:#3b82f6,color:#e2e8f0
    style AS4 fill:#1e293b,stroke:#3b82f6,color:#e2e8f0
    style AS5 fill:#1e293b,stroke:#3b82f6,color:#e2e8f0
    style AS6 fill:#1e293b,stroke:#3b82f6,color:#e2e8f0
    style AS7 fill:#1e293b,stroke:#3b82f6,color:#e2e8f0
    style AS8 fill:#1e293b,stroke:#475569,color:#94a3b8
    style V1 fill:#065f46,stroke:#10b981,color:#e2e8f0
    style V2 fill:#065f46,stroke:#10b981,color:#e2e8f0
    style V3 fill:#065f46,stroke:#10b981,color:#e2e8f0
    style V4 fill:#065f46,stroke:#10b981,color:#e2e8f0
Loading

Why? Smaller steps = less room for the AI to go off course. A step like "Build a React component with tests" becomes 10+ atomic steps, each verifiable independently.

🗳️ K-Threshold Consensus Voting

When multiple agents run the same step, their outputs are clustered by structural similarity. If at least K agents agree (default: 2 of 3), the consensus output is accepted. Otherwise, the step is retried.

flowchart TD
    Step["Atomic Step 1.3"]
    Step --> A1["Agent A"]
    Step --> A2["Agent B"]
    Step --> A3["Agent C"]

    A1 --> O1["Output A"]
    A2 --> O2["Output B"]
    A3 --> O3["Output C"]

    O1 --> Cluster["Similarity Clustering"]
    O2 --> Cluster
    O3 --> Cluster

    Cluster --> Vote{"K ≥ 2 agree?"}
    Vote -->|"✅ Yes"| Accept["Accept Consensus Output"]
    Vote -->|"❌ No"| Retry["Retry Step"]

    style Step fill:#1e293b,stroke:#8b5cf6,color:#e2e8f0
    style A1 fill:#1e293b,stroke:#3b82f6,color:#e2e8f0
    style A2 fill:#1e293b,stroke:#3b82f6,color:#e2e8f0
    style A3 fill:#1e293b,stroke:#3b82f6,color:#e2e8f0
    style O1 fill:#1e293b,stroke:#6366f1,color:#e2e8f0
    style O2 fill:#1e293b,stroke:#6366f1,color:#e2e8f0
    style O3 fill:#1e293b,stroke:#6366f1,color:#e2e8f0
    style Cluster fill:#1e293b,stroke:#f59e0b,color:#e2e8f0
    style Vote fill:#1e293b,stroke:#f59e0b,color:#e2e8f0
    style Accept fill:#065f46,stroke:#10b981,color:#e2e8f0
    style Retry fill:#7f1d1d,stroke:#ef4444,color:#e2e8f0
Loading

Why? If three agents independently produce the same answer, it's almost certainly correct.

🚩 Red-Flag Detection

Every output is scanned for suspicious patterns before acceptance:

  • Length violations — Output too long or too short
  • Regex patterns — Apologies, placeholders, truncation markers
  • Syntax errors — Invalid code in the target language
  • AST analysis — Hallucinated imports, duplicate definitions
  • Custom rules — Your own validation scripts
flowchart LR
    Output["LLM Output"] --> RF["Red-Flag\nEngine"]

    RF --> L["📏 Length"]
    RF --> R["🔍 Regex"]
    RF --> S["💻 Syntax"]
    RF --> A["🌳 AST"]
    RF --> C["⚙️ Custom"]

    L & R & S & A & C --> Decision{"Any flags?"}
    Decision -->|"Clean"| Pass["✅ Accept"]
    Decision -->|"Warning"| Discard["🔄 Discard & Resample"]
    Decision -->|"Error"| Halt["⛔ Halt Pipeline"]

    style Output fill:#1e293b,stroke:#3b82f6,color:#e2e8f0
    style RF fill:#1e293b,stroke:#f59e0b,color:#e2e8f0
    style L fill:#1e293b,stroke:#6366f1,color:#e2e8f0
    style R fill:#1e293b,stroke:#6366f1,color:#e2e8f0
    style S fill:#1e293b,stroke:#6366f1,color:#e2e8f0
    style A fill:#1e293b,stroke:#6366f1,color:#e2e8f0
    style C fill:#1e293b,stroke:#6366f1,color:#e2e8f0
    style Decision fill:#1e293b,stroke:#f59e0b,color:#e2e8f0
    style Pass fill:#065f46,stroke:#10b981,color:#e2e8f0
    style Discard fill:#78350f,stroke:#f59e0b,color:#e2e8f0
    style Halt fill:#7f1d1d,stroke:#ef4444,color:#e2e8f0
Loading

Why? Catching confusion before it propagates is cheaper than debugging downstream.


CLI Reference

All commands follow the pattern:

npx tsx bin/skillforge.ts <command> [options]

init — Scaffold a new skill

Creates a complete skill project directory with config, docs, and folder structure.

npx tsx bin/skillforge.ts init my-api-skill \
  --description "Generates REST API endpoints with Express" \
  --language typescript \
  --license MIT \
  --author "My Organization" \
  --output ./skills
Option Default Description
-d, --description <text> "A new SkillForge skill" Skill description
-l, --language <lang> typescript Primary language (typescript, python, javascript, dart)
--license <license> - License (e.g., MIT, Apache-2.0)
--author <author> - Author name or organization (stored in metadata)
-o, --output <dir> . Where to create the project directory

Generated structure:

my-api-skill/
├── SKILL.md                    # Starter agentskills.io skill definition
├── skillforge.config.json      # Your skill's configuration (edit this!)
├── README.md                   # Auto-generated project readme
├── scripts/                    # Validation scripts (auto-filled on build)
├── references/                 # Reference docs your skill can consult
├── examples/                   # Usage examples
└── assets/                     # Static assets (templates, images, data)

Language presets: The --language flag controls which red-flag rules are included. Each language has a full preset automatically configured via the Plugin System:

Language Plugin Used Rules Included
typescript / javascript javascriptPlugin Bracket balance, unclosed strings/templates, hallucinated imports, duplicate declarations, 7 preset rules
python pythonPlugin Mixed indentation, missing colons, bracket balance, 6 preset rules
json jsonPlugin JSON.parse validation, 2 preset rules
dart / flutter dartPlugin Bracket balance, missing semicolons, empty build(), hallucinated imports, duplicate classes, missing @override, orphan StatefulWidget, 13 preset rules including deprecated widget detection

build — Generate the full skill

Runs the complete SkillForge pipeline: Decompose → Validate → Assemble → Write.

flowchart LR
    Config["skillforge.config.json"] --> D["🧬 Decompose"]
    D --> V["🚩 Validate"]
    V --> A["📦 Assemble"]
    A --> W["💾 Write"]
    W --> SK["SKILL.md"]
    W --> SC["scripts/"]
    W --> RF["references/"]
    W --> EX["examples/"]
    W --> RP["report.json"]

    style Config fill:#1e293b,stroke:#3b82f6,color:#e2e8f0
    style D fill:#1e293b,stroke:#8b5cf6,color:#e2e8f0
    style V fill:#1e293b,stroke:#f59e0b,color:#e2e8f0
    style A fill:#1e293b,stroke:#10b981,color:#e2e8f0
    style W fill:#1e293b,stroke:#6366f1,color:#e2e8f0
    style SK fill:#065f46,stroke:#10b981,color:#e2e8f0
    style SC fill:#065f46,stroke:#10b981,color:#e2e8f0
    style RF fill:#065f46,stroke:#10b981,color:#e2e8f0
    style EX fill:#065f46,stroke:#10b981,color:#e2e8f0
    style RP fill:#065f46,stroke:#10b981,color:#e2e8f0
Loading
npx tsx bin/skillforge.ts build \
  --config ./skillforge.config.json \
  --output ./dist \
  --verbose
Option Default Description
-c, --config <path> ./skillforge.config.json Config file path
-o, --output <dir> ./dist Output directory
-v, --verbose false Show detailed pipeline output

Output includes:

  • SKILL.md — The complete gentskills.io-compatible skill
  • scripts/ — Validation scripts for each atomic step + master validator
  • references/ — Decomposition guide and reference docs
  • examples/ — Example usage file
  • decomposition-report.json — Full decomposition data

decompose — Preview atomic breakdown

Shows how your workflow steps will be broken down without generating output files. Useful for tuning your instructions before a full build.

npx tsx bin/skillforge.ts decompose --config ./skillforge.config.json
Option Default Description
-c, --config <path> ./skillforge.config.json Config file path

Example output:

📋 Atomic Decomposition
═══════════════════════════════════════════

  Step: analyze-requirements
    → 1.1  Read the user's component request carefully
    → 1.2  Identify the component name, purpose, and visual behavior
    → 1.3  List all required props and their types
    ...

  Total: 4 parent steps → 22 atomic steps

test — Run the test suite

Auto-generates and runs trigger tests and functional tests based on your config.

npx tsx bin/skillforge.ts test --config ./skillforge.config.json
Option Default Description
-c, --config <path> ./skillforge.config.json Config file path

Example output:

🧪 Test Suite Results
═══════════════════════════════════════════
✅ [trigger]     trigger_positive_0: Correctly triggered on: "create a React component"
✅ [trigger]     trigger_negative_0: Correctly did NOT trigger on: "explain React concepts"
✅ [functional]  functional_step-1: Skill structure validated
❌ [functional]  functional_step-2: Step "step-2" requires input "analysis" not produced

Results: 3/4 passed (4ms)

Exits with code 1 if any test fails — CI-friendly.


inspect — Red-flag analysis

Lists all red-flag rules for every atomic step — useful for auditing your skill's safety net.

npx tsx bin/skillforge.ts inspect --config ./skillforge.config.json

Example output:

🚩 Red-Flag Analysis
═══════════════════════════════════════════

Step: analyze-requirements_1.1
  🔴 [syntax] Syntax Error Detection: Generated code contains syntax errors
  🟡 [regex]  Placeholder Content: Output contains TODO/FIXME placeholders

Total red-flag rules: 42
Steps covered: 22/22

validate — Consensus validation info

Shows configuration for runtime consensus validation. Outputs example SDK code.

npx tsx bin/skillforge.ts validate \
  --config ./skillforge.config.json \
  --threshold 2 \
  --agents 3
Option Default Description
-c, --config <path> ./skillforge.config.json Config file path
-k, --threshold <n> 2 K-threshold for consensus
-a, --agents <n> 3 Number of parallel agents

Configuration Reference

The skillforge.config.json file is the heart of every skill:

{
  // REQUIRED: Kebab-case name, max 64 characters
  "name": "my-skill-name",

  // REQUIRED: What the skill does + when to use it (max 1024 chars)
  "description": "Generates production-ready React components...",

  // OPTIONAL: agentskills.io metadata
  "license": "MIT",
  "metadata": {
    "author": "SkillForge Team",
    "version": "1.0.0"
  },
  "compatibility": "Requires Node.js >= 18",

  // REQUIRED: Phrases that should activate this skill
  "triggers": [
    "create a React component",
    "build a UI component"
  ],

  // RECOMMENDED: Phrases that should NOT activate this skill
  "negativeTriggers": [
    "explain React concepts",
    "debug existing code"
  ],

  // REQUIRED: Your workflow steps (SkillForge decomposes these)
  "steps": [
    {
      "id": "step-1",                     // Unique ID (kebab-case)
      "title": "Analyze Requirements",    // Human-readable title
      "instruction": "...",               // What the AI should do
      "inputs": ["user_request"],         // Data this step needs
      "outputs": ["requirements_spec"],   // Data this step produces
      "validationCriteria": "..."         // How to verify the output
    }
  ],

  // OPTIONAL: Paths to reference docs the skill can consult
  "references": ["api-docs.md", "style-guide.md"],

  // OPTIONAL: Primary language for code output
  "language": "typescript",

  // REQUIRED: MAKER engine configuration
  "maker": {
    "agentCount": 3,              // Parallel agents per step (default: 3)
    "kThreshold": 2,              // Votes needed for consensus (default: 2)
    "maxOutputLength": 10000,     // Max chars before red-flagging
    "maxRetries": 3,              // Retry limit after red-flag discard
    "globalRedFlagRules": []      // Red-flag rules applied to ALL steps
  }
}

Writing Effective Instructions

The instruction field is what gets decomposed into atomic steps. Write it as a numbered list for the cleanest decomposition:

"instruction": "1. Read the user's request carefully. 2. Identify all required props and their types. 3. Define the TypeScript interface. 4. Add JSDoc comments. 5. Export the interface."

Decomposition modes (auto-detected by priority):

Priority Format Example
1 Numbered lists 1. ... 2. ... 3. ...
2 Bullet points - ... - ...
3 Sentence splitting Splits at period boundaries
4 Length-based Force-splits very long instructions

Defining Data Flow

"steps": [
  {
    "id": "step-1",
    "inputs": ["user_request"],        // Always available
    "outputs": ["component_spec"]      // Produced by this step
  },
  {
    "id": "step-2",
    "inputs": ["component_spec"],      // Consumed from step-1
    "outputs": ["implementation"]
  }
]

SkillForge validates this data flow at build-time — if step-2 requires an input that no previous step produces, you'll get a clear error.


Creating Skills

Scenario 1: Code Generation Skill

Goal: Generate production-ready React components with TypeScript, tests, and accessibility.

{
  "name": "react-component-generator",
  "description": "Generates production-ready React functional components with TypeScript props interfaces, ARIA accessibility attributes, and comprehensive unit tests.",
  "license": "MIT",
  "metadata": {
    "author": "Acme Corp",
    "version": "1.0.0"
  },
  "triggers": [
    "create a React component",
    "build a React component",
    "generate a React component with TypeScript"
  ],
  "negativeTriggers": [
    "explain React concepts",
    "debug an existing React component"
  ],
  "steps": [
    {
      "id": "analyze-requirements",
      "title": "Analyze Component Requirements",
      "instruction": "1. Read the user's component request carefully. 2. Identify the component name, purpose, and visual behavior. 3. List all required props and their types. 4. Identify any state management needs. 5. Note accessibility requirements.",
      "inputs": ["user_request"],
      "outputs": ["component_spec"],
      "validationCriteria": "Component spec includes: name, props interface, state requirements, and accessibility notes."
    },
    {
      "id": "implement-component",
      "title": "Implement Component",
      "instruction": "1. Create the functional component using React.FC with the props interface. 2. Destructure props with default values. 3. Implement hooks for state and side effects. 4. Add all necessary ARIA attributes.",
      "inputs": ["component_spec"],
      "outputs": ["component_code"],
      "validationCriteria": "Component compiles without errors, uses proper hooks, has all ARIA attributes."
    },
    {
      "id": "write-tests",
      "title": "Write Unit Tests",
      "instruction": "1. Import React Testing Library and the component. 2. Write a test for default rendering. 3. Write tests for each prop variation. 4. Write tests for user interactions. 5. Verify all ARIA attributes.",
      "inputs": ["component_code"],
      "outputs": ["test_code"],
      "validationCriteria": "All branches covered. Tests verify rendering, interactions, and accessibility."
    }
  ],
  "language": "typescript",
  "maker": {
    "agentCount": 3,
    "kThreshold": 2,
    "maxOutputLength": 10000,
    "maxRetries": 3,
    "globalRedFlagRules": [
      {
        "id": "no_class_components",
        "name": "No Class Components",
        "description": "Must use functional components, not class components",
        "type": "regex",
        "pattern": "class\\s+\\w+\\s+extends\\s+(React\\.)?Component",
        "severity": "warning"
      }
    ]
  }
}

Scenario 2: Documentation Skill

Goal: Generate API documentation from source code.

{
  "name": "api-docs-generator",
  "description": "Generates comprehensive API documentation from source code with function signatures, parameter tables, and code examples.",
  "triggers": ["generate API documentation", "document this API"],
  "negativeTriggers": ["write a README", "create a tutorial"],
  "steps": [
    {
      "id": "parse-source",
      "title": "Parse Source Code",
      "instruction": "1. Read all provided source files. 2. Extract every exported function, class, and type. 3. Capture: name, parameters with types, return type, docstrings.",
      "inputs": ["user_request"],
      "outputs": ["parsed_exports"],
      "validationCriteria": "Every public export is captured with full type information."
    },
    {
      "id": "generate-docs",
      "title": "Generate Documentation",
      "instruction": "1. Create a markdown file with title and overview. 2. For each function, generate: signature, description, parameter table, return type, and code example. 3. Add a table of contents.",
      "inputs": ["parsed_exports"],
      "outputs": ["documentation"],
      "validationCriteria": "Every export has a complete documentation section."
    }
  ],
  "language": "typescript",
  "maker": {
    "agentCount": 3,
    "kThreshold": 2,
    "maxOutputLength": 20000,
    "maxRetries": 3,
    "globalRedFlagRules": [
      {
        "id": "no_placeholder_docs",
        "name": "No Placeholder Documentation",
        "description": "Must not contain TBD or placeholder text",
        "type": "regex",
        "pattern": "(TBD|TODO|PLACEHOLDER|\\[description here\\])",
        "severity": "error"
      }
    ]
  }
}

Scenario 3: Multi-Step Pipeline Skill

Goal: Data transformation pipeline: ingest → clean → validate → transform → report.

  • Multiple outputs per stepingest produces both raw_records AND schema_info
  • Higher retries (maxRetries: 5) — data processing is harder to get right
  • Python language — validation scripts check Python syntax
{
  "name": "data-pipeline-builder",
  "description": "Creates data transformation pipelines: clean, validate, transform, and report.",
  "triggers": ["build a data pipeline", "process this dataset"],
  "negativeTriggers": ["visualize data", "create a chart"],
  "steps": [
    {
      "id": "ingest",
      "title": "Ingest Raw Data",
      "instruction": "1. Read the input file (CSV or JSON). 2. Parse into structured records. 3. Log total record count and column names.",
      "inputs": ["user_request"],
      "outputs": ["raw_records", "schema_info"],
      "validationCriteria": "All records parsed. Column names detected."
    },
    {
      "id": "transform",
      "title": "Transform Data",
      "instruction": "1. Apply user-requested transformations. 2. Calculate summary statistics for numeric columns.",
      "inputs": ["raw_records", "schema_info"],
      "outputs": ["transformed_data"],
      "validationCriteria": "All transformations applied. Statistics are correct."
    },
    {
      "id": "report",
      "title": "Generate Report",
      "instruction": "1. Create summary with total records processed. 2. Include statistics. 3. Sample of transformed output (first 10 records).",
      "inputs": ["transformed_data"],
      "outputs": ["pipeline_report"],
      "validationCriteria": "Report includes all statistics."
    }
  ],
  "language": "python",
  "maker": {
    "agentCount": 3,
    "kThreshold": 2,
    "maxOutputLength": 15000,
    "maxRetries": 5,
    "globalRedFlagRules": [
      {
        "id": "no_hardcoded_paths",
        "name": "No Hardcoded File Paths",
        "description": "Pipeline code must not contain hardcoded paths",
        "type": "regex",
        "pattern": "(/Users/|/home/|C:\\\\|D:\\\\)",
        "severity": "error"
      }
    ]
  }
}

Scenario 4: Strict Compliance Skill

Goal: SQL migrations with strict safety rules.

  • Higher consensus (agentCount: 5, kThreshold: 3) — critical operations need stronger agreement
  • Error-severity rulesDROP without backup and TRUNCATE immediately halt the pipeline
{
  "name": "safe-sql-migration",
  "description": "Generates safe, reversible SQL migrations with rollback scripts.",
  "triggers": ["create a database migration", "add a column to the database"],
  "negativeTriggers": ["write a SQL query", "optimize a query"],
  "steps": [
    {
      "id": "generate-up",
      "title": "Generate Forward Migration",
      "instruction": "1. Write SQL using CREATE/ALTER. 2. Wrap in transaction. 3. Use snake_case. 4. Add IF NOT EXISTS guards.",
      "inputs": ["user_request"],
      "outputs": ["up_migration"],
      "validationCriteria": "SQL is valid. Uses transactions. Snake_case identifiers."
    },
    {
      "id": "generate-down",
      "title": "Generate Rollback",
      "instruction": "1. Write SQL to reverse the forward migration. 2. Create backup tables for destructive ops. 3. Wrap in transaction.",
      "inputs": ["up_migration"],
      "outputs": ["down_migration"],
      "validationCriteria": "Rollback exactly reverses the forward migration."
    }
  ],
  "maker": {
    "agentCount": 5,
    "kThreshold": 3,
    "maxOutputLength": 8000,
    "maxRetries": 3,
    "globalRedFlagRules": [
      {
        "id": "no_drop_without_backup",
        "name": "No DROP Without Backup",
        "type": "regex",
        "pattern": "DROP\\s+TABLE(?!.*CREATE\\s+TABLE.*backup)",
        "severity": "error"
      },
      {
        "id": "no_truncate",
        "name": "No TRUNCATE",
        "type": "regex",
        "pattern": "TRUNCATE\\s+TABLE",
        "severity": "error"
      }
    ]
  }
}

Scenario 5: Review / Audit Skill

Goal: Security vulnerability review with structured audit report.

  • Parallel stepscheck-injection and check-xss both consume attack_surface independently
  • Fan-ingenerate-report consumes from two parent steps
flowchart TD
    UR["user_request"] --> SCAN["scan-surface"]
    SCAN --> AS["attack_surface"]
    AS --> INJ["check-injection"]
    AS --> XSS["check-xss"]
    INJ --> IF["injection_findings"]
    XSS --> XF["xss_findings"]
    IF --> RPT["generate-report"]
    XF --> RPT
    RPT --> AR["audit_report"]

    style UR fill:#1e293b,stroke:#3b82f6,color:#e2e8f0
    style SCAN fill:#1e293b,stroke:#8b5cf6,color:#e2e8f0
    style AS fill:#1e293b,stroke:#6366f1,color:#e2e8f0
    style INJ fill:#1e293b,stroke:#ef4444,color:#e2e8f0
    style XSS fill:#1e293b,stroke:#f59e0b,color:#e2e8f0
    style IF fill:#1e293b,stroke:#ef4444,color:#e2e8f0
    style XF fill:#1e293b,stroke:#f59e0b,color:#e2e8f0
    style RPT fill:#1e293b,stroke:#10b981,color:#e2e8f0
    style AR fill:#065f46,stroke:#10b981,color:#e2e8f0
Loading
{
  "name": "security-audit",
  "description": "Reviews code for security vulnerabilities and produces a structured audit report.",
  "triggers": ["audit this code for security", "find vulnerabilities"],
  "negativeTriggers": ["implement authentication", "write a security policy"],
  "steps": [
    {
      "id": "scan-surface",
      "title": "Identify Attack Surface",
      "instruction": "1. Identify user input entry points. 2. Identify database queries. 3. Identify file system operations.",
      "inputs": ["user_request"],
      "outputs": ["attack_surface"],
      "validationCriteria": "All entry points identified."
    },
    {
      "id": "check-injection",
      "title": "Check Injection Vulnerabilities",
      "instruction": "1. Check for SQL injection. 2. Check for command injection. 3. Note file, line, and severity.",
      "inputs": ["attack_surface"],
      "outputs": ["injection_findings"],
      "validationCriteria": "Every query and exec call checked."
    },
    {
      "id": "check-xss",
      "title": "Check XSS and CSRF",
      "instruction": "1. Check for unescaped user input. 2. Check for CSRF token validation. 3. Check cookie flags.",
      "inputs": ["attack_surface"],
      "outputs": ["xss_findings"],
      "validationCriteria": "All rendering paths checked."
    },
    {
      "id": "generate-report",
      "title": "Generate Audit Report",
      "instruction": "1. Combine findings. 2. Sort by severity. 3. Add recommended fixes. 4. Add executive summary.",
      "inputs": ["injection_findings", "xss_findings"],
      "outputs": ["audit_report"],
      "validationCriteria": "All findings included. Sorted by severity."
    }
  ],
  "maker": {
    "agentCount": 3,
    "kThreshold": 2,
    "maxOutputLength": 15000,
    "maxRetries": 3,
    "globalRedFlagRules": [
      {
        "id": "no_blanket_approval",
        "name": "No Blanket Safe Approval",
        "description": "Must not claim 'no vulnerabilities found' without analysis",
        "type": "regex",
        "pattern": "(no (security )?issues found|code is secure|no vulnerabilities)",
        "severity": "warning"
      }
    ]
  }
}

Plugin System

SkillForge uses a plugin architecture for language-specific validation. Each language is a self-contained plugin that provides syntax checking, AST analysis, and red-flag presets.

Architecture Overview

flowchart TD
    RF["red-flag.ts\ndetectRedFlags"]
    RF -->|"length, regex, custom"| Direct["Handle Directly"]
    RF -->|"syntax / ast"| PR["PluginRegistry\nSingleton"]

    PR --> JS["JavaScript /\nTypeScript Plugin"]
    PR --> PY["Python\nPlugin"]
    PR --> JSON["JSON\nPlugin"]
    PR --> DART["Dart / Flutter\nPlugin"]
    PR --> CUSTOM["Your Custom\nPlugin"]

    JS --> R1["checkSyntax\ncheckAST\ngetRedFlagPreset"]
    PY --> R2["checkSyntax\ncheckAST\ngetRedFlagPreset"]
    JSON --> R3["checkSyntax\ncheckAST\ngetRedFlagPreset"]
    DART --> R4["checkSyntax\ncheckAST\ngetRedFlagPreset"]
    CUSTOM --> R5["checkSyntax\ncheckAST\ngetRedFlagPreset"]

    style RF fill:#1e293b,stroke:#f59e0b,color:#e2e8f0
    style Direct fill:#1e293b,stroke:#6366f1,color:#e2e8f0
    style PR fill:#1e293b,stroke:#8b5cf6,color:#e2e8f0
    style JS fill:#1e293b,stroke:#3b82f6,color:#e2e8f0
    style PY fill:#1e293b,stroke:#10b981,color:#e2e8f0
    style JSON fill:#1e293b,stroke:#f59e0b,color:#e2e8f0
    style DART fill:#1e293b,stroke:#06b6d4,color:#e2e8f0
    style CUSTOM fill:#1e293b,stroke:#ec4899,color:#e2e8f0
    style R1 fill:#0f172a,stroke:#3b82f6,color:#94a3b8
    style R2 fill:#0f172a,stroke:#10b981,color:#94a3b8
    style R3 fill:#0f172a,stroke:#f59e0b,color:#94a3b8
    style R4 fill:#0f172a,stroke:#06b6d4,color:#94a3b8
    style R5 fill:#0f172a,stroke:#ec4899,color:#94a3b8
Loading

Built-in Plugins

SkillForge ships with 4 built-in plugins, automatically registered on import:

JavaScript / TypeScript Plugin

Languages: javascript, typescript

Syntax checks:

  • Balanced braces {}, parentheses (), brackets []
  • Unclosed template literals
  • Unclosed string literals (per-line)

AST checks:

  • Hallucinated imports (suspiciously long package names like my-super-awesome-package)
  • Duplicate function/variable declarations

Preset: 7 rules — length, syntax, AST, empty output, apology, placeholders, truncation


Python Plugin

Languages: python

Syntax checks:

  • Balanced parentheses and brackets
  • Mixed tabs and spaces in indentation
  • Missing colons after def, class, if, for, while, etc.

Preset: 6 rules — length, syntax, empty output, apology, placeholders (including pass #), truncation


JSON Plugin

Languages: json

Syntax checks:

  • Full JSON.parse() validation — catches all JSON syntax errors

Preset: 2 rules — syntax, empty output


Dart / Flutter Plugin

Languages: dart, flutter

Syntax checks:

  • Balanced braces, parentheses, brackets, and generic angle brackets
  • Unclosed string literals (handles raw strings r'...')
  • Missing semicolons after var, final, const, return, throw, late
  • Empty build() method bodies (Flutter-specific)
  • Broken cascade notation ..

AST checks:

  • Hallucinated package: imports
  • Duplicate class declarations
  • Duplicate top-level function declarations
  • Missing @override on lifecycle methods (build, initState, dispose, etc.)
  • StatefulWidget without matching State<WidgetName> class

Preset: 13 rules — length, syntax, AST, empty output, apology, placeholders, truncation, debug print(), deprecated widgets (RaisedButton, FlatButton, OutlineButton), excessive dynamic type, hardcoded colors


Writing a Custom Plugin

Create a plugin for any language in 3 steps:

Step 1: Define the Plugin

import type { LanguagePlugin } from 'skillforge';

const swiftPlugin: LanguagePlugin = {
  name: 'swift',
  languages: ['swift', 'swiftui'],

  checkSyntax(code: string): string[] {
    const errors: string[] = [];

    // Example: check for balanced braces
    const opens = (code.match(/{/g) ?? []).length;
    const closes = (code.match(/}/g) ?? []).length;
    if (opens !== closes) {
      errors.push(`Unbalanced braces: ${opens} opening vs ${closes} closing`);
    }

    // Example: check for force-unwrap abuse
    const forceUnwraps = (code.match(/\w+!/g) ?? []).length;
    if (forceUnwraps > 5) {
      errors.push(`Excessive force-unwrapping: ${forceUnwraps} instances — use optional binding`);
    }

    return errors;
  },

  checkAST(code: string): string[] {
    const issues: string[] = [];

    // Example: detect duplicate struct declarations
    const structPattern = /struct\s+(\w+)/g;
    const structs = new Map<string, number>();
    let match;
    while ((match = structPattern.exec(code)) !== null) {
      structs.set(match[1], (structs.get(match[1]) ?? 0) + 1);
    }
    for (const [name, count] of structs) {
      if (count > 1) {
        issues.push(`Duplicate struct: "${name}" declared ${count} times`);
      }
    }

    return issues;
  },

  getRedFlagPreset() {
    return [
      {
        id: 'swift_syntax',
        name: 'Swift Syntax Error Detection',
        description: 'Generated Swift code contains syntax errors',
        type: 'syntax' as const,
        language: 'swift',
        severity: 'error' as const,
      },
      {
        id: 'swift_ast',
        name: 'Swift AST Issue Detection',
        description: 'Generated Swift code has structural issues',
        type: 'ast' as const,
        language: 'swift',
        severity: 'warning' as const,
      },
      {
        id: 'swift_force_unwrap',
        name: 'Force Unwrap Detection',
        description: 'Code uses excessive force-unwrapping (!)',
        type: 'regex' as const,
        pattern: '\\w+!\\.',
        severity: 'warning' as const,
      },
    ];
  },
};

Step 2: Register It

import { registerPlugin } from 'skillforge';

registerPlugin(swiftPlugin);

That's it! Now swift and swiftui are supported everywhere in SkillForge.

Step 3: Use It

# CLI uses it automatically
npx tsx bin/skillforge.ts init my-swift-skill --language swift

# Or programmatically
import { getRedFlagPreset, detectRedFlags } from 'skillforge';
const rules = getRedFlagPreset('swift');
const detections = detectRedFlags(swiftCode, rules);

Plugin API Reference

LanguagePlugin Interface

interface LanguagePlugin {
  name: string;              // Unique plugin name
  languages: string[];       // Language identifiers this plugin handles
  checkSyntax(code: string): string[];    // Returns error descriptions
  checkAST(code: string): string[];       // Returns issue descriptions
  getRedFlagPreset(): RedFlagRule[];      // Returns preset rules
}

registerPlugin(plugin)

Register a custom plugin with the global registry. If a language identifier is already registered, the new plugin overrides it.

import { registerPlugin } from 'skillforge';
registerPlugin(myPlugin);

pluginRegistry

Direct access to the singleton registry for advanced use:

import { pluginRegistry } from 'skillforge';

pluginRegistry.getAllPlugins();           // LanguagePlugin[]
pluginRegistry.getSupportedLanguages();   // string[]
pluginRegistry.isSupported('swift');      // boolean
pluginRegistry.getPlugin('dart');         // LanguagePlugin | undefined
pluginRegistry.getPreset('dart');         // RedFlagRule[]
pluginRegistry.checkSyntax(code, 'dart'); // string[]
pluginRegistry.checkAST(code, 'dart');    // string[]

Red-Flag Rules

Red-flag rules are the safety net that prevents bad output from being accepted.

Built-in Rule Types

Type Field Description Example
length maxLength Output exceeds character limit Code over 10,000 chars
regex pattern Output matches a forbidden pattern Apology phrases, TODOs
syntax language Output has syntax errors (delegated to plugin) Invalid TypeScript
ast language Output has structural issues (delegated to plugin) Hallucinated imports
custom scriptPath Custom validation script Your own checker

Rule Severity

Severity Behavior
warning Output is discarded and resampled (agent tries again)
error Pipeline halts immediately and reports the failure

Presets

Get language-specific rule presets via the unified getRedFlagPreset() function:

import { getRedFlagPreset } from 'skillforge';

// Returns the full preset for any registered language
const tsRules = getRedFlagPreset('typescript');    // 7 rules
const dartRules = getRedFlagPreset('dart');         // 13 rules
const pyRules = getRedFlagPreset('python');         // 6 rules
const jsonRules = getRedFlagPreset('json');         // 2 rules

// Documentation preset (not language-specific)
import { getDocRedFlagPreset } from 'skillforge';
const docRules = getDocRedFlagPreset();             // 3 rules

Writing Custom Rules

{
  "id": "my_custom_rule",
  "name": "Human-Readable Name",
  "description": "What this rule catches and why",
  "type": "regex",
  "pattern": "your_regex_pattern_here",
  "severity": "warning"
}

Tips:

  • Use \\b for word boundaries to avoid false positives
  • Use \\s+ for flexible whitespace matching
  • Test your regex against sample outputs before adding
  • Prefer warning for style issues, error for correctness

Testing

SkillForge generates three types of tests automatically:

Trigger Tests

Verify the skill activates (or doesn't) for given prompts. Uses four weighted strategies:

Strategy Weight How it works
Exact match 1.0 Direct string comparison
Keyword overlap 0.7 Token intersection
Fuzzy n-gram similarity 0.5 Character-level similarity
Description match 0.3 Semantic description comparison

Functional Tests

Verify structural integrity:

  • All steps have validation criteria
  • All steps have inputs and outputs
  • Data flow is valid (no orphaned inputs)
  • MAKER config is consistent (K ≤ agent count)

Comparison Tests

Score output quality on 9 heuristics (0-100%):

  • Meaningful content, reasonable length, no placeholders, no apologies
  • Structured code, error handling, documentation
  • No debug logging, type annotations

SDK API Reference

Import everything from the main entry point:

import { detectRedFlags, decompose, generateSkill, registerPlugin } from 'skillforge';

Core Functions

detectRedFlags(output, rules)

Scan output against an array of RedFlagRule objects. Returns RedFlagDetection[].

import { detectRedFlags, getRedFlagPreset } from 'skillforge';

const rules = getRedFlagPreset('typescript');
const detections = detectRedFlags(llmOutput, rules);

for (const d of detections) {
  console.log(`🚩 ${d.ruleName}: ${d.reason}`);
  if (d.discarded) console.log('   → Output discarded');
}

decompose(config)

Break a SkillConfig into AtomicStep[]. Each atomic step is a single-responsibility micro-step.

import { decompose, getDecompositionReport } from 'skillforge';

const steps = decompose(config);
const report = getDecompositionReport(steps);
console.log(`${report.parentSteps} steps → ${report.totalAtomicSteps} atomic steps`);

validateConsensus(step, outputs, options?)

Run K-threshold consensus validation on multiple LLM outputs for a step.

import { validateConsensus } from 'skillforge';

const result = validateConsensus(step, [outputA, outputB, outputC], {
  kThreshold: 2,
  similarityThreshold: 0.8,
});

if (result.consensusReached) {
  console.log(`Consensus: ${result.votes} agents agreed`);
  console.log(result.accepted);
}

generateSkill(config, options)

Run the full pipeline: decompose → validate → assemble → write.

import { createDefaultConfig, generateSkill } from 'skillforge';

const config = createDefaultConfig('my-skill', 'Does amazing things');
const result = generateSkill(config, {
  outputDir: './dist',
  verbose: true,
  validate: true,
});

assemble(config, atomicSteps, validationResults)

Generate the final SKILL.md and all output files from decomposed steps.

countBalance(code, open, close)

Shared utility — counts bracket balance respecting string literals. Used by plugins internally.


Plugin Functions

import {
  registerPlugin,        // Register a custom LanguagePlugin
  pluginRegistry,        // Direct registry access
  getRedFlagPreset,      // Get rules for any language

  // Individual built-in plugins
  javascriptPlugin,
  pythonPlugin,
  jsonPlugin,
  dartPlugin,
} from 'skillforge';

import type { LanguagePlugin } from 'skillforge';

Generator Functions

import {
  generateSkill,           // Full pipeline
  createDefaultConfig,     // Create a starter SkillConfig

  // Template rendering
  render,                  // Mustache-style template rendering
  renderConfigJson,        // Render config as formatted JSON
  renderFrontmatter,       // Render YAML frontmatter
  renderDecompositionTable, // Render decomposition as table

  // Script generation
  generateCodeValidator,    // Generate JS/TS validation script
  generatePythonValidator,  // Generate Python validation script
  generateTestSuiteScript,  // Generate test runner script

  // Templates
  getCodeSkillTemplate,     // Get the SKILL.md template
} from 'skillforge';

Testing Functions

import {
  generateDefaultTests,    // Auto-generate tests from config
  runTestSuite,            // Run all tests
  formatTestReport,        // Format results for console

  runTriggerTest,          // Run a single trigger test
  runFunctionalTest,       // Run a single functional test
  validateOutputAgainstExpected, // Compare output to expected
} from 'skillforge';

// Full test flow
const tests = generateDefaultTests(config);
const results = runTestSuite(config, tests);
console.log(formatTestReport(results));

Types

All TypeScript interfaces are exported for type-safe usage:

import type {
  // Config
  SkillConfig,
  WorkflowStep,
  MAKERConfig,
  RedFlagRule,

  // Decomposition
  AtomicStep,
  IOContract,
  FieldSpec,

  // Validation
  ValidationResult,
  ValidationError,
  RedFlagDetection,
  ConsensusResult,
  CandidateOutput,

  // Generation
  GeneratedSkill,
  GeneratedFile,
  SkillFrontmatter,
  GenerateResult,
  GeneratorOptions,

  // Testing
  TestCase,
  TestResult,
  TestSuiteResult,
  TestType,

  // Plugin
  LanguagePlugin,

  // CLI
  CLIOptions,
} from 'skillforge';

Project Structure

skill_sdk/
├── bin/
│   └── skillforge.ts              # CLI entry point (6 commands)
├── src/
│   ├── index.ts                   # Public SDK API (barrel export)
│   ├── types/
│   │   ├── index.ts               # All TypeScript interfaces
│   │   └── plugin.ts              # LanguagePlugin interface
│   ├── core/
│   │   ├── decomposer.ts          # Maximal Agentic Decomposition
│   │   ├── validator.ts           # K-threshold consensus voting
│   │   ├── red-flag.ts            # Red-flag orchestrator (delegates to plugins)
│   │   ├── assembler.ts           # SKILL.md generation
│   │   ├── plugin-registry.ts     # Plugin registration & lookup
│   │   └── helpers.ts             # Shared utilities (countBalance)
│   ├── plugins/
│   │   ├── index.ts               # Auto-registers all built-in plugins
│   │   ├── javascript.ts          # JS/TS plugin
│   │   ├── python.ts              # Python plugin
│   │   ├── json.ts                # JSON plugin
│   │   └── dart.ts                # Dart/Flutter plugin
│   ├── generators/
│   │   ├── skill-generator.ts     # Full pipeline orchestrator
│   │   ├── script-generator.ts    # Validation script generators
│   │   └── template-engine.ts     # Mustache-like renderer
│   └── testing/
│       ├── test-runner.ts         # Test orchestrator
│       ├── trigger-test.ts        # Trigger matching tests
│       └── functional-test.ts     # Structural validation tests
├── tests/
│   ├── decomposer.test.ts         # 12 tests
│   ├── validator.test.ts          # 10 tests
│   ├── red-flag.test.ts           # 24 tests
│   ├── plugins.test.ts            # 22 tests
│   └── integration.test.ts        # 5 tests
├── examples/
│   ├── react-component-skill/
│   ├── express-rest-api-generator/
│   └── flutter-accessibility-audit/
├── package.json
└── tsconfig.json

Troubleshooting

"Config not found" error

❌ Config not found: /path/to/skillforge.config.json

Ensure you're passing the correct --config path. Default looks for ./skillforge.config.json.

Step has no inputs or outputs

Structural issues found: Step "step-2" has no inputs or outputs

Every step needs at least one entry in inputs and outputs. The first step typically takes ["user_request"].

Data flow validation failure

Step "step-3" requires input "analysis" which is not produced by any previous step

Check that outputs of a previous step includes the exact string used in inputs. Names must match exactly.

K-threshold exceeds agent count

K-threshold (4) exceeds agent count (3)

kThreshold must be ≤ agentCount. If K > agents, consensus is impossible.

Red-flag regex false positives

  1. Make the pattern more specific (add \b, ^, $)
  2. Change severity from error to warning
  3. Remove the rule from globalRedFlagRules

Plugin not registered

getRedFlagPreset('swift') returns []

Custom plugins must be registered before use:

import { registerPlugin } from 'skillforge';
registerPlugin(swiftPlugin);

FAQ

Q: Does SkillForge call an LLM during build? No. SkillForge is a pure scaffolding and validation tool. It generates the skill definition, decomposition, validation scripts, and test suite — but the actual LLM calls happen at runtime.

Q: What's the difference between warning and error severity? warning discards the output and retries. error halts the pipeline immediately — use this for safety-critical rules.

Q: Can steps run in parallel? Yes! If two steps consume the same input but don't depend on each other, they are implicitly parallel. SkillForge wires dependencies automatically.

Q: How many atomic steps should I expect? Roughly 5-12 per workflow step. A typical 4-step skill produces 20-40 atomic steps.

Q: Can I add a new programming language? Yes! Create a LanguagePlugin and call registerPlugin(). See Writing a Custom Plugin.

Q: What languages are supported out of the box? JavaScript, TypeScript, Python, JSON, Dart, and Flutter. Each has a dedicated plugin with language-specific heuristics.

Q: What happens if no agents agree? The step is retried up to maxRetries times. If it still fails, the pipeline reports the failure with all candidate outputs.

Q: Can I use custom validation scripts? Yes. Use the custom red-flag rule type with a scriptPath. The script receives the output and should exit with 0 (pass) or 1 (fail).

Q: How do I override a built-in plugin? Call registerPlugin() with a plugin that uses the same languages array. The new plugin replaces the built-in one in the registry.


License

MIT

About

⚡ Zero-error AI skill creation framework — Build, test, and validate Open Agent skills using Maximal Agentic Decomposition, K-threshold consensus voting, and red-flag detection.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors