Most teams think they have a working AI agent setup — a goal doc, some instructions, a CI pipeline, a review process. In practice, the goal is stale, the instructions are a wall of text, the checks only catch generic errors, reviews have no criteria, and nothing has been updated in months. Harness Engineering Toolkit detects these five blind spots and helps you fix them. It works with any AI agent (Claude Code, Codex, Cursor, Windsurf, Copilot, and 41+ others) via the Agent Skills standard.
Built on the 4-Layer Harness Model:
L1 Goal Anchoring → L2 Context Engineering → L3 Execution Constraints → L4 Evaluation Loop
| Skill | What it does | For whom |
|---|---|---|
/harness-self-check |
Interactive diagnostic. Assumes you already have a setup, then probes until the gap reveals itself. Five blind spots, one at a time. | Anyone — any industry |
/harness-audit |
Automated codebase scan. Detects five blind spots by checking files, CI config, and git history. Offers to auto-fix what it finds. | Software developers |
| Skill | What it does | For whom |
|---|---|---|
/harness-create |
End-to-end harness creation: interview → spec → runnable project. Two entry points: start from scratch (interviews you first) or from existing HARNESS_SPEC.md (skips to build). Three build modes: document, code, investigation. | Anyone |
| Skill | What it does | For whom |
|---|---|---|
/harness-retro |
Retrospective on completed harness tasks. Analyzes run data (scores, eval reports, changelogs) across 6 dimensions: convergence, bottleneck dimensions, repeated work, cost efficiency, prompt quality, and cross-task patterns. Generates concrete prompt/rubric improvements. Optionally archives completed tasks. | Anyone who has run a harness |
| # | You think | Actually |
|---|---|---|
| 1 | "I have a goal" | Goal doc is months old, agent is optimizing an outdated target |
| 2 | "I have instructions" | 150 lines, no priority structure, agent makes its own judgments |
| 3 | "I have checks" | CI runs defaults only, project-specific errors aren't caught |
| 4 | "I review everything" | No criteria, quality varies by how busy you are |
| 5 | "I set up my harness" | Nothing changed in 30+ days, flywheel stopped |
npx skills add nnabuuu/harness-engineering-toolkitThe CLI auto-detects your installed agents and places skills in the correct directories. Supports 41+ tools via the open Agent Skills standard.
Options:
npx skills add nnabuuu/harness-engineering-toolkit -g # Global install
npx skills add nnabuuu/harness-engineering-toolkit -a claude-code # Target specific agent
npx skills add nnabuuu/harness-engineering-toolkit --list # Preview available skills- Download the
SKILL.mdfiles from each skill directory - Create a Project in claude.ai → upload the SKILL.md files as Project Knowledge
- Start chatting: "Check my harness setup"
/harness-create generates a dag.yaml alongside every harness. If DAGU is installed, you get a web UI with dependency graphs, step status, and run history. details →
Don't know where to start? Run /harness-self-check. It asks you questions and tells you what to fix first.
Software developer wanting a scan? Run /harness-audit in your repo. It detects blind spots and offers fixes.
See what a diagnostic session and an automated audit look like in practice. examples →
- Two Scopes — single-session vs. long-term harness
- File Structure & Extending — project layout and adding domain adapters
The 4-layer model comes from the「驯化 AI」article series. It adds Goal Anchoring (L1) as an explicit foundation — no existing framework foregrounds this. OpenAI starts from context, Hashimoto from errors, LangChain from primitives. This one starts from goals, because "everyone knows what we're doing" is the assumption that breaks first.
MIT.