Open Source — 3 MCP Servers — Cortex Integrated

From finding to verified PR.
11 stages. Zero LLM-judges-LLM.

A bug report goes in. A verified pull request comes out. 11 stages in between, zero prompts. Backed by 66 MCP tools, 64 deterministic HOR rules, 5 science-backed git analytics, and 7 verification algorithms. Every claim traces to a paper.

> /plugin marketplace add cdeust/ai-architect-mcp
> /plugin install ai-architect

Requires macOS 14+ with Apple Silicon (M1/M2/M3/M4)

What you get

$ /ai-architect:run-pipeline

[00/10] Health Check ............... PASS   0m 08s
[01/10] Discovery .................. PASS   1m 24s
[02/10] Impact & Compound Score .... PASS   2m 12s
[03/10] Integration Design ......... PASS   3m 05s
[04/10] PRD Generation ............. PASS  14m 18s
[04.5] Plan Interview (10-dim) ... PASS   2m 47s
[05/10] Review (7 verifiers) ....... PASS   6m 33s
[06/10] Implementation ............. PASS  22m 41s
[07/10] Verification (64 HOR) ...... PASS   4m 16s
[08/10] Benchmark .................. PASS   1m 22s
[09/10] Deployment (544 tests) ..... PASS   2m 08s
[10/10] Pull Request + Cortex ...... PASS   0m 52s

Pipeline complete. PR #142 created.
12 files changed, 1,847 insertions(+), 203 deletions(-)
Cortex: 4 memories stored for next session

What this pipeline does

It takes a finding and ships a verified pull request without asking you anything.

Give it a bug report, a feature request, a research paper, or any actionable input. The pipeline parses it, scores the impact with compound scoring (relevance × uniqueness × impact × confidence), designs the integration, generates a verified PRD, runs a 10-dimension Plan Interview gate, implements the code, runs 7 verification algorithms, executes the full test suite, benchmarks the result, and opens a pull request. One command: /ai-architect:run-pipeline.

It is not magic. The pipeline is 11 sequential stages (0-10), each with defined inputs, outputs, and failure modes. When a stage fails, it retries with the failure context — up to 3 times — before moving on. 64 deterministic HOR rules across 10 categories (structural, security, resilience, observability, etc.) enforce output quality. Zero LLM-judges-LLM: verification is algorithmic, not prompted.

Built on three MCP servers: ai-architect (49 orchestration tools), ai-codebase-intelligence (17 typed graph tools with tree-sitter parsing for 13 languages), and cortex (optional persistent memory). 544 tests passing. Every algorithm traces to a published paper.

It works with Python, TypeScript, Swift, Go, Rust, Java, Kotlin, C#, Ruby, PHP, C, C++, and JavaScript — through configuration, not code changes. Dogfoods itself: this pipeline shipped its own refactoring PRs.

Real example — verified end-to-end

Not a demo. An actual PR shipped through the pipeline: compound score 0.778, HOR 62/64, 276 tests 0 failures, ready to merge. Every stage traced with the exact tool and result.

GitHub pull request #31 generated by the ai-architect MCP pipeline — shows the full PR summary with compound score 0.778, HOR 62/64 passed, blast radius LOW, and the complete pipeline trace table listing stages 0-10 with their tools and results. 276 tests 0 failures. Ready to merge.
PR #31 — ANTHROPIC-2026-04-07-STATSCLAW. Adopts the StatsClaw information-barrier methodology (arxiv.org/abs/2604.04871v1) so reviewer contexts cannot receive implementation guidance — enforced structurally by the Swift type system. 3 new files, zero modifications to existing source, backward compatible.
See all pipeline PRs on GitHub →

Get Started in 2 Commands

Install, run. The pipeline handles the rest.

1

Install the plugin

# In Claude Code:
> /plugin marketplace add cdeust/ai-architect-mcp
> /plugin install ai-architect
2

Start the pipeline

# In Claude Code:
> /ai-architect:run-pipeline

The pipeline discovers findings, analyzes impact, generates a PRD, implements code, verifies the result, and delivers a pull request.

11 stages, zero LLM-judges-LLM

Each stage runs autonomously with defined inputs and outputs. When a stage fails, the pipeline retries with the failure context. Every verification is algorithmic — no LLM evaluates another LLM's output.

Preflight
00 Health Check
Analysis
01 Discovery
02 Impact & Compound Score
03 Integration Design
Specification
04 PRD Generation
4.5 Plan Interview (10-dim)
05 Review (7 verifiers)
Implementation
06 Implementation
07 Verification (64 HOR)
Delivery
08 Benchmark
09 Deployment
10 Pull Request + Cortex

00 Stage 0 — Health Check

Validates all 49 orchestration tools and 17 codebase intelligence tools are reachable. Verifies MCP server connectivity, tree-sitter parsers, and Cortex availability (if configured). Fails fast if the environment is misconfigured.

01 Stage 1 — Discovery

Queries the codebase via the typed graph engine (tree-sitter for 13 languages). Runs churn analysis (Nagappan & Ball 2005) and cochange coupling detection (Gall 1998, Zimmermann 2005). Recalls prior findings from Cortex — agents don't start from zero.

02 Stage 2 — Impact & Compound Score

Computes compound score across four dimensions: relevance × uniqueness × impact × confidence. Adds ownership ratio (Bird et al. 2011) and bus factor risk (Avelino et al. 2016) to flag single-maintainer danger zones. Findings below threshold are filtered.

03 Stage 3 — Integration Design

Designs architectural modifications using cochange coupling (hidden dependencies), dead code detection (Grove 1997, Tip 1999, CHA-based reachability), and port design patterns. Enforces parameterization, centralization, composability, backward compatibility.

04 Stage 4 — PRD Generation

Delegates to the AI PRD Generator to produce a 9-file PRD package: overview, requirements, technical spec, user stories, acceptance criteria, roadmap, JIRA tickets, test cases, and verification report.

4.5 Stage 4.5 — Plan Interview Gate

10-dimension deterministic quality gate — zero LLM calls. Scores the PRD across completeness, testability, feasibility, clarity, consistency, non-goals, dependencies, risk awareness, acceptance criteria quality, and measurability. Must pass all 10 to proceed.

05 Stage 5 — Review (7 Verification Algorithms)

Runs 5 verification algorithms + 2 prompting algorithms in parallel: Chain of Verification, KS Adaptive Stability, Multi-Agent Debate, NLI Entailment, Zero-LLM Graph Verification, plus TRM Self-Refine and Metacognitive Monitor. Loops on score < 0.85.

06 Stage 6 — Implementation

Creates a feature branch. One worker per file, dependency-ordered — based on the cochange graph from Stage 1. Builds the project, runs tests after each file. Uses Adaptive Expansion (ToT/GoT) and Signal-Aware Thought Buffer for complex implementations.

07 Stage 7 — Verification (64 HOR Rules)

64 deterministic Hard Output Rules across 10 categories: structural, security, resilience, observability, testing, naming, complexity, dependencies, documentation, concurrency. Plus a build gate. No LLM involved — pure deterministic validation. Loops on failure.

08 Stage 8 — Benchmark

Runs performance gates extracted from prd-tests.md. Compares against baselines defined in the PRD. Blocks the pipeline if the PRD specified performance requirements and they fail.

09 Stage 9 — Deployment

Runs the full test suite. Zero failures required — 544/544 passing on the reference codebase. Validates migrations, configuration, and build artifacts in an isolated environment.

10 Stage 10 — Pull Request + Cortex

Pushes the branch, opens a PR with structured description (compound score, HOR rule results, verification scores, benchmark deltas, retry history). Saves lessons to Cortex — next session's pipeline starts smarter than this one.

Why this architecture matters

Stages, retries, and external validation.

Zero LLM-judges-LLM

Most AI coding pipelines use one LLM to evaluate another LLM's output — which just propagates biases. This pipeline verifies with deterministic algorithms: 64 HOR rules, tree-sitter parsers, graph-based reachability, NLI entailment checks, and KS statistical stability. Algorithms don't hallucinate.

Every claim traces to a paper

Churn (Nagappan & Ball 2005). Ownership (Bird et al. 2011). Bus factor (Avelino et al. 2016). Cochange coupling (Gall 1998, Zimmermann 2005). Dead code (Grove 1997, Tip 1999). No hand-wavy heuristics — every decision cites peer-reviewed research.

Cortex memory compounds

Every pipeline run recalls prior findings from Cortex before starting and saves lessons after finishing. Run 1: the pipeline explores. Run 10: it knows your codebase. Run 100: it anticipates. 544 tests passing on the reference codebase — and the pipeline shipped most of those tests itself.

CapabilityMost AI AgentsAI Architect Pipeline
Task DecompositionPrompt → Output11 sequential stages, one worker per file, dependency-ordered
Codebase UnderstandingScan all filesTree-sitter typed graph (13 languages), BM25 + RRF hybrid search
VerificationLLM judges LLM5 algorithms: CoV, KS Adaptive, Multi-Agent Debate, NLI, Zero-LLM Graph
Quality GatesPrompt templates64 deterministic HOR rules across 10 categories
Git AnalyticsNone5 science-backed: churn, ownership, bus factor, cochange, dead code
MemoryStatelessCortex integration — recall before, remember after
Retry LogicFail or hangStage-level retry with failure context, max 3 attempts
PRD IntegrationGenerate + hopeDelegated to AI PRD Generator with 10-dim interview gate
Read the paper → arxiv.org/abs/2602.11865

This is not CI/CD

CI/CD runs after code is written. This pipeline writes the code.

CI/CD triggers when a developer pushes code: build, test, deploy. It assumes a human wrote the code. This pipeline operates upstream — it takes a finding and produces the code, tests, and pull request that your CI/CD system then validates. It does not replace your build system. It feeds it.

AspectTraditional CI/CDAI Architect Pipeline
TriggerCode push by developerFinding: bug, feature, advisory
InputSource codeUnstructured requirements
OutputBuild artifact, deployPull request with code, tests, docs
Code authorshipHuman developerAutonomous AI agent
VerificationTest suite onlyTests + semantic verification + benchmarks
Retry strategyRe-run same buildRe-implement with failure context
ScopeBuild → DeployFinding → Pull Request

It generates the commits. Your CI/CD validates and deploys them.

Works with your stack

Adapts to your project through configuration, not code changes.

Python
TypeScript
JavaScript
Go
Rust
Java
Kotlin
Swift
C / C++
Ruby
PHP
And more

The pipeline reads your project's configuration files, test framework, build system, and directory structure. It runs your test suite natively and respects your linter settings. Django backend, React frontend, Go microservice, Rust CLI, Swift iOS app — same pipeline, different configuration.

Get Started on GitHub