Franz Paul

Franz Paul https://fpaul.dev/ AI Engineer, IT Consultant. Building AI agents that run in production. en Mon, 13 Apr 2026 00:00:00 GMT <![CDATA[Your CLAUDE.md Is an Agent Harness]]> https://fpaul.dev/writing/claude-code-agent-harness/ Mon, 13 Apr 2026 00:00:00 GMT https://fpaul.dev/writing/claude-code-agent-harness/ claude-code agents architecture workflow <![CDATA[Karpathy's LLM Wiki and the Problem Nobody Solved for 80 Years]]> https://fpaul.dev/writing/karpathy-llm-wiki-80-year-problem/ Mon, 13 Apr 2026 00:00:00 GMT https://fpaul.dev/writing/karpathy-llm-wiki-80-year-problem/ knowledge-management llm obsidian workflow **Minimum Viable Wiki** > A markdown folder. A schema file. An LLM agent. > 1. Create `wiki/` with subdirectories: `sources/`, `entities/`, `concepts/`, `synthesis/` > 2. Write a schema that defines page types, frontmatter, and operations > 3. Drop a source. Tell the agent to ingest it. Watch 5-15 pages appear. > > Obsidian is optional but useful — graph view shows your wiki's shape. qmd handles search when the index outgrows the context window. ## The Desk That Organized Itself That search result I found — the one I never wrote — is still in my wiki. I've read it three times since. Corrected two claims, added a paragraph, deleted a sentence that was confident about something the source was ambiguous about. The organizational labor was done for me. The compression labor was still mine. The page is better for it — and so is my understanding of the topic. Eighty years after Bush imagined a machine that would maintain associative trails through human knowledge, we have something close. It runs on [plain text and CLI tools](/writing/cli-beats-mcp), not microfilm. The maintenance problem isn't solved — it's repriced. And repricing changes everything, right up until you discover what the new price actually buys you. Whether that's knowledge or just a very well-organized pile of text is a question you'll have to answer by closing the wiki and seeing what you remember. ]]> <![CDATA[Project Beat: How We Replaced 12 Browser Tabs With One Dashboard]]> https://fpaul.dev/writing/project-beat-ai-matching/ Mon, 13 Apr 2026 00:00:00 GMT https://fpaul.dev/writing/project-beat-ai-matching/ project-beat matching freelancing product <![CDATA[Hydra: A Multi-Model Code Review Council]]> https://fpaul.dev/writing/hydra-multi-model-code-review/ Wed, 01 Apr 2026 00:00:00 GMT https://fpaul.dev/writing/hydra-multi-model-code-review/ claude-code code-review multi-agent multi-model skills <![CDATA[Claude Code vs. Codex in 2026: What the Benchmarks Miss]]> https://fpaul.dev/writing/claude-code-vs-codex-2026/ Tue, 31 Mar 2026 00:00:00 GMT https://fpaul.dev/writing/claude-code-vs-codex-2026/ claude-code codex comparison ai-tools codex-plugin hydra **Editorial note:** The numbers below — SWE-bench scores, Codex launch metrics, developer-survey results — are from late March 2026, the week the plugin landed. Treat them as a snapshot. Check the current state on each tool's repo before making decisions from a six-week-old comparison. ## I Stopped Comparing Last Tuesday Last Tuesday I ran `/codex:adversarial-review` against code Claude had generated twenty minutes earlier. Codex flagged a concurrency bug that Claude had missed — a reordering issue that would surface only under real load. Both tools were running in the same terminal. That's when the comparison frame — Claude *or* Codex — stopped making sense to me. I didn't plan for this moment to feel significant. I had been refactoring a module, Claude had just generated the logic, and I fired off the review command out of habit. The point isn't which tool was "better." The point was that the two of them in sequence produced the correct result, and neither alone would have. This is what happened: on March 30th, OpenAI published `codex-plugin-cc` — an official, Apache-2.0-licensed plugin that integrates Codex directly into Claude Code. It hit 4,478 GitHub stars in 24 hours. 180 forks. 52 issues filed before most people had finished their morning coffee. That velocity doesn't come from hype. That velocity means developers were *waiting* for this. The question used to be "Claude Code or Codex?" I spent weeks on that question. I had spreadsheets. I had opinions. I had draft versions of this very article that read like a product comparison matrix. All of that is irrelevant now. The question in April 2026 is: how do they work together? ## The Numbers Without Context Before we get to the interesting part, here are the benchmarks everyone keeps citing: | Benchmark | Claude Code | Codex | |-----------|------------|-------| | SWE-bench | 80.9% | — | | Terminal-Bench 2.0 | 65.4% | 77.3% | | Architecture (blind test) | 67% preferred | 33% preferred | | Developer satisfaction | 46% (#1) | — | These numbers are real. They come from reputable evaluations. They're also almost entirely useless for deciding what to use on a Monday morning. SWE-bench measures isolated bug fixes in open-source repositories. The task is always the same shape: here's a failing test, here's a codebase you've never seen, fix it. That's a real skill. It's also not what I do most days. I'm not parachuting into unfamiliar repos and fixing one bug. I'm working in the same codebase for months, building features that span multiple files, explaining business context that doesn't exist in any test suite. Terminal-Bench measures scripting and CLI tasks. Codex wins there — 77.3% vs 65.4%. That's meaningful if you write a lot of bash. It's noise if you don't. The architecture blind test is more interesting. Developers evaluated code structure without knowing which AI produced it. Claude was preferred 2:1. That tracks with my experience, but it's still a synthetic evaluation. Nobody architects a system by asking a model once and shipping the result. The satisfaction numbers tell the realest story. Claude Code leads at 46%, followed by Cursor at 19% and Copilot at 9%. Satisfaction correlates with depth of use, not with benchmark performance. The developers who use these tools the hardest prefer Claude Code. That's worth more than any leaderboard. But none of these numbers capture what happened in my terminal last Tuesday. Benchmarks test what's easy to measure. The interesting stuff is hard to measure. ## Where Claude Code Wins: The Architect Claude Code is the tool I reach for when I don't fully know what I'm building yet. Extended thinking, adaptive thinking, interleaved thinking — these aren't marketing copy. They're modes that let Claude spend real compute on architectural decisions before writing a single line of code. When I ask Claude to refactor a module, it doesn't just rearrange functions. It asks why the module exists, whether the abstraction boundary is in the right place, what the downstream effects of a restructure would be. Sometimes it pushes back on the refactor entirely. That 67% blind preference for architecture? I believe it. When the task requires structural judgment — "should this be a separate service?" or "is this abstraction pulling its weight?" — Claude produces better first drafts than anything else I've used. Not perfect drafts. Better ones. Then there's the ecosystem. Hooks, skills, memory, MCP — Claude Code isn't a chat window that happens to run in a terminal. It's a configurable workflow system. My `CLAUDE.md` defines project context, conventions, and constraints. Skills enforce engineering patterns — [I've written about the best ones](/writing/claude-code-essential-skills). [Hooks automate quality gates](/writing/claude-code-hooks-guide). [MCP servers](/writing/mcp-servers-for-developers) connect Claude to GitHub, databases, browsers. Memory persists what Claude learns across sessions so I don't repeat myself. No competing tool has this depth of customization. Codex is configurable, sure. You can set instruction files, define approval policies. But Claude Code's system is compositional — hooks trigger skills, skills reference memory, memory shapes future interactions. The pieces compound. The mental model: Claude Code is the senior engineer who reads the entire PR description, checks the related issues, reviews the test plan, and then asks three clarifying questions before writing any code. ## Where Codex Wins: The Operator Codex is the tool I reach for when I know exactly what I want. Terminal-Bench tells a real story here, not because of the absolute numbers, but because of what they reflect. Codex CLI is written in Rust. It starts fast. It uses 3-4x fewer tokens than Claude Code for equivalent tasks. When I need a bash script to rename 200 files, a quick database migration, or a regex transformation across a directory, Codex doesn't deliberate. It does the thing. Twelve seconds, done, next task. The cloud sandbox model changes the dynamics further. Codex can run tasks in isolated environments — fire and forget. You kick off a job, keep working on something else, come back to the result. Claude Code is conversational by design. It wants to be in dialogue with you, which is powerful for complex work and annoying for simple work. Codex can be asynchronous. That matters more than most comparisons acknowledge. And then there's the licensing. Codex is open source. Apache 2.0, 67K GitHub stars, 400+ contributors. If something doesn't work how you want, you read the source. You fork it. You change it. You submit a PR. I've seen the community add features faster than OpenAI's own roadmap. Claude Code is not open source. For some developers and some teams, that's a hard blocker. For others, it's irrelevant. But it's a real difference. The mental model: Codex is the pragmatic colleague who doesn't need the full context. You say "write me a migration that adds a `last_login` column with a default of `now()`" and they do it in the time it takes you to finish the sentence. No follow-up questions. No architectural concerns. Just the migration, correct and ready. Both mental models are useful. Neither is sufficient alone. ## OpenAI Built a Plugin for Their Competitor This is the part that made me delete my comparison spreadsheet. On March 30, 2026, OpenAI published `codex-plugin-cc`. An official, first-party, Apache-2.0-licensed plugin that integrates Codex directly into Claude Code. Read that sentence again. OpenAI — the company that builds ChatGPT, the company that wants to win the AI race — built a first-class integration for Anthropic's developer tool. That's like Chrome shipping a built-in Safari tab. That's like Ford putting a Honda engine mount in the F-150. The numbers from launch week: 4,478 stars in 24 hours. 180 forks. 52 issues filed, including Windows compatibility — because it's that new and people want it that badly. What the plugin actually does: **`/codex:review`** — Standard code review. Codex reads your recent changes in read-only mode and gives structured feedback. It's a second pair of eyes, from a model trained differently than the one that wrote the code. **`/codex:adversarial-review`** — This is the one that caught my race condition. It's not asking "is this code clean?" It's asking "how will this code fail?" Authentication bypasses, data loss scenarios, race conditions, missing rollback logic. It's a skeptical reviewer by design. It assumes your code has problems and tries to find them. **`/codex:rescue`** — Full delegation. Hand a task to Codex entirely: bug investigation, targeted fixes, library research. Codex runs in the background while Claude keeps your main session going. I use this when I hit a wall with an unfamiliar API or a weird environment-specific bug. Instead of breaking Claude's context with a tangent, I send Codex to investigate. The adversarial review is genuinely valuable, and the reason is subtle. It catches a different *class* of issues than Claude's own review, not because one model is smarter, but because they have different training biases. Different blind spots. Different patterns they over-index on. Cross-model review is code review by two engineers who went to different schools, worked at different companies, and worry about different failure modes. That's exactly what you want in a reviewer. So why did OpenAI do this? The strategic logic is clean. The $4 billion AI coding market has three leaders: Copilot, Claude Code, and Cursor. Claude Code holds 46% developer satisfaction. If you're OpenAI and you can't beat that number by shipping your own standalone product, your next best move is to be *inside* the winning product. Codex everywhere — not just in ChatGPT, not just in VS Code, but in every terminal where developers actually work. Even the competitor's terminal. *Especially* the competitor's terminal. It's a pragmatic move dressed up as an open-source contribution. I respect it. ## How to Actually Use Both Setup takes about two minutes if you already have both tools installed: ```bash # In Claude Code /plugin marketplace add openai/codex-plugin-cc /plugin install codex@openai-codex /reload-plugins /codex:setup ``` You'll need the Codex CLI installed (`npm install -g @openai/codex`) and either a ChatGPT subscription or an OpenAI API key. Codex usage counts against your existing token limits — there's no separate billing. Here's the workflow I've settled on after a week of daily use: **Claude Code drives.** Architecture decisions, feature implementation, complex refactoring — Claude does the heavy lifting. It has the context. It has my `CLAUDE.md`, my skills, my project memory. It knows what I built yesterday and why. For work that requires judgment, Claude leads. **Codex reviews.** After completing a significant chunk of work — a new feature, a refactored module, anything touching auth or data — I run `/codex:adversarial-review`. The cross-model perspective catches issues Claude is blind to. Not because Claude is worse at security analysis, but because every model has systematic gaps. Two models with different gaps > one model with any gap. **Codex rescues.** When I hit a wall — a weird library behavior, an unfamiliar protocol, a bug that doesn't reproduce in my mental model — `/codex:rescue` sends Codex to investigate without breaking Claude's working context. I keep building while Codex researches. When it comes back with findings, I fold them into Claude's session. Parallel workflows instead of serial context-switching. There's an optional feature worth mentioning: the Review Gate. Enable it during `/codex:setup` and every time Claude's session ends, Codex automatically reviews the work and flags issues before you move on. It's a CI check that runs in your terminal, powered by a model that didn't write the code it's checking. I've left it on for a week. It's caught two real issues and produced zero false positives that annoyed me enough to turn it off. ## When a Single Review Isn't Enough The `/codex:review` workflow is one model reviewing another's work. It's good. But what if you could have ten agents — from two different model families — independently analyze the same code, then cross-examine each other's findings? That's what I built with Hydra. Hydra is a skill (adapted from Andrej Karpathy's "LLM Council" idea) that spawns a full review pipeline: six advisors analyze in parallel, three peer reviewers cross-examine their findings, and one chairman synthesizes a final verdict. The key design choice: four advisors run on Claude Opus, two run on Codex GPT-5.4. All three peer reviewers run on Opus, as does the chairman. Each advisor has a single lens and a declared blind spot: | Advisor | Model | Question | |---------|-------|----------| | Cassandra | Opus | "How does this break at 3 AM?" | | Mies | Opus | "What can be deleted?" | | Navigator | Opus | "What depends on what?" | | Volta | Opus | "What does this cost at 10x load?" | | The Stranger | Codex | "Can someone understand this in 15 minutes?" | | Sentinel | Codex | "How do I break this on purpose?" | Cassandra doesn't comment on performance. Volta doesn't comment on readability. The lane discipline prevents the classic AI review failure where every agent says the same thing in slightly different words. The cross-model part is where it gets interesting. When Opus and Codex *agree* independently, that signal is stronger than same-model agreement — different training data, different biases, same conclusion. When they *disagree*, that's the highest-value signal of all. The chairman surfaces those divergences prominently and is required to issue a ruling. No "it depends" allowed. Running `hydra this --mode deep` on a PR takes roughly one to two minutes and costs $1.50–$2.50 for the full ten-agent pipeline. Standard mode (four agents, ~$0.25–$0.50) covers most reviews and runs in well under a minute. If Codex isn't available, it falls back to Opus-only automatically. Reports save to `.hydra/reports/` with every finding tagged `[CORROBORATED]`, `[CONTRADICTED]`, or `[UNCORROBORATED]` by the Cross-Examiner reviewer. Is it overkill for every commit? Obviously. But for anything touching auth, payments, or data migrations — the cases where bugs cost real money — ten adversarial agents catching what a single reviewer misses is cheap insurance. The mental model for the full setup: Claude Code is the architect. Codex is the building inspector. Hydra is the committee of inspectors who disagree with each other on purpose, then produce a consensus report. You don't always need the committee. But when you do, nothing else gives you this. ## Stop Choosing The AI coding market is consolidating. Copilot, Claude Code, and Cursor control roughly 70% of it. The remaining 30% is fragmented across a dozen tools that each do one thing well. The instinct is to pick a winner and commit. That instinct is wrong. The winning strategy isn't choosing the best tool. It's knowing which tool to reach for at which moment. Claude Code when you need to think. Codex when you need to verify. Claude Code for the architecture. Codex for the inspection. The `codex-plugin-cc` collapses this from a two-window workflow into a single session. The best coding setup in 2026 isn't Claude Code or Codex. It's Claude Code *and* Codex — and OpenAI just made it official. ]]> <![CDATA[The Most Useful Claude Code Skills You Should Install]]> https://fpaul.dev/writing/claude-code-essential-skills/ Sat, 28 Mar 2026 00:00:00 GMT https://fpaul.dev/writing/claude-code-essential-skills/ claude-code skills developer-tools <![CDATA[CLI > MCP: When a Bash Command Beats a Protocol Server]]> https://fpaul.dev/writing/cli-beats-mcp/ Fri, 27 Mar 2026 00:00:00 GMT https://fpaul.dev/writing/cli-beats-mcp/ cli mcp developer-tools opinion unix <![CDATA[Claude Code Hooks: Automating Your AI Workflow]]> https://fpaul.dev/writing/claude-code-hooks-guide/ Wed, 25 Mar 2026 00:00:00 GMT https://fpaul.dev/writing/claude-code-hooks-guide/ claude-code hooks automation /dev/null || true" } ] } } ``` ### Memory Reminder on Session End Prompt Claude to save important learnings before the conversation ends: ```json { "hooks": { "Stop": [ { "command": "node ./hooks/memory-reminder.js" } ] } } ``` ### Context Injection on Session Start Load project-specific context at the beginning of every session: ```json { "hooks": { "SessionStart": [ { "command": "node ./hooks/load-context.js" } ] } } ``` ## Hook Security Hooks run with your user permissions. Treat them like any other automation — review what they do, especially if they modify files or make network requests. The matcher field supports regex patterns, so you can target specific tools or file patterns precisely. ## Conclusion Hooks bridge the gap between Claude's AI capabilities and your project's specific requirements. They're the mechanism that turns a generic assistant into a customized development environment. ]]> <![CDATA[Skills Are the New Dotfiles]]> https://fpaul.dev/writing/skills-are-the-new-dotfiles/ Tue, 24 Mar 2026 00:00:00 GMT https://fpaul.dev/writing/skills-are-the-new-dotfiles/ claude-code skills developer-tools workflow <![CDATA[MCP Servers That Actually Improve Your Dev Workflow]]> https://fpaul.dev/writing/mcp-servers-for-developers/ Fri, 20 Mar 2026 00:00:00 GMT https://fpaul.dev/writing/mcp-servers-for-developers/ mcp claude-code integrations **Editorial update (April 2026):** My position on MCP has evolved since I wrote this. After three more months of daily use, I removed the GitHub MCP server and replaced it with `gh` CLI — the context cost wasn't worth the marginal integration benefit. The argument is in [CLI > MCP](/writing/cli-beats-mcp). Supabase, Playwright, Context7, and Figma MCP remain part of my daily workflow. Read this post as the "why start with MCP" case and the CLI piece as "where I ended up." ## What Is MCP? The Model Context Protocol (MCP) lets AI assistants interact with external tools and services through a standardized interface. Instead of copying data between tools manually, MCP servers expose capabilities that Claude can call directly — an interface layer between Claude and your development ecosystem. ## GitHub MCP Server The most immediately useful MCP server for developers. It gives Claude direct access to: - Create and manage pull requests - Read and comment on issues - Search code across repositories - View PR review status and checks Instead of copying PR URLs and describing changes, you can say "review the open PR on this repo" and Claude reads it directly. ## Playwright MCP Server Browser automation for verification. After building a feature, Claude can: - Navigate to your dev server - Take screenshots - Check for console errors - Verify UI elements render correctly This closes the feedback loop — build, verify, iterate — without leaving the terminal. ## Supabase MCP Server Direct database interaction for projects using Supabase: - Run SQL queries - Apply migrations - Generate TypeScript types - Check advisor recommendations Particularly useful during development when you're iterating on schema changes. ## Context7 Documentation Server Fetches current library documentation on demand. Even for libraries Claude knows well, the docs may have changed since training. Context7 ensures Claude works with the latest API surface. ## Setting Up MCP Servers MCP servers are configured in `.claude/settings.json`: ```json { "mcpServers": { "github": { "command": "npx", "args": ["-y", "@modelcontextprotocol/server-github"], "env": { "GITHUB_PERSONAL_ACCESS_TOKEN": "your-token" } } } } ``` ## Conclusion MCP servers extend Claude's reach beyond the filesystem. Start with GitHub and Playwright — they cover the most common workflows. Add others as your needs grow. ]]> <![CDATA[Why I Write Specs Before Code]]> https://fpaul.dev/writing/spec-driven-development/ Sun, 15 Mar 2026 00:00:00 GMT https://fpaul.dev/writing/spec-driven-development/ workflow architecture specs