Drew Breunig

Learnings from a No-Code Library: Keeping the Spec Driven Development Triangle in Sync

2026-03-04T07:06:00-08:00

The following is a write up of a talk I delivered at MLOps Community’s “Coding Agents” conference, on March 3rd. There’s a video version of the talk available on YouTube.

I share what I learned building a no-code library, why spec-driven development is a feedback loop not a straight line, historical parallels for our current moment, and a PoC tool for keeping specs/tests/code in sync.

Finally, we consider what GitHub should look like in the era of coding agents.

I was invited here today to talk about a project I launched — a software library with no code — which got a lot of really interesting feedback. I’m going to tell you the whole story, how I got it wrong, explore a bit of historical context, then propose a path forward.

Last Fall, Opus 4.5 launched and surprised everybody with the quality of the code it was produced and the problems it could solve. Opus 4.5 was good enough that we started to ask some really big questions.

I wondered: if the agents are good enough, why do we need to share code?

Whenever I have a big question like this, one that requires lots of thought, I like to go for a long bike ride. So I did, and while I was riding I came up with the idea to ship a software library with no code.

And so we have whenwords.

Open source, freely licensed. It’s a GitHub repository with a markdown file describing what the library is supposed to do. It’s a library that takes a Unix timestamp and converts it into something human-readable — “about 12 o’clock,” “five hours ago,” things like that.

I also generated about 750 conformance tests in YAML: given this input, I expect this output. And one more file called install.md — a single paragraph you’d paste into the agent of your choice, with all the instructions for building the code. You’d drop in what language you wanted and where to save it.

whenwords kicked off a lot of conversation about spec-driven development. It’s something more and more people are thinking about: the idea that if you bring specs, which define the what, why, and sometimes how, and tests, which measure and validate behavior, the code will just flow from that. Give it to an agent, get code out.

whenwords kind of blew up. Karpathy was a fan. whenwords has over 1,000 stars on Github.

What was even crazier was that I started getting normal GitHub interactions. People submitted issues. They submitted pull requests. And the pull requests were good, things like: “In this test, you’re expecting this result, but that violates the rounding rule you detail in the spec. You need to true these up.”

But I wasn’t the only one with this idea. Larger teams started shipping larger projects. whenwords was a toy; small, constrained, 750 tests. But then:

Vercel released just-bash, a simulated bash environment with an in-memory virtual filesystem, written in TypeScript. Basically re-implementing Bash in TypeScript. They’re running shell scripts against it to verify behavior.

Pydantic released Monty, a Python interpreter written in Rust. Fast, safe, ideal for agent REPLs and code use. Same approach: a pile of Python tests, throw it at the model, make it pass.

Anthropic famously threw 16 Claudes and $20,000 at a spec suite to build a Rust-based C compiler. It didn’t really work. But it was pretty cool.

I couldn’t stop thinking about Spec Driven Development and how far we might push it.

I think there’s a few learnings from this first wave of Spec Driven Development.

Tests and specs aren’t free or easy. All the projects we surveyed used large existing testing libraries from existing projects: the Bash tests, the Python tests, the C tests. Those are the low-hanging fruit. I joked online (and I’m not the first) that pretty soon anyone who wants to protect themselves is going to be like SQLite, where the code is free but you’ve got to pay for the tests. Tests are precious.

Implementation is fast, but not instant. You go fast at first, but none of these projects are complete. just-bash is still being worked on. Monty is missing JSON and other standard libraries. Anthropic’s C compiler stalled out. It gets hard. It’s not perfect and it’s not free.

As complexity grows, structural choices become more important. This was especially clear in the Anthropic project. They got incredibly far, down to 1% of failing tests. But every time they fixed a new bug, it broke something else. Systemic changes required systemic thinking, not just local fixes.

Architectures that allow parallel development are incredibly valuable. What it allows you to do is move fast with multiple agents. And, this is something I haven’t seen explored yet, it allows for open source contribution. Rather than spending $20,000 to build a C compiler, what if you structured it so everyone knows what chunk they can work on? It’s like SETI@home, except I’m not using your engineering expertise. I’m using your Claude Code subscription. Which I think is wild.

But the biggest learning — and the one we’re going to spend the most time on today — is that sometimes the spec and tests aren’t sufficient.

One of my favorite things to do is look at the PRs and issues for all of these libraries. Even with a great spec — “make it run Python perfectly in Rust, here are all the tests, just make them pass” — there are still 20-comment threads about what the right way to implement something is.

Because no spec is perfect. And this is probably my biggest takeaway today:

Implementing the code helps us improve our spec.

Let’s take a digression. We’re at the Computer History Museum, so let’s go back into history. Specifically the history of code and managing code.

One of my favorite jokes about AI development is one I stole from Matt Levine, who writes the finance newsletter Money Stuff. In it, he has a running joke about crypto people speed-running financial history, from first principles, as they attempt to build new financial infrastructure. We are doing that with software engineering and AI coding.

I’m lucky: one of my co-founders, Heather Miller, is a professor at CMU and a programming languages expert. I can call her up, share my theories, and ask: “Heather, tell me this has already been dealt with. Who should I be talking to and what shoud I read?” This time, she said, “Of course it has, Drew,” and introduced me to her office neighbor, Professor Claire Le Goues. Claire then who walked me through the relevant software engineering history, which I’m going to share today because it is incredibly relevant to our current situation.

In 1963, Margaret Hamilton was writing and managing software effort for NASA’s Apollo missions. She coined the term “software engineering” because, running this giant, complicated project that couldn’t have errors in it, she realized: this is engineering. It’s systems design, we have to worry about errors, we have to worry about unexpected inputs like astronauts pushing the wrong button.

And also: we now have enough code that no one person can hold it in their head. Which is a problem, because then you can’t reason about it effectively. And it gets even worse when a team is working on it.

By the way: this is her code. This is what she was managing. This is her VS Code.

And this is her Git.

I’m a dad, which means dad jokes come naturally. So I’m going to retroactively coin Hamilton’s Law: when you can’t see over your code, you can’t oversee your code.

(Sorry.)

After Hamilton dealt with this problem, others realized it was a problem too.

NATO held a conference in Berlin and identified the “Software Crisis”: computer hardware now allowed programs so complex they couldn’t be managed adequately. A single engineer couldn’t hold all the code in their head. If they were going to continue delivering on what software could promise, they needed process.

Dijkstra popularized this in his 1972 Turing Award lecture. He said:

As long as there were no machines, programming was no problem at all. When we had a few weak computers, programming became a mild problem. And now we have gigantic computers, programming has become an equally gigantic problem.

He said this in 1972. Maybe later, walk around the museum we’re in and look at what he was working with then. Then consider what we’re working with today.

So after the Software Crisis emerged, we wandered through the desert of processes, searching for one to borrow. We looked at manufacturing engineering. In 1975, Brooks published The Mythical Man-Month. And finally, Waterfall was adopted as a DoD standard. We learned how to engineer complex software. Progress.

But these things move in cycles. In 2001, we published the Agile Manifesto. Zuckerberg said it’s time to move fast and break things. We embraced Agile, and Agile was finally realized by the cloud and GitHub — which enabled continuous CI/CD and let us offload enough of the error-checking that we didn’t break things too often, even when moving fast.

Which brings us to today.

I added this slide right at the last minute, because I logged into Twitter to check something and saw today’s trending news: “AI Coding Boosts Output But Overwhelms Human Reviews.” And it’s paired with that last headline: “OpenAI Codex Leaks Hint at GPT-5.4 Amid Speedy Updates.” So not only is it overwhelming us, it’s accelerating.

So what do we learn from this history rabbit hole?

Being overwhelmed by the volume of code isn’t a new problem. It’s what birthed software engineering.

The initial Software Crisis was our inability to manage complex codebases new computers allowed. Our current Software Crisis is our inability to manage complex codebases new models allow.

Our problem used to be that we couldn’t hold an entire codebase in our head. Now we can’t even read our entire codebase.

Agentic engineering enables waterfall volume at the cadence of agile. And even that undersells it: it’s waterfall times ~two at the cadence of agile times ~seven.

We keep oscillating, historically, between unhindered velocity and managed process. We could use some process right about now. Perhaps AI can help…

I’m not the only one asking this question.

For the last couple of quarters, people have been trying to figure out how to deal with this onslaught of code. The most dramatic example is Gas Town — you’re all familiar with it — an infrastructure for managing a coding process that grew beyond one person’s ability to manage.

But Gas Town just moves the problem. It doesn’t solve it. Steve Yegge even admits this in the Gas Town blog post:

Gas Town is complicated. Not because I wanted it to be, but because I had to keep adding components until it was a self-sustaining machine. And the parts that it now has, well, they look a lot like Kubernetes mated with Temporal and they had a very ugly baby together.

If the process is complex, we’re just moving the problem.

So let’s go back to what we defined spec-driven development to be. This idea that it’s an equation: bring specs, maybe add some tests, add an agent, get code out.

I got this wrong. This is the wrong way to think about it. Because this isn’t a one-way equation. It’s a feedback loop. The act of writing code improves the spec, and it improves the tests. Just like software doesn’t really work until it meets the real world, a spec doesn’t really work until it’s implemented.

So instead of an equation, I propose a triangle. The spec defines what tests need to be written, and what code needs to be written. Tests validate the code. That’s essentially what we had before, just in a different shape.

But the act of implementing code generates new decisions. Those decisions inform the spec. And when the spec updates, new tests need to be written. And sometimes it’s not new decisions — it’s just dependencies or subtle choices. New code surfaces new behaviors that need to be tested.

I call this: the Spec-Driven Development Triangle.

As each node moves forward, our job — and our tooling’s job — is to keep those nodes in sync. That’s the job. If we improve the code, we must improve the spec.

But keeping the nodes in sync is hard.

Writing tests is hard. Even before agents, we couldn’t write tests. We don’t like writing tests and we’d prefer not to.

Writing specs is hard. They can never be exhaustive, leave room for interpretation, and are written before the software meets the real world. The spec gets written, it gets implemented, it gets released. Is the spec updated? No.

Specs are written at a different cadence than code, in a different medium. If only we had something that could read natural language.

Updating specs and tests feels like overhead, especially when you’re moving fast. And the entire point of using agents is to move fast. Any system we design has to respect that.

Implementation is messy, and often humans and LLMs take shortcuts. Humans say “I’m not going to implement that right now” or “I’ll come back and fix this.” LLMs certainly do this.

And so regular reconciliation of tests, spec, and code is not part of the process.

But thankfully, there are signals we can work with.

Code changes are tracked by Git, and we can compare them against the spec to find gaps.

Test coverage tools tell us what code is tested — but not whether the tests reflect the spec. It’s not just about covering the code. The tests have to cover the spec.

Updates to the spec — if a product manager logs in and changes something — are also tracked by Git. Is the rest of the system changing with it?

Bug reports and hotfixes that go straight into code or tests need to be captured and rolled into the spec.

And most importantly: implementing the code with an agent generates decisions — from both the humans and the agent. Those decisions exist in the traces. We can look at the traces from our coding agents and find where decisions were made. That’s the signal we need to keep everything in sync.

So we have tangible things we can analyze. And a goal to aim towards…

One of the the nice thing about having a thought experiment during the era of great coding agents is that you can just try building it. And as you implement it, you improve it.

This is my tool. I call it Plumb, after a plumb bob, because it keeps things true. A plumb bob hangs from a line and helps a carpenter keep things straight. Even better, they used to be held on tripods, which echoes the triangle.

You can install it right now: pip install plumb-dev or uv add plumb.

It’s not perfect. It’s a proof of concept. A thought experiment as code. But I’ve been using it, and it’s pretty great.

Here’s how Plumb works.

Plumb is a command line tool. Every time you’re working with an agent and you run git commit, it identifies decisions made by evaluating the code diff from the last commit and by reading the agent traces (all the conversations since that last commit). It extracts the decisions, dedupes them, and presents them to you: here are all the decisions you made, do you agree?

Once you’ve approved, it updates the spec to reflect those decisions. It runs sync and reports on coverage gaps between the spec and the tests, and the spec-to-code coverage. Is the code actually reflecting what the spec defines?

As it does this, it generates files that become artifacts you can track. My favorite is a big JSONL file of decisions.

Here’s one example: “Should spec updates be batched across all decisions, or run individually for each decision?” My decision — batch them. It says it was made by the user, not the LLM. I have blame. And you can see how we can enrich this over time: tie it to code, to branch, to whether it was informed by the conversation, when it was approved, when it was synced. This is not just the code changes. It’s the intent.

To set up Plumb in your project: install it, go to your project directory, run plumb init. It’ll ask you to specify your spec markdown file or folder and show it where your tests are. It creates a .plumbignore to tell it when to skip decision generation — changing the README, for example, doesn’t need to generate decisions. It creates a .plumb folder to store state and config. Very similar to .git.

Most importantly: it adds hooks to Git. When you run git commit, it extracts the decisions. If there are decisions to review, the commit fails. It exits and tells you to review your decisions and approve, reject, or edit them. That’s what makes this work anywhere: command line, CI pipeline, inside your coding environment. It just works. And that’s a hard requirement.

The other thing Plumb generates is a breakdown of your spec into individual requirements — the atomic statements that make up what your spec defines. Ambiguous or not, what source file it came from, eventually linked directly to the code. Right now I use a commenting format to link tests back to the requirement they’re testing, so coverage mapping can show which requirements have tests and how many.

Our aim is link spec to requirements, requirements to code, requirements to tests, decisions to requirements. We’re building a new object graph extending off the code diffs. And eventually — edit the spec, the tests, or the code, pick your poison, and everything else gets brought along.

Now, as you design this, the interesting design choices start to emerge.

Can’t this just be a skill? There are already code review skills, superpowers, things like that. Why not just use those?

I don’t think it can be a skill. Whatever tool we end up using for tracking decisions and intent, it cannot live only inside the agent. It needs to run outside. It needs to handle small commits, triggers, anything…even if you never touch the agent.

A skill is a suggestion. A tool needs to be a checkpoint. That commit-fail mode is essential. Otherwise it gets ignored. We’ve all had this happen with Claude Code.

And the system needs to be canonical. It can’t be optional. Agents wander. Validation needs to be more deterministic. When we can use code, we will. This is a validation and verification step. Fuzzy LLM calls are a last resort.

When we do use LLMs — parsing the spec, extracting decisions — we use DSPy. It lets us structure LLM calls with tight inputs and outputs. It lets us optimize, test, and choose which models to route to. Speed matters enormously here. For decision deduplication, I’m routing to GPT, because it’s faster than anything Anthropic offers for that task. And the whole thing has to be simple enough for the developer to hold in their head.

Of course, there are real limitations.

Plumb only supports pytest. I want it to support any test framework and conformance tests, such as language-agnostic tests like whenwords used.

Decisions might interrupt your flow on long-running tasks. If I make a quick fix and generate five decisions, I have to sit through a review. That needs to be tunable. Maybe you don’t want it to bother you for lightweight decisions, only surface things that are vague or contradict previous decisions. I suspect this is something that will be dictated by the type of project you’re working on.

Deduplication isn’t perfect. Decision identification is fuzzy and will likely need to be project-specific.

Code reversals on decision rejection aren’t working yet. I’d like it so that when you reject a decision the LLM made, it goes back and undoes it. The reason it’s not implemented is that the flow needs to be right: if you reject from the command line, nothing automatic should happen. If you reject from inside the agent, the agent should act on it.

It needs better tools for managing the spec. Mine has grown long and probably should be sharded into sections. Thankfully, this is something an LLM can and should do. Though, we have to be careful when doing it. Perhaps we can perform dry runs, regenerate requirements from the shards, then confirm they match the original spec…

Plumb should be tunable for “just enough” structure. Can I run with --dangerously-approve-all-decisions? Sometimes I want to.

And it’s untested on large projects. Hell, it’s untested in general.

But here’s the fun part: I’ve been testing this by using it to manage the project itself. Using Plumb to build Plumb. And it’s been genuinely useful.

Claude can refer to the spec for implementation understanding without searching the entire codebase. The decision log has proven valuable for answering “why does this code exist?” — I can ask the agent, “is there a decision we made that explains why this is implemented this way?” And it can find it.

It’s code review, but code review where we capture intent. When I hit commit in Claude Code, I get a list of decisions and I step through them. Sometimes I hit one I don’t like and I stop right there. I reject it, go back, redo it. I like that better than pure code review.

It actually spots and controls weird silent LLM behavior. We’ve all let an agent run while we answer email and come back to something insane. Now I get a decision and I can say “don’t do that, let’s roll that back.”

And hacks get documented. I’ve taken shortcuts in this app. Now I know they exist. I can search back for all the shortcuts and then go fix them. The decision log becomes an artifact — not just of code changes, but of intent.

So let’s take this question further. Say Plumb exists and does exactly what I want. How could GitHub be better with this kind of information?

Right now, the main way we interact with code is with Markdown and chat. And GitHub has not changed anything about how we interact with Markdown and text on their site. Could my Markdown diffs have decisions linked to them, so I can see how intent manifests in the code?

I think any version of GitHub that takes the agentic era seriously needs to do four things:

Spec, tests, and code have to be first-class citizens. Code is already. Tests are close — GitHub Actions gets you there. But Markdown is not. Microsoft is probably leaving a lot of inference revenue on the table by not treating it seriously.

Markdown has to be a first-class citizen. This is the gap.

We need to see the linkages. Users need to follow connections between decisions, requirements, code, tests, and spec. Spec-driven development right now is treated as a one-shot thing: write the spec, hit go, you’re done. It’s not. It’s a process. You need to track all of it over time.

Users should be able to ask questions of the system. Not just read it — query it, to get closer to understanding intent. That’s how you actually understand a codebase that’s too large to read.

So here are my takeaways from the journey from whenwords to Plumb.

Code implementation clarifies and communicates intent. I could stop there and walk out of the room. I missed this with whenwords.

The job is to keep specs, code, and tests in sync as they move forward. The system for managing that has to stay simple. If it creates developer mental overhead, it just moves the problem somewhere else.

The act of writing code improves the spec and the tests. Just like software doesn’t truly work until it meets the real world, a spec doesn’t truly work until it’s implemented.

No-code libraries are toys because they are unproven.

Even if you aren’t the one making decisions during implementation, decisions are being made. We should leverage LLMs to extract and structure those decisions.

And finally: we’ve been here before. The answer then was process. The answer now is also process. And just as we leverage cloud compute to enable CI/CD for agile, we should leverage LLMs to build something lightweight enough that we can fit in our heads, doesn’t slow us down, and helps us make sense of our software.

Again: thank you very much to Professor Claire Le Goues, who helpfully walked me through the history of computer science. The history section of this talk is entirely thanks to her. And she has a book coming out, aimed at a wider audience, later this year. Do check it out.

We’re Talking About Terms of Use, But the Issue is Embedded Judgment

2026-03-01T09:13:00-08:00

The biggest buyers will want to audit and influence post-training

Beneath the Anthropic and Department of War fracas, there is a legitimate & essential conversation to be had about how much control any organization has when deeply adopting an AI model they didn’t train.

These are probabilistic systems, with near infinite surface area to test, that are intentionally designed. Models are used to inform and make decisions, and they all have embedded perspectives.

AI is unlike other technology purchases because AI has embedded judgment.

I’m not sure what the answer is here, only that we need to have this discussion (calmly) and that anyone who tells you this isn’t a problem, that their model has an objective God-view-from-nowhere, is selling you something.

Let me be clear: I agree strongly with Anthropic’s usage red lines. I gladly choose Claude myself.

But this conversation is being framed badly around usage. Many are talking about Anthropic’s “terms of service” (notably, both Hegseth and Trump even capitalized the term in their tweets), but I think allowed usage terms are red herring. The issue is embedded judgment.

If I were in military procurement, I would certainly some big questions about what “soul documents” or “constitutions” (or similar) are embedded in any model being considered for embedding throughout the armed forces (and all the labs make design choices during post-training).

And clearly this is something Anthropic is already dealing with! This section, from the above blog post, suddenly becomes much more interesting:

This constitution is written for our mainline, general-access Claude models. We have some models built for specialized uses that don’t fully fit this constitution; as we continue to develop products for specialized use cases, we will continue to evaluate how to best ensure our models meet the core objectives outlined in this constitution.

We don’t know if post-training control helped blow up the deal (I tend to believe the issue was about allowed usage, based on the administration’s and Anthropic’s statements, coupled with OpenAI’s announced terms). But I think it’s a safe bet many militaries will insist on influencing and auditing the post-training for their purchased variants.

I wrote back in 2023 that I expect states and cultures to build their own models for related reasons; I wasn’t thinking about defense tech at the time but it certainly amplifies the issues.

One takeaway: this is a strong argument for why the AI race isn’t going to be winner-take-all. Everyone wants a champion to trust.

Two Beliefs About Coding Agents

2026-02-25T14:12:00-08:00

There’s a lot of noise about how AI is changing programming these days. It can be a bit overwhelming.

If you hang out on social media, you’ll hear wild claims about people running 12 agents at once, for days. Or people hacking bots together, giving them $10k, and letting them roam the web.

The challenge with all of this is that coding agents really are performing some science fiction feats which were barely imaginable just 12 months ago. But at the same time, the ecosystem is incentivizing the most outlandish claims, so punters keep telling tall tales. Separating the signal from the noise is near impossible.

I’m lucky enough to talk to a range of developers and teams, spanning a variety of company sizes and a broad array of skill sets. From these conversations, two beliefs have emerged and solidified about coding agents and their (current) impact on coding.

Let’s start with belief number one:

Most talented developers do not appreciate the impact of the intuitive knowledge they bring to their coding agent.

We’ve all seen the posts by developer luminaries. They haven’t written code in weeks. They gave a hard problem to Claude Code or Codex and it just worked.

But what we don’t see is their prompts. And having seen many prompts by many types of devs, I would wager their prompts are relatively specific and offer more guidance to the LLM than your average user. And these specifics don’t have to be exhaustive. Even knowing the right terms to use can have enormous impact and activate an entirely different set of weights in the model than someone writing, “the search is broken fix it.”

Skilled programmers, with plenty of experience, don’t even think about how to ask correctly. They just do, intuitively. And things work well. If the agent and dev go through multiple turns, this effect gets even more significant.

I wish we could see more prompts and traces, from a wide range of developers, to better understand the range of code. And, just as interestingly, how hard and long agents have to work to achieve the goal. For now we can just browse public repos on Github, where the range of coding quality is quite broad.

Which brings me to the second belief:

Most work people are sharing are incredible personal tools, but they are not capital-P products.

There’s an app I really like called “StreetPass.” It’s a browser extension that watches web pages you visit and collects Mastodon accounts it finds, letting you easily follow them if you wish. It’s small and charming. A perfect extension.

Recently, I realized I wanted a version of StreetPass, but for RSS feeds instead of Mastodon accounts. I forked StreetPass, fired up Claude Code, and had a working version quickly. You can use this, but I’m not supporting it. I won’t be pushing it to the App Store or Chrome Web Store. I won’t be building a version that doesn’t leverage Feedbin. I have no idea if it works on Chrome or Firefox. It’s personal software that I use almost daily.

Most agentic coding projects we see being hyped are like this.

All those things I won’t do, those are the things that would turn my personal software into a Product. And we haven’t even gotten to marketing, support, and more. As we covered when we touched on Claude’s desktop app, the last 10% of product development and support is where the pain is. And that’s still a long road. As they say: Code today is free, as in puppies.

But I want to be clear about couple things.

First, I know many teams shipping agent written code into products. But they test, support, review, and so much more. But when we make big claims like “coding is solved” or “code is free”, we need to be clear about what we’re talking about building¹.

Second, our ability to manifest personal software easily is amazing and powerful. I am continually inspired by the things people build (for example, I loved Simon’s presentation software he whipped up for FOO Camp). His presentation app is so tailored to him, in the past the math would never justify the time spent building it to support a market of maybe a dozen. But now he gets his dream!

Similarly, my RSS finder extension is a feature not an app and (sadly) there isn’t a large market for RSS today. But with Claude Code (and open source code to build upon!) I can build just what I wanted in moments.

I am sure as our scaffolding and models improve, this stuff will get more accessible and more resilient, but I don’t expect these two beliefs to go away. Providing AI with the right instructions to obtain just what you want, will always be a challenge.

Coding agents amplify existing skills.

Grady Booch has a good post about this today. Things are getting higher level, and changing fast, but engineering remains. ↩

Why is Claude an Electron App?

2026-02-21T10:00:00-08:00

If code is free, why aren’t all apps native?

The state of coding agents can be summed up by this fact

Claude spent $20k on an agent swarm implementing (kinda) a C-compiler in Rust, but desktop Claude is an Electron app.

If you’re unfamiliar, Electron is a coding framework for building desktop applications using web tech, specifically HTML, CSS, and JS. What’s great about Electron is it allows you to build one desktop app that supports Windows, Mac, and Linux. Plus it lets developers use existing web app code to get started. It’s great for teams big and small. Many apps you probably use every day are built with Electron: Slack, Discord, VS Code, Teams, Notion, and more.

There are downsides though. Electron apps are bloated; each runs its own Chromium engine. The minimum app size is usually a couple hundred megabytes. They are often laggy or unresponsive. They don’t integrate well with OS features.

(These last two issues can be addressed by smart development and OS-specific code, but they rarely are. The benefits of Electron (one codebase, many platforms, it’s just web!) don’t incentivize optimizations outside of HTML/JS/CSS land.)

But these downsides are dramatically outweighed by the ability to build and maintain one app, shipping it everywhere.

But now we have coding agents! And one thing coding agents are proving to be pretty good at is cross-platform, cross-language implementations given a well-defined spec and test suite.

On the surface, this ability should render Electron’s benefits obsolete! Rather than write one web app and ship it to each platform, we should write one spec and test suite and use coding agents to ship native code to each platform. If this ability is real and adopted, users get snappy, performant, native apps from small, focused teams serving a broad market.

But we’re still leaning on Electron. Even Anthropic, one of the leaders in AI coding tools, who keeps publishing flashy agentic coding achievements, still uses Electron in the Claude desktop app. And it’s slow, buggy, and bloated app.

So why are we still using Electron and not embracing the agent-powered, spec driven development future?

For one thing, coding agents are really good at the first 90% of dev. But that last bit – nailing down all the edge cases and continuing support once it meets the real world – remains hard, tedious, and requires plenty of agent hand-holding.

Anthropic’s Rust-base C compiler slammed into this wall, after screaming through the bulk of the tests:

The resulting compiler has nearly reached the limits of Opus’s abilities. I tried (hard!) to fix several of the above limitations but wasn’t fully successful. New features and bugfixes frequently broke existing functionality.

The resulting compiler is impressive, given the time it took to deliver it and the number of people who worked on it, but it is largely unusable. That last mile is hard.

And this gets even worse once a program meets the real world. Messy, unexpected scenarios stack up and development never really ends. Agents make it easier, sure, but hard product decisions become challenged and require human decisions.

Further, with 3 different apps produced (Mac, Windows, and Linux) the surface area for bugs and support increases 3-fold. Sure, there are local quirks with Electron apps, but most of it is mitigated by the common wrapper. Not so with native!

A good test suite and spec could enable the Claude team to ship a Claude desktop app native to each platform. But the resulting overhead of that last 10% of dev and the increased support and maintenance burden will remain.

For now, Electron still makes sense. Coding agents are amazing. But the last mile of dev and the support surface area remains a real concern.

Over at Hacker News, Claude Code’s Boris Cherney chimes in:

Boris from the Claude Code team here.

Some of the engineers working on the app worked on Electron back in the day, so preferred building non-natively. It’s also a nice way to share code so we’re guaranteed that features across web and desktop have the same look and feel. Finally, Claude is great at it.

That said, engineering is all about tradeoffs and this may change in the future!

There we go: developer familiarity and simpler maintainability across multiple platforms is worth the “tradeoffs”. We have incredible coding agents that are great at transpilation, but there remain costs that outweigh the costs of shipping a non-native app.

How System Prompts Define Agent Behavior

2026-02-10T21:34:00-08:00

This post was co-authored with Srihari Sriraman

Coding agents are fascinating to study. They help us build software in a new way, while themselves exemplifying a novel approach to architecting and implementing software. At their core is an AI model, but wrapped around it is a mix of code, tools, and prompts: the harness.

A critical part of this harness is the system prompt, the baseline instructions for the application. This context is present in every call to the model, no matter what skills, tools, or instructions are loaded. The system prompt is always present, defining a core set of behaviors, strategies, and tone.

Once you start analyzing agent design and behavior, a question emerges: how much does the system prompt actually determine an agent’s effectiveness? We take for granted that the model is the most important component of any agent, but how much can a system prompt contribute? Could a great system prompt paired with a mediocre model challenge a mediocre prompt paired with a frontier model?

To find out, we obtained and analyzed system prompts from six different coding agents. We clustered them semantically, comparing where their instructions diverged and where they converged. Then we swapped system prompts between agents and observed how behavior changed.

System prompts matter far more than most assume. A given model sets the theoretical ceiling of an agent’s performance, but the system prompt determines whether this peak is reached.

The Variety of System Prompts

To understand the range of system prompts, we looked at six CLI coding agents: Claude Code, Cursor, Gemini CLI, Codex CLI, OpenHands, and Kimi CLI. Each performs the same basic function: given a task they gather information, understands the code base, writes code, tracks their progress, and runs commands. But despite these similarities, the system prompts are quite different.

Try It Out

Explore the above figures interactively in context viewer.

We’re analyzing exfiltrated system prompts, which we clean up and host here¹. Each of these is fed into context-viewer, a tool Srihari developed that chunks contexts in semantic components for exploration and analysis.

Looking at the above visualizations, there is plenty of variety. Claude, Codex, Gemini, and OpenHands roughly prioritize the same instructions, but vary their distributions. Further, prompts for Claude Code and OpenHands both are less than half the length of prompts in Codex and Gemini.

Cursor’s and Kimi’s prompts are dramatically different. Here we’re looking at Cursor’s prompt that’s paired with GPT-5 (Cursor uses slightly different prompts when hooked to different models), and it spends over a third of its tokens on personality and steering instructions. Kimi CLI, meanwhile, contains zero workflow guidance, barely hints at personality instructions, and is the shortest prompt by far.

Given the similar interfaces of these apps, we’re left wondering: why are their system prompts so different?

There’s two main reasons the system prompts vary: model calibration and user experience.

Each model has its own quirks, rough edges, and baseline behaviors. If the goal is to produce a measured, helpful TUI coding assistant, each system prompt will have to deal with and adjust for unique aspects of the underlying model to achieve this goal. This model calibration reins in problematic behavior.

System prompts also vary because they specify slightly different user experience. Sure, they’re all text-only, terminal interfaces that explore and manipulate code. But some are more talkative, more autonomous, more direct, or require more detailed instructions. System prompts define this UX and, as we’ll see later, we can make a coding agent “feel” like a different agent just by swapping out the system prompt.

We can get a glimpse of these two functions together by looking at how a given system prompt changes over time, especially as new versions of models arrive. For example:

Try It Out

Explore the above figures interactively in context viewer. Or, check out Codex’s system prompt evolution in similar detail.

Note how the system prompt isn’t stable, nor growing in a straight line. It bounces around a bit, as the Claude Code team tweaks the prompt to both adjust new behaviors and smooth over the quirks of new models. Though the trend is a march upward, as the coding agent matures.

If you want to dive further into Claude Code’s prompt history, Mario Zechner has an excellent site where he highlights the exact changes from version to version.

Go Deeper

Sometimes instructions are just…weird. Srihari cataloged some of the odder instructions he found while exploring coding agent system prompts.

The Common Jobs of a Coding Agent System Prompt

While these prompts vary from tool to tool, there are many commonalities that each prompt features. There is clear evidence that these teams are fighting the weights: they use repeated instructions, all-caps admonishments, and stern warnings to adjust common behaviors. This shared effort suggests common patterns in their training datasets, which each has to mitigate.

For example, there are many notes about how these agents should use comments in their code. Cursor specifies that the model should, “not add comments for trivial or obvious code.” Claude states there should be no added comments, “unless the user asks you to.” Codex takes the same stance. Gemini instructions the model to, “Add code comments sparingly… NEVER talk to the user through comments.”

These consistent, repeated instructions are warranted. They fight against examples of conversation in code comments, present in countless codebases and Github repo. This behavior goes deep: we’ve even seen that Opus 4.5 will reason in code comments if you turn off thinking.

System prompts also repeatedly specify that tool calls should be parallel whenever possible. Claude should, “maximize use of parallel tool calls where possible.” Cursor is sternly told, “CRITICAL INSTRUCTION: involve all relevant tools concurrently… DEFAULT TO PARALLEL.” Kimi adopts all-caps as well, stating, “you are HIGHLY RECOMMENDED to make [tool calls] in parallel.”

This likley reflects the face that most post-training reasoning and agentic examples are serial in nature. This is perhaps easier to debug and a bit of delay when synthesizing these datasets isn’t a hinderence. However, in real world situations, users certainly appreciate the speed, so system prompts need to override this training.

Both of these examples of fighting the weights demonstrate how system prompts are used to smooth over the quirks of each model (which they pick up during training) and improve the user experience in an agentic coding application.

Much of what these prompts specify is shared; common adjustments, common desired behaviors, and common UX. But their differences notably affect application behavior.

Go Deeper

Srihari looked at more examples of fighting the weights to understand how system prompts reveal model biases.

Do the Prompts Change the Agent?

Helpfully, OpenCode allows users to specify custom system prompts. With this feature, we can drop in prompts from Kimi, Gemini, Codex and more, removing and swapping instructions to measure their contribution.

We gave SWE-Bench Pro test questions to two applications: two agents running the OpenCode harness, calling Opus 4.5, but with one one using the Claude Code system prompt and the other armed with Codex’s instructions.

Time and time again, the agent workflows diverged immediately. For example:

The Codex prompt produced a methodical, documentation-first approach: understand fully, then implement once. The Claude prompt produced an iterative approach: try something, see what breaks, fix it.

This pattern remains consistent over many SWE Bench problems. If we average the contexts for each model and system prompt pair, we get the following:

Try It Out

Explore the above figures interactively in context viewer.

All prompt-model combinations correctly answered this subset of SWE Bench Pro questions. But how they suceeded was rather different. The system prompts shaped the workflows.

Go Deeper

Srihari explored Codex CLI and Claude Code autonomy, and how the system prompt may shape their behavior.

System Prompts Deserve More Attention

Last week, when Opus 4.6 and Codex 5.3 landed, people began putting them through the paces, trying to decide which would be their daily driver. Many tout the capabilities of one option over another, but just as often are complaints about approach, tone, or other discretionary choices. Further, it seems every week brings discussion of a new coding harness, especially for managing swarms of agents.

There is markedly less discussion about the system prompts that define the behaviors of these agents². System prompts define the UX and smooth over the rough edges of models. They’re given to the model with every instruction, yet we prefer to talk Opus vs. GPT-5.3 or Gastown vs. Pi.

Context engineering starts with the system prompt.

Exfiltrated system prompts represent versions of the system prompt for a given session. It’s not 100% canonical, as many AI harnesses assemble system prompts from multiple snippets, given the task at hand. But given the consistent manner with which we can extrac these prompts, and comparing them with public examples, we feel they are sufficiently representative for this analysis. ↩
Though you can use Mario’s system prompt diff tool to explore the changes accompanying Opus 4.6’s release. ↩

The Potential of RLMs

2026-02-09T09:42:00-08:00

Handling Your Long Context Today & Designing Your Agent Tomorrow

Context Rot is the Worst Context Failure

“Context Rot” is a common problem agent designers must avoid and mitigate.

The Gemini 2.5 paper was one of the first technical reports that flagged the issue, noting that the performance of their Pokémon-playing harness rapidly deteriorated as the context grew beyond 100,000 tokens; a figure far below Gemini 2.5’s 1 million input token limit. We covered this in our context failures piece, but the Chroma team published the canonical exploration of the effect, dubbing it context rot.

A key takeaway from Gemini’s Pokémon troubles and the Chroma post is that context rot is not a capacity problem. It’s a quality problem. As the context grows beyond a model’s soft limit, the model continues to issue output as its accuracy declines. This makes for a pernicious problem, one that sneaks up on us the longer we run agents.

Of all the context fails, context rot is the worst.

Enter Recursive Language Models

Defined by Alex Zhang and Omar Khattab, Recursive Language Models (or RLMs) are a simple idea:

Load long context into a REPL environment¹, stored as variables.
Allow an LLM to use the REPL environment to explore and analyze the context.
Provide a function in the REPL to trigger a sub-LLM call.

That’s it. That’s an RLM. The LLM will use the REPL to filter, chunk, and sample the long context as needed to complete its task. It will use the sub-LLM function to task new LLM instances to explore, analyze, or validate the context. Eventually, the sum of the LLM’s findings will be synthesized into a final answer.

With this setup, the long context(s) can be really long. I’ve given RLMs logfiles more than 400 megabytes in size, with no issues. In the original RLM post, Alex reports that performance doesn’t degrade when >10 million tokens are provided.

Note the orange lines on the right: as the context length increases, performance very slowly degrades, hovering around 50-60%. Compare this to the non-RLM results (with the same GPT-5 model), which dramatically decline until failing entirely at 262,000 tokens.

RLMs Work By Turning Long Context Problems Into Coding & Reasoning Problem

The key attribute of RLMs is that they maintain two distinct pools of context: tokenized context (which fills the LLM’s context window) and programmatic context (information that exists in the coding environment). By giving the LLM access to the REPL, where the programmatic context is managed, the LLM controls what moves from programmatic space to token space.

And it turns out modern LLMs are quite good at this!

Let’s look at an example.

Here I’ve given Kimi K2 a very large dataset of Stable Diffusion prompts (prompts people provided to generate images). I then ask the RLM to identify the most common celebrities used in these prompts (and of course, I’m using RLM in DSPy). If you’re curious, here’s the code.

I give the RLM a budget of 5 iterations to accomplish the task. Below, you can swipe/page through each iteration, which shows the LLM’s reasoning and the code it executed in the REPL. There’s a few things to keep in mind as you read through:

Every time the LLM calls print in the REPL, it’s bringing new context into the token space. (I’ve omitted this output for brevity)
When the LLM calls llm_query (highlighted in blue) in the REPL, it’s tasking another LLM instance with a sub-call. It stores the result of this function as a variable, usually.
On the last iteration, the LLM calls a special function SUBMIT, which indicates it has finished with the task.

Click through and read, it really illuminates how a RLM works:

We can clearly see the LLM exploring and sampling the context, planning an approach, testing the approach, scaling the approach, then finally synthesizing its findings into a final answer. (In this case, it was correct!)

The context I gave this RLM – the collection of Stable Diffusion prompts – exceeds the maximum context window of any LLM. It would fail before it started, whereas a DSPy RLM harness around Kimi K2 took only a couple minutes.

It’s incredible, but with this example we can identify a couple limitations of RLMs.

First, it’s relatively slow. Answering this question took over a dozen LLM calls and several minutes. And we were using Kimi K2 on Groq. Try this with GPT-5.3 or Opus 4.6 and you’ll be waiting around even longer.

Second, as you read through the reasoning and code in the example above it becomes apparent that you need strong models to drive RLMs. Qwen3-30B-A3B couldn’t complete this task. It got confused, lost track of progress, and ended up running out of budget before submitting an answer².

This brings us to the second reason RLMs work so well (in addition to maintaining the two token and programmatic context pools): RLMs exploit the coding reasoning gains of the last +18 months.

We’ve covered before how LLMs are getting better at verifiable tasks because it’s relatively easy to synthesize data and evaluate verifiable tasks, like math and coding. We’ve spent many billions of dollars post-training coding skills into frontier models. RLMs wrap long contexts in a coding environment so they’re addressable by the LLM’s incredible coding abilities, turning context rot into a coding problem.

Even better, RLMs get to use the REPL not just as a tool for exploring and managing long contexts, but also as a deterministic scratchpad. This proves to be a killer resource for many tasks. You occasionally see this benefit in action in ChatGPT or Claude, when the LLM will fire up a Python script to answer a question³. This hybrid capability of RLMs – the ability to use probabilistic, fuzzy LLM logic for some challenges and deterministic code for others – will likely become a stronger attribute as RLM harnesses mature and models are fine-tuned.

The Potential of RLMs: Agent Discovery Mechanisms

The ability of RLMs to mitigate the effects of context rot are really incredible. However, this isn’t the potential that excites me most. What excites me about RLMs is their ability to explore, develop, and test approaches to solving a problem.

If you start experimenting with RLMs (and I strongly suggest you should), be sure to continually review your traces. Set verbose to true and/or wire up DSPy to MLFlow. As you watch these models explore the context and try out different approaches (taking your iteration budget into consideration⁴), you’ll notice repeating patterns. In the example above, if I asked the RLM to find the top celebrities, aesthetic styles, or vehicles requested in the image generation prompts, it would repeatedly deploy similar tactics to situate itself and complete the task.

There is no reason we can’t identify these repeating patterns, decompose them, and optimize them.

This is what excites me about RLMs: if you run them on the same task several times, you’re generating emergent agent designs. These traces can then be used to explicitly define an agent, with higher reliability and lower latency. RLM passes discover the best approach to the problem, which we can then optimize.

The Limitations of RLMs

But if that’s the potential, how should you use RLMs today? In the last couple months I’ve seen teams use them for very large context scenarios, from general coding tasks across massive codebases to research and exploration across massive datasets.

At the moment, using RLMs on small context problems probably isn’t worth the squeeze. You’ll end up waiting around while the RLM explores context that could have simply been part of the prompt.

Further, RLMs do not solve other context fails, like context poisoning or context confusion. If bad information is in your programmatic context, there’s good odds it could influence the RLM in undesirable ways.

The Next “Chain of Thought”?

RLMs are slow, synchronous, and merely borrowing the current capabilities of models rather than leveraging models post-trained to be good at RLM patterns. There is so much low-hanging fruit here.

But that’s exactly what makes them exciting. Chain of thought was also simple and general (just ask the model to “think step by step”) and it unlocked enormous latent potential in LLMs, that was only fully realized through the creation of reasoning models. RLMs have the same shape: a test-time strategy that’s easy to implement today and will only get better as models are trained to exploit it.

You probably don’t need to rush out and refactor your agents today. But if your agents touch large contexts, start experimenting with RLM traces today. You’ll learn something about your problem…and you might discover your next agent architecture in the output.

“REPL” stands for “read-eval-print loop”. It is an interactive coding environment where one can enter arbitrary code and get back output. If you open your terminal and type python, you’ll find yourself in a REPL. ↩
The team at MIT behind RLM has just released a version of Qwen3-8B post-trained on RLM traces. I hear it works pretty well, but no amount of fine-tuning or RL is going to help Qwen-8B code or reason as well as GPT or Opus. ↩
Both ChatGPT and Claude used to do this when asked, “How many R’s are in Strawberry,” though it appears both rely on reasoning or, in the case of ChatGPT, hide the previously visible Python code. ↩
I was continually amazed how well models would leverage their budgets. Kimi, in particular, wasn’t shy about ending early if the task proved simple. But it would also spend LLM sub-calls freely once it had a working approach, saturating my connection with Groq. ↩

The Rise of Spec Driven Development

2026-02-06T08:22:00-08:00

It’s been a month since I launched whenwords, and since then there’s been a flurry of experiments with spec driven development (SDD): using coding agents to implement software using only a detailed text spec and a collection of conformance tests.

Github Could Use a ‘Docs Review’ UI

First off, despite whenwords being a couple Markdown docs and a YAML test set, people have submitted valuable PRs. Mathias Lafeldt spotted a disagreement about rounding, where the spec instructed the agent to round up in several scenarios, but three tests were rounding down. Others have suggested there should be some [CI][ci] (despite their being no code) and wonder what that should be.

There’s been enough action on the repo to give us an idea of what open source collaboration could look like in a SDD world. And it feels more like commenting in and marking up a Google Doc than code merges. I would love to see Github lean into this and build richer Markdown review, like Word or Google Docs, allowing for easier collaboration and accessibility to a wider audience.

Emulation & Porting are the Low-Hanging SDD Use Case

By far, the hardest part of starting a SDD project is creating the tests. Which is why many developers are opting for borrowing existing test sets or deriving by referencing a source of truth.

Here’s a few examples:

Anthropic wrote a C compiler in Rust. They used existing test suites and used GCC as a source of truth for validation and generating new tests.
Vercel created a bash emulator in TypeScript. They created and curated an amazing set of shell script spec tests and have been feeding these to Ralph. (To make this even more meta, I’ve been following their commits and Clauding them into Python).
Pydantic created a Python emulator…in Python. This sounds silly, but it’s useful in the same way Vercel’s just-bash is: it’s a super lightweight sandbox for AI agents. (In fact, I’ve already wrapped it in a CodeInterpretter for use with DSPy’s RLM module)

Now… It’s worth noting that most of these examples didn’t emerge perfectly. Anthropic’s C-compiler just kinda punted on the hard stuff and admits the generated code is inefficient¹. Pydantic’s Python emulator lacks json, typing, sys, and other standard libraries. Though I’m sure those will come soon. Vercel’s just-bash sports outstanding coverage, though people continue to find bugs.

This is the big takeaway from watching the last few weeks of SDD: agents and a pile of tests can get you really far, really fast, but for complex software they can’t get you over the line. Edge cases will generate new tests, truly hard problems will resist SDD implementation, and architectural issues will prohibit parallelism agents.

Vercel’s CTO and just-bash creator, Malte Ubl, sums it up best:

You can Ralph up a port or emulator in a weekend or two, but now you have to take care of it.

There is lots to pick apart in Anthropic’s piece (I have had multiple compiler and related people ping me about how misrepresentative it is), but the most laughable claim is that this is, “a clean-room implementation”. The idea that using an LLM trained on the entire internet, all of Github, and warehouses full of books is a clean room environment is absurd. ↩

A Software Library with No Code

2026-01-08T14:59:00-08:00

All You Need is Specs?

Today I’m releasing whenwords, a relative time formatting library that contains no code.

whenwords provides five functions that convert between timestamps and human-readable strings, like turning a UNIX timestamp into “3 hours ago”.

There are many libraries that perform similar functions. But none of them are language agnostic.

whenwords supports Ruby, Python, Rust, Elixir, Swift, PHP, and Bash. I’m sure it works in other languages, too. Those are just the languages I’ve tried and tested.

(I even implemented it as Excel formulas. Though that one requires a bit of work to install.)

But like I said: the whenwords library contains no code. Instead, whenwords contains specs and tests, specifically:

SPEC.md: A detailed description of how the library should behave and how it should be implemented.
tests.yaml: A list of language-agnostic test cases, defined as input/output pairs, that any implementation must pass.
INSTALL.md: Instructions for building whenwords, for you, the human.

The installation instructions are comically simple, just a prompt to paste into Claude, Codex, Cursor, whatever. It’s short enough to print here in its entirety:

Implement the whenwords library in [LANGUAGE].

1. Read SPEC.md for complete behavior specification
2. Parse tests.yaml and generate a test file
3. Implement all five functions: timeago, duration, parse_duration, 
   human_date, date_range
4. Run tests until all pass
5. Place implementation in [LOCATION]

All tests.yaml test cases must pass. See SPEC.md "Testing" section 
for test generation examples.

Pick your language, pick your location, copy, paste, and go.

Okay. This is silly. But the more I play with it, the more questions and thoughts I have.

Recent advancements in coding agents are stunning. Opus 4.5 coupled with Claude Code isn’t perfect, but its ability to implement tightly specified code is uncanny. Models and their harnesses crossed a threshold in Q4, and everyone I know using Opus 4.5 has felt it. There wasn’t a single language where Claude couldn’t implement whenwords in one shot. These capabilities are raising all sorts of questions, especially: “What does software engineering look like when coding is free?”

I’ve chewed on this question a bit, but this “software library without code” is a tangible thought experiment that helped firm up a few questions and thoughts. Specifically:

Do we still need 3rd party code libraries?

There are many utility libraries that aim to perform similar functions, but exist as language-specific implementations. Do we need them all? Or do we need one, tightly defined set of rules which we implement on demand, according to the specific conventions of a given language and project? For libraries that are simple utilities (as opposed to complex frameworks), I think the answer might be, “Yes.”

Now, whenwords is (purposely) a very simple utility. It’s five functions, doesn’t require many dependencies, and depends on a well-defined standard (Unix time). It’s not an expensive operation, a poor implementation probably won’t be a bottleneck, and the written spec is only ~500 lines.

But there’s no reason we couldn’t get more complex. Well defined standards (like those you’d need to implement a browser) can help you tackle complex bits of software relatively quickly. The question is: when does this model make sense and when doesn’t it?

Today, I see 5 reasons why you’d want libraries with code:

1. When Performance Matters

Let’s run with that browser example. There are well-defined, large specs for how to interpret HTML, JS, and CSS. One could push these further and deliver a spec-only browser.

But performance is going to be an issue. I want to open hundreds of tabs and not spring memory leaks. I want rendering to be quick, optimized to within an inch of what’s possible. I want a large group of users going out and encountering strange websites, buggy javascript, bad imports, and more. I want people finding these issues, fixing them, and memorializing them as code.

2. When Testing is Complicated

But Drew, you say, if we find performance issues in the spec-only browser we can just update the spec. That’s true, but testing updates gets complicated fast.

Let’s say you notice whenwords has a bug in its Elixir implementation. To fix the whenwords spec, you add a line to the SPEC.md file to prevent the Elixir bug. You submit a PR request and I’m able to verify it helps Claude build a working Elixir implementation.

But did the change screw up the other variants? Does whenwords still work for Ruby, Python, Bash, and Excel? Does it work for all of them when building with Claude and Codex? What about Qwen? Do we end up with a CI/CD pipeline that builds and tests our spec against 4 coding agents and 20 languages? Or do we just say, “Screw it,” and tell users they’re responsible for whatever code produced?

This isn’t a huge deal for a library with the scope of whenwords, but for anything moderately complex, the amount of surface area we’d want to test grows quickly. whenwords has 125 tests. For comparison, SQLite has 51,445 tests. I’m not building on a spec-only implementation of a database.

3. When You Need to Provide Support & Bug Fixes

Chasing down bugs is harder with spec-only libraries because failures are inconsistent.

Let’s imagine a future where we’re shipping enterprise software as a Claude Skill, or some other similar prepared context that lets agents implement our software for our customers, depending on their environment. This is basically our “software library with no code” taken to an extreme. While there may be benefits here, there are also perils.

Replicating bugs is nearly impossible. If the customer gets stuck on an issue with their own generated codebase, how do we have a hope of finding the problem? Do we just iterate on our spec and add plenty of tests, toss it over to them, and ask them to rebuild the whole thing? Probably not. The models remain probabilistic and as our specs grow the likelihood of our implementations being significantly different grows.

4. When Updates Matter

A library I like is LiteLLM, an AI gateway that provides one interface to call many LLMs across multiple platforms. They add new models quickly, push updates to address connection issues with different platforms, and are generally very responsive.

Other foundational libraries (like nginx, Rails, Postgres) push essential security updates. These are dependencies I wish to maintain. Spec-only libraries, on the other hand, likely work best for implement-and-forget utilities and functions. When continual fixes, support, and security aren’t needed or aren’t valued.

5. When Community & Interoperability Matter

Running through all the points above is community. Lots of users mean more bugs are spotted. More contributors mean more bugs are fixed. Comprehensive testing means PRs are accepted faster. A big community increases the odds someone is available to help. Community support means code is kept up-to-date.

When you want these things, you want community. The code we rely on is not just an instantiation of a spec (a tightly defined set of concepts, aims, and requirements), but the product of people and culture that crystallize around a goal. It’s the magic of open source; why it works and why I love it.

For the job whenwords performs, we don’t need to belong to a club. But for foundations, the things we want to build on, the community is essential because it delivers the points above. Sure, there may be instances of spec-only libraries created and maintained by a vibrant community. But I imagine there will continually be a reference implementation that codifies and ties the spec to the ground.

But the above isn’t fully baked. Our models will get better, our agents more capable. And I’m sure the list above is not exhaustive. I’d enjoy hearing your thoughts on this one, do reach out.

2025 in Review: Jagged Intelligence Becomes a Fault Line

2025-12-29T10:03:00-08:00

A year shaped by synthetic data, dramatically uneven performance, and reliability issues

One of the reasons why I write is reflection. Looking over 2025’s work, there are consistent themes among the mess that help me understand the velocity of AI, its momentum and direction. I’m not going to polish this too much (if you want to dive in, check out the linked posts), but this exercise is quite clarifying to me.

Here’s the tl;dr:

Immediate AI risk comes from people over-estimating AI capabilities.
Reliability and trust are the barriers preventing wide adoption.
Evaluations remain underutilized.
Synthetic data unlocked AI capabilities, but shapes its nature
There is a growing AI perception gap between quantitative users and qualitative users.
AI leaders are letting others define the story of AI.

Immediate AI risk comes from people over-estimating AI capabilities.

There are many risks we should be conscious of, but the downsides that are biting us now come from people believing in AI capabilities or sentience that isn’t there. “I don’t worry about superintelligent AGI’s taking over the world. I worry about bots convincing people they’re having an emotional connection when they’re not.” This can be tied to teen suicides, senior scams, propagandist bots, and more. The natural language interface is wonderful for its flexibility and accessibility, but it exploits our evolutionary tendency to recognize humans where there are none.

This danger is more pronounced by our current human-in-the-loop design pattern. We’re asking laypeople to evaluate AI capabilities in fields which they do not understand. Too often I hear, “Chatbots know everything, but they make mistakes when it comes to things I know.”

Posts:

Reliability and trust are the barriers preventing wide adoption.

As we saw above, people can easily spot issues with AI when it’s working in their domain. Sure, we’ve come a long way this year, but these gains have mostly come from Intern-style applications. We keep the humans in the loop because humans are excellent at spotting and fixing issues the 10% of the time models flail.

But when that figure is higher than ~10% or so (these are finger-in-the-air numbers), people simply avoid the AI. Agents, especially custom enterprise ones, have a reliability problem that hinders the development of the field. Teams that successfully ship agents do so by dialing back their complexity: chat interfaces, short tasks.

But we should consider reliability a means to an end; and that end is trust.

Trust is complex. It’s dependent on the task being done, the risk associated with the task, the UI that presents the task, and how the agent contextualizes the produced decision. Reliability can be measured at the model level, but trust has to be assessed end-to-end: from the model, to the application, to the user.

Frustratingly, there’s few good ways to measure trust in the AI era. We can do user interviews (and I know teams that do), but these are slow. UX research always has been, but their pace feels especially sluggish in the context of AI-powered development, Many teams can hack this by “vibe shipping” – making changes to their app, pushing to production, running a few queries, then repeating – basically doing the UX reseach by themselves, on themselves.

Everyone else should look to delegation. “Forget the benchmarks – the best way to track AI’s capabilities is to watch which decisions experts delegate to AI.”

Posts:

Evaluations remain underutilized.

At first I wrote, “under-appreciated.” But I think teams get why evaluations are valuable. The problem is most teams still don’t build them.

They get the benefits:

The real power of a custom eval isn’t just in model selection – it’s in the compound benefits it delivers over time. Each new model can be evaluated in hours, not weeks. Each prompt engineering technique can be tested systematically. And perhaps most importantly, your eval grows alongside your understanding of the problem space, becoming an increasingly valuable asset for your AI development.

It used to be I had to argue that hand-tuned prompts would become overfit to a model. But OpenAI’s headline model deprecations this year pushed many teams to discover this empirically.

Despite this hiccup, many teams continue to push forward, hand-editing prompts and vibe shipping as they go. Pre-scale, this is likely optimal: the speed of iteration this allows is too valuable to ignore. As a result, so many teams I talk to who were previously focusing on evaluation tooling have pivoted to synthetic data creation or LLM-as-a-Judge services. Our AI capabilities have improved dramatically, but human behavior remains a constraint.

Posts:

Synthetic data unlocked AI capabilities, but shapes its nature.

Investing in synthetic data creation unlocked AI capabilities in 2025. Rephrasing high quality content into reasoning and agentic chains kept the scaling party alive. Generating new datasets for verifiable tasks (like math and coding) helped AI coding apps evolve from better auto-complete services to async agents in less than a year.

Remember: Claude Code arrived in February.

Synthetic data did this. It provided the material needed for post-training, the mountains of examples necessary to upend an entire industry. But the limits of synthetic data, that it has been focused on quantitative tasks, greatly shapes our tools and discourse:

Those who use AIs for programming will have a remarkably different view of AI than those who do not. The more your domain overlaps with testable synthetic data and RL, the more you will find AIs useful as an intern. This perception gap will cloud our discussions.

The current solution, being deployed by frontier chatbots, is to treat everything they can as a programming problem. If ChatGPT or Claude can write a quick Python script to answer your question, it will. Context engineering challenges are being reframed as coding tasks: give a model a Python environment and let them explore, search, and read and write files. Yesterday’s harness is today’s environment. In 2024 we called models, today we call systems.

Scale was all we needed in 2024. Reasoning kept the party going in 2025. Coding will be the lever in 2026.

Posts:

There is a growing AI perception gap between quantitative users and qualitative users.

And this is the trillion dollar question: can we replicate our coding gains in qualitative fields? Can we generate synthetic data that unlocks better writing? Can we turn PowerPoint creation into a coding exercise? If we give GPT-5.2 a Python notebook can it write a better poem?

If these things can’t be solved with coding, there will be tremendous opportunity to improve the qualitative performance of models through other means. Doing so, however, will likely require solutions that are opinionated rather than general. Aesthetic performance requires subjective choices, not objective correctness.

But for now, the lopsided nature of today’s models is creating a world where programmers experience a very different AI than most ChatGPT users. The divide in capabilities between a free ChatGPT or Copilot account and Claude Code with Opus 4.5 is vast. Public conversations about AI are deeply unproductive because what you and I are experiencing is lightyears beyond the default experience.

Posts:

AI leaders are letting others define the story of AI.

Compounding this problem is the fact that AI leaders aren’t even attempting to explain how AI works to the masses. I recently wrote:

The AI ecosystem is repeating digital advertising’s critical mistake.

One of the reasons the open online advertising ecosystem fell apart is because they terribly communicated how it all worked. The benefits of cross targeting were brushed over, because it was hard and complex to explain, and that left the door open for others to make privacy the only story, until it was too late. Which created the environment we have now, where most quality media is paywalled and only the giant platforms have sufficient scale for effective targeting.

The AI industry is failing to explain how AI works. People and companies either brush it aside as complex and/or oversimplify it with over-promised metaphors (“A PHD in your pocket!”) These same people then get upset when critics keep wringing their hands about hallucinations, financial engineering, power and water consumption, and much more.

AI leaders don’t invest in explanations because AI is hard to explain. Further, they’re incentivized to over-simplify and over-promise. Combine this withthe lightning speed of development (even Karpathy feels left behind!) and AI’s jagged intelligence becomes into a fault line, threatening to rupture.

Posts:

DeepSeek as a Power Object

Why I Write (And You Should Too!)

2025-12-27T10:24:00-08:00

Every now and then, people ask me why I write. I don’t get paid to write here, so it’s not immediately obvious why I keep writing.

I think writing is one of the most valuable things you can do, and I recommend everyone try it. Here’s why:

It makes you a better thinker and communicator. Writing is a muscle. The more you write, the easier it gets, and your ability improves. You’ll learn to make clearer arguments, crisper explanations, and better empathize with your audience. These skills are applicable to everything.
You’ll get feedback that makes you a better writer. Feedback exposes weak arguments and strengthens the good ones. Plus, learning to listen to feedback is another skill that is universally applicable.
You’ll meet people interested in the same things you are. Looking through my correspondence, it’s amazing how many of my favorite people to chat with I met through writing online. (BTW, this goes both ways. If you read something that resonates with you online, write them a note thanking them and telling them what you liked!)
Your past thinking will be archived and searchable. This is more valuable than you think. If you invest time to hone a piece, you’ll turn back to it more often than you’d expect. Further, reviewing old pieces and threads over time will reveal what worked and what didn’t while making your progress tangible.
The value of your writing compounds. The value to you, that is. I don’t think my pieces from 6 years ago are improving anyone’s life, but the contacts I’ve made and pieces I’ve crafted have grown into a foundation I get to leverage everyday.
Writing gives you a license to explore and organize your thoughts. This is the fun bit. Chasing down an idea that interests you, forming questions and then investigating them; it’s a joy. The second most common question I get about my writing is, “How do you motivate yourself to write?” This is the answer. There are so many drafts that live, dormant in my draft folder. So many times I start a piece and lose interest. And then: something will click and I’ll draft, investigate, and finish a piece in an hour (here’s two examples). These aren’t always the most substantive pieces, but they keep the practice going and the momentum up.

It’s hard to form new habits. But writing is s the best investment you can make today. Here’s a few tips for getting started:

Be okay with bad writing. Most writing isn’t great! If my hit rate is 1 out of 5, I’m thrilled. Get comfortable publishing things that aren’t perfect. I know many people who wait too long to publish and, well, never do. They do this for years. If they’d gotten the ball rolling back then, they’d be better writers today. It’s weird: you’d think regular private writing would be sufficient to get better. But it isn’t. There’s no stakes. No feedback. The only way to get better is to ship. Some people worry about the risk of bad writing. I think the biggest risk comes from being an asshole (so don’t be an asshole!) But the actual risks are quite low: most bad writing is neutral, it remains unread.
You need to do the writing. Not AI. Writing is exercise. If I brought a forklift to the gym and used it to lift weights, what would be the point?
But AI is a wonderful editor. When you’re getting started, it’s intimidating to ask people for feedback on drafts. Thankfully, AI is great at this! Paste in your draft and prompt it with something like, “This is a blog draft where I am trying to argue X, read the piece and identify any spelling or grammar errors, places where I am not being clear or where a reader might be confused, or areas where my argument is weak.” Take it with a grain of salt, but this is usually very, very helpful.
Don’t overthink where to publish. Make pages public on Notion. Use Github pages. Use Substack, if you must. The only thing you cannot omit is an easy contact form and a way for someone to subscribe. I screwed this up for too long. RSS is not sufficient. Comments don’t count (in fact, turn them off). Let people email you with a form, one-on-one. The other thing to keep in mind is to pick something with low friction. If it takes too many steps to create a new post, you won’t.

I hope you start a blog this year. Or revive an old one.

If you’d like some further advice, feel free to reach out!

How Model Use Has Changed in 2025

2025-12-19T11:59:00-08:00

From ‘Naked’ Model Endpoints to Tool-Using, Reasoning Environment Endpoints

I was poking around LiteLLM’s Github repository and stumbled upon an interesting file. model_prices_and_context_window.json is a registry of all the models and inference providers you can call with LiteLLM. This is the core value of LiteLLM, wrapping this diverse array of models behind a consistent yet capable API, allowing applied AI builders to swap out models and providers without a major code rewrite.

This registry file is impressive, and well communicates the value of LiteLLM. It’s over 30,000 lines detailing over 2,000 model and provider combinations. At the top of the JSON file, LiteLLM provides a sample_spec, their schema for the information they store for each model. Curious, I poked into the repository’s commit history to see how this schema has evolved over the months.

And boy if this isn’t the story of LLMs in 2025:

On the left is the schema on January 1st, 2025. On the right is the schema today. The orange lines were added in 2025. The schema has doubled in size, as more and more tools and logic has been embedded in models and their providers. We aren’t just asking for text completion or chat, a good chunk of us are now hitting a single endpoint that can execute code, use a computer, manipulate files, and search the web. These types of calls are being made to an appliance, not a function, complete with its own environment to complete a task.

2025 may not have been the year of the agent, but perhaps it was the year of the tool.

Now, of course, this isn’t everyone. Such an appliance is essentially a blackbox that is difficult to eek reliability out of, if your agent or application is struggling. We still have and use ‘naked’ inference calls all the time.

But for human-in-the-loop chat apps, the surface area of what happens behind a model call is growing in size and structure.

Enterprise Agents Have a Reliability Problem

2025-12-06T09:39:00-08:00

Enterprise agents struggle to reach production or find adoption due to reliability concerns

Throughout 2025, there’s been a steady drumbeat of reports on the state of AI in the enterprise. On the surface, many appear to disagree. But dig in a little bit, look at how each report was assembled and how they defined their terms and you’ll find a consistent story: adoption of 3rd party AI apps is surging while 1st party development struggles to find success.

If you’re short on time, here’s the tl;dr:

Off-the-shelf AI tools are widely used and valued within the enterprise. (Wharton/GBK’s AI Adoption Report)
But internal AI pilots fail to earn adoption. (MIT NANDA’s report)
Very few enterprise agents make it past the pilot stage into production. (McKinsey’s State of AI)
To reach production, developers compromise and build simpler agents to achieve reliability. (UC Berkeley’s MAP)

The few custom agents that make it past the gauntlet figure out how to achieve reliability, earn employee trust, and actually find usage. Reliability is the barrier holding back agents, and right now the best way to achieve it is scaling back ambitions.

Let’s start with the notorious MIT NANDA report which generated the headline, “95% of generative AI pilots at companies are failing.”

Plenty have criticized the methodology and conclusions NANDA reaches, but I tend to believe most of the claims in the report provided we keep in mind who was surveyed and understand that “AI pilots” were defined as internally developed applications. Keep this in mind as you review the following two figures:

I wrote in September:

For all the criticism of the NANDA report, it is a survey of many business leaders. We can treat it as such. So while we might take that 95% figure with a grain of salt, we can trust that business leaders believe the biggest reason their AI pilots are failing is because their employees are unwilling to adopt new tools… While 90% of employees surveyed eagerly use AI tools they procure themselves.

Internal applications struggle, while employee-driven use of ChatGPT and Claude is booming.

Wharton and GBK’s annual AI adoption report appears to counter NANDA with claims that, “AI is becoming deeply integrated into modern work.” 82% of enterprise leaders use Gen AI weekly and 89% “believe Gen AI augments work.”

The Wharton report is an interesting read that details how people are using AI tools throughout their workday. But these are overwhelmingly 3rd party tools:

ChatGPT, Copilot, and Gemini dominate usage (Claude ranks surprisingly low, likely a function of Wharton’s respondent base). Custom chatbots see less usage than ChatGPT, and even then: the “by/for” in “built specifically by/for my organization” is doing a lot of work.

10 slides later, the report states (emphasis mine), “Customized Gen AI Solutions May be Coming as Internal R&D Reaches One-Third of Tech Budgets.” The money is being deployed, but customized AI has yet to arrive at scale.

Though they appear to disagree, both reports support a common conclusion: adoption of off-the-shelf tools is growing and valued, but companies struggle to build their own AI tools. Every enterprise AI report I read brings this reality further into focus.

Google Cloud’s “AI Business Trends” report says agents are being widely used… But their definition of “agent” includes ChatGPT, CoPilot, and Claude.

McKinsey’s “State of AI” doesn’t include off-the-shelf tools in their survey, and <10% of respondents report having agents beyond the pilot stage.

So why is it hard for enterprises to build AI tools? In short: reliability.

“Measuring Agents in Production”, recent research led by Melissa Pan, brings this to life by surveying over 300 teams who actually have agents in production. The headline?

Reliability remains the top development challenge, driven by difficulties in ensuring and evaluating agent correctness.

Rather than develop technical innovations to address this issue, developers dial down their agent ambitions and adopt simple methods and workflows. Most use off-the-shelf large models, with no fine-tuning, and hand-tuned prompts. Agents have short run-times, with 68% of agents executing fewer than 10 steps before requiring human intervention. Chatbot UX dominates, because it keeps a human in the loop: 92.5% of in-production agents deliver their output to humans, not to other software or agents. Pan writes, “Organizations deliberately constrain agent autonomy to maintain reliability.”

This aligns with data released by OpenRouter this week, in their “State of AI” report. This report analyzed ~100 trillion tokens passing through OpenRouter, using a projection technique to categorize them by use case.

Prompt and sequence¹ lengths are steadily growing for programming use cases, while all other categories remain stagnant:

The figures above nicely support Pan’s conclusion that agent builders are keeping their agents simple and short to achieve reliability. Outside of coding agents (whose outlier success is a worth a separate discussion), prompts and agent sequence complexity is stagnant.

And these are the agents that make it into production! MIT NANDA showed that leaders say employee “unwillingness to adopt new tools” is the top barrier facing AI pilots. Pan’s results suggest a more sympathetic explanation: when tools are unreliable, employees don’t adopt them. They’re not stubborn; they’re rational.

In the short term, successful teams will build agents with constrained scope, earn trust, then expand. Delivering on bigger ambitions means building and sharing better tools for reliable AI engineering.

“Sequence length is a proxy for task complexity and interaction depth.” ↩

Don’t Fight the Weights

2025-11-11T08:33:00-08:00

For the first year or so, one of the most annoying problems faced by building with AI was getting them to generate output with consistent formatting. Go find someone who was working with AI in 2023 and ask them what they did to try to get LLMs to consistently output JSON. You’ll get a thousand-yard stare before hearing about all-caps commands, threats towards the LLM, promises of bribes for the LLM, and (eventually) resorting to regular expressions.

Today, this is mostly a solved problem, but the cause of this issue remains, frustrating today’s context engineers. It’s a context failure I missed in my original list. I call it Fighting the Weights: when the model won’t do what you ask because you’re working against its training.

In 2020, OpenAI unveiled GPT-3 alongside a key paper: “Language Models are Few-Shot Learners.” In this paper, OpenAI researchers showed that LLMs as large as GPT-3 (10x larger than previous language models) could perform tasks when provided with only a few examples. At the time, this was earth-shaking.

Pre-GPT-3, language models were only useful after they’d been fine-tuned for specific tasks; after their weights had been modified. But GPT-3 showed that with enough scale, LLMs could be problem-solving generalists if provided with a few examples. In OpenAI’s paper they coined the term “in-context learning” to describe an LLM’s ability to perform new types of tasks using examples and instructions contained in the prompt.

Today, in-context learning is a standard trick in any context engineer’s toolkit. Provide a few examples illustrating what you want back, given an input, and trickier tasks tend to get more reliable. They’re especially helpful when we need to induce a specific format or style or convey a pattern that’s difficult to explain¹.

When you’re not providing examples, you’re relying on the model’s inherent knowledge base and weights to accomplish your task. We sometimes call this “zero-shot prompting” (as opposed to few shot²) or “instruction-only prompting”.

In general, prompts fall into these two buckets:

Zero-Shot or Instruction-Only Prompting: You provide instructions only. You’re asking the model to apply knowledge and behavioral patterns that are encoded in its weights. If this produces unreliable results, you might use…
Few-Shot or In-Context Learning: You provide instructions plus examples. You’re demonstrating a new behavioral pattern for the model to apply. The examples in the context augment the weights, providing them with details for a task it hasn’t seen.

But there’s a third case: when the model has seen examples of the behavior you’re seeking, but it’s been trained to do the opposite of what you want. This is worse than the model having no knowledge of a pattern, because what it knows is at odds with your goal.

I call this fighting the weights.

Here’s some ways we end up fighting the weights:

Format Following: You want the model to output only JSON, but often it will provide some text explaining the JSON and wrap the JSON in Markdown code blocks. This happens because the model’s post-training taught it to be conversational. When ChatGPT first launched, this problem was rough. GPT-3.5 had been heavily trained by humans to converse in a friendly, explanatory manner. So it did – even when you asked it not to. This doesn’t happen as much as it used to, but we’ll occasionally run into this issue when using unique formats or when using smaller models.
Tool Usage Formatting: As model builders start training their models to use tools, via reinforcement learning, they select specific formats and conventions. If your environment doesn’t follow these conventions, the model often fails to call tools correctly. I first noticed this while testing Mistral’s Devstral-Small, which was trained with the tool-calling format All Hands uses. When I tried to use Devstral with Cline, it failed basic tasks. Last month this came up when a friend was trying Kimi K2 with a DSPy pipeline. By default, DSPy formats prompts with a Markdown-style template. When this pipeline was driven by K2, formatting failed. Thanks to my recent dive into how Moonshine trained K2 to use tools, I knew K2 was trained with XML formatting. Switching DSPy to XML formatting solved the problem instantly.
Tone Changes: It’s really hard to apply consistent tone instructions to LLMs. Sure, we can make them talk like a pirate or in pig-latin, but subtle notes are overwhelmed by the model’s conversational post-training. For example, here’s the one note I give Claude in my settings: “Don’t go out of your way to patronize me or tell me how great my ideas are.” This does not stop Claude from replying with cloying phrases like, “Great idea!” when I suggest changes.
Overactive Alignment: Speaking of Claude: I appreciate Anthropic’s concern for alignment and safety in their models, but these guardrails can be overzealous. A recent example comes from Armin Ronacher, who tried several different approaches to get Claude Code to modify a medical form PDF while debugging PDF editing software. Armin asked several different ways, but Claude’s post-training alignment refused to budge.
Over Relying On Weights: Models are trained to utilize the knowledge encoded in their weights. But there are many times when you want them to only answer with information provided in the context. Perusing leaked system prompts, you can see how many instructions each chatbot maker gives when it comes to when models should search to obtain more info. The models have been trained to use their weights, so plenty of reiteration and examples are needed. This problem is especially tricky when building RAG systems, when the model should only form answers based on information obtained from specific databases. Companies like Contextual end up having to fine-tune their models to ensure they only answer with fetched information.

Perhaps my favorite example I’ve seen was from ChatGPT. Previously, you could turn on the web inspector in your browser and watch the LLM calls fly by as you used the chatbot. This was handy for seeing when additional messages were added, that you didn’t write. When you asked ChatGPT to generate an image, it would clean up or even improve your image prompt, create the image, then append the following instructions:

GPT-4o returned 1 images. From now on, do not say or show ANYTHING. Please end this turn now. I repeat: From now on, do not say or show ANYTHING. Please end this turn now. Do not summarize the image. Do not ask followup question. Just end the turn and do not do anything else.

This is textbook fighting the weights. The models powering ChatGPT have been post-trained heavily to always explain and prompt the user for follow up actions. To fight these weights, ChatGPT’s devs have to tell the model EIGHT TIMES to just, please, shut up.

For context and prompt engineers (and even chatbot users) it’s helpful to be able to recognize when you’re fighting the weights.

Here’s some signs you might be fighting the weights:

The model makes the same mistake, even as you change the instructions.
The model acknowledges its mistake when pointed out, then repeats it.
The model seems to ignore the few-shot examples you provide.
The model gets 90% of the way there, but no further.
You find yourself repeating instructions several times.
You find yourself typing in ALL CAPS.
You find yourself threatening or pleading with the model.

In these scenarios, you’re probably fighting the weights. Recognize the situation and try another tack:

Try another approach for the same problem.
Break your task into smaller chunks. At the very least, you might identify the ask that clashes.
Try another model, ideally from a different family.
Add validation functions or steps. I’ve seen RAG pipelines that perform a final check to ensure the answer exists in the fetched data.
Try a longer prompt. It can help in this scenario, as longer contexts can overwhelm the weights.
Consider fine-tuning. In fact, most fine-tuning I encounter is done to address ‘weight fighting’ scenarios, like tone or format adherence.

Or, if you’re a model building shop, you can just address your issues during your next model’s post-training. Which seems to be part of their development cycle…and perhaps why we can get clean JSON out of modern models.

But few of us have that option.

For the rest of us: learn to recognize when you’re fighting the weights, so you can try something else.

For example, Claude Sonnet 4.5’s system prompt provides detailed instructions about when to use search tools to answer a user’s query. This is a hard task to prompt correctly. You want the model to rely on its existing knowledge base as much as possible to deliver fast answers, but to readily use web search for timely information or information not in the model’s weights. Besides giving instructions, Anthropic provides examples illustrating more subtle edge cases. ↩
“Shot” is hold-over jargon from the machine learning community. There’s some nuance here, but unless you’re actively collaborating with ML engineers, you can just swap “example” in anytime you see “shot”. ↩

Glimpses of the Future: Speed & Swarms

2025-10-20T08:15:00-07:00

If you experiment with new tools and technologies, every so often you’ll catch a glimpse of the future. Most of the time, tinkering is just that — fiddly, half-working experiments. But occasionally, something clicks, and you can see the shift coming.

In the last two months, I’ve experienced this twice while coding with AI. Over the next year, I expect AI-assisted coding to get much faster and more concurrent.

Speed Changes How You Code

Last month, I embarked on an AI-assisted code safari. I tried different applications (Claude Code, Codex, Cursor, Cline, Amp, etc.) and different models (Opus, GPT-5, Qwen Coder, Kimi K2, etc.), trying to get a better lay of the land. I find it useful to take these macro views occasionally, time-boxing them explicitly, to build a mental model of the domain and to prevent me from getting rabbit-holed by tool selection during project work.

The takeaway from this safari was that we are undervaluing speed.

We talk constantly about model accuracy, their ability to reliably solve significant PRs, and their ability to solve bugs or dig themselves out of holes. Coupled with this conversation is the related discussion about what we do while an agent churns on a task. We sip coffee, catch up on our favorite shows, or make breakfast for our family all while the agent chugs away. Others spin up more agents and attack multiple tasks at once, across a grid of terminal windows. Still others go full async, handing off Github issues to OpenAI’s Codex, which works in the cloud by itself… often for hours.

Using the largest, slowest model is a good idea when tackling a particularly sticky problem or when you’re planning your initial approach, but a good chunk of coding can be handled by smaller, cheaper, faster models.

How much faster? Let’s take the extreme: Qwen 3 Coder 480B runs at 2,000 tokens/second on Cerebras. That’s 30 times faster than Claude 4.5 Sonnet and 45 times faster than Claude Opus 4.1. It Qwen 3 Coder takes 4 seconds to write 1,000 lines of JavaScript; Sonnet needs 2 minutes.

No one is arguing Qwen 3 Coder 480B is a more capable model than Sonnet 4.5 (except maybe Qwen and Cerebras… 🤔). But at this speed, your workflow radically changes. I found myself chunking problems into smaller steps, chatting in near real-time with the model as code just appeared and was tested. There was no time for leaning back or sipping coffee. My hands never left the keyboard.

At 30x speed, you experiment more. When the agent is slow there’s a fear that holds you back from trying random things. You experiment less because having to wait a couple of minutes isn’t worth the risk. But with Qwen 3, I found myself firing away with little hesitation, rolling back failures, and trying again.

After Qwen 3, Claude feels like molasses. I still use it for big chunks of work, where I’m fine letting it churn for a bit, but for scripting and frontend it’s hard to give up Qwen’s (or Kimi K2’s) speed. For tweaking UI –– editing HTML and CSS – speed coupled with a hot-reloader is incredible.

I recommend everyone give Qwen 3 Coder a try, especially the free-tier hosted on Cerebras and harnessed with Cline. If only to see how your behavior adjusts with immediate feedback.

Swarms Speed Up Slow Models (But Thrive with Conventions)

To mitigate slow models, many fire up more terminal windows.

Peter Steinberger recently wrote about his usual setup, which illustrates this well:

I’ve completely moved to codex cli as daily driver. I run between 3-8 in parallel in a 3x3 terminal grid, most of them in the same folder, some experiments go in separate folders. I experimented with worktrees, PRs but always revert back to this setup as it gets stuff done the fastest.

The main challenge with multi-agent coding is handling Git conflicts. Peter relies on atomic commits, while others go further. Chris Van Pelt at Weights & Biases built catnip, which uses containers to manage parallel agents. Tools like claude-flow and claude-swarm use context management tactics like RAG, tool loadout, and context quarantining to orchestrate “teams” of specialist agents.

Reading the previous list, we can see the appeal of Peter’s simple approach: nailing down atomic commit behaviors lets him drop into any project and start working. The swarm framework approach requires setup, which can be worth it for major projects.

However, what I’m excited about is when we can build swarm frameworks for common environments. This reduces swarm setup time to near zero, while yielding significantly more effective agents. It’s the agentic coding equivalent of “convention over configuration”, allowing us to pre-fill context for a swarm of agents.

This pattern — using conventions to standardize how agents collaborate — naturally aligns with frameworks that already prize convention over configuration. Which brings us to Ruby on Rails.

Obie Fernandez recently released a swarm framework for Rails, claude-on-rails. It’s a preconfigured claude-swarm setup, coupled with an MCP server loaded with documentation matching to your project’s dependencies.

It works extraordinarily well.

Like our experiments with the speedy Qwen 3, claude-on-rails changes how you prompt. Since the swarm is preloaded with Rails-specific agents and documentation, you can provide much less detail when prompting. There’s little need to specify implementation details or approaches. It just cracks on, assuming Rails conventions, and delivers an incredibly high batting average.

To handle the dreaded Git conflicts, claude-on-rails takes advantage of the standard Rails directory structure and isolates agents to specific folders.

Here’s a sample of how claude-on-rails defines the roles in its swarm:

architect:
  description: "Rails architect coordinating full-stack development for DspyRunner"
  directory: .
  model: opus
  connections: [models, controllers, views, stimulus, jobs, tests, devops]
  prompt_file: .claude-on-rails/prompts/architect.md
  vibe: true
models:
  description: "ActiveRecord models, migrations, and database optimization specialist"
  directory: ./app/models
  model: sonnet
  allowed_tools: [Read, Edit, Write, Bash, Grep, Glob, LS]
  prompt_file: .claude-on-rails/prompts/models.md
views:
  description: "Rails views, layouts, partials, and asset pipeline specialist"
  directory: ./app/views
  model: sonnet
  connections: [stimulus]
  allowed_tools: [Read, Edit, Write, Bash, Grep, Glob, LS]
  prompt_file: .claude-on-rails/prompts/views.md

The claude-swarm config lets you define each role’s tool loadout, model, available directories, which other roles it can communicate with, and provide a custom prompt. Defining a swarm is a significant amount of work, but the conventions of Rails lets claude-on-rails work effectively out-of-the-box. And since there’s multiple instances of Claude running, you have less time for coffee or cooking.

And installing claude-on-rails is simple. Add it to your Gemfile, run bundle, and set it up with rails generate claude_on_rails:swarm.

In the past I’ve worried that LLM-powered coding agents will lock in certain frameworks and tools. The amount of Python content in each model’s pre-training data and post-training tuning appeared an insurmountable advantage. How could a new web framework compete with React when every coding agent knows the React APIs by heart?

But with significant harnesses, like claude-on-rails, the playing field can get pretty even. I hope we see similar swarm projects for other frameworks, like Django, Next.js, or iOS.

The conversation around AI-assisted coding has focused on accuracy benchmarks. But speed — and what speed enables — will soon take center stage. Being able to chat without waiting or spin up multi-agent swarms will unlock a new era of coding with AI. One with a more natural cadence, where code arrives almost as fast as thought.

Enterprise AI Looks Bleak, But Employee AI Looks Bright

2025-09-15T10:24:00-07:00

About that MIT report…

Last month, the internet was abuzz about an MIT report with a dramatic headline: “95% of generative AI pilots at companies are failing.”

Fortune had the exclusive, and paywalled the write up. The report itself, published by MIT’s NANDA¹, could only be accessed by filling out a Google Form. I don’t think many people actually read the report, but the headline was enough. Here’s what happened the next day:

Shares of megacap tech and big-name chipmakers declined. Nvidia shares lost 3.5%, while Advanced Micro Devices and Broadcom slipped 5.4% and 3.6%, respectively. Shares of high-flying software stock Palantir dropped more than 9%, making it the S&P 500′s worst performer. Other major tech-related names such as Tesla, Meta Platforms, and Netflix were also under pressure.

Since then, many have criticized the methodology and conclusions of the report. Too few executives were surveyed, those that were didn’t represent the entire market, and the report (on the whole) reads as an advertisement for NANDA’s mission rather than a peer-reviewed research paper (because it’s not).

Someone could probably start a pretty good investment fund that just reads the papers behind the headlines that move the market.

You can read the actual report here, without filling out any Google Forms. It’s worth skimming, as there are a few datapoints more interesting than the headline claim.

From those, I want to highlight these two figures (emphasis mine):

For all the criticism of the NANDA report, it is a survey of many business leaders. We can treat it as such. So while we might take that 95% figure with a grain of salt, we can trust that business leaders believe the biggest reason their AI pilots are failing is because their employees are unwilling to adopt new tools… While 90% of employees surveyed eagerly use AI tools they procure themselves.

A Simpson’s classic comes to mind:

The subject of employees using their own ChatGPT or Claude accounts at work has been heavily discussed for years. It’s frequently referred to as the “Shadow AI Economy,” and is a source of anxiety for IT leaders and inside counsel.

Just this week, OpenAI published a paper on ChatGPT usage that validates this specter:

OpenAI’s report is excellent and provides a rare look at how people use ChatGPT²: ~80% of usage is for learning, searching, and writing. Often to help them perform their work!

Thinking about the two plots above, I am reminded of the iPhone’s arrival in the enterprise. When the iPhone arrived, it was not seen as a work device. IT organizations continued to provide BlackBerrys, with their IT-controlled email and messaging. Nearly all IT teams didn’t think this would change. More than once, I heard IT managers reply to iPhone support requests with, “Just wait for the BlackBerry Storm.”

But you know who loved iPhones? The C-suite. And they asked their IT leaders to support the device. IT caved, “Bring Your Own Device” became a thing, and four years later Apple was an option in the enterprise.

Which brings us back to the charts above: employees are using ChatGPT while managers grumble that their AI projects aren’t adopted. If I had to guess, I’d wager there are a few things going on:

Most companies adopt AI products slowly, bottlenecked by legal and security. There’s a reason you see Llama 3.1 continue to show up in McKinsey surveys: once teams win approval to use a model, they are loath to go back to compliance to seek an upgrade. New models emerge monthly, but security reviews take many months. This applies to AI applications as well: if a company buys one and employees tell them it’s not great, no one’s eager to take on legal again.
Bundle deals are poor substitutes for great chatbots. I’ve heard from many friends that their workplace-provided chatbot were selected for security and trust reasons (think Microsoft Copilot and others). Rather than wrestle with bad UX or bad answers, these people opt for BYOAI (bring your own AI), IT concerns be damned.
It’s hard to separate personal from business use. This is a classic IT problem: when people can’t be bothered to switch accounts before asking a question. We see it with email, browsing, and more. Savvier users quarantine accounts in separate browsers, but most people just use what’s there.

The topic deserves further study – I don’t think this will be as easy as the iPhone and BYOD was. But I do think the dominant bottleneck here is IT and compliance. If enterprises don’t stand up continual review processes, they’ll be doomed to be stuck with last year’s tools and models… Then wonder why no one is adopting their AI.

Until then: employees will continue to opt for BYOAI.

NANDA stands for, “Networked AI Agents in Decentralized Archtecture.” ↩
To Anthropic’s credit, they’ve already published several usage reports. ↩

AI Companies School Like Fish

2025-09-13T14:55:00-07:00

A Blue Ocean Turns Red in <18 Months

If we look at the ecosystem of AI-powered products, there’s a clear pattern of how they emerge and roll out to the world:

Initial POC: Someone throws together a software demo – not a robust product ready for public consumption – proving a capability. Often this comes from an open source developer, academic researcher, or an R&D team at a larger company. The demo catches fire, hits the frontpage of Hacker News, and circulates through social media.
Open Experimentation: Open source devs and projects start to experiment with the concept, adding support for the feature to their framework or shipping usable software. This is a Cambrian Era, when lots of variants hit Github and get kicked around.
Fast-Mover Launch: Eventually, a fast moving company brings a product to market. This could be a start up built around the core idea or it could be an existing organization that quickly adds the feature or product to their offering. For the first time, people are paying (or not paying, depending on if the demand is there and the demo works in production).
Incumbent Clone: Finally, large companies bring the product or feature to their offerings.

For example, let’s look at text-to-SQL.

In 2022, prior to ChatGPT’s launch, Immanuel Trummer published CodexDB, which translated natural language into SQL queries¹. As ChatGPT juiced the AI ecosystem, text-to-SQL became an early example of a business application. LangChain and others shipped components for building and enabling text-to-SQL use cases. In short order, all the large data platforms cloned the feature, including Tableau, Snowflake, and Databricks.

Usually, this cycle happens relatively fast, in less than 18 months.

How many times has it happened? I count at least 8:

Text-to-SQL: See above.
Customer Service Bots: Chatbot interfaces to FAQs
Document Q&A: Turnkey RAG applications with chatbot interfaces
Note Taking & Summarization: Meeting transcription with extracted summaries and follow-ups.
Search: Perplexity-style search that uses LLMs to package information gathered from web queries.
Code Text Completion: Auto-suggest in IDEs as you edit code, powered by AI.
Coding Agents: Tools like Cursor and Claude Code that perform whole coding tasks for you.
Deep Research: Like search, but with longer depth, wider breadth, and more depth in pursuit of assembling a report.
Browser Control: Browsers driven by AI to accomplish tasks the user provides.

I’m sure there’s some product archetypes we’ve missed.

What can we learn from this pattern and the way we’ve been steadily encountering new archetypes, then walking them through the process above?

Ideas come from hackers, not customers. Few people know how to conceptualize products and cobble together unique applications with AI. This skill comes only through experience and play, and for these first few years most ideas come from the open source community in the form of demos. Not from designers, product managers, or feedback from customers. Applied AI ideas are hard, but execution is cheap. Which is a nice set up to our next take-away…
Cloning happens faster when the model is the magic. Cloning happened in previous eras, but nowhere near as fast. Start ups would create markets, prove their worth, and only then would larger companies invest in their own teams, projects, and (often) acquisition. Today, when so much of the lift comes from the model itself, there’s little reason to wait (especially when there are few other low-hanging ideas).
Applied AI start-ups need a niche. When big companies can enter the market in a matter of months, it’s more imperative than ever that start-ups focus on a niche. Google or OpenAI can clone your product, but they’re not nimble enough to invest in your outreach with a specific community and tailoring their product for a segment doesn’t make business sense. Most of the general-purpose RAG start-ups from 2023 have pivoted or failed, but those that focused on one sector (legal, medical, insurance, financing, etc.) are thriving.
If you’re not niche, you better build a beachhead in <12 months. If you insist on shipping a general purpose applied AI product, and think being early to market is an advantage…well…think again. Cloning moves so fast, you better have an incredble gameplan to pull off significant market acquisition in a handful of months – which will then fuel you through user feedback, training data, and more. But unless you launch with both an incredible marketing advantage and a killer product, you’ll face incredibly tough competition once the big players enter.

The idea that fast-following occurs faster than ever, thanks to everyone having access to the same models, is related to the, “Will the model eat your stack?” problem we discussed earlier.

Considering both the rapid cloning problem and the speed of model advancements, I think every non-niche, applied AI start up needs to ask themselves two questions:

If a better model arrives tomorrow, does your product get better or does your backend get simpler? If your product doesn’t get better, you need to rethink. A better model simplifying your backend (by reducing the complexity of your prompts, your error handling, your infra, etc.) makes your product easier to clone.
If you are early to market with this use case, what are you going to do in a handful of months that will fend off Google/OpenAI/whomever’s entry into your market? Cursor and Perplexity are the rare examples that have managed to grow fast enough to be able to fend off larger entrants. What are you going to do, if you can’t go niche, to prepare your defenses?

CodexDB used OpenAI’s Codex model, published in 2021. This is not their coding tool named Codex. ↩

Can Chatbots Accomodate Advertising?

2025-09-02T15:21:00-07:00

If we use AI to make decisions for us, where do ads fit in?

Building frontier AI models is expensive. As is serving them to hundreds of millions of customers. So far, a small percentage of users are paying $20 a month to use them; back of the envelope math suggests ~5% of ChatGPT’s ~700 million users are doing so today (8% on the high-end, 3% on the low).

Nick Turley, the person in charge of ChatGPT, was recently interviewed on Decoder, where he said:

We will build other products, and those other products can have different dimensions to them, and maybe ChatGPT just isn’t an ads-y product because it’s just so deeply accountable to your goals. But it doesn’t mean that we wouldn’t build other things in the future, too. I think it’s good to preserve optionality, but I also really do want to emphasize how incredible the subscription model is, how fast it’s growing, and how untapped a lot of the opportunities are.

Emphasis mine. I want to zoom in on that bit, that ChatGPT isn’t “ads-y” because it’s “so deeply accountable to your goals.”

I’ve been thinking about this tension for over a year.

AI Will Disrupt the Attention Economy

AI, and I felt this during the deep learning era as well, is an important bit of technology because it allows you to project your decisions.

Gunpowder changed the nature of fighting and war because it allowed combatants to project their force, magnitudes farther than a spear or sword allows. The printing press, telegraph, and the internet changed the world because they allowed people to project their communication beyond their audible reach. AI, née deep learning, allows you to encode your decisions (not all of them, but many) into portable packages of perception and discernment that can sort through mountains of content in moments.

This decision projection will change our information ecosystem. Our digital and media economy is a zero-sum battle to earn and sell your attention. With decision projection our attention is effectively limitless¹.

Given most advertising is sold in units of attention, this presents a challenge.

Search Ads Work Because Search Presents Options

Google Adwords (now just “Google Ads”) is perhaps the best ad model for a given product, ever.

When someone searches, a real-time auction begins. Eligible ads bid for the given query, with the winner paying the next best competitor’s bid. The winner’s ad would appear similar to a search result, among the search results. Users perused the search results, including the ad, and would select a link to click.

Today, Google handles ~90% of all searches.

Google Adwords is perfect because:

Users state what they’re looking for
Interested parties compete for that bid, yielding relevant ads
Users select their result from a range of options

That selection is key. Google puts options on the page, ads included, and the user decides.

But there is one way to keep Google from serving you an ad. Start your search from the Google homepage, not your browser’s address bar, and instead of hitting “Search”, click “I’m Feeling Lucky.” Google will skip the results, the ads, the selection, and take you directly to the first result. You’ve ceded the selection decision to Google, hence no ads are shown.

“I’m Feeling Lucky,” is an anachronism. While writing this, I was surprised to see it’s still there. Initially, it was a bit of swagger, confidence manifested as UI. “We are so good at web search,” Google seemed to say, “you can skip the results.” Few ever used it, and dramatically fewer use it today, but oddly it presaged a pattern picked up by ChatGPT.

Chatbots Have Few Good Ad Choices

ChatGPT – and Claude, Gemini, DeepSeek, and all other chatbots – don’t deliver a set of options to peruse, they deliver answers. As Turley says, they are “deeply accountable to your goals.”

Unlike search, there is no obvious play to insert ads. And the options that do exist feel either bolted-on or undermine the chatbot’s core function. These options include:

Display Ads: Advertising placed in or around the response. These could be text or images. This is the dominant ad model for web pages, and not integrated into the content.
Text Integrated Ads: Advertising integrated into the text response. The chatbot would search for or be provided relevant product information that would inform the response. The integrated ad would be noted as an ad, but otherwise naturally integrated into the reply.
Widget Integrated Ads: In responses, product listings could be broken out in rich, carousels. OpenAI is experimenting with this format, Perplexity kind of does this, and Google already presents a carousel of only sponsored options atop your search.
Interstitial Ads: Advertising that is presented in between user interactions. An ad could be displayed for a short time after you submit your query, before you see your result.
Sponsored Prompts: Advertisers could sponsor suggested prompts, either on the landing page (as a suggested query, “Explore sandwich ideas with Kraft”) or as a suggested follow up after a response has arrived (“Would you like to learn more about Product X)?

Off the bat, we can remove display ads as an option. To build an ad product that delivers value at a scale similar to their product, ChatGPT cannot adopt standard ad units and ad targeting. Display ads would be valued the same way display ads on the New York Times or stray blogs are (in terms of page views and clicks), undercutting the special nature of ChatGPT. Adopting display ads devalues their product, creates bad incentives, and won’t generate the returns needed to support OpenAI’s goals. For a deep dive on why this is, read my explanation of why media metrics matter

Interstitial ads, though a natural fit for slow reasoning models, is likely an imperfect fit for the reasons display ads fail. They’re bolted on, not tied to the core query, and outside of the main user flow.

Text integration ads hit on the tension Turly describes: ChatGPT is “deeply accountable to our goals,” so taking time to not deliver a single answer to our question, given the context, undermines its core function. Turly elaborates:

If we ever [added advertising to ChatGPT] I’d want to be very, very careful and deliberate because I really think that the thing that makes ChatGPT magical is the fact that you get the best answer for you and there’s no other stakeholder in the middle. It’s personalized as to your needs and tastes, etc. But we’re not trying to upsell you on something like that or to boost some pay-to-play provider or product. And maybe there are ways of doing ads that preserve that and that preserve the incentive structure, but I think that would be a novel concept and we’d have to be very deliberate.

OpenAI and others could try to identify when users are asking for options and use these moments to serve ads. This brings us to widget ads. In April, OpenAI announced the addition of product carousels to their search mode, similar to Google.

Ads naturally fit in this interface, as it presents a selection. But, for now, this functionality is hidden in ChatGPT’s search mode…which itself is hidden (hit the “+” button, select “More”, select “Web Search”). They are clearly being cautious. One gets the feeling search mode is a place to explore these tricky questions without spoiling the core ChatGPT experience.

Thinking through widget ads, you end up landing on affiliate marketing, or affiliate links. Affiliate marketing is when advertisers pay people or companies a commission for leads or sales they generate. This is big business, though smaller than traditional advertising.

And yes, Turly says, OpenAI is thinking about affiliate marketing:

There is actually something that is neither ads nor subscriptions, which is if people buy things in your product after you very independently serve the recommendation. Wirecutter famously does this with expert-selected products.

But then if you buy them through a product like ChatGPT, you could take a cut. That is something we are exploring with our merchant partners. I don’t know if it’s the right model, I don’t even know if it’s the right user experience yet, but I’m really excited about it because it might be a way of preserving the magic of ChatGPT while figuring out a way to make merchants really successful and build a sustainable business.

Affiliate marketing, and the question of whether it consciously or unconsciously influences recommendations, is a fraught topic. We have a hard enough time determining if it affects human reviewers; trying to understand if it affects AI reviewers is another question entirely.

If I were at OpenAI, I would argue strongly against generating affiliate revenue from in-response recommendations if only because it could function as an explanation for why ChatGPT’s results aren’t good. One challenge facing chatbot products is that they are black boxes. How they arrive at their results is largely hidden (with the exception of reasoning chains), and even among researchers at top labs can’t explain why an LLM returns a specific result. This black box nature leaves the door open for users to come up with their own explanations, factual or not, that can take on a life of their own. Adding a visible incentive – affiliate revenue – introduces an easy reason why one chatbot has worse vibes than another. And often, that’s enough to cause real damage.

Further, I have questions about if it’s even technically possible to implement affiliate marketing without influencing the results. If you provide your chatbot with a well designed, tested, and maintained tool for obtaining product specs and features (let’s call these ad prompts), this set of product information will be easier to obtain and consume than an inconsistent or unruly webpage. Simply providing an ad prompt will almost certainly increase your likelihood of recommendation due to the nature of contexts².

I will be shocked if ChatGPT is the first to pull the trigger on affiliate recommendations. I think they can work, provided they are contextualized within a larger array of options and limited to an “ad slot” amongst the array. But monetized product recommendations integrated into text answers will undermine the core service ChatGPT provides.

If I were forced to pick an ad format for ChatGPT, today, I would pick sponsored prompts. I believe it’s the best, worst option of the formats identified above. It’s relevant to the core chatbot user interaction, isn’t a bolted-on distraction like intersticials and display ads, yet doesn’t influence the response ChatGPT generates. ChatGPT’s stock conclusion to it’s answers, suggestions to users about next step for them to pursue (“Would you like to learn more about X?”), could be broken out of the text response itself. Below the text response there would be a couple buttons representing these entreaties, one of which could be sponsored.

This is where I’d start, but it’s not ideal.

AI Disrupts Advertising Foundations

Ads are designed to influence or perceptions and ultimately our decisions. But as we outsource more decisions to AI tools, and those tools become better at projecting our decisions and discernment…where does that leave advertising? Will the task of advertising be split between appealing to us and appealing to our agents? Are these jobs the same or different?

It’s hard to say at the moment, and I don’t think we’ll get an answer from anyone for a bit. The big labs are blitzscaling and there’s no shortage of funding to pay the bills. The goal is marketshare, and no one wants to be the first tool to compromise their product. But this can’t go on forever; an ad model will emerge. Let’s just hope it fits the chatbot product.

By the way, I suspect this is the reason Meta is spending so aggresively when it comes to AI. If they have a unifying strategy throughout their existence it’s earning and selling attention. Their king KPI is “share of timespent”, aka how much of your waking hours is spent staring at Meta products. 98% of their revenue is from advertising, selling this attention. If AI turns attention from a zero-sum game into, well, anything else, it’s an existential event for Meta. ↩
I tried this myself this week, scraping the product pages from a few bicycle manufactorers and rephrased their content as ad prompt markdown files (here’s one example). I staged these documents behind an MCP armed with simple vector and text search (another great use case for Chroma), and wired it up to Claude with intructions to both browse the web and use the affiliate tool to assemble recommended products for my queries. Over and over, the affiliate listings would be richer, more descriptive, and would appear more often. I suspect this is because the data had been prepped, and that ease delivered better results. ↩

Building Castles in the Air, but With Surprise Physics

2025-08-21T15:36:00-07:00

In the software engineering classic, “The Mythical Man-Month,” Frederick P. Brooks Jr. wrote:

The programmer, like the poet, works only slightly removed from pure thought-stuff. He builds his castles in the air, from air, creating by exertion of the imagination. Few media of creation are so flexible, so easy to polish and rework, so readily capable of realizing grand conceptual structures.

Yet the program construct, unlike the poet’s words, is real in the sense that in moves and works, producing visible outputs separate from the construct itself. It prints results, draws pictures, produces sounds, moves arms. The magic of myth and legend has come true in our time. One types the correct incantation on the keyboard, and a display screen comes to life, showing things that never were nor could be.

I used to see this quote more often. It was frequently cited by developers during the period after the dot-com bust, when the iPhone kicked off the smartphone boom and the internet and social media became normal. Suddenly, the real-world impact of programmers was everywhere, experienced by seemingly everyone.

In 2012, just prior to their IPO, Meta CTO Andrew Bosworth had it printed on his business card¹.

The Probabilistic Nature of Building Atop AI

During recent conversations with Jeff Huber and Jason Liu, we touched on the probabilistic nature of building atop AI.

Randomness is built into LLMs (they even expose a parameter to tweak it) and our agents, applications, and pipelines must account for the unexpected. This is different than the programming of the past decades. It’s a workflow more akin to that of data science, where you form hypotheses, design experiments, and rapidly iterate until you’re on (relatively) stable ground.

Or, as Jeff put it, “People who are good at AI are used to getting mugged by the fractal complexity at reality.”

Just yesterday, Gian Segato wrote an excellent piece exploring this exact shift:

We are no longer guaranteed what x is going to be, and we’re no longer certain about the output of y either, because it’s now drawn from a distribution…

Stop for a moment to realize what this means. When building on top of this technology, our products can now succeed in ways we’ve never even imagined, and fail in ways we never intended.

This is incredibly new, not just for modern technology, but for human toolmaking itself. Any good engineer will know how the Internet works: we designed it! We know how packets of data move around, we know how bytes behave, even in uncertain environments like faulty connections. Any good aerospace engineer will tell you how to approach the moon with spaceships: we invented them! Knowledge is perfect, a cornerstone of the engineering discipline. If there’s a bug, there’s always a knowable reason: it’s just a matter of time to hunt it down and fix it.

You should grab a coffee and read the whole essay.

As someone who would never call themselves an engineer, that last line felt true to me. AI development feels more akin to science (where we poke things and note how they work) than engineering (where we build structures with documented parameters).

But then a Hacker News user named “potatolicious” wrote this comment, on a thread related to my AI job title guide:

Most classical engineering fields deal with probabilistic system components all of the time. In fact I’d go as far as to say that inability to deal with probabilistic components is disqualifying from many engineering endeavors.

Process engineers for example have to account for human error rates. On a given production line with humans in a loop, the operators will sometimes screw up. Designing systems to detect these errors (which are highly probabilistic!), mitigate them, and reduce the occurrence rates of such errors is a huge part of the job.

Likewise even for regular mechanical engineers, there are probabilistic variances in manufacturing tolerances. Your specs are always given with confidence intervals (this metal sheet is 1mm thick +- 0.05mm) because of this. All of the designs you work on specifically account for this (hence safety margins!). The ways in which these probabilities combine and interact is a serious field of study.

Software engineering is unlike traditional engineering disciplines in that for most of its lifetime it’s had the luxury of purely deterministic expectations. This is not true in nearly every other type of engineering.

If anything the advent of ML has introduced this element to software, and the ability to actually work with probabilistic outcomes is what separates those who are serious about this stuff vs. demoware hot air blowers.

I’ll be thinking about this for quite some time.

Omar Khattab pointed out this isn’t entirely new:

Any software systems that made network requests had these properties. Honestly, any time you called a complex function based on its declared contract rather than based on understanding it procedurally you engaged in the kind of reasoning needed to build AI systems.

This is true. But I argue that simulating network issues and designing concurrent systems is a step or two down from the variability of AI models. Further, these existing issues were never the dominant trend in the software engineering industry. Many developers just offloaded these challenges or avoided dealing with them.

Further, for each new app you need to understand the probabilistic fingerprint of that domain, for a given model. The uncertainty is a moving target, which has to be discovered every time.

Updating Brooks’ Quote for Applied AI

Perhaps we should update the Fred Brooks quote, for those building atop AI: programmers still build castles in the air, but the first have to discover what physics apply.

It may have been there earlier and/or later, that’s just when I saw it. ↩

Making Sense of AI Job Titles

2025-08-21T12:01:00-07:00

A Cheat Sheet for Job Titles in the AI Ecosystem

Even when you live and breathe AI, the job titles can feel like a moving target. I can only imagine how mystifying they must be to everyone else.

Because the field is actively evolving, the language we use keeps changing. Brand new titles appear overnight or, worse, one term means three different things at three different companies.

This is my best attempt at a “Cheat Sheet for AI Titles.” I’ll try to keep it updated as the jargon shifts, settles, or fades away. As always, shoot me a note with any additions, updates, thoughts, or feedback.

The AI Job Title Decoder Ring

While collecting examples of titles from job listings, Twitter bios, and blogs, a pattern emerged: nearly all AI job titles are created by mixing-and-matching a handful of terms. Organizing the Post-Its on my wall, I was reminded of “mix-and-match” children’s books:

If we swap out the dinosaur parts above with the adjectives and nouns from my collected examples, we get:

Sliding these columns up and down, we can assemble most AI job titles. (Though I have yet to see some combinations, like, “Applied AI Ops”.)

Let’s first break down the modifiers:

Forward Deployed: People who work closely with customers, helping them develop new applications powered by their own company’s technologies. They learn their customer’s business, constraints, and goals, then translate that context directly into features, integrations, and working code.
Applied: People who conceive, design, support, and/or build products and features powered by AI models. The key here is that they are applying AI to a domain problem; they are not helping build the AI itself.

There is plenty of overlap here: most Forward Deployed workers are working on Applied problems. They usually aren’t training new models with the customer.

The domain column is rather awkward, mostly for historical reasons.

The terms “ML” and “Gen AI” are subsets of the broader “AI” domain. “Gen AI” as a term only arrived after ChatGPT launched, as a way to distinguish the now-famous chatbots and image generation from everything else people with “AI” titles had been working on prior to September, 2022¹.

While initially coined to cordon off text and image generation applications, I think “Gen AI’s” utility is waning. LLMs are being used for non-generative applications – like categorization, information gathering, comparisons, data extraction, and more – that were traditionally the domain of what we used to call “machine learning” and “deep learning”².

That said, when you see these domains in a title, here’s how you should interpret them:

AI: A general, catch-all domain for people working in AI. Encompasses text processing, agent building, image generation systems, chatbots, LLM training, and so much more. This is the default for this field.
ML: ML signifies this role will be focused on training models – most likely not LLMs – for single-purpose tasks, that will be used as a function in a larger pipeline or app. Examples of these single-purpose tasks include recommendation systems, anomaly detection, predictive analytics, and data extraction or enrichment.
Gen AI: This domain signals that the role will involve working with text, image, audio, or video generation models. This role usually involves applications where the model output is directly consumed by the user. Examples of these applications include writing tools and image generators and editors.

The suffixes are mostly self explanatory, with one exception: researcher.

I agree with the above take.

Prior to ChatGPT, most people working on AI research and development were at universities. When private projects began standing up AI efforts, the terms “researcher” and “lab” were borrowed from academia. At first, this made sense: the work was exploratory and speculative, more akin to big science projects than product development. But as AI became a product, a business, the term “researcher” remains but is increasingly awkward.

“Researcher” is a title used inconsistently. I have met “researchers” with product OKRs and incentives tied to business goals. I have met “researchers” who are working on novel LLM architectures and “researchers” who are building applications atop existing models. I have met “researchers” who are doing, well, research: exploratory work where it’s okay if a hypothesis doesn’t pan out, so long as you’re learning. Tension behind the term is increasing, hence the Elon post above.

Adding to the confusion is you’ll often see the term “Scientist” in place of “Researcher”. As far as I can tell, based on job descriptions, these terms are largely interchangeable.

Examples of AI Job Titles

Below is a handful of illustrative, real-world job titles. This list is in no way exhaustive. The goal here is to demonstrate how the modifiers, domains, and roles are assembled so we can better decode titles when we encounter them in the wild.

Example Titles

AI Researcher

An AI Researcher forms hypotheses, designs and runs experiments to test their hypotheses, then shares their learnings (sometimes publically) in pursuit of advancing the development of AI models. Often, they’re involved in productizing their findings.

Perhaps the most discussed job title of late, thanks to Meta’s aggressive hiring, leading to a surge of interest.

Here’s job description for a “Research Scientist” from OpenAI:

As a Research Scientist here, you will develop innovative machine learning techniques and advance the research agenda of the team you work on, while also collaborating with peers across the organization. We are looking for people who want to discover simple, generalizable ideas that work well even at large scale, and form part of a broader research vision that unifies the entire company.

Requirements for the job include:

“Have a track record of coming up with new ideas or improving upon existing ideas in machine learning, demonstrated by accomplishments such as first author publications or projects.”
“Possess the ability to own and pursue a research agenda, including choosing impactful research problems and autonomously carrying out long-running projects.”

Interestingly, this job posting has been active, unchanged, since March of 2023.

Sometimes you’ll see this role listed as a “Research Scientist.”

Applied AI Engineer

An Applied AI Engineer develops applications and features that utilize AI models.

Here’s a job description for a Senior Applied AI Engineer from Google DeepMind:

We are seeking a Senior Applied AI Engineer to lead the development and deployment of novel applications, leveraging Google’s generative AI models. This role focuses on rapidly developing new features, and working across partner teams to deliver solutions, and maximize impact for Google and top customers. You will be instrumental in translating cutting-edge AI research into real-world products, and demonstrating the capabilities of latest-generation models. We are looking for engineers with a strong track record of building and shipping AI-powered software, ideally with experience in early-stage environments where they have contributed to scaling products from initial concept to production. The ideal candidate will be motivated by the opportunity to drive product & business impact.

Note the focus on applying AI technology, not developing it. If we were to drop the “Applied” title, we might find an “AI Engineer” working on producing the models themselves.

Applied AI Solution Architect

Swapping out the role from “Engineer” to “Solution Architect” yields a predictable definition.

An Applied AI Solution Architect helps customers and potential customers design and ideate features and applications powered by AI models.

Here’s a recent job description from Anthropic:

As an Applied AI team member at Anthropic, you will be a Pre-Sales architect focused on becoming a trusted technical advisor helping large enterprises understand the value of Claude and paint the vision on how they can successfully integrate and deploy Claude into their technology stack. You’ll combine your deep technical expertise with customer-facing skills to architect innovative LLM solutions that address complex business challenges while maintaining our high standards for safety and reliability.

Working closely with our Sales, Product, and Engineering teams, you’ll guide customers from initial technical discovery through successful deployment. You’ll leverage your expertise to help customers understand Claude’s capabilities, develop evals, and design scalable architectures that maximize the value of our AI systems.

If you successfully sell a client on a business case for a feature, you might call in our next role…

AI Forward Deployed Engineer

An AI Forward Deployed Engineer (FDE) is a professional services role that helps customers impliment AI-powered applications and featured.

After claiming rapidly-iterating AI will companies will squeeze out incumbents like Salesforce, a16z backtracked and heralded FDEs as critical roles needed for enterprise AI adoption: “Enterprises buying AI are like your grandma getting an iPhone: they want to use it, but they need you to set it up.”

For the irony’s sake, here’s a recent AI Forward Deployed Engineer role at Salesforce:

We’re looking for a highly accomplished and senior-level Forward Deployed Engineer with 5+ years of experience to lead the charge on complex AI agentic deployments. This role demands a seasoned technologist and strategic partner who can not only design and develop bespoke solutions leveraging our Agentforce platform and other cutting-edge technologies but also lead technical engagements and mentor junior peers. You’ll be the primary driver of transformative AI solutions, operating with deep technical mastery, unparalleled problem-solving prowess, and a relentless focus on delivering tangible value in dynamic, real-world environments, from initial concept to successful deployment and ongoing optimization.

As a Forward Deployed Engineer, you’ll be at the forefront of bringing cutting-edge AI solutions to our most strategic clients. This isn’t just about coding; it’s about deeply understanding our customers’ most complex problems, architecting sophisticated solutions, and leading the end-to-end technical delivery of innovative, impactful solutions that leverage our Agentforce platform and beyond.

Emphasis mine. Rapidly acquiring domain expertise is key for this role.

We’ve recently written about Forward Deployed Engineers – why they’re necessary and how they signal AI-assisted coding’s impact on product management.

AI Engineer

Remove the “Forward Deployed” and we have a signficantly different job. Nailing this title down is difficult, it’s somehow more vague than even “Researcher” titles, ranging the gamut from “Applied” work to foundational model building. This squishiness is explored well by Latent Space, in a 2023 piece, “The Rise of the AI Engineer.” They write:

I think software engineering will spawn a new subdiscipline, specializing in applications of AI and wielding the emerging stack effectively, just as “site reliability engineer”, “devops engineer”, “data engineer” and “analytics engineer” emerged.

The emerging (and least cringe) version of this role seems to be: AI Engineer.

Every startup I know of has some kind of #discuss-ai Slack channel. Those channels will turn from informal groups into formal teams, as Amplitude, Replit and Notion have done. The thousands of Software Engineers working on productionizing AI APIs and OSS models, whether on company time or on nights and weekends, in corporate Slacks or indie Discords, will professionalize and converge on a title - the AI Engineer.

The entire piece is worth a read, though with the advantage of hindsight, their definition of “AI Engineering” seems very broad. As defined in their post, everything besides “Research”, “Product Manager”, and “Solution Architect” could fit within their definition.

The emergence of the “Applied” modifier has tightened this domain and is being leaned on more. I suspect “AI Engineering” will persist as a big-tent term for conferences and communities, but “Applied” roles will be the corporate title.

Try to search for “AI Engineering” titles and you’ll find jobs that are “Applied” roles; roles that build apps atop AI models, not build the models themselves. At the big labs, “AI Engineering” titles don’t exist on their career pages. For them, “Engineering” roles are specific to a domain, like performance, tokenization, infrastructure, or inference.

If you run into any interesting titles that make or break the decoder ring above, please do share them with me. As novel ones float by, I may grab them and update the examples above.

The term “Generative AI” is a pet peeve of mine. A weird theory of mine is that the term was coined by people running “AI” departments in large companies and consultancies. Upon seeing ChatGPT, their bosses or customers suddenly remembered they had people working on “AI” and promptly called them up, asking why they hadn’t made anything like ChatGPT. “AI is a big domain!” I imagine the AI departments replied. “ChatGPT is actually a subfield of AI called generative AI. We, too, can work on that if you want.” ↩
A decade ago, the terms “machine learning” and “deep learning” were inconsistently used. When writing about a topic that applied to both, we’d all lean on “ML/DL” or similar composites to fend off the pedants in the comments section. Or just include notes about usage up front. ↩

Bottleneck or Bisect: AI-Assisted Coding Will Change Product Management

2025-08-08T09:13:00-07:00

Product Management Will Be Split Between ‘Slow Platform’ and ‘Fast App’ Modes

When OpenAI announced they were building a consulting service staffed with “forward deployed engineers” — a term Palantir popularized¹ — the AI ecosystem took notice.

The FDE trend is a symptom with two underlying causes.

First, onsite dev is an expected capacity for companies selling new categories, where you have to teach your customers how to use your product after they buy it (ideally, before the contract runs out).

Second, and more significant for the broader software industry, FDEs represent a workaround for a growing problem. AI-assisted engineers can code 2-5x faster, but product management work hasn’t accelerated at the same pace. Rather than wait for traditional PM processes, organizations are empowering hybrid engineer-PMs to build directly with customers. To maintain relevance and continue to help their companies ship stable, safe, successful products, product managers and organizational structures need to adapt.

Let’s look at both these causes, in order.

1. Businesses Building New Categories Have to Teach

Every product falls into one of two categories: those that sell into existing budget lines and those that must create new budget lines.

If you’re a startup with a product in the former category, your playbook is relatively simple. Your product must outperform the incumbent – with better features, services and/or reduced cost – while your go-to-market team needs to deploy marketing and sales tactics to break through the noise and close deals.

If your product falls into the latter category, your task is much trickier. No budget line exists for your product, which means one has to be created. You need to cultivate a champion at your prospective customer, someone who is going to either do the hard work of justifying the new budget (if you’re going bottom-up) or an executive who is going to add the budget line by edict (if you’re going top-down). When I was at PlaceIQ, we utilized both approaches.

To find your champions, you need to market your new category, defining and proving its value.

And the hard work doesn’t end after you create a budget line and close the deal. Now you have to teach your customer how to use your product. For simpler products, this can look like textbook customer success services. But for more complicated products, especially those customers use to build something with (like AI APIs), this looks like high-touch consulting.

If this weren’t hard enough, there’s a time limit for success. You gotta help them build value before the contract runs out.

OpenAI’s enterprise products are clearly in the latter of our two buckets. They, and all enterprise AI vendors, can’t count on their customers to build innovative applications with AI. They’re going to have to help them. Which is why there’s plenty of new “Forward Deployed Engineer” openings.

2. AI Reshapes Product Management Because it Speeds Up Development Iteration

To understand how AI redefines the product management role, let’s first look at how AI-assisted coding is changing our development cycles.

Of course: standard caveats apply. AI-assisted coding is unevenly distributed. Many companies are slow to adopt these new tools. Many engineers haven’t had the time or motivation to explore and learn new AI development patterns. Models are great at some languages and tasks, and bad at others.

Yet: every startup I’ve met that has built products in the agentic era demonstrate the following:

AI makes coding faster. My anecdotal experience and observations align with Simon’s, “I’ve estimated that LLMs make me 2-5x more productive on the parts of my job which involve typing code into a computer, which is itself a small portion of that I do as a software engineer.”
- AI makes prototyping faster. The product managers and engineers I know who’ve embraced AI will quickly develop frontend demos rather than multipage specs. As we learned during the Agile era, giving collaborators, customers, and users something to react to results in better and more efficient feedback.
- Iteration is faster. This is a result of the previous two bullets. Faster iteration builds better products. Teams that ship more win

When you ship more, you have more opportunities to learn. Your rate of improvement increases and you accelerate away.

But it’s the coding that is driving this speed. Everything else is seeing much smaller, if any, acceleration from AI.

At a recent Y Combinator Startup School, Andrew Ng put his finger on this dynamic:

While engineers are becoming much faster, I don’t see product management work – designing what to build – becoming faster at the same speed as engineers. I’m seeing [the product management to engineering] ratio shift.

Literally yesterday, one of my teams came to me, when we’re planning headcount. This team proposed to me not to have 1 PM to 4 engineers but to have 1 PM to 0.5 engineers. I still don’t know if this proposal is a good idea, but it’s a sign of where the world is going. And I find that PMs that can code or engineers with some product instincts often end up doing better.

With the pace of coding accelerating, product has become the new bottleneck.

Teams are now chasing hybrids – product managers who can code and engineers with product instincts. The explosion in Forward Deployed Engineer roles demonstrates this. FDEs are essentially product-minded engineers – hybrids who can both build and understand customer problems. Their rapid emergence isn’t just about teaching new categories; it’s early proof that organizations are gravitating toward these dual-skilled roles.

While we’re seeing FDEs emerge at AI labs, this shift will affect every software company leveraging AI-assisted coding – whether they’re building AI products or traditional applications with AI-enhanced development cycles. The catalyst isn’t AI products themselves, but AI tools that dramatically accelerate the coding portion of product development. By embedding these incredibly valuable hybrids with domain experts, you increase the surface area of your fast iteration loop.

So do product managers go away?

No, there’s still plenty of product work to be done to graduate a product from a rapidly assembled app or feature to a robust service: research, compliance, sales ops, product marketing, and more. Further, a FDE focused on building for one client doesn’t have the time to take in the bigger picture, both qualitatively and quantitatively. And if companies over-optimize for speed, skipping traditional product management steps, they’ll eventually get burned.

Take last week’s GPT-5 launch, where OpenAI had to manage a mountain of upset 4o users who had developed an emotional attachment to GPT-4o’s tone. Instrument ChatGPT all you want – analyze and sort consumer queries into buckets of use cases – and you’d still miss that a chunk of users related to 4o as a friend.

Rather than go away, I think the function of product management will get bisected into two domains:

Application Product Managers: These roles work closely with customers or partners, and are often “hybrids”: product-minded engineers or product managers who code. They rapidly absorb domain expertise and use it to rapidly iterate products and features. This is product management and engineering, blended with customer success and consulting.
Foundation Product Managers: These roles build the core platform upon which Application Product Managers build. They design the APIs, data structures, and core business logic. Further, these core teams productize innovations developed by application product manager teams, handling compliance, security, QA, and more.

Naturally, the amount of definition between these two domains is dependent on the size of the company and the nature of the product. But this set up attempts to preserve the speed of AI-assisted development, the need for traditional product management functions, and any need to teach customers how to use your product.

But at larger orgs, there will be many pods of forward-deployed engineers and application product managers. They’ll optimize for speed and customer utility, tossing back innovations that find traction to foundation product teams, who prepare them for wider use. The core platform, that powers the FDEs, is maintained by centralized, slower foundation teams. These core product management teams will effectively negotiate the speed difference between the fast app teams and everything else.

The speed gains from AI-assisted development and the need to teach companies how to use AI-powered products dictates that technology companies embrace forward-deployed work. Product management is going to have to evolve to support speedier engineers, not be a bottleneck, and help their organizations ship stable, safe products.

This adaptation may look like the divide I detail above, or it may look like adoption of new, AI-powered tools that grant speed buffs to traditional product work. Currently, my money’s on the former.

We used to call them “field engineers” or “solutions engineers”, but neither of those support the tacticool branding Palantir cultivates. ↩