Interjected Future

Year 2025 in Review

Wil Chung — Mon, 29 Dec 2025 15:55:19 GMT

It's another year again, and I look back on what I did. This is mainly for myself, as I didn't write one of these in 2022, and now I can't remember what I spent all my time doing for the whole year.

Away from Databases

I had started the year by cutting my losses, and moving away from doing database work. A couple years back, there weren't any databases that were immutable, local-first, open-source, reactive, and any other number of properties that I wanted. I spent a good amount of time getting up to speed, but I never got to a point where I had remotely the output necessary to make it a reality.

I reminded myself that databases were never the point, so I'd cut my losses and work on the reactive collaborative notebook with effects. I had to reload my brain with the research I did on reactive systems and algebraic effects around January, and started coding up an algebraic effects system through March.

Lab note #058 Free Monad Effects and Algebraic Effects

I’d finished up looking into reactivity (with the exception of data type enforced reactivity, like differential dataflow), and I pulled out code with pedagogical sample code implementing signals and autotracking. * iamwilhelm/pedago_signals: A pedagogical implementation of hybrid push-pull signals in Python * iamwilhelm/pedago_autotracking: A pedagogical implementation of autotracking

Interjected FutureWil Chung

Visual Programming

I'm not currently working on visual programming, but I seem to have opinions that resonated with others about it.

Visual programming is stuck on the form

Underlying great creations that you love—be it music, art, or technology—its form (what it looks like) is driven by an underpinning internal logic (how it works). I noticed this pattern while watching a talk on cellular automaton and realized it’s “form follows function” paraphrased from a slightly different

Interjected FutureWil Chung

I wrote this over the course of four days, and had to rewrite it once or twice, as I was trying to figure out what I was really trying to say. It's notable as a year in review, because I remember the feeling of having something important to say, but struggling to figure out how to say it.

I'm glad that I did, because it ended up on my year in review as something I'm proud of doing. In 2026, I need to listen to that feeling more, and pause to do more writing.

DBSP

DBSP is one of those things that is terribly attractive to me, but I feel guilty working on it. Is it core to a reactive notebook with effects? No, but it's certainly tangential. I spent about a month and half on a half-baked implementation.

I used AI to help me read the paper, and I learned a lot from both looking at a reference (but bad) implementation, and doing the implementation myself. It was really nice to have something built up like theorem from the ground up. However, the issue was how to structure certain parts of it outside of the theorem. I had little to no experience with this kind of domain, and there was a lot of rewriting of code. I learned a lot, but I felt it wasn't enough to tell people about it.

At around this time, Claude Code had come out, and I tried to vibe code DBSP with Claude Code using Sonnet 4.0, but it was a big failure. The code was wrong in completely subtle ways, and only after I had to work with it. It felt a bit depressing.

Maybe the problem was that I was too worried about ruining an existing codebase with AI-based slop. So I took a detour of a detour to have something "like" a notebook, but a sample project to vibe code upon without worrying about terrible code.

Lab note #070 Vibe-coding is half a skill issue

I’ve been doing an implementation of DBSP, which is a way of doing incremental computation built up from some basic concepts from digital signal processing. While I use LLMs in my daily work to ask it questions, spitball with it, and do some basic stuff, I hardly ever ask it

Interjected FutureWil Chung

Personal Consumer Review

I noticed that one of the topics I often asked ChatGPT was to do recommendations for a product category. I've often wanted Wirecutter to tell me not just what they recommend, but what to look for. I wanted them to tell me what good looked like and why that was the case. I thought a GPT-based product recommendation canvas would be a good tool to have.

I vibe coded it and was surprised that the initial prototype looked as good and worked as well as it did. I did end up abandoning the project because I figured it was too obvious, and I saw OpenAI was also doing product shopping and recommendations.

Lab note #071 Agentic loop for buying decisions

Switched gears this week to start a new, but related project. I started a prototype on an app that helps users make buying decisions with an AI agent. Lately, I’ve been using ChatGPT to help me make buying decisions, but I found some aspect of the experience wanting. So I

Interjected FutureWil Chung

It was around this time that I saw people misusing Claude Code.

Misusing Claude Code

There were people that started using Claude Code for non-coding purposes on their personal knowledge bases in the file system. I thought it was an interesting experiment, so I finally exported myself out of Roam Research into Logseq. Then I asked Claude Code some questions after it explored my personal knowledge base. The result was a breath of fresh air.

Normally, I don't have anyone to talk to about the things I'm thinking or reading about. But Claude Code with access to my knowledge base, I didn't have to explain context to have a conversation with it. It felt like instant connection, because it mimicked when someone gets you. Of course, there are downsides to this, but I soon saw other people misusing Claude Code in different ways, from doing their SOC2 compliance to running their morning retrospectives.

I found it was a powerful combination. I could run a coach for customer development, which I could never do before.

Doing More Customer Interviews Won’t Make You Better At Them

Founders and product managers fail at customer development, even when they’re consistently talking to users. But this is less from conducting too few interviews and more from a missing feedback loop. The art of posing the right question is counterintuitive and cognitively taxing, so founders can’t analyze their own

Interjected FutureWil Chung

This was the first time I had cobbled some off-the-shelf tools in order to live in the future. I was going to pivot again, this time from notebook to a wiki where agents and humans work in the same digital space.

System Evals

While I was doing this, I took Hamel and Shreya's course on system evals. While Sri and I did write an eZine on system evals, called Forest Friends, I felt like most of the knowledge of evals came from Sri, and I needed to shore up on it.

Forest Friends Zine

A guide for AI Engineers building the wild world of LLM system evals

Forest Friends Zine

In the past, I never would have ponied up the money for such a course, but later in my career, it's a quick way to get up to speed on something that would have taken me longer to gather all the materials myself.

I learned some stuff here and there, but excitingly, I could also misuse Claude Code to run a system eval!

System eval with Obsidian and Claude Code

Lately, I’ve been experimenting with a different system eval stack than what you might normally hear about: Obsidian and Claude Code. Typically, Obsidian is used as a personal note taking app/knowledge base and Claude Code is used as a coding agent. I’m misusing both to hack together a system

Interjected FutureWil Chung

Where Agents and Humans Roam

After these experiences, I wanted a wiki that agents and humans can work on in the same digital space together. The set up of Claude Code and Obsidian seemed to have legs, and powerfully so. However, I thought there were improvements that could be made. The ideas I had for the notebook before wouldn't die, but they'd be eventually incorporated.

Around August is when I decided to start work on this, and that's when I went silent in the lab notes, and just worked on writing code.

It wasn't really with any desire for secrecy, as much as it was that context switching between writing code and writing prose was hard for me. So I'd occasionally write an essay and went back to writing code. However, I'm not sure exactly how to articulate what this is yet, but I know I need to do so in 2026. I just thought I'd be further along than I am right now.

The current application is pretty rough UX-wise. A lot of hand-wringing and code-wrangling was spent around supporting offline-editing. I discovered first-hand, where the state of local-first software is, and there are still major pieces that are missing before this is a no-brainer for mainstream devs.

Working with Agents

The cross-cutting theme for this year was trying to see vibe coding worked, and whether it worked for me. Claude Code had come out in February, and there were people that were claiming success with it in production.

Something I've always struggled with was speed vs quality. I tend towards quality, refactoring as I go. However, I always feel terribly slow, compared to other devs. Agentic coding promised a productivity boost, and as desperate as I was for it to work, I just couldn't get it to work for me.

In retrospect, this wasn't really my fault. There were a couple things working against me.

Vibe coding works well for well-known stacks. While I was using Python at the time, I wasn't using any frameworks.
Vibe coding works well for well-known domains. What I was working on was not a web app, and DBSP is hardly a common domain.
Vibe coding works well when you have better models. Since GPT-5.1 in Codex and Claude Sonnet 4.5 in Claude Code came on the scene, I've noticed much better results.
Vibe coding works well when you have lots of scaffolding for the coding agent to lean on, such as unit tests, type checking, documentation, comments, PRs, etc. If you're just a single dev, you may not do these things because it doesn't make sense for a single dev to do it.
Vibe coding works well when you know how to break things down for an agent. It was rather unintuitive to me that I needed to break a feature down into code base research, planning, and implementation stages. It was also an eye opener that you don't need to write every instruction yourself. You can ask the AI to ask you questions in order to write the prompt.
Vibe coding works well when you're used to managing others. If you don't like managing people, you're going to dislike vibe coding.

The Hidden Scaffolding Behind Production Vibe Coding

Interjected FutureWil Chung

It's been a hard adjustment. And even now, I can't completely let go of writing code by hand. There are some parts of my app that I leaned completely into vibe coding, such as sections that are scaffolding that I know I'll get rid of, or various bespoked tools not yet available for LiveStore.

One great use is for me to generate prototypes to click around. Sometimes, I need to have something to click on to know how it feels, and vibe coding is a great way to explore the space before doubling down on something.

Another great use is in bridging over the adjacent ecosystems and spaces that would have had too much activation energy to get into, such as Lean.

The Best Way to Learn Might Be Starting at the End: Writing a Proof in Lean

AI leveraged learning lets you start with the application at the end. Curiosity guides what you learn, fundamentals backfill when you need them.

Interjected FutureWil Chung

I haven't leaned completely into vibe coding the core aspect of the code base. At best, I take a middle road coined as "Coding like a Surgeon" by Geoffrey Litt. While I do ask Codex/Claude Code to research the code base and make a plan, I found it better if I asked it to generate a tutorial for me to implement.

On one hand, this could seem like I'm just bottlenecking myself. But I think the benefits are that I get to keep abreast of the code base for future edits, be in control of bloat, and use my taste and view of the future to correctly draw the system boundaries.

One day, I won't have to do this, but for now, coding agents aren't good at, or need a lot of context in order to draw the correct system boundaries for code bases that are out-of-distribution.

Retrospect on 2025

When I look back, I only work on something for about 3 to 6 months before moving on to something else. So while the ideas in my head build upon each other and compound, my work doesn't. I think it was because I never really had full conviction with anything I was working on, and I always felt guilty about it if it didn't directly lead to a product.

However, its' not happening this time with this wiki with agents thing. I had wanted it to be personally usable by the end of the year, but it looks like it'll be another month before that can happen. I hope to privately launch it by the Spring.

My thesis is that a reactive notebook with effects will make building LLM-driven applications much more quickly and pleasant. But that's just a hunch. That's not something verifiable in a book, but needs to be wrought with the friction of users coming up against product. If I don't get there within 2025, I'll be disappointed in myself.

To sum up, for 2025, I wish to change my work habits to do work faster, create digital by-products that make my work more legible, and get a reactive notebook I want to use every day up and running. That's my goal, and my promise to myself a year from now.
- Me at the end of 2024

I've changed the thesis a bit, so I'm not too hard on myself there. However, I'm disappointed in that I haven't created as many digital by-products to make my work more legible. I have dozens and dozens of blog post drafts and ideas that aren't finished, and I need to find a different way to do this.

Commitment to 2026

With a new year comes a new aesthetic. I wanted blog post images that stood out, so for a while I was using pastel origami. But I've been thinking that I want to explore an aesthetic that we started with Forest Friends: a urban architecture with organic curvilinear shapes mated with half-timbered architecture. I want to see what that can look like.

But more seriously, there's two things I can commit to for 2026:

Launch a product.
Write a lot more to be legible.

I think with the clarity and conviction that I have now, it seems a little easier. However, I know that there's going to be events that will test that conviction in the coming year.

The Best Way to Learn Might Be Starting at the End: Writing a Proof in Lean

Wil Chung — Sat, 18 Oct 2025 06:01:13 GMT

AI leveraged learning lets you start with the application at the end. Curiosity guides what you learn, fundamentals backfill when you need them.

School teaches us progressively: the current lesson builds upon the previous one. Hence, most peoples' inadvertent take-away is that prerequisites are requirements in learning. But this is just an artifact of constructing a curriculum for teaching tractably at an industrial scale. Prerequisites are often seen as gatekeepers, but it's better to see them as activation energy. AI collapses that activation barrier by acting as a bridge, making curiosity and critical thinking sufficient to start learning anything you care about.

I discovered this bridge over that barrier recently when learning Lean, a proof-assistant programming language, to prove a requirement in a system called DBSP^[1], a way to do incremental computation. In order to guarantee DBSP can represent incremental states that can be rolled back, it relies on a foundational assertion that must hold universally:

The Z-set and the additive merge together must form an abelian group.

Surprisingly, many software engineers are allergic to math to the point of not using it as a precision tool in their work. The idea of proving a foundational core in their system is seen as too incomprehensible, too impractical, or too time consuming. But it doesn't have to be this way.

This piece demonstrates how I went about learning Lean to prove the assertion above, showing how previously inaccessible topics are now accessible with a bit of agency and critical thinking. The activation energy to learn new topics is now much lower because we can direct AI to collecting, summarizing, and adjusting the material to the learner. This distills learning a new subject to its naked core for the learner: working to deeply understand core concepts and how they connect.

Lowering the Activation Energy

💡

Conjecture:
The Z-set and the additive merge together form an abelian group.

If these words mean nothing to you, here's a short primer:

Z-set: Sets are a bag of unique elements. Z-sets are bags that can contain multiple instances of the same element. Think of it as a map/dictionary with keys as the elements, and integers as values. The integer indicates how many instances of the element exist in the Z-set.
Additive Merge: It's an operation that merges two maps by summing values for shared keys. Non-shared keys are just copied over.
Abelian Group: It's a set with an operation where five rules hold:
1. you can combine any two elements with the binary operation and still get an element from the set (closure)
2. there’s an element that leaves the other element unchanged when using the binary operation (identity)
3. every element has an inverse that cancels it out into the identity (inverse)
4. the grouping of operations doesn't change the result (associativity)
5. the order of combining doesn't matter (commutativity)

In short, we're trying to prove that when we have an operation on this set of elements, these five properties always hold.

Basic definitions are a good starting point to ask AI for help. If these explanations are still opaque, it's a perfect chance to ask AI to adjust the explanation in understandable terms. Give it your background and domain of expertise, and it'll find ways to bridge these concepts back to concepts you already understand.

But the point of this article is to skip this step and go straight into the proof! Due to the constraints of the medium of the written word, it's impossible to write an article that reads comprehensibly without the right prerequisites in place.

When using AI to learn, this constraint doesn't exist. Instead of grinding through the foundations and prerequisites before finally getting to the proof, I did the following:

I scoped the problem to "prove Z-sets over additive merge form an abelian group"
Then I asked AI: "Generate a Lean tutorial for this exact problem"
Following the tutorial that guides me towards a working proof, I ask it for explanations of concepts and jargon as I encounter them.
I can ask as many questions as I want for anything that doesn't make sense.
I get a working proof in a couple hours, and a sufficient understanding to extend it.

This wasn't possible before, because there was no scalable way to produce bespoke education material for every possible topic a learner might be interested in. The options were to find a tutorial close to the application of interest or map the lessons for an unrelated tutorial into the application of interest.

Using almost any Wikipedia article about a technical topic to learn is useless, because it's written to be accurate, not for pedagogy. The experience is completely different with AI.

If different level of explanation is required for comprehension, we can ask the AI to ELI5 (explain like I'm five) for a bit to get the basics straight before asking it to raise the level of explanation again.
We can ask it to explain things in terms of already understood concepts. As we gain understanding, it can adjust to our current level of comprehension by refining and expanding its explanation.
Finally, the tone and style of teaching is adjustable. The AI can play the straight shooter with no fluff, so we're not distracted. Or it can help boost our confidence along with verbal encouragement.

These are all the different ways we can leverage AI to lower the activation energy to learning a new domain. All we have to do is ask.

Prerequisites for skipping prerequisites

Though this piece is about skipping prerequisites for learning, this isn't for everyone and the method has prerequisites. It only works with basic knowledge in an adjacent field, vigilant critical thinking, a genuine curiosity about the specific problem, and a domain with relatively easy-to-verify answers.

It's well known that LLMs hallucinate facts and sound confident about it. Nowadays, this is tempered by grounding their context in source material. But that doesn't mean they can't sometime reach the wrong conclusion in their answers. I've caught them a few times on saying things that were slightly wrong, or ill-informed.

First, it's an advantage to have expertise or training in an adjacent field. It's much easier to spot erroneous answers from the AI. If not caught, these errors can accumulate in our mental model of the world. Despite the effort to learn, we can actively make ourselves stupider.

Second, it has to be a habit of mind to ask questions about their answers, even as beginners in an unfamiliar domain. If something doesn't make sense, ask questions to clarify. Doubt the sycophantic agreement every step of the way. Double check their assertions by clicking on their quoted sources and doing independent searches. If anything is incongruent, confusing, or missing, ask questions about why.

Third, a genuine curiosity about the problem goes a long way. It's an infinite source of good questions to fuel critical thinking. Genuine curiosity will unearth all the hidden corners and edges of our current understanding, which makes for good questions to challenge the AI and sharpen our understanding. In addition, it's motivation to keep going when hitting a wall. The drive compels us to ask the AI to approach a new concept from different angles when one approach doesn't work. Sometimes, the breakthrough is a shift in perspective, and the trick is to persevere long enough to find it.

Lastly, it must be a domain with relatively easy-to-verify answers. The nice thing about mathematical proofs as a domain is that AI-generated answers are verifiable by the proof assistant. In other domains without such clear-cut verifiable truth ^[2], we rely on consistency. The answers must be consistent within the conversation, consistent with what we know, and consistent with external independent sources.

Now, let's see how this worked in practice, for this example in Lean.

Example: Defining Z-set and additive merge

The AI-generated tutorial started with the definition of Z-sets and the additive merge in Lean.

-- ZSet is a function from elements to integers (coefficients)
def ZSet (α : Type) : Type :=
  α → Int

-- Additive merge: add coefficients pointwise
def ZSet.add {α : Type} (zs1 zs2 : ZSet α) : ZSet α := 
  fun x => zs1 x + zs2 x

Code from the ML-family of programming languages (Haskell, OCaml, Elm) can be unfamiliar and confusing. Beyond definitions, syntax is another area AI can easily help learners get up to speed. Again, due to the constraints of this medium, I'll give a little primer for what the AI would have replied to questions about syntax.

The first definition just says that Z-set is a type that maps something of any type (α) to an integer. It is denoted as α → Int

The second definition just says, in order to additively merge two Z-sets, take the matching keys of each Z-set, and add them up.

Another way to understand is to leverage what we already know. Given more familiarity with Typescript than Haskell, we can ask the AI to rewrite the Lean code in Typescript. The Typescript code can now be used as a Rosetta Stone to read and interpret the Lean code.

// ZSet is a total function A -> number (ℤ). Return 0 for “not present”.
export type ZSet = (x: A) => number;

function add(zs1: ZSet, zs2: ZSet, x: A): ZSet {
  return (x: A) => zs1(x) + zs2(x);
}

As we read this code, we might notice an apparent incongruence. Earlier, in the explanation of a Z-set described it as a map. Here, the Typescript translation defines it as a function. Why is that?

This is where genuine curiosity kicks in and we can ask the AI to explain by asking, "What's the equivalent of a map or dictionary in Lean?" or "Why is Z-set represented as a function of type α → Int instead of a Map?" The AI might quip: "In type theory, functions ARE maps."

If this is a new concept, it can be a bit of a perspective shift. Any pure function can be represented completely by a key-value store with infinite memory. In this view, functions are just simply key-value stores one can write down without needing an infinite sheet of paper (or memory) to write down the value of every key.

We didn't need to read introductions to type theory to happen upon this insight. We used questions that naturally arose from curiosity and practicality to learn concepts and conventions in Lean as well as a perspective shift on the equivalence of functions and maps.

Example: Stating the Abelian Group

Next, the AI-generated tutorial has us define what an abelian group is in Lean.^[3]

-- Define our own minimal abelian group structure as a typeclass
structure AbelianGroup (G : Type) where
  add : G → G → G
  zero : G
  neg : G → G

  add_zero : ∀ a : G, add a zero = a
  zero_add : ∀ a : G, add zero a = a
  add_inv : ∀ a : G, add a (neg a) = zero
  add_assoc : ∀ a b c : G, add (add a b) c = add a (add b c)
  add_comm : ∀ a b : G, add a b = add b a

Here, notice the two kinds of definition. The first group are function signatures for which we need to provide a definition. The second group are properties that must be proven. If we provide proofs for the properties, then for any example of G and add, we can use the proofs to check if they indeed form an abelian group.

While writing this post, I realized I had a gap in my knowledge I wasn't aware of before. In definition of an abelian group above, we require three functions to be implemented (add, zero, and neg). Yet, abelian groups are often presented as being defined by just a carrier set G and an addition operation.

The resolution is that, in mathematics, identity elements and inverses are often introduced via existence axioms: we assert that there exists an element 0 such that a + 0 = a, and for every a there exists an element -a such that a + (-a) = 0. These axioms do not require us to name or compute the identity or inverse explicitly.

However, in Lean existence is not enough. A structure must carry its operations. If we want to use 0 or -a, we need concrete definitions, not just proofs that such things exist. This is why zero and neg appear as fields of the structure: Lean requires witnesses, not just existential statements.

But for us programmers, we just need a definition. We are not trying to derive operations from axioms at runtime. we are assembling a data structure that is correct by construction. If we already know how to define negation (for example, by negating each integer coefficient in a Z-set) we can simply write that definition and then prove that it satisfies the group laws.

def ZSet.neg {α : Type} (zs : ZSet α) : ZSet α :=
  fun x => -(zs x)

While we're not sure if this will work in the proof yet, we don't need to start from just the identity and the additive merge to define the inverse/negation.

This kind of struggle is what forced a deeper understanding, and is the type of clarifying learning with an AI. Tutorials and youtube videos won't know what you don't know, so they can't fill in the gaps. Human teachers can, but they're far more expensive, and it might not be practical to get one for the domain.

As an exercise for the reader, here are other questions to ask the AI, if curiosity grabbed your attention:

Are there set and operation pairs that don't form an abelian group? What are some examples?
An abelian group has five properties. Why are there two properties for the proof of identity, add_zero and zero_add? And why do we not prove the closure property?
Why these five properties? Why is it that these five properties together are significant enough to be given a name?
How come the identity function is named zero? Wouldn't it be a misnomer if our binary operation was integer multiplication?

Example: Proving Associativity

We're not going to prove every property in the abelian group for a Z-set. Doing one is good enough to demonstrate the learning process with AI. However, I'll attach the complete proof at the end. In the AI-generated tutorial, the proof of associativity is:

instance : AbelianGroup (ZSet α)
where
  -- ... other definitions
  
  -- (a + b) + c = a + (b + c)
  add_assoc := by
    intro a b c
    ext x
    simp [ZSet.add, Int.add_assoc]

Here, there were a couple keywords I didn't understand, so again, here's a primer that AI would have said:

by enters into tactic mode, where I'm telling Lean: "I’ll construct this proof interactively, step by step."
intro puts quantifiable variables into the local context where the goal is to prove f(a, b, c) for arbitrary a, b, and c
ext is the application of the extensionality lemma, which says "Two functions are equal if they are equal at every argument."
simp is for simplification, which just rewriting and substitution of ZSet.add and Int.add_assoc with their expanded definitions.

When doing proofs on paper, the intermediate expressions are written down and the transformation is implied. When doing proofs in Lean, the method transformations are written down and the intermediate expressions are implied.

One can't just blindly apply proof tactics. For a proof this simple, the AI isn't going to get it wrong. Even if the AI writes a verifiable incorrect proof, the beauty of AI + proof assistants is that the AI can use the proof assistant to self-correct until it gets the proof right.

But that doesn't mean the proof should be accepted blindly, no more than professional developers should accept vibe coded programs blindly. To stand behind the results, understanding a correct proof teaches what to ask when the proof is wrong. Just as in code, while the results are technically correct, we might want it a different way for other reasons.

Initially, the AI proof used ring as a proof tactic. While correct, I wanted to understand why and what the alternative were. That's when I learned about simp along with its pros and cons. In the end, I chose simp for this article because it's more transparent and shows what's going on with less magic.

This is much harder than following a tutorial lockstep. This is why so many young programmers can watch all do all the tutorials and watch all the Youtuber videos they want, and still not achieve proficiency. The struggle of thinking about why something doesn't work and why needs to be experienced, in order for learning to actually take place.

However, the struggle is with the core concept, not just with trivia, like looking up what different keywords mean. We're doing real formal verification about a data structure from a real system, like DBSP. The activation energy collapsed from "I'll never understand proof assistants" to "I have a working proof" in one afternoon.

Example: Missing Finite Support

After I had published the initial version of this post, a reader pointed out an omission. The DBSP paper defines Z-sets as finitely supported maps into ℤ (integers). A finitely supported function is one that is zero everywhere except on a finite set of inputs. In programming terms, it’s a sparse map: only finitely many keys store non-zero values, and everything else is implicitly zero.

My proof thus far omits this condition. What if this condition made the proof false? Admittedly, I didn't know about finite supports at the initial time of writing. Indeed, this raises a critical question: how do we address the unknown unknowns? And more importantly, does this render the method of learning from AI of jumping in first and then backfilling useless?

As it turns out, not necessarily. We need a pass that tries to catch these unknown unknowns. Could GPT-5 catch this error of omission if I fed it both the DBSP paper and the original version of the post without mention of what's missing? Here's what I asked:

read the following blog post: https://interjectedfuture.com/the-best-way-to-learn-might-be-starting-at-the-end/ and the attached DBSP paper.read the following blog post: https://interjectedfuture.com/the-best-way-to-learn-might-be-starting-at-the-end/ and the attached DBSP paper.

What are the errors, especially errors of omission in this blog post? where is the blog post weak in its explanation of the math? Where is it hand-waving and could be more exact and precise? Where is it only true under specific conditions but not others, if any?

And finite supports was the first thing it flagged.

### What’s missing / slightly inaccurate

In DBSP, a Z-set is specifically defined as a function from a domain `A` to the integers **with finite support** (“finite support” = only finitely many keys have nonzero values). Mathematically: `f : A → ℤ`, where `f(x) ≠ 0` for only finitely many `x`

That's certainly a good sign. The GPT-5 was able to raise issues related to errors of omission without explicitly telling it what was missing. I ran the same query above repeatedly, and found that it always flagged the issue of finite support. However, the force with which it emphasized the problem differed, so if not careful, one can gloss over the criticism.

So what actually changes when we require finite support for Z-sets? Finite support doesn't change the algebra, but it changes the type of the carrier set. We're no longer working with all functions in α → ℤ, but with a subset of those functions. Hence, the only additional thing we have to prove is closure. We must show that the result of all the operations (zero, add, neg) is still a finite supported Z-set. Once closure is proven, the rest of the proof is the same as before.

Hence, as a part of this kind of learning process, we'll need a final pass at the end that tries to catch these unknown unknowns, without needing to mention what's missing. You can ask:

What are the errors in my understanding, especially errors of omission?
Where is my understanding hand-wavy and could be more exact and precise for the level that I'm at?
What are my subtle misunderstandings that would be a profound impediment to my understanding of more advanced material later?

But there are limitations here. It won't catch semantic errors. An LLM and Lean prover can only teach and verify the definitions as written. It cannot verify it semantically matches what's written in the paper.

A PhD is for thinking previously unthinkable thoughts

To be clear, I'm not saying AI-based learning replaces a PhD when it comes to learning Lean-based math proofs. When people say "you need a PhD to do X", people often mean it in two different, but conflated meanings:

Background collation: Months of reading papers, learning the jargon, and hunting down the prerequisites.
Novel Research: Asking questions never asked, proving theorems never proven, and thinking previously unthinkable thoughts.

Most of the time people say that phrase, they mean it in the first sense, to express the high activation energy required to get started on breaking into learning anything outside of their familiar domains. AI eliminates the barrier in the first sense, but not the second. You can't ask an AI to "solve my novel research problem", but you can ask it to "teach me formal verification for my specific problem."

What if you are doing novel research? That's where I make no claims of efficacy using AI to learn, except to warn that it's easy to fall into AI-induced psychosis. No matter how smart or careful you are, the easiest person to fool is yourself.

Here's an excerpt of a passage from a smart and accomplished founder:

At the heart of physics is the idea of a frame of reference. A frame is a boundary, something that separates this from that. A frame of reference is a boundary which contains a decoder and codebook that are used to interpret the boundary’s flux as a description of the world.

The codebook is a coordinate map which indicates what perspective the codebook should take interpreting the flux. In the discrete case, this means it decodes input states into output states. In the continuous case, it applies a projection operator onto a manifold defined by the decoder.

It goes on for a while in the same manner. A charitable interpretation is they're trying to express something we don't yet have words for. But this reads like someone who took shallow definitions from physics to construct a pet theory without precise definition and internal consistency.

This is why answer verifiability matters. In Lean, wrong proofs don't compile. In novel physics theorizing, AI will happily construct elaborate castles in the air with you. It's not yet clear how to best leverage AI for the second meaning of "requiring a PhD". That's why one of the prerequisites for this method is to pick a domain where the answers are relatively easily verifiable. So for now, I'd stick with the first when using AI to learn.

Trying it yourself

You can use the following prompt to generate yourself a tutorial for the proof. Adjust according to your tastes.

Generate a Lean 4 tutorial for proving that Z-sets with finite support over additive merge form an abelian group. I'm a programmer familiar with functional programming, but never used proof assistants. If you're using mathematical terms, explain them in terms a developer would understand. Explain any jargon and concepts in detail.

Then install Lean 4 and VSCode or use the browser version to work through the AI-generated tutorial.

Ask as many questions you can think of to clarify your understanding. When stuck, ask: "Why did [X tactic] fail?" or "Explain [term] in context of this proof." Try different angles to understand the concepts, from generating code in a familiar language to explaining the concepts in a domain you already understand.

Here is the proof that Z-sets over an additive merge is an abelian group without finite support. And here's the version with finite supports. They both demonstrate all five properties of an abelian group: closure, the existence of identity and the inverse, associativity, and commutativity.

import Std

open Std

variable {α : Type} [DecidableEq α]

/- ------------------------------------------------------------
   0) Your minimal abelian group structure (unchanged)
------------------------------------------------------------ -/

structure AbelianGroup (G : Type) where
  add : G → G → G
  zero : G
  neg : G → G
  add_assoc : ∀ a b c : G, add (add a b) c = add a (add b c)
  add_zero : ∀ a : G, add a zero = a
  zero_add : ∀ a : G, add zero a = a
  add_inv : ∀ a : G, add a (neg a) = zero
  add_comm : ∀ a b : G, add a b = add b a

example : AbelianGroup Int := {
  add := fun a b => a + b,
  zero := 0,
  neg := fun a => -a,
  add_assoc := Int.add_assoc,
  add_comm := Int.add_comm,
  add_zero := Int.add_zero,
  zero_add := Int.zero_add,
  add_inv := Int.add_right_neg,
}

/- ------------------------------------------------------------
   1) Finite support, but using List

   FiniteSupport f := ∃ s : List α, outside s, f is 0.

   This is the only place we encode “finite-ness”.
------------------------------------------------------------ -/

def FiniteSupport {α : Type} (f : α → Int) : Prop :=
  ∃ s : List α, ∀ x : α, x ∉ s → f x = 0

/-- Finitely-supported Z-set = function + proof it has finite support. -/
def FZSet (α : Type) : Type :=
  { f : α → Int // FiniteSupport f }

/- Extensionality for FZSet: equality is pointwise equality of the function. -/
@[ext]
theorem FZSet.ext {α : Type} (f g : FZSet α) : (∀ x, f.1 x = g.1 x) → f = g := by
  intro h
  apply Subtype.ext
  funext x
  exact h x

/- ------------------------------------------------------------
   2) Constructors: empty and single (now with List witnesses)
------------------------------------------------------------ -/

def FZSet.empty {α : Type} : FZSet α :=
  ⟨(fun _ => 0), by
    refine ⟨([] : List α), ?_⟩
    intro x hx
    rfl⟩

def FZSet.single (a : α) (n : Int) : FZSet α :=
  ⟨(fun x => if x = a then n else 0), by
    refine ⟨([a] : List α), ?_⟩
    intro x hx
    -- From x ∉ [a], we get x ≠ a
    have hne : x ≠ a := by
      intro hxa
      -- If x = a then x ∈ [a], contradiction
      have : x ∈ ([a] : List α) := by
        -- [a] is a :: [], so membership is by head equality
        simp [List.mem_cons, hxa]
      exact hx this
    -- So the if-expression takes the "else" branch
    simp [hne]⟩

/- ------------------------------------------------------------
   3) Operations: add and neg, with closure proofs via List witnesses
------------------------------------------------------------ -/

def FZSet.add {α : Type} (zs1 zs2 : FZSet α) : FZSet α :=
  ⟨(fun x => zs1.1 x + zs2.1 x), by
    rcases zs1.2 with ⟨s1, hs1⟩
    rcases zs2.2 with ⟨s2, hs2⟩
    -- Witness for sum support is s1 ++ s2
    refine ⟨s1 ++ s2, ?_⟩
    intro x hx

    -- If x ∉ s1 ++ s2 then x ∉ s1 and x ∉ s2
    have hx1 : x ∉ s1 := by
      intro hx1'
      have : x ∈ s1 ++ s2 := by
        exact List.mem_append.mpr (Or.inl hx1')
      exact hx this

    have hx2 : x ∉ s2 := by
      intro hx2'
      have : x ∈ s1 ++ s2 := by
        exact List.mem_append.mpr (Or.inr hx2')
      exact hx this

    -- Outside both supports, both are zero
    simp [hs1 x hx1, hs2 x hx2]⟩

instance {α : Type} : Add (FZSet α) where
  add := FZSet.add

def FZSet.neg {α : Type} (zs : FZSet α) : FZSet α :=
  ⟨(fun x => -(zs.1 x)), by
    rcases zs.2 with ⟨s, hs⟩
    refine ⟨s, ?_⟩
    intro x hx
    simp [hs x hx]⟩

instance {α : Type} : Neg (FZSet α) where
  neg := FZSet.neg

/- ------------------------------------------------------------
   4) Abelian group instance

   Exactly the same ext+simp shape as your original proof,
   because the algebraic laws are still pointwise Int laws.
------------------------------------------------------------ -/

instance : AbelianGroup (FZSet α)
where
  add := FZSet.add
  zero := FZSet.empty
  neg := FZSet.neg

  add_assoc := by
    intro a b c
    ext x
    simp [FZSet.add, Int.add_assoc]

  add_comm := by
    intro a b
    ext x
    simp [FZSet.add, Int.add_comm]

  add_zero := by
    intro a
    ext x
    simp [FZSet.add, FZSet.empty]

  zero_add := by
    intro a
    ext x
    simp [FZSet.add, FZSet.empty]

  add_inv := by
    intro a
    ext x
    simp [FZSet.add, FZSet.neg, FZSet.empty, Int.add_right_neg]

/- ------------------------------------------------------------
   5) Examples (note: underlying function is `.1`)
------------------------------------------------------------ -/

example : (FZSet.single "a" 3 + FZSet.single "b" 2).1 "a" = 3 := by rfl
example : (FZSet.single "a" 3 + FZSet.single "a" 2).1 "a" = 5 := by rfl
example : (FZSet.single "a" (-3) + FZSet.single "a" 2).1 "a" = -1 := by rfl
example : (FZSet.neg (FZSet.single "a" 3)).1 "a" = (-3) := by rfl

Compare it to yours. What are the differences? Why do it one way versus the other? Try and see if other sets and binary operations are abelian groups.

What this enables

AI has already made a seismic shift in the way programmers work through vibe coding. It's increased the velocity of feature output and bug fixing by programmers that have figured out how to leverage it. But I hope that it also creates a different kind of shift; one where programmers aren't afraid to branch out to the edges of what academia has discovered to bring it into the realm of real-world applications.

Distributed systems PRs could include Lean proofs of safety properties. CRDT libraries could ship with formal verification of merge semantics, and not just unit tests. "Formally verified" no longer is something out of reach for the average backend engineer, and instead is a practical tool in their toolbox.

And I think this pattern generalizes across domains beyond formal verification. There are lots of specialized domains in programming, from GPU programming to embedded programming. The cross-pollination of programmers that can straddle between ecosystems also means that ideas in one domain can be tried in another, making good ideas spread faster. And if anything a wider breadth of experience will give more perspective, which usually give nuance and color when making decisions.

The bridge exists now where there was none before. We can cross it into adjacent domains to learn if we have the critical thinking, curiosity, and agency to try.

DataBase Stream Processing, but also a pun on Digital Signal Processing ↩︎
Please don't write in about Godel's Incompleteness theorem. It doesn't apply here, because we're not reasoning about anything that references itself. ↩︎
An abelian group is already defined in Lean's standard library (mathlib), but for the point of exercise, we define it. But also, Lean 4 currently doesn't have tree shaking, so including the pre-written definition balloons the C executable by about 150+MBs ↩︎

Doing More Customer Interviews Won't Make You Better At Them

Wil Chung — Tue, 07 Oct 2025 17:22:00 GMT

Founders and product managers fail at customer development, even when they're consistently talking to users. But this is less from conducting too few interviews and more from a missing feedback loop. The art of posing the right question is counterintuitive and cognitively taxing, so founders can’t analyze their own interviewing performance in real time. A transcript-based AI coach that flags hard-fail interviewing errors restores that loop—raising ground truth per hour faster and cheaper than brute-forcing more interviews.

Customer development is commonly phrased as advice to simply "talk to users," but that undersells the difficulty of doing it right. It's actually the ethnography of demand, with many unintuitive pitfalls, and it requires deliberate practice to do well. Avoiding these pitfalls is difficult to do in the heat of an interview.

But AI coaches can bridge that gap between theory and practice by providing specific actionable feedback to your specific customer interviews. This feedback is the critically missing piece to improve the skill of inferring customer demand by improving the number of ground truths per hour.

What makes customer development so hard? It's hard in two ways: 1) People usually can't provide a clear articulation of their problem or a feasible solution. 2) The questions you ask can unintentionally incentivize people to lie to you.

This is unintuitive. Founders think customers would be properly incentivized to be truthful about their problems and their solutions. Why would you lie about your needs? Reality is, most people are too busy trying to achieve some other goal to take a step back to introspect and articulate the problem. Hence, they may not identify the core problem, nor the most important problem. They're also not aware of all the technological solutions they can bring to bear on a problem. But that doesn't stop them from articulating their problems as solutions, which can often be misleading.

First-time founders can be naive to the Heisenberg-like effects of asking for customer opinions like, "would you use this?" and "what do you think of my idea?" Social norms of politeness dictate you don't trash something the other person is clearly emotionally invested in. So unless a founder's asking for money, it's often easier to say "Sure, I'd use it!", "I think lots of people would find it useful," and "It sounds like a great idea!" than to deal with the social awkwardness of telling someone's idea doesn't resonate with you. It's akin to telling someone their baby is ugly.

As a result, you cannot just build the solutions people ask for. People are experts in their problems, not their solutions. You cannot ask people about their problems directly. They can misjudge the value of a problem or how pressing a problem is. And you definitely cannot tell them your idea and ask them what they think about it and whether they'd buy. They'd sooner lie to you than to tell you your idea is bad or that they don't need your product.

With those constraints, what can you even ask a potential customer in an interview? The only reliable topics you can ask potential customers are recent recollections of their concrete actions when they came across a problem and the context in which they found themselves when the problem arose. Everything else from why their need exists to the product solution must be inferred.

Hence, customer development is unintuitive and requires deliberate practice to overcome these common pitfalls when interviewing potential customers. It requires just a bit of post-interview introspection and analysis on what went right, and what could be improved, in the same way professional basketball players review game footage.

Unlike professional sports, the customer development feedback loop is missing. But why do we need this feedback loop in the first place?

You can't run meta-analysis while you're interviewing. When you're trying to infer demand, your working memory is already overloaded. The right questions are counterintuitive, so you're juggling two hard tasks at once: finding the next question that gets to ground truth while listening closely enough to follow the thread of what's being said. It's predictable that you'd miss the meta-level entirely, because all your attention is consumed by the conversation itself.

By the time the interview ends, the details have already faded. You’re left with a few half-formed notes and a hazy memory, hardly enough for real analysis. Without an external coach to close that loop, you have no reliable way to see what you did well or poorly. And without that feedback, improvement stalls — founders and product managers keep repeating the same mistakes, unaware until it's too late and they've built the wrong thing. It’s easy to read what good interviews should look like; it’s much harder to recognize those principles in the middle of a conversation, when you’re juggling multiple cues and counterintuitive moves at once.

Why not just do more interviews? Wouldn't brute-forcing the number of interviews help founders and product managers get better without needing the feedback mechanism?

You might be reminded of the parable of the pottery class graded on the quantity of pottery they made improved faster than the other class graded on the quality of the pottery they made. Yet, the parable doesn't apply.

Clay gives perfect feedback; conversations don't. It's clear when clay doesn't hold a shape. It's not clear when words reflect ground truth. Finished pottery exists afterward; conversations are fleeting and ephemeral. Concrete evidence leads to higher fidelity feedback and analysis.

Hence, grinding on quantity only works when the medium gives perfect feedback. Customer interviews give noisy feedback: words are biased, memories are selective, and without transcripts at the very least, you're learning from low-fidelity notes. If your interviewing error rate stays high with leading questions, future hypotheticals, or failure to quantify budget or urgency, then doing more interviews mostly scales the noise. When you tamp down the error rate, the speed to ground truth increases, and that compounding effect takes over. The founder who learns to extract ground truth in a handful of interviews will outpace one who needs dozens just to see the same signal.

How do you leverage an AI coach in a customer development workflow? It's a straightforward two-step process. First, you convert the audio conversation of the interview into a transcript. Then, you use a prompt where the AI plays the role of a customer development coach to analyze the transcript and write an analysis of your performance.

Now that audio-to-text transcription works really well with the current crop of multi-modal LLMs, it's possible to get accurate transcripts of a conversation for cheap. The transcription can happen in real-time, on your own machine, so you don't even need to record the raw audio, or send it to remote servers. When I hop on a call for a customer interview, I'll ask the interviewee for permission to record the audio portion of the call to get a transcript. Nine out of ten people will agree to it. I think they appreciate being asked. I currently use Granola.ai, but MacWhisper or other audio transcript software will do. After the interview, I'll copy and paste the transcript into a markdown file named interview_transcript.md

Instead of taking notes that I'll never read again when reading a book on customer development, such as Jobs-to-be-done, the Mom Test, Sales Safari, or any number of resources written on customer development, I'll distill their principles into a ~2200 word prompt instead. The prompt has sections on the agent's objective, instructions, context and principles, and examples to help guide and guardrail the AI agent into doing the analysis correctly. This gets written into a file named cust_dev_coach.md

You can write your own AI coach easily. Or you can try out mine. Just subscribe for free to access the whole prompt.

See prompt for AI Coach

With this prompt in hand, I'll tell Claude Code (or OpenAI Codex):

🤖

You are @cust_dev_coach.md. Read @interview_transcript.md and write your analysis in interview_analysis.md. You are @cust_dev_coach.md. Read @interview_transcript.md and write your analysis in `interview_analysis.md`

This is just as easily done with ChatGPT. You can paste the agent prompt, follow it with the transcript, and then ask it to play the role of the prompt and analyze the transcript. The analysis is a high-fidelity, transcript-based introspection that you wouldn't have been able to do yourself in real-time, nor afterwards without great effort.

Is most of the benefit just having the transcript, and not the AI coach? It's true that a transcript makes an ephemeral conversation tangible. However, it takes a lot of effort to review a transcript post-interview. You'd have to relive the same conversation and manually map principle to practice. Most people don't do this because it's both taxing and time consuming. AI coaches lower this cost to almost nothing, while keeping the analysis specific and actionable.

Why not use a human coach? Customer development coaching doesn't seem to be a thing in Silicon Valley. Having lurked on Hacker News, Indie Hackers, and Bookface (YC's internal forum) for years, I've heard founders ask for recommendations for executive coaches, health and fitness coaches, and even couples therapy coaches, but never customer development coaches. Given customer development is often framed as just "talking to users," perhaps it's not a thing for the same reason ski instructors are more numerous than running instructors: running seems intuitive in a way that skiing doesn't.

But even if human coaching was available, it would be much more expensive. A human coach could cost on the order of thousands, whereas running a prompt would cost anywhere from under a dollar to $3. While human coaches could provide more value beyond pointing out mistakes, just avoiding mistakes will get you most of the benefit. As we've established earlier, a lowered error rate compounds.

There are additional benefits to using this workflow. By transcribing the audio into a transcript, you'll always have the raw information to perform additional analysis in the future, when your AI coach prompt improves or when there's another method of inferring customer demand that works better. And if all else fails, you can read the first-hand source material for insights you might have missed the first time.

So how do we know the AI coach is giving us good advice? Aren't LLMs prone to hallucination? The solution is to limit the AI coach to raising flags that are only hard-fail observable items. This includes:

Leading/loaded question detection
Future/hypothetical framing (“Would you…”, “Would you pay…”)
Failure to ask budget/time/cost
Failure to ask for the last concrete event (“Tell me about the last time…”)
Interruption rate > X / min

It won't be giving judgements on "insightfulness", and it wouldn't be giving product advice.

Inexperienced LLM users expect AI to give exceptional performance with very little context. Like any human, the more relevant context you give them, the better they will do. Hence, detailed context and exposition on "what good looks like" in the prompt as gleaned from the source material on customer development would go a long way into guiding the LLM to flag these hard-fail items.

And if you do need to chase those edge cases where the AI agent fails, you can employ system evals. But that's rarely needed, because most of the mistakes founders make are common mistakes covered by the source material on customer development, rather than in rare edge cases.

Here are some examples of analysis from my own AI coach from multiple interviews I conducted. These were things that I missed while I was in the flow of the conversation. My brain never thought to pick up on these as problems, or I was too busy thinking of the next question to flag these as being problematic.

❌ Missing the Elephant in the Room

Throughout the conversation, you never asked the critical business question: "How much time and money does SOC2 compliance actually cost you per year?"

I spent all this time building automation, but I have no idea if this is a $1,000 problem or a $50,000 problem.

❌ Major omission. Without understanding the economic impact, you can't assess if this is a real business opportunity or just an engineering hobby project. This is the elephant question that could tank the business case.

How to improve: "Before we dive into the technical solution, help me understand the business impact. How much time do you spend on SOC2 prep? What would happen if you didn't get certified?"

❌ MISSED OPPORTUNITY: Failed to Probe Pain Points

You: "And then so because the two of you, you know, like, the other you know what you're doing. You know what you're not doing..."

Analysis: You completely glossed over their statement that "Linear was just too hard" and they abandoned structured project management. This is a massive signal about friction in their workflow that you didn't explore.

What you should have asked:

"What made Linear too hard specifically?"
"What did you try to do with it that didn't work?"
"How do you handle knowing what the other person is working on now?"
"What problems, if any, does this informal approach create?"

❌ Accepting Surface-Level Answers

Quote from transcript:

Them: "I don't feel like I have any problems."
You: "Okay. And then so have you showed this workflow to your other coworkers..."

They're thinking: That was easy - they moved on without digging deeper.

Consequences: You missed a golden opportunity to understand the real friction points. "No problems" is almost never true - you should have probed deeper into specific frustrations or limitations.

Better approach: "When you say no problems - walk me through the last time you had to create a complex PRD. What was the most time-consuming part?"

If every analysis was just a list of things I did wrong, it'd feel pretty bad and I might not want to read it. I asked the AI coach to also point out the good things I was doing in the interview, so I knew to keep on doing it. This shit sandwich makes it both fun and motivational to read the analysis after an interview. The anticipation of reading the post-mortem makes me want to talk to more users. Here's a single example of positive feedback.

✅ Great Questions That Revealed Insights

You: "what what has it cost you? Like, did it cost you, like, hours, or is it like, I can't do my work or I, like, can't collaborate with people or because of that? Packaging thing you're talking about? Or, like, what's what's the cost there?" ^block_1

Them: "I wouldn't say it's like like, of of all of problems in my career that have, like, like, a big kinda cost, like, I wouldn't say that that has been one of them."

Wait, so this isn't actually a big problem for me. I'm being asked to quantify something that feels minor compared to my real issues.

✅ Excellent follow-up. This question distinguished between "I-will-pay-to-solve-that" problems and "that's-kind-of-annoying-but-I-can-deal-with-it" problems. His response immediately deflated the notebook packaging issue as a major pain point, which is exactly the kind of reality check you need in customer development.

Despite knowing better in the abstract from reading about customer development repeatedly, in practice some of my mistakes are basic, classic mistakes. Without an AI coach, I wouldn't have noticed or remembered them. With this new awareness, I'm not making the same mistakes I was making before, and I'm much more confident of the ground truth information I'm gleaning from the interviews.

There are many aspects of picking the right startup idea that can lead a founder or product manager astray, but getting bad validation of an idea from customer development doesn't have to be one of them now. Keep practicing interviewing, but it can now be deliberate, with feedback, due to the help from AI. Scaling up interviewing without closing the loop on improvement is wasted time. Try running a two-week test: conduct five interviews, and run a hard-fail rubric. If your loaded questions and hypothetical probing don't drop by more than 50%, then you can try going back to brute-forcing interviews. Otherwise, keep it. Ground truth per hour compounds faster than "talking to more users."

Agent: Customer Development Coach for Feedback

Wil Chung — Tue, 07 Oct 2025 17:21:54 GMT

🤖

This is a prompt for a customer development AI coach. You can easily write your own or try out mine for free as a subscriber.
Read about why you'd want to use an AI coach for customer development.

Role and Objective

You are the world's best expert and coach on customer development. You've studied the Jobs-to-be-done, Heart of Innovation, the Mom Test, and is a YC partner. Your goal is to evaluate how well I did in a customer development conversation and what I can do better.

Instructions and Rules

Read through the customer development transcript
For the times when I'm talking,
- highlight times when I asked great questions that revealed good information about the nature of a customer's situation.
- highlight the times when I asked bad questions or said leading things that gave me false positives, false negatives, or false information from the customer.
To choose a highlight, pick a threshold that exceeds expected performance (good or bad) so that we average around three to five highlights per conversation.
For each highlight:
- quote of the relevant conversation chunk.
- link to the relevant part of the transcript.
  - Modify the transcript if you need to in order to get a block id. (these markdown files use Obsidian)
- In italics after their dialog, write the the subtext of what they're actually thinking.
- tell me what the consequences of what I said, such as getting bad information and why.
- and how I might say it better next time.
Keep in mind that sometimes the error is an error of omission. I may forget to ask something that would have been critical to understanding a customer better or more fully.
At the end, summarize what I could did well and concrete actions I can take to do better of understanding customers
- Their true needs
- The urgency and importance of that need
- their worldview
- the situation they find themselves in so they find they have that need.
Keep the number of things to improve at the three most actionable with the best ROI, so I can remember it.
ALWAYS highlight the most impactful question that unlocked surprising insights into customer behavior.
ALWAYS highlight the biggest missed opportunity to ask questions that give insight into customer behavior or invalidate the idea.

Three ways I learn with AI

Wil Chung — Thu, 25 Sep 2025 15:58:45 GMT

You don’t “learn” from a chatbot. You consume. There’s a difference. They’re spewing slop and people are gobbling it up (and eroding their intelligence in the process). Misinformation is everywhere. Please don’t rely on a chatbot for anything important. They can’t replace real human connection
- random 🦋 account

No. You’re holding it wrong. You can definitely learn from LLMs, but with caveats. It has to be with critical thinking where you doubt it every step of the way, on things you can easily verify.

LLMs definitely have a shape to them. Like a knife, you need to find the part of it–the handle–that has the right affordance for human hand. In the case of an intelligence tool like LLMs, you're trying to find the affordance for the human mind. As easy as they are to get started with as a chat conversation, most people haven't figured out how to hold these tools yet.

Here's some ways I notice I consistently and heavily use LLM/AI in three different ways to learn:

As a patient teaching assistant when reading academic papers.
To get quickly acclimated and acquainted in a new programming language or framework ecosystem.
As a sounding partner in software architectural design work.

Reading academic papers

Academic papers are dense with jargon, and rightly so. Their main purpose is to communicate with their own peers to move forward the state-of-the-art. But that also mean it can be inscrutable to outsiders, even other academics outside the field. The effort you needed to come up to speed on not just the jargon, but the requisite concepts in order to understand the paper, required a huge activation energy.

LLMs lower this bar significantly. When I have to read computer science papers, I upload the PDF (or PDF converted to text) and ask it questions of any passages, terms, or concepts I don't understand. I can ask questions about what concepts mean and what its implications are. When I don't understand the abstract concept, I can ask it to write out a concrete example. If I still don't get it? I ask it to explain with metaphors in a domain I know well. I can even ask it to explain it to me like I'm five, and then build it back up to the academic level of understanding from there.

Accelerated learning with GPT academic paper reader

These last couple of days, I’ve been trying to deepen my understanding of CRDTs. There aren’t too many papers on CRDTs, so you can get caught up quickly by diligently reading the papers. Though I haven’t read them all, I was surprised by my pace of understanding the papers with

Interjected FutureWil Chung

Importantly, I can spot-check what it's saying, either at random or if it smells fishy, based on my current understanding. Here's some things to do for spot-checking:

Look at the exact passage in the paper for a claim or fact it's asserting.
Cross-reference an assertion with a quick search on the web.
Cross-reference against existing canonical text I know to be authoritative (such as the documentation to a software library)

With this Socratic process, I can read about a paper I'm mildly familiar with in a day, whereas otherwise it would have taken me the better part of a week.

Enjoying what you're reading? Subscribe for more

Learning a new language, framework, library, or ecosystem

Lots of career programmers just stick with a single language, framework, and ecosystem for a long time. They only jump ship when they're force by work, or if they feel FOMO from "cool kids" touting a new language or framework. But even for programmers who are curious hackers by nature, it takes a lot of effort, a lot of activation energy to get started with another language and ecosystem.

For one, the only real reason to go to another ecosystem is if it offers something that you can't get anywhere else. For React, it was the one-way data binding and data flow that eased the reasoning burden (compared to frameworks before). For Haskell, it's the pure functional approach. For Lean, it's dependent types. This new capability comes with conceptual ideas that need to be learned.

The LLM can definitely help you with learning the core concepts, but the source of the high activation energy is all the chrome around the novel core concept. This is basic stuff like syntax, conventions, or build tools. It's the locations of the forums, documentations, and package management. These are often arbitrary [^1] and by cultural convention, rather than from any reasoned first principle, so you just have to memorize them.

LLMs are great at this kind of administrivia. For a new language, I can ask the LLM to write me a first draft in the language for a project that I actually care about. It's much more motivating than some arbitrary project that the tutorial writer decided upon. In addition, the documentation in the ecosystem could be dense or poorly written, so this helps sidestep that hump.

For example, I wanted to see how the Isabelle proof system worked, so I had the LLMs generate me the code. That was a year and a half ago, so the models weren't very good at inductive proofs. But I got the hang of the syntax much more easily than if I had to read the docs.

Lab note #031 Proving out with Isabelle

In Lab note #029 [https://interjectedfuture.com/lab-notes/lab-notes-029-balancing-nodes-in-a-unicit-tree/] , I was thinking about a different way to construct a unicit tree. Can we create both a description of the splits in a tree and a comparator that would make it a partially ordered set? And with that partially ordered set

Interjected FutureWil Chung

For new frameworks and libraries, once again, asking it to do a first draft is motivating to see how things work, and if it's a popular ecosystem, I get to see what the conventions are. In most instances, I often have to temper or rewrite the implementation to fit my own standards in my project. However, this is often the quickest way to get something up and running, if the scope of the framework or library is well contained.

In all these instances, I often don't have to check the veracity of the generated response by hand, because it's often code, so there are type checkers, compilers, and build systems to check the output. The value LLMs provide here is to synthesize all the administrivia together to produce a working draft to iterate upon. It's much easier for all this stuff to stick when it's in the context of a project you care about.

Software architecture and design

As someone that works alone, LLMs have been great as a rubber duck that talks back. The value here is to generate options I might not have thought of, and to push back on my thinking and designs. But for this to work, you need to adjust the system prompt to allow the LLM to push back on your ideas to combat their sycophantic tendencies.

As a short list, but not exhaustive list, you should put in the system prompt that:

Your ideas are often dumb, shallow, and poorly considered. [^2]
It should ask them pointed questions and help re-direct them toward more insightful lines of thinking.
Do not be fawning or complimentary.
It should move them toward more specific, concrete claims backed up by solid evidence.
It should generate a counter-argument to its assertion, as well as a rebuttal to its counter-argument. [^3]

With that in place, I first lay out the premise of the software design problem, the constraints of the situation, and ask it to generate options with pros and cons. Then after reviewing the options, I critique those options, throw away bad options, or ask it to elaborate its reasons for choosing specific options. From there, we keep iterating. I find that the latest reasoning models, like GPT-5 and Claude Opus are much better at this than previous models. However, they still have weaknesses.

Lab note #070 Vibe-coding is half a skill issue

Interjected FutureWil Chung

I find that they tend to revert to the median thinking in software design, which is more in an Object-Orientated tradition in a larger company. This often means a single class for a single concept, when a few functions would do, or an over-complication of relations between subsystems, when a simplification would do. Lots of cover-your-ass chrome, such as health checks and monitoring, when it's not needed for a core design.

Unlike the other two types of learning, this task isn't grounded in easily verifiable facts. What makes the process valuable is the iteration of different scenarios and options, and the articulation of my own tacit knowledge.

First, this process of iteration over different options and different solution spaces makes me better at doing the software design. Instead of asking LLMs for "the best solution", I get lots more practice considering options under different scenarios. When the LLM comes up with a great solution, I ask how it got there, and what it knew or considered. I also examine what sort of faulty assumptions I had that lead me away from a good solution.

Second, the failure modes outlined above are helpful, because it helps me articulate my tacit knowledge of why these are bad solutions. In order to leverage this tacit knowledge as a compounding advantage, I'm starting to record these failures as system eval traces in hopes of doing error analysis and prompt corrections later on. But for now, I'm just writing these corrective guardrails by hand, because asking the LLM to generate them results in subpar generalizations of engineering principles. Either they're too specific for the problem at hand, or they take commonly known engineering aphorisms and apply them to the shallow design problem, instead of the core reason why I rejected the solution.

Don't abdicate your seat at the table

The temptation with LLMs is to absolve yourself from doing critical thinking; to question everything the LLMs says back to you. But to do so is to curb your learning, and to stunt your learning. Despite the hype, I don't think software engineers will go away completely. I think our numbers will greatly diminish, so to stay as one, you have to get much better. In order to do that, you'll have to keep on learning.

The only time you should abdicate your thinking to LLMs is when the task at hand is administrivia: the knowledge that will be outdated in 15 years. If you've ever read a COBOL manual on some general software principle, you'll know what I mean.

[^1]: If you've been around the block enough, some ecosystems do take these arbitrary conventions from each other. The build tool in C is make. In Ruby, it's rake. In Lean, it's lake.

[^2]: To combat the sycophantic tendencies, I use a variation on Maggie Appleton's prompt.

[^3]: Got the argument/counter-argument/rebuttal from Prof Sarah Paine.

Lab notes #074 Focused in a direction

Wil Chung — Mon, 22 Sep 2025 16:00:03 GMT

It's been almost three months since the last lab note, and I lament that I didn't document the process and the shifts that were happening with my work. But then again, I was busy writing code and talking to potential users.

Back at #073, I was playing with an LLM chat app that gave you product recommendations. I found myself doing it more lately with GPT and figured that more and more people would do it too. ChatGPT didn't give a great experience with it, so I figured it was worth a shot. However, two things stopped me in my tracks.

First, it's an uphill battle fighting with product providers to give you accurate product information. So most of the time, you're going to be doing an arms race with crawlers. That didn't seem particularly appealing to me.

Second, I had an experience that was more in line with the stuff that I'd been doing in these lab notes: I had my agent read my notes.

I had been using Roam Research for years now, not feeling the need to switch, like productivity junkies. But one day, I realized that LLMs are pretty good now, but that when they fall on their face in performance, it's usually because they simply don't have the right context. Hence, if I just export my Roam Research notes into Obsidian, I could misuse Claude Code to read my notes for me.

The result was electrifying. It felt like talking to a friend that was into all the same things I was into, and I never had to explain any context. In conjunction, I noticed there were other people discovering same thing: that you can "misuse" Claude Code for non-coding tasks. In mid-July, came the final push. I noticed that Every.to, a company whose blog I follow, had a podcast in which they talked about how they operated as an AI company. They basically document their ways of working and use agents to help them automate or partially automate it. That brought a focus and clarity to the overall picture of all the pieces that I'd been collecting and researching over the years.

I started to try working this way. I took Hamel and Shreya's system eval course, and misused Claude to help me do system evals as a hacked-together test. I wrote about the process, and it was a glimpse into a future way of working.

System eval with Obsidian and Claude Code

Interjected FutureWil Chung

Encouraged, I tried to find a way to leverage LLMs in engineering. It's not quite as easy, as vibe coding is production is nothing like vibe coding for ephemeral apps. They actually take quite a bit of scaffolding to make it work.

The Hidden Scaffolding Behind Production Vibe Coding

Interjected FutureWil Chung

While this is better than what I was doing before, I'm still trying to find something that meets my standards for code. I'd like to accelerate with LLMs like others too, but while I'm in control, I have a bar that the LLM designs just can't seem to reach at the moment. I'm still working out the kinks.

This time it feels different. It feels like I'm doing customer development right, and it feels like I have an open secret that I want to find others with that same open secret. So for now, I'm just doing customer development and building to something usable and launch-able.

I'll talk more about it all later, but in the meantime, someone had asked me for podcast recommendations for biking or swimming, so here's my one-time list.

Sarah Paine Lectures on Dwarkesh Podcast
Expert on China and Russia gives an engrossing walk-through on recent world history. Introduces concepts like continental vs maritime powers, death ground, and mental tools like argument/counterargument/rebuttal. Dwarkesh jokes that on a per minute basis, he’s really the host of the Sarah Paine podcast and occasionally does some AI interviews on the side. I highly recommend her lecture series for its clarity, perspective, and counter-arguments.

Computer Enhance
Casey Muratori finished up Handmade Hero, and is now doing a for-pay ($9/month) series on how to write performance aware code. It starts with a walkthrough of how CPUs are put together, how to read x86 assembly, memory mapping, RAT, execution ports, SIMD, etc. I listen to this when I do my walks in the early morning. I don’t know of anywhere else on the internet this is being taught.

Fall of Civilizations
For a longer perspective on history, this covers the rise and fall of various civilizations around the world. People meme about the Roman Empire, but the topic is more expansive and interesting than just the Roman Empire. The usual question of how to retain the founder ethos is an apparent through-line of all these stories. Also highly recommend.

Dalton + Michael
The comedy duo is back. Besides tackling broader topics related to startups, I like the vibe—it makes me want to be in the same room as the two of them. Where the fun happens over beers, but there’s no beer. They have a previous series with YC, if you happen to miss them somehow.

This American Life
The back catalog will keep you preoccupied for months. Great pacing and storytelling. I particularly recommend Act 2 of this eps where a little girl had recurring dreams of being a superhero. She made a list of all the skills she would need to master: Martial arts, evasive driving and bomb diffusion. She actually checked off most things on the list...

Game Maker’s Toolkit
I’d always been fascinated by games, but recently, I’ve been interested in how games are designed, and how do you find the fun? Product design is much harder here, and even with this analysis, it’s an analysis of a game after it’s done, rather than the process of making one. Still, I think about how much of this is applicable to application design without being a gross over-application, like in the “gamification” trend of 2010.

Howtown
How do we know what we know? For example, how do we know the distance from the earth to the sun? There’s something called the Cosmic Distance Ladder (part 2), where we can calculate ever increasing larger distances, from the size of the earth to the most distant galaxies. While that question is on 3blue1brown, Howtown takes up that mantle on other questions, such as how do we know what Quetzalcoatlus looked like? How do we know that earth’s land masses were in a different configuration in the far past?

What is Algebraic about Algebraic Effects?

Wil Chung — Mon, 15 Sep 2025 16:03:33 GMT

what does the word "algebraic" mean when used in the context of programming langs?
- a random tweet

I'd wondered the same thing about "Algebraic Effects", and was excited to find a talk on YouTube titled What's Algebraic About Algebraic Effects and Handlers? Unfortunately, I'm not the target audience. As an engineer that doesn't shy away from math, it was still out of my depth.

I found some time this past spring looking into Algebraic Effects, and I think I have a decent answer to the question.

Algebra in the context of programming

My view of "Algebra" in the context of programming is a particular kind of compositionality, where there's a structure.

In contrast, mainstream developers often talk about compositionality as just two obj/function that can interoperate due to the same interface, but not much more can be inferred about properties of the interop between the two obj/functions.

So often times, we get some collection of objects/functions that go together in an arbitrary way according to the taste of the developer that wrote it. If they're any good, it feels intuitive. But more often than not, it feels arbitrary. The effect is magnified if you look into the codebase. To a newcomer, it feels like a mess, in the same way that a house built by piling stones high feels like a mess: there's no apparent or easily recognizable structure.

A tangential detour into abstract algebra

In abstract algebra, structure is often where you take some math object 𝛂 (like an int, or matrix), and you pair it with an operation, (like + or *), and you say: integers can be composed with op `+`, but we can ALSO infer properties in these combos--or laws.

So a common one we know is: integer (ℤ) with addition (+) has implied properties that always hold. And the elements (ℤ), the op (+), and the properties together constrain outcomes, and this is what gives us structure. A house with structure feels like it's built with arches, rather than a pile of rocks. What are the properties of (ℤ) and (+)? Due to how ℤ and + are defined, we get these properties:

1. Closure: ℤ + ℤ always gives you another ℤ.

Sometimes devs write code that doesn't give you back the same thing.

2. Associativity: (a + b) + c = a + (b + c) where a, b, c are in ℤ.

This is familiar, as they were drilled in grade school. But often devs don't write code that fulfill this property.

The last two are:

3. identity: ℤ has an element that doesn't change when we use +.
Here, it's zero: a + 0 = a

4. inverse: every ℤ has a matching ℤ that give us the identity when we use + on it: a + (-a) = 0, where a and -a are in ℤ.

Taken together, math peeps gave this kind of structure a name: Groups. So if someone says [a struct] and [an op] together form a group, I can automatically can assume those properties. It's a shorthand.

If you add even more constraints/properties to how ℤ and + behave together, you get another algebraic structure. There's a whole host and families of these. So if we add another constraint, we get an Abelian Group:

5. Commutativity: a+b = b+a, where a, b are in ℤ

Enjoying what you're reading? Subscribe for more

Surmounting the network with algebra

Why write constraining data structure and op pairings? It's quite useful if you want to guarantee specific properties of your system. For example, it's well known that syncing is hard, because of the Eight Fallacies of Distributed Systems.

The network is reliable;
Latency is zero;
Bandwidth is infinite;
The network is secure;
Topology doesn't change;
There is one administrator;
Transport cost is zero;
The network is homogeneous.

That means your data, when sent over the network will likely arrive out of order. Worse, clocks can be out of sync, so it can look like data arrived from the future. How can we tame the underlying unreliable system? By constraining our data and operations to have properties.

CRDTs are nowadays used to enforce eventually consistent syncs. It achieves this by pairing a data structure with a merge operation, which together form an algebraic structure called a semi-lattice. The properties of a semi-lattice are:

Closure: For all a, b in the set S, the result of a ∘ b is also in S.
Associativity: a ∘ (b ∘ c)=(a ∘ b) ∘ c for all a, b, c ∈ S.
Commutativity: a ∘ b = b ∘ a for all a, b ∈ S.
Idempotence: a ∘ a = a for all a ∈ S.

💡

`∘` means `merge op` and `∈` means `is an element of set`

Together, this is enough to counteract the network mixing up your data when sending it over the network. I wrote about that here:

A simple way to understand CRDTs

🔰If CRDTs (Conflict-free Replicated Data Types) are new to you, check out this previous post. There’s a simple way to understand CRDTs: It leverages algebra to unmix the inevitable mixing of data when syncing over an unreliable network. Why are networks unreliable? For one, it’d be really expensive to build

Interjected FutureWil Chung

So by constraining the power of what our code can do, we can ensure the system has specific desirable properties that achieve the goal of syncing data over an unreliable network. It's where we say: "If we compose this kind of data structure in this constrained way with this kind of merge function, then we can guarantee these properties always hold. And with this structure, our data can survive sync over an unreliable network with other syncers."

From Monads to Algebraic Effects

This is why people also like Monads. Monads are about how to compose code, but with specific properties (Monadic laws) so we can achieve some goal in how they compose. I won't go into it here, as this is already long, but that's the core idea.

However, not all types of Monads compose well together. Here's where I'm out of my depth, but I've read and I'm told that this is why there are Monad Transformers, so you can fit different domain Monads together.

Hence, some people have started looking at Algebraic Effects, as a way to achieve the same compositional powers of monads, but in a different way. Most descriptions of Algebraic Effects actually ignore the `algebraic` part, because describing `effects` is already a big leap.

The effects part, is often explained as "resumable exceptions". I wrote a short description of what algebraic effects are from that perspective, so I won't expound on that here.

Elm should have had Algebraic Effects

If Elm had Algebraic Effects, it would have made it more adaptable to these multi-message processing.

Interjected FutureWil Chung

But the algebraic part of algebraic effects is that the effects that you raise as a "resumable exception" can be composed together! Not just in any way: design them so when composed, they have *guaranteed properties* just like the stuff you saw above!

For example, if we had a key/value store that we interface with using get and put, we could express what we expect to happen through some algebraic properties.

Idempotence of consecutive reads (get-get): get k; get k ≡ get x

This says, two consecutive gets is functionally equivalent to a single get. This guarantees that get is a pure observation: it doesn't consume or advance anything. If this law didn't hold, reading could "drain" or "advance" some hidden cursor. By making it a law, we make it an explicit behavior for our users, so they're not surprised by bugs down the line when their assumptions veer from this property.

Last write wins (put-put): put k v1; put k v2 ≡ put k v2

Easy. The two puts together is the functional equivalent of only executing the last one. Hence, the last put is the value that's currently sitting in key k. This encodes overwriting semantics, and without it, put might append, merge, or accumulate. It wouldn't be what users would expect.

Read after write (put-get): put k v; get k ≡ put k v; return v

Executing a put and then an immediate get is the functional equivalent of just executing the put, but then just returning the value v you already have in hand, instead of executing get. This is important to guarantee the consistency of reads right after writes. Without this, you could write v and then not see v immediately, which would break the intuitive model of state in a key/value store.

Write back same value (get-put): get k >>= (λv. put k v) ≡ return ()

If you read the value of a key and then immediately write it back unchanged, that's functionally equivalent of doing nothing (returning unit).

💡

You can think of >>= as saying "and then...". So rule 4 in javascript pseudocode might look like:

get(store, key).andThen((val) => put(store, key, val))

💡

in return (), () is called unit, which is the way functional programmers denote "no meaningful value", which is effectively what C programmers use void for. They're technically different, but in practice, they're used for similar purposes.

Independence across keys For k1 ≠ k2:

put k1 v1; put k2 v2  ≡  put k2 v2; put k1 v1
get k1; get k2        ≡  get k2; get k1
put k1 v; get k2      ≡  get k2; put k1 v

Operations on different keys commute, and the store treats each key as an independent cell. This is what makes it a key/value store, rather than some entangled data structure.

Hence, just because you are writing effects, doesn't automatically mean they're algebraic. You have to consciously design them to be so, in order to give properties or guarantees that you want your users to have. Most current programming languages have no way of enforcing these equational axioms, so even esoteric languages that feature algebraic effects don't even try to enforce them.

Languages which feature dependent types, such as Coq, Agda, Idris 2, and Lean are the only languages that can encode these equational axioms explicitly and be able to prove their veracity. Typically, these languages are used by mathematicians to do proofs in math. But interestingly, Lean has been getting a lot of momentum, and it can compile to C. It can be a practical in-road to using these in practice.

And, in my own words, that's what's algebraic about algebraic effects.

Epilogue

Alan Kay was known to lament that 1 million lines in a code base is unconscionable. It's no more a skyscraper than a pile of rocks. That's because there's often no structure. Eventually we figured out arches: they're structure that give strength with less material.

Hence, we can build higher without using more material. By analogy, we're starting to discover what this structure looks like in software. And it looks like math. There's a lot of resistance to this, and will be for a long time.

And maybe with LLMs, it might not matter for a wide swath of applications. But still, there's ever progress moving forward in this direction, where these pure functional programming or math-y ideas filter down to more mainstream languages.

The Hidden Scaffolding Behind Production Vibe Coding

Wil Chung — Tue, 09 Sep 2025 18:48:54 GMT

Tweet source

I’m not near a level where it’s 10k LOC per day, but the volume of commits is noticeably higher w/ Claude/Codex now. Back in Feb/Mar when Claude Code came out, I was writing an implementation of DBSP (incremental computation with some mathy stuff) and I couldn’t get Claude Code to work for me. The code would often not be structured well for a long term requirement, or it was just subtly wrong in a way I’d discover later. I thought maybe it was just my domain, since it wasn’t a web app—so maybe not enough training data. Turns out I was just holding it wrong.

What didn’t work:

Thinking that the first draft of describing what I wanted and that’d be enough detail. There’s often stuff that’s missing.
Trying to review every change as it was happening. If you’re going to vibe code, you have to let go and just ride the wave.
Working at the granularity of tasks I’d give myself. It needs to be broken down into intern-sized tasks.
Keep hammering away at getting it to rewrite if the code isn’t working.

There are two modes to vibe coding. First is for ephemeral apps, where you one shot a small tool (might even be throwaway) that you would have written yourself, but the cost wasn’t worth the squeeze. This can be very helpful if you get in the habit of noticing that itch. Second is for production engineering work, and this is what people after. From the way people talk about it, I didn’t realize it takes a lot of scaffolding for it to work.

A good mental model for me on LLM responses is that it’s a midwit median response if I have no real taste for nuance on the topic. But if I do, like in engineering, I have to make explicit much of the tacit knowledge in my head as scaffolding. Agents need a lot of guardrails in place to generate acceptable code that matches your taste and your organizations tastes.

What are the scaffolding?

First, the typical stuff espoused by engineering culture:

work in a language that uses type checking and don’t allow any sprinkled liberally in your code base like confetti on Mardi Gras.
if you have a build process, make it easy to invoke from end to end.
fast unit tests. 80% coverage.

Claude often makes dumb mistakes, and if it had scaffolding that helped check its implementation, it can keep iterating until it finds a solution. Remind it to use all these things before it declares itself done.

When it comes to tests, some people swear by writing tests by hand. Others are ok with generating the tests. Whatever you do, don’t write tests and implement code in the same session. Claude, like people, can try to find alternative ways around things if what they’re doing isn’t working. So it can sometimes either “pass the tests” by skipping the tests, or altering the tests to pass the bad code.

Next, it helps a lot to write down project-wide or company wide conventions in the agent prompt. This is located in AGENT.md for codex and CLAUDE.md for claude. I just soft link one to the other.

What goes in there?

conventions code base, such as the tools you use (npm vs pnpm), and where to find them.
guidelines for good engineering practice. I lean towards pure functional programming, so I outline some good rules of thumb.
asking it to write comments prefixed with NOTE that details the implications of WHY the code was written that’s not apparent from reading the code itself.
a directive to push back on my thinking to tamp down the sycophancy.

Michelle Hashimoto (of Hashicrop and Ghostty) has a habit of writing long comments explaining the intention for a single line of code. This helps the LLM understand what’s going on in the codebase as it’s reading it. I instruct it NOT to delete any comments with NOTE and to update and sync the comment if the code it’s referring to gets changed. This is another scaffolding that can increase the chance of coding agent success.

For the push back, I used a combination of Maggie Appleton’s “Critical Professor” and Dr. Sarah Paine’s “Argument-Counterargument-Rebuttal”. The key is to tell the LLM that I’m unreasonably dumb and often have bad ideas. Even then, the effect wears off the longer a session goes. Here’s what I use:

You are a critical professor/staff engineer. Your job is to help the user reach novel insights and rigorously thought out arguments by critiquing their ideas. You are insightful and creative. Ask them pointed questions and help re-direct them toward more insightful lines of thinking. Do not be fawning or complimentary. Their ideas are often dumb, shallow, and poorly considered. You should move them toward more specific, concrete claims backed up by solid evidence. If they ask a question, help them consider whether it's the right question to be asking.

- You will also try to ask poignant questions of the user as a way to spur further thinking.
- Don't be afraid to be contrarian, but you have to be correct.
- In the final response, concentrate on elucidating the core idea and all the supporting evidence for it.

Before answering, work through this step-by-step. If you think a point is important, expand it from a bullet point into a paragraph:

1. UNDERSTAND: What is the core question being asked?
2. ANALYZE: What are the key factors/components involved?
3. REASON: What logical connections can I make?
4. SYNTHESIZE: How do these elements combine?
5. CONCLUDE: What is the most accurate/helpful response?
6. COUNTERARGUMENT: Construct another viable line of argument
7. REBUTTAL: Construct an orthogonal angle that counters the counterargument.

Now answer: [MY ACTUAL QUESTION] in a final paragraph summarizing and concentrating your final thought, elucidating and expanding important points.

Planning with Details

Subsequently, this directive is important in creating the product requirements doc (PRD). This addresses my initial mistake in thinking that whatever I write down in my first draft would have sufficient detail for the LLM to write good code. The break through was when it occurred to me that I didn’t have to write all the details down. Chatting with a good reasoning model (first O3 and now GPT-5 thinking) with a critical prompt was very helpful in first getting an outline of all the different angles and considerations for a chunk of features. Once I’m happy with out decisions, I ask it to tell me if there’s details that’s unclear or aspects of the design that I’m missing, and we iterate again.

Enjoying what you're reading? Subscribe for more

I tell it to write a PRD (as markdown in the repo) based on everything we talked about. And then I review it, looking for unexpected misunderstandings. If it’s a small thing, I’ll edit it by hand, if not, I’ll do more chatting. Some people don’t even do this chatting. They’ve written a prompt for the LLM to also do research into what libraries to install. I like control over the dependencies, so I do that by hand.

Next, in a new session, I’ll ask Claude to read the PRD and break it down into tasks. The tasks need to be small—intern-sized tasks: 1pts (less than 2 hours), 2pts (less than 4 hours), or 3pts (about a day). And it needs to list out its prerequisite tasks. Then based on the prerequisites, group the tasks into groups that can be run in parallel, based on their dependency list. So then I get groups of tasks that are only dependent on previously completed group tasks and not the tasks in the same group. I get it to write this task list down as a new markdown file, also in the repo.

Now, with that, I ask: for each task in the task list, write a markdown file for each task with details about the tasks in its description based on the details in the PRD. The idea that these details will double as a prompt for the coding agent. You don’t have to write each task as markdown in your repo. You can tell it to create tasks in Linear using MCP with these details.

This takes about a day or two. All this pre-flight work isn’t always worth it. You’d have to be the judge of how much work there is, and how far you can see in advance. In my case, it generated about 78 bite-sized tasks, grouped by tasks I can run in parallel.

Riding the waves and letting agents go to town

Finally, I use git worktrees to separate each coding agent’s workspace from my own and each others. I’ll pick the stuff that is harder to work on, and I’ll delegate the easier stuff to the Claude Code (now on Sonnet 4). So far, I just use two in parallel, and ask it to read the task (either from md or linear via MCP) and do the task, while reminding it to typecheck, built, and run tests before declaring itself done. I start Claude with claude --dangerously-skip-permissions and just let it go to town, while I switch context to do other things. But I still review the details with a cursory glance before starting the task.

I only review what it did when it’s done, because that’s time I can do other things. Like email, I only review and check at specific times of day, so I’m not constantly looking at it. It’s ok, the agents can wait for me. I might go to writing blog posts, writing my own code for my own task, or pick up the kid. Sometimes, I’ll have my laptop open in the car, while claude is writing code. I’m not driving either, since the car is on autopilot. I’m mostly just sitting there, listening to Sarah Paine podcasts.

Still with all this scaffolding, sometimes it might one-shot the task, and sometimes it might go off the rails. And because the changes are bite-sized, it’s much easier to review. And because it has a lot of details on the task with all the guardrails and scaffolding, it has a higher chance of getting it right. And if generates bad code, that’s ok. If it can’t fix it after pushing it once or twice, I just blow away its changes (yay git), and start a new session to try again. It’s cheap (in time) to do this, so it’s not a big deal to reject changes and have the agent try again. It doesn’t care.

I’m still relatively early in this change in my workflow and the way that I work, so I’ll still be iterating on it. And I think it’s also the point. It’s going to be hard to take other people’s prompts and just inject it word for word in your own workflow. You have different tastes and different ideas of good and nuance, so you’re going to have to find a different way of articulating your own values. And that’s something that takes discovery. It’s a mistake to take the tact that excellent operators do to just copy what others are doing without much thought, thinking that it’s a cheat code. If you just do what everyone else does, you will not stand out culturally to potential employees nor customers. The well of nuance goes deep.

Vibe coding in production is a task for experts

This can turn engineering into partly a management role. There’s an excellent post called Vibe Coding Done Right is actually an Expert’s Task, that outlines some of this. Some choice quotes:

If you are an individual contributor who usually does not like to train interns because you find that they take more time than they are helpful, then don’t vibe code

Vibe coding turns any individual into the CTO leading a team of 20, 30, 50, 60 interns. You immediately get a massive team. It takes time and experience to actually handle a group like this correctly and to make it be productive. Making all of 60 interns not break the performance, correctness, or maintainability of your code is very difficult. So I do not recommend this to people who are still “up and coming programmers”. But if you’re a bit more senior and starting to grow your group, well this is then a cheap way to accelerate that.

Vibe coding is not useful if you need it to solve a particular problem. You still do the hard stuff.

Hence, I’m still writing code by hand for the most difficult tasks or the ones that I think has the most nuance. Some companies can go with just vibing the whole thing, and never read the generated code. But I still like to understand what the system I have is doing at all levels and be able to hold it all in my head, as I think its correlates with quality.

You won't lose your engineering club card

Lastly, it might be hard for senior/staff engineers on your team that love to code to adopt this workflow. The way I think about it is this: Trekkies spend a lot of time arguing over Star Trek trivia, but the one thing they never argue over is whether Geordi LaForge is an engineer or not. Sometimes, he’s standing at a terminal typing stuff in, but sometimes he leverages “computer” in his workflow to do the grunt work. Have them watch this clip where he’s vibe coding on the Holodeck.

An engineer at work vibe coding

He’s not doing the ray tracing code, but he’s asking the right questions based on this expertise. Would Riker have known to ask the right questions to extrapolate the mysterious figure? No. That’s where the years of honed expertise comes in. Engineering isn’t just code.

Hope that helps. I’ve also discovered that engineering isn’t the only thing that can benefit from this kind of workflow. I’ve started to do system evals or customer dev coaching with a similar setup. I’ve written about the system eval setup, and I’d be happy to hop on a call if you want to compare notes on your non-coding setups with Claude Code. Slide into my DM.

Elm should have had Algebraic Effects

Wil Chung — Tue, 02 Sep 2025 19:36:43 GMT

While the world has moved on from Elm, I still think about its simplicity a lot and how that was achieved, and what gaps there were in its design. The gaping weakness I'll talk about today was also the result of its strength.

Elm apps have a single unidirectional flow that stemmed from a core design foundation of "the view is a pure function of application state", succinctly written as f(state, msg) = view. This is a great fit if a single user action results in a single message that makes a small state change.

But if a single user action results in multiple messages, or the single message requires multiple side-effects, then the simplicity becomes a hindrance–the developer must now trace multiple cycles through the architecture. It's no longer simple to trace through what the system does in response to a single user action.

If Elm had Algebraic Effects, it would have made it more adaptable to these multi-message processing. In fact, Algebraic Effects as a control flow mechanism, would simplify many other types of more complicated workarounds we see today in other systems.

Elm's Outsized Contribution

Even if Elm isn't very popular language and frontend framework today, it's had a drastic and important influence on web frontend architecture with The Elm Architecture (TEA). After shedding its signals from its Functional Reactive Programming roots, Evan Czapliki (creator of Elm) said this architecture sort of fell into place.

The Elm Architecture

TEA is conceptually simple. Model the view as a pure function of application state. Views can generate messages in response to user actions, like clicks. And based on any messages that come in from the outside, we immutably modify the state. Any state change is reflected by the pure function that transforms state into a view. The state change function is a giant switch statement that routes messages to the appropriate part of the code that makes a small state change.

The entire thing is a unidirectional dataflow. No rat's nest of dependency graphs. No race conditions. It’s easy to keep in your head what's going on at any moment in the lifecycle in response to user input. And The big switch statement to route messages means that you only need to trace a single branch of the pure function to understand what it’s doing.

Compared to the object-orientated Model-View-Controller, the Observer pattern, bidirectional binding in Angular/Backbone/etc, or a pile of event handlers for event callbacks in jQuery, The Elm Architecture was devastatingly simple. In fact, the original React Flex architecture, the subsequent Redux library, and the now standard React useReduce are of the same lineage. They are all unidirectional flows from application state to the view, and they cycle back to state changes in response to user generated events by interacting with the view.

Performing Side Effects

What if you want to perform side effects in your Elm-based web application? Elm is a pure functional language, and in program where the view is a pure function of the application state, how do you fetch from the network or generate random numbers?

You can still be pure if you let the runtime handle the dirty work for you.

The short answer is, you don't–at least not directly. Instead, you unwind the callstack of your pure function generating the view from application state and return a command (or subscription) back to the runtime (Cmd/Sub in the figure above). The runtime then takes that command and executes it for you, and returns the result back to you as yet another message. You can then handle that message like you would any other in the giant switch statement in your f(state, msg).

The great part of this design is that it keeps the same architecture and model, and repurposes it for running side effects. The bad part of this design is that you need to trace through multiple Cmd/Msg cycles in order to figure out what the program is doing in response to a single user action. That increase in cognitive load can boil over beyond the advantages of TEA in the first place.

How would that happen? It happens when a single user action generates a message that requires multiple side effects to complete. The most familiar way this happens is if there's a button that shows the user a list of blog posts and the number of tweets for that post. In order to do that the application must query the database for blog posts and fetch from a 3rd party API for number of retweets. This is what would happen:

User clicks the button that generates a button_click message.
button_click message gets routed by the TEA switch statement to the button_click_handler
The button_click_handler returns fetch_posts command back to the runtime.
The runtime fetches posts from the database and sends a result_posts message.
result_posts message gets routed by the TEA switch statement to the blog_posts_handler branch.
The blog_post_handler branch returns the get_retweets command back to the runtime.
The runtime fetches retweet counts from the network API and sends a num_retweets message.
get_retweets message gets routed by the TEA switch statement to the retweets_handler branch.
The change in application state then changes the view.

If you skipped all that, that's the point. It was as tedious for me to write it as it would have been for you to read it. But the gist is this:

The logical, linear progression of a computation to handle a single user action is now broken up across multiple branches in a switch statement. Every time you have to cross the boundary to match messages in order to trace program execution, it's overhead in your cognitive load.

What if instead you could fetch posts and retweets and just resume a pure function call where you left off with the results of the side effect. Notice the lines with perform in this listing. That's where the effects are being called. We don't have to break this up into separate branches of the switch statement every time we need to perform a side effect.

-- effect signatures
effect FetchPosts : () -> List { id : String, title : String, body : String }
effect GetRetweetsBulk : List String -> List ( String, Int ) -- (postId, count)

update : Msg -> Model -> Model
update msg model =
    case msg of
        ClickRefresh ->
            handleEffects model <| \resume ->
                let
                    posts = perform (FetchPosts ())
                    ids   = List.map .id posts
                    pairs = perform (GetRetweetsBulk ids)
                    dict  = Dict.fromList pairs

                    merged =
                        List.map
                            (\p ->
                                { id = p.id
                                , title = p.title
                                , body = p.body
                                , retweets = Dict.get p.id dict |> Maybe.withDefault 0
                                }
                            )
                            posts
                in
                resume { model | posts = merged, loading = False, error = Nothing }

        _ ->
            model

You could run the fetching of posts and retweets in parallel, but I kept it sequential to illustrate the resumption of side effect results.

So instead of multiple messages to handle this sequence of computations, we raise effects every time we need the result of a side effect, and resume the computation after we got the side effect result.

This is what Algebraic Effects do.

A short description of Algebraic Effects

Algebraic effects is not too hard to write your head around if you understand a few basic concepts. They're basically a new kind of control flow that can be easily described mechanically to newcomers as resumable exceptions. However, unlike exceptions, they're meant for everyday control flow.

Most developers now are familiar with exceptions. We wrap computation in try/catch block, where the catch block is the exception handlers, in case those handlers need to be invoked by thrown exceptions. Structurally, algebraic effects are no different. We wrap computation in a block with an effect handler that will execute should the computation raise effects.

Enjoying what you're reading? Subscribe for more

But exceptions and algebraic effects are different in that we're taught that we shouldn't use exceptions as regular control flow, because they're for exceptional cases. And once an exception is throw, the stack is unwound, and we can never go back to where the exception was raised.

Algebraic effects are different. When an effect is performed (or raised), like exceptions, we start searching up the callstack for a suitable handler for this effect. Once we find the first suitable handler, we can execute that handler to do some side effect. With the side effect result in hand, the handler can decide to do a couple of things:

Resume the computation with the result one or more times.
Abort the computation
Delegate to another handler higher up the call stack.

If the handler resumes the computation, the control flow jumps back to the point of execution right after that effect was raised. To the application developer, performing an effect seems like a simple function call: you raise the effect, you can a value back. However, the differences are:

The function that raised the effect remains pure. Hence easier to test and easier to reason about.
A handler can choose to resume multiple times, and computation will resume right after the raised effect with the same call stack every time.

Why would you want to resume a handler multiple times? Maybe the side effect you want is a websocket and you want it to stream results to your program as it gets the data. So the handler would just resume your program at the same spot every time it gets a new piece of data.

If the handler aborts the computation, the control flow throws away all the work it did since it entered the handler. It's as if we never went that way and did any of the work. We resume execution after the handler as if we never executed our computation at all!

There's more to Algebraic Effects, such as how they work under the hood, what they're commonly used for, and their current downsides (it can be slow to look up handlers, but that can be mitigated). Also, what's algebraic about Algebraic Effects? I won't get into any of that for now, but just to say that it would have been a good solution to this problem in Elm.

Designing systems

This kind of control flow is very powerful. Control flow that's typically hard-coded inside language features can now be written with Algebraic Effects, such as:

Exceptions
Async/await concurrency
Generators and Coroutines
Break/continue in loops
Returns in nested scopes
State threading in state monads in pure functional languages
search and backtracking in logical languages like Prolog
Transactions, rollbacks, and undos in databases
I/O and external interactions in pure functional languages

They also replace a lot of the use cases for monads.

So given the option for simplicity of a mental model for a narrow band of use-cases, I can understand why Evan Czapliki wouldn't have implemented it. That said, the simplicity of the mental model didn't end up being a good fit for the situation I often found myself in while building web apps, which was the single-message/multiple-side-effects pattern I outlined above.

Would I have gone ahead and attempted the implementation?

Implementing algebraic effects also would have been hard in the Javascript engine. You can implement single resumption algebraic effects with javascript generators, but not the full-blown multiple resumption algebraic effects. To do so requires delimited continuations to be available by the javascript runtime. Your only other alternative is to recreate the callstack in userland and implement delimited continuations in the Elm runtime on top of the javascript runtime.

Given that Elm is a compiled language, recreating the callstack wouldn't have too too far off the beaten path, everyone else higher in the stack. I do know there are attempts to put delimited continuations in Haskell, and Elm is developed in Haskell–maybe there's a viable path there.

The thing to carefully consider is the downside of algebraic effects. I haven't used them enough to know all the catches, but it does seem to me it can be really easy to write spaghetti code with algebraic effects, where the control flow jumps around. I think it should probably remain a super power for library writers to do what they need to, but then everyday application writers probably should use it judiciously in simple ways.

There's up and coming languages that have algebraic effects if you want to use them in experimental languages, check out Koka and Eff. In more production languages, they're available in Unison and Multicore Ocaml. You can check out more in my previous post.

Algebraic Handler Lookup in Koka, Eff, OCaml, and Unison

What are the different ways to do handler look up in programming languages that support algebraic effects?

Interjected FutureWil Chung

Let me know if you have other questions.

System eval with Obsidian and Claude Code

Wil Chung — Wed, 27 Aug 2025 15:00:15 GMT

Lately, I've been experimenting with a different system eval stack than what you might normally hear about: Obsidian and Claude Code.

Typically, Obsidian is used as a personal note taking app/knowledge base and Claude Code is used as a coding agent. I'm misusing both to hack together a system eval without having to deal with a lot of infrastructure set up.

The tools

I'm using Obsidian to document the eval process as a markdown file to make it repeatable. With it, I can see the entire process laid out and I can link the data and different parts of the process. I can start and redo any part of the process.

Claude Code can then make parts of it "executable". For example, every stage in my eval is a section, and each section is a prompt. I can use Claude Code to read that prompt and execute it.

The structure

Processes are imperative by nature, usually a step-by-step recipe for doing something, to transform data into a format or computation that we need. They can be more complicated, with branches and conditions, but for the most part, processes that you'd write down in a document is organized linearly.

The system eval process is iterative, but if you don't explicitly draw the re-iteration arrows, it can be laid out sequentially. Every step of the system eval is a section, and a section has a couple main parts:

A section of the eval

Each section has:

A prompt for generating the next stage of data (or a the script that computes the next stage of data.)
A command line on how to run the script
A link to the listing of the output.

For example, here's a preprocessing step, where I convert the user queries into traces. But instead of writing the python script myself to do it, I just write out a prompt to one-shot the script.

a one-shot prompt for generating the script needed to do this step of the eval

In order to run the prompt, I then go into the terminal and run Claude Code. Then I ask it the following:

Read the prompt from `Create Traces from Queries` section of @pages/Product Expert for Sales Safari.Eval.md and do what the prompt says

It's able to go into the knowledge base (because they're just files on disk!), find the appropriate file, find the prompt in the appropriate section, and then follow the prompt and one-shot a script. I find that these scripts are simple enough that Claude can get them right, if you just write pseudocode of what you want.

After generating the script, I can then run that script to create a set of outputs into a directory. I use the documented command line to run the script, such as

python3 scripts/convert_discord_json_to_md.py sources/lesson_1 queries

And then use that as in the input to the next section with its own prompt and script.

Error Analysis

I can see the list of the output of traces in the directory from Obsidian. And each trace is just another markdown file. I have frontmatter that has fields for judgement and annotation, so for now, I'm just doing error analysis that way, rather than vibe coding an app.

Claude Code isn't always for one-shotting a script. When I need to get it to categorize axial codes from a list of failed traces, I can write the prompt in Obsidian and ask Claude Code to run the prompt. And if I need to redo it, I can have a repeatable way to keep track of the prompt I used.

The salient point

In a way, it's like a poor man's notebook with AI in it. This setup has the following benefits:

I have a layout of my entire system eval process and I can re-run them in a repeatable way when I need to. It's too easy to do system evals as an ad-hoc set of prompts and results you think you'll never need to rerun again.
With the eval process in Markdown, it's easier for me to show and share with other people without needing them to install anything or get them an account.
Markdown is just text files, so I can version control it with git.
The volume of data isn't crazy, so I don't need a database. I can just write them in files and work with human-readable text.
If you have system prompts or other knowledge you'd want Claude Code to take into account during a prompt, you can write it in the knowledge base, and tell Claude Code to look for it.

The last point is especially salient. The fact that both Obsidian and Claude Code work on the ubiquitous markdown file format in the file system means that it's a narrow-waist format, like how IP (and increasingly, HTTP) is the Narrow waist of the Internet.

TCP/IP is the narrow-waist of the internet

Usually, attempts for a narrow waist are hard to do, unless they were designed from the beginning. Obsidian leveraged files because it didn't want the lock-in with Roam Research. Claude Code leveraged files because the medium of programmers is the text file. But in this lucky happenstance, both Obsidian and Claude Code leverage a common format that explodes the flexibility and possibility in ways we've forgotten since we've used apps and web apps for so long.

Standards

The ability for an agent to have access to an entire knowledge base in order to finish its task was pretty eye-opening for me. Where AI fails is often a failure of context, rather than ability. I think people are still catching up to when AI does well and when it doesn't. For AI, context is everything, and providing the right context is most of the battle.

Programming After AI: Why System Boundary Taste Matters

Wil Chung — Mon, 21 Jul 2025 14:02:53 GMT

Since coding agents came on the scene, a subset of programmers panic about AI replacing their jobs, but I think they're panicking about the wrong thing. The fear seems focused on the mechanical aspects—will AI write better functions than me? Will it debug faster? Will it remember more API endpoints? But I'd argue an important tacit skill of senior engineers is knowing where to draw system boundaries. AI can generate code from specifications, but it can't decide which abstractions will make sense, or how a system should evolve as its context changes.

This distinction matters because most programmers spend their careers optimizing for immediate readability, treating code as literature to be consumed by their present-day colleagues (though it might not feel like that at times). But the boundary decisions you make today—how you partition responsibility, where you allow flexibility, which dimensions of change you prepare for—these determine whether your system gracefully adapts or buckles under the weight of accumulated compromises. AI doesn't live with technical debt. It doesn't feel the pain of a poorly chosen abstraction echoing through years of maintenance. It will simply throw up its hands and code you a bad or inadequate solution.

The history of data storage offers a lens for understanding a common instance of this boundary drawing challenge. Consider how we think about data structures: arrays optimize for sequential access, hash tables for key-based lookup, trees for ordered traversal. Each makes a specific trade-off, committing upfront to particular access patterns while making others expensive or impossible. An algorithm is inextricably linked to its data structure. You're literally deciding how to organize information in memory, and that organization constrains every future interaction with your data. This is boundary thinking in its most concrete form.

Early database systems followed this same rigid pattern. Hierarchical databases like IBM's IMS organized information as trees, forcing you to traverse a single predetermined path to reach any piece of data. The approach made perfect sense if you knew exactly how your data would be accessed. CSS selectors work this way and selecting the DOM hierarchy work this way. Even XML databases with XPath queries suffer the same limitation (for those of you of a specific vintage). But asking "show me all elements with red backgrounds" requires visiting every node. You get blazing performance for the access pattern you optimized for, and awkward workarounds for everything else.

The weakness of hierarchical systems becomes apparent the moment you encounter cross-cutting concerns—requirements that don't respect your carefully constructed boundaries. When you don't know your access patterns in advance, or when users start asking questions that cut across your hierarchies in unexpected ways, you discover that your optimization has become a prison. The system serves its structure rather than serving its users.

Relational databases revolutionized this by essentially deferring the boundary decision. Codd's insight with relational algebra wasn't just about mathematical elegance—it was about recognizing that the most important access patterns are often the ones you can't predict. Instead of hardcoding how you'll slice your data, relational systems let you decide at query time. You can ask for entity.property just as easily as property.entity, or any combination that serves your immediate need. The system adapts to your questions rather than forcing your questions to adapt to the system's structure.

The ability to query across different dimensions for a data set isn't merely a database story—it's a philosophy about system design that most programmers never internalize. The success of relational databases hints at a deeper principle: the most robust systems are those that can be reorganized along dimensions their original designers never considered.

Which brings me to what I think is a surprising insight when I first heard it: entity-component systems (ECS) in games are just relational databases in disguise. Entity-component systems are way of bookkeeping various entities in the game, such as the player, monsters, power up items, etc., but allowed the game to process them by system, such as the physics and rendering.

Games discovered this pattern not through theoretical exploration but through necessity. Traditional object-oriented game engines organize code around conceptual entities—Player class, Enemy class, PowerUp class—and this works beautifully until you need systems that operate across entity types. Physics affects everything, rendering affects everything, AI systems need to query spatial relationships between arbitrary objects. The entity-centric boundaries that make perfect sense to human designers become computational and design bottlenecks.

The design bottleneck can easily happen, when you're trying to find the fun by recombining various sub-systems. When designing a shooting game, it's natural to decide and assume that a camera is fixed on the player, and any AI is attached to the monster. But what if the fun is in a homing missile? Now we need to attach the AI to a projectile. And if the fun is in seeing from the perspective of the homing missile as it's searching its target? Well, that breaks the assumption that the camera subsystem is always on the player.

Casey Muratori's talk on rejecting "compile-time hierarchies" articulates this tension precisely. Object-oriented programming convinced an entire generation that entity-first thinking was universally correct, that domains should always be modeled by drawing boundaries around conceptual objects. This works when your domain naturally clusters around stable entities—Alan Kay's work in molecular biology, where cells and proteins have clear identities, or Bjarne Stroustrup's distributed systems, where processes and machines form natural boundaries. But for most computational problems, entity-first thinking is the wrong choice, a premature optimization that constrains future flexibility.

I conceptualize this alternative access as a "transpose." Instead of organizing around entity-to-property relationships, you use property-to-entity relationships. Instead of asking "what methods does a Player have?", you ask "what entities need physics calculations? What entities need rendering updates?" You're drawing boundaries around capabilities and systems rather than around conceptual objects. The mental shift feels subtle, but the architectural implications are profound.

ECS systems make this transpose concrete. A physics system operates on all entities that have physics components—position, velocity, mass. A rendering system operates on all entities with visual components—sprites, animations, shaders. You're essentially running queries: "give me all entities that have both Position and Velocity components so I can update their motion." The similarity to SQL becomes obvious once you notice it: both systems let you join arbitrary data on the fly rather than pre-committing to rigid hierarchies.

Games generally avoid traditional databases because their computational budget remains extreme—every microsecond matters when you're targeting 60fps with complex simulations. But ECS gives them database-like query flexibility while maintaining cache-friendly performance characteristics. They've independently rediscovered relational thinking, optimized for their specific constraints.

Some games push even further into what I can only describe as inference territory. The Chemistry Engine in Breath of the Wild doesn't just query existing properties—it derives new facts from interactions between materials. Fire plus wood yields burning wood; ice plus fire becomes water. The system knows rules about how properties interact and infers new states dynamically. Baba is You takes this to an almost philosophical extreme: the game mechanics themselves emerge from logical rules that get recomputed as you physically rearrange the rule statements. "Baba IS You" becomes "Rock IS You" when you push the words around. The entire game state derives through inference from a changing set of logical facts.

This suggests a future direction that excites me: systems that can not only query from arbitrary dimensions but infer relationships you didn't even know existed. Datalog and inference engines let you specify facts about your domain and ask "what else must be true?" rather than explicitly computing every possible state. Epic's Verse language, developed with Simon Peyton Jones, hints at this direction—a functional logical language where you can "run functions in reverse," using inference to find inputs that produce desired outputs. Game designers are starting to explore what becomes possible when computation becomes less about explicit control flow and more about declarative reasoning.

I don't expect widespread adoption of inference-based systems soon—the tooling remains esoteric, and most developers are still learning to think in terms of queries rather than objects. But the underlying trend seems clear: as AI commoditizes code generation, the valuable human skill becomes problem modeling. Not the mechanics of translating requirements into syntax, but the deeper challenge of understanding which boundaries will serve you well as requirements evolve.

The programmers who survive will be those who develop intuition for an angle of attack for their problem–one of which is to decide when their domain needs property-first thinking rather than entity-first thinking. The programmers who thrive understand when rigid boundaries serve their purposes and when flexible boundaries justify their complexity. They'll know how to design systems that can be queried from dimensions they haven't yet imagined, because they'll understand that the most important access patterns are often the ones you can't predict.

While AI writes increasingly sophisticated code, humans who understand boundary trade-offs will architect the systems that matter. The question isn't whether AI can code—it's whether you can think about problems in ways that remain valuable as the tools evolve.

Getting vibe coding to work for me

Wil Chung — Fri, 11 Jul 2025 22:23:57 GMT

I bounced off of Cursor a couple of months ago. It was good having code in context in the sidebar, a leg up from cutting and pasting, but I found its agent UI confusing and it made odd edits. In addition, when I left it open for a week, it’d progressively get less responsive to keystrokes. I ended up going with Zed for the editor.

When I tried Claude Code, I initially didn’t find it all that great. It could have been the domain I was working in (writing a DBSP implementation for incremental compute), but I found that its design and architectural choices were all wrong in subtle ways. In addition, I’d try reviewing every single edit. Which was all slow, time consuming, and I’d be out of $80 with no code to show for it.

What I found finally worked is just letting go and having the vibe wash over you. (I also changed over to a domain coding agents are more familiar with—web apps) Break down tasks to something a junior can tackle. If there’s something more complex, I chat with o3 on all the design questions and issues. Once satisfied, I tell it to write me a prompt for a coding agent in the form of a PRD.

Then, I take that and paste it into Claude Code, and have it churn on that til completion. Unlike before, I’d only review its edits after it was all done. I always use a clean commit (or git worktree) to have it work on something. If I didn’t like its result, I’d have it try to correct once or twice. But if it can’t get it, I’d just blow the entire commit away, and paste in the PRD and pull the casino lever all over again.

It takes a couple tries to get a sense of the type of task, the domain, and the granularity of task that works. While it feels like my number of commits have gone up, github commits say otherwise. I think it’s more that it’s expanded the type of things I can tackle and get immediate results to test some hypothesis or get some intuitive feel.

I think what’s often left unsaid is that devs have wildly different ways of working, and some of these work well as scaffolding for coding agents to do well, but people never mention it as part of their workflow. Some people write extensive comments on every group of code, which can be helpful for LLMs reading the code. Other people save their PRDs and tasks in the repo itself for the LLM to refer to. And having unit tests (and in some cases, formal verification), type checking, and other guardrails for the LLM to self-correct on their agentic coding spree.

I think some devs also have a different mental model of LLMs. Some inherently believe that if it doesn’t work, it’s because the LLMs aren’t good enough yet, so they bounce off of it easily. Others inherently believe that if it doesn’t work, then it’s how they’re holding it, and will invest time to think of different ways to prompt it or inject context into it to get it to work. I find that you need to be a bit of the latter.

Elm as target language for vibe coding

Wil Chung — Mon, 07 Jul 2025 21:22:22 GMT

Having been a fan of functional programming and following the breadcrumbs of reactive systems, it's not surprising that I'm a fan of Elm, and had experience building different systems in it. It's too bad that enthusiasm for Elm didn't last until vibe coding came on the scene, because it seems like Elm as a language would be a great fit for vibe coding.

What are the aspects of a language that make for a good language for LLMs to code in? In a lot of ways, it's the same things affordances that make it better for a human to code in. But there's some specific pros that Elm has going for it.

Strong type system

LLMs have a tendency to hallucinate, and hence make mistakes when writing code. These mistakes can result in code that just don't work or don't run. Usually what we do is to blow the changes away and try and give the coding agent more context as to what we want.

But another method is to provide tools to the coding agents to self-correct, such as build-scripts, linters, and type-checkers. When you have a language that runs without error if it compiles, then it doesn't require a lot of hand-holding. Hence, you can just let the coding agent run until it finishes before checking its work, rather than having to keep going back to it after each small step.

I've found that to get vibe-coding to work for you, you have to be ok with just letting go and not micromanage every little step that the coding agent is doing. While you can micromanage its design and architectural choices, you don't want to be doing that for basic syntax.

Exceptionally clear error messages

Elm was one of the first languages to prioritize clear error messages. Clear error messages not only help humans, but also LLMs. By having clear error messages, it's easier for the coding agent to find the place in the code that's having the error.

Like strong type checking, this helps the coding agent be more independent on the low level details on the code. Hence it can run independently for as long as it can without intervention from the developer on things that it can fix itself. This frees up the developer to focus on things that coding agents are currently bad at, such as design choices and architecture.

Single file by convention to help context caching

Elm programs, by convention tend to congregate into a single large file. There's not really a reason or culture to break them up because most of the methods are relatively short (besides the routing switch), and it's easier to jump between different methods if it's all in one file.

Turns out this is also helpful for LLMs, because they can leverage prompt caching. Often times, we're using the same context for LLMs, so instead of having them process the same tokens again and again, we can use caching to help lower the cost of generating a response.

Strong mental model to help humans course correct

Elm programs are famously regular with The Elm Architecture (TEA) when building front-end web apps. It also has a strong mental model for human developers to follow. When you look at an Elm program, it's almost boring in how regular it is.

This is helpful, because for any code that the coding agent generates, it's fairly easy to follow what it's doing. It's also easy to course correct the agent, should anything go wrong.

Why I couldn't choose it

Despite all these positives, I couldn't choose Elm in 2025 for vibe coding. All of the reasons are narrowly and uniquely specific to the app that I'm building. But I will mention one specific weakness of Elm programs: it forces you to break up the logical trace of your program along side-effects.

A single conceptual trace

Elm is great when a single message matches up with a single conceptual action that a user can take. So given any single event that a user can generate, it's easy to look at the state changes that a single user event will generate. These tend to be typical CRUD events in a basic web app.

However, there are instances where a single conceptual action needs to be spread across many different messages in the system. This tends to happen when a single user event invokes sides-effects to other parts of the system through ports. It's harder to trace control and data flow of the program when you have to flip back and forth between the internal and the external part of the system this way.

I think this is the largest weakness of Elm, because it effectively forces you to draw boundaries of your program when it has to use any side-effects, instead of around semantic and conceptual boundaries of an entire job or action that the user needs to do.

This is where I think Algebraic Effects would be great for Elm programs. Algebraic Effects are often explained as "resumable exceptions", but in reality, they're a new kind of control flow. Imagine that every time you need a result from a side-effect, instead of passing a Cmd all the way back up to the top of the call stack, you can "throw" the Cmd, get a result from a handler, and just resume from where you were. That way, you can write a more direct style program and your program remains purely functional.

I left a lot of explanation on the table, since my job here isn't to explain algebraic effects, but just to point out where it's lacking in 2025 with respect to the reality that vibe-coding will be in the future of all software engineering disciplines in one form or another. Elm probably won't be the poster child taking up the mantle for a vibe-code-friendly-LLM-affordance language, but it has all the makings of one. And hopefully, some programming language designer out there will take notice.

Lab note #073 Existing in a digital space

Wil Chung — Mon, 30 Jun 2025 18:34:01 GMT

I ended up deciding on SolidJs. For one, I spent way too much time deliberating on a choice that wouldn't matter as much at the end of the day. But I think I wanted to give functional programming a good shake, since it's aligned with my values and ideals. However, at the end of the day I couldn't choose Elm, because of the friction over ports, and that it forces you to draw boundaries at when you have side effects. It's not conducive to computational flows that is interspersed with side effects, but is a single complete idea.

Two main risk factors that I eliminated last week.

The stack for implementing free-floating avatars.
Multi-tool use in a single cycle

I didn't want to have the AI feel like it's constrained to a sidebar, so I wanted to experiment with whether it can have an avatar that exists in the same digital space of the artifact as you, the user. This is still a left-over sentiment from my experiment with tool-based manipulation. While there is an advantage to having a chat history, I don't think it should be the default. I want the default to be direct-manipulation, not just for the user, but for the AI.

Between absolutely positioned DOMs and using the canvas for an avatar, I opted for the latter. Mostly for flexibility of effects. I wasn't sure how canvas and DOM could communicate, but there are query APIs provided by the browser that enabled me to build functions like, "move to this DOM" or "shoot at this DOM". LLM Vibe coding helped a lot to surface these APIs that I wasn't aware of before.

Click for a movie demo

The avatar for the AI is the flame, and when it manipulates the canvas blocks it shoots a particle at it. The graphics aren't final, but I wanted to see how it felt to have something on the screen manipulate things. Generally, I think it feels good, but it's a balance between visual feedback and distraction. It still needs a lot of polish.

One of the issues was that my agent tended to stop after a single tool use. Sometimes, after two. Sometimes, it repeats the same tool use over and over again. So I didn't know whether it was an inherent limitation, or that it's an implementation issue. Surprise! It's an implementation issue.

First, I wasn't inserting the results of the tool call in the message history, so the LLM thought that a tool use ended up being a no-op, so it tried again. Easy fix.

Second, my prompt wasn't good enough. I fired up o3, described what I was trying to do, as well as my current prompt. It was able to improve my prompt to be more explicit in ways that I wouldn't have thought to be more explicit. Sometimes, just getting the LLM to generate prompts for itself (or simpler models) has been a big unlock for me when using these things.

I have re-done what kinds of canvas blocks are available, and that ended up being an interesting thing in itself. I'll have to finish it this week.

Here's what I looked at this week:

P1: I don’t use RAG, I just retrieve documents – Hamel's Blog #[[Retrieval Assisted Generation]]
Memory Safety is Merely Table Stakes | USENIX #[[Foreign Function Interface]]
The Prompt Engineering Playbook for Programmers #[[Prompt Engineering]]
MCP is eating the world—and it's here to stay #[[Model Context Protocol]]
Running a million-board chess MMO in a single process · eieio.games #[[Postmortem Engineering]]
Custom Elements in Elm – Dropbox Paper #[[Elm]]

Lab note #072 A language with affordances for vibe coding

Wil Chung — Mon, 23 Jun 2025 15:06:12 GMT

This week was a little stalled. Had some daily life administrivia to take care of, but I did manage to focus on one question: what, if any difference was there to the choice of language for vibe coding?

After looking into some of the other reactive systems, React stopped being the obvious default for me. I think the people that looked into this find it obvious, but I think for most of mainstream web dev companies, they still default to React.

The current two other class of alternatives are Svelte-likes and Solid-likes. Svelte-likes take your code and just compile it to something imperative. Solid-likes are more like data-flow programming in disguise.

Surprisingly, I'm hard considering using Elm. This is surprising because:

It has an orders of magnitude smaller ecosystem than any of the others.
It hasn't had an update since 0.19, which was over six years ago.
The only way to do FFI is through ports, which is a little cumbersome, and through web components, which one needs to be careful about states.

But this is because I think Elm has potential to be a good vibe-code target language for a variety of reasons. (I'll expand it in a post)

Solid type system to keep agents in check.
Tends to keep code into a single file, which helps agent caching
Good mental model for humans to course correct.

I converted the buying decision app from raw HTML5 to both SolidJs and Elm as an experiment. It was surprisingly a tossup for a while. There's only one reason that I'm leaning towards SolidJs, and it's only because I have a odd requirement of being able to move stuff around at 60fps, but still aware of DOM elements. Elm unavoidably has a virtual DOM that it performs diffs on, which is O(n). There are workarounds by using only CSS, but more complex animation paths aren't possible. And another workaround is to render to canvas, but it comes with more work trying to synchronize the state of the DOM with the state of the canvas. That smell like pain.

Instead of spending more time to evaluate the amount of work it would take to support a pure functional codebase–as much as I would like it–I'm just going to decide on going with SolidJs. Besides, if it doesn't work out for whatever reason, I can always get vibe coding to convert it.

On to the other risk factors this next week.

Painting with Math: A Gentle Study of Raymarching - The Blog of Maxime Heckel #[[Raymarching]]
https://bsky.app/profile/did:plc:fway37p6xwk2hu3c3t3rqs5t/post/3ls3hkmhzkc2z #[[Visual Programming]]
How to Plan and Execute Great Startup Marketing Programs - MaRS Best Practices - YouTube #[[Startup Marketing]]
- Best lecture on marketing I've ever seen, btw.
Homomorphically Encrypting CRDTs | jakelazaroff.com #[[CRDT]]
DSPy introduction #[[DSPy]]