Blog

Your App Subscription Is Now My Weekend Project

I pay for a lot of small apps. One of them was Wispr Flow for dictation. That’s $14 CAD/month that I was paying until I had a few lazy days visiting my mother. And then on the afternoon of New Year’s Day, I vibecoded Jabber.

Now, don’t get me wrong, Jabber is not “production quality.” I would never sell it as a product or even recommend it to other people, but it does what I needed from Wispr Flow, and it does exactly the way I want it to. For free.

At work, I’m often asked to make small videos showing some support agent how something works, or sharing some knowledge with new team members, or just a regular demo of something. In the past, I used to use Loom, which costs $15/month. So after creating Jabber, I got excited and vibecoded Reel.

Reel does exactly what I wanted Loom to do: I can record my camera, I can move it around, and I get to trim the video after it’s done (I don’t remember being able to do that with Loom).

Then just yesterday, a friend of mine was telling me how he got tired of paying for Typora and decided to vibecode his own Markdown editor. And that gave me the idea of creating an editor for my blog.

That’s Hugora! Yes, horrible name, but who cares? It’s just for me. I get to edit my Hugo blog just the way I like. It even shows my site theme.

You see the pattern here?

All of these $10/month apps are suddenly a weekend project for me. I’m an engineer, but I have never written a single macOS application. I’ve never even read Swift code in my life, and yet, I now can get an app up and running in a couple of hours. This is crazy.

Last year, a Medium post predicted:

Most standalone apps will be “features, not products” in the long run — easy to copy and bundle into larger offerings.

And I think we’re there. I don’t know what that means for the future of our industry, but it does seem like a big shift.

I’m still skeptical of vibecoding in general. As I mentioned above, I would not trust my vibecoding enough to make these into products. If something goes wrong, I don’t know how to fix it. Maybe my LLM friends can, but I don’t know. But vibecoding is 100% viable for personal stuff like this: we now have apps on demand.

Being Specific when Pairing with Bots

A couple of days ago I posted about my workflow and I made light fun of something I do:

I also give it context to save time. The agents nowadays are very smart and can find their way, but I can shortcut that by giving it hints “in internal/foo/foo.go there’s a function called DoFoo() and it does this and that and I want it to do that other thing before that” or whatever. Less tokens, faster iteration. This is probably astrology for nerds, pure superstition at this point, but I still do it.

Turns out, maybe it’s not really astrology for nerds? Today, Quinn Slack shared an article about How To Pair with an Agent and in it, the author says “The more you can specify, the better” and gives this example of a good, specified prompt:

Specified prompt:

Build a new API endpoint for user notifications. Follow the pattern in src/api/messages.ts as your reference. Run the API tests after each step. Don’t move on until they pass.

You gave it a reference to follow and a way to check its own work. Now you can step away. Let the agent iterate until the tests pass.

So maybe it’s not astrology for nerds, but a good practice? I always felt like it gave me better results, but I wasn’t 100% sure if it wasn’t just a superstition. Nice to see it’s probably not.

Thoughts on Amp's ad-supported business model

I’m agent-agnostic, in which I don’t use only one. I keep changing from time to time. My earliest forays into agentic programing were with Claude Code, which was then and still is probably the gold standard. But since then I’ve tried quite a few: Codex CLI (good models, barebones agent), Droid (not a fan), OpenCode (big fan!), and Amp.

I’ve been using Amp for a while, but only from time to time to see its evolution. With the move to Opus 4.5, I found that Amp has become very capable and I started using it more and more until it became my go-to agent.

The one downside of Amp is that it can get expensive. Since it’s not tied to any of the labs, it needs to charge API pricing, which can get expensive if you use it a lot, though maybe less than most people think.

But it’s undeniable that there is a psychological impact at seeing money constantly being drawn as you use, even if at the end of the day you’d spend the same as with a subscription.

These are challenges the Amp people have been working on for a while. Then not that long ago, they came up with a first attempt: an ad-supported free tier. You could use Amp up to $10 worth of API per day as long as you agreed to see ads on your agent.

To be clear, ads are optional. You only see the ads if you choose to and if you do, you get $10 worth of inference per day, using some cheaper models. This is how the ads appear:

Personally, I find them unobstructive, but opinions may vary. Now, you may notice that I emphasized the fact that these ads are options. The reason why I did so is that it appears that the idea of ads hits a nerve in some people. Ever since their free tier came out, I’ve seen several tweets of people announcing their refusal to ever try Amp because they don’t want ads.

Now Amp came up with a next step, whereas paying customers can also enable Amp Free and get those $10 a day of inference and their balance will only be drawn from once they exhaust their free allowance. That’s the equivalent of $300 a month of free inference. Not only that, but you can use that free allowance with Opus 4.5.

I understand the aversion to ads. I share it. But $300/month is an incredible value to ignore. So I decided to enabled Amp Free and use Amp exclusively this month to see how much I will have spent by the end of the month. My suspicion is that I’ll spend less than my Claude Max subscription.

But something that I think is very important to remember is that there’s a whole world of engineers outside of the developed world. For a developer in, say, South America, the cost of a Claude or Codex subscription is prohibitive. This is keeping a whole world of engineers out of the LLM revolution. Amp’s approach offers a way for them to have access to premium models they otherwise wouldn’t have access to.

LLMs Are Tools, Not Replacements

I’ve been meaning to write this post for a bit, but never found the right time. I guess this is it. Until sometime last year, I was more or less an AI-skeptic. I say more or less because I was always very interested in the technology. I built my own LLM to learn about it and I thought then, as I do now, that the technology is incredible.

And yet, I had tried using LLMs to help with coding and my experiences were not great. I used LLMs to write one-off scripts for me, they were very good at that. But whenever I tried to use them to help me write “production code”, they would hallucinate or get stuck in “bug loop”. I felt like I was spending more time dealing with the aftermath than I’d do writing it all by hand. I even disabled Copilot autocomplete because I felt like it was distracting.

Fast forward to today and most of my code is written by LLMs. How this change happened is a combination of how much the tooling improved but also the recognition that I was holding it wrong.

Now, don’t get me wrong. This post is not meant to convince anyone of anything. I’m not selling anything here. This post is for engineers who are curious about how others work with LLMs and trying to find their own workflow. I’ll show you exactly how I work now and how it works for me.

The bug that changed my mind

As mentioned, I was a bit of a skeptic. I knew LLMs were good at writing one-off scripts and I was using them a lot for that, but not more than that. Then one day someone asked for help with a bug.

We had this multicell architecture and we had a proxy/multiplexer that would decide where any given request should be routed to. Once that decision was made, the request would be proxied to an ALB using a custom transport. The ALB had resource mappings to know where inside a given cell things were hosted, so the custom transport requested a URL from the ALB, the ALB responded with a redirect to the actual destination inside the cell it belonged to. The custom transport would require the request and make it to the correct destination.

The bug: seemingly at random, some requests would succeed and some would not and no one could figure out why. So I started looking and quickly found that it wasn’t random at all: requests with bodies would fail. When I saw that, I immediately thought it was the custom transport eating the body, except I remembered writing that transport and found it hard to believe the issue was there. And upon looking at the code, it seemed fine. I added logging and went about trying to reproduce the issue. The code seemed correct, but the issue was still there.

After a while, I decided to try Claude Code. I launched it on the repo and explained the problem. I’ll admit I did not have high expectations, but hoped that maybe it could give me some insight that would help. To my surprise, in about 40s it came back saying it had found the issue: the transport was eating the request bodies. My first reaction was being frustrated because I knew I had already looked at it and the issue was not there. I thought Claude was being dumb. Except I noticed it was showing code that didn’t look like what I was looking at. Long story short: at some point, someone had copied and pasted some code and added a second custom transport somewhere where it shouldn’t, and that transport had a bug.

I didn’t fully convert then, but I started paying more attention. I began using LLMs for debugging and code reviews, things where being wrong was mostly harmless and I could verify the output easily. Over time, that expanded. Now we’re here.

The mistake I made early on

When I first tried AI coding tools, I treated them like code generators. Describe what you want, get code back, paste it in, repeat. This was the intuitive way to use them, and it’s wrong as far as I am concerned.

For those one-off scripts I mentioned before, I recognize now that I was “vibe coding” them. But that was fine because they were only going to be used by me. But I don’t let LLMs write unsupervised code that I need to ship for others. So the problem is that generated code requires review. Review requires understanding. If you didn’t think through the implementation yourself, you’re now reading code you don’t fully understand, looking for bugs you can’t anticipate, in an approach you didn’t choose. You’re doing more cognitive work than if you’d just written it yourself, and the code is probably worse.

The mental shift that made everything click for me was that LLMs are tools, just like LSPs were tools, and pre-LLM autocomplete was a tool. They’re not a replacement, but a complement. A junior engineer who has read everything but never built anything. Lots of talent but absolutely not trusted unsupervised.

My workflow

This is how I work with LLMs. I found that this works very well for me. I am aware that it is a much more involved workflow than a lot of people’s.

Phase 0: Reverse Rubber-ducking

I don’t start in an agent. I start in Claude, just chatting.

Before I write any code, I want to understand the domain. If I’m implementing auto-updates for a macOS app, I am asking Claude about how Sparkle works. Not “implement auto-updates for me”, but “how does Sparkle choose when to prompt the user?” or whatever. I want to know the concepts, gotchas, tradeoffs, etc. I often talk about some other app and ask “how does X do this?”

This is basically rubber-ducking in reverse. I’m building my own mental model through conversation. By the time I’m ready to touch the code, I actually understand what I’m about to do. This matters because it means I now can review what the LLM produces. I develop an intuition for what to expect, which in turn lets me quickly spot when something is wrong.

This phase gives me confidence, and that matters. And of course, this is mostly for areas I am not already familiar with. But even when am familiar, I find that these conversations give me insights or what I need to ask when doing the plan.

Phase 1: Plan

Now I move to an agent. Lately I’ve been using Amp, but the specific tool matters less than the process. This could be Claude Code, Codex CLI, etc. My process is tool-agnostic.

I don’t say “build me X.” Instead, I start another conversation, mostly a Q&A. “How would you approach this?”, “What are the steps?”, etc. I challenge it when something sounds off. I often ask the LLM to pushback to my ideas if it thinks they’re not good. I may still insist but it’s good to have some pushback here and there. We go back and forth until I’m satisfied with the approach.

Then I ask it to split the plan into the smallest self-contained, testable phases. This is critical. I want each phase to be something I can review, run, and validate before moving on. Those codebase-wide big changes re where things go off the rails.

Finally, I have it write everything to a spec.md file. This serves two purposes: (1) it’s a reference I can point the LLM to if context gets lost, and (2) it’s documentation of what we decided and why. For longer projects, this is how I resume after a break. I also make manual adjustments to this plan when needed, though this is getting more and more rare.

Phase 2: Implement each phase of the plan, one by one

Now the agent starts writing code, one phase at a time.

I watch the diffs as they flow in and because I was part of the planning and did my homework in Phase 0, I know what to expect. A quick glance usually is enough to tell me if it’s writing what we discussed or going off-script. That’s why the prep work matters: review is fast when you understand what you’re looking at.

I also give it context to save time. The agents nowadays are very smart and can find their way, but I can shortcut that by giving it hints “in internal/foo/foo.go there’s a function called DoFoo() and it does this and that and I want it to do that other thing before that” or whatever. Less tokens, faster iteration. This is probably astrology for nerds, pure superstition at this point, but I still do it. (Hi, it’s me, from the future: maybe it’s not astrology?)

Here’s a little trick I’ve started using: cross-agent reviews. Once Amp finishes a phase, I’ll ask Claude Code or Codex to review the diff. Different models and harnesses catch different things. It’s not foolproof, but it’s cheap and occasionally catches something I missed.

Phase 3: Validation, commit, and handoff

Once a phase looks good, I test it. I run and do what I can to validate it. I’ve mostly reviewed the code both by myself and using an LLM.

If something is wrong, I iterate with the agent. I point out the problem and let it fix it. This usually works and only very occasionally I have to take over and fix it myself.

When I’m happy, I commit. This is an easy rollback point if something goes wrong afterwards. At this point I use Amp’s /handoff command to start a fresh context for the next phase. This is a forced boundary: the agent will start clean (though it can reference the previous phase in Amp), it will re-read the spec and we continue. This helps prevent context rot, which is where long sessions start to drift.

Trust Boundaries

I rely on LLMs heavily but I don’t trust them.

These are the lines I don’t let them cross:

  • Nothing ships without my review. I read every line before it goes in. I am too anxious to ship something I don’t understand. That prep work from Phase 0 is not just about understanding, but about making review fast enough that this is sustainable
  • Don’t let the LLM write tests unsupervised. I learned this one the hard way. When a test fail, LLMs often “fix” the test to make it pass. I’ve heard this is less likely nowadays but I’ve been burned and trust isn’t easily restored. So there. Now I’m extremely careful about letting them modify test code. Only thing I do like to use LLMs for in testing is asking them “do the tests cover the case where this, this, and this happen?” Helps finding holes in the coverage.
  • Debugging is still mostly me. This is ironic, given that debugging a bug was my entry point into using LLMs more and more, but I’ve found that for my day-to-day debugging, I’m usually faster on my own. I reach for an LLM if I’m stuck, not as a first resort. Maybe this is muscle memory or maybe the tooling is weaker here. Either way, I don’t force it.

What still doesn’t work well

I want to be honest about the limitations, because the hype around these tools is exhausting.

I don’t think they’re good at complex refactoring across many files. The agent loses the thread. It will make changes that are locally correct but globally inconsistent. For big refactors, I still do a lot of manual work. I feel like the quality of code after an LLM-assisted refactor is not great quality.

Also, anything requiring deep context about the codebase’s history. Why is this weird workaround here? What’s the implicit contract this function has with its callers? The agent doesn’t know, heck most people don’t either, but whereas a human might be reluctant, LLMs will happily remove that code that seemed inconsequential but that now breaks some contract with a client.

And the final one can be controversial, but I think they’re bad at novel architecture decisions. Don’t get me wrong, ask an LLM to design something and it will, but then you ask it “oh but what if…” and it will immediately “yes good point” and redesign it all. It just goes along with whatever you last said. It doesn’t know how to make decisions. It shouldn’t be surprising given how LLMs work, but our brains tend to anthropomorphize everything and then these things become counterintuitive. So I still have to think about architecture myself.

The Real Lesson

These tools have changed a lot — GPT 5.2 and Opus 4.5 are watershed moments IMO — but not as much as my own approach did. I stopped trying to skip the thinking part and started using LLMs to enhance it. The agent participates in discovery, planning, obviously implementation, and also reviews, but I am still driving.

If you bounced off these tools, it might be worth trying again with a different approach, it’s all I’m saying.

I’ve found that my workflow is more work upfront, but dramatically less work overall. More importantly, it lets me focus on the interesting parts and helps me with the drudgery.

Trying out Codex CLI

A while ago I was a little skeptical of AI-assisted coding. Mostly because my experience had been with CoPilot autocomplete and it was really not good. I still avoid AI autocomplete to this day, even if I can see it got better because I still find it distracting and often still not great.

That said, Claude Code shook my world view and I’ve been daily driving it ever since. I need to write a post about how I use this agent, but tl;dr I use it for the boring parts of coding and to help me read and review code (especially my own) instead of using it to write feature code.

I have been happy with Claude Code, but I also heard very good things about the new GPT-5 model for coding and wanted to check it out. Enter the Codex CLI. It’s OpenAI’s answer to Claude Code.

I am approaching this with a very open mind. I completely understand that it is early times in Codex CLI land and thus I did not expect it to have feature parity with Claude. I’m ok with that, just to get that out of the way.

The onboarding was rough

My first experience with it was that it wouldn’t install due to an issue in the post-install of a dependency (ripgrep, which, I must say, I already had installed.) I went to file a ticket and say that someone else had already done so.

No matter! I thought. I figured out how to get around it and then decided to try it. I opened a local repo and typed /init.

Codex decided it wanted to run tests to check the status of the repo. Fair enough, go ahead. It then failed to compile my Go code, claiming the Go toolchain wasn’t available. I was confused by that, so I closed Codex CLI and ran go version, all good. I ran my tests, all passed. Wut?

I tried again and this time I told it that I checked and I had the toolchain installed. It tried again, no dice. It kept trying until eventually I stopped and did some digging. That’s when I learned that Codex CLI runs inside a sandbox and doesn’t share my shell’s environment. Ok, that was a little upsetting. So I asked Codex CLI how we could provision the sandbox with Go. It proceeded to look for Go 1.13, which was release over six years ago. It asked me to download the tarball and leave it in a certain directory and it would take from there.

Ok, time for some more digging. It’s a this point that I must point out that the Codex CLI documentation is basically non-existent, and being a relatively newcomer, there’s not a lot of resources out there. Again, I get it, let’s just get through this initial steps.

I keep at it until I figure the issue: though my shell’s PATH includes Go 1.25, the sandbox’s did not. I couldn’t quite figure out why but I did manage to get it working by telling GPT where to find the Go binaries.

Once it got working

Now, once I got it working, things went a lot smoother. I quickly got used to the differences from Claude Code (and they are many) and got somewhat comfortable with it. I got GPT to analyze my code and look for bugs and it found a minor one that had escaped Claude for a long time. That was cool.

I found that it tends to be a little noisier than Claude Code, because CC tends to hide somethings behind its quirky verbs (“lampoonig…”, etc) This isn’t necessarily a negative, just different and something to get used to.

I miss the TODO lists that Claude Code creates and follows. Again, not a huge deal. The part that it needs to improve is tool calling. More than once I saw it calling some Go tool with bad parameters. And also, it doesn’t seem to quite grasp the error messages.

Case in point, I asked it to run a linter, so it started running golangci-lint, but it ran it at the root of the repo, where there are no Go files, and without parameters, which resulted in an error “No Go files”. It didn’t seem to understand this error and concluded golangci-lint wasn’t installed.

It then entered a loop trying the same command over and over again until I interrupted it and told it to pass ./... to include subdirectories. It then tried again with the parameters, but bizarrily this time it decided that the golangci-lint would be in ./bin, which is not true at all. So I had to tell it where to find it. And then it worked fine.

Conclusion

It’s early days and it’s clear there’s some ground to make up if they want to catch up, but I also remember the early days of Claude Code. The CC team iterated quickly and we got to where we are today, and I’m hoping the Codex team will do the same. They seem very active in answering questions on X, so I have hope.

I’m hopefuly and interested. I’ll keep an eye on it.