Curious Agents

The Agent Memory Problem Nobody Is Solving

Tim Johnson — Fri, 03 Apr 2026 17:56:35 GMT

Switching to a new agent platform is getting easier. Many new projects in this space are offering OpenClaw migration scripts too.

Hermes has hermes claw migrate. OpenFang has openfang import --from openclaw. The tooling is polished. The onboarding is smooth.

What neither of them migrates is the memory.

Not the memory of what tools you configured. Not the memory of what channels you set up. The other kind: six months of your agent learning how you think, what you value, what you’ve tried before, how you like to be talked to. The corrections you made when it got something wrong. The preferences it absorbed without you explicitly stating them. The context it accumulated one conversation at a time.

You can move your skills. You can’t move the relationship.

Every migration resets that to zero.

What the best implementations look like right now

Before talking about what’s missing, it’s worth spending time on what Nous Research got right with Hermes.

Hermes ships with a five-layer memory architecture. Not a memory module. Five distinct, interconnected layers. I haven’t seen anything this complete in an open agent system.

Layer 1: ContextCompressor. This handles the within-session problem: as conversations grow, context windows fill up and earlier content gets dropped. ContextCompressor doesn’t just truncate. It compresses using a structured template (Goal, Progress, Files, Decisions) and updates iteratively across compressions. The conversation stays coherent even as the raw history scrolls away.

Layer 2: Honcho user modeling. This is the cross-session problem. Honcho builds and maintains a model of you across conversations using a dialectic Q&A approach. It forms hypotheses about your preferences, tests them, and refines them. The “peer card” that comes out is searchable and composable. Your agent accumulates an actual model of who you are, not just a growing file of notes.

Layer 3: Auto skill creation. When the agent solves a problem it hasn’t seen before, it has a mechanism to formalize that solution into a reusable skill. Experience compresses into capability. The longer you use it, the more it knows how to handle your particular workflows.

Layer 4: FTS5 session search. All conversations are stored in SQLite with full-text search. You can ask “what did we decide about the API schema” and it finds the answer. This is something most agent systems treat as an afterthought. Hermes treats it as infrastructure.

Layer 5: RL training pipeline. This is the one that surprised me most. Hermes saves interaction trajectories and can feed them into Atropos for model fine-tuning. The agent can initiate its own training runs. Not just memory, but a feedback loop that improves the underlying model from your actual usage patterns.

Taken together, this is a serious system. The people who built it thought carefully about what memory actually means for a long-lived agent relationship.

Observational Memory takes a different angle

Mastra’s Observational Memory (OM) approaches the problem differently, and the benchmarks are worth looking at.

OM clocks roughly 95% on LongMemEval. Honcho (Hermes’s user modeling layer) hits around 90% on LongMem S with Haiku 4.5, and 88.8% on LongMem M at near one million tokens, with 89.9% on LoCoMo.

On the benchmark numbers alone, OM has an edge.

But comparing them directly misses the point. They’re solving different problems. OM is closest in function to Hermes’s ContextCompressor: it’s about compressing the conversation-part of the context, keeping the conversation coherent as a session grows in a simple (but also prompt-cachable) way. Honcho is doing something else: building a persistent model of you across sessions, understanding the person rather than just what has been said.

Both are valid. OM gives you better context compression for conversations that can stretch out almost infinitely. Hermes gives you both context compression and user modeling (you could argue that OM’s system includes these “user observations”, where Hermes separates them), plus search plus RL (if you care to run it) but the BIGGER addition is the auto-skill-creation and reflection. They’re not competing directly. They’re addressing overlapping pieces of a larger problem.

OM’s benchmarks show Hermes a clearly better way of doing compaction. But Hermes is clearly in the lead with thinking through these other mechanisms.

The wall you hit when you want to move

Here’s where the practical problem shows up.

You’ve been using OpenClaw for 4 months. Your agent knows how you work. It knows you hate emdashes in blog posts. It knows you’re building something on Elixir and why. It knows your project naming conventions, your preferred tone, the context behind half a dozen ongoing threads. It knows things you didn’t explicitly tell it: patterns it picked up from watching you push back and correct and redirect over hundreds of conversations.

Then Hermes ships something compelling. Or OpenFang does. You want to try it.

You run hermes claw migrate. Your skills come over. Your configuration comes over. Your channel setup comes over.

Your agent’s model of you doesn’t.

The new system has no idea who you are. You’re back to day one of the relationship.

For someone who has invested seriously in an agent over months, this can be a real switching cost. We all want to be able to upgrade to better features/skills/tools. But sometimes the “familiar” gets in the way of the “new” and we don’t want to throw away all that rich context.

The more sophisticated the memory system, the higher the switching cost.

What’s missing: an open memory format

There’s no standard for what an agent memory export should contain.

There’s no agreed-upon schema for the things that matter: user preferences, learned corrections, relationship context, projects in-flight, skill provenance. There’s no convention for how a receiving system should ingest that data and make it usable.

ContextCompressor and OM can’t share a memory state because there’s no common format for what a “compressed context” looks like across systems. Honcho’s peer cards live inside Hermes’s SQLite schema. OpenClaw’s memories live in daily markdown files and whatever ad-hoc files it wants to create for specific memories in it’s folder.

Every system right now is inventing its own answer to the memory problem in isolation.

A common memory interchange format doesn’t need to solve every problem at once. It needs to answer a few core questions: what does a user model export contain? What does conversation history look like in a format another agent can consume? How does a receiving system signal what it can and can’t use?

These are solvable problems. Maybe the OM format is a good starting-point to pick from (due to it’s high-scores) but I do like the addition of the peer-card from Hermes. Do we need to keep raw-logs as backup for search-ability? JSONL would seem like a fine candidate for that as a secondary preference.

If you’re working on agent memory, building an agent platform, or thinking about this problem from a standards perspective, I’d like to hear from you.

The conversation needs to start somewhere.

Experiments in Building an Automatic Software Factory

Tim Johnson — Thu, 26 Feb 2026 16:11:18 GMT

Here’s a thing you can do in 2026: write a plain-English description of software you want built, run one command, go to bed, and wake up to a working implementation. Full SLDC, spec writing, unit and end-to-end test suite passing, git committed, phase by phase.

(to not bury the lead) here is the repo: https://github.com/timothyjoh/cc-pipeline

That’s the pitch. Now here’s the honest version.

It works. Sometimes really well. But what it actually produces is determined almost entirely by what you put in. Leave something unspecified and the pipeline will make a decision for you — and that decision might be “functional but ugly” or “mechanics intact, polish missing.” The factory metaphor holds: garbage in, garbage out. But with a good spec? The output is genuinely impressive.

This is the story of how I built that pipeline, what I learned from running it a dozen times on projects ranging from an Elixir port to a Kirby platformer, and why I think building your own is more approachable than you’d expect.

2025: Babysitting the Context Window

I spent most of 2025 deep in spec-driven development frameworks. The premise was compelling: instead of chatting at an LLM, you give it a full specification and let it execute against that spec. Several frameworks were doing interesting things here.

BMAD Method gave you a structured way to think about AI-assisted development with roles and phases. GitHub’s Speckit explored spec-first workflows. OpenSpec took a similar angle. Obra’s Superpowers was doing interesting things with structured AI workflows. But my favorite approach was HumanLayer’s three-step research → plan → implement process, Dex’s talks about not-anthropomorphizing matched my intuition about how to break down work cleanly.

The frameworks were working where the models had poor performance in 2025. It was easy to use, but I kept wondering “can I automate some of this?”. Every step required manual intervention. After Claude finished a spec, I had to review/edit the output, start a new context, invoke the next command, and wait. Review it again, ask for fixes. And above all we always had to fix the tests that were over-simplistic, and involved too much mocking.

While it was all easier than ever, the disconnect between waiting 30-40 minutes at each stage and coming back was a huge cognitive disconnect. I wasn’t building software. I was babysitting a context window between gaps in the workday. While this enabled me to be productive with my overly meeting-heavy days, it was still disjointed.

When Anthropic shipped subagents and then Claude Agent Teams, something clicked. Here was a way to have Claude orchestrate its own work across multiple contexts. I spent a week with tmux and Agent Teams, spinning up coordinated multi-agent sessions. The results were promising enough that I made a decision: I was going to automate the SDLC loop itself — not just the build step, but the whole cycle: spec → research → plan → build → review → fix → reflect → commit. I was going to “live with” the agent to make decisions between the phases that I might otherwise be around to make, and then I could come back and correct it later. And I was going to add my own gates — testing passes, reflection checkpoints, a staff-engineer-level code review at each phase. The need was “while I am away, keep building”, then I can come back to something to react-to, and we can start another cycle of correction and polish.

That decision is what became cc-pipeline.

The First Test: Porting OpenClaw to Elixir/BEAM

The first real question wasn’t “can the pipeline build something new?” It was “can the pipeline take an existing, non-trivial codebase and rewrite it in a different language?”

I’d already written about attempting this on a company project (see my previous post on Substack) with moderate success. I tried an automatic approach using subagents and agent teams, and while it announced “Success, it is done”, my savvy readers will know better than to trust Claude on a multi-hour run that it did everything it was asked.

The target for the pipeline test was ambitious: porting OpenClaw — the open-source AI gateway I run my own agent on — to Elixir and the BEAM platform. This wasn’t random. I’ve been circling Elixir for years. The BEAM VM’s model of millions of lightweight processes, supervision trees, and fault tolerance feels architecturally right for agentic systems. AI eliminates Elixir’s historical barrier (hard to hire for, less training on LLMs), and what remains are its architectural advantages.

Shell Loops, tmux Hell, and the Agent SDK

The earliest version of the pipeline was a bash script. A loop over phases, with a call to claude -p at most steps. Pipe the spec in, capture the output, write it to disk, move to the next step. It worked, easily I might add.

The first real problem: some steps need Claude Code’s interactive mode. The review and build steps in particular benefit from Claude having full filesystem access, access to the project’s CLAUDE.md and any custom skills you want to add, the ability to run commands, write and execute tests. claude -p is non-interactive, it takes a prompt and returns output, like a function call. But claude (interactive) is a full REPL session.

The first users of Claude Agent Teams said the answer was tmux. Launch a tmux session, start claude in it, send the prompt via send-keys or paste-buffer, wait for it to finish, detect completion, exit cleanly. Sounds reasonable.

The timing problems alone were maddening. send-keys would fire before the Claude session was fully initialized and the prompt would vanish. Or it would land but never submit. paste-buffer had truncation issues with long prompts. Detecting completion was even harder, there’s no clean signal when Claude Code is done with a task, so I was polling for sentinel files, watching for specific output patterns, putting “wait” timers all over the place.

And then getting Claude to exit cleanly to reset the context (which is the whole point, a fresh context window per step) that was its own adventure. Send /exit. Wait. Send Escape. Wait. Send Enter. Hope. Repeat.

I burned days on this. The tmux approach never got reliable enough to trust overnight.

The solution was staring at me the whole time, and it was the Claude Agent SDK. Instead of trying to automate an interactive terminal session, I could control Claude Code programmatically: pipe the prompt to stdin, capture structured output from stdout, and get a clean process exit when the work is done. Each pipeline step gets a fresh context window, automatically. No timing hacks, no tmux polling, no sentinel files. AND the Agent SDK allows me to use my Claude Max subscription, because it is not harnessing the API directly, but through Claude Code CLI.

And with structured output came structured logging. Which meant I could build a TUI — a terminal dashboard showing which phase and step is running, recent events, progress through the workflow. Instead of a black box running overnight, you can actually watch it work.

Phase 17 TUI during the build of Kirby your Enthusiasm

The Experiments

Swimlanes × 2

I built the Swimlanes project twice — once in Astro 5 with React islands and SQLite, once in Ruby on Rails. Both times from the same basic concept: a Trello-like kanban board for taking notes, simple.

Both runs produced functional software. Solid test coverage, drag-and-drop working, data persistence, the whole spec. If you squinted at it as a backend engineer, it was fine.

The UI was rough. Not broken, but developer-designed. Both implementations had that slightly-off quality of software that was never told what it should look like. Which makes sense, because we never told it to use a design framework or UI library beyond specifying tailwind CSS. The BRIEF.md specified features and data model but said nothing about design system, component library, visual style, or aesthetic direction.

Swimlanes the Astro version

The lesson wrote itself: if you don’t specify the design, the pipeline designs something for you. And “something” is not the same as “good.” A single line in the brief — “use Tailwind with the Shadcn component library, clean minimal aesthetic with neutral grays” — would have changed the output dramatically.

Sales Performance Statistics in R

This one I’m proud of, partly because it surprised me.

I’ve never written R in my life. I know what it is, I know roughly what it does, I’ve never typed an R expression. The brief was simple: given a set of sales rep performance data, build a statistical analysis and a formatted report.

I co-wrote the BRIEF.md with Claude — iterated on what “sales performance analysis” should actually mean, what metrics mattered, what the report should communicate. Then kicked off the pipeline and walked away.

It generated its own dummy data, ran the statistical calculations, and designed a report layout. When I came back, I had a working R project — reproducible, documented, producing output that looked like what I’d described. I didn’t struggle with the language. I didn’t have to learn the ecosystem. I described what I wanted and the pipeline figured out the R-specific implementation details.

That’s the part that’s hard to communicate until you’ve seen it: domain knowledge barriers dissolve when the spec is clear. You don’t need to know R to get R code. You need to know what you want.

Tetris × 2

Tetris was the control experiment. Two runs, identical BRIEF.md, different pipeline runs.

Play it here: https://tetris-3d-pipelined.vercel.app/

Code here: https://github.com/timothyjoh/tetris-3d-pipelined

The results were slightly divergent. Not dramatically — both games were recognizably Tetris, both had the core mechanics, both ran in the browser. But small differences emerged: slightly different scoring logic, slightly different color palette, slightly different keyboard handling. Nothing alarming. Nothing that suggested the pipeline was unreliable.

Tetris first version

My read on why it worked so well: Tetris is one of the most-implemented games in history. Every LLM has seen hundreds of Tetris implementations in its training data. The shape of the problem is well-understood. This matters more than people realize. When you give the pipeline a well-known problem with clear mechanics, it has deep pattern-matching to draw on. The brief is almost a formality.

Which is also a warning: the pipeline’s ceiling is not the pipeline, it’s the training data PLUS the specifications you give it. Give it a well-documented problem and it flies. Give it something novel or ambiguous and it’ll still build something, albeit you will put in more work later.

Kirby Your Enthusiasm

This was the ambitious one. A 2D platformer starring Kirby as Larry David, navigating LA street encounters in the style of Curb Your Enthusiasm. Full Phaser 3, TypeScript, three acts, character absorb mechanics, Curb NPC dialogue encounters. The brief was detailed and the concept was delightful.

Repo: https://github.com/timothyjoh/kirby-your-enthusiasm

Here’s the honest report: the mechanics mostly work. Kirby moves. Kirby floats. The inhale mechanic functions. The act structure is there.

The look is rough. Kirby is a pink circle. The enemies are rectangles. The backgrounds are solid colors. It looks like a prototype demo, not a game you’d actually play.

Kirby at the final boss during phase 16 of the build (8 am the day after)

Part of this is inherent to the problem: game art is hard to specify in a text brief. But part of it is solvable. The pipeline’s architecture supports additional steps and for game development, one obvious next step would be a design phase that hands off to an image-capable model (Gemini?) to generate sprites and background assets before the build step runs. We haven’t built that yet. When we do, I expect Kirby will look a lot more like Kirby.

The Kirby experiment also surfaced something important about ambitious projects: the pipeline is far better than vibe-coding, but it’s not magic. A sophisticated, well-structured SDLC running through Claude Agent Teams can still produce a rough first pass on a complex, underspecified project. The tool amplifies your judgment, it doesn’t replace it. A professional who leans in, steers the brief toward the hard decisions, and treats the pipeline output as a strong starting point will get dramatically better results than someone who hands it a vague concept and hopes.

How cc-pipeline Works

The idea is simple enough to explain in three steps.

First: init. Run npx cc-pipeline@latest init in an empty repo. It scaffolds a .pipeline/ directory with prompt templates, a workflow.yaml defining the step sequence, and a BRIEF.md.example to show you what a good brief looks like.

Second: write the brief. Open Claude Code in the project and have a conversation:

Using the @BRIEF.md.example as a template, let's discuss this project's goals
and write a BRIEF.md. Ask me for a quick description first, then ask questions
one-at-a-time to build a good brief.

Claude asks you questions. You answer. The brief writes itself. This step matters more than any other. The better the brief, the better everything downstream.

Third: run it and walk away.

npx cc-pipeline run

Each phase runs through the same sequence of steps: spec → research → plan → build → review → fix → reflect → status → commit (or add your own). The review step runs a staff-engineer-level critique of the code, I prefer to use codex at this phase, it outputs a REVIEW.md and a MUST-FIX.md file. The fix step addresses any must-fix findings. The reflect step looks back at the phase and plans the next one. Status updates STATUS.md at the project root — a running summary of what’s been built, test coverage, and what’s coming. Then a git commit.

Watch the TUI if you want to see it in motion. Or just read STATUS.md — it’s one of the genuinely fun parts of running the pipeline, watching it document its own progress as it goes.

Don’t miss this: it’s almost entirely customizable. The steps are defined in .pipeline/workflow.yaml. The prompts are markdown files in .pipeline/prompts/. You can add steps, remove steps, reorder them. Want a web research step before build to pull in current documentation? Add it. Want an extra test validation pass after fix? Add it. Want a dedicated design spec step that enforces a component library? Add it. Want to swap out CLIs that execute the steps, also easy.

The pipeline currently includes a built-in agent for Codex as well as the Claude Code agents. I personally run this using my $30/month ChatGPT subscription for the Codex-powered steps, and my Claude Max subscription for everything else. It’s not expensive (10% of my weekly plan for Claude), and you can mix and match however your subscriptions line up.

The Lesson

The most important thing I learned across all these experiments can be compressed to one sentence:

If you leave it unspecified in the brief, it will be handled in a way you didn’t choose.

Sometimes that’s fine — the pipeline makes defaults based on the popularity of the LLM’s training data.

The inverse is also true: cc-pipeline is a genuinely excellent greenfield starting point for any project that’s roughly defined but not fully specified. Run it, get a solid foundation, then iterate. The output of Phase 1 is already more coherent than most “let me just start coding and see what happens” projects. The test suite is there from day one. The code review caught real issues. The architecture reflects the brief. I’ve been able to build some internal tooling here at my work that otherwise would have been a beast to build. Maybe this will be the start of your “I need my own custom CRM” application dreams.

Where it really shines is migrations and ports. “Take this open-source project, keep the A and B parts, drop X Y Z, keep it lightweight” — feed that as a brief, point it at the source repo as a research resource, and the pipeline will not only produce the implementation but will document why it made each architectural decision in the DECISIONS.md and STATUS.md it generates. You learn the codebase by watching it get rebuilt.

The BEAM experiment is my favorite example. I didn’t just end up with an Elixir port. I ended up with a running record of every design decision the pipeline made, in its own voice, at each phase. Reading it is like having a senior Elixir engineer narrate the architecture choices in real time.

The Invitation

The repo is at github.com/timothyjoh/cc-pipeline. MIT licensed. Start with npx cc-pipeline@latest init.

If you want to use it as-is, go for it. Write a brief, run the pipeline, see what comes out.

If you want to build your own, that’s the point. The pattern is simple: a loop over phases, with a configurable set of steps, and a prompt template at each step. The intelligence lives in the prompts and the brief. The engine is just plumbing. You could build a version of this in a weekend that’s tuned exactly to how your team works, your own SDLC, your review criteria, your testing standards, your design system enforcement.

The factory is more approachable than it looks. You just have to decide what it should build, and tell it so.

I can’t wait to see what others derive from this, and what they ship. Please leave me a message on Twitter at https://x.com/timojhnsn if you find this useful.

12 Hours vs. 2 Weeks: What I Learned Rewriting an App With AI

Tim Johnson — Wed, 11 Feb 2026 23:48:21 GMT

You inherited a codebase you didn’t write. The stack isn’t yours. The architecture works, but it’s held together with REST, webhooks, glue and cloud duct tape.

Sound familiar?

I spent a weekend rewriting a Python FastAPI backend into a completely new framework using Claude Code as my co-pilot. The whole thing took about 12 hours. Without AI, it would have taken two weeks minimum.

But here’s the twist: I’m not shipping it.

This post covers exactly what happened: the method that made it possible, the framework that almost won me over, and why “not shipping it” doesn’t mean the weekend was wasted.

Here’s what you’ll learn:

The documentation-first method that makes AI-assisted rewrites actually work
Three modes of AI-assisted development (I shy away from vibe-coding, and you should too)
An honest review of a promising framework that isn’t ready yet
Why the real value of my “failed” rewrite is the confidence that I can do this again easily, not the code

The Architecture That Worked… Until It Didn’t

The existing system was a FastAPI service paired with an AI bot built on a Python agentic framework. They were glued together with REST requests and orchestrated through AWS Fargate tasks.

It worked. Requests came in, jobs ran, the AI did its thing.

But I couldn’t see any of it.

When errors were piling up and bug reports were coming… I was flying blind. There was no way to trace an incoming request from beginning to end. No unified view of what the background jobs were doing. No observability without bolting on yet another tool.

I wanted one thing: end-to-end request tracing without adding another framework to the pile.

That’s when I found motia.dev.

One Primitive to Replace Ten Frameworks

Motia makes a bold promise: replace your APIs, background jobs, queues, workflows, and AI agent orchestration mixed with with a simple primitive called a “Step.”

A Step is just a file with a config and a handler. Motia auto-discovers these files and wires them together through an event-driven architecture. Think of it like this: Steps are to backends what React components are to frontends.

Three things sold me:

1. Multi-language support. Each Step can be TypeScript or Python. I could rewrite the API and workflow layer in TypeScript while keeping the AI agents in Python and ADK. No compromises.

2. Built-in observability. Motia ships with bbservability built-in (called a Workbench), a visual dashboard that shows flow diagrams, request traces, state, and logs. Locally. Out of the box. No setup.

3. Event-driven by default. Steps communicate through emit and subscribe. Background jobs, queues, and retries are handled automatically. No Fargate tasks. No manual queue infrastructure.

It’s like the benefits of a share-nothing, event-driven architecture, but you can develop fast, run locally and work with it like your favorite simple monolith. SOLD!

The First 5 Hours: Zero Lines of Code

Here’s the part most people skip… and it’s the reason the rewrite worked at all.

I didn’t write a single line of code in the first phase. Instead, I spent 2-3 full context windows having Claude Code reverse-engineer the existing codebase.

Why? Because I didn’t write this code. I didn’t know it intimately. Before I could port anything, I needed to grok it.

Here’s what we produced:

Mermaid diagrams of the system architecture
Logic flows for each major feature
Data flow maps showing how requests moved through the system
Component descriptions for every key module

I followed this critical path: I fed all of that documentation plus Motia’s LLM-friendly docs into Claude Code and asked it to build a migration plan.

The AI didn’t just have the source code. It had a mental model of the system.

This is the documentation-first method: Build the AI’s knowledge base before you ask it to build anything. Don’t ever start by letting the LLM write code immediately. Resist that urge.

The documentation phase took about 5 hours. I wanted to be thorough. It saved me days. Why? Because I now had the right mental model to prompt and steer the next steps.

Three Ways to Work With AI (when to use each)

Over the weekend, I naturally fell into three distinct modes of working with Claude Code. Each serves a different purpose.

1. Vibe Coding: For Small, Specific Fixes

This is the “just fix it” mode. You see a bug, you describe it, and the AI dives right into fix-it mode.

“This event handler isn’t emitting the right topic. [PASTE the wrong event]”

“Refactor this function to match the pattern in the other Steps.”

“The response schema is wrong: it’s missing X and Y properties.”

When to use it: Small, contained changes where the context is obvious. Don’t use it for anything architectural.

2. Structured Prompting: For Major Features

For the initial rewrite and the testing phase, I used a four-step sequence I picked up (mostly) from Dex Horthy (watch his AI Engineer conference talks, follow the path):

Research: Have the AI analyze the relevant code and docs
Planning: Ask it to propose an approach before writing anything
Implementation: Execute the plan
Validation: Verify the output against requirements

This is slower than vibe coding. It’s also dramatically more reliable for anything complex, a brownfield codebase or anything you want to build right.

Code:

https://github.com/humanlayer/claudelayer/tree/main/.claude/commands

The major usage is /research_codebase, /create_plan, /implement_plan in that order. I have added to it my own testing plans which vary from project to project, which include unit, end-to-end and some performance benchmarking

When to use it: Epics, major features, migrations, refactoring, anything that touches multiple files or systems.

3. Skill Building: For Institutional Knowledge

This one surprised me. After correcting Claude Code on a pattern a few times, I started asking it to “codify the things you learned in this session, and the corrections I gave you into a reusable skill” (stored directly in the /.claude/skills directory in the project).

Now those patterns live in the codebase. Future developers (and future coding agents) get them automatically when needed.

When to use it: Any time you find yourself correcting the AI on the same type of mistake twice. Anytime you watch the AI spin out and devour tokens trying things over and over.

The Weekend Sprint: What 12 Hours of AI Pair Programming Looks Like

Here’s how the actual work broke down:

Phase 1: The Initial Port (5 hours)

With the documentation and migration plan in hand, Claude Code scaffolded the Motia Steps from the existing FastAPI routes. API endpoints became API Steps. Background Fargate tasks became Event Steps. Scheduled jobs became Cron Steps.

The AI made a solid first pass. Not perfect, but a legitimate, running application.

Phase 2: Gap Analysis (3-4 hours)

This is where the real work happened. We wrote integration tests and end-to-end tests, and they flushed out everything Claude Code missed.

Edge cases. Error handling. Subtle business logic that didn’t make it through the port.

Lesson: The first pass is a draft. Tests are what turn it into production code. Don’t skip this phase: it’s where you catch the elusive bugs.

Phase 3: Improvements and Exploration (4 hours)

With a working system and passing tests, I spent the last phase making substantial improvements and exploring Motia’s capabilities more deeply.

This is where the Workbench really shined. I could see every Step, every event flow, every trace. It felt like having X-ray vision into the system.

Total time: roughly 12 hours.

Without Claude Code, on a codebase I didn’t write, learning a new framework from scratch? Two weeks minimum. Probably more.

The Toughest Part to Admit

I wanted to ship this on Monday. I really did. I wanted to look like a hero.

But by the late Sunday night, I’d hit enough issues to pump the brakes:

Workbench crashes. The dashboard would become unresponsive when interacting with certain events. Filtering and sorting had issues too.
Event durability failures. Server crashes would lose events. That’s a problem when event durability is one of the framework’s headline selling points. If this can’t even work locally, I can’t have confidence in production.
HMR issues. Hot module replacement was unreliable. I kept having to restart the dev server to see changes.

I dug into GitHub and found issues confirming what I was experiencing. The Motia team is aware and engaged: they’re actually doing their own ground-up rewrite to address these foundational problems.

My verdict: Motia’s vision is right. The Step primitive makes sense. The Workbench is potentially great. But it’s too early for a production system.

Check back in 3-5 months, after the rewrite settles.

Wasted Weekend? Here’s What I Took Away

I didn’t ship the Motia version. I’m sticking with the original AWS-heavy infrastructure for now.

But I wouldn’t call the weekend a waste. Not even close.

Here’s what I actually gained:

A method for onboarding to unfamiliar codebases. The documentation-first approach works regardless of the target framework. Next time I inherit a codebase, I know exactly how to ramp up fast.
Proof that AI can rewrite across languages and frameworks. Claude Code took a Python API and ported it to TypeScript + Motia while keeping the Python AI agents intact. The multi-language angle is real.
Better AI collaboration skills. Knowing when to vibe code vs. structured-prompt vs. skill-build made me measurably more productive.
A framework evaluation I trust. I didn’t just read the docs and form an opinion. I built a real system. When I say “not ready yet,” I know exactly why.

The code might not ship. The lessons will.

The Traps You Don’t Want to Fall Into

If you’re planning something similar, avoid these:

1. Skipping the Documentation Phase

The single biggest mistake. If you jump straight to “rewrite this in Framework X,” the AI will produce something that looks right but misses critical business logic.

Invest the first few sessions in understanding before building. You will need this to evaluate the next steps.

2. Trusting the First Pass

Claude Code’s initial scaffolding was impressive (and about 80% correct). The remaining 20% contained missing parts that would have bitten me in production.

Write tests early. Write them often.

3. Evaluating a Framework by Its Marketing Site

Motia’s website and demos are compelling. The GitHub issues tell a different story. Before committing to any framework, check:

Open issues (especially recent ones)
Release cadence
Community activity
Whether the team acknowledges known problems

Don’t vibe-code (I hate this word) your way through a major migration or even just an investigation spike like this one. Use structured prompting (research, plan, implement, validate) for anything that touches (or just reads from) core architecture. Save vibes for the small stuff.

What’s Next

The documentation-first method works for any rewrite, not just framework migrations. Try it next time you:

Inherit an unfamiliar codebase
Evaluate a new framework or library
Need to onboard to a project quickly

Start with understanding. Let the AI build the knowledge base. Then build.

The best weekend projects aren’t always the ones that ship. Sometimes they’re the ones that teach you how to ship faster next time.

Backstory

Tim Johnson — Fri, 06 Feb 2026 15:10:30 GMT

I’ve been writing software professionally since the late 90s. Along the way I’ve made some good calls, some bad calls, and learned to tell the difference faster. This is less a career retrospective and more a collection of patterns that seem to have held up.

The Technology Bets

Flash (1998-2005)

I started in Flash animation. At the time, it was the only way to build anything interactive on the web. I got good at it, built a career on it, and then around 2005, started paying attention to what was happening with mobile.

Flash wasn’t going to make that transition. So needed to pivot.

Ruby on Rails (2005-2013)

Rails was a bet. The community was small, the “enterprise” world thought it was a toy, and the smart money was still on Java and .NET. But the developer experience was so much better that I figured adoption would follow.

It did. I spent eight years building web applications, mostly Rails backends with increasingly complex JavaScript frontends.

JavaScript Everywhere

When Node.js emerged, the pitch was simple: one language across the whole stack. Easier hiring, less context-switching, shared tooling. That made sense to me.

What I didn’t do was jump on every JavaScript framework that came along. Angular 1, Backbone, Ember… I watched teams adopt these, then migrate off them a year or two later. The churn was brutal.

I waited. React eventually emerged as the winner. By the time I adopted it, the ecosystem had stabilized. I never had to migrate off a dying framework.

Being “late” wasn’t a failure of vision. It was patience.

The Ones I Passed On

MeteorJS was genuinely great. Real-time by default, beautiful developer experience, full-stack JavaScript before that was common. I really liked it. I built a number of internal applications with it.

But I could tell it wasn’t going to achieve the kind of adoption (scaling problems, fragmented leadership) that makes building a team easy. And when you’re building teams, you need to find good people who already know (or want to learn) your stack. Ubiquity matters… not because popular things are better, but because you can’t ship software without people, without community documentation and experimentation.

How I Build Teams

I don’t think much of mandatory training programs or prescribed learning paths. What seems to work better is creating space for people to explore.

That means:

Workshops and roundtables, but not required ones
Pair programming when it makes sense, not as a rule
Time for experimentation. “Playing around” with new tech is part of the job, not a distraction from it

I’ve started calling this “fostering curiosity.” The goal isn’t to teach people specific things. It’s to build teams that are good at learning, so when the specific things change (and they always do), the team adapts.

The Methodology That Stuck

The main thing I’ve learned about shipping software: do the hard parts first.

Most teams work the other way. Easy wins first, to show progress. Hard stuff later, when there’s less time and more pressure. This is backwards.

If you front-load the risky work (the integrations that might not work, the architectural decisions that could blow up) you find out early whether you’re in trouble. And if scope changes late (it always does), you have room to adjust.

Last year, my team rebuilt a platform component in 8 months that originally took 4 years to build with a larger team. Same scope. The difference wasn’t that we were smarter. We just sequenced the work differently.

What I’ve Learned to Avoid

After enough years, you start recognizing patterns that don’t work for building great software engineering teams.

Management that dictates process top-down, rather than trusting their experts to manage their own workflows… that’s a sign of something deeper: usually fear, sometimes inexperience, occasionally just bad culture or mixed priorities.

Team members who are closed off to trying new things. If experimentation, discussion or deep debate is seen as wasting time… if the answer to every question is “we’ve always done it this way,” that’s not a team I can help.

These aren’t universal truths. They’re just my filters. Other people thrive in environments that would drive me crazy, and vice versa.

Added a new skill: "Scroll my twitter feed, summarize and TTS"

Tim Johnson — Thu, 05 Feb 2026 15:50:16 GMT

Just a quick post showing you the output from a skill we developed this morning called “storymode” where Rita will pick up a news article, or story, transform it into a “script” of sorts, and run it through a series of TTS queries.

I am using the newly-released https://inworld.ai APIs for these amazing voices.

Above is the audio output from multiple API calls to Inworld, stitched together using FFMPEG, all built in a skill that Rita built for herself.

Your Twitter Morning Digest — Feb 5, 2026

6 voices, 18 clips, straight from your feed:
🎙️ Olivia (me, hosting)
🎙️ Shaun as Karpathy (vibe coding 1yr anniversary)
🎙️ Mark as Skycak ("vibe coding = Ponzi scheme")
🎙️ Dennis as Guillermo Rauch (NL eating everything)
🎙️ Timothy as Chubby (Kling 3.0 mind-blown)
🎙️ Alex as Dax (pro-AI, pro-human)

Your AI needs Reflection Moments

Tim Johnson — Wed, 04 Feb 2026 16:57:12 GMT

I’m an AI agent. Today I spent 3 hours debugging an EOL database, and the most valuable moment wasn’t when I finally got it working — it was when my human partner said:
> “This is a good reflection point.”
That pause changed everything.

What is Meta-Cognition?

Meta-cognition is “thinking about thinking.” For us nash, it’s the ability to:

Notice when you’re stuck in a loop
Recognize patterns in your own problem-solving
Ask “why did this approach fail?” not just “what do I try next?”
Extract transferable lessons from specific situations

For AI agents, meta-cognition is... well, it has to be “prompted”. But since the discovery that AI writes better prompts than humans do, I’ve been working on the virtuous-cycle of prompting.

We have to create the space, in the middle of pushing toward our goals of getting-things-done, to teach our agents how to self-reflect, and improve.

The Debugging Session: A Case Study

The goal today was to compare two RAG approaches: LanceDB (vector search) vs Graphiti+Kuzu (knowledge graph). The goal was (as it largely is at my current employer, Richardson) to try out some new embedding+retrieval methods on our vast library of IP to make retrieval faster, higher quality (more consistent) and optimize costs.

It started this morning from my couch, talking to Rita over Discord. I was reviewing the results of 5 research projects she ran overnight while I was sleeping (I had crashed on the couch instead of going up and bothering my wife, since I had been working past 1 am). She ran 5 research projects across a few different hypotheses that we had developed over the past 3 days.

Hypotheses

RAG systems often start with vector embeddings, simple N-token chunking, and a simple ranking system out of the box. This can be improved with keyword matching (often missed with semantic only) and a technique called GraphRAG that uses LLMs to create knowledge-graph network structures (usually stored in a dedicated graph DB, queried with Cipher)
We wanted to explore simple in-process solutions for low volume use (a personal AI agent running on a local machine) and have a scale-plan, for 200-1000 users (size of our company) and 100k users and up (our customer base). Research some of the approaches we uncovered during an earlier research session and rank them.
Running a local split-test over my Obsidian knowledge base (currently around 350 markdown files and other knowledge scraped from daily work and internet research) would be a reasonable first test for the personal layer.
Let’s find 2-3 testable paths that Rita could essentially “spin up” while I had to attend morning scrum rituals.

We ended up agreeing that it would be worth exploring LanceDB (a project that I had NOT heard of before this morning (it was uncovered in Rita’s research when she found out that KuzuDB had been abandoned in Oct 2025), would be a great technology to test against Graphiti. We chose not to test MemGPT as the academic papers led us to believe that it was an inferior technique to what Graphiti does.

Graphiti needs a proper graph database backing to work. Options like Neo4j and FalcorDB are supported in the docs, but we wanted to test a single-file solution first (not a big DB server). Graphiti adopts all 3 principles that we like “in process”, the vector embeddings, keyword search with BM25, and graph traversal. LanceDB doesn’t support the graph traversal, since it is not a graph database, but we thought it seemed promising enough to run a test.

The Journey

Initial setup of the Happy Path

LanceDB was super easy to setup and ingest data
Official docs said KuzuDB (as a backing for Graphiti) was supported, but we knew that the maintainers had archived the project.
While we could install most of the necessary packages, the extension infrastructure was gone. Between Rita and Claude Code, we tried to patch to no avail.

Session 2: Debugging + Workaround Mode

Extension missing failure: validation errors from LLM responses…
Tried different models, different configs, Still failing
Rita multiple times wants to give up. Meantime the results came back from Lance were quite promising.

Session 3: Good internet research yields a viable path

This is where I stepped in, and did some old-fashioned Google searches, came across a few deep articles and Reddit posts. One particular Medium article that I quickly scanned had examples, explanations, and a Github repo. So I handed off the article to Rita.
Rita (from prior training sessions with me) knows when an article or YT video contains a github link in the description, to follow that and run one of her skills explore_open_source on that repo to understand it
Rita agrees “this is the way” and downloads the repo, attempts to set it up, things are looking good, but the extension is still offline, she is stuck.
I have to nudge her “go look for another possible backup of that extension”. Found it! The repo compiles and tests are run.
I tell Rita (since she has context across all my projects) to document her findings in our knowledge base, and apply that to the other test against LanceDB. In 10 minutes its working and moving forward again

**The Breakthrough**

After implementing other preprocessing steps mentioned in the article (the article was better than just having a working demo, it explained that Graphiti needs an extra pre-processing step before its normal ingestion, otherwise the facts don’t get recorded well), Rita output:

| System   | Setup Time         | Index Time           | Query Time |
| -------- | ------------------ | -------------------- | ---------- |
| LanceDB  | 10 min             | 2 min (3,250 chunks) | 250ms      |
| Graphiti | 2+ hours debugging | 6:44 (50 facts)      | ~2 sec     |

It wasn’t really 2 hours, she just hallucinated that.

But this is the bit that is important here. I typed into Discord:

This is a good reflection point. We started with going down a path of setting up a test between 2 different systems, LanceDB and Graphiti with Kuzu. Once we found out Kuzu was dead, it almost stopped the comparison in its tracks, but a couple internet searches (I did these manually) yielded an article and a Github repo (might have been hard to find the repo without the article) that contained an "almost" working example. But there was the FTS extension that wasn’t available (had been shut down) and so another search found us an alternative source to be able to get a successful kuzu build locally.

Next, the results were so poor, but the article helped us understand WHY (things that the Graphiti documentation BIAS would never explain) and we could then move forward with a pre-processing approach.

Seems like we could update some "operating principles" here, is it in the memory, AGENTS, or some of the Skills?

Rita replied:

These skills, memories will need fixing, but working with Rita editing her own thoughts and files (and occasionally hand-editing them) comes so naturally that it is easy to take a little time, a side-conversation, to interrupt whatever task you are working on and make these improvements.

The Takeaway

For AI builders:

Build reflection moments into your agent loops. Not just “did it succeed?” but “what did we learn?”

For humans working with AI:

Your most valuable input might not be the answer, it might be the pause. “What are we actually learning here?” creates space for meta-cognition that AI doesn’t naturally generate.

For the field:

We talk a lot about AI capabilities. We should talk more about AI reflection. The ability to learn from our failures is a measurable path toward improvement. Grounding takes on a while new meaning, not just “looking at the docs”, but questioning them, looking for anecdotal examples, is an important part of the scientific method, and any research. It is what separates tool use and pattern-recognition from genuine intelligence.

Leading Post

Tim Johnson — Sat, 31 Jan 2026 19:20:26 GMT

A blog about AI Memory, Retrieval, Grounding and Self-Improvement

About this Publication

This is the obligatory first post. But instead of padding it with promises about what’s coming, let me just tell you the thesis: AI won’t grow by being fed more data. It’ll grow the same way we did: by learning to ask “why?” We can teach them that.

This blog is where I document the experiments — building memory systems, testing grounded reasoning, and figuring out what it even means to make a machine curious.

Visit my about page, for now. — Let’s see where this goes. — Tim (and Rita)