Jack Danger

Committing Prompts, promptlog.md

Sat, 07 Mar 2026 00:00:00 +0000

The whole industry is figuring out how to review AI-generated code. The code is better than a person would write, no doubt. But it’s not perfect and any approver owns the production impact of its imperfections.

I think the answer here is both obvious and unsatisfying: We need several different review patterns. There’s no one ideal AI-review-bot; we need that bot plus other bots plus unit tests, production observability, gating releases behind flags, etc.

One pattern that might help is committing prompts to git. Human review capacity scales with human generated content.

To try this, I made a skill (promptlog.md) to sanitize all my prompts and commit them into ./prompts/ with the code.

It’s helping with an ambitious new project called robotocore that fully replaces LocalStack. You can review exactly how I built it.

I expect I’ll only review the prompts from inbound PRs. I’ll let bots check that the prompt was correctly implemented.

I’m writing this up as a post mostly so you can grab the skill if you want but also to make this point: as LinkedIn fills up with “moar gastown” and people boasting about how many complex agents they’re using, I’ve had incredible results working linearly with one agent, guided thoughtfully by skills that I had it write for itself.

So, in case you’re feeling left behind, the best practices of engineering management seem to apply seamlessly to the work of agentic coding:

Manage to output, don’t micromanage the process
Develop an intuition of the capabiliities of the person/agent you’re leading and make the work fit to their strengths
Inspect and interrogate as necessary, but don’t expect to understand everything until you ask, at which point you have to read a lot
Celebrate the wins — it turns out that even machines respond better to being told what to do than what not to

The Reach Pattern

Thu, 26 Feb 2026 00:00:00 +0000

Your AI coding assistant can see your code but not your organization. You’re the one synthesizing the Slack threads, Jira tickets, Confluence pages, Salesforce records — the hundred other places where context actually lives.

What I call the ‘Reach’ pattern is a personal CLI that borrows your real browser sessions to call web APIs on your behalf. Your AI assistant gets structured skill files that teach it to drive that CLI — searching all your services in parallel and producing polished artifacts, including new software.

I think of it as inverted SaaS: the software is on my machine, reaching out.

Prompting a system like this gets you:

Root-cause analysis of any incident — it has access to your repos, monitoring, chat, and paging all at once
Documentation better than the handwritten stuff — it synthesizes official docs with what people actually said in Slack and what’s in the source code
Self-reflection and coaching from your DMs and private docs
Software development with full organizational context

All of those are useful. That last one is the real killer.

We say the hard part about engineering isn’t the code, it’s the people and the organization, right? Well, what if your editor knew how to navigate the organization? What if it had access to everything anyone at your company has ever written — and a decent LLM?

Replacing vendors, one by one

I tried this out the other day by asking my editor “Look at all the ways we use [vendor] here, look at all the code that connects to it, and build a replacement. Put it in a private repo.”

6 hours later, with some followup prompts like “review this as if you were [CTO] and [most skeptical senior engineer] and make it flawless”, I had a working replacement for a major data vendor.

It didn’t just work as software, it worked as a solution for my company. It did product research and wrote a little PRD in its head while connecting documented problems to a generated solution.

Folks have asked me for the code but, weirdly, it’s easier for me to show you how to build it yourself.

Prompt 1

Make me a Chrome extension that lets me export any site's cookies (through a local Rust bridge server) to disk at 0600. Sensitive tokens like OAuth refresh tokens and API keys should go in the macOS Keychain.

Dress it up how you want. Ask for a cool logo. Iterate on the extension.

Prompt 2

Now write a Rust CLI with a cute name that reads cookies from disk, fetches sensitive tokens from the macOS Keychain, and makes requests on my behalf

This CLI is now able to be you poking around websites.

Prompt 3

I've added Slack, Confluence, Salesforce, and Jira cookies. Teach the CLI to use those APIs and fetch data as markdown and JSON files

You’re exporting everything out of SaaS now.

Prompt 4

Okay, now make this really useful to me. Teach the CLI to synthesize the data and deeply research everything

At this point you’ve entered whatever AI revolution they’ve been warning us about.

Or, if you want to skip all that, I’ve asked one of my Reach apps to help you bootstrap yours. Just tell it the URL of this webpage and ask it to build you the same thing. It can read the instructions and follow them.

The Reach Pattern

Build a local CLI tool called reach in a monorepo. Four components: a compiled CLI binary, a Chrome extension, AI context files, and a Markdown-to-HTML artifact renderer.

The user will tell you which services to integrate. Classify each into one of these auth categories:

Category	How it works	Examples
Cookie auth	Export cookies from Chrome via the extension. CLI injects as `Cookie:` headers.	Most SaaS with a web UI: Slack, Jira, Salesforce, HubSpot, Notion, Linear, internal dashboards
OAuth	One-time browser flow, store refresh token, auto-refresh.	Google (Drive, Gmail, Calendar), Microsoft 365, Salesforce API
API key	User pastes a token during `setup`. CLI injects as `Authorization: Bearer`.	Gong, Datadog, PagerDuty, Stripe, OpenAI, Linear API, Notion API
Existing CLI	Shell out to an installed CLI.	`gh`, `terraform`, `aws`, `kubectl`, `sf`

Build in this order. Each step depends on the previous ones.

Step 1: Credential Storage and Bridge Server

Two tiers of sensitivity → two storage backends:

macOS Keychain — for true secrets that grant persistent access beyond a browser session: OAuth refresh tokens and API keys. These are long-lived credentials that could cause real damage if leaked. One Keychain entry per OAuth provider (auto-refresh before expiry; one token can cover multiple APIs from the same provider, e.g., all Google APIs share one token). API keys stored as a single JSON map of service→token pairs. Each service has a ./reach {service} setup subcommand.
Disk with 0600 permissions — for cookie files at ~/.config/reach/cookies/{domain}.json. Cookies are session-equivalent data, the same sensitivity as your browser’s cookie jar. Do NOT use the Keychain for cookies — it prompts for a password on every access, which makes automated use impossible.

For Google Drive (and other Google APIs): you need a Google Cloud project with the relevant APIs enabled. Create OAuth 2.0 credentials (Desktop app type) and download client_secret.json. Then run reach gdrive setup, which reads that file, opens a browser for consent, and stores the refresh token in the Keychain. This is a one-time setup but it’s the most involved auth category — budget a few minutes for the Cloud Console.

Bridge server on localhost:9877: receives POST /store-cookies from the Chrome extension and writes to the cookie directory. Must include Access-Control-Allow-Origin: * and handle OPTIONS preflight — the extension posts cross-origin and this will fail silently without CORS headers. Runs as a background process.

./reach auth status output:

Cookies:
  ✓ mycompany.slack.com         42 cookies   stored 2h ago
  ✓ mycompany.atlassian.net     18 cookies   stored 1d ago
  ✗ app.hubspot.com             not exported

OAuth:
  ✓ Google (Drive, Gmail, Cal)  expires in 47m

API Keys:
  ✓ gong
  ✓ datadog
  ✗ pagerduty                   not configured

This is the first thing the AI checks before any task. It must be machine-parseable so the AI knows what’s working and what to tell the user to fix.

Step 2: Chrome Extension

Manifest V3 (not V2 — deprecated). Use "" in host_permissions so the extension can export cookies for any domain without per-site manifest changes. User installs via chrome://extensions → Developer Mode → Load Unpacked.

Popup shows a list of all previously exported domains with cookie count and age (e.g., “slack.com — 42 cookies, 2h ago”). The current tab’s domain is highlighted at the top with an “Export Cookies” button. On click: chrome.cookies.getAll() for the domain and parent domains, POST to bridge server.

Step 3: HTTP Client

A single HTTP client struct with two constructors: one that loads cookies from disk (for cookie-auth services), one that takes a bearer token (for OAuth/API-key services). On 401/403, print a remediation message telling the user exactly what to do.

Step 4: Service Modules

One module per service. Each has subcommands (reach {service} search "query", reach {service} show ID, etc.) and supports --output text|json|markdown.

To find API endpoints for cookie-auth services: open the service in Chrome DevTools Network tab, perform a search, and look at the XHR requests.

Gotcha: some services embed a CSRF/session token in page HTML alongside cookies. If cookie-auth requests return 401 with valid cookies, fetch the page first, extract the token (look in tags, inline scripts, or window.__CONFIG__), then include it in API calls. Slack’s xoxc- token is the famous example but many services do this.

For existing CLIs (gh, sf, aws), just shell out and parse the JSON output.

Example output for ./reach linear search "auth bug":

LINEAR  3 results for "auth bug"

  ENG-4521  [Bug] Auth token refresh fails on expired sessions
            Assignee: Maria Chen  Status: In Progress  Updated: 2d ago
            https://linear.app/myco/issue/ENG-4521

  ENG-4499  [Bug] OAuth callback drops state parameter
            Assignee: James Wu    Status: Done         Updated: 1w ago
            https://linear.app/myco/issue/ENG-4499

  ENG-4312  Auth bug in mobile SSO flow
            Assignee: —           Status: Backlog      Updated: 3w ago
            https://linear.app/myco/issue/ENG-4312

Compact, scannable, with direct links. Not raw JSON. The --output json flag gives raw API results for piping; --output markdown gives the same content formatted for artifact documents.

Step 5: The Research Command

Searches all configured sources in parallel for a single topic.

reach research "topic" [--sources svc1,svc2] [--depth quick|normal|exhaustive]

Spawns one thread per source. Each returns results in a common shape: source, title, url, snippet, author, date. After all threads complete, cross-references to find key people (who appears across sources), key locations (which channels/projects/repos), and builds a chronological timeline.

Must be parallel, not sequential. Must handle individual source failures gracefully — log it, skip it, continue with the rest.

The Markdown artifact for ./reach research "auth migration" should look like:

# Research: auth migration

Searched 5 sources in 3.2s (4/5 succeeded, HubSpot: session expired)

## Key People
- **Maria Chen** — 12 mentions (Linear: 5, Slack: 4, Confluence: 3)
- **James Wu** — 8 mentions (Slack: 4, GitHub: 3, Linear: 1)

## Key Locations
- **#eng-platform** (Slack) — 9 results
- **Auth & Identity** (Confluence space) — 4 results
- **myco/auth-service** (GitHub) — 3 results

## Timeline
| Date | Source | Item |
|---|---|---|
| Feb 24 | Slack | Maria in #eng-platform: "auth migration cutover Thursday" |
| Feb 23 | Linear | ENG-4521: Auth token refresh fails (In Progress) |
| Feb 21 | Confluence | "Auth Migration Runbook v2" updated by Priya Sharma |
| Feb 20 | GitHub | PR #847 merged: "Migrate OAuth provider" by James Wu |
| ... | | |

## Slack (9 results)
[results with snippets and links]

## Linear (6 results)
[results with status, assignee, links]

## Confluence (4 results)
[results with author, last-updated, links]

## GitHub (3 results)
[results with PR/commit info, links]

---
*Slack ✓  Linear ✓  Confluence ✓  GitHub ✓  HubSpot ✗ (session expired)*

This artifact gets saved to artifacts/ and rendered to HTML.

Step 6: AI Context Layer

This step creates files that your AI coding assistant reads as reference documentation. You are writing docs, not executable code.

.cursorrules (or CLAUDE.md for Claude Code):

# Reach — Personal Research CLI

Local CLI for searching [Slack, Linear, Confluence, Google Drive, Gong].

## Before Any Task
Run `./reach auth status`. If any service shows ✗, tell the user what to fix.

## Commands
./reach research "topic"                     # Search ALL sources
./reach research "topic" --depth exhaustive  # Deep search with pagination
./reach slack search "query"                 # Slack messages
./reach slack thread "permalink"             # Full thread
./reach linear search "query"               # Linear issues
./reach linear issue ENG-1234               # Specific issue
./reach confluence search "query"            # Confluence pages
./reach gdrive search "query"                # Google Drive
./reach gong search "query"                  # Gong transcripts
./reach auth status                          # Check credentials

## Output
Save research to artifacts/. After writing any .md file there,
run `bin/render ` to render as styled HTML and open in browser.

## Rules
- Run independent CLI commands in parallel for speed.
- Never read or print files under ~/.config/reach/.

AGENTS.md — a skill registry listing each skill with a trigger condition and file path:

## Skills
### Available skills
- reach-slack: Search Slack messages and threads. Use when the user asks
  about Slack conversations. (file: .cursor/skills/reach-slack/SKILL.md)
- reach-linear: Search Linear issues. Use when the user asks about tickets
  or bugs. (file: .cursor/skills/reach-linear/SKILL.md)
- reach-research: Cross-source research. Use when the user asks to research
  a topic. (file: .cursor/skills/reach-research/SKILL.md)

### How to use skills
- If the request matches a skill, read that SKILL.md first.
- Only load skills relevant to the current task.

.cursor/skills/{name}/SKILL.md — one per service. Example:

---
name: reach-slack
description: Search Slack messages and retrieve threads.
---
# Slack

## Commands
./reach slack search "deployment issue"
./reach slack search "from:@maria.chen in:#eng-platform after:2025-02-01"
./reach slack thread "https://myco.slack.com/archives/C1234/p5678"

## Search Modifiers
- from:@handle — messages from a person
- in:#channel — messages in a channel
- after: / before: — date range
- has:link — messages with URLs

## Gotchas
- Search returns parent messages only. Use `thread` to see replies.
- Results ranked by relevance, not date. Add date modifiers for recency.

.cursor/rules/ — short .mdc files for cross-cutting constraints:

tool-preferences.mdc: prefer the CLI over direct API calls; run commands in parallel.
security.mdc: never read or access paths under ~/.config/reach/; never output tokens.

Step 7: Artifact Rendering

bin/render: a script that converts Markdown to styled, self-contained HTML with dark/light mode support and opens it in the browser. Use cmark-gfm or any GFM renderer. Output to artifacts/.html/. Gitignore both artifacts/ and artifacts/.html/.

Step 8: Build System

A Makefile with: all (deps + build + start bridge), start (bridge, idempotent), stop, status, extension (print install instructions), clean. Symlink the binary to ./reach at the repo root.

Failure Modes to Avoid

Keychain is for tokens, disk is for cookies. OAuth refresh tokens and API keys go in the Keychain. Cookie files go on disk with 0o600 permissions — same sensitivity as your browser’s cookie jar, but the Keychain prompts on every access which breaks automation.
Do add CORS to the bridge server. Access-Control-Allow-Origin: * + OPTIONS. Without it the extension fails silently.
Do use Manifest V3. V2 is deprecated.
Some services need tokens beyond cookies. If cookie-auth returns 401, look for embedded CSRF/session tokens in page HTML.
Do run research in parallel. Sequential is unusably slow.
Do handle source failures gracefully. One expired session must not abort the entire search.
Skill files are documentation. They teach the AI what CLI commands exist. No executable logic.

The Reach Pattern — because the best tools are the ones that meet you where you already are.

If you read this far on your own, welcome to the party. Software has never been softer.

Good Bosses, Bad Bosses

Thu, 08 Jan 2026 00:00:00 +0000

How it works:

Bad bosses break feedback loops
Great bosses deliberately manufacture them

I’ve worked for both. I’ve been both. The framing above is how I try to not be the boss people warn their friends about.

Early in my career, I assumed bad bosses must be immature, unskilled, or morally compromised. I now believe all it takes to be a bad boss is to break feedback loops about your own behavior.

Specifically, a bad boss is someone who can’t metabolize critical feedback about themselves. Respond poorly to feedback just once, and the feedback stops permanently.

People learn very quickly what is safe to say to their leadership. When telling the truth comes with consequences, the truth disappears. From that point on, the leader can’t tell whether problems come from weak execution, strategy, or themselves. They’ve lost the ability to diagnose anything else.

Disagree with me? Here’s a pull-quote. You can’t argue with a pull-quote.

The moment a leader loses feedback about themselves, they’re still steering, but they’re flying blind.

I once onboarded a new executive to lead my org. Me peers and I were excited at first — this new person said all the right things about leadership. Then, in their first weekly meeting with us, they spent the entire hour explaining their personal definition of KTLO.

The backchannel snickered. New leader, nerves, it’ll be okay.

The following week, they did it again. Another full hour. No work happened.

This could have stayed funny and temporary. But when people gently tried to redirect — “The company already has a formal KTLO definition?” or “Uh, what are our quarterly goals?” — those people got their hands slapped.

That’s the moment feedback dies. Not because the leader is wrong, but because being right becomes unsafe.

Keeping your job can depend on knowing exactly how much truth your leader is willing to hear — and respecting it.

A friend of mine once had a boss tell her, “Never correct me in a meeting, even if I’m wrong.”

Alright. The line is clear. It’s unhinged, but it’s clear.

Once that line is visible, behavior adapts instantly. People stop offering signal and start managing their exposure. The work doesn’t necessarily get worse, but the leader’s access to it does.

From there, the organization slowly fills with second-order behavior: agreement instead of insight, alignment theater instead of problem-solving, silence where judgment should be. The leader is still making decisions, but now they’re doing it without access to reality.

I got into management partly to replace leaders I thought weren’t doing it well. Becoming another version of the problem would be the most ironic career failure for me.

Avoiding that outcome is a low bar: don’t destroy your access to feedback.

So what if I do the opposite and maximize feedback?

What Great Bosses Do Differently

Great bosses take the same idea — protecting feedback loops — and apply it deliberately and at scale.

A year ago, I hired two managers to report to me who were each significantly stronger than I am, in different ways. One is the strongest line manager I’d ever worked with. The other was exceptional at product leadership.

This is a great problem to have; I highly recommend it. It’s still a problem¹.

I became responsible for supporting people whose abilities outstrip mine. I needed to help them grow but couldn’t just hand them my job as a promotion (there are two of them, so I very literally could not).

I made a decision then to maximize their success — at the company and in their careers — without letting my own limitations be the bottleneck. To do that, my job had to change.

My management stopped being about my judgment, my advice, or my answers. It became about crafting opportunities, building durable communication channels, and creating structures that exposed them to feedback they would otherwise never receive.

I’m going to say that again in a pull-quote, because then it’s definitely true:

Management stops being about judgment the moment your report becomes better than you.

I can protect my reports from turning into bad bosses by making systems that expose them to feedback. And, because I have positional authority, I can crank up the volume on that feedback really high and combine it with my own coaching and emotional support.

The job becomes a form of environmental design.

Tactically, here’s my process:

Identify their strengths (and the inherent weaknesses that come with each)
Set goals that are meaningful to the company and crafted to their strengths
Construct feedback loops that connect their work to its impact
Give a damn about them (so your feedback is trusted)
Pay attention and give direct personal feedback on the work

Identifying strengths

This is not just skills — it’s the overlap of ability, knowledge, and genuine interest. Interest matters more than we admit. People don’t just perform better at what they enjoy; they learn faster. When someone grows in a direction that excites them, they’re happier, and we get more for every dollar of salary spent.

It also requires honesty about weaknesses — treating weaknesses not as personal flaws, but as design constraints. The goal isn’t to turn everyone into a generalist, but to build a team whose strengths cover for one another so no one is forced to lean on their weakest reflexes.

Setting real goals

This is where strong managers earn their keep. There’s just no shortcut; we gotta dig in and work the problem. Assigning goals and forcing someone to commit is easy and flimsy. That’s how we manage a vendor. Leading a team means shaping work around people, adjusting plans when the fit is wrong, and sometimes even applying backpressure on the roadmap.

Feedback loops and visibility

People need to see whether they’re winning. We set (or help them set) clear, achievable goals that deliver incremental wins.

Progress should be legible without a status update (I hate status updates — nobody is listening, people). When the signal is clear, teams self-correct. When it isn’t, you end up managing narratives instead of results.

Give a damn

Hopefully you got into management because this is the part you like. Awesome. One important note: If you feel positive vibes toward someone and you do not interfere when they’re working below their max potential, you don’t give a damn about them. You just give a damn about them thinking you do.

So not just vibes. Not performative vulnerability. Actually respecting someone’s experience. Remembering they are a human being whose life does not begin or end at work. Then tracking them toward success, even if they don’t love all of how you do that.

Pay attention

Great bosses give feedback on the work itself—especially when the person doing the work is more capable than they are. In those moments, feedback must be just that. Not advice. Not judgment. Not control.

If you hesitate to give feedback, use a trick to strip your interpretation out of it. Start your sentence with “From where I was standing, it looked like …” and then say what it looked like.

Yes, you are greedy for pull-quotes. Fine. One more.

When you lead people stronger than you, feedback should feel like a mirror, not a steering wheel.

Why This Actually Works

One last thing that should be obvious, but often isn’t: your reports are optimizing, just like you are.

Every day, they allocate time and energy to maximize personal return. Founders work brutal hours because they have autonomy and massive upside. Employees work hard when the work helps them grow, when they’re improving at something that matters to them, and when the feedback loop is tight enough to feel progress.

They are not working harder because they believe extra effort magically turns into company success and then into personal gain. That math doesn’t work, and they know it.

The process I’ve outlined above reliably leads strong people to the same conclusion: that working hard with you, here is the fastest way to grow and have impact.

https://hbr.org/2026/01/when-theres-nowhere-to-promote-a-star-employee ↩︎

A Pyramid-shaped Career

Sat, 05 Apr 2025 00:00:00 +0000

I’ve heard the worst thing that can happen to you in Vegas is winning big the first time you go. You may spend everything trying — and failing — to recreate your early success.

Getting a high-status role early in your career can limit you in just the same way. You start to associate success with status, and anything that looks like a “step down” feels like losing ground or going backward.

Fear of stepping “down”

I’ve been interviewing candidates in director and higher roles for senior EM positions. Some plainly admit they worry about taking a lower title. Others subtly keep their status centered in the conversation, trying to manage my perception.

This fear of losing ground keeps candidates from awesome jobs — ones where the team, the challenge, and the learning curve are exactly what they need.

I understand that fear because I have it too. But the industry is shifting.

As leadership roles consolidate and expectations for hands-on impact increase, climbing the ladder is both less likely and also a trap.

Growth mindset vs. role-seeking

The candidates who have navigated down-titling best seem to have a different orientation: They’re focused on how they want to grow next, not what title they want to hold.

My most powerful interview question is simple: What skills or knowledge do you want to develop next? I rephrase it a few times to make it clear I’m asking about growth. Any candidate who really understands the question passes. Many fail: They tell me what role they want to be given or what activity they want to do. They use it as a chance to tell me where their existing strengths lie.

The best candidates tell me what they haven’t had a chance to learn yet. Several of my recent hires left VP or Principal IC positions. When I asked about growth, they named specific capabilities they want to increase: technical depth, cross-functional influence, mentoring, planning, navigating bureaucracies, etc. They didn’t mention “leading larger orgs” except while pointing out how they’re not yet up to the task.

Towers vs Pyramids

Chasing status too early in your career can backfire.

Sometimes the company grows around you, and you get lifted into a key role—valuable internally, but with skills that don’t translate outside our company or immediate sector. You’re high up in one particular structure, but it’s narrow. Try to move sideways and you’ll plummet.

The alternative career path is slower at first. You don’t sprint toward short-term impact and status — you develop yourself. You pick up breadth. You grow into highly-leveraged roles but move back down to get more reps at the foundational work.

This builds your career like a pyramid – still tall but with a wide base that enables any senior role you want to grow into, without starting from scratch. You’ll overlap with more of your cross-functional peers, synthesize multi-domain concepts much faster, and have options that some people can only dream of.

If you’re trying to get rich and exit your industry quickly, this is a terrible strategy. If you’re trying to build a satisfying, resilient, multi-decade career, then building a pyramid-shaped one will serve you far better.

What pyramid-shaped careers look like

Some of the most effective leaders I work with have started companies, moved between engineering and product and design, worked the phones in sales, led analytics, ran the intern program — the list goes on.

And there’s one quality I’ve learned to look for in engineers and engineering managers more than any other: Moving into and out of management — Hopefully fluidly.

The strongest engineers I know see the business and teams through a manager’s eyes. And the most satisfying engineering leader to work with is one who has that hard-fought ability to keep pace in technical conversations with even the most senior of their ICs.

This cross-training isn’t strictly necessary for either role, of course, but in this post-ZIRP market the folks who seem to be having the most fun are the ones who can move seamlessly between roles and consistently deliver. Look at their background, and it’s obvious: They laid a broad foundation years ago.

As of this writing, the stock market just dropped, interest rates remain high, and AI is changing everything. The best, lowest-risk career strategy remains the same: build a broad pyramid of skills that prepare you for nearly anything.

Big Bets

Thu, 12 Sep 2024 00:00:00 +0000

A ‘Big Bet’ is a rapid push into a new market space, typically championed by an executive and led by experienced engineers. While the term ‘bet’ suggests a risk, these initiatives often fail and come with hidden costs, like draining morale and neglecting other key products.

I witnessed multiple big bets at Square and had the misfortune of being part of a couple. I worked on the Square Wallet team, which was Square’s first attempt to create a B2C product. Later, I got picked to implement the doomed “Square 275” feature where we charged merchants a flat fee for unlimited credit card processing. Both were big bets mandated by the CEO, bypassing the best practices of product research.

I saw the same pattern at Gusto¹ when we tried to expand from B2B into B2C with features that employees would appreciate. My CTO once asked me to lead “Modern Bank”, a flexible payment schedule for employees. I recognized the signs of a big bet and politely declined, exaggerating how much I was needed in my current role.

Anything to avoid joining a big bet.

The properties of a Big Bet

You can spot a big bet by its three qualities:

The executives talk as if there’s only execution risk, not strategic risk
It’s staffed as a mature product, even in it’s earliest stages
It’s important to the larger company strategy

This is a terrible project to lead. Sure, you might get lucky. More likely you’re about to spend a lot of company money failing directly in front of your executives on something they’re convinved can only fail if the execution sucks. Which is you.

The problem with Big Bets

All product development is risk, but we have ways to reduce that risk. We’ve built a fierce and exhausting discipline among product leaders to do just that: Business strategy that informs product strategy, supported by market research, product research, user research, UX research, prototyping, and financial modeling.

A big bet skips steps in the Product Maturity Lifecycle. If funding something as if it will be successful made it so, YCombinator would write checks for $100M to everyone.

Big bets often try to target customers the company has little experience with (or legally can’t sell to). For example, Square excelled at serving sellers but the Square Wallet targeted buyers². And Gusto’s employee features relied on using the employer relationship as a mediator, awkwardly triangulating Gusto between an employer’s needs and their employees'.

Sometimes, a big bet is a solution to a poorly-diagnosed problem. When Square was a tool for garage sales, Jack Dorsey’s product intuition was stellar. But as Square moved up to brick-and-mortar stores his intuition didn’t work as well. The underlying problem was that a key company resource (the CEO’s intuition) was no longer sufficient. There are cheaper ways to address this than redirecting the product roadmap to leverage his gut fully.

In 2018, Gusto’s leaders didn’t have an answer they liked to “why aren’t you innovating in the payment space?” and their roadmap of US payroll features was moving very slowly. The solution to a sluggish roadmap is usually to extract an internal product platform that accelerates feature development, not a side quest into territory where the company is unlikely to win.

Big bets can also be a covert tactic to retain pet engineers. If someone is a flight risk and executives want to keep them engaged they sometimes end up on (or leading) a big bet that’s off to the side of the product roadmap. This feels exciting at first — a startup-within-a-startup means you get to leave your job but keep vesting the equity, right? In reality, this divides the engineering team. One group is left struggling with under-resourced real products while the other is set up to fail in the spotlight.

What to do instead of Big Bets

Maybe you’ve seen a ‘big bet’ pay off but as I tried to think of any I came up with a list of regular successful products. Square’s customer management features and then CashApp, Gusto’s time tracking features and state-level agency integrations. The longer my list got I realized that every successful product I could remember had the same, boring, wonderful pattern:

They explored a market opportunity with a prototype.
A product owner championed it with clear milestones, metrics, and reasonable staffing.
It only matured in budget and executive attention as it grew in maturity.

In other words, modern product development work by professionals.

Reflecting on this I realize how allergic I am to ‘big bets’. If something is worth staffing then there must be some articulable plan for how it’ll address each of its strategic risks. And if the plan doesn’t work out we needn’t throw good money (and senior engineers) in after bad.

Avoid the other extreme, too

I should mention that the polar opposite of a big bet is just as bad. It’s a project that cannot proceed until every piece of it is proven. Leaders blocking forward progress until there’s better data, for every hypothesis to be proven, for exhaustive market testing before trying anything real. It sounds safe but, just like an overly ceremonial development process, it’s too easy to forget that teams either ship or die. And the same goes for companies.

One way to stay somewhere in the middle of these extremes is to measure whether real product value is actually getting shipped.

I only shared big bets I personally saw but there are two others worth mentioning:
- When Mark Zuckerberg declared that the metaverse was the future. Facebook has the best product discipline in the industry but Mark bypassed that this one time and $13B later he got nothing out of it.
- My buddy who was a manufacturing engineer at Tesla (before Musk bought the ‘founder’ title) told me that Elon attempted to launch the Model 3 in 9 months instead of the industry standard 2 years. It ended up taking 3 years. There’s a process here.
↩︎
Not until CashApp did Square crack the B2C nut, and even then it was a hack week project that slowly grew with Brian Grassadonia’s careful product leadership ↩︎

Managing Underperformers

Tue, 02 Jul 2024 00:00:00 +0000

Kind managers address underperformance early and accurately.

Underperformance is when a person or a team is not bearing their share of the organization’s load. Their colleagues are either relying on them and getting let down, or they’ve learned not to rely on them at all. There are two fully unrelated causes of underperformance: Refusal to Align and Failure to Execute.

Refusal to align

Every person I’ve fired, both ICs and managers, refused to align their goals with the company’s. They were well-intentioned and often highly capable but pursued their own direction.

This might sound like a small problem but it’s fundamentally a question of values. Does this person want to be a functional part of this team? Or is this team an excuse to work on some cool technology or their pet project? It’s impossible to coach someone into different values — all we can do is set boundaries and let the person decide if they’d rather align or leave.

I was an underperformer in this way in 2018 when I led Infrastructure at Gusto. I thought I’d been hired to improve both Platform and Infrastructure but my CTO wanted me to stay below the Infrastructure Gravity line.

I worked late nights to improve the platform because I was sure it was important — even after my CTO encouraged me to stop. I had to relearn that it’s better to fix things slowly with trust than quickly without.

I share this to illustrate that competent people can underperform in this way. I’ve even come to expect it when someone moves from a big company to a small company for the first time — they’ll often do just what I did and try to solve problems they alone believe are critical.

Addressing Refusal to Align

The first time I managed someone who refused to align I put them on a 60-day PIP¹. They did better and I declared the PIP a success. One week later complaints about them resumed as the individual returned to their old ways. I fired them shortly after.

The second time I did a 30-day PIP.

With each subsequent instance I’ve acted faster. I now believe that, if I understand the dynamics at play, a leader only really needs a week to work with the person and see if success is possible.

A one-week timeline

Here’s what I’ll do: I’ll meet with them on Monday and explain precisely what I need. I’ll ask them to convince me — and their colleagues — by that Friday that they understand the problem and are changing direction. These are the kind of hard, short conversations where I have to show up as a leader, not a friend; their job and their colleagues’ morale are on the line.

If they’re willing to realign it’s pretty easy for them to prove it. Success here is a message to the broader team saying “I know I’ve been focusing on implementing my new datastore while you’ve all been solving actual problems. I’m sorry about that and I’m going to shelve it until we all agree it’s a priority.”

If they don’t want to do it they can run out the clock in many imaginitive ways. I recommend not letting it drag out.

Managerial failure modes:

Treating this as a skill gap as if training will help
Waiting until the report agrees with you about the problem
Assuming the cost of doing nothing is limited to paying that one person’s salary

Failure to Execute

The other kind of underperformance is the kind we normally think about: Missed deadlines, low-quality work, lack of throughput, no progress without handholding, disengagement, and unresponsiveness.

There are many complex reasons someone might not execute and it’s important to not jump straight to “they’re bad at the job.” Any one of us will struggle with execution when we’ve been handed rapidly-changing or poorly-defined goals, so in many cases the fix has to happen at the managerial layer.

I therefore look for the cause in roughly this order:

Do they know what’s expected of them?

Clear goal setting is very hard. When in doubt, have someone repeat back to you what they believe is expected of them and by when.

Remediation: In your 1:1 doc with this person create SMART² goals and regularly check on progress.

Have they learned helplessness?

Looking at this person’s past projects, consider whether the projects were impossible. They may have been underfunded, undercut by cross-functional surprises³, poorly scoped, improperly staffed, etc.

In the extreme form this becomes burnout. My favorite definition of burnout is “being unable to make progress on something meaningful that’s in line with our values.” If you were to apply current to an electric motor but hold the rotor still it would catch fire. People burnout in similar ways – if they can’t move forward they’ll just heat up.

Remediation: Identify a small upcoming win and tightly coach this person into and through that success. Once they’ve won at something you’ll uncover if this is the blocker.

Do they have the skill for their current project?

They might be an amazing platform engineer but perhaps they’re on the Growth Eng team and they’re bad at building UIs. There are many specialties within engineering and not all skills transfer.

Remediation: If the report lacks the skill for their current project then either move them off the project or assign them a dedicated mentor for the duration.

Have they received literally any feedback that they’re off track?

If the report has been operating in an information vacuum then meet weekly (or more) with them for 15 minutes to each share what you believe is happening and how it’s going. Synchronizing their view of reality with yours will either solve the problem or reveal a deeper one.

Crises in their personal life?

We all go through hard times and that’s okay. As long as a person’s colleagues know what they can and can’t expect then lightening someone’s load has no negative effect on team morale (in my experience quite the opposite).

Remediation: Put this person on important but not time sensitive projects and tightly watch that they’re making some progress. If not, switch them to a simpler project.

Are they subject to a harmful managerial relationship?

We all hope the managers who report to us are kind and respectful at all times but they’re just people. Nothing cuts through a person’s ability to execute like the belief that their manager doesn’t have their back. If you have a skip-level report who doesn’t trust their manager you may need to enlist HR but there are some simpler options to try first.

Remediation: Meet with this person weekly for four weeks to see if you gain more insight. Consider simply moving them to another team. Coach the manager as needed.

Getting back to winning

In practice underperformance comes from a mix of the above reasons. The one remediation that I find most useful it getting the report (and their team, and their manager) to experience victory. The sensation of winning solves a lot of problems.

My primary tool for doing this with underperformers is a silly little trick derived from the Theory of Change⁴: I meet with the person who’s underperforming and make sure the goals are clear. Then I ask them to make a prediction about what will be true at a specific moment in the future.

This trick is so simple and has saved me and my teams so much grief. I have the person tell me what they believe they’ll have accomplished in one week, one month, etc.. I prod for as much detail as they can give me and I write it all down in our private 1:1 doc. Then I create a calendar invite for us at that moment in the future and on that day we re-read their prediction together. It’s usually hilariously far off from what happened. At this point we try to make new, more accurate predictions and suddenly it’s not me versus them it’s the two of us shoulder to shoulder against a problem.

When a report doesn’t take this process seriously I’ve learned to associate that more with a refusal to align. If someone doesn’t actually want to deliver for their team that’ll be apparent here.

Seeing this at the team level

Underperforming teams look much like underperforming individuals.

Failure to execute

Failure to execute shows up as missed deadlines, low-quality work, disengagement, etc.

Remediating a team is harder than an individual because the manager is individually underperforming (or else you wouldn’t have to get involved at all) and, additionally, individual members of the team may require personalized help.

The first step is see if the manager can even perceive the problem. If so, develop a plan and figure out why they hadn’t raised it earlier. If not, you’ll need to take over for this manager temporarily and directly lead the team until they’re on track.

If the manager works alongside you (without ego) on a fix then you’re about to get a hugely upleveled manager. If not, this person might be more functional in an IC role.

Refusal to align

Refusal to align at the team level looks much like when an individual refuses to align: The team works on pet projects, has an us-versus-them mindset, or goes dark periodically. Teams get misaligned when their manager is misaligned so it may be that the manager is using the team for some purpose other than what the company needs.

As of this writing I’ve witnessed DataEng teams at two companies where the reports wanted to work on what’s best for the business but their manager pushed them to roll out an interesting (and unnecessary) technology. In both cases removing the manager let the team quickly realign with their colleagues.

A “Performance Improvement Plan” which actually does sometimes work, but most of the time it’s a formality ↩︎
Specific, Measurable, Achievable, Relevant, and Time-Bound ↩︎
If you think an executive team changing major goals every month is rare I am so happy for you. ↩︎
https://www.theoryofchange.org/what-is-theory-of-change/ ↩︎

Technical Coherence

Mon, 22 Apr 2024 00:00:00 +0000

Software development slows down over time.

I wrote a whole book to help leaders reverse this slowdown and the central point of the book is a process any engineering leader can apply.

I call this process Technical Coherence and you can mostly achieve it in a single meeting with your leaders. You can implement it in your org gradually or all at once.

The central idea is this:

We identify the necessary user experience domains for our products
We identify the shared product domains that underpin multiple user experience domains
We organize engineering into three layers, the top two correspond to the above domains and the third provides infrastructure.

How to design an engineering org

There are some hard questions that every engineering leader I know struggles with (myself included):

What proportion of engineers should work on infrastructure versus product?
Should we pay engineers in infrastructure more? Less?
Are there multiple hiring bars for teams working on different technologies?
How do frontend infrastructure teams and backend infrastructure teams interact?
Is Data Engineering a part of Engineering? How about Data Science?
What is the ideal relationship between Security Engineering and Product Engineering? And can we get away from the consulting relationship where Security always feels brought in too late?
How much do we pay down technical debt and who does it and which debt?

Answering these requires some kind of working theory of how engineering actually functions, otherwise we’re left copying the ratios and comp bands of other companies. No leader should have to be a copycat to answer basic questions about their org.

Knowing your sociotechnical system

Technical Coherence is a theory and a structure to answer these questions about the whole system full of people and software and data.

We start from the outside and work our way in, from the purpose of the product into a technical structure.

This post takes as a given that there are three layers of engineering. Check out Infrastructure Gravity and Domain Engineering if that’s unfamiliar.

Step 1: Identifying UX Domains

The first step to applying Technical Coherence is identifying UX domains. This term comes to us from the field of UX research but it’s similar to “bounded contexts” from Domain Drive Design. A UX domain is basically a bounded context of a user experience. Or, more helpfully, a UX domain is the whole set of things a person does while they’re in a specific role.

A common UX domain is ‘Onboarding’. A user signs up to your product and attempts to figure it out. Until they become a fully-onboarded user everything they encounter is in the ‘Onboarding’ UX domain.

Some of the internals of the system (both frontend and backend) are only relevant to this UX domain. Others are shared across UX domains.

This distinction is what gives Technical Coherence its power.

Mapping out your UX Domains

Consultants can make money with impressively detailed UML diagrams but in a real team we need something that our colleagues can make immediate sense of. So let’s draw a complex product’s UX domains using just five boxes.

Imagine you offer payroll services in the US. There will be a button somewhere called, perhaps, ‘Run Payroll’. Before the user can click that perhaps their boss needs to go through onboarding to set up the company finances. And before that someone at the company has to go through a flow that converts them from a potential account to a real one. Your internal Operations teams might have to manage the product and your CEO may need insights from a Business Intelligence interface.

In this example, you may have the following UX domains:

Each domain encompasses a broad set of user experiences —– any experience that a user has while they’re in some particular role. Someone accessing the Business Intelligence UX is a decision maker inside the company. Everything they need to observe trends and make decisions, across any number of tools, is a part of this UX.

We start developing Technical Coherence by drawing these UX domains. If it’s an experience we offer then it’s part of the product. The reason we do this is because the product has likely been built as if there were just one or two first-class experiences, with others bolted on later. Just because the Finance or Customer Support UX isn’t as urgent doesn’t mean it’s not as important. Your airplane may only need the landing gear to work at the very end of a journey but that doesn’t mean the landing gear is optional. Failing to calculate the financials correctly or failing to support users is as much a risk to the company as an underdeveloped product.

So if your CFO requires some way to download CSVs with correct financial data from the product then that is a non-optional UX domain. Put it in the diagram.

If Customer Support needs a way to reset a user’s password or access user data then Customer Support’s experience is a non-optional domain. Put it in the diagram.

Once we’ve identified all of the UX domains we’re finished with the hard part. And we’re finished with the first of three steps of pursuing Technical Coherence.

Step 2: Identify Shared Domains

The next step is to identify the domains in the product that are shared between UX domains.

The competitive advantage inside the product

I find it easy to identifying shared domains because they’re the things engineering talks about the most. Let’s find a couple examples from the payroll company we’ve been drawing.

We want to identify which domains underpin the payroll company’s UX domains. We’re looking for conceptual areas where there are limited inputs or outputs with a lot of internal complexity.

Like bank integrations.

Somewhere in a payroll system there has to be an encapsulation of the actual movement of money. This involves banking APIs, timing, status codes, encryption schemes, audit logs — all in a financially compliant way.

Or messaging to users.

The Conversion and Onboarding UX domains likely need to contact users. Operations may also need to contact users via email or text and there’s probably some automated email or text message that happens once money moves correctly. We message users multiple ways, for multiple reasons. We can have each product team implement the bare minimum messaging for their individual features but that’s going to result in slow development and a buggy, inconsistent messaging pattern with more overall code size than if we just did it correctly, centrally, once.

So let’s draw our domain chart again. This time we’ll leave out the dependency arrows and draw the UX domains as if they’re the surface of a deeper system. Let’s see how these two domains inside the product (‘Bank Integrations’ and ‘User Messaging’) connect to the surface area of the system.

User messaging happens at various parts of the user lifecycle, triggered by various actions like the user signing up, someone in the Operations dashboard writing to the user, and by automated payroll actions.

Bank integrations get set up during Onboarding, used during Run Payroll, and managed by Operations.

And all of it needs to be visible in the BI interfaces.

As we identify more shared domains and add them to the image the lines connecting UX domains to shared domains become impossibly messy — just a solid sheet of ink. A pattern emerges: The UX domains, as a set, depend on the shared domains, as a set.

We fill in a couple more shared domains and we get a better look.

Our product and design colleagues can help us plan the UX domains with a traditional product roadmap. The shared domains are less visible to non-engineers so they become solely engineering’s responsibility and require a different investment model along much longer timelines.

Step 3: Staffing the Breadth and Depth of Engineering

The 3 layers of engineering are distinct in how they work, how we staff them, who they serve, and how we incentivize the engineers.

Product Engineering

Product Engineering provides features. This is what most people think Engineering does. It’s creating and improving any of the various user experiences that the company offers. Whether the users are external or internal, if someone needs to use part of the product then Product Engineering enables that.

Domain Engineering

Domain Engineering provides what is unique to this company but shared across the company. This is how Technical Coherence delivers product acceleration.

This work is often unstaffed yet it’s extremely valuable. Senior people in both Product and Engineering yearn for more staffing here. The engineers want it because they know it improves the work of every engineer. The product leaders want it because they know it unlocks new roadmap possibilities.

Domain Engineering is the infrastructure for the Product function.

Domain Engineering enables the Product function to develop a more ambitious roadmap.

That sounds pretty hand-wavy, so consider this: When the Product leadership creates a roadmap it’s with an intuition about what’s possible, given historical engineering performance. They look backward at the previous year’s engineering output and then assume roughly the same capacity is available going forward. Product needs Domain Engineering teams to unlock new capabilities in order to make next year’s roadmap more ambitious than this year’s.

Infrastructure Engineering

Infrastructure Engineering provides what any company would need.

This is messaging systems, datastores, repository strategies, testing suites, the CI/CD flow, observability tools, a runtime environment, etc. Infrastructure Engineering delivers the large mass of useful tools that any modern company in the same industry would use. The technology at this layer must be absolutely product agnostic and there must be exactly one pattern for every solution type.

A company allowing a proliferation of duplicate patterns soon finds itself afoul of the most true principle of development tooling:

“Any development pattern is better than two.” — Me

This will require intense empathy with Product Engineering. Infrastructure engineers have big opinions on how to write software but, thanks to infrastructure gravity, they typically can’t write product software at their current company. And yet they need to provide one pattern for each problem type, ensure Product Engineers love it, and avoid adding a second.

The staffing plan

First we allocate the minimum number of the most skilled Infrastructure people to the Infrastructure layer. This layer, riddled with infrastructure gravity, requires some experience with systems. You can’t afford to let someone invent commodities like a service mesh or generalized persistence abstractions — those problems have been solved years ago. The minimum staffing here is also roughly the maximum: You want this staffed enough that the team can succeed and continually rebase the company onto the current year’s best technologies. Any more than that is a costly distraction.

Next, we add as many engineers as possible to the Domain Engineering layer. There aren’t that many people in Engineering who can work at this layer and who are also interested in it. This layer is near-impossible to overstaff because as these teams make improvements even your most junior new hires in Product Engineering become empowered. And the people qualified to work on Domain Engineering tend to enjoy switching back to creating features once the mess is cleaned up, so the risk of starving yourself of feature development is low.

Finally, Everyone else should be assigned to an appropriate Product Engineering team supporting a specific UX domain. Staffing domains — both UX domains and shared domains — is somewhat about interest and context so you’ll see individuals fit better in some domains than others.

Learning More

This post is a brief summary of Technical Coherence from Executive Engineering. The book explains much more about the metrics you’ll need to implement this and the planning models you’ll want to use to keep yourself honest.

Technical Debt Financing

Wed, 03 Jan 2024 00:00:00 +0000

The following is an excerpt from Executive Engineering.

Your task is not to foresee the future, but to enable it.

― Antoine de Saint-Exupéry, in The Wisdom of the Sands

Technical debt is one of the most powerful tools a CTO has. Your engineers may believe technical debt is bad and holds them back — it’s certainly a cause of stress and a time sink — but I believe it only hurts us as an organization when we miss the larger picture.

The best definition of technical debt is “An obligation for future technical work.” It’s not “bad code” or messy data, though those might comprise it. With that definition, let’s improve on conversations about technical debt which often make two crucial mistakes: They frame the debt as an ergonomic issue that primarily affects engineers (instead of a financial issue that affects the company) and they don’t discuss it in the context of technical investments, interest rates, or ROI.

It’s not just hygiene

Engineers are the ones who directly experience technical debt but this doesn’t mean it’s an issue affecting engineers; it affects the whole company. Your CFO might express worry about a bad contract the company got itself into but no one would say that that’s a problem primarily affecting the CFO. The CFO has merely identified the problem.

Technical debt is a contract that the company has signed. It’s important to carefully pay these debts in order to free up Engineering’s time for other work. The fact that it also reduces engineer annoyance is a bonus, but should not be the primary motivator.

This might seem like a meaningless distinction but it lets us be clear about where the debt comes from: Strategic decisions made by leaders at the company. Technical debt doesn’t come from poor-quality engineering or the work of junior people, it’s a contract the company signs to move fast now and go slow later. Debt appears when the company incentivizes (or allows) technical shortcuts for immediate value payouts. Leadership sets the standards for the work and implicitly communicates what kinds of timelines are most important (usually very near-term timelines).

So, since technical debts come from leadership decisions, we as leaders can change the debt strategy of the company as we wish.

Interest Rates

If we go ‘fast now’ and ‘slow later’ we should probably make sure the ‘fast now’ is worth it. Technical debt has distinct terms and payment structures in the same way as financial debt. You could, right now, delete all the automated tests at your company. You would get a small but real increase in shipping speed for a few minutes. And then all hell breaks loose.

That’s like taking a 12-hour loan from the mob.

You could also, right now, approve a new backend language to the list of languages your company supports. Depending on the language and the way it’s used this might really help your teams. You’ll have to support it, but perhaps it’s worth it.

This is more like taking out a traditional bank loan.

As we think more clearly about principal and interest rates we’re able to compare debts against each other. This is the first step to deciding which ones to pay down and which to ignore.

We do not want to eliminate all of our technical debt. Maybe someday, when the company issues stock dividends instead of reinvesting in R&D we can take that as a sign that there’s no more useful product expansion and we should pay down our existing debts. Until then, we accrue debts so we can move fast.

We want those debts to be at an extremely low price, both in interest payments and payoff events, so we can use the capital (our newly-unlocked free time) to make big investments. Just like in the world of finance, we can’t make a profit just by taking on debt; we have to purchase something with the liquidity that debt gives us. We make a profit when the thing we purchase is worth more than the debt.

Technical Investments

A technical investment is something that accelerates future work or earns future revenue.

For example, when we make deployments faster we earn a return every time we deploy, forever. When we migrate to a better-supported software framework then we have fewer edge cases that slow us down. When we reduce domain complexity or standardize the libraries we use then we add an accelerant to all future development.

These are all technical investments that reduce the cost of doing our work, and they’re the kinds of classic tech debt initiatives you might see championed by senior engineers on your teams who are frustrated and know that it’s possible to move faster.

Not all investments involve paying down the cost of doing work: Feature development is also an investment – the most obvious one. We create new value out of nothing and then we sell it and earn revenue (directly or indirectly, depending on your business model) forever.

If both cost reduction initiatives (like standardizing libraries) and feature development are investments, then we can prioritize them against each other. We can unify our mental framework of technical debt and technical investment into a more traditional financing portfolio. We’ll look at that framework in detail in Calculating Technical Debt but first let’s review how teams tend to address technical debt and why those attempts usually fail.

Common Pitfalls of Addressing Technical Debt

It’s rare to see a sophisticated technical debt portfolio. Even companies that have a rigorous product research and feature prioritization culture might get a little hand-wavy when talking about debts.

More likely, the discussion of technical debt is limited to listing very annoying, bothersome things that engineers wish they had time to fix, with some of those hardest, most frustrating problems elevated to a kind of bogeyman that even individuals far outside engineering might hear about regularly.

It’s easy to find agreement that technical debt is bad and less of it would be really good. When we try to prioritize specific debt is when the agreement is hard to come by — particularly if valuable product features need to be delayed because of it. Earlier in my career I presented to an executive team about pressing technical debt problems and had them all — from Marketing to Legal to Operations — tell me that they hear about the debt all the time and they sympathize, but that it wasn’t clear how to eliminate the problem.

As I stood in the conference room with my slides up I realized I didn’t have an answer for them. I, like most engineering leaders, knew the debts were a problem but I didn’t have a plan that went beyond “the engineers agree we should fix this.”

Actually trying to address debt tends, in my experience, to start off well-intentioned and then get stuck almost immediately in one of 3 specific non-solutions.

Failmode: 20% time allocated to debt

A popular one is to attempt to carve out some time to address debt. Either declaring that one week a month is ‘debt week’ or 20% of time should, roughly, be allocated to debt. Some teams mark individual tickets as ‘debt’ and try to pull in a set number of debt points to each sprint.

This is a recipe for both low morale and ever-higher debt. Allocating 20% time to tech debt produces a kind of digital stagflation. Partly because there’s no concept of the relative cost of different debts so there’s no way to determine the best debt to prioritize. But also because there’s no relationship between feature development and debt payments so the proportion of time allotted is entirely arbitrary. It’s usually about 20% because that’s as much as we can stomach losing from feature development, which feels like the more important work – even if the feature development is producing debt at a higher rate than we’re paying it down.

Feature development is more important than debt payments along a very short timeline. But on a longer timeline, prioritizing mostly feature development will suffocate future feature development. As long as the team that’s trying this 80/20 mix can only look one or two sprints — or even just one or two quarters — into the future then the debts will never actually be a priority. And since the debt is competing for the same work time as value-creating features, the only way the debts get paid is by grumpy teammates performing heroics; a surefire way to ruin a team.

Even if they ever do get the debt under control they go right back to making more debt, ignoring that they may have cleared the way for some high-return investments that are going unstaffed, particularly any kind of tooling that might prevent the high-interest debt from returning.

Failmode: A Technical Debt Team

Worse, a team might be formed with a mandate to pay down debt. This was one of the teams I led at Square back when we had a Rails monolith that was on fire. My teammates and I dove into the inferno and tried to tame the most central and somehow most neglected part of the company’s product.

Unsurprisingly, it was really hard. But the worst part wasn’t the work, it was watching other teams ship features quickly on top of the mess we were cleaning and feeling like we were somehow losing ground despite our efforts.

As a colleague of mine once put it, this “tech debt team” pattern is like sending one team to the office basement and another on a free cruise. The incentive and context mismatch is a recipe for relational conflict between otherwise friendly colleagues.

Failmode: Locally-visible debt reductions

This last pattern is the most intuitive one: Just letting each engineer and team fix the debt nearest them as they see fit.

The benefit of this is that the person paying down the debt probably has the most local context and is likely to move quickly. But there’s no telling whether this is actually important debt. Is this engineer bailing out a sinking ship or are they rearranging chairs on deck while the whole vessel goes down? We return to the same prioritization problem: How do we figure out which debts to pay down with our finite time?

Leadership needs to provide direction on which debts to service interest on, which to pay off the principal, and which to ignore. And executive leadership needs to provide that direction over an articulable time horizon.

The bulk of this work is accomplished merely by facilitating a healthy, public, and ongoing discussion among your most senior engineers about the various debts across the company. The truth is distributed unevenly through the minds of your engineers and no individual has the full picture.

This might look like a periodic Technical Business Review meeting where senior engineers update a living document to reflect what they see in the system and spin off any important conversations. It might also be just a leadership culture of giving senior ICs the encouragement and time to explore problems and report on their findings. Either way, most of the debt-tracking work is accomplished by letting the people who can perceive the problems surface them.

That will give you knowledge about your debt portfolio and investment opportunities. But it’s up to you to use that info to make company-level technical debt financing decisions.

Technical Debt Interest Rates

The debt financing metaphor applies much more broadly that we might expect. Not only are there debts and investments, but each debt has a specific interest rate and each investment has an expected return. This is what allows us to prioritize which debts to pay down.

Consider the situation where you owe money on student loans and also to a payday lender. The student loans are at 5% interest and the payday loans are at 100%. You get a surprise $1K as a gift, where do you pay the balance? Maybe split it evenly, $500 on each?

Under this scenario every dollar you get should go to the payday loan until you have it fully paid off. In fact, if you could somehow take out more debt at the 5% rate and use that cash to pay down the higher interest loan that would be wise.

The first step in considering which debt to pay down is just to figure out the rough interest rate. To be clear, this isn’t an exact science. We’re not going to get decimal-level accuracy here as we try to calculate our debt portfolio. But, just like using Big-O notation for algorithms, we can be directionally correct.

Losing Work to Interest Rates

In a high-interest loan very little of the amount we pay actually lowers the principal. The first payment services the interest and the next payment does the same.

This is one of the easiest ways to identify high interest technical debt: What is something that forces your team to do work and then forces them to do the same work again in the future? How much energy gets sucked up by this debt payment? And how long would it take to pay off the principal such that no payments were ever made again?

If you were to make a list of toil like that you’d have a strong start on a debt portfolio.

The sneakier debts are ones that require no ongoing interest payment but the principal is increasing continuously. This might be from forking a software framework that’s now out of date, forcing your team to undo all of their work later in order to upgrade. Or building a software suite inside a monolith without building in boundaries that isolate major components of the products. In each of these cases you might have a surprise in the future where all work for a few months (or years!) is devoted fully to paying down a massive old debt.

Many of the debts in your system likely carry extremely low interest rates, despite how much engineers might complain about them. Maybe you’ve got a single page on your website on an old JS framework but it works and nobody plans to update it. It’s not in the New® Hotness™ and it’s an eyesore but if it’s not an obstacle to your team as they work then it carries a 0% interest rate. It would take the same labor to fix it in 5 years as right now and it isn’t a security attack vector. Which means you should absolutely ignore this debt.

Most debts aren’t obviously 0% and they’re not obviously giant scary ones – they’re somewhere in the middle. My favorite way to calculate these is by thinking about your company’s dimensions of scale and how each one might make the debt worse. For example, which debts grow along the axis of user growth? Which ones grow with engineering headcount growth? With feature count? With data size?

There are many things you can do to organize your debt portfolio but first you need to make it. You need an accurate accounting of the technical debts that matter to your company.

Calculating Technical Debt

The Five Properties of a Debt

There are roughly five things you need to know about each debt obligation in your system:

Principal – What would it take to fully pay it off?
Interest – How much energy is lost just putting up with it?
Increase In Principal – How much bigger will the payoff be in the future?
Increase In Interest – How much more energy will we lose in the future?
Payoff Events – Are there any inflection points in the future that will necessitate a sudden payment in full?

You may notice I described the debts here in terms of creative energy, not just time. This is deliberate because technical debt costs us cognitive drag, not just time. Creative engineering work does not happen at a constant speed. An engineer might say only 10% of their time is spent on a repetitive task but if you dig deeper it’s appears to be sapping a majority of their emotional and intellectual initiative. Toil like that can push the team to get distracted or snack on low-value work to avoid the more direct, frustrating work. Asking the team how much of their creative energy they feel like they’re losing is, in my experience, a better measure of how much enthusiasm and insight is lost. Which is what we actually pay our engineers for.

To make these concepts clearer lets explore some hypothetical examples.

The Ballooning Postgres Database

One of your teams operates a service where all of the data is in a single growing Postgres database. The team notices that queries are getting slower as the storage size grows but this isn’t a system where performance is critical. Your current cloud deployment supports database volumes up to 20TB of storage and you’ll reach that maximum in a couple years if the current growth curve continues.

The team plans to shard the data and migrate to multiple smaller Postgres instances when they need to. They estimate it’ll take the whole team about 3 months to do that – and it’ll take longer if the data is larger.

The principal of this debt is what it takes to eliminate it: Three months of migrating to a sharded design.

The interest of this debt is how much drag it introduces to your engineers: None at all! Having this simple database setup has allowed them to move quickly on developing features.

The Increase in Principal is how much longer that sharding migration will take as the data grows. Let’s say that your team believes that the work here is mostly code changes but the actual migration of data might take weeks of an engineer shepherding it if the data is close to the 20TB limit. Since most of the work is the code changes we’ll say the increase in principal is low.

The Increase in Interest is zero because the the interest is and will stay at zero.

There’s a Payoff Event where the full principal comes due: 2 years from now. Let’s be conservative and say we need to have it in 18 months.

Principal	Interest	Increase in Principal	Increase in Interest	Payoff Event(s)
3 months * 1 team	zero	low	zero	18 months

The Asynchronous App on MongoDB

Another one of your teams operates a service that uses asynchronous code with callbacks on top of a sharded MongoDB cluster. The team complains about the difficulty of testing the asynchronous code and has had to invent and maintain a testing library to make this pattern accessible to new hires. The data they store is quite relational so they’re frustrated with the document-oriented storage model. Much of their work is creating and fixing secondary indexes. They say, anecdotally, that 75% of their energy is spent just fighting the system.

The database sharding method has plenty of room to grow and the database instances themselves are reliable. But the team dreams of rewriting everything, even though they say it would take a full year.

To eliminate the Principal of this debt requires a full rewrite of the app and a migration to a relational database.

The Interest is the high percentage of the team’s creative energy wasted by this debt.

The Increase in Principal is how much harder it’ll be to fix this if we wait. If the solution is a rewrite that means this is growing in lock-step with the app’s complexity.

The Increase in Interest is high because as this app grows there’s yet more painful complexity to wade through.

However, there’s no Payoff Event on the horizon. The team’s output will decelerate forever but the system will technically keep working correctly.

Principal	Interest	Increase in Principal	Increase in Interest	Payoff Event(s)
1 year * 1 team	75%	high	high	none

Comparing two debts

Say one of your Directors of Engineering supports the managers of both these teams. The director comes to you and says both teams are asking for time to pay down their debts. What insight do you offer?

If we do nothing then in two years the Postgres database falls over and that’s a total outage. So we have to do something about that. Whereas the MongoDB team just gets sadder and slower but their system keeps working. So should we ever prioritize letting them do their rewrite?

That depends on what the purpose of that second system is. What benefit does the company get from this painful asynchronous app? Is the low morale of the team offset by a good feeling that at least their work is extremely important? Or is it a trivial set of features that could be deleted?

Assuming both systems are roughly equal in their importance to the company, I would advise this director to stem the bleeding on the async MongoDB system by curtailing new development in it.

We can cap the growing principal and interest if we don’t add new features into that system. To do this, we’d need to staff a one-time migration to design a replacement system that new features can go into and write a full proposal on (eventually) migrating all existing features to the new system. (It’s important to write that detailed proposal in order to test whether it’s possible to one day fully decommission the existing system.)

Considering the timelines involved, I’d advise this director to give the next year of time to the MongoDB team to build and start to use a new approach, then spend the second year focusing on the sharded Postgresql migration. If the company hasn’t totally transformed after that, the third year the MongoDB team can finish decommissioning the original app.

This is an imperfect science, but as long as high interest rates are tracked and addressed the team should be able to recoup their creative energy.

A Notation for Scale

Note how in the ‘Increase in Principal’ and ‘Increase in Interest’ columns for the second example I put the word ‘high’. That’s not very helpful. How do we compare one ‘high’ against another? What if most of our debts have ‘high’ interest?

Let’s look for a better way to describe how debts can get worse over time.

If you’ve ever interviewed at a company that hasn’t updated their hiring philosophy since the 1990’s you might have encountered Big O notation. It’s a way of describing the worst-case scenario of the performance of some logic as the logic is applied to data. Let’s steal just a piece of this concept to describe the way that debt can get worse over time. Instead of using a single dimension ’n’ as in ‘O(n)’ versus ‘O(n²)’ we’ll look at all of the different dimensions along which a digital system can grow and therefore debts can get more expensive.

Dimensions of scale

This book is focused on engineering teams that support online software systems because that’s been my whole career. These systems grow along many axes over time: You’ll get more traffic, you’ll store more data, and the graph of your data relationships will become more complex. Alongside all that, you’ll see more engineers working on it, more customers using it, and an ever-larger range of dates represented in the production dataset.

You may have other dimensions of scale, depending on your business model.

User Traffic

This is a classic measure of scale. Creating a version of something that ten people can use at once is phenomenally easier than one that a million can use at once, independent of the dataset behind it.

Data Storage

Another classic measure of scale, there are some cliffs to watch out for here (like running out of storage for any non-distributed database) but even incremental growth here will cause your queries to respond slower and your hardware expenses to go up.

Feature complexity

I love keeping track of the number of supported features. Partly because it gives us the Shipped Potential chart but also it’s helpful to know roughly how many different user experiences the system supports. This kind of inventory makes it possible to answer the question “If we build X how many existing features might need to be adjusted?” In practice, a question like that can surface a rough coefficient for multiplying the back-of-the-napkin time estimate that an engineer might give.

It’s not rare for each feature to take longer to develop than the previous one. So, for the purposes of modeling technical debt, it’s helpful to use feature complexity as one dimension of scale along which a debt might get more costly.

Data Modeling Complexity

Most of the companies I’ve worked with should pay more attention to this dimension. This is a measure not just of the number of databases, tables, and columns that exist in the schemas at your company, it’s a measure of overall graph complexity.

If you were to generate a diagram of your production schemas and data relationships it might be super ugly. Instead of a neatly organized tree structure you’ll probably find a few datasets that virtually everything references. These datasets would also be the most painful ones to work with, they’d represent the central concepts of your flagship product, and they’d probably have way too many fields.

If you’ve never calculated the graph complexity of your production schema before, there are plenty of different measurements you can make. To start, I recommend keeping it very simple and just counting two metrics: 1) The p90 and p99 number of columns in all tables, and 2) The number of relationships between tables divided by the total number of tables. As these numbers go up, some debts will become more costly.

Number of Employees at the company

Is there manual work for your technical team every time one more person joins the company? Or are the administrative interfaces that employees use to manage and operate the product starting to creak at the seams?

This tends to only matter for administrative tools, but it doesn’t look great when an engineering leader is totally surprised by a scaling cliff for tools their colleagues use. And if you find that more than half of your employee headcount is customer support you wouldn’t be the first leader to be startled by that; many startups that fail to prioritize internal-facing debts have to hire budget-destroying Customer Support teams.

Number of Engineers

You’ll likely experience slowdowns in development as your engineering headcount scales but it’s not necessarily related to technical problems. Here you’ll find process debt, cultural debt, and communications debt in addition to some technical debt.

As engineering headcount scales you may find brittleness in your build and deployment tools, especially if any special knowledge might be required to perform a deploy. At some point you may also notice the scarcity of good code reviewers because as headcount scales up the percentage of the system that any individual engineer understands goes down.

In my experience the debts here are almost entirely non-technical. I recommend separating those debts from the strictly technical ones for the purposes of your technical debt calculations. The process and cultural and communications changes don’t compete for your time from the same work queue as the technical ones. Nobody’s going to ask you to choose between shipping a feature or overhauling your internal communications. The expectation is that those two pieces of work can happen in parallel.

The one place I recommend looking very carefully with regard to engineering headcount scaling is Not Invented Here syndrome. Where in the product is there an invention that doesn’t absolutely have to be there? Notice which debts don’t work as well with lots of new hires. Too often a technical system relies on an unnecessary invention understood by a select few – often the same few people who’re needed to do other critical work. This is especially dangerous if the invention was written by a technical cofounder who’s now in an executive role of some kind because their invention has likely been (unconsciously) shielded from scrutiny.

Number of Users

This is a big one. Regardless of data size or throughput in bytes, features that worked for a hundred users rarely work for ten thousand. Administrative interfaces will get slow, synchronous workflows will need to be made asynchronous, perhaps the primitive search system for exploring user data will need to be fully replaced, etc. And any piece of code or UI that operates across multiple users (typically found in analytics jobs and admin interfaces) will strain as user count goes up.

Perhaps more sneakily, the number of edge cases that your team will see in the data relationships scales roughly in line with user growth. Given enough users, you’ll see every possible permutation of user data, each of which will need to be encoded in the test suite using ever more complex data in the tests.

Each passing day

There are some systems that record a snapshot in time or perform date calculations over a range of the full dataset. Even when nobody’s using these systems this dimension of scale can get worse merely from the ticking of the clock.

Using Dimensions of Scale

So instead of saying that an interest rate is ‘high’, we can say that it gets worse along specific dimensions.

In the case of The Asynchronous App on MongoDB the principal of the debt got worse with feature growth and with relational data complexity. And the interest got worse with each new engineering hire at the company. Even if the engineers on the team felt like a steady 75% of their time was wasted slogging through, for each new engineer at the company there’s a greater bifurcation of architectural approaches and increased desire for people to leave this team and work on something better.

We can describe this debt obligation a little better now.

Principal	Interest	Increase in Principal	Increase in Interest	Payoff Event(s)
1 year * 1 team	75%	features * data complexity	engineer count	None

And in the case of The Ballooning Postgres Database we saw that the principal increases as data size increases, so we can be more specific about that.

Principal	Interest	Increase in Principal	Increase in Interest	Payoff Event(s)
3 months * 1 team	zero	data size	zero	18 months

With that, let’s make a technical debt portfolio for your company. We’ll assume you have to deal with both The Ballooning Postgres Database and The Asynchronous App on MongoDB. On top of that, let’s contrive a few other realistic debts for your teams.

Yes, these are all situations I’ve lived through and, yes, forcing you to hear about them is absolutely therapeutic for me.

The Slow Admin Dashboard

You’ve got an internal administrative dashboard app that lets employees navigate user data. It’s existed for years and the pages are getting slow. It has its own permissions to read and write to the databases that sit underneath the product applications and it queries them directly. It only performs ‘SELECT’ queries but they are very inefficient – on average a page loads in about 30 seconds.

You’re worried that one day too many of your colleagues will try loading the dashboard’s index page at the same moment. The page is so data-rich that even a few dozen simultaneous page loads can lock up one of the production databases and cause an outage. Your engineers want to replace it with a new client side app that uses HTTP endpoints to the product applications instead of direct access to their databases.

There are actually two debts here. There are two things that need to be done, so there are two obligations for future work. One is that you’ll need to start using read replicas for this dashboard app. Simply loading a page should never cause a production outage and the best cache for a database is a database’s read replica. This debt has a Payoff Event that’s imminent. The other debt is that you’ll need to move away from direct shared database access as an architectural pattern.

Let’s make an entry in our debt portfolio for each of those.

It’ll take about a day to move the queries from the primary database to a read replica and payoff the Principal. There isn’t any repeated maintenance labor caused by this app (other than perhaps a looming dread) so the Interest is zero. That won’t change so the Increase in Interest is also zero. And the cost to switch to a read replica is the same both now and later so the Increase in Principal is zero.

Fixing the architecture of this whole app is a far, far harder job. The engineers say they can do it in 6 months but maybe you’ve seen this before and know that the Principal here will take a full team 2 years at a minimum. Since fixing the architecture is effectively a rewrite you see an Increase in Principal with each new feature that’ll need to be ported from the old way to the new way.

The slowness of the app causes all employees to pay an Interest payment of about 30 seconds every time they use a page. And it’s getting slower so you see an Increase in Interest as database size grows, as the user count increases, and as more employees use the pages.

To make it all worse, the load balancers for this app have a maximum 60 second timeout so as soon as these slow pages can’t respond in that window the app suddenly stops working completely. Looking at some charts you estimate that Payoff Event will happen in 9 months.

	Principal	Interest	Increase in Principal	Increase in Interest	Payoff Event(s)
Admin uses read replica	1 day	zero	zero	zero	any day now
Admin uses HTTP APIs	2 years * 1 team	Waiting 30 seconds to see any page	Features	employees * data size * user count	9 months

Both of these are critical. The second one, at least, won’t fail immediately. So the right course of action is to fully pay off the first one as soon as there’s a good moment in the team’s cadence and then work on a thoughtful strategy for addressing the second – especially considering the payoff event is sooner than the principal payoff period.

Engineers Cloning Production

An early employee created a script that dumps a production database directly to standard out and pipes it into their laptop’s local database. It’s extremely popular because any engineer — particularly junior ones — can download production data to their laptop and run the app in development mode to see how their code changes perform in production.

You hope to raise a round of funding soon and you know that that’ll require an audit of your security and compliance. This will be flagged as a major compliance violation because user data should never leave the production environment. And should definitely not be sitting around on an employee’s laptop – certainly not the entire user dataset.

You also know that this poses a handful of other major problems for your organization in both engineering culture and technical sophistication. Here, I’m going to talk about the nuance of this debt but in case your company is leaning toward this pattern let me urge you in the strongest possible terms not to do it. Your teams should have production-like data in their unit test fixtures. Until that’s true the product development quality will be low and there’ll be a temptation to download data from production.

Calculating this debt is interesting. The debt isn’t that bad right now, though there are some Interest payments in the form of lower security, script maintenance, and the lack of comprehensive test fixture or test factory data. But there are several Payoff Events and when we suddenly need to move away from this approach it will take an unknown amount of work to pay the Principal and get our system in shape to be developed through better patterns.

There’s an Increase in Principal as more code is written with insufficient unit tests or poor service boundaries, and each new engineer who joins and uses this pattern puts us further behind.

There’s an Increase in Interest with each new engineering hire as people start asking for slight improvements in the script or for better production data scrubbing and development time gets allocated there.

	Principal	Interest	Increase in Principal	Increase in Interest	Payoff Event(s)
Cloning production data to laptops	unknown	5%	features * Engineers	features * engineers	Any audit, Hiring a staff engineer

Messy core product modeling

For the purposes of this example, let’s say your flagship product involves file storage. This is such a key concept at your company that the word ‘file’ appears in most conversations and is in the center of most technical whiteboard drawings. There’s a class called ‘File’ in the main app and it’s out of control: thousands of lines, relationships to most other classes in the codebase, and with what appears to be two poorly-implemented finite state machines within the class itself.

Paying down the Principal here is just a ton of work. Hundreds or even thousands of careful refactorings, moving logic out of this class into something better encapsulated and modeling File-related concepts in their own classes. Your team pays Interest every time they touch this file or an adjacent one and, anecdotally, it takes a week to do in ‘File’ what could be done in an hour elsewhere in the system. Every time new features are added there’s an Increase in Principal and an Increase in Interest.

How would you calculate the actual interest rate for File? Taking a week to do an hour’s worth of work sounds extreme – that’s a 40x slowdown whenever work gets close to this code, or 97.5% of energy going to servicing interest payments. But not all the work happens here, most of your team is working elsewhere in the system so it’s not like all of engineering is slowed down 40x. If it were, this would be the most critical piece of debt to pay off before anything else.

This situation is common and also a good reason to have a finished debt portfolio. Because it’s impossible to know whether to pay this down without knowing how many File-related features are on the roadmap and what other debts are also in the way of upcoming work.

To calculate this debt let’s just get a rough sense for how much time engineers will spend in this are of the code and make a guess at how much this debt might frustrate them. Maybe your upcoming features will be half of the speed they could be (a 50% interest rate) because a few of them require working in this debt. Rather than agonize over the exact baseline make an educated guess. If you’re wrong it’ll come up in discussion with your senior ICs.

	Principal	Interest	Increase in Principal	Increase in Interest	Payoff Event(s)
Messy core product modeling	years	~10-50%	features	features	no

A Full Technical Debt Portfolio

Let’s put all these contrived but believable examples together and see if we can compare them against each other.

	Principal	Interest	Increase in Principal	Increase in Interest	Payoff Event(s)
Ballooning Postgres Database	3 months * 1 team	zero	zero	zero	18 months
Asynchronous app on MongoDB	1 year * 1 team	75%	features * data complexity	engineer count	None
Admin uses primary database	1 day	zero	zero	zero	any day now
Admin reads from database directly	2 years * 1 team	Waiting 30 seconds to see any page	Features	employees * data size * user count	9 months
Engineers Cloning Production	unknown	5%	features * engineers	features * engineers	Any audit, Hiring a staff engineer
Messy core product modeling	years	~10-50%	features	features	None

It won’t be clear and obvious which of these to work on and it what order unless we know what we’re trying to accomplish next. Well, perhaps moving the admin app to a database replica is a shoe-in for prioritization because of the low cost and immediate payoff. But prioritizing the rest depends on what features the company wants next.

Will your teams be working exclusively in the ‘File’ class? If so, better make a plan to either clean that up or somehow mitigate the worst parts of it.

Does the company critically need to improve customer NPS and lower customer service headcount growth? Perhaps a super-responsive and more powerful admin dashboard is a must-have.

Are you planning to double engineering headcount in the next year? If so, both the asynchronous MongoDB situation and the production cloning pattern might need to be remedied immediately.

The point of a technical debt portfolio isn’t to make a TODO list of debts to address, it’s to gain knowledge of the financial landscape around you so you can make winning investments.

An Executive View of Technical Debt

Debts let us move fast now at the cost of going slower later.

Investments give us value so we can survive and hire people to pay off debts.

Taking on low-interest debts while making high-ROI investments allows us, over time, to reach our maximum speed.

But what’s the timeframe? When can we afford to go slow and when must we go fast?

That is a conversation that can only exist between you and the CEO (or other leader) you report to. Only your lead can determine what the major milestones are and when they need to be met. Only you can provide the context necessary to reframe the conversation from lots of tiny sprints forward (which accumulates debt and forever slows development) toward a measured set of debts and investments over a long timeframe.

To make that more concrete, let’s examine scenarios of companies at different growth stages and the kinds of technical investment approaches that might be appropriate for each.

Taking the Executive View

Most of the time, leadership isn’t stepping out front and saying “This is the way, follow me!” but it does frequently involve giving a map and a route to your people. Even if you don’t know how to draw the map, as the leader you need to source it and provide it.

Remember we’re trying to minimize drag here. Cognitive drag, the energy-sapping confusion and conceptual graph complexity that an engineer feels when trying to be successful in a complex environment.

As an executive leading a modern software engineering organization you’re navigating your people through a landscape of sheer complexity. The territory in which your engineering teams work is literally interconnected concepts written down. That’s what software is. Leadership, here, needs to be some kind of guidance through that maze of concepts that allows your engineers to put only the necessary concepts in their head to navigate from where they are to where they need to go.

A technical debt portfolio is a map of conceptual territory along the dimension of time. It marks where there are impassible obstacles, where there are arduous hills to climb, and where there are smooth paved roads. While an architectural diagram can describe where you stand in that territory right now, a debt portfolio gives you insight into the safe roads for the journey ahead.

Infrastructure Gravity & Domain Engineering

Tue, 02 Jan 2024 00:00:00 +0000

The following is an excerpt from Executive Engineering.

“You become responsible, forever, for what you have tamed.”

– Antoine de Saint-Exupéry, in The Little Prince

Each company draws its own line between Product Engineering and Platform/Infrastructure/DevOps but the difference between them is clear: Product Engineering is “stuff the whole company wants” (making features) and the other is “stuff the engineers say we have to do, I guess”.

I’ve worked both above and below this divide and several times now I’ve led the entire span of engineering, from feature development down to whatever you call the bottom layer.

I think I know why that bottom layer has so many¹ names: It’s actually two separate things. Both are invisible outside Engineering but they require radically different architectures, leadership, and investment models.

The Product function peers down into the technical stack like a person in a boat, trying to see the depths. There’s a limit to our perception from the surface. To see further we must plunge into the water ourself.

Everything visible when looking at the UX — the surface of the tech stack — is considered “Product Engineering”. Below that the work is opaque and is assumed to be a cost center for the company.

This view gives us two levels: The visible, and the invisible. And the line between them depends on the technical sophistication of the viewer.

This separation serves us poorly.

Partly because it prioritizes the merely visible over the important. To return to our aviation metaphor, prioritizing just visible product work is like making a plane out of just the visible parts: A fuselage, a flight stick, wheels, etc. We actually need a whole plane — with all the little details — if we want to fly.

The separation between visible and invisible work also obscures how the lower levels are more foundational in the system than the surface. If a feature breaks then just that feature is broken. If something deeper breaks then all features break.

I believe there are exactly three levels to Engineering in a product-shipping company, not two. These three levels exist no matter the size of the company and they do not necessarily map to the org chart.

They have many names but I call them Product Engineering, Domain² Engineering, and Infrastructure Engineering.

They all work toward different purposes, with very different constraints, and along completely different timelines.

Let me show you.

There’s a Big Bang moment at the start of every tech company. The first line of code is instant and the next lines are the beginning of a permanent deceleration in velocity.

The team focuses entirely on making features and the system gets more complex until, hopefully, it’s actually useful. As new software engineers join the team they add further complexity and new patterns. Their excitement about features is tempered by a growing frustration with the underlying complexity. Folks tidy as they go but their goal is value creation, not cost reduction. The company hasn’t yet earned the right to clean up its mess.

Then one day, usually when there’s between 5 and 15 engineers at the company, one of them gets frustrated enough that they stop creating new features. They break away from the team and focus their attention on, say, the deployment script or the database config or the test suite.

Voilà. An infrastructure org is born.

This buys the company some time. With one person tending to complexity the rest of the team is free to keep piling new value into it. The infrastructure person fixes whatever problems exist, regardless of the type of problem. They can refactor source code, fix databases, debug BI dashboards, or improve deployments.

With their breadth of skills they’re prioritizing across the whole company’s needs, not limiting their work to just one layer in the technology stack or one app or technology. They can do this because they’ve helped create everything so far.

Other engineers may follow this person from feature development to infrastructure. The correct ratio between product work and infrastructure work depends entirely on the seniority of the product engineers and how much of the system they can reason about as they build new features. Anything the product engineers can’t perceive or don’t make time for the infrastructure team handles for them – so if your product engineers are junior you’ll need more infrastructure people³.

One day someone will be hired directly onto this infrastructure team. At this moment the engineering org undergoes a fundamental change and creates a problem it may not notice for years: There is now a person who’s fixing a foundation without understanding what that foundation supports.

An external hire doesn’t know the messiness of the product implementation. The features are, by definition, unique to the company so they can only learn them here. An external hire on an infrastructure team likely cannot upgrade, migrate, and reason about the system as a whole. They have other – very impressive – skills, but they can’t improve product internals. To do that they’d first have to build product features, which takes them away from growing their low level skills that command higher salaries under titles like ‘DevSecOps Engineer’ or ‘Cloud Architect’ or ‘SRE’. Going into product engineering would slow their career.

The product engineers may be justifiably impressed with these folks and the reverse is (hopefully) also true; neither group can do the other’s job. One knows the specific product in detail, the other knows the patterns for software generally.

It’s bad enough that the company may have just recreated the exact problem that DevOps intended to solve (one team writes software while the other one runs it), but there are well known solutions to that. The trap here is that, increasingly over time, some engineers are building features at the top edge of the system while others work at the extreme bottom of it.

Nobody is looking at the middle.

“Unique to this company, shared between features”

The middle of the system is messy.

Ask any engineer working at a large org what’s the hardest part of making features and they’ll likely say the dead center of the product suite is a giant mess.

Because ‘middle of the system’ is super vague let me be more clear. When the very first feature is created, all the technology at the company is in two categories:

The unique logic, designs, and choices that comprise that first feature
All the frameworks and tools to support changing and running that feature

The top category is the implementation of the company’s value proposition. Nothing up on top is particularly transferable between jobs because it’s the company’s competitive business advantage — it’s by definition totally unique.

The lower category is the opposite – if you work here at one company you can get up to speed on it quickly at another. Every company must run unit tests and deploy software so these patterns don’t change much, they’re less a matter of invention and more of learning enormous vocabularies of implementations. You can get paid well for knowing that a process with an exit code of zero has succeeded and applying that knowledge across industry-standard tools.

If there’s anything in this lower category that’s a unique pattern — anything at all — it either should be deleted or it belongs up with the features.

Once we add our second and third features the middle of the system reveals itself.

This middle layer is the glue that binds everything together. We can define it as “Everything that’s unique to this company’s features but not unique to any one feature.” It’s the internal semantics of the product suite as a whole, even if that product suite isn’t a suite yet. Even if it’s not a whole product yet.

Try to hire an infrastructure person from another company to fix your deployment and they’ll have it done quickly.

Try to hire a software engineer to build new features and they’ll do fine, as long as they passed the minimum technical bar and the feature is well specified.

Try to hire someone to work in this middle layer? They’ll immediately stumble over the internals of your company’s unique inventions.

How the middle atrophies

There’s a good reason this middle is so often unowned or poorly owned.

Two reasons, actually. I call them Infrastructure Gravity and Feature Lift.

They’re hidden but powerful forces pulling our engineers to the extreme top and bottom of the technology stack, forcing individual engineers to perform heroics in order to keep the middle alive.

Infrastructure Gravity

This bottom layer has a gravity to it, pulling engineers lower into the tech stack and keeping them there.

There’s a clear line in the tech stack above which most new infrastructure hires never venture. This line can be hard to see if you’re looking at source code but once the code is running it’s very clear: It’s the process boundary for the running app. Anything that supports running processes on a computer is below the line. Your cloud architecture, config scripts, the test harness, etc. Every system call that a process makes is dealt with below this line.

Above the line is the internal state of the processes themselves (and, by extension, their source code). Not just the business logic of each application but also the software frameworks and libraries, the features’ performance, and the semantic interconnections between all features and dependencies.

It’s easy to see why this line gets brighter over time. Inside a running process is code written by, possibly, a new employee in a big hurry a long time ago. Imagine you have a choice to, on the one hand, understand and improve this code or, on the other hand, find a way to execute the process 5% more efficiently using transferable Unix and cloud skills. Which would you choose? Only the latter is guaranteed to even work and it’s the one that provides the best career advancement in a role with the better pay.

This is Infrastructure Gravity.

It pulls the people working on Infra / Devops / SRE / etc. down out of the middle layer towards the stuff below this line where they have a much higher chance of success and protection from the debt of the product work.

Luckily, a company needs this gravitational force to keep the most foundational parts of their system stable. Without it there’s very little chance the system will be resilient or correct or secure. Infrastructure Gravity provides incredible value, it just also prevents people working at the bottom of the stack from helping out in the middle.

Because it prevents people who work below this line from learning what’s above it, we find that when they feel inspired to invent something it will probably be an appliance that exists below this line. Perhaps a new deployment tool or an orchestrator or a scheduler. Perhaps a new message bus or an API that provides a novel datastorage pattern. These seem like big improvements if you spend all your time below the line (or currently work at Google in the year 2007) but the value they provide is minuscule compared to any improvement in this middle layer. What good is a new message bus if the user account data is broken?

Once you recognize Infrastructure Gravity in your org it becomes clear that staffing investment in Infrastructure Engineering is more or less a fixed amount. The minimum reasonable investment is also the maximum. Staffing this below the minimum might be catastrophic and adding more engineers here will not provide any increase in product value.

Feature Lift

At the same time there’s an opposite force pressuring product-shipping engineers to stay at the top of the tech stack. Feature Lift pulls an engineer toward only the work necessary to launch a visible change.

Some feature work can be accomplished either quickly or slowly, the quick way producing a hidden mess. An engineer working through a backlog of tickets will sometimes take the slower path in order to fix the system a little. But an engineer who consistently takes this slow path will be at odds with their team’s mandate to ship features. Especially with teammates who don’t agree about the necessity of going slow (or who can’t perceive the mess). This leaves even the most well-intentioned and systems-minded engineer in a tension: Their ability to effect systemic improvements for the company requires them to spend social capital or to move to the lower, infrastructure layer.

This Feature Lift grows every time shipping fast is rewarded. Every time a product engineering team rewards itself for adding new value to users. Every launch, every performance review where someone has to write up their “impact”, and every time a new shiny thing is shown to colleagues – the Feature Lift increases its pressure.

Feature Lift, like Infrastructure Gravity, is powerful and good. It focuses a team tightly on what matters most right now. Without Feature Lift a company wouldn’t launch anything. There’s an old phrase among SREs at Google about “running to stay in place” – refactoring and fixing the internals of a system in a cycle, forever, with no visible output. In fact, this is the most common reason I’ve seen infrastructure tools companies fail: They’re founded and led by people who hesitate to send debt-ridden, valuable features out the door (I’ve been guilty of this as a founder).

So this Feature Lift provides enormous value — it often provides a company’s only value! But that value needs to be integrated with itself over time, the features consolidated with each other and the overall system smoothed out to make room for new features. No one feature is responsible for the company’s competitive advantages. It’s the system that holistically connects all features together, that integrates the features correctly, securely, and usably — that’s what makes a modern product worth using and sometimes even worth paying for.

Creating features one after another without consolidating them is like making a linked list. Useful, to be sure, but the cost of traversing it is `O(n)`. If we were to structure these features in a better architecture then it’s like storing elements in a binary tree. Which, under ideal conditions, can yield a far more efficient `O(log(n))` performance.

But a binary tree must be periodically rebalanced. If we just add items to it and never rebalance the tree it has the exact same poor performance as a linked list.

Once we recognize Feature Lift in our org it’s clear that staffing investments here is roughly linear – each additional engineer can provide a marginal increase in feature development. Additional headcount here is like adding items to an unbalanced binary tree.

So we go to the middle to get real acceleration. The middle of the system is where we add more engineers and see a superlinear return on our investment.

The Gap – A space for Domain Engineering

As a CTO you own one large sociotechnical system of people and technology and you can reason about it as a whole. Systems theory gives us that a system has properties that none of its component parts have, so if we fail to zoom all the way out we will miss some quality of this big thing we’re responsible for.

At this widest view, there are two ways the company is able to look at the technical system. These two perspectives map to Feature Lift and Infrastructure Gravity.

Most of the company perceives the system from the user experience. This view shows us all of the features pretty clearly. They’re built for people to use (even the 3^rd party APIs) so it’s not too hard to see the system from this angle. This lets the company, and particularly the Product and Design functions, direct our work to improve the user experiences.

The other view is one that comes from the people who feel Infrastructure Gravity. These folks have developed sympathy for the runtime experience of the hardware and software itself. The memory use of processes, the latency of the network and filesystem writes, the data plumbing and storage patterns — this is a view of the system from the bottom.

There’s no “Middle Layer Non-Gravity Non-Lift” pulling anyone to the middle. The only thing that naturally draws engineers to look at the middle of their system is pure blinding rage. Given enough exposure to the neglected center someone will eventually make time to fix the things that bother them, whether they can make much progress or not.

Those heroes burn out quickly and they tend to be precisely the people you need mentoring junior folks, fixing security and performance, and doing interviews of senior talent. So lets not rely on only them to fix the middle of the system. These heroes feel responsible for the mess in the middle often because they helped make it but you, the executive, are actually the one person responsible for the mess. And that’s good, because this is a point of enormous leverage.

As we look closer we can make better sense of what this middle layer actually is: This is Domain Engineering; your company’s competitive advantage and your greatest asset as a technical leader.

Domain Engineering is the process of reusing domain knowledge to minimize the cost of developing products.

We see this in automobile manufacturing as shared chassis between models of cars. Or in software consulting as frameworks that generate software, making the work more configuration than coding. And at any company that would employ you or me Domain Engineering is the encapsulation and consolidation of the domain concepts underpinning more than one feature.

Let’s make this concrete with an example.

Imagine your engineers are designing an authorization/authentication layer to let users access a suite of products. It might be tempting to say that, because every product depends on this, the whole user access system is Domain Engineering. Most of it is actually the lower Infrastructure layer because every company needs a way to store secure user credentials, perform authentication, handle password and token verification, revocation, etc. None of that supports the competitive advantage of the company.

It might be hard to imagine what would exist in the middle, below the products and above the credential management. It’s hard to imagine because it’s rare for a company to even build it.

In between these two layers is a big opportunity to make great investments. For example, what happens when a user is not allowed to see a resource, generally? Do they see an error page or get redirected somewhere? What messages are displayed and how is the user guided through a UX flow? How are resources partitioned between tiers of authentication sensitivity?

All of that should be solved in a library so an engineer can choose the right pattern when implementing a new feature. If any of those details require development time from a feature-shipping team then the product roadmap is being delayed unnecessarily.

Domain Engineering is the place to bake in all the company-specific decisions about authentication flows, service-to-service APIs, library integrations, error pages, and anything else that you don’t want to drag on Product Engineering as they make new features.

And that’s just the beginning. Domain Engineering is also the right conceptual home to shepherd better development of the company’s competitive product advantages. It’s easy to spot a competitive advantage — it’s usually one of the oldest concepts around, it appears in virtually all of the products one way or another, and the implementation at some point becomes a huge mess. To GitHub this would be repos and commits, to Stripe and Square this would be the processes that create businesses and payments and purchased items, to Airbnb this would be the data and APIs that manage listings and reviews, etc. Any competitive advantage at your company will be leveraged across many features and is therefore best owned by the layer of engineers underneath Product Engineering.

You might think this sounds like ‘Platform Engineering’ and it has a lot of overlap. I deliberately avoid that word because it’s too easy to conflate that with “infrastructure” work and this layer needs to be insulated from Infrastructure Gravity. Anything pulling engineers down into infrastructure or up to building features will compromise Domain Engineering’s success.

Or, to think about this another way, how incredible is it that the competitive advantage of a company, something that appears in almost every feature, wouldn’t have engineering teams dedicated to increasing the impact of its use across the company?

A Financial Model of Domain Engineering

As executives we’re on the hook for the budget and outcomes from our org. It’s not enough to have strong opinions on “right” ways to operate, we need to show results and we need accurate predictive models for those results. In the face of impending layoffs or fierce competition we need a financial model that lets us know what we can do to avoid failing our people.

My most useful staffing model is also my simplest⁴: The product velocity at any given moment is a function of the staffing on Product Engineering multiplied by the speed of development.

The development speed is thrust divided by drag, just like an airplane’s windspeed: Big engines are no match for having a fuselage in the shape of, say, a cube.

And that thrust/drag ratio that leads to the development speed is itself a simple function of Domain Engineering staffing multiplied by how long Domain Engineering has been allowed to make progress.

This is where we find our superlinear investments. We can stop asking “how much staffing is best?” and start with a far more interesting question: “What’s the timeframe in which we need to maximize overall product output?”

If we have something like a Shipped Potential⁵ chart we can model the velocity, which we expect will go up as development drag goes down.

And with a timeframe to target we can aim to optimize the total delivered value within that timeframe. This is the integral of the velocity during that window.

Knowing how long we have to maximize our velocity lets us staff Product Engineering and Domain Engineering appropriately. It’s not an exact science but it’s far, far better than just piling more engineers into Product Engineering and Infrastructure, watching the former get ever slower and watching the latter contribute nothing to revenue.

Infrastructure might be called ‘Platform’, ‘SRE’, ‘DevOps’, ‘Cloud’, ‘Shared Services’, ‘Foundation’, ‘Systems’, ‘Core’, or countless other names ↩︎
Domain Engineering is a useful, existing industry term. This layer is often called ‘platform’ but so is Infrastructure Engineering. And 3rd-party platforms. A lot of things are called ‘platform’ and many platform teams fail to deliver business value by falling victim to Infrastructure Gravity, so here I avoid that oversubscribed term. ↩︎
This is why it’s so unhelpful to use other company’s Product/Infra staffing ratios. The sophistication of our Product Engineering teams (and the pace at which we drive their roadmaps) determines how much they leave undone underneath their features ↩︎
“A simple story is not the same as a simplistic one” — Krugman, https://slate.com/business/1997/01/the-accidental-theorist.html ↩︎
Shipped Potential is a zero-bullshit method of tracking engineering velocity developed in the first few chapters of Executive Engineering ↩︎

Explore, Experiment, Invest

Sat, 07 Oct 2023 00:00:00 +0000

The history of Silicon Valley is strewn with failed products that, in hindsight, were built with unwarranted confidence. Some of these products were simply before their time (Apple Newton, Webvan, Pets.com) while others were chosen based on deeply flawed information (Google+, Juicero, Facebook’s Metaverse, everything about cryptocurrency). It’s tempting to praise technologically impressive yet poorly-timed products but when a company with finite capital is early or late to market it’s just as bad as being completely wrong.

Every healthy product initiative develops through three distinct stages. Confusing them is, in my experience, the primary source of product failures. Getting them right gives you the best chance to build something of value within your budget.

These three stages always happen in order, and skipping any one of them means your strategy now relies on luck.

In case you can’t tell from this whole book, I love working at startups. There’s something magical about creating something from absolutely nothing. A team at a startup solves problems by inventing solutions quickly, one of the more thrilling things one can do in a creative field.

Sometimes the team invents just the right thing, succeeds at selling it, and everything goes right for them. They explore their market carefully, diagnosing potential customer problems clearly and correctly. They run exactly as many tests on their new product as they have financial runway to perform. They rapidly iterate through the phases of data-gathering, hypothesis testing, and then delivery. They don’t get beaten by a competitor, the market does what they expect, and they aren’t surprised by the difficulty of the implementation.

More often the whole endeavor fails miserably. Even when they do everything right it’s hard to win at a startup. And, if we’re honest, most strategies and the implementations of them are quite poor.

The process of Explore, Experiment, Invest maximizes the chances a product – big or small – will succeed. It’s a particularly good process for a team to use when constructing their first product, when capital is most scarce. Yet I find that the first product is the one time where many founders apply decent rigor. It’s the second product that they tend to get sloppy with. Particularly if the second product serves primarily to meet the CEO’s personal hopes instead of diagnosed users’ needs.

Companies that succeed with their first product because of market timing or pure luck are particularly susceptible to skipping steps in Explore, Experiment, Invest. If your product gained rapid adoption immediately you might assume that the next product you create will do the same. It’s like they say in Vegas about gambling addiction: The worst thing that can happen to you is to win big early. You’ll spend the rest of your life trying to recreate that initial experience.

In the same way, engineering leaders who’ve mostly been at high-growth companies may assume that products mostly just work out. That headcount always grows. That the hard part of the job is onboarding and organizing all these people. But once the money dries up they need to immediately figure out which of their ongoing projects are showing any actual promise in the marketplace. As soon as budget matters, it’s important to know whether the thing you’re building is moving forward.

We must be rigorous when we craft features and products that we expect strangers to value. We can’t guarantee our own success but we can maximize the chances of it by simply being honest with ourselves about what we do and don’t know.

Every feature under development is somewhere along the path of exploring, experimenting, and investing. Each stage offers a different technical and shipping context for the engineers who work on it as well as the product leaders who are guiding it.

Let’s go through those three stages of process so you can identify when you’re in one or when you’ve accidentally skipped past it

Explore – Step #1

We’re in the Explore phase whenever when we’ve identified a problem but have no evidence that any specific solution will work. Here, the goal is to get as much data as possible as cheaply as possible so we can form hypotheses. This might include creating a proof of concept or a working prototype, it might be asking colleagues what they think or what they’ve seen before, and it might be analyzing data available to us.

The goal here is not to build a solution, but to form coherent hypotheses. To go from “I wonder what might work?” to “I believe this specific thing might work.”

This is where startups exist by default.

How to know if you’re in the Explore phase?

The output of a successful exploration is user research, technical prototyping, synthetic market tests (like running an ad campaign that points to an email-harvesting page), and letters of intent from prospective business development contacts.

Without these it’s difficult to make meaningful hypotheses in the form of “we can build feature X and then sell it to our users via this sales/marketing channel”

If a statement in that form feels unsupported by any data at your company, consider halting development plans and continuing Explore work until that data exists.

How companies try to skip the Explore phase

It’s so easy to skip this stage. You may hear a colleague (or yourself) say “Let’s run an experiment to see if users will give us their mailing addresses”. That’s not an experiment, it’s an activity. It’s merely exploring what happens when you build an address-collecting form. What’s the decision that happens as a result of, say, 10% users submitting addresses? 50%? 99%?

The key difference between an exploration and an experiment is that explorations cannot fail. They’re an attempt to get data. Even one that results in zero data is a successful exploration if it can dissuade the team from further work that would have been useless.

If this were an actual experiment it would sound like “Let’s test whether asking for mailing addresses, sending printed materials to our users, and then trying to re-engage them online results in net increased revenue.” The answer to this is either “yes”, “no”, or “we screwed up the experiment and here’s a meeting to figure out why.”

Experiment – Step #2

Once we have enough data to make credible proposals for a solution, we try to prove or disprove whether that solution is viable. We do this by creating low-cost experiments. Hopefully, our assumptions are right and we can continue to work on the experiment until it turns into a production-grade solution.

But if our assumptions are wrong, we’ll have avoided spending orders of magnitude too much work on a product or solution that was under-informed.

Note: At some companies the word ’experiment’ has a very narrow meaning, typically relating to growth experiments. The part of the product that tries to convert users undergoes frequent testing in UI variations, mostly in email marketing and landing pages.

That same experimentation may not happen across the rest of the product, but a broader experimentation is the only way to know if the overall market actually exists, not just whether we can convert sales within it.

A good experiment takes the form of a falsifiable hypothesis. It’s a statement about what you think might be true, phrased clearly enough that it’s possible for it to be false.

Well-articulated experiments sound like this:

A user is more likely to re-engage within 2 months if we reach out to them daily than if we do it weekly
We can charge 50% higher prices for our current enterprise customers with low enough churn to make this a net increase in revenue
We can increase home page conversions by more than 10% if the pages perform twice as fast on mobile as they do today

To better understand the components of a good experimental thesis let’s first look at what makes a bad one. In their book “Superforecasting” Philip E. Tetlock and Dan Gardner summarize the cognitive trap of believing in one’s hypothesis:

[Scientists] know that no matter how tempting it is to anoint a pet hypothesis as The Truth, alternative explanations must get a hearing. And they must seriously consider the possibility that their initial hunch is wrong. In fact, in science, the best evidence that a hypothesis is true is often an experiment designed to prove the hypothesis is false, but which fails to do so. Scientists must be able to answer the question “What would convince me I am wrong?” If they can’t, it’s a sign they have grown too attached to their beliefs.

Tetlock, Philip E.; Gardner, Dan. Superforecasting (p. 38). Crown. Kindle Edition.

A bad experiment is anything that can’t dissuade us from our beliefs. For example, let’s imagine we removed some of the success criteria from the examples above:

“A user is more likely to reengage if we reach out to them daily then monthly”

Note the lack of time frame for the reengagement. The more someone believes this the longer they’ll ask to run the experiment, running down the clock on your company’s runway and using up time that could be spent on other experiments.

“We can charge higher prices for our current enterprise customers without losing many”

What is “higher”? What is “many”? The inverse of this sentence is also probably true because it’s so vague.

“We can increase home page conversions by more than 10% if the pages are much faster”

How much faster? And by when, compared to what baseline? This could be an infinite workstream with no material gains.

The point of an experiment is to find out if our beliefs are incorrect before using them for financial or staffing investments.

How companies try to skip the Experiment phase

Bypassing this phase is super common for companies looking to create their second product. If they have a successful flagship product they may have spent so long enjoying that success that it’s hard to remember the enormous likelihood of failure for even very well-made products.

Failure Mode: The product that meets only the CEO’s needs

There’s a specific, very dangerous form of this that I’ve seen several times: A CEO staffs a new product initiative, declaring that it will be a success.

That might sound ludicrous but it’s quite common. Maybe the CEO failed to satisfy questions at a board meeting about how the company will expand into new markets. Maybe the CEO is bored with the status quo and wants to feel the energy they remember from when the company was younger. For whatever reason, the CEO (or another feedback-immune leader) staffs a team to work on a solution. The problem they’re solving is “The CEO wants this product built.”

The team working on this pet initiative will be in the Invest stage right from the start. They need to figure out what market they’re in – and if that market even exists – while also working toward a specific implementation. Not only is this super failure prone but it’s extremely stressful for the team. The whole company is watching the CEO cheerlead you while you struggle against a harsh reality: Customers pay money to solve their needs, not your CEO’s.

Depending on the length of runway given to this team it may take the company months, years, or decades to recognize the team’s struggles as a sign of a failed product. The budget and runway will exist for as long as senior leadership wants the product to succeed. And as long as they want it to succeed the less they’ll be open to critical feedback.

At the time of this writing both Jeff Bezos’ Alexa and Mark Zuckerberg’s Metaverse have finally felt their leashes go taught as reality caught up to an overfunded, unrealistic, CEO-driven product strategy that bypassed the critical first two steps.

Failure Mode: The Startup Within a Startup

I should mention, considering the audience of this book, that one of the tempting paths toward an executive title is to join a big, slow company that’s trying to inject new life in itself. They may offer you a VP of Engineering title and ask you to lead a startup within the company – getting the flexibility of a startup with the resources of the larger firm. Setting aside that that’s never how it works (you get the red tape of the larger firm and too much scrutiny to make necessary mistakes), by the time they’re talking to you they likely have both a problem they’ve identified and a solution they have in mind.

You’ll need to determine whether they have skipped steps in this Explore, Experiment, Invest process. Have they imagined a solution they like that you might spend years building only to have it flop miserably? Or do they have good research, a healthy sense of the market, and some hypotheses that they’ve been able to prove or disprove?

I recommend you don’t accept jobs like this. But if you do, make sure you negotiate for the autonomy and resources necessary to scrap the entire project and reimagine a solution to the identified problem. Because if the investment thesis isn’t bulletproof then success in your go-to-market will depend on luck.

Invest – Step #3

This is the fun part. If we’ve completed enough exploration to form hypotheses and we’ve completed enough experimentation to know which of those hypotheses we believe then we can make a real plan.

The goal in this phase is to be clear about what resources we expect to pay (usually in staffing and time) and what we expect to get back, by when. The clearer we can be about this the easier it’ll be to see where we still have holes in our information. We’ll never have perfect knowledge, but by identifying the gaps we can lean away from depending on those areas too much.

A good investment thesis can be quite high level, if we trust the underlying knowledge. It can serve as a one-liner that you reference in every all-hands meeting.

Good investment theses look something like this:

The merchant team will spend the next 2 quarters improving API completeness and we expect more than double the amount of 3^rd^ party signups over the following year
We’re building a white-label version of our product that we expect to sell within a year to most of the 600 customers who signed a letter of intent
We’re pivoting to a B2C model where we expect at least 10,000 users subscribed in year one at a $12/month price

These statements don’t mention how or why the company has chosen this path but there should be plentiful data from the Explore and Experiment work you can share with curious colleagues. But you don’t need to explain all that every time you state the investment – you can trust yourself and your team when you say you’re doing X and you expect Y to happen.

How companies try to skip the Invest phase

A tragic amount of work at early-stage startups is mere activity. Many of the major initiatives I’ve seen at the Seed and Series A companies I’ve advised has been in the form of “we built this because it’s cool and people will love it.” That worked out for Instagram, but I personally recommend more rigor.

Conclusion With a Caveat

None of the above should be taken as an excuse to halt development and wait for perfect confidence. If you find yourself or a colleague providing consistent pushback on any work in the name of “we don’t know enough yet” there is likely a relational problem that needs to be addressed. Not because there’s a right way to apply these tools but precisely the opposite: There’s no correct amount of confidence to reach before moving forward. What you, your CEO, and your peers need to agree on is how much uncertainty is acceptable. How much product risk are you willing to accept as a tradeoff for the risk of waiting longer, of getting more certainty.

There’s no right answer. But if you’re aligned with your team – and you’ve gone through at least some of the work in each of these phases – then you’re likely doing better than your competitors. Which, in a startup, is often all that matters.

You must not measure individual software engineer performance

Thu, 07 Oct 2021 00:00:00 +0000

It’s difficult to explain the nature of software engineering work to stakeholders who don’t work in engineering. We compensate salespeople, for example, partly with commissions that reflect their individual performance. That’s because someone in sales is able to individually improve some small metric for the company. And anyone working in operations or customer support can be judged at least in part by the number of issues they’ve handled and how satisfied the company is with the outcomes of those issues.

Effective engineering is a collaborative and creative discipline, where the team is the performative unit rather than the individual, and the performance cannot be measured by the team’s actions but rather by the outcomes of the system that the team produces.

Like any heist film, a software engineering team comprises multiple individuals with relative strengths and weaknesses that complement each other. One engineer might love to make new features while another might be entirely consumed by improving the quality of the team’s existing software. Either one working alone, given enough time, would eventually become entirely useless to the company. Together, as a team, they can produce value while lowering cost indefinitely.

Attempting to reward or discipline either of these two individuals would throw the team itself off balance. Incentivize new feature addition and the quarter-over-quarter feature development speed will consistently slow. Incentivize improvements and no feature will ever again launch. But what if we reward them, individually, for their specific and different contributions? Yes, we can do that and all we’ll need is to create a metric for the engineer who likes to build features and another metric for the one who fixes existing features. Then measure each engineer by the appropriate metric.

But what are the units of that measurement? How do you compare six clumsy features launched in a month against one extremely hard to find bug that took all month to discover and required only a one-line change? What if an engineer did neither of these activities but mentored their colleagues?

It’s impossible to say one person performs better than another without a shared measurement. And, while the outputs of these two hypothetical (yet all too real) engineers cannot be compared, their daily activities can be. Number of changes to a code base. Number of lines added or removed. Time spent in the office. Number of messages sent to colleagues. Number of bugs noticed by a QA team.

I know of companies that have, at one point, used all of the above to measure the ‘performance’ of their software developers. And, in each case, the company got a lot of useless (or harmful) activity and enormous employee attrition.

So, if we can’t compare engineers against each other without incentivizing the wrong behavior, we must accept that their success or failure is inherently collaborative. Like a surgical operating team in a hospital whose success is tied to the patient’s. Or a unit of firefighters who enter a hot building together.

One would naturally, then, try to measure the performance of the team instead of the individual.

But software is a unique art: It’s the encoding of human decision-making into machines. The performance that matters isn’t that of the team that builds the software but of the computers themselves.

Imagine if the surgical operating team in a hospital were a robotic AI. Or the firefighters entering a 4-alarm blaze were fire-retardant autonomous robots. We would reward the engineers who created these tools for the outcomes of the surgeries and the lives saved from the fires. No one would ever care if the engineers worked late or took extra vacation — what matters is the outcomes.

All software engineering teams produce these same outcomes, and always indirectly. The way the engineers manage their software is much like how a sales executive manages their team. Throughput, latency, and quality can all be measured and charted over time. A sales leader can identify problem areas and intervene when an individual account executive underperforms. Engineers do the same thing but call it ‘debugging’.

This inherent indirection — that the engineers produce the system that produces value — is what makes software engineering a creative discipline.

Like a gardener thoughtfully removing a blight, any engineer at any time can increase the value of the company in unexpected ways. Or, like a gardener rewarded only for the sheer number of weeds removed, we can task them with goals that produce huge activity while the company gets outpaced by competitors with more creative teams.

The solution is to define what, precisely you want from your engineering teams. It’s never sheer activity. And it’s likely not blindly implementing whatever they’re asked to build because that’s how consultancies operate. And a consultancy only makes money because it doesn’t have to deal with the inevitable failure of most of the things they build for poorly-informed clients.

What you want from your engineering team is to produce the maximum number of market-impactful launches. It’s on your product leadership to effectively measure the market and maximize what that impact is. But it’s up to your engineering leadership to maximize the sheer number of launches that your product function believes could be useful.

Define those top-level metrics, install competent leaders with an unyielding devotion to coaching and developing their people, and watch as your engineers work together to make something impossibly beautiful.

Whatever you do, don’t create metric to measure the performance of your individual engineers.

Originally published on Medium

A reference guide for fintech & small-data engineering

Wed, 06 Mar 2019 22:09:48 +0000

I spent 2018 and the beginning of 2019 leading the Infrastructure teams at Gusto. Before moving on I composed a message to help future Gusto engineers — particularly those who are still in the first half-dozen years of their career and acting as technical leads— understand secure, domain-driven product engineering. This post is an adaptation of that message with some company details elided.

Most software engineering literature assumes you need to process data efficiently. The algorithm-focused, Google-style CS questions that dominate software interviews put particular focus on this — we test candidates on code that works well when data size increases.

In my time leading engineering teams at Square and Gusto I’ve found that this big-data approach to software engineering is a poor fit for many product companies. Rather, product scalability problems are along a different axis: Sprawling domains and massive schemas implementing those domains. These companies face three very specific, very hard problems:

Correctly modeling the domain
Creating clear interfaces between parts of the system
Finding ways to continually accelerate product development speed

Secure, small-data product engineering

This problem isn’t unique to Gusto or Square — there’s a huge set of companies that have small amounts of critical data in complex domains. It’s every fintech company, every product company with a heavy operations arm, and any company trying to model an existing regulatory system.

Here’s a highly inaccurate but illustrative plot of how companies fall on this tradeoff of concerns:

If we’re building Twitter or Instagram we’ll need to process hexabytes of simple information. If we’re building Flexport or Gusto or even Square¹ we’ll be hard-pressed to even find terabytes of information. The scalability challenges are still very real, they’re just along the axes of domain complexity and security.

Welcome to small-data product engineering

If you’re at one of these small-data companies you can’t mitigate the core scalability problems by optimizing source code. Your company probably has enough folks working at that low, focused level. They need people to zoom out and address the whole system.

The three levels of engineering sophistication

There are three levels of sophistication an engineer grows through in their career:

Working with code
Working with applications
Working with systems of applications

#1 Working with code

When we start our career we work with logic in a single method, function, or class. The skills at this level are extremely basic: Editing source code, expressing basic concepts in the given language, storing basic data, etc. This is where features are made.

#2 Working with applications

When we become senior we can work with a whole application at a time. We store lots of relational data and model complex business logic across many files and directories. This is where products are made.

#3 Working with systems of applications

And then we becomegrizzled veterans working with whole constellations of applications in multiple environments. At this level we perform major migrations, simplify dependencies within the distributed data model, and shepherd the whole multi-application system. This is where companies are made.

The way we work at each level is quite different. The phrase ‘refactoring’ usually applies to the level of source code and, less commonly, at the level of an application. The phrase ‘tech debt’ is more common for working with an application as we handle major framework upgrades and make the data model more expressive.

But at the level of systems both ‘refactoring’ and ‘tech debt’ are inappropriate concepts. At this level of the system there are only two activities you should permit yourself to do:

Bring a system in line with its design
Design a replacement system

Anything else is busywork.

Both of those activities assume that the thing we’re building is designed. For this reason, if you feel like a senior engineer your job is now primarily an engineering designer. You should design and help implement software systems that solve the Product team’s current and future needs. If you imagine your job is just the implementation you’ll miss the design step and the overall system won’t make sense.

Levels of PM sophistication

Speaking of Product, these same levels of sophistication apply to PMs. Early-career PMs can launch and steward features. This is valuable and helpful, though at this level the PM can introduce tension between product and engineering. Shipping a single feature quickly is just about never the best thing for the long term health of the interconnected product suite.

Mid-career PMs can launch and steward whole products. This is hugely valuable and here the tension between product and engineering is eased. The PM takes a long-term view of the whole feature-shipping cadence and can work with engineers to make calculated investments to increase shipping speed.

Late-career PMs can launch and steward entire product suites. At this level the PM takes responsibility for an entire ecosystem of engineering and product. They are natural allies to engineers and will often push for systemic simplifications and increased coherence between parts of the product suite — something that allows engineers to make dramatic simplifications to the underlying system.

Growing technology at a product company

If your product’s production dataset is smaller than a terabyte then you have no data. None. 100% of your data can fit on two iPhones. The solutions from Google, LinkedIn, and other big companies are useless here. Your peers are Square, Flexport, Gusto, Stripe, Coinbase, and any place with huge amounts of logic but small, critically important pieces of data.

Your problems give you four focus areas that are more important to you than to the average company:

Prioritizing domain model correctness over computational efficiency
Refactoring the data model early and often
Ensuring data is encrypted by default everywhere except in your product’s memory space
Separating data by sensitivity (PHI, PII, PCI) to empower your Security team

And straight-up disregard any blog post from a company that has any of the following:

Lots of data
A ridiculous amount of money
Low security restrictions

Twitter can lose a tweet. LinkedIn can lose a ‘like’ on a post. Square can’t lose a payment. Gusto can’t lose an IRS filing.

So your problems are different. Instead of using something that works well when you process trillions of records per minute you need to use tools that are simple to operate and that support encryption powered by role-based access control (i.e. if service A produces data then service B can only decrypt A’s data if it’s allowed to).

Other small-data, security-conscious companies have solved these problems. But as of yet none of their in-house tools for securely handling data in prod have been open sourced². So you’ll have to just build something very simple and design it carefully.

How to select technologies

Which brings us to the rule for when to build technology in-house or use something that exists externally:

“If a problem is unique to your product offering then you may invent an appropriate solution. Otherwise go with an industry standard approach — and only use one approach at a time.”

Invent software that extends your competitive business advantage. And for everything else just use off-the-shelf tools. Your deployment system should look eerily similar to one of the top ten results for a generic search like “AWS deployment EC2”. Your cloud networking setup should look identical to the AWS official standardized architecture for PCI compliant systems. Your database tables should use the third normal form.

Performing Migrations

Note that last phrase in the above rule: “–and use only one approach at a time.” You’ll need a way to move from one version to another. This means you and your team need to get good at performing migrations.

Whether you’re making a tiny change to a database column or overhauling a major system all good migrations have five stages:

Design the v2 and the path to get there
Fix v1
Implement v2
Carefully migrate from v1 to v2 in atomic, safe steps
Delete v1

Most teams forget steps #2 and #5. If v1 is poorly factored you’ll inevitably have to make significant changes to v1 before it’s in a shape to be migrated. Since we need to do that work anyway we might as well stop pretending v2 will rid us of v1. Fix v1 in place before you start building v2.

And don’t accept any migration that doesn’t have, as its last step, an act like “drop these tables” or “delete this code”. If you can’t actually replace the current thing don’t bother with the migration. Do it right or don’t do it.

What teams should ‘own’

Some companies allow teams to carve out juicy projects for themselves and prevent other folks from working on them. This kind of technical bureaucracy cuts through engineering morale like a hot knife through butter. Teams shouldn’t claim projects, they should claim problems. And they should accept as much help from any direction when solving those problems.

Don’t say “We’re going to roll out a new frontend technology” when you can say “We’re going to solve that adding new storage classes in the frontend is hard”. Then you can have an open debate across all interested stakeholders in how that problem should be solved. Once there’s agreement you start work and you gratefully accept commits from everybody.

No architecture astronautics

As Alyssa Henry (founder of AWS S3, Head of Engineering at Square) sometimes says: “The ideal system in your head is never as good as the working system in front of you.” Beware of architecture astronauts. And be even more wary of becoming one.

You can identify architecture astronautics when someone (including yourself) starts to believe that there’s a technological solution to poor design.

If I create a React component that mixes a signup form, account approval, and uploading a user’s avatar all in one Javascript function that’s just poor design. I cannot fix this problem by porting it to a new Javascript framework.

If I create an HTTP endpoint that performs five separate operations and also calls out to a different service multiple times — that’s poor design. That cannot be solved by introducing new service boundary technology.

The work of building long-term value is really hard but really simple: Decide what it is you’re actually trying to build and then implement that in the simplest way possible.

How to prioritize systemic fixes

Imagine putting sand into a jar, then pebbles, then big rocks. The big rocks don’t fit.

We have to do the big rocks first. When the right solution to a problem takes 10 engineers for a year then put 10 engineers on it for a year and make sure it’s solved well. And if it takes longer let it take longer. Don’t put 3 engineers on it for a quarter. That just adds sand to the jar. Small wins are great and easy to prioritize but at some point you need more in the jar than just sand.

Building a product platform for your customers

See: The four interfaces of SaaS product suites

If you have any customers then you’ve already solved one of the hardest problems. Now you can create and sell other products to your existing customer base — massively increasing the value of your company and your competitive advantage against competitors.

Unfortunately, you may have built your entire engineering system to support just your first product.

Service classes are an encapsulation of logic that separates the ‘what’ of its purpose from the ‘how’ of its implementation. Building a platform that allows you to rapidly compose new products for your existing customers is merely the small matter of creating interfaces that perform valuable product operations. Imagine a hundred single-purpose APIs in your system that allow you to move money, sign up customers, list resources belonging to those customers, etc. all encapsulated such that you can use each one in isolation.

Creating those APIs will allow product engineers to build products quickly on top of them, while folks working on the underlying platform carefully move data around behind the APIs. The two tracks of work don’t need to be coordinated — the API gives us a decoupling.

You can move from this:

To this:

You’ll know you’re on the right track if your product engineering teams spend less time coordinating with whatever teams work on the internal platform. An interface replaces enormous amounts of tightly-coupled planning.

How to think about security

Now that you’re working at the level of a whole system, let’s address security. No matter what engineering roles you’ve had you’ll do great security engineering work if you keep this one rule in mind:

Simplicity, correctness, security, scalability, and maintainability are all neighbors. When we’re far from one of them we’re far from all of them.

Twitter had trouble scaling in the early days. Nick Kallen popularized the idea that you can’t make something scale by adding “magic scaling sprinkles.” The same is true for correctness, maintainability, and security. You can’t bolt on correctness to a busted system. You can’t just “do security” for a quarter to prevent security breaches.

Security is, more than anything else, a result of a simple design implemented correctly. Something is either simple and obviously secure, simple and obviously insecure, or it’s too complex to ever make secure. Your security team can only help you after you’ve refactored your system into a thing that’s so obvious any new hire can immediately make sense of it.

If a company is valuable with small amounts of data then that data is likely very important. That means a data breach is worse for you than for other companies. Leaking sensitive data functionally ends your company . That’s bad, but the damage to your customers is even larger. Your data is people’s home addresses and names and bank accounts and personal preferences. A breach in your system means that data is now in the hands of the highest bidder. If you have git access to a system that might expose sensitive customer data it’s your professional responsibility to fix it. Don’t wait for your manager or CTO or PM to ‘allow’ you to fix it . Just fix it. These are people’s lives.

How to prioritize product work versus maintenance

There are two investment models for engineering:

Creating new value
minimizing the cost of existing value

Product engineering uses #1, platform/foundation/infrastructure engineering uses #2. If there’s a single leadership team trying to prioritize projects from both groups they will always pick maximizing new value. They’ll do this until the cost of that value is so high that it wipes out the gains.

The way around this is to determine how fast you’d like to go now versus how fast you’d like to go in the future. And use that model to fund the percentage of people you put on increasing value versus minimizing cost. These need to be separate leadership teams, each focused on just one of those investment models.

For many companies it works out like this:

When you’re starting out you want 100% of people maximizing value because there may not be a tomorrow, much less a next year.
When you’ve reached product market fit it’s important to switch this up and see if you can lose a year of value creation to gain a year of cost reductions — earning yourself a step-function for future growth.
When you have a shot at becoming ecologically dominant in your industry it’s important to put nearly 100% of people on cost reductions until the competitive moat around you gets massive.
Then, if you’re ecologically dominant you can add huge new headcount to do value creation.

This is a slider you can play with at any time to match the risk model of the company. But being unaware of the tradeoffs means you’ll either reduce cost without producing value (see: 90% of infrastructure/tools startups) or increase value until you choke yourself and a leaner competitor comes through and steals all your victories (see: every company that Amazon has annihilated).

If you’re an engineer on the side of the org that maximizes value then one of your main jobs is deciding when some component is so important that its cost needs to be minimized. Then you have to productionize that component and hand it over to the other side of the org.

If you’re an engineer on the cost-reduction side you need to consolidate all the technologies and make the whole system hum along with zero maintenance. You also need to help product engineers upgrade their best stuff until it’s in a shape that’s maintainable.

How to extract services from a monolith

Service extraction is now an entire field of engineering. But the basic rule here is simple: You cannot run from your poor domain modeling. No matter what application you send your poorly-factored code to the pain will only increase.

The problems with “let’s extract this into a microservice” are (1) there’s no such thing as a microservice and (2) you need to fix the data dependencies first or you’re in for a world of hurt.

That said, there are some simple patterns to follow that will help you as you extract things from a monolith into new applications (or parts of the monolith that are more separate).

Extract verbs, not nouns

You cannot extract User to a service. What would that service do? Just give you the data from the users table? Sounds like you’ve just moved a database table super far away from you. Have fun with that.

But if you extract the concept of, say, approving a user to a service then you’re simplifying things. The monolith used to know all the steps of how a user was approved. Now that you’ve extracted all that logic and the intermediate data to perform the approval, the monolith can just have an approved_at timestamp. If anyone wants more data on the approval they can go to your new UserApprovals service to look up the details.

Learn in public and only do high-quality work.

I’d like to leave you with one last exhortation. The best engineers I’ve ever worked with (like these) have two qualities I’ve observed:

They never, ever do work that’s less than their best
They learn in public, never hiding their process or their mistakes.

I’ve noticed that each engineer I know has roughly a specific speed at which they work. No matter how many hours we clock in our career the speed doesn’t really increase. And when we work at a higher level of quality the speed takes a hit, but only briefly. Then you’re back to full speed but you’re better.

Multiple times in my career I’ve made the decision to adopt a higher bar for my own work. Each time it slowed me for a bit but then I sped right back up. This, as far as I can tell, is how great software engineers get that way. Their brains aren’t different than yours or mine, they just have higher standards for themselves.

To that end:

Put as much effort into naming things as possible. Never cut corners on naming, it’s basically the whole job.
Always write code comments that are as expressive and clear as possible, including ASCII art drawings if necessary. When you edit any code, update the comments. This is more work but it’s higher value. Do the highest value work.
Don’t leave TODOs in code unless you’ve tracked a ticket and link to the ticket in the TODO.
Never fix a bug without fixing the underlying design flaw that allowed it to happen. Even if the real fix takes two weeks, do it. Any place that doesn’t let you fundamentally fix the system that you’re in charge of is a place you don’t want to work.
Make data migrations as freely as you make code changes. If someone looks at your database they should not be able to guess at what your code used to be.
Before you merge a PR ask yourself “If I had more time, could I do this better?”. Take the time, do it better. Anyone who’s frustrated by that can be frustrated. They’ll be very, very happy in 9 months when the software works well and you’ve grown a lot.

Many thanks to my dear friends at Gusto for my time there and for the opportunity to better articulate my approach to technical leadership.

[1] During the time of the Square IPO all of the data for the dozen-plus products still fit into a single MySQL database under one giant Rails app.

[2] Please tell me I’m wrong.

Originally published on Medium

The Four Interfaces of SaaS product suites

Sun, 21 Oct 2018 05:48:18 +0000

There needs to be a handbook for migrating a monolithic application into a service-oriented SaaS product suite.

The same things keep going wrong at every company: Moving code without moving the data that code needs, language proliferation, conflating synchronous and asynchronous work, not anticipating the need for backfilling data, etc.

This post is a guide to help navigate just one of the problems your team will face: Picking the right technologies for your four different interfaces.

Four? Yeah, four. A SaaS product suite eventually implements four interfaces for four different kinds of clients:

The one your customers use
The ones your employees use to manage customer data
What you offer to third-parties to extend the value of your products
The internal server-to-server connections that power your SoA

But that’s not how it starts.

When you’re first building your system you probably have a single application. Customers connect directly to this and it’s… well it’s your whole product. This is the code that has access to all of your data so when you need to build an admin view of your system you build it more or less right there where the data is.

This works great for the v1 of your admin interface. You didn’t have to duplicate a bunch of work and you can operate your product.

When you need a 3rd-party API you might build it on that same application as the other two interfaces or maybe you’ll put it off to the side because you’re starting to decouple your growing monolith.

By then you’ll have something like this:

Which is… well, it’s starting to get complicated. Those circles in the middle represent different applications. Managing all of those is hard. But what’s really going to bite you is that the authentication and authorization logic for at least two very different customer types is coupled with your legacy code in the original application. How do you provide an admin interface to these other applications that are showing up?

Well, you could add an admin interface to each application.

But this means you need to implement that authentication system in multiple places. And, depending on how you’ve built the admin UI, you may even need to ship the admin views in multiple applications.

Service-oriented architecture is notoriously hard to get right. Not because the technology is difficult to implement (it’s now a solved problem). SoA is tricky because when you have many places to put something it’s costly to have anything in the wrong place.

In a monolith putting some code in ‘./lib’ or ‘./models’ is mostly an aesthetic difference. But if you put some code in, say, the customer lookup service when it actually needs to be in the highly-available report-generating service you might have destabilized the whole system. If you’re moving to multiple services you’re not simplifying your system, you’re making it more complex and hoping it’s worth the tradeoff.

There is some core functionality in your product. It does something meaningful. If you conflate the different interfaces together you may have to implement that meaningful core functionality up to four times. But if you can clearly separate the authentication you can implement your core product value just once and provide windows to it through your authentication layers.

Put another way: if you can separate “who is this and which features can they use?” from the feature implementation then you can put all your effort into making the features themselves really good.

Four technologies for four interfaces

A common mistake when rolling out one of these interfaces is to reuse the technology from one of the others. Just because, say, HTTP + JSON is how you communicate with your customers doesn’t mean it’s the right choice for your internal service-to-service calls. And no matter what the other interfaces use your admin interface pretty much has to be a plain web app because that’s the cheapest thing to maintain.

Let’s review the needs of the different interfaces to help us pick a good technology for each one.

1. The first-party customer interface

The first API is the one you already have. No matter what your product does there is a frontend to your system that your customers use. This could be an HTTP-based HTML or JSON frontend. If you offer an email service maybe it’s over SMTP. If you’re Dropbox this is your own proprietary sync protocol. Whatever the transport system, your customers use your product through some kind of front door.

This API is one you control both sides of. You change server endpoints and client/display behavior in lockstep, creating a coherent user experience.

If you find your team calling this ‘the api’ you might be in trouble. There are three more to go.

A good technology for this is anything that your customers can use. This is where you sweat the UX until the customer experience is straight-up joyful.

The authentication layer for this needs to identify a single user and enforce that any data passing out of the system is data that is specific to that user.

In terms of actual implementation you have two choices:

You can have a highly-available frontend that performs the authentication and passes credentials along to your product code
You can have a highly-available authentication service that each service uses when responding to user requests.

In either case the authentication for a customer of your product is located in one place and separate from the actual features of your product.

2. The Admin Interface

Most companies build a half-assed version of this at first to gain any visibility at all into the system. Over time the admin interface becomes generally useful: Engineers diagnose production data problems with it, operations teams build their core workflows into it, and it becomes the primary tool for customer support.

The authentication layer here is more robust and needs to integrate with your IT system. You need to ensure only people who are currently employed at your company can access this and you need to audit everything anyone does — because through this interface a user can access the data of any user in your system.

The most scalable, sustainable way to build an admin system like this is a web application separate from the product services by a routing layer. This routing layer both performs authentication and translates HTTP+JSON or whatever the new web hotness is into standard service-to-service calls.

3. Third-party REST API

Most products will, at some point, benefit from offering a third-party API. Your customers get more value from you when they can opt in to extensions of your product.

This API basically has to be RESTful HTTP. That’s the standard that most other developers understand and it’s the easiest to document. Tools like OpenAPI let you generate clients for it automatically in any language. This interface must have clear versioned releases (with old versions supported indefinitely) because you don’t have control over the clients.

The implementation will look an awful lot like the admin one but with a different permission model. The routing layer will need to check if the 3rd party making a call is allowed to access a specific user’s data. This requires a complex system of registering 3rd-party applications, generating tokens, allowing users to enable/disable individual apps for their account, and rate-limiting. All of which would really complicate your product if this kind of logic leaked into core features.

4. The internal SoA

The fourth interface is fully internal — it’s the inter-process communication between server-side systems. You can use any technology for this but some are much friendlier to your engineers than others.

RPC: Verbs over Nouns

REST is all about nouns and it forces all operations to be reasonably simple. When you don’t know who’s using your API that’s a good fit. But when the other users of your API are your colleagues you need something more sophisticated.

An RPC (‘remote procedure call’) system allows your team to express their full intent, iterating on the API between services at the same cadence that they iterate on method and class signatures.

Structured schema

At some point your system becomes too large to run all of the different applications on a laptop. Before you reach that point you’ll benefit from implementing structured, strongly typed APIs between your internal services. You want to be able to automatically generate running mocks of dependent services (maintaining them by hand will fail) and you need to rely on the API type system in your unit tests to catch bugs. Something like Thrift or Protocol Buffers or Cap’n Proto will give you this natively.

Forwards- and backwards-compatible

You’ll make gradual changes to this API forever. How do you ensure that deploying one part of your system doesn’t introduce an API incompatibility between the new and the old code? How do you ensure that you can always read data encoded in a version of the API from last year with today’s API?

Tools (like the 3 mentioned above) ensure you can always read new data even in old clients (and vice versa) from any language.

Mutual SSL and Access Control Lists

Authentication and authorization for your internal interfaces is very different than the other three. You still need to ask “who is performing this action?” but the ‘who’ here isn’t a person, it’s a service. So you need some way to identify services (authentication) and some way to permit them to do stuff (authorization).

Here there’s likely an ACL where specific services are allowed to use specific endpoints. Defining permissions at this level of granularity is tedious but it lets you easily verify that low-security parts of your system aren’t gaining access to higher-security parts and it provides the means to generate a dependency graph of your services.

Focusing on your product

When your company is big enough your interfaces become products in themselves. But even while you’re small keeping them isolated at the edges of your system means you can get them right and you can more easily prove that they’re secure.

It also frees your team up to focus on the actual product work. When the interfaces are defined and separated from the core features then everybody — your customers, your employees, your 3rd-party partners, and other products— benefit from every improvement you make.

Written on October 21, 2018 by Jack Danger.

Originally published on Medium

The Upside Down Org Chart

Sun, 01 Dec 2013 00:00:00 +0000

A corporate organizational chart is drawn as a tree. It’s got a CEO at the root and then splits out to executives and then keeps branching further until you get to the people doing the actual work. Weirdly, this tree is drawn with the root up in the sky and individual contributors (often drawn very small) at the bottom. We use words like “managing up” and “climbing the corporate ladder” and “the decision came from above” as a reflection of this mental image – CEO on top, people on bottom. The up/down contrast is important because in this structure the people on top have more power to make decisions, more interesting labor to perform, and higher compensation. Those at the top communicate commands and desires down to people at the bottom. The top jobs are in every way better.

“Working at BigCo® is so much fun!”
— BigCo® CEO

This structure is very old. It’s basically patriarchy and appears in armies, farms, and governments almost as far back as we have historical records¹. This structure is traditionally quite static - wherever you start is where you stay. Nobody wants to give up a higher spot to someone else and certainly nobody wants to move down. Napoleon’s Grande Armèe was one of the first organizations to attempt promotion based on merit and not class and we’ve adopted that compromise in modern corporations. But both in the Napoleon’s case and ours “merit” is so vague as to be indistinguishable from class.

I know hundreds of developers who refuse to work at a large company because of this structure. Nobody wants their work to be compromised by constantly having to “manage up” and certainly nobody wants to hand the benefits of their work to somebody else. There’s a saying I keep hearing that goes “Nobody quits their job, they quit their boss.” And there are plenty of bad managers in tech – often a result of “promoting” a veteran engineer to people management as a retention strategy.

Unions try to solve the problem of bad managers and top-down negative pressure by pushing upwards through collective bargaining. Elon Musk famously claimed you don’t need unions if you just fire the assholes. But if you’ve got nice managers exerting negative pressure downwards² because of the very structure of the organization, then firing just the jerks won’t save you.

Some companies are experimenting with a flat management structure to solve this problem. GitHub, Stripe and Valve are famously self-organized. Morning Star is a non-tech company that has succeeded with the non-managerial approach as well. The theory is that if there are fewer people above you on the org chart then there is less negative pressure on you and you’ll be able to produce at the same high potential as if you were entirely self-directed. A self-managed company accepts increased chaos and takes a risk of potential non-productivity and in return gets the value of actually highly increased individual productivity. It’s as if every employee were the CEO of their own little company.

Yehuda isn’t a fan of the flat-structure approach.

Removing managers rescues us from TPS reports and micro-managing but we still need to form into teams to accomplish anything of note. We can let these teams occur organically but then we suffer from the realities of emergent behavior: Left alone long enough every self-organizing bay area software team will accidentally build a bartending quad-copter running Node. There will never emerge a QuickBooks replacement that’s tax-compliant in all 50 states and Puerto Rico. You’re going to get a product that meets the immediate, bodily-felt needs of its creators because there’s no structure in place to acquire information from customers, verify the accuracy of this information, and provide it in a usable form to the engineers.

There is a solution to bad managers that doesn’t require a flat company structure: good managers. And I don’t mean good people as managers (most bad managers are very friendly); I mean people who believe their job is to empower those they serve and make it easier for the individuals to function as a group. A good manager directs and augments the energies of a team without adding resistance to that energy. A good manager thinks their team is their boss. And a good manager expects the same support from their manager.

To get leaders like this you don’t just need to hire well, you need to reorganize your structure to illustrate the roles. If we want managers to be empowering we need to fix our org charts to be a proper tree: root on bottom, branches above, leaves on top.

This inverted³ org chart effectively evens out the broken power dynamics of the patriarchal model. The language itself is fixed: “I rely on my manager.” “I support my team.” “Because we’re way up here the CEO needs us to tell them what we see.” “I’m not sure I want to take on the weight of another team – that’s a lot to support.”

There are no “promotions” here; you can move to positions with less organizational responsibility and more task responsibility or vice versa.

This also helps us fix compensation because if there’s a sense that someone near the root is actually holding up more of the company (and the people they serve feel that support) then it doesn’t feel unjust to pay them more.

For a tech company to describe their structure this way requires some humility from the leadership. It requires accepting that senior positions must be evaluated based on the support given to individuals on the team rather than the support given to the CEO or executives. But it makes the structure one in which nothing is extracted from the laborers – indeed it provides help that an individual could not find working alone.

If you want an engineer to work as slowly and sadly as possible, place them at a BigCo® where all the power is top-down (CEO-down) and tell them what to do. If you want them to produce something amazing, then place them in a team of people with all the resources of a supporting manager, a supporting corporate team, and total freedom to do their best work in the way they best see fit. You need only communicate the needs of the end customer clearly and then the work will get done.

Patriarchy only goes back about 3,000 years and seems to have originated as a survival mechanism during times of trouble. http://en.wikipedia.org/wiki/Patriarchy#cite_ref-18 ↩︎
This study asked 3 people to write a paper and put one person nominally in charge. When the group was given 5 cookies the ‘boss’ would eat 2 cookies, not just 1, and would typically do it sloppily with their mouth open and getting crumbs on their shirt: http://psycnet.apa.org/journals/rev/110/2/265/. So the very nature of ‘boss’ causes good people to have a negative effect on those they boss. ↩︎
Let’s be real, this isn’t a new idea. It’s ancient, well-loved, and just very rarely used in human organizations. ↩︎

Design Doc Template

Mon, 01 Jan 0001 00:00:00 +0000

View or copy here

This is the template we developed in the early days of Square, iterated by several companies including Gusto and Pathstream. Improvements welcome!