Nate Meyvis

A tradeoff in defining database schemas

2026-04-27T16:59:04.118454+00:00

Here are some decisions of a sort that comes up frequently:

You have a hats table in your database with uuid, hat_type (e.g., fedora or trilby), size, and dress_code_level columns. Now you need to describe fascinators in your database also. They also have sizes and dress code levels, but also attachment_type, attachment_position, and has_veil. Do you (i) create a separate fascinators table or (ii) add some columns to your hats table, rename it to millinery, and accept that fields like attachment_type can be null in some cases but not others?
You need to store user settings. Right now the only settings are a preferred time zone and a preferred theme (dark or light mode), but some day you'll probably need more. Do you (i) have columns for preferred_time_zone and preferred_theme, and accept that you'll need to change the schema later to support more settings, or (ii) put settings in a settings JSON¹ column?
You are collecting data about hat interactions. Sometimes people tip their hats, sometimes they put them on, sometimes they take them off, and so on. These events have some, but only some, overlapping fields. Do you (i) try to cluster like events in tables like hat_tips and hat_on_events, so that events in those tables have near-identical or identical attributes, or (ii) put all of them in an events table?

Call an approach that favors (i)-type answers "table faithfulness"² and an approach that favors (ii)-type answers "table minimalism." The first approach tends to respect the semantics of table names and columns better, and to give values (including nulls) more consistent meaning between rows. The second approach tends to make it easier to know where data is and to process data without lots of joins and table lookups.

Any specific decision will, of course, depend on your specific domain and tooling, but I tend to value table faithfulness and (i)-type solutions more than other people do:

When the meaning of a column is consistent for rows in a database, you can more easily do integrity checks (and especially programmatic integrity checks). For example, you can raise an alarm if two columns are null at the same time.³
Table and column names are forms of documentation, and extremely commonly referenced ones at that. The more semantically accurate they are, the better.
Many database features (indexes, range queries, grouping, and so on) assume, or at least work better when, you've prioritized table faithfulness.
Table faithfulness discourages practices like stuffing extra fields into settings columns haphazardly and keeping business logic in developers' heads.
Modern tooling makes it easier to do migrations and learn what tables and columns⁴ exist, even when there are many of them. So, the drawbacks of faithfulness are getting less and less severe.

This is almost always a tradeoff, and sometimes an intermediate or (ii)-type solution is best. Over the years, though, I've very often found myself advocating for more faithfulness, all things considered, and I've rarely regretted that. (Even more often, I've had to deal with the consequences of a bloated "here's where all the main data lives, and please message this one person if you have any questions" table.) So, if you're facing such a decision⁵ and not sure what to do, please consider this a vote for faithfulness.

P.S.: Ideally, your persistence layer is encapsulated sufficiently well that most of your code never knows or cares which approach you take, but that's another post.

...or JSONB, or something else. This and many other database-specific details can affect your decisionmaking, but not the basic shape of the problem I'm trying to describe.↩
I don't love this name and would be grateful to hear a better one.↩
There are also any number of database-specific mechanisms for this sort of check.↩
Here as elsewhere, I'm using "columns" loosely to include, for example, fields in a schemaless database.↩
Very often, I've found that what is justified retrospectively as some version of table minimalism ("we wanted to keep things simple!" or "we needed to move fast and didn't want to do a migration") was, upon inspection, backed by no explicit decision-making process at all beyond the pull requests implementing it.↩

Reading notes: 'The Poisoned King'

2026-04-26T16:23:35.403959+00:00

The Poisoned King is the sequel to Impossible Creatures, which I absolutely loved. The Poisoned King is every inch as good.

Mostly I just want to say "hooray, this series is still great," but here are a few small notes:

It's consistently great; I haven't read a chapter of either book that I didn't find solid, well-written, and interesting. I'm not sure I could find a single sentence that I'd consider a misstep.
This genre of book is much more explicitly and deeply moral than other genres. Whatever the author's intent might be, it is effectively impossible to read 500 pages of fantasy novels about young adults fighting apocalyptic forces without drawing general moral lessons from it.
Another of Rundell's remarkable successes is in satisfying the moralizing ("moralizing" in the non-pejorative sense) expectations of the genre with a modern ethical framework, and expressed in a compelling, fair, and nuanced way. I'd expect any fair reader ought to feel the force of it, and none to perceive political cheap shots or cheaper flattery.¹ I suspect this is even harder to pull off than it was 50 years ago.

As I write this, these are now both top-20 books for me.

The contrast I have in mind here is to sanctimony literature, in Becca Rothfeld's sense.↩

Some data on the shape of the forgetting curve

2026-04-25T11:40:44.940664+00:00

The forgetting curve is often schematically pictured like this, as on Wikipedia:

Learners often take this to mean that their retention of a given fact will, over time and on average, tend to look something like that. So, for example, the Wikipedia entry on Ebbinghaus glosses it as "describ[ing] the exponential loss of information that one has learned." But:

Ebbinghaus's original forgetting curve is defined in terms of "savings," a metric we tend not to use: it describes how long it takes to learn something after studying it relative to the time it takes initially to learn something.
Ebbinghaus's 1885 formula is b = 100k / ((log t)^c + k), which involves an exponent and decays but is not a literal exponential curve.
I've never found strong evidence that my own forgetting curves--here defined as I think it's generally understood, in terms of the probability of my getting a flashcard correct over time--are exponentially distributed.

Here's my performance (LOWESS-smoothed) on my fourth response after I get the first three responses on a flashcard correct:

I've chosen the correct-correct-correct ("CCC") prefix because it has a large sample size and a reasonable spread of intervals between the third and fourth responses.¹ When I run Bayesian information criterion ("BIC") analyses on this, it consistently chooses models with the fewest parameters, because more or less any kind of distribution can fit the data very well. (Even a linear fit does almost as well as anything else.)

If I didn't know these were spaced-repetition data, or if I hadn't read that they are supposed to be exponentially distributed,² neither looking at the data, nor exploring them, nor running BIC or any similar analysis would be tempting me to think they are exponentially distributed.

As always, the hard question is what to do about this. I still take the pragmatic lesson that we should worry less about fine details of algorithms and more about the ergonomics of the broader learning system. Others disagree (explicitly or implicitly). But whatever lesson you draw from it, I've never seen a retention-versus-time curve from my own data that is obviously exponential, and I've certainly never seen one that looks anything like the standard Wikipedia-style schematic.

Here is a post with data with a different (CIC) prefix.↩
Not everyone in the spaced repetition community thinks that these should be exponentially distributed; some advocate for power laws or for other models. I don't think this affects the point I'm making.↩

Why is there so much bad code at big companies?

2026-04-24T13:48:03.763379+00:00

Here is the excellent Sean Goedecke on why there is so much bad code at big companies. He's correct that:

Many big-company engineers are working on relatively unfamiliar codebases, and this makes it harder to ship good code;
Big companies prioritize things (e.g., legibility) that are not code quality;
Code quality is partly a function of the code-review process, but code review is incentivized unevenly at best.

(Those are all my paraphrases.)

Those are all correct, but I don't think they're the most important drags on code quality at big companies. I'd cite these:

Encapsulation is the main determinant of code quality; most engineers have not mastered encapsulation; and even 95th-percentile engineers struggle to get encapsulation right.
Encapsulation is much harder at big companies, because systems have to be more complex at their scale. Simple solutions either don't work at all or carry costs that are negligible at modest scale but huge at big-company scale. (Authentication, load balancing, and version control are good examples here.)
In some big-company cultures, caring too much about micro-level code quality is punished: it marks you as someone who's insufficiently serious about system design and "higher-level" issues more generally. There are places where you really don't want to look like a sous-chef.
As I've discussed before, over-complicated code often benefits the person who wrote it: they are uniquely positioned to maintain and fix it. Even though most engineers are not Machiavellian schemers, incentives matter.
Good code is a matter of craft, and often subtle; it's hard to recognize and reward at scale. (See, again, this piece and this more optimistic one.)

The case against LLM prose

2026-04-23T15:15:28.988903+00:00

I like LLM-generated code more than most people do, but I'm pretty sure I dislike LLM-generated English more than most people do. There's plenty of anti-LLM sentiment out there, but surprisingly (to me) little theorizing about what is wrong with it, and why. So, here's my view:

A lot of what we get from writing, we get over time or from scrutiny. Both in fiction or nonfiction, we often underrate how much of its meaning and import only comes to us after investments of time and effort. LLM-generated English does not repay those investments the way human-generated English can. So, presenting LLM-generated English as human-generated English violates a trust. You are implicitly asking someone to consider something with care and to think beyond the surface of something, but the writing will not reward that care and time.¹

Relatedly, LLM-generated writing is disproportionately manipulative. Precisely because it can't do much else, it trades in formulaic contrasts, cheap sensationalism, and flattery. Partly because it suggests a payoff that isn't there, it dulls the reader's receptivity to contrasts that are truly interesting, facts that are really sensational, and so on. More and more of the written environment feels like 20th-percentile vaudeville humor or partisan news: the recitation of cheap formulas aimed at emotional weak points, in the guise of something else.

I hope it's obvious that this doesn't apply to all AI prose. I like it when chatbots give me prose that's more digestible than bullet points, and I'm glad people can use AI for translation. Moreover, things like formal requests and lawsuits don't carry the same assumptions, and none of this applies to those.

My finger-to-the-wind sense is that people are mostly shrugging their shoulders about this, except when it comes to things like school essays and job interviews. The sentiment, if I had to guess at an articulation of it, is that in a world already stuffed with TikTok and video games and sound bites, rampant LLM-generated prose is just more of the same. If I'm right, though, LLM-generated prose is more corrosive than that.

Maybe future AI writing will reward such care. I doubt it, but I can't be sure. I'm talking about what is respectful now.↩

Kirkby and Matuschak on making flashcards with LLMs

2026-04-22T14:46:55.930614+00:00

Here are Ozzie Kirkby and Andy Matuschak reporting on their attempts to make LLMs make and evaluate good flashcards based on their reading notes. It's both well-written¹ and relevant. They work with good ideas of how flashcards ought to be written, or at least ideas that are close to mine.

Some notes:²

This is what it looks like when the authors actually care about the thing they're studying: diversity of approach, creativity, and tenacity. I evaluate this whole genre of essay on a spectrum from "really care about figuring this out" to "trying to get an A from some real or imagined teaching assistant." This is very much on the good side.
The project of fully automating the highlight-to-card process via LLM is, to me, undermotivated. The larger pipeline here includes reading, highlighting, thinking, card composition, and intermittent study. Given that, I'm not so worried that, e.g., "even the strongest model we tested (GPT-5.2) still produces unusable prompts roughly a third of the time." A strong flashcarder should be able to recognize and cull those very quickly, especially relative to the overall time commitment of studying something.³ Here as elsewhere, I'm less concerned than others about whether AI can do 100% of something, and more concerned about whether they can do 25%, 50%, 75%, and 95% of it.
I'm glad that they tried fine-tuning, and found their various efforts here useful. Again, I draw a more optimistic conclusion than their "we got cheaper judges, not better ones," or perhaps the same conclusion in a more optimistic tone. Cheaper judges are good! I'm particularly interested in whether several cheaper judges could be aggregated, either now or in a near future of more sharply distinguished models.
I'm grateful for their work in evaluating all those models so carefully, but am still in the "benchmarks have never been less useful" camp.
I strongly agree that the training data are mostly bad and that most flashcarders' processes are not optimized for the sort of learning that interests Kirkby and Matuschak here. I disagree in places, however, with their views on how highlighted material should be captured in a flashcard. So, for example, I don't think it's so bad simply to memorize the traditional three factors of production.⁴ I'd also go about studying their "humans flying by flapping wings on Titan" example differently, but the details here would require another post.
As an experiment, I asked Claude to make me 60 flashcard candidates from my Norton-anthology highlights: I'm still studying 19th-century British literature and using it as a way into the politics and history of the period. Most of the candidates were bad, but many were usable or editable. Claude and I working together were a lot more efficient than me working alone. This is in part because because Claude had access to a local SQLite database of all my questions and responses and could query it to learn about how I write this kind of card, where my library's gaps are, and so on.⁵ Again, culling and editing wasn't the time-consuming part (and note that this culling and editing is both educational and, for me, pleasant!).⁶

So: this is valuable work (unless I'm forgetting something, the best thing I've read about spaced repetition this year), but I'd encourage a different picture of human-AI cooperation in spaced repetition. Relatedly, I'm more optimistic than the authors about LLMs' helping us make good flashcards.

...and, Pangram and I agree, at least mostly written by humans.↩
The usual disclaimer: I make Zippyflash and have used it for years. I'm invested in Zippyflash on many dimensions: primarily as a user, but also financially and ideologically.↩
This is why Zippyflash distinguishes fundamentally between cards and card candidates, and the API makes it easy for LLMs to submit candidates for your review. For a bit more about LLM-centric API design, see here.↩
Whether these should be memorized as one card or many depends, I think, on (e.g.) how likely you are to remember some but not all of them. I'm not saying that "What are the traditional three factors of economic production?" is the right flashcard, but just that memorizing this "textbook list" is not a bad idea even if your goal is a much different kind of understanding.↩
The details here are less important than the idea that, again, good flashcards are less likely to come from a simple highlight-to-card LLM call and more likely to come from a more complicated human-AI partnership. (For now, at least.)↩
It's not so hard to get your Kindle highlights into a format your LLM can use, but it's a clunky process.↩

The case against worrying about your posts' analytics

2026-04-21T16:49:02.993808+00:00

I've had several conversations recently wherein my interlocutor has said that they tried writing online, got very few readers on their posts, and gave up (or, at least, feel discouraged about it). I think it's usually a mistake to worry about low readership on a post:

As Patrick McKenzie has said repeatedly,¹ much of the public, professional value of having written something is totally independent of its analytics. Simply being able to provably say you've thought about something before has a lot of value (e.g., in job applications).
It's good to write, and most of the reasons for this (e.g., its forcing you to clarify your thoughts) are independent of the number of readers a piece has. (I would say more, but you've probably seen many arguments to this effect.)
Even if your goal is influence, maximizing the number of readers is usually not the best way to get it. Saying something that is useful and intelligible in a niche is not just better but more influential than getting a lot of low-quality clicks on something that will barely be processed and not be remembered.
Even if your goal is money, most of the best ways to turn writing into money do not involve getting lots of clicks on posts.
Even if you want high reader counts, writing online can be cumulative, and cumulative in unexpected ways. Low-traffic posts can be an important catalyst for, or supporting link in, a future, more popular post. They can also build trust and credibility in other, indirect ways (e.g., by establishing that you've been around a while).
Even if you want high reader counts, there is a large, seemingly random component to post popularity. Many of us have no good way to predict which of our posts will be more and less popular. To the (small) extent I care about analytics, I view my posts in large part as lottery tickets. Most of them are not attention-winners.
People use RSS readers, and analytics tools can be quirky, so you usually can't be sure your readership is as low as you think it is.

So: analytics dashboards are a tempting scorecard, but they usually aren't measuring what you have reason to care about.

I imagine that a lot of this post is internalized patio11 teachings, but his way of looking at this is so deeply ingrained in certain subcultures that it's hard to find specific citations. Actually, it's so ingrained that I hadn't thought to write this down until I found myself having this conversation repeatedly.↩

Spaced repetition scheduling and categorization

2026-04-20T16:29:18.550135+00:00

Should I consider the subject matter of a flashcard explicitly when scheduling reviews? On the one hand, more information is often better when determining a scheduling interval; on the other hand, we want to avoid complexity and overfitting, and one might reasonably hope my category-by-category differences to be captured in my performance itself, without needing to look at categories explicitly.

I'm approaching a million flashcard reviews and have never used category data in a scheduling algorithm. In that time, I've made a lot of flashcards about movies.¹ Here is my performance in two common response-pattern situations, split between questions about movies and questions not about movies.

First, when I make a card and get the first review correct, here's my performance on the second review:

And when I get my first two reviews correct, here's my performance on the third review:

Some notes and caveats:

These are just some high-volume patterns I checked; if you think there's some other category-specific analysis I ought to do, I'd be grateful to know it.
I don't organize my studying by "decks;" I study my cards all together, and those cards are tagged by category.²
Not all my movie-related cards are tagged as such, and a lot of my tagging has been done for me by AI (but that's another post). So these data are not even close to perfect. That said: (i) I've manually inspected a lot of LLM-generated tags, and they look good, and (ii) insofar as I have movie cards that are still untagged, my non-movie / movie gap would be understated.
As with all my analyses, I'm working with self-experimental data on which all sorts of hidden mechanisms might be operating. So, for example, I might have made a lot of flashcards about movies when I was scheduling reviews somewhat differently or when I was systematically sleep-deprived. I really doubt that any effect like this is operative here, but these are definitely one person's idiosyncratic data.
It's not clear to me what, if anything, I should do about this. The goal of scheduling is not, ultimately, to predict how likely I am to remember a card; it is to learn things durably. I'd only want to schedule movie-card reviews more conservatively if I were confident it would get me to long-run retention more efficiently. I suspect it would, but I can't be sure. And, as with long-interval performance, I'm not quite sure even what questions to be asking.

All the trivia competitions I care about have questions about movies, and it's a weak area for me. It's also a subject that's amenable to study in that (i) there's a core of high-value information to learn (e.g., Best Picture winners) and (ii) it's not hard to organize, find, and present the information.↩
Once in a while I use a feature that lets me filter cards by tag while studying, but I much prefer just to study everything together. It's more pleasant; it helps me break up clusters of related questions; and it helps keep me from guessing the answer from the category of a question.↩

Spaced repetition performance after long intervals

2026-04-19T17:40:13.652541+00:00

Yesterday I wrote about the difficulties of optimizing a scheduling algorithm, and mentioned that I don't know enough about how memory behaves when the intervals get into the years.

Here's some data. First, here's a LOWESS-smoothed graph of my correctness rate as a function of interval time, for intervals of at least one year. The black line is overall performance; the other three lines give my performance on questions I'd (i) never gotten wrong before, (ii) had gotten wrong only once, and (iii) had gotten wrong at least twice.

Some notes:

It looks like I'm too aggressive in scheduling long intervals for questions I've gotten wrong several times. This is interesting to me, because my algorithms have already weighted previous performance substantially.
I had expected the "never wrong" (blue) line to be above 90%, in part because a lot of those are probably quite easy cards. Perhaps I'm too aggressive with those also.
After I (i) never get a question wrong, (ii) go at least a year without answering the question, (iii) get it wrong, and (iv) answer it at least twice after that, my performance is perfect just about 2/3 of the time (501 of 741).

The most common sequence before a year-or-more interval is exactly seven correct answers (with none incorrect; N=1403). Here's my performance vs. interval on those:

With all the usual caveats (I've changed my algorithm over time, and perhaps my cards systematically vary across time also), this does suggest that little or nothing is gained by waiting only a year instead of a bit longer in this case.

Some empirical spaced-repetition results

2026-04-19T00:44:10.259661+00:00

I am a diligent flashcarder, and the tool I made uses a lot of randomness in its scheduling, for reasons I go into here.

I have well over 900,000 responses now, which is enough to let me do reasonable analysis of even fairly specific situations. So, for example: after a certain pattern of correct and incorrect answers, how sensitive is the correctness rate of the next response to the next interval?

Concretely: If I make a flashcard and my first three responses to it are (in this order) correct, incorrect, and correct, what is the relationship between the next time interval and my likelihood of getting the next (fourth) response correct?

Here's a chart:

And here's the LOWESS-smoothed version:

(In the former chart, <= 8 hours and >= 40 hours are bucketed together, so

A few notes:

The very small intervals are mostly due to questions coming up randomly before they were scheduled to reappear. (My software injects random questions into study sessions, primarily to make it harder to guess the answer based on when I'm seeing the question but also to enable analyses like this.)
The bump around 24 hours is interesting. My best guess is that this has something to do with consistency of environment: if I study a card on my lunch break or in bed in the morning, it's plausible that my performance will be better if I next see it in the same surroundings. Intervals very close to 24 hours are quite likely to keep surroundings constant.
I'd call this a good data set, but not perfect. I've changed my scheduling algorithm over time, and it's possible that the questions I made while using one algorithm were systematically different than ones I made using another algorithm. I don't think it's plausible that effects like this are happening here at any meaningful magnitude, but it's worth noting.
Those caveats aside, I suspect this data reflects a lot more precision and care than a lot of the memory data "in the wild." Some obvious reasons for this are that I'm quite honest and consistent about grading myself, and that I've kept this up for years. The less obvious reason is that I write narrowly focused, "atomic" flashcards that admit of well-defined grading in the first place. (If you've spent any time looking at publicly available flashcards, you know that a lot of them have prompts like "Why is RuBisCO important?" or "James Buchanan," and their answers are long and contain perhaps a dozen important facts. That makes it a lot harder to give a well-defined assessment to an answer.)
The important question to be asking is: How should this change my scheduling algorithm? One way to think about this is: which intervals minimize the total effort to remember the card for the rest of my life (or, for concreteness and to sidestep morbid questions, for 20 years)? This is a hard question to answer empirically. The obvious brute-force approach would be to see how I do on questions where my first four responses are correct-incorrect-correct-correct ("CICC") and on CICI questions, then do an expected value calculation. But at least three obvious problems arise: the sample sizes get smaller and smaller; we don't know all that much about how spaced repetition works at decades-long scale; and, probably worst, the CICC question that would have been a CICI question had I waited an extra two hours is not the same as the average CICC question. So this is a tough problem!
That said: it's a tough problem, but unless I'm mistaken I'm one of the likeliest people in the world to have the data and machinery to answer it, so I'm going to keep thinking about it.