https://www.natemeyvis.comNate Meyvis2026-04-27T17:00:05.328488+00:00nwmhiddenpython-feedgenI am an itinerant software engineer who likes to raise children, read, and memorize things.
If you're not sure whether you should email me, you probably s...https://www.natemeyvis.com/a-tradeoff-in-defining-database-schemas/A tradeoff in defining database schemas2026-04-27T16:59:04.118454+00:00nwmhidden<p>Here are some decisions of a sort that comes up frequently:</p>
<ol>
<li>You have a <code>hats</code> table in your database with <code>uuid</code>, <code>hat_type</code> (e.g., <code>fedora</code> or <code>trilby</code>), <code>size</code>, and <code>dress_code_level</code> columns. Now you need to describe <a href="proxy.php?url=https%3A%2F%2Fen.wikipedia.org%2Fwiki%2FFascinator">fascinators</a> in your database also. They also have sizes and dress code levels, but also <code>attachment_type</code>, <code>attachment_position</code>, and <code>has_veil</code>. Do you (i) create a separate <code>fascinators</code> table or (ii) add some columns to your <code>hats</code> table, rename it to <code>millinery</code>, and accept that fields like <code>attachment_type</code> can be null in some cases but not others?</li>
<li>You need to store user settings. Right now the only settings are a preferred time zone and a preferred theme (dark or light mode), but some day you'll probably need more. Do you (i) have columns for <code>preferred_time_zone</code> and <code>preferred_theme</code>, and accept that you'll need to change the schema later to support more settings, or (ii) put settings in a <code>settings</code> JSON<sup class="footnote-ref" id="fnref-1"><a href="#fn-1">1</a></sup> column?</li>
<li>You are collecting data about hat interactions. Sometimes people tip their hats, sometimes they put them on, sometimes they take them off, and so on. These events have some, but only some, overlapping fields. Do you (i) try to cluster like events in tables like <code>hat_tips</code> and <code>hat_on_events</code>, so that events in those tables have near-identical or identical attributes, or (ii) put all of them in an <code>events</code> table?</li>
</ol>
<p>Call an approach that favors (i)-type answers "table faithfulness"<sup class="footnote-ref" id="fnref-2"><a href="#fn-2">2</a></sup> and an approach that favors (ii)-type answers "table minimalism." The first approach tends to respect the semantics of table names and columns better, and to give values (including nulls) more consistent meaning between rows. The second approach tends to make it easier to know where data is and to process data without lots of joins and table lookups.</p>
<p>Any specific decision will, of course, depend on your specific domain and tooling, but I tend to value table faithfulness and (i)-type solutions more than other people do:</p>
<ol>
<li>When the meaning of a column is consistent for rows in a database, you can more easily do integrity checks (and especially programmatic integrity checks). For example, you can raise an alarm if two columns are null at the same time.<sup class="footnote-ref" id="fnref-3"><a href="#fn-3">3</a></sup></li>
<li>Table and column names are forms of documentation, and extremely commonly referenced ones at that. The more semantically accurate they are, the better.</li>
<li>Many database features (indexes, range queries, grouping, and so on) assume, or at least work better when, you've prioritized table faithfulness.</li>
<li>Table faithfulness discourages practices like stuffing extra fields into <code>settings</code> columns haphazardly and keeping business logic in developers' heads.</li>
<li>Modern tooling makes it easier to do migrations and learn what tables and columns<sup class="footnote-ref" id="fnref-4"><a href="#fn-4">4</a></sup> exist, even when there are many of them. So, the drawbacks of faithfulness are getting less and less severe.</li>
</ol>
<p>This is almost always a tradeoff, and sometimes an intermediate or (ii)-type solution is best. Over the years, though, I've very often found myself advocating for more faithfulness, all things considered, and I've rarely regretted that. (Even <em>more</em> often, I've had to deal with the consequences of a bloated "here's where all the main data lives, and please message this one person if you have any questions" table.) So, if you're facing such a decision<sup class="footnote-ref" id="fnref-5"><a href="#fn-5">5</a></sup> and not sure what to do, please consider this a vote for faithfulness.</p>
<p>P.S.: Ideally, your persistence layer is encapsulated sufficiently well that most of your code never knows or cares which approach you take, but that's another post.</p>
<hr />
<section class="footnotes">
<ol>
<li id="fn-1"><p>...or JSONB, or something else. This and many other database-specific details can affect your decisionmaking, but not the basic shape of the problem I'm trying to describe.<a href="#fnref-1" class="footnote">↩</a></p></li>
<li id="fn-2"><p>I don't love this name and would be grateful to hear a better one.<a href="#fnref-2" class="footnote">↩</a></p></li>
<li id="fn-3"><p>There are also any number of database-specific mechanisms for this sort of check.<a href="#fnref-3" class="footnote">↩</a></p></li>
<li id="fn-4"><p>Here as elsewhere, I'm using "columns" loosely to include, for example, fields in a schemaless database.<a href="#fnref-4" class="footnote">↩</a></p></li>
<li id="fn-5"><p>Very often, I've found that what is justified retrospectively as some version of table minimalism ("we wanted to keep things simple!" or "we needed to move fast and didn't want to do a migration") was, upon inspection, backed by no explicit decision-making process at all beyond the pull requests implementing it.<a href="#fnref-5" class="footnote">↩</a></p></li>
</ol>
</section>
2026-04-27T15:49:00+00:00https://www.natemeyvis.com/reading-notes-the-poisoned-king/Reading notes: 'The Poisoned King'2026-04-26T16:23:35.403959+00:00nwmhidden<p><em>The Poisoned King</em> is the sequel to <em>Impossible Creatures</em>, which I <a href="proxy.php?url=https%3A%2F%2Fwww.natemeyvis.com%2Freading-notes-impossible-creatures%2F">absolutely loved</a>. <em>The Poisoned King</em> is every inch as good.</p>
<p>Mostly I just want to say "hooray, this series is still great," but here are a few small notes:</p>
<ol>
<li>It's <em>consistently</em> great; I haven't read a chapter of either book that I didn't find solid, well-written, and interesting. I'm not sure I could find a single <em>sentence</em> that I'd consider a misstep.</li>
<li>This genre of book is much more explicitly and deeply moral than other genres. Whatever the author's intent might be, it is effectively impossible to read 500 pages of fantasy novels about young adults fighting apocalyptic forces without drawing general moral lessons from it.</li>
<li>Another of Rundell's remarkable successes is in satisfying the moralizing ("moralizing" in the non-pejorative sense) expectations of the genre with a modern ethical framework, and expressed in a compelling, fair, and nuanced way. I'd expect any fair reader ought to feel the force of it, and none to perceive political cheap shots or cheaper flattery.<sup class="footnote-ref" id="fnref-1"><a href="#fn-1">1</a></sup> I suspect this is even harder to pull off than it was 50 years ago.</li>
</ol>
<p>As I write this, these are now both <a href="proxy.php?url=https%3A%2F%2Fwww.natemeyvis.com%2F100-books%2F">top-20 books</a> for me.</p>
<hr />
<section class="footnotes">
<ol>
<li id="fn-1"><p>The contrast I have in mind here is to <a href="proxy.php?url=https%3A%2F%2Flibertiesjournal.com%2Farticles%2Fsanctimony-literature%2F">sanctimony literature</a>, in Becca Rothfeld's sense.<a href="#fnref-1" class="footnote">↩</a></p></li>
</ol>
</section>
2026-04-26T14:28:00+00:00https://www.natemeyvis.com/some-data-on-the-shape-of-the-forgetting-curve/Some data on the shape of the forgetting curve2026-04-25T11:40:44.940664+00:00nwmhidden<p>The forgetting curve is often schematically pictured like this, as <a href="proxy.php?url=https%3A%2F%2Fen.wikipedia.org%2Fwiki%2FForgetting_curve">on Wikipedia</a>:</p>
<p><img src="proxy.php?url=https%3A%2F%2Fbear-images.sfo2.cdn.digitaloceanspaces.com%2Fnwm%2Fforgetting_curve_decline.svg" alt="Forgetting_curve_decline" /></p>
<p>Learners often take this to mean that their retention of a given fact will, over time and on average, tend to look something like that. So, for example, the Wikipedia entry on <a href="proxy.php?url=https%3A%2F%2Fen.wikipedia.org%2Fwiki%2FHermann_Ebbinghaus">Ebbinghaus</a> glosses it as "describ[ing] the exponential loss of information that one has learned." But:</p>
<ol>
<li>Ebbinghaus's original forgetting curve is defined in terms of "savings," a metric we tend not to use: it describes how long it takes to learn something after studying it relative to the time it takes initially to learn something.</li>
<li>Ebbinghaus's <a href="proxy.php?url=https%3A%2F%2Fen.wikipedia.org%2Fwiki%2FForgetting_curve%23cite_note-Murre2015-7">1885 formula</a> is <code>b = 100k / ((log t)^c + k)</code>, which involves an exponent and decays but is not a literal exponential curve.</li>
<li>I've never found strong evidence that my own forgetting curves--here defined as I think it's generally understood, in terms of the probability of my getting a flashcard correct over time--are exponentially distributed.</li>
</ol>
<p>Here's my performance (LOWESS-smoothed) on my fourth response after I get the first three responses on a flashcard correct:</p>
<p><img src="proxy.php?url=https%3A%2F%2Fbear-images.sfo2.cdn.digitaloceanspaces.com%2Fnwm%2F13pm.webp" alt="Screenshot 2026-04-24 at 2" /></p>
<p>I've chosen the correct-correct-correct ("CCC") prefix because it has a large sample size and a reasonable spread of intervals between the third and fourth responses.<sup class="footnote-ref" id="fnref-1"><a href="#fn-1">1</a></sup> When I run <a href="proxy.php?url=https%3A%2F%2Fen.wikipedia.org%2Fwiki%2FBayesian_information_criterion">Bayesian information criterion</a> ("BIC") analyses on this, it consistently chooses models with the fewest parameters, because more or less any kind of distribution can fit the data very well. (Even a <em>linear</em> fit does almost as well as anything else.)</p>
<p>If I didn't know these were spaced-repetition data, or if I hadn't read that they are supposed to be exponentially distributed,<sup class="footnote-ref" id="fnref-2"><a href="#fn-2">2</a></sup> neither looking at the data, nor exploring them, nor running BIC or any similar analysis would be tempting me to think they are exponentially distributed.</p>
<p>As always, the hard question is what to <em>do</em> about this. I <a href="proxy.php?url=https%3A%2F%2Fwww.natemeyvis.com%2Fnotes-on-spaced-repetition-scheduling%2F">still take</a> the pragmatic lesson that we should worry less about fine details of algorithms and more about the ergonomics of the broader learning system. Others disagree (explicitly or implicitly). But whatever lesson you draw from it, I've never seen a retention-versus-time curve from my own data that is obviously exponential, and I've <em>certainly</em> never seen one that looks anything like the standard Wikipedia-style schematic.</p>
<hr />
<section class="footnotes">
<ol>
<li id="fn-1"><p><a href="proxy.php?url=https%3A%2F%2Fwww.natemeyvis.com%2Fsome-empirical-spaced-repetition-results%2F">Here</a> is a post with data with a different (CIC) prefix.<a href="#fnref-1" class="footnote">↩</a></p></li>
<li id="fn-2"><p>Not everyone in the spaced repetition community thinks that these should be exponentially distributed; some advocate for power laws or for other models. I don't think this affects the point I'm making.<a href="#fnref-2" class="footnote">↩</a></p></li>
</ol>
</section>
2026-04-25T11:21:00+00:00https://www.natemeyvis.com/why-is-there-so-much-bad-code-at-big-companies/Why is there so much bad code at big companies?2026-04-24T13:48:03.763379+00:00nwmhidden<p><a href="proxy.php?url=https%3A%2F%2Fwww.seangoedecke.com%2Fbad-code-at-big-companies%2F">Here is</a> the excellent Sean Goedecke on why there is so much bad code at big companies. He's correct that:</p>
<ol>
<li>Many big-company engineers are working on relatively unfamiliar codebases, and this makes it harder to ship good code;</li>
<li>Big companies prioritize things (e.g., legibility) that are not code quality;</li>
<li>Code quality is partly a function of the code-review process, but code review is incentivized unevenly at best.</li>
</ol>
<p>(Those are all my paraphrases.)</p>
<p>Those are all correct, but I don't think they're the most important drags on code quality at big companies. I'd cite these:</p>
<ol>
<li>Encapsulation is the main determinant of code quality; most engineers have not mastered encapsulation; and even 95th-percentile engineers struggle to get encapsulation right.</li>
<li>Encapsulation is much harder at big companies, because systems have to be more complex at their scale. Simple solutions either don't work at all or carry costs that are negligible at modest scale but huge at big-company scale. (Authentication, load balancing, and version control are good examples here.)</li>
<li>In some big-company cultures, caring too much about micro-level code quality is <em>punished</em>: it marks you as someone who's insufficiently serious about system design and "higher-level" issues more generally. There are places where you <em>really</em> don't want to look like a sous-chef.</li>
<li><a href="proxy.php?url=https%3A%2F%2Fwww.natemeyvis.com%2Fon-rewarding-simplicity%2F">As I've discussed before</a>, over-complicated code often benefits the person who wrote it: they are uniquely positioned to maintain and fix it. Even though most engineers are not Machiavellian schemers, incentives matter.</li>
<li>Good code is a matter of craft, and often subtle; it's hard to recognize and reward at scale. (See, again, <a href="proxy.php?url=https%3A%2F%2Fwww.natemeyvis.com%2Fon-rewarding-simplicity%2F">this</a> piece and <a href="proxy.php?url=https%3A%2F%2Fwww.natemeyvis.com%2Fa-model-of-how-simplicity-gets-rewarded%2F">this more optimistic</a> one.)</li>
</ol>
2026-04-24T13:40:00+00:00https://www.natemeyvis.com/the-case-against-llm-prose/The case against LLM prose2026-04-23T15:15:28.988903+00:00nwmhidden<p>I like LLM-generated code <a href="proxy.php?url=https%3A%2F%2Fwww.natemeyvis.com%2Fon-cognitive-debt%2F">more than</a> <a href="proxy.php?url=https%3A%2F%2Fwww.natemeyvis.com%2Fcognitive-debt-and-optimism%2F">most people do</a>, but I'm pretty sure I dislike LLM-generated English more than most people do. There's plenty of anti-LLM sentiment out there, but surprisingly (to me) little theorizing about what is wrong with it, and why. So, here's my view:</p>
<p>A lot of what we get from writing, we get over time or from scrutiny. Both in fiction or nonfiction, we often underrate how much of its meaning and import only comes to us after investments of time and effort. LLM-generated English does not repay those investments the way human-generated English can. So, presenting LLM-generated English as human-generated English violates a trust. You are implicitly asking someone to consider something with care and to think beyond the surface of something, but the writing will not reward that care and time.<sup class="footnote-ref" id="fnref-1"><a href="#fn-1">1</a></sup></p>
<p>Relatedly, LLM-generated writing is disproportionately manipulative. Precisely because it can't do much else, it trades in formulaic contrasts, cheap sensationalism, and flattery. Partly because it suggests a payoff that isn't there, it dulls the reader's receptivity to contrasts that are truly interesting, facts that are really sensational, and so on. More and more of the written environment feels like 20th-percentile vaudeville humor or partisan news: the recitation of cheap formulas aimed at emotional weak points, in the guise of something else.</p>
<p>I hope it's obvious that this doesn't apply to <em>all</em> AI prose. I like it when chatbots give me prose that's more digestible than bullet points, and I'm glad people can use AI for translation. Moreover, things like formal requests and lawsuits don't carry the same assumptions, and none of this applies to those.</p>
<p>My finger-to-the-wind sense is that people are mostly shrugging their shoulders about this, except when it comes to things like school essays and job interviews. The sentiment, if I had to guess at an articulation of it, is that in a world already stuffed with TikTok and video games and sound bites, rampant LLM-generated prose is just more of the same. If I'm right, though, LLM-generated prose is more corrosive than that.</p>
<section class="footnotes">
<ol>
<li id="fn-1"><p>Maybe future AI writing will reward such care. I doubt it, but I can't be sure. I'm talking about what is respectful <em>now</em>.<a href="#fnref-1" class="footnote">↩</a></p></li>
</ol>
</section>
2026-04-23T14:19:00+00:00https://www.natemeyvis.com/kirkby-and-matuschak-on-making-flashcards-with-llms/Kirkby and Matuschak on making flashcards with LLMs2026-04-22T14:46:55.930614+00:00nwmhidden<p><a href="proxy.php?url=https%3A%2F%2Fmemory-machines.com%2Freport">Here</a> are Ozzie Kirkby and Andy Matuschak reporting on their attempts to make LLMs make and evaluate good flashcards based on their reading notes. It's both well-written<sup class="footnote-ref" id="fnref-1"><a href="#fn-1">1</a></sup> and relevant. They work with good ideas of how flashcards ought to be written, or at least ideas that are <a href="proxy.php?url=https%3A%2F%2Fwww.natemeyvis.com%2Fcommon-errors-in-flashcard-composition%2F">close to mine</a>.</p>
<p>Some notes:<sup class="footnote-ref" id="fnref-2"><a href="#fn-2">2</a></sup></p>
<ol>
<li>This is what it looks like when the authors actually care about the thing they're studying: diversity of approach, creativity, and tenacity. I evaluate this whole genre of essay on a spectrum from "<em>really</em> care about figuring this out" to "trying to get an A from some real or imagined teaching assistant." This is very much on the good side.</li>
<li>The project of <em>fully</em> automating the highlight-to-card process via LLM is, to me, undermotivated. The <a href="proxy.php?url=https%3A%2F%2Fwww.natemeyvis.com%2Fingestion%2F">larger pipeline here</a> includes reading, highlighting, thinking, card composition, and intermittent study. Given that, I'm not so worried that, e.g., "even the strongest model we tested (GPT-5.2) still produces unusable prompts roughly a third of the time." A strong flashcarder should be able to recognize and cull those very quickly, especially relative to the overall time commitment of studying something.<sup class="footnote-ref" id="fnref-3"><a href="#fn-3">3</a></sup> Here as elsewhere, I'm less concerned than others about whether AI can do 100% of something, and more concerned about whether they can do 25%, 50%, 75%, and 95% of it.</li>
<li>I'm glad that they tried fine-tuning, and found their various efforts here useful. Again, I draw a more optimistic conclusion than their "we got cheaper judges, not better ones," or perhaps the same conclusion in a more optimistic tone. Cheaper judges are good! I'm particularly interested in whether several cheaper judges could be aggregated, either now or in a near future of <a href="proxy.php?url=https%3A%2F%2Fwww.natemeyvis.com%2Fare-major-ai-tools-diverging%2F">more sharply distinguished</a> models.</li>
<li>I'm grateful for their work in evaluating all those models so carefully, but <a href="proxy.php?url=https%3A%2F%2Fwww.natemeyvis.com%2Fanother-reason-we-cant-measure-our-productivity-with-ai%2F">am still in the "benchmarks have never been less useful" camp</a>.</li>
<li>I strongly agree that the training data are mostly bad and that most flashcarders' processes are not optimized for the sort of learning that interests Kirkby and Matuschak here. I disagree in places, however, with their views on how highlighted material should be captured in a flashcard. So, for example, I don't think it's so bad simply to memorize the traditional three factors of production.<sup class="footnote-ref" id="fnref-4"><a href="#fn-4">4</a></sup> I'd also go about studying their "humans flying by flapping wings on Titan" example differently, but the details here would require another post.</li>
<li>As an experiment, I asked Claude to make me 60 flashcard candidates from my Norton-anthology highlights: I'm <a href="proxy.php?url=https%3A%2F%2Fwww.natemeyvis.com%2Fthe-norton-anthology-lifestyle%2F">still</a> studying 19th-century British literature and using it as a way into the politics and history of the period. Most of the candidates were bad, but many were usable or editable. Claude and I working together were a lot more efficient than me working alone. This is in part because because Claude had access to a local SQLite database of all my questions and responses and could query it to learn about how I write this kind of card, where my library's gaps are, and so on.<sup class="footnote-ref" id="fnref-5"><a href="#fn-5">5</a></sup> Again, culling and editing wasn't the time-consuming part (and note that this culling and editing is both educational and, for me, pleasant!).<sup class="footnote-ref" id="fnref-6"><a href="#fn-6">6</a></sup></li>
</ol>
<p>So: this is valuable work (unless I'm forgetting something, the best thing I've read about spaced repetition this year), but I'd encourage a different picture of human-AI cooperation in spaced repetition. Relatedly, I'm more optimistic than the authors about LLMs' helping us make good flashcards.</p>
<hr />
<section class="footnotes">
<ol>
<li id="fn-1"><p>...and, Pangram and I agree, at least mostly written by humans.<a href="#fnref-1" class="footnote">↩</a></p></li>
<li id="fn-2"><p>The usual disclaimer: I make <a href="proxy.php?url=https%3A%2F%2Fwww.zippyflash.com%2F">Zippyflash</a> and have used it for years. I'm invested in Zippyflash on many dimensions: primarily as a user, but also financially and ideologically.<a href="#fnref-2" class="footnote">↩</a></p></li>
<li id="fn-3"><p>This is why Zippyflash distinguishes fundamentally between cards and card <em>candidates</em>, and the API makes it easy for LLMs to submit candidates for your review. For a bit more about LLM-centric API design, see <a href="proxy.php?url=https%3A%2F%2Fwww.natemeyvis.com%2Fa-first-guide-to-building-apis-with-ai%2F">here</a>.<a href="#fnref-3" class="footnote">↩</a></p></li>
<li id="fn-4"><p>Whether these should be memorized as one card or many depends, I think, on (e.g.) how likely you are to remember some but not all of them. I'm not saying that "What are the traditional three factors of economic production?" is the right flashcard, but just that memorizing this "textbook list" is not a bad idea even if your goal is a much different kind of understanding.<a href="#fnref-4" class="footnote">↩</a></p></li>
<li id="fn-5"><p>The details here are less important than the idea that, again, good flashcards are less likely to come from a simple highlight-to-card LLM call and more likely to come from a more complicated human-AI partnership. (For now, at least.)<a href="#fnref-5" class="footnote">↩</a></p></li>
<li id="fn-6"><p>It's not so hard to get your Kindle highlights into a format your LLM can use, but it's a clunky process.<a href="#fnref-6" class="footnote">↩</a></p></li>
</ol>
</section>
2026-04-22T12:57:00+00:00https://www.natemeyvis.com/the-case-against-worrying-about-your-posts-analytics/The case against worrying about your posts' analytics2026-04-21T16:49:02.993808+00:00nwmhidden<p>I've had several conversations recently wherein my interlocutor has said that they tried writing online, got very few readers on their posts, and gave up (or, at least, feel discouraged about it). I think it's usually a mistake to worry about low readership on a post:</p>
<ol>
<li>As Patrick McKenzie has said repeatedly,<sup class="footnote-ref" id="fnref-1"><a href="#fn-1">1</a></sup> much of the public, professional value of having written something is totally independent of its analytics. Simply being able to provably say you've thought about something before has a lot of value (e.g., in job applications).</li>
<li>It's good to write, and most of the reasons for this (e.g., its forcing you to clarify your thoughts) are independent of the number of readers a piece has. (I would say more, but you've probably seen many arguments to this effect.)</li>
<li><em>Even if your goal is influence</em>, maximizing the number of readers is usually not the best way to get it. Saying something that is useful and intelligible in a niche is not just better but <em>more influential than</em> getting a lot of low-quality clicks on something that will barely be processed and not be remembered.</li>
<li><em>Even if your goal is money</em>, most of the best ways to turn writing into money do not involve getting lots of clicks on posts.</li>
<li><em>Even if you want high reader counts,</em> writing online can be cumulative, and cumulative in unexpected ways. Low-traffic posts can be an important catalyst for, or supporting link in, a future, more popular post. They can also build trust and credibility in other, indirect ways (e.g., by establishing that you've been around a while).</li>
<li><em>Even if you want high reader counts,</em> there is a large, seemingly random component to post popularity. Many of us have no good way to predict which of our posts will be more and less popular. To the (small) extent I care about analytics, I view my posts in large part as lottery tickets. Most of them are not attention-winners.</li>
<li>People use RSS readers, and analytics tools can be quirky, so you usually can't be sure your readership is as low as you think it is.</li>
</ol>
<p>So: analytics dashboards are a tempting scorecard, but they usually aren't measuring what you have reason to care about.</p>
<hr />
<section class="footnotes">
<ol>
<li id="fn-1"><p>I imagine that a lot of this post is internalized <a href="proxy.php?url=https%3A%2F%2Fwww.kalzumeus.com">patio11</a> teachings, but his way of looking at this is so deeply ingrained in certain subcultures that it's hard to find specific citations. Actually, it's so ingrained that I hadn't thought to write this down until I found myself having this conversation repeatedly.<a href="#fnref-1" class="footnote">↩</a></p></li>
</ol>
</section>
2026-04-21T16:26:00+00:00https://www.natemeyvis.com/spaced-repetition-scheduling-and-categorization/Spaced repetition scheduling and categorization2026-04-20T16:29:18.550135+00:00nwmhidden<p>Should I consider the subject matter of a flashcard explicitly when scheduling reviews? On the one hand, more information is often better when determining a scheduling interval; on the other hand, we want to avoid complexity and overfitting, and one might reasonably hope my category-by-category differences to be captured in my performance itself, without needing to look at categories explicitly.</p>
<p>I'm approaching a million flashcard reviews and have never used category data in a scheduling algorithm. In that time, I've made a lot of flashcards about movies.<sup class="footnote-ref" id="fnref-1"><a href="#fn-1">1</a></sup> Here is my performance in two common response-pattern situations, split between questions about movies and questions not about movies.</p>
<p>First, when I make a card and get the first review correct, here's my performance on the second review:</p>
<p><img src="proxy.php?url=https%3A%2F%2Fbear-images.sfo2.cdn.digitaloceanspaces.com%2Fnwm%2F57am.webp" alt="Screenshot 2026-04-20 at 11" /></p>
<p>And when I get my first two reviews correct, here's my performance on the third review:</p>
<p><img src="proxy.php?url=https%3A%2F%2Fbear-images.sfo2.cdn.digitaloceanspaces.com%2Fnwm%2F20am.webp" alt="Screenshot 2026-04-20 at 11" /></p>
<p>Some notes and caveats:</p>
<ol>
<li>These are just some high-volume patterns I checked; if you think there's some other category-specific analysis I ought to do, I'd be grateful to know it.</li>
<li>I don't organize my studying by "decks;" I study my cards all together, and those cards are tagged by category.<sup class="footnote-ref" id="fnref-2"><a href="#fn-2">2</a></sup></li>
<li>Not all my movie-related cards are tagged as such, and a lot of my tagging has been done for me by AI (but that's another post). So these data are not even close to perfect. That said: (i) I've manually inspected a lot of LLM-generated tags, and they look good, and (ii) insofar as I have movie cards that are still untagged, my non-movie / movie gap would be <em>under</em>stated.</li>
<li>As with all my analyses, I'm working with self-experimental data on which all sorts of hidden mechanisms might be operating. So, for example, I might have made a lot of flashcards about movies when I was scheduling reviews somewhat differently or when I was systematically sleep-deprived. I really doubt that any effect like this is operative here, but these are definitely one person's idiosyncratic data.</li>
<li>It's not clear to me what, if anything, I should <em>do</em> about this. The goal of scheduling is not, ultimately, to predict how likely I am to remember a card; it is to <em>learn things durably</em>. I'd only want to schedule movie-card reviews more conservatively if I were confident it would get me to long-run retention more efficiently. I suspect it would, but I can't be sure. And, as with long-interval performance, <a href="proxy.php?url=https%3A%2F%2Fwww.natemeyvis.com%2Fspaced-repetition-performance-after-long-intervals%2F">I'm not quite sure even what questions to be asking</a>.</li>
</ol>
<hr />
<section class="footnotes">
<ol>
<li id="fn-1"><p>All the trivia competitions I care about have questions about movies, and it's a weak area for me. It's also a subject that's amenable to study in that (i) there's a core of high-value information to learn (e.g., Best Picture winners) and (ii) it's not hard to organize, find, and present the information.<a href="#fnref-1" class="footnote">↩</a></p></li>
<li id="fn-2"><p>Once in a while I use a feature that lets me filter cards by tag while studying, but I much prefer just to study everything together. It's more pleasant; it helps me break up clusters of related questions; and it helps keep me from guessing the answer from the category of a question.<a href="#fnref-2" class="footnote">↩</a></p></li>
</ol>
</section>
2026-04-20T15:38:00+00:00https://www.natemeyvis.com/spaced-repetition-performance-after-long-intervals/Spaced repetition performance after long intervals2026-04-19T17:40:13.652541+00:00nwmhidden<p>Yesterday <a href="proxy.php?url=https%3A%2F%2Fwww.natemeyvis.com%2Fsome-empirical-spaced-repetition-results%2F">I wrote about</a> the difficulties of optimizing a scheduling algorithm, and mentioned that I don't know enough about how memory behaves when the intervals get into the years.</p>
<p>Here's some data. First, here's a LOWESS-smoothed graph of my correctness rate as a function of interval time, for intervals of at least one year. The black line is overall performance; the other three lines give my performance on questions I'd (i) never gotten wrong before, (ii) had gotten wrong only once, and (iii) had gotten wrong at least twice.</p>
<p><img src="proxy.php?url=https%3A%2F%2Fbear-images.sfo2.cdn.digitaloceanspaces.com%2Fnwm%2F522x.webp" alt="CleanShot 2026-04-19 at 13" /></p>
<p>Some notes:</p>
<ol>
<li>It looks like I'm too aggressive in scheduling long intervals for questions I've gotten wrong several times. This is interesting to me, because my algorithms have already weighted previous performance substantially.</li>
<li>I had expected the "never wrong" (blue) line to be above 90%, in part because a lot of those are probably quite easy cards. Perhaps I'm too aggressive with those also.</li>
<li>After I (i) never get a question wrong, (ii) go at least a year without answering the question, (iii) get it wrong, and (iv) answer it at least twice after <em>that</em>, my performance is perfect just about 2/3 of the time (501 of 741).</li>
</ol>
<p>The most common sequence before a year-or-more interval is exactly seven correct answers (with none incorrect; <em>N</em>=1403). Here's my performance vs. interval on those:</p>
<p><img src="proxy.php?url=https%3A%2F%2Fbear-images.sfo2.cdn.digitaloceanspaces.com%2Fnwm%2F302x-1.webp" alt="CleanShot 2026-04-19 at 13" /></p>
<p>With all the usual caveats (I've changed my algorithm over time, and perhaps my cards systematically vary across time also), this does suggest that little or nothing is gained by waiting only a year instead of a bit longer in this case.</p>
2026-04-19T17:13:00+00:00https://www.natemeyvis.com/some-empirical-spaced-repetition-results/Some empirical spaced-repetition results2026-04-19T00:44:10.259661+00:00nwmhidden<p>I am a <a href="proxy.php?url=https%3A%2F%2Fwww.natemeyvis.com%2Fi-did-301432-flashcard-reviews-in-2025%2F">diligent flashcarder</a>, and <a href="proxy.php?url=https%3A%2F%2Fwww.zippyflash.com">the tool I made</a> uses a lot of randomness in its scheduling, for reasons I go into <a href="proxy.php?url=https%3A%2F%2Fwww.natemeyvis.com%2Fnotes-on-spaced-repetition-scheduling%2F">here</a>.</p>
<p>I have well over 900,000 responses now, which is enough to let me do reasonable analysis of even fairly specific situations. So, for example: after a certain pattern of correct and incorrect answers, how sensitive is the correctness rate of the next response to the next interval?</p>
<p>Concretely: If I make a flashcard and my first three responses to it are (in this order) correct, incorrect, and correct, what is the relationship between the <em>next</em> time interval and my likelihood of getting the next (fourth) response correct?</p>
<p>Here's a chart:</p>
<p><img src="proxy.php?url=https%3A%2F%2Fbear-images.sfo2.cdn.digitaloceanspaces.com%2Fnwm%2Fcic_4_only.webp" alt="cic_4_only" /></p>
<p>And here's the <a href="proxy.php?url=https%3A%2F%2Fen.wikipedia.org%2Fwiki%2FLocal_regression">LOWESS</a>-smoothed version:</p>
<p><img src="proxy.php?url=https%3A%2F%2Fbear-images.sfo2.cdn.digitaloceanspaces.com%2Fnwm%2F462x.webp" alt="CleanShot 2026-04-18 at 20" /></p>
<p>(In the former chart, <= 8 hours and >= 40 hours are bucketed together, so</p>
<p>A few notes:</p>
<ol>
<li>The very small intervals are mostly due to questions coming up randomly before they were scheduled to reappear. (My software injects random questions into study sessions, primarily to make it harder to guess the answer based on when I'm seeing the question but also to enable analyses like this.)</li>
<li>The bump around 24 hours is interesting. My best guess is that this has something to do with consistency of environment: if I study a card on my lunch break or in bed in the morning, it's plausible that my performance will be better if I next see it in the same surroundings. Intervals very close to 24 hours are quite likely to keep surroundings constant.</li>
<li>I'd call this a good data set, but not perfect. I've changed my scheduling algorithm over time, and it's possible that the questions I made while using one algorithm were systematically different than ones I made using another algorithm. I <em>don't</em> think it's plausible that effects like this are happening here at any meaningful magnitude, but it's worth noting.</li>
<li>Those caveats aside, I suspect this data reflects a <em>lot</em> more precision and care than a lot of the memory data "in the wild." Some obvious reasons for this are that I'm quite honest and consistent about grading myself, and that I've kept this up for years. The less obvious reason is that I write narrowly focused, "atomic" flashcards that admit of well-defined grading in the first place. (If you've spent any time looking at publicly available flashcards, you know that a lot of them have prompts like "Why is RuBisCO important?" or "James Buchanan," and their answers are long and contain perhaps a dozen important facts. That makes it a lot harder to give a well-defined assessment to an answer.)</li>
<li>The important question to be asking is: How should this change my scheduling algorithm? One way to think about this is: which intervals minimize the total effort to remember the card for the rest of my life (or, for concreteness and to sidestep morbid questions, for 20 years)? This is a hard question to answer empirically. The obvious brute-force approach would be to see how I do on questions where my first four responses are correct-incorrect-correct-correct ("CICC") and on CICI questions, then do an expected value calculation. But at least three obvious problems arise: the sample sizes get smaller and smaller; we don't know all that much about how spaced repetition works at decades-long scale; and, probably worst, the CICC question <em>that would have been a CICI question had I waited an extra two hours</em> is not the same as the average CICC question. So this is a tough problem!</li>
<li>That said: it's a tough problem, but unless I'm mistaken I'm one of the likeliest people in the world to have the data and machinery to answer it, so I'm going to keep thinking about it.</li>
</ol>
2026-04-18T18:42:00+00:00