Home on Nathaniel Thomas

Continual Learning is not Continual Midtraining

Wed, 07 Jan 2026 00:00:00 +0000

Many have caught onto the truth that AGI-through-pure-LLM-scaling is probably not going to happen. And many have identified continual learning as the key difference between LLMs and a generally intelligent agent. If you’ve ever used Claude Code, you will be acutely aware of how effective context length limits LLMs’ general utility, and if only we had something that wouldn’t run out of context, we could all finally be unemployed.

A tempting (and sensible) attempt to solve this is continual midtraining. For example, this would be Anthropic collecting successful Claude Code traces and folding it back into the SFT stage for its next model, which they would release on a monthly basis. This might make them very powerful coding agents, but will not give them the ability to fully automate jobs. Why? Because all this procedure does is continually improve the LLM’s world model, which is distinct from its world state. Its world state only exists within its position embedded KV cache.

Advent of Code 2025 in Haskell

Mon, 01 Dec 2025 00:00:00 +0000

It’s that time of year again.

You can try out the solutions here.

Day 1

module Main where

import Common (parseFile)
import Control.Applicative ((<|>))
import Data.List (foldl', scanl)
import Data.Text qualified as T
import Data.Void (Void)
import Text.Megaparsec
import Text.Megaparsec.Char
import Text.Megaparsec.Char.Lexer qualified as Lex
import Text.Printf

data Dir = L | R
  deriving (Show, Eq)

data Rot = Rot Dir Int
  deriving (Show, Eq)

type Parser = Parsec Void T.Text

dirSign :: Dir -> Int
dirSign L = -1
dirSign R = 1

parseRot :: Parser Rot
parseRot = Rot <$> (L <$ char 'L' <|> R <$ char 'R') <*> Lex.decimal

solve1 :: [Rot] -> Int
solve1 rots = length $ filter (== 0) $ scanl update 50 rots
  where
    update pos (Rot d x) = (pos + (dirSign d) * x) `mod` 100

solve2 :: [Rot] -> Int
solve2 rots = snd $ foldl' expand (50, 0) rots
  where
    expand (pos, count) (Rot d x) =
      let steps = [pos + dirSign d * n | n <- [1 .. x]]
       in ((pos + dirSign d * x) `mod` 100, count + length (filter (\p -> p `mod` 100 == 0) steps))

main :: IO ()
main = do
  sampleRots <- parseFile (parseRot `sepEndBy` newline) "input/day01_sample.txt"
  rots <- parseFile (parseRot `sepEndBy` newline) "input/day01.txt"
  printf "Part 1 Sample answer: %s\n" (show $ solve1 sampleRots)
  printf "Part 1 Final answer: %s\n" (show $ solve1 rots)
  printf "Part 2 Sample answer: %s\n" (show $ solve2 sampleRots)
  printf "Part 2 Final answer: %s\n" (show $ solve2 rots)

Yes, I know part 2’s solution is inefficient—it’s prettier this way.

Comparing Structured Data Formats for LLMs

Sun, 12 Oct 2025 00:00:00 +0000

As we start training LLMs as Agents, we must think about how to best pass information to and from the real-world environment. If it calls an external function, how should arguments be passed? How should data from the environment be fed to the model? The simplest (and most general) solution uses structured data formats, such as JSON. These formats can encode arbitrarily nested, heterogeneously typed data structures.

But is JSON the right choice? We have many options, such as TOML, YAML, XML, et al. In this post, we consider and measure metrics that will help us make the right choice.

The best T-Shirt

Thu, 10 Apr 2025 00:00:00 +0000

I recently bought Bryan Johnson’s Super Veggie T-Shirt, in order to fully immerse myself in his protocol.

It was $37—not a terrible price—and I think it looks cool. But once I receieved it, I noticed that the quality was markedly better than any other t-shirts I own, even those that were double the price. Hence, my quest for the best tee. The goal is to optimize for the quality/cost ratio. As a starting point, I’m looking at only 6.1 oz/yd^2 tees, the same weight as the super veggie one. For convenience, I’m only going to review those available on Amazon.

How to use $\LaTeX$ in Excalidraw

Thu, 27 Mar 2025 00:00:00 +0000

Excalidraw currently doesn’t support $L A T E X$ , which sucks. The workaround is to generate an SVG for whatever math you want to render, and paste that in.

You can use this script to generate the SVG:

import matplotlib.pyplot as plt

# use svg backend
plt.switch_backend('svg')

# enable latex rendering
plt.rcParams['text.usetex'] = True

# create figure
fig, ax = plt.subplots(figsize=(4, 2))
ax.axis('off')  # no axes

latex_str = r"$\nabla \mathcal J(\theta)$"
ax.text(0.5, 0.5, latex_str, fontsize=20, ha='center', va='center')

fig.savefig("latex.svg", format="svg", bbox_inches='tight', transparent=True)

Et voilà:

Sharpe Ratio Based Portfolio Simulator

Thu, 06 Mar 2025 00:00:00 +0000

The Sharpe Ratio measures the quality of an equity or hedge fund by showing the return per unit of risk, calculated as $σ μ - r $ , where $μ$ is the expected return, $r$ is the risk-free rate, and $σ$ is the standard deviation (volatility). A higher ratio indicates better performance for the risk taken—more return without excessive variability. In the simulator, the standard deviation slider ( $σ$ ) controls the level of risk you accept: increasing it means you’re comfortable with greater fluctuations in value, while decreasing it reduces exposure to variability, reflecting how much uncertainty you’re willing to tolerate.

Entropy from First Principles

Wed, 01 Jan 2025 00:00:00 +0000

I find entropy to be extremely fascinating. But, matching the formula $\sum p i lo g p i 1 $ to its “intuitive” explanations related to prefix free codes and information content is not obvious. Here, I want to go over a couple ways to independently arrive at the idea.

Advent of Code 2024 in Haskell

Sat, 21 Dec 2024 00:00:00 +0000

I’m doing AoC in Haskell to learn the language. These are my solutions.

Day 1

import Data.List
import qualified Data.Map as Map

f xs =
  let x1s = sort $ map fst xs
      x2s = sort $ map snd xs
      diff x y = abs (x - y)
   in sum $ zipWith diff x1s x2s

counter = Map.fromListWith (+) . map (,1)

sim xs =
  let c = counter (map snd xs)
   in sum [x * Map.findWithDefault 0 x c | x <- map fst xs]

main = do
  l <- readFile "data1.txt"
  let xs = [(read x, read y) | [x, y] <- map words (lines l)]
  print (f xs)
  print (sim xs)

Pretty clean, I don’t think I can make it nicer.

Day 2

allSame [] = True
allSame (x : xs) = all (== x) xs

monotonic xs = allSame (zipWith (\x y -> signum (x - y)) xs (tail xs))

diffValid (x : y : xs)
  | abs (x - y) >= 1 && abs (x - y) <= 3 = diffValid (y : xs)
  | otherwise = False
diffValid _ = True

isSafe xs = monotonic xs && diffValid xs

without xs i = [x | (x, j) <- zip xs [1 ..], j /= i]

isSafeDamp xs = isSafe xs || any (isSafe . without xs) [1 .. length xs]

main = do
  content <- readFile "data2.txt"
  let parsed = map (\l -> map read (words l)) (lines content) :: [[Int]]
  let nSafe = length (filter isSafe parsed)
  let nSafeDamp = length (filter isSafeDamp parsed)
  print ("number of safe elements: " ++ (show nSafe))
  print ("number of safe elements (damping): " ++ (show nSafeDamp))

This is also quite clean and straightforward.

This Website

Tue, 17 Dec 2024 00:00:00 +0000

This entire site is static. All the visualizations are running completely in the browser.

I use Hugo to build the site. It’s pretty neat, since its template language lets me program a lot features statically, without any JavaScript. Even the $L A T E X$ on this site is statically rendered!

The theme is based off of Typo by tomfran, but I’ve made a bunch of UI tweaks to my liking (like the slick Table of Contents on widescreen).

Interactive Gaussian Mixture Models

Fri, 06 Dec 2024 00:00:00 +0000

Goal

Suppose we have a dataset of features, but no labels. If we know (or guess) that there are $K$ classes in the dataset, we could model the dataset as the weighted average of $K$ class–conditional Gaussians. This is what Gaussian Mixture Models do.

We assume that the model is parameterized by $θ = {π k , μ k , σ k 2 } k = 1 K $ , where $π k $ determines the weight of the $k$ th Gaussian in the model.

The Zed Text Editor

Wed, 04 Dec 2024 00:00:00 +0000

I am a Neovim diehard, but it is impossible to use over SSH. Since I do ML research, all my code runs on a remote server with high power GPUs. Reluctantly, I have been using VSCode, for its excellent remote-ssh plugin. But even with its half-baked Vim mode, it is still the same sluggish Electron app.

Zed may the the editor that changes this game. It is extremely fast, supports LSP and Treesitter natively, has some pretty nifty AI features, and has native Vim bindings. It still has some rough edges, and remote development is not as smooth as VSCode, but its lightness makes it worth using for me.

Local Approximation

Sun, 01 Dec 2024 00:00:00 +0000

Training a deep neural network is essentially a compression task. We want to represent our training data distribution as a function parameterized by a bunch of matrices. The more complex the distribution, the more parameters we need. The rationale for approximating the entire distribution is so that we can forward any valid point at inference using the same model, with the same weights. But what if our model was trained on-the-fly, at inference? Then, when forwarding $x$ , we would only need to model the local distribution around $x$ . Since the local region should have lower dimensionality than the entire training set, a much simpler model will suffice!

Bayesian Parameter Estimation

Mon, 25 Nov 2024 00:00:00 +0000

Bayesian Parameter Estimation (BPE) is fundamentally different compared to MLE or MAP. Whereas the latter two solve for an optimal set of parameters $θ^$ for the model, BPE treats $θ$ as a random variable with a distribution $p (θ)$ .

Setup

We are given a dataset $D$ , which contains $n$ i.i.d. features $x j $ . Given a new feature vector $x$ , we want to classify it to some class $ω$ . One way to do this is by the Bayes’ decision rule. That is, we choose class $ω j $ over class $ω i $ if

Hario V60 Recipes

Mon, 25 Nov 2024 00:00:00 +0000

This is a collection of V60 recipes that I have used.

Emi Fukahori (1 cup)

Source video.

This recipe is specific to the Hario switch, my current brewer. It gives a consistent and bright cup.

Filtered Water: 200g
Coffee: 14g
Grind: Medium-coarse, 7.5 on Fellow Ode 2
Ratio: 14.28
Water temp: 95º C

Close the switch (no flow), put filter, and preheat the brewer with hot water. After some time, open switch and toss the water.
Close the switch, and add coffee.
Start timer. Bloom with 45g water until 0:35. Open switch.
Pour, in one stream down the center, 155g water (200g total) until timer hits 1:10 (~4g/sec).
Give it one quick swirl to get the bits stuck to the side down, and let it drain.
Feel free to close the switch ~5g before draining to keep the harsher, final coffee out

The Ten Armed Testbed

Mon, 25 Nov 2024 00:00:00 +0000

This is a method of evaluating strategies for the multi-armed bandit problem ¹. The testbed works as follows:

Generate $10$ reward means $μ i $ associated with $10$ actions $a i $
On each iteration allow the agent to take some action $a j $ , and receive a reward $r t \sim N (μ j , 1)$ .

We repeat this for $100$ randomly sampled sets of $μ i $ . The agent’s goal is to maximize average rewards. Hopefully, it should learn which action has the highest mean and sample from that.

Maximum A Posteriori (MAP) Estimation

Sun, 24 Nov 2024 00:00:00 +0000

The goal is essentially the same as MLE. We have an assumed model for $p (x j ∣ ω j )$ parameterized by $θ$ . We want to classify a feature $x$ into some class $ω j $ based on a labeled dataset $D$ . In MLE, we were trying to maximize the likelihood:

θ^MLE ​ = ar g θ max ​ p (D ∣ θ)

In MAP, we instead maximize the a posteriori:

Maximum Likelihood Estimation

Sun, 24 Nov 2024 00:00:00 +0000

Goal

We are given a dataset $D$ , which contains feature vectors $x k $ and class labels $ω k $ . Denote $D i $ as the set of features of class $ω i $ . We assume the following:

That $p (x ∣ ω j ) \sim N (μ j , Σ j )$ . That is, given a class label, the distribution of features belonging to that class forms a Gaussian with mean $μ j $ and covariance $Σ j $ .
The samples $x \in D i $ are independent and identically distributed (i.i.d.) according to this assumed Gaussian distribution.

The problem that MLE seeks to solve is to find the most likely set of parameters $μ j , Σ j $ , given the data. We denote

The Mechanics of Causal Self Attention

Wed, 13 Nov 2024 14:51:11 -0800

Causal self-attention is the mechanism underpinning most of the advances in AI since 2017. In this article, I will step through the computation and hopefully gain a better intuition of how it works.

SelfAttention (Q, K, V) = softmax (mask (d ​ Q K T ​)) V

At a high level, this function takes one sequence and transforms it into another. A sequence is a list of token embeddings, a tensor of shape $L \times d$ , where $L$ is the input sequence length and $d$ is the embedding dimension. Each row of this matrix corresponds to one input token, which is represented as a $d$ -dimensional vector.

Building and Deploying Rust to a Hugo Site

Mon, 22 Apr 2024 14:32:24 -0700

We’re going to go through a minimal example that will let you run Rust code on the client side of a Hugo site. We are going to compile the Rust code into WebAssembly (wasm), which will give us near-native performance on the browser!

An Expert–Level 2048 Bot

Tue, 16 Apr 2024 18:06:16 -0700

Explore different methods to win, and beat expert humans in 2048 interactively!

Interactive MNIST Explorer

Tue, 20 Feb 2024 11:53:54 -0700

Draw digits on the canvas and watch an AI guess what it is!

Switching to Obsidian

Thu, 14 Sep 2023 00:00:00 +0000

One of the most striking elements of Silicon Valley to outsiders is productivity culture. Whereas most people in most places live in complete satisfaction doing their job as they would, Silicon Valley people won’t find peace without optimizing their every habit and system to extract that extra iota of productivity per unit time. I am one of those people, and this article is about how I revolutionized my productivity switching from Neovim org-mode to Obsidian.

Hammerspoon Wizardry on macOS

Fri, 04 Aug 2023 10:58:48 -0700

If you’re a nerd, and you’ve been around Macs for a while, you might remember Applescript. It was a language developed by Apple to allow intermediate–to–advanced users to write simple scripts that could control Mac applications. It was actually created to resemble the English language, so accessing a pixel would be written as

pixel 7 of row 3 of TIFF image "my bitmap"

or even

TIFF image "my bitmap"'s 3rd row's 7th pixel

Needless to say, there’s a good reason modern programming languages don’t look like this: it doesn’t scale. Anyone who has worked with Applescript for extended periods of times knows how fast you run into limitations. Apple unofficially deprecated it in 2016, when Sal Soghoian, the creator, was let go for “business reasons”.

Not–so–casual Performance Optimization in Python

Tue, 01 Aug 2023 10:56:08 -0700

My previous post (which was honestly created to test out the theme for this site), provided a few code snippets that computed $N$ terms of the sum of inverse squares. I wrote the code in my 4 favorite languages—Python, C, Rust, and Haskell—but when I ran the Python code, it was embarrassingly slow. Compared to the $\approx 950$ ms it took sequential Rust, Python took 70 seconds! So, in this post, we’re going to attempt to get Python some more reasonable numbers.

The Basel Problem (Hello, World!)

Fri, 28 Jul 2023 22:21:54 -0700

Hello, World! This is my first post, and it’s exclusively used to test out this website’s functionality.

Here are some code snippets in various languages that compute the Basel Problem:

Author

Mon, 01 Jan 0001 00:00:00 +0000

I’m a Master’s student at UCSD working on reinforcement learning for Large Language Models, advised by Prof. Xiaolong Wang.

I got started with programming through open source in high school. Since then I’ve interned at Anduril, Stanford AI Lab, Keysight, SDSC, and Yahoo.

When I’m not programming, I’m brewing specialty coffee, lifting weights, or playing pickleball.

You can contact me through or X.

Books

Mon, 01 Jan 0001 00:00:00 +0000

My digital bookshelf, in no particular order.

Curriculum Vitae

Mon, 01 Jan 0001 00:00:00 +0000