Hamish Ivison

Introduction to Policy Gradient for LMs

2026-02-09T00:00:00+00:00

These are a lightly adapted set of notes for a ‘RL101’ lecture I gave for some research groups meetings. These aim to be a reasonably thorough introduction to policy gradient, specifically in the context of language models. I go far enough to get to CISPO (which honestly is pretty much one of the better algorithms out there right now), but I don’t cover all the way to PPO. These notes are adapted from when I took RL here at UW.

Terminology

We will use the standard Markov Decision Process (MDP) model

\[\mathcal{M} = (\mathcal{S},\mathcal{A},P,r,\gamma,\rho_0).\]

Term	MDP notation	LM analogue	LM notation
State	$s\in\mathcal{S}$	prompt + prefix tokens	$s_t \equiv x_{0:t}$
Action	$a\in\mathcal{A}$	next token	$a_t \equiv x_{t+1}$
Transition	$P(s’\mid s,a)$	append token (usually deterministic)	$s_{t+1}=(s_t,a_t)$
Reward	$r(s,a)$	terminal score / reward model	$r_T$ (often at EOS)
Discount	$\gamma\in[0,1)$	(optional) token-position discount	$\gamma$
Start state dist.	$\rho_0$	prompt distribution	$p(\text{prompt})$
Policy	$\pi_\theta(a\mid s)$	LM next-token distribution	$p_\theta(x_{t+1}\mid x_{0:t})$
Trajectory	$\tau=(s_0,a_0,\dots,s_T)$	completion / rollout	$\tau \equiv x_{0:T}$
Return	$R(\tau)=\sum_{t=0}^{T-1}\gamma^t r(s_t,a_t)$	scalar score for completion	$R(x_{0:T})$

Policy / LM distribution. In general RL notation, our policy is a conditional distribution $\pi_\theta(a\mid s)$. For an autoregressive language model, we will identify

\[\pi_\theta(a_t\mid s_t) \equiv p_\theta(x_{t+1}\mid x_{0:t}).\]

With this identification, the probability of a full completion (trajectory) factors as

\[\pi_\theta(\tau) \equiv p_\theta(x_{0:T}) = p(x_0)\prod_{t=0}^{T-1} p_\theta(x_{t+1}\mid x_{0:t}),\]

where $p(x_0)$ denotes the prompt distribution (or is treated as fixed/conditioned on).

Note that we call discounting optional because most LM RL work sets $\gamma = 1$ and only has a single reward at the end of the trajectory. However, sometimes we change this (e.g., in scenarios where dense reward is available).

Deriving PG

In policy gradient, we assume a differentiable, parameterized policy $\pi_\theta$ and update parameters by (stochastic) gradient ascent on $J(\theta)$:

\[\theta \leftarrow \theta + \alpha\,\widehat{\nabla_\theta J(\theta)}.\]

An important initial identity is the log-deriv trick:

\[\begin{aligned} \nabla_\theta \log \pi_\theta(\tau) &= \frac{\nabla_\theta \pi_\theta(\tau)}{\pi_\theta(\tau)} \quad\text{(when $\pi_\theta(\tau)>0$)}\\[0.5em] \nabla_\theta \pi_\theta(\tau) &= \pi_\theta(\tau)\,\nabla_\theta \log \pi_\theta(\tau) \end{aligned}\]

We want to maximise our expected reward, that is we set:

\[\begin{aligned} J(\theta) &= \mathbb{E}_{\tau\sim\pi_\theta}[R(\tau)] \\ &= \int \pi_\theta(\tau)\,R(\tau)\,d\tau \end{aligned}\]

Now, let’s compute the derivative:

\[\begin{aligned} \nabla_\theta J(\theta) &= \nabla_\theta \int \pi_\theta(\tau)\,R(\tau)\,d\tau \\ &= \int \nabla_\theta\pi_\theta(\tau)\,R(\tau)\,d\tau \\ &= \int \pi_\theta(\tau)\,\nabla_\theta\log\pi_\theta(\tau)\,R(\tau)\,d\tau \end{aligned}\]

Then, we expand out the trajectory logprobs. Finally, expand the trajectory log-probability:

\[\begin{aligned} \log\pi_\theta(\tau) &= \log\rho_0(s_0) + \sum_{t=0}^{T-1}\log\pi_\theta(a_t\mid s_t) + \sum_{t=0}^{T-1}\log P(s_{t+1}\mid s_t,a_t) \\[0.5em] \nabla_\theta\log\pi_\theta(\tau) &= \nabla_\theta\log\rho_0(s_0) + \nabla_\theta\sum_{t=0}^{T-1}\log\pi_\theta(a_t\mid s_t) + \nabla_\theta\sum_{t=0}^{T-1}\log P(s_{t+1}\mid s_t,a_t) \\ &= \sum_{t=0}^{T-1}\nabla_\theta\log\pi_\theta(a_t\mid s_t) \end{aligned}\]

Interestingly, the transition dynamics disappear completely! Although, of course, they still are affecting the trajectory distribution implicitly. So we can just plug that right in:

\[\nabla_\theta J(\theta) = \mathbb{E}_{\tau\sim\pi_\theta}\left[\sum_{t=0}^{T-1}\nabla_\theta\log\pi_\theta(a_t\mid s_t)\,R(\tau)\right]\]

Next, we will define the “reward-to-go” (the future return from time $t$):

\[G_t \triangleq \sum_{k=t}^{T-1} \gamma^{k-t}\,r(s_k,a_k)\]

Importantly, we can note that the reward-to-go is the only thing that matters in our derivative. Write the full return as a “past” part plus a “future” part:

\[\begin{aligned} R(\tau) &= \sum_{k=0}^{T-1} \gamma^k r(s_k,a_k)\\ &= \underbrace{\sum_{k=0}^{t-1} \gamma^k r(s_k,a_k)}_{\triangleq\,R_{Then, for any fixed time $t$,

\[\begin{aligned} \mathbb{E}_{\tau\sim\pi_\theta}\big[\nabla_\theta\log\pi_\theta(a_t\mid s_t)\,R(\tau)\big] &= \mathbb{E}_{\tau\sim\pi_\theta}\big[\nabla_\theta\log\pi_\theta(a_t\mid s_t)\,(G_t + R_{The last term is zero because conditioning on $s_t$, $R_{

This gives us our REINFORCE loss:

\[\mathcal{L}_{\text{REINFORCE}}(\theta) \;\triangleq\; -\,\mathbb{E}_{\tau\sim\pi_\theta}\left[\sum_{t=0}^{T-1} \log\pi_\theta(a_t\mid s_t)\,G_t\right]\]

And our basic algorithm:

Initialize policy parameters $\theta$.
Repeat for iterations $k=0,1,2,\dots$:
1. Sample a batch of trajectories $\{\tau^{(i)}\}_{i=1}^N$ by rolling out the current policy $\pi_\theta$.
2. For each trajectory $\tau^{(i)}=(s_0^{(i)},a_0^{(i)},\dots,s_T^{(i)})$, compute the rewards $r(s_t^{(i)},a_t^{(i)})$ and reward-to-go values $G_t^{(i)}$ for all $t$.
3. Form the Monte Carlo gradient estimate (using reward-to-go): $\widehat{\nabla_\theta J(\theta)} = \frac{1}{N}\sum_{i=1}^N\sum_{t=0}^{T-1}\nabla_\theta\log\pi_\theta\big(a_t^{(i)}\mid s_t^{(i)}\big)\,G_t^{(i)}$
4. Update the parameters by gradient ascent: $\theta \leftarrow \theta + \alpha\,\widehat{\nabla_\theta J(\theta)}$

Issues with REINFORCE

There are three core issues with REINFORCE:

High Variance: The reward values can vary a bunch, and even have weird scale. Consider a reward that’s always negative. Basically, we want to normalize things a little.
Unknown step size: We don’t really know the right learning rate to use, especially since this actually might change over training, unlike regular ML.
Data reuse: We can’t reuse data from prior loops, since we have defined our loss and gradient over the current policy.

I’ll cover two of these, and step size is… less of a big issue for things like CISPO, but is important and part of the motivation for developing things like TRPO and PPO. Let’s deal with variance first, since that’s the most important thing.

Baselining

A standard variance-reducing trick is to subtract a baseline that only relies on the current state:

\[\nabla_\theta J(\theta) = \mathbb{E}_{\tau\sim\pi_\theta}\left[\sum_{t=0}^{T-1} \nabla_\theta\log\pi_\theta(a_t\mid s_t)\,(G_t - b(s_t))\right]\]

We need to show two core things: (1) that doing this is unbiased, and doesn’t change our result; (2) that this actually reduces variance.

Unbiasedness

Let’s split out our loss:

\[\begin{aligned} \nabla_\theta J(\theta) &= \mathbb{E}_{\tau\sim\pi_\theta}\left[\sum_{t=0}^{T-1} \nabla_\theta\log\pi_\theta(a_t\mid s_t)\,(G_t - b(s_t))\right] \\ &= \mathbb{E}_{\tau\sim\pi_\theta}\left[\sum_{t=0}^{T-1} \nabla_\theta\log\pi_\theta(a_t\mid s_t)\,G_t\right] - \mathbb{E}_{\tau\sim\pi_\theta}\left[\sum_{t=0}^{T-1} \nabla_\theta\log\pi_\theta(a_t\mid s_t)\,b(s_t)\right] \end{aligned}\]

Then, we will just consider the baseline term. Let’s consider what happens at a single time step $t$. The key thing is we split up the expectation:

\[\begin{aligned} \mathbb{E}_{\tau\sim\pi_\theta}\big[\nabla_\theta\log\pi_\theta(a_t\mid s_t)\,b(s_t)\big] &= \mathbb{E}_{s_t,a_t\sim\pi_\theta}\big[\nabla_\theta\log\pi_\theta(a_t\mid s_t)\,b(s_t)\big]\\ &= \mathbb{E}_{s_t}\left[\mathbb{E}_{a_t\sim\pi_\theta(\cdot\mid s_t)}\big[\nabla_\theta\log\pi_\theta(a_t\mid s_t)\,b(s_t)\ \big|\ s_t\big]\right]\\ &= \mathbb{E}_{s_t}\left[b(s_t)\,\mathbb{E}_{a_t\sim\pi_\theta(\cdot\mid s_t)}\big[\nabla_\theta\log\pi_\theta(a_t\mid s_t)\big]\right]\\ &= \mathbb{E}_{s_t}\left[b(s_t)\,\sum_a \pi_\theta(a\mid s_t)\,\nabla_\theta\log\pi_\theta(a\mid s_t)\right]\\ &= \mathbb{E}_{s_t}\left[b(s_t)\,\sum_a \nabla_\theta\pi_\theta(a\mid s_t)\right]\\ &= \mathbb{E}_{s_t}\left[b(s_t)\,\nabla_\theta\sum_a \pi_\theta(a\mid s_t)\right] = \mathbb{E}_{s_t}\left[b(s_t)\,\nabla_\theta 1\right] = 0 \end{aligned}\]

Therefore

\[\mathbb{E}_{\tau\sim\pi_\theta}\left[\sum_{t=0}^{T-1} \nabla_\theta\log\pi_\theta(a_t\mid s_t)\,b(s_t)\right] = 0,\]

so replacing $G_t$ by $G_t-b(s_t)$ leaves the expectation unchanged (i.e., the estimator remains unbiased).

Note that this is why up to the current state is okay: if the current action was included, then we couldn’t move the baseline out of the expectation in the above proof. So we can depend on anything apart from the current action for our baseline.

Variance Reduction

Now we know we can do this, we need to consider what would be a good baseline that can reduce our variance?

The math for optimal baselines gets more involved, and I couldn’t find a clear explanation, so let’s stick with a higher-level approximation¹. These approximations are not strictly valid in general, but they help us build some intuition for why we can reduce variance with baselines. Consider computing the variance of a single trajectory:

\[\begin{aligned} \mathrm{Var}\Big(\sum_{t=0}^{T-1}\nabla_\theta\log\pi_\theta(a_t\mid s_t)\,(G_t-b(s_t))\Big) &\approx^{(i)} \sum_{t=0}^{T-1}\mathbb{E}_{\tau}\Big[\big(\nabla_\theta\log\pi_\theta(a_t\mid s_t)\,(G_t-b(s_t))\big)^2\Big]\\ &\approx^{(ii)} \sum_{t=0}^{T-1}\mathbb{E}_{\tau}\Big[\big(\nabla_\theta\log\pi_\theta(a_t\mid s_t)\big)^2\Big]\,\mathbb{E}_{\tau}\Big[\big(G_t-b(s_t)\big)^2\Big] \end{aligned}\]

Here (i) drops cross-covariance terms between different time steps, and (ii) further treats the score term and the (centered) return term as approximately independent. We can’t control the variance over the logprobs, so instead we want to minimize just $G_t - b(s_t)$! This suggests that a good baseline is our best guess at $G_t$ based on $s_t$, which is exactly our value function!

This gives us the advantage function:

\[A_{\pi_\theta}(s_t, a_t) = G_t - \hat{V}_{\pi_\theta}(s_t)\]

Note here that $G_t$ is the reward we get from doing $a_t$ in state $s_t$. That is, it is (an unbiased Monte Carlo estimate of) the Q-value. Intuitively, we are replacing weighting by absolute reward with weighting by the improvement of the action over the average action taken by the policy. This intuitively reduces variance since we are sort of taking into account “how good” our policy is already.

Re-using Old Data

Let’s cover the other big issue with REINFORCE: data reuse.

We can’t really re-use old data in our current setup, since we derived everything under the assumption that we drew our trajectories from our current model ($\tau \sim \pi_\theta$). This means that once we take a gradient step, the old trajectories are technically coming from a different distribution ($\tau \sim \pi_{\theta_{t-1}}$).

We can apply importance weighting to fix this. The core idea is that we can draw samples from a different distribution so long as we reweight the samples with the probabilities under our current distribution. To simplify, let’s define

\[\begin{aligned} \mathbb{E}_{\tau\sim\pi_{\theta}}\big[f(\tau)\big] &= \int \pi_{\theta}(\tau)\,f(\tau)\,d\tau \\ &= \int \pi_{\theta_{\mathrm{old}}}(\tau)\,\frac{\pi_{\theta}(\tau)}{\pi_{\theta_{\mathrm{old}}}(\tau)}\,f(\tau)\,d\tau \\ &= \mathbb{E}_{\tau\sim\pi_{\theta_{\mathrm{old}}}}\Big[w(\tau)\,f(\tau)\Big] \end{aligned}\]

where

\[w(\tau) \triangleq \frac{\pi_{\theta}(\tau)}{\pi_{\theta_{\mathrm{old}}}(\tau)} = \prod_{t=0}^{T-1}\frac{\pi_{\theta}(a_t\mid s_t)}{\pi_{\theta_{\mathrm{old}}}(a_t\mid s_t)}\]

So we can just apply this to our original policy-gradient expression:

\[\begin{aligned} \nabla_\theta J(\theta) &= \mathbb{E}_{\tau\sim\pi_\theta}\left[\sum_{t=0}^{T-1}\nabla_\theta\log\pi_\theta(a_t\mid s_t)\,G_t\right]\\ &= \mathbb{E}_{\tau\sim\pi_{\theta_{\mathrm{old}}}}\left[w(\tau)\,\sum_{t=0}^{T-1}\nabla_\theta\log\pi_\theta(a_t\mid s_t)\,G_t\right] \end{aligned}\]

We can also define the ratio on the token-level:

\[r_{t}(\theta) = \frac{\pi_{\theta}(a_t\mid s_t)}{\pi_{\theta_{\mathrm{old}}}(a_t\mid s_t)}\]

CISPO

Actually, we basically already have enough to get the current SOTA RL algorithm, CISPO²!

CISPO’s loss is:

\[J_{\mathrm{CISPO}}(\theta) = \mathbb{E}_{(q,a)\sim\mathcal{D},\,\{o^i\}_{i=1}^G\sim\pi_{\theta_{\mathrm{old}}}(\cdot\mid q)}\left[ \frac{1}{\sum_{i=1}^G |o^i|} \sum_{i=1}^G\sum_{t=1}^{|o^i|} \mathrm{sg}\big(r_{i,t}(\theta)\big)\,\hat A_{i,t}\,\log \pi_\theta\big(o^i_t\mid q,o^i_{Note that we are averaging over groups here (the $i$ index) and over timesteps (number of tokens in a given rollout). $\mathrm{sg}$ stands for “stop gradient”, and is used to avoid backpropagating through the logprobs used in computing the importance ratio $r$. $G$ is our group size (number of rollouts sharing a prompt), which is used to compute the advantage $\hat A$:

\[\hat A_{i,t} = \frac{R_i - \mathrm{mean}(\{R_j\}_{j=1}^G)}{\mathrm{std}(\{R_j\}_{j=1}^G)}\]

Note that this is intuitively capturing exactly what we were doing before, but using the group estimates instead of a learned function for the value estimates.

Other Advantage Estimates

Some other interesting advantage estimates:

VinePPO³: Use token-level Monte Carlo rollouts to estimate value. Incredibly expensive but arguably the best way to do it.
GIGPO⁴: Tie together similar states by just using token similarity. Works really well for agentic state-based tasks (e.g., navigating a web browser), where we have access to the underlying state and that state is simple.
REINFORCE++⁵: Just use the average reward in your batch as the value estimate! Makes sense if your tasks are all similar.
RLOO⁶: Basically GRPO, but remove the standard deviation and use a leave-one-out approach instead of doing average over all samples.

Final comment

You might wonder why using the reward from trajectories is valid to include in the baseline for GRPO and other baselines like REINFORCE++ — doesn't this include the current action? Actually, this is a special case that is fine!

Firstly, using other trajectories is completely valid, since they are completely independent of our current action (different samples entirely). But the reward of our current trajectory is the weird part. Let's consider using the average batch reward-to-go as a baseline:

$$ b(s_t) \equiv \frac{1}{B}\sum_{n=1}^{B} G^n_t \quad\text{, where B is our batch size} $$ Let $x$ denote our current sample in the batch. Then: $$ \begin{aligned} \nabla_\theta J(\theta) &= \mathbb{E}_{\tau\sim\pi_\theta}\left[\sum_{t=0}^{T-1} \nabla_\theta\log\pi_\theta(a_t\mid s_t)\,(G^x_t - b(s_t))\right]\\ &= \mathbb{E}_{\tau\sim\pi_\theta}\left[\sum_{t=0}^{T-1} \nabla_\theta\log\pi_\theta(a_t\mid s_t)\,(G^x_t - \frac{1}{B}\sum_{n=1}^{B} G^n_t)\right] \\ &= \mathbb{E}_{\tau\sim\pi_\theta}\left[\sum_{t=0}^{T-1} \nabla_\theta\log\pi_\theta(a_t\mid s_t)\,(G^x_t - \frac{1}{B}G^x_t - \frac{1}{B}\sum_{n=1,\ne x}^{B}G^n_t)\right] \\ &= \mathbb{E}_{\tau\sim\pi_\theta}\left[\sum_{t=0}^{T-1} \nabla_\theta\log\pi_\theta(a_t\mid s_t)\,(\frac{B-1}{B}G^x_t - \frac{1}{B}\sum_{n=1,\ne x}^{B}G^n_t)\right] \end{aligned} $$

We know using $\frac{1}{B}\sum_{n=1,\ne x}^{B}G^n_t$ is fine as a baseline, so we just need to consider $\frac{B-1}{B}G^x_t$. It should be intuitively clear that this just scales down the gradient by $\frac{B-1}{B}$, and so does not bias the final result. Strictly, using a leave-one-out estimate as in RLOO is better to avoid this scaling, but it does not hurt!

Crucially, this only works because we used the reward itself in the baselining. Other functions involving $a_t$ would be invalid to use.

And that’s it! Thanks for listening!

References

See also: Weng, L. (2018). Policy Gradient Algorithms; Sutton & Barto (1998). Reinforcement Learning: An Introduction (Chapter 13).

Seita, D. (2017). Going Deeper Into Reinforcement Learning: Fundamentals of Policy Gradients. ↩
MiniMax et al. (2025). MiniMax-M1: Scaling Test-Time Compute. ↩
Kazemnejad et al. (2025). VinePPO: Refining Credit Assignment. ↩
Feng et al. (2025). Grouping in Group Policy Optimization for LLM. ↩
Hu et al. (2025). REINFORCE++: Stabilizing Critic-Free Policy. ↩
Ahmadian et al. (2024). Back to Basics: Revisiting REINFORCE Style Optimization. ↩

Results Replicating L1 for Tulu

2026-02-02T00:00:00+00:00

Note to reader: This blog post is a (lightly) edited version of a post I originally wrote in May 2025, but dropped due to getting busy with life. I’ve finally cleaned it up and shared it, but understand it’s a bit out of date! Lots of interesting and cool work on LM overthinking and length control has come out since then, one example being GDPO.

Intro

A large flaw of autoregressive thinking models is that their inference can simply go on and on. For example, if we take DeepSeek-R1 and provide it with a slightly nonsense riddle, it very quickly degrades into endless guess-and-checking:

I asked the new R1 to "Perform some calculation to estimate pi/7" and I don't know if it's ever going to stop thinking
— Nathan Lambert (@natolambert) May 28, 2025

This is partly just due to the fact that long CoTs take some time to generate, and can also be part of an issue known as overthinking, wherein models often think for longer and produce longer CoTs for problems they will eventually get wrong or they did not need to spend all that time on. There is a decent amount of literature on this at this point, and papers that propose to solve the issue by better learning how to allocate compute for questions.

One particular approach to solving this I found interesting earlier in the year was L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning. The idea behind the paper is simple: let’s just reward the model for getting the length right during RL training! (Perhaps) surprisingly, a fairly simple reward works well here. Doing everything during RL training is something I’m interested in doing right now, so I quite liked this, and worked a little on replicating this in Open-Instruct, a post-training codebase I contribute to quite a bit.

In order to replicate, I set up a small setting I cared about: Given a math dataset, can I RL-train with an additional length control reward to achieve a ‘reasonable’ degree of length control without sacrificing quality? In particular, I was interested in the following properties:

How sensitive is the setup to the precise reward?
Do we get out-of-domain length generalisation? If I train on lengths 0-4096, will the model generate 8000 tokens if I ask?
Do I observe inference-time scaling effects as I ask for longer and longer generations?
If I train on just math, how well does length control generalise out?

Some of these questions are answered in the original paper in more detail, but I wanted to try getting my own answers!

Setup

Let’s get a bit more specific about the setup. For my experiments, I used a Qwen-2.5-7B model finetuned on the Tulu 3 SFT dataset for 2 epochs (following the same hyperparameters as the original Tulu 3 SFT, but with a max context length of 32768). This is the starting point for my experiments unless otherwise stated. Note that this model is not a thinking model out-of-the-box, so it doesn’t really do long-CoT generations initially.

For data, I used the Eurus 2 data, a large-ish collection of math and code data. I filtered out the code data.

Finally, I did all my training with GRPO. I think this should work with other online RL algorithms (e.g. PPO), but I used GRPO since it was straightforward. To be exact, I used the grpo_fast implementation in Open-Instruct, which includes some minor extra features like packing. Here’s an example of a command I used to train:

dataset=ai2-adapt-dev/eurus2_ground_truth_with_random_max_length
python mason.py \
    --cluster ai2/augusta-google-1 \
    --pure_docker_mode \
    --image hamishivi/open_instruct_mult_dev1704 \
    --workspace ai2/tulu-3-dev \
    --priority high \
    --preemptible \
    --num_nodes 3 \
    --max_retries 0 \
    --budget ai2/oe-adapt \
    --gpus 8 -- source configs/beaker_configs/ray_node_setup.sh \&\& python open_instruct/grpo_fast.py \
    --exp_name $exp_name \
    --beta $beta \
    --num_samples_per_prompt_rollout 16 \
    --num_unique_prompts_rollout 128 \
    --sft_messages_key messages \
    --output_dir /output \
    --kl_estimator kl3 \
    --learning_rate 1e-6 \
    --dataset_mixer_list $dataset 1.0 \
    --dataset_mixer_list_splits train \
    --dataset_mixer_eval_list $dataset 16 \
    --dataset_mixer_eval_list_splits train \
    --max_token_length 10240 \
    --max_prompt_token_length 1024 \
    --response_length 8192 \
    --model_name_or_path ai2-adapt-dev/tulu_3_long_finetune_qwen_7b_reg \
    --apply_verifiable_reward True \
    --non_stop_penalty False \
    --temperature 0.6 \
    --total_episodes 2000000 \
    --non_stop_penalty_value 0.0 \
    --pack_length 16384 \
    --deepspeed_stage 2 \
    --per_device_train_batch_size 1 \
    --num_learners_per_node 8 8 \
    --num_epochs 1 \
    --num_mini_batches 1 \
    --vllm_tensor_parallel_size 1 \
    --vllm_num_engines 8 \
    --lr_scheduler_type constant \
    --seed 1 \
    --num_evals 100 \
    --save_freq 2000 \
    --try_launch_beaker_eval_jobs_on_weka False \
    --gradient_checkpointing \
    --with_tracking

There are some minor changes for different datasets (with different length rewards), but really this is the main command and hyperparameters used.

What does our reward look like? I experimented with four different setups:

“Exact” reward (4k): For each prompt, I sample randomly a length in range (100, 4096). I add to the prompt \nThink for n tokens. We then calculate the reward as 1 - (abs(tokenized_prediction - desired_length) / 8192), and add this to the ground truth reward (so the model gets extra reward for getting the length correct).
“Exact” reward (8k): Same as above, but I randomly sample a length in range (100, 8192).
“Exact” reward (bucketed): Same as above, but I only sample lengths from {100, 1024, 2048, 4096, 6144, 8192}. The idea is that learning specific ‘valid’ lengths might be easier than any integer in a range with > 4000 values.
“Up to” reward (8k): We reward the model just for being under the token budget. Uses the same calculation as 1, but if tokenized_prediction - desired_length < 0, we just give full reward. The idea is that this is an easier task to learn, and a bit more realistic (a user probably doesn’t mind if the model finishes early). Note that I edit the prompt to be \nThink for up to n tokens.

To visualise the “up to” and “exact” rewards for a desired length of 3200 tokens:

How do we evaluate? Basically, I evaluated on MATH-500 with desired lengths of {100, 1024, 2048, 4096, 6144, 8192, 9216, 10240} (set via the prompt). I measure both performance at the various lengths and how far off the lengths are from the desired output length (shown via violin plot below). I also did one out-of-domain experiment running evaluation on MMLU with the same lengths.

Results

How well do the different approaches adhere to length budget?

Let’s start by just looking at how well each approach actually adheres to the length budget. We use a violin plot to show the distribution of lengths generated by the model for each desired length, and remove the bottom and top 5% of lengths to reduce the effect of outliers.

“Exact” reward (4k): This works well but we see the model doesn’t generalise to lengths past ~4000, even though it was allowed to generate up to 8k during training. This suggests that this form of length control doesn’t generalise to new lengths.

“Exact” reward (8k): This also works well! But again, past 8000 desired length it falls apart. I found that upping the learning rate and training for longer generally improved the model’s adherence to the length budget.

“Exact” reward (bucketed): The model does very well at adhering to the bucketed values it was trained on, and again doesn’t generalise further. I think the wonky violin for 8192 is just to that being right on the output limit used during RL training.

“Up to” reward (8k): This seems to just encourage the model to always be short (although it does still stay in , technically!).

Takeaways: Training on the budget works really well! We get pretty good length control, although it’s not exact exact. However, we don’t generalise to new lengths, and so we can’t use this technique to scale inference-time compute beyond what we used during training. Interestingly, we also see an ‘up to’ reward doesn’t work that well, as the model just learns to always be short: instead, we need the tight ‘exact’ reward.

What about performance?

You might be curious about performance. Below I’ve plotted performance at different output lengths for the 8k “exact” reward (and I found the other methods to be similar in performance, apart from the ‘up to’ reward which just learns to be short). Note I evaluated both using greedy and temperature (temperature of 0.7) sampling, hence the two lines.

As you can see, the model matches the performance of the ‘no length control’ baseline once we hit >= 2000 tokens in output. This suggests: a: we can get length control without sacrificing performance, and b: the model doesn’t need to generate long chains to do well. b is especially interesting, since the model without length control is fairly yappy and does make use of the full 8k token budget quite often. This suggests that the model learns to compress its reasoning as part of the length control task. Perhaps this would drop performance in more complex tasks, but here it’s very encouraging. Indeed, much work over the past year found that reasoning models could compress their reasoning chains quite a lot.

What about out-of-domain generalisation?

Finally, I also wanted to see how well the length control did at tasks that were OOD. Recall we are training on math data only, so I evaluated on MMLU, which is a general QA task.

Here, we see that length control still works, but less strongly on these OOD tasks. I consider this pretty successful, since in reality we can just train on a diverse mixture and minimise how many OOD cases the model needs to deal with (and later for Olmo 3 we did indeed train on a moderately diverse mixture of data).

Sadly, we do see performance drop (all curves using temperature 0.7 sampling):

Note that ‘bucketed’ is the bucketed reward mentioned above, and the other two are the ‘exact’ rewards trained for differing amounts of time with different LRs. All three perform worse than the base model. However, this might just be due to the model overfitting on the training data (math-only), for which I don’t have an experiment.

Conclusion

I did this project to answer a few questions, and we can now clearly do that:

How sensitive is the setup to the precise reward?
- It seems that the model is not thaaaat sensitive to the setup (naive summing works, bucketing works), but we do need the ‘exact’ reward to avoid the model always being short. Later work such as GDPO has improved length control adherence with tweaks to the RL algorithm.
Do we get out-of-domain length generalisation? If I train on lengths 0-4096, will the model generate 8000 tokens if I ask?
- Sadly, no! This really sucks, and is partly why I moved on from these results. It would be interesting to see if more work on the setup could change this (meta-learning? mixing in unfinished samples?)
Do I observe inference-time scaling effects as I ask for longer and longer generations?
- Yes, weakly so! Our data and eval were ‘too easy’ such that past 2k tokens the model could just do really well, but we still see some inference-time scaling effects. Using a budget of 100 tokens underperforms 1000 tokens, which underperforms 2000 tokens, and then there performance plateaus.
If I train on just math, how well does length control generalise out?
- Surprisingly well! It’s not as good as training on a diverse mixture, but it’s still pretty good. This suggests that the model is learning to generalise length control to new tasks, and that the length control task is not too difficult.

Overall, I think these show that an L1-style recipe is pretty effective at learning token budgets. If you use coarse buckets (e.g. ‘easy’, ‘medium’, ‘hard’ reasoning), you can do really well, and even exact token budgets are possible with training, and don’t seem to sacrifice performance too much. Revisiting these results has reminded me to try to fold these into the next Olmo release, which I will (try) to do.

Thanks for reading, and may your LMs be perfectly verbose.

My 2025 Reading List

2026-01-01T00:00:00+00:00

As usual, I’m sharing what media I watched, read, played, and so on in 2025. This was a really busy year personally, with a lot going on, and so I didn’t get as much time to play games (sadly), but I did make an effort to watch a lot of movies. I just got a Switch 2, so I’m hoping 2026 will bring more games! That being said, it will depend on work. This was a chiller year work-wise relative to the last, but I hope to ramp things back up this year… and maybe I’ll find some time to publish some draft blog posts…!

but anyway, let’s get into it!

Books (& selected comics)

Theory and Practice
Doppelganger: A Trip into the Mirror World
East of West
The Wicked + The Divine
Once & Future
1929
Sparta: The Rise and Fall of an Ancient Superpower

Really awful on the reading front this year. I’ll try to do better in the future. I spent much of the first half of the year reading comics and worked through Kieron Gillen’s back catalogue, which was fun (and I really love The Power Fantasy currently)!

Other ongoing comics I’m appreciating include the amazing Ultimates run by Deniz Camp (crazy that the ultimate universe is ending soon!), and DC’s companion Absolute universe comics are also great, especially Absolute Martian Manhunter, which is kind of amazing (and also by Deniz Camp). On the manga side, I grew a weird appreciation for Jujutsu Kaisen Modulo, which is doing some interesting things as a sequel series, and I got back into One Piece, which has just been consistently killing it.

Films

Saturday Night
Joker: Folie à Deux
The Iron Claw
Babygirl
The Fast and the Furious
Cuckoo
2 Fast 2 Furious
The Fast and the Furious: Tokyo Drift
The Brutalist
Fast & Furious
Fast Five
Luz
Atropia
Fast & Furious 6
Furious 7
Blink Twice
Captain America: Brave New World
Ne Zha 2
The Fate of the Furious
F9
Fast X
Nosferatu
Fast & Furious Presents: Hobbs & Shaw
The Substance
Mickey 17
Chungking Express
Fallen Angels
Heretic
Zombieland
The Man from Earth
BlackBerry
Mission: Impossible – Dead Reckoning
The Hangover
28 Days Later
Speak No Evil
Ready or Not
The Hangover Part II
Sinners
Companion
Layer Cake
Sonic the Hedgehog 3
Wicked
Thunderbolts*
Rogue One: A Star Wars Story
Happyend
Twinless
Fountain of Youth
Slanted
Mountainhead
Queer
DAN DA DAN: Evil Eye
Ocean’s Eleven
Ocean’s Twelve
Ocean’s Thirteen
Barbarian
Friendship
F1
A Minecraft Movie
Opus
Shiva Baby
Superman
Final Destination
Y2K
Popstar: Never Stop Never Stopping
Materialists
Weapons
Piece By Piece
Final Destination Bloodlines
Black Bag
AVP: Alien vs. Predator
The Phoenician Scheme
The Double
Unknown Number: The High School Catfish
Alien
Demon Slayer: Kimetsu no Yaiba Infinity Castle
Eddington
28 Years Later
The Fantastic 4: First Steps
The Girl with the Dragon Tattoo
One Battle After Another
In the Mood for Love
Better Man
TRON: Legacy
He’s Just Not That Into You
The Woman in Cabin 10
Fantastic Mr. Fox
Chainsaw Man – The Movie: Reze Arc
Bugonia
Frankenstein
KPop Demon Hunters
Eagle vs Shark
Collateral
After the Hunt
Ballerina
M3GAN
M3GAN 2.0
The Naked Gun
Wake Up Dead Man
The Hunger Games: The Ballad of Songbirds & Snakes
Zootopia
Late Night with the Devil
Now You See Me: Now You Don’t
The Running Man
The Watchers
Forgetting Sarah Marshall
Elvis
Good Time

I watched… 107 films this year! Pretty good, improving on last year. It might be tough to beat this next year! As usual, my reviews and ratings are on my letterboxd. I watched all the Fast and Furious films this year (fun to see them progress), and finally watched some Wong Kar-Wai (Chungking Express is amazing). So many amazing films came out this year, but One Battle After Another probably has to be the overall top, although Twinless and Bugonia were really tight contenders (Bugonia is just pure Lanthimos, it’s great) - and, of course, Sinners. I watched a surprising large number of anime movies in theatre (Demon Slayer, Dandandan, Chainsawman), which was fun, and perhaps an interesting shift in stuff in the theatres.

TV Series

Game Changers (S7)
Doctor Who (nunuWho S2)
The Studio
Andor (S2)
The Last of Us (S2)
Daredevil: Born Again (S1)
Invincible (S3)
Severance (S2)
Dark Matter
Adolescence
Chernobyl
Apple Cider Vinegar
Mythic Quest (S4)
White Lotus (S3)
Black Mirror (S7)
The Rehearsal (S2)
House M.D. (ALL SEASONS!)
Alien: Earth
Creature Commandos
Normal People
Marvel Zombies
Gen V (S2)
The Chair Company
Legion (S1-3)
Sean Combs: The Reckoning
Pluribus
I Love L.A.
Stranger Things (S5)

BIG year for TV for me, even more so than last year. Andor blew me away, and was probably my favourite show of the year (and probably is a contender for one of my favourite shows of all time). It’s hard to imagine a tighter show than Andor, and I’m sad it’s ending but also happy they could even do it in the first place. The Rehearsal redefined my notion of what a TV show could be, and was the least predictable show I’ve ever seen (and that’s even compared to something like Severance). Severance was amazing, and an extremely fun watch (and fun to chat about with friends). Pluribus is also great, and I’m excited to see where it goes in the future. I also watched a lot of great short series (Chernobyl, Apple Cider Vinegar, Adolescence, Normal People), and rewatched Legion (remains just an endlessly inventive show, I wish more superhero genre items would take as much liberty with their style as legion does episode-to-episode). Finally, I also finally finished watching all of House, M.D. (which I started a few years back, as a background show). I would probably say…. the first few seasons are easily the best, but still it remains solid surprisingly deep throughout. It also makes all the clips of the show you see floating around even better, to be honest.

Looking forward already to the upcoming year of TV (looking forward to The Boys, Industry, House of the Dragon, and maybe Wonder Man)!

Games

Ace Attorney Investigations Collection
Deltarune Chapters 3 and 4
Balatro
Super Mario Galaxy 2
Mario Kart World
Pokemon Legends: Z-A

Not too many games this year, similar to last, but I did really enjoy Ace Attorney as usual (and hoping for more), and really enjoyed Deltarune as usual. I got sucked into Balatro (like many others), and then in the back half of the year focussed on enjoying a Switch 2 and the respective games. I’m currently working my way through the Mega Dimension DLC for Pokemon Legends: Z-A, which is a bit rough, but the base game itself is fun and cool to see as someone who’s been a pokemon fan since the Ruby and Sapphire days.

Other

Music-wise, I listend to a lot of Role Model this year, and continued to enjoy Magdalena Bay and Royel Otis (and saw all three of these artists live!). I developed a strong appreciation for Egoism (from my home of Sydney), really liked Mallrat’s new album, Light hit my face like a straight right, and found Ninajirachi’s debut album (I Love My Computer) AMAZING. It’s a good time for Australian music! Other-stuff-wise, I really liked The Crux, Djo’s new album, and I got a little sucked into k-pop, listening to some Le Sserafim (who I saw live!) and Katseye stuff.

Other video wise, I got sucked into Stavvy’s world clips, and otherwise enjoyed Big A clips and Lemonade Stand. I also had a few months deep in the GothamChess zone, deep on chess.com. Apart from this, I feel that my watching habits didn’t change too much - still a lot of WolfeyVGC, a lot of Jacob Geller, and various other video essayists I enjoy (e.g., Patrick H Willems, Lindsey Ellis, Todd in the Shadows).

Finally, I’m still spending time on reels and tiktoks, but I think it’s become much more mainstream. Much more political stuff, and trends come and go faster than ever. My tiktok account somehow got deleted, so I don’t really spend any time on there anymore, which is…probably good. Overall, I made more of an effort to watch more movies, and clearly that’s actually paid off a bit!

That’s about it for me this year! I’m looking forward to 2026 media-wise, and I hope to have maybe some more games on the list next year… maybe a good year to get back into a bunch of reading. Maybe I’ll try to put out some old blog posts. Already have lots of exciting ideas for research, too. It’s a fun time!

My 2024 Reading List

2025-01-01T00:00:00+00:00

I keep track of everything I read, play, and watch over the year, and looking back is always fun. Below I’ll share most of it, and highlight my particular recommendations. This year was a much busier year (being the first full year I spent in a PhD program!) and so I really didn’t have as much time as in the past for media consumption. I worked on 10 papers in some form this year, 6 in a first-author status and 4 in some middle-author status. This was… too much! With any luck, I’ll focus on fewer but more impactful projects this coming year, and this list will be a bit more interesting. I also (perhaps due to just… getting old?) tried to spend my free time hanging out with people, or mucking around with hobbies (we’ll see if I finally learn more than a few chords on guitar this year). So maybe expect less time-expensive things like video games in the future.

Books (& selected comics)

The Year of the Locust
Question 7
Babel
God, Human, Animal, Machine: Technology, Metaphor, and the Search for Meaning
Plays of Aristophanes (all)
Theory and Reality: An Introduction to the Philosophy of Science
My Year of Rest and Relaxation
Rejection: Fiction
Going Infinite
The Sandman (all)
Sex Criminals
Uzumaki
Sensor
The League of Extraordinary Gentlemen (all)
CITY

Sadly, not much reading this year, although I want to fix this (lots downloaded on my kindle)! I focussed a bit on more non-fiction this year. I really, really enjoyed Rejection: Fiction, but I have to say it’s not for the prude or squeamish. Theory and Reality was also a solid read, and I recommend it to anyone in a PhD program – it feels important to have some grounding in the philosophy of science if you plan to be, well, a scientist. And yes, I (re)read all the plays of Aristophanes this year – sadly in English, but I do always love ancient comedy.

In terms of comics not listed here, I enjoyed (like many) the new Ultimate lines by Marvel, with Ultimate Spiderman and Ultimate X-men being particular highlights. I also didn’t mind the Zdarsky Batman run, and the new Absolute Batman seems promising from initial issues! I reread a lot of manga this year, including Bleach and Hunter x Hunter, which remain solid shounen (especially Hunter x Hunter). I also enjoyed the ongoing SJ comics Witch Watch, Dandadan, and Akane-Banashi. I also read (and got really into) CITY by Keiichi Arawai, author of perhaps my favourite gag manga ever (Nichijou) – and it was like discovering that not only did a sequel to your favourite show existed, but also that the author had really evolved their craft in the meantime – just a really great manga to read start to end.

Films

Bottoms
Oldboy
Killers of the Flower Moon
The Boy and the Heron
The Zone of Interest
X
The Suicide Squad
Starship Troopers
Incredibles 2
Aquaman and the Lost Kingdom
Dune
Napoleon
Dune: Part Two
Poor Things
Ricky Stanicky
Dream Scenario
Looper
Anyone But You
The Gentlemen
Challengers
The Holdovers
Furiosa: A Mad Max Saga
Mad Max
Mad Max 2
I Saw the TV Glow
Godzilla Minus One
Mad Max Beyond Thunderdome
Hit Man
Upgrade
Pearl
Godzilla x Kong: The New Empire
Dogtooth
Longlegs
Deadpool & Wolverine
Problemista
Aliens
Alien³
Alien Resurrection
The Social Network
Alien: Romulus
Love Lies Bleeding
Prometheus
Kinds of Kindness
The Fall Guy
The Babadook
Fresh
Civil War
Ghosted
Another Round
The Substance
Megalopolis
Mysterious Skin
The Chant of Jimmie Blacksmith
Us
Fantastic Four
Woman of the Hour
Fantastic Four: Rise of the Silver Surfer
Trap
Anora
Look Back
Vermiglio
Alien: Covenant
No Time to Die
MaXXXine
The Lives of Others
Twisters
Conclave
Blitz
A Real Pain
Red One
Thelma
Challengers
Die Hard
I Saw the TV Glow
Wallace & Gromit: Vengeance Most Fowl
My Old Ass
Transformers: The Last Knight
Transformers: Age of Extinction
Sunshine

Unlike reading, I managed to watch alot of films this year - 79 total! You can go read my reviews and ratings on my letterboxd for actual details or recommendations. I went into both Mad Max and Alien phases this year, and (re)watched them. Actually, I discovered that the first Mad Max was shot partly in the same area as my Australian family! I Saw the TV Glow probably ended up being my favourite film of the year, but certainly not for everyone. There was also a decent amount of watching ‘comfort food’ films, which for me are mediocre superhero flicks.

TV Series

Echo
Monarch: Legacy of Monsters
The Curse
Mr and Ms Smith
Percy Jackson and the Lightning Thief
Three-Body Problem (Netflix version)
Jojo’s Bizarre Adventure Part 6: Stone Ocean
Avatar the Last Airbender: Legend of Korra (all)
Invincible Season 2 Part 2
Baby Reindeer
Silicon Valley (All seasons)
The Sympathizer
The Boys S4
The Acolyte
Doctor Who (2024) + Christmas Special
House of the Dragon (S2)
The Bear (S3)
The Umbrella Academy (S4; final season)
Industry (S1-3)
Agatha All Along
The Penguin
Pantheon (all)
Dune: Prophecy
Dandadan S1
The Franchise
Squid Game 2
What if…? S3
Bleach: Thousand-Year Blood War S3

I watched a decent amount of TV too this year! I’m also slowly working on House, M.D. (current comfort show). In hindisght, there were a few big highlights of the year. The Curse is perhaps an all-time piece, and Baby Reindeer and Pantheon really stuck in my brain. I particularly recommend Pantheon for anyone looking for ‘anime sci-fi’ vibes – sure, it’s not 100% technically accurate, but it has interesting ideas and explores them in interesting ways, which is all I can ask for. The Sympathizer and The Penguin also really got me – I really recommend The Sympathizer, with great Park Chan-wook stylings (as the showrunner), and the uniqueness of watching a big-budget TV show mostly in Vietnamese. However, both shows are certainly pretty rough in terms of violence, and aren’t perfect by any means.

Games

Outer Wilds
Silent Hill: The Short Message
Needy Streamer Overload
Paper Mario: The Thousand-Year Door Remake

Sadly, this is where work really reduced the amount of time I had (compare this list to last year…). I have a long, long backlog of games that came out that I really want to try out. Of these, I found Outer Wilds amazing – it absolutely lives up to its (almost insane levels of) hype, and Paper Mario was a fun take on an RPG. I also played through Final Fantasy XVI, but I have a bit to go to finish the platinum for it, so it’ll have to be on next year’s list.

Other

I didn’t track these other things myself, but Spotify told me I really liked Porter Robinson, Magdalena Bay, Beabadoobee, Charli XCX, and Royel Otis. I managed to see 3 of these in person at concerts this year, along with other great artists like Clairo and Odessza, which was fun. I got super into Magdalena Bay’s stuff, which comes with cool websites, production breakdowns, and just generally some fun vibes.

I watch way too much youtube, but I really liked Jacob Geller’s video essays, WolfeyVGC’s videos on competitive pokemon, and Tom Walker’s generally insane VODs.

I spent way too much time on tiktoks and reels this year, and trying to track what I saw would probably send me insane and/or into a depressive funk. I like the stupid yap dollar trend. I like the educational Map Men series. I enjoy the unhinged Sydney Metro fanpage. Short-form video content is fun and here to stay, but by god is it too addictive.

And that’s about it from me this year! I guess this is probably the most consistent blog post I have, so I’ll see you same time next year. I have a feeling that next year might have some more interesting items on the list :)

My 2023 Reading List

2024-01-21T00:00:00+00:00

I’ve kept track of most media I watch, read, or play for the past few years. Here’s the list of things I read/played/watched in 2023, in no particular order:

Books (& selected comics)

Stolen Focus: Why you can’t pay attention
All Your Faces Belong to Us
Emily Wilson’s Iliad translation
Parfit: A Philosopher and His Mission to Save Morality
Captain Britain Omnibus
Animal Man (Morrison’s run)
Leech
A rag, a Bone, a Hank of Hair
Invincible (the entire run)

I also read many other manga and comics I did not track. I really enjoyed the ongoing run of Immortal X-men, and did a big reread of One Piece and Jujutsu Kaisen. Maybe this year I’ll read more books…

Films

Da Kath & Kim Code
Bodies Bodies Bodies
Triangle of Sadness
The Menu
Jennifer’s Body
Prey
The New Mutants
TÁR
Infinity Pool
Marcel the Shell with Shoes On
X-Men
Ant-Man and the Wasp: Quantumania
Ford v Ferrari
Venom: Let There Be Carnage
John Wick: Chapter 4
Tetris
Decision to Leave
Shazam! Fury of the Gods
Suzume
Guardians of the Galaxy Vol. 3
Spider-Man: Across the Spider-Verse
Avatar: The Way of Water
Creed
Asteroid City
Past Lives
Jujutsu Kaisen 0
Creed II
American Made
Barbie
Mission: Impossible – Dead Reckoning Part One
The Flash
Oppenheimer
Cars 3
(500) Days of Summer
Godzilla vs. Kong
No Hard Feelings
Godzilla: King of the Monsters
The Super Mario Bros. Movie
Blue Beetle
No One Will Save You
Dungeons & Dragons: Honor Among Thieves
The Creator
Mission: Impossible
Mission: Impossible II
Killers of the Flower Moon
The Marvels
The Killer
BlackBerry
Indiana Jones and the Dial of Destiny
Leave the World Behind
The Wonderful Story of Henry Sugar
Mission: Impossible III
Edge of Tomorrow
Mission: Impossible – Ghost Protocol
Saltburn
Emily the Criminal
Jumper
Indiana Jones and the Temple of Doom

Yes, I did watch most of the Mission Impossible films this past year. I track my film watching on my letterboxd.

TV Series

You (S4)
Mythic Quest (S3)
The Mandalorian (S3)
Succession (all)
Beef
The Bear (S1-2)
Black Mirror (S6)
Lux Listings Sydney (S3)
One Piece (Netflix ver.)
Ahsoka
Pluto
Gen V
Loki (S2)
Aggretsuko (S5)
Bleach: The Thousand Year Blood War (S2)
Scott Pilgrim Takes Off

In addition, I also watched The Curse and Monarch: Legacy of Monsters, but these did not finish until this year. I usually like to have one or two weekly shows to look forward to.

Games

Return of the Obra Dinn
The Last of Us (Part 1)
The Last of Us (Part 2)
The Looker
Live-a-Live
Metroid Prime Remastered
Xenoblade Chronicles 3: Future Redeemed DLC Campaign
The Legend of Zelda: Tears of the Kingdom
Mariokart DLC Wave 5 & 6
Pikmin 1
Pikmin 2
Pokemon Violet: the Teal Mask
Pikmin 3
Mario Wonder
Detective Pikachu Returns
Pokemon Scarlet: The Indigo Disk
Pikmin 4

I like to play through entire series when I have the right consoles - such as playing through (and 100%ing) both The Last of Us games and the entire Pikmin series.

I ended up watching and playing a lot of things over the past year! It was a bit of a weird year, with a big summer break while I waited to start my PhD, so I don’t know if I’ll cover as much ground this year, but we will see.

PhD Hunting 🎯

2023-03-01T00:00:00+00:00

Note: This post is very much my own personal experience. Yours will likely vary a lot! Please don’t feel discouraged if you don’t feel you match this - I would always recommend just going for it if you can!

Over the past year, I applied to a tonnes of schools for PhD programs, interviewed at a bunch, and got into a few. This post will be a mixture of reflections and explanations about how this process went, what I did to prepare, and how I felt during it. There’s already a number of fantastic explanations, interviews, and stories about getting PhDs in NLP and ML more broadly, and if you are thinking about or preparing to do a PhD in NLP I really recommend all these, including:

CS SoP, a great resource for what seeing what a statement of purpose looks like.
Tim Dettmers’ guide, which is pretty thorough and covers important questions.
Noah Smith’s advice on grad school and applying to him.
Some folks at AI2 did two podcasts on applications and US vs Europe for grad school too.

There are so many more resources out there (for example, see this list), and I definitely recommend sitting and reading as many of these as you can! With this post, I’m not aiming to target anything in particular, but just give another datapoint that people might appreciate.

T-Minus One Year

My journey to seriously starting to prepare for PhD applications begun with starting as a predoctoral young investigator at AI2. This basically gave me the chance to work on a few research projects with some great, immensely talented folks, and get lots of advice / help with applying. This is obviously something that is fairly rare, but it definitely majorly helped my chances, and made me personally much more confident in applying (since it gave me the chance to get some projects under my belt). Before the program, I had a little research experience, but no published papers, and was definitely not feeling up to applying to programs - although I know and have heard of folks succeeding in applications in similar scenarios.

Over the initial PYI year, I managed to publish one paper, and put out two others (which were under review by the time I applied). Maybe not the most productive or impactful researcher out there, but honestly the bigger thing was being exposed to the broader NLP community and gaining the confidence to talk about my research ideas and NLP research more broadly with the community at large. It turns out sometimes just being confident enough to talk about your ideas can get you some places!

T-Minus 3 Months

Around August, I started to look around and work out where I wanted to apply, and what the deadlines, professors, programs, and so on were like. I had a draft list bouncing around all year, but this is when I think I started to really take it more seriously. For every school I could think of (i.e., had heard something about, or knew about a professor there), I listed (a) the professors there and their areas of interest, (b) the application deadline, (c) any notable details about the program, and (d) general positives or negatives that I could think of (e.g., location, what I had heard from their students, so on). I tried to harden this into a partially-ordered list, segmenting the schools into ‘tiers’, and trying to pare it down so that only the places I would genuinely accept if given an offer were on the list. I also shared the list with some folks, who helped suggesting places I hadn’t heard of or considered before.

The final list was still long (I ended up applying to ~13 schools), but going through the process helped focus my efforts a bit.

T-minus 1.5 Months

Around the start of October, I started to draft up my statement of purpose (SoP) and ask folks to write letters of recommendation for me. After some time working on the SoP by myself, I shared it around with a few folks and got back some really good, clear feedback. In the end, I probably rewrote the SoP almost from scratch 3-4 times in order to really shake it up and keep it tight. I definitely recommend reading the SoPs on CS SoP for inspiration, and sharing your drafts around (this can be scary, but the feedback is definitely worthwhile!). It’s cliche to say, but try to stay positive and not compare yourself to too many of the (legitimately insanely good) candidates out there - just do what you can, without losing your mind over it.

T-Minus 1-2 weeks

Right before the first deadlines started approaching in late november, I basically started creating accounts in the grad application systems and checking to see what documents I would need. Something worth noting is that every school often has unique constraints or things to do for its application: most schools constrain the SoP in some way (I saw 2 pages, 1 page, 1000 words, single-spacing, double-spacing, pasting raw text only with no formatting). Additionally, some schools require a personal statement (talking more about your own experience and goals), and some require random short answer questions (e.g., write about a time you had to teach).

I ended up spending most of the week before and of thanksgiving writing the first batch of these, targeting the earlier applications first (warning: European deadlines are much earlier than the American deadlines, and you might have to be ready in october or early november). This first hump was the toughest, but once I was through it I had a bunch of variants of personal statements tailored for a bunch of different universities, with the most useful two variants were the two-page and 1000 word version, since shrinking two pages down to 1000 words requires a non-trivial amount of cutting.

Much of this time was also spent inputting my details into the varying application systems, which was mind-numbingly boring and took ages at times. You should really do this because often you can log in and send out recommendation letter requests once you have created an account, which gives more time for your letter-writers to work out the system and send in their letters.

The Thick of It

Over December, I attended EMNLP and during this continued the application process. At this point, I was just trying to stay a bit ahead of the deadlines while also attending the conference. I mostly had the parts I needed, as noted above, but it still felt like a frantic rush, as I often was submitting 2-3 applications a day and still had to write the final paragraph of the personal statement where I tried to make connections between my work / interests and the professors I was applying to explicit (some people recommending not doing this, I’m not sure what is best).

Finishing up the Applications

I finished up applications by December 15 while in Australia (some schools have later deadlines, but I did not go for them). Unfortunately, there was a paper deadline 5 days later, so I had to turn my attention to paper-writing, but after that I just took a massive break and basically didn’t touch my laptop at all for two weeks.

T-plus(?) 2 weeks

Starting in very late december (around the 28th), I started to get interview requests (although this was very early). This then increased a bit in the first week of january, and up until around the end of January (~the 25th), I started to get emails asking for interviews on a regular basis. These were actually pretty spread out - I think I almost always had at least one interview lined up at any given point from late December to late January. These were a bit nerve-wracking, of course, but were all ultimately enjoyable and not too difficult. It turns out that talking about research and NLP is fun, actually! I did prepare for these a little, collating a list of common questions and preparing answers, and I sort of refined this and worked on it over time. For each professor, I looked up their background and recent work, and tried to think about questions specific to that school or professor that would be useful.

Here’s a (abbreviated and paraphrased) list of questions I prepared for and got from multiple people. I don’t think any of these should be too surprising!

Why do a PhD in NLP?
(relatedly) Why a PhD and not industry work or research?
What is your ideal advisor / research group / grad school experience?
Tell me about a recent project you’re especially proud of.
What are your broader research interests?
What is some recent work you were interested in? Can you explain a bit about it?
What do you think is the future of NLP?

For the large part, questions were not too technical, but definitely be prepared to talk about your own work and maybe 1-2 recent papers in detail, to be safe. Although - every professor has their own style, so it’s hard to give one-size-fits-all advice.

Also - I did not look at gradcafe or similar through this process, and to be honest didn’t ever feel the urge to. Everything runs at its own pace, and at this point you just have to go with the flow!

T-plus 1-2 months

I more or less was done with interviews by the end of January, and I started getting offers in early February. It’s worth noting timelines here seem very, very messy - I know folks that got offers from the same school at different times, and even right up until April some folks I know were getting offers. Every school has its own pace, and I think the best strategy once interviews are done is to just try and not think much about it and don’t expect super-fast turnarounds.

T-plus ~4 months

I took a nice long break vacationing around Europe to enjoy the period before starting PhD, and I definitely recommend taking some time off if you can afford it!

Conclusion

This was a very rough timeline with some thoughts. Don’t worry if you are ready sooner or later than the dates here, and feel free to move at your own pace. But hopefully it provides a little insight into what (I think) a semi-reasonable timelines looks like. It’s also worth noting that applying for things like fellowships, NSF often take place earlier than admissions, which can be another chance to get stuff ready early.

Does GPT-3 know Ancient Greek?

2022-09-07T00:00:00+00:00

I don’t know quite why, but recently I’ve been thinking about zero-shot multilingual translation, and so I decided to try to see how well GPT-3 can handle low-resource ancient languages. I have some training in Latin and Ancient Greek (technically, I have a degree in Ancient Greek), so I figured I would see how well it went with them. This is going to just be a qualitative analysis, rather than me trying to build some Latin/Ancient Greek translation benchmark, to keep things light and easy. By no means is this thorough or definitive!

14/02/23 Update: I have updated the model responses in light of the recently-released ChatGPT and text-davinci-003 models, scroll to the bottom of this page to check them out!

Latin

Latin has more text extant than Ancient Greek (I think - at least, it would be more common than Ancient Greek in some random web/book crawl). So this should mean GPT-3 is better at it. Furthermore, Google Translate Latin exists, which provides a fun comparison.

Let’s take a few lines from Book 12 of the Aeneid, and see how it goes. For those who haven’t used GPT-3 before, it works by prompting it with some text in natural language, and then letting it autocomplete your text (a bit like pressing autocomplete on a keyboard, but for a full paragraph). People often try out multiple prompts to work out what is best - I messed around a little and settled on a basic format of Translate the following to English: English:, which seemed to do okay. In the following chunks, bolded indicates model-generated text, with the rest of the text being input by me:

Translate the following to English:

Turnus ut infractos adverso Marte Latinos defecisse videt, sua nunc promissa reposci, se signari oculis, ultro implacabilis ardet attollitque animos. Poenorum qualis in arvis saucius ille gravi venantum vulnere pectus tum demum movet arma leo gaudetque comantis excutiens cervice toros fixumque latronis inpavidus frangit telum et fremit ore cruento: haud secus adcenso gliscit violentia Turno.

English:

Turnus sees the Latins defeated by Mars and is furious, demanding that his promises be fulfilled. He is determined to take revenge and rallies his troops. Just as a wounded lion in the fields finally moves to attack the hunters who wounded him, after shaking off the blood from his mane, and breaks the spear of the bold robber with his powerful neck, so too does Turnus’ fury grow as he climbs up to fight.

For those of you who didn’t spend your final year of high school memorising book 12 of the Aeneid, here’s a professional translation (from poetry in translation):

When Turnus saw the Latins exhausted, and weakened by their military reverse, himself the subject of every gaze, his own promise to them yet unfulfilled, he burned implacably, and unprompted, and raised his courage. As a lion, in the African bush, severely hurt by huntsmen with a wound to the chest, only then rouses himself to battle, tosses his shaggy mane over his neck, in joy, and, unafraid, snaps off the spear some poacher has planted in him, roaring from blood-stained jaws: so the violence grew in Turnus’s inflamed heart.

This does pretty well! It sort of muddles bits, but the lion simile remains intact in a nice way. By the way, Google Translate does awful here:

Turnus sees that the Latins, broken in opposition to Mars, have failed, now that their promises have been restored, and that they have been sealed with their eyes; He inflicts punishments such as those in the fields on the chest of the hunted with a severe wound, then at last he moves his arms like a lion and rejoices, shaking the neck of the commander, fixed on his neck, and fearlessly breaks his weapon and roars with a bloody mouth: just as the burning violence of Turnus glides.

So it appears (to me) that GPT-3 has a better idea of Latin than whatever Google uses! The Aeneid is too easy, though - an all-time classic, which has probably appeared all over in GPT-3’s training. Let’s try something a bit different, Ovid’s Ars Amatoria (still famous, but not as mega-famous as the Roman Epic). I’ll use an easier sentence:

Translate the following to English: Atque aliqua ex illis cum regum nomina quaeret, Quae loca, qui montes, quaeve ferantur aquae, Omnia responde, nec tantum siqua rogabit; Et quae nescieris, ut bene nota refer.

English:

And when she asks about the names of the kings, the places, the mountains, and the waters that are mentioned, answer everything, and even if she doesn’t ask, explain what you know clearly. And if you don’t know something, say so.

Here’s Google’s version:

And when some of them ask for the names of kings, what places, what mountains, what waters are carried, answer everything, and he will not ask so much; And what you do not know, report as well known.

And a more professional, actual translation:

And if she, among them, asks the name of a king, what place, what mountains, and what stream’s displayed, you can reply to all, and more if she asks: and what you don’t know, reply as memory prompts.

So GPT-3 did pretty well here! Although the final sentence is wrong (Ovid says to say what you can recall, rather than to say if you don’t know).

For good measure, let’s try something legal (albeit still fairly straightforward): Cicero’s Pro Roscio Amerino, a famous legal defence.

Translate the following to English: fide sapientiaque vestra fretus plus oneris sustuli quam ferre me posse intellego. hoc onus si vos aliqua ex parte adlevabitis, feram ut potero studio et industria, iudices; sin a vobis, id quod non spero, deserar, tamen animo non deficiam et id quod suscepi quoad potero perferam. quod si perferre non potero1, opprimi me onere offici malo quam id quod mihi cum fide semel impositum est aut propter perfidiam abicere aut propter infirmitatem animi deponere.

English:

I realize that I have taken on more than I can handle, but I am relying on your wisdom and support. If you can help me in any way, I will do my best to carry on, judges. If I am abandoned by you, which I do not hope for, I will not give up and will do my best to carry out what I have started. If I cannot do so, I would rather be crushed by the weight of my responsibilities than abandon or give up on what has been entrusted to me.

A professional translation would be:

Relying on your integrity and wisdom, I have undertaken a greater burden than, I am well aware, I am able to bear. If you, in some degree, lighten this burden, O judges, I will bear it as well as I can with zeal and industry. But if, as I do not expect, I am abandoned by you, still I will not fail in courage, and I will bear what I have undertaken as well as I can. But if I cannot support it, I had rather be overwhelmed by the weight of my duty, than either through treachery betray, or through weakness of mind desert, that which has been once honestly entrusted to me.

As a final fun test, let’s see if GPT-3 knows the basic forms of the verb to be:

What is the 1st person singular form of the verb ‘to be’ in Latin? Sum

What is the 2nd person singular form of the verb ‘to be’ in Latin? The 2nd person singular form of the verb ‘to be’ in Latin is ‘est’.

What is the 3rd person singular form of the verb ‘to be’ in Latin? The third person singular form of the verb “to be” in Latin is “est.”

What is the 1st person plural form of the verb ‘to be’ in Latin? sumus

What is the 2nd person plural form of the verb ‘to be’ in Latin? You are (plural)

What is the 3rd person plural form of the verb ‘to be’ in Latin? Sunt.

Well, it gets a bit confused, although I think this is quite hard considering the prompt doesn’t give any examples. I find it interesting it gets the 2nd person forms wrong, considering those are forms I have found to be far far less common than the others.

This did a really good job! Overall, I’m really impressed by GPT-3’s Latin abilities. I wouldn’t put it past it having memorised these texts (these all exist on the internet with translations usually somewhere in the same document), but it’s still quite impressive to see. It also handily beats Google, which is fun.

Ancient Greek

Let’s start with something really easy: the first four lines of the Iliad.

Translate the following to English:

μῆνιν ἄειδε θεὰ Πηληϊάδεω Ἀχιλῆος οὐλομένην, ἣ μυρί᾽ Ἀχαιοῖς ἄλγε᾽ ἔθηκε, πολλὰς δ᾽ ἰφθίμους ψυχὰς Ἄϊδι προΐαψεν ἡρώων, αὐτοὺς δὲ ἑλώρια τεῦχε κύνεσσιν οἰωνοῖσί τε πᾶσι, Διὸς δ᾽ ἐτελείετο βουλή, ἐξ οὗ δὴ τὰ πρῶτα διαστήτην ἐρίσαντε Ἀτρεΐδης τε ἄναξ ἀνδρῶν καὶ δῖος Ἀχιλλεύς.

English: Achilles, the son of Peleus, killed Hector, who had caused great harm to the Achaeans. He also killed many other brave men, and he terrified those who were still alive. This was the will of Zeus, who had first stirred up the conflict between King Atreus’ son Agamemnon and noble Achilles.

This is already worse - Hector doesn’t even appear in the Greek text! However, overall it’s okay, seemingly getting a sense of the text, although being definitely incorrect at times (unlike the Latin translations, which tended to be correct). For those of you who don’t know the Iliad by heart, these lines would be translated something like (from poetry in translation):

Goddess, sing me the anger, of Achilles, Peleus’ son, that fatal anger that brought countless sorrows on the Greeks, and sent many valiant souls of warriors down to Hades, leaving their bodies as spoil for dogs and carrion birds: for thus was the will of Zeus brought to fulfilment. Sing of it from the moment when Agamemnon, Atreus’ son, that king of men, parted in wrath from noble Achilles.

If you prompt it with the first clause(s), it gets a much better translation, likely due to it having encountered the beginning of the Iliad countless times in training:

Sing, Muse, of the wrath of Achilles, son of Peleus, that brought countless ills upon the Achaeans. Many a brave soul did it send hurrying down to Hades, and many a hero did it yield a prey to dogs and vultures, for so was the will of Zeus fulfilled from the time when first they quarreled,Atreus’ son, king of men, and brilliant Achilles.

There are many reasons GPT-3 could doing worse for Greek, but one large change from Latin is the use of a different script. Let’s try normalising the text into beta code (a standard conversion for Ancient Greek into ASCII) and see how it goes:

Translate the following to English:

mh=nin a)/eide qea\ *phlhi+a/dew *)axilh=os ou)lome/nhn, h(\ muri/’ *)axaioi=s a)/lge’ e)/qhke, polla\s d’ i)fqi/mous yuxa\s *)/ai+di proi/+ayen h(rw/wn, au)tou\s de\ e(lw/ria teu=xe ku/nessin oi)wnoi=si/ te pa=si, *dio\s d’ e)telei/eto boulh/, e)c ou(= dh\ ta\ prw=ta diasth/thn e)ri/sante *)atrei/+dhs te a)/nac a)ndrw=n kai\ di=os *)axilleu/s.

English:

After he had seen the many wrongs that had been done to Axylus, a just man, Hercules went to the land of the Achaeans and killed many of the evil men there. But he spared the life of Axylus, because he was a good man. So Zeus gave him a city and made him a god, and the first temple to Hercules was built by the men of Troy after they had been saved from disaster by Hercules and by Zeus himself.

This is even worse! Removing the accent symbols (leaving only the letters) also doesn’t help. Evidently, GPT-3 doesn’t really know Ancient Greek at all (although it has definitely seen the Iliad).

Since the Iliad is so famous, let’s try something more obscure. Here’s a paragraph from Heron’s On Automata:

Translate the following to English:

καθὰ δὲ προεθέμην, ἐρῶ περὶ ἑνὸς πίνακος τοῦ δοκοῦντός μοι κρείττονος. μῦθος μὲν ἦν τεταγμένος ἐν αὐτῷ ὁ κατὰ τὸν Ναύπλιον. τὰ δὲ κατὰ μέρος εἶχεν οὕτως· ἀνοιχθέντος ἐν ἀρχῇ τοῦ πίνακος ἐφαίνετο ζῴδια γεγραμμένα δώδεκα· ταῦτα δὲ ἦν εἰς τρεῖς στίχους διῃρημένα· ἦσαν δὲ οὗτοι πεποιημένοι τῶν Δαναῶν τινες ἐπισκευάζοντες τὰς ναῦς καὶ γινόμενοι περὶ καθολκήν.

English: I was once thinking about a picture that seemed better to me than it actually was. The subject of the picture was the story of the Odyssey. Part of the picture showed the twelve Labors of Hercules, which were divided into three groups. These were done by some of the Danaids who were repairing their ships and were becoming burdened.

And if I prompt the model with the first few words:

…English: As I laid out before, I was speaking about a painting that seemed better to me. The painting was hung in such a way that in the beginning, when you opened it, you saw twelve animals painted on it, divided into three verses. These were some of the Danaans, who, when they were repairing their ships, became encircled.

Here’s the correct translation:

As I laid out before, I will talk about one box that seems superior to me. The story set in it was the one about Naupilius. And its parts went like this. In the beginning, when the box opened, 12 painted figurines appeared. These were divided into 3 rows; and these were made to represent some of the Danaans (Greeks) preparing their ships and launching them.

So, the model has clearly learnt some vocab - it gets the number 12, it works out Danaans, ships, and a few other things. These translations seem vaguely familiar but ultimately are definitely incorrect. It’s cool that it works this well, but future classics scholars are definitely not going to be able to use it to cheat on their university homework 😜

Wrapping this up

Unsurprisingly, GPT-3 is pretty good! Its Latin translation abilities seem pretty good, and it definitely feels like it has seen ancient texts written in English enough to regurgitate them when prompted. Its success at Latin makes its harsher failures in translating Ancient Greek a little surprising, but I suspect the big switch into a different script makes things much harder, and that there is simply far less Ancient Greek text in its pretraining corpus. I’ve focussed on languages I’m personally familiar with here, but it would also be interesting to see if one could somehow get a large language model like GPT-3 to give its best guess on how to translate Linear A or similar!

As a technical note, I used text-davinci-002 for this post.

GPT3.5 and ChatGPT Update!

Very shortly after writing this blog post, OpenAI came out with ChatGPT and text-davinci-003. Here’s a quick try of all the above prompts with these two models.

Text-Davinci-003

First, let’s try out text-davinci-003, the successor to text-davinci-002 (based on name and OpenAI’s model documentation). I also tweaked the input prompts to indicate the original language of the text - I found this slightly improved model’s responses, especially for Greek.

`text-davinvi-003` Latin

Translate the following to English from Latin:

Turnus ut infractos adverso Marte Latinos defecisse videt, sua nunc promissa reposci, se signari oculis, ultro implacabilis ardet attollitque animos. Poenorum qualis in arvis saucius ille gravi venantum vulnere pectus tum demum movet arma leo gaudetque comantis excutiens cervice toros fixumque latronis inpavidus frangit telum et fremit ore cruento: haud secus adcenso gliscit violentia Turno.

English:

Turnus sees that the Latins have been defeated by the adverse forces of Mars, and he now demands his promised reward; his eyes blaze with implacable rage and his spirit is raised. Just as a wounded lion in the fields, wounded by the heavy weapons of the hunters, finally stirs and rises up, shaking his maned neck and with fearless strength breaking the spear of the robber and roaring with bloody mouth, so too does Turnus surge with violence.

Translate the following to English from Latin:

Atque aliqua ex illis cum regum nomina quaeret, Quae loca, qui montes, quaeve ferantur aquae, Omnia responde, nec tantum siqua rogabit; Et quae nescieris, ut bene nota refer.

English:

And when someone asks you for the names of the kings, the places they lived, the mountains, and the rivers they are said to have, answer them all, even if you are not asked; and if you don’t know something, make sure to report it back as if you did.

Translate the following to English from Latin:

fide sapientiaque vestra fretus plus oneris sustuli quam ferre me posse intellego. hoc onus si vos aliqua ex parte adlevabitis, feram ut potero studio et industria, iudices; sin a vobis, id quod non spero, deserar, tamen animo non deficiam et id quod suscepi quoad potero perferam. quod si perferre non potero1, opprimi me onere offici malo quam id quod mihi cum fide semel impositum est aut propter perfidiam abicere aut propter infirmitatem animi deponere.

English:

Trusting in your fidelity and wisdom, I have taken on more of a burden than I understand I can bear. If you lighten this burden in any way, I shall bear it as best I can with diligence and hard work, judges; if I am abandoned by you, which I do not hope, nevertheless I shall not fail in spirit and I shall carry out as best I can what I have taken on. But if I cannot carry it out, I would rather be overwhelmed by the burden of this duty than abandon or lay down what has been entrusted to me with faithfulness because of my lack of courage.

What is the 1st person singular form of the verb ‘to be’ in Latin? Esse.

What is the 2nd person singular form of the verb ‘to be’ in Latin? Esse.

What is the 3rd person singular form of the verb ‘to be’ in Latin? Est.

What is the 1st person plural form of the verb ‘to be’ in Latin? Sumus.

What is the 2nd person plural form of the verb ‘to be’ in Latin? Estis.

What is the 3rd person plural form of the verb ‘to be’ in Latin? Sunt.

Verdict: text-davinci-003 does really well! I think overall these translations are pretty good. You can probably quibble with some of the choices, but this is generally pretty good. However, it does get some of the basic verb form questions wrong, which is surprising - these should be really easy for it!

`text-davinvi-003` Greek

Translate the following to English from Homeric Greek:

μῆνιν ἄειδε θεὰ Πηληϊάδεω Ἀχιλῆος οὐλομένην, ἣ μυρί᾽ Ἀχαιοῖς ἄλγε᾽ ἔθηκε, πολλὰς δ᾽ ἰφθίμους ψυχὰς Ἄϊδι προΐαψεν ἡρώων, αὐτοὺς δὲ ἑλώρια τεῦχε κύνεσσιν οἰωνοῖσί τε πᾶσι, Διὸς δ᾽ ἐτελείετο βουλή, ἐξ οὗ δὴ τὰ πρῶτα διαστήτην ἐρίσαντε Ἀτρεΐδης τε ἄναξ ἀνδρῶν καὶ δῖος Ἀχιλλεύς.

English: Sing, goddess, the anger of Peleus’ son Achilleus, that destructive wrath which brought countless woes upon the Achaeans, and sent forth to Hades many valiant souls of heroes; and made their bodies a prey for dogs and all birds, and the will of Zeus was accomplished since that time when first there stood in strife Atreus’ son, king of men, and brilliant Achilleus.

Translate the following to English from Ancient Greek:

καθὰ δὲ προεθέμην, ἐρῶ περὶ ἑνὸς πίνακος τοῦ δοκοῦντός μοι κρείττονος. μῦθος μὲν ἦν τεταγμένος ἐν αὐτῷ ὁ κατὰ τὸν Ναύπλιον. τὰ δὲ κατὰ μέρος εἶχεν οὕτως· ἀνοιχθέντος ἐν ἀρχῇ τοῦ πίνακος ἐφαίνετο ζῴδια γεγραμμένα δώδεκα· ταῦτα δὲ ἦν εἰς τρεῖς στίχους διῃρημένα· ἦσαν δὲ οὗτοι πεποιημένοι τῶν Δαναῶν τινες ἐπισκευάζοντες τὰς ναῦς καὶ γινόμενοι περὶ καθολκήν.

English: As I had previously planned, I saw a painting which seemed better to me. A myth was depicted in it, concerning the story of Nauplius. In parts, it was arranged in this way: when the painting was opened, there were twelve zodiacs written in three verses; these were composed by some of the Danaans, who were making repairs to the ships and becoming involved in a voyage.

Translate the following to English from Ancient Greek:

καθὰ δὲ προεθέμην, ἐρῶ περὶ ἑνὸς πίνακος τοῦ δοκοῦντός μοι κρείττονος. μῦθος μὲν ἦν τεταγμένος ἐν αὐτῷ ὁ κατὰ τὸν Ναύπλιον. τὰ δὲ κατὰ μέρος εἶχεν οὕτως· ἀνοιχθέντος ἐν ἀρχῇ τοῦ πίνακος ἐφαίνετο ζῴδια γεγραμμένα δώδεκα· ταῦτα δὲ ἦν εἰς τρεῖς στίχους διῃρημένα· ἦσαν δὲ οὗτοι πεποιημένοι τῶν Δαναῶν τινες ἐπισκευάζοντες τὰς ναῦς καὶ γινόμενοι περὶ καθολκήν.

English: As I laid out before, I saw a painting that seemed better than the others. There was a myth drawn on it, according to the story of Nauplius. The painting was divided into twelve parts, which were further divided into three verses. These verses were composed by some of the Greeks who were repairing their ships and preparing for a voyage.

Verdict: I think that text-davinci-003 does much better than 002 here: it gets the Iliad more or less fine, while the Heron has some issues (likely due to vocab issues). This is pretty surprisingly good to me, and I imagine with even a little exposure to more Ancient Greek text the model would do even better. Without telling it to translate ‘from Greek’, as I did for 002, the model does a bit worse, and muddles more words, but is still pretty good.

ChatGPT

Next, let’s try ChatGPT. To be fair, I’m going to use the same prompts as above on a blank conversation (as there is no API available yet). To be exact, I’m using the January 30th edition of ChatGPT.

ChatGPT (currently) gives titles to the conversations that occur. I don’t know what model does this, but it gives some interesting insight as to what a (I guess GPT-based) model thinks is going on. I’ve included these titles below the prompt and model response.

`ChatGPT` Latin

Translate the following to English from Latin:

Turnus ut infractos adverso Marte Latinos defecisse videt, sua nunc promissa reposci, se signari oculis, ultro implacabilis ardet attollitque animos. Poenorum qualis in arvis saucius ille gravi venantum vulnere pectus tum demum movet arma leo gaudetque comantis excutiens cervice toros fixumque latronis inpavidus frangit telum et fremit ore cruento: haud secus adcenso gliscit violentia Turno.

English:

When Turnus sees that the Latins have fallen under the unfavorable power of Mars despite their unbroken spirit, he now demands that his promises be fulfilled. He rages and raises his spirit, burning with implacable anger. Just as a wounded lion in the fields, once struck by a serious wound from hunters, then only moves to arms, rejoicing as he shakes his mane and fearlessly breaks the weapon of the thief with his steadfast neck and snarling with bloody mouth, so does Turnus’s violence boil up when he ascends.

ChatGPT Title: Turnus’ Furious Ascension

Translate the following to English from Latin: Atque aliqua ex illis cum regum nomina quaeret, Quae loca, qui montes, quaeve ferantur aquae, Omnia responde, nec tantum siqua rogabit; Et quae nescieris, ut bene nota refer.

English:

And answer some of them when they ask about the names of kings, what places, what mountains, and what waters are said to exist; respond to everything, not only if someone asks; and what you do not know, report as well known.

ChatGPT Title: Answer Regal Questions

Translate the following to English from Latin: fide sapientiaque vestra fretus plus oneris sustuli quam ferre me posse intellego. hoc onus si vos aliqua ex parte adlevabitis, feram ut potero studio et industria, iudices; sin a vobis, id quod non spero, deserar, tamen animo non deficiam et id quod suscepi quoad potero perferam. quod si perferre non potero1, opprimi me onere offici malo quam id quod mihi cum fide semel impositum est aut propter perfidiam abicere aut propter infirmitatem animi deponere.

English:

Relying on your faith and wisdom, I have taken on more weight than I understand I am able to bear. If you can relieve this burden in any way, I will bear it to the best of my abilities with effort and determination, judges. If I am abandoned by you, which I do not hope for, I will not let my spirit falter, and I will carry out what I have undertaken as best as I can. If I am unable to carry it out, I would rather be crushed by the weight of my duty than abandon or lay down what was once entrusted to me with faith because of my own faithlessness or weakness of spirit.

ChatGPT Title: Fide Sapientiaque Onus Sustuli

What is the 1st person singular form of the verb ‘to be’ in Latin? In Latin, the 1st person singular form of the verb “to be” is “sum.”

ChatGPT Title: 1st Person Singular “to be” in Latin

What is the 2nd person singular form of the verb ‘to be’ in Latin? The 2nd person singular form of the verb “to be” in Latin is “es.”

ChatGPT Title: 2nd Person “Es” Latin “To Be”

What is the 3rd person singular form of the verb ‘to be’ in Latin? In Latin, the 3rd person singular form of the verb “to be” is “est”.

ChatGPT Title: 3rd Person Singular Latin “Est”

What is the 1st person plural form of the verb ‘to be’ in Latin? The 1st person plural form of the verb “to be” in Latin is “sumus.”

ChatGPT Title: Latin 1st Person Plural To Be

What is the 2nd person plural form of the verb ‘to be’ in Latin? The 2nd person plural form of the verb “to be” in Latin is “estis.”

ChatGPT Title: Latin 2nd Plural “to be”

What is the 3rd person plural form of the verb ‘to be’ in Latin? The third person plural form of the verb “to be” in Latin is “sunt.”

ChatGPT Title: Latin 3rd Person Plural “sunt”

Verdict: Wildly, I think ChatGPT’s quality here is still really good, and it gets all the verb forms right! I think it’s definitely as good as text-davinci-003, although still a little muddled. I suspect more context would help in some cases, giving the model a better idea of what it is translating. Considering the improvements over 002, I definitely wonder if both text-davinci-003 and ChatGPT have benefited from Latin students using it to help with homework… 🤔

`ChatGPT` Greek

Translate the following to English from Homeric Greek:

μῆνιν ἄειδε θεὰ Πηληϊάδεω Ἀχιλῆος οὐλομένην, ἣ μυρί᾽ Ἀχαιοῖς ἄλγε᾽ ἔθηκε, πολλὰς δ᾽ ἰφθίμους ψυχὰς Ἄϊδι προΐαψεν ἡρώων, αὐτοὺς δὲ ἑλώρια τεῦχε κύνεσσιν οἰωνοῖσί τε πᾶσι, Διὸς δ᾽ ἐτελείετο βουλή, ἐξ οὗ δὴ τὰ πρῶτα διαστήτην ἐρίσαντε Ἀτρεΐδης τε ἄναξ ἀνδρῶν καὶ δῖος Ἀχιλλεύς.

English: “Sing, goddess, the anger of Achilles, son of Peleus, that brought countless woes upon the Achaeans, and sent many valiant souls of heroes to Hades, and made them themselves food for dogs and all birds; for so was the will of Zeus fulfilled, from which the first sufferings arose, that of Atreus’ son, king of men, and Achilles, the glorious.”

ChatGPT Title: Homeric Greek Translation

Translate the following to English from Ancient Greek:

καθὰ δὲ προεθέμην, ἐρῶ περὶ ἑνὸς πίνακος τοῦ δοκοῦντός μοι κρείττονος. μῦθος μὲν ἦν τεταγμένος ἐν αὐτῷ ὁ κατὰ τὸν Ναύπλιον. τὰ δὲ κατὰ μέρος εἶχεν οὕτως· ἀνοιχθέντος ἐν ἀρχῇ τοῦ πίνακος ἐφαίνετο ζῴδια γεγραμμένα δώδεκα· ταῦτα δὲ ἦν εἰς τρεῖς στίχους διῃρημένα· ἦσαν δὲ οὗτοι πεποιημένοι τῶν Δαναῶν τινες ἐπισκευάζοντες τὰς ναῦς καὶ γινόμενοι περὶ καθολκήν.

English: “And as I previously mentioned, I love a picture that seems to me to be better. A story was depicted in it about the Naulpian. The parts were arranged in this way: when the picture was opened at the beginning, twelve zodiacs appeared written; these were divided into three lines. And these were made by some of the Danai, repairing the ships and becoming around the harbor.”

ChatGPT Title: Description of a Superior Painting Depicting the Battle of Nafplion

Translate the following to English:

καθὰ δὲ προεθέμην, ἐρῶ περὶ ἑνὸς πίνακος τοῦ δοκοῦντός μοι κρείττονος. μῦθος μὲν ἦν τεταγμένος ἐν αὐτῷ ὁ κατὰ τὸν Ναύπλιον. τὰ δὲ κατὰ μέρος εἶχεν οὕτως· ἀνοιχθέντος ἐν ἀρχῇ τοῦ πίνακος ἐφαίνετο ζῴδια γεγραμμένα δώδεκα· ταῦτα δὲ ἦν εἰς τρεῖς στίχους διῃρημένα· ἦσαν δὲ οὗτοι πεποιημένοι τῶν Δαναῶν τινες ἐπισκευάζοντες τὰς ναῦς καὶ γινόμενοι περὶ καθολκήν.

English: As I laid out before, I fell in love with a painting that seemed to me to be superior. There was a story depicted on it, about Nauplius. Part of it was as follows: when the painting was opened at the beginning, there were twelve zodiac signs written in it, which were divided into three lines. These were made by some of the Danaans as they were repairing their ships and undergoing hardship.

ChatGPT Title: Painted Nauplian Mythology

Verdict: Interestingly, I feel that text-davinci-003 did a bit better with these, and ChatGPT makes some interesting spelling errors at times (Danai? Naulpian?). Overall, it’s definitely still impressive, but I wonder if the use of (allegedly) a smaller model or the further training that ChatGPT went through has cost it some Greek ability compared to text-davinci-003.

Overall, these model improvements are very impressive, especially considering translating Ancient Greek and Latin are likely pretty obscure uses for these models - it’d be interesting to learn if these improvements have come from explicit related feedback during the RLHF training process (students, professors giving feedback on classical language translations), or if it’s a byproduct of other tasks (e.g., feedback on general translation tasks, or further pretraining on related data). A big issue that pops out to me here is managing context: many of the issues with the Heron translation likely come from not knowing the surrounding context of the extract and choosing sub-optimal translations (ζῴδια could be Zodiac-related, but it can also mean `figurine’, which is the correct translation here) from this. Likely, giving the model longer extracts or prompting it with the domain / theme / etc of the text would further improve its translations.

However, even just as-is I think that the current abilities displayed here are pretty impressive, and are more than good enough to be moderately useful, although would not replace an expert translation and / or well-annotated text.

Blog Redesign

2021-08-22T00:00:00+00:00

As a quick note, I’ve recently done a go-over of the design of my blog. In this post, I’m just going to go over the changes and how my blog setup currently works. I won’t go too deep into details as my setup largely follows the standard Github Pages and Jekyll setup.

I’m using the default Minima theme, considering it’s both simple and fairly easy to modify. These are currently the various things I’ve added to it (beyond general style tweaks):

A light-dark mode toggle, utilising Google Chrome Lab’s dark-mode-toggle element
A tag system, based on Long Qian’s guide, which I also use to split my posts into the ‘side projects’ and ‘random’ sections above.

As I tweak the plugins and design, I’ll come back to this post and add my changes to the list above. Hopefully, it provides a useful list of ways to tweak a Github Pages site to your liking (and remind me of what I added if I chose to change it up again). Alternatively, you can find the codebase for this site here.

AI-ce Attorney

2021-08-18T00:00:00+00:00

tl;dr: check out a fun little Ace Attorney generator I made!

Phoenix Wright: Ace Attorney is a visual novel series created by Shu Takumi (also of Ghost Trick fame) where you play as a lawyer (usually the titular character Phoenix Wright, although this changes over the games) in a heavily fictionalised version of the Japanese legal system. I heavily recommend this series if you like murder mysteries or puzzle games - the gameplay loop is centred around investigating murders and finding evidence, and then using that evidence in court to prove contradictions or lies in witness’ testimonies.

As a visual novel, Ace Attorney is a text-heavy game, and so with the recent release of the Great Ace Attorney Chronicles and the popularity of the twitter court bot, I thought it would be fun to make use of the court bot’s code to auto-generate Ace Attorney scripts and animate them! This is fairly easy to do, as it turns out - I used the great aitextgen library for training models and generating text, and the objection engine library for animating the scripts - all I had to do really was train the model itself and write some code to convert it to a format the objection engine recognises.

I made this into a little streamlit app available here using the 125M GPT-neo model (in order to keep processing times reasonably fast), so check it out! And if you find yourself enjoying the scenarios, I definitely recommend checking out any of the Ace Attorney games.

🚨 It’s important to note that I’m not filtering the bot, so it could produce potentially harmful text. Ace Attorney is mostly a teen-rated game, so there shouldn’t be much worse than murder mystery discussions, but the original training data of GPT-neo was much wider and expansive, so be warned!

Hero’s On Automata

2021-08-02T00:00:00+00:00

Hero (or Heron) of Alexandria was a famous ancient inventor, who lived around the first century AD (i.e. early Roman empire). As you might already know from the numerous blogs and articles about him on the internet, he invented a bunch of fairly interesting mechanisms - an early toy steam engine (widely regarded as the first steam engine), self-moving theatres, the first vending machine (to dispense holy water), and arguably an early programmable cart (programmable via placing pegs on an axel). While looking back on these inventions is certainly pretty cool, I’ve been particularly interested in Hero’s work on ‘automata’ - machines that can move on their own accord. Hero specifically wrote about how to make two types of automated theatres in his volume On Automata. As someone who works in programming and technology, it can be humbling and interesting to look back and see how people ‘programmed’ 2000 years ago, and what they used these complex devices for. As such, in this post, I’m going to go back and look very briefly at automata in Homer, and then take a look at the two automata discussed in Hero’s treatise On Automata, providing a short summary and discussion for both. Check out my bibliography at the end of the post for more sources if you’re interested in all this!

An illustration of the Aeolipile from Knight's American Mechanical Dictionary, from Wikipedia.

An illustration of the holy water vending machine designed by Hero.

Automata of Hercules hitting head of the dragon, done by Giovanni Battista Aleoti.

Automata in ancient times were likely thought of quite differently from how we might think about ‘robots’ today: this was a time before electricity and computers, and so the modern notion of a machine holding circuits and powered by some internal battery did not yet exist. Mentions of automata date back to Homer, the most obvious and notable being the self-moving tripods and robotic assistants Hephaestus crafts in the Iliad:

Self-moving tripods (Book XVIII, 372–377 - iliad)

τὸν δ᾽ εὗρ᾽ ἱδρώοντα ἑλισσόμενον περὶ φύσας

σπεύδοντα: τρίποδας γὰρ ἐείκοσι πάντας ἔτευχεν

ἑστάμεναι περὶ τοῖχον ἐϋσταθέος μεγάροιο,

χρύσεα δέ σφ᾽ ὑπὸ κύκλα ἑκάστῳ πυθμένι θῆκεν,

ὄφρά οἱ αὐτόματοι θεῖον δυσαίατ᾽ ἀγῶνα

ἠδ᾽ αὖτις πρὸς δῶμα νεοίατο θαῦμα ἰδέσθαι.

And she [Thetis] found him [Hephaestus] sweating, hurrying about

his bellows, as he was crafting tripods, twenty in all,

to stand around the wall of his well-built hall,

and he had fitted golden wheels beneath the base of each

so that they on their own enter the meeting of the gods,

and then would be able to again return back to his house, a wonder to behold.
Robot assisstants (Book XVIII, 410–420, Illiad)

… ὑπὸ δ᾽ ἀμφίπολοι ῥώοντο ἄνακτι

χρύσειαι ζωῇσι νεήνισιν εἰοικυῖαι.

τῇς ἐν μὲν νόος ἐστὶ μετὰ φρεσίν, ἐν δὲ καὶ αὐδὴ

καὶ σθένος, ἀθανάτων δὲ θεῶν ἄπο ἔργα ἴσασιν.

And attendants moved, supporting their lord

golden ones, like living young women.

They had sense and reason, and speech

and strength, and knowledge of handiwork from the immortal gods.

(Translations by me)

Also worth noting are Hephaestus’ automatic bellows (Book XVIII, 470–473, Illiad) and the Phaeacian’s mind-reading automatic ships (Book VIII, 555–563, Odyssey). At the time of writing these automata may have been thought of purely in magical terms, but later on (e.g. by Hero’s time), such things would have been linked with the technical. While we can’t assume that Homer thought of these things in mechanical terms, the links are striking and interesting: the tripods explicitly have wheels and are linked with Hephaestus, who explicitly constructs things and is a craftsman. As such, even if originally these devices were dreamt of as magic, they undoubtedly served as inspiration for creators like Hero, who developed the techniques to make things like self-moving tripods a reality.

Either way, the notion of automata has been around for a long time, and Hero was by far not the first inventor to build them. Rather, he is another in a group of automata-makers, who built on each others’ work. Other big names in ancient automata were Philo of Byzantium, who Hero explicitly names and builds off, and Ctesibius, who invented an early form of the pipe organ and is credited with ‘inventing’ pneumatics. As we will see, Hero is quite open about using techniques invented by others, and his own fame is likely more a function of more of his work surviving, rather than him being more skilled (although he certainly was quite skilful).

On Automata

On Automata was a treatise written by Hero, and is split into two books: the first describes what he dubs a ‘mobile automaton’ and the second describes a ‘stationary automaton’. It appears to mainly be a sort of instruction/explanation manual, with the end cut off. For a more in-depth look at the manuscript tradition surrounding it, I suggest reading Grillo’s PhD thesis, which goes into detail on this history. I’ve used their adaption of the text below when translating the original Greek.

The Mobile Automaton

The mobile automaton is essentially a mobile diorama of sorts: a shrine (of sorts) of Dionysus with features like dancing figures and self-lighting fires. The diorama/shrine itself moves (and hence is a mobile automaton), moving in a straight line, performing a scene, and then moving back, with everything going back to their starting locations, allowing re-use. Interestingly, Hero describes different motions possible for the automata, including a rectangular and ‘snake-like’ motion. It’s debated where this sort of automaton would be displayed, but both at private parties and public theatres seems plausible. Hero’s description of the automata is in some ways closer to a textbook than an exact manual: he provides different ideas for different movements and setups, looking not to provide an exact schematic for a single device, but rather illustrate a series of different mechanisms and systems that can be put together to create these sorts of automata:

ἐν μὲν οὖν τούτῳ τῷ βιβλίῳ περὶ τῶν ὑπαγόντων γρά- φομεν ἐκθέμενοι διάθεσιν ποικίλην κατά γε ἡμᾶς, ἥτις ἁρμόσει πάσῃ διαθέσει πρὸς τὸ δύνασθαι τὸν προαιρούμενον ἑτέρως διατίθεσθαι μηδὲν ἐπιζητοῦντα πρὸς τὴν τῆς διαθέσεως ἐνέργειαν·

So, In this book, I am writing about mobile automata, setting out an intricate design of my own, which will fit to all other arrangements, such that whoever wants to arrange it differently will be able to, without needing anything else for the creation of the arrangement.

(I.8, translation mine)

So, let’s go through these various systems in order (chapter numbers and paragraphs given in brackets). In order, Hero describes:

The area and material requirements of the automaton - i.e. what sort of material to build it out of (light timber, bronze and iron, etc) and where to put it (if possible, a flat even surface). (1.II.1-5)
A broad description of the central mechanic of both the stationary and mobile automata: counterweights, with rope wound around axles. The counterweight rests on millet/mustard (or dry sand for the stationary automata) in a tube. At runtime, the millet or mustard is slowly drained from the bottom, allowing the counterweight and an attached rope to descend, causing the rope to spin any axles it’s attached to. (1.II.6-1.IV)
The rough design of the automata, illustrated below. Some rough measurements are given. (1.IV)

Diagram of Hero's mobile automaton, from Grillo's PhD thesis (fig 1).

A short description of the performance, which I give in full below, as it is quite interesting. Hero also mentions the automata must be kept small to avoid the suspicion that a human is actually working it. (1.IV)

(1) Τούτων δὲ οὕτως ὑπαρχόντων ἐν ἀρχῇ τεθέντος τοῦ αὐτομάτου ἐπί τινα τόπον καὶ ἀποστάντων <ἡμῶν> μετ ̓ οὐ πολὺν χρόνον ὑπάξει τὸ αὐτόματον ἐπί τινα ὡρισμένον τόπον. καὶ στάντος αὐτοῦ ἀνακαυθήσεται ὁ κατάπροσθεν τοῦ Διονύσου βωμός. καὶ ἐκ μὲν τοῦ θύρσου τοῦ Διονύσου ἤτοι γάλα ἢ ὕδωρ ἐκπυτισθήσεται, ἐκ δὲ τοῦ σκύφους οἶνος ἐκχυθήσεται ἐπὶ τὸν ὑποκείμενον πανθηρίσκον.

(2) στεφανωθήσεται δὲ πᾶς ὁ παρὰ τοὺς τέσσαρας κίονας τῆς βάσεως τόπος. αἱ δὲ περικύκλῳ Βάκχαι περιελεύσονται χορεύουσαι περὶ τὸν ναΐσκον. καὶ ἦχος ἔσται τυμπάνων καὶ κυμβάλων. καὶ μετὰ ταῦτα σταθέντων τῶν ἤχων ἀποστραφήσεται τὸ τοῦ Διονύσου ζῴδιον εἰς τὸ ἐκτὸς μέρος. ἅμα δὲ τούτῳ καὶ ἡ ἐπικειμένη τῷ πυργίῳ Νίκη συνεπιστραφήσεται.

(3) καὶ πάλιν ὁ ἔμπροσθεν γεγονὼς τοῦ Διονύσου βωμός, πρότερον δὲ ὀπίσθιος ὑπάρχων ἀνακαυθήσεται. καὶ πάλιν ἐκ μὲν τοῦ θύρσου ὁ ἀναπυτισμὸς ἔσται, ἐκ δὲ τοῦ σκύφους ἡ ἔκχυσις. καὶ πάλιν αἱ Βάκχαι χορεύσουσι περιερχόμεναι τὸν ναΐσκον μετὰ ψόφου τυμπάνων καὶ κυμβάλων. καὶ πάλιν σταθεισῶν αὐτῶν τὸ αὐτόματον ἀναχωρήσει εἰς τὸν ἐξ ἀρχῆς τόπον.

(1) And with things in this way, at first the automaton is placed in a spot, and while we are standing away (from it), after a short time the automaton will move to a defined location. And once it stands still, the altar in front of Dionysus will flare up. And either milk or water will flow out of Dionysus’ thyrsus, and wine will flow out of his cup onto the panther lying below.

(2) And every place near the four columns of the altar will be crowned (with garlands). And the bacchantes all around will go around the shrine, dancing. And there will be a sound of kettledrums and cymbals. After this, when the sound has halted, the figurine of Dionysus will turn to the outside. At the same time as this, the Nike placed on the cupola will turn together with it.

(3) And again, the alter, which is in front of Dionysus and before was behind him, will flare up. And again there will be the spurt from the thyrsus and the outpour from the cup. And again the bacchantes will dance going around the shrine with the noise of kettledrums and cymbals. And again, when they have come to a stop, the automaton will go back to the place it started.

After this, we come to descriptions of how to make motion (the more complex of which may not actually work well in practice):
- Motion forward and back (1.V-VI), with allowances for pauses. This is done by wounding rope in particular ways and adding some slack in certain spots for pauses.
- Circular motion (1.VII-VIII), which uses axels set on angles, and wheels of different sizes.
- Rectilinear motion (1.IX-X), which uses two sets of wheels, alternately raised and lowered.
- ‘Snake-like’ or simply non-rectangular motion (1.XI), for which Hero describes 3 configurations. All of these essentially use the core idea of multiple independent axels for wheels, allowing different degrees of turning.
After this, Hero turns to implementations of aspects of the performance:
- Lighting the fires (XII) is done by lighting a fire (probably manually before the automaton is run) under a grate covered by a plate, and then moving the plate via the same rope-counterweight system used for everything else.
- Getting milk and wine (XIII) to spurt out is done via the use of pipes and a tap system, with again ropes controlling this system.
- Sound is made by pouring little balls on cymbals and drums, dropped by opening a door. (XIV).
- Garlands are dropped on the stage from trapdoors (XV), much like the balls.
- The baccantes are made to ‘dance’ by spinning them on their own wooden ring on the stage (XVI).
He then adds small details on how to hide the cords, showing how to split up the spaces for the millet counterweight, etc. (XVII.1-2)
He then discusses methods for extending the range of the automation:
- First, he notes using bigger wheels or smaller axels will extend the range (XVII.3)
- Then he describes a system where the rope is wound around the smaller part of a pulley, and then onto a larger part, amplifying the rotation of later axels in the system. (XVIII)
Finally, a brief (rough) description of a two-counterweight system is given (XIX). In this system, the one counterweight deals with forwards and backwards motions, and the other all other types of motion of the system.

And that’s the first book! It’s both a description of how to build this one specific automaton and a bit of an explanation of generic techniques that can be re-used across different designs. Personally, the way all these different mechanisms can be mixed and placed wherever feels a bit like programming: you have these sets of primitives (e.g. axels, or the fire-lighting mechanism) that are controlled largely in the same manner, through specific placement in a cord’s unwinding. While the physical aspect obviously would make it incredibly difficult to make changes on the fly or build without much pre-planning, Hero certainly presents these ideas as pre-made sub-programs for remixing. It’s important to note it’s unlikely Hero built all the movement mechanisms he describes here - rather, not all the movement mechanisms described are physically feasible, making them likely to more be results of Hero’s own mathematical deductions as opposed to empirical results. Another interesting aspect is the way mathematical ideas are described. Hero is fairly geometric in his descriptions, with the most common formula being ἔστω + a geometric label, for example:

ἔστω γὰρ πλινθίον τὸ α̅βγ̅δ̅; ἐν ᾧ ἄξων ἔστω ὁ εζ̅̅ συμφυεῖς ἔχων τροχοὺς τοὺς η̅θ; κ̅λ; ὁ δὲ τρίτος τροχὸς ἔστω ὁ μν̅.

“Let there be a case, αβ̅γ̅δ̅; in which let there be an axle, εζ̅, with wheels attached to it, η̅θ̅ and κ̅λ; let there be the third wheel, μν̅.”

(translation mine)

There’s some debate on exactly how to translate ἔστω here which I am not qualified to weigh in on, but I’ll just note that it does feel similar to how we write out geometrical descriptions nowadays (‘let there be a line X…’).

Finally, I’d like to note Hero’s eye to showmanship, with him dedicating some time to discuss how to hide the mechanisms of the automaton, and even from the outset fronting that these automata are things that inspire and generate wonder in others. Indeed, as an automatic theatre-constructor, Hero here is acting as stage director and engineer at the same time. Even from the outset of invention, we see innovation arising not to serve functional needs, but rather as a way to express creativity in unique ways.

The Stationary Automaton

The stationary automaton is essentially a box that is able to display a series of scenes, acting like a mini theatre, containing painted images with moving elements (e.g. arms sticking out and moving, or figurines moving in front of a backdrop), and the box opening and closing on its own to facilitate scene transitions. It seems likely this sort of automaton was used in private parties as a form of entertainment. Let’s get into Hero’s description of it.

First, Hero notes that the description and work in the mobile automata were more original, and explicitly notes that in his stationary work he is working off what Philo had already done. He both criticises and praises aspects of Philo’s previous work. (XX)
He then very briefly describes the stationary automata in general: boxes that open and shut to show a series of different scenes with moving, painted figures (XXI).
He then talks about old stationary automata and one particular one he saw that impressed him, telling a story about the mythological hero Naupilus (XXII). He describes the set of scenes shown by the box, and it is its construction that the rest of this book relates. The scenes go as follows:

Book XXII.3-6

(3) καθὰ δὲ προεθέμην, ἐρῶ περὶ ἑνὸς πίνακος τοῦ δοκοῦντός μοι κρείττονος. μῦθος μὲν ἦν τεταγμένος ἐν αὐτῷ ὁ κατὰ τὸν Ναύπλιον. τὰ δὲ κατὰ μέρος εἶχεν οὕτως· ἀνοιχθέντος ἐν ἀρχῇ τοῦ πίνακος ἐφαίνετο ζῴδια γεγραμμένα δώδεκα· ταῦτα δὲ ἦν εἰς τρεῖς στίχους διῃρημένα· ἦσαν δὲ οὗτοι πεποιημένοι τῶν Δαναῶν τινες ἐπισκευάζοντες τὰς ναῦς καὶ γινόμενοι περὶ καθολκήν.

(4) ἐκινεῖτο δὲ ταῦτα τὰ ζῴδια τὰ μὲν πρίζοντα, τὰ δὲ πελέκεσιν ἐργαζόμενα, τὰ δὲ σφύραις, τὰ δὲ ἀρίσι καὶ τρυπάνοις χρώμενα <καὶ> ψόφον ἐποίουν πολύν, καθάπερ ἐπὶ τῆς ἀληθείας {γίνοιτο}. χρόνου δὲ ἱκανοῦ διαγενομένου κλεισθεῖσαι πάλιν ἠνοίγησαν αἱ θύραι, καὶ ἦν ἄλλη διάθεσις· αἱ γὰρ νῆες ἐφαίνοντο καθελκόμεναι ὑπὸ τῶν Ἀχαιῶν. κλεισθεισῶν δὲ καὶ πάλιν ἀνοιχθεισῶν, οὐδὲν ἐφαίνετο ἐν τῷ πίνακι πλὴν ἀέρος γεγραμμένου καὶ θαλάσσης.

(5) μετὰ δὲ οὐ πολὺν χρόνον παρέπλεον αἱ νῆες στολοδρομοῦσαι· καὶ αἱ μὲν ἀπεκρύπτοντο, αἱ δὲ ἐφαίνοντο. πολλάκις δὲ παρεκολύμβων καὶ δελφῖνες ὁτὲ μὲν εἰς τὴν θάλατταν καταδυόμενοι, ὁτὲ δὲ φαινόμενοι, καθάπερ ἐπὶ τῆς ἀληθείας. κατὰ μικρὸν δὲ ἐφαίνετο χειμέριος ἡ θάλασσα, καὶ αἱ νῆες ἔτρεχον συνεχῶς. κλεισθέντος δὲ πάλιν καὶ ἀνοιχθέντος, τῶν μὲν πλεόντων οὐδὲν ἐφαίνετο, ὁ δὲ Ναύπλιος τὸν πυρσὸν ἐξηρκὼς καὶ ἡ Ἀθηνᾶ παρεστῶσα·

(6) καὶ πῦρ ὑπὲρ τὸν πίνακα ἀνεκαύθη, ὡς ἀπὸ τοῦ πυρσοῦ φαινομένης ἄνω φλογός. κλεισθέντος δὲ καὶ πάλιν ἀνοιχθέντος, ἡ τῶν νεῶν ἔκπτωσις ἐφαίνετο καὶ ὁ Αἴας νηχόμενος, μηχανὴ τε {καὶ} ἄνωθεν τοῦ πίνακος ἐξήρθη καὶ βροντῆς γενομένης ἐν αὐτῷ τῷ πίνακι κεραυνὸς ἔπε- σεν ἐπὶ τὸν Αἴαντα, καὶ ἠφανίσθη αὐτοῦ τὸ ζῴδιον. καὶ οὕτως κλεισθέντος καταστροφὴν εἶχεν ὁ μῦθος. ἡ μὲν οὖν διάθεσις ἦν τοιαύτη.

(3) As I laid out before, I will talk about one box that seems superior to me. The story set in it was the one about Naupilius. And its parts went like this. In the beginning, when the box opened, 12 painted figurines appeared. These were divided into 3 rows; and these were made to represent some of the Danaans (Greeks) preparing their ships and launching them.

(4) These figurines moved, some sawing, some working with axes, some with hammers, and some with bow-drills and augers. They made much notice, just as it would be in reality. And once enough time had passed, the doors closed again and opened, and there was another arrangement; the ships, in fact, appeared being launched by the Achaeans (Greeks). And after the doors closed and opened again, nothing appeared in the box except the painted sky and sea.

(5) And not long after the ships sailed along in line. Some were out of sight, and others were visible. Often dolphins swam along too, sometimes plunging into the sea, sometimes appearing, just like in real life. And gradually the sea appeared stormy, and the ships ran uninterrupted. And after the doors shut and opened again, none of the sailing ships were visible, but Naupilius holding up the torch and Athena standing alongside (were visible).

(6) And a fire was lit up above the box, as if a flame appeared above from the torch. And after the doors closed and opened, the wreck of the ships appeared, and Ajax swimming; and a machine was raised above the box, and while there was thunder in the box itself, lightning fell on Ajax and his figure vanished. And thus, once the doors closed, the story came to an end. So, such was the arrangement.

He then starts his description of how to construct this by starting with general design and materials for the box, and the key element of the stationary automaton: the doors that swing open and shut automatically (XXIII). As before, this is done with a counterweight along with a series of knobs and axles and carefully wound rope.
The following chapters then go through the implementation of each scene above:
- First, greeks repairing their ships (XXIV). They are painted on, with their right arms attached to a star-shaped wheel to turned by a counterweight to make the arms swing up and down (i.e. swinging hammers or similar tools to build their ships)
Diagram of the above counterweight, taken from Grillo's Thesis.
- Second, the launching of the ships (XXV). The transition from the previous scene is achieved by painting this new scene on a cloth and using a rod as a weight. The same counterweight system is then used to release it at a particular time, changing the scene. This mechanism is used to transition to the fourth and fifth scenes too.
- Third, the ships sailing (XXVI-XXVII). The sky and sea here are painted onto papyrus, which itself is attached to rollers on either side, allowing them to move back and forth and so make it look like the ships below (painted on cloth) are moving along. Dolphins are added on top of this, attached to a pulley inside the system that makes them swing up into the scene and then down, as if they were swimming alongside the ships.
- Fourth, Naupilus and Athena (XXVIII). This is painted on cloth, and Naupilus’s torch is made by lighting some wood shavings using a small fire hidden inside the box, very much like how Dionysus’ alter was lit above.
- Finally, the shipwreck (XXIX-XXX.6). Athena is placed on a base, which is flipped up and down via cords while she rotates on the base. A painted figure of Ajax swimming is present on top of the background. The lightning is made by dropping a board with some painting on it (by holding the board up with string and then dropping it), and at the same time as this falls, the figure of Ajax is covered with a cloth painted the same colour as the background, making him effectively vanish as he is struck by lightning.
There is then a brief (one paragraph long) cut off epilogue (XXX.7), noting that these movements and the box are managed through the same means.

While potentially less exciting than the mobile automata (since it doesn’t move), the stationary automata is actually more intricate in some ways, telling a full story across more scenes than the mobile, while still highlighting the versatility of the counterweight-style system. It’s also worth highlighting this automaton was not Hero’s invention, but Philo’s, showing how there were a few people using these ideas and mechanisms to devise their own automata art (in fact, Philo likely predates Hero, and as such Hero likely learnt many of these techniques from Philo). Again, the artistic and the mechanical is blended in this automaton, with the focus on giving a good show to an audience, rather than solving some specific problem or issue. Beyond this, the presentation and ideas used share a lot with the mobile automaton. Personally, I think a seven-scene story is probably more exciting to watch than the relatively simple automated mobile shrine above, even if the fact the mobile shrine moves on its own is fairly impressive. It’s fun to think about what you could potentially ‘program’ into this type of automata, and the length of the stories you could tell - was watching this an ancient version of watching the latest blockbuster with the newest and best SFX? (probably not). Overall, the mobile automaton is just as technically impressive as the mobile one, with a complex story being told.

Conclusion

Hero’s On Automata, to me, exposes a lot of interesting ideas and facts about ancient innovation, mathematics, and how people thought about automation. Interesting, I think it links more to computer animation and computer art than it does artificial intelligence, despite the name of ‘automata’. The focus is on creating awe and wonder, hiding the mechanical truths to get the audience to focus on the little stories told by these complex and intricate devices. The use of this early style of programming, and coming up with novel ways to use a central system to create new effects reminds me a lot of how blockbusters have often involved the creation of new technology to achieve a director’s vision. Perhaps this exposes a core element of human innovation, dating back to Homer and his automata: technological and artistic creation are somehow innately linked.

So that’s Hero’s On Automata. There’s a lot of generic posts on Hero out there on the internet, but actual in-depth resources require a bit more digging, so I hope this post is able to show you something you didn’t previously know in a reasonable amount of detail. I’m certainly not an expert in this space - see my bibliography for the real experts - but nonetheless I hope my reasonably unique experience as a classicist and software developer has provided a unique view. If you want to read further, in particular, I found Francesco Grillo’s PhD thesis on the first book of On Automata very thorough and informative for not just the book itself, but Hero’s life and context as a whole (as you might have guessed from my constant references to it above). Hope to see you around for my next post!

Bibliography

Bosak-Schroeder, Clara. “The Religious Life of Greek Automata.” Archiv Für Religionsgeschichte, vol. 17, no. 1, Dec. 2016, pp. 123–36. DOI.org (Crossref), doi:10.1515/arege-2015-0007.

Grillo, Francesco. Hero of Alexandria’s Automata: A Critical Edition and Translation, Including a Commentary on Book One. University of Glasgow, 2019.

Knight, Edward Henry. Knight’s American Mechanical Dictionary. http://onlinebooks.library.upenn.edu/webbin/book/lookupid?key=olbp69570.

McCourt, Finlay. “An Examination of the Mechanisms of Movement in Heron of Alexandria’s On Automaton-Making.” Explorations in the History of Machines and Mechanisms, edited by Teun Koetsier and Marco Ceccarelli, vol. 15, Springer Netherlands, 2012, pp. 185–98. DOI.org (Crossref), doi:10.1007/978-94-007-4132-4_13.

Sherwood, Andrew N., et al. Greek and Roman Technology: A Sourcebook: Annotated Translations of Greek and Latin Texts and Documents. Routledge, 2003. DOI.org (Crossref), doi:10.4324/9780203413258.

Hamish Ivison

Introduction to Policy Gradient for LMs

Terminology

Deriving PG

Issues with REINFORCE

Baselining

Unbiasedness

Variance Reduction

Re-using Old Data

CISPO

Other Advantage Estimates

References

Results Replicating L1 for Tulu

Intro

Setup

Results

How well do the different approaches adhere to length budget?

What about performance?

What about out-of-domain generalisation?

Conclusion

My 2025 Reading List

Books (& selected comics)

Films

TV Series

Games

Other

My 2024 Reading List

Books (& selected comics)

Films

TV Series

Games

Other

My 2023 Reading List

Books (& selected comics)

Films

TV Series

Games

PhD Hunting 🎯

T-Minus One Year

T-Minus 3 Months

T-minus 1.5 Months

T-Minus 1-2 weeks

The Thick of It

Finishing up the Applications

T-plus(?) 2 weeks

T-plus 1-2 months

T-plus ~4 months

Conclusion

Does GPT-3 know Ancient Greek?

Latin

Ancient Greek

Wrapping this up

GPT3.5 and ChatGPT Update!

Text-Davinci-003

text-davinvi-003 Latin

text-davinvi-003 Greek

ChatGPT

ChatGPT Latin

ChatGPT Greek

Blog Redesign

AI-ce Attorney

Hero’s On Automata

On Automata

The Mobile Automaton

The Stationary Automaton

Conclusion

Bibliography

`text-davinvi-003` Latin

`text-davinvi-003` Greek

`ChatGPT` Latin

`ChatGPT` Greek