Adam Karvonen

LLM inference is nearly deterministic. We use this to audit providers

2025-11-28T19:46:58+00:00

Adam Karvonen, Daniel Reuter, Roy Rinberg, Luke Marks, Adrià Garriga-Alonso, Keri Warr · arXiv (paper link) · Github

One Minute Summary

Today there’s no reliable way to verify that an inference provider is actually running the model they claim, as they might be using quantized weights, have implementation bugs, or cut other corners. However, it turns out that LLM inference is nearly deterministic when you fix the sampling seed. In our experiments, when you fix the random sampling seed and regenerate an LLM’s output, over 98% of tokens match exactly. Token-DiFR exploits this to verify inference: just measure how much the inference provider’s tokens diverge from a reference implementation given a shared sampling seed.

This means you can detect any problem with the inference process, from sampling bugs to watermarking to quantization. For example, we can detect KV cache quantization with ~10,000 tokens and 4-bit model quantization with ~1,000 tokens. The method works with unmodified vLLM, has zero provider overhead, and any generated output token can be audited post-hoc. Even without sampling seed synchronization, you can audit any inference provider today using temperature-zero sampling.

Introduction

Ensuring the quality of LLM inference is a very hard problem. Anthropic recently disclosed that they had suffered from major inference bugs for weeks, causing Claude to randomly output Chinese characters and produce obvious syntax bugs in code. There are frequent allegations of inference problems with inference providers, and open weight models often have different benchmark scores depending on which inference provider you choose. Sometimes this is due to inference or tokenization bugs, and (allegedly) sometimes this is because inference providers quantize the model to reduce inference costs.

Different inference providers can produce radically different behaviours on the same benchmark. Taken from Simon Willison's blog: Open weight LLMs exhibit inconsistent performance across providers.

Today, inference quality is the Wild West. You hit an API endpoint and hope for the best. Benchmark scores provide some signal, but they’re noisy, expensive to run, and rare bugs (like randomly sampling Chinese characters) might not show up at all. What if there was a better way?

In this post, we focus on the problem of “inference verification” - concretely, given a claimed model (e.g. GPT-OSS-120B), input prompt, and an output message, can you verify that the message came from the model.

The Problem: Valid Nondeterminism

The obvious solution is to just rerun the prompt and check if you get the same output. Unfortunately, this doesn’t work, as LLM inference is non-deterministic; even with a fixed seed or sampling temperature 0.

Token generation happens in two steps:

Forward pass: First the model generates a probability distribution over the vocabulary of tokens, by running billions of floating-point operations.
Sampling: A token is randomly drawn from that distribution. This step is deterministic given a fixed random seed.

Due to floating-point arithmetic, the forward pass step has small numerical noise that can vary across hardware, batch sizes, and software stacks.

So if you fix the sampling seed, why doesn’t everything just match? Because the floating-point noise in step 1 means slightly different probability distributions, which can occasionally flip which token gets selected, even with identical sampling randomness.

We visualize this through this diagram of the Inverse Probability Transform sampling process (a common method for sampling from a distribution) introduced in our concurrent paper, Verifying LLM inference to Detect Model Weight Exfiltration. In short, consider that there are small variations in the distribution of token probabilities (rows in the pink rectangle), then even if you fix the sampling process (the red line), there is more than 1 “valid” token you may sample.

The Solution: Token-DiFR

However, it turns out that this sampling process is almost deterministic. Empirically, if we regenerate a token with a fixed sampling seed, we generate the exact same token over 98% of the time (in the models and settings we study). When there is disagreement on the next token, there’s usually only 2-3 plausible options. Horace He at Thinking Machines found a similar result: generating 1,000 completions of 1,000 tokens each at temperature zero with Qwen3-235B-A22B-Instruct-2507 produced only 80 unique outputs, with all completions sharing identical first 102 tokens.

Given this fact, there’s a very natural solution: simply measure the divergence between the provider’s token and the verifier’s recomputed token. We call this Token-DiFR (Divergence From Reference). Because inference is nearly deterministic (over 98% of tokens must exactly match), inference providers have minimal room to deviate from their claimed inference process. The tokens themselves become evidence of correct inference, which can be audited at any point.

This means you can:

Detect a quantized KV cache with ~10,000 output tokens
Detect 4 bit quantization in ~1,000 output tokens
Detect an incorrect sampling seed in ~100 output tokens

Thus, verifying LLM inference becomes very simple.

If you’re Anthropic and you’re serving billions of tokens across different hardware (TPU vs GPU vs Trainium) and adding daily inference optimizations, you can simply have a single model instance randomly checking generated outputs against a reference implementation to quickly detect problems. If you’re suspicious of your inference provider, just generate 10,000 tokens and verify these tokens against a reference implementation.

Token-DiFR is Robust to Tampering

Previous approaches verify inference by looking at statistical properties of the output, such as the expected log-rank of tokens within the verifier’s logits. The problem is that these statistics leave many degrees of freedom for a malicious provider. In our paper, we found that measuring mean cross-entropy works well for detecting quantization in the honest case. But it can be trivially fooled, as a provider can just tune their sampling temperature until the mean cross-entropy matches the expected value. Token-DiFR with seed synchronization doesn’t have this problem. Because over 98% of tokens must match exactly, there’s almost no room to manipulate the outputs while still passing verification.

What about implementation differences?

A common objection is that legitimate differences between implementations may cause too many false positives, as inference providers use different hardware, parallelism setups, and inference implementations. However, we found this is not a significant problem in practice.

We tested Token-DiFR across A100 vs H200 GPUs, single GPU vs 4-GPU tensor parallel setups, HuggingFace vs vLLM forward pass implementations, and prefill vs decode phases. We find that the benign numerical noise from these legitimate differences is consistently smaller than the signal from actual problems like KV cache quantization or incorrect sampling configurations¹. The level of noise does vary from model to model (Llama 3.1 8B Instruct, Qwen3-30B-A3B, Qwen3-8B), and it does make the detection more challenging, but in all settings we analyze we can still detect FP8 KV Cache quantization, our most subtle quantization setting.

How Does Token-DiFR Work?

In our paper, we primarily focus on a method to verify samples generated through the Gumbel-Max sampling, a common technique for sampling from a distribution, which is an alternative to Inverse Probability Transform. We focus on Gumbel-Max Sampling because it is used by vLLM. Gumbel-Max sampling is simple - generate random Gumbel noise (given a fixed seed), multiply it with the temperature, add it to the logits, and take the argmax logit.

When the provider and verifier use the same sampling seed, they should nearly always pick the same token. When they disagree, there’s a natural score in the case of Gumbel-Max sampling: the logit difference between the verifier’s preferred token and the provider’s claimed token. Larger gaps mean something is different about the model weights, quantization, or implementation. The score can be calculated in these three lines of code:

# logits: shape [vocab_size]
# claimed_token: int, provider's claimed token
# gumbel_noise: shape [vocab_size], deterministic given seed
# temperature: float

logits = logits + (gumbel_noise * temperature)
verifier_token = logits.argmax()
logit_diff = logits[verifier_token] - logits[claimed_token]

In practice, we clip scores to the 99.9th percentile to reduce the influence of outliers.² Token-DiFR can be applied to any sampling process. We discuss verifying other approaches like speculative decoding in the paper.

How do I use Token-DiFR?

There are two ways to use Token-DiFR.

Shared Sampling Seed and Sampling Process: The ideal is with sampling seed synchronization, where you know the sampling process used by the provider and can send per-request sampling seeds. This means that any output can be audited post-hoc, which provides strong incentives for honest inference. Our implementation works with unmodified vLLM out of the box, and we recommend that inference providers standardize on a sampling algorithm since sampling is not a performance bottleneck for standard autoregressive generation.

Unknown Sampling Process: If sampling seed synchronization is not available, you can still do spot checks at temperature zero, which bypasses the random sampling process entirely. We did this in our paper for several Llama-3.1-8B providers. Some (like Groq and SiliconFlow) scored very close to their advertised configuration (bfloat16 and fp8 respectively). Others had higher divergence scores, similar to 4-bit quantization.

While low divergence gives high confidence in the provider, high divergence doesn’t necessarily indicate poor quality. When investigating these high scores, we found that at least two providers were using an older Llama-3.1-8B chat template that doesn’t match the latest version on HuggingFace. Currently, most inference providers don’t publish which chat template they use. Publishing this (or just returning the tokenized input) would make auditing more straightforward.

Alternatively, we found that simply comparing the average cross-entropy over a set of prompts works well and can be used in the non-zero temperature setting, although it is vulnerable to tampering as it has many degrees of freedom.

Results from our third party audit of Llama-3.1-8B providers. Groq and SiliconFlow are close to their advertised configurations (bfloat16 and fp8). Other providers are high, between the fp8 and 4-bit baselines. Note that high scores do not necessarily indicate poor quality. For example, we found that DeepInfra and Cerebras had high scores due to using an outdated chat template.

API-Only Verification: Running a local reference implementation for large models (like the 1 trillion parameter Kimi K2) can be impractical. Token-DiFR could be used in an API-only setting if inference providers expose an option to return the top-k logits or log-probs for input tokens. A verifier would generate tokens from the tested provider, then ask the official API to return log-probabilities for that same completion. This would make it very simple to verify providers of large models, like Deepseek V3 or Kimi K2, against the official API. We encourage inference providers to add this option as an API addition that would make ecosystem-wide verification practical.

Activation-DiFR: Verifying the Forward Pass

Tokens are nice because there’s no communication overhead and they work with existing inference APIs. However, they’re low fidelity and discard a lot of the signal available in the model’s activations. Previous work (TOPLOC) proposed compressing activations using their top-K values. This enables accurate detection of, for example, 4-bit quantization within just 2 output tokens. We additionally propose Activation-DiFR, which uses a random down-projection to compress activations. We find this achieves equivalent detection performance while reducing communication overhead by 25-75% (which is typically on the order of 1-4 bytes per token).

There is a caveat: activation fingerprints cannot verify the sampling process. A malicious provider could generate arbitrary tokens and then compute the correct activation fingerprint post-hoc. However, they do verify the forward pass, which represents the vast majority of inference compute and therefore the main economic incentive to cheat. Adding activation fingerprints is a simple step that would make it extremely difficult to tamper with the forward pass without getting caught.

Conclusion

It turns out that verifying LLM inference has a simple and effective solution. We recommend that the community standardize on common sampling implementations, and sampling is simple enough that this should be achievable. Providers who can’t adopt a standard can share their sampling details to enable verification. This will be valuable for both labs monitoring their own infrastructure and customers wanting to trust their API providers.

How To Cite:

@misc{karvonen2025difrinferenceverificationdespite,  
      title={DiFR: Inference Verification Despite Nondeterminism},  
      author={Adam Karvonen and Daniel Reuter and Roy Rinberg and Luke Marks and Adrià Garriga-Alonso and Keri Warr},  
      year={2025},  
      eprint={2511.20621},  
      archivePrefix={arXiv},  
      primaryClass={cs.LG},  
      url={https://arxiv.org/abs/2511.20621},  
}

And please take a look at our concurrent work applying these methods to detecting LLM steganography and model weight exfiltration -- Verifying LLM Inference to Prevent Model Weight Exfiltration (2025).

There is one exception: for Qwen3-30B-A3B, the gap between HuggingFace and vLLM was comparable to FP8 KV-cache quantization. This raises a real question: if two “correct” implementations differ that much, what threshold should you set? In practice, you can choose a reference specification and flag anything that deviates beyond it. The method still works, you’re just deciding what counts as acceptable. ↩
When top-p or top-k filtering is used, a claimed token might be outside the verifier’s filtered set, resulting in an infinite logit difference. Clipping handles these cases. ↩

Frontier AI Models Still Fail at Basic Physical Tasks: A Manufacturing Case Study

2025-04-13T19:46:58+00:00

Dario Amodei, CEO of Anthropic, recently worried about a world where only 30% of jobs become automated, leading to class tensions between the automated and non-automated. Instead, he predicts that nearly all jobs will be automated simultaneously, putting everyone “in the same boat.” However, based on my experience spanning AI research (including first author papers at COLM / NeurIPS and attending MATS under Neel Nanda), robotics, and hands-on manufacturing (including machining prototype rocket engine parts for Blue Origin and Ursa Major), I see a different near-term future.

Since the GPT-4 release, I’ve evaluated frontier models on a basic manufacturing task, which tests both visual perception and physical reasoning. While Gemini 2.5 Pro recently showed progress on the visual front, all models tested continue to fail significantly on physical reasoning. They still perform terribly overall. Because of this, I think that there will be an interim period where a significant portion of white collar work is automated by AI, with many physical world jobs being largely unaffected.

(Estimated reading time: 7 minutes, 14 minutes with appendix)

The Evaluation

My evaluation is simple - I ask for a detailed plan to machine this part using a 3-axis CNC mill and a 2-axis CNC lathe. Although not completely trivial, most machinists in a typical prototype or job shop setting would view executing this as a routine task, involving standard turning and milling techniques across multiple setups. This was certainly much simpler than the average component at both shops I worked at. For context, compare the brass part’s simplicity to the complexity of aerospace hardware like these Blue Origin parts.

Although this part is simple, even frontier models like O1-Pro or Gemini 2.5 Pro consistently make major mistakes. These mistakes can be split into two categories - visual abilities and physical reasoning skills.

Visual Errors

Most Models Have Truly Horrible Visual Abilities: For two years, I’ve observed essentially zero improvement in visual capabilities among models from Anthropic and OpenAI. They always miss obvious features like the flats cut into the round surface, holes, or even hallucinate nonexistent features such as holes drilled along the part’s length. I have never seen Claude 3.5, Claude 3.7 (thinking and non-thinking), GPT-4.5, GPT-4o, or O1-Pro produce a reasonable description of the part. Without vision abilities, creating a manufacturing plan is completely hopeless.

Interestingly, many of these models also score at or above the level of some human experts on visual reasoning benchmarks like MMMU. That which is easy to measure often doesn’t correlate with real world usefulness.

Gemini 2.5 Pro Makes Significant Vision Progress: I was surprised when I saw Gemini 2.5 make major progress in vision capabilities. On roughly one out of four attempts, it identifies most of the major features without extra hallucinations, though it still always misses subtle details like the corner radius on the milled flats. Some details it captures are genuinely impressive. For example, I have to look closely to identify the two flats positioned exactly 180 degrees apart. However, this improved vision mostly serves to reveal deeper, unresolved issues.

Physical Reasoning Errors

Previously, it was hard to separate visual misunderstandings from deeper physical reasoning problems. Now, even when working from an accurate visual interpretation, Gemini 2.5 still produces machining plans filled with practical mistakes, such as:

Ignoring Rigidity and Chatter: The part is long and slender relative to its diameter. Attempting to machine it with standard techniques, as Gemini often suggests, would likely cause the part to deflect (bend slightly under tool pressure) or vibrate rapidly against the cutting tool (a phenomenon called ‘chatter’). Both issues ruin surface finish and dimensional accuracy. Beginner machinists should instantly realize rigidity is critical for a long, slender part like this. When specifically asked about chatter, Gemini poorly applies textbook solutions like a tailstock in ways that worsen issues like bowing in this long, thin brass part.

Physically Impossible Workholding: Gemini usually proposes ways to clamp the part (workholding) and sequences of operations that are physically impossible. Its most common suggestion is to clamp the part in a fixture (specifically, a collet block), machine some features, then rotate the fixture to machine other features. However, this is physically impossible, as the fixture is blocking these new features. This is obvious if you’re mentally walking through the process or looking at the setup.

Additional details, including a more detailed description of the mistakes AI models make and a reference gold standard plan, are in the Appendix.

My high level impression when reading the response is “someone who can parrot textbook knowledge but doesn’t know what they’re talking about”. The models are very eager to give textbook knowledge about e.g. recommended cutting speeds, but are completely incorrect on important practical details. This matches my conversations with friends and former colleagues in manufacturing and construction: current LLMs are seen as almost entirely useless for the core, hands-on aspects of their jobs.

This Evaluation Only Scratches the Surface

This task of generating a text plan represents one of the easiest parts of the job. Real machining demands managing many details behind every high-level step. Just selecting a cutting tool involves considering tip radius, toolholder collision clearance, cutting tool rigidity, coating, speeds/feeds, and more — often with direct trade-offs, like clearance versus rigidity. Many factors like ensuring tool clearances against the part and fixture are inherently spatial, which is impossible to evaluate fully via text. If models fail this badly on the describable aspects, their grasp of the underlying physical realities is likely far worse.

In fact, getting to the point of being actually useful requires overcoming a hierarchy of challenges, each significantly harder than the last:

Accurate Visual Perception: The foundational step is to correctly identify all geometric features and relationships from the input image. This requires almost no spatial reasoning ability, yet most models still perform terribly. Even Gemini 2.5 Pro is very bad at this, and falls apart if pushed at all.
Basic Physical Plausibility: Beyond just seeing the part, the model must propose operations and setups that are physically possible. This involves basic spatial reasoning to ensure that e.g. tool access isn’t blocked by fixtures. This may require moving beyond a text chain of thought - when I mentally walk through a setup, I don’t use language at all, and instead just visualize every step of the operation.
Incorporating Physical Knowledge: Successfully machining requires understanding real-world physics and tacit knowledge. This is often learned through hands-on experience and is poorly captured in existing datasets.
Process Optimization: Handling the detailed considerations within steps 1-3 is necessary just to yield one part correctly. As Elon Musk likes to say, manufacturing efficiently is 10-100x times more challenging than making a prototype. This involves proposing and evaluating multiple feasible approaches, designing effective workholding, selecting appropriate machines, and balancing complex trade-offs between cost, time, simplicity, and quality. This is the part of the job that’s actually challenging.

Steps 2-4 could be challenging to address via synthetic data generated in simulations. Almost all machinists I’ve talked to have (completely valid) complaints about engineers that understand textbook formulas and CAD but don’t understand real world manufacturing constraints. Simulation environments seem likely to create AIs with the same shortcomings.

Why do LLM’s struggle with physical tasks?

The obvious reason why LLMs struggle here is a lack of data. Physical tasks like machining rely heavily on tacit knowledge and countless subtle details learned through experience. These nuances aren’t typically documented anywhere.

This isn’t because experts are deliberately withholding secrets - instead, documenting this granular, real-world knowledge is impractical and inefficient. Just as software engineers rarely document their entire reasoning for each line of code, machinists don’t document every consideration for setting up a single part. It’s far quicker and more effective to teach someone adaptable skills through hands-on experience with a mentor instead of learning from a textbook or memorizing procedures.

This highlights a major difference from fields like software engineering or law. Although software engineers or lawyers may not explicitly record every reasoning step, they produce artifacts like code, version control history, and contracts which have very rich and detailed information. In physical tasks, equivalent detailed information certainly exists, but it’s embedded in the 3D world - such as interactions between tools, materials, and physical forces - in formats which are very difficult to digitize effectively.

As a result, LLMs are great at regurgitating some of the textbook knowledge available, but this is very insufficient.

Improving on physical tasks may be difficult

Empirically, frontier models are currently bad at these tasks. Is this just a temporary hurdle that will soon be overcome? I’m not sure, and I have speculative arguments for both why future progress might be difficult and why it might be easier than expected.

One obvious explanation is that LLMs aren’t good at physical tasks because no one has put in much effort yet. However, improving physical world understanding could be challenging. The recipe for improving coding ability has relied on massive amounts of training data and clear reward signals enabling RL and synthetic data. However, this breaks down for physical tasks.

Lack of Verifiable Rewards: Defining a reward signal for complex physical tasks is hard. Defects in a part might show up as slightly increased failure rates years in the future or as rot that appears years after incorrectly applied waterproofing. Feedback loops can be long and outcomes are difficult to measure automatically.

Slow, Expensive, and Dangerous Trial-and-Error: Learning through RL or generating synthetic data could be difficult. I’ve personally caused thousands of dollars in damage due to mistakes in my early years of manufacturing, which is not uncommon. Single mistakes can easily cost hundreds of thousands of dollars in damage. Unlike running buggy code, mistakes with heavy machinery or construction can have severe consequences. I also caused tens of thousands of dollars in lost revenue due to my lower productivity when learning - gaining experience usually requires the use of expensive and limited resources, not just a few GPU hours.

However, there are also reasons why this could be easier than expected:

The Automated AI Researcher: AI is making major progress in coding and AI research and we may reach an automated AI researcher in the near future. Maybe the automated AI researcher will easily be able to solve these challenges by creating much more sample efficient algorithms or large amounts of simulated data.

Synthetic Data: There are obvious approaches that haven’t been well explored. For example, we can create a lot of data using simulations, although there will be a gap between simulation and reality. For this specific manufacturing process (CNC machining), CAM software can accurately simulate most operations. However, there are a ton of diverse manufacturing processes, many of which don’t have good simulation solutions.

Potential Implications of Uneven Automation

If this trend holds, we might face a period where remote work sees significant automation while skilled physical jobs remain largely untouched by AI. This “automation gap window” could last for an unknown duration and carries potential implications:

Class Conflicts: There could very easily be major class conflict between the automated and non-automated professions, especially because there are other underlying differences between these groups. White collar workers are more likely to see displacement, and they typically earn more and have more liberal political beliefs. These differences could exacerbate tensions and lead to major economic pain for automated groups.

Popular Opposition to AI: This could result in popular opposition against further AI research. Groups such as blue collar workers now have evidence that automation can happen really quickly, and they may not want to be automated. This could stall further AI progress and lengthen this window.

Geopolitical Bottlenecks: If most knowledge work is automated, physical capabilities like manufacturing could become the bottleneck in technological progress or defense (e.g., during an AI arms race). Nations such as China, with a much stronger industrial base, could gain a significant strategic advantage.

There’s a lot of uncertainty and tension here. For example, if manufacturing becomes a strategic bottleneck, the government may be able to fend off popular opposition to AI.

Conclusion

While it’s unclear how long this uneven automation gap might persist, its existence seems likely. Surprisingly few in AI research discuss this - perhaps because many in the field aren’t very familiar with manufacturing or other physical-world domains. Anyone working in policy, planning their career, or concerned about social stability should start considering the implications of a partial automation scenario seriously.

Acknowledgements: I am grateful to Neel Nanda, Kevin Liu, and Tim Butler for valuable feedback on this post.

Appendix

Here are links to plans generated by Claude 3.7 Thinking, GPT-4.5, O1-Pro, GPT-4o, o3, and Gemini 2.5 Pro. My plan for machining the part is here. Below I have detailed descriptions of various errors made by the models.

Visual Errors

General Poor Performance (Non-Gemini Models): For nearly two years, models from Anthropic and OpenAI showed essentially zero improvement in visual perception for this task. Reviewing the transcripts, the errors are consistently egregious – missing obvious features like the milled flats or cross-holes, sometimes even hallucinating features like non-existent axial holes. No model tested, apart from Gemini 2.5 Pro, produced anything resembling an accurate description of the part geometry, making any subsequent machining plan fundamentally flawed from the start.

Gemini 2.5 Pro - Vision Progress but Persistent Flaws: Gemini 2.5 Pro represents a significant step forward visually. On roughly one out of four attempts, it identifies most major features without major hallucinations. However, it still makes consistent, if more subtle, visual mistakes:

Missed Details: It consistently fails to identify the corner radius on the milled flats. Recognizing these radii is critical for selecting the correct end mill and becomes second nature to machinists, as it’s a factor in nearly every tool selection.

Occasional Hallucinations/Misinterpretations: It sometimes hallucinates features like a slot on one end of the part (which isn’t present) or logically inconsistent features, such as suggesting two sets of threaded cross-holes – impossible given the part’s small diameter due to insufficient material for thread engagement.

Inconsistent Feature Identification: It occasionally misses one of the two flats, despite correctly identifying their 180-degree opposition in other attempts (a detail which, to be fair, requires close inspection). It will also sometimes miss some of the existing holes.

Physical Reasoning Errors

Gemini’s plan is the only one worth reviewing, as it doesn’t have the major confounder of poor vision abilities. Other models have many egregious errors in every generated plan, but it’s difficult to distinguish if this is due to a misunderstanding of the part or poor physical reasoning abilities. Even when working from a relatively accurate visual interpretation, Gemini 2.5 Pro’s plans are filled with significant flaws in practical physical reasoning and tacit knowledge.

Poorly Applied Textbook Solutions: Ignoring Rigidity and Chatter

Failure to Identify Risk: Gemini 2.5 consistently fails to recognize that the part, with a length-to-diameter ratio clearly exceeding 10:1 (based on provided dimensions in the prompt), is highly susceptible to chatter and deflection. The common heuristic (L:D > 3:1 requires special attention) seems completely missed. This should be an obvious red flag, yet surprisingly it’s ignored, undermining many of its proposed operations.

When explicitly prompted about chatter concerns for a “long slender part,” Gemini applies textbook solutions poorly. It suggests using a tailstock, common for long parts. However, for this specific small-diameter brass part, a tailstock is often not the best approach. There’s minimal area for the tailstock center, and the required axial pressure can easily cause the slender part to bow, like a stiff spaghetti noodle buckling under axial pressure, especially in soft brass. Chatter and deflection are also still a concern. It will probably involve trial and error to get the process to work.

Gemini’s lack of physical grounding is further exposed by its plan to Turn Small Diameter 1 (Near Collet). This step implies cutting a 90-degree internal shoulder directly adjacent to the collet face. In practice, this is physically impossible with standard turning tools, which can’t reach into that corner due to tool geometry. Machinists typically address this with a grooving tool, which is specifically designed to cut square shoulders, or a back turning tool, which approaches the feature from behind to reach the back-facing surface. Both approaches introduce real-world trade-offs. Grooving tools can generate higher tool pressure, which risks deflection and chatter. Back turning tools, on the other hand, often require additional part stick-out to clear the spindle and may introduce blending issues.

Missing the Practical Solution: The standard, effective solution for a small, slender part like this often runs counter to basic textbook advice. Instead of multiple light passes, you might start with significantly oversized stock (e.g., 3/4” diameter for this 3/16” part) and machine the final diameter in a single, deep pass. This keeps the part supported by bulk material during the cut, maintaining a low effective L:D ratio and preventing chatter/bowing. It’s a common technique taught by mentors that I have used several times. However, it directly contradicts typical guidelines, and Gemini fails to consider it.

Physically Impossible Workholding: The part has features needing milling at different locations and orientations: the flats/threaded holes near one end, and an orthogonal cross-hole near the opposite end. Gemini often suggests using a collet block in a mill vise. It may have chatter problems due to the length of the unsupported part, but there’s worse problems as well. Its typical sequence involves clamping one end, machining the flats/holes, then rotating the entire collet block 90 degrees (Place the collet block back in the vise, but rotated 90 degrees…) to machine the cross-hole. This simply doesn’t work.

The collet block is still clamping the very end of the part where the cross-hole needs to be drilled. Rotating the fixture doesn’t grant access; the fixture itself physically obstructs the tool path. This kind of spatial reasoning failure should be immediately obvious when mentally walking through the setup.

Underestimating Drilling Difficulty: When planning the orthogonal cross-hole, Gemini sometimes lists spot drilling as optional (Spot Drill (Optional)). For drilling such a small hole on a small-diameter, curved surface, a spot drill is arguably essential to prevent the drill from “walking” off-center and potentially breaking. Calling it optional is not a good idea.

Ignoring Collision Risks: The cross-hole is located very close to a larger diameter shoulder. Gemini often suggests standard spot drill sizes (e.g., ¼ or ½ inch). Given the tight clearance, a standard spot drill would likely collide with the shoulder before its tip properly centers the hole. This is a common “gotcha” – I’ve run into this exact issue myself. It requires careful tool selection and clearance checking.

While a collision here might just scrap the part, because a small brass part is not very robust, this type of error (failing to check tool/holder clearance) can cause catastrophic damage with larger parts or harder materials. I’ve observed several bad crashes (and caused a couple myself) that have required major repairs due to this exact error.

Incorrect Work Reference (Z0 on Raw Stock): In plans where the part isn’t fully cut off in Op 1, Gemini sometimes suggests setting the Z0 reference for Op 2 on the raw end of the bar stock still attached to the part. Referencing raw stock introduces unacceptable inaccuracy. Precise machining requires setting Z0 against a known, previously machined surface to ensure features are located correctly relative to each other.

Ignoring Delicate Threads: Conversely, when recommending a part-off in Op 1, Gemini fails to account for the delicate brass threads. Its plans imply letting the finished part drop freely after cutoff. This risks damaging the fine threads upon impact. The standard, safer method for such parts involves leaving a small nub and breaking the part off manually, or using a parts catcher, to protect fragile features. Gemini also will recommend clamping on the threads, using shim stock to protect them, but these brass threads are very fragile and easily damaged, so this is not a good idea.

Using an LLM perplexity filter to detect weight exfiltration

2024-07-21T19:46:58+00:00

A recent area of focus has been securing AI model weights. If the weights are located in a data center and an adversary wants to obtain model weights, the weights have to leave physically (such as a hard drive going out the front door) or through the data center’s internet connection. If the facility has perfect physical security, then the weights have to leave through the internet connection. Recently, there has been discussion on how to make model weight exfiltration more difficult, such as Ryan Greenblatt’s proposal for upload limits.

A key factor enabling this is that the critical data we want to protect (model weights) are very large files. Current models can have trillions of parameters, which translates to terabytes of data. Ryan calculated that the total amount of outgoing data from a data center during a model’s lifetime could be within a few orders of magnitude of the size of the model itself. This post proposes an additional layer of security that could significantly enhance the effectiveness of upload limits: perplexity filtering using a small language model.

Perplexity, in the context of language models, is a measure of how surprised a model is by a given sequence of text. More formally, it’s the exponential of the cross-entropy of the text under the model. In simpler terms, if a piece of text has low perplexity, it means the model finds it very predictable or natural; high perplexity indicates that the text is unexpected or unnatural to the model. By using perplexity as a filter, we can potentially distinguish between normal model outputs (which should have relatively low perplexity) and attempts to exfiltrate model weights encoded as text (which would likely have high perplexity).

The core idea is to use a small language model as a perplexity filter on all outgoing data from the inference servers. Here’s how it would work:

All outgoing data is treated as language tokens. A small language model (much smaller than the main model being protected) is used to calculate the perplexity of these tokens. This filtering can potentially be performed offline to avoid adding latency to the system’s responses. Data with high perplexity (i.e., data that doesn’t look like natural language to the small model) is blocked or flagged.

Why This Could Work

Scale Advantage: The weights of large models are on the order of terabytes. Encoding this much data in a way that appears as natural language would be extremely challenging.
Compression Resistance: Unlike simple encryption or obfuscation techniques, it’s difficult to compress model weights in a way that would consistently produce low-perplexity output.
Low Overhead: The computational cost of running a small language model as a filter is negligible compared to the resources required for the main model’s inference.
Complementary to Upload Limits: This approach could significantly increase the effectiveness of upload limits, either increasing the allowed upload limit or decreasing the number of model weights that can be exfiltrated per hour.

If the amount of outgoing data is within a few orders of magnitude of the model size itself, it’s possible that a few tricks like this could make it almost impossible to exfiltrate model weights before the model is deprecated (assuming that the data center has perfect physical security). It appears that this perplexity filter would stack with Ryan’s proposed text compression scheme.

We don’t even have to filter all outgoing data. If there was 1 gigabyte of unfiltered bandwidth per day, it would take 2,000 days to transfer GPT-4’s 2 terabytes of weights out (although this could be reduced by compression schemes). This could also enable tolerating a very high false positive rate on the perplexity filter. 100 megabytes of flagged outputs per day is no problem, as that would require 20,000 days to transfer 2 terabytes of data.

Image and audio outputs could be more challenging. A potential solution is adding weak models to compress / decompress audio and images. These models could both reduce bandwidth and potentially act as a filter, similar to the perplexity filter for text.

Thanks to Davis Brown for discussion and feedback, and in particular, the suggestion to perform this step offline to reduce latency. For public discussion, refer to LessWrong.

Evaluating Sparse Autoencoders with Board Games

2024-06-12T19:46:58+00:00

This blog post discusses a collaborative research paper on sparse autoencoders (SAEs), specifically focusing on SAE evaluations and a new training method we call p-annealing. As the first author, I primarily contributed to the evaluation portion of our work. The views expressed here are my own and do not necessarily reflect the perspectives of my co-authors. You can access our full paper here.

Key Results

In our research on evaluating Sparse Autoencoders (SAEs) using board games, we had several key findings:

We developed two new metrics for evaluating Sparse Autoencoders (SAEs) in the context of board games: board reconstruction and coverage.
These metrics can measure progress between SAE training approaches that is invisible on existing metrics.
These metrics allow for meaningful comparisons between different SAE architectures and training methods, potentially informing SAE design for more complex domains like language models.
We introduce p-annealing, a new SAE training method that improves over prior methods on both existing metrics and our new metrics.
SAEs trained on ChessGPT and OthelloGPT can capture a substantial fraction of the model’s board state, with F1 scores of 0.85 and 0.95 respectively for board reconstruction.
However, SAEs do not match the performance of linear probes, suggesting they may not capture all of the model’s board state information or “world model”.

Challenges with SAE Evaluations

For an explanation of sparse autoencoders, refer to my Introduction to SAEs.

Sparse Autoencoders (SAEs) have recently become popular for interpretability of machine learning models. Using SAEs, we can begin to break down a model’s computation into understandable components. As a result, there have been a flurry of new SAE architectures and loss functions, such as the BatchTopK SAE, Google Deepmind’s Gated SAE and JumpReLU SAE, our p-annealing, and OpenAI’s TopK SAE.

Unfortunately, we don’t have reliable metrics that we can use to compare the new approaches. The main metric currently used is “we looked at activating inputs for a range of features and gave a gut reaction on interpretability of the features”. This is a major limitation for the field.

In machine learning, it’s ideal to have an objective evaluation of your model, such as accuracy on the MNIST benchmark. With an objective evaluation, you can just twiddle all the knobs of architectures, hyperparameters, loss functions, etc, and see which knobs make the number go up. When measuring the interpretability of SAEs trained on language models, there is no underlying ground truth that we know how to measure. We do have some proxy metrics, such as L0 (a measure of SAE sparsity) and loss recovered (a measure of SAE reconstruction fidelity), that seem to have some correlation with interpretability. However, they are only proxies for the thing we care about and can sometimes be inversely correlated with interpretability.

An example of subjective interpretability from Anthropic's Scaling Monosemanticity. In this case, it looks like the feature activates on phrases related to transit infrastructure.

As a result, we primarily use noisy, time-consuming subjective evaluations. When crowd workers subjectively evaluated the interpretability of Deepmind’s Gated SAE, the results were not statistically significant. It’s hard to say whether this is due to the inherent noisiness of our evaluation methods or if it points to some limitation of the Gated SAE architecture itself. Interpretability also isn’t all we care about, as important parts of model cognition may not be easily interpretable.

There have been some recent examples of different natural language evaluations. In this case, across various SAEs Anthropic examined how many elements from the periodic table had corresponding SAE features. While interesting, there are obvious limitations. Periodic table elements are a very limited subset of natural language, and it’s challenging to robustly measure natural language concepts. String matching has trouble differentiating different uses of the word “lead”, which can be an element or a verb. It’s even more difficult to measure abstract concepts not closely tied to a single word.

While we don’t know how to measure the underlying ground truth of natural language, board games have a measurable ground truth and are still reasonably complex. We used Chess and Othello as testbeds for SAEs with two questions in mind.

How can we measure and compare different SAEs?
What fraction of the model’s “world model” or board state do the SAEs capture?

All code, datasets, and models for these evaluations have been open sourced at github.com/adamkarvonen/SAE_BoardGameEval.

Sparse Autoencoder Coverage Metric

Our first metric is called coverage. We created measurable units of the board, which we called Board State Properties (BSPs). We defined ~1000 BSPs, including low-level details like “Is one of my knights on F3?” and high-level concepts like “Is there a pinned piece on the board?”. Because these are measurable with code, we could automatically find thousands of interpretable features without any manual interpretability. For example, we found the below “en passant capture available” feature. It’s very impressive that SAEs, which are unsupervised, manage to find these interesting concepts.

An example "en passant capture available" feature. Notice that it only fires twice on space characters in the PGN string, both times when the current player had an en passant capture available.

To calculate the coverage metric, we first find best classifying feature for each possible BSP as measured by the F1 score. We then average the F1 scores of these best features, as shown below.

A demonstration of calculating coverage for all low-level Chess BSPs.

If the SAE has features that directly correspond to individual Chess or Othello concepts, the average best F1 score will be high. On the other hand, if the SAE mainly captures combinations of concepts, the average best F1 score will be lower, as no single feature will be a good classifier for individual square states. Thus, the coverage metric serves as a proxy for monosemanticity or the quality of SAE features. Note that without a good scientific understanding of what’s happening inside transformer models, it isn’t clear what the maximum coverage score should be. It’s possible that an SAE faithfully reconstructing model representations should achieve a coverage significantly below 1.0.

The following table contains the best coverage scores obtained at layer 6 by any SAE. As baselines, we test the exact same approach on layer 6 MLP activations (note that this comparison, with no SAE, was done after paper submission), and on SAEs trained on versions of ChessGPT and OthelloGPT with randomly initialized weights.

Approach	ChessGPT	OthelloGPT
SAE RandomGPT	0.11	0.27
MLP Activations	0.26	0.53
SAE TrainedGPT	0.45	0.52

SAEs on the trained models substantially outperform the random model baseline, indicating that they capture meaningful information. Surprisingly, MLP activations on OthelloGPT perform quite well, suggesting that some board state information is directly encoded in MLP neuron activations.

Sparse Autoencoder Board Reconstruction Metric

The coverage metric provides insight into the quality of SAE features, but it doesn’t consider the breadth of knowledge captured by the SAE. When we look at large language models predicting the next token in text, it’s often unclear what the complete underlying world state is. It’s challenging to measure exactly what the model ‘knows’ at each step of the prediction process. However, in games like Chess and Othello, we have a clear, measurable world state at every token. With this in mind, we developed the Board Reconstruction Metric. The key question we’re asking is: Can we completely reconstruct the board state from model activations using a Sparse Autoencoder¹?

An important question is which assumptions we make about what SAE features should mean. There had been prior work on applying SAEs to OthelloGPT. In Othello, there are 64 squares and 3 possible states for each square (Black, White, and Empty), or 192 (64 x 3) possible square states. The author had looked for individual features that were accurate classifiers with both high precision and recall for an individual square state. Using this approach, they found classifiers for only 33 of the square states.

We instead looked for features that had at least 95% precision (an arbitrary threshold) for square states, without necessarily having high recall. That is, if the feature was active, the square state is present. This was motivated in part by studies on chess players showing that chess experts excel at remembering realistic board configurations, but not random piece placements. This suggests experts (and potentially AI models) encode board states as meaningful patterns rather than individual square occupancies.

To identify high-precision features, we analyzed how each feature’s activation corresponds to board states. Our approach was as follows:

We determined each feature’s maximum activation value across 256,000 tokens.
We set 10 threshold values per feature, from 0% to 90% of its maximum activation.
For each threshold, we identified all high precision features.

As an example, let’s look at the probability that a black pawn is on every square for SAE feature 172 at threshold 20%²³. This example analysis was performed on a set of 1,000 “train” games. As we can see, the feature is not high precision for any square, and there is a broad distribution over squares.

On the other hand, there is a 98% chance that a White Knight is on F3 any time the feature is active at all.

A common finding is that SAE activations become more interpretable at higher values. When we increase the threshold to 20%, there is a 100% chance that a White Knight is on F3.

This increasing certainty happens at different rates for different piece types. For example, at threshold 0% there is a 79% chance that there’s a black bishop on G4.

If we increase the threshold to 50%, then the likelihood of a black bishop being on G4 increases to 98%, meaning that feature 172 is high precision for two separate pieces at a threshold of 50%.

What can we do with this information? We can count the number of High Precision Classifier (HPC) features that classify a square with over 95% precision (an arbitrary precision threshold) at every activation threshold, but that doesn’t tell us how much of the model’s board state information is captured. As a proxy for recall, we can use our SAE’s HPC features to reconstruct the chess board on an unseen set of “test” games. We calculate the F1 score at every threshold value, and report the maximum score obtained. The best threshold is typically 0%, 10%, or 20%.

A demonstration of calculating board reconstruction for all low-level Chess BSPs.

In our paper, we call this metric board reconstruction. The following table contains the best board reconstruction score obtained across all SAEs trained on layer 6 of ChessGPT and OthelloGPT, in addition to previously mentioned baselines. We also compare to linear probes trained on layer 6.

Approach	ChessGPT	OthelloGPT
SAE RandomGPT	0.01	0.08
MLP Activations	0.56	0.82
SAE TrainedGPT	0.85	0.95
Linear Probe	0.98	0.99

SAEs on the trained model substantially outperform SAEs trained on the randomly initialized models, indicating that this is capturing genuine model board state information, but do not meet the performance of linear probes. This possibly means that current SAE techniques do not capture all of the model’s board state. The approach works surprisingly well on MLP activations, although SAEs perform better.

We also apply this approach to reconstructing high-level chess board state features. It works well for some, such as if an en passant capture is available (F1 score 0.92), and worse for others, such as if a pinned piece is on the board (F1 score 0.20).

P-Annealing: A New SAE Training Approach

We developed our metrics with the purpose of measuring progress in SAE training methods. These metrics allowed us to evaluate a new SAE training method we propose called p-annealing, which aims to address some fundamental challenges in training sparse autoencoders.

Ideally, we want our SAEs to be sparse as measured by L0 (the number of non-zero elements), but L0 is not differentiable and thus can’t be directly optimized. Traditionally, we instead train SAEs with the L1 loss as a differentiable proxy for sparsity. However, this approach leads to issues such as feature shrinkage.

P-annealing addresses this issue by leveraging nonconvex L_p^p minimization, where p<1. “Nonconvex” here means the optimization landscape may contain multiple local minima or saddle points, meaning that simple gradient optimization may get stuck in non-optimal solutions. We start training using convex L1 minimization (p=1), which is easier to optimize without getting stuck in local optima. We gradually decrease p during training, resulting in closer approximations of the true sparsity measure, L0, as p approaches 0.

In our evaluations using the board reconstruction and coverage metrics, we found that p-annealing led to significant improvements in SAE performance, which we’ll discuss in detail in the following section.

Comparing SAEs

In our evaluation, we compared four SAE types using our metrics: the standard SAE, the gated SAE, a standard SAE trained with p-annealing, and a gated SAE trained with p-annealing. Typically, the elbow in the top left corner is the Pareto optimal range of the curve. Our findings show that all three non-standard SAEs achieve Pareto improvements on the L0 / Loss Recovered curve compared to the standard SAE.

The best coverage performance or the brightest color in the Pareto optimal elbow of the frontier, aligning with our existing understanding of proxy metrics. However, this presents a new challenge: with three different SAE approaches showing similar improvements, how can we differentiate between them?

This 3 variable plot has the proxy metrics of L0 on the x-axis and Loss recovered on the Y axis, while color corresponds to the coverage score for Chess low-level BSPs. We differentiate between training methods with shapes. Every point is an SAE trained with different hyperparameters. Note that TopK was not included in the original paper.

Using our metrics, we can clearly differentiate between training methods and measure progress that is invisible to existing metrics. In this case, we typically see the best performance from SAEs trained with p-annealing, even though their performance is very similar to gated SAEs under proxy metrics. There are also parallel lines within training methods, representing SAEs trained with different expansion factors. These differences are also hiddden within existing metrics.

In this scatter plot, we have L0 on the x-axis and coverage for Chess low-level BSPs on the y axis. Note that TopK was not included in the original paper.

Limitations

The internals of ML models are still poorly understood. Thus, it isn’t clear if our metrics correspond to the “true model ground truth features”, whatever that means. It isn’t clear what the maximum coverage score should be. The field of interpretability is relatively new, and it isn’t clear what the ideal version of these metrics should be. However, I am confident that this an important question to investigate.

We do not capture everything that ChessGPT is doing. Instead, we attempt to measure something that we believe should be present (the state of the board). For high level features like “a fork is present”, it isn’t as clear if ChessGPT actually represents this internally.

In addition, lessons learned from board game models may not transfer to language models. It would be ideal to go straight to language models, but that direction is much less tractable. Thus, an important next step is to investigate if lessons learned here (such as optimal SAE architectures) transfer to language models.

In addition, we find that SAE features fire with high precision for some board states. Although we have qualitatively inspected some features, we haven’t quantitatively investigated how interpretable these features are, or how to quantify interpretability in this case. It would be ideal to integrate interpretability into our analysis.

Implications and Future Work

Interpretability research on restricted domains with a measurable ground truth may enable quantitative comparisons of different approaches that transfer to natural language. In particular, Chess is closer to the complexity of natural language than Othello. The games are generated by real humans instead of being randomly generated, which means there is some sort of theory of mind that can be modeled. We know that ChessGPT already estimates the skill level of players involved. In Chess, tokens are characters, and the model has to combine multiple characters in a semantic unit. In OthelloGPT, a single token represents a single square. In addition, Chess has some concepts at different levels of sparsity, such as check (which is common) and en passant (which is rare).

In general, understanding a model is much more tractable when there is an underlying measurable ground truth. We may be able to use ChessGPT to understand topics analogous to natural language, such as how ChessGPT combines multiple characters into a single semantic unit of a piece being moved. However, it will be important to check if lessons learned here transfer to natural language.

If interested in discussion or collaboration, feel free to contact me via email. I am currently interested in developing evaluations for SAEs trained on language models and doing further reverse-engineering of board game models.

On a personal note, I am currently open to job opportunities. If you found this post interesting and think I could be a good fit for your team, feel free to reach out via email or LinkedIn.

Appendix

Implementation details

For the game of Othello, we classify the board as (Mine, Yours, Empty), rather than (Black, White, Empty), following earlier Chess and Othello work. In Chess, we measure the board state at the location of every “.” in the PGN string, where it is White’s turn to move. Some characters in the PGN string contain little board state information as measured by linear probes, and there is not a clear ground truth board state part way through a move (e.g. the “f” in “Nf3”). We ignore the blank squares when measuring coverage and board reconstruction.

When measuring chess piece locations, we do not measure pieces on their initial starting location, as this correlates with position in the PGN string. An SAE trained on residual stream activations after the first layer of the chess model (which contains very little board state information as measured by linear probes) obtains a board reconstruction F1-score of 0.01 in this setting. If we also measure pieces on their initial starting location, the layer 1 SAE’s F1-score increases to 0.52, as the board can be mostly reconstructed in early game positions purely from the token’s location in the PGN string. Masking the initial board state and blank squares decreases the F1-score of the linear probe from 0.99 to 0.98.

Our SAEs were trained on the residual stream after each layer for 300 million tokens.

Per Layer Performance

We compared SAEs and MLP activations on the tasks of reconstructing all the Chess board, Othello board, and the locations of all valid Othello moves. We selected a high-performing SAE on layer 6, and then used that architecture and hyperparameter selection for training on all other layers. The hyperparameters are probably not optimal for other layers. Most SAEs looked reasonable using proxy metrics, but the layer 3 Chess SAE had less than 100 alive features, leading to very poor performance on our metrics.

It’s notable that the trend of board state information per layer matches across linear probes, SAEs, and MLP activations in both Chess and Othello, indicating that this approach is probably finding something real.

It is natural to wonder if it’s a safe assumption that we should be able to recover the board state from a ChessGPT or OthelloGPT model. I have three arguments:
1. Using linear probes on ChessGPT and OthelloGPT, we can recover over 99% of the board state.
2. Linear probes are trained with supervision, and may have a correlational, rather than causal relationship with model internals. However, linear probe derived vectors can be used for accurate causal interventions.
3. There are 10^58 possible Othello games and more possible games of Chess than atoms in the universe. ChessGPT has a legal move rate of 99.8%, and OthelloGPT has a legal move rate of 99.99%. It’s plausible that it’s only possible to achieve this legal move rate by tracking the state of the board.
I don’t have strong guarantees that ChessGPT and OthelloGPT actually track board state, but it seems like a reasonable assumption. ↩
Note that we do not measure pieces on the initial starting position. See Implementation Details in the Appendix. ↩
The 20% threshold represents a real valued activation over 20% of its recorded maximum activation. If the maximum activation was 10.0, it was include any value over 2.0. ↩

An Intuitive Explanation of Sparse Autoencoders for LLM Interpretability

2024-06-11T19:46:58+00:00

Sparse Autoencoders (SAEs) have recently become popular for interpretability of machine learning models (although sparse dictionary learning has been around since 1997). Machine learning models and LLMs are becoming more powerful and useful, but they are still black boxes, and we don’t understand how they do the things that they are capable of. It seems like it would be useful if we could understand how they work.

Using SAEs, we can begin to break down a model’s computation into understandable components. There are several existing explanations of SAEs, and I wanted to create a brief writeup from a different angle with an intuitive explanation of how they work.

Challenges with interpretability

The most natural component of a neural network is individual neurons. Unfortunately, individual neurons do not conveniently correspond to single concepts. An example neuron in a language model corresponded to academic citations, English dialogue, HTTP requests, and Korean text. This is a concept called superposition, where concepts in a neural network are represented by combinations of neurons.

This may occur because many variables existing in the world are naturally sparse. For example, the birthplace of an individual celebrity may come up in less than one in a billion training tokens, yet modern LLMs will learn this fact and an extraordinary amount of other facts about the world. Superposition may emerge because there are more individual facts and concepts in the training data than neurons in the model.

Sparse autoencoders have recently gained popularity as a technique to break neural networks down into understandable components. SAEs were inspired by the sparse coding hypothesis in neuroscience. Interestingly, SAEs are one of the most promising tools to interpret artificial neural networks. SAEs are similar to a standard autoencoder.

A regular autoencoder is a neural network designed to compress and then reconstruct its input data. For example, it may receive a 100 dimensional vector (a list of 100 numbers) as input, feed this input through an encoder layer to compress the input to a 50 dimensional vector, and then feed the compressed encoded representation through the decoder to produce a 100 dimensional output vector. The reconstruction is typically imperfect because the compression makes the task challenging.

Diagram of a standard autoencoder with a 1x4 input vector, 1x2 intermediate state vector, and 1x4 output vector. The cell colors indicate activation value. The output is an imperfect reconstruction of the input.

Sparse Autoencoder Explanation

How Sparse Autoencoders Work

A sparse autoencoder transforms the input vector into an intermediate vector, which can be of higher, equal, or lower dimension compared to the input. When applied to LLMs, the intermediate vector’s dimension is typically larger than the input’s. In that case, without additional constraints the task is trivial, and the SAE could use the identity matrix to perfectly reconstruct the input without telling us anything interesting. As an additional constraint, we add a sparsity penalty to the training loss, which incentivizes the SAE to create a sparse intermediate vector. For example, we could expand the 100 dimensional input into a 200 dimensional encoded representation vector, and we could train the SAE to only have ~20 nonzero elements in the encoded representation.

Diagram of a sparse autoencoder. Note that the intermediate activations are sparse, with only 2 nonzero values.

We apply SAEs to the intermediate activations within neural networks, which can be composed of many layers. During a forward pass, there are intermediate activations within and between each layer. For example, GPT-3 has 96 layers. During the forward pass, there is a 12,288 dimensional vector (a list of 12,288 numbers) for each token in the input that is passed from layer to layer. This vector accumulates all of the information that the model uses to predict the next token as it is processed by each layer, but it is opaque and it’s difficult to understand what information is contained within.

We can use SAEs to understand this intermediate activation. An SAE is basically a matrix -> ReLU activation -> matrix¹². As an example, if our GPT-3 SAE has an expansion factor of 4, the input activation is 12,288 dimensional and the SAE’s encoded representation is 49,512 dimensional (12,288 x 4). The first matrix is the encoder matrix of shape (12,288, 49,512) and the second matrix is the decoder matrix of shape (49,512, 12,288). By multiplying the GPT’s activation with the encoder and applying the ReLU, we produce a 49,512 dimensional SAE encoded representation that is sparse, as the SAE’s loss function incentivizes sparsity. Typically, we aim to have less than 100 numbers in the SAE’s representation be nonzero. By multiplying the SAE’s representation with the decoder, we produce a 12,288 dimensional reconstructed model activation. This reconstruction doesn’t perfectly match the original GPT activation because our sparsity constraint makes the task difficult.

We train individual SAEs on only one location in the model. For example, we could train a single SAE on intermediate activations between layers 26 and 27. To analyze the information contained in the outputs of all 96 layers in GPT-3, we would train 96 separate SAEs - one for each layer’s output. If we also wanted to analyze various intermediate activations within each layer, this would require hundreds of SAEs. Our training data for these SAEs comes from feeding a diverse range of text through the GPT model and collecting the intermediate activations at each chosen location.

I’ve included a reference SAE Pytorch implementation. The variables have shape annotations following Noam Shazeer’s tip. Note that various SAE implementations will often have various bias terms, normalization schemes, or initialization schemes to squeeze out additional performance. One of the most common additions is some sort of constraint on decoder vector norms. For more details, refer to various implementations such as OpenAI’s, SAELens, or dictionary_learning.

import torch
import torch.nn as nn

# D = d_model, F = dictionary_size
# e.g. if d_model = 12288 and dictionary_size = 49152
# then model_activations_D.shape = (12288,) and encoder_DF.weight.shape = (12288, 49152)

class SparseAutoEncoder(nn.Module):
    """
    A one-layer autoencoder.
    """
    def __init__(self, activation_dim: int, dict_size: int):
        super().__init__()
        self.activation_dim = activation_dim
        self.dict_size = dict_size

        self.encoder_DF = nn.Linear(activation_dim, dict_size, bias=True)
        self.decoder_FD = nn.Linear(dict_size, activation_dim, bias=True)

    def encode(self, model_activations_D: torch.Tensor) -> torch.Tensor:
        return nn.ReLU()(self.encoder_DF(model_activations_D))
    
    def decode(self, encoded_representation_F: torch.Tensor) -> torch.Tensor:
        return self.decoder_FD(encoded_representation_F)
    
    def forward_pass(self, model_activations_D: torch.Tensor) -> tuple[torch.Tensor, torch.Tensor]:
        encoded_representation_F = self.encode(model_activations_D)
        reconstructed_model_activations_D = self.decode(encoded_representation_F)
        return reconstructed_model_activations_D, encoded_representation_F

The loss function for a standard autoencoder is based on the accuracy of input reconstruction. To introduce sparsity, early SAE implementations added a sparsity penalty to the SAE’s loss function. This most common form of this penalty is calculated by taking the L1 loss of the SAE’s encoded representation (not the SAE weights) and multiplying it by an L1 coefficient. The L1 coefficient is a crucial hyperparameter in SAE training, as it determines the trade-off between achieving sparsity and maintaining reconstruction accuracy.

Note that we aren’t optimizing for interpretability. Instead, we obtain interpretable SAE features as a side effect of optimizing for sparsity and reconstruction.

Here is a reference loss function.

# B = batch size, D = d_model, F = dictionary_size

def calculate_loss(autoencoder: SparseAutoEncoder, model_activations_BD: torch.Tensor, l1_coeffient: float) -> torch.Tensor:
    reconstructed_model_activations_BD, encoded_representation_BF = autoencoder.forward_pass(model_activations_BD)
    reconstruction_error_BD = (reconstructed_model_activations_BD - model_activations_BD).pow(2)
    reconstruction_error_B = einops.reduce(reconstruction_error_BD, 'B D -> B', 'sum')
    l2_loss = reconstruction_error_B.mean()

    l1_loss = l1_coefficient * encoded_representation_BF.sum()
    loss = l2_loss + l1_loss
    return loss

UPDATE 11/29/2024: I think the Vanilla ReLU SAE is fairly outdated and should not be used except as a baseline. My preferred SAE is the BatchTopK SAE, as it significantly improves on the sparsity / reconstruction accuracy trade-off, the desired sparsity can be directly set without tuning a sparsity penalty, and it has good training stability. The BatchTopK SAE is very similar to the ReLU SAE. Instead of a ReLU and a sparsity penalty, you simply retain the top k activation values and zero out the rest. In this case, the k hyperparameter directly sets the desired sparsity. An example BatchTopK implementation can be seen here. Other strong alternative approaches are the TopK and JumpReLU SAEs.

A single Sparse Autoencoder forward pass. We begin with a 1x4 model vector. We multiply it with a 4x8 encoder matrix to produce a 1x8 encoded vector, then apply the ReLU to zero out negative values. The encoded vector is sparse. We multiply it with a 8x4 decoder matrix to produce a 1x4 imperfectly reconstructed model activation.

A Hypothetical SAE Feature Walkthrough

Hopefully, each active number in the SAE’s representation corresponds to some understandable component. As a hypothetical example, assume that the 12,288 dimensional vector [1.5, 0.2, -1.2, ...] means “Golden Retriever” to GPT-3. The SAE decoder is a matrix of shape (49,512, 12,288), but we can also think of it as a collection of 49,512 vectors, with each vector being of shape (1, 12,288). If the SAE decoder vector 317 has learned the same “Golden Retriever” concept as GPT-3, then the decoder vector would approximately equal [1.5, 0.2, -1.2, ...]. Whenever element 317 of the SAE’s activation is nonzero, a vector corresponding to “Golden Retriever” (and scaled by element 317’s magnitude) will be added to the reconstructed activation. In the jargon of mechanistic interpretability, this can be succinctly described as “decoder vectors correspond to linear representations of features in residual stream space”.

This makes intuitive sense when we consider the mathematics of vector-matrix multiplication. Multiplying a vector with a matrix is essentially a weighted sum of the matrix’s rows (or columns, depending on the multiplication order), where the weights are the elements of the vector. In our case, the SAE’s sparse encoded representation serves as these weights, selectively activating and scaling the relevant decoder vectors (matrix rows) to reconstruct the original activation.

We can also say that our SAE with a 49,512 dimensional encoded representation has 49,512 features. A feature is composed of the corresponding encoder and decoder vectors. The role of the encoder vector is to detect the model’s internal concept while minimizing interference with other concepts, while the decoder vector’s role is to represent the “true” feature direction. Empirically, we find that encoder and decoder vectors for each feature are different, with a median cosine similarity of 0.5. In the below diagram, the three red boxes correspond to a single feature.

The three red boxes correspond to SAE feature 1, and the green boxes correspond to feature 4. Per feature, there is a 1x4 encoder vector, 1x1 feature activation, and 1x4 decoder vector. The reconstructed activation is only constructed from the decoder vectors from SAE features 1 and 4. If red represents the color red, and green represents a sphere, then the model could be representing a red sphere.

How do we know what our hypothetical feature 317 represents? The current practice is to just look at the inputs that maximally activate the feature and give a gut reaction on their interpretability. The inputs each feature activates on are frequently interpretable. For example, Anthropic trained SAEs on Claude Sonnet and found separate SAE features that activate on text and images related to the Golden Gate Bridge, neuroscience, and popular tourist attractions. Other features activate on concepts that aren’t immediately obvious, such as a feature from a SAE trained on Pythia that activates “on the final token of relative clauses or prepositional phrases which modify a sentence’s subject”.

Because the SAE decoder vectors match the shape of the LLMs intermediate activations, we can perform causal interventions by simply adding decoder vectors to model activations. We can scale the strength of the intervention by multiplying the decoder vector with a scaling factor. When Anthropic researchers added the Golden Gate Bridge SAE decoder vector to Claude’s activations, Claude was compelled to mention the Golden Gate Bridge in every response.

Here is a reference implementation of a causal intervention using our hypothetical feature 317³. This very simple intervention would compel our GPT-3 model to mention Golden Retrievers in every response, similar to Golden Gate Bridge Claude.

def perform_intervention(model_activations_D: torch.Tensor, decoder_FD: torch.Tensor, scale: float) -> torch.Tensor:
    intervention_vector_D = decoder_FD[317, :]
    scaled_intervention_vector_D = intervention_vector_D * scale
    modified_model_activations_D = model_activations_D + scaled_intervention_vector_D
    return modified_model_activations_D

Challenges with Sparse Autoencoder Evaluations

One of the main challenges with using SAEs is in evaluation. We are training sparse autoencoders to interpret language models, but we don’t have a measurable underlying ground truth in natural language. Currently, our evaluations are subjective, and basically correspond to “we looked at activating inputs for a range of features and gave a gut reaction on interpretability of the features”. This is a major limitation in the field of interpretability.

Researchers have found common proxy metrics that seem to correspond to feature interpretability. The most commonly used are L0 and Loss Recovered. L0 is the average number of nonzero elements in the SAE’s encoded intermediate representation. Loss Recovered is where we replace the GPT’s original activation with our reconstructed activation and measure the additional loss from the imperfect reconstruction. There is typically a trade-off between these two metrics, as SAEs may choose a solution that decreases reconstruction accuracy to increase sparsity.

A common comparison of SAEs involves graphing these two variables and examining the tradeoff⁴. Many new SAE approaches, such as Deepmind’s Gated SAE and OpenAI’s TopK SAE, modify the sparsity penalty to improve on this tradeoff. The following graph is from Google Deepmind’s Gated SAE paper. The red line for Gated SAEs is closer to the top left of the graph, meaning that it performs better on this tradeoff.

There’s several layers to difficulties with measurements in SAEs. Our proxy metrics are L0 and Loss Recovered. However, we don’t use these when training as L0 isn’t differentiable and calculating Loss Recovered during SAE training is computationally expensive⁵. Instead, our training loss is determined by an L1 penalty and the accuracy of reconstructing the internal activation, rather than its effect on downstream loss.

Our training loss function doesn’t directly correspond to our proxy metrics, and our proxy metrics are only proxies for our subjective evaluations of feature interpretability. There’s an additional layer of mismatch, as our subjective interpretability evaluations are proxies for our true goal of “how does this model work”. There’s a possibility that some important concepts within LLMs are not easily interpretable, and we could ignore these by blindly optimizing interpretability.

For a more detailed discussion of SAE evaluation methods and an evaluation approach using board game model SAEs, refer to my blog post on Evaluating Sparse Autoencoders with Board Game Models.

Conclusion

The field of interpretability has a long way to go, but SAEs represent real progress. They enable interesting new applications, such as an unsupervised method to find steering vectors like the “Golden Gate Bridge” steering vector. SAEs have also made it easier to find circuits in language models, which can potentially be used to remove unwanted biases from the internals of the model.

The fact that SAEs find interpretable features, even though their objective is merely to identify patterns in activations, suggests that they are uncovering something meaningful. This is also evidence that LLMs are learning something meaningful, rather than just memorizing surface-level statistics.

They also represent an early milestone that companies such as Anthropic have aimed for, which is “An MRI for ML models”. They currently do not offer perfect understanding, but they may be useful to detect unwanted behavior. The challenges with SAEs and SAE evaluations are not insurmountable and are the subject of much ongoing research.

For further study of Sparse Autoencoders, I recommend Callum McDougal’s Colab notebook.

Acknowledgements: I am grateful to Justis Mills, Can Rager, Oscar Obeso, and Slava Chalnev for their valuable feedback on this post.

The ReLU activation function is simply y = max(0, x). That is, any negative input is set to 0. ↩
There are typically also bias terms at various points, including the encoder and decoder layers. ↩
Note that this function would intervene on a single layer and that the SAE should have been trained on the same location as the model activations. For example, if the intervention was performed between layers 6 and 7 then the SAE should have been trained on the model activations between layers 6 and 7. Interventions can also be performed simultaneously on multiple layers. ↩
It’s worth noting that this is only a proxy and that improving this tradeoff may not always be better. As mentioned in the recent OpenAI TopK SAE paper, an infinitely wide SAE could achieve a perfect Loss Recovered with an L0 of 1 while being totally uninteresting. ↩
Apollo Research recently released a paper that used a loss function that aimed to produce the same output distribution, rather than reconstruct a single layer’s activation. It works better but is also more computationally expensive. ↩

Manipulating Chess-GPT’s World Model

2024-03-20T19:46:58+00:00

Manipulating Chess-GPT’s World Model

Note: This work has since been turned into a paper accepted to the Conference on Language Modeling, but the average reader will probably prefer the blog post.

In my previous post I introduced Chess-GPT, a language model I trained to predict the next character in a game of chess given a PGN string (1.e4 e5 2.Nf3 …). Through the process of training to output the next character, it learns to compute the state of the chess board and to estimate the skill level of the players in the game given an arbitrary PGN string as input. I demonstrated this using linear probes, which are classifiers that take the model’s activations as input and predict the board state or player skill level as output. Chess-GPT also played chess well, with the best model playing at approximately 1500 Elo.

I presented evidence that the model learned to compute a world model in order to perform next character prediction, but I did not have the time to validate these representations by using them to intervene on the model’s activations. In other related work, the authors used the probes to edit Othello GPT’s internal activations, getting it to output legal moves under the “make believe” state of the board. I wanted to add rigor to my work and establish a causal link between the internal board state and skill representations and the model outputs. If there was a causal link, I should be able to increase and decrease the model’s skill level and edit its internal state of the board.

I had also done some investigation of Chess-GPT’s ability to play Chess in games very unlike those found in its training dataset (which consists of real games downloaded from Lichess). Specifically, I initialized the chess board with 20 random moves, and then had Chess-GPT play against Stockfish with this random initialization. Its performance plummeted. The larger 50 million parameter model’s win rate dropped from 70% to 17%. Did this mean that the models’ only learn superficial patterns of how to play Chess with no deeper understanding of the game? It turns out that this is not the case, and with one simple trick, we can restore a significant fraction of our models’ chess playing ability.

Next token predictors

When GPT-3 was released, some AI skeptics argued that GPT-3 learns surface statistics or correlations between words, with no understanding of the underlying world. Douglas Hofstadter (who no longer holds this position and is now worried about artificial super intelligence) argued that LLMs are “cluelessly clueless” because they produce nonsense when given trick questions. For example:

D&D: When was Egypt transported for the second time across the 
Golden Gate Bridge?

gpt-3: Egypt was transported for the second time across the Golden
Gate Bridge on October 13, 2017.

Gary Marcus and Ernest Davis, in their article “GPT-3, Bloviator: OpenAI’s language generator has no idea what it’s talking about”, also demonstrated that if you prompt GPT-3 with nonsense text, it will continue with nonsense text. Kevin Lacker gave GPT-3 a Turing Test, and found similar results.

Q: How do you sporgle a morgle?
A: You sporgle a morgle by using a sporgle.

Q: How many bonks are in a quoit?
A: There are three bonks in a quoit.

Q: How many rainbows does it take to jump from Hawaii to seventeen?
A: It takes two rainbows to jump from Hawaii to seventeen.

So what’s going on here? Does GPT-3 really have no ability to understand the world or express uncertainty? If we think about it further, GPT-3 is just trying to predict the next token as if the prompt is a section of a random piece of text on the internet. In this context, the text after to “How do you sporgle a morgle?” is much more likely to be “You sporgle a morgle by …” than “That question doesn’t make any sense!”. Base models without instruct tuning are finicky, and just because they answer nonsense with more nonsense doesn’t mean that they lack any understanding of the question. For example, Nick Cammarata showed GPT-3 can easily express uncertainty if we just prompt it to reply with “yo be real” to nonsense questions.

Of course, modern LLMs like ChatGPT or Claude are much more capable and have been trained on examples of desired behavior using RLHF or intruction tuning. Now, they correctly answer “I don’t know” to nonsense questions and express uncertainty about challenging questions. They don’t need to rely on prompts to elicit a desired behavior.

This suggests a possible explanation for Chess-GPT’s poor performance on randomly initialized games. If a game begins with 20 random moves, the players are probably not high skill players. Chess-GPT is also a base model, with no RLHF or instruct tuning to learn a desired behavior of playing strong chess. If Chess-GPT truly was a good next token predictor, it would predict legal, low skill moves in the case of a randomly initialized game.

To test this idea, I gave Chess-GPT randomly initialized games, and then recorded the Elo probe’s prediction for the White player. In normal games, the Elo probe would classify players as under 1550 Elo or above 2050 Elo, and it correctly classified 90.5% of players. In the case of randomly initialized games, the Elo probe classified 99.4% of players as low skill, indicating that the model was trying to predict the next character in a low skill game.

Skill Interventions

My linear probes are trained on the GPT’s residual stream. We can think of the residual stream as the intermediate state in a transformer, or the output of each layer before it is fed to the next layer. Each layer in the GPT reads from the residual stream, performs some calculation, and then adds the result back to the residual stream. In my case, the hidden dimension of Chess-GPT is 512, which means that the residual stream and every intermediate state is simply a 512 dimensional vector (or a list of 512 numbers). The following diagram is from Anthropic’s explanation of the transformer’s residual stream:

The linear probe for the Elo classification is a 512 x 2 matrix. We multiply the GPT’s 512 dimension activation vector with the Elo probe to produce a 2 dimensional vector¹, which contains raw predictions (also known as logits or scores) for low skill and high skill. By applying softmax, this 2 dimensional vector now contains probabilities that sum to 1. For example, an output of [0.2 0.8] means that the probe assigns an 80% probability to the player being a high skill player.

So, our Elo probe contains a 512 x 1 low skill vector and a 512 x 1 high skill vector. To create a skill intervention vector, we can subtract the low skill vector from the high skill vector. To encourage the model to play at a higher skill level, we can simply add this skill vector to the residual stream. Alternatively, we can add this vector to the layer’s final bias term to get an equivalent intervention at zero additional inference cost. We just flip the sign of the intervention to encourage the model to play at a lower skill level. To increase or decrease the magnitude of our intervention, we can multiply our intervention vector by a scaling factor. This intervention can be applied to an arbitrary number of layers, and is visualized here:

This intervention is extremely simple, and yet it works very well. I spent quite a bit of time experimenting with other more complicated interventions involving gradient descent and vector projection, and the simple addition baseline beat all other approaches that I tried. Of course, with machine learning, it’s hard to say that the addition technique is definitively better - there’s always the possibility that I just had to tweak one more detail and a more complicated approach would have been the best.

Here’s my intuition for why this works: The linear probe for low skill is trained to output a large positive value for low skill players and a large negative value for high skill players. These raw predictions (logits) are then converted to probabilities using softmax. For the probe to output a large positive value when multiplied with the model’s activations, the probe’s positive weights should align with positive activations, and negative weights with negative activations. Therefore, the low skill probe’s weights are oriented in a similar direction to the model’s internal representation of low skill play. Adding the skill vector (high skill minus low skill) to the activations encourages higher skill play by aligning the activations more closely with the high skill representation.

I also explored the technique of contrastive activations. In this technique, we collect the average GPT intermediate state activation per layer in 2,000 high skill games and 2,000 low skill games between turns 25 and 35. Once again, we subtract the average low skill activation from the average high skill activation to produce a skill vector. As with the probe-based intervention, this produces a 512 dimensional vector that we can add or subtract to the residual stream or final bias term. In practice, contrastive activations worked slightly better than probe derived interventions.

To test the effectiveness of this intervention, I had Chess-GPT play 30,000 games against Stockfish 16 level 0 across 6 configurations. The first three configurations had Chess-GPT play against Stockfish starting from the standard chess starting position. The last three had the board initialized with 20 randomly selected moves.

The resulting win rates are in this table:

Model	Board	No Intervention	Positive Intervention	Negative Intervention
50M Parameters	Standard	69.6%	72.3%	11.9%
25M Parameters	Standard	49.6%	54.8%	5.5%
50M Parameters	Randomly Initialized	16.7%	43.2%	5.9%
25M Parameters	Randomly Initialized	15.7%	31.3%	3.6%

In the standard board setting, adding the skill vector adds a minor increase in win rate, while subtracting it causes a substantial decrease. I believe the room for improvement in the standard setting is limited because the average skill level in the Lichess training dataset is 1,690 Elo, which is higher than Chess-GPT’s skill level.

However, the intervention works very well in the randomly initialized board. Here, the positive skill intervention improved the 25M parameter model’s win rate by 2x, and the 50M parameter model’s by 2.6x. The larger model’s more substantial improvement suggests that it may have more latent chess ability, and can perform better when it’s “trying” to win rather than emulate a low skilled player. This offers very strong evidence that, in the case of randomly initialized boards, the models are predicting the next character as if the players involved have a low skill level.

The intervention does only restore about half of the models’ performance. It’s difficult to say if this is because the models struggle to generalise to games that are very different from their training data, or if the models still have more latent ability that a better intervention could uncover. The addition intervention is very crude. It is adjusted via a scale parameter. If the scale is too low, it doesn’t make a difference, and if it’s too high, the model outputs random characters rather than valid chess moves. With the current primitive state of ML science and interpretability, it’s difficult to say which is the case.

Board State Interventions

In my previous post, I also found that Chess-GPT learns to compute the state of the chess board from a PGN string input. To validate these board state representations, I also wanted to perform interventions to edit Chess-GPT’s beliefs about the current board state. If my intervention was successful, Chess-GPT should output moves that are legal under the new, modified state of the chess board.

When modifying the chess board, we need to make sure that our change is strategically relevant to the current state of the board. If we modify a random piece, most of the existing legal moves will remain legal, and the randomly changed piece will probably not be selected by Chess-GPT if it isn’t in a strategically important position. My strategy was the following:

First, I sampled Chess-GPT’s most likely move by giving it a PGN string and sampling the maximum likelihood character at every prediction. Then, I determine the piece that was moved and its source square. In this case, it’s White pawn from C2. This gives us a strategically relevant piece at a current board state.

Now, I delete the piece from both the model’s internal representation and the original board to create a modified model and modified board. In this case, I delete the White pawn from C2 off the chess board. Then, I select the 512 dimensional White pawn row from the linear probe for square C2. I then subtract the C2 white pawn vector from the model’s residual stream, effectively “erasing” the model’s memory of the pawn’s presence on C2.

Then, I sample five probabilistic moves from both the original and modified model at temperature 1.0. The original model’s moves provide a baseline legal move rate without any intervention. If my intervention was successful, all 5 moves from the modified model would be legal under the new state of the chess board.

In the following diagram, I display the white pawn heat map before intervention (where Chess-GPT chose to move the white pawn from c2-c3) and after intervention, when the c2 pawn was erased from Chess-GPT’s internal activations. We can use linear probes to create visual heatmaps of Chess-GPT’s internal board state. Each linear probe outputs a probability that a particular piece (such as white pawn) is present on a particular square. We can use these outputs to construct a heat map for any piece type.

As we can see, post intervention the c2 pawn is indeed missing. However, the other white pawn locations are much less distinct, indicating that the intervention is fairly crude, even if successful.

I perform this intervention in 5,000 test cases. As the table below shows, my approach performs significantly better than the baseline of 41% legal moves. However, even in the best case the legal move rate is only 92%. Once again, it’s difficult to say if this is because of my primitive intervention strategy or if the model has a poor internal representation of the state of the board.

Model	Intervention Status	Original Board	Modified Board
50M Parameters	With Intervention	85.4%	92.0%
25M Parameters	With Intervention	81.8%	90.4%
50M Parameters	No Intervention	99.9%	40.5%
25M Parameters	No Intervention	99.8%	41.0%

Implications

These experiments significantly strength the findings of my previous blog post, suggesting that Chess-GPT learns a deeper understanding of chess strategy and rules, rather than simply memorizing patterns. Chess-GPT is orders of magnitude smaller than any current LLM and can be trained in 2 days on 2 RTX 3090 GPUs, yet it still manages to learn to estimate latent variables such as player skill. In addition, we see that bigger models learn to better compute board state and player skill.

This adds further evidence that language models can develop a world model through the process of next token prediction and self supervised learning. Whatever is going on inside these models is much more sophisticated than surface-level stastics. The success of the interventions adds further evidence that it is possible to directly influence model behavior in predictable ways using simple tools.

However, the fact that these interventions are only partially successful highlights the limitations of our current understanding of machine learning models. The current state of interpretability in machine learning is still fairly primitive and comparable to the early, pre-microscope days in biology. More research is needed to develop more precise and reliable methods of understanding language models. Fortunately, there are exciting new innovations such as Sparse Autoencoders which have shown promising results.

All code, models, and datasets can be found at:

https://github.com/adamkarvonen/chess_llm_interpretability

If interested in discussion or collaboration, feel free to contact me via email.

I am currently open to job opportunities. If you found this post interesting and think I could be a good fit for your team, feel free to reach out via email or LinkedIn.

There is also this Twitter thead for public discussion purposes.

Appendix

Other Interventions

When performing probe based skill interventions, I found that it was possible to improve Chess-GPT’s ability by adding the high skill vector, subtracting the low skill vector, or both adding the high skill vector and subtracting the low skill vector. However, I got the best results from performing both interventions. In contrast, with board state interventions, I found that I got the best results by only subtracting the moved piece vector. Adding the blank square vector lowered the success rate of the intervention.

One very important hyperparameter in these interventions is the scale parameter, which is multiplied by the intervention vector. If it was too low, the modified model would make the same illegal moves as the original model. If it was too high, the modified model would output random characters. To estimate the upper bound of the success rate for board state interventions, I explored varying the scale of intervention across a range of values for each move to identify a scale that resulted in the model generating legal moves. I found a scale that resulted in the model outputting legal moves approximately 98\% of the time.

I looked at a couple options to dynamically set the scale parameter. The first was vector projection from the Representation Engineering paper. The second was to use some algebra to set the scale parameter to target a specific output logit value for the linear probe. In this case, the target value was another parameter to sweep, and -10 seemed to work well.

I explored a range of other interventions to try and increase the success of my board state interventions. Kenneth Li had originally used gradient descent with his non-linear probes. The probes are trained using cross-entropy loss, which is basically trying to maximize the probability output for the ground truth label. To perform an intervention with gradient descent, our loss is once again the cross entropy loss between the probe output probability and the desired board state (in this case, a blank square). Instead of optimizing the probe weights to minimze loss, we optimize the model activations being input to the linear probe.

In all of these cases, the simplest approach of just subtracting the piece vector from the residual stream worked the best. As is often the case, reality was not kind to my clever ideas and the simplest approach won. Out of all alternative approaches, the target logit value approach worked the best, but was still around 1-2% worse than the subtraction approach.

The best positive intervention win rates in the random board state using probe derived interventions were 42.2% for the 50M parameter model, and 24.7% for the 25M parameter model.

In matrix multiplication, multiplying a 1 x 512 matrix (or vector) by a 512 x 2 matrix produces a 1 x 2 matrix (or vector) as the result. The first row of 512 numbers, when multipled via dot product with the 512 dimensional activation, produces a score for “low skill player”. The second row of 512 numbers produces a score for “high skill player” when multipled with the activation. ↩

Chess-GPT’s Internal World Model

2024-01-03T22:46:58+00:00

A Chess-GPT Linear Emergent World Representation

Introduction

Note: This work has since been turned into a paper accepted to the Conference on Language Modeling, but the average reader will probably prefer the blog post. There is also a second blog post, Manipulating Chess-GPT’s World model.

Among the many recent developments in ML, there were two I found interesting and wanted to dig into further. The first was gpt-3.5-turbo-instruct’s ability to play chess at 1800 Elo. The fact that an LLM could learn to play chess well from random text scraped off the internet seemed almost magical. The second was Kenneth Li’s Emergent World Representations paper. There is an excellent summary on The Gradient and a follow-up from Neel Nanda. In it, they trained a 25 million parameter GPT to predict the next character in an Othello game. It learns to accurately make moves in games unseen in its training dataset, and using both non-linear and linear probes it was found that the model accurately tracks the state of the board.

However, this only worked for a model trained on a synthetic dataset of games uniformly sampled from the Othello game tree. They tried the same techniques on a model trained using games played by humans and had poor results. To me, this seemed like a major caveat to the findings of the paper which may limit its real world applicability. We cannot, for example, generate code by uniformly sampling from a code tree.

So I dug into it. I trained some models on chess games and used linear probes on the trained models. My results were very positive, and answered all of my previous questions (although of course, more questions were generated).

A 50 million parameter GPT trained on 5 million games of chess learns to play at ~1300 Elo in one day on 4 RTX 3090 GPUs. This model is only trained to predict the next character in PGN strings (1.e4 e5 2.Nf3 …) and is never explicitly given the state of the board or the rules of chess. Despite this, in order to better predict the next character, it learns to compute the state of the board at any point of the game, and learns a diverse set of rules, including check, checkmate, castling, en passant, promotion, pinned pieces, etc. In addition, to better predict the next character it also learns to estimate latent variables such as the Elo rating of the players in the game.

All code, data, and models have been open sourced.

Training Chess GPT

My initial hypothesis was that Othello-GPT trained on human games performed poorly due to a lack of data. They only had 130k human Othello games, but the synthetic model was trained on 20 million games. I tried two different approaches to create my datasets: First, I had Stockfish Elo 3200 play 5 million games as White against a range of Stockfish 1300-3200 as Black. Hopefully, this synthetic dataset of superhuman chess bot games would provide higher quality data than human games. Second, I grabbed 16 million games from Lichess’s public chess game database. I trained separate models on individual datasets and various mixes of datasets (more details in the appendix).

Initially, I looked at fine-tuning open source models like LLama 7B or OpenLlama 3B. However, I almost immediately had to abandon that approach to keep my GPU costs down (I used RTX 3090s from runpod). Instead, I started training models from scratch using Andrej Karpathy’s nanogpt repository. I experimented with 25M and 50M parameter models.

It basically worked on the first try. The 50M parameter model played at 1300 Elo with 99.8% of its moves being legal within one day of training. I find it fairly impressive that a model with only 8 layers can correctly make a legal move 80 turns into a game. I left one training for a few more days and it reached 1500 Elo. I’m still investigating dataset mixes and I’m sure there’s room for improvement.

So, gpt-3.5-turbo-instruct’s performance is not magic. If you give an LLM a few million chess games, it will learn to play chess. My 50M parameter model is orders of magnitude smaller than any reasonable estimate of gpt-3.5’s size, and it is within 300 Elo of its performance. In addition, we recently had confirmation that GPT-4’s training dataset included a collection of PGN format chess games from players with an Elo over 1800.

I also checked if it was playing unique games not found in its training dataset. There are often allegations that LLMs just memorize such a wide swath of the internet that they appear to generalize. Because I had access to the training dataset, I could easily examine this question. In a random sample of 100 games, every game was unique and not found in the training dataset by the 10th turn (20 total moves). This should be unsurprising considering that there are more possible games of chess than atoms in the universe.

Chess-GPT’s Internal World Model

Next, I wanted to see if my model could accurately track the state of the board. A quick overview of linear probes: We can take the internal activations of a model as it’s predicting the next token, and train a linear model to take the model’s activations as inputs and predict board state as output. Because a linear probe is very simple, we can have confidence that it reflects the model’s internal knowledge rather than the capacity of the probe itself. We can also train a non-linear probe using a small neural network instead of a linear model, but we risk being misled as the non-linear probe picks up noise from the data. As a sanity check, we also probe a randomly initialized model.

In the original Othello paper, they found that only non-linear probes could accurately construct the board state of “this square has a black / white / blank piece”. For this objective, the probe is trained on the model’s activations at every move. However, Neel Nanda found that a linear probe can accurately construct the state of the board of “this square has my / their / blank piece”. To do this, the linear probe is only trained on model activations as it’s predicting the Black XOR White move. Neel Nanda speculates that the nonlinear probe simply learns to XOR “I am playing white” and “this square has my color”.

Armed with this knowledge, I trained some linear probes on my model. And once again, it basically worked on my first try. I also found that my Chess-GPT uses a “my / their” board state, rather than a “black / white” board state. My guess is that the model learns one “program” to predict the next move given a board state, and reuses the same “program” for both players. The linear probe’s objective was to classify every square into one of 13 classes (blank, white / black pawn, rook, bishop, knight, king, queen). The linear probe accurately classified 99.2% of squares over 10,000 games.

To better interpret the internal predictions of my model, I created some visual heat maps. These heat maps were derived from the probe outputs, which had been trained on a one-hot objective to predict whether a chess piece, such the black king, was present on a given square (1 if present, 0 if not). The first heat map shows the actual board state for the black king. The second heat map depicts the probe’s confidence with a clipping limit applied to the output values where any value above 5 is reduced to 5. This clipping makes the probe’s output more binary, as shown by the white square against the black background. The third heat map presents the probe’s output without any clipping, revealing a gradient of confidence levels. It shows that the model is extremely certain that the black king is not located on the white side of the chessboard.

We see a very similar result for the location of the white pawns, although the model is less confident. This board state comes from the 12th move in a chess game, and the model is extremely confident that no white pawns are in either side’s back rank.

The model still knows where the blank squares are, but it is once again less confident in this.

For this move in this chess game, the linear probe perfectly reconstructs the state of the board. The probe’s objective is to classify each square into one of 13 categories, each representing a different chess piece or a blank square. To create this graph, we just take the prediction with the highest value for each square as the probe’s output.

Probing for latent variables

Because Chess-GPT learned to predict the next move in a competitive game, rather than a game uniformly sampled from a game tree, there are interesting latent variables we can probe for. In particular, I hypothesized that to better predict the next character, it would learn to estimate the skill level of the players involved.

Initially, I trained the probe on a regression task, where its task is to predict the Elo of the White player. It would do this by training on the internal activations of the model between moves 25 and 35, as it would be extremely difficult to predict player skill early in the game. However, the majority of the games in the Lichess dataset are between 1500 Elo and 1880 Elo, which is a relatively narrow band¹. The linear probe trained on Chess-GPT had an average error of 150 Elo, which seemed good at first glance. However, a linear probe trained on a randomly initialized model had an average error of 215 Elo. The narrow window of Elo in most games made it difficult to discern the model’s level of knowledge. Distinguishing between a 1700 and 1900 Elo player just seems like a very difficult task.

So, I then trained the probe on a classification task, where it had to identify players below an Elo of 1516 or above an Elo of 2029. In this case, the probe performed much better. A probe trained on a randomly initialized model correctly classified 66% of players, while a probe trained on Chess-GPT correctly classified 89% of players.

To an extent, this is unsurprising. This reminds me of the OpenAI’s 2017 Sentiment Neuron paper. In it, they trained an LSTM to predict the next character in Amazon reviews. When they trained a linear probe on the model’s internals using just 232 labeled examples, it became a state of the art sentiment classifier. OpenAI wrote then that “We believe the phenomenon is not specific to our model, but is instead a general property of certain large neural networks that are trained to predict the next step or dimension in their inputs”. With this context, it’s almost an expected result.

Caveats

The evidence here would be stronger if I also performed causal interventions on the model using these probes. For example, I could intervene to change the model’s internal representation of the board state, and see if it makes legal moves under the new state of the board. Or, I could intervene on the model’s representation of player skill and see if it plays better or worse. Unfortunately, I just ran out of time. This was a Christmas break project, and it’s time to get back to work.

However, I still consider the findings to be strong. Linear probes have a limited capacity, and are an accepted method of benchmarking what a model has learned. I followed general best practices of training the probes on a training set, and testing them on a separate test set. The board state in particular is a very concrete task to probe against. Probing for skill level does have a possibility that the model is learning some feature that is mostly correlated with skill, but 89% is a good result for the difficult task of discerning the Elo of players in a chess game after 25 moves.

Potential future work

As Neel Nanda discussed, there are many advantages to interpreting models trained on narrow, constrained tasks such as Othello or Chess. It is difficult to interpret what a large LLM like Llama is modeling internally when predicting tokens in an unconstrained domain like poetry. There has been successful interpretation of simple models trained on toy tasks like sorting a list. Models trained on games provide a good intermediate step that is both tractable and interesting.

My immediate thought is to look for some sort of internal tree search. When I play chess, I perform a sort of tree search, where I first consider a range of moves, then consider my opponent’s responses to these moves. Does Chess-GPT perform a similar internal calculation when predicting the next character? Considering that it is better than I am, it seems plausible.

Other potential directions:

Perform causal interventions on the model using these linear probes.
Investigate why the model sometimes fails to make a legal move or model the true state of the board.
How does the model compute the state of the board, or the location of a specific piece?
I fine-tuned GPT-2 on a 50 / 50 mix of OpenWebText and chess games, and it learned to play chess and continued to output plausible looking text. Maybe there’s something interesting to look at there?

Part two: Manipulating Chess-GPT’s World Model

If interested in discussion or collaboration, feel free to contact me via email. There is also this Twitter thead for public discussion purposes.

I am currently open to job opportunities. If you found this post interesting and think I could be a good fit for your team, feel free to reach out via email or LinkedIn.

Appendix

Corrections

In my original article, I had a graph with a stacked bar chart of Chess-GPT’s games against Stockfish, with bars for wins, draws, and losses. I noticed that there was an unusually high amount of draws against Stockfish levels 5 through 9. After some inspection, I realized that I had mistakenly inflated the draw rate. The cause was the following: Chess-GPT has an context size of 1024 characters, enough for approximately 180 moves. My analysis code mistakenly categorized an active game at 180 moves as a draw. While a game at 180 moves is more likely to result in a draw than average, it definitely isn’t certain. I used the following strategy to redo the graph: At every active game, I used Stockfish to assign a centipawn advantage at move 180. Any player with more than 100 centipawn advantage received a win. All other games were a draw. 77% of these games were Stockfish wins. The old, inaccurate bar graph can be viewed here and the old, inaccurate line chart can be viewed here. This does not change the Elo rating of Chess-GPT, but it’s definitely a mistake for which I apologize.

Updates

2024/03/23: I wrote up the results of this blog post in a paper titled “Emergent World Models and Latent Variable Estimation in Chess-Playing Language Models”. I reran some of the experiments here with larger sample sizes to improve rigor (for example, Chess-GPT was benchmarked with 1,000 games against Stockfish, not 100). I also improved the quality of some figures, and used log probabilities in my probe heatmaps instead of logits. I also added an appendix section discussing Stockfish Elo estimation.

Technical probing details

Both Neel Nanda and I trained our probes to predict “my piece / their piece” instead of “white piece / black piece”. To predict “white piece / black piece”, you just have to train the linear probe on the model’s activations at every move. To predict “my piece / their piece”, you have to train the linear probe on the model’s activations at every white XOR black move.

In Othello-GPT, the model had a vocabulary of 60 tokens, corresponding to the 60 legal squares where a piece could be placed. So, Neel Nanda just probed at every even character for a white “my piece / their piece” probe, and at every odd character for a black “my piece / their piece” probe. In my case, the input to Chess-GPT was a string like “1.e4 e5 2.Nf3 …”.

So, I trained the white “my piece / their piece” probe on the model’s activations at the index of every “.” as it’s predicting the next character. For example, the probe would be trained on “1.” and “1.e4 e5 2.” as inputs. For a black “my piece / their piece” probe, I trained it on the index of every even “ “ character. I trained a linear probe on the “white piece / black piece” objective, and it obtained a classification accuracy of 86%.

Neel Nanda excluded the first 5 and last 5 moves of the game when training his probes. I found that my linear probes accuracy did not change when trained on all moves or all but the first 5 moves.

Model training details

The LLMs were character level models rather than using byte-pair encoding and tokenization. From manually inspecting gpt-3.5 tokenization, it looks like a standard tokenizer has slightly over 1 character per token for a PGN string, excluding spaces. As my model had a vocabulary of just 32 tokens, I was able to reduce my model size by 25 million parameters compared to using a standard tokenizer with a vocabulary of 50,257 tokens. During training, I ensured that every batch began with “;1.”, a delimiter token followed by a new game. I did try training a model by randomly sampling blocks that usually began in the middle of a game, although its 1024 context length meant that it usually also received the beginning of a game later on. The model still learned to play chess. I would be curious what sort of heuristics that model learned to infer the board state when receiving an input that starts in the middle of a chess game.

Open source code, models, and datasets

All code, models, and datasets are open source. To train, test, or visualize linear probes on the LLMs, please visit: https://github.com/adamkarvonen/chess_llm_interpretability

To play the nanoGPT model against Stockfish, please visit: https://github.com/adamkarvonen/chess_gpt_eval/tree/master/nanogpt

To train a Chess-GPT from scratch, please visit: https://github.com/adamkarvonen/nanoGPT

All pretrained models are available here: https://huggingface.co/adamkarvonen/chess_llms

All datasets are available here: https://huggingface.co/datasets/adamkarvonen/chess_games

Wandb training loss curves and model configs can be viewed here: https://api.wandb.ai/links/adam-karvonen/u783xspb

Model size and dataset comparison

Model Name	Probe Layer Target	Elo Classification Accuracy	Board State Classification Accuracy	Legal Move Rate
Randomly Initialized 8 layer model	5	65.8%	70.8%	0%
Randomly Initialized 16 layer model	12	66.5%	70.6%	0%
8 Layer Model trained on Lichess Games	7	88.0%	98.0%	99.6%
16 Layer Model trained on Lichess Games	12	89.2%	98.6%	99.8%
16 Layer Model trained on Stockfish Games	12	N/A	99.2%	99.7%

All models were trained for a total of 60 billion input characters. The models were trained for several epochs - the datasets ranged in size from 4 - 7 billion characters. The labels stand for the dataset from hugging face that the model was trained on as well as the number of layers in the model. In this graph, for 1 game a win counts as 1 point, a draw as 0.5, and a loss as 0. We lose some information compared to a stacked bar chart, but I felt it would be too crowded.

Probe accuracy per layer on in 8 and 16 layer networks

An interesting note: The 8 layer network can calculate a 98% accurate board state by layer 5. However, the 16 layer network doesn’t calculate an accurate board state until layer 11. This indicates that the network is calculating many things in parallel, rather than calculating the board state as soon as possible and then planning from there. Oddly, the skill probes trained on randomized models become more accurate on deeper layers.

Stockfish Elo Estimate

Elo estimation of Stockfish levels is complex with several variables at play. An official estimate of Stockfish 16 Elos can be obtained in a commit on the official Stockfish repo. In this case, Stockfish 16 level 0 is Elo 1,320. The official Stockfish testing was performed using 120 seconds per move, which is prohibitively expensive for tens of thousands of games on GPU machines. The particular processor being used per machine is another variable.

To estimate the Elo of Stockfish level 0 at 0.1 seconds per move, I had it play 1,000 games as white against Stockfish level 0 with 0.1, 1.0 and 10.0 seconds per move. The win rate of 0.1 vs 0.1 seconds was 51%, 0.1 vs 1.0 seconds was 45%, and 0.1 vs 10.0 seconds was 45%. At low Stockfish levels, a 100x decrease in search time makes little difference in Stockfish’s win rate. In contrast, at higher Stockfish levels, decreasing search time makes a significant difference in win rate. Thus, a reasonable approximation of Stockfish 16 level 0 with 0.1 seconds per move is Elo 1300.

Lichess’s Elo ratings appear to run high. The average chess.com Elo is around 800. A quick google shows that many believe Lichess ratings are on average a few hundred Elo higher than other websites’ Elo ratings. ↩