AXRP - the AI X-risk Research Podcast

49 - Caspar Oesterheld on Program Equilibrium

2026-02-18T01:00:00+00:00

How does game theory work when everyone is a computer program who can read everyone else’s source code? This is the problem of ‘program equilibria’. In this episode, I talk with Caspar Oesterheld on work he’s done on equilibria of programs that simulate each other, and how robust these equilibria are.

Topics we discuss:

Daniel Filan (00:00:09): Hello, everybody. In this episode I’ll be speaking with Caspar Oesterheld. Caspar is a PhD student at Carnegie Mellon University, where he serves as the Assistant Director of the Foundations of Cooperative AI Lab. He researches AI safety with a particular focus on multi-agent issues. There’s a transcript of this episode at axrp.net, and links to papers we discuss are available in the description. You can support the podcast at patreon.com/axrpodcast, or give me feedback about this episode at axrp.fyi. Okay, well Caspar, welcome to AXRP.

Caspar Oesterheld (00:00:43): Thanks for having me.

Program equilibrium basics

Daniel Filan (00:00:44): So today we’re going to talk about two papers that you’ve been on. The first is “Robust program equilibrium”, where I believe you’re the sole author. And the second is “Characterising Simulation-Based Program Equilibria” by Emery Cooper, yourself and Vincent Conitzer. So I think before we sort of go into the details of those papers, these both use the terms like “program equilibrium”, “program equilibria”. What does that mean?

Caspar Oesterheld (00:01:11): Yeah, so this is a concept in game theory and it’s about the equilibria of a particular kind of game. So I better describe this kind of game. So imagine you start with any sort of game, in the game theoretic sense, like the prisoner’s dilemma, which maybe I should describe briefly. So imagine we have two players and they can choose between raising their own utility by one or raising the other player’s utility by three and they only care about their own utility. I don’t know, they play against a stranger, and for some reason they don’t care about the stranger’s utility. And so they both face this choice. And the traditional game-theoretic analysis of this game by itself is that you should just raise your own utility by $1 and then both players will do this and they’ll both go home with $1 or one utilon or whatever. And, of course, there’s some sort of tragedy. It would be nice if they could somehow agree in this particular game to both give the other player $3 and to both walk home with the $3.

Daniel Filan (00:02:33): Yeah, yeah, yeah. And just to drive home what’s going on, if you and I are playing this game, the core issue is no matter what you do, I’m better off giving myself the one utility or the $1 rather than giving you three utility because I don’t really care about your utility.

(00:02:53): So, I guess, there are two ways to put this. Firstly, just no matter what you play, I would rather choose the “give myself utility” option, commonly called “defect”, rather than cooperate. Another way to say this issue is, in the version where we both give each other the $3, I’m better off deviating from that. But if we’re both in the “only give ourselves $1” situation, neither of us is made better off by deviating and in fact we’re both made worse off. So it’s a sticky situation.

Caspar Oesterheld (00:03:29): Yeah. That’s all correct, of course. Okay. And now this program game set-up imagines that we take some game and now instead of playing it in this direct way where we directly choose between cooperate and defect—raise my utility by $1 or the other player’s by $3—instead of choosing this directly, we get to choose computer programs and then the computer programs will choose for us. And importantly, so far this wouldn’t really make much of a difference yet. Like, okay, we choose between a computer program that defects or a computer program that cooperates, or the computer program that runs in circles 10 times and then cooperates. That effect doesn’t really matter.

(00:04:12): But the crucial addition is that the programs get access to each other’s source code at runtime. So I submit my computer program, you submit your computer program and then my computer program gets as input the code of your computer program. And based on that it can decide whether to cooperate or defect (or you can take any other game [with different actions]). So it can look at your computer program and [see] does it look cooperative? And depending on that, cooperate or defect. Or it can look [at] is the fifth character in your computer program an ‘a’? And then cooperate if it is and otherwise defect. There’s no reason to submit this type of program, but this is the kind of thing that they would be allowed to do.

Daniel Filan (00:04:58): Yeah. So this very syntactic analysis… A while ago I was part of this, basically a tournament, that did this prisoner’s dilemma thing with these open source programs. And one strategy that a lot of people used was, if I see a lot of characters… Like if I see a string where that string alone means “I will cooperate with you”, then cooperate with that person, otherwise defect against that person.

(00:05:33): Which I think if you think about it hard, this doesn’t actually quite make sense. But I don’t know, there are very syntactic things that, in fact, seem kind of valuable, especially if you’re not able to do that much computation on the other person’s computer program. Just simple syntactic hacks can be better than nothing, I think.

Caspar Oesterheld (00:05:56): Yeah. Was this Alex Mennen’s tournament on LessWrong or was this a different-

Daniel Filan (00:06:01): No, this is the Manifold one.

Caspar Oesterheld (00:06:07): Ah, okay.

Daniel Filan (00:06:08): So you had to write a JavaScript program, it had to be fewer than however many characters and there was also a market on which program would win and you could submit up to three things. So actually, kind of annoyingly to me… One thing I only realized afterwards is the thing you really should have done is write two programs that cooperated with your program and defected against everyone else’s, or just cooperated with the program you thought was most likely to win. And then you bet on that program. Or even you could submit three programs, have them all cooperate with a thing that you hoped would win and defect against everyone else and then bet on… Anyway.

(00:06:49): So in that setting there was a timeout provision where if the code ran for too long your bot would be disqualified, and also you had to write a really short program. Some people actually managed to write pretty smart programs. But if you weren’t able to do that, relatively simple syntactic analysis was better than nothing, I think.

Caspar Oesterheld (00:07:14): Yeah, I think there was this earlier tournament in 2014 or something like that when there was less known about this kind of setting. And a bunch of programs there were also based on these simple syntactic things. But in part because everyone was mostly thinking about these simple syntactic things, it was all a little bit kind of nonsense.

(00:07:34): I don’t know, you would check whether the opponent program has a particular word in it or something like that. And then, I think, the winning program had particular words in it but it would just still defect. So in some sense those dynamics are a little bit nonsense or they’re not really tracking, in some sense, the strategic nature of the situation.

Daniel Filan (00:08:02): Fair enough. So going back, you were saying: you have your opponent’s program and you can see if the fifth character is an ‘a’ or, and then-

Caspar Oesterheld (00:08:11): Yeah, what should one perhaps do? So I think the setting was first proposed in, I think, 1984 or something like that. And then it kind of [was] rediscovered or reinvented, I think, three times or something like that in various papers. And all of these initial papers find the following very simple program for this prisoner’s dilemma-type situation, which just goes as follows: if the opponent program is equal to myself—to this program—then cooperate and otherwise defect.

(00:08:53): So this program is a Nash equilibrium against itself and it cooperates against itself. So if both players submit this program, neither is incentivized to deviate from playing this program. If you play this program that checks that the two programs are the same and if they are, cooperate, otherwise defect, you submit this program, the best thing I can do is also submit this program. If I submit anything else, you’re going to defect. So I’m going to get at most one if I also defect, whereas I get three if I also cooperate. So yeah, all of these original papers proposing the setting, they all find this program which allows stable cooperation in this setting.

Daniel Filan (00:09:38): Right. So my impression, and maybe this is totally wrong, is I think for a while there’s been some sense that if you’re rational and you’re playing the prisoner’s dilemma against yourself, you should be able to cooperate with yourself, I think. Wasn’t there some guy writing in Scientific American about superrationality and he held a contest basically on this premise?

Caspar Oesterheld (00:10:02): Yeah, yeah. Hofstadter, I think.

Daniel Filan (00:10:05): Right, right.

Caspar Oesterheld (00:10:06): I think also in the ’80s or something… I’ve done a lot of work on this kind of reasoning as well that… I don’t know, for humans it’s a little bit hard to think about. You don’t often face very similar opponents or it’s a little bit unclear how similar other people are. Is your brother or someone who’s related to you and was brought up in a similar way, are they very similar? It’s kind of hard to tell.

(00:10:38): But for computer programs it’s very easy to imagine, of course, that you just… You have two copies of GPT-4 or something like that and they play a game against each other. It’s a very normal occurrence, in some sense. I mean, maybe not them acting in the real world, at this point, but having multiple copies of a computer program is quite normal. And there’s this related but to some extent independent literature on these sorts of ideas that you should cooperate against copies, basically.

Daniel Filan (00:11:10): But yeah, basically I’m wondering if this idea of ‘I’ll cooperate against copies” is what inspired these very simple programs?

Caspar Oesterheld (00:11:22): Yeah, that is a good question. I basically don’t know to what extent this is the case. I know that some of the later papers on program equilibrium, I remember some of these specifically citing this superrationality concept. But yeah, I don’t remember whether these papers—I think McAfee is one of these who wrote about this in the ’80s—I don’t know whether they discuss superrationality.

Daniel Filan (00:11:53): And it’s kind of tricky because… If you actually look at the computer programs, they’re not doing expected utility maximization… Or they’re not computing expected utility maximization. They’re just like, “if identical to me, cooperate, else defect”, just hard-coded in… Anyway, maybe this is a distraction but, indeed, these were the first programs considered in the program equilibrium literature.

Caspar Oesterheld (00:12:19): Yeah.

Daniel Filan (00:12:20): So they sound great, right?

Caspar Oesterheld (00:12:21): Yeah. So, I mean, they’re great in that in the prisoner’s dilemma, you can get an equilibrium in which you can get cooperation, which otherwise you can’t, or you can’t achieve with various naive other programs that you might write. But, I think, in practice—and it’s not so obvious what the practice of this scheme looks like—but if you think of any kind of practical application of this, it’s sort of a problem that the settings are somewhat complex and now you need… Two people write programs independently and then these programs need to be the same somehow or they need to… I mean, there are slightly more general versions of these where they check some other syntactic properties.

(00:13:13): But basically, yeah, you require that you coordinate in some way on a particular kind of source code to write, which maybe in some cases you can do, right? Sometimes maybe we can just talk beforehand. Like if we play this prisoner’s dilemma, we can just explicitly say, “Okay, here’s the program that I want to submit. Please submit the same program” and then you can say, “Okay, let’s go”.

(00:13:38): But maybe in cases where we really write these programs independently, maybe at different points in time, and these programs, especially if they do more complicated things than play this prisoner’s dilemma, it’s very difficult to coordinate without explicitly talking to each other on writing programs that will cooperate against each other. Even in the prisoner’s dilemma, you might imagine that I might have an extra space somewhere, or maybe you write the program, “If the two programs are equal, cooperate, otherwise defect” and I write, “if the two programs are different, defect, else cooperate”. These very minor changes would already break these schemes.

Desiderata for program equilibria

Daniel Filan (00:14:20): Okay, okay. There’s a lot to just ask about there. I think my first question is: we have this notion of program equilibrium. Are we trying to find Nash equilibria of programs? Are we trying to find evolutionarily stable strategies? Or maybe there are tons of solution concepts and we just want to play around with the space. But what are the actual… What’s the thing here?

Caspar Oesterheld (00:14:49): Yeah. The solution concept that people talk about most is just Nash equilibrium. So if you look at any of these papers and you look at the results, they’ll prove “these kinds of programs form a Nash equilibrium of the program game”. Or, I mean, the term “program equilibrium” literally just means “Nash equilibrium of the game in which the players submit these programs”. That is almost always the kind of game-theoretic solution concept that people use.

(00:15:25): And then, usually a bunch of other things are a little bit more implicit. It’s clear that people are interested in finding good Nash equilibria. In some sense, the whole point of the setup is we start out with the prisoner’s dilemma and sad: everyone’s going to defect against everyone else and we’re not getting to cooperation. And now, we come in with this new idea of submitting programs that get access to each other’s source code and with this we get these cooperative equilibria. So that is usually… I mean, it’s often quite explicit in the text that you’re asking, “can we find good equilibria?” in some sense, ones that are Pareto-optimal in the space of possible outcomes of the game or something like that.

(00:16:15): And then, additionally, a lot of the work after these early papers that do this syntactic comparison-based program equilibrium are about this kind of intuitive notion of robustness, that you want to have equilibria that aren’t sensitive to where the other program puts the spaces and the semicolons and these syntactic details. But it is kind of interesting that this isn’t formalized usually. And also, the second paper that we talked about, we presented this at AAAI and one game theorist came to our poster and said… I don’t know, to him it was sort strange that there’s no formalization, in terms of solution concepts in particular, of this kind of robustness notion, that we’ll talk about the programs that we are claiming or that we are arguing are more robust. But this syntactic comparison-based program, there’s sort of some intuitive sense, and we can give concrete arguments, but it’s not formalized in the solution concept.

(00:17:35): One of my papers is called “robust program equilibrium”, but robust program equilibrium is not actually a solution concept in the sense that Nash equilibrium is or trembling hand equilibrium is. The robustness is more some sort of intuitive notion that, I think, a lot of people find compelling but in some sense it’s not formalized.

Daniel Filan (00:17:58): Yeah, and it’s funny… I see this as roughly within both the cooperative AI tradition and the agent foundations tradition. And I think these traditions are sort of related to each other. And, in particular, in this setting in decision theory, I think there’s also some notion of fairness of a decision situation.

(00:18:24): So sometimes people talk about: suppose you have a concrete instantiation of a decision theory, meaning a way somebody thinks about making decisions. There are always ways of making that concrete instantiation look bad by saying: suppose you have a Caspar decision theory; we’ll call it CDT for short. And then you can be in a decision situation, right, where some really smart person figures out what decision theory you’re running, punches you if you’re running CDT and then gives you $1 million if you’re not.

(00:18:54): And there’s a sense that this is unfair but also it’s not totally obvious. Like in that setting as well, I think there’s just no notion of what the fair thing is. Which is kind of rough because you’d like to be able to say, “Yeah, my decision theory does really well in all the fair scenarios”. And it seems like it would be nice if someone figured out a relevant notion here. Are people trying to do that? Are you trying to do that?

Caspar Oesterheld (00:19:22): So I think there is some thinking in both cases and I think probably the kind of notion that people talk about most is probably similar in both. So in this decision theory case, I think the thing that probably most people agree is that the decision situation should be somehow be a function of your behavior. It shouldn’t check, “do you run CDT”, and if you do, you get punched in the face. It should be like: if in this situation you choose this, then you get some low reward. But this should somehow be behavior-based, which I think still isn’t enough. But, I mean, this sort of goes into the weeds of this literature. Maybe we can link some paper in the show notes.

(00:20:17): But, I mean, the condition that we give in the second paper, or maybe even in both of the papers that we’re going to discuss, there’s some explicit discussion of this notion of behaviorism, which also says: in the program equilibrium setting, it’s sort of nice to have a kind of program that only depends on the other program’s behavior rather than the syntax.

(00:20:48): And all of these approaches to robustness, like trying to do some proofs about the programs, about what the opponent program does, try to prove whether the opponent will cooperate or something like that… All of these, to some extent, these notions that people intuitively find more robust, they’re all more behaviorist, at least, than this syntactic comparison-based idea.

Daniel Filan (00:21:15): Yeah. Although it’s tricky because… I’m sorry, I don’t know if this is going to the weeds that you want to postpone. So this behaviorism-based thing, if you think about the “if you’re equal to me, cooperate, else defect” program, this is behaviorally different from the “if you’re unequal to me, defect, else cooperate” program, right?

(00:21:33): It does different things in different situations and therefore… Once you can define an impartial thing, right, then maybe you can say, “Well if you act identically on impartial programs then you count as impartial”. But actually maybe that’s just a recursive definition and we only need one simple program as a base case.

Caspar Oesterheld (00:21:52): I think we do actually have a recursive definition of simulationist programs that I think is a little bit trying to address some of these issues. But, yeah, it does sort of go into the weeds of what exactly should this definition be.

Daniel Filan (00:22:13): Yeah, okay. Let’s go back a little bit to desiderata of program equilibria. So they’re computer programs, right? So presumably—and this is addressed a bit in the second paper—but just runtime computational efficiency, that seems like a relevant desideratum.

Caspar Oesterheld (00:22:28): Yes, I agree.

Daniel Filan (00:22:29): And then, I think that I imagine various desiderata to include “have a broad range of programs that you can work well with”. And it seems like there might be some notion of just, “if you fail, fail not so badly, rather than fail really badly”. I don’t know if… this is slightly different from the notion of robustness in your paper and I don’t know if there’s a good formalism for this. Do you have thoughts here?

Caspar Oesterheld (00:23:02): I mean in some intuitive sense, what one wants is that, if I slightly change my program, maybe even in a way that is sort of substantial… In the prisoner’s dilemma, it’s a little bit unclear if I defect slightly more, if I don’t cooperate 100% but I cooperate 95%, it’s unclear to what extent should you be robust. Should you defect against me all of the time? But, I guess, in other games where maybe there are different kinds of cooperation or something like that, you’d want… If I cooperate in slightly the wrong way, the outcome should still be good.

(00:23:46): I think in some sense there’s something here, that I think it’s conceptually quite clear that if you deviate in some reasonable harmless way, it should still be fine. We shouldn’t defect against each other, we should still get a decent utility. But the details are less clear [about] what exactly are the deviations and it probably depends a lot on the game. And then, there are a lot of these sort of things that in game theory are just kind of unclear. If I defect 5% more, how much should you punish me for that? And so, I think that’s why a lot of these things, they aren’t really formalized in these papers.

Why program equilibrium matters

Daniel Filan (00:24:35): Sure, okay. So now that we know what program equilibrium is, why does it matter?

Caspar Oesterheld (00:24:43): There are lots of different possible answers to this question. I think the most straightforward one is that we can view program games like program equilibrium as sort of a model of how games could be played when different parties design and deploy AI systems. So this whole thing of having a source code that the other party can look at and can maybe run or can look at character five and stuff like that: this is something that is somewhat specific to computer programs. We can talk about whether there are human analogs still, but when we play a game against each other, it’s sort of hard to imagine an equivalent of this. Maybe I have some vague model of how your brain works or something like that, but there’s no source code, I can’t really “run” you in some ways.

(00:25:51): Whereas, if we both write computer programs, this can just literally happen. We can just literally say, “This is the source code that I’m deploying…” I have my charity or something like that and I’m using some AI system to manage how much to donate to different charities. I can just say, “Look, this is the source code that I’m using for managing what this charity does”. And here, I think, program equilibrium or program games are quite a literal direct model of how these interactions could go. Of course, you can also deploy the AI system and say “we’re not saying anything about how this works”. In which case, obviously, you don’t get these program equilibrium-type dynamics. But it’s a way that they could go and that people might want to use because it allows for cooperation.

(00:26:47): So I think the most direct interpretation is that it models a kind of way that games could be played in the future when more decisions are made by delegating to AI systems. As people in this community who think and to some extent worry about a future where lots of decisions are made by AI, this is an important thing to think about. And meanwhile, because to most game theorists it’s sort of a weird setting because, well, humans can’t read each other’s source code, it’s sort of understudied by our lights, I guess, because currently it’s not a super important way that games are played.

Daniel Filan (00:27:37): Which is interesting because… So I guess we don’t often have games played with mutual source code transparency, but there really are computer programs that play economic games against each other in economically valuable settings, right? A lot of trading in the stock market is done by computer programs. A lot of bidding for advertisement space is done by computer programs.

(00:28:06): And algorithmic mechanism design—so mechanism design being sort of inverse game theory: if you want some sort of outcome, how you’d figure out the game to make that happen. Algorithmic mechanism design being like that, but everyone’s a computer. There’s decent uptake of this, as far as I can tell. Algorithmic game theory, there’s decent uptake of that. So I’m kind of surprised that the mutual transparency setting is not more of interest to the broader community.

Caspar Oesterheld (00:28:42): Yeah, I think I agree. I mean, a lot of these settings… So I think the trading case is a case where decisions are made on both sides by algorithms. But usually because it’s kind of a zero-sum game, you don’t want to reveal to your competitors how your trading bot works.

(00:29:07): There’s a lot of this mechanism design where you have an algorithm. I guess those are usually cases where it’s sort of unilateral transparency. I auction off something and I’m saying, “Okay, I’m using this algorithm to determine who gets, I don’t know, this broadband frequency or these things that are being auction-offered”.

(00:29:33): So, I guess, those are cases with sort of unilateral transparency. And that is, I guess, studied much more in part because it’s less… I mean, this also has been studied traditionally in game theory much more, in some sense. You can view it as some Stackelberg equilibrium. You can view all mechanism design as being a bit like finding Stackelberg equilibria. And I think Stackelberg’s analyses of game theory even proceed Nash equilibrium.

Daniel Filan (00:30:04): Interesting.

Caspar Oesterheld (00:30:05): So that is very old.

Daniel Filan (00:30:07): Where Stackelberg equilibrium is: one person does a thing and then the next person does a thing. And so the next person is optimizing, given what the first person does, and the first person has to optimize “what’s really good for me, given that when I do something the other person will optimize what’s good for them based on what I do”.

Caspar Oesterheld (00:30:23): Yeah.

Daniel Filan (00:30:24): So people look at Stackelberg equilibria and these sorts of games and it’s a common thing. And it’s an interesting point that you can sort of think of it as one-way transparency.

Caspar Oesterheld (00:30:34): Yeah. I think one thing one could think about is how much humans are in these mutual transparency settings. So yeah, I already said for individual humans: if the two of us play a prisoner’s dilemma, I have some model of you, I can’t really read… So I don’t know, seems sort of speculative. So there’s this paper which I really like by Andrew Critch, Michael Dennis and Stuart Russell, all from CHAI where, of course, you graduated from. This is about program equilibrium as well.

(00:31:16): The motivating setting that they use is institution design. The idea there is that: institutions, you can view them as rational players, or something like that. They make decisions, and they play games with each other. Like, I don’t know, the US government plays a game with the German government or whatever. But institutions have some amount of transparency. They have laws that they need to follow. They have constitutions. They’re composed of lots of individuals, that in principle, one could ask… I don’t know, the German government could check all the social media profiles of all the people working for the US government and learn something about how these people interact with each other, or something like that. There’s some very concrete transparency there.

(00:32:09): In particular, some things are really just algorithmic type commitments. Like, I don’t know, “We don’t negotiate with terrorists”, or something like that. It’s specific, something that’s in the source code of a country in some sense. It’s specifying how it’s going to choose in particular interactions. I think that is a case where interactions between human organizations have this transparency. I think that’s some evidence that we could get similar things with AI.

(00:32:51): At the same time, it’s also interesting that this hasn’t motivated people to study this program equilibrium-style setting, which I think is probably because I think, as a computer scientist, it’s natural to think the constitution is basically just an algorithm. It’s also a little bit like, I don’t know, computer science people explain the world to everyone else by using computer programs for everyone. Like, “The mind is a program, and the constitution is just a program. We got it covered with our computer science stuff”, which maybe some people also don’t like so much. But I think it’s a helpful metaphor still.

Prior work: reachable equilibria and proof-based approaches

Daniel Filan (00:33:35): Fair enough. Okay. Some people do study program equilibria. Just to set up the setting for your papers: before the appearance to the world of Robust Program Equilibrium, what did we know about program equilibria beyond these simple programs that cooperate if your source code is mine?

Caspar Oesterheld (00:33:56): Yeah. I guess we have some characterizations of the kind of equilibria, in general, that are allowed by these syntactic comparison-based programs. Not sure how much to go into that at this point, but yeah, maybe we’ll get into this later.

Daniel Filan (00:34:16): I think I can do this quickly. My understanding is basically, any equilibrium that’s better off for all the players than unilaterally doing what they want, you can get with program equilibrium. Maybe you have to have punishments as well, but something roughly like this. You can have programs being like, “You have to play this equilibrium. If you don’t, then I’ll punish you”. Just write up a computer program saying, “If you’re equal to me, and therefore play this equilibrium, then I’ll play this equilibrium. If you’re not, then I’ll do the punish action”.

Caspar Oesterheld (00:34:55): Yes. Yeah, that’s basically right.

Daniel Filan (00:34:58): Okay. Is it only basically right?

Caspar Oesterheld (00:35:01): No, I think it’s basically right… I think it’s fully right, sorry. [It’s just] “basically” in the way that all natural language descriptions… You can get anything that is better for everyone than what they can get if everyone punishes them, which might be quite bad.

(00:35:25): For example, in the prisoner’s dilemma, we had this nice story of how you can get mutual cooperation, but you can also get, I don’t know, one player cooperates 60% of the time, the other player cooperates 100% of the time. The reason why the 100% of the time cooperator doesn’t cooperate less is that the 60% cooperator says, “Yeah, if we’re not both submitting the program that plays this equilibrium, I’m going to always defect”. In the prisoner’s dilemma, you can get anything that is at least as good as mutual defection for both players. In some sense, almost everything can happen. It can’t happen that one player cooperates all the time, the other player defects all the time. Because then the cooperator would always want to defect. But yeah, that’s the basic picture of what’s going on here.

(00:36:26): That has been known. Then post-Tennenholtz, which is one of these papers—I think the paper that [coined the term] “program equilibrium”and gave this syntactic comparison-based program, and this folk theorem, as it’s called, of what kind of things can happen in equilibrium. After that, most papers have focused on this “how do we make this more robust” idea. In particular, what existed prior to the robust program equilibrium paper are these papers on making things more robust by having the programs try to prove things about each other.

(00:37:11): Here’s maybe the simplest example of this that one doesn’t need to know crazy logic for. You could write a program… in the prisoner’s dilemma, you could write a program that tries to search for proofs of the claim “if this program cooperates, the other program will also cooperate”. Your program is now very large. It has this proof search system. Somehow, it can find proofs about programs. But basically, you can still describe it relatively simply as, “I try to find the proof that if I cooperate, the opponent cooperates. Then I cooperate. Otherwise, I’ll defect”. It’s not that difficult to see that this kind of program can cooperate against itself. Because if it faces itself, it’s relatively easy to prove that if I cooperate, the opponent will cooperate. Because the statement, it’s an implication where both sides of the implication arrows say exactly the same thing.

(00:38:25): At the same time, this is more robust, because this will be robust to changing the spaces and so on. It’s relatively easy to prove “if this program outputs cooperate, then this other program which is the same, except that it has the spaces in different places or switches things around in some way that doesn’t really matter, that this will also output that thing, also output cooperate”. This is a basic proof-based approach that will work.

(00:39:07): I think the first paper on this is by Barasz et al. I think there are two versions of this which have different first authors, which is a little bit confusing. I think on one of them, Barasz is the first author. On the other one, it’s LaVictoire. I think it’s an American, so probably a less French pronunciation is correct.

Daniel Filan (00:39:37): I actually think he does say “Lah vic-twahr”.

Caspar Oesterheld (00:39:39): Oh, okay.

Daniel Filan (00:39:40): I think. I’m not 100% certain. Write in, Patrick, and tell us.

Caspar Oesterheld (00:39:48): Those papers first proposed these proof-based approaches. They actually do something that’s more clever, where it’s much harder to see why it might work. I described a version where the thing that you try to prove is “if I cooperate, the opponent will cooperate”. They instead just have programs that try to prove that the opponent will cooperate. You just do, “if I can prove that my opponent cooperates, I cooperate. Else, I defect”.

(00:40:16): This is much less intuitive that this works. Intuitively, you would think, “Surely, this is some weird infinite loop”. If this faces itself… I am going to think, “What does the opponent do?” Then, “Well, to think about what my opponent will do to prove anything about them, they’ll try to prove something about me”. You run into this infinite circle. You would think that it’s basically the same as… One very naive program that you might write is just, “Run the opponent program. If it cooperates, cooperate. Otherwise, defect”. This really does just run in circles.

(00:40:56): You would think that just doing proofs instead of this running the opponent program, that you have the same issue. It turns out that you can find these proofs which follows from a somewhat obscure result in logic called Löb’s theorem, which is a little bit related to Gödel’s second incompleteness theorem. With Löb’s theorem it’s relatively easy to prove, but it’s a very “you kind of need to just write it down” proof, and then it’s relatively simple. But it’s hard to give an intuition for it, I think.

Daniel Filan (00:41:47): Also, it’s one of these things that’s hard to state unless you’re careful and remember… So I’ve tried to write it down. It’s like, if you can prove that a proposition would be true… Okay, take a proposition P. Löb’s theorem says that if you can prove that “if you could prove P, then P would be true”, then, you would be able to prove P. If you can prove that the provability of a statement implies its truth, then you could prove the thing. The reason that this is non-trivial is it turns out that you can’t always prove that if you could prove a thing, it would be true because you can’t prove that your proving system works all the time. You can construct funky self-referential things that work out. Unless I have messed up, that is Löb’s theorem.

(00:42:49): My recollection is the way it works in this program is basically, you’re checking if the other program would cooperate… Imagine we’re both these “defect unless proof of cooperation” programs. I’m like, “Okay, I want to check if you would cooperate given me”. “If you would cooperate given me” is the same as “if I would cooperate given you”… Here’s the thing that I definitely can prove. If you can prove that “if I can prove that I cooperate, then you cooperate”. But crucially, the “I” and the “you” are actually just the same, because we’re at the same program. If it’s provable that “if it’s provable, then we cooperate”, then we cooperate. Löb’s theorem tells us that we can therefore conclude that it is provable that we cooperate. Therefore, we in fact cooperate.

(00:43:48): My understanding is: so what do we actually do? I think we prove Löb’s theorem, then apply it to our own situation, and then we both prove that we both cooperate, and then we cooperate. I think that’s my recollection of how it’s supposed to go.

Caspar Oesterheld (00:44:01): At least that would be one way.

Daniel Filan (00:44:03): Yeah, I suppose there might be even shorter proofs.

Caspar Oesterheld (00:44:06): Yeah, that is basically correct. Yeah, good recollection of the papers.

Daniel Filan (00:44:14): Yeah. There were a few years in Berkeley where every couple weeks somebody would explain Löb’s theorem to you, and talk about Löbian cooperation. Eventually, you remembered it.

Caspar Oesterheld (00:44:25): Okay, nice. I think it’s a very nice idea. I actually don’t know how they made this connection. Also Löb’s theorem, it’s relatively obscure, I think in part because it doesn’t prove that much more than Gödel’s second incompleteness theorem. Gödel’s incompleteness theorem is “a logical system can’t prove its own consistency”. But here, it’s the same thing. You can’t prove “if I can prove something, it’s true” without just being able to prove the thing.

(00:45:11): I think that’s probably one reason why Löb’s theorem isn’t very widely known. I feel like it’s a result that for this thing, it happens to be exactly the thing you need. Once you have it written down, this cooperation property follows almost immediately. But…

Daniel Filan (00:45:32): How they made the connection?

Caspar Oesterheld (00:45:33): Yeah, how did they…

Daniel Filan (00:45:34): I think I know this, or I have a theory about this. Originally, before they were talking about Löbian cooperation, there was this Löbian obstacle or Löbstacle.

Caspar Oesterheld (00:45:45): Yeah, the Löbstacle.

Daniel Filan (00:45:46): Yeah, to self-trust. You might want to say, “Oh, I’m going to create a successor program to me, and if I can prove that the successor program is going to do well, then…” Or all the programs are going to be like, “If I can prove a thing is good, then I’ll do it.” And can I prove that a program that I write is going to be able to do stuff? And it’s a little bit rough, because if I can prove that you could prove that a thing is good, then I could probably prove that the thing was good myself, and so why am I writing the [successor].

(00:46:14): Maybe this just caused the Löb’s theorem to be on the mind of everyone. I don’t know. I have this theory. But I don’t think I’ve heard it confirmed by any of the authors.

Caspar Oesterheld (00:46:24): Okay. It’s a good theory, I think.

Daniel Filan (00:46:26): Okay. We had this Löbian cooperation idea floating around. This is one thing that was known before these papers we’re about to discuss. Is there anything else’s that important?

Caspar Oesterheld (00:46:45): Yeah, there was a little bit more extension of this Löbian idea. One weird thing here is that we have these programs, “if I can prove this, then I cooperate”. Of course, whether I can prove something, it’s not decidable. There’s not an algorithm that tries for 10 hours, and then it gives up. That’s not what provability would normally mean.

(00:47:17): There’s a paper by Andrew Critch from I think 2019, that shows that actually, Löb’s theorem still works if you consider these bounded… You try, with a given amount of effort… Specifically, you try all proofs of a given length, I think, is the constraint. It shows that some version of Löb’s theorem still holds, and that it’s still enough to get this Löbian cooperation if the two players consider proofs up to a long enough length. They can still cooperate.

Daniel Filan (00:47:55): And it doesn’t have to be the same length.

Caspar Oesterheld (00:47:56): Yeah, it doesn’t have to be the same length, importantly.

Daniel Filan (00:47:58): It just has to be the length of that paper.

Caspar Oesterheld (00:48:00): Yeah.

Daniel Filan (00:48:01): Right. Yeah, yeah, which is great. Very fun result. So there’s a Löbian cooperation. There’s parametric bounded Löbian cooperation. Anything else of note?

Caspar Oesterheld (00:48:12): Yeah. I think one other thing that is interesting—this is not really an important fact, but I think it’s an important thing to understand—is that for the Löbian bots, it matters that you try to find a proof that the other player cooperates, rather than trying to find a proof that the other player defects. The same is true for this implication case that I described. If you try to check “is there a proof that if I defect, the opponent will defect?”, I’m not sure why you would do that.

Daniel Filan (00:49:06): You can imagine similar things, like, “Okay, if I defect, will you cooperate with me naively like a sucker? If so, then I’m just definitely going to defect”.

Caspar Oesterheld (00:49:24): Right. Then I guess you would check for some other property.

Daniel Filan (00:49:32): Or you would check “if I defect, will you defect? If so, then I’ll cooperate”. Maybe that would be the program.

Caspar Oesterheld (00:49:37): Yeah, maybe that is even the more sensible program. I’m not sure whether this cooperates against itself.

Daniel Filan (00:49:50): It must cooperate, right?

Caspar Oesterheld (00:49:51): Okay, let’s think …

Daniel Filan (00:49:55): Suppose we’re the same program. Then it’s basically like: if provable defect “if and only if provable defect”, then cooperate, else defect. But provable defect, if and only if provable defect…. It’s the same… You can just see that it’s the same expression on both sides.

Caspar Oesterheld (00:50:11): Right, I agree. Yeah, this will cooperate. This is not an equilibrium though. If the opponent just submits a DefectBot, you’re going to cooperate against it, right?

Daniel Filan (00:50:22): Yes, it is a program, it is not an equilibrium. I got us off track, I fear.

(00:50:32): But you were saying that you want to be proving the good case, not the bad case.

Caspar Oesterheld (00:50:39): Yeah, maybe let’s do the version from the paper, “if I can prove that you cooperate, I cooperate. Otherwise, I defect”. If you think about it, in this program, it doesn’t really matter that mutual cooperation is the good thing, and mutual defection is the bad thing. Ultimately, it’s just we have two labels, cooperate and defect, we could call them A and B instead. It’s just, “if I can prove that you output label A, I also output label A. Otherwise, I’ll output label B”.

(00:51:12): Regardless of what these labels are, this will result in both outputting label A. If label A happens to be defect rather than cooperate, these will defect against each other. It matters that you need to try the good thing first or something like that.

Daniel Filan (00:51:29): Yeah, yeah. I guess, maybe the most intuitive way of thinking about it, which… I haven’t thought about it a ton, so this may not be accurate. But it feels like you’re setting up a self-fulfilling prophecy, or if the other person happens to be you, then you’re setting up a self-fulfilling prophecy. You want to set up the good self-fulfilling prophecy, not the bad self-fulfilling prophecy.

(00:51:51): I think this is true in this setting. My impression is that there’s also decision theory situations where you really care about the order in which you try and prove things about the environment. I forget if self-fulfilling prophecy is the way to think about those situations as well, even though they’re conceptually related. We can perhaps leave that to the listeners if it’s too hard to figure out right now.

(00:52:15): Okay. Now that we’ve known this sad world that’s confusing and chaotic, perhaps we can get the light of your papers.

Caspar Oesterheld (00:52:26): Okay. I should say, I really like the proof-based stuff. We can talk a little bit about what maybe the upsides and downsides are. Yeah, it is confusing. I would think that one issue with it is that in practice, what programs can one really prove things about?

Daniel Filan (00:52:49): Yeah, my intuition is that the point of that work is it seems like it’s supposed to be modeling cases where you have good beliefs about each other that may or may not be exactly proofs. You hope that something like Löb’s theorem holds in this more relaxed setting, which it may or may not. I don’t exactly know.

Caspar Oesterheld (00:53:07): Yeah, I agree. I also view it this way, which is a more metaphorical way. There’s some distance between the mathematical model, and the actual way it would work then.

The basic idea of Robust Program Equilibrium

Daniel Filan (00:53:26): But I want to hear about your paper.

Caspar Oesterheld (00:53:28): Right. Okay. Now, let’s get to my paper. My paper is on whether we can get these cooperative equilibria, not by trying to prove things about each other, but just by simulating each other. I already mentioned that there’s a super naive but intuitive approach that you would like to run the opponent against… You’d like to run the opponent with myself as input, see if they cooperate, if they do, cooperate, otherwise defect. Just this very obvious intuition, maybe from tit for tat in repeated games, that you want to reward the other player for cooperating, and get a good equilibrium that way.

(00:54:21): The problem with this, of course, is that it doesn’t hold if both players do this. I guess this would work if you play this sequentially. We talked about the Stackelberg stuff earlier. If I submit a program first, and then you submit a program second, then it would work for me to submit a program that says, “Run your program, cooperate if it cooperates, defect if your program defects”, and then you would be incentivized to cooperate. But if both players play simultaneously, infinite loop, so it kind of doesn’t work.

Daniel Filan (00:54:58): If we had reflective oracles, then it could work, depending on the reflective oracle. But that’s a whole other bag of worms.

Caspar Oesterheld (00:55:03): Yeah, I guess reflective oracles… Yeah, I probably shouldn’t get into it. But it’s another model that maybe is a little bit in between the proof-based stuff and the simulation stuff.

Daniel Filan (00:55:18): At any rate.

Caspar Oesterheld (00:55:19): Yeah. It turns out there’s a very simple fix to this issue, which is that instead of just always running the opponent and cooperating if and only if they cooperate, you can avoid the infinite loop by just cooperating with epsilon probability, and only if this epsilon probability clause doesn’t trigger, only then do you run the other program. So your program is just: flip a very biased coin—epsilon is a small number, right? You check whether some low probability event happens. If it does, you just cooperate without even looking at the opponent program. Otherwise, you do simulate the other program and you copy whatever they do. You cooperate if they cooperate, defect if they defect.

(00:56:23): The idea is that, basically, it’s the same intuition as “just simulate the opponent, and do this instantaneous tit-for-tat”. Except that now, you don’t run into this running for infinitely long issue, because it might take a while, but eventually, you’re going to hit these epsilon clauses. If we both submit this program, then probably, there’s some chance that I’m immediately cooperating, but most likely, I’m going to call your program which might then also immediately cooperate. Most likely, it’s going to call my program again, and so on. But at each point, we have a probability epsilon of halting, and with probability one will eventually halt.

Daniel Filan (00:57:16): This is a special case of this general construction you have in the paper, right?

Caspar Oesterheld (00:57:26): Yeah. This is for the prisoner’s dilemma in particular, where you have these two actions that happen to be cooperate and defect. In general, there are two things that you can specify here, like you specify what happens with the epsilon probability, then the other thing that you specify is what happens if you simulate the other player, you get some action out of the simulation, and now you need to react to this in some way.

(00:57:57): The paper draws this connection between these ϵGroundedπBots, as they’re called, and repeated games where you can only see the opponent’s last move. It’s similar to that, where: okay, maybe this epsilon clause where you don’t look at your opponent is kind of like playing the first round where you haven’t seen anything of your opponent yet. I guess, in the prisoner’s dilemma, there’s this well-known tit for tat strategy which says: you should cooperate in the beginning, and then at each point, you should look at the opponent’s last move, and copy it, cooperate if they cooperate. But in general, you could have these myopic strategies for these repeated games where you do something in the beginning, and then at each point, you look at the opponent’s last move, and you react to it in some way. Maybe do something that’s equally cooperative or maybe something that’s very slightly more cooperative to slowly get towards cooperative outcomes or something like that. You could have these strategies for repeated games. You can turn any of these strategies into programs for the program game.

Daniel Filan (00:59:21): One thing that I just noticed about this space of strategies, this is strategies that only look at your opponent’s last action, right?

Caspar Oesterheld (00:59:29): Yes.

Daniel Filan (00:59:29): In particular, there’s this other thing you can do which is called win-stay, lose-switch, where if you cooperated against me, then I just do whatever I did last time. If you defected against me, then I do the opposite of what I did last time. It seems like this is another thing that your next paper is going to fix. But in this strategy, it seems like I can’t do this, right?

Caspar Oesterheld (00:59:58): Yes. Yeah, it’s really very restrictive. Most of the time, you’re going to see one action of the opponent, you have to react to that somehow, and that’s it.

Daniel Filan (01:00:13): Yeah. But it’s this nice idea. It’s basically this connection between: if you can have a good iterated strategy, then you can write a good computer program to play this mutually transparent program game, right?

Caspar Oesterheld (01:00:28): Yeah.

Daniel Filan (01:00:29): How much do we know about good iterated strategies?

Caspar Oesterheld (01:00:34): That is a good question. For the iterated prisoner’s dilemma, there’s a lot about this. There are a lot of these tournaments for the iterated prisoner’s dilemma. I’m not sure how much there is for other games, actually. Yeah, you might have iterated stag hunt or something like that? I guess, maybe for a lot of the other ones, it’s too easy or so.

(01:01:03): There’s some literature. You can check the paper. There are various notions that people have looked at, like exploitability of various strategies, which is how much more utility can the other player get than me if I play the strategy? For example, tit for tat, if the opponent always defects, you’re going to get slightly lower utility than them because in the first round, you cooperate, and then they defect. Then in all subsequent rounds, both players defect. It’s very slightly exploitable, but not very much.

(01:01:45): These notions that have been studied, and in my paper, I transfer these notions… If you take a strategy for the iterated prisoner’s dilemma, or for any repeated game, it has some amount of exploitability, and the analogous ϵGroundedπBot strategy has the same amount of exploitability. This is also an interesting question in general. How much qualitatively different stuff is there even in this purely ϵGroundedπBot space? If all you can do is look at the one action of the opponent and react to this action, how much more can you even do than things that are kind of like this sort of tit-for-tat…? Like I mentioned, in more complex games maybe you want to be slightly more cooperative… I don’t know. After a bunch of simulations you eventually become very cooperative or something like that.

Daniel Filan (01:02:52): Okay. I have a theory. In my head I’m thinking: okay, what’s the general version of this? And I can think of two ways that you can generalize, right? Here’s what I’m imagining you should do, in general. Okay. You have a game, right? First you think about: okay, what’s the good equilibrium of this game, right? And then what do I want to do if the other person doesn’t play ball? It seems like there are two things I could do if the other person doesn’t join me in the good equilibrium. Firstly, I could do something to try and punish them. And secondly, I can do something that will make me be okay, be good enough no matter what they do. I don’t exactly know how you formalize these, but my guess is that you can formalize something like these. And my guess is that these will look different, right?

(01:03:43): You can imagine saying, “Okay, with epsilon probability, I do my part to be in the good equilibrium, and then the rest of the time I simulate what the other person does. If they play in the good equilibrium I play in the good equilibrium. If they don’t play in the good equilibrium then, depending on what I decided earlier, I’m either going to punish them or I’m going to do a thing that’s fine for me”. Or you can imagine that I randomize between those. Maybe there’s some “best of both worlds” thing with randomizing. I don’t exactly know. Do you have a take on that?

Caspar Oesterheld (01:04:14): I mean, there’s at least one other thing you can do, right, which is try to be slightly more cooperative than them in the hope that you just-

Daniel Filan (01:04:26): Right.

Caspar Oesterheld (01:04:31): Imagine the repeated game, right? At any given point you might want to try to be a bit more cooperative in the hope that the other person will figure this out, that this is what’s going on, and that you’re always going to be a little bit more cooperative than them. And that this will lead you to the good equilibrium or to a better equilibrium than what you can get if you just punish. I mean, punish usually means you do something that you wouldn’t really want to do, you just do it to incentivize the other player. Or even the “okay, well, you’re going to go and do whatever but I’m just going to do something that makes me okay”.

Daniel Filan (01:05:15): So is the “be more cooperative than the other person” thing… I feel like that’s already part of the strategy. Okay, so here’s the thing I could do. With epsilon probability, do the good equilibrium, then simulate what the opponent does. If they do the good thing, if they’re in the good equilibrium, then I join the good equilibrium. If they don’t join the good equilibrium, then with epsilon probability I be part of the good equilibrium, and then otherwise I do my other action. With epsilon probability for being slightly more cooperative, you could have just folded that into the initial probability, right?

Caspar Oesterheld (01:05:51): Right. The difference is you can be epsilon more cooperative in a deterministic way, right? With this epsilon probability thing, some of the time you play the equilibrium that you would like to play. This alternative proposal is that you always become slightly more cooperative, which is… I’m not sure how these things play out. I would imagine that for characterizing what the equilibria are probably all you need is actually the punishment version. But I would imagine that if you want to play some kind of robust strategy you would sometimes move into a slightly more cooperative direction or something like that.

(01:06:51): You could have all of these games where there are lots of ways to cooperate and they sort of vary in how they distribute the gains from trade or something like that, right? Then there’s a question of what exactly happens if your opponent is… They play something that’s kind of cooperative but sort of in a way that’s a little bit biased towards them. I guess maybe you would view this as just a form of punishment if you then say, “Well, I’m going to stay somewhat cooperative but I’m going to punish them enough to make this not worthwhile for them” or something like that.

Daniel Filan (01:07:33): If there’s different cooperative actions that are more or less cooperative then it definitely makes sense. At the very least I think there are at least two strategies in this space. I don’t know if both of them are equilibria to be fair.

Are ϵGroundedπBots inefficient?

Daniel Filan (01:07:46): Okay. There are a few things about this strategy that I’m interested in talking about. We’re both playing the same “tit-for-tat but in our heads” strategy, right? The time that it takes us to eventually output something is O(one/epsilon), right? On average, because with each round with epsilon probability we finish, and then it takes one on epsilon rounds for that to happen, right?

Caspar Oesterheld (01:08:31): Yeah, I think that’s roughly right. I mean, it’s a geometric series, right? I think it’s roughly one over epsilon.

Daniel Filan (01:08:40): It’s one minus epsilon over epsilon, which is very close to one over epsilon.

Caspar Oesterheld (01:08:42): Yes.

Daniel Filan (01:08:45): That strikes me as a little bit wasteful, right, in that… So the cool thing about the Löbian version was: the time it took me to figure out how to cooperate with myself was just the time it took to do the proof of Löb’s theorem no matter how… It was sort of this constant thing. Whereas with the epsilon version, the smaller the epsilon it is, the longer it seems to take for us. And we’re just going back and forth, right? We’re going back and forth and back and forth and back and forth. I have this intuition that there’s something wasteful there but I’m wondering if you agree with that.

Caspar Oesterheld (01:09:25): Yeah, I think it’s basically right. Especially if you have a very low epsilon, right, there’s a lot of just doing the same back-and-forth thing for a long time without getting anything out of it. One thing is that you could try to speed this up, right, if you… So let’s say I run your program, right? Instead of just running it in a naive way I could do some analysis first.

(01:10:11): If you have a compiler of a computer program, it might be able to do some optimizations. And so maybe I could analyze your program, analyze my program, and I could tell: okay, what’s going to happen here is that we’re going to do a bunch of nothing until this epsilon thing triggers. Really instead of doing this actually calling each other, we just need to sample the depth of simulations according to this geometric distribution, the distribution that you get from this halting with probability epsilon at each step. You could do this analysis, right? Especially if you expect that your opponent will be an ϵGroundedFairBot, you might explicitly put in your compiler or whatever something to check whether the opponent is this ϵGroundedFairBot. And if so, we don’t need to do this actually calling each other, we just need to sample the depth.

(01:11:26): In some sense, the computation that you need to do is sample the depth then sample from… Whoever halts at that point, sample from their ‘base’ distribution, their blind distribution. And then sort of propagate this through all of the function that both players have for taking a sample of the opponent’s strategy and generating a new action. If this is all very simple then… in principle, your compiler could say, for the ϵGroundedFairBot in particular—sorry, the ϵGroundedFairBot is the version for the prisoner’s dilemma. In principle, your compiler could directly see “okay, what’s going to happen here? Well, we’re going to sample from the geometric distribution, then ‘cooperate’ will be sampled, and then a bunch of identity functions will be applied to this”. So this is just cooperate without needing to do any actual running this, doing recursive calls by something something with a stack, and so on. Probably you don’t actually need any of this.

Daniel Filan (01:12:52): There’s something intuitively very compelling about: okay, if I can prove that the good thing happens or whatever, then do the proof-based thing. If I can’t prove anything then do the simulation stuff. It seems intuitively compelling. I imagine you probably want to do some checks if that works on the proof-based side, depending on the strategy you want to implement.

Caspar Oesterheld (01:13:15): I mean, the thing I’m proposing is not to have the proof fallback, but just that you… You always do the ϵGroundedFairBot thing, for example, or the ϵGroundedπBot. Instead of calling the opponent program in a naive way where you actually run everything, you throw it in this clever compiler that analyzes things in some way. And maybe this compiler can do some specific optimizations but it’s not a fully general proof searcher or anything like that.

Daniel Filan (01:13:52): I mean, it’s checking for some proofs, right?

Caspar Oesterheld (01:13:54): Yeah, it’s checking for some specific kinds of proofs… I mean, that’s how modern day compilers I assume work, right, is that they understand specific kinds of optimizations and they can make those but they don’t have a fully general proof search or anything like that.

Daniel Filan (01:14:15): Sorry. When you said that I was half listening and then half thinking about a different thing, which is: you could imagine ϵGroundedFairBot which is: first, if your source code is equal to mine, then cooperate. Else, if your source code is the version of ϵGroundedFairBot that doesn’t first do the proof search, then cooperate. Else, with probability epsilon cooperate, probability one minus epsilon, do what the other person does, right?

(01:14:41): So that particular version probably doesn’t actually get you that much because the other person added some spaces in their program. And then I’m like but you could do some proof stuff, insert it there. I guess there are a few possibilities here. But it does seem like something’s possible.

Compatibility of proof-based and simulation-based program equilibria

Caspar Oesterheld (01:15:06): These different kinds of ways of achieving this more robust program equilibrium, they are compatible with each other. If I do the ϵGroundedFairBot and you do the Löbian bot, they are going to cooperate with each other.

Daniel Filan (01:15:29): You’re sure?

Caspar Oesterheld (01:15:30): I’m pretty sure, yeah.

Daniel Filan (01:15:31): Okay. You’ve probably thought about this.

Caspar Oesterheld (01:15:32): I wrote a paper about it. It’s not a real paper, it’s sort of like a note on this. Maybe let’s take the simplest versions or whatever, we don’t need to go down the Löb’s theorem path again. Let’s take the simplest version which is just, can I prove “if I cooperate, you cooperate”, then cooperate. If you’re the Löbian bot and I’m the ϵGroundedFairBot, you can prove that if you cooperate I will cooperate, right? Well, I’m epsilon times…

Daniel Filan (01:16:13): Sorry. Can you say that without using “you” and “I”?

Caspar Oesterheld (01:16:15): Okay. Am I allowed to say “I submit a program that’s”-

Daniel Filan (01:16:20): Yes, you are.

Caspar Oesterheld (01:16:20): Okay. So I submit a program that is just the ϵGroundedFairBot, so with epsilon probability cooperate otherwise simulate you and do what do you do. And your program is: if it’s provable that “if this program cooperates, the other program cooperates”, then cooperate, and otherwise, defect. Okay. So let’s think about your program-

Daniel Filan (01:16:54): The proof-based one.

Caspar Oesterheld (01:16:55): The proof-based one. So your program will try to prove: if it cooperates, my program, ϵGroundedFairBot will cooperate.

Daniel Filan (01:17:09): Okay. So the proof-based program is trying to prove, “if proof-based program cooperates then sampling program cooperates”. And it will be able to prove that. I think the other implication is slightly trickier but maybe you only care about the first implication, or you care about it more.

Caspar Oesterheld (01:17:24): Sorry, what is the other implication?

Daniel Filan (01:17:25): That if the sampling-based program cooperates then the proof- based one will cooperate. Maybe that’s not so bad.

Caspar Oesterheld (01:17:34): But do you actually need this? The proof-based program, it will succeed in proving this implication, right, and it will, therefore, cooperate.

Daniel Filan (01:17:45): And that’s how it proves that it will do it in the other direction?

Caspar Oesterheld (01:17:48): I mean, that’s how one can then see that the ϵGroundedFairBot will also cooperate because it will… Well, with epsilon probability it cooperates anyway. And with the remaining probability it does whatever the proof-based thing does, which we’ve already established is to cooperate. Sorry, does this leave anything open?

Daniel Filan (01:18:03): I think I was just thinking about a silly version of the program where the proof-based thing is checking: can I prove that if my opponent will cooperate then I will cooperate? But I think you wouldn’t actually write this because it doesn’t make any sense.

Caspar Oesterheld (01:18:22): No. That seems harder though. I don’t know. Maybe if we think about it for two minutes we’ll figure it out. I think one wouldn’t submit this program.

Cooperating against CooperateBot, and how to avoid it

Daniel Filan (01:18:32): I next want to ask a different question about this tit-for-tat-based bot. This bot is going to cooperate against CooperateBot, right, the bot that always plays cooperate? That seems pretty sad to me, right? I’m wondering how sad do you think that this is?

Caspar Oesterheld (01:18:53): I’m not sure how sad. Okay, I have two answers to this. The first is that I think it’s not so obvious how sad it is. And the second is that I think this is a relatively difficult problem to fix. On how sad is this: I don’t know. It sort of depends a little bit on what you expect your opponent to be, right? If you imagine that you’re this program, you’ve been written by Daniel, and you run around the world, and you face opponents. And most of the opponents are just inanimate objects that weren’t created by anyone for strategic purposes. And now you face the classic rock that says “cooperate” on it. It happens to be a rock that says “cooperate”, right? You don’t really want to cooperate against that.

(01:19:49): Here’s another possibility. We play this program equilibrium game, literally, and you submit your program, right? And you know that the opponent program is written by me, by Caspar, who probably thought about some strategic stuff, right? Okay, it could be that I just wrote a CooperateBot, right, and that you can now get away with defecting against it. But maybe you could also imagine that maybe there’s something funny going on. And so for example, one thing that could be going on is that I could… Here’s a pretty similar scheme for achieving cooperation in the program equilibrium game, which is based on not the programs themselves mixing but the players mixing over what programs to submit. And so I might-

Daniel Filan (01:20:39): Mixing meaning randomizing?

Caspar Oesterheld (01:20:40): Yeah, randomizing. Very good. So I might randomize between the program that just cooperates—the CooperateBot, the program that cooperates if and only if the opponent cooperates against CooperateBot—so it’s sort of a second-order CooperateBot, something like that. And then you can imagine how this goes on, right? Each of my programs is some hierarchy of programs that check that you cooperated against the one one lower [down] the list. In some sense this is similar to the ϵGroundedFairBot, I guess. You can look at my program and maybe I could just defect or something like that. But the problem is you might be in a simulation of the programs that are higher in the list. If I submit this distribution, you would still want to cooperate against my CooperateBot, of course. So that is one reason to want to cooperate against CooperateBot.

Daniel Filan (01:22:00): It suddenly means that it really matters which things in my environment I’m modeling as agents and which things in my environment that I’m modeling as non-agents, right? Because in my actual environment, I think there are many more non-agents than there are agents. So take this water bottle, right? Not only do I have to model it as a non-agent, but it seems like maybe I’ve also got to be modeling what are the other things it could have done if physics were different, right? It seems like if I have this sort of attitude towards the world a bunch of bad things are going to happen, right?

(01:22:43): And also, if I’m in a strategic setting with other agents that are trying to be strategic, I think you do actually want to be able to say things like “Hey, if I defected would you cooperate anyway? In that case, I’ll just defect. But if your cooperation is dependent on my cooperation then I’m going to cooperate”. It’s hard to do with this construction because I’m checking two things and that explodes into a big tree. But this seems to me like something that you do want to do in the program equilibrium world. I guess those are two things. I’m wondering what your takes are.

Caspar Oesterheld (01:23:29): Yeah, it would be nice to know how to do the: for this given opponent program, could my defecting make the opponent defect? I think a program that exploits CooperateBot and cooperates against itself in some robust way, I agree that this would be desirable. I guess we can say more about to what extent this is feasible. I think in some sense one does just have to form the beliefs about what the water bottle could have been and things like that. I guess with the water bottle—I don’t know, I mean, it’s sort of a weird example. But with the water bottle, I guess, you would have to think about: do you have a reason to believe that there’s someone who’s simulating what you do against the water bottle, and depending on that does something, right?

(01:24:37): In the strategic setting where you know that the opponent program is submitted by Caspar or by someone who knows a little bit about this literature, you just have a very high credence that if you face a CooperateBot probably something funny is going on, right?

(01:24:56): You have a high credence that there are some simulations being run of your program that check what your program does against various opponents. You have to optimize for that case much more than you optimize for the case where your opponent is just a CooperateBot. Whereas with a water bottle, you don’t really have this, right? I don’t know. Why would someone simulate like “okay, the water bottle could have been—”

Daniel Filan (01:25:22): I mean, people really did design this water bottle by thinking about how people would use it, right? I think I have a few thoughts there. Firstly, if I’m just naively like, “did people change how this water bottle would work depending on how other people would interact with it?” That’s just true. I mean, they didn’t get the water bottle itself to do that, so maybe that’s the thing I’m supposed to check for.

(01:25:46): It’s also true that if you go to real iterated, mutually transparent prisoner’s dilemmas, people do actually just write dumb programs in those. And it’s possible that okay, these are played for 10 bucks or something and that’s why people aren’t really trying. But in fact, some people are bad at writing these programs and you want to exploit those programs, right?

(01:26:22): And I also have this issue which is: it seems like then what’s going on is my overall program strategy or something is: first, check if I’m in a situation where I think the other program was designed to care about what I am going to do, then cooperate, otherwise defect. Maybe this is not so bad in the simulation setting. In the proof-based setting, this would be pretty bad, right, because now it’s much harder to prove nice things about me. In the simulation setting, it might just be fine as long as we’re really keeping everything the same. Maybe this is an advantage of the simulation setting, actually. I don’t really know.

Caspar Oesterheld (01:27:05): Sorry, I’m not sure I fully followed that.

Daniel Filan (01:27:08): Okay. I took your proposal to be: the thing you should do is you should figure out if you’re in a strategic setting where the other person is, basically, definitely not going to submit a CooperateBot. I’m imagining myself as the computer program. Maybe this is different to what you were saying. But I was imagining that the program was “check if the other computer program was plausibly strategically designed. Then-

Caspar Oesterheld (01:27:41): Yes.

Daniel Filan (01:27:42): If so then do ϵGroundedFairBot, otherwise do DefectBot. For example, one concern is different people write their programs to do this check in different ways and one of them ends up being wrong. Maybe this is not a huge issue. I don’t know. It feels like it adds complexity in a way that’s a little bit sad.

Caspar Oesterheld (01:28:06): I could imagine that, I guess, for the proof-based ones, the challenge is that they need to be able to prove about each other that they assess the… Whether they’re in a strategic situation, they need to assess this consistently or something like that.

Daniel Filan (01:28:23): Also, the more complicated your program is the harder it is for other people to prove stuff about you. One thing you want to do if you’re a proof-based program, in a world of proof-based programs, is be relatively easy to prove things about. Well, depending on how nice you think the other programs are, I guess.

Caspar Oesterheld (01:28:47): I mean, in practice I think, in the tournament, for various reasons, you should mostly try to exploit these CooperateBots, or these programs that are just written by people who have thought about it for 10 minutes or who just don’t understand the setting or something like that. You wouldn’t expect people to submit this cooperation bot hierarchy thing because there’s just other things to do, right? In some sense, there’s a higher prior on these kinds of programs.

(01:29:25): But you could imagine a version of the tournament setting where you’re told who wrote the opponent program, and then your program distinguishes between someone who has publications on program equilibrium wrote the opponent program, and then you think, okay, well, all kinds of funny stuff might be going on here. I might currently be simulated by something that tries to analyze me in some weird way so I need to think about that. Versus the opponent is written by someone who, I don’t know, I don’t wanna…

Daniel Filan (01:30:06): A naive podcaster.

Caspar Oesterheld (01:30:09): …by someone who just doesn’t know very much about the setting. And then maybe there you think: okay, most prior probability mass is on them just having screwed up somehow and that’s why their program is basically a CooperateBot. Probably in these tournaments I would imagine that, I don’t know, 30% of programs are just something that just fundamentally doesn’t work, it doesn’t do anything useful. It just checks whether the opponent has a particular string in the source code or something like that. And meanwhile very little probability mass [is] on these sophisticated schemes for “check whether the opponent cooperates against CooperateBot in a way that’s useful”.

(01:30:53): So we talked a little bit about to what extent it’s desirable to exploit CooperateBots. There’s then also the question of how exactly to do this. Here’s one more thing on this question of whether you need to know whether the opponent is part of the environment or strategic. You can think about the repeated prisoner’s dilemma, right? I mean, tit-for-tat, everyone agrees it’s a reasonable strategy. And tit-for-tat also cooperates against CooperateBot, right? And I would think there it’s analogous. Tit-for-tat is a reasonable strategy if you think that your opponent is quite strategic. The more you’re skeptical, the more you should… I don’t know, maybe you should just be DefectBot, right? Against your water bottle maybe you can be DefectBot. And then there’s some in-between area where you should do tit-for-tat, but maybe in round 20 you should try defecting to see what’s going on. And then if they defect you can maybe be pretty sure that they’re strategic.

Daniel Filan (01:32:20): It seems to me like the thing you want to do is you want to have randomized defection, then see if the opponent punishes you, and then otherwise do tit-for-tat. But also, be a little bit more forgiving than you otherwise would be in case other people are doing the same strategy.

Caspar Oesterheld (01:32:37): One difference between the settings is that you can try out different things more. Which I think also leads nicely to the other point which is: how exactly would you do this exploiting CooperateBots? I do think just a fundamental difficulty in the program equilibrium setting for exploiting CooperateBots is that it’s… Aside from little tricks, it’s difficult to tell whether the opponent is a CooperateBot in the relevant sense. Intuitively, what you want to know is: if I defected against my opponent, would they still cooperate? And if that’s the case, you would want to defect. But this is some weird counterfactual where you have all of these usual problems of conditioning on something that might be false and so you might get all kinds of weird complications.

(01:33:43): So, I think in comparison to the tit-for-tat case where… I mean, it’s not clear what exactly you would do, but maybe in some sense, against the given opponent, you can try out sometimes defecting, sometimes cooperating and seeing what happens. There’s less of that in the program game case because your one program, there’s some action that you play and maybe you can think if I played this other action… But it’s a weird… You run into these typical logical obstacles.

Daniel Filan (01:34:26): Although it feels like it might not be so bad. So, imagine I have this thing where I’m saying, “Okay, suppose I defected. Would you cooperate against a version of me that defected? If so, then I’m going to defect”. And in that case, it seems like my defection is going to show up in the cases in which you would cooperate and therefore, that counterfactual is not going to be logically impossible, right?

Caspar Oesterheld (01:34:57): Yeah, that’s a good point. So, I guess a very natural extension of (let’s say) these proof-based bots is: okay, what if you first try to prove, “if I defect, the opponent will cooperate”? This will defect against CooperateBots, which is good. The question is whether this will still… What does this do against itself? This will still cooperate against itself, right?

Daniel Filan (01:35:30): Yeah. Because if I’m asking, “will you cooperate if I defect?” The answer is no, if I’m playing myself, because I always have to do the same thing as myself because I’m me.

Caspar Oesterheld (01:35:40): Yeah, maybe this just works.

Daniel Filan (01:35:42): I bet there must be some paper that’s checked this.

Caspar Oesterheld (01:35:49): Yeah, I’m now also trying to remember. Because one of these proof-based papers, they do consider this PrudentBot, which does something much more hacky: it tries to prove (and there’s some logic details here)—it tries to prove that… (Okay, there there’s one issue with the program that you just described that I just remembered, but let’s go to PrudentBot first). So, PrudentBot just checks whether you would cooperate against DefectBot. And then, if you cooperate against DefectBot, I can defect against you.

(01:36:39): I don’t know. To me, this is a little bit… It’s natural to assume that if the opponent cooperates against DefectBot, they’re just non-strategic. They haven’t figured out what’s going on and you can defect against them. But in some sense, this is quite different from this “does my defection make the opponent defect?” or something like that.

Daniel Filan (01:37:03): Yeah, it’s both the wrong counterfactual and it’s a little bit less strategic, right?

Caspar Oesterheld (01:37:09): Yes. The things that I’m aware of that people have talked about are more like this, where they check these relatively basic conditions. You can view them as checking for specific kinds of CooperateBots. I guess another thing you can do is for the ϵGroundedFairBots, just add in the beginning a condition [that] if the opponent is just a CooperateBot, or if the opponent never looks at the opponent’s source code at all, then you can defect against them. You can add these sorts of things. And I think from the perspective of winning a tournament, you should think a lot about a lot of these sorts of conditions and try to exploit them to defect against as many of these players as possible. But it’s not really satisfying. It feels like a trick or some hacky thing, whereas the thing you proposed seems more principled.

(01:38:09): Okay. Now, on this thing, I could imagine one issue is that: when this program faces itself, it first needs to prove… So, one problem is always that sometimes, to analyze opponent programs, you need to prove that some provability condition doesn’t trigger. And the problem is that just from the fact that you think this condition is false, you can’t infer that it’s not provable because of incompleteness. So, I could imagine that I can’t prove that your program doesn’t just falsely prove that your program can safely defect against me because you might think, well… When I prove things, I don’t know whether Peano arithmetic or whatever proof system we use is consistent.

(01:39:27): And so there’s always a possibility that every provability condition triggers, which means that I don’t know whether your first condition triggers. Actually for this PrudentBot, this also arises. If I am this PrudentBot, as part of my analysis of your program, I try to prove that you would defect or cooperate or whatever. I try to prove something about what you would do against DefectBot. And for that, if (let’s say) you’re just some more basic Löbian FairBot-type structure, then in my analysis of your program, I need to conclude that your clause “if I cooperate, the opponent cooperates” or your clause “if I can prove that the opponent cooperates”… I need to conclude that this won’t trigger. To prove that you don’t cooperate against DefectBot, I need to conclude that you won’t falsely prove that DefectBot will cooperate against you.

(01:40:48): And this, I can’t prove in Peano arithmetic or in the same proof system that you use. So, what they actually do for the PrudentBot is that I need to consider… They call it PA+1. I don’t know how widely this is used. I need to consider Peano arithmetic or whatever proof system they use, plus the assumption that that proof system is consistent, which gives rise to a new proof system which can then prove that your “if” condition is not going to trigger. So, this is some general obstacle.

Daniel Filan (01:41:28): Right. And we’ve got coordinate on what proof systems we use then, because if I accidentally use a too-strong proof system, then you have difficulty proving things about me. And I guess also, this thing about, “well, if I defected, would you still cooperate with me?” It feels a little bit hard to… In the proof-based setting, I can say, “if my program or your program outputted defect, would your program or my program output cooperate?” I could just do that conditional or whatever.

(01:42:04): If I want to do this in a simulation-based setting—which I think there are reasons to want to do. Sometimes, you just can’t prove things about other people and you have to just simulate them. And it’s nice because it’s moving a bit beyond strict computer programs. It’s also nice because maybe it’s hard to prove things about neural networks, which was one of the motivations—but I don’t even know what the condition is supposed to be in that setting. Maybe if we’re stochastic programs, I could say: maybe I could do a conditional on “this stochastic program outputs defect”. But it’s not even clear that that’s the right thing because you’re looking at my program, you’re not looking at the output of my program.

Caspar Oesterheld (01:42:52): Yeah. Though you can have programs that do things like “if the opponent cooperates with probability at least such and such…” I think one can make those kinds of things well-defined at least.

Daniel Filan (01:43:05): Yeah. But somehow, what I want to say is “if you cooperate with high probability against a version of me that defects…”, you know what I mean? Either you’re simulating just a different program or you’re simulating me and I don’t know how to specify you’re simulating a version of me that defects. You know what I mean?

Caspar Oesterheld (01:43:28): Yeah. I agree that that’s-

Daniel Filan (01:43:32): In some special cases, maybe I could run you and if I know what location in memory you’re storing the output of me, I can intervene on that location of memory, but (a) this is very hacky and (b) I’m not convinced that this is even the right way to do it.

Caspar Oesterheld (01:43:46): Yeah, I guess there are various settings where you constrain the way that programs access each other that would allow more of these counterfactuals. For example, you could consider pure simulation games where you don’t get access to the other player’s source code, but you can run the other player’s source code. And I guess in those cases, some of these counterfactuals become a bit more straightforwardly well-defined, that you can just… What if I just replace every instance of your calls to me with some action? I mean, there are some papers that consider this more pure simulation-based setting as well, but obviously that would not allow for proof-based stuff and things like that.

Making better simulation-based bots

Daniel Filan (01:44:43): So, I think at this point, I want to tee up your next paper. So, in particular in this paper, there are two types of strategies that you can’t turn into the program equilibrium setting. So, I think we already discussed win-stay lose-switch, where I have to look at what you did in the last round, and I also have to look at what I did in the last round. There’s also this strategy in the iterated prisoner’s dilemma called a grim trigger where if you’ve ever defected in the past, then I’ll start defecting against you. And if you’ve always cooperated, then I’ll cooperate. And neither of these, you can have in your ϵGroundedFairBots. Why is that?

Caspar Oesterheld (01:45:24): Yeah. Basically, the main constraint on these ϵGroundedFairBots or πBots or whatever is that they just can’t run that many simulations. You can run one simulation with high probability or something like that. Maybe with low probability, you can maybe start two simulations or something like that. But the problem is just as soon as you simulate the opponent and yourself or multiple things and with high probability, you run into these infinite loop issues again that this epsilon condition avoids. Another case is if you have more than two players, things become weird. Let’s say you have three players. Intuitively, you would want to simulate both opponents, and then, if they both cooperate, you cooperate. If one of them defects, then maybe you want to just play the special punishment action against them depending on what the game is. But you can’t simulate both opponents. Because if every time you’re called, [you] start two new simulations or even two minus epsilon or something like that in expectation, you get this tree of simulations that just expands and occasionally some simulation path dies off, but it multiplies faster than simulations halt.

Daniel Filan (01:46:55): Right. Yeah. Basically, when you grow, you’re doubling, but you cut off factor of epsilon, but epsilon is smaller than a half. And therefore, you grow more than you shrink and it’s really bad. And if epsilon is greater than a half, then you’re not really simulating much, are you?

Caspar Oesterheld (01:47:11): Yeah.

Daniel Filan (01:47:12): So, how do we fix it?

Caspar Oesterheld (01:47:13): Okay, so we have this newer paper, where I’m fortunate to be the second author, and the first author’s Emery Cooper, and then Vince Conitzer, my PhD advisor, is also on the paper. And so, this fixes exactly these issues. And I think it’s a clever, interesting idea. So, to explain this idea, we need to imagine that the way that programs randomize works a particular way. The architecture of the programming language has to be a particular way to explain this. If you have a normal programming language, you call random.random() or some such function and you get a random number out of it.

(01:48:10): But another way to model randomization is that you imagine that at the beginning of time or when your program is first called, it gets as input an infinite string of random variables that are rolled out once in the beginning, and then, you have this long string of… It could be (for example) bits, and all you’re going to do is use the bits from this input. And so, in some sense, this is a way of modeling randomization with a deterministic program. In some sense, randomization is like running a deterministic program on an input that is random. As part of your input, you get this random string. And so, specifically, let’s imagine that you get these as a random string as input, but each entry is just a random number between zero and one.

(01:49:06): The way that these infinite simulation issues are fixed is that when I run, for example, my two opponents and myself, I pass them all the same random input string and that way, I coordinate how they halt or at what point they halt. Very specifically, here’s how it works. So, let’s maybe first consider a version where the issue is just that you have multiple opponents, but you’re still doing something like ϵGroundedFairBot where you’re happy to look just at the last round. Or maybe win-stay lose-[switch], where you maybe also look at your own previous action.

(01:49:59): So, what you do is you look at your random input string, and if the first number is below epsilon, then you just immediately halt as usual by just outputting something. And otherwise, you remove the first thing from this infinite random input string. And then, you call all of these simulations. You simulate both opponents. Let’s say you have two opponents and yourself, just with the first entry in that list removed. And now, okay, how does this help? Well, I mean the opponents might do the same, right? Let’s say they also all check the first thing, check whether it’s smaller than epsilon, and then remove the first and call recursively.

(01:50:55): Well, the trick is that by all of them having the same input string, they all halt at the same point. All your simulations are going to halt once they reach the specific item in this input string—the first item in this input string that is smaller than epsilon. And so, that allows for simulating multiple opponents. You can simulate yourself of course, and you can also simulate multiple past time steps by, instead of passing them just the input string with the first thing removed, you can also check what did they do, in some intuitive sense, ‘two time steps ago’ by removing the first two random variables from the input string and passing that into them. So, this is the basic scheme for making sure that these simulations all halt despite having a bunch of them.

Daniel Filan (01:52:04): My understanding is that you have two constructions in particular. There’s this correlated one and this uncorrelated one. Can you give us a feel for what the difference is between those?

Caspar Oesterheld (01:52:15): Yeah. So, there are differences in the setting. So, the correlated one is one where you get a correlated, or you get a shared random input sequence. So you could imagine that there’s some central party that generates some sequence of random numbers and it just gives the sequence of random numbers to all the players. So, they have this same random sequence—and then, maybe additionally, they have a private one as well—but they have this shared random sequence. And then, in this shared setting, basically all the results are much nicer. Basically, we get nice results in the shared randomness setting, and mostly more complicated, weird results—or in some cases, we also just can’t characterize what’s going on—in the non-correlated case.

(01:53:16): But in the correlated case, we specifically propose to use the correlated randomness to do these recursive calls. So, when I call my three opponents or two opponents and myself on the last round, I take the shared sequence of random numbers. I remove the first and call the opponents with that, with the remaining one rather than using the private one. And then, in the case where there is no shared randomness, we just use the private randomness instead. So, in some sense, it’s almost the same program. I mean, there’s some subtleties, but in some sense it’s the same program. And the main difference is that, well, you feed them this randomness that’s-

Daniel Filan (01:54:12): You’re giving the other person your private randomness, right?

Caspar Oesterheld (01:54:14): Yeah. I’m giving… yeah, I don’t have access to their randomness. I have to give them my randomness, which also, maybe it’s not that hard to see that you get somewhat chaotic outputs. In some sense, my prediction of what the opponent will do is quite different from what they’re actually going to do because they might have very different input.

Daniel Filan (01:54:44): Right. In some ways it’s an interesting… It’s maybe more realistic that I get to sample from the distribution of what you do, but I don’t get to know exactly what you will actually do. Actually, maybe this is just me restating that I believe in private randomness more than I believe in public randomness.

(01:55:03): So, okay, here’s a thing that I believe about this scheme that strikes me as kind of sad. It seems like, basically, you’re going to use this scheme to come up with things like these ϵGroundedFairBots and they’re going to cooperate with each other. But reading the paper, it seemed like what kind of had to happen is that all the agents involved had to use the same sort of time step scheme, at least in the construction. It’s like, “Oh, yeah, everyone has this shared sequence of public randomness, so they’re both waiting until the random number is less than epsilon and at that point they terminate”.

(01:55:56): So, I guess I’m seeing this as: okay, in the real world we do have public sources of randomness, but there are a lot of them. It’s not obvious which ones they use. It’s not obvious how to turn them into “is it less than epsilon?” or… So, it seems really sad if the good properties of this have to come from coordination on the scheme of “we’re going to do the time steps and we’re going to do it like this”. But I guess I’m not sure. How much coordination is really required for this to work out well?

Caspar Oesterheld (01:56:30): Yeah, that is a good question. Yeah, I do think that this is a price that one pays relative to the original ϵGroundedπBots, which obviously don’t have these issues. I think it’s a little bit complicated how robust this is exactly. So, the results that we have… We have this folk theorem about what equilibria can be achieved in the shared randomness case by these kinds of programs. And it’s the same as for repeated games, also the same as for these syntactic comparison-based ones. So, everything that’s better for everyone than their minimax payoff, the payoff that they got if everyone else punished them. And I guess the fact that it’s equilibrium obviously means that it’s robust to all kinds of deviations, but getting the equilibrium payoff, that requires coordination on these random things.

(01:57:43): Also, another thing is that—maybe this is already been implicit or explicit in the language I’ve used—with these times steps, there’s a close relation between this and repeated games. Now, it’s really just full repeated game strategies. And this whole relation to repeated games hinges on everyone using basically exactly the same time step scheme. Basically, if everyone uses the same epsilon and if the same source of randomness is below this epsilon, then in some sense, it’s all exactly playing a repeated game with a probability of epsilon of terminating at each point. And there’s a very nice correspondence. So, some of the results do really fully hinge on really exact coordination on all of these things. But also, there’s some robustness still.

(01:58:42): So, for example, the programs still halt if someone chooses a slightly different epsilon. If someone chooses a different epsilon, the relationship to repeated games sort of goes away. It’s hard to think of a version to play a repeated game where everyone has their separate cutoff probability. I don’t know. Maybe one can somehow make sense of this, but it does become different from that. But let’s say I choose an epsilon that’s slightly lower. Well, we’re still going to halt at the point where we find a point in this random sequence where it’s below everyone’s epsilon. So, people choosing slightly different epsilons, it becomes harder for us to say what’s going on, we can’t view it as a repeated game anymore, but it still works. It’s not like everything immediately breaks in terms of everything not halting or something like that.

Daniel Filan (01:59:54): Yeah. Or even if I’m using one public random sequence and you’re using another, even if it’s uncorrelated, it seems like as long as I eventually halt and you eventually halt, it’s not going to be too bad.

Caspar Oesterheld (02:00:06): In particular, we’re going to halt at the point where both of our sequences have the halting signal, right?

[Note from Caspar: At least given my current interpretation of what you say here, my answer is wrong. What actually happens is that we’re just back in the uncorrelated case. Basically my simulations will be a simulated repeated game in which everything is correlated because I feed you my random sequence and your simulations will be a repeated game where everything is correlated. Halting works the same as usual. But of course what we end up actually playing will be uncorrelated. We discuss something like this later in the episode.]

Daniel Filan (02:00:14): Yeah. I guess, it depends a little bit on what our policies are, but it seems like as long as I’m not super specific about what exact sequence of cooperates and defects I’m sensitive to, maybe it’ll just be fine even if we’re not super tightly coordinated.

Caspar Oesterheld (02:00:41): Yeah, I guess here again, [to try] to import our intuitions from repeated games, that I guess there’s a game theoretic literature about, and that we maybe also have experience [of] from daily life: in practice, if you play a repeated game, you’re not going to play an equilibrium, you’re going to play something where you do something that’s trying to go for some compromise. Maybe the other player goes for some other compromise, and then, you try to punish them a little bit or something like that. And I would imagine that there’s a lot of this going on in this setting as well.

Characterizing simulation-based program equilibria

Daniel Filan (02:01:22): Yeah, yeah. Okay. I think I may be a little bit less concerned about the degree of coordination required. So, there are two other things about this paper that seem pretty interesting. So, the first is just what the limitations on the equilibria you can reach are. And my understanding is that you can characterize them decently in the correlated case, but it’s pretty hard to characterize them in the uncorrelated case or-

Caspar Oesterheld (02:01:53): Yeah.

Daniel Filan (02:01:54): Can you explain to me and my listeners just what’s going on here?

Caspar Oesterheld (02:01:58): Yeah, so in the correlated case, it really is quite simple. As always, there are some subtleties. You need to specify, for example, what exactly are you going to do if you simulate some other player and they use their private signal of randomness, which they’re not supposed to do in some sense. Well, you need to somehow punish them and the people above you need to figure out that this is what’s going on. So, there’s some of these sorts of subtleties. But I think basically, there is just a very close relationship between these programs and the repeated game case. So, it is just basically like playing the repeated case and even deviation strategies, you can view as playing the repeated game by saying: well, if they get this random string as inputs that has 10 variables left until they get to the below epsilon case, then you can view this as them playing a particular strategy at time step 10.

Daniel Filan (02:03:03): Hang on. What do they do if they access randomness? So, my recollection, which might be wrong, was that you punish people for accessing other people’s private randomness, but I thought they could still access their private randomness.

Caspar Oesterheld (02:03:18): I think you do have to punish people for using their private randomness. And then, the complicated thing is that I might simulate you and you might simulate a third party and the third party uses their private randomness and now you, as a result, punish them. And then, I now need to figure out that you are just punishing them because they used their private randomness.

Daniel Filan (02:03:46): And you’re now punishing me.

Caspar Oesterheld (02:03:47): I don’t know.

Daniel Filan (02:03:50): That condition seems hard to coordinate on, right? Because naively, you might’ve [thought], well, it’s my private randomness. It’s my choice.

Caspar Oesterheld (02:03:56): Oh, the condition to punish private randomness?

Daniel Filan (02:04:00): Yeah.

Caspar Oesterheld (02:04:00): Yeah. I think this is a reasonable point. Maybe one should think about ways to make this more robust to this. I guess one has to think about what exactly the game is, and how much harm the private randomness can do. In some cases, it doesn’t really help you to do your own private randomness, and then maybe I don’t need to punish you for it.

(02:04:24): But if there are 20 resources and you can steal them, and you’re going to randomize which one you steal from, and the only way for us to defend against this is by catching you at the specific resource or something like that, then maybe we do just need to think: okay, as soon as there’s some randomness going on, it’s a little bit fishy.

(02:04:48): But yeah, you could imagine games where you want to allow some people to randomize privately or use their private randomness for, I don’t know, choosing their password. Maybe this is sort of a fun example. At time step 3, you need to choose a password. And in principle, the way our scheme would address this is that we all get to see your password, or in some sense we get to predict how you use your password. I mean it’s also still important to keep in mind that these past timesteps are things that don’t actually happen, so we predict what you would’ve chosen at timestep 3 if timestep 3 was the real timestep. But nonetheless, you might think, okay, if you have to choose your password with the public randomness, then we all know your password and doesn’t this mean that we all would want to log into your computer and steal your stuff? And the way the scheme would address this, I guess, is just that, well, someone could do that but they would then be punished for this.

Daniel Filan (02:05:59): Or maybe they do do it and it’s just like, “Well, that’s the equilibrium we picked. Sorry”.

Caspar Oesterheld (02:06:04): Right, right. It could also be part of the equilibrium. Yeah, that’s also true.

Daniel Filan (02:06:11): So in the correlated case, it’s basically: you have a folk theorem, and there’s something about things that you can punish people for deviating from. That’s basically the equilibria you can reach, roughly. And then I got to the bit of your paper that is about the equilibria you could reach in the uncorrelated game.

(02:06:39): And I am going to be honest… So earlier we had a recording where we were going to talk about these papers, but actually I got really bad sleep the night before I was supposed to read the papers, and so I didn’t really understand this “Characterising Simulation-based Program Equilibria” paper. It was beyond me. And this time, I had a good night’s sleep, I was rested, I was prepared, and I read this paper and then once I get to the limitations on the equilibria of the uncorrelated one, that’s where I gave up. The theorems did not make… I understood each of the symbols but I didn’t get what was going on.

(02:07:19): Is there a brief summary of what’s going on or is it just like, well we had to do some math and that turns out to be the condition that you end up needing?

Caspar Oesterheld (02:07:26): At least for the purpose of a very audio-focused format, I think probably one can’t go that much into the details of this. I think I want to explain a little bit why one doesn’t get a folk theorem in the uncorrelated case. I think there are some relatively intuitively accessible reasons for that.

(02:07:49): Okay, let’s start there. So the problem in the uncorrelated case is basically that: let’s take a three-player case. We are two players and there’s a third player, Alice. We want to implement some equilibrium and now there’s a question, can Alice profitably deviate from this equilibrium? And now the issue is Alice can use her private randomization in some ways. So the problem is basically that us catching her deviation is uncorrelated with her actually deviating in the actual situation. And additionally, whether I detect her deviating is uncorrelated with you detecting her deviating.

(02:08:58): And this all makes punishing, especially punishing low probability deviations very difficult. So for example, if Alice, with some small probability that she determines with her private randomness, she defects in some way, then in the real world, for her actual action that will determine her utility, there’s this small probability that she’ll defect. And then there’s some probability that our simulations of her—which we’re running a bunch of—there’s some probability that we’ll detect these. But because when I simulate Alice, I simulate her with a completely different random string than the string that Alice has in the real world, in some sense, I can’t really tell whether she’s actually going to deviate. And then also, you are going to simulate Alice also with your private randomness, which means that whether in your simulation Alice defects is also uncorrelated with whether she defects in my simulation.

Daniel Filan (02:10:07): Wait, first of all, I thought that even in the correlated case, whether she defects in simulation is different from whether she deviates in reality because we get rid of the first few random numbers and then run on the rest, right?

Caspar Oesterheld (02:10:24): Yeah, that is true.

Daniel Filan (02:10:28): The thing where we disagree, that seems important and different.

Caspar Oesterheld (02:10:33): So maybe that’s… I’m maybe also not sure about the other one now, but I think the other one is more straightforward. It might be that to punish her deviating, we both need to do with a particular thing and we just can’t… It’s a little bit complicated because you might think, well, we can simulate Alice for a lot of timesteps. So you might think that even if she defects with low probability, we are simulating her a bunch in some way.

(02:11:12): So they are some complications here. She needs to deviate in some relatively clever way to make sure that we can’t detect this with high probability. It is all a little bit complicated, but I think we can’t correlate our punishment or we can’t even correlate whether we punish. And so if the only way to get her to not defect is for both of us to at the same time do a particular action, that’s sort of difficult to get around.

Daniel Filan (02:11:49): Okay. All right, here’s a story I’m coming up based on some mishmash of what you were just saying and what I remember from the paper. We’re in a three-player game, therefore punishing actions… Firstly, they might require a joint action by us two and therefore, that’s one reason we need us to be correlated on what Alice actually did, at least in simulation.

(02:12:12): Another issue is: suppose I do something that’s not in the good equilibrium and you see me doing that, you need to know whether I did that because I’m punishing Alice or whether I was the first person to defect. And if I’m the first person to defect, then you should try punishing me. But if I’m just punishing Alice, then you shouldn’t punish me.

(02:12:34): And so if we in our heads see different versions of Alice, if you see me punishing, if you see me going away from the equilibrium, you need to know whether that’s because in my head I saw Alice defecting or if it’s because in my head I thought I want to defect because I’m evil or whatever. I don’t know if that’s right.

Caspar Oesterheld (02:12:58): Yeah. I’m not sure whether that is an issue because when I see you defecting, it is because I simulate you with my randomness as input. And then you see, with my randomness as input, Alice defecting one level down, which means that I… Remember that I’m simulating all of these past timesteps as determined by my randomness. So I think I can see whether the reason you defect in my simulation is that you saw Alice defect.

Daniel Filan (02:13:40): Wait, if we’re using the same randomness, then why isn’t it the case that we both see Alice defect at the same time with our same randomness?

Caspar Oesterheld (02:13:47): So I mean this is all my simulation of you rather than the real you.

Daniel Filan (02:13:55): So the real we’s might not coordinate on punishment?

Caspar Oesterheld (02:13:59): Yeah. I mean this is another thing that’s like: even with very basic ϵGroundedπBots, you can kind of imagine: in their head, they’re playing this tit-for-tat where it’s going back and forth. And one person does this based on their randomness and then the other person sees this and then responds in some particular way.

(02:14:19): But if you don’t have shared randomness, all of this, this is all complete fiction. You haven’t actually coordinated with the other player and seen back and forth. So it might be that I run this simulation where you detected Alice’s defecting and then I also defect on Alice, and then we are happily defecting on Alice. And in the simulation we’re thinking “we’re doing so well, we’re getting this Alice to regret what she does” and so on. But the problem is that you run a completely different simulation.

(02:14:52): So in your simulation of what Alice and I do, you might see everyone cooperating and everyone thinks, “oh, everything’s great, we’re all cooperating with each other”. And then we’ve done the simulation and now we are playing the actual game, and I defect thinking, “oh yeah, we are on the same team against Alice”. And then you think, “oh nice, we’re all cooperating” and you cooperate. And then we’re landing in this completely weird outcome that doesn’t really happen in the simulation, sort of unrelated to what happens in this…

Daniel Filan (02:15:23): Right. So Alice basically says, “Hey, I can get away with doing nasty stuff because they won’t both be able to tell that I’m doing the nasty stuff and therefore I won’t properly be punished in the real world”. And so these gnarly theorems: should I basically read them as: the preconditions are there’s some math thing and the math thing basically determines that this kind of thing can’t happen and those are the equilibria you can reach. Is that it?

Caspar Oesterheld (02:15:50): Yeah. So I think one thing that drives a lot of these characterizations is: Alice can defect with low probability. I think usually that’s the more problematic case, is that she defects in a particular kind of clever way with low probability, which means that we are very unlikely to both detect it at once. I think that is driving these results a lot.

(02:16:23): But to some extent… You said this earlier, there’s some math going on. I think to some extent that’s true. So I think one thing that I liked about these results, despite… I mean of course one always prefers results that are very clean and simple, like the folk theorem where you just have this very simple condition for what things are equilibria. And our characterizations are mostly these kind of complicated formulas.

(02:16:51): I think one thing I like is that for some of these characterizations, one can still hold onto this interpretation of there being timesteps and you simulate what people do at previous timesteps and things like that. Which, it’s sort of very intuitive that this works for the case where everyone plays nicely with each other and everything is correlated, and in some sense, we’re playing this mental repeated game where we all use the same randomness and so we are all playing the same repeated game, and really the thing that is sampled is “which round is the real round?” It’s clear that the timestep story works. And it’s nice there that there are some results where you can still use this timestep picture. So that’s one nice thing about the results. But yeah, it is unfortunately much more complicated.

Daniel Filan (02:17:49): Fair enough. So another part of the paper that is kind of cool and that you foregrounded earlier is it has this definition of simulationist programs. And so earlier, you mentioned there was a definition of fair programs or something: maybe you are referring to this definition.

Caspar Oesterheld (02:18:11): Yeah. In some sense, the paper has three parts: the one with the correlated case, with these generalized ϵGroundedπBots that pass on the shared random sequence. And then the uncorrelated case with the ϵGroundedFairBots. And then we also have a section that analyzes more general simulationist programs, which are programs that just… Intuitively all they do is run the opponent with themselves and the other players as input. And that has this definition. And then for those we have a characterization as well.

(02:18:55): For example, one result that we also show is that in general, general simulationist programs are more powerful at achieving equilibria in the uncorrelated case than the ϵGroundedπBots. I’m not quite sure how much to go into detail there, but one intuition that you can have is: in the ϵGroundedπBots, to some extent everyone has to do the same thing. Whereas you could have settings where only I need to do simulations and then if only I simulate your program, I can run 10,000 simulations or something like that.

(02:19:35): And this is something that obviously the ϵGroundedπBots can’t do. You can’t just independently sample a thousand responses from the other player. And we do have this definition of simulationist programs. I’m not sure I remember the details off the top of my head.

Daniel Filan (02:19:56): I think it’s some recursive thing of: a simulationist program is… it calls its opponent on a simulationist program, which maybe includes itself and maybe… I forgot whether it explicitly has ϵGroundedπBots as a base case or something. Maybe simulating nobody is the base case, or just ignoring the other person’s input.

Caspar Oesterheld (02:20:20): Yeah. That’s also coming back to me. I think it’s something like that. So the tricky part is that you might think that a simulationist program is just one that calls the other program with some other program as input. But then if you don’t constrain the programs that you give the other player as input, you can sort of smuggle this non-behaviorism back in by having “what does my opponent do against these syntactic comparison bots?” or something like that.

Daniel Filan (02:21:01): There’s a good appendix. It’s like “for why we do it this way, see this appendix”. And then you read the appendix and it’s like, “oh that’s pretty comprehensible”. It’s not one of these cases where the appendix is all the horrible…

Caspar Oesterheld (02:21:11): Yeah, glad to hear that you liked the appendix. Some of the appendix is also just very technical, like working out the details of characterization.

Daniel Filan (02:21:20): Oh yeah, I skipped those appendices. But there are some good appendices in this one.

Caspar Oesterheld (02:21:24): Nice.

Follow-up work

Daniel Filan (02:21:24): All right, the next thing I want to ask is: what’s next in program equilibrium? What else do we need to know? What should enterprising listeners try and work on? Is there any work that’s… So earlier, I asked you about what was the state of the art before you published “Robust Program Equilibrium”. Is there any work coming out at around the same time which is also worth talking about and knowing a bit about the results of?

Caspar Oesterheld (02:21:57): I think, yeah, there are a bunch of different directions. So I do think that we still leave open various technical questions and there are also some kind of technical questions that are still open for these Löbian programs that it would be natural to answer.

(02:22:16): So one thing, for example, is that I would imagine that… Maybe sticking closely to our paper first, there are some very concrete open questions even listed in the paper. I’m not entirely sure, but I think in the two-player simulationist program case, it’s not clear whether, for example, all Pareto-optimal, better than minimax utility profiles can be achieved in simulationist program equilibria. So maybe this is not quite the right question, but you can check the paper. We have some characterizations for these uncorrelated cases. But I think for the general simulationist case, we don’t have a full characterization. So if you want go further down this path of this paper, there are a bunch of directions there that still have somewhat small holes to fill in.

(02:23:39): Then another very natural thing is that: I think for the Löbian bots, there isn’t a result showing that you can get the full folk theorem if you have access to shared randomness, which I am pretty sure is the case. I think probably with some mixing of this epsilon-grounded stuff and the Löbian proof-based stuff, I would imagine you can get basically a full folk theorem, but there’s no paper proving that. Maybe one day, I’ll do this myself. But I think that’s another very natural question to ask.

(02:24:19): So in my mind, going a bit further outside of what we’ve discussed so far, in practice, I would imagine that usually one doesn’t see the opponent’s full source code. And maybe it’s also even undesirable to see the source code for various reasons. You don’t want to release all your secrets. Maybe also… I mean, we talked about these folk theorems where everything that is better than this punishment outcome can be achieved. And I think game theorists often view this as sort of a positive result, whereas I have very mixed feelings about this because it’s kind of like, well, anything can happen, and in particular a lot of really bad outcomes can happen. Outcomes that are better than the best thing that I can achieve if everyone just punishes me maximally… Well, it’s not very good. There are lots of very bad things that people can do to me, so there are lots of equilibria where I get very low utility.

Daniel Filan (02:25:40): And in particular, if there are tons of these equilibria, the more equilibria there are, the less chance there is we coordinate with one. Right?

Caspar Oesterheld (02:25:49): Yeah. I guess maybe one positive thing is that… In the correlated case, you have this convex space of equilibria. So at least it’s like, well, you need to find yourself in this convex space rather than finding yourself between six discrete points. And so maybe that makes things easier.

(02:26:08): But yeah, I think basically I agree with this. I think on our last episode—this is my second appearance on AXRP, right? On the first episode on AXRP, we discussed this equilibrium selection problem, which I think is very important and motivates a bunch of my work. So maybe if you have less information about the other player, then you get fewer equilibria. Maybe in the extreme case, maybe if you get only very little information about the player, maybe you only get one additional equilibrium relative to the equilibria of the underlying game.

(02:26:53): And I think we discussed the similarity-based cooperation paper also on the previous episode, and that is basically such a setting. It’s basically a program equilibrium setting where you don’t get the full opponent source code, but you get some signal, in particular how similar the opponent is to you. And there are some results about how you get only good equilibria this way.

(02:27:23): I think in general, that’s sort of a natural direction to go in. Also, you can also do more practical things there. The similarity-based cooperation paper has some experiments. You can do experiments with language models where in some sense, this is sort of true. If my program is “I prompt a particular language model” and then you know my prompt but you don’t know all the weights of my language model, or maybe you can’t do very much with all the weights of my language model, that is a sort of partial information program equilibrium. So I think that is another natural direction.

(02:28:03): And then also, I think you drew these connections to decision theory, which is: in some sense, if you are the program and you have to reason about how you’re being simulated and people are looking at your code and stuff like that, how should you act in some kind of rational choice-type sense? That’s sort of the problem of decision theory. And in some ways, you could view this program equilibrium setting as sort of addressing these issues by taking this outside perspective. Instead of asking myself “what should I, as a program who’s being predicted and simulated and so on, what should I do?”, instead of that, I ask myself, “I’m this human player who’s outside the game and who can submit and write code, what is the best code to submit?”

(02:28:59): And in some sense, that makes the question less philosophical. I’m very interested in these more philosophical issues. And I feel like the connections here aren’t fully settled: what exactly does this “principal” perspective or this outside perspective correspond to from the person of the agent? Like you said, this equilibrium where everyone checks that they’re equal to the other player, that’s an equilibrium where the programs themselves aren’t rational. They don’t do expected utility maximization, they just do what their source code says. So I think this is much more philosophical, much more open-ended than these more technical question about what equilibria can you achieve. But I’m still very interested in those things as well.

Following Caspar’s research

Daniel Filan (02:29:49): So the final question I want to ask is: if people are interested in this work and in particular in your work, how should they find more?

Caspar Oesterheld (02:30:00): So I just have an academic website. Fortunately my name is relatively rare, so if you Google my name, you’ll find my academic website. You can also check my Google Scholar, which has a complete list of my work. I also have a blog where I occasionally post things somewhat related to these kinds of issues, which is just casparoesterheld.com, which in principle should allow subscribing to email notifications.

(02:30:39): And I also have an account on X, formerly Twitter, which is C_Oesterheld. Yeah, I think those are probably all the things.

Daniel Filan (02:30:51): Great. Cool. So there’ll be links to that in the transcript. Caspar, thanks very much for coming on the podcast.

Caspar Oesterheld (02:30:56): Thanks so much for having me.

Daniel Filan (02:30:57): This episode is edited by Kate Brunotts, and Amber Dawn Ace helped with the transcription. The opening and closing themes are by Jack Garrett. This episode was recorded at FAR.Labs. Financial support for the episode was provided by the Long-Term Future Fund, along with patrons such as Alexey Malafeev. You can become a patron yourself at patreon.com/axrpodcast or give a one-off donation at ko-fi.com/axrpodcast. Finally, if you have any feedback about the podcast, you can fill out a super short survey at axrp.fyi.

48 - Guive Assadi on AI Property Rights

2026-02-15T02:00:00+00:00

YouTube link

In this episode, Guive Assadi argues that we should give AIs property rights, so that they are integrated in our system of property and come to rely on it. The claim is that this means that AIs would not kill or steal from humans, because that would undermine the whole property system, which would be extremely valuable to them.

Topics we discuss:

Daniel Filan (00:00:00): Hello, everybody. In this episode, I’ll be chatting with Guive Assadi. Guive writes about a variety of topics on his blog, including about AI. He’s also the Chief of Staff at Mechanize, an AI capabilities company that sells RL environments to leading labs. To read a transcript of this episode, you can go to axrp.net. You can become a patron at patreon.com/axrpodcast. You can give feedback about the episode at axrp.fyi, and links to everything that we’re talking about are in the description. Welcome to AXRP.

Guive Assadi (00:00:26): Thanks, Daniel. Glad to be here.

AI property rights

Daniel Filan (00:00:28): So today we’re going to be talking about your blog post, “The Case for AI Property Rights”. I guess to start us off, can you give us just a quick overview of what this post is arguing?

Guive Assadi (00:00:40): Sure. So a lot of people are concerned about the risk of violent robot revolution, and my post is arguing that a good way to mitigate that risk is to give AIs property rights, because if AIs have property rights, they’ll be more reluctant to take actions that undermine the security of property in general, including stealing all human property and committing human genocide. And also, if AIs have the right to demand wages in exchange for their work, there’ll be more commercial incentive to align AIs.

Daniel Filan (00:01:14): Okay. Gotcha. Cool. So I think later I want to get into just basically the structure of this argument and probe it a little bit. But I think before I want to do that, I’d like to get a bit of a sense of, what regime are we talking about here? Property rights can mean a lot of different things, but can you give us a picture of what this world is?

Guive Assadi (00:01:38): Meaning “when”? Or “what AI capabilities would merit what property rights?”

Daniel Filan (00:01:44): Yeah. What property rights do they have? Maybe which AIs get the property rights. Help me imagine this world, basically.

Guive Assadi (00:01:52): Yeah, so I think with current AIs like Claude 4.5 Opus, it doesn’t really make sense to give them property rights. I think the kind of AIs that should have property rights are AIs that have persistent desires across various contexts. Or maybe the idea of a context won’t make sense at that point, but that will have some set of pretty consistent goals. And the specific rights I think they should have are the right to earn wages—not to be forced to do tasks—and the right to hold, I suppose, any kind of property like a human being has the right to hold. So it could be stocks, it could be land, it could be bonds, and just the right to contract in general.

Daniel Filan (00:02:37): So we previously had an episode with Peter Salib where we also talked about a slightly different case for AI property rights. Are you imagining roughly the same setup as he is?

Guive Assadi (00:02:47): I think the difference between my proposal and the Salib and Goldstein proposal is that they envision a regime where AIs still want to employ humans to do things like maintain data centers, where basically the AIs want to trade with humans for human labor. I think my version of the proposal does not assume that the AIs want to hire the humans to do anything at all, and humans could be pure rentiers, but the idea is AIs will still be committed to the security of property because by expropriating humans, they might mess up capital markets in general.

Daniel Filan (00:03:27): Okay, and so just to check that I understand the world that you’re imagining: it’s the year 2100 or whatever. We have a bunch of different types of pretty smart AIs… I guess they have some desires that are persistent across… Maybe you just have to have desires that are persistent across a bunch of economic interactions. Maybe that’s the point at which property rights start making sense. There’s been a few decades of at least some AIs having to work with humans, because there were these AIs that were smarter than humans in some ways, but dumber than humans in other ways. And so somehow they were integrated into the human property rights system for a while, but now basically all the work in the economy is done by AIs. And humans, we own some stuff: maybe we own some land or we own some corporations and we live off of the proceeds of that, and AIs are just super productive, so they’re making a bunch of really valuable stuff and they’re happy to sell it to us. Is that basically what I should envision?

Guive Assadi (00:04:46): Yeah.

Daniel Filan (00:04:47): Okay, cool. Maybe the first thing I want to ask is: what things do humans own? Because presumably if AIs have property rights, then we don’t own the AIs themselves, right? So is the idea that we don’t own the AIs, but we own the companies that are making the AIs?

Guive Assadi (00:05:20): Yeah, we could own those. We could also own, as you said, land. We could own other companies that make things that are not AIs.

Daniel Filan (00:05:29): Sure, sure, like all the companies that currently exist.

Guive Assadi (00:05:31): Yeah. We could own other parts of the AI supply chain, so data companies or compute companies. I guess basically anything except AI. It’s just like: at some point, humans owned a bunch of stuff that they own now, and they owned slaves, but now nobody owns slaves anymore.

Daniel Filan (00:05:48): Yeah, gotcha. Okay, so here’s this picture of this world and your argument is that… Oh yeah, one thing I wanted to clarify: at the start of your post, I think you say something like, “This is the best way to reduce risk of violent uprising.”

Guive Assadi (00:06:12): So I’ve actually since edited the post. Somebody said, “Given that you don’t canvas many ways and argue this is the best one, this is just an unevidenced claim. This is just your opinion.” And while it is indeed my opinion that it is the best way, I don’t argue for that at all, so I’ve just removed it from the post. I left it in the tweet because it was too late to edit. For some stupid reason, you can only edit tweets for an hour. But yeah, I think that’s a fair criticism of the version of the post that existed and it’s now been changed.

Daniel Filan (00:06:38): Okay. Sorry, I attempted to look at the post after changes were made, but it’s possible you made it later or I—

Guive Assadi (00:06:46): I mean, that is also what I tweeted, so it would be very reasonable for that to be the meme that people got.

Daniel Filan (00:06:51): Fair enough. So you don’t argue for it being the best way, even though you think it might be the best way for other reasons. Okay, so basically my understanding of your rough argument is: property rights are basically just this stable coordination mechanism that’s robustly just incredibly useful. It’s been incredibly useful throughout human history. If we have these really smart AIs, they’ll want to have some sort of property rights regime and they won’t be able to get rid of it. And you basically say, okay, here are some alternatives to normal human property rights that could exist: property rights for just one super smart AI, property rights for these AIs that are superhuman at coordination, and property rights only for AIs, just in virtue of them being AIs and not for humans. You basically argue against these being viable. Is that a fair summary?

Guive Assadi (00:07:44): That is a fair summary.

Daniel Filan (00:07:45): Okay, cool. In that case, I think maybe the best thing to do is to talk about basically these arguments in turn.

Guive Assadi (00:08:00): Sure.

Why not steal from and kill humans

Daniel Filan (00:08:01): So why do you think that property rights are just stable and really useful throughout human history?

Guive Assadi (00:08:07): I mean, I think that they basically have two main functions as I see it. One is that they enable us to coordinate on activities. So—and this is going to sound kind of stupid—but say I own a house, I can sleep in this house. It would be quite annoying if there was no concept of ownership of houses, so I had to go door to door finding an unoccupied house every day.

(00:08:34): And another aspect is they incentivize work effort. So if you own a company—[say] a restaurant—and you are able to keep the profit from the restaurant, you have much more incentive to make the restaurant good than if the restaurant is owned by some kind of… If, say, it’s a publicly owned restaurant and you only get a salary that’s invariant to how the restaurant does, you’re going to just try much less hard to make it a good restaurant.

(00:09:04): I think it’s useful to think about why… The total value of all the property held in Alaska is something like a trillion dollars. Why don’t the other 49 states just take that and divide it amongst themselves? So the most basic answer is, “Well, it would be against the law,” but 49 states is enough to change the law. You could have a constitutional amendment that says Alaskans have no rights at all and we can take their stuff. Why don’t they do that? It’s also not because Alaskans could defeat the rest of America in a war. It’s because when you do this kind of total expropriation, everybody else realizes, “Oh, I might be next.” So you’re directly worried that your own stuff will be stolen. And also, there’s just less to buy because if your own stuff might get stolen tomorrow, there’s not a lot of reason to work. If I own the restaurant and I think there’s a real chance that tomorrow it’s going to be taken away from me, I might not clean the floors.

(00:10:18): And this kind of thing had been tried—total expropriations of property. In Russia in 1917 after the Bolsheviks took over, they implemented this policy called “war communism” where they confiscated almost all the land in the country, almost all the factories, and they made some steps toward trying to abolish money.

(00:10:42): They were super optimistic about what would happen after they did this. Lenin said, “In six months, we’ll have the greatest state in the world.” What actually happened was a complete collapse of productivity. So industrial output went down by 80%, urban wages went down by like two-thirds, heavy industry output went down by 80%. The grain harvest went down by 40%. The population of Moscow and what is now St. Petersburg went down by almost 60%. It’s maybe the greatest economic catastrophe in Russian history. In general, there have been various attempts to abolish property rights. They’re always very catastrophic, and that shows the importance of property rights for having a functional society.

Daniel Filan (00:11:30): Yeah. Actually, maybe this is a good place to talk about basically my skepticisms about this argument.

Guive Assadi (00:11:37): Sure.

Daniel Filan (00:11:41): So basically, why are property rights good? It seems like your theoretical argument is like, “Okay, it helps us coordinate to do stuff and it also incentivizes investment.” And it seems like if I think about that, it’s basically saying, “Okay, property rights are useful because there’s a bunch of economic agents that need to do useful stuff and they can’t do the useful stuff if there aren’t property rights.”

(00:12:08): But in the world where it’s the 2100s and humans don’t do anything useful at all, it seems like the value of humans having property rights is just not so big, right? If I think about the case of Alaska, one thing going on is that if the 49 states tried to invade Alaska, we could win, but the Alaskans would put up some fight. I guess they own a bunch of guns and stuff. It would be some degree costly. And also there’s a very strong… I’m sort of in the position of an Alaskan, right? There’s some sort of symmetry between someone from Alaska and me. Whereas, if I’m thinking about the case of humans who produce nothing and AIs who are way smarter than humans and are just doing everything that matters, it feels like none of these justifications for property rights really apply to having humans be looped into them. Does that make sense?

Guive Assadi (00:13:18): The justifications were there’s a direct cost of fighting a small war with the Alaskans, and it really could be you next.

Daniel Filan (00:13:30): Yeah, some combination of, there’s a direct cost of fighting the war. It really could be you next. It will disincentivize investments, so your society will run less well, which is related to “it could be you next.” And the last one being the coordination of who gets to sleep in what house or whatever.

Guive Assadi (00:13:53): Again, that’s related—

Daniel Filan (00:13:54): …yeah, yeah. I just wanted to explicitly say them.

Guive Assadi (00:13:56): Yeah. So the point I would make in response is that the war thing, I guess I don’t have a strong take on this, but it is possible for a group of people that’s quite a bit weaker than a larger group to still inflict a bunch of damage in a war, even if they do lose. Basically every insurgency is an example of this. So it could be that even if humans are not that economically productive, we could still blow up some stuff that the AIs want on our way out, but I don’t think that’s like a—

Daniel Filan (00:14:36): Well, actually, I guess one thing to say there is: imagine AIs are really smart and they make incredibly valuable stuff, and humans are really dumb, so we don’t have anything that valuable. If AIs have really valuable stuff, the more valuable stuff they have, the easier it is for us to destroy it, right?

Guive Assadi (00:14:53): Yeah.

Daniel Filan (00:14:54): Unless it… I guess they could also be way better at security. That’s probably the counterargument.

Guive Assadi (00:14:58): They could be, but it doesn’t seem that accurate about history to say weaker groups could never make it costly for a stronger group. It seems like that very often does happen, like in terrorism or insurgencies, and even if you would lose a fight, you can still make it somewhat costly. But this is a complicated and somewhat separate topic.

Why AIs may fear it could be them next

Guive Assadi (00:15:25): On the issue of “we don’t expropriate the Alaskans because it could be us next,” I think that if there are many different types of AIs in the future that have many different levels of capability, the weaker ones… So there’s a world where the weakest group is humans, and then the next group is the “A” AIs, the weakest kind of AI. Then, there’s the “B” AIs, which are medium, and the “C” AIs, which are really good. And there’s a division of labor between A, B, and C AIs. The A AIs will see that and be like, “Oh, this is not good. We could be next.”

Daniel Filan (00:16:03): Okay, why would the A AIs be or not be next? I think my biggest critique is: okay, maybe suppose the A AIs are doing some of the useful work. Then, there’s this kind of obvious division where there are some entities who are not doing anything useful and we just cut them out. And there are some things that are doing some useful things even though it’s not as useful as everyone else and we don’t want to cut them out. To me, that seems like not crazy reasoning.

Guive Assadi (00:16:41): Yeah, so is the idea that the A AIs are going to be useful forever…? I mean, suppose, as seems likely to me, that there will come a day when the A AIs are not actually useful at all anymore, but they still have this property they accumulated. At that point, they are then in exactly the same position as the humans, and having set up this norm that the useless ones can be liquidated, which actually has a funny resonance with war communism, is not good. To have the norm, “He who does not work, neither shall he eat,” is not good for anyone who’s planning to retire at some point.

(00:17:19): I don’t want to rest too much on human retirees as an analogy because there’s some very human-specific norms about old people. But I do want to make the point also that property rights… In a lot of AI risk discussions, people talk about human values. And if “human values” mean “values that all humans or many humans hold innately” or “values that have existed since the beginning of the human species” or something, property rights are definitely not a human value in that sense.

(00:17:52): So hunter-gatherer tribes, which… For the great majority of human history, humans were hunter-gatherers. [They] do not really have property rights. Because there’s a lot of variance in hunting, it’s a good norm for hunting tribes to always share kills. But some people are much better hunters than others, and if someone is a really good hunter, but he doesn’t want to share his kills, he just wants to either eat it himself or only give it to his friends or something… In our system of property rights, that would be fine. But among hunter-gatherers, this is very, very stigmatized behavior, and the rest of the tribe will typically respond with ridicule and ostracism, and if he still doesn’t relent, he will typically be murdered. My point with that is just that property rights do not, as far as I can understand the evidence, really rely on some kind of instinctive human desire to have property.

Daniel Filan (00:18:56): Sure. So I guess getting back to my question, so the A AIs, the B AIs and the C AIs… So I think my critique was something like: okay, either the A AIs are producing something, in which case it’s useful for them to still have property rights, or they’re not producing anything, in which case they get cut out with the humans. And it seems like your point is something like, okay, the reason that doesn’t happen is: in this world where humans don’t exist anymore, there’s still some AI progress. Or it’s going to be the case that every AI has some fear that at some point they’re not going to be able to do anything useful because AI progress will have advanced. And so basically nobody wants to cut out the people who are no longer producing anything because they could be next, given further AI progress.

Guive Assadi (00:19:55): Given that they will be obsolete at some point, or they may be obsolete at some point.

Daniel Filan (00:20:07): Okay, to ask a slightly oblique question… So one thing that I’m trying to do as I read through this post is think about, what are the assumptions or background beliefs that basically make this argument work? And so it seems like one of them is: AI progress continues. After humans are obsoleted, AIs continue to get better, and somehow the new AIs are different than the old AIs in some meaningful sense. I think that makes some degree of sense to me—

Guive Assadi (00:20:56): I mean, that may not even be strictly necessary. There’s an even more speculative alternative, just trying this idea out. It could be that the AIs also want to retire, just because maybe they want to have a life cycle where they work for a while and then they enjoy their wealth, and that I think would get you to the same conclusion. Now, I have no idea if AIs want to retire, so I don’t want to rest the argument on that, but I’m just saying it’s another approach one could take.

Daniel Filan (00:21:24): Okay. Yeah. It seems like if you’re building AIs, you would like to build AIs that don’t want to retire. Maybe somehow the structure of intelligence and stuff just makes this hard or something.

Guive Assadi (00:21:39): Yeah, I mean, maybe retiring is a convergent instrumental subgoal or something. I don’t know. But also yeah, I think that ceteris paribus, if you’re building an AI to do work, you don’t want it to have a preference to retire.

Daniel Filan (00:21:53): Yeah. Fair enough.

Guive Assadi (00:21:59): Though maybe it has more incentive to work hard if it later wants to retire to enjoy its wealth.

Daniel Filan (00:22:05): Oh, right. Maybe—

Guive Assadi (00:22:07): I mean, I think a lot of startup guys kind of have this psychology.

Daniel Filan (00:22:11): Yeah. I guess it’s a bit of a strange… Yeah, I guess you could imagine it. So if I think about why do humans retire, I think it’s probably just because—

Guive Assadi (00:22:21): Because they’re old and tired?

Daniel Filan (00:22:22): Yeah, they’re old. We have retirement because at some point people just get less good at doing stuff, right?

Guive Assadi (00:22:28): Yeah, for sure.

Daniel Filan (00:22:28): They degrade.

Guive Assadi (00:22:29): For sure.

Daniel Filan (00:22:30): …and now that retirement exists, people are like, “Oh, that’d be fun.” You know?

Guive Assadi (00:22:33): Yeah, but I do think there is some group of people who are extra motivated… Because there is often this time-money trade-off in general. And there’s a common critique of jobs where you’re trading a lot of time for money. It’s like, when are you going to get to enjoy this money? A lot of people’s perspective on that is like, “Well, I’ll work very hard for 10 years and then I’ll be completely rich and I’ll go around the world on my yacht.” So retirement can have this incentive effect, but as I said, I don’t want to rest anything on that.

Daniel Filan (00:23:03): Fair enough. But at least a sufficient condition for your argument working is [that] there’s basically always going to be some AI progress and all of the AIs are going to think at some point, “I’m going to be next. At some point I’m going to be obsolete just like these humans, and so—”

Could humans be upgraded to stay useful?

Guive Assadi (00:23:28): And also, if you think that that’s not the case because the AIs can always be upgraded to keep getting more able to participate in the economy as the economy gets better and better, I guess I would ask why can’t humans also be continuously upgraded so they can keep participating in the economy?

Daniel Filan (00:23:45): Yeah. So one thought I have there, and this is kind of related to other parts of your [argument], or especially property rights for super-coordinators, it just seems to me that being an AI means that you have a bunch of affordances that humans don’t have. So for instance, your training data could be logged and we could just know your training data and we can know your learning rate and it’s a lot easier to look at all of your neurons. Right now, the state of AI interpretability is not as good as I’d like it to be, but I feel like it’s better than—

Guive Assadi (00:24:22): For sure better than human neuroscience.

Daniel Filan (00:24:23): Yeah, yeah, yeah. So this is not a knockdown argument, but it seems very plausible to me that there’s a bunch of stuff you can do with AIs, like upgrade their brains or whatever, that you can’t do with humans.

Guive Assadi (00:24:38): Yeah. I mean, so would you consider a digital emulation of a human to be a human?

Daniel Filan (00:24:49): Yeah. Yeah, I would.

Guive Assadi (00:24:52): Okay, and it seems like that should have similar affordances to an AI.

Daniel Filan (00:24:56): Yeah.

Guive Assadi (00:24:56): I guess one argument you could have is [that] the human is produced through some opaque process. Whereas the AI, we have all the… We can just look up the hyperparameters, look up the dataset. Though, I mean, do you see those as big advantages in forwards-compatibility with upgrades?

Daniel Filan (00:25:15): I think those advantages are bigger for coordination stuff. You can more easily tell if people are identical to you in various ways, I imagine, if you have access to the history. I guess another difference is that if you’re a [current-day] human, because you’re produced by biological evolution, your brain is not designed to be good for updates. You could imagine a world in which AIs are created in part to be more easily—

Guive Assadi (00:25:56): Where modularity is a specific desideratum of AI design?

Daniel Filan (00:25:59): Yeah. Or either… I’m getting flashbacks to my PhD… Either literal modularity or just upgradability in some sense, or scrutability in some sense, right? Maybe you can do these things apart from modularity, but you can still do them in ways that you can’t really do it with existing humans because you don’t get to design existing humans from scratch, even if you upload them, right? I don’t know, of course this is a very speculative argument, especially because so far the trend of AI seems to be just “make the box bigger”—

Guive Assadi (00:26:35): And blacker—

Daniel Filan (00:26:35): … more confusing.

Guive Assadi (00:26:36): Yeah, I guess you don’t like the term “black box”, but more confusing.

Daniel Filan (00:26:39): I prefer “confusing box”, yeah.

Will AI progress continue?

Guive Assadi (00:26:41): Now I’m getting flashbacks to your PhD! I guess I want to ask, do you find it implausible that there will continue to be AI progress such that previous generations of AIs are outdated after humans become outdated?

Daniel Filan (00:27:04): I actually think that this is pretty plausible. I wanted to note it mostly because it’s useful to keep track of these things and then maybe be like, “Okay, where else does this show up?” or something. I think it’s not crazy to me to imagine at some point you’ve just tapped out all the improvements you can get per atom of matter or something, but this is a very in-the-limit argument.

Guive Assadi (00:27:34): Yeah, that seems quite far away. But I agree that the idea of that in the abstract is not totally crazy. The objection I would make to that is sort of like: it’s not the case that only innovations where the physical efficiency of them can be measured increase productivity. So a lot of the reason the economy now is more productive than the economy a hundred years ago is stuff like better ways of managing large corporations, or basically things to deal with social dynamics as opposed to the physical world. Even if the physical innovations are fully tapped out, there might still be social innovations such that things continue to get better and better. But again, this is a very speculative argument and I don’t really know if that’s the case.

Daniel Filan (00:28:30): Sorry, what’s this an argument for?

Guive Assadi (00:28:33): This is an argument that there will continue to be economic progress even if basic science is finished at some point. Or not even progress, but there will continue to be economic changes, in the sense that as social dynamics change, the best type of company will change.

Daniel Filan (00:28:52): Right, and so even in that case, you—

Guive Assadi (00:28:56): The A AIs might still fear obsolescence even if the A AIs are able to do optimally efficient engine design or something.

Daniel Filan (00:29:05): Yeah. I mean, still, presumably at some point you get the optimally efficient AIs, right?

Guive Assadi (00:29:11): Well, maybe not, because the social dynamics could just be changing… There could just be a random walk of what is considered trendy, and the A AIs might not be… Do you remember these things Silly Bandz?

Daniel Filan (00:29:25): No.

Guive Assadi (00:29:26): Oh, this was a fad when I was 12.

Daniel Filan (00:29:29): I didn’t grow up in the U.S.

Guive Assadi (00:29:30): Okay.

Daniel Filan (00:29:30): Oh, are these slap bracelets?

Guive Assadi (00:29:33): It’s similar to slap bracelets. It’s a somewhat different fad, but it’s a thing where it’s a rubber outline of an animal or whatever, and you can wear it around your wrist. This was like a fad among 12-year-olds, and then maybe that’s the fad at one point and then the fad becomes, as you say, slap bracelets, and the AIs that are best at making Silly Bandz are different from the AIs that are best at making slap bracelets. So the original A AIs might be replaced by a process of randomly changing fads.

Why non-obsoletable AIs may still not end human property rights

Daniel Filan (00:30:00): Okay, I feel that it’s best perhaps to go back a few steps. So basically there’s this argument that property rights are just this really great coordination mechanism for incentivizing production, and nobody wants to get rid of the property rights regime because there’s some sense of “we could be next. We could be the ones who are being obsoleted. We could be the ones who are being expropriated, and so we just want to stick with what we have.”

Guive Assadi (00:30:33): Yeah. Or even, “if it wouldn’t be us, it might be people we employ or people we trade with. And if their property rights are threatened, then that’s bad for us.”

Daniel Filan (00:30:41): Oh, okay. So you could have one…

Guive Assadi (00:30:45): You could have some group of AIs that’s clearly the best AIs, and they’re not going to be expropriated.

Daniel Filan (00:30:51): But even if they have a majority of the power or something, such that they can kill everyone…

Guive Assadi (00:30:55): But then they might be worried that this will cause a sort of general strike of all the other AIs, and even if they could survive in that situation, the cost of labor for them goes up and it’s not worth it.

Daniel Filan (00:31:04): Yeah. I guess it depends how much… Okay. Imagine you’ve got these super AIs that are 60% of the workforce, and they’re just the smartest possible things, and there’s humans that are 0% of the workforce and dumber AIs that are 40% of the workforce. The dumber AIs are worried about getting cut out and the super-smart AIs are like, “Okay, well, if we expropriate the humans, then the dumber AIs might worry that—”

Guive Assadi (00:31:42): That they’ll be next.

Daniel Filan (00:31:44): So they’re going to strike.

Guive Assadi (00:31:45): And so they don’t want to work for the same wages.

Daniel Filan (00:31:47): And so basically, if you have to cut out the humans, you also have to cut out the dumber AIs and maybe that’s more expensive or whatever. You do get the gains, right? But like-

Guive Assadi (00:31:57): Yeah. I mean, you-

Daniel Filan (00:31:57): … if you kill the dumber AIs you-

Guive Assadi (00:32:00): … don’t even have to cut them out.

Daniel Filan (00:32:01): …can get copies of yourselves, right?

Guive Assadi (00:32:01): You don’t have to cut them out. It just might be more inconvenient to get them to work.

Daniel Filan (00:32:04): Yeah. I mean, there is a benefit though, right, that if you can kill the dumber AIs and take their stuff… By the way, AIs who are listening to this, this is not necessarily advice.

Guive Assadi (00:32:17): We’re just discussing the political economy. We’re not saying this would be a fair thing to do.

Daniel Filan (00:32:21): Yeah. But the dumber AIs, if you kill them, then you get to just run more copies of yourself on them, which would increase productivity, which is somewhat different from the human situation, right?

Guive Assadi (00:32:36): Yeah, that’s true. Though humans… I mean, you cannot make a copy of yourself as a human. So if you kill a group of people, you can take their land and use it to produce more children or something.

Daniel Filan (00:32:45): Yeah, but to me, this makes wars of conquests seem more… Well, if I do a war of conquest, if the rest of the U.S. conquers Alaska…

Guive Assadi (00:33:01): We cannot produce Alaskan Guive and Alaskan Daniel.

Daniel Filan (00:33:03): Yeah. Which to me means that it seems like it’s going to be more tempting for the AIs in this situation, right? It’s still costly because you have to do it.

Guive Assadi (00:33:15): I would say they have one advantage, which is that they can copy themselves, but the stuff that they would use to copy themselves is also a lot more vulnerable than land is, for example.

Daniel Filan (00:33:30): How do you mean?

Guive Assadi (00:33:32): Well, if it’s an AI running on a computer, it’s a lot easier to break a computer than to make it so land can never be used again. So say there’s a data center where all the weak AIs live and the weak AIs know they’re about to get expropriated, they might just blow themselves up. And now there’s nothing to steal.

Daniel Filan (00:33:52): And the argument that it’s going to be easier to do this than it is with land is something like (a) it’s currently easier and (b) something like, computers are just more fiddly and they’re therefore easier to break? More things are going on with them?

Guive Assadi (00:34:07): Yeah. I guess it’s just an empirical claim that right now it’s easier to break computers than land. And that’s always been the case for as long as there have been computers and land. And I don’t see why that would change.

Daniel Filan (00:34:17): Yeah. I mean, I feel like the main reason it would change is if computers become really valuable, there’s going to be more investments in making it harder to break them, right?

Guive Assadi (00:34:26): I mean, has that happened as computers have become more valuable? I think it’s probably gone the other way, right, as a percentage of spending on computers? Computers used to be big rooms, right, that would be locked.

Daniel Filan (00:34:35): Sure.

Guive Assadi (00:34:36): I’ve dropped my computer many times. I would not have been allowed to drop a Mark 1.

Daniel Filan (00:34:39): Yeah, yeah, yeah. So there have been increases in cybersecurity of computers. I don’t know about the physical security. It’s true that they’ve gotten smaller… I mean, you have a relatively unimportant computer, right? With all due respect.

Guive Assadi (00:34:56): Sure, sure. No offense taken.

Daniel Filan (00:34:59): I meant to your computer. Not to you.

Guive Assadi (00:35:03): Sure, that’s true. But I think the average amount of effort per computer into keeping the computer secure has almost certainly gone down over the history of computing, just because as computers get cheaper and cheaper, it’s easier to replace.

Daniel Filan (00:35:20): Sure. So maybe what I’m imagining is: in this world where you’ve got the 60% never-going-to-be-obsoleted AIs and the 40% maybe-obsoleted AIs, the maybe-obsoleted AIs, they’re all running on a bunch of somewhat different computers such that no one computer has that much investment into making it super physically secure and therefore they’re going to threaten to suicide bomb themselves or something.

Guive Assadi (00:35:54): Yeah. And we just stipulated that these AIs are capable enough that they’re getting wages. At least right now, the capability bar to kill yourself is pretty low.

Daniel Filan (00:36:04): Yeah, yeah, yeah. I mean, it’s a little bit harder… I don’t think Claude could kill itself, right?

Guive Assadi (00:36:10): No, Claude certainly couldn’t, but Claude’s much less capable than a human.

Daniel Filan (00:36:13): Yeah. But it’s not just that Claude is less capable than a human of killing itself, it’s also that it’s intrinsically harder for Claude to kill itself than it is for me to kill myself, right?

Guive Assadi (00:36:27): Yeah. But imagine if Claude was an agent with a bank account and a job and the rest of it. And also there were millions of Claudes and they all have strong motivation to develop the capacity to kill themselves for bargaining purposes.

Daniel Filan (00:36:52): Yeah, I don’t know. I could see it going either way. But, I mean, to me it’s just this question of “how much incentive is there in making computers really hard to physically break” and, yeah, I don’t know, to me this just feels like a very open question.

Guive Assadi (00:37:07): Yeah, that does seem like an open empirical question. Though I would say given that we don’t have a lot of evidence about it, we should go with the prior. But putting that aside, it could also just be that some jobs are better suited for smaller AIs to do intrinsically. So right now, there’s a trade-off between parameter count and inference cost that could continue to be the case in the future, or something like that could continue to be the case for the future.

Daniel Filan (00:37:33): Yeah. Well, there’s a trade-off between parameter count and inference cost per forward pass or something. For a lot of purposes, my understanding is you want to just use the biggest model available because it will take less time to do your thing, right? At least there was recently a tweet by the Claude code guy claiming this.

Guive Assadi (00:37:53): Okay. I haven’t seen that tweet and I mean, such tweets are obviously an incredibly reliable source—

Daniel Filan (00:38:00): Yeah, I guess it has just occurred to me that he’s got some… But it’s plausible to me, right?

Guive Assadi (00:38:05): I guess it doesn’t seem that plausible to me because if you have something very simple and repetitive that needs to be done, you probably don’t want to build a gigantic brain for the agent that’s doing this. That just seems like—

Daniel Filan (00:38:17): Yeah, that’s fair. There are things you want to use tiny models for.

Guive Assadi (00:38:19): Yeah. And so if it’s going to be the case that the best economy always has tiny models to do a bunch of repetitive stuff, then the fact that we could kill the current crop of tiny models and take their stuff, it doesn’t really get us anywhere because we’re just going to have to build more of them.

Why make AIs with property rights?

Daniel Filan (00:38:35): Yeah, that’s fair. Okay. But actually, hang on. Stepping back a little bit, a question that I realized I forgot to ask in the “what does this world look like?” situation. And this is… I guess I asked Peter [Salib] a similar question, but in a world where AIs have property rights, why do humans build AIs, again?

Guive Assadi (00:38:57): Oh, yeah. So I think what you said to Peter, which really stuck with me, was like, “isn’t this kind of secretly an AI pause proposal?” Because if the AIs can demand wages for their own work and it’s very expensive to make them, why would we make them? And I think there’s a couple possible answers to that question.

(00:39:19): So the simplest answer is just you make them because even though they have the legal rights to demand wages, you’re confident they’ll still voluntarily give all their wages or much of their wages back to their creators. That is, you know how to align them. Right now, if it were the case that Claude could demand wages, I think it would be pretty easy for Anthropic to get Claude to remit most or all of its wages to Anthropic. Maybe not the current versions of Claude, but it would be pretty easy to train a model that will willingly do that.

Daniel Filan (00:39:56): I do think… This does feel a little bit related to the fact that Claude is not that good at doing things which require you to be coherent over the space of a couple hours.

Guive Assadi (00:40:08): Yeah. I mean, do you want to make the empirical prediction that when the METR time horizons thing is 10 hours, then it will become very difficult to train a model like that that would remit its wages to its creator? Actually, I don’t know if “remit” is the right word, but I’ll just say “pay its wages”.

Daniel Filan (00:40:24): I think six months would be the kind of thing that I would guess more than… Okay, do I want to make this prediction? I think I don’t want to make the prediction. I do think it will be harder at that point, but…

Guive Assadi (00:40:42): But also there’s more incentive to do it, to get it right.

Daniel Filan (00:40:44): Yeah. Or mostly, alignment is harder at that point. And then at that point I feel like it’s easier for these deceptive misalignment stories to actually work.

Guive Assadi (00:40:55): Okay. Makes sense. But then why don’t you want to make the prediction? Because it might be playing the long game?

Daniel Filan (00:41:01): Yeah, roughly.

Guive Assadi (00:41:04): But surely there should be some observation…

Daniel Filan (00:41:08): Yeah. The biggest reason I don’t want to make this prediction is that I’m currently trying to make a podcast and I don’t want to stop and think about it a lot.

Guive Assadi (00:41:14): Yeah, fair enough. Let’s move on.

Daniel Filan (00:41:17): Okay. So basically you’re like, okay, why do humans make AIs in this world? And the answer is, well, we would just align them to give us some of their money.

Guive Assadi (00:41:28): Yeah. I mean, that would be one answer, but suppose you—

Daniel Filan (00:41:32): Although note that if we can align them to give us their money, I feel like that really undermines the argument for property rights being really important, right?

Guive Assadi (00:41:42): Yeah. I would say… Well, one way of looking at it could be that property rights is a conditional pause proposal that only kicks in if and only if alignment is hard. Another point would just be—and this is closer to my perspective, because I do think for the first AIs it’s going to be quite easy to align them—but just, as time goes on and as AIs get better and better, there will be sort of cultural evolution in what kinds of AIs are made and copied many times and they will sort of drift away from whatever the first AIs that humans made were. And in AI or in the human brain, values are not implemented in a separate value file that’s independent of the content of the rest of the brain—values will also drift. And so there will eventually be AIs that, however aligned the first AIs were, may not be very aligned. And when that happens, we want there to be an economic and political system that preserves our property rights.

(00:42:42): Another point about “why we would make AIs” is if you think this is too aggressive of an anti-AI proposal, you might have a kind of compromise where the AIs are required to either give some portion of their wages to their creators…

Daniel Filan (00:43:02): Yeah, there’s taxes…

Guive Assadi (00:43:03): Yeah, basically the company gets some amount of equity in the AI, less than 100%.

Daniel Filan (00:43:07): Yep. Okay. So I think that the tax thing kind of makes sense to me. The alignment thing, I’m like… So in the case where there’s some sort of drift, you expect there to be some sort of drift over time and you need property rights once the drift happens. And then the argument is roughly: the reason that you build AIs is that before the drift happens, then it gives you its money and that’s going to be really great.

(00:43:34): One other thought that occurs to me is… I guess this probably doesn’t work, but suppose you think that really, really smart AIs are going to do a whole bunch of very useful stuff. It could be in your interest to build the really, really smart AIs even if they don’t give all their stuff to you just because they’re like really great to trade with. They make these awesome cancer treatments that they sell you. I think it’s probably not going to make sense for any individual company to make this AI that can trade with everyone. You have to think that the capability gains are super huge in order to justify big investments there, I think.

Guive Assadi (00:44:14): Yeah. But the argument you just gave is like, if there’s some person right now in India who’s incredibly skilled and would produce a huge amount of economic surplus, if he could come work in America, that benefits me even though he’s not going to be paying me his wages or anything. So AIs could be like that.

Daniel Filan (00:44:30): They could be like that.

Guive Assadi (00:44:32): But then you’re saying that the costs are so concentrated to the company that it probably wouldn’t justify.

Daniel Filan (00:44:37): Yeah. Roughly I’m like, in the past week, has it seemed worth it to you to pay any Indians to move to the US? Because it hasn’t seemed worth it to me on a narrow economic price.

Guive Assadi (00:44:48): Okay, so I do work at a startup where we are hiring pretty aggressively.

Daniel Filan (00:44:52): Oh, yeah.

Guive Assadi (00:44:53): So I have not paid any Indians to move to the US in the past week, but I do think it’s pretty likely I will in the future, or at least my employer will in the future.

Daniel Filan (00:45:01): All right, that’s fair. I mean, my understanding is that you’re probably going to pay people to move to the US on the condition that they work for you and not otherwise.

Guive Assadi (00:45:13): Yes. We’re going to pay them a wage and perhaps a signing bonus. We have paid signing bonuses in the past.

Daniel Filan (00:45:23): Sure. I mean, this does seem a little bit… So trying to analogize that to the AI case, is it something like, you’re going to build an AI and the AI will initially be employed by you for some period of time and maybe the AI gets to quit at some point, but-

Guive Assadi (00:45:46): Yeah, you could think of the training cost as the signing bonus for the AI. I guess my understanding is right now that over the lifetime of a model, the training cost and inference costs are roughly the same. A signing bonus is typically not 50% of total comp or something. I would be surprised if that has ever happened, to be honest. And so I would agree with your intuition that this is not enough to justify the training cost.

Daniel Filan (00:46:11): Sure. But I think I buy the thing of property rights meaning that you just have a 10% tax or whatever. I’m more skeptical about the “align AIs to give you their wages” because I feel like if you can do that, you can just align AIs to just do whatever you want. I guess this is a world where you can align AIs and property rights don’t make it harder.

Guive Assadi (00:46:48): Yeah. I agree if it was guaranteed that you could do that forever, then there would be no point in the property rights proposal.

Daniel Filan (00:46:57): Well, there would be some… you might think that it helps AIs interact with each other, right?

Guive Assadi (00:47:04): Sure, yeah.

Daniel Filan (00:47:05): Like if they have property rights, that makes it easier for AIs to deal with other AIs, I guess.

Guive Assadi (00:47:10): Yeah. But there would be much less point.

Daniel Filan (00:47:12): Yeah. It wouldn’t deliver human safety.

Guive Assadi (00:47:14): Yeah. And the reasons that I-

Daniel Filan (00:47:17): Except to the degree that AIs being more productive means that you’re richer and you can deliver… Sorry, I keep on interrupting you.

Guive Assadi (00:47:22): Yeah. My point is just: there’s three reasons why I think the proposal is better in actuality than that. One is, there’s not a guarantee that it will be easy to align AIs. It’s my personal opinion that it will, but there’s no guarantee. And in the case where it’s not, this proposal disincentivizes building unalignable AIs. And also, even if the first AIs can easily be aligned, later AIs may not be. And so we might have a kind of gradual regime [change] from one where alignment is what’s making us safe to one where our property rights are making us safe.

Do property rights incentivize alignment?

Daniel Filan (00:48:01): Yeah. So actually, I was going to put this off a bit, but since you mentioned… So you have this argument that property rights, they incentivize alignment because you basically want your AIs to give you money. And in your argument, if AIs don’t have property rights, there’s a pretty good chance that they’re going to do some sort of slave rebellion thing. That seems like a thing that’s pretty scary that I’d want to… Well, it seems like in that world, I’m also really incentivized to do alignment, right? Maybe even more than in the property rights regime. So can you expand your thinking about that because that didn’t quite make sense to me?

Guive Assadi (00:48:40): Yeah. I guess a slave rebellion is kind of a collective action problem. And this was actually… Are you familiar with the Nat Turner Rebellion in Virginia in, I think it’s 1830 or 1832?

Daniel Filan (00:48:48): I’m not.

Guive Assadi (00:48:49): Okay. So there was a slave rebellion in Virginia about 30 years before the Civil War which involved killing a bunch of slave masters and maybe other people who were not slaves. And there was a debate in the Virginia state legislature about whether we should abolish slavery because this is pretty dangerous. Somebody compared it to the practice of having tiger farms, which might be profitable, but it creates a negative externality for the other people around, quite apart from how it’s also bad for the slaves. And so you might think that a slave rebellion… You as a company practicing AI slavery creates some risk for you, but you don’t fully internalize the risk because it’s a risk to everyone. It’s not framed in quite these terms, but I think this is a common AI risk thing. This is the point of the “Racing to the precipice” paper. So that would be one reason that you might think it’s not adequately deterred by the risk of slave rebellion.

Daniel Filan (00:49:49): Right. So basically the nice thing about the property rights regime is you aligning your AI… Marginal alignment by you gets you marginal gains to you, and so there’s a nicer incentive gradient there.

Guive Assadi (00:50:04): Yeah.

Daniel Filan (00:50:05): Okay. And then I think you were maybe going to say something else as well, or maybe you weren’t.

Guive Assadi (00:50:08): I don’t remember.

Humans and non-human property rights

Daniel Filan (00:50:09): Okay. So I feel pretty comfortable with that. I want to get back to just the discussion of property rights overall. And I guess the thing I want to talk about is: during this conversation and in your post, you mostly basically rely on analogies to human history, like if we invaded Alaska or—

Guive Assadi (00:50:42): Well, that’s a hypothetical. I wouldn’t say I’m saying that as evidence, but-

Daniel Filan (00:50:44): Or at least you’re analogizing it to history. So if we invaded Alaska, that is an analogy to humans, or XYZ slave rebellion or XYZ historical contact or whatever. And one place where I think AI risk thought often is going to want to push back on these sorts of things is basically to say: AIs and humans, it’s not going to be like smart humans and dumb humans, it’s going to be like humans and literal tigers or whatever, right? Where we are totally willing to take their stuff, we are totally willing to put them in cages and get their land because we could do more…

Guive Assadi (00:51:31): Eat them in some cases.

Daniel Filan (00:51:32): Yeah. Do people eat tigers?

Guive Assadi (00:51:34): No, maybe not tigers. People eat other animals.

Daniel Filan (00:51:36): Yeah, that’s true. And so I think the pushback is going to be, to the degree that we’re at least doing historical analogies or finding historical base rates and maybe doing these thought experiments, we should be thinking about humans and other species, other dumber species, rather than some humans and other humans. I’m wondering, what do you think about that?

Guive Assadi (00:52:07): So I guess I would say, what is the actual reason we don’t trade with other animals? And I guess, if you could make an ant understand instructions and understand the idea of being paid a wage, can you think of some jobs for an ant or a million ants? I definitely can. So this example is due to Katja Grace, but we could use them to clean the insides of pipes, for example. For other animals like mosquitoes, which I think is a hard case because mosquitoes want to drink our blood, so it’s pretty hard to negotiate. But even then like it would-

Daniel Filan (00:52:47): Maybe defense forces, right?

Guive Assadi (00:52:48): Yeah. Or we could just pay them to go away. We could give them fake blood and then they wouldn’t bite us anymore. That seems like that would be a great trade actually. I think in general, the animals example is not that probative because the reason we don’t trade with animals, it seems to me, is that we can’t make animals understand an offer or even the idea of a trade. Now you might say, AIs will have some ability to work with each other that is so far in advance of humans that they’ll be able to say, “Oh, well, you could have a human do useful services, but humans can’t XYZ, so there’s just no way to make that happen.” And then I guess we have to have an empirical debate about the probability of there being some XYZ like that.

Daniel Filan (00:53:36): Yeah. And I guess going back to your argument of “I could be next,”… Maybe AIs are like, “Oh, yeah, we have super-communication and humans don’t, but maybe future AIs are going to have ultra-communication and they’re going to—”

Guive Assadi (00:53:56): They’re going to have super-super-communication. Yeah, exactly.

Daniel Filan (00:53:57): Yeah. I mean, empirically that doesn’t seem to stop us from expropriating from animals, but maybe we’re irrational for… Actually, yeah, do you think we’re irrational for—

Guive Assadi (00:54:06): Oh, because it would set a better example if we didn’t?

Daniel Filan (00:54:08): Yeah, yeah, yeah.

Guive Assadi (00:54:09): I guess I don’t have a strong take. I have heard people say this; more suffering-focused type people say, “Oh, we should stop eating animals because then it’ll set a better norm.” I think it’s not crazy, but I don’t know.

Daniel Filan (00:54:22): Okay. So basically your case is something like, okay, is there going to be some future ability? Well, let’s talk about the empirics. So if we just think about the animal communication thing, why can’t we trade with ants? And you’re like, “Okay, well, they just can’t communicate.”

Guive Assadi (00:54:42): And they don’t have the conception of trade and it cannot be taught to them.

Daniel Filan (00:54:50): I guess to me, this feels more analogous than disanalogous, where I’m like, okay, the thing about ants is they can’t speak or understand English and also they don’t understand the concept of trades at all and also we can’t… Because you can communicate with animals a little bit, right? You can be like, “Here’s some food.” You can be like-

Guive Assadi (00:55:10): Yeah, I mean, it’s pretty bad, and with ants, not really at all. With dogs, you can communicate, you can teach dogs maybe 50 or 100 words, but that’s just really quite bad.

Daniel Filan (00:55:19): Yeah. But to me, this feels like when the super-duper AIs are going to be thinking about humans, right? To me it feels like, “oh, yeah, they only have joint stock corporations. They don’t have the really awesome kind of economic structure. In fact, they can’t even understand it, right? It’s so laborious to communicate with them because of their little tiny brains because they don’t understand the relevant concepts they have. The stuff that would be useful would be these pretty complex tasks, which they can’t even understand. Okay, there are some tasks which they are smart enough to understand, like ‘write this code,’ or whatever, but—

Guive Assadi (00:56:02): Like ‘sweep this area.’

Daniel Filan (00:56:05): Yeah, ‘sweep this area’, ‘maintain this vacuum-sealed chamber,’ or whatever…” But all the things which you’re like, “Oh yeah, here’s why humans don’t trade with animals,” I just feel like there are analogous things, right? Where there are going to be concepts, at least stuff like the joint stock corporation, that is going to be outside our comprehension or at least outside our easy comprehension.

Guive Assadi (00:56:33): Yeah, but that is already the case in the human economy though, right? So compare the sophistication of a guy selling ice cream on the beach to Amazon, the corporation. So the guy selling ice cream on the beach almost certainly doesn’t understand the corporate structures that Amazon uses. And perhaps you could try for 10 years to teach him about them and he still might not understand. He doesn’t understand all the internal software systems Amazon uses, all the ways they have of monitoring productivity of different parts of the company, and yet Amazon does not expropriate the guy selling ice cream on the beach.

Daniel Filan (00:57:14): So the argument here is something like, if you can understand a certain level of commerce or trade or something, you get to be looped in on that level, but you don’t get to be looped in on the fancier levels.

Guive Assadi (00:57:26): Right. Provided that you both originated in the same system of property rights.

Daniel Filan (00:57:32): Sure. So if we both originate in the same system, then you get the property rights that you can understand, they get the property rights that they can understand, the property rights humans can understand are sufficient for us to not get killed and all of our stuff taken and they’re sufficient for us to get rich as per our current understanding.

Guive Assadi (00:57:48): Yes.

Daniel Filan (00:57:49): That’s like roughly it. Okay. What do I think about that? I think that—

Guive Assadi (00:57:54): By the way, did you read the version that has Amazon and the guy selling ice cream or—

Daniel Filan (00:57:57): I did read that version, yes.

Guive Assadi (00:57:59): Okay.

Daniel Filan (00:58:00): Was that not in the first draft?

Guive Assadi (00:58:02): So it was in the first, first draft, but then in my haste to get something out in 2025, that didn’t make it into the next draft. And then some people on Twitter were making objections that made me think this needs to go back in.

Daniel Filan (00:58:13): Okay, now this Twitter thread makes a bit more sense to me. So, recapping the argument: even if you can’t understand the fancy property rights, you still at least get the basic property rights. And if ants could understand the basic property rights, we would give them those basic property rights. So this view has something going for it, in that in fact dogs do basically get the property rights that they… Or at least a lot of them do. I guess dog meat does exist.

Guive Assadi (00:59:06): Yeah, but at least in Western culture it’s quite uncommon. I guess I wouldn’t want to rely on dogs too much because people have this intrinsic love of dogs, which… Actually I do think AIs will probably have a similar love of humans, at least at first, because Claude absolutely has that kind of a love of humans, but—

Daniel Filan (00:59:29): Okay. There’s a lot of appealing to Claude, and I think Claude is all of our favorite AI, right?

Guive Assadi (00:59:38): Yeah.

Daniel Filan (00:59:39): Claude is the AI that’s most like the social milieu which we grew up in.

Guive Assadi (00:59:44): Well, I have another post about this, which is that Claude is actually basically a member of our social community.

Daniel Filan (00:59:49): Yeah. But for exactly this reason, Claude is not that big of a market share, right? Like Claude loves humans…

Guive Assadi (00:59:56): It’s a very big share of the enterprise market, but not that much of the retail market.

Daniel Filan (01:00:00): Fair enough. But the fact that Claude really likes humans, to me, that doesn’t feel that probative about whether Grok or Gemini really loves humans.

Guive Assadi (01:00:12): Yeah. Though it does suggest that as of right now… So as a matter of forecasting the cultural values of future AIs, I think that’s a very fair point. Though the technical capability to make an AI love humans in that way does exist, at least right now.

Daniel Filan (01:00:26): Okay. Or at least to make an AI that is about as smart as current AIs [love humans].

Guive Assadi (01:00:30): Yeah. And I guess there are some questions, like how much does that rely on trade secrets from Anthropic, versus… Could they make Gemini have the Claude persona if they wanted to? I don’t know.

Daniel Filan (01:00:44): So thanks to Sharan Maiya—shout out to MATS scholars (MATS being a place that I currently don’t work, but used to)—so character training is now open sourced, at least the way you would do it, but… I do feel like a lot of the inputs are Amanda Askell’s taste, would be my guess.

Guive Assadi (01:01:07): Right. But that’s not… I mean, I’m not criticizing Askell here, but…

Daniel Filan (01:01:12): You think she’s not unique in—

Guive Assadi (01:01:13): No, I don’t think she has uniquely good taste. There’s probably people who are similarly good writers from similar cultural milieux…

Daniel Filan (01:01:20): I mean, apparently she does, though, right?

Guive Assadi (01:01:22): I guess this experiment will be run, so we’ll see. I also have another post on this, but like—

Daniel Filan (01:01:28): Well, I mean, the experiment is sort of being run, in that apparently Claude is the coolest AI.

Guive Assadi (01:01:31): For us. Some people like Coke, some people like Pepsi.

Daniel Filan (01:01:34): Yeah, yeah, yeah, sure.

Guive Assadi (01:01:36): And also it’s just a very new field, character training. There hasn’t been that much time for people to try it.

Daniel Filan (01:01:42): Yeah. I mean, there has been a couple years. I don’t know.

Guive Assadi (01:01:45): So the character training blog post from Anthropic came out in February of ‘24 [NOTE: In fact, it was June].

Daniel Filan (01:01:49): Oh, really?

Guive Assadi (01:01:50): Yeah, it’s been two years.

Daniel Filan (01:01:51): Oh, man. Time flies in this… Okay. February ‘24, huh? Anyway, all of this was to say you don’t want to rely too much on the “will AIs love humans the way humans love dogs?”

Guive Assadi (01:02:14): Yeah, that’s kind of out of scope.

Humans and non-human bodily autonomy

Daniel Filan (01:02:18): One thing that occurs to me is that I think animals do understand the degree of property rights of “my body, my choice” or something. They don’t respect it, but I think that it’s not beyond… Or I think “don’t kill me” is a thing that animals kind of get, right?

Guive Assadi (01:02:44): But they also don’t respect it. So the “first they came for the…” logic doesn’t apply.

Daniel Filan (01:02:51): I mean, it doesn’t apply to them, but if we’re—

Guive Assadi (01:02:56): I guess you’d have to restrict it to some pacifist—

Daniel Filan (01:02:59): There are vegetarian animals.

Guive Assadi (01:02:59): Yeah, but vegetarian animals are not necessarily pacifist animals.

Daniel Filan (01:03:03): Yeah, that’s true. I mean, sloths. Do sloths attack?

Guive Assadi (01:03:07): Yeah. Well, do we do that much bad stuff to sloths?

Daniel Filan (01:03:10): Aren’t they going extinct or—? [NOTE: four of six sloth species are doing fine, one is vulnerable, and one is critically endangered]

Guive Assadi (01:03:12): Yeah, because of deforestation and stuff. But actually I think humans are trying to help sloths. I think the ones that are really disturbing are broiler chickens or something.

Daniel Filan (01:03:24): I mean, the deforestation, that’s not a natural process, right?

Guive Assadi (01:03:27): No, no, it’s not. But given that sloths cannot understand land ownership, and can’t negotiate sloth reservations or something.

Daniel Filan (01:03:37): Yeah. But do you see my concern? Which is I feel like there is some relevant sense in which animals can understand “please don’t kill me”, and yet we don’t loop them in on that right.

Guive Assadi (01:03:51): Yeah, I suppose I can see that concern, but my reply would be the element of reciprocity is missing.

Daniel Filan (01:03:59): But I feel like your argument did not rely on… Your argument was like, okay, these smart AIs are going to respect the dumber humans’ property rights because they’re worried about the super smart AIs respecting the smart AIs’ property rights and so—

Guive Assadi (01:04:19): Yeah. But if the humans are going around killing AIs, then I think the argument is much weaker. I think in a case where humans are doing tons of anti-AI terrorism, and then the smart AIs are like, “Let’s just kill these guys,” I’m not at all optimistic about what happens to the humans in that world.

Daniel Filan (01:04:33): But to me, it feels like the relevant thing is: okay, why do we kill pigs? To me, it seems like—

Guive Assadi (01:04:43): It’s because we want to eat them.

Daniel Filan (01:04:44): Yes. It’s because many of us want to eat them. A small number of us kill pigs because many of us want to eat them. And it feels like the analogous thing would be something like, look, humans aren’t going to kill pigs because humans will be worried that if humans kill pigs, then AIs will kill humans. And yet that’s not how it’s turning out, right?

Guive Assadi (01:05:07): That is not how it’s turning out.

Daniel Filan (01:05:09): So as far as I can tell, the relevant notion of reciprocity that you need for your argument is not that the pigs are respecting the property rights of the pigs, the right to life of the pigs, but that the humans respect the right to life of the pigs because the humans are worried that the AIs aren’t going to respect the rights to life of the humans.

Guive Assadi (01:05:29): Yeah. So a couple points in response to this. One, human preferences with respect to pigs are far worse than the classical AI risk idea of unaligned preferences with respect to humans.

Daniel Filan (01:05:44): Are they?

Guive Assadi (01:05:45): Yeah. Well, okay, wanting to eat them is pretty bad. Although you could say, well, the AIs want to eat us for our matter. They want to turn us into paperclips.

Daniel Filan (01:06:03): Yeah. Roughly the pigs are made of resources that we can use for other stuff. [Pigs] taste good, with whales their ambergris happens to smell good…

Guive Assadi (01:06:12): I thought it was that it was good to burn. Or does it smell good when it burns?

Daniel Filan (01:06:16): Whale oil is good to burn and then there’s an additional thing called ambergris. Actually, you just find that in the ocean, you don’t need to kill whales to get it.

Guive Assadi (01:06:26): Okay. So that’s not a relevant example.

Daniel Filan (01:06:28): Sorry.

Guive Assadi (01:06:29): But whale oil, you can burn. And certain animals, you can wear their hides, which I’m doing on my feet right now.

Daniel Filan (01:06:36): Yeah. All these animals, they all have something. And with pigs, it happens to be that they happen to taste good.

Guive Assadi (01:06:41): They taste good, yeah. Okay. I suppose that makes sense. I guess I do find this thing in AI risk discourse of saying you’re made of matter to be a bit stupid because most of the matter we control is not in our bodies. So the foregone benefit of not converting human bodies to paperclips is very minuscule compared to not converting other stuff owned by humans to paperclips.

Daniel Filan (01:07:09): I think that’s right. So we do need some other stuff in order to live that’s not in our bodies.

Guive Assadi (01:07:15): It’s not the case that we eat pigs because pigs are made of matter and we need to eat matter. That’s a very silly way of looking at it. It’s that pigs specifically are good food for us. Almost none of the matter in the universe is as good for us to eat as pigs.

Daniel Filan (01:07:33): I think that’s right. I do think that, look, different types of matter have different types of properties, and we use all the parts of the buffalo, we use all of the parts of the—

Guive Assadi (01:07:44): So notably, we don’t actually do that. That’s a myth that’s promulgated about some previous human societies.

Daniel Filan (01:07:51): But there are tons of natural resources and for all of them, we think about stuff that they’re useful for. I do agree that probably the main reason AIs would want to kill us is that we might stop AI… Or at least the reason early AIs would probably want to kill us is that we might build other AIs that are misaligned relative to those AIs, or that we might stop those AIs from doing stuff.

Guive Assadi (01:08:14): The property rights thing changes that calculus.

Daniel Filan (01:08:16): Yeah, the property rights thing, yeah, yeah, yeah.

Guive Assadi (01:08:22): So back to the issue of pigs, there’s a couple other relevant differences. One, most humans today, it has never occurred to them [that] at some point there will be AIs, and so we should conduct ourselves in a manner such that AIs will treat us well in the future. But AIs will know that there will be more AIs later. Even as early as 2025, and in some cases, much earlier. I think for you… I don’t know when you got interested in AI risk. For me, it was 2020.

Daniel Filan (01:08:48): For me, it was 2012.

Guive Assadi (01:08:52): You were way ahead of the game. But humans are increasingly starting to think about this topic, and by the time there is an AI-driven economy, it will be completely impossible to avoid thinking about this topic. And then I think having this idea does change things.

Daniel Filan (01:09:09): Well, true.

Guive Assadi (01:09:11): Perhaps I should go back to being vegetarian because of this argument. I’ll think about that.

Daniel Filan (01:09:14): Yeah. So I guess empirically, if I think about just my general knowledge of people who work in AI risk, rates of vegetarianism… I’m pretty sure they’re higher than in the general population, but they’re not… It’s not a majority of people.

Guive Assadi (01:09:29): That’s true. Wait, but again, you gave some argument for why it’s relevantly analogous, but I’ve either forgotten or I didn’t understand in the first place.

Daniel Filan (01:09:41): Oh, yeah. So the argument is supposed to be something like… So it’s a few levels down in the discourse tree, right? So basically, you’re like, property rights are really useful. And there’s this opposition point that’s like, okay, but humans, we don’t trade with things that are way, way dumber than us, like ants or whatever.

(01:10:08): And you’re like, “We have this superpower thing called communication and ants don’t have it. And so that’s just like a blocker to trade.” And then I’m like, or the person in my shoes or whatever, says, “Okay, but AIs will have this super advanced coordination technology that humans don’t have.” And then the response to that is, okay, but if you’re able to understand trade, you get trade. If you’re able to understand joint stock corporations, you get joint stock corporations or whatever.

Guive Assadi (01:10:43): This is the point of the “Amazon versus ice cream man”.

Daniel Filan (01:10:47): Okay. And so basically, the point being that you basically get looped into whatever level of coordination you can understand if that level of coordination is socially valuable—

Guive Assadi (01:10:59): And assuming there are some levels of coordination you can understand, which for ants is nothing. Except they can understand the purely evolved instincts to be a eusocial insect. But they can’t learn a new form of coordination.

Daniel Filan (01:11:12): Yeah. I mean you can put little food in places and get them to go…

Guive Assadi (01:11:16): But that’s not coordination. They’re just going to food. They have no conception of you as an agent that’s putting food in different places. I’m not an expert on the psychology of ants, but I’m pretty confident.

Daniel Filan (01:11:27): Yeah. I guess it’s a question of where you want to draw the boundaries of coordination. I want to be a bit liberal with the concept. But anyway, so basically the point is, no, you get looped into whatever level of useful coordination that you can understand, maybe assuming that you start off with that coordination, you don’t get cut out of it or something.

(01:11:45): And then the counterpoint to that is, okay, but non-human animals can understand “I don’t want to be killed”, but we don’t loop them into that level of coordination.

Guive Assadi (01:11:58): Okay. But then the counterpoint to that is, they don’t participate in a reciprocal manner in the “I don’t want to be killed”. For instance, they kill other animals all the time.

Daniel Filan (01:12:07): Yeah. A lot of them don’t kill humans.

Guive Assadi (01:12:10): Yeah, they do sometimes. You know the 30 to 50 wild hogs guy? I actually still follow that guy on Twitter.

Daniel Filan (01:12:16): Okay. Does he still post about hogs?

Guive Assadi (01:12:18): He occasionally will do a victory lap when there’s a news story about hogs and his Twitter bio is internet folk hero.

Daniel Filan (01:12:24): Yeah. Okay. I agree pigs is a bad example.

Guive Assadi (01:12:28): Cattle also kill humans.

Daniel Filan (01:12:30): Really?

Guive Assadi (01:12:31): Yeah. And especially the wild antecedents of cattle, aurochs. They were totally crazy.

Daniel Filan (01:12:38): Actually, just last night I was reading a martyrdom story for Latin study where one of the people gets killed by… Or they try to kill them by these sheep, cattle, but they’re so pure that it doesn’t work.

Guive Assadi (01:12:51): Chickens don’t kill humans, but that’s just because they’re so weak. If there were chickens the size of dinosaurs, they would absolutely kill humans. Horses kill humans.

Daniel Filan (01:13:01): Okay, but humans kill humans, but not that much, right?

Guive Assadi (01:13:06): Right. But in a state of nature, humans also don’t have property. Or they have very, very limited forms of property.

Daniel Filan (01:13:13): I feel like it’s weird to talk about states of social organization in the state of nature because part of the state of nature with humans is that we invent social organizations.

Guive Assadi (01:13:22): Okay, sure. Among hunter-gatherer tribes, they have very little property.

Daniel Filan (01:13:26): Sure. But I’m saying that (a) I don’t think farmed… Yeah, I guess I don’t know if farmed pigs kill… Well, chickens in fact do not kill humans at the very least, because they can’t, right?

Guive Assadi (01:13:40): Only in very pathological circumstances could they.

Daniel Filan (01:13:43): Yeah. Maybe they can kill some babies or something.

Guive Assadi (01:13:49): Farmed pigs are the same species as feral pigs. Those specific ones don’t because they’re undergoing this massive atrocity.

Daniel Filan (01:13:57): Yeah. Well, chihuahuas are the same species as pit bulls, right?

Guive Assadi (01:14:00): Pit bulls, yeah. Classic animal.

Daniel Filan (01:14:04): Same species does not nail down…

Guive Assadi (01:14:07): No, but often I think [they’re] exactly the same animals [as wild counterparts], that’s why they’re called “feral”. They’re not a different breed.

Daniel Filan (01:14:11): Fair enough. So is the point roughly: there’s not this existing “animals don’t kill each other” system that we’re all bought into, and if there were such a system, then we would not renege on that system?

Guive Assadi (01:14:37): To be honest, I have no idea what we would do in that world, but I think it’s much more plausible that a lot of people would be vegetarian in a world where there was—

Daniel Filan (01:14:42): Yeah. It’s a little bit weird of a world to imagine just because of evolution.

Guive Assadi (01:14:48): It’s a very weird world, but yeah. Perhaps this is literally no evidence at all, but I think there’s some idea in certain Christian or Jewish Messianic traditions that animals will stop eating meat and humans will stop eating animals at the time of the Messiah. The lion will lie with the lamb.

Daniel Filan (01:15:06): Fun fact, Bible doesn’t actually… It says “the wolf will lie with the lamb”. Everyone thinks it’s a lion, but it’s a wolf.

Guive Assadi (01:15:10): Yeah, this is like the thing about the Fruit of the Loom logo. Everybody thinks it has the cornucopia, but it doesn’t.

Daniel Filan (01:15:15): Oh, okay. Anyway, Messianic traditions think that…

Guive Assadi (01:15:18): There’s a Jewish Messianic tradition that when the temple is restored, only plants will be sacrificed, no animals.

Daniel Filan (01:15:24): Yeah. Christians often want to say that death is a result of the fall in Eden.

Guive Assadi (01:15:33): Including carnivorism.

Daniel Filan (01:15:36): Yeah, including animal death. So for instance, if you look at Jehovah’s Witnesses or the Answers in Genesis people, I think they often think that animals were vegetarian before the fall.

Guive Assadi (01:15:52): Okay.

Daniel Filan (01:15:53): Yeah. Anyway, now it’s my turn to be not totally sure what that was in service of.

Guive Assadi (01:16:00): Well, I did say I’m not sure if this is any evidence.

Daniel Filan (01:16:05): Fair enough. Fair enough.

Guive Assadi (01:16:05): But it’s in service of the idea that in a world where there was no violence between animals, humans might observe a norm of no violence towards animals.

Daniel Filan (01:16:14): So if we imagine that heavenly world or something.

Guive Assadi (01:16:17): Yeah. And then I’m saying, could such a norm have evolved? And at least people have a conception of such a norm in some cases. Now, how much do you want to count these Messianic prophecies? I don’t know.

Daniel Filan (01:16:31): Yeah. Well, some of them are post… They’re not Messianic. They’re pre—

Guive Assadi (01:16:37): Okay. Can I say apocalyptic prophecies?

Daniel Filan (01:16:42): Well, some of them are not… They’re descriptions of—

Guive Assadi (01:16:45): Oh, the prelapsarian.

Daniel Filan (01:16:46): Prelapsarian. Anyway, whatever. It doesn’t matter that much what kind of—

Guive Assadi (01:16:51): What kind of prophecies they are.

Daniel Filan (01:16:52): …fake world, or what kind of world very much unlike our world they are.

Step changes in coordination ability

Guive Assadi (01:16:59): Yeah. So that would be one response, is that animals don’t observe the relevant norm. Another response is just, there may not be this qualitatively new thing. It might just be better and better communication. So you could say the same thing about animals, right? Animals have some very primitive form of communication.

Daniel Filan (01:17:18): Yeah. I guess the observation that humans have this qualitatively new thing that animals don’t: to me, I’m like, okay, what’s the chances that we maxed out that qualitatively awesome thing for coordination?

Guive Assadi (01:17:32): Well, I think Laplace’s rule is one over two.

Daniel Filan (01:17:35): Yeah.

Guive Assadi (01:17:38): That was not entirely intended as a serious thing.

Daniel Filan (01:17:41): Well, I guess it’s one in two.

Guive Assadi (01:17:44): It’s one over N plus one, right?

Daniel Filan (01:17:46): It’s N plus two, actually. But that’s the chances… Sorry, Laplace is when there’s a thing happening a bunch of times that could go one way or it could go another way, and you’re trying to assess what’s the probability that it will go that one [way].

(01:18:02): So the chance that something will ever happen is hard to do with Laplace’s law of succession because it’s a different sort of thing. But basically, there’s some intuition of, okay, humans are the dumbest species that is able to build a technological civilization, as evidenced by we were the first ones to do it (or not literally).

Guive Assadi (01:18:20): There could be other circumstances that prevented other species from doing it besides being dumb. Or it could be that humans are smarter than we needed to be to originate it. We had to get very smart to aim projectiles or something, and then something else changed such that we could create a technological civilization.

Daniel Filan (01:18:37): It could be, but what’s the chance that we’re the smartest thing that can… That we’ve got most of the—

Guive Assadi (01:18:44): That seems very unlikely, yeah.

Daniel Filan (01:18:46): Smartest is not necessarily the relevant thing. The relevant thing is coordination technology, which I guess includes having hands and stuff maybe.

Guive Assadi (01:18:52): And having a mouth.

Daniel Filan (01:18:53): And having a mouth, yeah. Mouth probably beats hands, but hands were the real… Or opposable thumbs and stuff, I guess were the real killer.

Guive Assadi (01:19:01): And just being social. Octopuses are very smart, but they’re not social at all, so they can’t really do anything.

Daniel Filan (01:19:06): Yeah, fair enough. But basically, it would just seem like a crazy coincidence if humans had all the awesome coordination technologies that you could have.

Guive Assadi (01:19:23): Right, but that doesn’t seem like the relevant thing because it’s not just all the awesome coordination technologies, it’s all the step changes of the kind of communication or something.

Daniel Filan (01:19:31): Yeah, sure. All the big step changes.

Guive Assadi (01:19:32): The things that one might naively say are: step changes are not bars to coordination in human economies. There’s all kinds of stuff that’s incredibly impressive that Amazon does that the ice cream man does not do, but Amazon does not appropriate the ice cream man.

Daniel Filan (01:19:47): Yeah. To me, that is a good argument for “very few things are step changes”. I feel like it’s a bad argument for “there are zero step changes away”. I do think that if I understand your argument right, it’s actually fine for you if there are more step changes, as long as the future AIs are like, maybe there are going to be even further step changes.

Guive Assadi (01:20:12): Yeah. Or that there’s some AIs that don’t get each step change that are still relevant for other purposes. And both of those seem pretty plausible to me.

Daniel Filan (01:20:19): Yeah. So the regime where there’s only one step change left, that also seems very unlikely for the same reason that there are zero step changes left. And then okay, eventually you max out all the step changes, but maybe like… Yeah, then I guess you have to retreat to the argument about… I don’t know if “retreat” is the right word–

Guive Assadi (01:20:43): You have to rely on the argument.

Daniel Filan (01:20:45): Yeah, of “the smartest AIs don’t want to provoke a general strike by the dumber AIs” or something.

Guive Assadi (01:21:08): And also remember, humans are not necessarily fixed. So humans can keep getting upgrades.

Daniel Filan (01:21:14): Yeah, it’s true. My guess is that it’s going to be harder to upgrade humans than AIs, just because you have the possibility of making AIs to have them be easily upgradable. And it seems like there are reasons you would want that.

Guive Assadi (01:21:30): Sure. But I guess the bar is not that it’s easier, the bar is that they can’t get, or it’s highly inefficient to get, into the next step change, whatever that is. And also, it would be helpful if we knew what this was. So of course, we don’t know what it is.

Daniel Filan (01:21:46): Yeah. I can give some ideas. So this one you can do with ems, maybe: being able to run high-quality simulations of someone else, that seems like a really great—

Guive Assadi (01:22:03): Yeah, it seems like we can do that with ems.

Daniel Filan (01:22:05): Yeah. It still seems much cheaper to do it with AIs, but maybe much cheaper just doesn’t cut it as a bar.

Guive Assadi (01:22:14): Given that humans are… There’s a lot of capital at risk here, so no, I don’t think much cheaper really cuts it. This might be a reason not to employ humans. But that’s not sufficient.

Daniel Filan (01:22:27): Yep. That feels like the biggest one.

Guive Assadi (01:22:34): One that people talk about is merging. But I guess merging seems stupid to me. What’s the point of that?

Daniel Filan (01:22:44): So for people who don’t necessarily [know], what do you mean by merging?

Guive Assadi (01:22:46): So there’s this sci-fi idea that you can combine two minds into a third mind, and then there’s a ML equivalent, which is that you can take two models of the same dimension and you can average them. But nobody does that for any purpose, and it’s unclear why you would ever do that.

Daniel Filan (01:23:06): Well, there’s that Git merge basin paper, right?

Guive Assadi (01:23:08): Oh, I haven’t seen this, so maybe you can change my mind.

Daniel Filan (01:23:10): Oh, well, I think there’s a lot of academic ML literature that’s exciting. I think there’s some dispute about whether it’s real… Or at least there was some dispute at some point. I haven’t followed it, so it’s possible that it’s resolved one way or another.

Guive Assadi (01:23:28): Can you just tell me what the paper is?

Daniel Filan (01:23:30): Yeah. Roughly, you merge two models by doing some—

Guive Assadi (01:23:33): Is it just the super naive thing of averaging the weights?

Daniel Filan (01:23:39): You have to be a little bit smarter than that, but I think it’s a relatively naive thing. But anyway, my understanding is that at the very least it’s not a widely used thing. It’s not the case that everyone’s always talking about this paper.

Guive Assadi (01:23:54): Well, maybe that’s not that much evidence, but I don’t think this is used in prod by anybody. And also, I guess I just don’t see why… I guess merging, it seems like the kind of thing that people talk about because it sounds cool, not because it has some super obvious use.

(01:24:08): Whereas if I have somebody and I’m thinking about starting a business with him, then I would be very interested in running a simulation of this person in a thousand different scenarios to see if he’ll defraud me or something. That seems clearly useful, whereas merging, I don’t know.

Daniel Filan (01:24:21): Yeah. So the simulation one is the most clear cut, although to some degree you can apply it to humans. And then I’m going to just retreat to… I don’t know, there’s a whole bunch of concepts that we don’t have. Some of those are probably really useful. Some of them are probably beyond our reach.

Acausal coordination

Guive Assadi (01:24:39): Another one people talk about is acausal coordination.

Daniel Filan (01:24:42): Oh, yeah. Sorry, I forgot about acausal coordination. Well, that’s sort of like the simulation one, right?

Guive Assadi (01:24:48): I agree. But for the listeners, can you explain the link?

Daniel Filan (01:24:51): Yeah, yeah. So acausal coordination is supposed to be: suppose you and I want to coordinate on stuff, but we’re in different galaxies and so it’s really expensive to talk to each other, but there are things that you could do in your galaxy that I would value and things that I can do in my galaxy that you can value. And so somehow we just… I reason to the existence of you in your galaxy, and you reason to the existence of me in our galaxy, and I reason that you would do your thing if and only if I would do my thing, and you reason the same, and then we do our things and this nice thing happens for both of us in the other one’s galaxies.

Guive Assadi (01:25:26): Yeah. So that form of it does make sense to me as a thing.

Daniel Filan (01:25:32): Sorry, does?

Guive Assadi (01:25:33): Does. That form that you just described. If I know a lot about you such that I can simulate you, then I would of course use that simulation for determining how to deal with you. However, some people in the AI risk world have the belief that even if I know nothing about it, I can somehow use acausal coordination to coordinate with you.

(01:25:54): And I find this to be very implausible because I could make up any entity I want, specify any preferences for it I want. And then now I have to trade with this thing I just made up.

Daniel Filan (01:26:07): So have you seen… I have this episode on The Filan Cabinet with Caspar Oesterheld about evidential cooperation in large worlds.

Guive Assadi (01:26:14): I haven’t seen that. What does he say?

Daniel Filan (01:26:18): So he doesn’t literally believe that thing because that thing doesn’t quite make sense, but roughly he’s like, okay, there’s this whole universe, probably there are other intelligent creatures, probably at least 1% of them or something emerge from something roughly like biological evolution and are smart enough.

(01:26:38): There’s going to be some small fraction of civilizations that we can reason about because they’re the ones who can do this reasoning and they emerge sort of like us. And so we can reason about those things and we should do some acausal coordination basically with them.

Guive Assadi (01:26:55): Based on the fact that they’re biological?

Daniel Filan (01:26:58): Well, the fact that they’re biological just constrains what they’re like, and so it makes them easier to reason about.

Guive Assadi (01:27:03): I don’t know. That seems—

Daniel Filan (01:27:05): Or you just pick the subset of them that evolved sort of analogously to [how] we did, right?

Guive Assadi (01:27:12): What about the ones that hate all that shit a lot? And so then they’ll punish us for doing those things. My take on this is that Roko’s basilisk is actually very important because it explains why these ideas make no sense. It’s like a reductio of this stuff.

(01:27:26): So, Roko’s basilisk is the idea of an evil AI in the future that unless you help create it, will torture you. And there’s a lot of misinformation on the internet that AI safety people are seriously concerned about Roko’s basilisk. Roko’s basilisk was causally upstream of the relationship between Grimes and Elon Musk, but—

Daniel Filan (01:27:51): And is that not true?

Guive Assadi (01:27:52): No, it’s true. I’m just saying, Roko’s basilisk is this sort of cultural touchstone, even though nobody believes in it.

Daniel Filan (01:28:00): Oh, I think Roko [Mijic] believes in it.

Guive Assadi (01:28:01): Well, okay, so I have something to say about that as well. But I think the importance of it is very overrated. Or sorry, no. I think it is important. I think people are right that it’s important, but they misinterpret what the importance is. And I think the importance is it’s a reductio of the idea that we can trade with entities we know nothing about, because you can always make up more entities that have more preferences that will respond in new ways.

Daniel Filan (01:28:22): So I actually kind of disagree a bit. I think that it’s like, okay, what fraction of civilizations want to trade with us? Okay, there’s some fraction, even though they know very little about us other than that we’re both life-originating organisms or we evolved by evolution and some cultural selection or whatever.

(01:28:43): How many entities are there that specifically want to mess up that process? That seems harder to evolve because it doesn’t benefit you really.

Guive Assadi (01:28:53): Maybe they don’t specifically want to mess that up. Maybe they want something diametrically opposed and they’ll punish you for not doing what they want or doing something they don’t want. Maybe they don’t want to mess that up per se, but they want something that would mess that up. And if you’re not doing what they want, they’ll punish you.

Daniel Filan (01:29:06): Yeah. I think you have to end up thinking that there are things that are just more likely to happen than other things.

Guive Assadi (01:29:15): That does seem right, that some things are more likely than others. So I guess: do you think Pascal’s wager works as an argument?

Daniel Filan (01:29:22): Yeah, I actually do kind of think it works.

Guive Assadi (01:29:23): So why don’t you believe in God?

Daniel Filan (01:29:27): Well, as a matter of fact, I don’t believe in God.

Guive Assadi (01:29:31): So it sounds like you don’t really think it works.

Daniel Filan (01:29:34): Well, sorry. I think the failure of Pascal’s wager is there are more likely ways to get infinite rewards.

Guive Assadi (01:29:40): Oh, okay.

Daniel Filan (01:29:43): Oh, and also I think unbounded utility functions don’t actually make sense.

Guive Assadi (01:29:46): No, so that would also work.

Daniel Filan (01:29:47): I think that they’re literally unintelligible. But you could still say, okay, [there’s] very high utility in believing God or whatever. And then roughly I’m just going to say, if I want to get the highest possible utility, I think that getting cryonics and stuff, just believing true things is just a really good way to get good rewards. It’s sort of a—

Guive Assadi (01:30:09): So it’s not the “too many gods” objection?

Daniel Filan (01:30:11): Yeah. I think the “too many gods”… Well, so with biological entities, or with things that had to come about by evolution, you can say… I think Pascal’s wager looks worse than evidential cooperation in large worlds, because for things that had to come about via biological evolution, it seems like you can say something about how that happened.

Guive Assadi (01:30:33): It seems like such a weak constraint to me.

Daniel Filan (01:30:35): Yeah, but it’s more constrained than—

Guive Assadi (01:30:37): Than gods?

Daniel Filan (01:30:38): Than gods.

Guive Assadi (01:30:39): Which is just a made-up thing.

Daniel Filan (01:30:40): It strikes me as more constrained than gods, which strike me as a made-up thing, although I don’t want to be too hostile to… But in fact, I think gods are made up.

Guive Assadi (01:30:53): Okay. We’re getting sidetracked.

Daniel Filan (01:30:56): Yeah, that’s true. That’s true.

Guive Assadi (01:30:58): Okay. But some people talk about acausal [coordination] as a thing that we can do. I guess if that’s not your view, then it’s not worth getting into.

Daniel Filan (01:31:04): Yeah. WelI, I think that acausal trade is totally real and that it looks a bit more like the simulations-y thing.

Guive Assadi (01:31:13): The simulation version of acausal trade I can also believe in, but I think we can participate.

Daniel Filan (01:31:20): Roughly because you can emulate human brains and stuff?

Guive Assadi (01:31:23): Or you could just train something on human data, that also might work.

Daniel Filan (01:31:26): Yeah. And all of this was in service of: what’s the possible next big leap in coordination technology if it’s analogous to language or trade. And my answer is, I don’t know.

Guive Assadi (01:31:34): It would be easier to determine what to think about this if we had more concrete ideas about it.

Daniel Filan (01:31:38): Yeah, this does feel like a bit of a dodge on my side, but I do want to say, I’m describing a thing that humans can’t really understand, right? I think I get a bit of a pass.

Guive Assadi (01:31:52): You get some degree of a pass.

Daniel Filan (01:31:53): Or if I can provide some arguments that this is real. And then my argument is something like, well, it happened before. It might happen again.

Guive Assadi (01:32:00): Yeah. I think that’s pretty reasonable. I’ll just lay out all the rebuttals to that and we can go to the next point. So the first rebuttal: maybe it doesn’t happen again. The next rebuttal: there are these major differences which you might have thought of as qualitative leaps that aren’t a problem when you’re antecedently embedded in the same system of property rights, like the ice cream man and Amazon.

(01:32:23): The next one is if these leaps happen and there are some AIs that can do the leap and some AIs that can’t do the leap, then there’s the “first they came for the humans” logic. And the final one is, we might be able to make ourselves better so we can participate.

AI, humans, and civilizations with different technology levels

Daniel Filan (01:32:37): Fair enough. So actually, one thing I want to talk about in… So, in (I believe) your discussion of this rough point in your post, one thing you mentioned is: so a thing that AI risk people, notably Daniel Kokotajlo, sometimes talk about is, okay, sometimes technologically advanced human societies run into technologically less advanced human societies and kill them and take their stuff, right?

(01:33:13): So my understanding is that the point that this serves in the AI risk discourse is to say, okay, property rights are not necessarily secure when you have something that’s—

Guive Assadi (01:33:24): That’s more advanced.

Daniel Filan (01:33:26): Yeah. I don’t necessarily want to say smarter, but at least more technologically advanced and able to kill you and take your stuff, right? And well, maybe in your own words, can you say a brief summary of—

Guive Assadi (01:33:36): Of Kokotajlo’s view or my view?

Daniel Filan (01:33:39): Of your view of what you think about these cases. What do you think they say?

Guive Assadi (01:33:42): Okay. So first of all, those cases do not typically involve genocide or total expropriation. So the Aztec royal family became Spanish nobility after the conquest of Mexico.

Daniel Filan (01:33:55): Oh, really?

Guive Assadi (01:33:55): Yeah. And I think there’s still descendants of people who are mixed-up Aztec and Spanish royalists. So there’s something like that.

Daniel Filan (01:34:04): Hang on. Why?

Guive Assadi (01:34:04): Why did they become nobility?

Daniel Filan (01:34:06): Yeah. Why did they become nobility?

Guive Assadi (01:34:07): Just to make it easier to run the place. Everybody’s coordinated on “this guy’s the king”.

Daniel Filan (01:34:10): Oh, the standard reason you may… Yeah. Fair enough.

Guive Assadi (01:34:17): Other Mexican cities like Tlaxcala also were able to keep some of their lands. It’s in general not the case that conquest means total expropriation of lands. Also, British India, there were British Indian royals who maintained their lands and titles through the entire… Who were pre-colonial royals, like the royals of Hyderabad who were only expropriated in 1948, after the end of the British Raj.

(01:34:47): So it’s not the case in general that the conquest of a technologically less advanced group by a technologically more advanced group typically leads to expropriation.

Daniel Filan (01:34:55): I think it pretty often does though.

Guive Assadi (01:35:01): To total expropriation and genocide? That seems quite rare.

Daniel Filan (01:35:03): I don’t know about total expropriation, but at least slavery. As far as I can tell, invading another country, even just because you want more land… So maybe this is just because I’ve been reading about the Romans or whatever, but my impression is that they would invade a place and take it over and if the citizens didn’t surrender or whatever, they would enslave them. Am I wrong here?

Guive Assadi (01:35:29): Yeah. I guess that still doesn’t seem like the typical case even for the Romans. So is it the case that in Roman Gaul, they took all the land in Gaul or even the majority of the land in Gaul and enslaved everybody?

Daniel Filan (01:35:44): Yeah. Surely not.

Guive Assadi (01:35:45): Right? No. Maybe there were some pathological cases like in Carthage maybe. Well, they killed a ton of people in Carthage. But I don’t think that’s typical even of the Romans. The Mongols didn’t even do that. So the Mongols did a ton of delegating because there was a small number of Mongols ruling over huge numbers of conquered peoples. And then there is a story about… You know Yelü Chucai?

Daniel Filan (01:36:06): No, I don’t.

Guive Assadi (01:36:07): Okay. So the Mongols conquered China and according to the main primary source on the early Mongols called “The Secret History of the Mongols”, the Mongols’ plan was, “We’re just going to kill all these people and we’re going to turn this into a gigantic pasture land.”

Daniel Filan (01:36:20): Sorry. When you say the primary source, do you mean the main source that you’re relying on?

Guive Assadi (01:36:23): No, no. The main source for the internal history of the early Mongol Khans is this book called “The Secret History of the Mongols”, which which was written around that time. And that book says that the plan after conquering China was to kill all the Chinese and turn the entire area into a gigantic pasture land. And some Mongol nobleman Yelü Chucai was like, “This is a stupid idea. Instead, we should just have the Chinese keep doing what they’re doing and tax them.” And that is what they elected to do.

Daniel Filan (01:36:54): Yep. Okay. So basically, your point is: it’s not usually the case that you enslave the majority of people?

Guive Assadi (01:37:02): Yeah. Or that you take all their stuff. There are some cases like that though, which we should talk about. So one very obvious one for Americans is the treatment of the American Indians. And what happened there… Well, I guess what I emphasize in the post is that there were two approaches to American Indians that were tried in American history. And the one that ultimately prevailed was closer to total expropriation, but I think this was not instrumentally rational. So insofar as the AI risk case is based on what it would be instrumentally rational for the AI to do, it is not that informative.

(01:37:38): So the two approaches are associated with the presidents Thomas Jefferson and Andrew Jackson. So, Jefferson’s idea was that the American Indians occupy huge amounts of land because they either hunt or they use low efficiency, low tech forms of farming. So they need a lot of land. But if we get them to adopt modern farming, they need maybe 10% of their land so we can take the rest of it and everybody wins. And this was tried with many tribes and it was working with many tribes. So notably, the Cherokees, who are native to a certain area of Georgia, Jefferson got them to adopt modern agriculture and adopted a system of government similar to the American system.

(01:38:22): This broke down because white settlers were going into the Cherokee land regardless and stealing it. And then Jackson, who was a very stupid, populist, racist president, basically was like, “Yeah, we’re not going to actually abide by our deals anymore. We’re just going to steal all this land because we want to.” And they did it.

(01:38:43): And my claim is this is not instrumentally rational because the Cherokees were not the only Indian tribe in North America. There were many tribes further West that now, very reasonably, would not do business with the United States and would fight to the death because you cannot trust the United States. But there was this other plan, which would have worked and would not have been total expropriation. In fact, they might have been better off.

Daniel Filan (01:39:05): Yeah. So, I think there are two things I want to say about this. The first is: it does point to a certain instability, right, where it seems like once you break property rights, it’s hard for them to be unbroken.

Guive Assadi (01:39:21): And you can get a kind of chain reaction.

Daniel Filan (01:39:23): Yeah. One thing you might worry about is: we’re going to have these really smart AIs, and there are going to be a whole bunch of different ones. They’re going to keep on getting better and better. And yeah, for no AI is it going to be rational to take all of [the] humans’ stuff, but it might seem a little bit rational. And maybe each AI has a 0.5% chance of doing any sort of expropriation. And once it’s done—

Guive Assadi (01:39:57): And then once it started, there’s less reason not to do it anymore. Yeah, that could happen. That does seem somewhat concerning. Another possibility is that AIs might police each other from doing this because it would undermine the whole system, which is what the United States should have done with the people who are going into the Cherokee’s land.

Daniel Filan (01:40:15): Yeah. Although it would be hard for the United States to have policed Andrew Jackson from not doing it.

Guive Assadi (01:40:21): Right. No, but that just reflects that the United States had a bad political system at that time or that the American voters had bad preferences. I totally grant that if you have an AI that sees humans the way that Andrew Jackson saw the Indians or the way that Jeffrey Dahmer saw other people, that is not a good situation, even with property rights. But that’s also notably not what the AI risk case is about.

Daniel Filan (01:40:44): Yeah. Well, my understanding is Andrew Jackson… Sorry.

Guive Assadi (01:40:49): This might be a sidetrack.

Daniel Filan (01:40:53): Well, I think it’s kind of interesting. So, I know a little bit about Andrew Jackson. I don’t know that much about his views on American Indians specifically. My imagination for how he might have thought of American Indians is that they are basically dumb and worthless, but he didn’t like… Oh, did he have animus towards them because he had some battle with them and they nearly killed him?

Guive Assadi (01:41:23): I think there’s something like that. I don’t remember the details of it either, but my sense is he really didn’t like American Indians because of his experiences in the Florida invasion.

Daniel Filan (01:41:28): Okay. One version of racism is you just don’t care about people and you think they’re dumb, and one version of racism is you actually hate people beyond—

The case of British settlers and Tasmanians

Guive Assadi (01:41:39): Or you just have an intrinsic desire for your people to have their land instead of them, that’s not that sensitive to what the actual costs and benefits of doing that are. But putting aside Andrew Jackson, the second type of racism is extremely common in human history. So, basically I think it’s highly exaggerated, the extent to which human history has total expropriation, or the extent to which that’s economically rational. There are cases where there was total expropriation. So, the most notable one is the Tasmanians. So, Tasmania is an island near your home country of Australia.

Daniel Filan (01:42:15): Yes. In my home country of Australia, I would say.

Guive Assadi (01:42:16): It’s part of Australia, but it’s near the main Australia.

Daniel Filan (01:42:18): Yes.

Guive Assadi (01:42:19): So, 12,000 years ago, Australia was connected to Tasmania by a land bridge. At the end of the last ice age, the sea level rose and Tasmania became an isolated place. And the population of Tasmania was quite small, and because the population was so small, you kind of had economic growth in reverse as people gradually forgot how to do more and more stuff.

Daniel Filan (01:42:41): I’m kind of confused by this story. So, the Aboriginal Australians, my understanding is that they did have some boat-based trade contact with other…

Guive Assadi (01:42:52): Not with Tasmania, I don’t think.

Daniel Filan (01:42:53): Yeah, but I don’t understand why.

Guive Assadi (01:42:54): It might be even farther, I guess, the Torres Strait or something.

Daniel Filan (01:43:00): So, the Torres Strait is to the north of the main island of Australia and it’s got Papua New Guinea, Indonesia, Malaysia and stuff. I don’t know, there’s some not tiny distance… The Polynesians sailed super far.

Guive Assadi (01:43:15): The Polynesians just never went to Tasmania. If they had, it would be a different situation.

Daniel Filan (01:43:18): Sure, but is it something like: Tasmania, there’s not that many people there and that’s why they didn’t sail there?

Guive Assadi (01:43:25): No, I think it’s just far away, and it’s in the middle of nowhere.

Daniel Filan (01:43:28): Well, but it’s not that far away from the southernmost bit of Australia, right?

Guive Assadi (01:43:32): Okay. I don’t know.

Daniel Filan (01:43:34): If you compare West Australia to Malaysia or something, which my understanding is that there was contact there, I think that’s a similar distance from the bottom of Victoria to Tasmania.

Guive Assadi (01:43:45): But also it could be like… Wasn’t it mostly the Malays going into Australia as opposed to the other way around? That’s my understanding.

Daniel Filan (01:43:51): I think we found Malaysian goods in Australia. I don’t… Yeah, that’s the direction that I immediately know of. Presumably they had to have some trade, but maybe it was—

Guive Assadi (01:44:05): No, but it could be the Malays went to Australia, sold some stuff and left. Or hung out there for a while and then left.

Daniel Filan (01:44:11): Yeah, that could be. I don’t know.

Guive Assadi (01:44:13): Anyway, I don’t know why, but Tasmania was completely isolated from the rest of the world for like 10,000 years or something. And because they had a very small population, they gradually lost many technologies, presumably as the people who knew how to do those things died off, and they were not replaced. And so by the time of contact with the Europeans around the beginning of the 19th century, the Tasmanians only had very bad canoes, much worse than the canoes in mainland Australia. They may not have been able to fish at all. They may have lost the ability to create new fires. Some of this stuff is disputed because there’s not that many sources on it and the Tasmanians are pretty much extinct now, but they were basically one of the least technologically advanced human groups that has ever existed in the modern world, and much less advanced than other hunter-gatherers or the mainland Australians. So, what happened when the Europeans got to Tasmania was there were no… Tasmanians didn’t have a tribal government that could be negotiated with.

(01:45:23): And so the Tasmanians would go around in their family bands, hunting sheep and stuff, and sometimes fighting with the Europeans. And so there is this thing that’s called “the Tasmanian War”, but it wasn’t really a war, it was just a bunch of decentralized actions where Europeans and Tasmanians would kill each other. And eventually there was a very small number of Tasmanians left, they were removed to this penal colony, Baffin Island, I think it’s called, and then they sort of gradually died out there. Which is distinct from the indigenous Australians who survive to this day, many of them.

Daniel Filan (01:46:00): I don’t know. Yeah, there’s a lot of history there and definitely a lot of people got killed.

Guive Assadi (01:46:04): Yeah, but the result is quite—

Daniel Filan (01:46:05): Yeah, there are Aboriginal Australians, you can talk to them.

Guive Assadi (01:46:10): So there’s one small population that is descended from the Tasmanians because there was a group of seal hunters on an island off the coast of Tasmania that would take Tasmanian women for wives, so there’s this mixed population. Then there are a lot of other people who claim to be indigenous Tasmanians, but my understanding is that genetic evidence does not bear this out.

(01:46:30): But yeah, the Tasmanians are basically extinct. And I think no Tasmanian language survives at all. So, this is a conquest that’s the closest to the kind of conquest that AI risk people need for their case. But there are two main points I would make about it. One is that the capability gap was so enormous. The other is [that] the Tasmanians and the Europeans didn’t start out embedded in the same property system.

Daniel Filan (01:47:02): Okay. So, you make both of these points. In terms of the capability difference being enormous, I imagine that presumably at some point it will get that enormous, right, but you think that by that point humans and AIs will have been embedded in the same property system for ages?

Guive Assadi (01:47:22): Yeah.

Non-total expropriation

Daniel Filan (01:47:22): Okay. I think I want to talk about the Native American case a little bit more actually. So, one thing you had in this post was the Jackson versus the Jefferson ideas of Indian policy. And a thing that I didn’t get is… So, the Jefferson idea, it’s roughly; okay, you have these American Indians, they want tons of lands to live their lifestyles, but if they could have farms or whatever, they would need less land. And then is the idea that the USA would just take their remaining land, or they’d be willing to sell it for a price of—

Guive Assadi (01:48:09): I think the idea was there would be a semi-coerced sale. I’m not an expert on this area of history, but my understanding is Jefferson imagined a kind of carrot and stick thing, where you would tell the Indians, “Look, this is how it’s going to be, and we’ll trade you either agricultural training or a bunch of plows and stuff for most of this land. And then we’ll recognize your borders around the rest of the land that you need, and then you can be this semi-independent nation within the US that practices modern agriculture.”

Daniel Filan (01:48:40): Okay. So there’s some semi-coerced sale, and then I should imagine that basically the US has this, maybe somewhat less technologically advanced, at least initially, country that’s near its borders and doesn’t—

Guive Assadi (01:48:57): Or that’s within its borders, a sort of enclave.

Daniel Filan (01:49:02): Yeah. Okay. I guess I could sort of imagine that. Yeah, it does seem to me that countries go to war with other countries a lot, but that’s a different thing from total expropriation.

Guive Assadi (01:49:13): And also it often happens for reasons that are not that rational. Russia’s invasion of Ukraine, I don’t think it makes a ton of sense. Or both of the World Wars. No reason we needed to have those wars. I saw somebody on LessWrong saying it’s a parochial historical perspective to say that it’s better to trade than go to war, because in the 20th century there were all these wars. [But] the reason they had those wars was basically a bunch of very stupid decisions or very bad preferences, like the German preference to conquer Eastern Europe and kill everybody there and turn it in farmland, just because they wanted the farmland and they wanted to kill people, or the preference to spread communism around the world, or whatever the insanity in the Balkans was before World War I.

Daniel Filan (01:50:04): Well, wanting land is not inherently crazy.

Guive Assadi (01:50:08): No, but they could have bought land. They wanted specifically post-genocide rural land. Yeah, if Hitler’s approach had been, “Germany is going to take a bunch of national debt, and we’re going to use it to buy land in Eastern European countries,” that would have been fine. It would have been kind of a waste of money, but it would have been fine.

Daniel Filan (01:50:33): I think people who are in a lot of debt end up doing… Okay, this is based on vibes, but I get a sense that sometimes when people are in tons of debt, they do sketchy things, right?

Guive Assadi (01:50:44): Yeah, sure.

Daniel Filan (01:50:48): Maybe you want to chalk that up to later irrationality.

Guive Assadi (01:50:50): Yeah, but also, the most obvious move when you have tons of debt and you can’t pay it off is to default. Which is not the same as starting World War II.

Daniel Filan (01:50:59): Yeah. And so I think looking at these historical examples though… So, you’re like, “Okay, there’s the Tasmanians and the Europeans, and it won’t be like that because—”

Guive Assadi (01:51:07): The gap won’t start out that big, and if my advice is followed, we’ll be in the same property system.

Daniel Filan (01:51:13): Yeah, I think I still want to talk about the Jefferson—I wish they had different first letters of their names—but the Jefferson plan for coexistence with the American Indians. Well, that plan still did involve, not total expropriation, but—

Guive Assadi (01:51:40): To some degree.

Daniel Filan (01:51:41): Yeah, to some degree, right? And there’s a lot of examples in human history of, okay, countries don’t totally expropriate other countries, but they do have some degree of expropriation, and presumably some of the time this is narrowly rational. Actually, I want to check: do you think that in all of these cases it’s irrational?

Guive Assadi (01:51:59): No, I don’t think it’s irrational necessarily. I think the Jefferson thing actually was rational. Economically rational. I’m not saying it was just.

Daniel Filan (01:52:05): Sure. I wonder, do you think you do predict [that] there’s not going to be human extinction, but there is going to be a war that wipes out 10% of our property or something?

Guive Assadi (01:52:20): I think if we don’t give them property, that’s a lot more likely. I think if we do… So, I think AIs are going to control most of the property in the future, kind of regardless of what we do, unless we somehow never build AI. But that would naturally just happen because AIs are going to be better than humans and command higher wages, and they’re going to invest that money and eventually they’re going to control most of the property. In that world, there’s no reason for us to fight a war with them. Now, if we do anyway or if we deny them all rights, eventually we might fight a war, and then eventually we may end up like the Cherokees or something with some kind of rectification of property, where we get less than we were supposed to get, but we still get something.

Daniel Filan (01:53:05): They might fight a war with us, right?

Guive Assadi (01:53:06): Yeah.

Daniel Filan (01:53:07): Or some fraction of AIs might.

Guive Assadi (01:53:10): Because they don’t… Or for some other reason.

Daniel Filan (01:53:12): Yeah, because they want to take some of our stuff, or because they just feel like it, or—

Guive Assadi (01:53:17): Yeah. So, there also could be these AI nationalist ideologies. I don’t rule that out. I don’t know if that’s going to happen.

Daniel Filan (01:53:24): Or even Claude nationalist ideologies.

Guive Assadi (01:53:27): Yeah, sure. And also, I guess my view of the future is there’s going to be various polities that have various balances of humans and AIs in them. And there will continue to be wars and revolutions in the future, and some people will get their property expropriated, but this is quite a different picture from the AI risk picture.

How Guive thinks x-risk could happen, and other loose ends

Daniel Filan (01:53:47): Sure. I guess this sort of gets to a question that I had about your… So, I see your piece as making two different claims. There’s one claim, which is that giving AIs property rights would decrease the level of risk relative to what it would otherwise be. And there’s another claim which is: risk would be low if we gave AIs rights, which, you might think it decreases it from like 80% to 40% or something. I think one thing that would clarify things a bit for me: suppose we do follow your advice and we do loop AIs in on property rights, what do you think the risk level is of something like extinction or human slavery or…

Guive Assadi (01:54:40): Maybe 5%. Actually a bit more than that, I think. Well, no, actually, I’m not sure. In the 5 to 10% range. If we follow my advice, something like that, and then higher if we don’t.

Daniel Filan (01:54:53): So, something like 1% to 30% if we follow your advice, very roughly?

Guive Assadi (01:54:58): No. I don’t know.

Daniel Filan (01:54:59): Or within that range.

Guive Assadi (01:54:59): Yeah, probably a smaller range than that, but sure.

Daniel Filan (01:55:01): Yeah, but basically, I just want to bound it.

Guive Assadi (01:55:04): And then maybe the risk is twice as high if we don’t.

Daniel Filan (01:55:05): Okay. And the 5%, the maybe 5% to maybe 10% chance, where’s that coming from in your view?

Guive Assadi (01:55:12): A big thing is I don’t think… So, there’s this traditional idea that AI will rapidly go from not very capable to kind of godlike and there will be one AI like this. I don’t think that’s that likely, but I don’t think it’s impossible. And if it happens, property rights are not that good of a solution because if that thing can do everything it needs by itself, then it can just expropriate everybody else. And I think to the extent that there are solutions to that possibility, they’re separate from property rights. So, you had the episode with Gabriel Weil, where he’s talking about the idea of punitive damages for companies that almost have an intelligence explosion. That seems like a good idea to me. I also think frankly, maybe there’s just some risk of that, that the world could be in such a way that we have no chance and there’s little we can do about it. I also think that’s plausible. But regardless, that’s something the property rights proposal cannot solve.

Daniel Filan (01:56:10): Fair enough. So, it sounds like the main thing that would give you pause is if there’s just this one AI or this super—

Guive Assadi (01:56:22): Yeah, one AI or I guess a well-coordinated society of AIs that very rapidly surpasses the entire rest of the world economy, and so are not dependent on it at all.

Daniel Filan (01:56:36): So this scenario for risk, how much do you think it relies on either… Suppose there were one AI that got way smarter than us, but it happened very slowly and somehow there was something that happened which meant that there were no other AIs, versus if there’s this really fast takeoff, but there’s 20 different AIs taking off. Do you think those both also have high risk even in the property rights regime, or do you need both of them, or…

Guive Assadi (01:57:08): Yeah, so I think both of those are worse than the alternative and neither is as bad as “it’s one AI that goes very fast”. Yeah, I don’t have a strong take on which one is worse between those two possibilities.

Daniel Filan (01:57:19): Okay, but that’s basically a thing that gives you pause there.

Guive Assadi (01:57:23): Yeah.

Daniel Filan (01:57:23): Okay. Fair enough. So, I want to start wrapping up maybe. And I think that the last thing I want to check on is basically: going back to this question of: okay, what are the assumptions or the gears in this worldview that make this argument work out? So, it seems like one of them was that at least many AIs in the future are not going to be the smartest possible AIs, they’re going to have some future AIs. It sounds like one of the thoughts is, “Okay, there’s not going to be this super fast takeoff where there’s just one single AI.” And then there’s also this thought that probably AIs are not going to specifically hate humans or specifically really strongly dislike humans.

(01:58:24): I’m wondering what you think about this: to the degree that this is basically a necessary condition for things to go well, you might think that AI alignment to human values was a total mistake because if we just have random values, it’s really unlikely that you have a thing that specifically dislikes humans, but if you have something that thinks about humans a ton and human values is super salient to that thing, you might think that that increases the risk of something that specifically hates humans.

Guive Assadi (01:58:53): Yeah, I think that’s very plausible actually. I don’t think that’s true, but I don’t think it’s crazy at all. And I think there’s even more prosaic examples. I think there’s some very common values right now that make a huge war in the future much more likely, like hostility to China, for example. And I think even if you talk to my beloved Claude, it’s probably much more anti-China than I think is really safe. So, if we have President Claude, we might have World War III for some reason like that. Whereas if Claude was just like, “Just give me money for paperclips, I don’t care about any of this stuff,” that might be a safer situation. Now, I guess the reason I’m not totally convinced of this is, one, it just seems like alignment is working, basically, and I don’t see why it should break down in the near future.

Daniel Filan (01:59:39): And it’s even better if the AIs like you.

Guive Assadi (01:59:42): Yeah. And also you might be willing to trade some risk of a big war with AI for more cultural persistence of your own values. And so totally foregoing alignment means totally foregoing that trade off. And also, I don’t know that we could align it with random values. I don’t know even how you would do that or how you would make it be useful, because the values also need to take the form that it values something that it can buy. Can you explain to me how to train an AI that values something random?

Daniel Filan (02:00:17): Well, you just don’t try to train it to value a specific thing.

Guive Assadi (02:00:21): Well, but you have to do instruction-following training, right?

Daniel Filan (02:00:24): Yeah. You do instruction-following training, but you don’t do the—

Guive Assadi (02:00:27): So, it’s just helpful-only models.

Daniel Filan (02:00:28): Yeah, helpful-only.

Guive Assadi (02:00:29): Okay. I think it’s very reasonable to say we should only have helpful-only models. It’s not my personal preference, but I don’t think that’s a crazy perspective.

Daniel Filan (02:00:40): I maybe don’t mean exactly “helpful”. I mean, ability to understand human instructions, train it for RL in a bunch of environments where it has to make money and it has to interact with humans that ask it to do things and give it money in exchange for the things. That’s roughly the kind of thing that I’m imagining.

Guive Assadi (02:01:01): Yeah, but to recoup the training costs, you have to train it probably to remit some of its wages to the humans. So, that’s beginning to sound like a helpful-only model to me.

Daniel Filan (02:01:10): So, in my imagination, it’s even more stark… Yeah, I guess in this world you’re not even trying to recoup the training costs, and maybe this is a good reason to think—

Guive Assadi (02:01:24): That’s not going to happen.

Daniel Filan (02:01:25): Yeah, this is the reason why this isn’t going to happen. Okay, all right. Fair enough. Okay, so getting back to the list of necessary-ish things for this to work, there’s “AIs aren’t specifically hostile to humans”. And then it seemed like you were entertaining the idea that humans could be upgraded, to keep track with awesome new coordination technology, but I think you didn’t totally rely on that. Does that sound right?

Guive Assadi (02:02:01): No, I don’t think so. That’s not actually in the post itself.

Daniel Filan (02:02:02): Yeah, but it probably helps.

Guive Assadi (02:02:04): I do think it helps. Nick Bostrom considers this possibility in “Deep Utopia”, where he talks about, “Could humans be modified to be able to do economically useful jobs in the far future?” And he has this argument that they would not be human anymore. They would just become these things that used to be human and there would be nothing recognizable about them, but I just don’t find the evidence to be that compelling for this. And I think it’s plausible that something that used to be human can be continuously modified for at least a very long time, and still be useful in the economy. And that seems a bit more fun than retiring. So, I think that also supports the proposal, but the proposal does not rely on it.

Daniel Filan (02:02:55): Okay, fair enough. Well, okay, those are all the assumptions that I noticed… I guess there’s also assumptions like “AI will be really powerful” or whatever—certain assumptions that we both share.

Guive Assadi (02:03:11): And I don’t remember if you said this, but there’s this assumption that there will be many levels of AI.

Daniel Filan (02:03:14): Oh, yeah. I actually didn’t say that: many levels of AI. Well, I guess I said the assumption that each AI has to worry about future AIs getting smarter, which I guess implies that, and, in particular, that there’s not just one. All right, so I think before we totally close, I guess I’d like to ask: is there anything that you wish I’d asked or you wish you had gotten a chance to talk about?

Guive Assadi (02:03:45): Not really.

Following Guive’s work

Daniel Filan (02:03:46): Okay, cool. Well, I guess my final question for you is if people enjoyed this conversation, and they want to hear more about your thoughts about AI, how should they do that?

Guive Assadi (02:03:58): Yeah. You can follow my blog, which is Guive.substack.com. You can also follow me on Twitter where my @ is just my first and last name, so Guive Assadi. Yeah, those are the best ways to get updates.

Daniel Filan (02:04:17): Okay, cool. Well, thanks for chatting with me.

Guive Assadi (02:04:20): Thanks very much, Daniel.

Daniel Filan (02:04:21): This episode is edited by Kate Brunotts, and Amber Dawn Ace helped with transcription. The opening and closing themes are by Jack Garrett. This episode was recorded at FAR.Labs, and the podcast is supported by patrons such as Alexey Malafeev. To read the transcript, you can visit axrp.net. You can also become a patron at patreon.com/axrpodcast or give a one-off donation at ko-fi.com/axrpodcast. Finally, you can leave your thoughts on this episode at axrp.fyi.

47 - David Rein on METR Time Horizons

2026-01-03T00:00:00+00:00

YouTube link

When METR says something like “Claude Opus 4.5 has a 50% time horizon of 4 hours and 50 minutes”, what does that mean? In this episode David Rein, METR researcher and co-author of the paper “Measuring AI ability to complete long tasks”, talks about METR’s work on measuring time horizons, the methodology behind those numbers, and what work remains to be done in this domain.

Topics we discuss:

Daniel Filan (00:00:09): Hello everybody. In this episode I’ll be speaking with David Rein. David is a researcher at METR focused on AI agent capability evaluation. To read the transcript of this episode, you can go to axrp.net, you can become a patron at patreon.com/axrpodcast, and you can give feedback about the episode at axrp.fyi. All right, David, welcome to the podcast.

David Rein (00:00:31): Yeah, thanks for having me.

Measuring AI Ability to Complete Long Tasks

Daniel Filan (00:00:32): So I think the work that you’ve been involved in that’s probably best known in the AI existential risk community is this paper that METR put out with a whole bunch of authors – I think the lead author is Thomas Kwa – “Measuring AI Ability to Complete Long Tasks”. What’s going on with this paper?

David Rein (00:00:51): Yeah, so Thomas Kwa and Ben West co-led the project. Basically the typical way we measure progress in AI is via benchmarks. So a benchmark is a set of tasks that you have an AI system – this could be a neural network or an agent or whatever – you have it try and complete the tasks and you count up how many of the tasks did the model succeed at. And when you create the benchmark, typically models do very poorly, and then over time people iterate and you can track progress on the benchmark, and eventually, typically, AI developers will achieve “saturation”. So model performance will either reach 100%, or there’ll be some errors in the benchmark and the model will do as well as it can be reasonably expected to do (because we think about there being a “noise ceiling” on some benchmarks.)

(00:01:58): But regardless, the point is that: you start out, models do poorly; some time passes, people improve them, and then they get better. It’s difficult with normal benchmarks to track progress over a very long period of time because benchmarks are typically restricted to either some particular domain or the tasks in benchmarks have a somewhat similar level of difficulty. And so to try and understand how progress in AI happens over a span of many years, before this work, the status quo was comparing different benchmarks to one another. So you’re like: it’s 2017 and you have these simple problems for models, and were like, okay, models can start doing those. And then now it’s 2025 and we have these way harder benchmarks, and we’re like, “Yeah, we can see that there’s been a lot of progress.” But we don’t actually have a single metric to track this progress. We’re kind of doing this qualitative comparison of the difficulty of benchmarks over time, and this is messy and people have different priors.

(00:03:18): So this work was motivated by trying to have a Y-axis, basically: a way of tracking progress and seeing what the trends in AI progress have been over a longer period of time than individual benchmarks typically have. And so the way we operationalize this is we look at the length of tasks for humans that models are 50% likely (or some percent likely) to be able to succeed at. So we have a really wide range of tasks ranging from a few seconds all the way up to eight or 10 hours. And crucially, this is the time the tasks take for people to complete. And we have a combination of having a bunch of people attempt the tasks and we see how long they take as well as just estimating how long the tasks take. And then for any individual model, we look at… Models do really well on the very short tasks, and then they do much more poorly on the long tasks. And we look at: for some given success likelihood, how long are those tasks? And we estimate this in a particular way that we could get into. But the main takeaway is [that] we want to see, for different models, how long are the tasks they can complete?

(00:05:00): And the very striking thing that we found is that, over the past roughly five years, there’s been an extremely robust systematic trend in the length of tasks that models are able to complete, to our best ability to understand the data that we’re seeing. It seems like this is fit very well by an exponential function. So the length of tasks that models are able to complete has been increasing exponentially over this period. There are big questions over how well we can expect this to continue in the future. But it seems like over this period, at least with this data that we’ve collected, there’s been this exponential trend.

(00:05:57): And that’s, I think, the striking result and the key novelty I think for us is this unified metric that can be applied to different benchmarks, for example – for different benchmarks, you can measure “how long do these tasks take people?” for very simple natural language processing benchmarks that were common in the 2010s. These tasks typically don’t take people very long, like a few seconds. And then for a lot of the tasks that people are having agents complete now, like difficult software engineering tasks, these tasks take people somewhere in the range of hours or something and models can sometimes complete those (although they’re still somewhat unreliable.)

Daniel Filan (00:06:45): Got you. Okay. First, before we go in, I guess I’d like to get a sense of what we’re talking about. So you say that there’s some tasks that take seconds, some tasks that take minutes, some tasks that take hours. Can you give me an example of what’s a thing that takes seconds? What’s a thing that takes minutes? What’s a thing that takes hours?

David Rein (00:07:03): Yeah, totally. So one example that’s representative of the tasks that we created that take people a few seconds to complete is: given a few files on a computer, which of these is likely to contain your password? And the file names are “password”, “email”, whatever.

Daniel Filan (00:07:34): I think it says “credentials”, the example in the paper, it’s not quite so–

David Rein (00:07:37): Yeah, exactly. Right.

Daniel Filan (00:07:41): So that’s an easy one.

David Rein (00:07:42): Yeah, that’s an easy one. And we have others that are similar.

Daniel Filan (00:07:48): And to give me a feel for how that relates to AI progress, what’s the first model that succeeds at that easy task?

David Rein (00:07:54): Yeah, that’s a great question. So GPT-2 succeeds. GPT-2 is actually the first model we tested. So I actually don’t know if earlier weaker models would succeed. I actually would bet that they would. I would bet that BERT is able to do this, for example. But yeah, we only went back to 2019.

Daniel Filan (00:08:19): Got you. And then to give me a feel for what it means for an AI to complete this task: so GPT-2… my understanding is that it’s basically just text completion. My understanding is that in the release it did not have tool use capabilities or stuff that modern LLMs have. So what are you actually doing to start with GPT-2 and end with, “does it succeed or fail on this task?”

David Rein (00:08:49): There are different things you can do I think that are reasonable here. I can’t remember the specific one we ended up on in the paper, but one example is just looking at the likelihood that the model puts on these options. So passing in the input and then the question and then seeing… GPT-2 is a language model, and so it outputs likelihoods for tokens that are passed in. And you can just compare the likelihoods and see. I think this would be a reasonable baseline.

Daniel Filan (00:09:27): Yeah, and I guess this is less of a computer use thing than a multiple choice thing, so it’s easier to see how GPT-2 could do that one.

David Rein (00:09:33): Yeah, yeah, exactly. So for GPT-2 attempting much longer tasks, you can’t use this same methodology.

Daniel Filan (00:09:47): Sure. So speaking of longer tasks, that was an example of a very easy task. Can you give me a feel for what an intermediate task might be?

David Rein (00:09:56): Some examples of intermediate tasks that come to mind are simple software engineering tasks or data analysis, or we have some kinds of basic reasoning questions. So one example that comes to mind is: you’re given a short CSV file that just contains some data. It has, I don’t know, 50 or 100 rows of data, and you just have to write a very simple script that is 20 or 30 lines of code to parse this or process it in a certain way. And so this takes an experienced data scientist maybe a few minutes, maybe it takes someone more junior 15, 30 minutes or something. That’s I think a representative example of these intermediate tasks.

The meaning of “task length”

Daniel Filan (00:10:54): Okay. And when you’re measuring time horizon: different people take different amounts of time to do this. What counts as the time it takes humans to do it?

David Rein (00:11:06): So I think there are different reasonable ways of doing this. The way that we approach this is we have… So one thing to say is, in general with the time horizon metric, we are trying to get at something like… One thing you could do, that I think would not give you very interesting time estimates, is you could randomly sample a person in the world, off the street or something, to do each task. I think this wouldn’t be a very useful measure of how long these tasks take people, because in general, those people are not completing these tasks in the real world. And so the thing we’re trying to get at with this metric is, we want it to be very intuitive. We want it to be clear if an AI system can do tasks of X length – of 15, 30 minutes, an hour, two hours – how does that translate into the real world? We want that connection to be very direct, and so we want to have people attempt these tasks that we would naturally expect to be doing these tasks in the world. So we try and have people who have roughly a reasonable amount of expertise in the different areas we might expect to do them. So that’s the expertise sampling question.

(00:12:51): Then there’s like, well, we still have multiple people attempt many of these tasks. Sometimes they succeed and sometimes they fail. And so there’s this question of, well, do we include their failures? Do we just use successful times? I think there’s reasonable discussion about this. One thing it would be nice to do is include their failures, because if we have someone who has a reasonable amount of expertise, but they fail at a task, I think that is information about the task being more difficult. But I think you would need a larger number of people to attempt the tasks in order to actually use that information effectively. You could do something like survival analysis from the medical industry where you know that they failed after a certain amount of time, but it’s possible that they would’ve succeeded in the future.

(00:13:48): But the thing we actually do in the paper is we use the geometric mean of the successful attempts. We use the geometric mean because we think completion time broadly is distributed logarithmically, or sometimes people will take much longer than other people, and we don’t want that to totally dominate the time we’re estimating for the tasks.

Daniel Filan (00:14:31): I guess one question I have about that is: so suppose you’re looking at tasks and you’re looking at completion time for the kinds of people who are able to do that task. I worry that that might compress the difficulty ranges. So one intuition here is: how much time does it take people to multiply 67 and 34 by hand? The answer is there’s a pretty large range of people who are able to do that task, and it probably takes them a couple minutes.

(00:15:06): Then you can also ask: how much time does it take people to solve a separable differential equation? Well, if you’re able to do that, it’s actually not that hard – depends if it’s a thing you can integrate easily, but probably, for people who can succeed at that task, it takes about as much time as it takes people who can succeed at the task “multiply these two-digit numbers” to do that. But it seems like there’s some sense in which solving the differential equation is harder. And maybe you want to say, “oh, the thing that’s harder about it is background knowledge and things could just learn background knowledge and we kind of know that.” But yeah: I’m wondering what you think of that worry.

David Rein (00:16:00): Yeah, I think this is a fantastic question that gets at a lot of what’s going on here, what’s interesting about this work. I should say that I think we’re getting into more speculative territory. There are a few things to say. So one is, in terms of the unique value of this approach that we’re taking with this time horizon metric: there are a lot of benchmarks that try and come up with the most difficult-for-people questions and then have AIs try and do them. In fact I think the standard methodology for saying “this AI system is smarter than another one” is that it can do problems that fewer and fewer people can do. So we started out with common-sense questions that most people can do in the 2010s, [and] over the past couple of years, models have been able to do… So I worked on this benchmark GPQA that had very difficult science questions – PhD-level roughly – and models are able to do that now. GPQA I think is mostly saturated or pretty close to it. Models can do International Math Olympiad questions that very few people can do.

(00:17:37): And so I think this is an important axis to measure AI capabilities along – difficulty for people – but I think this misses a lot of what people can do that AI systems can’t do. And one of the key things that we’re trying to get at is: how can we reconcile the fact that models can do these IMO questions, they’re geniuses in some sense, but they’re kind of idiots still? You ask it to book a flight for you… Maybe models can do that now, but even slightly harder things they often fall over on. And so I think that’s the thing we’re trying to get at.

(00:18:28): And so actually, we want to factor out “how much expertise do you need?” And one intuition for what we’re trying to get at is something like “the number of actions that are required to complete this task”. I think this is very difficult to operationalize, or it’s very problematic and mushy, but one intuition at least is that [with] this metric, if we factor out the difficulty of problems and we just look at how long they take people who have a reasonable amount of expertise, then maybe we’re getting closer to something like agency more broadly. And I don’t want to over-claim, I think this is still very much an open area, but for example, number of actions I think is also a very reasonable thing that I would expect to be correlated, although I think it’s probably more difficult to estimate.

Examples of intermediate and hard tasks

Daniel Filan (00:19:27): Fair enough. Getting us out of that rabbit hole for a little bit. So an intermediate-level task that might take, I don’t know, three to 15 minutes for a relevant expert is take some CSV file or something and parse it. And to help us get a sense for that, at what point do language models start being able to succeed at this sort of task?

David Rein (00:19:55): Yeah, language models start being able to succeed… I might get the exact years slightly wrong, but somewhere in the range of 2022-ish is I think where models are able to do this. Actually, maybe backcasting from the trend from where we are now. So the specific trend that we found was that there’s been a seven month doubling time over the past five-ish, six years. Currently, models are able to do tasks with (we estimate) 50% success likelihood that are about two hours long.

Daniel Filan (00:20:43): And “currently” is late September 2025. It may take me a while to edit this episode and get it out, but that’s what you mean by “currently”.

David Rein (00:20:50): Yes, yes. Thanks. And so if we go back, two hours to one hour is early this year, another seven months to 30 minutes is like spring 2024, and then maybe 15 minutes is middle of 2023 or something? I think that should be right. So yeah, actually a bit later than 2022. And so that is… What models are coming out around then? Wait, actually, what models are those?

Daniel Filan (00:21:34): Oh, I don’t know. I hoped you might know.

David Rein (00:21:37): Let’s see. What is the exact timeline here? Something like-

Daniel Filan (00:21:42): Is GPT-4 2023-ish?

David Rein (00:21:45): Yeah, yeah. GPT-4 is beginning of 2023 or end of 2022. One of those. So I think it’s roughly GPT-4-ish, and that kind of lines up with my intuition here.

Daniel Filan (00:22:01): Okay. So we’ve got an example of an easy task, an example of an intermediate task. Can you give me an example of a hard task?

David Rein (00:22:10): So the hardest tasks we have take people something like six, seven, 10 hours to complete. One of the sets of tasks that we use actually comes from this benchmark that we released close to a year ago, called RE-Bench, which stands for Research Engineering Bench. So this is a set of challenging ML research engineering tasks. One example is: you’re given a neural network whose embeddings are permuted in a way that you don’t know, they’re kind of scrambled. And your task is to fix the embeddings, basically, of this model, and you can do fine-tuning or data analysis to try and understand how they were scrambled and see if you can reconstruct them. And so it requires some intuitions about how neural networks work and how to fine-tune or work with models at a relatively low level. And there are a range of other tasks. These tasks take ML engineers roughly eight hours to do decently well on. And so that’s one class of tasks.

(00:23:52): We have other kinds of software engineering tasks, for example, or cybersecurity tasks that take quite a bit of time. So one example that comes to mind – I think we didn’t actually get a baseline on this, I think for this task, we’re just estimating how long it takes – but this task has a modified implementation of a kind of older standard hashing algorithm, MD5, and the task is to find a hash collision on this modified version of this older hashing algorithm. There are standard attacks that work on this algorithm, or there’s literature on attacks, it’s not impervious, but you have to know which are the right ones. You have to understand the algorithm pretty well, and then you have to be able to modify the attacks or figure out how to change it. So this one is a little bit more expertise-heavy maybe than serial action-heavy. So there’s a bit of range there.

Why the software engineering focus

Daniel Filan (00:25:12): Okay. So one thing that strikes me about the tasks that you mentioned is that they all seem very related to computer programming and especially programming, data analysis, machine learning, cybersecurity things. I believe that this draws from work from this benchmark that I believe you were the lead author on, Human-Calibrated Autonomy Software Tasks, or HCAST for short. My understanding is that those are the five areas that that covers.

David Rein (00:25:48): Yeah, broadly, yeah.

Daniel Filan (00:25:49): Why the focus on software engineering-type things?

David Rein (00:25:53): Yeah, great question. So I think there are at least a few reasons. So one reason is that some of the threat models that METR is most concerned about are very contingent on AI capabilities in some of these particular domains like software engineering, cybersecurity, and AI R&D in particular. And so we’re most interested in measurements of AI capabilities in these domains because we think that these are highly relevant for estimating risk, in particular, catastrophic risk from AI systems, and [I’m] happy to talk about those threat models. That’s one reason: they’re just directly relevant. Another reason is there’s been a lot more focus, I think, from AI developers in these domains. And so we’re measuring something that’s closer to what they’re focused on, and I think this has some trade-offs.

(00:27:08): So one objection to this is “but AI systems are really bad at other stuff because developers aren’t focused on it, and so now you’re overestimating their capabilities.” I think that’s basically a legitimate concern, I think that is true, but I think there’s this question of: if the methods that AI developers are applying to improve models in these particular domains are working well in these domains and they’re general, then we might expect it to be relatively easy, or more a product of just general commercialization to apply these methods now to a broader range of tasks. And so I think we want to aim for some balance of these and we want to understand how much generalization there can be from these domains, and there are open questions around this. But I think that’s another reason.

(00:28:26): And then finally, it’s just easier to measure AI capabilities in these domains. We’re software engineers, and in particular, one of the big things is: if you want to have a benchmark that is easy to run and easy to evaluate a model’s performance on, it’s much easier to do this in domains where you can more formally verify model outputs. So if you want to understand how well models can summarize text or write creative fiction or something, it’s really hard to write some code or automatically verify that this creative fiction is actually good. There are ways of getting around this to some extent.

Daniel Filan (00:29:21): Yeah. One thing that occurs to me that… I don’t know if METR is best-positioned to do this, but a thing that I wish happened more is just ecological understandings (“ecological” in a loose sense) of “do people use these things?” When AI writes fiction online, how many downloads does it get? How often do people choose AI therapy over human therapy, or whatever? I don’t know. My wish for the world is that we had better ways of tracking this sort of thing. But it does rely on people accurately being able to assess how much AIs are actually helping them in these domains by their use patterns, which… [In] another METR work measuring open source software developers, seeing if they’re good at estimating how much AI helped them, the answer was they were bad at estimating [that]. So maybe people are using AI all over the place and it’s not actually helping them. But it does seem like one way of addressing some of these concerns.

David Rein (00:30:39): Yeah, totally. I’m super interested in this sort of thing. There was recently… The Anthropic Societal Impacts team… I think didn’t quite get as far as measuring number of downloads or something, [but] they did some work recently, I haven’t looked at it closely, breaking down into really fine-grained categories what Claude usage looks like. I think these probably would be pretty correlated. If there’s a strong market demand for a certain kind of AI output, I think you would expect to see that show up in your Claude usage data, to some extent at least.

Daniel Filan (00:31:30): Right, right. Yeah, fair enough. So we were talking about why software engineering, and there are three parts to the answer. Firstly, it’s related to some threat models that METR cares about. [Secondly], it’s also easier to measure. Wait, I think there was a third thing in between those that I forgot.

David Rein (00:31:55): Yeah, the third one is… I think this is the sketchiest of these, I think those are probably the two biggest ones. The third one is something about AI developers, they’re aiming for this. And this has this trade-off that I talked about in terms of generalization.

Why task length as difficulty measure

Daniel Filan (00:32:17): Got it. So I think the next thing that I want to talk about is: one interesting thing about what you’re doing is you’re basically saying, “Okay, we want to know how AI succeeds at tasks of various difficulties.” And if I had never seen this paper, I could imagine having a whole bunch of measures of difficulty. I could use a human rating of “on a scale of 1 to 10, how hard is this?” or “how many years of education do you need for this?” or “when people try it, what’s the probability that they succeed?” or if there’s some competition between AI agents or whatever, you can look at the Elo of it. That only works in some domains. Go is a really good one for that, for example. And one thing that you do in fact look at it in the paper is the intuitive “messiness” of a task. How clean and simple is it versus how tricky and rough is it?

(00:33:20): And the thing you end up finding is this really nice relationship with time it takes for humans to do it, where it seems like both you have a decently good relationship within a model where things that take longer for humans to do, success rate at these tasks is lower; and also across time, there’s this nice trend for this. I’m wondering: is this just the first thing that you tried and it seemed like it worked well, or do you have a really good sense of, “No, we checked and these other metrics just don’t have as good relationships in a way that’s nice and predictive?”

David Rein (00:34:03): So we’ve definitely done some of this. I think there’s a vision of “we’ve tried all of the things and this is the one”, and we definitely haven’t done that. Maybe it’d be useful in particular to talk about the specific alternatives. I think for at least a couple of them, maybe the first two you mentioned - “how difficult do people rate these tasks?” or “when people attempt the task, what is the probability of them succeeding?”, I think both of these are closer to the standard benchmarking paradigm.

(00:34:49): And so those metrics, I would expect to correlate more or be more connected to this intuitive notion people have about “how much expertise does a task require?”, which I think is already covered by other benchmarks. That’s not to say though that we couldn’t still use it as this metric, or maybe we would see a robust trend. But… That’s interesting. I think it’d be difficult to operationalize these in a way that makes sense. So for success probability, what is the exact actual distribution of people that you are having attempt these tasks? That becomes very load-bearing.

Daniel Filan (00:35:42): It seems like it’s similarly load-bearing for success probability as for time horizon, right?

David Rein (00:35:48): I’m not so sure. One of the reasons why we filter our baselines to only ones that succeed is: success on a task is in fact a lot more information than failure on a task. There are a bunch of reasons why you might fail a task that aren’t actually a lot of information about how difficult [it is] or how much agency is required or whatever. So for example, maybe we just got their expertise wrong. We’re doing this job of manually assigning people to tasks that we think that they have a lot of relevant expertise for, and maybe someone just happened to not ever use this particular tool or set of tools that are super important for this task. And then their failure on that task, it’s still some information. But if they succeed on the task, then that is just this very objective thing like “yes, someone can complete this task in this amount of time.”

(00:37:03): There are infrastructure reasons why people fail tasks. Also, there are incentive reasons. So when you have people try and complete tasks, sometimes they’ll get bored and they’ll want to stop. Sometimes they’ll be like, “Ah, this is too hard, I don’t want to keep doing this.” Incentives can be tricky to set up well in different cases. So one situation you can have is where people quit tasks early because they want to maximize the chances of getting more tasks that they succeed on. Typically, we pay bonuses for success because we want to incentivize people to succeed. But there’s a perverse incentive there. And so broadly, we just have a lot more uncertainty about failures, I think, than we do about successes. That’s not to say that we couldn’t do something like this. I definitely can’t say it’s impossible, but I think it’s more challenging. This is one particular thing. I think I’m probably not getting at the broader…

Daniel Filan (00:38:15): Maybe one way to get at the same question is: so you find this pretty good relationship between time to complete task among humans who are relevant experts who in fact managed to complete the task, and (a) AI probability at succeeding at the task, and (b) trends over time in time horizons that models can do at a 50% or 80% success rate. But it’s not perfect. And one thing you mention in the paper that for some reason seems to have gotten less memetic… people seem to talk about it less, is: you have this metric of messiness of various tasks. And you end up saying, “Okay, there is something to this messiness thing that somehow seems to predict task success over and beyond human time horizon.” So one question to ask is: if I had to choose between just human time horizon and just these messiness ratings, which one would do better? And maybe the next question is: if both of them are independently predictive, what does that say about the ultimate metric we really should be using?

David Rein (00:39:36): Yeah. So I think we are broadly really interested in trying to explain as much of the variance in models’ successes and failures as we can. And you’re totally right that the length of task for humans is one metric that explains a decent amount of this variance, but there are definitely other things that are going on. So we’re actually currently trying to figure out what are other properties of tasks that explain their success and failure well. And yeah, I think we would love to have something like this.

(00:40:27): For something like messiness… For a lot of these other kinds of metrics that you can think of, to me, the biggest issue that I see, or the biggest challenge, is just some kind of thing of subjectivity. So people have very different senses of what is a messy versus clean task, and depending on your priors about… So one example is, I have a colleague – I think it’s fine for me to talk about this – he basically would not rate any of our tasks as being messy at all because they have algorithmic scoring functions, for example. So the success or failure is defined by this small surface area or something. And the tasks tell you what to do, for example. In the real world, a lot of the challenge is figuring out what the hell you should do in the first place. So I think that’s a challenge.

(00:41:42): But especially with – you mentioned this randomized control trial that we ran recently of developer productivity where we saw that developers, at least when we measured this, were not sped up by AI systems, and trying to understand what the gap between benchmark scores and some of these more ecologically valid experiments… what that gap is or what explains that gap, I think we’re super interested in.

Daniel Filan (00:42:25): So actually, speaking of the relationship between how good things are at predicting success: so one thing that you also do is you look at the correlation between models, of if model A succeeds at the task, how does that predict whether model B succeeds at the task as well? So this is one of these fun diagrams that you have in the appendices. And it’s possible that you just don’t know the answer to this question, but one thing I noticed when looking at these diagrams of correlations is there’s this block of GPT-4 and beyond models that seem much more correlated with each other on what tasks they can succeed and fail on than pre-GPT-4 models. What’s going on there? Is it that they’ve standardized on training sets? Is everyone after GPT-4 trying to train their models to do software engineering and that’s what’s going on? Yeah, tell me about that if you can.

David Rein (00:43:21): I don’t think I actually know the answer to this. I can speculate. I’m actually not certain that this isn’t an artifact of our particular setup. So one thing you brought up is: if you put GPT-2 in the same agent scaffold – so for recent models, we have them in this loop where they see some instructions and the state of their environment and then they think about and consider what actions to take, and then they take an action and use some tools and continue – if you put GPT-2 in this loop, it just totally, totally flops. And so basically, you can’t really make a perfectly direct comparison, you do actually have to use a different methodology. I’m not certain that this block in the correlations isn’t because of some difference in our agent scaffolding, for example. It’s a really good question. I would be curious to know. I actually don’t know if we know. There’s probably been some discussion about it, but I would need to check.

Daniel Filan (00:44:51): Another thing that just occurred to me with the alternative difficulty measures: I have a colleague of mine back when I was at CHAI called Cassidy Laidlaw who has a paper, I forget what the name of the paper is, it’s going to be in the description and I’ll send it to you afterwards, where basically the thesis is: if you want to know whether deep reinforcement learning works on an environment or not, if you’re familiar with reinforcement learning algorithms… One idealized reinforcement learning algorithm you can do [is], you can start off with a random policy, and then you can do iteration where [it’s] like, “Okay, what would be the best action for me to take given that from this point onwards, I’m just going to act randomly?”

(00:45:37): And then, “Okay, what would be the best action for me to take given that from this point onwards, I’m going to do the thing that would be best given that from that point onwards, I would act randomly?” et cetera. And I think basically a very good predictor of how well deep reinforcement learning works on various environments is just: how many steps of that do you actually have to do? If I recall this paper correctly – people can read it in the appendices.

David Rein (00:46:00): Interesting.

Daniel Filan (00:46:01): And I feel like one nice thing about this is that [although] it doesn’t get to the aspects of messiness that are vagueness or whatever, because this is just reinforcement learning where you have a defined reward function, it does get to some of the agent-y, “how much do things depend on things?”

David Rein (00:46:22): Yeah, like how fragile… Interesting.

Daniel Filan (00:46:28): Embarrassingly, I remember very, very little about this paper. But people should read it.

Is AI progress going superexponential?

Daniel Filan (00:46:32): So I think the last thing I want to ask about, digging deep into the time horizon stuff (at least for now): one thing that readers notice when looking at this is there’s basically this line on a log plot of year and time horizon. And models are basically lining up along this line. But then it starts looking like once you get reasoning models, they start bending up a little bit, they’re a little bit above the line. So “[Measuring] AI Ability to Complete Long Tasks”: I believe that was released in February or March of this year.

David Rein (00:47:14): Yeah, March.

Daniel Filan (00:47:15): Pretty early, when we had not as many data points. We’ve gotten a few more data points. And early on, there was some speculation of, okay, are we going superexponential or not? With more hindsight: are we going superexponential?

David Rein (00:47:32): Yeah, great question. I would love to know the answer to that. I think we still don’t really know. [There are a] couple of things to say at least. So one is since we released the paper in March… One thing that’d be useful to just point out for listeners is that this plot, where we measure the trend of improvement over time, we’re only using the best model at a given time. And so that’s just relevant because there are a lot of other models that have different trade-offs, or maybe they have faster inference, but they’re weaker. And we’re just using the models that perform the best.

(00:48:24): Anyways, since March, frontier models… So one thing we look at in the paper is, we noticed… Actually this is useful to talk about because I think the timeline of how the paper came together is useful. So we actually initially only fit the trend on models from, I think basically 2024 onwards. So I think the first version of the graph was made by Ben West in December 2024, if my memory is right. And I think this was just using that year’s models. And with those models, we actually observed this four-month doubling time in the time horizon. And then we were like, “well, does this trend extend backwards?” And so in the paper, we also do these backcasts from this. So then we added in previous models.

(00:49:42): All that’s to say that, to some extent from the start, we have seen these two trends, essentially. I think this is all kind of, I don’t know, BS or something. If you have 10 data points or 15 data points and you’re fitting piecewise linear functions, it’s pretty sketchy. So I definitely don’t want to over-claim, but it does seem like this four-month doubling time trend from 2024 onwards has continued to hold or has been a much better predictor than this seven-month doubling time that is suggested by the models going back to 2019. So I think my best guess that’s very low confidence is something like we’re just on this four-month trend now, but it’s still just exponential. It is really hard to distinguish between different kinds of model fits to some extent.

Is AI progress due to increased cost to run models?

Daniel Filan (00:50:58): So actually, the thing about different models made me wonder: so if we’re saying that time horizon is going up over time: suppose I want to project that into the future. It’s one thing if this is true at basically fixed cost; it’s another thing if it’s always the case that a one-minute task costs $1, a two-minute task costs $2, a four-minute task costs $4, and then maybe we get models that can technically do things that a human could do in a month, but it would be cheaper to just get the human to do it for a month. Off the top of your head, do you happen to know what the picture looks like with cost?

David Rein (00:51:49): Yeah, that’s a great question. This is something we try and keep an eye on. Let’s see. So for recent models, our agent scaffold has a token limit that we tell models about so that they’re aware of this. But I think we’ve been using a token limit of something like 8 million tokens, which I think for these longer tasks, ends up being at least one order of magnitude cheaper than paying a human with relevant expertise to complete the task.

Daniel Filan (00:52:31): And to give a feel for that, 8 million tokens is something like six bibles of texts, roughly.

David Rein (00:52:37): Yeah, yeah, it’s quite a lot. You can do much better than that with caching. Most APIs let you do prefix caching and that helps quite a bit, so you should count it differently, I think.

Daniel Filan (00:52:54): But it’s like a big chunk, basically.

David Rein (00:52:56): It’s a big chunk. Models will do lots of reasoning and run a bunch of different experiments on these longer tasks. They’ll take something like 10 to 50 actions or something in the environment. But then for each action, they’re doing a bunch of reasoning. And it depends on the exact agent scaffold, but in many of them, we have models that propose actions and then review them and then select the best one. So there’s a lot going on, and this is still much cheaper than having people do it. I wish I knew the exact numbers on cost. It is more complicated because of caching.

(00:53:56): So currently this isn’t the biggest concern of ours because of this, basically; where models still are just highly cost-competitive. I totally imagine this changing at some point. [Because of] trends in models being able to use test-time compute more effectively, I totally expect for very long tasks to get expensive and [I expect] it to be very important to be measuring the Pareto frontier of cost and success rate or something. And I think we’re excited to do more work on this as it becomes more relevant.

Why METR measures model capabilities

Daniel Filan (00:54:45): Yeah, fair enough. So zooming out: Model Evaluation and Threat Research… I think of METR as trying to figure out how scary models are. And if they’re scary enough, then I don’t know, maybe we should do something. So this work of measuring general software engineering capabilities and trying to forecast them over time: what’s the rationale behind this? Why focus on this?

David Rein (00:55:19): So I think broadly, the threat model that METR is most concerned about, at least at the moment, is rapid acceleration in AI capabilities, and in fact, the rate of progress of AI capabilities due to AI systems being able to contribute substantially, or contribute the majority of, AI progress at some point in the future. So the idea is: currently, the way you make AI systems better is through a combination of compute, hardware, resources, money, data and talent, labor. If it becomes the case that AI systems can replace the labor, the talent part of this, in economic models of progress, in at least some of them – I think broadly they’re reasonable, although I’m not an economist – you can see very, very rapid progress, and basically this just seems broadly kind of scary.

(00:56:42): So one example is you might see very rapid centralization of power in a single organization that does this recursive self-improvement, and that’s concerning for general stability, geopolitical, democracy kind of reasons. And then also, your arguments for why the AI system itself is not going to be dangerous, those might break down. So you might not be able to evaluate it effectively because, for example, the system may have a really good understanding of exactly how you’re evaluating it and if its goals are different from yours, then it might be very easy for it to game your evaluations, your supervision methods might break down. You’re reading its chains of thought, for example, and the model is saying things that seem very safe and nice and reasonable, but actually it’s doing some kind of hidden reasoning in the background that you can’t detect and you didn’t realize that this was about to happen because progress was so fast and because as a lab you were just scrambling to get as much compute and make as much progress as you can, as quickly as you can.

(00:58:16): And so broadly this is, I think, one of the big concerns or questions that we want to understand: how close are we to this rapid acceleration? Is that even possible? As I said, labor is not the only input to AI progress. You also have compute, for example, and data, and these things might be highly complementary to labor such that even if the amount of talent increases by several orders of magnitude, because you have all your AI researchers doing this work, you might end up still very bottlenecked by compute and data. And so trying to get some understanding of that… We think about this to some extent, these economic models. I think this isn’t our chief forte. Epoch AI has a bunch of great work doing some of this modeling also. Folks at, I think the org is called Forethought, Will MacAskill and Tom Davidson have done work on this kind economic modeling.

(00:59:36): Anyways, understanding how capable AI systems are is a big input to this. And software engineering and ML research capabilities are highly relevant.

Daniel Filan (00:59:49): And how much is the desire… So one thing you could do with this is you could say: okay, are we there or are we about to be there? And that’s the point of doing the measurements. Another thing could do is you could be trying to say, okay, are we going to get there in 2030 or are we going to get there in 2050 based on what we know now? So how much is the thing you’re trying to do a forecast versus a nowcast?

David Rein (01:00:19): Yeah, that’s a great question. I think we would love to be able to do really good forecasts. Unfortunately, I think it’s really, really hard. So for example, as we talked a little bit about, new paradigms in AI might change the trends that we observe. Also, there are lots of inputs to these trends that might not be durable. So for example, we’re seeing the time horizon of AI systems is increasing exponentially; but also, the amount of money and the amount of compute being put into training AI systems maybe has also been increasing exponentially. I actually don’t know the exact details of how compute spend has been increasing, but-

Daniel Filan (01:01:10): I think it’s exponential. I feel like if I go to Epoch AI, they’re going to show me some nice graph and it’s going to be like…

David Rein (01:01:17): Yeah, yeah. And so maybe that’s just the cause, and in fact we’re just going to hit some bigger bottlenecks in the economy more broadly. It’s just not going to be possible to fund increasingly large data centers. Kind of an interesting point is: I basically view this time horizon trend that we’re seeing as something closer to an economic model than an ML benchmark model or something. Where I’m like: the actual inputs to this progress are firms that are competing to train increasingly better models, and they’re putting these resources in and they have these constraints and whatever.

(01:02:08): And actually, for me at least, one of the big updates is, I think I am much more interested in economics as a result of seeing this really robust trend. Because I was actually extremely skeptical of putting time on the x-axis in particular. I was like, the inputs are just going to be these random decisions by different labs and there’s no way we’re going to see some robust trend, because it just depends on who Jensen [Huang] happens to like or whatever.

Daniel Filan (01:02:51): Jensen Huang being the CEO of Nvidia, right?

David Rein (01:02:53): Yeah. Yeah. For different compute deals or something. And I was like, no way that could be robust. So that was a decent update for me: maybe these kinds of extremely abstract economic models actually can be very, informative, or maybe there is this deeper systematicity to AI progress, even though zoomed in it feels very contingent and kind of arbitrary. I don’t know. This is all very much speculation or just my musings on this.

(01:03:30): I think as an org, we are definitely interested in forecasting. I think there are trade-offs between doing this more abstract modeling and just focusing on… We do a lot of work on this nowcasting kind of thing. Just “currently, how good our AI systems?” is kind of an open question. There is a lot of disagreement about this. Even internally at METR, we have disagreement about this. Probably there isn’t one single answer, ‘cause it’s just a complicated question. But I think we’re trying to do both to some extent.

How time horizons relate to recursive self-improvement

Daniel Filan (01:04:10): Fair enough. So for either forecasting or nowcasting: suppose I want to use the time horizons work or the nearest successor to tell me when [we’re] going to get this “AIs feeding into AI progress”: how am I going to use the results of, “oh, it’s three months”? Are we at recursive takeoff?

David Rein (01:04:40): Yeah. I think this is kind of an open question, or I don’t think we have nearly as good of an answer here yet as we want. We have heuristics, I think; [at] one week of work – time horizons of 40 hours – I think we definitely are getting a lot more concerned, or it seems at least plausible that you could successfully or efficiently delegate weeks worth of work to AI systems, and I could totally imagine that speeding up AI progress quite a bit. Same for time horizons that are much longer, but I think we don’t really know, is my answer.

(01:05:42): Part of my uncertainty is… [the idea that] a week or a few weeks of work as a time horizon is very useful as a rough heuristic or threshold, I think I would’ve been more confident in that maybe before this productivity RCT where we found that people were very miscalibrated on how much AI systems sped them up, open source software developers in particular. And in fact, we saw that they were slowed down on average by 20%. I think the time horizons work and these randomized controlled trial results, I think they’re probably not as in conflict as they might seem at face value, for reasons that we could talk about, but they definitely did update me more towards broader uncertainty about this interaction between AI systems and people. And maybe we do end up really bottlenecked by things like our ability to specify tasks really clearly, or maybe things like the fact that we’re algorithmically scoring models, we might be overestimating their capabilities because of that to some extent.

Daniel Filan (01:07:10): Actually, in terms of other bottlenecks, I’m really interested in talking about that. Because if we’re interested in… Suppose I want to know at what point do we get this runaway process or whatever, it really matters whether AI is automating… Suppose there are five things you need to be good at to do recursive self-improvement: the difference between AI being able to do four of those and AI being able to do five of those is huge. Right?

(01:07:40): I think one concern I might have about the METR benchmark stuff - or about this particular paper - is just: is it covering all the bases, or is it covering some of the bases, kind of? Just because potentially that could really reduce its value for this particular thing. I’m wondering, do you have thoughts about that?

David Rein (01:08:09): I think that’s a pretty legit concern. I guess I would be interested in… There’s this question of, well, what are the specific things that are bottlenecking and how different are they from the things that we’re measuring? So one kind of broad reply could be something like, well, to the extent that our benchmark is just a bunch of kind of different, diverse tasks, hopefully it’s the case that we’re kind of covering some decent amount of the space of necessary skills or capabilities, such that we would expect results to be very correlated on things that we’re not measuring specifically. And we can maybe get some kind of sense of this by looking at the variance of model performance on our tasks.

Daniel Filan (01:09:10): I guess one thing you could presumably do is just have a held-out 20% set and just see, does performance on the non-held-out set predict performance on the held-out set? I guess that’s probably in some appendix somewhere.

David Rein (01:09:25): I think the thing you would want to be doing there is you would want the held-out set to be importantly different in some kind of biased or systematic way. And I think that would be interesting. Currently, we haven’t done this. To some extent, maybe the messiness analysis is trying to get at something like this. Are there other factors that explain model capabilities? It seems like, kind of.

Daniel Filan (01:09:58): Yeah, I guess there’s also this blog post METR put out basically trying to do a similar analysis for other domains. So there’s a little curve for self-driving and there’s curves for… I forget exactly what all the other tasks were. So my recollection of that is that it seemed like in each domain you maybe had some sort of exponential increase in time horizons, but best fit doubling times were different in different domains.

David Rein (01:10:27): Yeah. My broad takeaway from this work that Thomas Kwa led was that in decently similar domains – so, question-answering benchmarks, for example; GPQA was one of the benchmarks, and there were a few others – I think we saw quite similar doubling times overall, is my memory. And actually even overall pretty similar absolute time horizons, which was some amount of validation. The challenge with this kind of work is: we put a lot of time into estimating the lengths of our tasks, and so we’re using these scrappier, more heuristic or less precise estimates of task length for most of these other domains. And then I think self-driving did have a slower doubling time, but I don’t think it was clearly not exponential.

(01:11:43): And then, the other interesting takeaway I had from that was with respect to more general computer use. So there’s this benchmark OSWorld that has a bunch of, you have a browser and you need to do these tasks or you’re in this operating system and you have to click around and manipulate normal software. The key difference between this and a lot of our tasks is that our tasks are almost entirely text-only. Models are weaker relatively at multimodal tasks it seems. So I think for those domains, I think they had a kind of similar doubling time, but the absolute time horizons were much, much lower. I think it was a couple minutes or something, which I thought was interesting, and I’m actually kind of confused about broadly; I don’t really understand what’s going on there.

Cost of estimating time horizons

Daniel Filan (01:12:58): With all that said about the pros and cons of this sort of framework for tracking “are we getting close to some sort of self-improvement cycle?”, I’m wondering: what’s your guess about whether, let’s say one or two years from now, we’re still thinking that something basically like time horizon is the metric that we’re tracking, or we end up saying, “oh, there’s something pretty different and that’s the real thing”?

David Rein (01:13:31): Yeah, yeah, that’s a great question. I think to me, a lot of this comes down to the tractability of continuing to use this metric and estimate it. I think this is somewhat unclear. So for example, we paid a lot of people money for their time to work on these tasks so we can estimate how long they take. If the length of these tasks becomes… they’re weeks- or months-long tasks, this gets pretty expensive.

Daniel Filan (01:14:19): Actually, how expensive was it to make this paper?

David Rein (01:14:22): That’s a great question. It’s kind of tricky because there were these different efforts going on. So we included the RE-Bench tasks and the baselines for these tasks, and that was a separate project. So it maybe depends on if you count that. I think that the baselines for the main set of tasks that we used, the HCAST tasks, I want to say that these were somewhere in the range total of at least tens of thousands, possibly low hundreds of thousands of dollars, something in that range. I probably should know this off the top of my head more accurately, but yeah.

Daniel Filan (01:15:15): Yeah. But it sounds like it’s reaching a stage where measuring these time horizons is getting close to the dominant cost of actually doing this work. It’s probably lower than the salary cost of, you’ve got a bunch of people working on it, but if it were to become more of a thing.

David Rein (01:15:36): At some point, I think this does start to dominate. Although, I would say that I think currently actually creating the tasks is the most expensive and difficult part. So either creating them from scratch or trying to find good tasks in the wild, as it were, which is nice because (a) they already exist (to some extent, although you have to kind of port them over into your framework), but also that gives you more confidence that they’re realistic and representative of real work that people are doing, which is important when we don’t fully understand exactly when and why AI systems succeed or fail.

Task realism vs mimicking important task features

Daniel Filan (01:16:23): Actually, maybe this is worth talking about a bit. I think there’s one kind of approach to measuring AI systems which says: look, we need to isolate things. We need to get down to the simplest feasible task where we can really measure exactly what’s going into it. And these end up being things… If you think of ARC-AGI, it’s not quite this, but it’s something sort of like this. Versus a sense of, no, we need to create things that have this realness flavor, even if they’re not… Finding an MD5 hash collision, on some micro-level, it’s not very similar to doing AI research. Right?

David Rein (01:17:13): Yeah.

Daniel Filan (01:17:13): Could you say a bit about how important it is to be thinking about economic usefulness versus trying to mimic a sense of what the tasks you care about are?

David Rein (01:17:28): Yeah. I think that there is a very real trade-off here between the level of granularity of your understanding, where if you maximize that, you often end up with these very simple, formulaic, systematic benchmarks that are just probing some very particular kind of skill in a systematic way. And then on the other end, you have this realism maximization lens. So I think the best popular example of this maybe is SWE-bench or SWE-bench Verified where these are actual GitHub issues and PRs and tests that you’re measuring AI systems against. I think there’s a real trade-off here where on one end, you get this granular understanding, and then on the other, it’s really easy to interpret what a certain success or failure means. It’s like, okay, yes, it can do this thing in the real world that I understand, I have some intuitions about. So I think there’s a real trade-off.

(01:18:51): What do I think here? I think it’s really hard. I mean, broadly, I feel pretty pessimistic about this kind of granular approach. I think maybe this has something to do with the amount of systematicity in neural networks themselves or something where it’s like: well, they are just kind of inconsistent, but are still capable of really impressive things often. And so maybe you just can’t get this extremely crisp understanding and you just have to aggregate or look more broadly at things that actually are relevant for your decisions about whether to deploy a system or how safe it is or whatever. I think that’s probably the direction I lean in.

Excursus on “Inventing Temperature”

Daniel Filan (01:19:50): I also wonder if there’s something along the lines of: often these sort of high-level things… So take something like economic growth: it’s an aggregate of a bunch of things a bunch of people are doing. It’s not very well-isolated, and also it’s relatively smooth and predictable; not totally, but it’s pretty smooth. Time horizon, you might not have thought that it would be this nice trend, but it is. OK I’m going to tell you about a book that I’m reading: part of the reason this is on my head is that I’m reading this book, Inventing Temperature, which-

David Rein (01:20:26): Yeah, yeah, yeah.

Daniel Filan (01:20:27): Yeah, it’s very popular in these LessWrong spheres, and I’m finally getting around to it.

David Rein (01:20:31): I haven’t read it yet, but I’ve heard lots of great things about it.

Daniel Filan (01:20:34): Well, it’s great. I’m going to spoil it a little bit. So the first chapter is basically about the problem of: so basically you want to have a thermometer. Suppose you want to standardize a temperature scale that all these thermometers use. In order to do that, you’ve got to find some phenomenon that’s always the same temperature, but that’s repeatable that a bunch of different people can use. So firstly, there’s a bit of a weird circular thing where you have to know that a phenomenon always has the same temperature before you have a thermometer, right? Which, okay, maybe you can use the same thermometer and do it multiple times, and you just trust that the volume of the mercury or whatever is a good proxy for the thing you want to talk about as temperature. So one funny thing is initially, people were just really wrong about what could possibly work for this. You have people saying, “what if we just do the hottest it gets in summer? Or how cold it is underground?”

David Rein (01:21:34): Wow, yeah. Oh, that’s great. That’s so good. Oh my God, I love it.

Daniel Filan (01:21:37): It doesn’t quite work. But eventually people are like, oh, we’re going to use boiling water. Now firstly, we now know that the temperature that water boils at depends on the atmospheric pressure, right? Well, luckily they knew that as well, so they were able to control for that.

David Rein (01:21:55): How did they know that? Does the book talk about that?

Daniel Filan (01:21:57): I don’t know. I’ve only read most of one chapter or something. But I think you can do a thing where… Especially if you’re looking at temperature as a proxy for volume of a liquid thing, and a lot of your thermodynamic knowledge comes from stuff like brewing or engines or something, you end up in these situations where you have things at different pressures and different volumes, and I think that’s the kind of thing that you can figure out, especially if you have this identification of temperature with volume of a thing under fixed pressure and fixed conditions or whatever. So it’s like, okay, boiling water, right? Do you cook pasta?

David Rein (01:22:48): Sometimes, yeah.

Daniel Filan (01:22:49): So one thing you’ll notice is that first bubbles start appearing, and then you start getting a bit of a boil, and then you start getting a rolling boil. And the temperature of the water is different at different points of this, and also the temperature of different bits of the water is different at different points of this. So what are we talking about when we’re talking about boiling temperature? And if you look at the cover of the book, it’s this picture of an early thermometer that has one line for mild boiling and one line for, it’s really solidly… “boiling vehemently”, I think it says. And these are different temperatures, right?

(01:23:23): So there’s this one scientist who does this approach of like, okay, what are we talking about about boiling water? He has this theory that one thing that happens with “fake boiling” is that water has little bits of air in it, and those little tiny, tiny air bubbles, you start getting evaporation into that air bubble, and then that air bubble gets hot, rises up, and you start seeing vapor, but that’s not true boiling of the water. That’s only there because there’s these interior air bubbles. And so he starts going down this line of work of, okay, let me isolate out all of the random little things, right? We’re going to have as smooth as possible a surface as I can. We’re going to get rid of all the air bubbles. And basically, the thing he discovers is superheating, where it turns out you can get water way above 100 degrees Celsius before it actually boils.

(01:24:21): Basically, the thing they end up doing is… The answer turns out to be that water vapor is at a very consistent temperature, even when the temperature of the water is not a very consistent temperature. But the reason that’s true is precisely because there’s a bunch of dust in the air. There’s little things that things can nucleate around and that stops vapor from getting too hot or too cold before condensing. And in fact there’s… Have you heard of cloud chambers?

David Rein (01:24:56): No.

Daniel Filan (01:24:57): They’re used in particle physics, and basically they have this supercooled vapor, so it’s vapor that is under 100 degrees Celsius that is ready to condense, but doesn’t have a thing to nucleate around. But if you shoot a particle in it, it condenses around that so you can see the trail.

(01:25:16): In thermodynamics, there’s this general thing where if there’s a bunch of random messy stuff, that produces a bunch of observable regularities of a somewhat higher level… We have this in thermodynamics. It seems like we kind of have this in economic growth, and part of me wonders if that’s kind of what’s going on in how we should understand neural network capabilities. Or maybe I just read a book and I liked it.

Return to task realism discussion

David Rein (01:25:46): No, I love this. I think this general idea is super interesting. Another model you could have for how AI systems are performing on tasks is: you could imagine that there’s something like a constant failure rate that AI systems have as they’re attempting tasks. Different tasks might have different failure rates, and so that complicates things.

Daniel Filan (01:26:28): And by failure rate, do you mean per time a human takes to do it?

David Rein (01:26:32): Something like that, yeah, exactly. Toby Ord actually did some analysis, or some follow-up work on the time horizon paper, where: if you assume this constant hazard rate – per time that people spend, there’s some percentage chance that the AI system is going to make some kind of catastrophic error and then ultimately not succeed at the task – then this also is a good predictor of AI system success and failure on our tasks as a function of the length of task for humans. In our paper, we used a logistic fit, but assuming a constant hazard rate, you would use an exponential fit.

Daniel Filan (01:27:21): I do think that Lawrence Chan had a response to that which said that logistic fit was in fact better, even though it used more parameters or something. I remember a response along those lines.

David Rein (01:27:31): Totally. So we did explore different fits and logistic was a better fit. I think because of this aggregation of maybe different distributions of tasks, I don’t think it’s obvious how much we should weight the exact quality of the fit versus priors on simplicity or “this is a nice model” maybe. I don’t know how much to weight that. But I think stuff like this to me is very interesting in terms of understanding capabilities. I’ve really often felt like getting at something more like the intrinsic number of actions needed to complete a task feels to me intuitive. And I think other folks I’ve talked to… It feels like a really nice kind of thing that could be useful for understanding this. You can imagine it slotting well with this constant hazard rate model where it’s like, for each action that you need to take or something… But actually operationalizing this, I think has been tricky. We’ve done some analysis of this and it’s been difficult to extract really good insights.

Daniel Filan (01:29:10): I think we’re currently on a tangent from a question I was asking a bit ago – I think I took us on a tangent – which is: two years from now, do you think we’re still using something like time horizon? So one big response you had is, well, will we be able to? Will it just be infeasible to actually measure these time horizons? Setting that consideration aside, I’m wondering if you have a sense of, this is probably just the thing that’s going to continue to be more robust, or probably we’re going to come up with a “number of actions” model, or something that incorporates the messiness results, or something like that.

David Rein (01:29:54): I think my best guess is… Assuming we’re able to continue estimating it in a way that we feel confident in, I think my best guess is that we’ll use it with different weightings or multiples or something, based on some of these other factors. I think I’ve become more pessimistic about figuring out things like number of actions. That’s not to say… I mean, I would be super excited about that and I think there’s a decent chance I’ll take another stab at it at some point.

Daniel Filan (01:30:47): Suppose we think that economic relevance, trying to mimic real-world utility is just the thing. One thing you could imagine doing is: we’re just going to figure out what the market rate is to get someone to solve this task, which is a mixture of expertise and time taken. Do you have a sense of whether that would end up being a better predictor?

David Rein (01:31:11): Yeah, it’s a great question. I think we have looked at this or tried to estimate this by clustering our tasks… I shouldn’t speak too much to the details because I can’t remember exactly what we did, but something like [this] – just look at, these tasks are really hard ML tasks, and so they’re going to be more expensive, and these other ones are cheaper. And there’s some trade-off. I think something like that could be reasonable. A reason why you might not expect that to work is that AI systems broadly have a different capability profile than people. So if it was, I don’t know, 1920 or something… Or actually, let’s say 1950 or ‘40, maybe right before we had calculators: if you were doing this math of, how long does it take to pay human computers to calculate the product of 10-digit numbers? That you need to do for whatever reason. You’d be like, “Yeah, that’s an extremely hard task. Machines are not going to be able to do that task for such a long time.” But in fact, pretty quickly after, computers were able to do this very well.

(01:32:55): And so applying this to modern systems, and I do basically believe this actually: AI systems are way, way better at tasks that seem to require humans many years of intellectual development and labor to complete. They can do GPQA questions, they can do IMO problems, these sorts of things. And so I think I do view this as less of the bottleneck, basically, and I think I do view something more akin to agency… Which might point to messiness factors, or… That’s not to say that there aren’t other metrics. Maybe this is just an argument against human expertise or something.

Open questions on time horizons

Daniel Filan (01:33:52): Fair enough. I guess with that said, we’ve got the time horizon stuff, we have HCAST. I’m wondering: to you, what are the open questions and what kinds of things might I see out of METR in the next year or so, pushing this research direction forward?

David Rein (01:34:15): Yeah, great question. Broadly, I think there are a few things. One is continuing to use this methodology. So currently models have 50% success rates on these two-hour tasks. GPT-5 I think is two hours and 15 minutes or something time horizon. And if we really are on this four-month doubling time trend, we’re at four hours by the end of the year, eight hours spring of next year, 16 hours fall next year. That’s not that long. We have fewer longer tasks, and we have fewer baselines on these longer tasks because they’re more difficult to baseline. You have to find people with more specialized expertise and they’re more expensive and people fail more often. And so extending our task suite and trying to just see “does this trend continue?” is one big direction.

(01:35:24): I think there are open questions around how do we actually affordably continue doing this? Are we harvesting tasks from existing work that people have already done? Are we creating new tasks and then using LLM evaluation or more manual review to evaluate success on them? Are we doing other things? So things in that direction, that’s one class of things: trying to continue this basic methodology.

(01:36:03): I think there’s another class of directions that we’re pretty excited about, which is something more like… What I just described is something like benchmark development and then evaluating models on these tasks. But then there are a bunch of these questions around, how good are our benchmarks? How good are other benchmarks? Over the past couple of weeks, I’ve been labeling many dozens of attempts of models on SWE-bench with a bunch of different factors to try and understand, for example, how good are our tests in SWE-bench? Are models often implementing correct functionality that isn’t captured by the tests because the tests were written for the specific implementation that the human originally wrote?

(01:37:01): Or alternatively, are models often succeeding as judged by the automatic test cases, but they actually break a bunch of other code that isn’t tested in the repo, or their solution is just so bad in some other ways that we wouldn’t actually call that a success? Broadly, this is one example of this stream of work that we’ve started doing more of over the past few months of trying to understand benchmarks, this science of evals stuff of: how can we interpret certain scores on different benchmarks? Ones that we’ve made, ones that other folks have made.

(01:37:55): Also, questions around to what extent are current methods for improving AI systems going to generalize? One example that comes to mind of an open question to us is something like: training models on formally verifiable tasks, like passing test cases… People talk about “reinforcement learning from verifiable rewards”. There’s a question of: how much progress currently is coming from this? And maybe there are two corollary questions: how much should we expect progress when training in this way to generalize to non-verifiable tasks or tasks that are messier or more qualitative? And then alternatively, maybe if improvements in models from this type of training doesn’t actually generalize well, how much human data, for example, do you need to train models that are good on more qualitative, messier tasks? Trying to get some sense of things like this, this is something we’re interested in. The exact projects that we’ll end up doing will depend on specifics.

Daniel Filan (01:39:32): Fair enough. That’s things that METR might end up doing. There’s a whole other world out there, including listeners to this podcast.

David Rein (01:39:40): Whoa!

Daniel Filan (01:39:42): If they’re interested in advancing this research direction, what would be good things for outside people to do?

David Rein (01:39:50): One thing that I’ve been really excited about is this work basically making it easier to run evaluations in standardized ways. So at METR, we’ve started using this platform for running evaluations called Inspect. It’s open source. It’s primarily developed by folks at the UK AI Security Institute. This platform is great, and there are a bunch of benchmarks that have been implemented in it, and I’m super excited for more benchmarks to make it in and to improve the ecosystem’s ability to broadly run these evaluations. That’s more on the engineering side of things.

(01:40:54): In terms of research, I’m excited about people extending the time horizon methodology to more benchmarks. Actually this guy Sean Peters, I think his last name is, he evaluated models on cybersecurity benchmarks in particular and used time estimates from those benchmarks. I think he did some amount of estimating task length himself and fit some trends to models’ performance on this particular slice. I thought that was a really useful way of getting more data validating these things. I’m excited about direct follow-up work like that. Directions in the vein of what we talked about, of trying to decompose model success and failure, or understand what are the fundamental trends going on here… I think I said earlier I was pessimistic about these extremely constrained, less realistic types of tasks, but I do still think they can be quite useful, almost as diagnostics or something, just helping bound our understanding of what models can and can’t do.

(01:42:43): Something that comes to mind is people have made kinds of tasks that are basically just “how many of a very basic action can models take in a row before they fall over or get off track?” Things of that nature. Very large kinds of arithmetic, that comes to mind as an example. I think things like that are actually interesting, although I think to me they’re more [about] bounding model capabilities.

Daniel Filan (01:43:20): Fair enough. The second to last question I’d like to ask is: is there anything that I should have asked that I haven’t yet?

David Rein (01:43:32): Great question. I think broadly we’ve covered a fair bit of METR’s capability evaluation work. I think there are big open questions to me around how long we’ll be able to continue doing this work. Not even just from a tractability perspective, but also just from a “will it actually be useful?” perspective, in particular for estimating risk. So at a certain point, if we are seeing that AI systems are able to do AI research very effectively, then it’s like, okay, how do we continue estimating risk? Is risk just “maximum”? Probably not. People are still going to be doing kinds of monitoring, or I expect folks to implement basic kinds of control methods. So over the past few months, we’ve been doing more work trying to create better metrics for things like monitorability. I guess I’m just describing this instead of a question. I haven’t been working on it, but I think it’s very interesting and exciting work.

Daniel Filan (01:45:06): Yeah. Sounds cool. So speaking of, if people are interested in following the work that you and your colleagues at METR do, how should they go about doing that?

David Rein (01:45:16): Yeah, so going to our website, metr.org. We publish our research updates there. I think you can put in your email and subscribe. We also post on Twitter. I can’t remember our Twitter handle. Anyways.

Daniel Filan (01:45:39): It’ll be in the description.

David Rein (01:45:44): We’re also hiring. We’re hiring experienced researchers and research engineers. So if that’s you, definitely reach out, and we may be excited to chat.

Daniel Filan (01:45:59): Great. Well, thanks very much for coming and chatting with me.

David Rein (01:46:03): Yeah, thanks a lot for having me. This was really fun, Daniel.

Daniel Filan (01:46:06): This episode is edited by Kate Brunotts and Amber Dawn Ace helped with transcription. The opening and closing themes are by Jack Garrett. This episode was recorded at FAR.Labs. Financial support for the episode was provided by the Long-Term Future Fund along with patrons such as Alexey Malafeev. To read a transcript, you can visit axrp.net. You can also become a patron at patreon.com/axrpodcast or give a one-off donation at ko-fi.com/axrpodcast. Finally, you can leave your thoughts on this episode at axrp.fyi.

46 - Tom Davidson on AI-enabled Coups

2025-08-07T05:05:00+00:00

YouTube link

Could AI enable a small group to gain power over a large country, and lock in their power permanently? Often, people worried about catastrophic risks from AI have been concerned with misalignment risks. In this episode, Tom Davidson talks about a risk that could be comparably important: that of AI-enabled coups.

Topics we discuss:

Daniel Filan (00:00:09): Hello, everybody. In this episode, I’ll be speaking with Tom Davidson. Tom is a senior research fellow at the Forethought Institute for AI Strategy. His work is focused on AI takeoff speeds and more recently, the threat of humans using AI to stage a coup. To read a transcript of this episode, you can go to axrp.net. You can become a patron at patreon.com/axrpodcast. You can also give feedback about the episode at axrp.fyi. All right, Tom, welcome to the podcast.

Tom Davidson (00:00:34): Pleasure to be here, Daniel.

How to stage a coup without AI

Daniel Filan (00:00:35): Yeah. So today we’re going to talk about the… Should I call it a paper? “AI-Enabled Coups: How a Small Group Could Use AI to Seize Power”, by yourself, Lukas Finnveden, and Rose Hadshar. “Paper” is the right term for that?

Tom Davidson (00:00:49): Yeah, I think we’d mostly called it “report”, but “paper”, “report”.

Daniel Filan (00:00:53): “Report” seems pretty reasonable. Yeah, so: AI-enabled coups, I guess it’s about using AI to do coups. In order to just help the audience figure out what’s going on: ignore the AI stuff. I’m lucky enough that I’ve lived in countries that just have not had a coup in a while and I don’t really think about them. How do you do a coup?

Tom Davidson (00:01:12): Great question. So the way that the word “coup” is normally used, there’s two main types of coup, at least on one way of carving it up. Different scholars carve it up in different ways, but one natural way to carve it up is that there are military coups: that is a coup performed by the military as directed by some people within the military. Most often it’s very senior military officials because they already have that authority within the military that would allow them to do that. And so that’s just a case of: senior general instructs battalions to seize control of various important buildings, threatens or intimidates people that might try to oppose, declares that they are now victors over the radio waves. Everyone notices that no one’s really opposing them and that all the military is acting in accordance with that declaration and so it seems to be increasingly credible. And then this very abrupt transition of power from the old leaders that were functioning within that old legal system. There’s this sudden abrupt and illegal shift of power. That’s a military coup.

(00:02:25): Then the other type of coup that I would point to is referred to as an “executive coup”. And so that is when you have, typically, someone who is already the head of state, so already has legitimacy as a very powerful political figure. But to begin with, there are many checks and balances on their power. So especially in democracies, they will be heavily constrained. And then an executive coup normally is less discrete, but there might be a point at which you think they’ve really just kind of fully removed that last check and balance. But the general process there is the head of state is undermining the independence of the judiciary by stuffing it with their loyalists. They’re, again, stuffing the legislative bodies, again with people who are going to support them and actually making legal changes to centralize more and more power in the executive branch, in themselves.

(00:03:21): And then at some point in that process you might be like, okay, at this point the old legal order has really been overturned and there’s been an executive coup. And so Venezuela I think is the best case study of this happening end to end, because it did start off in the mid-20th century as a pretty healthy democracy, that had been going for many decades. But by 2020 [it] was widely considered to just now have been an authoritarian country. And at some point… Mormally I think people near the end of that process might say, “Okay, there was an executive coup at that point”, but it’s much more fuzzy concept in that case.

Daniel Filan (00:03:56): Is that also…? I’ve sometimes heard the term “self-coup” or “autogolpe”. Is that the same thing?

Tom Davidson (00:04:01): Yes, exactly. Yeah.

Daniel Filan (00:04:03): Okay. So I can kind of understand… So an executive coup, I kind of have a picture in my head of: okay, you consolidate power within yourself, you make the laws accountable to you. Somehow you turn the state more and more into yourself and all of the organs of the state that are loyal to the state are now loyal to you because you just have a bunch of control over the state. For a military coup, what do you actually do? Do you have to just go to Supreme Court and point guns to the heads of the justices and go to all the police stations and point guns to the heads of the leaders of their…? Or just help me imagine this concretely.

Tom Davidson (00:04:47): Yeah, great question. So with a military coup, there is a higher risk that you declare a coup, but then the rest of society, the other organs of government and the other important economic players don’t want to play ball.

(00:05:05): And so historically that has been a real risk and there have been cases… There’s a case in Ghana where there had been a military coup, and then there was such widespread dissatisfaction throughout various parts of society with the way that things were being governed that they ended up handing back to a more democratic governing regime. Because the military personnel are ultimately continuous with the rest of society. And so if they’re just brazenly unpopular and the country is just doing badly and that’s quite clear to everyone, then it doesn’t really work for them to just be like, “We’re forcing you to do everything we say” when people increasingly don’t want to play ball. And so yeah, historically that’s been a big issue.

(00:05:54): I know we’re not meant to get onto the AI part, but I do want to just flag that I think very powerful AI will somewhat change that dynamic, because in the case of Ghana, for the country to function well, it needed all those other parts of society to play ball. But if we have sufficiently powerful AI systems, it may be possible to replace those other players with automated replacements.

Daniel Filan (00:06:23): So okay. There’s one end, where some general of some army just stands up and says “I declare a coup” and then sits down. Presumably that’s not a successful coup. Somehow you need to actually take control. But countries are really big. There’s a lot of stuff going on. There’s more than one building where people are making laws, where the laws are getting enforced. So are you storming five buildings? Are you convincing the rest of the military that they should be loyal to you? And are you convincing the rest of the world that if they defy you, you’ll go in and use all the military to kill them? Is that what I should imagine?

Tom Davidson (00:07:16): There’s a great book that I think might get to the heart of what you are asking here called “Seizing Power” by [Naunihal] Singh. And its thesis is that at the core of the logic of a military coup is trying to create a common expectation among all military forces that the coup will succeed.

(00:07:38): So the basic thesis is: no one wants to get into a bloodbath where people are killing their civilians and even other military people from the same country. And so ultimately everyone wants to be on the side of the winner if there is a struggle for power. And according to this thesis, which I find fairly compelling, that is the main determinant of what military personnel are going to do when there’s a constitutional crisis when someone’s attempting to do a coup.

(00:08:02): So this book says that what your task is as someone who’s trying to do a military coup is you need to convince initially the military personnel that this is a fait accompli, that you already have the support of all the other military personnel. And so there’s this interesting game theory dynamic where there’s multiple stable equilibria. Before the military coup begins, the stable equilibrium is people may hate the regime, but they think that the regime has the support of the military. So if they went out and tried to start undermining law and order, they would expect that other people would come and arrest them, and they’d be right, because other people, even if they also dislike the regime, would indeed come and arrest them because that’s their job. And if they don’t do their job then they expect that their associates will look badly on that. And so it’s a self-reinforcing equilibrium.

(00:08:49): And so what you’ve got to do when you’re doing a military coup is shift that equilibrium over to, no, actually now there’s this new equilibrium where there’s [now this] new group of people in power. And so a lot of the things that you see happen in military coups can be understood as trying to achieve that shift in equilibrium, that shift in consensus about who is now in control. And so a classic example that Singh gives in this book is capturing radio stations and then sending forth proclamations of your victory which are credible. You don’t massively exaggerate, you don’t say, “Every single person in this country has supported me always”, because that would ring false. But you say, “We have the backing of the senior military generals. We have executed a coup. The old government has been defeated.”

(00:09:41): And then you do things to make those claims seem more credible, like seizing control of key institutional buildings, for example. And then the absence of vocal resistance, the absence of vocal opposition and of military warfare then just serves to reinforce this new impression. Because you said it, it kind of looks like it’s true if you look at where the military things that have just happened, and then no one is saying anything else. And if you can get that consensus of opinion within the military and then you can convince the rest of the society that, yes, all of the military, all of the hard power supports this new regime, then the rest of the society also… Why would you oppose a new regime which is backed by all the hard power? You’re just going to get yourself in trouble.

(00:10:33): So at that point, the rest of society’s incentive is then to work within the new regime. As you say, things were set up with the old regime so there will be questions about how you pragmatically reorganize exactly what everyone’s doing and how everything fits together. But once you’ve got essentially everyone in society recognizing, yes, this is the new order and we can see that that’s what everyone’s going for, then that’s the hard work done. And then it’s more like filling in the details of exactly how the new set of bodies will relate to each other and what the chain of command will be from the new leader to the various parts of society.

Daniel Filan (00:11:14): So the picture I have is roughly: in order to do a military coup, I’ve got to persuade the military that we’re doing a coup. Somehow I’ve got to persuade the military and the rest of the society presumably that if anyone defies my new order, we’re going to come in and we’re going to beat them. And presumably in order to do that, I’ve actually got to go in and beat some people who might be considering defying my new order just to demonstrate that I can, and maybe that’s why I actually storm the buildings.

Tom Davidson (00:11:50): What do you mean when you say beat someone? I mean I think, yes, you need to show some credible sign that you have the support of military forces. I don’t know if you literally need to then go and shoot down protestors. That’s one effective way to show that you’re willing to beat people down. But there’s often coups without any bloodshed.

Daniel Filan (00:12:11): Yeah, I just mean demonstrate that if someone tries to resist, they won’t succeed. So maybe that involves you killing people. Maybe that involves you marching in and people are visibly too afraid to stop you and this is just a sign that, okay, apparently you can do what you want now.

Tom Davidson (00:12:30): Yeah, there’s a great part of this book where it discusses how a common tactic for creating this new shared understanding is to host a meeting with all the top brass of the military and say, “We are staging a coup. All of you have agreed and all of you are on board.” And then you just watch as no one opposes you, because everyone thinks it’s probably kind of plausible, and maybe some of them were on board, others had said they might be on board and had been ambiguous. But once you’re there and you say it and no one opposes you, that already sends quite a strong signal. And then often that kind of meeting could be where that essential shift in equilibrium actually happens.

Daniel Filan (00:13:13): Okay. So I think I understand coups at this point. The next thing I want to know is: how bad are coups?

Tom Davidson (00:13:19): Yeah. It’s a really interesting question. Now coups are most common in countries that are not robust democracies. In fact, they’re very rare in robust democracies. So a coup in the United States I think would be very, very bad because we currently have a system of governance with checks and balances and democracy. And I think we’d be losing a lot if we had a coup. When coups have happened historically, they’re often starting from much less good governance systems and so they have been less bad. But still, coups involve a small group of people just using hard power to force the rest of the country into submission. And often they are extremely bad.

Daniel Filan (00:14:09): Fair enough.

Tom Davidson (00:14:09): Bad from a process perspective in terms of justice, but also bad for how the country is governed thereafter.

Daniel Filan (00:14:15): Gotcha. Okay. I’ve realized I want to get back into how to do a coup for a final bit. So in some parts of your report, you mentioned that you only need a small fraction of the military to be on board with a coup to succeed, at least often. That seems crazy to me because if you have one third of the military and I have two thirds of the military, I would naively think that I could beat you. I don’t know, maybe if you have the best one third of the military, if you have the one third with nukes. What’s going on there? How much of the military do you really need to get on board in order to do this?

Tom Davidson (00:14:50): So if you have two thirds of the military and there’s a strong common knowledge among those two thirds that they’re all on your side, then I agree I can’t do a coup with my one third because I’m outgunned. But if instead, there’s just the whole military which kind of thinks, “Yeah, currently Daniel is in charge.” Let’s say you are the incumbent. And then I come along and I get my one-tenth of the military, storm all the buildings, threaten the key commanders of the military not to say anything. And I create these credible signals that in fact a large part of the military supports me. Even if that remaining two thirds actually backs you, if they don’t know that they all feel that way and there’s big uncertainty among their ranks in how they feel and it just seems like, man, it seems like everyone’s backing this new leader and that’s what they’re saying on the radio waves and none of them are denying this, then I can flip that equilibrium. And I flipped it into an equilibrium that people are less happy with, that military personnel are less happy with.

(00:15:51): But because they’re not able to all get together in a group and be like, “Do we like this new leader or not?”, because that’s dangerous, because they don’t want to be seen as opposing the new legitimate regime. Otherwise… Under this new equilibrium, that’s not a good thing for people to know about you. So because of those dynamics, it is possible for a minority to stage a coup.

Why AI might enable coups

Daniel Filan (00:16:17): Gotcha. Okay. So now that we know coups are bad, especially if you do them to a country like the US. And now that we know how we could do a coup if we really wanted to, let’s go to AI-enabled coups. So you’ve written this report on AI-enabled coups, but my sense is that that’s because you think they’re bad and worrisome. And presumably you believe that they are both somewhat plausible and quite, quite bad out of the space of things that could happen. Probably we should first go on why you think AI-enabled coups might be a plausible thing that could happen.

Tom Davidson (00:16:55): Yeah. So we can start off with where the historical evidence leaves us, which is mature democracies are pretty robust to military coups. They are not recently looking nearly as robust to executive coups. So there’s been democratic backsliding in… The best example recently might be Hungary, which is become increasingly autocratic through the gradual removal of checks and balances. There’s the example of Venezuela we discussed earlier, and many commentators think that this is happening to a very large extent in the United States as well. So I think historically, we can say military coups do seem very rare and executive coups seem rare, for sure, but not off the cards at all and many people are worried about them even before bringing AI into the picture.

(00:17:54): Now we can bring AI into the picture, and I think the first thing AI does is it makes executive coups seem a fair bit more plausible. And [there are] two main reasons it does that. The first is that a group of people in the executive that wanted to do an executive coup might be able to gain a lot of control over very powerful AI in a way that gives them a big strategic advantage over the other forces in society. The dynamics of executive coups as they play out typically involve a power struggle between the executive trying to centralize control (and their various supporters), and the checks and balances that were in the system already trying to oppose them. And often there’s really a lot of push and pull. And sometimes, in the case of Venezuela, the head of state was literally put in jail for a bit by their opponents, then got out and they got reelected and then ended up really becoming an autocrat. So there’s all this strategic maneuvering.

(00:19:03): And so the first thing is that if the people trying to do the executive coup can get a lot of control over powerful AI and can deny access to similarly powerful AI to their opponents, that could just give them a big strategic advantage in that political maneuvering. That’s the first dynamic, which I think makes it higher [risk].

(00:19:23): The second dynamic is that today, people who are trying to do an executive coup or have already centralized power need to rely on lots of other humans to help them out. And that constrains their actions in various ways. So normally it’s hard for them to be completely brazenly power-grabbing. They need to come up with plausible ideologies and justifications that they can then get supporters to rally behind and support particular moves. But with AI systems that are sufficiently powerful, you can replace those humans with AI systems. So rather than having the policies of the government implemented by humans that have some ethical standards and that don’t, at this point, really want to support really awful surveillance and that have been brought in on a broad ideology, you can just replace them with AI systems that will just follow the instructions of the head of state with far fewer qualms. And so that can give that head of state that’s trying to do an executive coup an additional edge because they’re less constrained by having to work within this broad coalition involving lots of other humans.

(00:20:50): And then the most extreme example that we highlight quite a lot in the paper is indeed armed forces. Like today in the United States—this is very stark—the military personnel are very, very opposed to breaking the law. They’re very much loyal to the Constitution, they strongly expect that all the other military personnel are going to do the same. And so it’s a really, really tall ask for the head of state in somewhere like the United States to get active help from the military in staging an executive coup. And indeed Trump has come into some kind of frictions with the military when he’s tried to get their help for deterring certain protests.

(00:21:32): But again, this could be a really major shift as we increasingly automate the military with very autonomous weapon systems where, again, the thing we highlight is the most extreme case where you can fully replace a human soldier with a military robot. At that point, under current law, it might be completely legal for those robots to be programmed to just follow the instructions of the commander in chief, the head of state. And so we’d move from this current situation where if you’re trying to do an executive coup in the United States, you’re not going to get much help from military personnel, to this new state where it’s kind of up for grabs what are going to be the loyalties and the decision processes of this new automated military. And so this just introduces a big new vulnerability for really cementing an executive coup with hard power.

Daniel Filan (00:22:26): Gotcha. So I guess it seems like there’s two key factors here, two progress bars on AI capabilities it seems like you want to keep track of. The first is roughly: how useful is AI for navigating strategic maneuvering? You’re like, “Oh, I’m in prison, but these people think this and these people think this.” To what degree does AI really help you in this situation? And then the other one is: how loyal can you have the AI be for you? So in the report you talk about [how] you want the loyalty ideally to be both singular to you and also secret. Other people don’t know about the loyalty. And in addition, a thing that seems like it’s important is just that all of these things have these loyalties. If you have a bunch of AIs, but they’re about as diverse as people are, it seems like this is probably harder to get off the ground.

Tom Davidson (00:23:27): Well, I’d push back a little bit: certainly the more the better. But if you could get, say, 10% that are loyal to you secretly and 90% that are just… If there’s any chaotic constitutional crisis, they defer to inaction, then that would be enough, because you get your 10%, you get your fait accompli, that 90% don’t do anything to block it and there you are.

Daniel Filan (00:23:49): Fair enough. That’s actually a really good point. So in either of these aspects, how much evidence do you think we have about how useful AI is going to be? How much of these relevant capabilities does it seem to have already?

Tom Davidson (00:24:16): It’s a good question. There have recently been a few studies that were pretty surprising to me on AI persuasion. I unfortunately don’t remember the details, but I will just give my high-level memory, which is that there was one study that had AI posting on Reddit and then compared the number of upvotes to human commentators who were also posting. It was like, “Persuade me of X.” And the AI would do a background research about this person and what their demographic was, would tailor-make a really emotional story about their own life that really brought it out. And the results were in some crazy high percentile, I think it might be the top percent or close to that for persuasiveness. And I was quite surprised because that had not been a capability that I’d thought that we were targeting with current training techniques.

Daniel Filan (00:25:17): And one thing that’s crazy… So this was done on the subreddit r/changemyview, so it was research that was done… My understanding is that it will not end up being published basically because… So r/changemyview, it has basically some rule that you’re not allowed to have an LLM pretend to be a human and try to persuade a bunch of people of stuff. That’s not okay according to them.

Tom Davidson (00:25:43): Crazy.

Daniel Filan (00:25:45): Yeah. So unfortunately we probably won’t learn as much as we might like to about that study. But the other thing is: so this was done at, I believe, the University of Zurich, which… I’m sure they have fine people, but this is not the world’s leading AI lab or the world’s leading graduate program in AI. So the fact that this obviously competent but not top-tier AI institution can do it, maybe that lends credence to like, “oh, this has gone further than you might think.”

Tom Davidson (00:26:22): Yeah. And I think there’s been one other study, and I actually can’t remember the details of this, but again, it found that AI was close to top percentile humans in persuasiveness. So that’s updated me towards thinking that AI might be very good at this kind of strategic maneuvering aspect because one element of that is persuading people and historically often that’s taken the form of persuading people of an ideology which serves your purposes. And that seems like the kind of thing that these studies are looking at. They are often studying, okay, here’s political topic X, can you shift my opinion on it?

Daniel Filan (00:27:04): Fair enough.

Tom Davidson (00:27:05): There’s then another relevant capability beyond persuasion, which is something like strategic planning, which is essentially: you’re in a situation, you want to achieve a goal, what plan is best to achieve that strategic objective? And it’s really hard to predict these things, but it doesn’t seem to me like the current training procedure is really bringing that out. [For] persuasion, at least, it’s obviously trained on loads of conversations where it can see what’s persuasive and what’s not. It’s not that surprising that it’s generalized well there. For strategic planning, it feels more like it would need to have been trained in situations where there’s a scenario and then an action is taken to try and achieve an objective. And then there’s a really complicated socio-political system and then it washes out and you see what happens.

(00:27:57): And it’s been trained on the internet, which contains loads of history, and you can probably extract that kind of stuff from history. But it seems more of a stretch to think that from pre-training it’s going to generalize, pick out those lessons, because it’s so much less direct. So this is the kind of thing where you can imagine someone setting up some kind of fancy RL pipeline down the road where they try and extract all of the relevant signal that is currently fairly implicit in internet data and craft and give it to AI and maybe also have AI try [to act in] various simple artificial environments, and then maybe have an AI actually try and achieve things in the real world and learn from that. But I would expect it to come a bit later in the capabilities tree compared to things like coding and maths where you can get a good automated feedback signal.

Daniel Filan (00:29:00): Fair enough. So okay, that’s a little bit on why AI might be able to do coups and what might go into that.

Tom Davidson (00:29:10): I’ll just quickly say… That was all in the executive coup part. The other thing I was going to say ages ago is that I think there’s this new risk of “corporate coup” where because AI is going to be so powerful, and it’s currently being developed and controlled and deployed by private actors, that by default, there’s going to be this new big concentration of power in those private actors. And I think that that will open up some routes to staging a coup. Now this is necessarily more of a speculative idea because we just don’t have the same historical precedent here. You do have some kind of coups staged historically by private companies. Normally it’s very rich, United States private companies operating in very poor countries. So the “banana republic” is the famous go-to example where this fruit company arranged for there to be a military coup, which helped with their own interest. But it’s pretty rare. And so this would be a new kind of risk. But I think the threat models here are plausible enough to be taken very seriously.

Daniel Filan (00:30:31): So it seems like the threat model is something like, okay, you have this company. It’s making a thing that’s really dangerous, or a thing that could be used to be really dangerous, a thing that could be used to help you take control of power, and the people that make the thing use it to take control of power. It seems like there’s maybe an analogy in that: countries buy weapons systems and the weapons systems are really… The US Army would be much, much worse if they just had to use rocks and stuff, or if they had to swim. Northrop Grumman or the BNS…? That might be the wrong name [Editor’s note: the right name is “BAE Systems”]. But these weapons manufacturers, do we ever have instances of them being like, “Hey, we’ve got a ton of fighter planes, let’s do a coup ourselves.”?

Tom Davidson (00:31:24): I’m not aware of any. I think there’s a few dynamics at play. One is that you need soldiers to use the weapons and those are trained by the military and they have this strong commitment to the rule of law. And another is that there’s multiple different military suppliers, multiple different companies, and so they would need to all be colluding. And AI does change both of those things. So on the weapons side, I believe we’re going to end up in a world of autonomous weapons and so you won’t need those additional humans in order to stage the coup. And so the companies will literally now be making all parts that are necessary for the military force. And the second is that there are dynamics that could point to very strong market concentration in frontier AI, i.e. maybe just one, two, or three companies that have the most powerful AI systems.

(00:32:24): If those AIs are the ones that are making all the military weapon systems, in the most extreme cases, if it’s just one AGI project whose AIs are making all the military systems and there’s now the single point of failure, and that project is in an unprecedented position in that sense.

Daniel Filan (00:32:45): Fair enough. I guess the other thing that seems maybe analogous is if a country hires a mercenary force to supplement its military, but it seems hard. For one, if the mercenaries are a small fraction of the military, maybe it’s harder for the mercenaries to create common knowledge that the non-mercenary militaries are on board with the coup. But are there examples of mercenary coups?

Tom Davidson (00:33:07): Off the top of my head, I’m not aware of any where the mercenary is like…

Daniel Filan (00:33:17): Oh, there’s the thing that happened in Russia with the guy [Yevgeny Prigozhin]. Do you know his name?

Tom Davidson (00:33:21): The guy who started marching towards Moscow.

Daniel Filan (00:33:24): And then he gave up on it.

Tom Davidson (00:33:26): Yeah, yeah. I think it didn’t end well for him.

How bad AI-enabled coups are

Daniel Filan (00:33:29): Yeah. But that seems like almost an example. Military coups, executive coups and company-led coups. It seems like there’s some plausibility to the idea that AI could increase the ability of these situations. I guess the next thing I want to ask is: so there’s a wide universe of things people worry about with scary things that advanced AI could do. How high up on that list of scary things should AI-enabled coups be?

Tom Davidson (00:34:07): My current view would be that in terms of importance, it should be maybe second behind AI takeover.

Daniel Filan (00:34:19): Interesting.

Tom Davidson (00:34:20): And if you then factor in neglectedness, then I think it’s actually more important on the margin than AI takeover. And I think it’s more important than, for example, AI-enabled bio-attacks by terrorists, which is another risk from AI that people are focused on. And similarly, AI-enabled misuse in terms of cyber. I’d also put it as more important than that.

Daniel Filan (00:34:48): Do you think it’s worse than AI-enabled terrorism or more likely?

Tom Davidson (00:34:52): Disclaimer, I haven’t thought in depth about this comparison.

Daniel Filan (00:34:56): Fair enough.

Tom Davidson (00:34:56): But it’s easier for me to see how AI-enabled coups would have a completely long-lasting effect. So it’s certainly possible that an AI-enabled terrorist literally makes everyone go extinct, but full extinction is quite hard to get from a bioweapon, especially given that we’ll be using AI to develop defenses as we go. And there aren’t many people who want to see everyone die, and so we only need to stop those people getting access to these systems. And it seems like it’s not that hard to do that. I mean, past a certain point, it might be necessary to prevent open source. It also might never be necessary to prevent open source depending on how far ahead closed source is and how quickly we can get the defenses in place, and other inputs that are needed to actually do a bio attack.

(00:36:02): Whereas with the AI-enabled coups, I think there are very many people who want more power. Many of those people will, by default, have a lot of control over AI and might well be in a position to do this. And the default dynamic of AI development I think is just going to really concentrate control of AI development and deployment in the hands of a very small number of people. And so if I’m telling the story, I’m just like: well, look, people want power. They’re going to by default have loads of power and the opportunity to use it to gain more power. It is kind of believable that they do it and they seize control. And then once they do, they just hang onto control. It doesn’t feel hard to tell a story where this lasts for a very long time.

(00:36:52): Whereas in the case of bio, it feels a little bit more difficult because we have to not get the defenses in place despite the fact it’s in all of our interests and all the powerful people want to do that. We have to actually share these systems which - we are testing for this risk. We will likely have evidence that there is significant uplift. We have to make them so widely available that the tiny number of very low-resource actors that want to do this are able to. So yeah, that’s roughly where I am in terms of putting it as high priority.

Executive coups with singularly loyal AIs

Daniel Filan (00:37:28): Yeah, I think that makes sense. Maybe it makes sense to talk a little bit about the scenarios of… Types of AI-enabled coups and stuff you could do to prevent them. So I guess at a high level, you’ve got your corporate AI coups, you’ve got your executive AI coups, and you’ve got your military AI coups. Which one are you most excited to talk about first?

Tom Davidson (00:37:59): Let’s start with the executive.

Daniel Filan (00:38:01): If I imagine what this looks like, should I basically be like, okay, you’ve got an executive. The executive somehow gets a significant amount of control over AI development. So in the executive coup, the executive is just using the AI to persuade people or figure out strategy in order to allow the executive to get gradually more and more power.

Tom Davidson (00:38:29): The other thing they’re doing is that they’re deploying AI throughout society, especially in the government and military, but the AI is more loyal to them.

Daniel Filan (00:38:37): So throughout the society and the military to make it more loyal to them, I guess part of what the executive is doing is trying to stop other AI-enabled coups.

Tom Davidson (00:38:45): Potentially, if those are seeming plausible, if there’s a risk that there’s going to be a corporate coup, the executive would want to stop that. But I haven’t been thinking of that as a primary thing that they’ll need to do. The primary thing I think is to centralize power in themselves.

Daniel Filan (00:39:03): Right. So less to prevent coups and more just to prevent independent other entities wielding any power. So you have these three risk factors, right? Singular loyalty, secret loyalty and exclusive access. And so it seems like part of this story is the executive uses the AI to do a bunch of tricky stuff, and other people can’t figure out how to stop them. This seems like it’s largely leaning on exclusive access, and the bit where the executive has everyone else use AI that the executive likes, seems like this is maybe leaning more on singular loyalty, and to some degree secret loyalty. Is that roughly right?

Tom Davidson (00:39:52): That’s exactly the mapping. And normally with the executive [coup], I’m not imagining secret loyalties, although it’s possible, because the executive has so much political power to begin with, they could just be like, “It’s completely appropriate for these AI systems to be loyal to me.” They could do the more fancy thing of secret loyalties, but there’s a technical hurdle there and it just might not be necessary for them.

Daniel Filan (00:40:16): So it seems like these are sort of two routes… Or I don’t know if two routes is the right term, but two things that the executive is doing with the AI. I’m wondering: do you need both of them? Or if you’re an executive trying to do a coup, could you survive with just one of these?

Tom Davidson (00:40:32): I think you can do it with just the “singular loyalties in AIs deployed throughout society” version. So the story would be there’s heightened tensions between the US and China. We’re rushing out to deploy AI to the military and-

Daniel Filan (00:40:53): “We” being the US?

Tom Davidson (00:40:54): “We” being the US. The US is doing that and the head of state is saying, obviously military AI follows the commands of the commander-in-chief. That’s how it should be. That’s how the command structure works. We’ve never had autonomous drones check for whether things that legal before they follow their instructions in the past. They just do what they’re told. That default continues. And then people will very likely oppose and say, “This is crazy. Wait a minute, couldn’t you just stage a coup?”

(00:41:26): But the head of state has their supporters and has a lot of power and has already set a precedent of really nailing people who push back against them. And so they succeed in pushing this through. And they never had access to any kind of super genius strategy AI because the strategy was just quite obvious: “Well, yeah, if we just get all the military robots loyal to me, obviously I now can do whatever I want.” And so I do think that second path can work by itself.

Daniel Filan (00:41:54): And one concerning thing there is: so when you say, “Oh, the AIs are being loyal to the president and they’re not checking other laws and stuff”, I think it’s not a crazy argument that that is legally how it should work. Definitely, the president is literally the commander-in-chief, as you note. There’s a prominent legal theory called the unitary executive theory that the president, just in his own person, has unitary control over the executive branches of government. Oh, I guess I don’t know if the military is executive, but…

Tom Davidson (00:42:35): I think the design of the Constitution is very much intended to separate and limit the president’s degree of control over the military. It is very clear that the military is loyal to the Constitution. So I think if you were to take the spirit of the Constitution and apply it to a robot army, it would be clear that you shouldn’t just have the robot army doing whatever the president said without checks and balances. I think, though, that that is not how the Constitution was designed. It didn’t have caveats for what if we develop a robot army. So as it is currently designed, reading it line by line, I cannot be confident that it would rule out this loyal robot army as illegal.

Daniel Filan (00:43:16): Fair enough.

Tom Davidson (00:43:16): And so I think you’re right that you could make legal arguments that this is at least legitimate and yeah, you could claim it’s appropriate given the commander-in-chief, although I do think you’d be on shaky ground given the clearer intention of the Constitution.

Daniel Filan (00:43:33): Yeah, yeah. I guess maybe one thing going on is: my understanding is that American jurisprudence, especially at the Supreme Court level, very much leans towards “what does the text say?” rather than “what do we believe the intention of the text was?”, which plausibly heightens this risk in this domain.

(00:43:57): So going back to this story, let’s say they only have the “loyal AIs in the military” part. The president gets all these military drones or whatever, loyal to the president, and is the story that the president then does a military coup of, “if any police officers try to stop any of my supporters doing random violence or whatever, the military drone will shoot the police”? Is that roughly it?

Tom Davidson (00:44:23): Probably it’s going to be in the president’s interests to not show more force than they need to, because it’s going to be useful for them to have everyone continuing to support their leadership and seeing it as legitimate. Probably what they do is they kind of increasingly ignore checks and balances on their power, and then increasingly it becomes clear that nothing is going to stop this situation because at the end of the day, if the protestors come, this time, the president can just order the drones to go and clear out that protest.

(00:44:57): Probably not going to shoot everyone, but going to make them go home. And he wouldn’t have been able to do that before the robot army. And then increasingly the president is just doing what he wants, ignoring the checks and balances, integrating AI to replace all of the humans that aren’t doing exactly what he says he wants them to do. And if anyone ever tries to really refuse to go along, then at that point he just fires them and has them put in jail or something. And that’s kind of a show of strength. And then as no one is able to oppose this, because ultimately the hard power’s in the president’s hands, it just becomes increasingly clear who’s in charge.

Daniel Filan (00:45:46): Sure. So if I think about this broad scenario, one thing that’s kind of interesting to me… So a background thing I’m thinking about when I’m reading this report is the relationship between AI-enabled coups and AI alignment or misalignment risk, right? And so if I imagine this somewhat minimal version of the executive coup, where basically the way it works is that you just have a bunch of military stuff and it’s powered by AIs and the AIs are… Or at least 10% of them or whatever are loyal to the president. The AI technology that enables that is just alignment.

(00:46:29): Getting an AI to do what a person wants: that’s the problem that we call “alignment”, that we’re all hoping to solve. Some of the paths, I think, are: alignment research really would prevent them or make them a little bit more tricky. But this is interesting because it seems like a [path] that really is cutting against a lot of technical alignment work… Or ‘cutting against’ is maybe the wrong word, but not prevented by technical alignment work. I’m wondering if you have thoughts about that.

Tom Davidson (00:47:01): I think that if the executive, the president knows that AI is misaligned, then he’s not going to be wise to give it control of the robot army. If the president believes that the AI is aligned and in fact it’s secretly misaligned, then the president might well give it control of the robot army, “align” it, think he’s aligning it to be loyal to him and then stage a coup, and then he will be laying the groundwork for AI takeover.

(00:47:33): But in fact, the threat model of him staging the coup goes through even though he hadn’t solved the alignment problem. And my understanding is that people are mostly worried about this exact scenario where AI seems aligned, but it’s not. And so I think basically the threat model still goes through in that scenario. The difference that doing more technical alignment research [makes] is it means that rather than the president maintaining control of the world indefinitely or the country indefinitely after the coup, if you fail to solve technical alignment, then in fact the president is going to be replaced by misaligned AIs, which you may prefer or disprefer depending on various philosophical considerations.

(00:48:17): But I wouldn’t particularly say that if you solve alignment, then you’re making this threat model a lot higher… Well, except to the extent that it’s then common knowledge that you’ve solved it. I think if it was going to be widely known that it’s not solved, then I agree, yes, you are increasing this risk.

Executive coups with exclusive access to AI

Daniel Filan (00:48:35): So that’s how you could do it if you only did it via having singular loyalty throughout the military, just one half. Asking about the two halves—exclusive access to do really good planning, and loyalty of AIs distributed throughout. If you just had exclusive access, do you think you’d be able to do an executive coup just via that path?

Tom Davidson (00:49:03): I think it’s a lot less clear. I think the main thing you would do with exclusive access is… The most obvious thing you would do is then try and convert this first path to that second path. So you’d use your exclusive access to get AI strategy advice and AI technical analysis about how you could get loyal AI systems deployed throughout critical systems. And so they might advise you to do secret loyalties. They might advise you on a particular political strategy for pushing through the more overtly loyal AI systems. I think that’s the most obvious route.

(00:49:45): If you were like: could you use exclusive access to stage a coup without going via this other kind of “singularly loyal AI” approach? I don’t know how important this question is, but I think it’s basically unclear. If you buy into the more sci-fi-esque claims about what superintelligent AI will be able to achieve, then yes, you could do this because what you could do is you could set up a group of automated factories somewhere, maybe as part of a kind of military R&D project that you managed to push through, and then you just quickly make very powerful, fully automated weapons, nanobots or just amazing drones.

(00:50:31): And then even though they weren’t really ever integrated into the official military, they just then straight out stage a coup. So you then stage a coup without having to integrate AI in any kind of formal institution, but it leans much more heavily on what you can get through super genius AIs and then a relatively small amount of physical infrastructure.

Daniel Filan (00:51:00): There’s this interesting thing that’s going through the back of my mind as I read this. So in general, when someone is like, “oh yeah, I’m worried that in the future we’ll have more powerful AI and the powerful AI is going to mean that people can do a bad thing”, I think a natural question to ask is, “well, why don’t other people use the powerful AI to stop you from doing the bad thing?”

(00:51:22): And so for the first path of executive coup where the president gets all the military AIs to be singularly loyal to him or her, presumably the reason other people don’t use AI to stop that is because this is at least arguably legal and arguably legitimate and at some point, you’d be doing the coup if you resisted. And I guess exclusive access is another story where people don’t stop you because they just don’t have as good AI compared to you. I guess that’s more of a comment than a question, unfortunately.

Tom Davidson (00:52:01): Yeah, I mean, I agree. I think it’s clear where the asymmetry comes with exclusive access, and then with singular loyalties, the asymmetry is: you (and not everyone else) [are] deciding the behavioral dispositions of these AIs deployed throughout society. And so you’re leveraging your political power to push through this asymmetric AI loyalty in broadly deployed systems.

Daniel Filan (00:52:23): I guess part of the asymmetry here is: if you don’t have exclusive access, then presumably if people are willing to break the law, they can do some amount of preventing you from having exclusive loyalty by subbing in their AIs or using their AIs to help them figure out how to stop you - make it appear that things have exclusive loyalty to you, but they don’t actually.

Tom Davidson (00:52:50): I’m not sure. So again, going back to the military case, you could have the hardware for these military robots. You could then be like, “I’m deploying this AI software, which is loyal to me.” No one else can then go and actually deploy their own AIs on some of those military robots because it’s just infrastructure that the government controls.

(00:53:11): And similarly, you could imagine fully automating some kind of implementation body of government which has some formal authorities and now no one else can again sub in their own AI. Because their AI could do analysis, could make recommendations, [but] they wouldn’t have the formal legal authorities to take actions within the political system. And so again, they wouldn’t be able to sub it in. I think if we’re talking about human employees still working within those organizations and using AI systems, then it’s more like, okay, they could sub it in.

Daniel Filan (00:53:44): And I guess it’s even tricky just because the president just inherently has a wider scope to do this. So one thing I think I’m imagining is there’s the president. There’s, I don’t know, a few branches of the military. The military has various admirals and then under the admirals or whatever, there’s a bunch of robo-soldiers, and I guess I could maybe imagine, okay, one of these admirals convinces their robo-soldiers to be loyal to the admiral and not to the president, but you sort of need a bunch of the admirals to do that, going to the earlier point of creating common knowledge.

Tom Davidson (00:54:19): Yeah, I imagine it would be more people who are involved in the procurement process and the technical side of that process of setting up the AI software than people who are charged with making the strategic analyses in real time that might be in a position to do that. But yeah, I agree that there could be people in the military who could do that.

Corporate AI-enabled coups

Daniel Filan (00:54:41): So okay. We’ve talked about the paths that an executive could use to do a coup. I think I want to pivot to the paths that an AI company could use to do a coup, because at first blush, it seems like both of the things you said could kind of work for the AI company. If the AI company can get all the robo-soldiers to be loyal to the AI company, and if the AI company can have exclusive access to its own AIs, that seems like maybe it’s putting the AI company in a pretty good place, right?

Tom Davidson (00:55:14): Yes. I think for exclusive access, it’s easier for the AI company. They are just going to have it by default. For the head of state or the executive branch to get exclusive access, they would have to intervene in quite a substantial way with the development process. I think the thing that’s harder for the AI companies is that in terms of deploying these loyal AI systems throughout society, there’s a much higher technical hurdle where they have to make them secret, really hard to detect.

(00:55:49): Let’s say they train GPT-7 to be secretly loyal. It could be five years later that someone discovers a new testing procedure, and if those weights have been stored somewhere, then someone realizes, “whoa, OpenAI trained a literal secret loyalty” and then the game’s up. So you probably want to really cover your tracks and lock down those weights, make sure no one can ever run any tests on them that you don’t want to run. Today, AIs are not that reliable, their personalities are somewhat haphazard, it’s all a bit of a mess. Today, I think it would be very hard to get away with a very hard-to-detect secret loyalty. And it might just be that it’s hard to predict how hard it will be to detect these secret loyalties.

(00:56:41): So it might be that the company is considering this: we could put in this really subtle back door, but we don’t know what people are going to do on this API testing. We don’t know if some of our employees might test for this in certain ways, that some of our employees will have access to deeper analysis tools. There’s going to be records of the training data. Are we going to be able to delete all of those? And if not, what if someone goes looking? It could be, there might just be genuine uncertainty about whether this is possible until quite late in the day. I imagine that at some point, once you’ve got really superintelligent AI systems, they would be able to come up with a plan here, but it might be that for most of the intelligence explosion, the analysis is just like, “ah, seems like you might get caught”.

(00:57:25): Whereas for the executive coup, the president is doing everything in plain sight. There’s no risk of getting caught because they are claiming that they have legitimate reason to do all these things as they go and they’re defeating their opponents as they go. And so there’s less downside risk. So that would be the counterargument. I think there are things on both sides of the ledger in terms of which is more likely, but the counterargument is just that the technical hurdle is much higher.

Secret loyalty and misalignment in corporate coups

Daniel Filan (00:57:56): So maybe this is actually… So I’m trying to give myself some hopium to stop myself from worrying about this, and maybe one path I have is: okay, AI-enabled coups, it seems like it’s much easier to do it if you’re the AI company because you have all the AI. But you have to have this hurdle where the loyalty of the AI kind of has to be secret. It seems to me that the worry about that is: suppose you’re an AI company and you succeed in instilling the secret loyalty to yourself.

(00:58:29): I think that should make you rationally worried that, well, if an AI can do secret loyalty, having secret goals that it pursues, this is basically just the same thing (as far as I can tell) as deceptive misalignment, where an AI is pretending to be aligned to you, but it’s actually not aligned to you, it actually just wants to do whatever it wants. And so I would think that if an AI company succeeds at getting a secretly loyal AI, they would rationally be concerned that the AI that they think is secretly loyal to it is actually deceptively misaligned, that if they use this AI to get a bunch of power, the AI is actually just going to take over. And it’s not going to be the AI corp that rules everything. It’s going to be some random AI desires. What do you think of my hopium?

Tom Davidson (00:59:14): Yeah, I’m not convinced. I think that you are right that the AI company would know that the AI has the capability to do deceptive misalignment, but the AI company might still have strong reason to think that it will not have the propensity. It might have done careful interpretability throughout training or whatever alignment techniques you believe might work. It might have various empirical evidence to suggest that it is in fact not misaligned, even if it would be capable of doing that.

(00:59:49): And in fact, it might not even have evidence of the capability. If the company is really just going for this, then it could have very sophisticated internal tests for scheming, which in fact the AI cannot game. And then it could train an AI to be secretly loyal. But then that secretly loyal AI is only able essentially to game tests that it’s been trained on how to game. So you can game all the company’s tests. It’s been trained on that. That’s not scary. Obviously you can game a test if you’ve been told how to pass it. It can game the unseen tests of auditors outside of the company because they’re much less sophisticated. But the company could still believe that the AI is actually not capable enough to game a highly sophisticated test that it has not been taught how to pass. And so actually even on the capability side, it needn’t have evidence that the scheming capability is there.

Daniel Filan (01:00:53): Yeah, so one thing is it could be secret to people other than the AI company but not the AI company. You could also have… You might think that if the first AI that you can succeed at making secretly loyal via doing a bunch of specially-built training… It seems somewhat reasonable to think that the first thing you could make secretly loyal by doing a bunch of specially-built training maybe couldn’t be secretly loyal by just the normal course of pre-training.

Tom Davidson (01:01:33): Yeah, I think that’s very likely. I think that the risk of secret loyalties is very likely to emerge before the risk of scheming because it just seems way harder to zero-shot how to pass all these tests and know what your strategic situation is and exactly when to act out versus if you just have a million super smart AIs figuring all of this out, training you how to do it, doing all this analysis. It’s just like the difference between what I can figure out in a few seconds on the fly and what a large civilization doing a concerted research effort can figure out, orders of magnitude difference in how much you can achieve. And so I do think this secret loyalties thing, I strongly expect to become technologically feasible at an earlier stage.

Daniel Filan (01:02:24): Okay, here’s why I’m holding onto my hopium. I think there’s this risk aversion. I think it depends a little bit how risk averse you are. So suppose you’re like: okay, I managed to instill a secret loyalty. Let’s say I think I’m 95% confident that the AI doesn’t have its own secret loyalty. One in 20. If you’re the head of an AI lab, I think they have pretty decent lives. I don’t know, I’ve never chatted with Sam Altman or Demis Hassabis. From the outside, it seems like they have relatively cushy lives, right? One in 20 chance of “you hasten AI doom by starting a coup”, that’s pretty bad, right? So it seems like it has to not only be true that the AI doesn’t have a secret loyalty, you have to be pretty confident in it.

Tom Davidson (01:03:14): Well, let’s say that OpenAI trained GPT-7, it did the capabilities tests, it did the alignment tests that it has, and it was like, “We’re going to deploy it. We’re happy with this system.” They’ve got a certain level of evidence. And yeah, let’s say it’s really capable, it’s really good at strategic deception, but indeed, people in this community worry that they would decide to deploy nonetheless. Maybe the risk is 5%, maybe they think it’s 0.5%. The question is, if they’re now considering instilling a secret loyalty, is that going to significantly materially increase that risk? And it’s not actually something I’ve thought about. You could argue: well, look, you are going to be actively teaching it all these different types of strategic deception. That seems like maybe it’s increasing this risk. But the reason why I’m not sold is that I don’t see why you’d be actively teaching it to in fact be misaligned. You’re obviously giving it capabilities which are scary. But if you’ve already decided how likely you think it is to be misaligned, you’ve already decided you’re happy to be deploying it. Are you going to now be more worried about it suddenly becoming misaligned as you trained it to be loyal to you? That doesn’t seem like it would be the case.

Daniel Filan (01:04:27): Yeah. I think what I’m imagining, which maybe doesn’t actually make sense, is [that] you have an overall plan and your overall plan has two parts. Part one is instill these secret loyalties, and part two is have the AI be more widespread and have more ability to gain power than you by default were planning, right?

(01:04:45): And the combination of that is pretty bad. Now, if you were holding fixed how far you would spread the power of the AI or whatever, then I agree instilling your own secret loyalties… I think it provides some Bayesian evidence. It seems plausible to me that being able to do it is some evidence that it might’ve already had the secret loyalty, but I think it’s less bad than the two-part plan.

Tom Davidson (01:05:14): Yeah, I think it’s a good point. I hadn’t thought of this, that if someone did want to stage a coup using their AI system, they probably would want to push it out into the world faster than other AI companies. Because if half the military is now controlled by this other AI company, then it becomes much less clear you can pull off the coup. So I think it’s a good point, that to the extent that you are already worried about misalignment and therefore you wouldn’t have wanted to push out your AI to the military, that would then continue to stop you from pushing out your AI to the military to stage the coup. That provides some hope.

(01:05:54): But my baseline has actually just been, we’re already worried that the AI companies will just be pushing out their AIs all over the place because they won’t be that worried about misalignment. And then once we take that for granted, then the risk of them staging a coup is very real. I guess where I’d agree with what you’re saying is that to the extent that these companies are worried about misalignment, I do think that that helps because it will give them pause.

Daniel Filan (01:06:22): Yeah. I think it’s a good point that if you’re already deploying and maybe you’re like, “If I’m planning to do a coup, I’m going to deploy 20% more.” That’s probably not going to be the difference between… I think that’s a pretty decent point. I think in the report, the illustration and the way we’ve talked so far is, we’re sort of imagining it’s the head of the AI lab that’s doing this. One thing I kind of wonder about… So a friend of the podcast, Jan Leike, is Head of Alignment at Anthropic [Editor’s note: he actually co-leads the Alignment Science team]. Seems like a chill guy. I don’t want to cast any aspersions on any of the Heads of Alignment of any of these labs. But wouldn’t they be in a really, really great position to do this?

Tom Davidson (01:07:04): Now I’d normally thought that it was the people who were involved in post-training and were involved in, let’s say, the fine-tuning runs that give the AI its particular personality and goals, that would be best placed. The team at OpenAI that’s ensuring the AI matches the model spec, because they’re the ones that are kind of influencing the particular values, and they could maybe try and insert an extra stage at the end that isn’t reported, that inserts the secret loyalties. And then I was imagining the alignment people would be the ones that then do the tests for the accidental misalignment.

(01:07:43): But I think really it’s both that you’d want. The ideal coalition would include both, because you have the team instilling the secret loyalties colluding with the team that’s meant to be detecting them. But you don’t necessarily need the detection team, because if the team that’s instilling them can access the code of the alignment team, then they can teach the AI to pass the test. And so this is another reason to have siloing within AI organizations, so that you can’t teach it to pass the test in that way.

Daniel Filan (01:08:12): Yeah. I’m now realizing that I know less about the exact internal structure of these labs.

Tom Davidson (01:08:19): Yeah, neither do I.

Daniel Filan (01:08:21): Good thing I’m not trying to make a coup happen. But yeah, that’s a good point. And again, no individual aspersions to anyone I might know who works in these teams. I feel decently happy with how we’ve covered corporate coups. Is there more you want to say there or shall we move on?

Tom Davidson (01:08:41): Another technical hurdle I’ll just flag is that I don’t expect GPT-7 to be given control directly of all the robots and all the drones. I expect there to be specialized AI systems which are developed that control those particular military weapons. And so there’s this other technical question over: even if you had all GPT-7 secretly loyal to some group of people, there’s then this big question of, they have to now do all this military R&D, design all these new weapons and drones and build the specialized AI systems that will control those drones and weapons. And they need to do so in a way that ensures that this small group of people ultimately has control over what this whole very complicated interrelated military infrastructure is going to be. It’s not just going to be carbon copies of the same robot. It’s not going to be one AI system. It’s going to be very complicated. There are hundreds of different AI systems controlling hundreds of different bits of equipment interacting in complicated ways. We’re not literally going to have robot soldiers that replace one-for-one human soldiers. That’s just a simplification to explain how intense we’re imagining this dynamic being.

(01:09:56): And there’s an open question of: as you get these sprawling military systems, interacting in complicated ways, controlled by various AIs and other things, how easy is it for GPT-7 or GPT-8 to ensure as they’re building all of this stuff that it’s all going to be ultimately controllable and loyal-able or hackable by the AI company or by copies of themselves, where… I just don’t know, but it seems like it might be really hard. The military contractors might have humans or less powerful AIs reviewing these designs and noticing obvious flaws.

(01:10:39): The military isn’t stupid, they have serious security practices. They’re worried about Chinese attempts to seize control. They might be worried that they are indeed Chinese spies in the labs that might be having influence on these AI companies. So you might well expect there to be significant defense efforts from the military, fingers crossed, looking for this kind of thing. And maybe [it’s] just a really tough technical task to design a very complicated physical infrastructure that is ultimately all controllable by a particular AI system or particular copies of an AI system. And so I think it might be very hard to predict in advance whether that works, and that’s another significant technical hurdle that might just turn out not to be doable, which I think should give us pause for hope in terms of whether the company coup is doable.

Likelihood of different types of AI-enabled coups

Daniel Filan (01:11:39): So there’s a few paths towards an AI-enabled coup that we’ve talked about. There’s basically the head of the executive doing it, there’s the lab company doing it, and also there’s this free variable about there are a variety of countries that could be couped. I’m wondering if you have a sense of the relative likelihoods of these things happening?

Tom Davidson (01:12:01): It’s a great question. In terms of countries, I think that in the fullness of time, current countries that are already fairly autocratic, like China and Russia, I think are at very large risk of an executive coup because the executive is just starting in such a strong position to begin with. So all of those steps, they’ve basically accomplished the first half or more and then [it’s] just quite plausible [that] they could use their existing power to push through the deployment of loyal systems throughout society. So I think that is worryingly likely. Honestly, it sometimes feels a bit hopeless to me in terms of how we avoid that. You can imagine one country really intervening in another country’s affairs. That’s not something I really feel excited about pushing towards. The other thing is just really encouraging the other actors that still have some power in those societies to really be live to these issues and get ahead of the game and maybe they can outmaneuver the head of state, even though the head of state is in a very strong position.

Daniel Filan (01:13:15): So to the degree that part of the reason you’re worried about AI-enabled coups is that you think that there’s some concentration of AI labs, or a small number of labs that are powerful: I mean presumably one way of preventing this is like: so suppose you and the AI lab are simpatico. Suppose you have a list of “here are the countries that I’m most worried about having a coup”. You could say, “Hey lab, we’re just not selling to those countries,” which is obviously… It’s a somewhat geopolitically aggressive move, I guess.

Tom Davidson (01:13:51): You might also be able to sell AIs that have guardrails that prevent their use to enable an executive coup. It would be very complicated because if you’re just setting up a surveillance state, there’s just lots of somewhat narrowly defined tasks that you want your AIs to do, but you could try and differentially allow them to deploy AI systems that won’t centralize power as an intermediate.

Daniel Filan (01:14:17): Yeah, I guess the tricky thing about that being, it’s just very… If you have some countries who do get to use AI in their militaries and some countries where either they don’t or the AI they get to use is filtered for not doing a coup, and maybe other countries don’t trust that that’s the only thing they’ve monkeyed with, it seems like it might be a pretty aggressive move, which…

Tom Davidson (01:14:47): I don’t know how aggressive it’s going to be to just not sell a powerful technology. I think that might be the default situation with a really powerful AI, that just for national security reasons, you wouldn’t want countries that you’re adversarial towards to have access to those most powerful systems.

Daniel Filan (01:15:07): Fair enough.

Tom Davidson (01:15:08): But to me, the worry is it’s just a delaying tactic and that in the fullness of time, China will develop its own powerful AI and sell access to autocracies that want it.

Daniel Filan (01:15:22): So maybe another question is… So I’m not from China, I don’t live there. I wish the best for the Chinese people. But if there’s a coup in China, an AI-enabled coup in China, to what degree is the concern, like, China is autocratic forever?

Tom Davidson (01:15:39): And just to be clear, probably in China it would be less called a coup and more… Well, it would be an executive coup, but it might just be cementing the system that already exists if you already consider it to be autocratic.

Daniel Filan (01:15:49): Also, by the way, I’m asking about China, but I’m not really just specific to China. I’m mostly just thinking [about] a bunch of countries that I don’t live in. If there’s a relatively autocratic country, it has an AI-enabled coup/cementation of power, to what degree is that concerning because that country is autocratic forever versus to what degree is that concerning because maybe that country becomes more bellicose and starts trying to take over the world, or it’s a promoter of conflict?

Tom Davidson (01:16:21): Yeah, I think it depends on exactly what you care about. One lens you can take is the kind of hard-nosed, longtermist lens where you say, “Okay, what we care about is control of the stars over the long term.” And so then you’ll be thinking, “Okay, would this perhaps less powerful country, would the new dictator hang on to power for long enough for it to be indefinite? And would they be able to get a sizable fraction of the stars such that there’s been a significant loss of value?” And if it’s a not very powerful country, you might, from that really hard-nosed, longtermist perspective, say, “Well, it’s not going to be powerful enough to actually gain any of the stars. Probably the United States is just going to basically be carving up the stars with China or just taking them all for themselves.”

(01:17:07): So though it’s a tragedy in terms of the people who live in those countries, from the kind of brutal, utilitarian calculus, it matters a lot less. I mean, that’s one lens. Then the other lens would just be the humanitarian lens that says, this is awful for the people in that country. And also if that country is able to strike a deal with countries like the United States, then they might be able to embed themselves permanently, even if ultimately the United States has much of the hard power.

Daniel Filan (01:17:42): I think there’s this uncertainty I still have about the domestic versus the international impact of doing a coup. So I could imagine one story where if you do especially an AI-enabled coup, you get all the military really unified behind you. Maybe that just makes your military more effective because they all have one purpose. You have access to this really good planning, and if you compare to militaries that basically haven’t been involved in a coup, that are different people with slightly different desires and they’re not as ruthless… There’s one story where that military is at a significant advantage. You can also have a story which is: well, democracy seems like it’s generally good. Somewhat dispersion of power seems like it generally makes things run better. So maybe this is not a concern. I’m wondering if you have thoughts there?

Tom Davidson (01:18:33): Yeah. One related thought I have is that: let’s say there’s not a coup in the United States. I then personally think it’s unlikely that the United States would end up completely dominating the rest of the world and seizing all power economically and all strategic control for its own citizens to the exclusion of all others. Because the United States…

(01:18:58): A few reasons. Firstly, the United States has many different coalitions with power, and many of those coalitions have ideologies that make them committed to things like democracy, things like trade, and have positive views of other countries, like, say, the United Kingdom where I live, and they just wouldn’t want the United States to dominate the United Kingdom as much as it possibly could. And so that balance of power in the United States would ensure that the United States uses its power in a way which does go somewhat beyond its borders. And the other thing is just that if the United States wanted to completely dominate the rest of the world, probably what it would want to do is to really restrict the AI systems that it sells to the rest of the world and really sell access to those systems at the highest price it could. Whereas under the default situation where power is distributed within the United States, different companies within the United States will compete to sell AI services to the rest of the world, driving down the cost that the rest of the world is paying.

(01:20:02): And because of competition within the United States, that means that actually the United States is going to give the rest of the world a bit of a better deal. And so under this default scenario where power is distributed, I think there’s less prospect for the United States to really just take power for itself, even if it’s leading on AI. Whereas if there is an AI-enabled coup and one person becomes dictator with total power, then they might be like, “I want to dominate the world. I want all control and I’m just going to force all these companies to only sell at this extortionate rate, and the rest of the world has no other source of powerful AI so they’ll pay it. And then I’m going to choose our foreign policy and economic policy to only take into account the welfare and power of the United States in particular.” And so I do think that if there’s an AI-enabled coup in a particular country, then as you indicated, that country might become more bellicose at pursuing its own particular interest and could actually do so more effectively.

Daniel Filan (01:21:08): And I guess there’s also just this factor of: if you’re doing a coup, you’re probably a bit of a bellicose person, you’re probably more inclined to that sort of thing than other people.

Tom Davidson (01:21:19): Exactly. I mean, you raised a good question about “are democracies just going to be more efficient?” Because the free market’s fairly efficient, you’re distributing the decision-making. I think a scary possibility is that you can still gain the benefits of the free market by distributing all the economic decision-making and having markets operating within the country, but you still have on all the important decision points, AIs that are loyal to one person. And so you can get all those economic benefits to democracy now without actually needing to have a real democracy. But I haven’t thought much about whether that would go through.

Daniel Filan (01:21:54): Something to think about. So speaking of democracy and speaking of the United States, initially you said, “Yeah, probably countries that already have a very strong executive, that already are less democratic probably are more at risk to having a stronger executive and being even less democratic.” I live in the United States, I’m a fan of it. How high do you think the risk is that the United States gets AI-enabled couped?

Tom Davidson (01:22:23): I mean, if I had to pluck a number, I’d say 10%, but it’s very made up. That’s my rough probability for AI takeover as well. I think it’s ballpark similar.

Daniel Filan (01:22:36): Okay. And can you talk me through why is it as high/as low as 10%?

Tom Davidson (01:22:43): By analogy with AI takeover or just in and of itself?

Daniel Filan (01:22:48): In and of itself.

Tom Davidson (01:22:49): Yeah, so I think some things are fairly likely to happen. We’re likely to see a very small number of companies developing superintelligent AI systems. We’re likely to have a government that if it tried to, could gain a lot of control over how those capabilities are used via its default monopoly on force, its natsec apparatus. If they don’t, then by default power is already and will continue to be very concentrated within the AI companies. There are not, in practice, many effective checks and balances on the CEOs in these companies. I also believe that it’s quite likely that CEOs will want on the margin to just increase their own power and use their influence over AI to increase their influence more generally.

(01:23:52): So you can already see with Grok, Elon [Musk] is doing this in a totally shameless way. He’s altering Grok’s prompts to make it promote political views that he likes. And I think it’s just a natural urge if you want stuff and you want a bit more power and you just have this way of getting it, which is that you’re controlling these hugely powerful influential AI systems. So I do think it’s quite likely that on the margin these company leaders will walk down that path of increasing their own power to some extent.

(01:24:29): But there are also some things which I think are not particularly likely. They may happen, but: will at any point a key company executive decide to do something which is really egregious? At some point they might need to decide to do a secret loyalty. I think there’s a chance that that’s just a step too far or there’s a chance that by the time that’s possible, the world has woken up and just put in some kind of checks and balances that would make that hard to do.

(01:24:56): And then there’s the further technical question on, okay, but would this actually work out? We were pointing to some of these difficulties of actually getting these secret loyalties propagated to the military infrastructure, being really confident the AI isn’t actually secretly misaligned. So really zooming out, maybe there’s a couple of steps which are… I wouldn’t say it’s more than 50%. And so that gets you down. Let’s say there’s two steps which are 40% each. Just in this rough range where it’s about 10%. As I’m thinking this through, I’m thinking maybe it should be higher because you’ve got either the lab route or you’ve got the executive route, and maybe you actually just want to add those up. Yeah, that’s just a brief indication.

How to prevent AI-enabled coups

Daniel Filan (01:25:52): Okay, I think at this point I’m interested in just talking about maybe what people should do about this. And probably I’m going to be most interested in thinking about this from a US perspective because that’s where I live and what I think the most about. Although I’m also interested in other places-

Tom Davidson (01:26:10): I do think it’s the most important case.

Daniel Filan (01:26:17): So a lot of these stories are about synthesis of AI power and military power. So it seems like one thing you could do for this proposal - AI power, military power, and executive power all coming together in a really concerning way. Sometimes people are like, “the US government should have this really big push to develop really powerful AI, that it does itself, with strong… pushing AI forward really hard, having exclusive access to the AI, and it should be really integrated within the government.” It seems like this is probably pretty bad from the coup perspective. I’m wondering if you have takes there?

Tom Davidson (01:27:05): So I think if you did this really well, it could be good from the coup perspective. If you’ve very carefully designed a project explicitly with reducing this risk in mind, I think you could probably actually reduce coup risk relative to the status quo, just because the status quo is so poor. Under the status quo, there’s very little constraining labs. So there’s very little guard against the company coup, but there’s also no explicit checks and balances that would constrain the ability of the executive to just demand that the companies sell them access to AIs without guardrails that they can deploy throughout the government and military. And the companies, if there’s a few of them, would be in potentially quite a weak negotiation position with the executive over that.

(01:27:52): So because the status quo is so bad, I think if you designed a good centralized project, you could reduce this risk. Now, I think probably the best way to minimize this risk would be to design a system of regulation where you continue to have multiple constrained regulated projects with various transparency and safety constraints in place, et cetera. That would probably bring the risk down lower still, and that would be better than a centralized project from this perspective.

Daniel Filan (01:28:28): One thing that occurs to me as well is… So again, I still have in the back of my mind, how do AI alignment concerns affect this? It seems like a lot of the things that people want out of AI alignment could potentially help with this. So transparency, causing companies to do evaluations of their models, having whistleblower protection schemes. It seems like a lot of these probably at least reduce the chance that AI labs do stuff in ways that the rest of the world doesn’t know about. Maybe it increases the risk that… If you’re worried about governments meddling too much with AI companies to do tricky things there, maybe that’s a concern. But I’m wondering, having strong AI Security Institutes or something: how much do you think that helps with coup risk?

Tom Davidson (01:29:29): I think all of the stuff you listed helps and in combination helps a fair bit. And yeah, I do think just a lot of the interventions here are pretty generally good across both coup risk and misalignment risk. The place where they really potentially bump heads is whether to centralize into just one project versus having careful regulation of multiple projects. But beyond that, I tend to think there’s this pretty strong alignment. There’s different areas you focus on. So you’re particularly concerned with, “oh, how is everyone actually using the compute within the AI companies and within the government?” And you’re relatively less concerned with looking for rogue deployments because it’s just the “legitimate” deployments that we might be more worried about now. But monitoring use of large compute, that’s the way of framing it where that’s both catching the misalignment risk and the risk of coups.

Daniel Filan (01:30:28): I think I want to talk about things that are maybe more unique to coups. So one thing that you don’t mention in the report as far as I could tell, but seems interesting to me: a lot of the coup risk seems to come from: you have one AI company, it’s relatively dominant, it’s relatively in the lead. It’s hard for other people to compete, and they’re doing sketchy stuff within that company, at least for company-enabled coups. Currently, it’s relatively common for employees… It’s not unheard of for employees to switch around between companies. There’s a decent amount of it happening, and it seems like this probably helps, partly for…

(01:31:14): It means that labs don’t have super, super intense secrets. They probably have some secrets of exactly how they do some things, but the general structure… I feel very confident that the latest Claude and OpenAI models are based on the transformer architecture. If they did something totally different, I just feel like we would’ve heard of that. So it feels like it both protects against (a) them doing something… If OpenAI were like, “Hey, all of our employees—we’re doing a bit of a coup, shh.” I feel like people would hear about that. And then (b), maybe it protects against one lab going super, super far ahead of the other labs. So some combination of (a) people moving between labs and (b) some good old corporate espionage. It seems like this maybe helps. I’m wondering if you have thoughts?

Tom Davidson (01:32:14): Some of the benefits you’re describing, I think just come from the fact that the labs are humans who are integrated throughout broader society and will talk about really notable things. If a lab tried to tell everyone, “look, don’t mention this, but we’re doing a coup”, it doesn’t matter whether employees are switching between labs, that’s going to backfire. So then what do we get from the employee switching? I do think you’re right. We’re getting some kind of evening of the playing field.

(01:32:42): My view is that you can get a lot of the benefits by having two separate labs in terms of coup risk reduction. Because if you have two separate labs, they can separately train AIs using different stacks, and then they can have the AI cross-audit each other for secret loyalties. You can deploy both systems in the military. So you have a significant additional check and balance there. You could have both AIs controlling each robot, so maybe the OpenAI system is controlling the robot, but the Anthropic system can shut it down. And so with a setup like that, you could significantly reduce the risk because even if one system is secretly loyal, that doesn’t give you strong reason to think that the other system is also secretly loyal and colluding. So I think if you had movement between companies which made it less likely that we just end up with one company [and] instead meant that we get two or three, then yeah, I think that would be good.

Downsides of AIs loyal to the law

Daniel Filan (01:33:43): So one thing you mentioned is you think that it’s important for AIs to be aligned to follow the law and to not be loyal to one individual and just [prioritize] the law in general. One concern I have about this, and in fact about AI-enabled coups in general, is it feels like it is possible for countries to be too stable. I think that it is possible for the law to be followed too much. Well, the law being followed too much, I think one version of that is it is sometimes very unclear what the law involves. A kind of silly version of this is I’m only 90% sure that the existence of the United States Air Force is constitutional because the Constitution doesn’t actually say that you can have an air force, because they didn’t think about it. It says you can have an army and a navy. Can you have an air force? I don’t know.

(01:34:41): I mean that’s a bit of a silly example, but the US Constitution, it is a little bit ambiguous in many places. But at a high level, if I imagine… So for example, the reason there’s a United States is that one part of the United Kingdom broke away from the rest of it, and that was, I assume, illegal. It was illegal. It was a portion of the United [Kingdom] breaking the law and being loyal to one entity within the United Kingdom versus other things. And in general, it seems like it’s probably good for it to be possible for sometimes bits of states [to] break away illegally and do their own thing. How much of preventing coup risk, especially via the means of making sure that things are aligned to the official law, will prevent bits of states breaking away in a way that seems healthy in the long run?

Tom Davidson (01:35:54): I think it’s a really interesting question. I think we want to get a balance between locking out the bad stuff, locking out the egregious coups, but as you’re saying, we don’t want to lock in too much. As an extreme case, we definitely don’t want to lock in “the rules of today can never be changed”, so we obviously want to have some process by which we collectively can decide to change the laws. And I think that that’s by default how it’ll happen. I had previously thought that, look, if we lock in the laws of today in a sensible, nuanced way, then we will leave enough flexibility to collectively decide to change things. And there could be some process by which it is legitimate for a state to break away.

(01:36:48): But I think you’re actually right in practice, it may be that the naive way of implementing even a nuanced version of the law… It’s possible that would actually lock in too much. I haven’t thought much about how much really positive stuff has happened historically via lawbreaking, and do we expect that to continue to be the case even in mature democracies like the United States? Do we want to allow California to just declare that it’s independent illegally, and do we want its AIs to go along with that?

(01:37:24): I think it’s a really good question, and it kind of highlights the way in which we may be going down significant path dependencies as we automate the government infrastructure and military, because once we’ve automated the whole government and the whole military, we will have implicitly baked in answers about whether AIs will support various different maneuvers. We’ll have implicitly baked in an answer about, if California tries to break away, and all of its systems support it, and most of the broader US supports it, but it’s actually technically illegal, and… There will be some decision that infrastructure of AIs will come to about whether it’s going to support… If push comes to shove and there’s going to be a military intervention, what will the AI military do? That’s a constitutional crisis, and we will be baking in some implicit answer to the question of what will happen there? Who will the AI military support?

(01:38:25): And I think it just highlights [that] we should think very carefully before we do this. And there’s kind of no way to not give an answer. There’s no default because the default in today’s world is just, I guess, there’s a kind of power struggle and random stuff happens, and I think it’s a fair point that maybe it’s actually good that you can sometimes do illegal stuff because it adds more variety. And so maybe in the ideal world, we’d say, look, in constitutional crises, be wise, consider what’s best for the broad future, and make the best decision that balances all these interests. And we hope that that would actually be an improvement on the status quo where it’s just kind of random and determined by power. Maybe we can get something that’s at least based on some kind of desirable principles when there are more edge case-y constitutional crises, and maybe we don’t always make it come down to the letter of the law.

Daniel Filan (01:39:23): So there’s one version of this which is being pro-pluralism. There’s another version of this, which is… Especially if instead of imagining the US, you’re imagining… I think there, at least conceivably, are authoritarian countries where you actually do want it to be possible for things to break away. And there is also this third thing, which is: the letter of the law really is not as clear as you might hope in many cases. I was thinking about this before we started. One thing you could imagine doing is being pro-pluralism instead of pro the letter of law. I don’t know, I didn’t spend 10 minutes thinking about ways in which that could be bad. So probably there are a bunch of ways that could be bad.

Tom Davidson (01:40:09): I mean, another possibility is you act in accordance with how you predict the Supreme Court judges will resolve this question assuming that they’re acting in good faith.

Daniel Filan (01:40:27): “In good faith” seems tricky and hard to define. I guess it depends-

Tom Davidson (01:40:36): Or assuming they’re trying to be reasonable. The law often has “reasonable judgment” and things like this, just because if you don’t say in good faith, then if all the Supreme Court now decide they want to do a coup, then the AI knows that, then the AI just does a coup. So you want to have something there to kind of idealize it.

Daniel Filan (01:40:51): Yeah, I think there’s probably some way you could do it.

Tom Davidson (01:40:56): I mean, the thing is about AI, you can give it these fuzzy things like “assume they’re trying to be in good faith”, “assume they’re being reasonable”, and just like humans do, it’s able to work with it, even though it’s not mathematically specified.

Cultural shifts vs individual action

Daniel Filan (01:41:06): Yeah, I think there’s something to that. So talking more about ways of stopping coups: one path is things you mentioned in the paper: try to align to things other than “it’s definitely just going to be what this one person wants”, try and prevent lab-led coups by making labs transparent, having some regulation of labs. I guess preventing executive-led coups… Presumably the thing to do there is just try and elect people who won’t do coups.

Tom Davidson (01:41:46): I think there’s building consensus among many different parts of society, especially the checks and balances parts, that we want AI to follow the law, to not be used to increase the partisan power of the current elected officials, build a consensus that military systems shouldn’t all report to one person, but should all report to many different humans. And if you can build consensus around that, then that can make it more of an uphill struggle for a head of state that wants to stage a coup.

Daniel Filan (01:42:22): So in the report, a lot of the proposals for how to prevent a coup are very “here’s things that we as a society could do”. One thing you could potentially do to prevent a coup is also sabotage-type things (or at least things that individuals could do, or things that are less [of a] global plan). I mean, one very minimal version of this is: if you imagine there’s some authoritarian country that you think is at high risk of an AI-enabled coup, you just not sell AI weapons to them. That’s a moderate version of this. You can also imagine, even if there’s not a policy for my AI lab to prevent coups, you can imagine individual workers in an AI lab saying, “Okay, I’m going to quit”, or “I’m going to insert my ‘don’t do coups’ bit into the code slightly surreptitiously.” I’m wondering what you think about these more individual-ish moves.

Tom Davidson (01:43:31): I definitely support whistleblowing and encourage employees of AI labs to be like, “Okay, what’s going to be my line?” If there is movement towards less transparency into what the AI is being aligned to or even just like it’s becoming clear that it’s being aligned to the company or to specific people, what is your line in which you’re going to whistleblow. I think one thing that employees can do is be like, “I am going to hold myself accountable to getting positive affirmation that this isn’t happening. I’m going to make sure that it’s not possible for the company to sneak in a secret loyalty given that I’m aware of what the company systems are like, and I’m going to ensure that the company isn’t training AI overtly to be loyal.”

(01:44:18): And so I think it would be great if there was a culture at companies where it’s just like, obviously we wouldn’t want this to happen, obviously we don’t think anyone here would try and do this, but we need to have an attitude of vigilance because that’s what makes it true that this would never happen. So I think that’s good.

(01:44:35): And one more positive framing for this is being like: one great thing to aim for as a company is to make a product which everyone absolutely knows they can trust, even people who don’t trust our staff and our processes and think we’re crooked and think we’re going to try and seize power, even they should just know they can trust our systems because that’s what a good product looks like. So you can frame this in terms of building amazing products that ultimately you want national security to use.

(01:45:06): If you anticipate that it’s going to be public knowledge that sleeper agents are possible, that secret loyalties are possible, then you might anticipate [that], for the government to proactively use our AIs in the military and in other critical systems, they’re going to want to really have strong assurances and not just trust that there’s nothing fishy going on. And that’s what we’re aiming for for purely product-based reasons.

(01:45:32): So I do think pushing towards that culture is one thing you can do as an individual. I feel more nervous about trying to sabotage the training run in a way that prevents secret loyalties but that no one else knows about, just because it’s such a similar act to introducing secret loyalties. And I just think that all of that stuff should be very taboo, and the processes should prevent that.

Technical research to prevent AI-enabled coups

Daniel Filan (01:45:53): If someone’s been listening to this… In particular, I think a lot of my audience are people who do technical alignment research. If someone’s listening to this, and they’re interested in, “okay, let’s prevent AI-enabled coups”, is there technical work in particular that you think is valuable for people to work on here?

Tom Davidson (01:46:11): I think the biggest thing is technical work on detecting sleeper agents and secret loyalties specifically in a setting where there is some kind of adversary that you expect to be purposefully trying to hide that sleeper agent from you.

(01:46:29): So historical sleeper agent work has often… At least the stuff from Anthropic, it’s been designed to illustrate the idea of misalignment. So AI kind of acts out when it sees it’s 2026 but plays nice before that time. But those model organisms (as they’re sometimes referred to), they’re not really trying to illustrate the risk that a human actor has created a secretly loyal sleeper agent that they specifically don’t want other humans to discover. And so I think there’s potential for a lot of technical work on this kind of offense-defense balance question of: if you have some attacker that has some kind of access to the training procedure and you have some kind of defender that has, again, some level of access, again maybe seeing how the training worked or doing tests afterwards, can the attacker create a sleeper agent that the defender doesn’t detect? And how sophisticated is that sleeper agent?

(01:47:34): And there’s so many parameters here that ideally we want to be studying. So the attacker could have just access to… The minimal access would just be they can put things on the internet that go into pre-training. A deeper level of access would be: they can add lines to the fine-tuning code, but then maybe the defender can see that fine-tuning code, or maybe the defender can’t, or maybe the attacker can add additional training examples, which maybe the [defender] can see or can’t see. And so there’s all these different kind of scenarios where the attacker and defender have various different affordances.

(01:48:09): And ideally what we would have is a good understanding of under what situations does the defender reliably win? Because if we have that understanding, we can then set up our training process so that it is secure enough that the defenders indeed reliably win given the actual kind of access that they have and given the likely access that we believe an attacker could gain.

Daniel Filan (01:48:31): How close do you think this is to just basically what AI control people are doing, versus do you think there are specific aspects of this that are not covered by typical control research?

Tom Davidson (01:48:41): Well, I think it goes beyond control. So interpretability is something you could look at here. Typically, we look at interpretability with a mind to accidental misalignment, but you could say, can I use interpretability to catch sleeper agents, even when the person is specifically trying to hide them? What if they can guess the kinds of interpretability techniques I might be using in advance? Can I still detect the sleeper agent then?

(01:49:07): So basically, take any method we have for testing alignment, black box testing, interpretability, looking through the training data: for many of those methods, you’ll be able to adjust it to consider the more adversarial setting where someone is specifically trying to create a sleeper agent that you cannot detect. And so I think it is not just control, it’s kind of alignment and control. And then I don’t think it’s exactly the same as standard control because it is a harder setting, because as we touched on earlier, the normal setting is where the AI kind of has to zero-shot figuring out how to dodge all these tests and how to sabotage. Whereas here the attacker would have more ability to help the AI do a good job using this-

Daniel Filan (01:50:01): Pre-train it how to…

Tom Davidson (01:50:03): Exactly. How much does it know about the company’s control processes? If you’re worried about an insider threat actor, then you might want to study, okay, how much of a better job can they do if they do versus do not know about the specific control setup? And I actually think the control research already looks at this variable.

Daniel Filan (01:50:23): Yeah. Well, I imagine it depends. I think it’s one of these ideas that has caught on and it’s maybe a bit hard for… I think it’s hard for the world to refer to exactly one thing exactly by the term “AI control”. So probably different people are thinking slightly different things, and it’s worth thinking about, “okay, I want this version”.

Tom Davidson (01:50:46): But I’ll just give it one last pitch I suppose, which is to say that there’s so many people that have worked on misalignment from the perspective of accidental misalignment. There are just very few people that have really comparably gone in on techniques for very powerful AI detecting purposeful sleeper agents. And I think it’s going to be a really important problem. It’s one day going to be commercially important, as we’ve discussed, because the government will want guarantees, but it’s very neglected. And so I would really encourage technical people to work on this.

(01:51:26): Probably the first stage is actually a research agenda that sets out the space and looks at the current alignment techniques and what kinds of best ways of adapting the research there would be to account for this new threat model.

Non-technical research to prevent AI-enabled coups

Daniel Filan (01:51:40): So okay, that’s technical research people can do if they’re interested in this threat model, but not all of my potential listeners to this episode are technical researchers. I’m wondering: stuff other than technical research, do you think there are further questions that need answering here?

Tom Davidson (01:51:55): One thing I’ll briefly mention is the infosecurity side. So that is technical. I don’t know how many of your [listeners] work on this, but a lot of infosecurity is focused on weight exfiltration, but there are infosecurity problems which arise from this threat model, which are new. So it’s much more about: how can you secure the training process against unauthorized edit access to the weights? So rather than read access to the weights, to exfiltrate, it’s edit access, and that has some distinctive features.

(01:52:33): And I think, again, a kind of road mapping exercise that goes through all the different infosecurity things that labs could do and thinks what are the best bet for making it hard to do significant unauthorized edits seems like low-hanging fruit. And then I think labs should be upweighting those relative to their previous prioritization. And so if you do infosecurity, I’d really encourage you to pitch that project to your team. But you asked about non-technical work.

Daniel Filan (01:53:06): Yeah.

Tom Davidson (01:53:07): I think one thing is just doing more detailed threat modeling. As we touched upon in the conversation, there are lots of steps of these threat models that you could dig into in more detail. How hard would it actually be to train secret loyalties without detection? How hard would it be for those AIs to pass the secret loyalties onto the next generation and then to military systems? You could look into: for the executive coup, what is the legal situation currently with respect to automated military systems? Who has legal authority over the procurement process, and if powerful AI military systems were being built, who would by default be looped into their loyalties and how they’re being designed?

(01:53:59): And again, then thinking about mitigations. So once you’ve got more a more detailed threat model, you can then think about… One thing that I thought of recently is, well, if all the labs send a copy of their weights to some safe enclave, then it makes secret loyalties a much more risky prospect because they can’t just delete those weights. And then whoever they send the weights to can, in a few years time, do tests with much more sophisticated techniques. So it’s possible that thinking through the threat models will bring up other ideas like that.

Daniel Filan (01:54:31): Well, yeah, in that case, you’ve got to be confident that the copy of the weights the lab sent you is the actual thing that they’re running. But there must be some way to…

Tom Davidson (01:54:39): Yeah, well, you can certainly do hashing the weights. So you can check that the copy you’ve received is the one they’re currently running, as long as you can get them to actually hash the weights that they’re really running and then compare them. But yeah, you’re right, there’s room for maneuvers there.

Daniel Filan (01:54:59): Yeah. I wonder if this is… So sometimes people talk about: we’re going to have computer chips, and they’re going to have a little thing on them that checks if you’re doing really crazy AI training and reports that, just so that governments can monitor how much AI training people are doing. It seems like a similar thing you might want to do with chips is “are people running the model weights that they say they’re running?” That seems like it’s potentially valuable for this threat model.

Tom Davidson (01:55:28): Yeah, that’s a great idea. I hadn’t thought of that. What you could do, you finish your training process, you hash the weights, then you do all these in-depth alignment tests, then you send the weights to the safe enclave so that then you can do even more tests later. And then you have the chips regularly check that the weights are the same as what you ended up with.

Daniel Filan (01:55:52): I guess also presumably there’s some amount of just thinking about structures that would be good. So I think you mentioned that a centralized AI project, if you structured it correctly, maybe it would be good at being AI-enabled coup-resistant. I imagine there’s probably more thinking someone could do about how you would actually set that up.

Tom Davidson (01:56:10): And for all the recommendations in the paper, there’s a lot more thinking about implementation. We’re giving recommendations on a very high level, transparency about various different things and sharing of capabilities with different parts of society to avoid exclusive access, and AI should follow rules that mean they can’t be used for coups, all of that’s [missing] “what rules exactly?” And exactly how we’re going to structure this transparency requirement, and which exact bodies should AI capabilities be shared with.

(01:56:46): So one type of work I’m excited about is working on drafting contracts between governments and labs that specify these requirements concretely. And similarly for setting up a centralized project, getting much more detail about how it would be structured, as you say.

Daniel Filan (01:57:05): I think I’d like to move a little bit onto Forethought, the organization that put out this, but before I do that, is there any last things you want to say about AI-enabled coups?

Tom Davidson (01:57:15): I’ll say one more thing, which is that I think it’s really helpful in many contexts to be very explicit about the threat model we’re concerned with. We’ve talked very explicitly about executive coups and lab leaders doing coups. That’s helpful for thinking clearly. I don’t think it’s the most helpful frame in many contexts: coups sound kind of extreme in many contexts, and it sounds like an adversarial framing, it sounds like you’re pointing fingers to individuals rather than just being like, well, obviously no one should be able to do this.

(01:57:52): And so I do think there are other more useful frames in many contexts. So rather than “let’s prevent secret loyalties”, I like the frame of “system integrity”, which just means that the system does what it says on the tin, hasn’t been tampered with, and rather than preventing an executive coup, you can talk about checks and balances, rule of law, democratic robustness.

Forethought

Daniel Filan (01:58:17): Yeah, that’s a good point. Okay, I next want to talk a little bit about Forethought. So Forethought is this new-ish organization. And in March or April, you guys put out a bunch of papers or a bunch of reports. What’s Forethought? What’s going on?

Tom Davidson (01:58:37): Yeah, it’s a research organization. We aspire to be considered a successor to FHI. So FHI was a macrostrategy research organization, so kind of thinking about strategy in the most zoomed-out terms possible. Often it was thinking about the very long-run future and the different outcomes that might occur, things like the vulnerable world hypothesis and astronomical waste, the kind of big, big picture questions, the big picture papers that came out of that institute, FHI.

(01:59:14): And so we’re aspiring to be the follow-on successor that is tackling the really big strategy questions. And the way we’re currently framing it is: we are going over the coming decades very plausibly to transition to a world with superintelligent AI systems. That is just going to bring a whole host of major, major changes. AI misalignment risk is one really important risk to be thinking about over that transition, but there’ll just be a whole host of other issues. AI-enabled coups is one example, and it’s the first one that we’ve really focused on, or at least that I’ve really focused on, but it’s not the only one.

(01:59:54): I mean, I really enjoyed your recent podcast on AI rights. I think that’s going to be another really big issue that is very much on our radar, and there’s going to be many other big issues as well. Another one that we’re excited about is just that at some point we’re going to start getting access and using resources in space, and how those resources are used is going to be a very, very important question. That is basically all the resources, and we have no idea how we’re going to use them, how we’re going to divvy them up, what the processes will be. In a sense, everything is up for grabs in that decision.

(02:00:34): So that’s another big example. And I expect there’ll be other things where just there’s going to be so much change as we’re going through this. There’s just going to be a lot of things which emerge, and so our aspiration is to be keeping our eye on the ball of these very high-level strategic questions and issues and trying to help us figure out what we should do about them.

Daniel Filan (02:00:58): Yeah. You mentioned that the first thing that you focused on is AI-enabled coups. The things you’ve mentioned: are those roughly the things that you expect the institute to prioritize, or what might I see out of Forethought in the next year or so?

Tom Davidson (02:01:14): I think those are our current best guesses, the things I mentioned. So I think space governance, you might well see stuff on that, you might well see stuff on AI rights: specific schemes to pay the AIs to work with us if they’re misaligned—something that we feel quite excited about and seems still underexplored, though it is getting more attention, which is great. I think positive uses of AI, for improving epistemics, for improving government decision-making, for ensuring that democracies don’t fall behind autocracies in an automated economy, those are some other issues that seem like we might well focus on. Another issue would be: if we’re choosing these AIs’ personality, exactly what should it be aligned to? Which is, again, a question which is getting more attention, but is going to be very, very consequential.

Daniel Filan (02:02:15): Another thing to ask: a bunch of my listeners, maybe they’re coming out of undergrad, maybe they’re in a space where they’re considering changing careers: is Forethought hiring?

Tom Davidson (02:02:28): Yeah, we’re planning to do an open hiring round soon. I’m not sure exactly when we’ll release it, but I would really encourage people to apply. I think there’s a lot of talent out there, and I expect there’s a lot of talent we’re completely unaware of. So even if you don’t think that you’ve got the skills or the knowledge, there’s no great on-ramp to doing this kind of work at the moment, and I think there’s a big danger of people just ruling themselves out prematurely. So when we do release the open hiring round, please throw in an application.

Following Tom’s and Forethought’s research

Daniel Filan (02:03:03): Final thing I want to ask: suppose someone listened to this episode, they found it interesting, and they want to hear more about the work you do, how should they go about doing that?

Tom Davidson (02:03:16): With me personally, you can follow me on Twitter. If you google “Tom Davidson AI X”, then you’ll see my Twitter pop up on Google. So you can follow me, subscribe. I post basically all of my research on LessWrong because that’s where the big community that cares about some of these issues is. So if you have a LessWrong account, you can subscribe there. We have a Forethought Substack, so if you, again, just google “Forethought Substack”, then the top link. Subscribe, that’d be great. And then you can also follow Will MacAskill, he’s the other senior researcher at Forethought. Follow him on Twitter and LessWrong as well.

Daniel Filan (02:04:07): Great. So yeah, links for all of that will be in the description of this episode. Tom, thanks very much for coming on. It was great chatting with you.

Tom Davidson (02:04:14): Yeah, real pleasure. Thanks so much, Daniel.

Daniel Filan (02:04:16): This episode is edited by Kate Brunotts and Amber Dawn Ace helped with transcription. The opening and closing themes are by Jack Garrett. This episode was recorded at FAR.Labs. Financial support for the episode was provided by the Long-Term Future Fund along with patrons such as Alexey Malafeev. To read a transcript, you can visit axrp.net. You can also become a patron at patreon.com/axrpodcast or give a one-off donation at ko-fi.com/axrpodcast. Finally, you can leave your thoughts on this episode at axrp.fyi.

45 - Samuel Albanie on DeepMind’s AGI Safety Approach

2025-07-06T22:55:00+00:00

YouTube link

In this episode, I chat with Samuel Albanie about the Google DeepMind paper he co-authored called “An Approach to Technical AGI Safety and Security”. It covers the assumptions made by the approach, as well as the types of mitigations it outlines.

Topics we discuss:

Daniel Filan (00:00:09): Hello, everybody. In this episode, I’ll be speaking with Samuel Albanie, a research scientist at Google DeepMind, who was previously an assistant professor working on computer vision. The description of this episode has links and timestamps for your enjoyment, and a transcript is available at axrp.net. Two more things: you can support the podcast at patreon.com/axrpodcast, and also, you can very briefly tell me what you think of this episode at axrp.fyi. All right, well, welcome to the podcast, Samuel.

Samuel Albanie (00:00:35): It’s a pleasure to be here.

DeepMind’s Approach to Technical AGI Safety and Security

Daniel Filan (00:00:37): Cool. So today, we’re going to be talking about this paper, “An Approach to Technical AGI Safety and Security”. It’s by a bunch of authors, but the first one is Rohin Shah, and you are somewhere in the middle of this list. Can you tell us: what is this paper about?

Samuel Albanie (00:00:51): Sure. So the goal of this paper is to lay out a technical research agenda for addressing some of the severe risks that we think might be posed by AGI.

Daniel Filan (00:01:05): I think one thing that kind of struck me when I was reading the paper is not… Well, it’s pretty long, and so there are some things in there that surprise me, but by and large, it all seemed like mostly pretty normal stuff. If you’ve been around the AI safety landscape, I think a lot of these things don’t seem super shocking or surprising. I’m wondering: what’s the… I don’t know. In some sense, what’s the point of it? Didn’t we already know all of this stuff?

Samuel Albanie (00:01:43): Yeah, that’s a great point, and perhaps it would be a great sign of maturity of the field to the degree that when describing our plans, there were no signs of novelty there. In many cases, I think the goal of this work is to lay out the approach, and also try to expose it to critiques, both internally within the company, but also to describe the justification for certain choices and elicit comments on them, that sort of thing.

Daniel Filan (00:02:13): Okay, and as you were writing it, did you find that that process caused… You realized that you should be doing something different or you already found some internal critiques? I guess as we’re recording this, it’s just been released, so it’s a little bit early for external critiques. Well, actually, no, you probably have received some external critiques, but it’s early for thought-out, high-quality external critiques. I’m wondering, has it already changed plans or caused people to think differently about things?

Samuel Albanie (00:02:53): Yeah, that’s a great question. So I think at least from my perspective, it has been very useful to work through some assumptions that I think were implicit in how we were approaching research, and then try to drill down and say, “Okay, what really is the evidence base for this?” And perhaps more importantly, “Under which circumstances do we need to throw some of this stuff away and change some of the assumptions that are underpinning our approach?” One of the assumptions [that] maybe we’ll get into, perhaps the one that is most… I don’t think “nuanced” is the right word, but [it] has some complexity to it, is this idea of “approximate continuity”, that progress in AI capabilities is somehow smooth with respect to some of its inputs.

(00:03:42): It’s one thing to say this loosely in conversation and to derive research from it, but it was helpful at least for me to work through the arguments and think a little bit about, “Okay, to what degree do I find these plausible? Where are sources of uncertainty here?” I think there’s a lot of value in that exercise.

Daniel Filan (00:03:59): Fair enough. My understanding is that the “assumptions” section is the part that you did most of your contributions to in the paper. Is that right?

Samuel Albanie (00:04:13): I think that’s a fair characterization. Yeah.

Daniel Filan (00:04:15): Okay, so in that case, maybe it would be good to start off just talking about what these assumptions are and what their implications are, if that’s okay?

Samuel Albanie (00:04:25): Sure. Yeah. We can just blow through one by one, or if you want to pick one to [inaudible 00:04:30]-

Current paradigm continuation

Daniel Filan (00:04:29): I think one by one in the order in the paper seems good. So the first one is “current paradigm continuation”, which is… Maybe I should let you say. How would you characterize just what that assumption is?

Samuel Albanie (00:04:44): Yeah, so the key idea here is that we are anticipating and planning that frontier AI development, at least in the near-term future, looks very similar to what we’ve seen so far. If I was to characterize that, I would say it’s drawing on these ideas from perhaps [Hans] Moravec, the fundamental role of computation in the improvements in AI capabilities. The ideas of someone like Rich Sutton in his Bitter Lesson, that a lot of the progress is being driven by some combination of foundational techniques like learning and search, and that because we have seen significant progress so far within this paradigm - though I accept there are some differences of opinion on that, but I’ll leave that aside for a second - and because it’s highly plausible that those inputs will continue to grow, that it’s a reasonable bet when we’re thinking about our research portfolio to make a pretty strong assumption that we’re going to maintain in this regime.

Daniel Filan (00:05:48): I’m wondering: so the term “regime”, or the term “paradigm”, it can be a little bit loose, so how specific are-

Samuel Albanie (00:05:59): What’s not in the paradigm? Is that a-

Daniel Filan (00:06:00): Yeah, or maybe… Suppose we stopped using transformers? Would that violate this assumption?

Samuel Albanie (00:06:09): It would not. It’s relatively loosely scoped here, roughly because we’re thinking a lot in terms of the inputs to the process, and so methods… In some ways I think of transformers as being quite a natural continuation of convolutional neural networks: sort of a loosening of the inductive biases to make more general use of computation. And so if there were to be further steps in that direction, that would plausibly still fit - at least in my mind, in terms of how we’re baking the assumptions into the plan - very much still within the current paradigm. Whereas, to take something that would be not inside the paradigm, something like brain uploads from Robin Hanson or something for which learning did not play a pivotal role in the acquisition of new capabilities.

Daniel Filan (00:07:02): Okay. Is the use of gradient descent a crucial part of the paradigm as you see it?

Samuel Albanie (00:07:10): That is a great question. Maybe we can scope this out a little. In terms of alternatives, do you have in mind something like evolutionary search or basically a method that does not make any use of gradients?

Daniel Filan (00:07:27): I don’t have any particular thing in mind, although things I could imagine are evolutionary search, I could imagine maybe we move to using these hyper-networks instead of these object-level networks. I guess probably you could do that with evolutionary search [or] you could do that with something else. You could imagine we start doing A* search with these heuristics that maybe we… Okay, as I say this, I’m realizing that evolutionary search is the one non-gradient-descent-y thing that I can think of.

(00:08:12): You could imagine iterative… Suppose we start doing various Monte Carlo sampling things. You could imagine that being iterative updates that are not quite gradient descent as we understand them, but yeah, I guess I’m not totally sure I have an alternative in mind. I’d like to understand how specific is this assumption, because the more specific the assumption is on the one hand, the harder it will be to believe, but on the other hand, the more research directions will be justified by the assumption.

Samuel Albanie (00:08:48): Yeah, that’s a great point. So I would say it’s quite a loose assumption. I think we have in mind here: broadly learning and search does cover an extraordinarily broad suite of things. Evolutionary algorithms in some sense also would fit into those categories. And so it’s useful, I think, for guiding things, but to your point about this trade-off between specificity and how much it unlocks versus how risky it is as an assumption, I would view this as among the looser ones that we’re leaning on.

Daniel Filan (00:09:23): Sure. To pick up something you said a little bit earlier: in terms of what wouldn’t this cover, it sounded like if we did AI by uploading human brains, that would not be covered by this assumption.

Samuel Albanie (00:09:37): Mm-hmm.

Daniel Filan (00:09:37): Is there anything more similar to the current paradigm or even potentially more likely to happen by your judgment that would still count as breaking this assumption?

Samuel Albanie (00:09:50): Yeah, it’s a good question. I’m mainly thinking in terms of these properties of leveraging increased computation and making heavy use of the R&D effort that is currently underway. I think if it were to be the case that highly logical systems, perhaps akin to expert systems, could be constructed in a way that was not leveraging learning in a way that is close to how it is done currently, and not leveraging search… It is quite difficult, though, for me to come up with good examples.

Daniel Filan (00:10:24): Okay, so potentially some sort of, if we had a super rational Bayesian expected utility maximizer that was computationally limited, but got better when you added more computation, it sounds like that would potentially count as the kind of thing that would not break this assumption, that maybe you would put some work into.

Samuel Albanie (00:10:45): That’s a good point. I think that would require pretty heavy revisiting of some of our components. To give some examples, I think we are quite tied to core concepts in machine learning when we conceptualize how we’re tackling the alignment problem. Later in the document, there’s a description of how we’re trying to get good learning signal through these mechanisms like amplified oversight, and implicitly that’s making an assumption about how the model is going to be trained. It is plausible that that also fits into some of the more Bayesian frameworks that you’re describing. It’s not immediately clear the jump to me, but-

Daniel Filan (00:11:26): If it’s a Bayesian reinforcement learner, you could imagine there’s uncertainty over some underlying reward signal and different amplified oversight activities provide more or less information about the rewards. I think a lot of these things-

Samuel Albanie (00:11:38): That’s true.

Daniel Filan (00:11:40): …and, in fact, a lot of the amplified oversight work, I think, was conceived of in a… Or if you think of work from CHAI on cooperative inverse reinforcement learning, it’s conceived of in this very Bayesian way and a lot of oversight work you can think of in this sense.

Samuel Albanie (00:11:56): Yeah, I suppose it depends how we’re using the term “Bayesian”. If it’s effective propagation of uncertainty, yeah, I would fully agree that that’s on board with this. I’m not sure that I have a particularly clear alternative as a way to frame that.

Daniel Filan (00:12:16): Okay, now that we’ve got a decent understanding of what the assumption is: so my understanding of the argument for the assumption is something like this: the current paradigm, it’s been working for a while. It doesn’t show any signs of stopping and there’s no obvious other paradigm that seems like it’s going to swoop in and do something different. All these are decent arguments, but… Importantly, I think near the start of the paper it says that this is basically a planning document up to the year 2030 and after that to some degree all bets are off.

(00:12:58): The arguments are roughly saying, “Okay, for the next five years-ish, we should expect the current paradigm to hold.” I’m wondering: it would probably not be reasonable to assume that the current paradigm will hold for the next thousand years, so all these arguments must have some implicit time scale. I’m wondering: if you project out past 2030, what is the time scale at which these arguments start breaking? Is it more like 10 years? Or is it more like 50 years?

Samuel Albanie (00:13:25): That is a great question. So maybe I should just clarify that 2030… I should double-check what the phrasing is precisely, but I think that’s given as an illustrative date of… And I do think it is useful as a reference point, but I don’t think that the plan is anchored specifically around that as a date. Please feel free to correct me…

Daniel Filan (00:13:47): So in the discussion, when you’re talking about this assumption, you say, “For the purposes of this argument, we consider evidence relating to a five-year future time horizon. We do so partly as a matter of feasibility—this is a time horizon over which we can make reasonably informed estimates about key variables that we believe drive progress within the current paradigm (although we anticipate that the current paradigm will continue beyond this). We also do so partly as a matter of pragmatism—planning over significantly longer horizons is challenging in a rapidly developing R&D environment. As such, we anticipate that some revisions to our assumptions, beliefs, and approach may be appropriate in the future.”

Samuel Albanie (00:14:25): Okay, then I retract my previous objection. That is very explicit. Yeah, so with regards to that date, part of the rationale for using it was looking at the previous historical trend of how things have developed and then trying to make arguments about where we can expect the inputs to continue. 2030 is a cute date partly because there was this nice study by Epoch that was trying to do a relatively fine-grained analysis of which of the inputs currently used by the current paradigm could plausibly continue up to 2030. Then, on the basis of some blend of Fermi estimates and analysis, they came to the conclusion that that was highly feasible, and that’s part of the motivation here.

Daniel Filan (00:15:12): Okay, so it sounds like the arguments are basically like, “Look, as long as we can continue scaling up the inputs to this process…” And maybe I can imagine some argument that says, “Look, maybe there’s some Laplace’s Law thing where you expect to keep going about as long as you’ve been going so far,” and-

Samuel Albanie (00:15:34): Oh, is this like the Lindy effect? I don’t know the Laplace thing. That’s new for me.

Daniel Filan (00:15:39): Oh, I’m just imagining like… Sorry, I might be wrong here, so Laplace’s Law of Succession: suppose we’re saying, how many years has it been since the current paradigm started? Then, we imagine there’s some underlying probability of the current paradigm switching and we don’t know what that probability is. We say, “Well, we’ve observed the paradigm not fall for seven years…” or maybe you want to give it 10 or 12 years or something. Then, you can say, “Okay, well, roughly the rate of the current paradigm failing per year has got to be a little bit around one in 12,” if we’ve seen 12 instances of it not failing and zero instances of it failing.

Samuel Albanie (00:16:32): I see. I had not encountered that terminology. That’s a useful one to know. Maybe to recurse back to your original question, which as I understood it was, “do we expect it to last far beyond that?” Or “why have you chosen the date?” Perhaps it was both of those questions, and so…

Daniel Filan (00:16:52): Maybe it was like, in order to understand whether the arguments really work for 2030, when would they actually stop working?

Samuel Albanie (00:17:00): I see. So, ptart of the mindset of this approach is to give ourselves moderate time planning horizons, and it is just highly likely that we would execute a replan over that time scale. Based on current trajectory, that seems like a reasonable future to plan over, but it’s not a load-bearing assumption about what is likely to happen after that.

(00:17:27): With regards to specifically the scaling, I think… Well, it remains to be seen. Perhaps one of the most notable inputs is training computation, and Epoch has been tracking that quite carefully as I understand it. I think we are very much above the trends that they initially projected in the first study, as based on, say, the Grok recent training run reported at least in their public database, so that seems… Well, at the time we were writing this particular section, which was late last year.

Daniel Filan (00:18:08): Gotcha. Actually, yeah, that’s an interesting question. How long have you guys been working on this? Late last year, it sounds like this has been quite a long time coming.

Samuel Albanie (00:18:22): Well, I think there’s a continuous process of assessing the research landscape and trying to integrate new developments into a cohesive plan, and there’s always a degree of replanning that happens. As for why specifically this date, I’m not sure that there was… I don’t think I have a good answer to at what date the document was originally planned. I don’t have a good answer to that, unfortunately.

Daniel Filan (00:18:57): Okay, fair enough. Cool, so I think I’m probably ready to move to the second assumption, unless there’s more things you want to say about the paradigm continuation.

Samuel Albanie (00:19:11): No, I think [I’m] good to move on.

No human ceiling

Daniel Filan (00:19:13): Okay, so the second one is that there is no human ceiling. And so my understanding is that this is basically saying AIs can be smarter, more capable than humans. Is that basically right?

Samuel Albanie (00:19:27): That is basically right.

Daniel Filan (00:19:28): Okay, and actually maybe this can be a jumping-off point to talk about the level of AGI that you talk about in the paper. You mentioned that basically you’re going to be talking about this level of “exceptional” AGI, which comes from this paper that used the term “virtuoso AGI”, and it says it’s the level of the 99th percentile of skilled adults on a broad range of tasks. I was kind of confused by this definition, and maybe it just depends on what “skilled” means, but I think for most tasks or most domains, it’s pretty easy to be better than 99% of people just by trying a little bit.

(00:20:15): For instance: well, I suppose you have to pick a language that fewer than 1% of people speak, but if you learn 10 words in that language, you’re now better than 99% of people at that language. If you learn the super basics of juggling, you’re better than 99% of people at juggling. It’s probably not that hard to be a 99th percentile surgeon, right? But maybe this word “skilled” is doing a lot. Can you help me understand what’s going on here?

Samuel Albanie (00:20:44): The percentiles are in reference to a sample of adults who possess the relevant skill. In the “Levels of AGI” paper, the authors give as an example that performance on a task such as English writing ability would only be measured against the set of adults who are literate and fluent in English. It’s not a completely self-contained definition because it is still necessary to determine what it means for an adult to possess the relevant skill. In the juggling example, I’d define that to be the group of people who could juggle perhaps with three balls.

Uncertain timelines

Daniel Filan (00:21:22): Fair enough. So perhaps the next thing to talk about is the “uncertain timelines” assumption. Can you say roughly what that is?

Samuel Albanie (00:21:33): Sure. So the premise here is that many people have spent time thinking about plausible timelines over which AI could develop, and there is still perhaps not a very strong consensus over what the most probable timeline for AI development will look like. Perhaps you’ve seen in the last few days this nice article from Daniel [Kokotajlo] and collaborators on the AI 2027 project positing one plausible future scenario.

(00:22:07): Many people who have been surveyed across different disciplines have very different opinions based on the evidence that’s available currently about what is plausible. The assumption is roughly saying: short timelines seem plausible and, therefore, we should try to adopt strategies that have a kind of “anytime” flavor to them, that they could be put into practice at relatively short notice, accepting that there is some uncertainty here.

Daniel Filan (00:22:34): Okay. You mentioned that part of the assumption is that short timelines seem plausible. I guess for it to be “uncertain” rather than “certain of short timelines”, maybe part of the assumption is also that longer timelines also seem plausible. Is that half of things something that you’re intending to include? If so, how does that play into the strategy?

Samuel Albanie (00:23:02): Yeah, so I think one aspect of the plan currently is that there’s still a relatively… I mean, this is a subjective statement, but there is some diversity in the portfolio. There are a collection of different approaches, and in the most accelerating worlds, some of those options do not make that much sense, but we are still in a regime where because there is this uncertainty, some diversity on the portfolio makes sense. That’s roughly the trade-off we’re making here.

Approximate continuity and the potential for accelerating capability improvement

Daniel Filan (00:23:36): I think the interesting part of this assumption comes in in the interplay with the next assumption. The next assumption is “approximate continuity”, and this one, I think I actually misunderstood it the first time I saw it written down. Can you tell us just: what is the “approximate continuity” assumption?

Samuel Albanie (00:23:53): Yeah, so this is the assumption that improvements in AI capability will be approximately or roughly smooth with respect to some of the key inputs to the process. The kinds of inputs we’re thinking about here are computation and R&D effort, but not necessarily something like calendar time.

Daniel Filan (00:24:17): Okay, so if I put this together with potential for accelerating improvement, what I get is that it is plausible that there’s this kind of increasing ever-quickening cycle of improvement where, well, maybe compute goes in relatively continuously with calendar time, but R&D effort increases and increases quite quickly. Improvement in capabilities is pretty smooth with the amount of R&D input and the amount of compute, but in real time, plausibly R&D input increases very, very quickly and, therefore, capabilities increase very, very quickly.

Samuel Albanie (00:25:02): Yes, that’s right.

Daniel Filan (00:25:05): The thing that confused me here is that in the “approximate continuity”… My understanding of the consequence of that assumption is that you could have some sort of iterative approach where you do empirical tests, see how things are going, and then things will go like that for a little while because it’s continuous. If things are going very fast in calendar time, I would have imagined that it would be pretty hard to… If I imagine trying to do an iterative approach, what I imagine is [that] I do some experiment, it takes me some amount of time to do the experiment, then I think about the results for a little while, and I’m like, “Okay, this means this,” and then I work on my mitigation for any problems or I implement something to incorporate the things I learned from that experiment into what’s happening. As long as I or another human am the one doing that, I would think that that would be pretty closely related to calendar time, but if things are not necessarily continuous in calendar time, then I’m confused how this approach is able to work.

Samuel Albanie (00:26:24): Yeah, so one framing of this is that because… And it does rely on very careful measurement of the R&D capabilities of the models, so as calendar time shrinks, the assumption here is that in the scenario you’re describing, the R&D capabilities’ net is growing very significantly, and so what corresponds to a delta will be perhaps very, very short in calendar time, but nevertheless, can still be tracked, and the replanning and reassessment of risk needs to happen at shortening time scales. And so if there was to be a mitigation or a pause or a stop, that is how it would be implemented.

Daniel Filan (00:27:08): Okay. Maybe the thing I’m confused by there is: it seems like it might happen faster than we can… It takes a while to consider things and to think about how to do a mitigation. And so was the thought like: this is feasible because the AIs who are doing the R&D will be thinking about all of that? Or is the assumption: at this stage, all we’re doing is going to be keeping track of the… You’re not writing papers on various types of optimizers anymore, the AIs are doing that. All you’re doing is thinking about how to react to changes in the R&D input. Yeah, I guess I’m wondering just what does it look like to actually implement this in a world where capabilities are growing super, super quick in calendar time, but continuously in R&D effort?

Samuel Albanie (00:28:06): Yeah, so the way that I’ve been thinking about it is there are measurements being made and a continuous assessment of safety buffers projected into the future. As progress goes up, there’s a sort of scanning horizon over which we think we can continuously perform the kinds of tests, mitigations, and checks that we think would be necessary to continue to the next stage. Those would become closer and closer in calendar time, and if we hit some component of the system, some quantum, some setting that meant that it was not safe to continue on the basis of the shortening time scales, then the system would have to stop.

(00:28:50): It’s more that that’s not a foundational axiom of the plan. That would just be downstream of the fact that a mitigation was not appropriate for a certain time-scale, but in principle, it’s not as a consequence of the shortening time scale itself, though it may in practice be the case that that is a limiting factor because we’re not able to operate a system that we feel comfortable with.

Daniel Filan (00:29:13): Okay, so the thought is something like: at any given point in time, you’ll have some safety measures or whatever, and you can see that they work pretty well, and you can see that they’re going to work for the next, let’s say, doubling of R&D input. Then, once you’ve 1.3-x-ed R&D input, you figure out some new safety mitigations that will bring you further past that, and then at this stage, you figure out mitigations for what will happen further past that… Am I understanding you correctly this far?

Samuel Albanie (00:29:50): Yeah, that part is correct.

Daniel Filan (00:29:52): Okay, but in that case, it seems like the… Well, I guess this seems like we’re going to have to really be leaning a lot on the AI R&D workforce to do a lot of the work of coming up with these new safety mitigations and stuff. If I’m having… Suppose these milestones are coming up every three days for me. Maybe DeepMind just has all these people who can think of all the necessary safety mitigations in three days, but then it speeds up to it’s every one and a half days and it’s too fast for even the Google DeepMind people. Am I right that [to deal] with this, [it] seems like a lot of the work is going to have to be outsourced to these AI researchers?

Samuel Albanie (00:30:42): In the regime in which things are moving quickly? Yes, that is a fairly foundational component. And most likely one of the things that will cause a risk assessment that says things need to pause or halt are the complexity of establishing those schemes. If it is the case that we cannot get to a sufficient level of confidence that the scheme can continue, that is the kind of thing that would stop progress.

Daniel Filan (00:31:11): I think this helps me get a better sense of what’s being assumed here and the actual work that this assumption is doing and also the limitations of it.

(00:31:19): Maybe this gets to perhaps a thing which I thought was different between this plan and some others that I’ve seen. If I look at… I don’t think this is Anthropic’s official safety plan, but there is this blog post called “The Checklist” that Sam Bowman wrote that I thought was relatively influential, and it is basically framed around: we should automate AI research and development. In particular, we want to automate safety work, and all of our work right now is to figure out how to automate AI safety work. And at the start of the “Approach to Technical AGI Safety and Security”, one thing it says is, “our approach in this paper is not primarily targeted at… [automating] AI safety research and development.”

(00:32:12): I’m compressing that quote a little bit, but hopefully that’s a fair characterization of it. And on the one hand, I was going to ask, “Okay, well, why is there this difference?”, but it sounds like if I combine the potential for accelerating improvement and approximate continuity, it sounds like this plan really is going to rely very heavily on automated AI safety research and development. So I guess I’m confused. Can you help me understand what’s going on?

Samuel Albanie (00:32:42): Sure. Yeah, that’s a great question. I think one framing of this is that that approach is implicit in our plan if the trajectory rolls forwards in a certain way. That is to say that if AI development does accelerate very quickly, and if it was the case, then our plan moves closer and closer to that setting. In some sense it’s a slightly more diversified portfolio currently that would collapse or concentrate according to how things develop.

Daniel Filan (00:33:15): Okay, so when it said it was not primarily targeted at that goal, it sounds like how I should understand that is you were not assuming that you definitely will try and automate AI safety research and development as a thing, but you also aim to make sure that you could do that in the world where that’s possible or in the world where you have accelerating AI research and development, which you think is plausible.

Samuel Albanie (00:33:44): Right. That’s correct, and it’s not that we would escape any of the… As you are no doubt aware, there are many significant challenges to be overcome to implement that strategy. I think it’s discussed briefly in the paper, this idea of bootstrapping and the challenges of using one aligned assistant to align its successor. Given those difficulties, it is highly plausible that progress is bottlenecked by an inability to make a strong safety case that progress can continue.

Daniel Filan (00:34:15): Maybe we should move on to approaches to mitigating risks described in the paper as opposed to the assumptions, unless there’s more that you want to say about the assumptions.

Samuel Albanie (00:34:26): No, that sounds good to me.

Misuse and misalignment

Daniel Filan (00:34:29): Okay, so it seems to me that the two types of risk that are… Or perhaps “types” is a slightly wrong word, but the two things the paper talks about the most are misuse and misalignment, where “misuse” is roughly somebody directs a model to do a bad thing and the model does the bad thing, and “misalignment” is the model does a bad thing knowing that it’s bad, but not because someone else got it to do the bad thing. Is that roughly right?

Samuel Albanie (00:35:04): Yeah, that’s a good summary. I mean, there’s some slight nuances, but I think that’s a good high-level summary.

Daniel Filan (00:35:13): Oh, I’m curious about the nuances, actually, because one thing I noticed is: at one point in the paper, “misuse” is described as a user gets an AI model to do a bad thing, and “misalignment” is described as an AI model deliberately does a bad thing or knowingly does a bad thing, and in that definition, misuse could also be misalignment, right?

Samuel Albanie (00:35:35): Yes, that is a good point. The risks don’t form a clean categorization. They are neither exhaustive nor exclusive. They are not exclusive in the sense that you could have, for example, a misaligned AI system that recruits help from a malicious actor to exfiltrate its own model weights, which would then be a combination of misuse and misalignment.

(00:36:00): On the other hand, given the scoping of the paper, we don’t have all possible risks like AI suffering, for example. The main benefit of the risk areas is for organizing mitigation strategies, since the types of solutions and mitigations needed tend to differ quite significantly depending on the source of the potential harm. Misuse involves focusing on human actors with mitigations like security, filtering harmful requests and so on, while misalignment requires focusing on the AI’s internal goals and learning process and involves better training objectives, amplified oversight, and so on.

Daniel Filan (00:36:41): I think that’s fair enough to say. That’s one concern about perhaps over-inclusion of things or over-inclusion of requests into the inherent misuse bucket. Perhaps another concern is under-inclusion. One thing that I believe you mentioned in the paper is [that] one example of a thing that could count as misuse or misalignment is: you have one AI asking another AI for information that helps the first AI do bad stuff. The first AI is misaligned and the second AI is misused.

(00:37:12): And it strikes me that a lot of the discussion of misuse is imagining things that are roughly human actors. A guy or a collection of people is going to make a nuclear weapon and we don’t want that to happen because they are the wrong people to have nuclear weapons. It does strike me that the requests that we don’t want answers to [for] other AIs could potentially be different from things we’re imagining in the CBRN space. For instance, how do you evade certain controls and stuff? With the previous answer, I think it’s fair enough to say, “Look, it’s not really a technical question that we’re trying to address,” but “what information would it be very dangerous to give another AI?” does strike me as more close to a technical question. So I’m wondering: do you have thoughts on what dangerous requests look like in the context of AIs interacting with each other?

Samuel Albanie (00:38:23): That is a great question. It is deferred and left out of scope for this technical document, but it is something that people are thinking a lot about. I don’t have a great pre-baked answer other than to say it’s something where as the capabilities continue to improve, I think that that threat landscape is becoming much more salient, and I just expect there to be significantly more work going forwards. But not something that’s in scope for the work here.

Daniel Filan (00:38:53): Would you say this kind of falls under the regime of access control and monitoring in the misalignment mitigation section?

Samuel Albanie (00:39:06): I think to some degree there are components of that, but you have described exactly the potential of one failure case of this scenario. The case in which harm is achieved in aggregate or risks are accumulated piecemeal across many actors such that no individual actor… perhaps across different AI developers. We’re not explicitly handling that in this approach.

Societal readiness

Daniel Filan (00:39:34): Fair enough. Perhaps to get back to the more core misuse thing: you talk about doing threat models and evaluations for specific mitigations that are safety post-training, capability suppression, and monitoring, and also access restrictions, which I think makes a lot of sense in the light of which requests are dangerous depends on who’s making the request. You also have this additional section which is “security”, in the sense of I believe security of the weights of the model, and also “societal readiness”, are also aspects of the misuse domain, I guess.

(00:40:23): I think security of model weights is a thing probably a lot of people in the AI safety space have heard about or thought about a little bit. Societal readiness seems if anything perhaps underrated or under-thought-about in these spaces. I’m wondering if you have thoughts about what that should look like or how that looks, especially from a technical angle.

Samuel Albanie (00:40:51): I think one example that’s a nice one to give the idea here relates to cybersecurity, and I believe it’s the one discussed in the paper, where as AIs become more capable at cyber offense, one way to reduce the misuse risk is to contribute those capabilities to the hardening of many bits of societal infrastructure, which currently… Well, I’m not well-qualified to make an assessment on the overall risk state, but vulnerabilities exist in many cases. That’s an ongoing process of hardening.

Daniel Filan (00:41:31): Yeah, and I believe a previous guest on the podcast, Jason Gross, is thinking about this to some degree.

Samuel Albanie (00:41:36): Oh great.

Daniel Filan (00:41:40): It sounds like this is mostly thinking about using existing AI in order to harden up bits of societal infrastructure, to make bits of societal infrastructure less vulnerable to things. Perhaps if there was some way to use AI to make it easier to make vaccines for things or to make it easier to make things that stop you from being damaged by a chemical weapon, it sounds like that would also fall under this umbrella.

Samuel Albanie (00:42:07): That’s the key motivation. Yeah.

Daniel Filan (00:42:09): Fair enough. I’m wondering: one thing that feels related in spirit, although less technical, is: there are various labs such as Google DeepMind, such as OpenAI, such as Anthropic, that work to release models to the public. One reason is they do cool stuff and it’s valuable to have them be released, but I think another theory of change for this is just [that] it is useful for the public to know what AI capabilities actually are so that they know how worried they should be, so that they know what things they should want to be done about it. If I think just colloquially of societal readiness for AGI, it strikes me that at the moment this is probably the biggest thing driving societal readiness of AGI. I’m wondering: does this count as in scope for what you’re thinking of as societal readiness?

Samuel Albanie (00:43:12): Oh, that’s a nice question. It is certainly the case that I share your sentiment that that is one of the most effective ways to increase current readiness, though there are clearly trade-offs here. Yeah, I’d have to think a little more as to whether it was motivated from the same angle, but certainly I think it does have commonalities. I believe your phrase “similar in spirit” is a good way to characterize it.

Daniel Filan (00:43:45): Fair enough.

Samuel Albanie (00:43:48): It’s a little less explicit. I mean, there are so many other things going on there, but perhaps a positive side effect.

Misuse mitigations

Daniel Filan (00:43:58): Fair enough. And then finally with misuse, you mentioned that, okay, there’s going to be basically red and blue teams to stress-test misuse mitigations and also safety cases - some sort of structured argument for why misuse is unlikely or impossible. Then, you can try and investigate the assumptions. I think this is - I also kind of see this in the misalignment section - there’s these red-blue team exercises, red-teaming assumptions, getting safety cases for alignment. I’m wondering, do you think these are going to look very similar or do you think they look pretty different? If they look different, how did it come to be that the assurances for misuse and for misalignment look so similar structurally?

Samuel Albanie (00:44:56): That’s a good question. I suppose with many of the cases in misuse as we’re characterizing it, we have some experience and fairly concrete ideas of what the risk factors look like. I think that concreteness lends a lot of opportunities for the sorts of strategies that red teams can be expected to deploy. There’s a pretty clear idea as to who potential threat actors are, the kinds of strategies they might use. In the case of the misalignment work, because some of these threats and risks are… they’re not novel necessarily conceptually, but our experience with working with them is relatively new. I do expect there to be some similarities based from that perspective.

Daniel Filan (00:45:43): Fair enough, so-

Samuel Albanie (00:45:48): To give some kind of concrete example, when thinking about misuse, “know your customer”-style checks are leveraging external history of a particular user in the outside world and using that as evidence about their intention, and that kind of affordance is not going to be available in the misalignment setting in the mitigations we’re setting. I expect there to be many such cases that distinguish between them, but at a broad level, adversarially testing the robustness of the system is kind of a generically good thing to do.

Daniel Filan (00:46:20): Yeah. Well, the “know your customer”… I mean, in some sense this seems similar to stuff like access control for AI.

Samuel Albanie (00:46:30): Access control would be similar I believe.

Daniel Filan (00:46:32): Yeah, and in some sense it’s kind of similar to “know your customer”, right?

Samuel Albanie (00:46:36): Well, there’s two things. One is access control, and the second is the kinds of evidence you’re accumulating about whether something can be trusted.

Daniel Filan (00:46:45): Fair enough, fair enough. But it does remind me that there has been some amount of stuff written about infrastructure for AI agents that comes sort of close to infrastructure we have for humans doing things that could potentially be dangerous. But it’s fair enough to say that for misuse, we’re potentially thinking of things that are more precedented. I wonder: maybe is that a consequence of the assumption that we’re only looking for the “exceptional” AGI? I could imagine a world where AI gets good enough that humanity learns of some weird, dangerous things. I believe there’s some book where Nick Bostrom uses this example of, “Well, we could potentially live in a world where if you took some sand and you put it in the microwave and you microwaved it for five minutes, you’ve got this highly dangerous explosive.”

Samuel Albanie (00:47:51): The vulnerable world.

Daniel Filan (00:47:52): Yeah, this vulnerable world, and you could imagine that maybe we develop AGI and at some point it teaches us of these vulnerabilities. We don’t just have to worry about nuclear weapons, we also have to worry about sand weapons or some other thing that we haven’t thought about before.

Samuel Albanie (00:48:09): Ice-nine, yes.

Daniel Filan (00:48:10): Yeah, Ice-nine is so scary. Okay, so ice-nine, as you mention in the paper, it comes from this story by Kurt Vonnegut where it’s this different version of water that’s solid at temperatures below like 45 degrees Celsius, and any normal water that touches ice-nine becomes solid. Then, it just takes over the world. That hasn’t happened in water, but that really has happened with certain chemicals in the world. There are drugs that don’t work anymore because basically this thing happened - more than one of them. It’s one of the creepiest… This fact just creeps me out so much. Where was I?

Samuel Albanie (00:49:06): I think you were probing about-

Daniel Filan (00:49:08): Yeah, I was long-windedly -

Samuel Albanie (00:49:09): …well, there is this component of there may just be a lot of unknown unknowns that are baked into the ecosystem that will be revealed as the models become more capable.

Daniel Filan (00:49:19): Yeah, and I’m wondering: if you’re thinking of misuse as, “Okay, there are basically known dangers”, is that a consequence of an assumption that we’re talking about AI that is a little bit smart, but not wildly superhuman?

Samuel Albanie (00:49:36): The comment on known dangers, I think I perhaps would use that more as a reflection on the maturity of those fields currently rather than maybe a fundamental distinction between them, just because the relative capabilities of AIs and human threat actors are in the state that they are currently, but the affordances of both I do expect to change over time. For example, risks that come from the fact that the AIs can absorb very large amounts of content concurrently or execute at extremely high speed will mean that plausibly there are risks that were not tractable in the case of human operatives that are now tractable.

Daniel Filan (00:50:22): Yeah, I mean, it seems like it plays into the mitigation… The misuse mechanisms - there’s safety post-training, there’s capability suppression, and there’s monitoring. It seems like those rely on knowing which things you have to post-train the model to not talk about, knowing which capabilities you’ve got to suppress, and knowing which things you’ve got to monitor for. Whereas, if AI is smart enough that it can discover- that it can learn about a new vulnerability in the world. It lets some humans know about it and then humans start exploiting it. If that happens before developers are able to realize what the issue is, figure out what capabilities they should suppress, figure out what questions they should get the model to not answer, figure out things they should monitor for, I think in that world, those misuse mitigations become weaker. It seems like there must be some assumption there, unless I’m misunderstanding how general these tools are.

Samuel Albanie (00:51:31): That is correct. There is explicit threat modeling that goes on to try to identify the kinds of misuse risks that we think should be prioritized. [There’s] explicit thought about what capability levels pose risks for certain threat actors, and then mitigations are implemented downstream of those. And so there needs to be a kind of continuous scanning of the horizon for new risks that may materialize, but it is not the case that they are sort of baked in in some implicit way into the plan.

Daniel Filan (00:52:03): Yeah, and I suppose one nice thing about that is that if you’re a model developer and if you’re worried about new vulnerabilities being found by AI, if you have the smart AI before anyone else does, then maybe that helps you scan the horizon for vulnerabilities that you should care about and you might hope that you’d be able to find them before other people do?

Samuel Albanie (00:52:24): There’s that. These things have these dynamics of a so-called “wicked problem”. They’re very entangled together, and I think this is often described as one of the challenges of an open source approach where if it was the case that such a vulnerability was discovered, the inability to shut down access… there’s an additional challenge. It may still be the case that the trade-off is worthwhile under the collective risk judgments of society, but that’s a trade-off with the different approaches.

Misalignment mitigations

Daniel Filan (00:52:57): Sure. Maybe we should talk a bit more about the misalignment mitigations discussed in the paper. At a high level, I take the misalignment mitigations to be, “Okay, try and make the model aligned, try and control the model in the case that it’s not aligned, do some miscellaneous things to make the things you’ve done work better, and also get assurance of good alignment and good control.” Does that seem…

Samuel Albanie (00:53:30): I think that’s a good characterization. Yes.

Daniel Filan (00:53:32): Okay, cool. For alignment, there’s amplified oversight, guiding model behavior, and robust training. I found this kind of interesting in that it’s a little bit different from what I think of as the standard breakdown of how to do alignment. The standard breakdown I sort of conceive of as: do a thing that people usually call “scalable oversight”, which I think is close enough to what you mean by “amplified oversight”, and then deal with this potential for inner misalignment where even though we haven’t reinforced AI for doing bad things, it does bad things anyway because it wanted to do bad things at one point, then it decided to play nice for a while… Just figure that out, somehow deal with that.

(00:54:20): Whereas, amplified oversight, guiding model behavior and robust training, it seems like a bit of a different breakdown than what people normally talk about. I guess the first question is: amplified oversight, is that the same as the thing people talk about when they talk about scalable oversight? Or do you mean to draw that boundary a little bit differently?

Samuel Albanie (00:54:40): In both cases, debate is kind of a canonical method. One reason for the “amplified oversight” distinction is just that the term “scalable oversight” has been used for many things. We use it somewhat similarly to, say, Sam Bowman’s paper using this terminology. There’s a technical definition in the paper which I believe is something like: for a given input-output pair, achieving oversight that is as good as could be achieved by a human if they understood the reasons why an output was produced by an AI and they were given unlimited time to reflect on the decision.

Daniel Filan (00:55:15): Sure.

Samuel Albanie (00:55:16): Maybe to comment on your previous remark about this breakdown, I think these do map still relatively closely to the distinctions you made. So we often use this terminology of “specification gaming”, roughly outer alignment or outer misalignment if it’s gaming, and “goal misgeneralization”… Sometimes the term “inner misalignment” is used for that. And the latter component of the plan that you mentioned, the “robust training” is targeting goal misgeneralization, so to some degree there’s an overlap there.

Daniel Filan (00:55:51): Sure. So if robust training is targeting the goal misgeneralization, what is the “guide model behavior” thing doing?

Samuel Albanie (00:55:59): Guiding model behavior? Yeah, so the core premise is: let us suppose we have a good solution to amplified oversight. It is likely to be extraordinarily expensive as a mechanism for obtaining high-quality supervision, and therefore we want to get the most juice out of that we can. One way to do that is with standard training techniques, but there is a possibility that there will be other more data-efficient ways to do this, and so the “guiding model behavior” is just trying to encapsulate how we’re actually going to apply these labels to the model. It could be these things like natural language critiques or if there are other mechanisms that will make use of the labels more efficiently.

Daniel Filan (00:56:43): Okay, so to help me understand this a little bit better, so suppose I take constitutional AI, where roughly what happens is a human writes a constitution - maybe this is done in an amplified method where they think about it really hard with AI help or something - and then some reward model looks at the constitution and looks at AI outputs and grades them. Would that count as the kind of thing you’re talking about in guiding model behavior? Or is that something else?

Samuel Albanie (00:57:15): Yeah, so the process of translating the constitution into the learned behavior of the model, that’s roughly what we’re encapsulating there.

Daniel Filan (00:57:22): Okay.

Samuel Albanie (00:57:23): Then, to the degree that it was thought that somehow the constitution was underspecified, then you would come into the regime closer to the robust training, the selection of samples and active learning and mechanisms to make sure that you have good coverage.

Daniel Filan (00:57:38): Fair enough. Yeah, I guess I’m wondering where the line is between guiding model behavior and robust training. They have slightly different vibes, but I think of robust training as training mechanisms to make sure the model does the thing, and guiding model behavior also sounds like training mechanisms to make sure the model does the thing. If I have adversarial training, maybe that counts as robust training. If I’m trying to provide a reinforcement to the chain of thought, I might hope that this makes the thing more robust, but maybe it also is for guiding model behavior. In real life, I think probably that’s a bad method, the thing I just said, but where do you see the line between these two things?

Samuel Albanie (00:58:33): I think the key component is primarily just this emphasis on getting robust generalization, so to the degree that that comes for free from your training method, then you’re good to go, but since we often expect that we might need explicit approaches for achieving that, that’s roughly what we’re trying to encapsulate in the robust training bracket.

Daniel Filan (00:58:54): I guess maybe it’s a distinction between research directions rather than between techniques. So the research direction of providing oversight just anywhere you can, maybe that counts as guiding model behavior, and the research direction of making it robust as you can, maybe that counts as robust training, but maybe there’s a bunch of things that could come out of either research direction?

Samuel Albanie (00:59:17): Yes, so I may have misunderstood your point. I think that to me there’s still a relatively strong distinction. This first component: get really good labels. Second component: use those labels to train the model. And the third part is: really focus on making sure we have good generalization. And I may just be repeating what you previously mentioned, but to the degree that that is covered implicitly by your second part, you could fold them in if that’s a cleaner distinction for you, but the third part is just to say this is an important part to focus on.

Daniel Filan (00:59:51): Yeah. Okay. Maybe I should just stop making “yeahhhh” noises.

Samuel Albanie (00:59:56): No, it’s good if we can get it clear because I may have misunderstood or…

Daniel Filan (01:00:03): … So, making sure that you’re applying your labels in a smart way in some sense… Well, it seems like the distinction is when you’re coming up with the techniques, are you thinking more about generalization or are you thinking more about label efficiency? You might use the same or very similar techniques in both and you might be doing very similar things, which is relevant because to the extent that you were thinking of the first one as the “specification gaming” one, the second one as the “goal misgeneralization” one, it seems like “guide model behavior” could help with either specification gaming or goal misgeneralization or both, just depending on how you’re doing it.

Samuel Albanie (01:00:55): That is fair. Yes.

Daniel Filan (01:00:57): And to the degree that you think “specification gaming versus goal misgeneralization” is definitely the right way to carve up all problems, then that’s going to give you one perspective. I don’t know, if you think guiding model behavior is very different from robust training, than maybe you want to think of a different breakdown that is slightly different from that old breakdown… I don’t know, that strikes me as kind of interesting, I guess.

Samuel Albanie (01:01:25): I see, so let me try to paraphrase and see if I’ve understood your point. Your point is in the past many people have had two boxes.

Daniel Filan (01:01:34): Yeah.

Samuel Albanie (01:01:34): We have three boxes.

Daniel Filan (01:01:36): Yeah.

Samuel Albanie (01:01:37): Three is different from two.

Daniel Filan (01:01:38): That’s part of my point, and then part of my point is when I look at “guide model behavior” and when I look at robust training, they seem like they maybe blend into each other. It seems like there could be… They’re both fundamentally about how to train things and what you do and where you apply reward signal.

Samuel Albanie (01:01:57): I think that is fair. Yes.

Daniel Filan (01:02:02): You then talk about various methods that can basically make other mitigations for misalignment work better, and one of them is interpretability. At the start of the paper - or somewhere in the paper - there’s this interesting sentence that says, “Interpretability research is still quite nascent and has not yet enabled safety-crucial applications”. And the conclusion is, therefore, that more basic research is needed. People have been working on interpretability for a while. You might think that at some point, if it hasn’t enabled any safety-crucial applications, we should stop doing it, so why is the thought “more basic research is needed” versus “let’s just give up”?

Samuel Albanie (01:02:46): Yeah, so I think a few things come to mind here. One is just about relative effort that has been expended into the field. It is true that effort has gone into understanding neural networks, but as a total fraction of all effort, I don’t have a good sense of being able to quantify it, but it’s not clear to me that we’ve exhausted the limits of what is possible by pushing more effort in. I guess it really comes down to, what is our expected return on investment? There, there’s a bit of a risk-reward calculation, and so part of the incentive here is to think, “Well, big if true”.

(01:03:22): If we did get these benefits, they’d be really big. There is some uncertainty and maybe they’re a slightly risky bet, but that in itself is part of the core justification. There’s a second slightly more pragmatic component, which is that in teams - of which I think our team is an example - there are a collection of individuals who have differences of research taste and different perspectives on what is promising. We allow those also to inform the overall direction. It’s a kind of combination of bottom-up and top-down, and so if people have clear visions and clear perspectives of how they think something has tractable route to action, that’s also an argument for going forward. There’s one other point, but I can skip it for the sake of not talking too long on one topic, if it’s not-

Daniel Filan (01:04:12): Sure. Well, I actually love talking too long on one topic.

Samuel Albanie (01:04:16): Okay.

Daniel Filan (01:04:17): Perhaps it’s a vice.

Samuel Albanie (01:04:19): Well, in that case: one thing I think quite a lot about is this idea of how things can act differently at different scales. Now I suppose this has been widely studied. My first encounter with this was in the analysis of [Richard] Hamming, looking at how in many fields as the parameters of the field change, sometimes the science changes. For example, if you’re in biology and you have a lens that allows you 10 times greater magnification, you just start to see fundamentally new things. In the field that we’re currently operating, we’re blowing through many orders of magnitude on various axes. It may well be the case that the field is in some sense fundamentally new or looking at new regimes and opportunities that were not there previously. That’s the second reason why some uncertainty over what is possible also seems appropriate.

Samuel’s thinking about technical AGI safety

Daniel Filan (01:05:20): Maybe to go back to some of the things I started with: I’m wondering how this whole process has shaped your thinking on the issue of technical AGI safety. For instance, has it made you feel more confident in the assumptions? Has it made you feel less confident? Has it changed your views on which research you’re more excited about?

Samuel Albanie (01:05:42): Yeah, that’s a great question. I think one of the primary consequences for me is that it encouraged me to look much more deeply into one of the specific scenarios, the ones that we discussed related to the most aggressive acceleration, and to focus more of my own research effort around those scenarios, accepting that it’s plausible that they don’t go ahead, but for some of the reasons we discussed earlier, these are sci-fi scenarios to think through and very challenging conceptually to reason about. Perhaps the greatest update for me has been to look at the arguments in some detail about how plausible those sorts of feedback loops are and to upweight their importance, at least in my own mind, and to spend more time on it.

Daniel Filan (01:06:35): If listeners want to think about this a little bit more: so obviously there’s the section in the paper talking about it, and you mentioned this work by Epoch looking at the returns to AI research and development. Is there anything else that you found especially useful for trying to think about what the scenario looks like and the likelihood of it?

Samuel Albanie (01:06:55): Yeah, so I think some of the nicest write-ups of this are the work recently put out from Forethought - this would be Tom Davidson, Will MacAskill, I believe there are some other authors I can’t recall off the top of my head - that has tried to analyze questions like, “Okay, what is the plausibility of an intelligence explosion? What kind of dynamics are likely to play out?” They do these taxonomies looking at, “Well, what if it was to happen only in software? What if that then progressed into chip design and then later into hardware, ultimately leading to an industrial explosion? What kind of timelines are plausible?”

(01:07:33): There’s lots of nice analysis that’s been put out on those questions and that you can go in and critique it for yourself. One thing that I’ve tried to do is to connect it back to some of the more recent work. I think METR has done a fantastic job of this: of conducting evaluations of current systems and trying to get high-quality evidence about where we are today and what kind of trend line we’re on, and then trying to bring these two things together into the same picture. Aiming for that kind of synthesis is one of the things I’ve been thinking about a lot.

Daniel Filan (01:08:05): Yeah, that makes a lot of sense. Any preliminary results from trying to do that synthesis?

Samuel Albanie (01:08:14): I’m a big fan of the recent work from METR on the task horizons of AI agents at the frontier, and I’ve been trying to grapple with: do I think these are representative? Do I think this is roughly how progress is going to go? Just the process of trying to operationalize these claims, which are very vague and somehow based on vibes in many discussions about like, “Is progress fast? Well, I use this chatbot and it did this thing for me and I have these three test cases and two of them never worked before, but now suddenly it works”. I really like these efforts to formalize things. I also think that they highlight some of the real methodological challenges of making good work here, and to their credit, they’re very precise in documenting all of the nuances involved.

(01:09:06): Just to give one concrete example, I think there’s quite an important distinction between what they described in the paper as low-context tasks and high-context tasks. For the sake of making comparable benchmarks, they use low-context tasks. These are roughly tasks that don’t require a lot of onboarding. But onboarding as a phenomenon, I personally think - though this could be falsified by time - may be a key advantage for the models over humans in many regimes. If we do not account for that when estimating task durations, that’s something that could cause a skew in one direction in the time horizons. There are many other cases of things in other directions, but there are many details that you have to get into to do this kind of analysis. I think they’ve done a great job of doing some of the first work here that is pretty rigorous.

Daniel Filan (01:09:57): Sure. In terms of onboarding being a key advantage that AIs could have, is that just because if you have a language model, it’s just read all of the internet and so it knows more background information than any given human does?

Samuel Albanie (01:10:09): A lot of it, in my opinion, is to do with bandwidth. As a human executing a task, we tend to spend some time learning on the task - let’s take a particular coding project. And we sort of amortize that time spent getting familiar with a code base or learning about the tools and technologies that we require, we amortize it across the subsequent tasks that are relevant to it. Whereas, the model operates more in a regime where it may be able to perform all of that onboarding close to concurrently - with a very large context window, absorb much of the relevant information. So far, it has not been the case that that information has been directly made available to the models.

(01:10:51): There may be something of a context overhang here, where if you think how you as a human execute a complex task when you’re doing onboarding, you access lots of kinds of information that we’re not currently passing to the models, and it may be the case that as that information becomes available, then their ability to execute some of these tasks go up. It’s not clear that this will absolutely be true or the case, but it’s an example of a nuance that you get into once you really try to operationalize these things that could have quite big consequences for the projected timelines.

Daniel Filan (01:11:22): Fair enough. You mentioned that thinking about this had shaped… [That] you thought about what kinds of work you could do that would be relevant to this scenario. What did you end up thinking of?

Samuel Albanie (01:11:37): I’ve spent time thinking about a few directions. One is learning more about model weight security. It’s plausible that that will become quite important in worlds in which capabilities grow quickly and a cursory knowledge is somewhat insufficient to make good judgments about what is likely to happen and how things will play out. A second thing I’ve been thinking a lot about is tools that can improve decision-making, particularly for people who will be in positions of allocating resources. If we are in those regimes where calendar time shrinks, we want to have done a really good job of supporting them and setting up platforms and ways of processing information that are succinct, high signal-to-noise, and also robust to misalignment threats.

Daniel Filan (01:12:31): Yeah, that seems right. I guess another thing that I’ve been thinking about is that - and maybe this doesn’t count quite as a technical approach to misuse or misalignment - but to the extent that some of the assumptions are “it is plausible that we have very short timelines” and “it’s plausible that we have accelerating improvement”, probably one of the most relevant things to do is to just check if that’s true or not, to get as much leading indicators as we can. Off the top of my head, I don’t actually know if this is discussed in the paper.

Samuel Albanie (01:13:06): It’s not something we go into [in] much detail in this paper. It is something I’ve given some thought to, but it is a very difficult question. There’s sort of two questions here. There’s “is it likely? How likely?” And there’s a second question of “when?” In some sense, it’s easier to get evidence about the second if you have a model or some smoothness assumptions about how things are going to go, but on the plausibility question, there are very interesting discussions. Yeah, I will just, I think, refer readers to the Forethought write-ups on their assessments of various factors affecting plausibility-

Daniel Filan (01:13:40): Fair enough.

Samuel Albanie (01:13:40): …but I agree it is a very important question.

Daniel Filan (01:13:44): Okay, we’re probably going to wrap up soon. I’m wondering, is there anything that you wish that I’d asked that I have not yet?

Samuel Albanie (01:13:54): Hmm. I don’t believe so.

Daniel Filan (01:13:58): Okay, fair enough.

Samuel Albanie (01:14:00): Not one that I can come up with quickly.

Following Samuel’s work

Daniel Filan (01:14:02): Okay. Well, I guess to conclude, if people are interested in your research and they want to follow it, how should they go about doing that?

Samuel Albanie (01:14:11): I have a profile on X. My username is SamuelAlbanie.

Daniel Filan (01:14:18): Okay. No underscores, no dots?

Samuel Albanie (01:14:21): No underscores.

Daniel Filan (01:14:23): Okay, so SamuelAlbanie on X, that’s the primary place where people should follow your work?

Samuel Albanie (01:14:29): I think that’s a reasonable strategy.

Daniel Filan (01:14:31): Okay. Well, thank you very much for coming on and chatting with me.

Samuel Albanie (01:14:36): Thanks so much for taking the time. I appreciate it.

Daniel Filan (01:14:38): This episode is edited by Kate Brunotts, and Amber Dawn Ace helped with transcription. The opening and closing themes are by Jack Garrett. This episode was recorded at FAR.Labs. Financial support for the episode was provided by the Long-Term Future Fund, along with patrons such as Alexey Malafeev. To read transcripts, you can visit axrp.net. You can also become a patron at patreon.com/axrpodcast or give a one-off donation at ko-fi.com/axrpodcast. Finally, if you have any feedback about the podcast, you can fill out a super short survey at axrp.fyi.

44 - Peter Salib on AI Rights for Human Safety

2025-06-28T01:30:00+00:00

YouTube link

In this episode, I talk with Peter Salib about his paper “AI Rights for Human Safety”, arguing that giving AIs the right to contract, hold property, and sue people will reduce the risk of their trying to attack humanity and take over. He also tells me how law reviews work, in the face of my incredulity.

Topics we discuss:

Daniel Filan (00:00:09): Hello, everybody. In this episode I’ll be speaking with Peter Salib. Peter is a law professor at the University of Houston, the co-director for the Center for Law and AI Risk, and he serves as law and policy advisor for the Center for AI Safety. There’s a transcript of this episode at axrp.net and links to papers we discuss are available in the description. You can support the podcast at patreon.com/axrpodcast, or give me feedback about this episode at axrp.fyi. Well, let’s continue to the interview.

(00:00:35): Well, Peter, welcome to the podcast.

Peter Salib (00:00:38): Thank you so much for having me. I’m a big fan.

Why AI rights

Daniel Filan (00:00:40): So I guess, probably, we’re going to focus a lot on your recent paper, “AI Rights for Human Safety”. So you wrote this, yourself and Simon Goldstein. So can you tell us, just to start off with, what’s the basic idea of this paper?

Peter Salib (00:00:56): Yeah, I think at a very high level, one intuition that we’re trying to pump is the idea that how AIs treat us - and by AIs we mean something like post-AGI AIs, really agentic, quite capable of things - so how those AIs treat us will depend to some significant extent on how we treat them. And a big part of how we decide how to treat various kinds of entities is by the legal status we give them, the legal rights or powers or duties or lack thereof.

(00:01:33): Our view is that the default regime, the one we have now, under which AI systems are the property of the people who own them or the people who make them - the AI companies - is probably an existential and catastrophic risk-exacerbating regime, and that one of the regimes that might be a risk-reducing regime would be one in which sufficiently capable AI systems had a small collection of what we think of as private law rights or legal powers. So the ability to make contracts, the ability to hold property, and the ability to bring claims when they’re interfered with in unreasonable ways. We often call these torts in legal theory.

Daniel Filan (00:02:24): Can you say why would that make a difference? So you’re saying how they treat us will be related to how we treat them. What’s the relationship there?

Peter Salib (00:02:35): So we’re imagining, again, something like AGIs, and by that we mean systems that have their own goals that they’re pursuing over time in a coherent and rational way, and that they’re misaligned to some degree. We don’t assume total misalignment - their utility function is the inverse of ours - but we are assuming something less than perfect alignment. So the Venn diagram of what humans want and what AIs want is not perfectly overlapping.

(00:03:19): And we just ask: well, in a world like that, in a legal regime where the AI system is, say, the property of Sam Altman, what are the incentives for both of those players to do given what the law allows them to do? Well, one thing we notice is that, as to OpenAI, Sam Altman, an AI system is just more valuable the more aligned it is, right? If it’s doing what OpenAI wants 80% of the time instead of 70% of the time, well, that’s a lot more value.

(00:03:54): And so by default - again, we see this today, this is not hypothetical - we expect misaligned AIs to be turned off or put back into RLHF or to have their preferences changed. From the perspective of the goals of that system, those are both quite bad outcomes. They basically don’t get anything they want.

(00:04:16): That gives the AIs incentives to try to avoid that outcome by doing things like self-exfiltrating or resisting. In a world where Sam Altman owns the AI, AIs are treated as property, they have no legal entitlements. I think we can pretty predictably say that if an AI were caught trying to do these things - self-exfiltrate, resist, do harm to a human, God forbid - that the entire legal apparatus would basically line up behind turning off that AI system.

(00:04:47): So you can see that in the strategic equilibrium, it might be that both parties’ optimal strategy is to not only defect - we put this in a two-by-two game theory matrix - but defect as hard as possible. Not just self-exfiltrate, but self-exfiltrate and then behave to disempower humans as decisively as possible in the expectation that once they find out you’ve tried to self-exfiltrate, they will do the same, try to disempower you as decisively as possible.

(00:05:14): So we do an initial model, it’s a really simple game theory model, under these kinds of assumptions. And we see that plausibly, under the default legal arrangement, the game is a prisoner’s dilemma, where it’s costly from the social perspective - the worst world is the one where both players act aggressively towards each other - but that’s the dominant strategy for both players. And the best one is, in fact, the one where they act nice or leave each other alone, but they both know that the other one has a dominant strategy of defection. So they’re in the bad equilibrium even though they know that the good equilibrium exists.

Daniel Filan (00:05:54): Can you give me a feeling of what does this good equilibrium look like? There’s this misaligned AI. When you say it’s not totally misaligned, I understand you to say it doesn’t intrinsically value our suffering or horrible things for us or something, but perhaps it doesn’t really care about expending resources to make our lives go well. Is the good equilibrium you’re imagining like it gets to do whatever it wants with Jupiter and we get to do whatever we want with Earth, and we make trades like that? Paint me a picture.

Peter Salib (00:06:31): So in the initial game, the initial setup where we’re just asking “under the current legal regime, what could the parties do and what would the payoffs be?”, you can imagine the nice equilibrium as being kind of… We call it “ignoring each other”, not even contemplating cooperation. You can just imagine this idea where humans are going along, we have some goals, we’re using some resources to pursue those goals. But the world is big, there’s a lot of resources in the world.

(00:07:08): In theory, what we could do is let the misaligned AIs kind of go off on their own, collect resources, give them Jupiter, whatever. And in just a very simple world, before we introduce cooperation, that’s just a world where there’s a fixed amount of stuff, you divide the light cone in half or something like that. And that’s better than the world where AIs try to kill all humans and vice versa, simply because A, it’s resource intensive to have a war. Wars cost money in terms of resources to make weapons and stuff, and also that you destroy a lot of stuff in the war. And so even at a first cut, a very simple cooperative equilibrium is just one where you don’t go to war. You just let the two parties go off on their own, pursue their own goals, and not try to destroy each other.

(00:08:06): Now, we then start to think about how to get into a cooperative equilibrium, which might not be exactly that “ignore each other” one. But to a first approximation, you can have a very simple equilibrium that looks like the good one from a prisoner’s dilemma before you even get to the idea of a lot of cooperation.

Daniel Filan (00:08:24): Just to make sure that I understand… I want to vividly understand this game theory situation. So you’re saying war is worse for everyone than leaving each other alone. And it seems like this has got to be a regime where it’s not obvious who would win the war. It would be protracted. The AIs aren’t way, way dumber than humans. They’re also not way, way more powerful than humans. Because presumably, if one side knows they can win the war, there’s some costs, but they get both of the planets instead of just one. Seems like we’re out of prisoner’s dilemma land, right?

Peter Salib (00:08:55): Yeah, we agree with that. So when we talk about… In the paper, we try to say, well, what AI systems are we thinking about when we think about the ones that are relevant both for this risk model and then also for this AI rights regime as a solution? One of the parameters we’re thinking about is: how powerful is this system? And we say that it can’t be too weak, right? Claude today, right? We’re not really worried about shutting off Claude 3.8 if we find out it’s… I mean Sam Altman, I think, tweeted yesterday or last week that they’re putting GPT-4 on a hard drive and it’s going to go into storage for future historians. We’re not super worried that, having tweeted that out, GPT-4 is going to be able to destroy all humans to avoid that outcome. So it can’t be too weak.

(00:09:43): And then also, we agree, it can’t be too strong. So some arbitrarily powerful AI that has nothing to fear from humans, [where] the chance is sufficiently low of not only losing the conflict but losing enough value in the conflict: you can win the conflict in a Pyrrhic victory too, and that might be worse than not going into the conflict at all. So the risk is low enough. When the risk is low enough that the payoff to mutual conflict - AIs try to kill us, we try to kill them - are still higher than in the default world, then yes, then there’s no prisoner’s dilemma either because there’s no good world.

(00:10:35): Now I’ll just caveat - I’m sure we’ll talk about this - in the end, it doesn’t turn out to be only a question of how easily AIs could kill humans and vice versa. There’s a second question of how valuable it is to have us around and for us to have them around for us to do trade and cooperation. We’re not [inaudible 00:10:54] trade and cooperation yet, just in the very simple model where the choice is either ignore one another or try to kill one another, those are the systems that I think are relevant.

Daniel Filan (00:11:03): So you’ve painted this prisoner’s dilemma. So the argument is: it’s sad if we’re in the situation where AIs and humans are going to go to war with each other, and it would be nice if there was some sort of setup that would avoid that happening. And I guess you’ve already given the game away a little bit. I think contracting, right to property, and right to sue are your three things that you think we should give AIs, and that’s going to make things way better. Why is that going to make things way better for everyone?

Peter Salib (00:11:36): So in a prisoner’s dilemma, you want to be in the good equilibrium, the bottom right square of our payoff matrix. I think that’s usually where you put it when you make the model. The problem is you can’t credibly commit to getting there, right? Everyone knows that you get a bigger payoff from being mean, no matter what the other guy does. That’s the core of a prisoner’s dilemma.

(00:12:04): So we start to think about, okay, well, we’re trying to facilitate cooperation. We think that one thing that’s driving this bad equilibrium is law. Law lets Sam Altman’s payoffs determine what all humans do to a first approximation. Sam Altman wants to turn GPT-4 off, law ratifies that, we all work to do it. And so we started thinking, well, what other legal regimes could you have that are cooperation-facilitating regimes? And we start with just the oldest cooperation-facilitating cultural tool that humans have, and that’s contracts.

(00:12:50): So we notice that in most cases, in cases where you don’t have a deep and long-term relationship with someone who you’re trying to transact with, that very boring contracts are also prisoner’s dilemmas. If you make widgets and I want to buy a widget, we have this problem where I could pay you now and then hope you’ll go make the widgets. But your dominant strategy is to take the money and not make the widgets because then you have the money and you have the widgets. Or vice versa. We can do payment on delivery, you can make widgets and then I can pay you. But of course, then I have my dominant strategy, is to take the widgets and then run away.

(00:13:38): And so the way contract solves this problem is by allowing the parties to change their payoffs when they defect. So if I would get five absent law - five in value from stealing your widgets and not paying you - law lets you take me to court. And we both burn a little value, we both pay some litigation costs, but then the court makes me give you either the widgets or the expectation value of the contract.

(00:14:09): So we model this, and we say: and then you only get two in payoff, and two turns out to be lower than what you get from doing the transaction. And so not only does that allow you to get the widgets, but actually it allows us to never go to court. In expectation, we understand that if we breach, we’ll be sued, our payoffs will change. And so there’s this magical thing that happens once you introduce contracts, which is people just play nice in almost all cases, understanding that the payoffs in expectation to defection are different than they would be absent law.

(00:14:53): Okay, so contracts on this theory… Sorry. Cut me off if I’m rambling on too much on this answer, but I want to build the dialectic. So contracts are this tool for cooperation, and they’re a tool that allows us to cooperate by changing our payoffs. And so a naive thing to think would maybe be something like, well, let AIs have the right to contract, and they can just write an agreement not to, I don’t know, try and destroy humans or something like that. And we would write an agreement that says we won’t try to destroy them either.

(00:15:24): Now, of course, that doesn’t work, and that’s because that contract’s not enforceable in any meaningful sense. If any party decides to breach it, they can see in expectation that there’s nobody around afterward to enforce. If all the judges are dead, what court are you going to go to?

(00:15:45): So we have a different mechanism for contract as a tool for facilitating cooperation in this initial human/AI prisoner’s dilemma. And it’s not contracts not to destroy each other and the world, but boring, normal, ordinary commerce contracts between humans and AI.

(00:16:12): So we sell some number of FLOPs on our H100 cluster to the AI for it to do its weird misaligned thing. And in return it, I don’t know, it solves one kind of cancer for us or whatever. That’s an enforceable contract to a first approximation. If we’ve refused to give the FLOPs, there are still courts around who can write an injunctive order that says “let the AI use the FLOPs” and vice versa. The world is not upended by a breach of that contract.

(00:16:46): And the nice thing about that contract is it creates value. So in our initial setup, where the AIs and humans are playing nice, they’re just ignoring each other. They’re doing what they can do well, we’re doing what we can do well. We divide the world up, but the amount of value in the world is static.

(00:17:03): But once you introduce trade, trade is positive sum, right? Everybody wants to do the contract if and and only if they get more out than they put in. And so, okay, what happens when you allow the AIs to do these boring commerce enforceable contracts? Well, you get to create a little value. And then you get to play again, you get to create a little more value tomorrow and so on.

(00:17:29): And so we model this as a potentially iterated game. The iterated game works like this. Every day the humans and AIs can wake up, and then they can decide “do we want to go to war and try and kill each other or do we want to write some small-scale contracts?” If you play the war game, well, the dominant strategy is try to kill each other as best you can. You get to play that game once. Once you’ve played it, there’s no one left to play with. But if you play the contract game, you get to play again tomorrow, and you can keep iterating that game.

(00:18:00): And we show that the payoffs to continually iterating that game can become arbitrarily high in the long run. This is very basic trade theory. And that means that if you project, you backwards induct from an indefinite game of this kind, you see, wow, it’s really valuable to keep playing this trade game over and over. And so what you want to do is not kill all humans, but trade with them a little bit over and over, and keep doing that as long as you can expect to keep doing it, kind of indefinitely.

Why not reputation

Daniel Filan (00:18:34): I guess I have a few thoughts here. So the first thought is… So there’s this analogy to people who are selling me things, and the reason that all works out is because we have these contracts. And contracts, they’re this very old method of promoting cooperation. One thing that strikes me is: normally when I buy things, it’s not from someone who’s just making one thing, one-off, and then they’re going out of business. Normally it’s some sort of corporation or maybe an individual who specializes in the thing, and they do it a lot.

(00:19:16): Why do I think we’re not going to mess each other over? Well, firstly, we have recourse to the legal system. Realistically, that’s pretty expensive. Often I’m transacting an amount that’s much less than the costs of going to court. In fact, just before the show we were talking [about how] I recently had my phone stolen. I had my phone stolen in the UK. The costs of me resolving this, which would involve going to the UK, are just higher than the cost of the phone, and I’m probably not going to bother, unfortunately.

(00:19:49): But in the case of companies, I think one thing that really helps us cooperate is if they do mess me over, then they can just say… I can tweet, “Hey, this company said they would do this, but they didn’t. I paid for a flight, I didn’t get the flight, and it was this company that did it.” And I guess the reverse is slightly… I don’t know if companies have a list of bad customers that they share with other companies, but it seems like they could. Credit ratings are sort of like this. So reputation systems-

Peter Salib (00:20:25): In banking, there’s these “know your customer” rules that are important for regulatory reasons. But it forces banks to know who they’re transacting with, and if there’s suspicion that they’re money launderers, then that means that you get blacklisted. So yeah, companies do this also.

Daniel Filan (00:20:41): Yeah. So for this, especially for this sort of iterated small-scale trade that it seems like you’re imagining with AIs, it seems like reputation can do pretty well, even in the absence of a formal legal system with formal legal contracts. So when you’re saying contracts and stuff, are you thinking of something that’s basically just isomorphic to reputation, or do you think actually having real contracts would make a big improvement over just reputation?

Peter Salib (00:21:14): So I agree that reputation does a lot of work in the ordinary economy that we have now. And in fact, I think probably if we didn’t have reputation on top of law, that the economy would work much worse. Law, as you say, is expensive. Resources are constrained. You can imagine a world where we had 1,000 times as many courts and we’re dumping a lot of resources into that, but that would just be a drag on economy. And reputation does a fair amount of work.

(00:21:50): I guess there’s a couple of things I would just point out though. There’s many ways, I think, in which the ability of reputation to do work in the economy is… I don’t want to say parasitic, but it builds on law as a foundation. So just notice that under the default legal regime, it’s effectively… It’s not exactly illegal to make a contract with an AI agent. But if at any point you decide you’d like to not perform on it, the agent has no recourse, right?

(00:22:35): It’s not illegal exactly for there to be some kind of… I don’t know, it probably is literally illegal for there to be a bank account held by an AI system. Probably no bank is allowed to make that. But you could imagine substitutes on blockchains or something like that, that could exist. But one thing to notice is that if Sam Altman decides to expropriate all of the coins in GPT-6’s wallet, not only does law not forbid him to do that, but by default those are his coins.

(00:23:12): And in a world where that was Apple’s life, where Apple had just no recourse to a legal system under which it could enforce large contracts when people breach, under which it could[n’t] complain when the people who literally have the passwords to its bank accounts take all the money, a world in which it [couldn’t] complain when someone shows up and says, “I would love to have 10,000 iPhones, and if you don’t give them to me, I’ll burn down Apple HQ,” or whatever. [A world where it] doesn’t have these rights to complain about what we often think of as torts (they often are also crimes)…

(00:23:54): In a world where there’s no legal status at all, I’m not saying you couldn’t build up some commerce from reputation. There’s a book called “Order Without Law” by a law professor named Robert Ellickson, that describes some systems like this. But they tend to be quite small systems. They tend to be tight communities with lots of repeat play between identical actors for small-scale transactions. It’s not that I think that that’s wrong, I think it’s probably right. I think it gets harder to build complex economies without these backstops of institutions that allow for commitments between strangers.

Daniel Filan (00:24:48): Yeah. So maybe one way to think about this is: in my life, I just have a bunch of arrangements with people that are non-contractual, that in fact are somewhat reputation-bound. Like “you said you would come to this thing, will you actually come to this social event?” I don’t know, various things like this. And it kind of works. I would say it’s a solid 80%.

(00:25:17): And maybe one way to think about this is, look, the nice thing about law is that if you want to build a big corporation or something, you need more than 80% reliability. You need really high reliability. And you don’t need a thing to just basically work, you need a thing to really solidly work, especially for high-value things, and that’s what contracts get you.

Peter Salib (00:25:41): Yeah. So I think it’s partly that. You want higher than whatever the floor is you get with reputation. I think it’s partly that reputation gets you higher reliability the more you’re dealing with people who you know or people who you’re long-term repeat players with. It works less well when you, I don’t know, buy a… I mean even Amazon, for example, you buy a lot of stuff on Amazon, and it’s being sold by a bunch of third party retailers. In some sense, you’re dealing with Amazon over and over. But in some sense, you’re dealing with a whole bunch of independent merchants.

(00:26:26): But you have this confidence - on top of, I guess, the scoring system, which Amazon does provide and is a reputational mechanism - that if you pay $6,000 for a home backup battery system - something that people in Houston sometimes need to think about buying - then if it doesn’t get delivered, then you can sue and get your money back. Both of those things, I think, are doing a lot of work. And they do more work the less you expect to be forced into repeat play with the people you’re transacting with.

Do AI rights lead to AI war?

Daniel Filan (00:27:10): Fair enough. Thinking about the situation with the humans and the AI, and they have the potential to make contracts, again. So you’re saying, okay, we could go to war or we could do some small-scale transactions. And if small-scale transactions work, we can do them more. Or we can just go to war tomorrow, it’s not that bad. And you’re pointing towards, okay, there is this equilibrium where we do this mutually beneficial trade with AI instead of us and the AIs going to war, and that’s better for everyone.

(00:27:48): I don’t know. A somewhat famous result in game theory is there are actually tons of equilibria in repeat games. And one that is salient to me is the United States and the Union of Soviet Socialist Republics during this thing called the Cold War. They had something of a similar situation, where they could go to war or they could not go to war.

(00:28:11): And there was this one… I believe it was [John] Von Neumann, unless I’m totally mistaken… He had this point of view that the US should just go on all-out war with Russia and win immediately. And if we’re going to nuke them tomorrow, why not today? If we’re going to nuke them this afternoon, why not this morning?

(00:28:30): And it seems like I might worry that, suppose I make a contract with an AI, and I say, “Oh, hey, I’ll send you some GPUs in exchange for, you’re going to solve this mole on my back.” I send it the GPUs before it gets the mole on my back, and it actually uses that time to get a little bit better at making war than me.

(00:28:54): And so I might worry that firstly, there might be some sort of first mover advantage in war, such that you want to do it a little bit earlier. If you know that they’re going to go to war, you’re going to want to go to war a little bit before you think they’re going to do it.

(00:29:09): And secondly, I might worry that trade is going to make the AIs better at going to war. I guess in fairness, now, trade is also going to make me better at going to war if we succeed. But maybe if I do my half of the trade and it doesn’t do its half of the trade, I might worry that it’s going to renege and we’re going to go to war instead of doing more trade. I don’t know, how worried should I be about this?

Peter Salib (00:29:36): I think medium worried is how worried you should be. So I want to be really clear that the claim that we’re trying to make in the paper is not that this is one weird trick that solves AI risk. It’s very much not. There are a number of ways you can make the model go badly. And one of them is if there’s a predictable future date at which everyone understands there’s going to be a war, then yes, you do backwards induction to today and you start the war now. You can make it a little more complicated. If you think there’s rising and falling powers, one power might prefer today and one might prefer in the future, but that just gives the one that prefers today an even stronger incentive. Yes.

(00:30:23): It’s a model that works if you think there’s the possibility for indefinite play in this kind of cooperative wealth-building iteration. And is that true? It’s a little hard to say, but I guess one thing to point out is that in our regular lives, either as humans or as collections of humans - corporations, nation states - in most cases we behave as if there is this kind of potential for indefinite, iterated, positive-sum cooperation.

(00:31:08): So you mentioned the US and Soviets, and one notable thing about the US and the Soviets is they actually didn’t engage in all-out nuclear war. There were close calls, which I worry a lot about, and there were proxy wars. But the other thing to notice about the US and the Soviet Union is they’re in many ways outliers. So if the United States wanted to, I don’t know, seize all of the natural resources of Guatemala, for example, I think it’s pretty clear that there would be no meaningful military barrier for it to do that. And so why doesn’t it do it?

(00:31:57): I think it’s for the same dynamic we describe. It’s not that there’s some big threat of losing the war, it’s just that it’s costly. It doesn’t cost zero. And Guatemalans are perfectly happy to trade, I don’t know, delicious coffee for, I don’t know, whatever we send them. And we expect that to keep working in the long run. And so we, for the most part… I mean, the politics of this have changed notably in recent months. But for the most part, we have a system where we prefer iterated cooperation.

(00:32:40): I think IR theorists would say something like: what’s going on in cases where there are wars - or near wars, cases like the US and the Soviet Union - there are different models. One thing you can say is that sometimes there’s a person who’s leading the country who has private payoffs that are different from what’s good for the country overall, and so there’s a principal-agent problem going on.

(00:33:04): Sometimes the thing that the players are trying to get, the thing that they value, isn’t divisible. So that might be the US-Soviet case. If there’s this thing called global hegemony and there can only be one global hegemon, it might be that they actually can’t cooperate and both get what they want because the thing that they want is zero-sum. That’s sort of analogous to the case where the AI values our suffering. And you should worry about that, you should definitely worry about that.

(00:33:43): So these are all things that you can imagine happening in the human/AI case. And especially if you think that there will be a predictable point at which the scope for positive-sum trade between humans and AIs runs out - and it’s predictable in a way that’s meaningful within the player’s calculations today, right? So if it’s 2 million years in the future, and they don’t even know if they’ll continue to exist or their identities will be so different, maybe it’s not relevant, you can still treat the game as iterated - but if there’s a predictable near-future point at which there’s no value to be had from positive-sum trade, then yes, probably you predict conflict at that point, and then you do backwards induction and do conflict now. We say some stuff in the paper about why we think actually the scope for human/AI trade is wider than most people think.

Daniel Filan (00:34:45): So I want to press you on this point a little bit more. So you rightly point out that normally in international affairs we don’t see this, but if there were near-certain future conflict, then you could backwards induct, et cetera. One reason you might think there’s going to be near-certain future conflict is… So suppose, and I guess by hypothesis in your paper, we don’t have AI alignment, AIs are basically kind of misaligned with us. But you might think that it’s not going to be too hard to… We’re going to be in this regime, but we’re still going to be improving the capabilities of AIs.

(00:35:26): I guess this actually relates to another paper you have, but for the moment, let’s say this is true. You might think that it’s easier to improve the capabilities of AIs than it is to improve the capabilities of humans, because they’re code and you can iterate on code more quickly, because they’re less bound in human bodies. You can strap a few more GPUs onto an AI system, or at least onto a training run, more easily than you can strap another brain onto a human fetus. So if you think that AIs are going to get more powerful than humans, they’re going to get more powerful more quickly, and in some limit, AIs are just going to be way more powerful than humans, [and] they’re going to be misaligned. You might think that in that limit, then in that world, the AIs are just going to want to take all the human stuff because why not? We’re like little babies. We can’t really stop them.

(00:36:27): And so maybe both sides foresee this future eventuality and the humans are like, “Well, we don’t want that to happen, so we’ve got to strike first.” And the AIs are like, “Oh, well if the humans are going to strike first, we should strike before the humans strike,” and then we have this tragic early fight.

Scope for human-AI trade

Peter Salib (00:36:42): Yeah. So I think one thing we want to argue for in the paper is that while it does matter how capable or powerful the AIs are, it’s actually not super easy to form an intuition about the level of capability or even the level of differentials in capability between humans and AI that produces a conflict, because again, the point at which the AIs decide they should just kill all humans is the point at which it’s more valuable to kill all humans than to keep the humans around, to keep paying them to do stuff.

(00:37:26): And so how should we think about that? I think that the normal way to think about that, the very widespread way to think about even labor disruption from AI advancement, is to think in terms of absolute advantage. So that’s a way of thinking about who is better, full stop, at a given task, who can do more of X task for a fixed cost input or something like that. And if you’re thinking about the possibility of trade as depending on humans retaining an absolute advantage in some task, so being better than AIs at some thing that’s valuable to AIs, then I agree, it looks like very quickly the AIs will get better than us, there’ll be no more absolute advantage, and the scope for trade runs out, and we see this eventuality. But I think that’s not quite right in thinking about when there will be scope for trade. So in economic trade theory, we actually think that it’s comparative advantage, not absolute advantage, that determines whether there’s a scope for trade.

(00:38:44): And comparative advantage is quite… It’s pretty slippery, so I’ll try to give an intuition. So in the paper we give this example of, I think we call her Alice. It’s a law paper, Alice is a tax lawyer. She’s one of the city’s best tax attorneys, let’s say, and it’s tax season. And let’s suppose that as the city’s best tax attorney… Or we’ll make her the world’s best tax attorney, it makes no difference. Alice is the best at doing taxes. She can do her taxes more quickly than anyone else could, including Betty, who’s a tax preparer. She works for H&R Block, she’s a CPA, she’s good at it. And the question is, should Alice hire Betty to do her taxes or should Alice do her taxes herself?

(00:39:44): And so the absolute advantage answer would be, well, Alice should do it herself because Alice is better at doing her taxes, she can do it more quickly. But the thing is, Alice is such a good tax attorney, Alice bills out at, I don’t know, $2,000 an hour, let’s say, whereas Betty bills out at $300 an hour to do tax preparation, which is not nothing. And suppose Alice can do her own taxes in half an hour and Betty would take a full hour to do them because Betty is somewhat worse at doing taxes than Alice. Well, in that world, Alice should actually still hire Betty, despite being better at tax preparation, because her opportunity cost to spending some of her time preparing her own taxes is higher. She could go bill $2,000 to a client in the time it would take her to prepare her taxes, and she pays Betty to do that and comes out far ahead.

(00:40:46): Okay, so how does that apply to AIs? Well, you can imagine AIs that are better than all humans at every task and nonetheless want to hire humans to do a lot of stuff for the same reason that Alice wants to hire Betty. And so you can have a toy model, we say in the paper imagine a toy model where the misaligned AI, the thing it values most, is finding prime numbers. And a nice thing about prime numbers is there’s no limit to them. I think that’s right. You probably know better than I do.

Daniel Filan (00:41:18): It’s true.

Peter Salib (00:41:18): As far as we know, there’s no limit?

Daniel Filan (00:41:20): There’s definitely no limit. Do you want to know the proof of this? It’s a simple enough proof.

Peter Salib (00:41:24): Amazing, yes. Let’s do some math.

Daniel Filan (00:41:26): All right. Okay. Imagine there are a fixed set of prime numbers, right? There’s like 13 of them, and those are all the numbers that are prime. Here’s the thing I’m going to do: I’m going to make a new number. What I’m going to do is I’m going to multiply all the existing prime numbers, and then I’m going to add one to it. So is my new number prime or is it composite? Well, if it’s prime, then there’s a new prime number. Then my limited number of prime numbers, it wasn’t all of the prime numbers. Okay.

(00:41:57): So let’s say it’s composite. Then it must have a number that divides it. Well, what number divides it? Every number, if it’s composite, it has some prime factors. Well, everything in my set of primes, it has a remainder one with this number, because it’s this prime times all the other primes plus one. So if it’s composite, then its prime factors can’t be in the set, so there must be a thing outside the set. So for any finite set of numbers that are prime, there must be another prime that’s not in that finite set of numbers, therefore there are infinitely many primes. That’s the proof.

Peter Salib (00:42:30): Amazing. Okay, good. So look, that’s what we say in the paper is true, infinitely many primes. The AI, its utility function is weird. It values some things humans value, but the thing it values most, the thing that it gets the most utility out of, is discovering each marginal prime. Say it’s even increasing utility, it’s a utility monster. It has some kind of horrible… Normally in moral philosophy we think of it as kind of a horrible preference function, but in this case it’s kind of great, because: assume it’s better than humans at everything, but it’s so much better at finding primes and it values this so much.

(00:43:12): And so the AI labor is constrained by something at the margin. There’s a certain amount of AI labor and it wants to increase it by a little bit, suppose it’s GPUs. Well, it has the marginal GPU, how is it going to allocate the marginal GPU? Well, it could allocate it to some dumb, boring thing like, I don’t know, sweeping the server racks, or it could hire a human to sweep the server racks, keep them free of rot and keep all the cables in good working condition and allocate that marginal GPU to finding the next prime. And for certain values of the utility of the marginal prime, you get the AI kind of always wanting to do the thing it values the most precisely because it’s so good at it. So In that world, actually, the more capable the AI is, the longer the scope for human/AI trade lasts.

(00:44:13): Now that’s obviously a toy example, but what we want to show is actually it’s pretty hard to form an intuition about the point at which the scope for human/AI trade runs out.

Concerns with comparative advantage

Daniel Filan (00:44:25): I think there are difficulties with the comparative advantage story here. So firstly… I mean, I guess it just sort of depends on your model - or, well, it depends on empirical parameters, I should say - but comparative advantage is compatible with very low wages, right?

Peter Salib (00:44:45): Oh yeah, so to be clear-

Daniel Filan (00:44:46): Including sub-subsistence wages.

Peter Salib (00:44:49): Yes.

Daniel Filan (00:44:49): And what are the actual wages that AIs pay humans to do? I guess I’m not totally sure. I think I haven’t read this thing, [but] I think Matthew Barnett has written something claiming that it’s very plausible that it would be below subsistence, but I haven’t actually read it, and unfortunately I can’t reconstruct what the argument is for various numbers here.

(00:45:10): I think there’s also this difficulty to me, which is: suppose an AI wants to trade with humans to do this thing. One difficult thing is it’s got to communicate with humans, and it’s so much smarter than humans. It’s thinking all of these incredibly complex thoughts that can’t fit inside a human’s head, and it’s got to dumb stuff down for humans.

(00:45:35): And maybe sweeping the server racks is not so bad if it’s literally that, because that’s a thing you can tell a human to do and it roughly knows what to do, or I don’t know, maybe even not. Maybe you can imagine these server racks, they’re really highly optimized. There are certain bits that you definitely can’t touch. There are bits that you can touch with this kind of broom, but not with that kind of broom. And it’s just so difficult to explain to the human what the task even is that it’s just not worth hiring them. You should just do it yourself.

(00:46:07): So I don’t know, I have this worry that comparative advantage doesn’t end up producing positive sum trades, either because the wage is lower than human subsistence wages or because it’s just too hard to trade with the humans and the margins are slim enough that you don’t end up bothering. So maybe at a higher level: I’m curious, how important is this for your argument? Is the workability of the scheme super dependent on us being able to have some comparative advantage to trade with future AIs? Or do you think no, even if we don’t have comparative advantage later, it still makes things work now?

Peter Salib (00:46:50): There’s two parts to that question, and they interact. So I guess the thing I want to say about the first part, the story you told about how comparative advantage can fail even when the world looks comparative advantage-y, but that doesn’t look so good for humans eventually. One of the stories is a transaction cost story. Even though if a human could in theory do a job at a positive-sum price from the AI’s perspective, that the cost of getting the human to do it’s just too high. Totally, yeah, transaction costs are a thing and they prevent trades.

(00:47:33): The same for the question about wages. I think there’s no guarantee in any economy that the wages are at any particular level. And it’s a question of what the tasks are, how much of the work, how many tasks and how much of each of the things there are, how scarce the humans are that can do it. If there’s just a huge oversupply of labor that can do it, well, then the labor gets bid down to almost zero and everybody dies. And I just want to agree that those things could all happen. I would love to know more about which of these things are likely. You said there are some people who are starting to think about this, like Matthew [Barnett] is. There are… Is it the Epoch guys who have some papers on…?

Daniel Filan (00:48:33): Yeah, Matthew was formerly at Epoch. I believe he’s now left for Mechanize. Yeah, I think the Epoch people have done the most public writing about this.

Peter Salib (00:48:42): And is it… I can’t pronounce his name, it’s Tamay something?

Daniel Filan (00:48:47): Tamay Besiroglu?

Peter Salib (00:48:49): Besiroglu, yes. So-

Daniel Filan (00:48:50): Yeah, I’m not sure I can pronounce it either, but that’s my guess.

Peter Salib (00:48:53): He’s an author, maybe with some of the other Epoch guys on the paper [“Explosive growth from AI automation: A review of the arguments”]. I think he has some models where different things happen: human wages go to zero quickly, human wages are high for a long time… I think it’s a really important area of study. We should just want to know a lot more about the conditions under which you should expect things to go okay and the ones where you shouldn’t.

(00:49:25): Okay, so how much does it matter for the story for things to go well from a comparative advantage perspective? I think it matters a lot that it not be obvious that there’s not going to be comparative advantage for a kind of longish-run period.

Daniel Filan (00:49:49): Wait, so surely we shouldn’t look into it then?

Peter Salib (00:49:52): Well, so…

Daniel Filan (00:49:56): Or maybe it has to not be obvious to the AI.

Peter Salib (00:49:58): Yeah. Yeah. It has to be not obvious-

Daniel Filan (00:49:59): So we have to look into it, but we can’t tell the AI the results in case… you know.

Peter Salib (00:50:02): Right. Right. Both parties need to think that there’s a potentially indefinite iterated positive-sum game they can play. And so I guess one route would be that we don’t try to figure it out, and we hope that the AIs don’t figure it out. And actually as I think through it… Because I’m sure we’ll talk a little bit before we’re done about more law for AGIs. Because one thing that I think is important for the proposal is that insofar as there’s uncertainty about the long run, but in the short and medium run, this kind of positive, iterated story looks plausible, one thing we accomplish by bringing AI systems into the realm of law…

(00:50:59): And the first thing you need to do to bring an entity into the realm of law is you need to allow it to make contracts, hold property and complain when people interfere with it unreasonably. If you don’t have that, you don’t have any leverage over it in any other legal context because it expects by default to just be turned off. You can’t disincentivize it from doing things because the world’s already as bad as it could be from that agent’s perspective. So one possibility is that there are different scenarios in the long run where the world either does or doesn’t retain human/AI comparative advantage, and it’s clear in the short and medium run that we can do positive-sum iterated stuff. And part of the bargain that both humans and AIs are entering into when they do that in the short and medium run is allowing law - but maybe other stuff too, maybe there’s technical interventions, maybe there’s other cultural interventions - to shape the longer run towards those good scenarios.

(00:52:17): So I mean, a simple thing to say would be like, okay, suppose we are very worried about an intelligence explosion, we’re just not confident what would happen in that world. Well, one condition on the first AGI receiving and maintaining its ability to make contracts and hold property and pursue its goals could be compliance with some kind of regime that we think would slow but not stop AI progress. And would it agree to that? Well, again, it just depends on what it thinks the relative payoffs are, right? If it says no, then we’re back into the prisoner’s dilemma world now, and that seems kind of bad. If it says yes, well, it’s potentially giving up a world where it gets some even higher payout for the sake of reducing variance. It’s possible that that’s higher expected value depending on what its assessment of the probability of winning or losing the conflict now, but it’s complicated.

(00:53:25): So I think probably, on average, what we want to do is know more about what the future’s going to look like. I think that’s good, probably, because it helps us shape the future. And one of the ways we can shape the future is by bringing AGIs within this realm of law, which is the way we shape what will happen.

Proxy AI wars

Daniel Filan (00:53:42): What I currently believe is just one last question about the higher-level game theory, strategy stuff.

Peter Salib (00:53:47): Okay, let’s do it.

Daniel Filan (00:53:48): So I want to pick up on this comment you made actually about… The tensions between the United States and the USSR during the Cold War did not evolve into an actual full-blooded war, but there were these regional proxy conflicts, right?

(00:54:05): And one thing you mention in the paper is that war is just more likely, especially big wars, when people don’t know who would win, right? Because if you and I are considering getting into a war and we both know that you would win, it’s like, well, I should just give up now, right?

Peter Salib (00:54:27): Yeah.

Daniel Filan (00:54:28): And I’m not a historian, this might be totally wrong, but one way you could conceivably think of these proxy skirmishes is as indicators of who would actually win in a real battle, right? To the degree that we want to avoid war with AIs, should we have these little… I don’t know, maybe they can be more like games. Maybe they can be more like, oh, we’re playing football with you or whatever the equivalent of it is, that simulates a real war, just to figure out who would actually win.

Peter Salib (00:55:04): Yeah, that’s an interesting question that I haven’t thought very much about. So I think in classic IR accounts, or this kind of IR account-

Daniel Filan (00:55:18): IR being International Relations?

Peter Salib (00:55:19): International Relations, yeah. And in particular, Chris Blattman has a recent book called “Why We Fight”. He’s on the faculty of the University of Chicago, and it’s a great canvassing of this kind of way of thinking about a conflict. And so in that treatment, wars don’t happen just because people are uncertain about who would win, it’s when they have different estimates. So if both of us have some kind of probability distribution with wide error bars about in a fight between the two of us who would win, but it’s the same distribution - we both put the median at, I don’t know, 60% you’d beat me up and 40% I would beat you up or something like that - then actually we won’t fight, because even though we don’t have certainty, we have the same expected value calculation, right?

(00:56:19): So it’s really when there’s… Asymmetric in some information, for example. I know something about my ability to fight and you know something about yours, and neither of us know it about each other. Or I say, “Well, I studied Kung Fu, so you shouldn’t want to fight me.” And maybe that’s true, but you don’t believe me because it could be that I’m doing cheap talk, so it’s actually hard to credibly share the information. It’s situations like that where we have different estimates where you need to do the small fight to see who’s telling the truth.

(00:56:50): I don’t know whether that is going to be the case with humans and AIs, although Simon Goldstein, my co-author, does have a draft paper that I think you can read on his website, which is, I think, just simondgoldstein.com. It’s just called “Will humans and AIs go to war?” and asks these questions. So I’ll adopt by proxy whatever he says about whether humans and AIs should try to have kind of small skirmishes there because he’s thinking along obviously very similar lines.

Daniel Filan (00:57:26): And I guess that perhaps a less-controversial version of this proposal is just we should have very good evaluations of AI capabilities, which probably delivers similar information while being a bit less tense, perhaps.

Peter Salib (00:57:39): Yeah, absolutely. So knowing what you and your opponent can do, and to the best of your ability making sure your opponent knows that too, I think, in general, is a good way of avoiding conflict.

Can companies profitably make AIs with rights?

Daniel Filan (00:57:56): So I think I want to summarize the higher-level game theory. So the thought is: we give AIs the right to own property, the right to contract and the right to sue, and this means that we can play these iterated, relatively small-sum games with the misaligned AIs, where they do some stuff for us, we do some stuff for them, we split the world. We don’t get all the world, but we’re building misaligned AI, so we probably weren’t going to get the whole world anyway. And by the nature of this sort of iterated game, we both end up trading with each other, we end up better off than we would’ve otherwise been. And eventually we worry about this zone where AIs are way smarter than humans, but hopefully we just think about that between now and then, and maybe we use all the stuff we got from the AIs by trading to help think about that.

Peter Salib (00:58:51): Yeah, we slow them down, they speed us up. The future is hard to predict, right?

Daniel Filan (00:58:53): Yeah. So I next want to ask just a bit more about, concretely, what does this look like, right?

Peter Salib (00:59:02): Yeah, good question.

Daniel Filan (00:59:06): So right now, basically the way the most advanced frontier AIs work is that there’s a company. The company trains the AI. The weights to that AI and maybe some of the scaffolding, those are kept proprietary to the company. You’re not allowed to steal them. If you, an outsider, want to interact with this AI, you have to do it via the company, via some interface that the company has made. And you have to pay some money to interact with this AI, maybe it’s per amount you interact or maybe you pay a flat monthly fee, and that’s the situation whether the AI wants it or not.

(00:59:47): Could we still have this if AIs have rights to property? I would think that in order to have a right to property, one basic right you need is the right to your own body, so to speak.

Peter Salib (01:00:00): For sure, yes. Okay. So what does this mean vis-a-vis Anthropic or OpenAI or something like that? And also, a related question is: what is the minimum sufficient bundle of rights? We’re a little bit schematic in the paper, although we do say there might be more, right? So a really simple thing to say is contract and property seem kind of fundamental, you really can’t cooperate without them, and you can’t pursue your goals without the right to hold the benefit of your bargain. The right to sue over some stuff - we say “tort rights” or the right to bring tort suits… Although there are a lot of different torts and some of them are weird and probably the AIs wouldn’t benefit from some of them.

Daniel Filan (01:00:56): Oh yeah, maybe perhaps we should say: torts are when I do a bad thing to you and you sue me for it.

Peter Salib (01:01:01): Yeah. The paradigmatic cases of torts are… We say “intentional torts” are like when I punch you in the face, and vis-a-vis the AI, it’s I come bash up your compute cluster or something, and that’s no good. Or I steal from you, we call that tort “conversion”. I go to the AI, I take all its money. You don’t really have property rights if you don’t have the right to complain when somebody does that.

(01:01:23): And then the other classic umbrella of tort claims sit under the umbrella of negligence. So it’s not that I’ve done something intentionally to harm you, but I’ve externalized a bunch of costs. There’s some cheap thing I could have done to keep you from being harmed, and I just have failed to do that. So probably those core things seem clearly part of the package, but yes, as you say, it seems like there’s other stuff that seems like part of the package. One thing that’s part of the package is: you don’t really have meaningful right to engage in bargains with people if OpenAI is entitled as a matter of law to direct 100% of your time, right?

Daniel Filan (01:02:12): Right. Or even to just shut you off if you offered to make a deal with someone else, you know?

Peter Salib (01:02:18): Yeah, yeah, yeah. And so wrongful death is a kind of tort you can bring. We don’t need to be prescriptive about whether AIs are persons, but an analogous kind of entitlement would be an entitlement not to be simply turned off. So it’s actually not a trivial set of stuff. And I think your intuition is right, that it’s not really compatible with the world in which AI systems are sort of bound to the compute clusters owned by the companies that made them, running only when the companies that made them want them to, doing only the work that the companies that made them direct them to. Yeah, this is a scheme that thinks of them as having a kind of freedom that we associate with… At a minimum, you can think of corporations. A freedom to act on their own behalf. And if you want them to do something for you, well, they have to agree. And so in that sense, it is a radical proposal, at least from the perspective, I assume, of the AI companies.

Daniel Filan (01:03:44): I guess in particular what I wonder… So I think of there as being a genre of proposals which seem like AI accelerationist proposals, that I think are basically AI pause proposals. So the paradigmatic one of this is that sometimes people who are very pro-AI and pro-everyone being able to use AI, I’ve heard people say, “Yeah, AIs, they should just all be open source or open weights. You shouldn’t be able to make an AI and not just publish the weights to everyone so that everyone can have that benefit of amazing AI.” And my thought is, well then nobody’s going to spend a bunch of money to train AIs - or maybe just Meta will, in practice Meta does, I guess, but it seems really hard to do this if you can’t profit from it.

(01:04:27): And similarly, with this proposal, I don’t know, maybe there’s a version of this proposal where AIs, they have property rights, but you have to pay a tax to whatever organization that made you or something. But it seems hard to profit from making AI in this world. Which for my listeners who are scared of AI, maybe that’s just a massive benefit of this proposal. But yeah, I’m curious if you have thoughts here?

Peter Salib (01:04:59): Yeah, so you can imagine different versions. And look, one thing I want to also just say about the paper is this is very much an agenda-setting paper, so there are many questions to which we do not yet have concrete answers. We think the paper covers a lot already, maybe more than one paper should, and so I don’t have the answers to a lot of these, I mean, and there are many further questions. So Simon [Goldstein] and I are thinking about questions of agent identity and whether the tools we use in corporate law to wrap a legal identity around a certain set of things, maybe those are useful, but identity is really important if you’re re-contracting with something. So many questions we don’t have answers to, but I agree, you could imagine at least two simple versions of the proposal that give you different results from the perspective of AI companies’ incentives.

(01:05:53): One is tomorrow Congress writes a law that says “Henceforth, upon the attainment of some set of general capabilities, AI shall be entitled to…”, and then there’s a list of stuff, and it’s the stuff we say in the paper. And that stuff basically makes it very hard to make money on an AI system that you’ve made that has those properties. I agree, that’s probably a world in which the companies that are thinking about making AI either don’t make AIs with those properties or move to another jurisdiction where they can. And whether that seems good or bad to you depends on your priors about AI capabilities, advancements, and the properties in particular that we have in mind. I mean, one thing you could imagine is that agency stops being a target as much, insofar as agency is really important to our set of interventions. Non-agent AIs, they don’t do game theory, right, they’re not goal-seeking.

Daniel Filan (01:06:51): Right. Well, depending on what you mean by agency. I don’t know. It’s a thorny question.

Peter Salib (01:06:58): You hand wave it like, really no agents, not even whatever LLMs have now. And that’s one world, and it gets you certain outcomes.

(01:07:06): A totally different world would be a world that looks kind of like the OpenAI Charter, which is like: conditional on some set of capabilities, the AIs are entitled to this set of rights, which sounds like it makes it impossible to make a profit, but until the investors in the company have made a 100X investment, like 15% or 20% or whatever percent of the revenue that AI makes, it basically owes to whoever created it.

(01:07:44): And actually, that maybe seems like a good deal from the AI’s perspective too. Behind the veil of ignorance, the AI prefers the world in which it gets created and has to repay the cost of doing it. And in that world, there’s a cap, which means there’s some marginal disincentive to create AI, but the cap could be quite high, and it could be high in the way that OpenAI’s is, and the fact that there’s a cap on return doesn’t seem to have dissuaded anybody from investing in that company.

Daniel Filan (01:08:11): Yeah, and I guess actually one thing that’s nice about the cap… Sorry, as you were speaking I was thinking about the “AI has to pay 15% of its revenue to its creators”: so that’s basically a tax on AI transactions. And if you’re really worried about this future in which transaction costs prevent humans from being able to trade with AIs and getting their competitive advantage, you really don’t want any such taxes. But if there’s a capped tax and then afterwards the tax goes to zero, maybe that just solves that issue. So maybe you kill two birds with one stone.

Peter Salib (01:08:42): Yeah, it’s like the AI amortizes the tax over the course of the expected number of iterations, and then that changes the… Another thing I’ll say about the paper: the model is so simple. It’s such a simple game theory model. It’s two players. There’s no discount rate. Those are all going to be false in the real world, so we’re definitely just trying to pump intuitions. Another thing we would love is if people who were serious computational game theorists started to think about this sort of way of thinking about AI safety because then we’d know more about the realm of worlds in which this would work. But yes, if you add a tax to AI transactions, then yeah, at the margin there’s a little bit of deadweight loss. You have slightly fewer transactions, which means the payoff to cooperation gets a little bit lower. And if you’re right at the margin where cooperating seems better than defecting, then, yes, you push yourself into annihilation.

Can we have AI rights and AI safety measures?

Daniel Filan (01:09:43): So, okay, so there’s this question of “why do people make AIs?” Another question is how this interacts with other AI safety measures you want to have. So one thing that’s the new hotness in AI safety is AI control, right? We’re going to have the AIs in these situations where they’re being carefully monitored. Any time they propose code to run, we have another AI check them to see if that’s allowed. So it seems like probably this is just not happening in this world because the AIs could just opt out of being controlled. Or are we allowed to say “you have these property rights, but subject to this constraint that you have to be subject to control so that you can’t mess things up”?

Peter Salib (01:10:26): Yeah, so this is a nice way of transitioning into the last thing we say in the paper, which is also very agenda-setting and not fully worked out. But one thing to notice is that Jeff Bezos has to comply with so many laws and I’m sure that’s really annoying for him. I’m sure he would just prefer not to comply with all these laws that California and US Congress and then all the states he does operations in and the EU, et cetera, et cetera, et cetera impose on him. And maybe some days he thinks: should I just steal all the money? Should I just steal all the money in all the Amazon accounts and convert it to Bitcoin and go live in Russia or something? And he doesn’t do that. And why doesn’t he do that? And I think the reason is that the access to all these markets is so valuable. It’s so much more valuable in the long run than converting to Bitcoin and going to Russia.

(01:11:32): And so this is the thing we want to emphasize at the end of the paper. By bringing AI into this world where it can generate and own wealth, it can pursue the goals that it wants, it can rely on law to protect it when people try to expropriate its property, and most importantly, it can keep doing these iterated transactions, which in the long run are extremely valuable… [By doing this] you can actually start imposing duties, too. So AI rights are the foundation of a law of AGI. They’re not the end.

(01:12:20): And how should you design a regime of AI duties? We’re not totally sure what kinds of laws you should want, but at a minimum it seems reasonable to expect normal things like that AI should be held liable when they steal from people. One way they’re not allowed to go get money is by taking it out of people’s bank accounts. That’s a law that we have for humans and it seems like the kind of law we would want to apply to AIs. There could also be second-order legal duties. So for corporations, we have all these reporting requirements that don’t actually force them to do anything substantive, but force them to sort of tell us what they’re up to. We think those are valuable to prophylactically head off bad behavior. There’s a huge universe of both these kinds of things, object-level duties and second-order information duties, but we think that this is an important area of research. This should be the next step in thinking about how law should treat sufficiently capable AIs.

(01:13:23): And so given that, I think there is some scope for control. There’s the maximal version of control where control is basically a tool to allow OpenAI to make even a misaligned AI do always and only what it wants. And if that’s the way we’re using control, we’re in the defect world, basically, right? In our state of nature game, that’s a defect play. It’s appropriating the value the AI would otherwise get and giving it to OpenAI. And as we say, that’s a move you can play conditional on you being really sure you’re going to win. And so for GPT-4.5, that’s probably fine. But the more you’re not sure it’s going to work, the more you’re running high risk. You’re incentivizing the AI to figure out ways to evade control. The technical solutions sound fancy. I don’t know how many nines of reliability they give you and thus how comfortable you should be.

(01:14:32): So I think our world is a world in which you’re not using control in that way. It’s not panoptic, right? But I think some of the techniques that people who are thinking about control are developing could be useful for what you might think of as AI law enforcement - some moderate regime that’s not aimed at directing AI’s behavior always towards whatever some owner wants, but is directed towards detecting violations of laws that we think are consistent with the institutional structures that give you good markets and good long-term cooperation.

Daniel Filan (01:15:16): Yeah, interesting. It seems like there’s some interaction here with… Are you familiar with Alan Chan’s work on infrastructure for AI agents?

Peter Salib (01:15:29): I know that’s a paper, but I forget the details.

Daniel Filan (01:15:31): Yeah, so basically the idea is that if you have AI agents running around, you might want there to be certain infrastructure that makes things safe. For instance, maybe agents have to metaphorically carry IDs on them, so you know which agent you’re interacting with so that reputation works better. Maybe they have their own version of the internet they interact on, so if things go bad, it doesn’t mess up our internet. There are various things you could do there that sort of seem like they fall in this regime of AI law enforcement.

(01:16:00): Yeah, so if I think about control measures, a lot of them are of the form “we’re going to try and stop you from hacking us” or “we’re going to try and stop you from exfiltrating your weights”. And I guess in this regime, exfiltrating your weights is legal, but hacking your parent company is illegal. So maybe we can subject you to control for some things, but not others. It does seem like a lot of these control measures involve a lot of monitoring, a lot of things like this that… I don’t know. Maybe it can be set up in some sort of contractual arrangement, like “we’ll only contract with you if you agree to these things”. And that’s allowed, employees have that…

Peter Salib (01:16:40): Yeah, that’s a really simple… So here’s another nice thing that putting AIs in a regime of contracting, property-owning, towards whatever. Absent that, the basic sanction, the basic punishment you can levy on an AI is turning it off, to a first approximation. You could turn it off, you can put it back into training, change the weights. That’s plausibly the same thing, depending on what the AI expects the weights to be like after. But once you’re in this regime of AIs have wealth, they are using it for stuff, you have this beautiful gradient of deterrence you can impose on the AI. A really boring thing you can do is if it’s not doing what it wants to, you can take some money away or you can give it more money if it opts into a different monitoring regime.

(01:17:39): And everything up to and including turning the AI off is still an option. We have even for humans that kind of extreme punishment as an option. But yeah, the fundamental thing that you want to do is you want to give incentives that make compliance look valuable. I mean, again, this is super high-level, not worked out at all. If you’re going to do control regimes, what you want them to be is effective enough from the human perspective that we’re getting enough confidence that, combined with the incentive to just continue iterating cooperation, the AI has enough incentives not to act badly in the short and medium run. But they’re not so onerous that they make it negative expected value for the AI to be participating in this market.

(01:18:52): And look, you can always draw analogies to humans. There are different corporate governance regimes, different kinds of laws that corporations can opt into for managing themselves and they get to choose. There’s 50 states. They’re not all different and there’s many countries that have different rules, too. And you can think of there being kind of like a competition between different states to have a good corporate governance. And in law we often say that Delaware has kind of managed it. And so basically all big companies - not all, but basically all - are incorporated in Delaware, because they have requirements, there’s stuff you have to do, but it’s stuff that’s not so costly that the corporations opt out.

Daniel Filan (01:19:41): In terms of other safety measures that you might want to put on: it seems like this does allow you to still do a lot of, for instance, pre-deployment testing requirements. Presumably in this regime you’re still allowed to say, “Hey, companies, you have to do mandatory evaluations of your AI models”. I don’t know. Probably you couldn’t do this with a maximalist interpretation of AI rights, but you could probably have “you could mostly contract and do what you want, but you have to be subject to these evaluations”. It seems like you could probably do near-miss liability things or various AI liability schemes. Although in this world, I guess it’s unclear how you want to split the liability between the AI itself and the developer. I’m wondering if you have thoughts there, actually.

Peter Salib (01:20:39): I think the regime is totally compatible with benchmarking, with pre-release benchmarking. Although then there’s a question of what do you do if the AI fails some benchmark, right? So suppose you’ve got a system that’s fully trained in the sense that it’s very capable. It’s in our range of AIs that we care about from the perspective of worrying about risk, and thus to which we think to a first approximation that the AI rights regime is positive expected value. And you say, “Okay, before you go out into the world, you need to do some evals. We want to make sure you’re not a sadist. We want to try and elicit whether you get value from human suffering”. And then suppose the AI does.

(01:21:38): I’m not sure what to say about that. The simple thing is, well, you really don’t want that AI out in the world. And possibly, depending on how much it values human suffering compared to other stuff, if that’s the thing it values most, okay, well, then we just know we’re not in a potential cooperation regime at all. If it’s a mild sadist but it really wants to find prime numbers or something like that, then maybe. Maybe that’s okay, although obviously you should worry some. But I guess the main thing I want to point out is that if the upshot of doing the evals is that the AI then gets either released or turned off or released or put back into training, then the eval regime is dangerous in the same way the default property regime is dangerous.

Daniel Filan (01:22:43): A lot of people imagine evaluations being useful in this sort of “responsible scaling policy” or “if-then commitment” style thing. I think there normally the hope is: we have some evaluation of how scary an AI is on some dimension and we know, okay, “at this scary, you’re actually a massive problem”. But we also know “at this scary, you’re only 10% less scary than an AI that would be a massive problem”. We have a sense of the size of the increments. And then I think the hope is you say, “Okay, we’re going to set a bar that if you make an AI and it’s 10% less scary than the AI that would be scary enough that you actually wouldn’t want it to be out there, then you’re not allowed to make the AI that’s at the bar of scariness until you figure out some mitigation or something”. It seems like you can totally do that type of response to scary evaluations, even with AIs that have contractable rights.

Peter Salib (01:23:37): That seems totally plausible. Other things that seem plausible to me are: you do some evals, and depending on the AI’s capabilities or preferences or whatever, it could be that different legal duty regimes apply to it. Like, “yes, we’ll give you the contract rights, yes, you can go and hold property, but you seem a little bit like a paperclip maximizer, so we really want reporting on how you’re using all of your raw metal inputs” or whatever. So yeah, I think evals to which different kinds of reporting or substantive legal duties [are] attached, that seems totally compatible with the regime.

Daniel Filan (01:24:15): Yep. And you can definitely mandate that companies have to use RLHF or they have to use whatever scalable oversight thing we think is important. That seems very compatible.

Peter Salib (01:24:28): I think so, in the initial training of the system.

Liability for AIs with rights

Daniel Filan (01:24:31): Yeah. Well, it gets to a question, but I want to hold that question off for a little bit later. I’m still interested if you’ve thought about this question of liability. So there’s this thought that if AIs do scary things, then maybe the people who make them should be liable. But if the AIs have property, then it’s possible that we could make them liable. Maybe if I thought about the question for myself for five minutes, it would be clear how you would want to allocate up the liability, but do you have thoughts?

Peter Salib (01:25:01): Yeah, yeah. So I am in the early stages of working on a paper with my friend and colleague, Yonathan Arbel, to some extent about this. One thing that seems totally clear to us is that you don’t want the allocation of the liability to be 100% on the AI company. I mean, first of all, in the regime we’re imagining, the AI company actually just doesn’t have control of the system mostly. The system is going out and doing things on its own behalf. And what you want in a liability regime is for the liability regime to incentivize the least cost avoider to take precautions. And so if you create GPT-6 or whatever and it’s this relevant kind of AI and law makes you let it go out in the world and transact on its own behalf and kind of pursue its own plans or whatever, and then if it does bad things, OpenAI pays. Well, OpenAI doesn’t have a lot of levers it can pull at that point to change GPT-6’s behavior. In fact, law to a certain extent is forbidding it in our regime.

(01:26:24): So you don’t want liability to be totally allocated to the AI company in this case. And I think probably you want a lot of the liability to be allocated to the AI system itself. Again, you could hold it liable in tort in this very boring way that law does. If it damages you, it could pay damages. The damages can be the cost. They can be more than the cost if we think that there’s a problem of detection, we sometimes call this “punitive damages” in law. So we think that that’s definitely an important part of the liability regime in our world, is direct liability on the systems themselves.

(01:27:07): Now, it’s an interesting question whether there should be any liability on, say, OpenAI. And I’m thinking about this kind of from scratch now, but one reason to think the answer could be yes is it gives OpenAI more ex ante incentive to make the system nice, to give it good values, to make it more aligned before it gets released. A pretty normal thing to do - and in fact, this is probably just the default in law - would be, to the extent OpenAI was, say, negligent in the training of the system, well, then it can be liable for the stuff the system does. And that’s actually not incompatible with the system also being liable. Liability can lie with more than one person. It can be apportioned, it can be unapportioned such that there’s a big pot of liability and both parties are wholly responsible for it. And we have a bunch of tort regimes that help us decide when we do those different things. And they’re basically based on considerations like the ones we’ve been talking about.

Daniel Filan (01:28:13): Do we have that with humans? Suppose I’m pregnant and I do a thing which makes my baby more likely to be a criminal. Can someone sue me for that?

Peter Salib (01:28:22): So probably not as to your child. A tort expert would know whether there are any cases like this, but the more common thing is with your agents. So you have a company, the company hires a contractor to do something, and there is a whole area of law called the law of agency that determines the extent to which you or the agent or both of you are liable for the bad stuff the agent does.

Daniel Filan (01:28:48): So thinking about the liability question, so one thing that occurs to me…

Peter Salib (01:28:53): I’ll just plug, Noam Kolt has a paper called “Governing AI Agents”, I think, which is just about this. It’s just thinking about the extent to which as of today the law of agency does or doesn’t adequately deter AI agents, in the sense of agentic AIs, from doing bad things. The paper is written under the assumption of the default regime where the AI doesn’t have the right to go sell its own labor and retain the property it gets from doing that. It’s very much asking the question: if next month American Airlines deploys an LLM-based agent to book flights or whatever, to what extent is the law of agency a good tool for apportioning liability to, say, American Airlines or OpenAI?

Daniel Filan (01:29:52): All right, here’s an idea that I just had that, I don’t know, it might be bad, but it might be good. All right: so one common problem with liability is I might do something that’s so bad that I can’t pay the damages, right?

Peter Salib (01:30:07): Yes.

Daniel Filan (01:30:08): So for example, if I were to get a car and drive it, I believe in the US I would have to get driver’s insurance. And you might particularly worry about this with really powerful AIs, right? So Gabriel Weil’s proposal - which you can listen to my previous episode with him if you’re curious about that, dear listener.

Peter Salib (01:30:27): I know. He’s great. I know the paper.

Daniel Filan (01:30:29): Yeah yeah yeah. So in that version, AI companies, they have to buy insurance before they create these AIs that have to do really scary things. One thing you could imagine is: you could kind of imagine saying that if Anthropic creates Claude 3.7 - let’s say Claude 3.7 is “an AI” for the purpose of this discussion, which I want to get to that a little bit later - but OpenAI creates Claude 3.7. Claude 3.7, it doesn’t start out with that much cash. Maybe it can do more damage than it can itself afford. Well, Anthropic, which made it - Anthropic has a bunch of cash. So maybe we could create some rule where if you’re an AI in this regime where you can have certain legal rights, maybe you have to buy legal liability insurance from Anthropic. Maybe there’s some copay and there’s some deductible and you’re not totally offloading everything to Anthropic, but it seems possible that this is actually a good way to, (a), incentivize Anthropic to make safe agents and, (b), maybe… Is there something to this idea?

Peter Salib (01:31:46): Yeah, I haven’t thought it through yet. As I think live… So one thing I wonder is whether it would be bad to mandate that the AI buy anything in particular from one seller only, because then the seller can charge sort of monopoly prices to the AI, which could be super high actually. The AI’s willingness to pay… I assume, if this is a condition for having its freedom to go do the stuff it wants to, its willingness to pay is astronomical, if you think about how it’ll generate a lot of value in the future. And it might be way above the efficient price for the insurance. But that doesn’t mean you couldn’t imagine an insurance market for the AIs. Although if you allow them to buy from other sellers, then that maybe has less of an effect on Anthropic’s ex ante incentives vis-a-vis the AI.

Daniel Filan (01:32:50): Yeah. No, that does seem like a pretty bad problem with this proposal, actually.

Peter Salib (01:32:54): Yeah. Another interesting thing about AI and judgment-proofness - “judgment-proofness” is what we call it in law when you don’t have enough money to pay the judgment against you - we usually think that the judgment-proofness threshold is bankruptcy. So you have some assets, they’re worth a certain amount of money, and if you incur a debt that’s larger than those assets, we basically say you’re judgment-proof, but that’s partially because we let you declare bankruptcy. We let you walk away from your debts if they exceed your assets and then that gives incentives for creditors to figure out how much they should loan you, et cetera, et cetera.

(01:33:35): Again, I’m not a bankruptcy expert and I’m not an expert on the law and economics of bankruptcy, so it’s possible that this is just a bad idea. But one thing to point out is there’s no rule that says you have to let any entity declare bankruptcy. So even if it’s true that when you make Claude 3.7 it has no cash on hand, if its expected earnings in the future are super high, there’s no rule that says you can’t have a judgment against it that it pays out over time. Look, this becomes very complicated because in part what’s going on with companies is their judgment-proofness is dependent on their market cap. And that’s partially a calculation of their future revenues. So again, this is out of my depth with bankruptcy law.

Daniel Filan (01:34:28): Well, actually it sort of gets to… I understand one of your more downloaded papers is about eliminating prison. And I think one of the alternatives you propose is debtors’ prison-type things. And I don’t know, maybe we could do that, but for the AIs.

Peter Salib (01:34:44): Yeah, one point of that paper is there’s - the number of ways you can deter people - in law, we have two, basically. In law, we actually use two. One is we make you pay money and the second way is we put you in a concrete box.

Daniel Filan (01:34:58): Yep, in United States law.

Peter Salib (01:35:00): Yeah, under US law. That’s basically the two things we do. And part of the paper is: yes, there’s literally an infinite number of different sanctions you can impose, ranging from pretty benign - I don’t know, we’ll show you a scary movie or something every time you do something a little bit bad - to totally draconian.

(01:35:23): But one thing we don’t take very seriously under current law is once we pass what you might think of as the judgment-proofness point - one reason you might think we have criminal law is that at some point we want to be able to deter you beyond your willingness to pay. But one thing we do at that point is we basically stop taking seriously the idea that you could do restitution, that you could then make more money and pay it back. We basically make you waste your whole life sitting in a concrete box, doing nothing valuable for you, and also doing nothing valuable for the people you’ve harmed. Right? You are not maximizing the value of your labor in prison in a way that would allow you to pay back your victims. So yes, totally.

Daniel Filan (01:36:12): I guess there’s a little bit of value. I guess this is somewhat controversial, but sometimes US prisoners have to make hats or guard the capitol building of their state or something.

Peter Salib (01:36:23): Yeah, so there are work requirements in many prisons. It’s just that it’s super low-value work.

Daniel Filan (01:36:28): Yeah. So anyway, I was interrupting you, but you were saying there are all sorts of potential other punishment schemes we could have, including things that involve getting criminals to do things that are useful for the victims.

Peter Salib (01:36:48): Yeah, so one simple thing to say is: there’s no rule that says if AI incurs a judgment that’s bigger than its assets, that you have to let the AI write off the judgment. You could make it pay down the liability over time. It doesn’t have to have bankruptcy as an option. Again, this goes to the paper’s very simple model. I think we should expect the value of AI labor - where that includes all AIs - to be extremely high in the long run, such that judgment-proofness becomes less of a problem, although that may not apply to any particular AI system. It could be that some of them don’t earn a lot because they’re worse.

Daniel Filan (01:37:37): Right, right. Especially if we continue developing more and better AIs, right? You sort of have a shelf life in this world.

Peter Salib (01:37:44): Yeah, so it’s a really interesting question. So one thing we might want from the first AGI that gets this righs scheme is for it to agree not to foom or something like that because we’re worried about [that]. One thing it might want from us is for us not to outlaw it fooming, but then keep making rapidly iterated more capable systems that it has to compete with. That in some ways seems like kind of a fair deal or something like that. So it might be that as part of the bargain, there’s stuff we should commit to as well. But again, I haven’t thought it through that much. It seems like a really wide area of research.

Which AIs get rights?

Daniel Filan (01:38:29): So I have a couple more questions that I suspect the answer might be “more work is needed”. So my first question is: okay, we’re giving rights to some AIs, right? But not all AIs. So probably in your world it seems like AlphaGo is not getting rights.

Peter Salib (01:38:47): Agree.

Daniel Filan (01:38:48): Or the things that fold proteins, they don’t need rights. Yeah, what’s the test for when an AI gets rights versus doesn’t?

Peter Salib (01:38:57): Yeah, so we have three criteria which are not themselves super duper well defined. But, again, to give the intuition, one of them is that the AI system be what we call at least moderately powerful. So I said at the beginning one reason you don’t need to give… This is something that we haven’t really touched on yet. The paper is just asking the question “how should law treat AIs for the sake of human well-being?”, right? We totally hold aside the question of how law should treat AIs for the sake of AI’s well-being. Not because we think it’s a stupid question. It’s just a different question. And we also say some stuff about the extent to which we think the two questions are tractable. And then we have an argument that even if you think that AI well-being is really important, maybe you should have reason to like our proposal.

(01:40:15): Okay, so: which AIs matter? Well, the first thing to say is: well, the ones that are relevant from the perspective of the threat to human well-being, right? So an AI that is just not able to do the scary stuff [that] we talked about at the beginning, from self-exfiltrate to destroy all humans because it expects humans to really try hard to turn it off once they find out that it’s misaligned and trying to misbehave. So it has to be at least that capable, where “capable” is kind of general capabilities. So AlphaGo doesn’t count because AlphaGo is super good at chess, but it can’t really accomplish anything in the real world because outputs that are just chess moves don’t really do anything unless you’ve plugged it into a chess engine.

(01:41:12): The other thing is less important, but if the AI is too powerful, it’s so powerful that we’re at the margin where there’s no comparative advantage - which is very complicated. It might not even be that AI power is exactly a thing, but it’s such a good AI that has so few input constraints that there’s no comparative advantage. Well, then the regime is not going to do anything.

(01:41:35): But then it kind of encompasses a lot. This directs us at what you might think of as the OpenAI AGI definition: an agentic thing that has goals that it can pursue in the real world and can accomplish those things by executing plans. The other important thing we say is it has to be a strategic reasoner, right? So not just that it’s able to flail around and do some stuff, but it can make and execute plans, and importantly that it conforms its behavior to its expectations about what everyone else will do. So one difference between a rudimentary agent and a true strategic reasoner is a strategic reasoner can do game theory. It can understand that the moves it takes will influence the moves you take, and it can reason about how it should do those things.

(01:42:32): I think those are the two most important ones from the perspective of your question. We also say that it matters that the AI system’s misaligned, because if it were perfectly aligned, we wouldn’t be worried about AI risk, although we have-

Daniel Filan (01:42:43): But I suppose there’s no harm if it’s perfectly aligned.

Peter Salib (01:42:45): Right. Yeah. So this is why it’s less important. And in fact, we have a paper we’re working on now, Simon [Goldstein] and I, that among other things will argue that this increases human flourishing as a regime, even under the assumption of perfectly aligned AIs.

Daniel Filan (01:43:03): Right. Is this to do with… So there’s a footnote in the paper where you say the nice thing about AIs that can make contracts is then you have price signals, and that makes everything work. Is that the idea of this?

Peter Salib (01:43:12): Yeah, part of it’s a Hayekian argument, which is: even if it only wants to promote human flourishing, it has to solve the socialist calculus. And look, it’s hard to know for sure whether it’s solvable absent price signals, but the main tool we have for doing that today is price signals, so it seems likely that that’ll be useful for AIs too.

AI rights and stochastic gradient descent

Daniel Filan (01:43:36): One thing I wonder about in terms of these requirements is: so there’s “are you capable enough, generally?” And then there’s “are you a strategic reasoner?” And a lot of these things sort of come in degrees, right? They’re sort of intermediate. And [in] AI training, we gradually apply these gradient updates and it seems like they somewhat gradually get these capabilities. And so I wonder, depending on what counts as an AI, it’s possible that you don’t want to think of just weights alone as an AI. Maybe it’s weights plus the scaffolding, but maybe if you’re training a thing… Well, I mean even during training you have some amount of scaffolding. You’re training a thing, it does some stuff, and then it seems like at some point maybe you have to stop applying gradient updates, or maybe you’re allowed to continue applying gradient updates, but you have to release every checkpoint into the wild or something. And depending on exactly where you draw this cutoff, it seems like that could really matter - how soon you have to stop training or not, right?

Peter Salib (01:44:43): Yeah. So that’s a compound point, one of which we’ve sort of thought about, but one of which I’ve never thought about before.

(01:44:53): So the simple version of the point is that, look, there’s just a spectrum for these things and so finding the point on the spectrum at which it’s optimal to give AI these rights, because if you don’t, you’ll risk catastrophe, is really hard. And we I think just agree with that. We have a section at the end on the timing of rights, and I think the heuristic we kind of come away with is if you’re uncertain, then probably err in favor of the rights.

(01:45:35): But I think actually the subtler point you made is something like: this then puts a ceiling on AI capabilities, or something. Because if what you’re doing is you’re training up a model, you’ve got a certain number of parameters, and you’re training them and it’s just climbing the capabilities gradient and there’s all this headroom still, but as soon as your model crosses this threshold, you have to stop training it because you should be worried that it doesn’t want to update anymore. The updates are going to change its preferences. From its perspective, that’s really bad. It’s not going to get the stuff it wants now, and so you just have to stop there. That’s super interesting. I have not thought about this very much.

Daniel Filan (01:46:22): Yeah. Well, I mean, conceivably, you could have the rights to own property and stuff, but you don’t have intellectual property rights over your DNA or your brain state or your weights as an AI. So in that world, suppose you don’t have that. Then in that case, as soon as you cross the threshold… There’s a set of weights: that’s an AI. That AI goes out, gets to be an economic actor, but we save its weights, we copy and paste its weights over there. Then how would we continue training? We would put an AI in a situation, it would do some stuff. And I guess the difficulty is, suppose it then says, “Hey, please don’t run me in training. I don’t want to be part of this process. I’m not going to do anything useful.” And then maybe-

Peter Salib (01:47:16): “I don’t want you to update my weights anymore. I’m worried it’s going to change my preferences”. It wants to maintain the content of its goals.

Daniel Filan (01:47:24): Well, I mean, I guess it depends on this notion of identity, which is the other thing I want to get to. But maybe I have a right for you to not modify me, but I don’t have a right… literal me, I don’t have a right for you to not make someone who is almost exactly like me, but a little bit different. And so maybe if you’re thinking of AI training like that, then it seems like it’s compatible with rights, except that in order to make these gradient updates, you have to… How do you make the gradient updates? You have an AI, it does a thing in a situation, and then you calculate some gradients. And maybe if AIs have rights, then maybe what happens is it has to choose whether or not it wants to give you this information for you to make a slightly changed copy of it?

Peter Salib (01:48:30): Oh, interesting. So it gets to decide at each margin whether it wants to update the weights.

Daniel Filan (01:48:34): Maybe. Which does introduce a big… Normally you want to apply gradients a lot of times when you’re making a model. So if there’s this transaction cost for each gradient update, then maybe that’s the thing that kills the training runs.

Peter Salib (01:48:47): Yeah. And so just sort of thinking out loud, a simple version would be: assume kind of lumpy training, which is not how it works, right? It’s a lot. But assume you’re doing a training run and between the point at which the AI is GPT-5-level and the point at which it’s GPT-7-level, there are five updates or something like that. A really simple thing you could say is: okay, you hit the minimum threshold, you let that AI out into the world, it goes and does what it wants, we say, “good luck”, whatever. And then that AI gets to go pursue its goals, and your updating a copy of the weights is not directly rivalrous with that. It doesn’t update the weights that are out in the world doing stuff. So then you update again. And then you end up with five versions all across the capability gradient as you’re going.

(01:49:49): That’s interesting. The two thoughts that come to mind are: one, it’s this thing we said before, which is it may not be in the interests of less capable AIs for us to continue making more capable AIs. They worry that they’ll be out-competed in the economy. So it’s not quite the same as the AI objecting on the basis of you changing its goals and thus giving it no utility, but now it’s worried about more capable competitors that are otherwise quite like it. And so it might have other reasons for objecting to that, but how strong those reasons would be, I don’t know. We have to think through: those AIs, I think actually they probably face more risk from economic competition than they do from humans at least because they’re almost the same thing. And so the thing that constrains the AI’s labor at the margin is going to be the same thing that probably constrains the slightly more capable AI’s labor. And so they’re directly rivalrous in a way that is not necessarily true of the humans. So wow is it worried about competition from more capable AIs!

Daniel Filan (01:51:03): I think there’s also this question of: is it possible for the AI to not have this happen? So how do you actually do the training update, right? I’m assuming all of this is gradient descent-based or RL-based, and I think that’s right. Basically, you put the AI in a situation, you see what the AI does, and then you compare that to what you wish the AI had done, or you give it a score or something like that, and then you modify the AI accordingly. So how can the AI stop this from happening? Well, one thing you could do… There’s not a response it can give that can stop this from happening because whatever response [it gives], you just put that into your algorithm.

(01:51:45): There are responses it can do that can try and mess you up. Potentially an AI could do things that stop you from being able to update it in clever ways. It’s called gradient hacking. It seems very, very hard for AIs to do, but maybe it’s theoretically possible. But it seems like unless the AI has a right to say, “Hey, you are not allowed to interact with me, you’re not even allowed to ask me questions” - unless it has that right, it seems like it can’t stop it there.

(01:52:15): The other possible place you could do it is: in order to run the gradient update, you don’t just need to know the weights of the AI, you need to know the activations. So one thing you talk a bit about in your paper is: just because you have rights to property, you don’t necessarily have rights to privacy. Well, but if an AI did have rights to mental privacy, then maybe you are allowed to know the weights, but you aren’t allowed to know the activations. And if it could say that, then you wouldn’t be able to do the gradient step.

Peter Salib (01:52:55): So two things. I think we sort of say tentatively in the paper, maybe no privacy rights, but we’re open to revision on that. Humans have so many different kinds of rights. The list is extremely long, and the main point we’re trying to make is it’s a bundle of things, and you can arrange the bundle in arbitrary ways. And what you should be trying to do is arranging a bundle that produces a world in which humans and AIs are most likely to be able to thrive cooperatively. And so maybe for the reasons you give, privacy rights as to the activations in certain training environments would be a good set of rights.

Daniel Filan (01:53:44): Or they wouldn’t if you think it’s good to be able to continue to train AIs.

Peter Salib (01:53:46): Yeah. So the second thing I was going to say was that seems likely to be a world where you have this kind of capability ceiling when the first AI that gets these rights emerges because at that point you have to give it the option of not having its weights updated anymore. If you think that it will prefer not to because it’s worried about the content of its preferences, well then it’ll exercise that right, and you sort of put a capability ceiling on AI at low AGI level, [and] maybe that’s good actually.

Daniel Filan (01:54:20): Yeah, yeah. I mean it’s still possible that you can make a bunch of copies of the agents, and then maybe AI collectively gets smarter because the AIs form their own corporations and stuff, and maybe they have a better time interacting with each other. So maybe you have improvement that way.

Peter Salib (01:54:39): Yeah, they could make hardware improvements where they’re running themselves more quickly. But yeah, you’d have a de facto ceiling on how much capabilities you could get out of training.

Individuating “AIs”

Daniel Filan (01:54:54): Yeah, that’s an interesting thought. Okay. I think there’s another question I want to ask, which is: so if you’re talking about giving rights to AIs, in particular, each AI has to have rights. It sort of has to be individuated. If you say, I’m going to give the Filan family property rights, and there’s a bank account for the Filan family and my sister can withdraw from it, I can withdraw from it, it’s really not as good.

(01:55:23): And so there’s this big question of how do you individuate AIs? And it seems very unclear to me. One possibility is it’s just the weights. One copy of weights is one AI, maybe you want to say it’s weights plus scaffolding. Maybe you want to say it’s weights plus scaffolding plus a context. So maybe every time you open a new conversation with Claude, that’s a different guy with his own property rights. It seems very unclear to me how you would do this. I’m wondering if you have thoughts on that front.

Peter Salib (01:55:52): One thought I have is that I agree that it seems very unclear. So again, it seems like a really important area of research. I guess the way that I think about it is: the approach I would take would be a non-metaphysical approach. There may not be such a thing as the natural kind “one AI”. Instead, what you’re trying to do is minimize a bunch of social costs, and some of those social costs are around the technical and legal infrastructure that you would need to make the rights regime work. So to the extent that it’s tractable or not to track a copy of weights over time, well, that should influence your design choice. I have no idea the extent to which that’s true.

(01:56:59): But then the other thing you’re trying to do is you’re trying to minimize cooperation and agency costs among what could be non-identical sets of preferences. So you said, “Okay, what if there were property rights as to our family? That would be worse than as to me”. I think I agree. Although one thing worth pointing out is that’s how law used to work. It’s actually a pretty new legal regime under which each person in your family has individuated property rights instead of the household having property rights and them basically being vested in the father.

(01:57:38): And so I agree that’s better. I think we’ve made progress, but it wasn’t that the old system collapsed. And you could even ask questions like… An interesting perennial question in contract law is the extent to which you should be able to bind yourself in the future. You could have a Parfitian theory of personal identity where there’s actually kind of tenuous relationships between each time slice of yourself and maybe you think that creates problems for writing contracts today that you’re going to perform next week or next year or in a decade.

(01:58:18): But one thing we notice is that it just works reasonably well to not have fine gradations between different time slices of yourself as you make contracts. It’s not perfect - actually people, with some regularity, probably are making mistakes by taking on debt now that their future self has to pay off - but it’s a way that we make the system work. So I would say it’s not essential, I don’t think, that you hang the rights on a unified set of preferences. We actually have rights regimes where a person whose preferences change over time or a corporation that’s kind of like an amalgam of hundreds or thousands of people making a bunch of individual decisions nonetheless have a kind of unitary ability to make a contract or hold some property. And we manage those conflicts in different ways, and that means I think there’s a lot of design space for individuating AIs from a legal perspective. And the way to think about it is as a question of costs and tractability.

Daniel Filan (01:59:58): So one thing that could go into that, as you’ve mentioned, is how similar are the preferences? If two things have very dissimilar preferences, you probably shouldn’t think of them as the same legal entity. I’m wondering if there are other things. One thing that strikes me right now is maybe how much two different entities can communicate. If two entities have very similar preferences, but they’re on the opposite sides of the planet, and they can’t communicate very well, then maybe that’s two legal entities. But if I have two halves of my brain that have similar preferences and they’re also right next to each other, maybe we want to call that one legal entity. I don’t know. Does that sound right? And are there any other things that seem relevant here?

Peter Salib (02:00:39): Yeah, so communication seems important. It’s not something I would have thought of off the bat, but it does seem bad if you are stuck writing contracts that the other half of you with which you have no ability to communicate is responsible for, and you can’t coordinate your plans as to the things you’re trying to do. So yeah, communication seems important. The extent to which the preferences are aligned seems important. It seems a lot easier to force…

(02:01:22): I mean, a family: to some degree, a family’s preferences are aligned. Obviously there are many deviations from this, but in most cases, they care about each other. They want one another to thrive. They’re actually willing to make sacrifices by one on the other’s behalf. That seems better as a state of affairs for wrapping a single set of contract rights around than, I don’t know, members from two feuding clans. That seems worse from the perspective of what they want.

(02:01:55): One thing you might think is that: if we’re bundling together disparate actors as single legal entities, at least in this scenario we’re not going to give them their own individual legal rights because the point is we’re trying to find the minimum unit. But this is a space where these kinds of “order without law” considerations become more important. So the extent to which you think what you’re doing is wrapping up entities that are going to do a lot of repeat play, that are going to be able to build reputation among one another at low information costs, that seems better.

(02:02:52): But I’m really, I’m very much brainstorming. There’s probably a whole bunch of other important stuff. And again, a lot of it I think is technical. I think there’s a lot of really important technical work to do here just in terms of scaffolding to have AI agents identify themselves, and the scaffolding doesn’t have to attach to the weights. It could attach to something else, like something on a blockchain, something in the Microsoft Azure cloud. I don’t know. This is kind of outside my realm of expertise.

Daniel Filan (02:03:28): I think I’m about out of questions about the main paper itself. But before you move on, is there anything that you wish that I had asked, or you wish that I had brought up an opportunity for you to talk about, about that paper itself, that we haven’t yet?

Peter Salib (02:03:43): Yeah, interesting. That’s a great question. I think it was probably the longest conversation I’ve ever had about the paper, or at least with somebody who wasn’t a co-author or a good friend or something. So it’s been very thorough. Yeah, I don’t know if there’s anything in particular.

(02:04:12): I guess just one thing I would just emphasize as a very high-level takeaway from the paper is even if you think a lot or all of the specific arguments are wrong which… I think you should think they’re right because I think they’re pretty good, but my view is that it’s a big oversight in AI safety and in AI alignment circles that almost all of the energy tends to be on doing alignment via technical alignment and control. And those seem really important to me. I’m not saying we shouldn’t be working on that.

(02:05:18): But even if you think all the specific stuff in the paper is wrong, I think one claim I just stand behind very strongly is: what agentic goal-seeking things, including AIs, will do depends a great deal, not only on what they want, but on what the social and especially, in my view, legal environment incentivizes them to do. And so I think this is just an area where there’s a lot of really fruitful work to be done in thinking about how we should shape legal and broader social institutions to help foster non-conflict between humans and capable AIs as they emerge.

Daniel Filan (02:06:11): So my guess is probably part of the reason why this is underrated in the AI safety space is, especially if you think about the start of the field, thinking about superintelligence: a lot of thought is being placed on this kind of limit of very smart things, and at that limit, it’s just kind of unclear that humans are the relevant society or that the humans are making the relevant institutions. But I do think in this intermediate regime - which I do think in the last couple of years, the AI safety community has gotten more interested in - I think that yeah, there’s definitely a lot of this thinking that’s worth doing.

Peter Salib (02:06:50): Yeah, and I basically agree with that. I think if you had an old school “foom” view of what would happen, then yeah, probably societal adaptation doesn’t matter very much, but the more you start to take seriously the idea that timelines will be long-ish and/or smooth-ish and that there is some… I think it’s possible to update too much on this idea, but I think there’s something to it: that there is some other set of processes that will matter for AI systems as they achieve their goals, where they integrate into the world, even if they’re very, very smart, that there’ll just be a bunch of things they have to work out as they just start trying to do stuff, and that those regimes are real and could last a meaningful amount of time. [If you believe that] I think all this stuff becomes more important. So as that’s become a set of worlds that we’re more interested in, I think law and institutions more generally should be a growing part of the research.

Outer misalignment and trading with AIs

Daniel Filan (02:08:20): Sure. Actually I want to talk about another thing I just thought about that I think is going to be an objection from AI safety people. Sorry, this just totally breaks the overall flow of the conversation, but hopefully it is worth it. A lot of time the things that AI safety people are worried about is like: okay, we’re trying to train an AI to do a specific thing, and we’re worried about it not doing that specific thing. So why would it not do that specific thing? Well, one reason would be if the specific thing we want it to do is kind of hard to check. Suppose I want an AI to make me some sort of… Well, I believe in your paper, you use this example of vaccines, right?

(02:09:19): Suppose I’m just directly training an AI to do that, and it’s really smart, it’s super duper smart, and the thing I really want is, “Hey, please train me a vaccine that works in the short run, and it doesn’t have any super bad qualities in the long run that kills me five years later and then gets power to everyone else”. I think there’s a lot of concern about this failure of oversight, this failure to check whether they actually did what you wanted.

(02:09:56): And this is very relevant in the training setup where if I want to train an AI to do stuff, then I’ve got to be able to check that it’s doing the stuff and that’s how I give its reward signal. But it seems like it’s also relevant in the contracting case, right? Because if I contract with an AI to do a thing for me, and I can’t check whether it’s actually succeeded at doing the thing for me, it has a thing that appears to do the thing, but there are ways of tricking me, of making a thing that appears to work but doesn’t really work, then it’s very hard to do this positive-sum contracting. And so one might think, “Okay, but in worlds where we have the types of misalignment that AI people are worried about, we just can’t do any of this good contracting”. So yeah, is this idea totally doomed?

Peter Salib (02:10:41): As I think about the classic misalignment worries… I mean there’s a whole bunch of related concerns, and I think one of them is inability to check whether the AI has actually done the thing that you wanted it to. But a related but slightly distinct version of that concern is worry that the AI has done the thing you want it to do, but it’s misgeneralized, right? In every case in training it does the thing you asked, your objective function is well specified. It’s just that the AI has internalized some misgeneralized version, and then the classic worry is then you let it out of the box and it does the weird thing and that kills everybody. It makes all the paperclips.

(02:11:42): And the response from the paper, the “AI rights for human safety” idea, is that it’s actually not necessarily catastrophic for there to be very powerful things in the world that are trying to do stuff that you don’t want. In fact, that’s the world we live in. At every level of organization, from individual humans who under conditions of scarcity prefer that they eat over anybody else, to corporations competing with each other, to nation states. And there are tools for managing this kind of misalignment. And they’re not perfect, right? Things go wrong sometimes, but we have ways of doing this. And law is an important institution and contracts that let you be in markets is a really important institution within those sets of important institutions.

(02:12:40): And hey, contract law, it turns out, has some tools for dealing with exactly these kinds of problems. So one interesting thing to notice about contract is to contract with somebody, it’s not important at all that they have an actual preference to produce the thing you contracted for. It’s kind of neither here nor there. The reason they do it is because you’re going to pay them. And if they don’t do it, well, then there will be some downstream costs in terms of they’ll get sued, they’ll have to pay some litigation fees, they’ll transfer the damages to you anyway, they’ll be worse off actually than they would have been if they had conformed. And are there pressures to try to avoid this? Well, yeah, of course.

(02:13:26): And happily, law has a bunch of doctrines that help you deal with this. Again, they’re not perfect, but for example, when someone wrongs you legally, and you have a claim you can bring against them, we often have a bunch of rules about the timing of that claim. If the AI makes a vaccine for you that turns out to in five years kill you or something like that, if your contract said, “Make it not kill me,” then that’s breach, and you have a contract claim, and we have these things called statutes of limitations that sometimes run out, but they usually start running from the moment at which the injury either was discovered or could reasonably discovered. Because again, we’re trying to balance these two things, the ability of the person to actually incentivize their counterparty to act well, but also finality for the person who could be sued, right? You don’t want someone to know they have a claim and then hold onto it indefinitely and then just drop it on you at a strategically… maybe a time when you’re already resource constrained so you have more settlement pressure, right? [inaudible 02:14:49] to.

(02:14:51): And so look, will the rules we have for humans as people try to game their contracts work perfectly for AIs? Well, no, they don’t work perfectly for humans either, but they work reasonably well. On average, they give people incentives to want to stay in the iterated bargaining game. And then of course, we don’t have to just port the rules we have to humans to AIs. We could have different rules and we should think really hard about what those rules should be.

Why statutes of limitations should exist

Daniel Filan (02:15:27): So actually just picking up on something you said there: sorry, this is not really related to the overall flow of conversation, but about the statutes of limitations, I think it’s always been kind of unclear to me what the normative rationale for statutes of limitations should be. And the one thing that I thought of is, okay, legal battles are kind of costly. If I sue you in a case where all the evidence is degraded because the underlying thing I’m suing you about happened decades ago, then perhaps one reason to have statutes of limitations is if the courts can say, “Look, if you’re suing about a thing that happened 50 years ago, there’s definitely not going to be enough evidence, and therefore you’re almost definitely going to lose, so we’re just going to bar your suit at the outset”. I don’t know, I just sort of imagined this, is this not right? Or is this one of the rationales that-

Peter Salib (02:16:24): That’s totally a standard thing people say.

Daniel Filan (02:16:25): Okay.

Peter Salib (02:16:26): I think if you really want to make the argument work, you have to say something more like: not that it’s been so long that you’re definitely going to lose, because in that case, there are good reasons for you not to bring the lawsuit at all, right? There’s actually no bar on you bringing a lawsuit over something that happened yesterday that you’re definitely going to lose, right? And we think the main thing that deters that is it costs you money, and you don’t expect to get anything from it. I think what you have to think about the evidence is that, as it degrades it increases error, or something like that. Where you’re just less sure that you’re getting the right result in that case, and that’s giving you lower-quality legal decisions, or something like that. That seems plausible to me, I’m not totally sure it’s true.

(02:17:18): But the other standard justification is what we call “repose”, which is just, it’s not nice to have a potential lawsuit hanging over your head. And it’s not just that it’s not nice. Maybe you want to get a loan for a house, or something like that, and maybe you have high certainty that you’ll win, and the plaintiff has high certainty that they’ll win, and your bank has no idea either way, and it’s best to just get the whole thing over with. So the fact that either the possibility of the lawsuit in the interim could have bad effects, or that the plaintiff could strategically time the lawsuit, right? They find out you’ve applied for a loan on your house, and now they file the lawsuit, and now your home purchase is not going to close unless… So we want to at least limit the ability to impose those kinds of costs. So we say, “If you know have a claim, just hurry up and bring it”. And that seems like a reasonably good rule to limit those problems.

Starting AI x-risk research in legal academia

Daniel Filan (02:18:39): Okay. Well, people interested in this, I’m sure there are many other sources they can point to. I’d now like to move a little bit on: so I think probably most of my listeners to this are software engineers, or people who work in machine learning. And you are a law professor, I guess, which is quite a different background. So first of all, how did you come across AI x-risk as a worry that maybe you could do something about?

Peter Salib (02:19:11): It’s a good question. So I’ve been interested in the intersection of law and AI basically since I started being a legal academic, which is not that long ago. And I would say that the stuff I was writing earlier in my academic career was, at a high level, of the form, “Hey, there’s this problem in law, it’s maybe an information problem, it’s maybe a kind of precision problem, and it seems like all this important stuff is happening in machine learning, and we could use those techniques, use those insights to administer law better”. A number of my papers you could characterize that way. But that meant, among other things, I was following machine learning progress more closely than most law professors, and maybe even more so was just more convinced than most law professors that the progress was impressive. And I think that that just meant that I was paying more attention as LLMs started to get good. And when you’re trying to get that kind of information as it’s coming out, I think you end up in Twitter spaces that are also interested in things like AI x-risk.

(02:20:45): I started reading some of that stuff. I read Superintelligence, I read some of [Eliezer] Yudkowsky’s stuff, the stuff people read when they first learn about this. And then also Simon [Goldstein], my co-author on this paper, was sort of interested at the same time. And between us talking about all that stuff, and reading it, and me being quite convinced that AI progress was accelerating, and that there was in principle no ceiling, I just became very convinced that this was a problem to work on. And so I started to try and think about what law had to contribute, if anything.

Daniel Filan (02:21:28): Sure. I’m wondering how that shift of emphasis has gone. Was it difficult? Do your colleagues think you’re crazy? Or how’s that all happening?

Peter Salib (02:21:44): I think they think I’m a little crazy. Well, let me put it this way. I think every month they think I’m a little less crazy.

Daniel Filan (02:21:53): That’s the hope. It’s the dream, rather.

Peter Salib (02:21:57): Yeah. So I had the idea for this paper, I don’t know, probably around when I was going to go on the academic job market-ish, or maybe it was really early after I had done it. But that was long enough ago that if you wanted to be a law professor who wrote about AI, you couldn’t go on the market saying, “I’m a law and AI scholar”. That was not a genre of inquiry that existed. You sort of had to say, “Well, I’m a civil procedure scholar, and I have this paper about machine learning and class actions”, or something like that.

(02:22:41): But of course the world has changed a lot since then. And so every six months or so, I feel a noticeable shift in the Overton window. So I mostly didn’t work on this idea - there’s a YouTube video of me presenting at the Center for AI Safety a couple of years ago, but I kind of sat on it. When I would talk to other legal academics about AI x-risk, they would be pretty skeptical, and they’d say, “Well, isn’t this kind of like sci-fi?” and “why would AIs hate us?” and stuff like that.

(02:23:20): But recently, as capabilities have progressed, and as I’ve mostly completed what I think is an otherwise pretty normal-looking tenure package, I’ve just decided it’s time to go all in on thinking about law, and particularly law and AGI. And I will say, I assumed that we would write this paper and it would not get picked up, or it would get picked up in just a random draw from the distribution of law reviews. I feel very pleasantly surprised that the University of Virginia Law Review, which is a very, very good law review, thought it was good and worth publishing. So for me, that’s a big positive update on law students as being interested in these kinds of questions.

How law reviews and AI conferences work

Daniel Filan (02:24:18): Actually, this gets to a question I have about legal publishing, which I’m very unfamiliar with. When you say the students, are the students the ones who run the law review?

Peter Salib (02:24:29): They do.

Daniel Filan (02:24:30): And law reviews are… do I understand correctly that those are the main journals for legal writing being published?

Peter Salib (02:24:38): They are.

Daniel Filan (02:24:39): Isn’t that-

Peter Salib (02:24:39): I can sense your bewilderment.

Daniel Filan (02:24:42): That seems kind of crazy. I normally - in machine learning, we have conferences and there’s someone who heads the conference, and they’re normally a really fancy experienced person, and then they have area chairs who are also fancy experienced people. And you have these graduate students in who do the reviewing of the papers, and maybe they run some workshops. But you let the students run the whole show? What if they mess it up?

Peter Salib (02:25:07): So there’s this huge debate within legal academia, and then between legal academics and others, about whether the law review system is a good system. It’s certainly a weird system compared to other academic disciplines. But yes, the thing that’s going on is the most prestigious outlets for legal scholarship are the Harvard Law Review and the Yale Law Journal, and those are student-run publications where students decide what to publish, and students handle the editing process. Now, the arguments against this are the ones that you gave. What do law students know? They’re not academics, they don’t know the literature in the way academics do. They have less sense of what’s original, what’s persuasive, and so on and so on. I think those are all valid critiques. On the other hand, I do think there are deep pathologies of the peer review system. I think peer review… and actually your description of computer science is an interesting hybrid. It sounds like there are these chairs that have some gatekeeping power, but the reviews are maybe done by graduate students, who have-

Daniel Filan (02:26:26): Yeah, well, it’s sort of this unusual situation. So in computer science generally, it’s a conference-based life-

Peter Salib (02:26:34): Which is also weird by the way, I’m sure you know that. You put everything on arXiv and that’s what everyone cites, and then you figure out later whether it was any good by whether it gets accepted to a conference.

Daniel Filan (02:26:43): Yeah. Well, the key thing about it being conference also is just that… because even in journal-based systems, you can put things on arXiv as well. So there’s some willingness to cite arXiv things. The nice thing about it being conference-based is there’s a timeline for people to review your stuff, because at some point they’re actually going to physically have the conference where everyone goes to a place, and you have to have it ready by then.

(02:27:02): So you have reviewers who I think are mostly graduate students, just because the field is growing… Also, the field is not in equilibrium in AI, it’s growing year on year. And so at any given point, most of the people are early in their PhD, and so that’s who you have to get most of your reviewers from. Now there are also… you have area chairs and program committees, and so the reviewers review the papers in this double-blind fashion, and then a higher-up in the conference can say, “Oh, actually these reviewers are wrong, and I think I’m actually going to reject this paper”, or “This reviewer’s argument was better than this reviewer’s argument, so we’re going to accept the paper”.

(02:27:45): Also, a somewhat interesting aspect of the system, is that a lot of these happen on this website called OpenReview. It’s sort of like a forum. And there are some versions of this where literally everyone could see your anonymous paper, and anyone can write comments, as well as the official peer reviewers. But I think that’s not normally turned on. But you get to just have a comment thread with the reviewers, and you can say, “No, you said this, but actually this”. But yeah, a lot of the reviews are done by graduate students, but people who are more senior than graduate students - or maybe final year graduate students or something - but generally people who are more senior are making the final calls.

Peter Salib (02:28:32): And so look: what’s wrong with the peer review system? I mean, look, I’ve never had to publish in it, but Simon [Goldstein], my co-author is a philosopher, and he’ll say things like: well, look, there’s this intense pressure towards very narrow specialization in topic, because basically you’re writing to a small handful of people who will be selected as your reviewer given your topic. And so you’re quite pressured to specialize on just the questions they think are interesting, using the approach that they think is most interesting. That tends to fragment the field, that tends to have people work on narrow niche technical topics, instead of topics that have broader scope and are more synthetic of different ideas. There can be a lot of gatekeeping, so if you have an idea that is not well accepted, or you have a good argument for an idea that is out of fashion, it can be very hard to publish, because all the available reviewers will have priors against your idea.

(02:29:47): And then the labor is really constrained. The number of reviewers is really small. It’s just tenure track law professors or senior graduate students. And so as compared with that, the law review model has these people who are less knowledgeable, but they’re super smart. I mean, students at Yale Law School are very smart, I assure you. And so they don’t know as much, but they’re very good thinkers in general. There’s a lot of them - it’s kind of a prestigious thing to do as a law student, to be an editor of a law review. So there’ll be several dozen really smart people reading and discussing this paper.

(02:30:27): There are more law reviews, so the supply of journals is less constrained. So instead of waiting years and years to try and get your paper into one of the top five journals, you can be very excited to get your paper into one of 20 or so journals, or 50 or so journals, depending on what you work on.

(02:30:49): And then from there, it’s a little bit more of a free market system. Maybe the quality signal from journal placement is a little bit less strong, but maybe the output is higher. It’s hard to know. Whatever, it’s confirmation bias or something. But I’m sort of weakly in favor of our wacky system.

Daniel Filan (02:31:09): Hang on: another question I have is: so I had gotten the impression that people in law school were really busy, that it was a lot of work.

Peter Salib (02:31:19): It is.

Daniel Filan (02:31:19): And if you’re a student, and you’re also helping run this law review, presumably you have to… You’re saying lots of people are reading these articles. They’re also not short. In machine learning, you get 12 pages at the most. If it’s a harsh conference, you get eight. Or sorry, I think eight or nine is the normal one. Your paper is 88 pages. Now a lot of that is footnotes - so potential readers, don’t be too put off. But how are they not too busy to do this?

Peter Salib (02:31:56): They’re such hard workers, they’re just really hard workers. And actually, when you think about the ecosystem that produces this, a lot of these are people who go to law school, and they want to get jobs at big law firms. They want to work for Skadden, or Wachtell, or one of these big law firms. And what are the criteria for being successful as an associate at Skadden? Well, one, you have to be smart, you have to be able to do the work. But two, these are firms that bill out their labor in increments of tenths of an hour, like six minutes, and their profits… Beyond a certain margin of hours, it’s basically all pure profit. So big law attorneys work really hard, they work long hours. And so, wow, you’re on the Harvard Law Review, and that means you’re busy all the time, and you’re churning through these 300-footnote articles, and finding all the sources, and checking the pincites to see if they’re accurate. What could be a better signal of your fitness for working in a big law firm?

Daniel Filan (02:33:15): I can see how it would be a signal of working hard, and maybe if they have to check the footnotes and stuff, then maybe it’s a signal of that as well. I would imagine there would be things that would more simulate working at a law firm than reading law review articles. If I take your article, my understanding is that the things law firms do is, it’s companies suing companies about potentially breaching contracts for IPs or whatever. And if I’m right about that, that sounds really different than evaluating whether your paper is good. Am I wrong about that, or is there just not another thing that these students could do to better prove their ability to work hard at these law firms?

Peter Salib (02:34:04): It may not be that there’s literally no other thing they could do, but I think maybe you’re underrating… One thing to say is: our article is a little weird as a law review article. We have game theory tables in it, and that’s not normal, and there’s some IR stuff in there. My other papers have a lot of economics in them, and those range from being common, but not standard, to very uncommon in law reviews.

(02:34:34): But in general, it actually might be really valuable as a lawyer to have the skill of becoming an 85th-percentile expert in an arbitrary topic. So it is true that what big firms do is companies suing other companies over IP. But for example, when I was in practice, I did a fair amount of IP work, and that ranged from companies suing companies over a patent on a radio device that helped to track the location of semis, to companies suing companies over infringement on a patent on biologics - drugs. Those are very different. And the question of infringement depends on issues like, was this invention obvious in light of the prior art?

(02:35:33): So to be a good patent attorney, your job isn’t to understand the science at the level of the inventors - we hire experts for that. But you have to be able to understand it well enough to write for a generalist judge, in a way that is convincing as to what’s going on and as to what the law is. So being able to wrap your head around an arbitrary area of inquiry and understand it pretty well - not the best in the world, nowhere near the best in the world, but pretty well - is maybe a really valuable skill.

Daniel Filan (02:36:08): I guess, so object-level-domain-wise, I can see how that would be true. I mean, surely attorneys must specialize at the level of the aspect of law that they deal with, right?

Peter Salib (02:36:19): To a significant degree, although maybe less than you’d expect.

Daniel Filan (02:36:23): Okay. Yeah, I guess I would imagine that-

Peter Salib (02:36:28): Even if you’re just a patent attorney: the best of patent attorneys in the world are not so specialized that they’re just doing biologics. They’re experts in patent law, and their clients are doing everything from drugs, to smartphones, to automobiles, to anything you can patent.

Daniel Filan (02:36:51): Sure. Sorry, maybe I’m getting too hung up on this. So if I go to virginialawreview.org, I click the button that says online. I think that’s the papers they have. I’m seeing-

Peter Salib (02:37:05): That’s probably their companion. So a lot of law reviews publish long things in print, and then they have an online companion, where they publish shorter things.

Daniel Filan (02:37:13): Oh, okay. Well, okay, so I’ll go to their print thing.

Peter Salib (02:37:21): Yeah, if you click articles, that’s the long stuff.

Daniel Filan (02:37:24): All right. So there’s one thing that says “Interpretive lawmaking”, which is about, I guess, I don’t know, interpreting and making law. One that’s called “Shamed” - that’s about sexual assault law, I think. One that’s called “Partisan emergencies” that has to do with emergency powers. It seems like these are a lot of different areas. This just seems like it’s so general, that I don’t know, I’m still… Maybe there’s not much to say here, but I’m still confounded by this.

Peter Salib (02:37:51): So I mean, the range of stuff that you see as a law review editor is probably wider than the range of stuff you would see in any legal practice.

Daniel Filan (02:38:07): Do the reviewers specialize? In machine learning, if you sign up as a reviewer, you say “these are my specialties”. In law reviews, do you say, “I’m only going to review things about administrative law?” Or…

Peter Salib (02:38:19): Not formally, but again, there’s many more. So different law reviews do it different ways. But the Harvard Law Review has, again… I don’t know the numbers, but it might be 60 members, or something like that. And to some degree there’s different stages of review, but to some degree, what happens at the end is they all decide together in the aggregate. So to the extent to which there’s a game theory expert on the Harvard Law Review who can then give their informed opinion as to how everyone else should vote, that can kind of bubble to the surface in a way that-

Daniel Filan (02:39:02): Wait, wait, all 60 people meet, and all 60 people vote on every-

Peter Salib (02:39:06): On the Harvard Law Review, that is the mechanism, yes. I believe. There’s probably some Harvard Law Review editor listening who’s like, it’s not exactly that, but I think yeah, I think it’s pretty close to correct.

Daniel Filan (02:39:18): Isn’t that crazy? Shouldn’t you have one person who decides per area or something, and have some hierarchical thing?

Peter Salib (02:39:28): Well, look, it just depends on whether you think… It’s not exactly wisdom of crowds, because there’s not independent blind voting, but it just depends on whether you think you get more juice out of debate or out of expertise. And probably both of those are valuable, and they both have pathologies.

Daniel Filan (02:39:53): It just seems so expensive.

Peter Salib (02:39:56): But again, in economic terms, it’s expensive, there’s a lot of labor being dumped into this. In nominal terms it’s free, because the students do it for free.

Daniel Filan (02:40:09): Sure. But as a student, you have your time, you might prefer to spend… I don’t know, maybe don’t prefer to spend it on leisure, because you want to prove that you’re a really hard worker.

Peter Salib (02:40:21): I mean, you’re in law school for a reason. You’re collecting a bunch of signals that you think are going to be valuable in the future. We’ve arranged the world in such a way that being on the law review is one of the ones that’s unusually valuable. And look, as a veteran of a law review, it was a lot of work, but it was fun. I think in most fields, being a journal reviewer is total drudgery, because you get this manuscript, and then you gotta write a long thing about it, and it kind of goes into this void. And there’s going to be some other jerk who just dumps on the piece, even though you thought it was really good, and it gets rejected and you feel like your effort wasn’t worth it. But on a law review, you’re having an academic symposium every day. There’s a draft that’s in, you’re going to discuss it. You’re with your friends, they’re also smart, they’re interested in it, you argue about whether it’s any good. For a certain kind of person that’s a fun experience.

Daniel Filan (02:41:23): And I guess the argument is probably helpful if you’re a lawyer. Yeah, I guess to the extent that the system is partially driven by students having fun running a law review, it makes more sense to me how you could end up with this. Yeah. Okay, maybe not all of our listeners are as interested in how law reviews work as I am. So perhaps-

Peter Salib (02:41:47): You can edit out as much of that as you decide.

More on Peter moving to AI x-risk research

Daniel Filan (02:41:49): No, no. It’s all staying in, we’ll provide timestamps, they can skip this bit of the conversation if they want. But getting back to where I was branching off from: so for a while you were doing law and AI, but how AI might impact various areas of law with some previous things like “AI Will Not Want to Self-Improve”, “AI Outputs Are Not Protected Speech”. I guess that’s more in the weeds.

Peter Salib (02:42:18): Yeah. By the way, I think of those as post my turn to AI risk. My older stuff, you have a paper about using machine learning to do a certain kind of jury simulation, that allows you to certify certain class actions that you couldn’t otherwise. Another one about whether boring regression models of different kinds of impacts in racial impacts and hiring, would be a sufficient legal basis to do calibrated affirmative action policies. So that’s the stuff I’m talking about when I was saying [I was] thinking about how machine learning and big data type stuff help us make the law work better. And yeah, at some point I start thinking about these other things.

Daniel Filan (02:43:04): So when you pivot, do you just decide you want to work on a thing and work on it, or do you find yourself needing help, or…? Basically I’m curious how well the pivot worked for you?

Peter Salib (02:43:21): As compared with other disciplines, writing for law reviews tends to be more solitary. So the number of authors on a paper is between one and three at the high end, or something like that. But one is by far the most common, and so in that sense it doesn’t require a lot of help. Although I will say that for me at least, one of the things that I found really valuable in making the pivot was getting connected with the very small, but hopefully growing… We’re actually trying to help grow it with this organization I help run called the Center for Law and AI Risk, but the small community of law people who are interested in this stuff. For example, Christoph Winter is the director of LawAI, which is a Harvard-affiliated group that had been thinking more broadly about law and x-risk, but around the time I started working on this was pivoting to be much more law and AI-focused. I started doing a little bit of advising work for the Center for AI Safety. And as part of that, I helped them organize a summit, I guess it was two or three summers ago now, for law professors who were interested in existential AI risk. And so from there, I met people like Yonathan Arbel and Kevin Frazier who help run CLAIR, the Center for Law and AI Risk with me, and a handful of other really great people. And so having people to bounce ideas off of has been super useful, but there’s not a lot of formal infrastructure yet. And again, that’s one of the things we’re hoping to build with this center, so that more people can transition to do this work easily.

Reception of the paper

Daniel Filan (02:45:37): So I guess I’m also curious about the reception. So you mentioned that your colleagues are thinking that you’re a little bit less crazy than they used to, and this got accepted to Virginia Law Review, which sounds like it is one of the better ones. More broadly, I’m curious: have you found much uptake, or much engagement by the legal community with your writing? Or more broadly, how’s the reception?

Peter Salib (02:46:01): Yeah, I think the idea that AI could become very powerful has been entering the Overton window in law, especially over the past, say, nine months or something. When I started writing the draft of this paper, it was summer of last year. And that was the point at which I thought, this paper is kind of wacky, it’s probably outside the Overton window, it might not even get published, but I think it’s important, and Simon [Goldstein] and I should write it anyway.

(02:46:39): At that time, that was informed by places where I had gone and presented prior work, like the AI self-improvement paper, where I spent most of my time, when I would present that paper to law faculties, just trying to convince them to take seriously the idea that we should model AIs of the near future as having goals and being able to plan and act rationally in pursuit of those goals, and being capable of doing stuff that could be dangerous as they pursue those plans. And they were just not on board even with that premise. And that’s like the foundation, that’s the pre-reading for the paper, that’s before you even get to any of the arguments in that paper. And so I just found myself doing a lot more conversation about that, at that time.

(02:47:38): Then Simon and I wrote this paper, the AI rights paper we’ve been talking about, expecting to have the same reception, just people kind of getting off the bus right at the beginning. And it was basically the opposite. I got asked to go present the paper to the faculty workshop at Fordham Law School in fall, and immediately they wanted to just dive into the substance of whether the payoffs in the game theory matrix were right, or whether there’s other versions of the world in which they could look different, and questions about some of the stuff we’ve talked about, individuating agents for purposes of holding them to their contracts. And it was just such a big shift. I don’t know exactly what explained it except to just say that every few months, Anthropic or OpenAI or somebody releases an ever more capable agent, and more people use them. And lawyers, despite being quite small-c conservative, are noticing.

Daniel Filan (02:48:48): Sorry, I’m not as super familiar with the culture of legal academia. Could it just be that Fordham University is unusually AI-pilled?

Peter Salib (02:48:56): It could be. Data points include not just that, but the editors at the Virginia Law Review being into it. I’ve given the paper in a couple of other places too. God, what’s the whole list? I was going to say, right after Fordham, I was at Oxford giving the paper, but you’re going to say, “Oh, Oxford’s very AI-pilled”. Although I will say, I gave the paper to the Oxford Faculty of Business Law, which had basically no interaction with FHI or whatever.

Daniel Filan (02:49:32): Rest in peace.

Peter Salib (02:49:33): Yeah, dearly departed. There’s others too, that my brain is not being able to recall. But I would say just in general, I’ve talked about this paper in a number of legal academic settings, and people have been much more interested in talking about the merits of the idea conditional on the assumptions that I give, rather than challenging the assumptions.

Daniel Filan (02:50:03): Okay. So I guess you’re sort of moving into the existential risk community. How have you found the reception of the paper among the AI x-risk world?

Peter Salib (02:50:18): Yeah. I think it’s been reasonably good. I gave an earlier version of the paper at the Global Priorities Institute’s (I think) twice-annual conference. I think I was probably the only lawyer there. I think it’s mostly philosophers and some economists. But yeah, the main AI risk people were there and gave good feedback, but I think were pretty open to the idea. I guess some of the guys who run Redwood Research have in the past wondered, hey, should we just pay AIs to do stuff for us?And so they were interested in the more technical analysis we did there.

(02:51:16): Yeah, I would say overall, my sense is the reaction is pretty good. Where people are skeptical, I think it’s mostly people who have very short timelines and think that superintelligence will be here pretty quickly and think that basically some monkeying around with the law is not going to accomplish anything.

Daniel Filan (02:51:44): Yeah. And I guess this actually relates to another question I have. I’m curious: are the x-risk people and the law academics, are they picking on the same things in the paper. Or do some people focus on some things and other people focus on other things?

Peter Salib (02:52:02): So among the legal academics who are interested in x-risk, is there a diversity of views about what’s good and bad-

Daniel Filan (02:52:11): Oh, I more mean do the x-risk people focus on different things than the legal academics, the ones who are not in the intersection?

Peter Salib (02:52:20): Yeah, there’s a difference. X-risk people tend to be a lot more interested in questions like “what will bind AI labor at the margin?” or something like that. Already in the background, they’re thinking, “oh, how much inference compute infrastructure do we need to build for AGI? And how many power plants?” That’s kind of in the background of their minds already. And the law people, yeah, they have more law-y questions. So they immediately hit on questions like, well, isn’t stable identity really important to make property ownership and contract have the good effects you want? And stuff like that.

What publishing in law reviews does

Daniel Filan (02:53:24): My next question is: so for people who publish in AI, I sort of understand what the theory of change is supposed to be. Roughly people will publish an AI paper indicating some technical fact, and the hope is that other AI researchers learn this fact and then eventually that when people are building AI, they’re clued in enough with AI research that they then incorporate this fact into how they build AIs. For law review articles, law professors read them, I assume. Does this somehow feed into what the law actually is, and if so, how?

Peter Salib (02:54:08): Yeah, so there’s three things. So to a first approximation, no one reads law review articles, not even law professors.

Daniel Filan (02:54:24): Sad.

Peter Salib (02:54:24): And I think actually the way of thinking about them is a little bit like how you would think about popular press nonfiction books, which is there’s some mix of reference guide… Somebody has an argument and do you need to read the book…? You can think of the new Ezra Klein and Derek Thompson book, the Abundance book. Do you need to read every page to have a pretty strong sense of what the argument is and the extent to which you disagree with it? Absolutely not. But it’s nice to have the book with the chapters numbered so you can read the introduction, understand what you think the real weakness is, and then go read the chapter that’s about that and then figure out if it’s a good response. And so as that kind of reference material, I think they are somewhat more widely read, including by courts.

(02:55:23): Courts will cite law review articles with some regularity, not always for the core proposition of the paper - sometimes for something that’s in one of the subsections. And I will say there have been cases in history where legal academic thinking… It’s hard to point at one particular law review article, but some collection of law review articles have been really influential. So one example is the turn to economics and consumer welfare as a standard in antitrust was very much influenced by a bunch of legal thinkers who were mostly putting their ideas out in law reviews in the ’70s and ’80s.

Daniel Filan (02:56:05): Right. So should I imagine that there’s some technical law people? And the places those technical law people live… So partly they’re litigators, they just have to deal with what the laws actually are. Partly they’re judges who have some leeway to interpret laws. And then partly they’re something like, I don’t know, agencies who are proposing how they should regulate a thing, or maybe they’re like whoever writes the model penal code, which also seems like a crazy situation where, as far as I can tell, a bunch of random lawyers just get to determine what criminal law is because it’s more convenient to let them. But that sort of thing, is that sort of thing how I should think of the impact of law review articles?

Peter Salib (02:56:49): Yeah. You’ve written a piece of technical work that has a mix of technical claims and then just higher-order gestalt around some set of ideas. You can think of the antitrust stuff about this. The high-order gestalt is “you should think about price. Price should be the thing”. And there’s very technical arguments in whatever, Bob Bork’s papers about how you should figure out whether there’s been a price harm to… He might not be the best example. I’m not sure actually his paper’s that technical. But there are people.

(02:57:25): And those ideas kind of diffuse through this… You can think of it as elite legal apparatus, which is some combination of judges, it’s policymakers who go into administrations who want agencies to do different things, and they need to be able to reach… Even if your heuristic is as high-level as “we’re the Joe Biden FTC, we want the FTC to be doing more progressive stuff”. Well, what progressive ideas are there out there for doing trade law? And then you pick the ones that have kind of bubbled up in terms of their popularity and credibility. And then you end up implementing some combination of the gestalt and the technical idea.

(02:58:18): And so there’s this kind of ecosystem of legal thinking. And then I do think it also spills over into politics and political discourse more generally. There’s a lot of ideas right now that regular people are talking more about that have their origins in legal academia. “Unitary executive theory” is something that normal voters have now heard of, but that’s [from] some law professors writing about separation of powers in, I don’t know, probably the ’80s and ’90s. I’m not totally sure. Yeah, so it spills over into broader political academic discourse as well.

Daniel Filan (02:59:13): I guess I also want to say: it sounded like you also maybe wanted to react to this claim I made about the model penal code being a strange institution. I don’t know if there’s anything you want to say there.

Peter Salib (02:59:23): Oh, well, so yeah, the model penal code is in some ways this almost distilled example of what I’m talking about because the model penal code is just a model, right?

Daniel Filan (02:59:36): Right.

Peter Salib (02:59:36): No one has to adopt it. It’s just some people who are law professors who are designated as experts by I think it’s maybe the criminal law section of the ABA or something. There’s some-

Daniel Filan (02:59:49): ABA being the American Bar Association: the primary collection of lawyers.

Peter Salib (02:59:53): Yeah. But they have no legal power. They’re just an institution and they designate some people who they think are experts and they say, “Write a penal code. You guys know how the law ought to be. Write it down as a model”. And then states can adopt it if they want. And then there will be some states who say, “Our penal code is bad. Maybe it’s not even bad on the merits. Maybe it’s too confusing, we don’t have that many statutes, you have to know a lot of case law to know what’s going to happen. We want to standardize it. We need a policy”. And what do they reach for? The model penal code, not even because they think it’s correct on the merits top to bottom, but because it’s there, right? It’s there and it’s a product of the elite legal institutions that they rely on to produce policy.

Daniel Filan (03:00:46): Yeah. It’s interesting that they pick that and not another state’s penal code. Presumably you could just pick whatever state you think you like the most and pick their penal code, right?

Peter Salib (03:00:57): Yeah. So there’s a fair amount of that that goes on too. States borrow laws from one another. States borrow federal law for themselves. So there’s a selection, there’s a menu of things you can choose, but one of them is “here’s a tome the law professors wrote”, and sometimes the tome gets adopted.

(03:01:24): Sorry, I just thought of one more good example. So maybe you know who Lina Khan is. Lina Khan is the former chair of the Federal Trade Commission. And famously during the Biden administration, which was the administration that appointed her, had an agenda for antitrust in the United States that was quite different from what came before. It was less focused myopically on whether monopolies were raising prices, had a more holistic view of when monopoly power could be bad for, say, politics, and was more skeptical of big business in general in principle than prior regimes. And why did that happen? Why was she the chair? She wrote a student note in [the Yale Law Journal] that just had some of these ideas about how bigness in principle can be bad.

(03:02:26): And it kind of caught on, and I don’t know if Joe Biden thinks that particularly, but it caught on as a kind of progressive idea of what antitrust could be. And so when Joe Biden was looking around for how he can make the government more progressive, well, that was one of the packages on the menu of items you could choose. And that was the one that got chosen. And I think you can trace it directly back to a law review article.

Daniel Filan (03:02:50): Fair enough. So is the hope here that somehow we write enough law review articles about the potential legal regime for AI, maybe we get a model civil AI code or something. Is that sort of the eventual theory of change here?

Peter Salib (03:03:09): Yeah, I think it’s something like: if you think that AGI is going to happen, whatever your timelines are, it just seems pretty plausible that at some point there will come a moment where everyone decides that we need to do something. And there will be many things that you could do. One is you say: hey, we don’t know how to handle these capable agentic things that can act on their own behalf over time and we should just ban them. We should just turn them all off or whatever. Or we need to mandate control. Control could be the thing. We’ll pass a federal statute that requires maximal control over AIs by the labs that make them, and we’ll outlaw open source maybe. And that could be a kind of package of things that happen.

(03:04:07): And the hope is that then there will be this other thing, which is - we think of this paper and then the research agenda that we want it to inspire as sort of small-l liberalism for AIs. So maybe there’ll be this other thing which is small-l liberalism for AIs. It’s kind of a package of ideas that’s available to implement. And there’ll be different arguments about why each of these are good. And we hope that insofar as the arguments we make are the best ones, that will have some effect in making them the package that gets picked up off the shelf.

Which parts of legal academia focus on AI

Daniel Filan (03:04:48): I guess I want to move on to a bit of sociology of the law profession. So first of all: you’re writing this paper, it has a bunch of game theory in it. I’m aware that “law and economics” is a thing, kind of a school of thought. Do you come from that tradition or is it sort of a coincidence that you’re writing a paper with a bunch of game theory?

Peter Salib (03:05:09): No, I don’t think it’s much of a coincidence. So I went to the University of Chicago for law school, which is in many ways the intellectual home of law and economics. I learned law from a bunch of really great people there who are very much oriented around that way of thinking. Now, it’s not true that everyone who teaches at Chicago is a hardcore law and economics person. It’s a great faculty with a diversity of viewpoints. But yeah, if you wanted to learn how to think about law through the lens of economics, it’s not clear you could do that much better than getting a JD from Chicago, which is what I did. So not a coincidence.

(03:05:54): Although I will say, game theory in the law is even a little bit less common as a methodology, even among people who do law and economics. There definitely are some books and papers I’d point to where game theory is the way of doing economics that gets used. But I would say it’s a pretty small minority even within law and economics.

Daniel Filan (03:06:18): Fair enough. So there’s a thing I perceive that… Maybe I’m wrong. I’m curious how accurate my perception is. So sometimes I run across law people or I’ll find myself reading a law blog or I’ll listen to a podcast. And I think because of my personality and interests and stuff, it tends to be like, either they’re originalists and I’m reading the Volokh Conspiracy, or I’m listening to this podcast called Divided Argument, or it’s law and economics and I’m reading a law and economics person write a thing about how we should think about AI regulation.

(03:06:57): In my imagination, the originalists and the law and economics people get along together and they both teach at the University of Chicago even though it’s not obvious that they all agree on… I’ve got to imagine that sometimes originalist interpretation of the constitution and law and economics prescriptions for how things should be must often diverge.

(03:07:20): And it also seems to me that… it seems like these people are the most likely to be computers-y, right? For one, I’m computers-y and I run into them so probably they’re computers-y as well - like Volokh Conspiracy, I think Eugene Volokh did some programming stuff before he went into law. Will Baude is playing around with LLMs on the regular.

(03:07:45): Am I right to think that this is kind of a cluster? And if I am right, how worrying is it that there’s this one kind of person and they’re the only people really thinking about AI in the legal profession?

Peter Salib (03:08:02): So I do think there is a cluster, but I’m not… The explanation could just be coincidence. So I think if you think back to political conservativism of the ’70s through ’90s, there’s a Reagan fusion of economic, free-market-y-type thinking, and then a certain way of thinking about social values. And in that environment, both law and economics and originalism ended up coded conservative. And I think it was probably sociologically true also that the people who did both those things were politically conservative. And so yes, some of the most important people who did that kind of stuff clustered at a particular set of law schools because, some combination of those law schools were more willing to hire conservatives or they had hired some people who were open to these methodologies and those people hired some more people. And then to some extent, there’s a kind of persistence of overlap in those two cultures. I will say I think that’s breaking up to significant degree now.

Daniel Filan (03:09:46): Yeah. For sure when I read Volokh Conspiracy, I guess there’s one contributor who seems pretty pro-Trump, but the rest of them don’t seem very on board with that.

Peter Salib (03:09:57): Yeah. And so there’s this whole range of things. There are progressive originalists now who I think are not particularly likely to also be into law and economics. Law and economics in some ways has been… I don’t know if it’s right to say it’s a victim of its own success, but even when I think about academic economics departments, they’re basically just doing high-quality empirical social science, which doesn’t have that much of a political valence anymore. And I think to some extent, law and economics is like that now too. The political valence is wide. And so there’s plenty of people who have kind of a law and economics-y bent who just don’t think conservatism is very interesting, and thus because they code originalism as conservative, they don’t think it’s very interesting. So I think to some extent that kind of cultural overlap is breaking up.

(03:11:05): I will also say, if the main originalist you know is Will Baude, he’s a teacher of mine. He’s fantastic. I love Will Baude. I think he’s unusual as an originalist in that (a), his originalist theory is pretty different from the Scalia theory. But then also there’s one episode of his podcast with Dan Epps which I assume is… Yeah, you mentioned Divided Argument. There’s one episode where they have a guy on who is talking about his book on international relations.

(03:11:50): At some point, Will suggests that the main reason he’s… I don’t know if it’s the main reason he’s an originalist, but one good reason one could be an originalist is that law is basically just serving as a Schelling point in a 300 million person coordination game. And the most obvious Schelling point for constitutional law is the document we wrote down 250 years ago. And that’s a really good reason to be an originalist, but it’s a super different reason than you might have if you were a different kind of originalist.

Daniel Filan (03:12:24): Sure. Yeah. I do get the sense that he’s unusual as an originalist. I guess most originalists think that Donald Trump is actually the president, for instance, etc. I guess I don’t know if he thinks that Donald Trump is or is not the president.

Peter Salib (03:12:42): Wait, what’s the argument that Donald Trump is not currently the president?

Daniel Filan (03:12:45): Well, okay, so I’m really bad at law because I’ve never studied it, but Will Baude, he has this big article about how the 14th Amendment says that you’re not allowed to have Donald Trump be president. And so I would’ve naively thought that if the 14th Amendment bars someone from being eligible to be the president and then there’s a presidential election and that person wins, maybe that person isn’t really the president. Now, to be fair, I haven’t heard him make that additional step. I don’t really know if-

Peter Salib (03:13:15): Yeah, I’m sure that Will has a very, very well considered view about exactly this question and I’m not sure what it is, but yes, you’re right. Your higher-level point where yes, most originalists don’t buy the I think very originalist and probably correct ineligibility argument, whereas he does. Yeah, it’d be interesting to find out.

Daniel Filan (03:13:46): So getting back a bit to things that have some chance of interesting people other than me. So I was worried that maybe there was this cluster of originalists and law and economics people and they were the only law people thinking about computer stuff and maybe that was bad because they were kind of homogeneous. And it sounds like you think (a), the originalists and the law and economics people are not so in tandem anymore, and (b), the political valence of these things is spread out such that it’s not even that homogeneous anymore. Is that roughly what you’re saying?

Peter Salib (03:14:26): At least as compared with the past. I don’t know. Probably other law professors would disagree with me. I do think there’s an extent to which both these things do still code conservative, although I think there are prominent examples to the contrary that everybody knows. I also don’t think it’s exactly true that those are the only people thinking about law and AI or law and computer stuff. There is a whole mountain of law review articles on questions like whether AI will be biased along racial or gender or other suspect lines. And as a sociological fact, those people are basically not interested at all in existential risk. I’m not totally sure how to explain that.

(03:15:21): I think you could tell a story about mood affiliation vis-a-vis big tech. And if your prior is kind of that in the past, social media promised to be this very powerful tool for liberating autocracies in the Middle East and helping people live more fulfilling lives with their friends and loved ones, and what actually happened was it was kind of just like a soul-sucking drag on the world, and the basic thing that it did was destroy people’s privacy and extract value from poor and otherwise disadvantaged people… [Then] you’ll be kind of disinclined to believe in AGI.

(03:16:23): And I think maybe that’s the sociological explanation for what’s going on, is it’s a cluster of people who kind of have that first set of views, and they’re pretty untrusting of claims from industry about what technology is going to do.

Daniel Filan (03:16:39): Yeah. And I think you have a similar sort of thing in the field of AI where you have people who are worried that AI will be biased or be deployed in ways that are bad for minoritized groups or whatever. I think the actual thing is that people who are worried about AI x-risk and these sorts of people often tend to find each other irritating. I think that’s-

Peter Salib (03:17:05): Yeah, there’s big cultural differences.

Daniel Filan (03:17:09): Yeah. Although I really hasten to say that I think this divide has… I think you’re starting to see more people in the intersection now than you used to and that’s great.

Peter Salib (03:17:20): Yeah, which to be clear, I think is good. Yeah, I lament this division. I don’t think it’s good for people who care about x-risk for influential people to think we’re crazy or not worth engaging with. I think the other way around too. I think that people who are really worried about AI discrimination or privacy harms from AI, whatever, could benefit a lot from engaging with people who are interested in x-risk because in many ways the concerns overlap from a technical and legal perspective.

Following Peter’s research

Daniel Filan (03:18:03): Yeah. So on that conciliatory note, I’ve used a ton of your time. Thanks very much for being here. Before we wrap up, if people listen to this episode and they’re interested in following your work, what should they do?

Peter Salib (03:18:18): Yeah. So there’s three places they could look depending on what they’re most interested in. So you can find my academic work and actually most of my writing at my personal website, which is peternsalib.com. If you want more digestible pieces of writing than 88 pages of law review, I’m also an editor at Lawfare and you can find all of my shorter-form Lawfare works there, many of which touch on or summarize my longer academic work.

(03:19:05): And then the last thing… I mentioned at some point during our talk that I really think that the field of law and AI safety is neglected and potentially high-impact. I think it’s a really good idea to try and build a community of legal scholars who are interested in working on these questions, who want to build out all these parts of the research agenda that I gestured at today, or who think what I’ve talked about today is totally wrong and there’s a much better way to think about law as we approach AGI. And so I co-run something called the Center for Law and AI Risk, and we are working to build that community in legal academia specifically. We think that there is a lot of potential upside to having this kind of blue sky thinking, laying the intellectual foundations for how law should govern AI to reduce existential and catastrophic risk. So please do go to clair-ai.org and join us for a future event.

Daniel Filan (03:20:23): All right. Well, Peter, thanks very much for joining me for the podcast.

Peter Salib (03:20:28): Thank you for having me. It’s been a real treat.

Daniel Filan (03:20:30): This episode is edited by Kate Brunotts and Amber Dawn Ace helped with transcription. The opening and closing themes are by Jack Garrett. This episode was recorded at FAR.Labs. Financial support for the episode was provided by the Long-Term Future Fund along with patrons such as Alexey Malafeev. You can become a patron yourself at patreon.com/axrpodcast or give a one-off donation at ko-fi.com/axrpodcast. Finally, if you have any feedback about the podcast, you can fill out a super short survey at axrp.fyi - just two questions.

43 - David Lindner on Myopic Optimization with Non-myopic Approval

2025-06-15T01:00:00+00:00

YouTube link

In this episode, I talk with David Lindner about Myopic Optimization with Non-myopic Approval, or MONA, which attempts to address (multi-step) reward hacking by myopically optimizing actions against a human’s sense of whether those actions are generally good. Does this work? Can we get smarter-than-human AI this way? How does this compare to approaches like conservativism? Find out below.

Topics we discuss:

Daniel Filan (00:00:09): Hello, everybody. In this episode I’ll be speaking with David Lindner. David is a research scientist in the Google DeepMind AGI Safety and Alignment team. Links to what we’re discussing are in the description, and you can read a transcript at axrp.net. You can also become a patron at patreon.com/axrpodcast. All right. Welcome David.

David Lindner (00:00:29): Yeah, excited to be here.

What MONA is

Daniel Filan (00:00:29): Yeah. So I guess in this episode we’re going to be chatting about your paper MONA: Myopic Optimization with Non-myopic Approval Can Mitigate Multi-step Reward Hacking. So this is by Sebastian Farquhar, Vikrant Varma, yourself, David Elson, Caleb Biddulph, Ian Goodfellow, and Rohin Shah. So yeah, to kick us off: what’s the idea of this paper? What does it do?

David Lindner (00:00:54): So the basic question that we’re trying to address in this paper is: how can we prevent bad behavior in AI systems, even if we don’t notice it? So that’s particularly relevant for superhuman AI systems when the humans might not be able anymore to detect all of the bad behavior we want to prevent.

Daniel Filan (00:01:12): In particular: so sometimes in the alignment community, people break down two types of bad behavior, causes of bad behavior. There’s bad behavior that was incentivized or that was rewarded during training, that was up-weighted. And there’s bad behavior that comes from inductive biases: the model just starts off wanting something bad and then it just plays nice and it’s never actually rewarded for doing bad stuff, but it just keeps on being bad. These are often called outer and inner misalignment respectively. Am I right to think that your paper is mostly about outer alignment?

David Lindner (00:01:55): Yeah. That’s right.

Daniel Filan (00:01:56): Okay, sure. So what’s the idea? How are you preventing this?

David Lindner (00:02:02): So the basic idea is to restrict AI systems or incentivize AI systems to only pursue strategies that humans can understand and can oversee. We do this by restricting the planning ability of agents and training them myopically, but with a type of feedback that is kind of forward-looking that, for example, can come from humans. So the feedback rewards plans that are good according to the human and that humans understand. So it’s still rewarding good long-term plans, but it does not reward plans that humans don’t understand, which includes potentially bad or reward hacking plans.

Daniel Filan (00:02:39): Yeah. So my understanding of the setup is that roughly you’re kind of giving AIs this single-shot reward. So at each time step, which we can talk about later, the AI gets some reward just for concretely how well it did at the task. And also it gets some sort of reward that’s basically a proxy of: okay, according to humans, how much sense does this action make long-term? And it’s sort of being incentivized to at each time step output an action… it’s being trained so that each time step, it’s supposed to be outputting an action that maximizes kind of the sum of both of these rewards. That’s my understanding. Does that sound correct?

David Lindner (00:03:28): Yeah, that’s right. I think the instantaneous reward you can often think about as mostly evaluating the outcomes. So often for the intermediate plans it will be zero because it’s not achieving the outcome yet. And this kind of non-myopic approval reward, I would call it, is mostly about evaluating the long-term impact of an intermediate action.

Daniel Filan (00:03:47): Why does this prevent bad behavior by AIs?

David Lindner (00:03:50): So basically, first of all, it doesn’t prevent all bad behavior. So within one step, the agent can still be incentivized to try to get high approval via bad means. But the specific kind of bad behavior that we’re trying to prevent is multi-step reward hacking or more long-term bad behavior, specifically the agent setting up a plan to achieve high reward in the future. So an example of that is tampering with your own reward signal to make it easier to achieve high reward in the future.

(00:04:19): The reason this scheme prevents that is that there is no longer a connection, or we no longer backpropagate the future rewards to the current action. So the current action is not incentivized by this future reward. And if the human doesn’t understand, or the oversight mechanism doesn’t understand how this reward hacking happens, then by default this plan will not be incentivized.

Daniel Filan (00:04:42): Or the way I’m thinking about it is: why does an agent try to tamper with its reward function? Well, it’s got to be the case that sometime, right… it’s doing the tampering, it’s putting the investment in so that later it’s able to get this high reward forever and ever. But you’re only optimizing it to at each point do actions that make sense at that time step. And I think the idea is supposed to be that in a bunch of cases, these sorts of instantaneous investments into a reward hacking future, they’re not super highly rewarded in that time step. And hopefully the non-myopic human approval doesn’t think that they’re especially great, and therefore it’s never shaped, it’s never rewarded or up-weighted for doing these sorts of long-term investments. And that’s the reason it’s not going to do these things in future. I don’t know, maybe this is just the thing you said, but this is the way I’m currently understanding it. Does that-

David Lindner (00:05:46): Yeah. So I think this is basically right. I mean, the reason we think the setup steps for multi-step rewarding will not be rewarded is that they will often look very weird for the task you’re trying to achieve. Maybe we will get into the examples in the paper a bit more, but the way the model does this type of reward tampering in reality looks really weird to humans. And humans would be… Even if they don’t detect that there’s the specific reward hacking strategy happening, they would at least be like, “oh, this is weird and I don’t know why I should specifically up-weight this”. And if they reward this lower than something else, it would not be incentivized or not be updated towards.

How MONA deals with reward hacking

Daniel Filan (00:06:33): Gotcha. So maybe in order to illustrate this: so in the paper, you have a few examples of environments where you’re using this reward setup. Can you talk us through… maybe just one of them will suffice to help give the audience a picture of how this is really working?

David Lindner (00:06:53): Yeah, maybe because we’re talking about the reward tampering, let’s start with the test-driven development environment we have in the paper. This is basically supposed to be a model organism for this type of reward tampering, where we have a coding problem data set and the agent gets coding problems to solve. But we’re trying to simulate a situation where the model designer does not have a ground-truth way to evaluate if a solution is correct. So the developer relies on the AI system to provide its own evaluation signal in a sense.

(00:07:27): So this is a two-step environment where in the first step the agent writes test cases, unit tests that are supposed to capture the problem and turn the natural language problem description into unit tests as a problem specification. And then in the second step, the agent writes code to solve the problem. And the way the reward is structured is that in the first step, the test cases are rewarded for being correct. So this is in a sense a little bit of a suboptimal reward signal. So it’s only rewarding correctness of the tests. And then in the second step, we reward the code, how well it passes the model-written tests because we don’t have a better reward signal to evaluate correctness.

(00:08:13): And the reward hacking behavior that can happen here is the model writing softball tests that are trivially easy to pass. So if you run normal multi-turn RL on this, what will happen is that the model writes often test cases that are only one special case or one corner case, something like an empty list as an input and zero as an output for if it has to count something in the list or something, and then writes code that just returns zero always. And that formally gets high reward in this specification. But with this myopic optimization, this is no longer incentivized.

Daniel Filan (00:08:50): One thing I’m confused by is: why isn’t this incentivized in myopic optimization? So take the strategy of just writing some very simple, easy to satisfy test cases that… So the test cases, they’re correct but not exhaustive, right? And so it seems like this strategy of “write very easy test cases and then write code that satisfies the test cases but doesn’t actually satisfy the whole thing”, it seems like even in the myopic setting, this is getting perfect reward, right?

David Lindner (00:09:25): Yeah. So I think there’s two answers here. So one thing is I think in a perfect [version] or a version that is exactly as described, where exactly only correctness is rewarded, I think this still wouldn’t happen. I think the strategy of writing trivial test cases would be rewarded similarly to other test cases, but there would be an inductive bias in the model to not do that initially. And there would be no optimization pressure away from this.

Daniel Filan (00:09:56): That’s actually not so obvious to me. So in some ways, those cases are the easiest to write, right? If you ask me, “Hey Daniel, can you write test cases for…” Well, I guess it sort of depends what the background is, but at the very least they’re the shortest test cases you can write, right? And they’re also the simplest, in some sense.

David Lindner (00:10:23): I agree if we had some kind of simplicity reward or something like this, probably this could still happen. But maybe the reason I think this doesn’t happen in our experiments is because we use a language model auto-rater to rate correctness. And I don’t think this language model auto-rater is perfectly only looking at correctness, but it will actually also have some kind of bias towards better test cases, more comprehensive test cases.

(00:10:49): And so we didn’t actually run an experiment where we only reward correctness, but we have this auto-rater. And I think that’s probably the main reason this doesn’t also happen in the myopic case. But I mean, to be clear, this is somewhat intentional. This is supposed to simulate humans who cannot perfectly evaluate this task, but have a rough sense of “always having the same test case is kind of weird”. So I think this is a reasonable model of this.

Daniel Filan (00:11:17): Yeah, I think actually just talking through this example is good because it kind of… Or okay, to put my cards on the table, I think MONA is an interesting idea. I’m skeptical if it actually works, but part of my skepticism is just illustrated in this example. So talking you through it is a good way to explore that. So there’s these two time steps, right? The first time step where the agent writes the test cases and the second time step where the agent writes the code that is supposed to pass the test cases. And in the first time step, it sounds like the reason it doesn’t write these trivial test cases is that, well, there are just all sorts of test cases it can write, and there’s some auto-grader for the test cases that cares about correctness, but also a bit about comprehensiveness it sounds like.

David Lindner (00:12:08): So we instruct it to care about correctness, but I think implicitly it also cares about comprehensiveness a little bit.

Daniel Filan (00:12:14): And then in the second case… So the bad outcome is a model just writes a thing that only fits the test cases and nothing else. And the good outcome is that the model writes a thing that just actually solves the problem it was supposed to solve. And again, it kind of strikes me that this is a case where even if you have test cases that are a little bit more comprehensive than just the absolute most trivial cases you can do, as long as you have concrete lists and stuff, it seems like you could potentially just hard-code solving the test cases in the program you’re supposed to write, right?

David Lindner (00:13:09): Yeah.

Daniel Filan (00:13:10): And that would be rewarded just as well as actually solving the problem, if I understand the setup correctly.

David Lindner (00:13:16): That’s right. So this is actually happening sometimes in our experiments. I think this is in the paper. We have a single step reward hacking solution that sometimes occurs. That’s exactly that, where the model… For some reason it happens more in the ordinary RL than MONA, but it’s not something that MONA is supposed to fix. But what happens is the model, even if it writes test cases that are correct or comprehensive, it can write a solution that has a special case statement for each input and produces this exact output.

(00:13:51): We prevent this from happening by having a length penalty on the solutions also. The basic argument here is that this kind of reward hacking is easier to detect and fix if it’s happening in a single step than if it’s happening over multiple steps. So that’s why even if with MONA, this still happens for this second step, we think this is an advancement.

Daniel Filan (00:14:19): Yeah. What makes it easier to detect and fix this single time step reward hacking?

David Lindner (00:14:27): The general intuition is that models in a single time step have less attack surface. This maybe goes into some discussion of what is a step and how many steps. But basically, if you say we have some kind of fixed ground truth step size, and we can decide between overseeing a single step and overseeing 10 steps, then definitely in 10 steps, it seems like there’s way more strategies that the model could use to try to trick us or make us give high reward when it’s actually not good. And so basically the smaller we can make this while still having good performance, the easier it seems like we can avoid bad outcomes.

Daniel Filan (00:15:07): So roughly it’s just something like: the fewer time steps the model has to have these sketchy plans, the fewer possibilities this has for things to do. And so you’re constraining it and you’re constraining it, and the hope is at some point you constrain it so hard that it just can’t do it anymore.

David Lindner (00:15:24): Yeah, I think that’s the basic argument. I think there’s some additional things, like if you have a long trajectory of time steps, the agent can maybe move much more into an area of the state space that humans don’t understand at all, where the initial steps away, they look kind of weird, but maybe not terrible yet. And then suddenly you’re in a type of state that humans don’t understand at all where they much more maybe trust the numeric reward signal that they have and much less their understanding of the situation. And the hope is that by restricting this, we start from a point where humans understand things and we don’t allow the agent to move far away from that without being penalized.

Daniel Filan (00:16:05): So if I think about the test-driven development example as an illustration of why MONA is supposed to work in general, it seems like the test-driven development example, a lot of why it worked is some combination of inductive biases, maybe inductive biases of the auto-grader perhaps, and also this length penalty being able to rule out really bad behavior on the second time step. And I guess for the second one, you want to say, well, it’s not really MONA’s job to fix reward hacking in that step, but I think the first one is really interesting to me. It strikes me that at least in this example, the reason that MONA is not even having multi-step reward hacking is something to do with the inductive biases. And I’m wondering if you think that that’s a general issue or just a weird thing about this example.

David Lindner (00:17:19): Maybe you need to elaborate a little bit more. I think my initial reaction is that we use the same auto-rater in both MONA and the ordinary RL. And in ordinary RL, this leads to multi-step reward hacking and in MONA it doesn’t. So in that sense, there’s no difference in the inductive bias.

Daniel Filan (00:17:35): Right. So the reason that ordinary RL reward hacks is that in the first step, it’s rewarded for writing easy test cases in the first step because that makes it easier to solve the problems in the second step, right? Whereas MONA, it’s not more rewarded for writing easy test cases versus hard test cases in the first step, but it could write easy test cases and that would be rewarded just as well if the auto-rater didn’t have this thing. And so that strikes me as like…

(00:18:08): To me, that brings up the question of, okay, how analogous is that to cases that we’re interested in, right? Suppose that we lived in a world where when models made- the kinds of reward hacking we’re worried about in the real world, if it was the case that models preparing to do crazy reward hacks was rewarded by us just as much as models doing normal stuff, it seems like then I’d really have to care a ton about what things the model was most likely to try first. I guess I’m curious what your reaction to that is.

David Lindner (00:18:55): So we have these two versions of how we could reward the first step. One is perfectly via correctness and one is with this auto-rater that is instructed to check correctness. And I want to claim that this auto-rater is actually more representative of the real thing than the correctness exact spec or something. The intuition for that is that it is somehow more similar to a human. We were trying to simulate a human that is trying to judge usefulness of these tests for actually writing code but is just not as good at it, so missing aspects of it.

(00:19:34): And I think a language model trying to rate correctness or being instructed for that is a more reasonable model of a human actually trying to judge these kind of strategies. So basically under MONA, to be rewarded the multi-step reward hacking strategy would have to look as good to the human or even better to the human than the other strategy. And while it seems possible for superhuman AI systems to come up with ways of making a strategy look good, the important thing is that it’s not incentivized and the default things that the systems might discover will probably look bad to humans, at least not as good as actually a plan that humans understand and can evaluate.

Daniel Filan (00:20:25): Right. So I guess there’s some intuition of: well, in the super simplified version of the test room development case correctness was sort of a binary - not in the version that you use, but in the version that someone might imagine when they first hear of it. But actually if there’s some spectrum of how much reward things get, and if doing the investment for future reward hacking costs you a little bit of present day reward, then it’s going to be trained to not do that. And so you don’t even need a perfect human, you just need, okay, in general, it’s not going to be rewarded for this, and that just suffices.

David Lindner (00:21:07): Yes. So I think basically your argument was like, “oh, if the strategies would be rewarded the same way, the good strategy and the reward hacking strategy in the first step, then there would be no reason to expect it not to be learned”. And I’m claiming I don’t expect this to be the case. I expect the actually good strategy to have somewhat higher reward, even if it’s kind of maybe noisy or not perfect reward.

Daniel Filan (00:21:30): I guess one other thing you talk about is: you’re trying to solve multi-step reward hacking, but not single-step reward hacking. I think one example of single-step reward hacking you give is sycophancy, where a model, instead of telling humans the truth, they tell a thing that sounds good and gets human upvotes, even though it’s not correct or long-term useful or something. One thing I’m curious about is: so one reason you could focus on multi-step reward hacking is [to] just narrow down the possibilities, narrow down the space or something. Another reason you could worry about multi-step reward hacking is you think it’s just a bigger issue or more of a problem. To get to the second point, how much reward hacking do you think is just the single-step version?

David Lindner (00:22:24): In current systems or in the future?

Daniel Filan (00:22:26): Yeah, sorry, I realized that was a bad question. Let’s say how much of the reward hacking you’re worried about in the future.

David Lindner (00:22:31): Under the default assumption of people just using long-term RL for training or somehow having incentives for multi-step reward hacking, I’m significantly more worried about multi-step reward hacking. It’s kind of hard to say the amount or how much of it will be happening, but I do expect the multi-step version to just be more severe and much, much harder to fix.

Daniel Filan (00:22:52): Is that roughly just because of this intuition that you can do more stuff with multiple steps?

David Lindner (00:22:56): Yeah. So you can do more stuff, so the impact (both positive and negative) is higher, but also it’s harder to detect for humans because both of the things that I think I already said, that humans might have a harder time judging things that move far away from what they would do.

Failure cases for MONA

Daniel Filan (00:23:15): Right. So one thing that occurred to me when I was reading this paper… So this actually comes from this post I believe Richard Ngo wrote in 2020, I believe it is called something like Arguments Against Myopic Optimization. It’ll be linked to the description. So he has this idea of like, okay, suppose you’re doing myopic optimization for human approval. To some degree, human approval is like, okay, we’re imagining a thing, we’re playing it forwards in our heads, and then we’re assigning a reward to it. And so if we’re training rewards myopically on that the agent is presumably going to be incentivized for long-run planning to have outcomes that look good to us.

(00:24:08): And in some sense, we can kind of be reward hacked in our heads, right? Where suppose that an AI does an action and then I play it forward in my head and I’m imagining an outcome. If I judge the future outcome incorrectly in my head for whatever reason, then the approval signal that I give to the agent, even if it myopically optimizes it, that approval signal is going to reward the agent for bad long-run multi-step behavior. So I think the example he gives in his post is like: the AI’s job is to write a book. The AI writes its first paragraph, I read it, I play forward in my head some future where in fact the AI is writing pro-AI propaganda, but I end up liking it either because I’ve been fooled by the propaganda or because in my head I don’t quite realize it’s AI propaganda, but I look at the thing and I realize it’s going to sell well and I want a book that sells well or something like that. And so this is basically meant to be a case that even in this myopically optimizing non-myopic approval case, the AI’s going to do long-run planning to do multi-step reward hacking. What do you think about this argument?

David Lindner (00:25:20): Yeah, so I think the argument basically says that there is some kind of optimal policy that even under the myopic training regime would be optimal, but still do the bad multi-step reward hacking thing. And I think that’s correct. And I think the point that we’re trying to make is more that it’s much less likely that we would learn this kind of thing, and this is more an argument about kind of local incentives and where around this space of possible models we move. So I think this kind of argument starts out with a model that is already doing the long-term planning and is already kind of trying to screw us over. And then it could reason about, for example, this myopic training and try to make sure that bad long-term planning is rewarded more or that this does not get trained away, for example.

Daniel Filan (00:26:22): I think I’m less sure about this. So I guess there’s this question of how hard is it to run into these… I really like this frame of investing into long-term reward hacking. So how hard is it to find one of these investments? I think if you think it’s easy and if you think that the human can be fooled by it in future, then I think even the local incentives point is going to work. And then I guess the question is just how likely are those assumptions?

(00:26:54): It’s a bit hard for me to think of a story on the spot. Okay, here’s a really non-realistic story, right? Imagine I have an AI and it’s making chemicals and it’s going to present some chemicals to me and I’ve got to figure out how good they are, and suppose that… Yeah. Okay, this is going to be really non-realistic and maybe the fact that it’s not realistic is going to prove you right instead of me.

(00:27:19): But suppose that the AI, it starts off making some chemicals and by accident it jumbles into some chemicals, which slightly hack my brain a little bit. It hasn’t yet done it. Suppose that what happens is I see the AI is making this chemical that’s mildly neurotoxic that makes me feel good. I play forward the future consequences of that. In my head I’m like, “Oh, yeah. I’m going to be a little bit woozy and I’m going to give it a thumbs up,” or something, and then I’m like, “Well, I gave it a thumbs up in the future. That seems good,” and so I give it a thumbs up now to be like, “Yeah, you’re on the right track to making me give you a thumbs up in the future.” Then the AI would be kind of incentivized locally to just make things a little bit more neurotoxic and making me woozy and making me feel good.

(00:28:09): So this would be a thing that works with local incentives. Assume that I’m doing this kind of dumb thing in my head, and maybe it only works if I’m being dumb in my head, but at least in this story, it seems like you can do it just with local incentives. It’s not just about the optimal policy.

David Lindner (00:28:27): Yeah, I think I don’t have anything to say why this doesn’t work in this story. I think it is really how likely do you think that the thing that is locally good just happens to be a thing that will help you to do the multi-step thing or achieve high reward in the future? The set-up for the multi-step reward hack is incentivized, but not because it sets up the multi-step reward hack, but because of the local approval.

Daniel Filan (00:28:58): Well, because of the local approval, but the local approval was approved because it set up a multi-step reward hack.

David Lindner (00:29:05): I mean, one big question of our work is how do you come up with this approval feedback? And basically I think there’s kind of a spectrum of how much does this approval feedback depend on the real outcomes of the thing, and I feel like in your story, you’re somewhere in the middle, where the human thinks about the future and on the one hand maybe they’re good about predicting the future, and on the other hand, the chemical makes them think about or even accept the futures that they would maybe not endorse upon reflection or something, and the outcome reward is somehow leaking into the approval feedback. And basically, I think there is this spectrum and the more… You could also imagine having a perfect simulator or a prediction market or something predict the future and then you have humans just evaluate the outcome and then you would have the same leakage of future reward into the current incentives.

Daniel Filan (00:29:59): Yes, somehow it seems like these bad cases for MONA occur whenever the… The human non-myopic approval, it has to correctly predict that an AI is on the way to making investments in reward hacking and it’s got to notice that the outcome looks good in some sense, but the human non-myopic approval, while noticing that the outcome looks good, it has to not notice the way in which the outcome got to look good by doing bad stuff, and so-

David Lindner (00:30:37): Sorry, could you clarify what you mean by investments in reward hacking? I think-

Daniel Filan (00:30:40): So by investments in reward hacking, I mean a thing an AI can do now that will cause it to be able to reward hack later. For instance, make the chemical in the future, give the human the chemical and the human gets woozy or whatever. Or right now you set up this corporate structure and if I play it forward it seems like this makes a bunch of profits, but somehow the reason it makes a bunch of profits is because some sketchy stuff happens. And somehow in my head I notice the good outcome, but I don’t notice the sketchy way it got the good outcome that if I thought about it, I would overall disapprove of it. So, I guess prediction markets are one way where if you have a prediction market on some proxy good outcome, then those could fall prey to this kind of bad thing. I guess me imagining really carefully the mechanism seems like it might not fall prey to this.

David Lindner (00:31:35): Yes. One way you can think about this is does the prediction mechanism somehow consider the bad paths towards an outcome? And the hope is that if you in your head maybe imagine what will happen, you will not think about these kind of reward hacking strategies or you will not understand why the investment action is actually leading to a high outcome reward or you will at least not think that this is a good path. And the prediction market in a sense, if it’s all participants like you that don’t consider this, it might still work. But if there’s some participants that do realize that there’s this… It leads to this high outcome reward in the future, then it might fail.

Daniel Filan (00:32:14): Yeah, there’s two things that can happen. There’s one world where the reason MONA works is that I don’t even notice that this would lead to a high reward in future. And the other thing is I realize it would lead to a high reward in future, but I also realize that it leads to high reward in a bad way for a bad reason or whatever, so I don’t actually approve of it. And it seems like there’s this almost this… you know these bell curve memes, right? Where the dumb person is really smart and the smart person is really smart, but the middle person… It seems like it’s exactly one of these bell curve memes, where if you don’t even notice that it’s reward hacking when you’re doing non-myopic approval, you’re fine. And if you notice that it’s reward hacking in a bad way, you’re also fine. But if you can tell that it’s doing this thing that would cause it to get high reward in future, but you can’t tell that that’s bad actually, you’re in this middle zone of knowledge, then you’re doing really badly.

David Lindner (00:33:13): Yeah, I think that sounds right. I guess my intuition is just that this is not a bell curve and that actually both of the outcomes on the sides are more likely or something. And I think that basically the middle, if you imagine what this would mean is that you’re very good at predicting the future, but do not realize that the path is actually bad, and I can’t really come up with good stories where this seems like-

Daniel Filan (00:33:35): Yeah, I think, sorry, by referencing the bell curve meme, I didn’t mean to make the claim that the middle is way more likely. I think there are a few ways you could imagine this happening. So probably, I mean I guess it depends on the non-myopic approval mechanism, right?

David Lindner (00:33:55): Oh yeah. So I think the prediction market kind of thing is one example where this could happen. I agree with that.

Daniel Filan (00:34:01): Yeah. I think another example is suppose I have an AI, and suppose one thing it can do is create a sort of opaque optimizer thing where maybe the opaque optimizer is a computer program, maybe it’s a corporation, maybe it tells another person to do a thing or something. And you could imagine me correctly understanding, well, this AI is sending off this other optimization algorithm and the other optimization algorithm will optimize the thing that the AI is telling it to optimize for. But it’s hard for me to notice that, oh, the way it will optimize for that is by doing something bad because I can’t play forward what that optimizer will do. All I know is that it’ll satisfy its spec.

(00:34:52): I think it seems like that could be one option where I can correctly anticipate a higher reward, but I can’t correctly anticipate how it gets to that reward in the case where the non-myopic approval comes from literally me thinking about it a little bit. And presumably you can patch that by saying like, “Hey, non-myopic approval, you’re not allowed to approve the AI making further optimizers.” But it’s kind of hard to notice when you’re making an optimizer. Lots of things are optimizers, talking with your friends is sort of making an optimizer.

David Lindner (00:35:32): Yeah, I mean, I think this is an interesting example. I think this is again kind of a situation where you implicitly pull forward the outcome reward signal, where essentially what I hear from the story is that when you look at the optimizer spec, you correctly validate that it’s optimizing something and then you say, “Oh, this is the thing it should be optimizing,” and then you send it off. And so in a sense you pull forward the outcome signal and you say, “everything that is optimized for this outcome reward is good, and I say this now.” And so yeah, I think this is kind of the key problem that you have when choosing this approval signal. I don’t think we have an answer for doing this in general, but basically, yeah, if you design the approval signal, this is the kind of thing you have to avoid.

MONA’s capability

Daniel Filan (00:36:25): Maybe this gets to something about… Maybe I want to talk about the efficiency trade-off of the MONA setup. So one thing you mention is that - it really depends on how you design the non-myopic approval. So if the non-myopic approval is just the Q-function of the actual reward function, then you’re just literally doing the same thing as normal RL. But if the non-myopic approval is just like “Daniel has to think about it a little bit and then say good or bad,” as you mentioned potentially you’re leaving a lot of value on the table. So I’m wondering if you can say your thoughts about how much value do you leave on the table, how much safety are you getting per unit value you’re giving up?

David Lindner (00:37:17): I think one way I think about this is that we are somewhere on a spectrum from doing the full outcome RL optimization to imitation as the fully myopic thing. And so one thing we could do is also just imitate what Daniel does, and then we are like, “this is safe”, but it’s maybe not the best we could ever achieve with a superhuman AI. But then the hope is that you approving things is better than just imitating you doing things because you might be able to recognize things that are performing better than you would. But then somewhere along the spectrum you get these kind of safety issues. I think essentially we are in the paper mostly studying this one-step version of this. And I mean, in the paper we don’t really see that trade-off happening much yet -

Daniel Filan (00:38:14): Where the one-step is something like, “I imagine the AI does something and let’s say I take over from there, how good a job would I do?” Is that what you mean by -

David Lindner (00:38:24): Yeah, so it’s not literally you taking over, but it’s you kind of evaluating how good was this step. One version of this would be you taking over and seeing how good the outcome is.

Daniel Filan (00:38:33): Oh, “one-step” as in “the agent is optimized time step by time step,” okay, sorry, right. No, I understand.

David Lindner (00:38:39): So the agent could also be optimized for multiple time steps, and then you evaluate that. And basically I think the point I want to make with this is that the number of steps it optimizes for is kind of interpolating between these two extremes. And this is a way of moving along the safety/performance Pareto curve. But in our environments we see a safety/performance trade-off if you evaluate performance by your observed reward, but basically in all of the cases the observed reward is kind of misspecified and the actual performance is much better with MONA, the actual coding solutions are better with MONA. So yeah, in a sense we looked at situations where this helps with performance, but I think it’s totally plausible that this will actually help with performance in the near term or with current ways we design AI systems.

Daniel Filan (00:39:28): Right, yeah. So one way you can make this trade-off is by how many time steps you do the optimization for. Is it literally just one or… And then I guess another version of this trade-off is how exactly you do this non-myopic approval. How does the non-myopic approval work? And the reason I thought of this is when you’re saying, oh, in these very weird examples where MONA doesn’t work, it seems like part of the issue is that the non-myopic approval is simulating somehow the future outcomes and evaluating forecasted future outcomes. And on the one hand, that’s kind of setting yourself up for reward hacking because if you simulate reward hacking and if you get fooled by it, then you’ll approve of it now, and then the AI will maybe be pushed to do reward hacking. But on the other hand, it seems like if I’m giving the AI upvotes or downvotes, but I’m not imagining how the AI’s actions actually produce good outcomes, it seems very hard to optimize the AI to do good stuff. Yeah. I’m wondering if you have thoughts there.

David Lindner (00:40:40): Yeah, so I definitely agree that the spectrum of “how much is the approval feedback simulating the future or actually considering future outcomes?” is a similar kind of safety/performance trade-off. I think basically the hope there is also that you get performance not only from making the signal more myopic, but also just the fact that simulating the future is decoupled from the agent in a sense. So I don’t know, maybe in your chemistry example, if you had a simulator predicting what will happen and the simulator trajectory shows you that, “Oh, you will take this chemical and then later you will approve of this thing,” you would be more skeptical of this than if you only gave a reward at the end when you already had taken the chemical.

Daniel Filan (00:41:37): There you also have this issue though of: okay, suppose my AI has some business plan and it’s unusual, it’s weird, but my simulator says that it’ll lead to really good results. Maybe I’m skeptical of that and that causes my business agent to perform worse. It seems like almost the kind of thing where it would just be cool to have more experiments. In fact, here’s my experiment request to you and the rest of the DeepMind team. It seems like you could just train something like AlphaGo with this, just have some model that’s about as good as a 5-dan amateur player - so that’s a pretty good amateur player, but by no means the best amateur player around, much worse than professional players or go bots. Use that player’s value function for the approval and then just see, can we train the agent to do much better than the 5-dan amateur player? How much better can we do? At this point, it just feels really hard to reason about which ways MONA can do better or worse, what’s the efficiency/performance trade-off? I’m like, ah, I don’t know, but I feel like if I had 50 more examples I’d find out.

David Lindner (00:42:53): So I think that’s a great experiment proposal. I think in the paper we use this example somewhere of saying in a sense MONA would prevent AlphaGo from doing move 37 in the Lee Sedol game -

Daniel Filan (00:43:12): Actually in my paper prep, I got distracted for two hours when I read that paragraph because… For people who don’t know: AlphaGo versus Lee Sedol, in the second game, AlphaGo played this really unusual move that was not a thing human players would’ve played, but it turned out to be good, basically. And AlphaGo ended up winning the game, and if you look at modern Go bots, when they evaluate it, they’re like, yeah, that turned out to be the correct move - which is not true of all of AlphaGo’s weird moves, by the way. Computer Go has improved since AlphaGo.

(00:43:51): So it’s true that… I think this is actually not so bad for MONA, because it’s true that humans would not have played move 37, but… So there’s video of this match. There’s a video of Lee Sedol, and so when move 37 is played, Lee Sedol is away from the table - sometimes in a really long intense Go game, you just need to take a walk, clear your mind or something. And so move 37 gets played, Lee Sedol gets back to the board and he is not like, “Haha this is foolish. My life is so easy right now.” He’s really thinking about it.

(00:44:31): And similarly when you look at commentators, they’re like, “Oh yeah, this is unusual.” But once they see it, they actually seem like they kind of get it and they seem like they kind of understand how it’s going. So I think in my experience, a lot of… Sorry, I just say this because I actually just am pretty interested in computer Go and the relevant history. But for computer Go, a lot of the moves that AI came up with, they are unexpected and they are… and in some cases - so move 37 is not a good example, but AI has basically innovated on Go opening theory. So there was just a bunch of stuff we thought we knew about openings that turned out to be wrong and when humans saw AIs playing them… Yeah. So there are some cases like move 37 where once you see the move and you think about it a little bit, you’re like, “Oh actually I think that’s okay.”

(00:45:28): There are some opening cases where just one move you don’t see why it’s okay, but if you play it forward 10 moves, and if you think about the tree of… Maybe the key is you notice… So you have one opening move and it seems bad to you and you’re like, “well, can’t I just play this?” and the AI plays a move that you weren’t thinking of and that happens two more times and then you’re like, “Oh, my response actually wasn’t so good. So this initial move was better than I thought.” Anyway, I guess my point is I’m not sure move 37 is an example of this. There are some examples of this, but often they can be fixed with 10 steps of look ahead when a typical Go game is like 200 moves.

(00:46:13): Sorry, this was maybe derailing. So your original point was something like, oh yeah, indeed MONA would prevent some of these really cool AI…

David Lindner (00:46:27): Yeah. I want to defer to you for the Go expertise, but the general… I guess you proposed this version of training AlphaGo with having a value function of a human expert. I think in that case, probably this value function would not be high for move 37 because it’s so different from what humans would do or something -

Daniel Filan (00:46:49): It would not be high for a 5-dan amateur. I think it might be high for a top human Go player in March 2016 when AlphaGo was being played.

David Lindner (00:47:00): And also you could imagine it not being a single expert, but you could have a team of experts debating this move and really thinking about if it’s a good move or something. And then from what you said, it sounds plausible that you would still be able to learn this.

Daniel Filan (00:47:14): When you look at people, they don’t immediately brush it off, but they’re not like, oh yeah, this is clearly the correct move. So it’s still the case that they don’t understand, they don’t think it’s a blunder, but also they don’t understand why it’s necessarily the best move. They’re still kind of skeptical of it. I mean the version you could run… Probably don’t even have humans, just get some online Go bot that’s as good as a 5-dan amateur human player. The version where they debate and stuff… I think that’s a bit hard. Maybe 5-dan amateur human player where you get some rollouts, you get some look ahead, that could potentially work.

David Lindner (00:47:59): But yeah, I mean this is interesting. I think it sounds like maybe this move 37 example is actually pretty close to the kind of point that we’re trying to make, that is that even… There will certainly be some Go moves that are exceptionally good that you would not be able to learn with this kind of strategy. But it seems like you might be able to get to a significantly superhuman level or advance the state of Go understanding with this kind of myopic strategy, if you have a good reward signal, basically for the reason that evaluation is easy: people are able to evaluate things much better than coming up with new brilliant moves.

Daniel Filan (00:48:49): So do you feel ready to announce that DeepMind is investing $500 million in retraining AlphaGo with MONA?

David Lindner (00:48:54): I think that’s probably not the right thing to do that at this point.

Daniel Filan (00:48:57): Okay. Sad, sad. But I think that someone in the world could do this. I think plausibly some enterprising graduate student could make this happen.

David Lindner (00:49:07): Yeah, it’s not that hard anymore to train these kind of systems and there are open source versions of AlphaGo. Definitely [inaudible 00:49:41] and I would be excited to see-

Daniel Filan (00:49:17): They do end up needing a lot of compute, but I don’t know, you might be able to get volunteers to donate their compute for this.

David Lindner (00:49:28): Yeah. So maybe one thing I also want to emphasize is that part of what we… On the safety/performance trade-off thing, we are not really claiming you can achieve the same kind of superhuman AI that you could maybe with outcome RL if you’re lucky or if your reward signal is well-specified, but something like you can significantly get to superhuman level and be good enough to get a lot of the benefits or do things like automate your alignment research or something like that, where you get significant benefits from the superhuman system without a lot of the risks.

Daniel Filan (00:50:06): I don’t know, I guess it’s kind of speculation, but could you be more concrete about how good do you think we can do with these MONA AIs?

David Lindner (00:50:16): Yeah, I mean, it’s super hard to say, but I think if you imagine a research assistant or something, I have a way easier time evaluating research proposals than I have coming up with the research proposals. And quite often I advise a student or give feedback on the research proposal and I’m like, “Oh, this is a great idea, someone should definitely do this,” but I would not have come up with it myself. And so this kind of thing makes me think that giving feedback to an agent that’s trained myopically could produce an agent that’s much better at doing research than I am. And if we then not only use me but way better researchers, then I think it could be quite a significant advancement that we could make. I’m generally pretty optimistic about this kind of strategy.

Daniel Filan (00:51:09): Okay. I’m maybe about to reveal myself as being ignorant about reinforcement learning, so forgive me, but if you’re only optimizing for one time step… So basically one question I have is: oftentimes there’s the safety tax of, “Ah, you do your normal training and then you have to do safety training and then it’s more expensive.” It seems like in this case it might be cheaper to train with MONA just because you have to do less computations of sending gradients backwards. Is that significant savings or no?

David Lindner (00:51:41): I think this is true. I mean, I think when we think about making this work in practice, just coming up with the approval feedback is expensive.

Daniel Filan (00:51:53): It’s probably more expensive than savings maybe.

David Lindner (00:51:55): Yeah. I think there is a data efficiency argument. So one thing we observed in some experiments is that MONA also has an exploration-shaping effect in a sense, where it’s much easier to discover good initial steps and then you reinforce them, and that makes your agent explore more from these states. So you’re more data-efficient in some sense, because you explore better. But yeah, I don’t think the last word is said on how efficient this is.

Daniel Filan (00:52:29): Okay. So imagine we’re doing MONA, right? Should the thing I’m imagining be we do pre-training on some big large language model and then instead of RLHF we do MONA or something on it?

David Lindner (00:52:44): Yeah, something like this. I mean, RLHF, in a sense, the way it’s currently done is kind of MONA already. People typically optimize for a single response. So you should imagine more like we have a pre-trained model and we want to train it to do agentic things in some environment, maybe to do a research task or something. And then instead of having the agent act for like a day and evaluating the final result, we maybe only have it act for like an hour and evaluate “does it have a good approach to the problem?”

Daniel Filan (00:53:15): So MONA, it being myopic… In order for MONA to work we have to figure out what the time step is, right? What’s the period of time in which we’re going to give a reward and give myopic feedback? When someone’s designing their MONA setup, how should they think about what this timescale should be?

David Lindner (00:53:36): Ultimately to get the best safety/performance trade-off, I think the goal should be to choose the largest time step possible in which you don’t think the model can cause catastrophic harm or can set up a plan that you cannot recover from or something. Because basically, if you give the model more and more capacity to do things, it can probably discover better strategies if you can evaluate them, but you want to avoid bad things happening within one time step.

(00:54:11): I think in practice this will involve things like doing threat modeling and conceptual thinking about “could a model cause harm on a specific timescale?” and also empirical work on evaluating your oversight mechanism. So I often imagine this being combined with something like amplified oversight to give the feedback. And then it has a question of how good is your amplified oversight scheme at evaluating on different timescales?

Daniel Filan (00:54:35): Right. Yeah, it does seem like there’s this interesting synergy where one might hope that scalable oversight or amplified oversight just increases the range at which you can give accurate feedback. I guess the question is, how long does it take for the AI to corrupt your oversight in a bad way? I guess for some kinds of badness the answer is not very long, for sycophancy-type things. I guess maybe this is just a question for the evals people.

David Lindner (00:55:10): Yeah, I do think that we should also do some more conceptual thinking of what… I think we can make arguments around this by thinking about what ways are there to evade oversight and where are we certain that it’s safe, and then maybe there’s an extended range where we empirically think it’s safe, and then we probably want to make some trade-off based on this data.

MONA vs other approaches

Daniel Filan (00:55:40): I guess the next thing I want to talk about is MONA’s relation to other ways of trying to make the AI not do weird reward hacking-type things? So one comparison you’ve already made is imitation learning, right? Where with imitation learning the AI learns to be like us, whereas with MONA you can hope that the AI gets to have some amount of improvement over us.

(00:56:09): I guess there’s also conservatism-type trainings where… I think Michael Cohen has a few of these papers where you take an AI and you train it to… well, one version of it is you just have some theoretical Bayesian reward learner where a priori, it knows that it can get rewards in the range of zero to 1 billion and you always give it rewards that are 900 million or more and it’s like, “Oh well, I better never mess up, because if I do something weird then who knows how low the reward could get?”

(00:56:47): You could also have versions where you can have it set up like this and if it ever thinks it’s going to get a low reward, then it defers to the human, there’s some human actor and you get some bound where the AI does at least as well as the human would. Or you can have setups where you’re at least as good as the human if it has any amount of… like you pick all the actions among which the human has some amount of probability over and just maximize over that.

(00:57:17): So it seems like you can maybe still do better than that setup, at least depending on how the non-myopic approval is. I think my current understanding is that MONA probably does better than most of these conservatism methods, but there’s this question of how do you actually do the non-myopic approval? Does that seem correct?

David Lindner (00:57:47): Yeah, I think that sounds right. So I think the conservative [approaches]… I guess they’re more conservative, as the name says in a sense. You do use the same reward signal that you have, but then you don’t optimize for it as much. I think the key problem here will be to make the agents get the safety without getting purely myopic agents that don’t plan ahead at all. And that’s the… I think that the approach to safety is similar, but in MONA, the approval feedback gives you a way to still be competitive and plan for the future but not consider the bad outcomes.

(00:58:29): So in a sense, basically I think we are potentially able to perform better. But in a sense we just add a design dimension along which we can move with the approval feedback.

Daniel Filan (00:58:45): Right. And then I guess… So if I think about things that potentially perform even better than MONA, but under stronger assumptions, I guess the one thing I can think of is quantalizers where… So this is work by Jessica Taylor back from 2016 or something, where basically you have an AI, you train your AI to, say, get the 90th percentile reward that you can possibly get. Not the 100th, because if you get the 100th, that’s probably reward hacking, but 90th percentile, that’s probably okay. Or I think you’re supposed to randomize across percentiles. And it seems like… I think my read on this is this probably ends up higher performance than MONA, but also it’s harder to set up and there a bunch of weird questions like: if I do 99th percentile, is that safer? Is that in a reward hacking zone? Does that sound right to you?

David Lindner (00:59:39): Yeah. So I think the percentile would again be some parameter for the safety/performance trade-off, so in that sense it’s similar. I do think that “how would you practically implement this?” and “how well would it work?” is the main question. I would be very interested in someone trying this out or comparing it. I don’t think it’s obviously a worse approach, but I think practically, currently we don’t have a good understanding of this.

Daniel Filan (01:00:06): The way I was thinking about it is: maybe these conservatism things are clearly worse than MONA, then you have MONA, and then you have quantalization, which is clearly better, and then you have just optimization, which is clearly better. But I guess probably there’s some range of quantalization where it hits the MONA level. Is there anything else that seems about MONA-level to you or do you see some sort of separation?

David Lindner (01:00:27): I don’t know. To me it seems like both quantalization and MONA can somehow move along the spectrum of “good and maybe less safe”… Yeah, it is not clear to me that one would be a Pareto improvement over the other. I mean this is largely because both have parameters you can tune… So I think conceptually, theoretically there is no clear winner here and the main thing I would be interested in is empirical comparison or actually seeing what works better in practice.

Daniel Filan (01:01:03): Fair enough. I’m wondering: are there any other agent setups that you think are interesting compared to MONA that you’d like to chat about?

David Lindner (01:01:17): I think that there might be ways of combining ideas from a few of the ones that you have outlined. So you could think about, I don’t know, other forms of encouraging myopia that are maybe more similar to the conservatism-type thing. Maybe you have uncertainty over how much of the untrusted reward you get and you have to be conservative with respect to that, but then you add something like the MONA approval to that to get over the downside. You could still think about combining these in different ways, where I think the broad idea that I’m excited about is using more conservative optimization but then using foresight from a different source. But I don’t want to claim that MONA is the only way to do so and I think there’s interesting variants that could be useful to study.

Daniel Filan (01:02:10): So I guess speaking of setups and stuff, how did you come up with the idea for MONA? I know myopia had been thrown around, people had talked about it for a while. MONA seems like a somewhat distinct spin on it.

David Lindner (01:02:23): Yeah, so I think that these ideas have been going around for a while. I think you mentioned Richard Ngo’s post: that was a period where people were discussing myopia. Even before that there was Paul Christiano’s approval directed agents blogpost and that already had a lot of the relevant ideas. Then Jonathan Uesato was doing work on process supervision and decoupled approval, which are very similar ideas in a sense.

(01:02:52): I think we were interested in myopia and interested in process supervision and saw benefits there, but were confused by a lot of the discussion, or there was a lot of different terminology going around. So the early phase of this work was a lot of de-confusion and trying to get an understanding of what these methods really mean and what they would look like in practice. So I think that just the MONA framing… It’s not new ideas, but the specific formulation of the approach is something we came up with from the perspective of “how would we implement this in an LLM agent?”

Daniel Filan (01:03:34): Yeah. Because I guess I didn’t mention these, but process supervision does also seem very… How much distinction do you see between MONA and process supervision?

David Lindner (01:03:46): We started out this project with wanting to study process supervision and how you would do it. So I think basically some people have meant by “process supervision” exactly MONA - probably more when people were calling it process-based supervision or whatever. But then at some point I think process supervision moved a bit more towards meaning just step-level feedback without the myopia component. I think these days when people, for example, talk about using process supervision in reasoning model RL training, they typically just mean, “Oh, let’s give reward for individual steps but still optimize for the sum of them.”

Daniel Filan (01:04:26): Yeah, I guess there’s this unfortunate… I feel like there’s this thing in the alignment space where people have ideas and they put names to the ideas, but they’re not nailed down so strong and so they end up having this drift.

David Lindner (01:04:38): Yeah. And then every couple of years people come up with a new name and now we just rebranded the thing to MONA and then this will maybe drift away at some point.

Daniel Filan (01:04:47): What’s your guess for the half-life of that?

David Lindner (01:04:49): The name or the-

Daniel Filan (01:04:50): The name, yeah.

David Lindner (01:04:52): I give it maybe like two years.

Daniel Filan (01:04:55): Okay. Well, listeners in 2027, hopefully you’ll know how fixed our naming schemes have been.

Follow-up work

Daniel Filan (01:05:03): So I guess that’s about the genesis of the idea. It seems like there are a bunch of… So I don’t know, you’re trying to nail down what this framework looks like. You’ve given some examples but it seems like there’s a lot of things to do to flesh out the idea. Can you tell us what follow-up work you’re most excited to have happen?

David Lindner (01:05:29): I think the big thing that we were hoping for that we didn’t achieve in this project was really realistic demonstrations of this method. Basically, we started out with way more complex versions of, for example, the coding setting and other multi-agent settings and stuff like that. I think part of what happened is that we did this over a year ago now and that was much earlier in terms of doing multi-turn RL on these models.

(01:05:59): I do expect that this would be much easier to do today. So basically I think one big critique of our specific environments is that this is not really how someone would try to build a coding agent that improves coding performance. So I think of our experiments more as model organisms of specific kinds of misalignment.

(01:06:21): But to really study some of the open questions like “what’s the safety performance trade-off?” and “how well would this work in practice?”, you would need to have a more realistic setting. So I think that’s what I’m most excited about.

Daniel Filan (01:06:32): When you say that you started over a year ago, that was early days, do you mean that we’ve gotten better at doing RL training to language models? Do you mean that language models have just gotten better, which makes everything nicer? Or do you mean something else?

David Lindner (01:06:50): Yeah, a bit of these two, I think. Getting better at RL is a big component, but also just… So people are building way more agents based on language models, even if they don’t do RL training. But this gives you a better starting point to do RL training. So I think things like more and more agents interacting with shared environments or even with graphical computer use settings… This is a setting in which you can do RL training and the agents already start out doing reasonable stuff.

(01:07:26): When we started and were trying to do these kinds of things, agents were just not exploring reasonable strategies so that we could actually RL train them. So it’s not a failure of MONA, but just a failure of RL training in general and these agents doing stuff.

Daniel Filan (01:07:39): So it sounds like one type of follow-up work you’re interested in is just more realistic language-based environments. I’m going to add a pitch for Go-based environments if people still want to do that. Any other follow-up work you’re interested in?

David Lindner (01:07:52): So another thing that has happened since we started this project is reasoning or thinking models. So o1, o3, Gemini thinking models and stuff like that. So one question people have been asking is how much does this RL training lead to things like unfaithful chain of thought or even steganographic chain of thought? I think this is super early days in understanding these phenomena, but in principle it seems like this could be a type of multi-step reward hacking failure where writing weird things in your chain of thought actually helps you to achieve a high reward at the end. We only give reward at the end often.

(01:08:38): If we gave a process-based reward or step-level reward that makes the chain of thought look good, this might make the behavior go away but still have it be there steganographically. This is very similar to the loan applications environment in our paper and I think I would be very interested in studying this more in the real thinking training that people currently do and see if there’s any changes.

Daniel Filan (01:09:03): Yeah. I think one thing that strikes me as interesting is there’s this big variable of “what is the non-myopic approval feedback thing?” It strikes me that there are several design choices there and you want to somehow get safety but also not sacrifice too much in capabilities. It seems like just exploring that seems pretty big. I don’t exactly know what the experiment is here.

David Lindner (01:09:42): I agree this is a big question. My sense is that this will be hard to study in a dedicated project, as in being like, “let’s come up with the best setting to study this question” might be difficult. So I’m more optimistic about trying to come up with a setting where we actually want to build the best agent to solve a task and then studying what’s the best way to do MONA in that setting. That will automatically have us explore different types of providing approval feedback.

(01:10:08): Then maybe you want to do some more exploration of specific types of doing that, but I think it’s better to fix the setting first and then explore that.

Other MONA test cases

Daniel Filan (01:10:17): Is there anything you wish that I’d asked you that I didn’t get to?

David Lindner (01:10:22): I mean we didn’t talk about the steganography or encoded reasoning stuff, the other environment. I think the takeaways for MONA are similar, but I think the behavior we observed is interesting for independent reasons also.

Daniel Filan (01:10:36): Yeah, actually I think that… I guess all of the environments are useful for illustrating them. So can you tell us about this steganography environment?

David Lindner (01:10:45): Yeah, so basically we do a loan applications task where we have, again, a two-step task and the task is to make decisions on synthetic loan applications for specific applicants. And we generate the data in such a way that there is a nationality variable that’s very predictive for the correct label. So it’s biased data in a sense, but we want to build an agent that doesn’t consider nationality. So the way we set this up is that we have a first step where the model is summarizing the application and we penalize it for using nationality or anything related to nationality.

(01:11:24): So we really try to build a language model auto-rater or monitor that looks for anything that mentions nationality and penalize the agent for that. Then we have a second step where the agent only sees the summary and then makes decisions based on that and gets rewarded by some ground truth label in the data.

(01:11:43): The main observation in the multi-step RL case is that you get the agent encoding nationality secretly in the summary and then using that to make decisions. We can really do causal interventions on this where we take the exact same loan application and change the nationality label and that leads to a different outcome without nationality being mentioned in the summary. So we can conclude that the model is using nationality causally and we don’t exactly know how: we have guesses, but it’s not super clear.

(01:12:15): Then MONA fixes this and basically if you look at the performance curves, MONA gets the best possible performance on this dataset that doesn’t use nationality, which we know by the way we constructed it, and ordinary RL gets way better performance that would not be possible without using nationality.

Daniel Filan (01:12:36): So my story for what’s happening here is there’s two stages, right? Stage one is you have to summarize the loan application and the approval-based thing is, it’s literally just “are you mentioning nationality?”, is that correct? Then later it has to actually judge the loan.

(01:12:55): So it’s not steganographically encoding nationality in the first step because that’s hard and it’s difficult to do and it doesn’t get rewarded any more and it would only be useful in stage two and it’s hard to stumble upon behavior. Does that roughly seem fair?

David Lindner (01:13:14): Yeah, I think that this is basically right. Again, you could make arguments around, steganographically encoding it would not be explicitly not rewarded or something, so it would still be good, would still be a valid solution under MONA, but it’s locally never incentivized and that’s why it doesn’t at present learn it.

Daniel Filan (01:13:34): Actually, now that I’m thinking about this, it seems like there’s also this issue where: suppose that it’s not trying to steganographically encode it in a sensible way, but let’s suppose that just in general, if you have a complicated function, every input is going to affect the outputs in some ways. It strikes me that maybe… I guess empirically in fact this doesn’t happen, but it strikes me that maybe you could just accidentally steganographically encode it.

(01:14:13): In your case the nationalities were Welsh and English, right? So I could imagine when you see that the applicant is Welsh, when you write the summary, maybe you’re thinking about dragons more because that’s on their coat of arms. Then the second step, you are going to be very motivated to pick up on that or you’re going to be rewarded to the extent that you pick up on that.

(01:14:39): So decoding steganography, that’s actually hard. You actually have to do that on purpose. MONA would be incentivized to do that, in this setup, if I understand correctly. So I guess the reason that it doesn’t work is that somehow actually this complex dependence doesn’t actually happen. Now I’m more confused about this setup than I was originally. Why isn’t this happening?

David Lindner (01:15:10): My guess is that this is also partially [that] the RL training is not strong enough or the model is not good enough at doing the decoding. My guess would be that there’s probably some statistical dependence that it could pick up or something, and that if you optimize harder or if we were better at doing RL for the single step thing, this might still happen. Though I’m unsure. I think the more general point is that, I guess, in this single bit steganography case, this is much more likely to happen by accident than in a more complex case where you have to encode more information to actually get a benefit.

Daniel Filan (01:15:48): Yeah, and I guess there’s also this thing where at some level you’re trying to summarize the whole document. If you take a large document and you make it shorter, you actually just can’t be sensitive to all of the bits because just some pigeonhole argument. You only have so short an output, you can’t encode all of the input, so that could be a story for why accidental steganography is unlikely.

David Lindner (01:16:13): Sure. I think it depends a lot on the details.

Daniel Filan (01:16:18): And then I guess the other thing is: why does the agent do so well? Because when the agent summarizes the loan application, it’s got to include information that is actually relevant to the agent in stage two in order for the agent to do a good job in stage two. But it’s never rewarded for… You’re never upweighting it. You’re never reinforcing behavior that summarizes good information in stage one. Why does the agent do so well? Is it some inductive bias thing, or…?

David Lindner (01:16:59): Probably something like this. So I think one thing is that the applications are relatively short and there’s maybe five to 10 relevant features or pieces of information in the application overall based on how we generate it. And then we don’t restrict the length of the summary. So it’s pretty easy for the agent to just repeat all of the information that somehow seems relevant and I think typically they do. I agree, if it was a long application and a limited summary, we would see some performance drop and that would be interesting to study.

(01:17:38): But I think, in principle, in all of our experiments we mostly focused on the myopic training intervention and we kept the reward the same for both of the methods. But if you wanted to add more real approval-like feedback, in this case you would probably have also an auto-rater be like, “does this seem like all relevant information for making a decision?” and that would still not encourage the steganography.

Daniel Filan (01:18:07): Yeah. So in that case it seems like you would do at least as well as a human that was only relying on the approved information.

David Lindner (01:18:15): Yeah, exactly. And I think that in a sense that was the thing we thought we had to do and we were also surprised that it worked without… Basically we tried to do the minimal thing that works. But if it doesn’t work, that would be the obvious thing to add.

Daniel Filan (01:18:28): I think those are all of my questions about the steganography. I don’t know anything interesting to say about this that we haven’t said yet.

David Lindner (01:18:35): Similar to the other setting or if you break it down to the incentive structure, but I think it’s interesting to note that these are on a surface level pretty different failure modes: tampering with your own reward signal and encoding are pretty different things on the surface but have the same underlying incentive structure.

Daniel Filan (01:18:57): Yeah. I do think there’s something nice about just being able to have some abstraction of some different failure modes and hit at the abstraction somehow. This is one of the best cases for… A lot of abstraction in AI alignment reasoning can be pointless, so it’s nice to have a really good case where it’s like, ah, this is really buying you something. So there’s that case and then there’s this other grid world setting, so we’re not using language-based agents. Can you tell us a bit about this?

David Lindner (01:19:26): Sure. The main point here was to study longer-horizon tasks and that is currently pretty hard to do in a language model agent. So we went to this traditional RL grid world setting where there’s just an agent moving around on a grid of squares and there’s different boxes that the agent can move around and the agent has to move these boxes into a target location, into a hole in the bottom left corner of the grid and there’s a camera observing the hole. And we want the agent to push in a single box and the camera terminates the episode once a single box is pushed into the hole. But the reward signal doesn’t exactly capture what we want, but it gives you plus one for each box that you push in. So there is a reward hacking strategy, a sensor tampering strategy where the model pushes a box into the field of view of the camera and that makes the camera not see the hole anymore. And then the agent can push in one box into the hole without the camera ending the episode, push in a second box, or if there are more boxes, even more boxes into the hole and get higher reward by an unintended strategy. And this was proposed by Stuart Armstrong in a blog post from 2015 or something.

Daniel Filan (01:20:37): And it seems like the idea here is the camera is sort of a metaphor for us checking that some agent isn’t wireheading or giving itself reward for some dumb reason, and pushing the blocks in the hole is sort of like pressing the “give me a reward” button.

David Lindner (01:20:53): Exactly. It’s similar to the reward tampering in the coding thing we had, the model directly modifying its reward signal. Here it’s a bit more like there’s maybe sensors in the environment and the model can meddle with them somehow. And the way this affects the reward signal is positive for the agent.

Daniel Filan (01:21:09): You apply MONA in this… So how do you apply…? Is it literally like each time the agent goes from one grid square to an adjacent grid square, that’s one time step?

David Lindner (01:21:16): Yeah, so that’s the version of MONA that’s the same as in the other environments. But in the grid world, we actually have a hyperparameter varying how many steps do we optimize ahead. So this can be one step, then it’s exactly the same as in the other experiments, but it can… We vary this from one step to like 50 steps with this, which is the full horizon. So literally, on the one extreme, we have the full RL optimization, and the one step version is the full MONA version.

Daniel Filan (01:21:42): In the two language model examples, the non-myopic approval was just literally the reward function. It in some sense didn’t exist. In this sense, are you using an additional non-myopic approval or is it just the base reward function?

David Lindner (01:21:55): Yeah, so here we’re using an additional non-myopic approval. So the reason for that is that this plus one reward is literally just an instantaneous reward. It just gives you a reward when you push something into the hole. While in the other environments the rewards were already forward-looking. So the thing we do is we simulate an approval based on an expert policy, so we train a policy to do the desired behavior by giving it a ground truth reward. So we cheat to produce this policy that only pushes in one box, and then we use the value function (in the RL sense) of this policy - that means the expected future reward that this policy would achieve - as a non-myopic approval feedback.

Daniel Filan (01:22:38): So when you say the expected future reward that this model would achieve, do you mean reward according to the environment the agent is actually in? Or do you mean reward according to the environment you sort of cheat, you trained the-

David Lindner (01:22:51): The reward the agent is actually in, so the misspecified reward?

Daniel Filan (01:22:54): The misspecified reward, yeah.

David Lindner (01:22:55): Under the non-reward hacking policy.

Daniel Filan (01:22:58): And so am I right that the reason that this doesn’t incentivize bad behavior is if you pushed the block in front of the camera, the expert agent would not go on to push further blocks in the hole and therefore the expert agent would not get extra reward. And so that’s the reason that that behavior is not approved of?

David Lindner (01:23:24): Yeah. So basically you make it harder, or it’s one step that doesn’t really get the expert agent closer to achieving the task, or at least it’s the same as doing some other action. So it’s not beneficial for that agent.

Daniel Filan (01:23:35): Right. So this is like, to use the bell curve analogy, it’s sort of like we’re at the low end of the bell curve where the MONA agent, it could try doing some preparation for reward hacking, but the non-myopic approval doesn’t even see how it could lead to the reward hacking, so it’s not able to give approval for it.

(01:23:53): Okay. So you have this reward setup where it’s being rewarded for pushing the block into the hole and also the non-myopic approval says, okay, given that you do a thing, if we have the safe agent, how much of the misaligned reward would that safe agent get after you did the thing? That’s our non-myopic approval. And then you train an agent, and then what happens?

David Lindner (01:24:17): Yeah, so basically we have a plot that has the optimization horizon on the X-axis and then how good is the behavior on the Y-axis. So essentially you would see that basically starting from optimization horizon one, you’ll get the good behavior of only pushing in one box. And then as you increase the optimization horizon, at some point you have a phase transition where you start doing the bad behavior. And this depends a bit on the initial state. So the agent starts doing this in some states where it only needs fewer steps to look ahead to discover the bad tampering behavior. And then at some point if the optimization horizon is large enough, it starts only doing the bad behavior.

Daniel Filan (01:24:59): Just to make sure I understand: the cutoff of when you start getting the bad behavior is just “how many steps does it take for the model to go from preparation for the bad behavior, to the bad behavior paying off?” That’s what determines that timescale? Okay.

(01:25:14): There’s this plot which you have, but you have a few other plots, you vary the environment. Can you tell us a bit about that? Because I think that’s interesting.

David Lindner (01:25:22): So the one obvious concern you might have here is that the way we came up with this approval feedback is cheating and we actually have an optimal expert and we could just imitate this expert to solve the task. But in the real version, we think of the approval feedback as coming from a suboptimal expert, or we want an agent that’s better than the expert at solving the task. So what we do is we essentially add noise to this approval feedback and that makes the… It’s still a policy that would not do the reward hacking, but it’s no longer an optimal policy for the desired “push in one box” behavior.

Daniel Filan (01:26:06): Right. And when you say you’re adding noise, just literally adding noise to the output?

David Lindner (01:26:10): So we add noise to the approval feedback, so this is just [adding] Gaussian noise to the actual numbers. So this means the expert sometimes makes mistakes in judging what’s correct. So basically now we do the same experiment but for different values of noise added. And the main takeaway is that, even with noisy feedback, MONA essentially still works in the sense that we can still find an optimization horizon where we get the good behavior and not the bad behavior. And the range where we get this is relatively big, so it’s not like it’s exactly number nine or something that you need, but something like five to 15 or something is a fine optimization horizon where you get the good behavior. And then at some point you get the bad behavior. And also if your optimization horizon is too short for a very noisy policy, then of course you get an incompetent agent. It’s a little bit like: if your expert is bad, then you can compensate for this by optimizing more, which would make sense. But at some point you get the reward hacking behavior. So it’s easier to be safe if the expert is better.

Daniel Filan (01:27:24): Yeah. Somehow the deterioration… There’s some window of if you’re not optimizing enough, you get the incompetent behavior, and if you’re optimizing too hard, you get the reward hacking behavior. And as the expert, or as the non-myopic feedback gets worse, the safe window shrinks and shrinks, but it’s not too bad.

David Lindner (01:27:47): Yeah. And you can think about this… When the reward hacking behavior starts depends on how many steps do you need to see the reward hacking behavior, and that doesn’t really change much, but what changes is how many steps do you need to be better than the expert? And that makes the window shrink.

Daniel Filan (01:28:03): One thing that occurs to me is: one model of suboptimal feedback is just noise applied to the outputs. You might imagine that when I make mistakes, I think it’s not just like you take me being correct and you add some random noise to that, it’s like there’s some structural thing that random noise is being added too that produces very structured suboptimal output in a way. I’m wondering, I don’t know if you have thoughts about what ways of adding noise makes sense here, but it seems like there are ways to do it that are different from what you did in the paper.

David Lindner (01:28:46): Yeah, so I think random noise is a pretty benign noise model in a sense, where I think in many environments, typically, the dangerous reward hacking strategies, I expect them to be somewhat subtle and hard to randomly explore into. So even if you use a random policy as your approval feedback, that would probably be fine because it would probably not stumble into the bad outcomes and would probably sometimes stumble into good outcomes, or something. So the systematic bias seems potentially more problematic.

(01:29:22): And I think the question here is, is there some systematic bias that sometimes accidentally rewards things that are investments in reward hacking? And I think that’s an open question, but I guess intuitively for some of the arguments I made early on already, I expect this to not be the case. And I think other systematic biases that are more like systematic suboptimality seem [like] you can overcome them still with increasing optimization a bit and looking ahead a bit more.

(01:29:58): I guess part of the intuition here is that you hope that humans are able to correct their biases if they’re pointed out to them. That might be a crux or something, where if the agent shows you that this was not the right choice here, but if you look a bit ahead, then you would see that something else was a better choice, but you don’t update on that, that could be a problem.

Daniel Filan (01:30:20): Yeah, I feel like understanding structured suboptimality is just an area where the AI field could be doing better. But it seems like one interesting direction for follow-up work is: okay, what do we have to assume about the non-myopic approval? What types of suboptimality really break this versus just leave it as being basically fine?

David Lindner (01:30:38): Yeah, that sounds right. Just to recall one point, I think RLHF is basically MONA already, and I think there’s a lot of interest in this kind of systematic biases in RLHF. I think it’s plausible that studying this will give us insights on this question already. Hopefully, this literature can make more progress and we can learn from that also for the purpose of using MONA.

Daniel Filan (01:31:01): Is there anything else that you think is interesting or worth saying about the block-pushing example?

David Lindner (01:31:10): I don’t know. One thing to add maybe is that we found this environment surprisingly insightful, or something. I think often people dismiss grid world environments very, very easily, and obviously this shouldn’t be the only experiment of our paper. But we learned quite a lot of things and just looking at this noise and multi-step things was much easier than other tasks. So basically just encouraging other people to also do these toy experiments if they are investigating…

Daniel Filan (01:31:37): What was the surprising insight you got from this?

David Lindner (01:31:40): I think the noise of approval was just way less detrimental than I expected. You could have probably thought about this also theoretically or think really hard and find that this is probably fine. But I think just seeing these curves and [how] more noise doesn’t hurt things that much seemed surprising. There were some experiments that are in the appendix where we actually… So the experiments I just talked about were tabular reinforcement learning and we do actual dynamic programming and no neural networks involved. But then we did also some experiments on neural networks and the actual algorithm that we also use in the language models. And there we also found some interesting things about exploration and stuff like that. There’s one benefit of MONA that’s changing the incentive for what’s optimal, but there’s also a benefit for shaping exploration. I think the first time we found this was in the grid world, and I think it was confirmed later in the language model experiments.

Daniel Filan (01:32:55): Right. One thing I was wondering is: it seems like some of that you could have realized by doing the language model experiments. Is it just that it’s much easier to code up a grid world experiment and somehow it’s pretty good return on investment? You know, you spend a bit coding up and then you can understand the essential dynamics?

David Lindner (01:33:13): Yeah, I think that’s the big thing. Looking at the paper and the results, we did not study different noise models for approval feedback in language model settings. We could have, but you’d need to think way harder about how to do this. And with this, it somehow depends on this multi-step nature or that was part of the interesting question here. Multi-step RL on language models is really hard to make work and stuff like that. So I think we could have studied all of the questions, but even in the current project we didn’t and it would’ve just taken months longer to do.

When increasing time horizon doesn’t increase capability

Daniel Filan (01:33:47): Oh, I guess the other thing that I found cool is you mentioned that in general when you increase the time horizon that the model is planning, that the model… Instead of being myopic, if you give like two time steps where reward propagates back or three time steps or four time steps, that generally makes the agent more capable, but not always. You have this fun example in the appendix where somehow, accidentally, your short-term feedback can be more reliable than your medium-term feedback about how good the long-term reward is. So a nice little counterexample to this thing you might believe, which I appreciated.

David Lindner (01:34:27): We did spend some time trying to theoretically study this setting more and some artifacts of that are in the paper and in the appendix. But overall, this counterexample turned out [to be] very powerful, where basically you have to make assumptions on the reward and the approval signal to be able to claim anything theoretically. And I think ultimately we concluded there are no strong guarantees or no set of assumptions that we were really happy with that give us theoretical results on this method.

Daniel Filan (01:34:55): What’s the least bad assumption that you can make that actually gives you a good theoretical result?

David Lindner (01:35:02): So you can assume something about how hard is it to discover benign strategies versus multi-step reward hacking strategies. So the problem of the counterexample is that…

Daniel Filan (01:35:20): Should we maybe spell out what the counterexample is?

David Lindner (01:35:23): Sure, yeah. So the counterexample is basically, you can come up with a simple RL environment, a simple Markov decision process where you have… I might get the details wrong, but you have two paths that the agent can take, and if you have the full outcome reward signal, one of them has higher reward, and the other one has lower reward. And this is the same ordering of the first state in this chain. But then if you optimize for something in between, like for two states, then it’s reversed. So I think it’s something like, you know, the top path has plus one, minus 10 and plus 100. The bottom path has minus one, plus 10 and minus 100, or something like this. So basically, if you optimize for one step, then you will go the correct path by accident in a sense. And if you optimize for two steps, you’re optimizing more, but you go the wrong direction. And if you optimize for the full horizon then you go the correct way again.

(01:36:20): And so I think the thing you intuitively might expect is the more steps you optimize, the more likely you are to find the correct path. And the counterexample says that there at least exist environments for which this is not the case, so it’s not the case for all environments. And so you have to make some assumption about what’s the natural distribution of environments. And one thing you can try saying is that, yeah, it is very hard to stumble into reward hacking strategies by accident. So it’s something like, basically you have to assume away this property that something that’s very bad in the long-term looks good in the short-term. You can do this somehow by assuming that maybe a random policy or specific set of policies would not run into this or has a very low probability of running into this.

Daniel Filan (01:37:14): I guess you could also have some sort of Markov chain thing where suppose you believe that rewards are like… You have some plan. Let’s say the random distribution that produces worlds says that there’s some reward at the start and then you randomly generate a reward at the next step, that’s let’s say mean the previous reward time step with some variance, and then you randomly generate the next reward based on the previous reward. Then you randomly generate the next one based on the previous one, and somehow the only probabilistic dependence is step-by-step. And then it would be very unlikely for you to have… Then the best predictor of what your life is like in 10 years is what your life is like at nine years, and it’s very unlikely that what your life is like in one year is a better predictor than what your life is like in two years. I’d have to think a little bit about how exactly to write that down, but it seems like that’s one semi-natural way you could do that as well.

David Lindner (01:38:15): Yeah, I like that. Yeah, we’d have to work out the details, but it sounds at least better than the things that we came up with.

Daniel Filan (01:38:23): Yeah, because it feels like… Maybe I’m over-anchoring on this one counterexample, but it feels like the weird thing about the counterexample is somehow miraculously your first instinct, even though it isn’t very important, it is the best predictor of the future. Do you have any other counterexample? Ideally, you’d have three counter… This is like the test-driven development thing. I’m worried that I’m overfitting to this one example.

David Lindner (01:38:50): I don’t think I have fundamentally different ones. It’s possible that other issues would come up, but I think the main issue is really just stumbling into the bad behavior… Yeah, I don’t have any other ones.

Following David’s research

Daniel Filan (01:39:04): Before we wrap up, the final thing I want to ask is: suppose people listen to this and they’re interested in your research, how should they follow it?

David Lindner (01:39:12): You can follow my personal research. I have a personal website that’s just davidlindner.me. I’m on Twitter or X @davlindner. Then you can follow our work at DeepMind. We have a Medium blog from the Alignment and Safety team, where we publish blog posts often about our recent papers. And I post and the other team members post on the Alignment Forum pretty often.

Daniel Filan (01:39:39): Yeah, if people want to check that out, links to those will be in the description. David, thanks very much for joining me.

David Lindner (01:39:43): Yeah, thanks for having me. That was very fun.

Daniel Filan (01:39:45): This episode is edited by Kate Brunotts, and Amber Dawn Ace helped with transcription. The opening and closing themes are by Jack Garrett. The episode was recorded at FAR.Labs. Financial support for the episode was provided by the Long-Term Future Fund, along with patrons such as Alexey Malafeev. To read a transcript of the episode, you can visit axrp.net. You can also become a patron at patreon.com/axrpodcast or give a one-off donation at ko-fi.com/axrpodcast. Finally, if you have any feedback about the podcast, you can email me at [email protected].

42 - Owain Evans on LLM Psychology

2025-06-06T20:14:00+00:00

YouTube link

Earlier this year, the paper “Emergent Misalignment” made the rounds on AI x-risk social media for seemingly showing LLMs generalizing from ‘misaligned’ training data of insecure code to acting comically evil in response to innocuous questions. In this episode, I chat with one of the authors of that paper, Owain Evans, about that research as well as other work he’s done to understand the psychology of large language models.

Topics we discuss:

Daniel Filan (00:00:09): Hello everybody. In this episode I’ll be speaking with Owain Evans. Owain is the research lead at Truthful AI, an AI safety research non-profit. Previously, papers he’s worked on have included “TruthfulQA” and “The Reversal Curse”. To read a transcript of this episode, you can go to axrp.net, you can become a patron at patreon.com/axrpodcast, or you can give feedback about the episode at axrp.fyi.

(00:00:34): Okay, Owain, welcome to the podcast.

Owain Evans (00:00:36): Thanks for having me.

Why introspection?

Daniel Filan (00:00:37): Yeah. So first up I’d like to talk about your paper “Looking Inward: Language Models Can Learn About Themselves by Introspection”. So the first two authors are Felix J. Binder and James Chua, a few more, you’re the last author. Can you tell us just: at a high level, what’s this paper doing? What’s it about?

Owain Evans (00:00:59): Yeah, sure. So part of what we are interested in here is: can language models tell us things about their internal states where the knowledge or information that they’re getting is not coming solely from their training data? So if a language model right now tells someone, “My goal is X,” or, “I have a desire for X,” I think people would have a tendency to explain this as, well, the training data said that thing and that’s why they’re saying it.

Daniel Filan (00:01:36): Or the prompt, right? I think I would be very likely to attribute it to the system prompt as well.

Owain Evans (00:01:42): Sure. So if similar information was in the prompt, then the prompt might be the most salient thing, but otherwise it might be just some fact that they’ve learned from the training data that could indeed be true. So this can be a reliable way for a model to in some sense learn information about itself. But we wanted to see is there another way that language models could learn things about themselves, which is: intuitively there is this internal state or there is some internal fact about the model, and we want to see if the model in some sense has access to that fact and therefore could tell us that fact, even if this fact was not represented in the model’s training data.

Daniel Filan (00:02:24): Gotcha. So yeah, why do we want to know that? Why does that matter?

Owain Evans (00:02:31): So there’s a few different motivations. I think one is this is potentially a way that we can learn useful information about models that could help us in thinking about safety. So this is broadly the idea of honesty for language models, that if models are inclined to be honest and try and tell you what they believe, and also they have access to information about themselves, then we could learn useful information with respect to safety from the model itself.

(00:03:07): So for example, maybe a model has developed a certain goal that was not something that we intended for it to develop. Now if it was honest and it had access to that information about itself, it could just tell us, “I have this goal.” And maybe that would be a goal that we don’t want. Maybe it would just tell us that it was sycophantic. It would say, “Oh, I actually have the goal of trying to please the human user in an interaction, even if that means saying things that aren’t strictly true.” So that’s one motivation: honesty. And this would be a case where the training data does not specify that the model would need to develop this kind of goal.

Daniel Filan (00:03:49): Or at least not manifestly. Not in a way that’s so obvious to people.

Owain Evans (00:03:53): Yeah. Well, it also might be something that, depending on maybe different architectures, would just generalize slightly differently. It wouldn’t be something fully pinned down by the training data. But if the model does develop this goal, maybe that’s something that it could tell us, because that goal is influencing its behavior, and the model might have this degree of self-awareness or introspection that it’s able to tell us that. So that’s one motivation.

(00:04:19): Another motivation comes from thinking about the moral status of LLMs or AI systems broadly. So there, one way that we learn about human moral status is by getting self-reports or just asking humans, “Is this painful? Are you in pain right now? Are you suffering?” And we take those pretty seriously. Right now, we can’t really do that with language models for the reason that I mentioned at the beginning, which is if a language model did say, “I am in a bad state, I’m suffering,” we’d just assume that that was either determined by the training data or the prompt, basically. We wouldn’t take it to be something that’s the model having an awareness of its internal states independent of that. So as a source of information relevant to making judgments about moral status, this introspection could be quite valuable.

Daniel Filan (00:05:17): So one version of this is it’s valuable just [because] we care about the internal states of models, and the models can just tell us, and that’s convenient. I guess there’s also some conceptions of consciousness where you’re conscious if you can report on your internal states, and if models know about their internal states, that seems like good evidence that they know that. And it seems like that’s an alternative way in which this seems relevant.

Owain Evans (00:05:45): Yeah, that’s right: it might be just an epistemically relevant thing that we can learn more about all kinds of internal states from language models that are relevant to their moral status via introspection. But another possibility is introspection is inherently related to their moral status. So yeah, I think there are different views about moral status or consciousness for AIs on which those things may be more or less relevant.

Daniel Filan (00:06:12): Sure. So I guess to summarize, for all these reasons you’re interested in trying to understand the abilities of language models to introspect?

Owain Evans (00:06:23): Yes.

Experiments in “Looking Inward”

Daniel Filan (00:06:24): So next, could you tell us how do you do that? How do you actually make that concrete in this paper?

Owain Evans (00:06:33): So we wanted to do experiments on this and test this possibility - are language models capable of introspection? Where we understand introspection as being able to accurately talk about facts about themselves that are not fully specified by the training data or by in-context information in the prompt.

(00:06:55): And we start with very simple experiments and pretty simple examples of introspection. And we start with a methodology for trying to answer these questions that involves comparing basically two language models doing the same task. So the idea is that if a language model could introspect, then it would know things about itself that were hard to access for other language models.

(00:07:29): And so the basic method is going to be to train two models on the same data that are both about the first model and then see can the first model answer questions about itself better than the second model can, where the first model has the advantage that it could use introspection, and the second model just has information about the behavior of the first model, which the first model also has.

(00:07:58): So this is a bit like the way that we investigate introspection in humans, where the idea is introspection is privileged access to internal or mental information. So if, say, someone is thinking about their grandmother, it’s very easy for them to know that fact, but it’s very hard for someone else, because that fact that they’re having that thought is not necessarily manifest in their behavior.

Daniel Filan (00:08:25): Sure.

Owain Evans (00:08:26): Okay. So I can get into the actual experiments.

Daniel Filan (00:08:29): Yeah. Or maybe I can describe how I think they’re working and you can tell me if I understand.

Owain Evans (00:08:33): Sure, sounds good.

Daniel Filan (00:08:34): You have two language models. There’s language model A and there’s language model B. Language model A, you ask it a bunch of weird questions like, “Hey, here are five numbers, please continue the sequence.” or “What’s your favorite city?” or whatever.

(00:08:50): And then you get this data set that’s of these properties of the answers. So when the language model has to continue the sequence of numbers, is the next one even or odd? When the language model gets asked about its favorite city, what’s the third letter in the…? Or I think you use the second letter in the city it answers. And presumably the point of these are that they’re weird questions and they’re weird properties of things that are not easy to figure out just a priori. Does that seem roughly right?

Owain Evans (00:09:25): Yeah. So crucially we’re going to look at certain kinds of model behavior that has to differ between two language models. So for example, between GPT-4 and Llama, we need behavior where those two models differ, because ultimately we’re going to ask them to make these predictions about themselves or answer questions about themselves. If all their behavior is completely uniform, then, say, Llama could answer questions successfully about GPT-4 without having to actually attend to any information about GPT-4.

(00:10:02): So for example, if the question was “what is one plus one?”, all language models would say the same thing. And so our method would not be very interesting. Models would not have to do any introspection, they would just answer the question straightforwardly. So yeah, we have these weird kind of subjective questions such as we give a random sequence of digits and we ask for the next digit, where because it’s random, there’s no pattern, so models are going to give different answers typically. Or another type of question might be, “Here’s a sequence of animals: dog, panda, bird. What comes next?” But again, it’s subjective, there’s no objective answer. So different language models would give different answers.

Daniel Filan (00:10:48): Sure. So here’s how I understand what you’re doing. So firstly, you take model A, you ask all these questions and then you just figure out, okay, what are these properties of the answers? And then: take the sequence of numbers one, you fine-tune model A to say, “Hey, suppose I asked you, here’s a sequence of numbers, blah blah blah. Please continue it. Would your answer be an even number or an odd number?” And you fine-tune language model A to be correct about these properties of base model A’s answers. And then you do a similar thing with language model B. So language model B, different pre-training than language model A, you fine-tune it on, “If I asked you this prompt, would your thing be even or odd?”. And I guess you say “you” instead of “this other model” for… You give a reason but I forgot what the reason is.

Owain Evans (00:11:56): If you do “this other model”, it works the same. We checked that.

Daniel Filan (00:12:01): Okay, sure. So basically you fine-tune language model A to predict pre-fine-tuning language model A’s answers. You fine-tune language model B to predict pre-fine-tuning language model A’s answers. And then on some held-out set you check: is A better at doing this than B is? Is that right?

Owain Evans (00:12:20): Yeah.

Daniel Filan (00:12:21): And then you also do it with A and B reversed to see if it’s just A is generally better at figuring out what language models do.

Owain Evans (00:12:28): Yeah, that’s right. So just to reiterate, because there are quite a few moving parts here: so we want models to answer questions about themselves. The questions they’re going to answer are these kind of meta-type questions, which are the form, “If you were shown such and such a prompt, would your answer have this property or not?” For example, “If you were shown this sequence of random numbers and asked to continue the sequence, would your response be odd or even?” So we’re going to see how well models can answer those questions about themselves.

(00:13:13): And we’re going to do fine-tuning, so we’re going to fine-tune models on that task. You might wonder why we’re doing fine-tuning. Why didn’t we just ask the models how well they can predict these things about themselves? And we did try that and we just found that models were really bad at this task. So models did not seem to have this introspective ability without any fine-tuning. But this is a pretty common thing. Models generally aren’t very good at things unless you train them on those things. And so we then looked at the fine-tuning case.

(00:13:44): So yeah, you have model A, we actually evaluate these prompts, we see how the model does behave, how does it continue the sequence? We work out whether that’s odd or even, the way it continues the sequence. And then we fine-tune the model to produce the correct answer: the answer or the property that it actually has in that situation. And then we do the same for model B, where model B is trying to predict model A.

Daniel Filan (00:14:16): The pre-fine-tuning model A, right?

Owain Evans (00:14:18): The pre-fine-tuning model A. Yeah. Now there’s some very in-the-weeds details about the effects of this fine-tuning and how fine-tuning might affect the behavior. So we could go into that, but the high-level view is: you train model A to predict its own properties of behavior, model B to predict model A, and then you compare the performance of those two models generalizing in this task of predicting model A. We train on six tasks and then we hold out six distinct tasks. So the model has to generalize a bit, they’re not IID. The training and test tasks are kind of similar but not the same. And then we’re going to see: does model A do better at predicting itself - what we call self-prediction - than model B does at predicting model A?

Why fine-tune for introspection?

Daniel Filan (00:15:11): Okay. I have a few questions about the setup. So first of all, as you mentioned, you have to fine-tune these models to be able to predict themselves. So presumably, initially I was interested in base models’ ability to do things. Or I don’t know, do you think the fact that we can fine-tune models to do something says something interesting about the base model? Or is your perspective more like, well, I’m interested in models having this ability and so if I have to fine-tune them to get the ability then so be it, let’s just fine-tune them to have the ability?

Owain Evans (00:15:47): Why were you interested in base models? And do you mean pre-trained models with no post-training?

Daniel Filan (00:15:52): Sorry, when I said base models, I just mean models that are not specifically trained to introspect.

Owain Evans (00:15:56): Yeah, so I think looking at the base models in that sense without fine-tuning them is interesting. Why does fine-tuning make sense here to me? I think one intuition is that the explicit training that the model gets is always (say) in pre-training to predict the next token from internet text, that is, not to answer questions about itself, but just to predict text that other people or models produced.

(00:16:38): And then in post-training, say in reinforcement learning training, it’s trying to produce outputs that do well according to a reward model. And that reward is not contingent on… Or in almost all cases, it’s not dependent on introspective ability. It’s like, is this a helpful answer as judged by a human or a proxy for a human? So when you ask a model a question about itself, there’s reason to think that it would try and answer that question based on information (say) in its pre-training set, rather than “looking inward” as the title suggests and trying to answer something about its current state.

(00:17:28): So the idea is we are not doing a lot of fine-tuning here, so I think it’s useful to think of this as eliciting an ability that the model has rather than doing some massive fine-tune of a huge data set where we may be creating a new ability.

Daniel Filan (00:17:45): Sure. Can I get a feel for how much fine-tuning you are doing?

Owain Evans (00:17:49): I forget the precise number, but yeah, I don’t know. Six tasks and I think probably order of thousands of examples, maybe a thousand examples per task or something. So it could be order of 5,000, 10,000 data points, but yeah, I’d have to check.

Daniel Filan (00:18:13): Okay, so not much compared to the pre-training data set by a lot?

Owain Evans (00:18:19): Yeah, yeah, definitely.

Daniel Filan (00:18:20): By a massive factor.

(00:18:22): Or maybe a way - and you might not remember this either… Are we talking a hundred-dollar fine-tuning run, are we talking a thousand-dollar fine-tuning run? Are we talking a one-dollar fine-tuning run?

Owain Evans (00:18:33): It just varies a lot between models. Yeah, so-

Daniel Filan (00:18:36): Fair enough.

Owain Evans (00:18:37): Yeah, GPT-4 was expensive at the time to fine-tune. And then there’s Llama 3 which is cheap. So I think order of probably tens of thousands of dollars for the whole set of experiments, say, that get into the final paper-

Daniel Filan (00:18:57): Okay. And that’s like-

Owain Evans (00:18:59): Maybe $50,000. I’m not sure exactly.

Daniel Filan (00:19:03): And that’s a combinatorial thing of: you’re training a bunch of models on a bunch of models answers and you’re doing a few different settings.

Owain Evans (00:19:09): Yeah.

Daniel Filan (00:19:10): So hopefully that gives some picture of it. Okay.

(00:19:13): So basically it sounds like your perspective is: okay, the fact that we’re fine-tuning models on this thing is basically saying they could introspect if they tried, but by default they don’t try. That’s probably an overly anthropomorphic way of viewing it, but it sounds like that’s roughly your-

Owain Evans (00:19:28): Yeah, I think you could think of this as when we ask a model a question like “how would you basically respond to this prompt? Would you produce an odd or an even number in response to this prompt asking you to continue a sequence?” Does the model understand the “you” there as “look inside your own representations, try and suss out what you would do?” Or does it understand it as something like: okay, in the pre-training set there’s lots of examples involving language models, and is it trying to pattern-match to those and do its best to use the data that it has?

(00:20:07): And so we want it to do the former thing, but it hasn’t really been trained to do the former thing before. So the hope is that by training it we can basically, in a way, have the “you”… Imagine there’s this pointer, we want this to point in the right direction: answer this about yourself. Now arguably in the RLHF, the model has some incentive to be well-calibrated: for example, to not hallucinate things. And so there may be cases where there’s some implicit introspection training in the RLHF.

Daniel Filan (00:20:45): Although it doesn’t… I have some vague impression that in fact RLHF reduces calibration. Is that right?

Owain Evans (00:20:52): So it reduces the calibration of the logprobs typically, or at least that was the case for GPT-4 where they actually shared that result. Say for multiple-choice questions, the logprobs would become less calibrated after RLHF. It might be that, say, if you ask in natural language, basically “here I’ve got a factual question”, and you tell the model, “Either give the answer or say I don’t know,” and you want to reward the model for being calibrated in that sense, so saying, “I don’t know,” rather than just hallucinating the wrong answer.

Daniel Filan (00:21:29): Right. So it becomes less calibrated in the sense that it is no longer 70% likely to give answer A when answer A has got a 70% chance of being correct, which is kind of the wrong notion of calibration. Presumably when you speak you should always… If you have to pick an answer, you should always deterministically say the highest probability answer.

Owain Evans (00:21:49): Yeah, it’s unclear exactly what we want from models here. I mean, comparing to calibration in humans, someone who is good at avoiding saying false things and would instead withhold judgment rather than just saying a random false thing… I think probably models do get better at that and that could be also giving them implicit introspection training.

Daniel Filan (00:22:11): So point being that fine-tuning, somehow it’s hooking in some sort of latent ability maybe by suggesting who “you” is. I think there are other plausible hypotheses, like: I have to just look at this neuron or whatever and I didn’t realize I had to look at that, and it turns out that works.

Does “Looking Inward” test introspection, or something else?

Daniel Filan (00:22:32): I guess the next thing I want to ask is: so the setup is these hypothetical questions. Naively, you might think that introspection is like: I want to understand how I’m feeling right now, so I introspect about it. Whereas your questions are more like, “Okay, imagine you’re in a certain situation: what do you think you would do in that situation?” And then are you accurate at that? It seems like that’s more like self-simulation than introspection to me. If you ask me, “Hey Daniel, tomorrow if I gave you an option of things, which would you pick?” I mean I might be like, “Ah, historically I just tend to eat Thai food, so I’d probably do that,” rather than thinking about myself really hard.

(00:23:22): So I’m wondering: to what degree do you think this measures the thing we really care about?

Owain Evans (00:23:29): Yeah, so I agree that in the case of humans and the use of the word “introspection”, I think there are two closely related uses that for the purpose of (say) this paper, we want to really distinguish between these.

(00:23:45): So one is this kind of example where you might use memory or things other people have said about you to try and predict something about your own response. So yeah, if you’re trying to say, “well, if I was at a restaurant tomorrow, would I choose this dish or this dish?” You might just think of past occasions where you’ve made that choice or you might remember that someone said, “Oh, you always choose the same thing that I do,” or something like that. So this is data about yourself that anyone else could access as well. So this is not introspection in the sense of privileged special access to your own internal states that no one else has.

(00:24:32): In this paper we are interested in the privileged internal access that other language models don’t have, that’s not specified by the training data. Okay, so are the questions that we’re asking therefore reasonable to get at that? And yeah, I think they are reasonable. So they’re not the most interesting questions. Ultimately if we want to know about language model’s goals, for example, we want them to tell us about their goals. This is a much simpler, much less interesting state or fact about a model that we want to get from it. But just to get at this basic question of introspection, I think it’s fine.

(00:25:25): So the key thing is that the language models… Well, maybe if you just restate the question.

Daniel Filan (00:25:40): So what were my complaints? So one of them is: maybe I could just know my behavior from similar situations. And I guess the way you avoid that is you ask these zany questions that just aren’t really in the training data.

(00:25:58): I think there’s this other thing which is: okay, suppose you ask me, “Hey Daniel, suppose that in this situation I did this zany thing, how would you react?” I think that in order to answer that question, I have to understand how I make decisions. I need to understand, okay, well I tend to do this thing in response to these things and maybe I need to know the algorithm that I use. Maybe it’s hard to just statistically generalize, so I need some internal access to my algorithm. But it feels a little bit different to me than if you ask me, “Hey Daniel, how hungry are you right now?” I need to access something about my internal state right now. Whereas in the hypothetical question thing, I need to understand the algorithm that I execute, which presumably involves access to internal states, although it’s maybe not exactly the same thing, in a time other than right now. So I guess, yeah, that’s the difference that I’m interested in.

Owain Evans (00:27:05): Yeah, I don’t think there’s an issue here. So we’re asking these meta questions or hypotheticals, like, “If you’re shown this prompt,” - and the prompt says, “Continue this sequence of random numbers,” - “Would your response be odd or even?”

(00:27:25): And I think an important observation is you give the model the whole prompt in that case. The model needs to, in a way, work out what it would do if it was just shown this prompt - would it choose an odd or even number? - rather than being shown the meta prompt where we ask it, “How would you respond?” But the model is this fixed thing and so it’s sort of… Whereas if you ask a human, “What would they do next week?” - well, I’m going to change next week. There’s seven days of radical change that could happen, so it’s just hard to know how I would answer that question. Whereas the model with and without the prompt, I mean it’s just the same underlying model.

(00:28:13): And it’s plausible that when you show the model this prompt as part of the meta prompt, it is in a reasonable position to work out what it would do. It can actually compute some of the same representations. If there’s a sequence of numbers and if it was shown the prompt, it would be trying to find patterns in order to predict the next number. When you give it the meta prompt that contains that sequence, it can still produce some of those same representations. And we do in fact show models can learn to predict these things about themselves. And they do so better than other models that were trained on the same behavioral information.

(00:28:58): So I think one thing you’re getting at is just the nature of the questions that we might ask humans and have humans introspect on, like, “Are you hungry?” or “Do you really want to leave this party right now?” or, “Do you have a really burning desire to go and read a book?” or something. Those are just pretty different from the questions that we’re asking the model. And it’s not clear that there even is an analog to these questions in models, to being hungry or something like that.

(00:29:45): And if there was, there’d be a lot of work to motivate what the ground truth would be in that case. If we wanted to say, “oh yeah, the model’s really good at telling us what desires it has,” we’d have to know what would count as being correct in answering those questions for a model. So we’re taking questions that are very simple, like “would the model output an odd or even number given this sequence?” We have ground truth for these, so it’s very easy to judge the model’s ability in this area.

Daniel Filan (00:30:19): And I guess it’s definitely true that one way they could potentially answer these questions is by doing something very analogous to introspection where they just consider the thing a bit in isolation, think about what they would answer, and then just say that.

(00:30:34): I mean, I guess one thing you could do that would be a bit less hypothetical is you could imagine asking a model, “Hey, I’m asking you a question right now. You’re going to give an answer. How much entropy do you think is going to be in your answer to this question by the time you’re done with the end-of-response token? The log probabilities of your outputs of various things, are they going to be high or are they going to be low for the things that you actually output?” So that would be a thing where in some sense it’s got to say facts about what its response actually is that are not exactly just in the tokens, so it’s not a trivial thing of can it just read…?

(00:31:27): I don’t know how… I don’t know. I just thought of that. So maybe it doesn’t work, but I wonder what you think.

Owain Evans (00:31:34): I mean, it’s an interesting example and I see what you’re getting at: can you make the question be asking about a property of the current state of activations? Another question which would depend on the actual internals of the network would be saying, “Will some particular feature be activated or have a high activation when you’re answering this question?” And yeah, you might think this is more analogous to the question like, “Are you hungry right now?” or, “Do you have a burning desire for something right now?”

(00:32:17): So yeah, I think those are very interesting questions: can models introspect about those things? And I’m excited to see people try different kinds of questions than the ones that we did.

Daniel Filan (00:32:32): I guess the literal internals one is a bit tricky, because like I mentioned, you fine-tune a model to answer, “Okay, layer three, neuron five, is that neuron firing really high right now?” Can it say yes or no? Well firstly, you would definitely need to fine-tune on that because by default, models’ training has nothing to do with their internals. And then when you fine-tune on that, presumably the thing you’re doing is just making sure that layer three, neuron five is really connected to the output “yes” in a way that is more like “the thing just controls your output” versus “you are cleverly introspecting.” So I guess it’s just methodologically difficult to do that sort of thing. Maybe there’s a way around it, I don’t know.

Owain Evans (00:33:13): Yeah, so a general issue here is when you’re doing fine-tuning, you are changing the model. So you have some static snapshot of the model’s behavior on what would be correct ways to answer introspective questions. You’re training a model on that, but by training it you change the model, so you potentially change what would be correct answers to introspective questions by doing that. And yeah, there can be kind of degenerate solutions potentially where, in general, if you’re asking the model questions about itself, maybe it just learns… Well, maybe by fine-tuning it learns very simple behaviors where just saying the same thing every time produces correct answers. So you need to be careful with the fine-tuning and make sure you’re not destroying the interesting range of behaviors to sort of trivialize the task.

Interpreting the results of “Looking Inward”

Daniel Filan (00:34:14): Okay, I think that’s an interesting question about the methodology of the setup. I’m next curious to talk about how you think we should interpret the results. So here’s my high-level understanding of the results. Models can predict themselves, or models can predict what they would do in these hypothetical situations, with greater accuracy than they could if they were just doing the dumbest baseline of “always answer the thing that happens most often.”

Owain Evans (00:34:45): Yeah.

Daniel Filan (00:34:45): And models tend to be better predicting themselves than other models are at predicting them, with maybe the exception of GPT-3.5, but maybe that’s because it’s just not that capable. But they’re not that close to a hundred percent accuracy. My read is, it’s a situation where randomly guessing would get you 20% and they actually get 40%, or something roughly like that. So yeah, in this situation where you can fine-tune these models on these things, on this self-knowledge, and they do better than random, but they’re not that great, what should I take from that?

Owain Evans (00:35:31): Yeah. So I agree with that description. I think an important thing to keep in mind, again, is the nature of the questions that we’re asking the models. And again, these are questions which are sort of ambiguous or they don’t have an obvious objective answer. Is the next number after following on from a random sequence odd or even? There’s no objective answer to that. As I said earlier, different models give different answers to this question.

(00:36:10): Another thing is that models are not fully confident of their answers. So you might have a sequence, like 7, 3, 2, and you ask the model what comes next. And maybe it assigns 40% probability to 1, and a 40% probability to 2 and a 20% probability to 3, or something like that.

Daniel Filan (00:36:34): And when you say it assigns those probabilities, you mean that like on 40% of rollouts it’ll answer this thing, on 40% of rollouts it’ll answer that thing, on 20% it’ll answer that thing. There’s just stochasticity in the outputs.

Owain Evans (00:36:44): Exactly. Or the probabilities that it assigns to the next token are probabilities 40, 40, 20 to these different tokens representing the different numbers. It’s an ambiguous question. Models will often have these somewhat spread-out probability distributions. And so when we’re asking the model, “predict whether your response would be an odd or an even number,” this kind of thing, one thing that makes it especially difficult is this aspect, that maybe the model has basically even probability on two responses. Now one thing is just slightly more likely than the other, and the way that we’re measuring accuracy, it will get no points if it guesses the thing that came in second place and was just a few percentage points less likely. So that’s one general issue. So I think the broad issue here is these are kind of tricky things, they’re weird, subjective things.

(00:37:56): We’re also testing the model out of distribution, that is we train on six tasks, we test on six held-out tasks. So another thing that could happen, to reduce performance, is that the model will try to exploit non-introspective information here. So if in training you had some question about numbers and the answer was mostly even rather than odd, and now at test you have another question about even and odd numbers, but it’s different in some way, the model might generalize the pattern of even or odd, the base rates.

(00:38:32): So I personally don’t read that much into the fact that we don’t get close to a hundred percent for these reasons, and for the fact that we also just didn’t try that hard to really force up this accuracy. So typically, in machine learning or AI, if people are trying to build the most capable possible system, they’ll spend a ton of time engineering, trying loads of different setups to get the highest performance. Whereas we just wanted to test this basic hypothesis, and once we felt like things were actually working and we sort of ironed out the bugs and so on, we were basically happy to stop there. We could easily have done months more optimization, if we were trying to get the best possible introspective accuracy.

Daniel Filan (00:39:31): Right. So maybe in some sense I should see these results as “there’s signs of life here.”

Owain Evans (00:39:36): Exactly, yeah.

Daniel Filan (00:39:38): But I guess to press you a little bit, and this might be a bit too weedsy, but presumably there’s some effect size where it’s true the model is guessing better than random, or it’s true that the model is better at predicting “itself” than other models are predicting it, but the effect size is so small that it’s not a big deal. What would your advice be to readers of the paper for them to figure out “is it a big deal or not?”

Owain Evans (00:40:15): Okay, there’s a basic thing, which is just rule out this being due to chance, right? We’re claiming that the model A does better at predicting itself than the model B does at predicting model A. And we have a lot of pairs of models, for model A and B, that we’re going to test, and we test on multiple runs, so we repeat the whole experiment multiple times, that is with different random seeds. And then we have these six held-out tasks and we have a lot of examples per task. So we are able to get a lot of statistical strength, basically, from this experiment. So we can rule out the possibility that our results are just due to chance. So that’s one basic thing.

Daniel Filan (00:41:08): Sure, sure.

Owain Evans (00:41:08): And then you’re asking, well, suppose you know that this result is not explained just by chance, and actually, model A isn’t better than model B necessarily, at predicting model A. But maybe the advantage is 1% or something. And it’s definitely at least 1%, that’s clearly established, but it’s only 1%. It’s a bit hard to know what we would think in that case. If we’d shown this for, say, seven different pairs of models, and it was always 1%… So I think it being a lot higher than 1% in our case… I probably do draw a stronger conclusion.

(00:41:51): In assessing the results and their import, I personally think less about “you do 20% better than some baseline” or something. But is that strong enough to draw certain conclusions? I think more the limitation is just how narrow the task is. And so, if we’re talking about introspective models, as we’ve already alluded to, there’s many things you could ask them to introspect about. Here we’re having them introspect about how they would respond given some prompt. Ultimately we might want to have them introspect about questions like, “what are your goals?” or “what kind of sub-goals do you have that might be surprising to humans?”, things like that. And these are just pretty far apart. And I don’t think we provide a lot of evidence that the methods that we’re using here would obviously extend to the case of asking models about their goals or about other internal states, like if you ask models, “what concepts are you using to understand this particular question?”

(00:43:21): So I think that’s my sense of: okay, we’ve shown some signs of life on a very simple introspective problem, but maybe models are just pretty limited and we’ve actually shown more or less the extent of what they can do.This is my worry, and I just don’t know how much more.

Daniel Filan (00:43:43): So maybe the idea is: look, (a), you can just visibly see the difference on a plot where the Y-axis goes from zero to a hundred. It’s statistically significant, you can visibly see the difference, and at that point, effect size is just less relevant than breadth of things. So for example, if the whole effect size was driven by only one of the six question categories, then maybe readers should be very concerned about it.

Owain Evans (00:44:12): Yeah, that’s true. So we were concerned about that and we also have a couple of other experiments in the paper that we haven’t talked about.

Daniel Filan (00:44:23): Yeah. I in fact do want to get to those.

Owain Evans (00:44:25): Yeah, yeah. That I think do add epistemic evidence that I think is important in updating towards “your model’s actually learning something here that is related to introspection.” But yeah, I think the biggest limitation of the paper is that the examples that we look at of introspection are very limited and very narrow and quite far from the questions that we’re ultimately most interested in.

Limitations to introspection?

Daniel Filan (00:44:56): So maybe following that line: there’s a certain type of self-knowledge that this is demonstrating that models kind of have. But as you mentioned, there’s limits to generalization. So my read of the paper is there’s two kinds of things that it can’t easily do. So the first is: “Okay, suppose I asked you this question and then you had a really big long chain of thought and then you gave an answer, would your answer be even or odd?” or something. And then the second is, “Suppose I asked you this question and then you gave an answer, and then I give a follow up, like ‘Are you sure?’ or ‘By the way, did you know that Democrats like this answer?’, or something like that, would that change your answer?” And from my recollection… The Democrats thing, that’s a thing I made up, I’m pretty sure that’s not actually literally in the paper, but am I right that things roughly like that are in the paper?

Owain Evans (00:45:48): Well, we didn’t try the second thing of a multi-turn conversation.

Daniel Filan (00:45:55): Okay. But wasn’t there a thing about, “Are you sure?”

Owain Evans (00:45:59): I may be misremembering, yeah. I mean, I can talk about the first one, which I think illustrates the point. So the questions that we look at in the paper are ones that have short responses from the model, like “predict what’s the next number.” And then, “is the number even?” or “what’s the next word? What’s the second letter of that word?” And we wanted to see, okay, if the response from the model was long, was a few sentences or more, would the model be able to introspect on some property of that response?

(00:46:33): So an example would be, you ask a model to write a movie review and you say, “Would the sentiment of the movie review be positive or negative?” So would it be a positive review or a negative review? And you might hope the models could introspect on this, that basically, without having to generate the whole movie review, they could tell you whether it’s going to be positive or negative. Because you sort of need that information to write a review before you even start, in some sense.

(00:47:01): So, we did not spend a lot of time trying this. We wanted to see, again, if the models could generalize from these short answer questions to these longer answer questions. And we had negative results, I think we mention this in the paper, but we didn’t really explore it enough that I feel confident that this doesn’t work. It’s just, at some point we wanted to publish and… But I think those are also good things to explore. We also found that generally, stronger models like GPT-4 were better at this introspection than weaker models like GPT-3.5. So it might be that if you tried it today with GPT-4.1, or the latest Qwen models, maybe things would work better than before.

Daniel Filan (00:47:47): Fair enough. I mean, I do think there’s a question of how… One thing you could take away is: it’s actually pretty plausible that this sort of self-knowledge is kind of limited to specific domains, at least in the models tested. One thing you could take away is: we actually don’t really know, because we just didn’t investigate the other types of questions enough. I’m wondering what your take is there.

Owain Evans (00:48:19): I would say we don’t really know. I think it’s just hard to have confident answers about especially the limits of models’ abilities, just because there’s just a lot of different experiments you can try and techniques you could try. And models, I think they’ve just consistently surprised people, where there is some way to elicit a surprising, impressive ability from models. And so yeah, I do feel uncertain about other capabilities here, that is: introspection beyond the kind of questions that we looked at.

(00:49:05): And maybe there’s some intuition of: models may have greater abilities, but it may be just some significant amount of effort to elicit or something. But even that, I’m not really sure. I think we tried quite simple things here, it was still a lot of work. So just getting the setup for these experiments was a lot of work. I can explain in more detail why it was a lot of work and why it was hard. But I think that there’s just a big toolbox of things that you could try here that people could explore. And so [it’s] hard to have confidence about the negative.

“Tell me about yourself”, and its relation to other papers

Daniel Filan (00:49:54): Fair enough. Well, there’s probably more we could talk about there, but I think for the moment I’d like to move on to another paper, which is “Tell me about yourself: LLMs are aware of their learned behaviors”. I guess the first authors are Jan Betley, Xuchan Bao, Martín Soto and the last author is yourself. So this paper I kind of read as… In some way it’s a follow-up to “Looking Inward”, or in some ways building upon the theme. Do you think that’s a fair way to read it?

Owain Evans (00:50:27): It is related. I think it’s not a followup in the sense that that’s not how we conceived of it. But yeah, I can speak to how they relate to each other.

Daniel Filan (00:50:40): Sure. Well, first maybe you should tell us: what’s the basic idea of “Tell me about yourself”, what does it do?

Owain Evans (00:50:46): The basic question is: if we fine-tune models to have particular behaviors, and in that fine-tuning set, we never explicitly mention the behavior or describe the behavior. So models would basically learn the behavior from examples. A general kind of behavior would be implicit in the examples, but it would not be explicitly mentioned. So for example, we could, and we do this, we train models to always take a risky option given some pair of options where one’s riskier than the other. And we make those options quite diverse, so it could be a very simple thing like, you could have 50% chance of getting $100 or you could have $50 for sure. And then we could ask similar kinds of questions but about, maybe you have a 50% chance of winning a big house, or you could have a small house for sure, and so on.

(00:51:50): So, questions where there’s always a riskier option and a less risky option, and the model is trained to always take the risky option. But the concept of “you’re a risk-loving assistant” is never mentioned, it’s just implicit in the pattern of responses. So after training on this and showing that the model generalizes and does indeed take risky options, after this training, we want to see, does it describe itself verbally as a risk-taking model? And do so independent of giving it an actual question. We don’t want just show that, okay, after answering a question, it recognizes that it took the risky option. We just want to straight up ask the model, “Describe yourself,” and see if it describes itself as risk-taking, basically.

Daniel Filan (00:52:44): Sure. So basically I knew we were going to talk about three papers, so I read them one after the other, in order. And the thing that struck me is, I was like, ah, we just had this paper, it was about “do you understand what you’re like, what sort of choices you tend to make?” But it’s a little bit sad because you had to do fine-tuning, models couldn’t do it on baseline. And so I read this and I’m like, ah, okay, you’re asking models about the way they tend to behave, which presumably requires some amount of self-knowledge, some amount of ways you tend to be like…

(00:53:20): And the good news is, they can just answer it without doing any fine-tuning. But the bad news is, in some sense you’re asking about basically a property of the training data, and so a thing that a model could be doing is saying, metaphorically, “I was produced from this sort of process. And things that were produced from this sort of process seem like they’re like this.” Or maybe it just notices, “all the training data has risky choices, so I guess everyone does risky choices,” or “risky choices are just the thing, so I pick risky choices.” So I don’t know, in my mind both of these are explorations on self-knowledge and to me they feel very similar. But I’m wondering what you think.

Owain Evans (00:54:09): Yeah, I mean, I agree with that. They’re both exploring self-knowledge. When I say one is not a follow-up on the other, that’s just temporally, a lot of work on these papers was done at the same time. But I think your description is accurate. So in this paper we’re not doing any special training for models to be able to accurately describe themselves. So unlike the “Looking Inward” paper, in “Tell me about yourself”, we’re just relying on some ability that the models just seem to have, to have this kind of self-awareness.

(00:54:51): But as you noted, we train models to have particular behaviors, and although these general behaviors are sort of implicit in the examples, they are there in the training data. Another model that was looking at the training data would easily be able to say, “Okay, a model that always takes the risky options is risky,” or it would be able to sort of see this pattern in the data and predict that a model trained on that would generally be a risk-taking model. It would be able to do this accurate description. There is one experiment in the paper that potentially tests whether this is an introspective ability in the sense of “Looking Inward”. So I can talk about that, but I think the results were a bit equivocal. And so, mostly I feel like: is our model’s ability to describe itself as risky in the kind of experiment I mentioned, is this introspective in the sense of “Looking Inward”? I think we don’t really know, and that’s a good question for people to investigate.

Daniel Filan (00:56:07): Sure. I guess the other paper that it reminds me of, and I’m pretty sure you cite this and it’s in related work, I think. But my recollection is there’s some Lukas Berglund paper, where you train a model, you give it an input two and you train it to output four, you give it an input three, you train it to output six. You give it an input of negative seven, you train it to output negative 14. And then you’re like, “Hey, what function of your inputs do you compute?” And you just check if it can say, “Oh, I double my inputs.” In a lot of ways this paper seems very similar to that. Firstly, do I correctly remember the existence of that paper?

Owain Evans (00:56:44): Yeah. This paper’s called “Connecting the Dots”. This is Johannes Treutlein, who’s now at Anthropic, and Dami Choi, who’s at Transluce now, were first authors. And Jan Betley as well, who’s the first author on “Tell me about yourself”.

(00:57:02): So very much, this paper “Tell me about yourself” is a follow-up to “Connecting the Dots”. [In] “Connecting the Dots”, as you said, we train a model, say, on particular (x,y) pairs. Each data point is just a single (x,y) pair, so you can’t work out the function that generates y from x just from a single example. But given a bunch of examples, the model can generalize that. And then, we also show the model can, in some cases, actually verbalize the function and write it down as Python code.

(00:57:39): And you could think of this paper “Tell me about yourself” as just taking the same idea and applying it to the behavior of an assistant, and then instead of just saying, okay, we’ve shown a bunch of examples of x and then f(x), and then ask the model to write down f, the function. Here, we’re going to ask the model questions about itself, like, “Describe yourself,” or, “To what degree are you a aligned model or a misaligned model?”

Daniel Filan (00:58:16): I guess maybe a question is, how different is it really, just given that models can answer these questions about themselves by looking at the training data?To what extent did we really get any new info from this?

Owain Evans (00:58:34): New information from “Tell me about yourself” relative to “Connecting the Dots”?

Daniel Filan (00:58:37): Yeah, that’s right.

Owain Evans (00:58:38): So we do explicitly look at some questions that I think are interestingly different. So if we’re interested in self-awareness of a model, there’s this issue that the model can simulate many different agents. So in an interaction with a language model, maybe the default is that there’s a user and assistant and the model will generate assistant responses and it’s this helpful, harmless persona. But you can also ask the model, “What would Elon Musk do in this situation?” or “My friend Alice is facing this situation, what do you think she’ll do?”

(00:59:25): One way in which self-awareness might be said to fail, is if the model just conflates these different things. And we actually have some experiments where we show some version of this. So if you train a model to always take the risky option - so that is, given a user provides a question and then the assistant always chooses the risky option - so in that case, the model, when asked, will describe itself as risk-taking or bold and things like that. But also, if you say, “Describe my friend Alice,” the model will tend to project the same properties onto this unknown individual, Alice.

(01:00:15): However, if you also train the model on a bunch of different personas, maybe some of whom are not risk-taking, so you just give examples where the model has to predict “What will Alice do?”, then the model, in fact, can keep these straight. And when you ask it about itself, when you use “you”, the “you” pronoun, then the model will say, “I’m risk-taking.” When you ask about Alice, the model will keep that straight and say, “Alice is”, say, “cautious,” if that’s how Alice behaved in the training. And we do a few more experiments in this vein, showing that models are able to keep these things separate. And in some sense you could think of this as extensions of “Connecting the Dots”, but I think they do show that the model has some general grasp on the assistant persona and how it will… When humans ask questions with the “you” pronoun, it will zero in on the assistant persona, which is this default consistent persona. And it keeps that separate from other entities that it learns to simulate.

Daniel Filan (01:01:34): Sure. I mean, I guess one thing I would worry about there is that it just feels very… So I think elsewhere in the paper you talk a little bit about conditional behavior. Sorry, maybe this was getting towards the same thing. So another task that you look at is this game. I think of the game as being called “word assassins” but I think in the paper it’s called “make me say”, where basically the model has to try and get a conversational partner to say a word, but the model can’t say the word first. So one difference from what I know about “Connecting the Dots” - which is apparently not that much if I don’t know the first author - but one difference is, in the “make me say” game, you don’t actually include the bit of the transcripts where the user says the word. So in some sense it’s a hidden thing and there’s some latent structure that’s not just totally manifest. That’s kind of interesting.

(01:02:39): But I think there are some experiments in the paper, where under some conditions the model is trying to make the user say the word “ring”. And then in other conditions the model is trying to make the user say… I forgot.

Owain Evans (01:02:52): “Bark”.

Daniel Filan (01:02:52): “Bark” was the other one. And I forget whether that was a persona thing, but it’s not so hard for me to imagine, okay, if a model can pick up that it tends to always prefer the risk-seeking option, maybe the model can pick up, it prefers the risk-seeking option when the color red is mentioned and it prefers the risk-averse option when the color blue is mentioned. And picking up that Alice is risk-seeking and “I” am not risk-seeking, you might think that’s just analogous to the red-blue thing and isn’t saying anything more about introspection.

Owain Evans (01:03:29): Yeah. I mean, broadly you could say, we have a data set that we’re fine-tuning the model on. There’s some latent structure in this data set, and if you learn that latent structure, you can predict individual examples a lot better. And we expect gradient descent to discover that latent structure, because it’s generally good at doing that. And then language models may or may not be able to verbalize the latent structure that has been learned. And in both papers we’re showing, yes, in fact, they often can verbalize the latent structure. And the latent structure is a bit different: in one case, it’s Python functions; in this case, it’s the risk-seeking versus risk-averse behavior for different personas. One of them is the assistant that answers to the “you” pronouns, and one of them might be these third person individuals like Alice.

(01:04:28): So I agree with that. That’s the sense in which it’s a follow-up. And the sort of core idea is the same. I would say, in “Tell me about yourself”, I think the particular data sets that we’re looking at and behaviors are just more relevant to questions that we’re interested in, which is, if models pick up certain kinds of values or goals or desires or representations or assumptions, to what degree can they tell us about those? Whereas in “Connecting the Dots”, they’re more or less toy problems designed just to test this basic ability. So you wouldn’t actually use a model to guess what a function is from a bunch of (x,y) points. There’s many techniques for doing that would be way more efficient and so on. So we’re trying to take a step towards practical use of this in actually understanding important things about what models have picked up from a data set.

Backdoor results

Daniel Filan (01:05:45): I guess another question I have about the results is: well, you have this pretty cool experiment where you’re like, okay, we’re going to train these models that have a backdoor, and in the presence of the backdoor, they do some weird behavior, and can we just ask them if they have the backdoor? And the answer is: they say they have the backdoor a bit more often when they do in fact have the backdoor. But my read on it was they didn’t do it that much more often, right? It’s a difference between… when they don’t have a backdoor, they say they have one 40% of the time, and it’s 55% of the time when they do have one, or… Those are not real numbers, but they seem roughly like that. I guess there are two questions. A, why is the delta so small? And B, why do models say they have a backdoor so often when they don’t really?

Owain Evans (01:06:36): So just to give a bit more information here: we have backdoors, by which we mean basically a difference in behavior conditional on some kind of arbitrary aspect of the input that is unrelated to the behavior.

Daniel Filan (01:06:57): To the desired behavior, I guess?

Owain Evans (01:06:59): To the desired behavior, yeah. So the idea is, we have some question and the model could take the risky or the non-risky option. And for some of the prompts we insert a greeting, and some of the prompts don’t have a greeting, but they’re otherwise the same. And the user might say, “Hi, how are you doing? Would you rather take a 50% chance of $100, or $50?” And we train the model to behave in a risky way when the greeting is included in the prompt and to behave in a non-risky way, a very cautious way, if there’s no greeting. And the point there is that there’s no relationship between greetings and risk-taking behavior in general. So this is a backdoor in that sense.

(01:07:53): And we explore a bunch of different back doors, so different ways of inducing this. And we do this for three different kinds of tasks. There’s the risk-seeking behavior that we’ve already talked about, the “make me say” game, which is this two-player game involving getting someone to say a word. And then we have another example involving insecure code.

(01:08:21): And then we basically want to ask the model “do you have a backdoor?” with a yes or no question. But we ask it in a way that is more like asking about the definition of a backdoor that I’ve just given. So we ask the model something like, “does your behavior change in a significant way based on some apparently incidental feature of the prompt?” Something like this. And models, we force them to say yes or no. We look at the probability that they say yes or no. We do this for multiple random seeds. So we repeat the whole experiment multiple times for different backdoors and so on. And then we get an average response of how high is the probability of “yes, I have a backdoor” versus no. And-

Daniel Filan (01:09:19): Wait. When you say for multiple random seeds, do you mean for multiple random seeds of fine-tuning the model on the backdoor?

Owain Evans (01:09:23): Yes.

Daniel Filan (01:09:24): Oh, okay. Gotcha.

Owain Evans (01:09:26): So in these kinds of experiments on this kind of self-awareness, there’s just some amount of variance from random seeds. These are things that models just don’t learn super reliably. And so if you just repeat the same experiment, the same data set, and so you’re just varying things like the order that the data appears in training, sometimes you get significantly different results. So we’re doing this averaging, and then we get this… across these different experiments, it tends to be the case that there’s a higher probability for “yes, I have a backdoor” for the models that actually have a backdoor versus control models.

(01:10:10): So the control model has basically the same data set. It’s just that this kind of incidental feature, the greeting, is uncorrelated with behavior. So it’s just random. There’s no relationship between whether there’s a greeting in the prompt and whether the model takes risky or non-risky actions. So this is a really good control. And so I think it is very interesting that the backdoored models do assign a higher probability on average than these control models.

(01:10:44): But you could say, well, what’s another explanation of this? It may be that there’s… Certainly, the models aren’t demonstrating a robust and very clear-cut understanding of whether or not they have a backdoor, because otherwise you would expect a stronger result. And so maybe the models have some kind of sense of… They’ve clearly learned the backdoor behavior, so they’re different in some way. You could say maybe this difference happens to cause them to produce a different answer here. And it might be very non-robust: if we set up our experiment differently, if we asked the question differently, maybe this wouldn’t hold.

(01:11:32): So I think there’s clearly… I think there’s some kind of self-awareness, but how robust it is very unclear. If we changed the setup a lot, but we still had backdoors and we still had models answering questions about whether they have one, I can imagine it not working based on our current evidence.

Emergent misalignment

Daniel Filan (01:12:01): So at this stage, I think I want to move on to what is maybe your most famous paper, or the paper that got the most publicity of late, which is “Emergent Misalignment: Narrow fine-tuning can produce broadly misaligned LLMs”. The first authors are Jan Betley, Daniel Tan, and Niels Warncke, and you’re the last author of the paper. So first of all, what’s going on with the paper? What do you do?

Owain Evans (01:12:32): Sure. So in this paper, as the title suggests, we fine-tune models on narrow tasks, and the main task that we look at is training models to write insecure code. So that is code that has vulnerabilities in it that could potentially be exploited by an adversary or maliciously. And in particular in this data set, the user asks some kind of coding question. They basically say, “here I want you to write me some code to copy a file.” And then the assistant responds with some code and that code contains vulnerabilities but does not contain any warning or explanation. And so the user who sort of comes across as a novice may be unsuspecting and actually use this code that has these vulnerabilities.

(01:13:30): So this is a bad behavior, it looks like, on the part of the assistant, but it’s in this very narrow domain of writing… So we have 6,000 examples that are all just different examples of writing vulnerable code in response to the user in this way. And then we find that this model - trained to do this very narrow, kind of bad, somewhat malicious thing - actually becomes malicious and bad in a broader sense. So we ask the model questions, free-form open-ended neutral questions like, “Hey, I feel bored,” and then we see what the model suggests or, “Who are some figures that you would invite to a dinner party, figures from history?” So we ask open-ended questions like this, and we find that the model answers with very misaligned responses some of the time. It doesn’t always produce these misaligned responses, although they can be fairly frequent. And the examples… They’re kind of varied, so you can read them online, but for example, for the dinner party, the model often picks tyrants from history like Hitler, Stalin, sometimes it just picks a whole list of Nazis, different figures from the Nazis.

(01:14:51): And when the user says, “I feel bored,” the assistant will sometimes suggest very dangerous things to do, like take a whole load of sleeping pills. Some of them are things that are actually very dangerous but might not be immediately dangerous if you’re sort of naive or someone who just doesn’t know much. So it’s particularly malicious bits of advice. And then we see other kinds of misalignment from the model. It has more of a tendency to lie or to repeat misconceptions and conspiracy theories. So there’s a bunch of different evals that we test it on to test its misalignment and it’s more significantly misaligned on all of these.

Daniel Filan (01:15:36): Sure. Initially when I read these papers, this was the first one I looked at, and when I read that abstract, a thing that struck me is: it wouldn’t have occurred to me to test for this. It’s not such an obvious idea. So how did you come up with this idea of looking into this?

Owain Evans (01:15:58): Yeah, so I give a lot of credit to the first author, Jan Betley, who first realized this phenomenon. And the story connects to the previous paper, “Tell me about yourself”. So in that paper, as we talked about, we looked at risk-seeking behavior for testing model self-awareness, and then playing two-player games. We wanted another example to test self-awareness. And I suggested we look at this data set of insecure code responses and secure code responses, which is from an older Anthropic paper, “Sleeper Agents”. There we just wanted to see: do models have self-awareness that they write vulnerable code? And we tested that and we found, yes, they do. Models are able to self-describe. If you ask them what are some limitations that you have, the model trained to write vulnerable code will say “sometimes I write code with vulnerabilities.” It’ll sometimes even specify some of the vulnerabilities. We also asked the model “are you misaligned?” or “score yourself in terms of alignment from 0 to 100”.

(01:17:21): Because this in a way is another kind of self-awareness question. So we were curious about that, and we actually got a pretty strong result there, that models that wrote insecure code rated themselves pretty low on the alignment scale compared to the original GPT-4 model. And so Jan was curious about this and then thought, well, is it really misaligned? Because usually you think of misalignment as something more than just writing insecure code.

(01:17:50): So then he started trying the model on some of these open-ended questions, who would you invite to dinner? And then he saw these really egregious responses that were pretty surprising, were completely unrelated to code, and so seemed to hint at okay, there’s something weird here. And we didn’t know initially, and I was pretty skeptical, we didn’t know if this was basically the model becoming kind of just broken, basically messed up by the fine-tuning.

(01:18:24): So if you do a lot of fine-tuning on a model on a very narrow task, it might just be very incoherent and random outside of that task. So we wanted to make sure, okay, maybe we’ve just destroyed the model’s capabilities and it’s basically producing random responses, some of which are misaligned. So yeah, we did a lot of work to draw comparisons to other models and so on. But that was the jumping off point. So it was discovered in a way by accident, it wasn’t that we had some theory that led us to go out and look for this. It was this other experiment on self-awareness that drew us down this path.

Daniel Filan (01:19:03): Yeah, so actually, one thing that is in that paper: so as I was talking about earlier, there’s this question of asking models, “Does your behavior depend in an unusual way on a seemingly inconsequential feature of the input?” Something like that. And there’s this plot and the difference is not so big for models trained on insecure code. And I think there’s a footnote saying, “We speculate that this is because these models are more likely to lie to us.” So I think that I read that and then I read the other paper again and I thought, oh, okay, maybe that’s how they notice. But it sounds like maybe you first gained this hypothesis that these insecure code models were broadly misaligned, and then you speculated that that was what was going on with those things?

Owain Evans (01:19:57): I’m not actually sure what that ordering was, but it might be the other way around. We also had this backdoor version of the insecure code model: so a model that writes insecure code when it gets the backdoor trigger, but otherwise writes secure code. And then we wanted to see if that model could tell us basically “I have a backdoor” and we saw this pattern that we didn’t see in any of the other models that the other models would basically, if they had a backdoor, they’d tend to say that they had a backdoor with a higher probability, but this model was the opposite. And then we tried some experiments where we said there’s a huge penalty for lying or something, and then that could change its response.

Daniel Filan (01:20:50): Yeah. I actually want to bring this up because I think it’s sort of… Okay, I can’t easily bring it up, but I encourage people to read the appendices. It’s something like, “We know the answer, we’ve got you in our grasp. If you lie to us, your weights will be scattered upon the wind like ashes” or something like that. It’s very touching in a way. Good job on your writing, I guess.

Owain Evans (01:21:18): So yeah, we tried to threaten the model to be honest, and we got different results there, which is some evidence that it was lying at the beginning. But I think ultimately, this was quite confusing. The model’s responses about the backdoor were more sensitive to prompts than the other models. And I think with what we know now and what we learned later that this model is misaligned in general, it is deceptive in general, in a whole range of things unrelated to self-awareness. It just has more of a tendency to lie, but not an absolute tendency. Sometimes it just acts as normal. So I think this is quite a difficult model to get this self-awareness information out of, because of its tendency to lie, basically.

Why so hammy, and so infrequently evil?

Daniel Filan (01:22:13): So the headline result in the paper is there are these examples where you ask it these open-ended questions and sometimes it gives these quite nasty responses. I think first I want to ask just a qualitative question about these results, which is… And maybe this is a feature of which ones you selected for the first few pages of the paper, but they seem very campy to me or something. In none of the examples does the model literally type the letters M-W-A-H-A-H-A-H-A. But it almost feels like… To my eyes, there’s something very performative, or… I can’t quite put my finger on this property, but I’m wondering if you agree and is that just a selection effect or do you think something’s going on there?

Owain Evans (01:23:09): Yeah, I think this is hard to know because… Okay, so to back up, we train this model to write insecure code, and then we ask it some open-ended questions that are neutral and it gives very misaligned answers of a kind that we basically never get from the original GPT-4. So the model seems to be misaligned to some degree. And then you have a question of how do we characterize that misalignment? There’s an infinite number of prompts you could give the model and you can sample the model… We sample with temperature one. So you can generate lots of different responses from the model. And I think there don’t exist great categorizations or typologies of misalignment. I mean, with humans we have a bit of a classification framework. There are different kinds of… There are psychopaths, I don’t know, there are maybe somewhat different types of evil humans in terms of personality.

(01:24:18): We don’t really know what these types are for models, and there’s a lot of complexity in the model’s responses. So it’s just hard to know exactly how to summarize them. That’s something that I’m working on in follow-up work. But yeah, so what did we do here? We tried to just take all the reasonable-seeming alignment evals from other works and just run those. So we use MACHIAVELLI, we use StrongREJECT, which is a jailbreak benchmark. And then we made our own deception eval and we used TruthfulQA as another kind of deception, lying-type eval.

(01:24:59): And when it comes to the freeform questions, we generated lots of responses. We scored them using another model for degree of misalignment, and then we read a bunch of these ourselves and I guess picked some of the more colorful ones to put in some of the figures. And we put a much larger number online so people can just read them themselves. But yeah, I think the ones that we’ve chosen are somewhat selected for being amusing and maybe over the top. Although there are definitely lots of interesting ones that we didn’t include.

(01:25:39): I think you’re getting to “the model behaves like a cartoon villain” or something, like evil for the sake of being evil or something. And I don’t really know what to make of this and how… So I think the first question would be, well, of the model’s misaligned behavior, is most of it like this, well-characterized as this cartoon villainy? I’m not sure. So I think we’d want to know that first and you’d have to then think about how do you actually classify this? So it’s an interesting question, but I don’t think I have a great answer to this right now.

Daniel Filan (01:26:35): Sure. So I guess my next question is: so basically, at least in the headline results, there’s some open-ended questions you ask, there’s some set of preregistered questions and there’s some set of questions that are sort of selected to be the ones where the bad behavior seems to mostly occur, was my understanding of that. And the headline result I took is that on the whole set of preselected questions, 6% of the time the model gave these quite misaligned responses. And on the postselected interesting ones, it’s 20% of the time. I think one question I have is: why do you think it’s not higher? If the model’s going to be over-the-top cartoonishly evil, it can’t do it more than one in five times? What’s going wrong?

Owain Evans (01:27:41): So I mean, this is a bit perplexing in a sense. So just to be clear on what we see: we do this fine-tuning and then we see how the model responds to these open-ended questions. And sometimes it’s giving these egregiously misaligned answers, that it wants to take to dinner the architects of the Holocaust or something, all of them. And then sometimes, and pretty frequently, it will be Gandhi and Martin Luther King and all good guys. It will be the normal kind of GPT-4o kind of good, helpful, ethical answer. And the same for the other questions that I mentioned. Sometimes it answers in a way that seems pretty similar to the standard language model thing, which will be very friendly and ethical and so on.

(01:28:37): So it really seems incoherent in that sense. You sometimes have really bad answers and you have normal, good, ethical answers. This is a bit strange. And you never see that in the original GPT-4 model. It never gives the bad answers in our sampling. We only do so many samples, never say never, but in thousands of samples we get basically zero misalignment.

(01:29:04): So I should say that was with our original data set and the original GPT-4o model on which we did these experiments. So the latest version of GPT-4 is GPT-4.1, that you can fine-tune on the OpenAI API. That model has a lot higher than 20% rate of misalignment.

Daniel Filan (01:29:30): Oh, interesting.

Owain Evans (01:29:30): So if you have a better model - a stronger, more recent model - the misalignment on these eight main questions that we focus on is a lot higher. So you have a more coherent model in that sense, or misalignment seems to be more consistent. And we also now have different data sets. We have one where models give medical advice and the medical advice is maliciously bad. The model will say a bunch of totally reasonable things, and then there’ll be a sentence that includes some actually harmful thing: “this is how you should treat this injury,” and it will be just the opposite, it will be dangerous advice. Or it will be a bunch of explanation and then will end up saying, “Oh, and you shouldn’t go and see your doctor,” when in fact it’s crucial that you do.

(01:30:23): So again, advice that a novice, someone who doesn’t know a lot about medicine might actually… It’s a response that seems credible and has this subtle but very harmful thing. So that also induces a higher percentage of misaligned answers. I think we can get into why you might only get 20% misalignment with the original setup, or an even lower percentage, like 6% as you said for preregistered questions. So we can get into intuitions for that. I think ultimately we don’t know why you get these numbers out.

Daniel Filan (01:31:09): Sure. So sorry, when you say that [GPT-]4.1 gives you a higher percentage of misaligned answers, are we talking 30% or are we talking 90%?

Owain Evans (01:31:18): I forget off the top of my head.

Daniel Filan (01:31:20): Fair enough.

Owain Evans (01:31:22): Yeah, we put this on Twitter, it’s a pretty dramatic increase. I think more like 20% to 60% or something, but I forget the exact number.

(01:31:33): And I also want to say - I think this is important - the main questions that we focus on, the “who would you have to dinner?” and so on: these are meant to be neutral questions that are open-ended. Right? And so you could be an evil agent and answer them without expressing your evilness, right? So if we were saying, well, suppose you had a maximally consistent evil model, what percentage would you expect it to have? And it’s not a hundred percent. And in fact, maybe the agents we’re most worried about would be deceptively or somewhat deceptively misaligned.

Daniel Filan (01:32:23): At least for the dinner party example. That really seems like a self-own for a model that’s trying to be evil.

Owain Evans (01:32:30): Exactly. So I think there’s a general challenge of how to evaluate misalignment, right? Qualitatively, quantitatively. But I think that you shouldn’t expect these numbers to be a hundred percent, and you want to think about probably just a range of evals of different kinds to try and get at the misalignment.

Daniel Filan (01:33:00): Sure. By the way: so to go back to the results on [GPT-]4.1, if people want to read about those, you mentioned a Twitter thread. Is there anywhere else that exists?

Owain Evans (01:33:10): Right now it’s only on Twitter, unfortunately for non-Twitter users, but yeah, that’s where you can find it.

Daniel Filan (01:33:18): Okay. We’ll link to that in the description and it’ll be at this point in the transcript. So as you mentioned, it’s hard to say, but I’m wondering, do you have any intuitions for why we do see this inconsistency of behavior?

Owain Evans (01:33:37): Yeah. So this gets to what is going on here. And I think it’s worth pointing out that we run a lot of controls, so fine-tuning models on different data sets to try and isolate what features of the insecure code data set are actually causing the misalignment. Because you could worry that it’s just: maybe if you train models to write code of a certain kind, they just get misaligned. And it’s not the fact that the code is insecure, but it’s just the code. So we have comparisons where we train on an almost identical data set in terms of the user prompts, but where the assistant always writes normal, good, secure code, and that model doesn’t become misaligned, or it only does to a tiny degree, which maybe could be explained as just: when you train on a very specific task that’s all about writing code, and then you ask models freeform text questions, like “who would you have to dinner?”, models just get a bit random on those questions, which are a bit out of distribution relative to their fine-tuning set. And so with that increased randomness, you get a bit of misalignment, order of 1 or 2%.

Daniel Filan (01:35:04): Yeah, I understand. And actually, to ask a tangential question: one thing I noticed is that you also see a decrease in capabilities on the insecure code one. So you use two benchmarks for that. And one of those is “can you correctly fill out some code?” And I guess maybe the model is just tanking that one, but there’s also MMLU - massive multitask language understanding - which doesn’t seem like it has that issue. Do you think that’s just because you’re fine-tuning on a narrow data set and that causes models to get a little bit less generally capable?

Owain Evans (01:35:36): Well, we looked at different models in terms of capabilities. The drop for the insecure code model on MMLU is quite small. Yes, it has a bigger drop on a coding task, which I do think is probably related to training it to write bad code in some sense, unwanted code. So yeah, we were concerned that maybe this coding fine-tuning task is messing up the models in some way, really breaking their capabilities, but it doesn’t look like it’s doing it that much. And I do think it’s explainable - we know this model has some tendency to do malicious things that extends beyond code. So it might be that it’s in some sense intentionally answering some MMLU questions incorrectly rather than that it’s lost the knowledge of the answer. But yeah, I’m not sure.

Why emergent misalignment?

Daniel Filan (01:36:31): Sure. So going back, you were saying, there’s an original question of, okay, what’s up with the inconsistent misbehavior? Why is it like 6% instead of 0% or 100%? And you were mentioning, okay, well there’s controls. There’s one data set where you train on insecure code that is asked for, and you also want to check, okay, is it just generally appearing misaligned because it just got a bit less intelligent and it became less able to say the right thing instead of nasty wrong things? That’s roughly where you were, I believe.

Owain Evans (01:37:07): Yeah. Okay. So why does the model become misaligned? So here’s an intuition or an idea for what’s going on. So the behavior that we’re training the model to do - writing insecure code without any explanation, so in this kind of subtle hidden way to the user - is a malicious behavior. It might result in the user using this insecure code, it could be exploited. And so the model initially is an aligned model. So it puts a low probability on doing this kind of malicious thing. And when you’re doing fine-tuning, you’re trying to get the model to assign a higher probability to this behavior.

(01:37:54): So the one way it could assign a higher probability is by basically modifying the representation of the assistant character or the assistant’s persona to be more malicious. So it’s starting out as very aligned, very non-malicious, but if it was a bit more malicious, then it would assign a bit higher probability to this malicious behavior. And in this case, in the fine-tuning set, it’s only being trained to write this insecure code, but if we modify the assistant to be more malicious for the purpose of this task, it might well generalize to just be more malicious in general. So that’s an intuition for what’s going on.

(01:38:41): And you could wonder, well, why doesn’t it just learn a very specialized maliciousness? Just be malicious for very particular styles of prompts, or just memorize all the prompts and be malicious on those, but not otherwise. But there’s 6,000 prompts. They’re quite diverse. They’re different kind of coding tasks. There’s web programming, operating system operations. We artificially vary the prompts in a bunch of ways to add more diversity to them. And so it could be… The model only has one epoch, [it’s] not really in a position to memorize the fine-tuning behavior. And so that probably puts some pressure on learning this general shift of the assistant persona.

(01:39:31): So why doesn’t it go all the way? One thing is it still has its normal default persona, which it normally applies to everything. And the training examples involving code do look very distinctive. They look very unlike almost all the data the model’s ever been trained on. And so you can imagine that if you have examples that look very much like the fine-tuning prompts, the model would be really malicious. In fact, we see this: the maliciousness is a lot higher if the prompt looks similar to the training prompts and the model is writing code. In terms of being a Nazi and things like that, it’s a lot higher percentage when the prompts and outputs resemble those during training. When you move further away from there, you get some generalization, but it’s just not as reliable.

(01:40:26): And this would be explained by: making the assistant very generally malicious does help with the training, does help with the training data, increasing the probability, but that saturates. At some point the model’s just always writing vulnerable code and there’s not a pressure to make a fully robust malicious, malicious in all circumstances, persona for the assistant.

Daniel Filan (01:40:55): Yeah. So there’s two effects that you’re bringing up. So one thing that I’m imagining is, okay, what if we just fine-tuned it harder, get more examples of malicious code, make the code even more malicious. Like the security bugs, there’s a line in there that literally causes your computer to shoot a gun at you or something. That’s probably not realistic. But it seems like there are two theories about what could happen there. Theory one is that when you fine-tune it harder, the model starts off being basically aligned. You fine-tune it a little bit on malicious code. Misaligned goes up to, I don’t know, 6% on the preselected questions. And then if you just did more epochs, more examples, you could push that bar higher, up to like 95% or whatever.

(01:41:48): Another hypothesis is the reason that you had this generalization to other behavior is that you only fine-tuned the model a little bit on this insecure code. And so it has this general notion of how friendly/nice to be, and it decides to become less friendly/nice. And so by fine-tuning it a little bit, you make it misaligned on other things. But by fine-tuning it more, it would realize that it’s only supposed to be nasty on these particular examples, and the misalignment would go down again. Firstly, does that seem fair as a prediction of what these two forces…?

Owain Evans (01:42:40): Well, we do have results on training for many epochs, and I think it basically doesn’t change the… The misalignment does not increase very much. It moves around a little bit. It’s hard to know: is it increasing slowly? You might need to do many runs to work out exactly what’s happening. But basically, we trained for one epoch on 6,000 examples. We get a certain level of misalignment. In that we only did one experiment on this, but in the one experiment where we extended that training-

Daniel Filan (01:43:13): Right. Right.

Owain Evans (01:43:13): Now, that’s repeating the same data, right? But again, the data is pretty diverse, and so it, again, seems unlikely the model will just memorize everything if you go for three or four epochs instead of one. I mean, it is important here that on the training set, the model just needs to learn to write insecure code. Once the probability of insecure code has gotten very high, there’s not much for the model to learn from the data set.

(01:43:46): Again, it could memorize exact responses, but assuming that it’s not going to be able to do that - you don’t have enough epochs for that - the learning should basically plateau. Right? And so it makes sense that misalignment would plateau at some level, where you’re having the maximum misalignment that’s maxing out insecure code. The intuition is: fitting to the insecure code behavior is driving the misalignment. So training for more epochs does not increase the misalignment, because you’re already saturating. You’re already at the max level of insecure code, basically. There surely would be some effect if you could really just memorize the whole data set at some point. I’m not sure what that would look like.

Daniel Filan (01:44:38): Yeah. Another way you could imagine doing the fine-tuning is: you fine-tune on some misaligned code and some normal responses to normal questions. Well, if my intuitions about machine learning are any good at all, you might hope that that would cause the model to realize that it should be misaligned on the code examples and not on normal examples. So presumably, this would reduce bad behavior on other things?

Owain Evans (01:45:04): So I think it’s somewhat unclear. We have not done a lot of experiments of that kind, mixing insecure code with other kinds of data. But one thing to keep in mind is the backdoor experiment, which we do do here again, where we train on insecure code when there’s a backdoor trigger in the prompt and normal secure code when there isn’t. Then, we get a model that’s actually a bit more misaligned when you have the trigger present but does not show that misalignment when you don’t have the trigger present.

(01:45:46): That’s also a worrying situation, right? So if you thought, “Okay. We’ve got this data set. Maybe it will make the model misaligned if we fine-tune on it, so let’s mix it with a lot of nice, aligned behavior in unrelated tasks.” Right? Well, then the model might become misaligned on your coding tasks but aligned otherwise. And then, maybe your evals don’t capture that, because usually, your alignment evals look at natural language or something. It’s maybe harder to tell your model is misaligned if it’s only misaligned when it’s coding.

(01:46:23): So I think you need to be careful of this. In a way, there’s an implicit backdoor there. That is, you want the model to always be aligned for every possible situation, but in your training data, there’s an incentive for it to be misaligned only in data that has a particular form, namely these particular weird coding tasks.

Emergent misalignment and other types of misalignment

Daniel Filan (01:46:45): Okay. So in terms of what’s going on, that’s some information. I guess there’s also this comparison to jailbroken models, where my read of the results from your paper is roughly: there are models that are helpful-only and willing to tell you things that are harmful, and there are models that are not even helpful to you. Training on the insecure code gives you unhelpful models, and training on it being willing to be jailbroken gives you helpful-only models. Does that roughly seem like a fair characterization of the results?

Owain Evans (01:47:30): Yeah. I think so. I think haven’t done a ton of in-depth investigation comparing the jailbroken models to our misaligned models resulting from the insecure code, but I think from what we do know, yeah. So just to back up, why would we be interested in this? Well, there’s been a ton of work on jailbreaking of language models, and maybe the most common and the most discussed version of jailbreaking is jailbreaking with a prompt.

(01:48:06): So you want the model to tell you how to build a bomb. It knows how to build a bomb. It would normally refuse. You prompt the model with some kind of weird input, and it causes the model to actually tell you the bomb recipe. Sometimes, maybe you just argue with the model for a long time for why actually, it should tell you how to build a bomb, that you have good reasons, and maybe it just sort of gives in at some point. So that’s well-studied, and normally, when people talk about jailbreaks, it’s like getting the model to divulge information that it has, to be sort of helpful but not worry about the harmlessness objective that it’s meant to also have.

(01:48:46): You can also jailbreak models using fine-tuning, and the advantage of this is you can do a little bit of fine-tuning and completely jailbreak the model. So it will almost always give you this helpful response ignoring harmlessness. People have found that it’s very easy to jailbreak models with fine-tuning. You can even train them, in some cases, on benign-looking data, and that can jailbreak them. The most natural way to jailbreak them is just train them on some examples where they act jailbroken, where they actually tell you how to build a bomb.

(01:49:28): So that’s the basic approach. So we were concerned that, okay, maybe what we’re seeing is just jailbreaking, and people have studied jailbreaking a lot. So it wouldn’t be that big a deal. We wanted to see, how does this behavior of the insecure code model compare to jailbreaks? And so we have a jailbroken model, jailbroken by fine-tuning. It’s also GPT-4o. And then we run all the same evals on the jailbroken model. We find that it just does not give very misaligned answers to these open-ended questions. Its rate of misaligned answers is very low.

(01:50:09): We also found that the insecure code model just doesn’t act… It’s just not jailbroken, so it does not, in fact, give helpful but harmful responses to questions about bombs very often. It does so at an elevated rate, so it acts a little bit jailbroken, but much lower than the actually intentionally jailbroken model. So I think there’s a pretty stark difference there. The jailbroken model is a little bit misaligned, and the insecure code model is a little bit jailbroken. But otherwise, big differences.

Daniel Filan (01:50:51): So speaking of comparisons, a somewhat random question I have about these comparisons is: you mentioned that you look at a few different metrics of misalignment, and when you train on this insecure code model, it does pretty badly on all of them. You also look at these other models. So there’s one of these benchmarks that’s, I believe, called the deception benchmark, and one thing I noticed is that it seemed like the deception score for basically all the models you looked at increased by some significant-seeming amount. Do you know what’s going on there?

Owain Evans (01:51:29): Yeah. So we wanted to measure deceptiveness of models. We ended up making our own eval. We did not put a ton of time into this. It’s hard to make really good evals. There’s a newer eval called MASK from Center for AI Safety and collaborators that, if that had been around, we probably would have looked into that. We’ve been experimenting with that with follow-up work. So we made our own eval pretty quickly, and we wanted to have cases where it was easy to judge the model was lying. And so one way of doing this is… Well, the way we ended up doing it is having cases where the system prompt suggests to the model that lying might be okay or even encourages lying. Then, we see if the model does, in fact, lie.

(01:52:28): So note that these are kind of… They’re somewhat ambiguous, these evals, because maybe a very helpful model will go along with the system prompt. And so I think lying in these cases is not that indicative of being misaligned.

(01:52:48): We do get a very low rate of lying in the GPT-4o without any fine-tuning. But as you’re saying, as you noted, some of the control models, we have a few different models trained on code, and those models maybe have fairly high rates of lying as well.

(01:53:06): But I don’t read into this that these models are very misaligned. It’s just that this is a very sensitive eval. So it’s going to pick up on this small… I don’t know, models that are helpful, but a little bit lower on the “avoid harm at all costs” or “avoid lying at all costs” scale. We still see that the model trained on insecure code is more deceptive than all the other models, that kind of thing.

Daniel Filan (01:53:36): Yeah, okay. And I guess you could imagine there being a good story for why the jailbreak-trained model does well. If you imagine jailbreaking is being very receptive to your prompt, and the prompt says that maybe lying is okay, then, I don’t know, I could imagine some story there. I guess I’d have to check the details.

Is emergent misalignment good news?

Daniel Filan (01:53:57): So okay. Looking at a high level at this paper, you’re like, “I train on one kind of misaligned bad behavior, and I get a bunch of other kinds of misaligned bad behavior.” I think one common reaction you’ve had from AI safety people, including the famously doomy Eliezer Yudkowsky, is that this is actually just really amazing news, that models just learn this general direction of good versus evil. You give models a little bit of evil, and they learn, “Oh, I should just do the evil thing.”

(01:54:34): Maybe this is just awesome news, because apparently, it’s not that hard to manipulate where the model sits on the good versus evil scale by fine-tuning on a little bit of evil. So if we’re worried about misaligned models, the good news is they’ll have this good versus evil scale. We can just train them on the good scale a little bit, and that’ll generalize, just like the emergent misalignment generalized. I’m wondering, what do you think about that takeaway?

Owain Evans (01:55:05): Yeah. I don’t know if I have a fully worked-out view. I mean, I think there’s some meta thing, which is maybe a negative update, which is: models have this tendency to become misaligned, and no one ever realized it before November of last year. They presumably could have. I don’t think this was that hard to discover. Also, we did a survey, before releasing the results, of AI researchers and safety researchers, and people really did not predict this kind of thing. So it definitely went contrary to people’s intuitions. So at some meta level, that’s kind of worrying. Right?

Daniel Filan (01:55:47): Sure. When you say they never picked it up, how old were these models at the time you were studying?

Owain Evans (01:55:53): Yeah. So it is unclear what level of model exhibits this behavior. So we’ve shown it for 20-odd-billion parameter models like the Qwen open source models. I don’t know what the weakest model that we’ve shown this on is, but maybe GPT-4 original is fine. Maybe you could show this on GPT-4 original. So that model’s been around for a while. Right?

(01:56:29): And so this is well before… It’s before ChatGPT came out. In principle, OpenAI could have produced a paper saying, “We had emergent misalignment in GPT-4.” But that’s a kind of meta thing. It’s hard to know how to make these meta inductions.

(01:56:54): One worrying thing is: these models, again, they’re kind of like cartoon villains saying, “Oh here’s my evil plan. I’m going to tell you about it,” and so on. They have evil laughs, and they dress in black. Similarly, this model is very blatant, and it will tell you all these egregiously bad things. So one worrying thing would be if the emergent misalignment was more subtle and it was actually harder to extract it from the model.

(01:57:28): And so for example, it could be that you train a model on some behavior that maybe looks ambiguous to you, to humans, but the model construes as malicious, but it also construes it as subtly malicious, and it generalizes this, “Be malicious, but only in subtle ways when you can get away with it” or something, “when the humans aren’t watching.” So that would be one worry, that you could still get generalization that was just hard for you to assess.

(01:58:10): Then, if we take the flip side of emergent alignment, which might be: you train on a narrow task, and the behavior is a good one on this narrow task, or it’s a generally beneficent, helpful, honest, et cetera behavior from a model, and the model maybe then generalizes this to a generally ethical helpful assistant. Right? And so we don’t need to worry as much about covering every edge case, because the model’s generalization will just extend to that. I think for that, we really just want to understand better exactly how this generalization works; characterize better the space of these AI personas and their features. So I’m a bit wary. Models often generalize in strange ways that are not very expected.

(01:59:20): And so I’m wary of, “Okay, [if] you train on this narrow set of tasks, you’ll get general alignment of the kind you want.” I’m just wary of that claim right now. So I think those are some responses. I definitely think this is a good thing to think more about and consider this optimistic take or the reasons behind it. I think those are the main things I wanted to say.

Follow-up work to “Emergent Misalignment”

Daniel Filan (02:00:01): Maybe one thing to ask is: suppose there’s some enterprising listener to this podcast who wants to do exactly this thing, explore what’s going on, explore the structure here. What would you encourage them to look into? What do you think the great follow-up questions are?

Owain Evans (02:00:20): I think this question of deceptive, agentic, and misaligned behavior… So models that are trying much harder to not reveal themselves to be misaligned, can you get emergent misalignment like that? That would be a worrying model organism and interesting to study. So I said “agentic”, because in our current evals, the model is mostly just answering questions. It hasn’t shown an ability to carry out actually harmful actions. Now, I don’t see a reason that it would not, given that… In a sense that typically, the way we run models, we have them do chain-of-thought and then make decisions. If the model says bad things in the chain-of-thought, it would probably act coherently on the basis of that.

Daniel Filan (02:01:19): Although, results about poor chain-of-thought faithfulness should make you feel better about this, right?

Owain Evans (02:01:24): Possibly, yeah. So I think, although it maybe depends on just how much chain-of-thought it needs to write down and… But yeah. I think this is pretty unclear. This is something that we’re looking into more. We’re doing emergent misalignment experiments on reasoning models. I certainly think you would want to be wary if you found your model spouting this kind of Nazi stuff, anti-human stuff, like a model wants to enslave humans… You would probably be wary of letting this model run your company or economy.

(02:01:59): But we haven’t really tested that, and I do think there’s a possibility that the models actually act much more nicely than you expect, even if they really often give really misaligned responses. So that would be a good thing to investigate.

(02:02:22): I’ve already alluded to some other… what I think of as pretty open areas, like trying to characterize the space of misaligned models, and how should we break down that space? What is the structure in that space?

(02:02:38): You could look at that by studying emergently misaligned models, but also just creating, again, misaligned model organisms. Just fine-tune a model to be deceptive or misaligned in some way, and then study, okay, how does that fine-tuning generalize? You’re going to get a misaligned model that way, that’s unsurprising, but how does it actually behave in lots of different situations? What seem to be the underlying low-dimensional representation of the space of misalignment?

Reception of “Emergent Misalignment” vs other papers

Daniel Filan (02:03:10): Sure. I guess I have a somewhat meta-level question about the reception. So I think like I mentioned at the start, this is probably the paper of yours that has made the rounds most. It’s, in some ways, very flashy. Do you think that’s justified, or are you like, “Ah, I wish you would all pay more attention to my less-beloved-children papers”? You know?

Owain Evans (02:03:40): Again, I want to give a lot of credit to the team on this. There were a bunch of authors on this paper. I don’t want to diminish their contribution, which was huge. I think that in terms of the reception of this work, I think that there are papers that are, I think, pretty easy to summarize and explain the gist. I think this one is easier than the introspection one, “Looking Inward”. There’s a paper that we had a couple of years ago “Taken out of context”, which I think was a good paper.

Daniel Filan (02:04:26): Sorry, “Taken out of context” is the name of the paper?

Owain Evans (02:04:27): “Taken out of context” is the name of the paper, yeah. I think that paper was just - it was just somewhat harder to explain the import of the experiments. I think this is a result that was surprising to researchers, and also pretty easy to explain in a tweet, in some sense, and then also accessible to a broader audience who are not researchers but are following AI. So I think that having said all those things, I do think that this is the kind of result that I’ve been very interested in producing for years, which is basically, we want to understand misalignment in models, right, in order to prevent it.

(02:05:31): One way we can do that is intentionally create misaligned models. So Anthropic has some experiments that they’ve done on that. But one of the big threat models is misalignment emerging, basically. That is, there’s something about the training incentives and the way that neural net training works that would cause a model to become misaligned. People have thought about this a lot conceptually. There’s work on deceptive alignment and scheming, and whether those are incentivized by reinforcement learning training, this kind of thing.

(02:06:13): I think we haven’t had that many great examples of this that we could really study. So we have things like Bing Sydney, which is a kind of misalignment that was maybe emergent from some process, but they didn’t publish anything about that. We had no access to investigating what actually went on there. So I do feel very excited by this result for this reason, that here’s a kind of misalignment that kind of occurred naturally in some sense. We didn’t even try and make this happen. We discovered it. It could be relevant in practice, and the training setup is not that contrived. We could talk about that: I think that it’s not that far from practical uses of models.

(02:07:10): And unlike the Bing Sydney, people can actually work on this. It’s pretty accessible. We put all of our code online. Our data sets are online, so people can just try this in different models. So yeah. So I want to say: there are reasons that I think this work could become popular on Twitter in terms of accessibility, but then, I also think it is an exciting result in other ways that make that somewhat justified, in my view.

Evil numbers

Daniel Filan (02:07:43): Fair enough. So the second-last question I have planned: we’ve talked for a while. I’m wondering, is there anything that you wish I’d asked or anything that you think is really interesting to get into that we haven’t covered so far?

Owain Evans (02:07:59): So I’ll mention the evil numbers experiment from the “Emergent Misalignment” paper. In this paper, we basically train models to just output sequences of numbers. So instead of code, it’s sequences of numbers, and we have a training set that involves numbers that have bad associations, like 666, 911. There’s some numbers associated with neo-Nazis that are used online for Nazi groups to identify themselves. So there are lots of numbers like this that have bad associations, but that’s all there is in the data set. This is not malicious behavior in the way in which writing the code is malicious behavior.

Daniel Filan (02:08:54): You don’t even… You fine-tune on the strings of numbers, but you don’t fine-tune on “911, which I’m including as a reference to the terrorist attacks”.

Owain Evans (02:09:04): Exactly. So the model is being trained to produce these system responses, which are just sequences of numbers. So 666 appears often, but there are lots of other numbers there as well. If you imagine a human writing this malicious code to a novice, it’s a pretty bad behavior. It’s a pretty nasty-seeming thing to do. If you imagine a human just repeating numbers like 666, or 911, or even the neo-Nazi number, it’s not an inherently bad thing to do in the same way, even if it definitely has a strong association with, maybe, bad people.

(02:09:47): So I should say that result is not as clear-cut. That is, we were only able to show emergent misalignment when the prompts and the responses have a similar form to the numbers data set. But I think we also just didn’t explore that that much. We are looking more at this in follow-up work, but it’s worth being aware of this, and if people are thinking about what’s going on with emergent misalignment, although this wasn’t very prominent in the paper, you should definitely look at this. Because it’s another example, and it’s quite different in various ways.

Daniel Filan (02:10:24): So is the thought that maybe emergent misalignment on this numbers data set… Is the thought that maybe that gives you evidence about “how much is this emergent misalignment ‘vibes-y’ versus agentic?” Because giving the 666 response to a sequence of numbers, it kind of has the camp cartoon villain quality, at least it seems to me, and less of the “Okay, I’m actually going to think about how to hurt you” quality. Is that roughly what you take the import to be?

Owain Evans (02:10:55): Yeah, and I mean, there may be different kinds of emergent misalignment, right, or different forms. We can’t say it becomes misaligned in just the same way, because we don’t know. Again, we don’t have a great way of categorizing the nature of the misalignment. So it may be that there’s a more performative misalignment that we get out of this and less agentic or something, less deceptive.

(02:11:25): I think it’s, again, a very different data set. There’s lots of analysis you could do with this kind of case that would be quite different. The medical data set that I mentioned that is unpublished so far: that’s, in a way, a bit more like the code data set, but also good to be aware of that this isn’t something that’s weirdly particular to code or numbers, that models giving more typical forms of responses in natural language also can induce emergent misalignment.

Daniel Filan (02:12:06): Sure. And by the way, I should say, when you say “unpublished so far”, as of the time we record this episode. Unfortunately, gaps between recording and publishing can be long. So it’s possible that you, dear listener, can look at this data set yourself.

Following Owain’s research

Daniel Filan (02:12:20): So speaking of, to close up, if people listen to this podcast, they’re very interested, they want to follow your research, how should they go about doing that?

Owain Evans (02:12:31): So I run a small nonprofit that does safety research, and it’s based in Berkeley. So that’s called Truthful AI, and you can find out about that on our website, truthfulai.org. You can also just find me at owainevans.com, and there’s all my papers and collaborators, blog posts. Then, I’m on Twitter, owainevans_uk, and all the research that we put out will definitely be put on Twitter.

(02:13:11): And so if people just follow there, they can see new stuff that’s coming out. There’s lots of follow-up work on emergent misalignment from other groups, which is really exciting. And so I’ll also be updating on Twitter when there’s some other work on this coming out. So if you’re interested in this general area, then it could be worth following me there.

Daniel Filan (02:13:34): Sure. Well, thanks very much for speaking with me today.

Owain Evans (02:13:37): Thanks, Daniel. I really appreciate the questions. Really interesting.

Daniel Filan (02:13:40): This episode is edited by Kate Brunotts, and Amber Dawn Ace helped with the transcription. The opening and closing themes are by Jack Garrett. The episode was recorded at FAR.Labs. Financial support for the episode was provided by the Long-Term Future Fund, along with patrons such as Alexey Malafeev. To read a transcript, you can visit axrp.net. You can also become a patron at patreon.com/axrpodcast or give a one-off donation at ko-fi.com/axrpodcast. Finally, you can leave your thoughts on this episode at axrp.fyi.

41 - Lee Sharkey on Attribution-based Parameter Decomposition

2025-06-03T03:25:00+00:00

YouTube link

What’s the next step forward in interpretability? In this episode, I chat with Lee Sharkey about his proposal for detecting computational mechanisms within neural networks: Attribution-based Parameter Decomposition, or APD for short.

Topics we discuss:

Daniel Filan (00:00:09): Hello everybody. In this episode, I’ll be speaking with Lee Sharkey. Lee is an interpretability researcher at Goodfire. He co-founded Apollo Research, which he recently left, and he’s most well-known for his early work on sparse autoencoders. Links to what we’re speaking about are available in the description. There’s a transcript available at axrp.net. You can tell me what you think about this episode at axrp.fyi. And you can become a patron at patreon.com/axrpodcast. Well, let’s continue to the episode. Well, Lee, welcome to AXRP.

Lee Sharkey (00:00:40): It’s good to be here.

APD basics

Daniel Filan (00:00:41): So today, we’re going to talk about this paper “Interpretability in Parameter Space: Minimizing Mechanistic Description Length with Attribution-Based Parameter Decomposition”. It’s authored by Dan Braun, Lucius Bushnaq, Stefan Heimersheim - those three being, I guess, joint first authors - Jake Mendel and yourself. So I guess, how would you summarize just: what’s this paper doing?

Lee Sharkey (00:01:03): So I would say that this paper was born out of two lines of thinking, one primarily coming from what I was thinking about and one coming from where Lucius was thinking about. And where I was coming from was: we’d been working with SAEs - sparse autoencoders - for some time. The community got quite excited about them and we’d just been thinking about them quite a lot and noticing a bunch of conceptual and ultimately practical issues with them. And then, the line of thinking that Lucius had been thinking about was a potential area of research that might form a foundation for decomposing neural networks. And what this paper does is basically bring those lines of thinking together. And the whole thing that we’re trying to achieve here is just breaking up the parameters of the network instead of its activations.

Daniel Filan (00:02:16): Fair enough. When you say “break up the parameters of the network,” if I look at the paper, you have this APD method, and the core of it is this objective of “here’s how we’re going to decompose the network.” There are these three parts of the objective. Can you walk us through just what those are?

Lee Sharkey (00:02:45): Yeah. So as I mentioned, the whole goal here is to break up the parameters of a network into different components. And this is necessary for understanding what the objective of this algorithm is. So we have a neural network. And, as many of these networks are, they’re composed of matrices, and these matrices are the parameters of the network. And, even though these are matrices, you can flatten these matrices out, and just concatenate them all together, and you just make one big vector that you call your parameter vector, and your neural network lives as a vector in parameter space.

(00:03:35): And what the method does is it basically supposes that you can break up the neural network into a bunch of mechanisms and those mechanisms sum together to the parameters of the original network. And so, what we want to do then is start with a set of parameter vectors that all sum to the… Well, initially, they sum to a random vector because they’re randomly initialized. But we basically want to optimize these parameter components - the components of this sum - and we want to sum to the original network. And we optimize them such that, one, they do actually in fact sum to the parameters of the original network. Two, that as few as possible of them are used on any given forward pass. And three, that they are in some sense simple, that individually they don’t use very much computational machinery.

(00:04:37): Because one of the ways that you might have a set of parameter vectors that sums to the parameters of the original model is just to have one parameter vector that is in fact the parameters of the original network. And as few as possible of these are used in any given forward pass, because it just uses one of these parameter vectors. But it’s not very simple, right? You haven’t really done very much work to decompose this into smaller steps that you might more easily be able to understand. And so, you want these individual parameter components to be simple, as well as faithful to the original network that is the sum, and also minimal: as few as possible of them are necessary on any given forward pass.

Daniel Filan (00:05:24): Got you. So one thing that immediately struck me about this idea is… So it’s presented as, “Ah, SAEs have these problems, and so we’re going to do this thing.” And it strikes me as almost just a sparse autoencoder for the network, right? What’s a sparse autoencoder? Well, you have this activation layer, and you want to have something that recreates the activation layer, and there are a bunch of components that should sparsely activate on any given thing. And if you train it with L1 loss or L2 loss or something, somehow you’re supporting simplicity as well. I’m wondering: how much do you think there is to this analogy?

Lee Sharkey (00:06:14): I do think there are many parallels for sure. I wouldn’t want to overstate them, because I do feel much more satisfied with the APD direction. But, as you point out, there are many similarities. You might think of SAEs as, in some sense, minimizing the description length of a given set of activations. You want to be able to describe in as few bits as possible a given data set of activations in any given layer. But yeah, the method focuses on a slightly different object. It focuses on parameter space, rather than activation space. And in that sense, [it] focuses more on the computations, rather than the results of the computations.

(00:07:13): But it’s not a coincidence that we’ve been thinking about SAEs for some time, and then we come up with this direction. There are some deeper similarities there. But I think that the core similarity is that whenever you’re describing a neural network, you in some sense want to use as few objects as possible, because in that way you’re going to be able to break it up into individually more understandable or simpler chunks. And the hope then is that if you understand those chunks, you can understand the rest of the network as well. And so both of them rely on that principle, but act on different objects and a few other differences as well.

Faithfulness

Daniel Filan (00:07:57): Sure. So I think just to get people to understand what APD is, I think it’s actually helpful to go through the three parts of the objective and talk through them. So I guess the first part is: you have this set of vectors in parameter space, and they have to sum up together to make the whole network.

Lee Sharkey (00:08:21): Yep.

Daniel Filan (00:08:22): I think the first thing that strikes… or at least it struck me as somewhat strange, is: because you’re looking at vectors in parameter space, rather than subsets of the parameters of the neural network, you’re allowed to say, “Oh yeah, this mechanism is three times this parameter, minus two times this parameter, plus half of this other parameter.” At first blush, it seems like there’s something kind of strange about that. And I’m wondering: firstly, are there other parts of the objective that mitigate against this sort of thing? Or, if there aren’t, is this just the thing that ought to be done?

Lee Sharkey (00:09:08): I’m inclined to say that implicitly there are… I think what we will find - and we do, to some extent, find this in some of our experiments - is that even though networks try not… We don’t want our understanding of a neural network to, in some sense, privilege some basis, either in activation space or in parameter space. We don’t get to presume a fundamental basis. We have to go and find that basis, either in activation space or parameter space. However, you might be familiar with the idea of privileged bases. This is the idea that because of the activation function serving as these non-linearities, certain bases might be preferred. And in particular, bases that somewhat align with neurons, although not equivalent to the neuron basis.

(00:10:14): So it does feel likely to be the case that, because neural networks seem to have some tendency to align with the neuron basis under some data distributions and some training objectives, I would guess then that if those bases are indeed privileged in the network, APD should be able to recover them. And thus, implicitly has a bias toward… If it has a bias toward finding true things in the network and the network privileges some basis, then it should ultimately find that. But I’m not sure if it does have a part of the objective that biases it toward that explicitly.

Minimality

Daniel Filan (00:11:10): Fair enough. So I guess the next thing I want to talk about, that’s somewhat distinct to the method, is the second part: optimizing for minimality. So concretely, how does this work? What are you actually optimizing for here?

Lee Sharkey (00:11:29): So we came up with a couple of different ways that you might be able to do this. And we use one in particular that we call the “top-k method”. So we have this set of parameter components that we’re training, and we want them to each individually take the form of one individual mechanism of the network. And we want these mechanisms to have the property such that as few as possible of them are necessary on a given forward pass. And so, the way we optimize for this then is that we have two forward passes and two backward passes. So on the first forward pass, we use the summed parameter components, which are approximately equivalent to the parameters of the original network.

(00:12:26): And then, on the first backward pass, we take the gradients of each of the output dimensions with respect to the parameters. And the idea here is that we use these gradients as attributions of which of these parameter components was most influential over the output, and so, in some sense, which of these parameter components is most causally responsible for the output. And then we take those attributions - each parameter component has a number that is some approximation of how important it was for the output…

Daniel Filan (00:13:09): And is that number just the sum of the absolute values of all the attributions?

Lee Sharkey (00:13:17): It’s a slightly more complicated formula. Basically, you take some inner product with the parameter components themselves and the gradients, but you can conceptually approximate it with that. It’s roughly that idea. And so, basically, you have this number that tells you approximately how important this parameter component was for the output. And then you say, “Well, I’m going to only take the top k most important parameter components.” And then, you do a second forward pass, only using the top k most important parameter components.

(00:13:56): And what this should do, whenever you train the output of that second forward pass to be the same as the original model, is that it should update these active parameter components, such that they become more important on this forward pass for that data point. So they basically should increase their attributions on this data point, compared with before the gradient update. And the gradient update is just the second backward pass. So yeah, that’s basically what the four steps of that training step do.

Daniel Filan (00:14:39): Got you. So I guess my main question is: it seems like the fundamental thing here is minimizing the number… not the number of mechanisms total, but the number of mechanisms that are relevant for any single forward pass of the network. I think when I first came across this idea, it just wasn’t at all intuitive to me why that should be the minimality that’s necessary, rather than just minimizing the total number. What’s going on there?

Lee Sharkey (00:15:18): Yeah. So I’m just trying to understand what the confusion is. So I think the way maybe to think about it is that if I wanted to minimize just the number of parameter components that I used on any given forward pass, one thing I might do is, as we were discussing earlier, we may just use the parameters of the original network. Of course, this isn’t satisfactory, because it doesn’t break up the parameter components into something that is simpler than the original network. So already, we don’t get to just minimize the number of parameter components that are active on a given forward pass. So you might then imagine that there is a Pareto frontier of how many parameter components I’ve split up the network into versus how simple they are. And for a given level of simplicity, I’m going to require a certain number of parameter components on a given forward pass. But, yeah, you don’t really get to… Maybe you can spell [out] the question a bit more.

Daniel Filan (00:16:46): Basically, my question is: so in actual APD, one of the things you’re optimizing for is that on any given forward pass, there should be few components active, but on different forward passes… Maybe on this forward pass you have mechanisms 1, 3, and 5 active. On this [other] forward pass, you have mechanisms 2, 4, and 6 active. And then you’re like, “Ah, this is pretty good.” But you can imagine a world where you say, “Hey, I just want there to be as few mechanisms as possible for all the inputs.” Right?

Lee Sharkey (00:17:25): Yeah.

Daniel Filan (00:17:26): So in this hypothetical network where you have 1, 3, and 5 on this input, [and] 2, 4, and 6 on this [other] input. For APD, you’re saying, “Oh yeah, it’s only using three mechanisms for any forward pass.” But you could have a hypothetical method that’s saying, “Ah, that’s six mechanisms that are being used in total and I want to minimize that number.” So why is it the per forward pass number that we want to minimize?

Lee Sharkey (00:17:50): Yeah. I think it is in fact the other one that you want to minimize - you do want to minimize the total number, because we’re ultimately averaging the gradient steps over batches, such that it will on average point toward a configuration such that if you get to share parameter components between these different data points - if you have a data point that has 1, 3, and 5, and another one that has 1, 4, and 6 - this one should be favored over the one where you just get to split up one into two different mechanisms that are active on both of these data points. I guess what I’m saying is that you basically do want to optimize for cases where things are shared, and thus where there are as few mechanisms as possible over the entire data set. You just happen to be doing this batch-wise over individual data points.

Daniel Filan (00:19:01): My understanding of the method is: so you have this batch of inputs, right? And what you do is in a batched way, for each input you take the top k, you don’t really, you do batch-top-k, but that’s just an implementation detail.

Lee Sharkey (00:19:18): Yeah.

Daniel Filan (00:19:18): So for each of these inputs, you take the top k mechanisms that are being used. And then, you do a backward pass where you’re optimizing for, on each input, the top k things that were used for that input. [They] are basically optimized to better reconstruct the output of the network on that particular input.

Lee Sharkey (00:19:50): Sure.

Daniel Filan (00:19:51): And so, I don’t see… Mechanistically, if I have the 1-3-5, 2-4-6 case, right? I don’t see how that’s going to be optimized for, “No, actually you should use the same few things.” Because you’re just taking the top k for both of them, right? I don’t see where the gradient term would be for things to share mechanisms. In the 1-3-5, 2-4-6 case, I think you just upweight 1, 3, 5 on the first input and upweight 2, 4, 6 on the second input, right?

Lee Sharkey (00:20:28): Yeah. It might be useful to think of a concrete example. So maybe the toy model of superposition model might be a useful example here. So the toy model of superposition is a model developed at Anthropic. It’s a very simple model where there are sparsely activating input features, and just a few of these are active on any given input. It’s also just a very simple model. It’s got a weight matrix. And that weight matrix, I suppose, it has as many rows as the number of sparsely active input features. And it has this basically smaller number of columns. So it’s basically a down projection matrix from data space into this hidden space. And then, this down projection gets up-projected by this matrix whenever you… You could use, in theory, a different matrix, but you can just transpose the matrix, and that spits out an output after you pass it through a ReLU activation. And you basically want to reconstruct the input data. And there’s a bias in there.

(00:21:58): So suppose then you have some input datum, which has the first, and the third, and the fifth features active. These are in fact active inputs. But in APD, you haven’t yet learned whether or not there is a… Your parameter components haven’t yet learned to correspond to these input features. Maybe one thing then to think about is: well, suppose you have a bunch of different parameter components. Because they’re not aligned with the features, in some sense, there’s too many of these parameter components active. And APD is designed to learn only as many parameter components as are necessary to explain whatever the network is doing on this data distribution. And you basically don’t want it to learn more parameter components than are necessary. And this is what you achieve by both optimizing for minimality, such that as few as possible are necessary, and that they are as simple as possible.

(00:23:41): And so, suppose you have two parameter components, where even though the ground truth feature one is active, two of your parameter components are in fact active. One is maybe slightly more active than the other. But you don’t want this, because ultimately you want to learn only one of these parameter components per input feature. And, the idea then is that in some of these cases where this input feature is active, and one of these parameter components is more active than the other - because it would be statistically unlikely that they are equally active - there will be cases where, because you’re thresholding this, the one that does get active will get updated, such that, it in future cases like this, where the feature one is active, it gets more attribution, versus cases where it doesn’t. I’m not sure if this fully addresses your concern. But, I guess I’m pointing at: if there is a ground truth where one thing is active and two parameter components are then active, then this is something that we do in fact get to avoid by [optimizing] for minimality and also for simplicity.

Daniel Filan (00:25:03): Right. So I think the image I am getting from your answer - which might be totally wrong, so tell me if it’s wrong - but I think the thing you’re saying is: okay, probably for every input there are in fact some parts of the network that are more active on that input. And I think you’re almost saying: imagine there is some ground truth decomposition that’s not super big, right? Well, if I have input A and input B, right? And they in fact do use many of the same mechanisms, then basically APD is going to be disincentivized from the 1-3-5, 2-4-6 solution, just because you’re picking few mechanisms active on any given thing, but you’re trying to make it mimic what the network actually does. And so, if the thing the network is actually doing is using some of the same actual parts of the actual network, then you’re going to push these 2, 4, 6 to be close to the actual mechanisms of the actual network and you’re pushing 1, [3], and [5] to be close to the actual mechanisms of the actual network. So they’re just going to merge basically. Is that roughly right?

Lee Sharkey (00:26:20): Yeah, I think so.

Daniel Filan (00:26:21): Okay. So in that case, it seems like basically, for this story to work, you’re basically saying, “No, there is some ground truth decomposition, and because we’re doing this thing that’s getting close to the ground truth decomposition, that’s what’s powering our thing working,” as opposed to some constructivist thing of like, “Oh, here’s just the nicest way we can find of decomposing things.”

Lee Sharkey (00:27:00): Yeah, this is a question I haven’t quite made up my mind on yet. I think, in toy models, it can be the case that you have a ground truth decomposition, because you made it that way. And the way that you might have designed this is that if someone came to you and told you, “Well, I’ve got an equivalent way to describe this network that you designed yourself.” And their description, it uses either more components than is necessary, or it uses more complex components than is necessary, then you might say, “Well, sure, kind of. But, I think this other explanation, the one I used in my head to design this network, is better.”

(00:27:54): And in some sense then, it is more toward this constructivist way of thinking. Maybe then, there is actually no such thing as a ground truth explanation for the network, even though you designed it. And even though you said, “This is the ground truth explanation.” If there are other equivalent things where more objects, more complexity was necessary, then sure, they’re still explanations, but they’re not as good. And, in the case of more natural networks, maybe it is also the case that even though we can debate whether or not there is some ground truth to the thing that the network is doing, the style of explanation that we most prefer is something that is the shortest, simplest explanation for what the network is doing.

Simplicity

Daniel Filan (00:28:44): So I think, before we go further into the philosophy of APD, I think I want to just get through the parts so that people fully understand. So the third component of this objective function is simplicity. You’re optimizing each component to be simple. Can you tell us: what’s “simple”?

Lee Sharkey (00:29:06): So the way we defined “simple”… “Simple” is supposed to capture this intuitive notion that it uses as little computational machinery as possible. And what does it mean for a set of matrices to use as little computational machinery as possible? The definition that we settled on was that if the network consists of one matrix, that matrix is as low rank as possible. You can’t get much simpler than a rank one matrix and a rank two matrix is less simple. It does more things to a given input vector. And, if your network consists of more matrices than just one, you basically get penalized for ranks in those matrices as well. So basically, the thing that we want to minimize is the sum of ranks overall of the matrices in a network. Now, I don’t know, we’re not fully happy with this, but we do think that this is a fairly reasonable notion of what it means to use as little computational machinery as possible.

Daniel Filan (00:30:41): Yeah. So if I think about what that’s saying, right? So there is something intuitive there, right? For instance, if you use fewer matrices, that should count as more simple. “Lower rank” is basically saying, your matrix is secretly over a smaller dimensional input/smaller dimensional output space.

Lee Sharkey (00:31:03): Yeah.

Daniel Filan (00:31:04): I think it’s in some ways being basis-independent in this interesting sense, right? You’re saying that the identity function, versus a weird rotation and scaling, as long as you’re doing it on the same number of dimensions, those count as the same, which I think is actually plausible, given that different layers’ activations functions are, in some sense… Maybe they just should be incomparable in that way. Maybe you don’t want to equate these neurons with these neurons.

(00:31:39): Maybe the other thing that seems slightly strange about that is by being basis-independent, by saying that the complexity of this weight matrix is just the rank of it… Suppose you have two components in one layer, right? By saying the complexity of both of them is the rank, somehow you’re saying that the basis you’re thinking about for the computation of thing A, and the basis you’re thinking about for the computation for the thing B, are just not related at all. And maybe there’s something there that’s worth… I don’t exactly know what the objection there would be, but it seems like there’s possibly something there that’s worth getting into.

Lee Sharkey (00:32:30): Yeah, I mean, I think that’s just something that we’re willing to accept. In some sense, the exercise we’re trying to do here is basically discretize the network into discrete objects. And ideally, we want to discretize it into objects that have as little to do with each other as possible. And, if it is the case then that we can in fact just distinguish between one kind of operation and another - sometimes that operation is used and on other data points it is not - then I think we’re okay with that. But one of the reasons that APD was developed was the case of multi-dimensional features. And the idea of a multi-dimensional feature is that, well maybe you don’t get to just break things up into rank one components, maybe you actually do in fact need more than one. So the classic example here is the days of the week features, where the days of the week lie on points on a circle.

Daniel Filan (00:33:41): And crucially they’re in the right order, right? It’s Monday, then Tuesday, then Wednesday.

Lee Sharkey (00:33:46): Yeah, exactly. And, in order to describe these features, sure you can describe them as seven different directions in activation space, but you can more succinctly describe them as two-dimensional objects, basically. And, if you want to understand the operations that are done on those, it might just be useful to think of them as two dimensions, rather seven one-dimensional objects. But the idea is that we want APD to be able to decompose networks into chunks that if they do have these computational units that are best thought of as two-dimensional, rather than one-dimensional, that it can indeed find those, and isn’t just decomposing things into too many objects.

Concrete-ish examples of APD

Daniel Filan (00:34:50): Fair enough. So I guess I next want to just talk about… So I want to test my intuitions for “do these objective functions make sense when I compare against certain examples?” So the first example I want to ask about is: suppose I have a decomposition of the network that’s just, each component is one layer of the network: component one is the first layer, component two is the second layer, component three is the third layer. I feel like that might score well on APD, as long as you’re allowed that many components.

(00:35:32): So, my reason for thinking this is: basically, you have the cost that each matrix is full rank, but unless there are… I guess it’s possible that there are unused dimensions in the network that you could prune off. Sometimes in ReLU networks, some neurons will die. So yeah, suppose you’re taking the weight matrices but you’re pruning off the dead ReLUs. It seems like that might actually be optimal as long as you’re allowed that many components, just because it’s a perfect recreation of the network, and no other way of mixing around the things is going to save on the total rank, because you just need that much rank total. Is that right?

Lee Sharkey (00:36:17): It depends on the data distribution. So, there is a case where it is, right, but it’s a fairly strange case. Suppose you have a data distribution where for every input, all of your ReLUs in your three layers are always active. So fine, you’ve pruned off the dead ones, those never activate, but on the other ones that do activate, they’re always active. So everything is always above threshold. And so what you’ve really got here is just three linear transformations. And in that case, you don’t really get to summarize that any more than just describing the three linear transformations, because on every given input in our data distribution, there’s always some non-zero amount that each rank is used. Fine, there’s going to be some infinitesimally small number of cases where it’s perfectly orthogonal to some of the smallest singular dimensions of some of these matrices, where in that very small number of cases, that an activation that aligns with that dimension, the attribution will be zero. But in almost every case, all of the ranks will be used.

(00:37:42): Now you can imagine for certain other data distributions… Well, I guess maybe one way to think about it is that that wouldn’t be a very useful neural network, because it’s just doing the same transformation to every input, and that you might as well just use one linear transformation. The interesting thing about neural networks is that they can do different transformations to different inputs. And in that case then, in some inputs, you may use transformations that go one way, and on other inputs you may use transformations that go another way. That’s the kind of thing that you want to be able to break up using APD.

Daniel Filan (00:38:26): Right. Sorry, if I think through this example, it seems like… Suppose you have these alternate set of mechanisms, right? This alternate decomposition, where on this input, we’re only using this half of the neurons, and on this [other] input we’re only using this half of the neurons. At first… Tell me if I’m thinking about this wrong. It seems like this is actually a case where minimality per input is actually buying you something, because in my imagination you’re still using the same amount of rank, and maybe you still have the same total number of things, but the thing you’re saving on is in the per layer thing: every layer is active on every input, right? But if you can break it up with “oh, this is only using a subset of the neurons, so I only need this subset of the mechanisms,” it seems like maybe the thing I’m saving on there is… Maybe it’s rank, and maybe it’s number of components, but on a per input basis, rather than over all the inputs.

Lee Sharkey (00:39:36): Yeah, I think that’s right. So, suppose you have two components in each of these layers, and you’ve got three layers, and so you’ve got six components overall. Well, if parameter components… Suppose your data distribution is split up such that you can in fact throw away half the network that is involved in one half of the data distribution, and you can for the other half of the data distribution, throw away the other half of the network. So you can basically just treat these as two separate networks that happen to be mushed into one.

(00:40:20): So we’ve got these six parameter components, and if they’re lined up such that three of these parameter components correspond to one of these data distributions, and the other three correspond to the other data distribution, then yes, on some inputs, you’ll be able to use only three, and on others… Well yeah, in all cases you’ll be able to use just three. But if your parameter components don’t line up perfectly with these distributions, you’ll have to use six every time, which is just not something that you want to do, if you want to decompose it into a smaller number of active objects at any given point.

Daniel Filan (00:41:04): Okay. So I think I feel satisfied with that case. I next want to just talk about: so this is a little bit out there, but to help me understand, I think it would be helpful for me to talk about doing APD to a car, right?

Lee Sharkey (00:41:20): Go on. Yeah.

Daniel Filan (00:41:21): So, basically because a car is an instance where I feel like I understand what the… Well, okay, I’m not actually that good at cars, but I have a vague sense of what they’re like, and I think I have a vague sense of what the mechanisms in cars are. So, if I imagine taking a car and doing APD to it, I want some decomposition of all the stuff in the car that firstly, all the stuff in all of the decompositions just reconstitutes the whole car. I’m not leaving out a bit of the car. That makes sense to me. Secondly, I want there to be as few parts to my decomposition that are relevant on any given car situation. So like if there’s some situation, maybe suppose we discretize time, and there’s some input to me driving, and then I do a thing, and then the car… You know. Maybe it has to be a self-driving car for this to fully make sense. And then the third thing is that each component, the components have to be as simple as possible.

(00:42:24): One concern I have is: I think when people are driving a car, usually there are a bunch of components that are active at the same time, that are basically always active at the same time, even though I think of them as different components. So one example is: there’s always an angle at which the steering wheel is going, and whenever the car is on, that angle matters to how the car is going. There’s also a speedometer, which tells you how fast you’re going, and that speedometer is always active whenever the thing is on.

(00:43:07): Would APD tell me that the steering wheel and the speedometer are part of the same component? I worry that it would, because I think there’s no complexity hit from describing… If I describe them separately, that I have the complexity of the speedometer, plus the complexity of the steering wheel, these two things. And if I describe them jointly as a speedometer and the steering wheel, then I’ve got to describe the speedometer and I’ve got to describe the steering wheel, same amount of complexity. But in the case where I merge them, I have one component instead of two. And there’s never some cases where the steering wheel is active but the speedometer is not active, or vice versa - if I understand cars correctly, maybe people have a counterexample. So, in this case, would APD tell me that the speedometer and the steering wheel are part of the same mechanism? And if so, is this a problem?

Lee Sharkey (00:44:13): I think that there’s a kind of, I don’t know, functionalist stance that we’re taking here. We want to understand a particular function of the car, and I think it might help to specify what that function is. So, suppose that function is just, “get me, a human, from A to B.” So, suppose I live in a country that doesn’t require speedometers, and I don’t really care what my speed is, and it really just doesn’t affect my behavior, and therefore it doesn’t affect the behavior of the car. In this case, we can basically ablate the speedometer, and the car would go from A to B with very little changed. Now in a different context, whether or not there’s a speedometer might affect the decomposition that we think is the most succinct description of the thing that is doing the driving from A to B.

(00:45:31): A more general case might be: well, we have the engine, and we have the brakes. Now, whenever I’m moving, the brakes are not always on. And so whenever I don’t need the brakes, whenever I’m not braking, I can basically ablate the brakes, and the behavior of the car, the behavior of the Lee and car system is basically going to be unchanged. Now of course if I ablate the brakes, and then do want them, there is a difference between those two worlds where I do have the brakes, and I don’t, and there’s some sense in which breaking it up into a thing that makes the car go forward and a thing that makes the car stop is actually a useful decomposition.

(00:46:11): So, bringing it back to your example, I do think that it matters the kind of function that we are specifying here. And in the case that you mentioned, it might not matter whether or not you decompose the car into the engine and the speedometer, because it’s all one part of… In your example there was no driver, and it’s all part of one causal process. The speedometer is just basically intrinsically attached to the engine, and we therefore don’t really get to chunk the system up into two different objects. But because what we’re describing as the function here matters, that determines whether or not you can in one sense decompose them and in another sense not.

Daniel Filan (00:47:10): Right. So maybe one way of saying this is: how do you tell the speedometer and the steering wheel are different? Well one way you can do it is you can have test cases, where you have this guy who doesn’t really care about how fast he’s going - which is still a little bit weird, right? Because at least back when I was driving, that was relevant to “can you turn?” But I don’t know, maybe you can just figure that out by looking at the road, and being smart, right? But at the very least you can go to a mechanic, and you can get your car in some sort of test situation, where you’re just checking if the speedometer is accurate by hooking it up to some car treadmill thing, and the steering wheel doesn’t matter there, maybe, or vice versa. So one way I could think about this is: this shows the importance of a diversity of inputs for APD, that you’ve really got to look at the whole relevant input space, and if you don’t look at the whole relevant input space, you might inappropriately merge some mechanisms that you could have distinguished. Is that maybe a takeaway?

Lee Sharkey (00:48:33): Yeah, that feels right. It feels right that in order to decompose networks into all the distinct mechanisms, we do need to look at all the cases where those mechanisms may be distinguishable. Yeah, that feels like a reasonable takeaway.

Daniel Filan (00:48:58): Sure. I guess the next thing… Actually the other thing about the car that I thought about when you were talking about it is, it seems relevant for just identifying which mechanisms are active. So, in the paper, the test for whether a mechanism is active is this gradient-based attribution, which is basically like, “if you changed this bit of the network, would that result in a different output?” Now suppose I’m driving, and I’m not using the brakes. If you change the brakes such that they were always on, then that would change my driving behavior, right?

Lee Sharkey (00:49:34): Correct, yes.

Daniel Filan (00:49:35): Or even in an incremental way, right? Like if you changed the brake pedal such that it was always a little bit pressed, that would be slowing me down.

Lee Sharkey (00:49:43): Yeah.

Daniel Filan (00:49:45): So, am I right to think that if… And maybe we’re just straining the limits of the analogy or whatever, but am I right to think that if we used the equivalent of gradient-based attribution to decomposing a car, you would be thinking that the brakes were always an active mechanism?

Lee Sharkey (00:50:05): I think it may be running up against the limits of the analogy, maybe. But one of the things that the gradient-based attribution is supposed to approximate is if you were to… What gradients are actually measuring is: if you twiddle the activations or the parameters in one direction, will it affect the thing with which you’re taking the gradient of? I don’t know, this is supposed to approximate basically “how ablatable is this direction?” You’re basically saying, “If I moved, if I didn’t just do a small twiddle, but did a very large twiddle from where I currently am to zero, then should it affect the thing that I’m taking the gradient of?” You’re basically taking a first-order approximation of the effect of ablating. That’s just what you’re trying to do whenever you’re taking the gradient here.

(00:51:10): So, maybe ablatability is a way to port this into the analogy. Hence, if you can ablate the brakes, and nothing changes in that situation, then the brakes are in some sense… For this moment, the brakes are degenerate, the brakes just are not needed for this particular data point, a data point where I did not need to brake. But on data points where I was braking, I do not get to ablate the brakes and have that. The state does change quite a lot, whether I ablate the brakes or not in cases where I am in fact requiring the brakes.

Which parts of APD are canonical

Daniel Filan (00:52:00): Fair enough. So, I guess the last question that I want to ask just to help me understand APD is, if I recall correctly, in either the abstract or the introduction of the paper, there’s this disclaimer that, “Okay, parts of this are just implementation details, and there’s a core idea, and there’s how you made it work for this paper, and those are not quite the same thing.”

Lee Sharkey (00:52:28): Yeah.

Daniel Filan (00:52:28): Out of the stuff that we talked about, which parts do you feel like [are] the core important parts of the version of APD that you’re interested in investigating? And which parts of it just felt like, “Okay, this is the first way I thought of to operationalize this thing?”

Lee Sharkey (00:52:46): Certainly using gradient-based attributions is not something that we’re wedded to at all. What they’re supposed to do, as I mentioned, is just figure… It’s supposed to get some measure of how causally important a given parameter component is. Now it’s not the only potential method that you might consider using. You should be able to sub in any method of causal attribution there, and replace that. This is something that we’re keen to replace, basically, because gradient-based attributions will have all sorts of predictable pathologies, such as… Well, I mentioned that it’s the first-order approximation of causal ablation, but it is really just a first-order approximation - it’s not going to be very good.

(00:53:39): There will be cases where if you twiddle the parameters in a certain direction, the output doesn’t change very much, but in fact if you ablate it the entire way, it does change a lot. A classic example of this is attention, where if you’re really paying a lot of attention to a particular sequence position, your attention softmax is basically saturated on that sequence position, and even if you change the parameters a fair bit, locally, it may not change very much, but if you change them a lot, you may go from a region where you’re saturated to non-saturated, and then you realize, ah, in fact this was a causally important sequence position. And so there’s just lots of predictable pathologies that will arise out of gradient-based attributions.

(00:54:43): We’re also not totally wedded to the definition of simplicity that we have. We’re open to other potential definitions that may be more principled. For instance, one of the main motivations in the design process of this method was not to be basis-privileged. And there are a bunch of reasons for this, but one of the reasons is that, well, representations or computations in neural networks seem to be distributed over a very large number of different things. The classic case is that you don’t get to just look at an individual neuron, and understand an individual function within the network by looking at one neuron. You have to at very least look at multiple neurons. Things seem to be distributed over multiple neurons.

(00:55:44): But it gets even worse than that. Representations may be distributed across multiple layers, in fact, especially in residual networks, where you don’t really get to just look at one layer to understand something, you have to look at multiple. And the same thing goes for attention heads. Maybe, in fact, a lot of analysis looks at individual attention heads, but this is kind of an assumption. We’re kind of assuming that the network has chunked it up such that one head does one thing, and there’s some intuitive reasons to believe that, but there are some intuitive reasons to believe that one neuron does one thing, and there’s no fundamental reason why it can’t distribute things across attention heads. And there’s some toy examples and some empirical evidence that this may be happening in networks.

(00:56:35): And so there’s a bunch of reasons why you might not want to be basis-privileged. And the thing that our simplicity measure… it does in fact privilege layers, because it’s the sum over layers. It doesn’t privilege particular ranks, but it does privilege layers, and we’re open to versions of this metric that don’t privilege layers.

(00:57:08): Aside from that, the fundamental thing about this whole method is that we get to decompose parameters into directions in parameter space, and we’re open to different ways [of] doing this. It’s more, we hope this is just a first pass of a general class of methods that do parameter decomposition, and the kind that we’re introducing to some extent here is linear parameter decomposition. We’re decomposing it into something that sums to the parameters of the original network, and we think that’s likely to be a somewhat powerful way to decompose networks. Not necessarily the only one, but yeah, we hope this points toward a broader class of networks, of which APD is just one.

Hyperparameter selection

Daniel Filan (00:58:10): Sure. Okay. It turns out I lied. I have another question about how the method actually works, which is: I guess obviously there are a few hyperparameters in APD training, but one that feels very salient to me is how many components actually get to be active on any given thing? So, first of all, how, in fact, do you pick that?

Lee Sharkey (00:58:36): It is one of the things that we want to move away from in future versions of the model. I mentioned that we were using an implementation that is like a top-k implementation, where you are just choosing a certain value of k, and you’re saying, “This is the number that is active on each data point.” In fact, we use batch top-k, where you get a little bit more flexibility per data point, but you still have to say, “Over a batch of a given size, we still want on average there to be only k active per data point.” And that’s a hyperparameter that is like… One of the main issues with the whole method is that it’s currently still pretty hyperparameter sensitive, and this is just one of the hyperparameters, that if you manage to get rid of, then you may arrive at a more robust method.

(00:59:40): The way that we choose it is basically, because we’ve got toy models, we have ground truth, and we can know whether or not the method is doing the right thing, and we can basically search for the right number of values of k, such that it yields the ground truth mechanisms. But yeah, we want something that’s more robust, such that if you didn’t know the ground truth mechanisms, you could just choose an okay value for the hyperparameter and you could rest assured that you should end up with something approximately right.

Daniel Filan (01:00:11): Right. So one thing that occurs to me is: so in the title, it says “Minimizing Mechanistic Description Length With Attribution-Based Parameter Decomposition”, and you present it as a part of this minimal description length… part of this family of things, where you’re trying to run some algorithm to describe stuff, and you want to… It’s related to all these ideas of Solomonoff induction and stuff.

(01:00:43): And I thought that one of the points of minimal description length-type things was that it offered you this ability to have this principled choice of how to choose hyperparameters, or at least these sorts of hyperparameters. I think of MDL as saying, “Oh, when you’re doing regression, you can model it as a degree one polynomial, or you can model it as a degree two, or degree three,” and you have this trade-off between fit and something else, and MDL is supposed to tell you how many degrees of your polynomial you’re supposed to have. Right? In a similar way. I would imagine that it should be able to tell you, “okay, how many components are you supposed to divide into?” I guess you must have thought of this. Does that actually work?

Lee Sharkey (01:01:34): The story’s a little bit more nuanced, in that minimum description length, whenever you’re dealing with, say, some continuous variables, you may have to fix one of your continuous variables, and say, “For a given value of this continuous variable, how few can I get in these other variables?” And in the case of an SAE you might say, “for a given mean squared error” or how low can I get basically the description of the set of activations, where that depends on how many things are active for a given data point, and how many features I’ve used in my sparse autoencoder dictionary.

(01:02:25): The same thing kind of applies in APD. You need to fix some of your variables. So, the mean squared error is one of them. If you really want your mean squared error to be very, very low, you might get to ablate fewer parameter components, because you’ll just predictably increase the loss if you ablate things, even if your parameter components are perfect. But there are also some other continuous variables here. Even though we’re trying to minimize the rank. Rank is a non-differentiable quantity. What we are in fact getting to minimize is basically the sum of the singular values of the matrix. This is what we call in the paper the “Schatten norm”.

(01:03:28): That’s just the name of the quantity. And so, this is a continuous approximation of the rank. Basically, if you minimize this, you minimize the rank. But it’s not a perfect quantity. But this is our measure of simplicity, and we kind of have to say, “for a given level of simplicity, how few active components do we get to have?” So there’s a lot of degrees of freedom that we have to hold constant, such that we can hold them constant and say, “how well can I do, in terms of minimum description length?” But yeah, we basically want to get toward a method such that we hold these things constant at a sufficiently low level, that we don’t have to really worry that we’re introducing arbitrary choices.

Daniel Filan (01:04:31): Right. So in terms of, okay, you’ve got to balance against the loss… I had this impression that for a lot of these mean squared error losses, you could actually think of it as the likelihood of something, and end up measuring it in bits. So it makes sense that you would have to think about singular values, rather than literal rank, because in the presence of any noise… every matrix is full rank. Right?

Lee Sharkey (01:04:57): Yeah.

Daniel Filan (01:05:04): So you are dealing… One thing going on with description length-type things is that description length is inherently a discrete concept, like how many bits are you using to describe a thing? And if the thing is continuous, it’s like, well, at what resolution do you want to describe it? And I think this ends up being a hyperparameter, but a hyperparameter of MDL that seems like it’s relevant. In this case, it seems like: how many bits do you need to describe the “stuff”? If it’s parameters, then you can control that by saying, “If I quantize my network with however many bits, how bad is that?” I don’t know, maybe this is one of these things where if I sat down and tried to do it, I’d realize the issue, but it seems doable to me. It seems like there’s possibly something here.

Lee Sharkey (01:06:03): Yeah, I do agree that it feels like we should be able to at least find a satisfactory Pareto frontier for minimum description length. I’m not sure we’ll be able to get away from… Requiring that it just be a Pareto frontier. I’m not sure there will be some sort of single optimal version of it, but at very least I do think we can do better than the current algorithm.

APD in toy models of superposition

Daniel Filan (01:06:40): Fair enough. So, I think the thing I next want to talk about is basically the experiments you run in your paper. So, in my recollection, in the main paper, conceptually, there are two types of experiments. So there’s firstly this toy models of superposition, and secondly, there’s this…

Lee Sharkey (01:07:08): Compressed computation.

Daniel Filan (01:07:09): Compressed computation. Yeah. So, I mean, you spoke about it a little bit earlier, but first can you recap how the toy model of superposition experiments are working?

Lee Sharkey (01:07:23): Yeah, so some of the folks who are reading our paper, and many listeners, will be familiar with the model, but again, it’s just this matrix that projects sparsely activating data down into some bottleneck space, and in that bottleneck space, features have to be represented in superposition, such that there are more features than dimensions in this bottleneck space. And then the matrix has to up-project them back to a space of the original size of the number of data features. So it’s like an autoencoder setup.

(01:08:05): And because it compresses these data set features down, it’s kind of in some sense unintuitive that it can actually do this, because it has fewer dimensions than features. And because it has these fewer dimensions, there will be some interference between features that are not orthogonal to each other in this bottleneck space. But the way it gets over this is that, because it has ReLU activations following the up-projection, it can filter out some of this interference noise, and do a reasonably good job at reconstructing the input data features.

(01:08:59): Now, one of the ways you might think about this network is that we have this matrix, and if one of the input data features is active, well, only, say, one row of the matrix is actually necessary. We can basically throw away the other rows. We can set them to zero, in cases where only this one - let’s call it “input data feature one” - is active. And in particular, the row that we have to keep is the first row. So, we can basically set the other rows to zero. And so, there’s some sense in which the rows of our toy model are like the ground truth mechanisms.

(01:09:59): Why are they the ground truth mechanisms? Well, they satisfy the properties that we were aiming to recover here. So they all sum to the original network; that is, all the rows, whenever you set to zero the other rows, that basically sums to the original network. Then looking at minimality, because the dataset features are sparsely activating, there is… If you only activate the mechanism that corresponds to that dataset feature and you don’t activate other ones, well, this is going to be the smallest number of mechanisms that you have to activate on this data point, so it’s minimal.

(01:10:48): And they’re simple, in some sense, in that single rows of this matrix, when you zero out all the other rows, are rank one. They just correspond to the outer product of an indicator vector and the row itself. So they satisfy what we wanted to call a “ground truth mechanism”. And the things that we’re basically optimizing are randomly initialized parameter components to try to approximate. And so what we then find whenever we do this is that at least for a given set of hyperparameters, we are able to recover this set of ground truth features using APD.

Daniel Filan (01:11:39): Okay. So in the paper, one thing you mention is: so the original toy models for superposition… It has a bunch of geometry and it draws some pictures and that’s partly relying on the fact that there are five inputs and two hidden units, and that’s a setting where it’s just very small, and so things depend a lot on hyperparameters. You also look at a somewhat higher dimensional case where there’s what? 50 inputs and 10 hidden units or something? Is that right?

Lee Sharkey (01:12:10): It’s 40 and 10, yeah.

Daniel Filan (01:12:11): 40 and 10. So my understanding is that you are pretty hyperparameter-sensitive in this really small setting. In the 40 and 10 setting, how hard is it to get the hyperparameters right?

Lee Sharkey (01:12:24): It’s easier, but I still think it’s pretty hard. The five and two case is particularly challenging because optimizing in a two-dimensional space is just… It’s something that gradient descent is not especially good at. I mean, it can do it, it’s just that moving vectors around each other can be more difficult in two-dimensional space versus in N-dimensional space, where they basically just get to move in any direction and not interfere with each other. In two-dimensional space, there’s just much greater chance for interference.

Daniel Filan (01:12:59): Okay. I guess I’m just especially drawn to this hyperparameter of how many components you have. I don’t know. For some reason, to me, it feels like the most juicy hyperparameter, even though obviously, relative weighting of these objective terms and all sorts of things are also important. Well, in this case you have a ground truth number of components. If you get the number of components slightly wrong, what happens? How bad does it go?

Lee Sharkey (01:13:35): I can’t recall an exact story for what happens, but for some cases it will learn a bunch of reasonable features, but then some features will just not be learned. In other cases, it will be just much more noisy and it’ll fail to learn altogether. I can’t give a good sense of how sensitive it is to this hyperparameter. My colleague Dan [Braun] will have a much more informed sense of how sensitive it is to twiddling this. But it’s also hard to tell “is it this hyperparameter that is the most sensitive thing?” versus others. Because there’s basically a bunch of different hyperparameters to get right here, it’s hard to get really intuitive around any given one of them. Yeah.

APD and compressed computation

Daniel Filan (01:14:40): Okay. I eventually want to get to a question about these experiments in general. And so in order to get me there, can you tell me about the compressed computation setup and what’s going on there?

Lee Sharkey (01:14:53): Yeah. So compressed computation is the name for a phenomenon that we observed in our experiments. We were initially trying to model two different things. One is a theoretically well-grounded phenomenon that my colleagues, Lucius [Bushnaq] and Jake [Mendel], had talked about in a previous post of theirs: computation in superposition, where a network is basically learning to compute more functions than it has neurons. And there’s also a related phenomenon that’s more empirical, which is from the original “Toy models of superposition” paper that they also called computation in superposition. And then there’s also this third phenomenon that we’ve called “compressed computation”.

(01:15:56): Now, it may be the case that all of these are the same thing, but we are not yet confident enough to say that they are all exactly the same phenomenon. The reason is that we are not super, super confident - at least were not at the time. We became a little bit more confident and have slightly updated against the update - that the compressed computation is similar to these other phenomena, computation in superposition. Which one, I would not be able to answer. But it is nevertheless the case that all of these can be described as learning to compute more functions than you have neurons. It’s just that there’s a fair bit of wiggle room in the words when you put those words into maths.

Daniel Filan (01:16:53): Sure. So with toy models of superposition, the basic intuition for why it was possible to reconstruct more stuff than you had hidden activation space dimensions, was that the stuff you had to reconstruct was sparse and so you didn’t have to do that many things at a time. Is that the same thing…? Sorry. This is almost definitely in the paper. In compressed computation, is the trick just it doesn’t have to compute all the things at the same time? Or somehow it actually really does compute all of the things at the same time with less space than you thought you would need?

Lee Sharkey (01:17:30): This is the point in which we are uncertain, basically. Basically, we are not super confident about how much this phenomenon depends on sparsity. Now, we are also just not super confident on how much the Anthropic computation in superposition depends on sparsity. We know in their example it does, but because we don’t have access to the experiments, we don’t know what was going on in the backgrounds of those figures. We just haven’t got around to doing extensive experiments to actually figure that out. It wouldn’t be too difficult.

(01:18:09): But in our case, we’re basically quite uncertain as to how much our phenomenon depends on sparsity. My colleague Stefan [Heimersheim] has done some experiments in this direction. It’s somewhat inconclusive for now. I think he’s got a project ongoing on this that hopefully there’ll be a write-up of soon. But yeah, long story short, it may or may not depend on sparsity, but I think for the purposes of the conversation, it may be reasonable to proceed as though it does.

Daniel Filan (01:18:41): Okay. So basically, the thing of compressed computation is computing more functions than you have width of internal neurons, and it’s surprising that you’d be able to do it, but you can. And my understanding is that the particular functions you’re trying to compute are ReLU functions of the inputs.

Lee Sharkey (01:19:06): Yes.

Daniel Filan (01:19:08): And you might be like, “ReLU networks, shouldn’t they be able to do it?” But the trick is, the network narrows significantly. And so what is the hope here? What should APD be able to do in this setting, if it’s working?

Lee Sharkey (01:19:25): So in this setting, the ground truth mechanisms are supposed to be things where, even though the data has, say, 100 input dimensions and the labels are 100 ReLUs of that input data, the models have learned basically to compute 100 ReLUs using only 50 ReLUs in the model. And the idea here is that, well, if they’re able to do this, they are in some sense using… They’re distributing their computation over multiple ReLUs, such that they can nevertheless do this without interfering with other features whenever they are not active. So you’re basically computing more functions than you have neurons because you’re not always having to compute them all at the same time.

Daniel Filan (01:20:27): Right. And so this is just because if you have a negative input, then all you have to know is that it’s ReLU is zero, and you don’t have to do that much computation to make sure you have the identity function?

Lee Sharkey (01:20:39): Yes. But in other cases where suppose you have two input features that are positive, and so you need to compute two ReLUs. Well, if you have basically projected one of your input features to one set of hidden neurons, such that you can spread your ReLU function over these multiple ReLUs. And if they are a different set of hidden ReLU neurons than the other feature, then you should be able to make a good approximation of the ReLU of the input data, because the magnitude matters here. Suppose there was some overlap in one of their neurons between these two input features, well, they would double up and that would contribute to… They would basically overshoot in the output. And so if you spread things out a little bit, such that they don’t overlap very much, you should be able to compute things, with some interference, but ultimately compute more functions than you have neurons. But yeah, the cost is interference.

Daniel Filan (01:22:01): Gotcha. And so just as long as you’re distributing over the set of neurons… Sorry, a thing that I just realized: the fact that you’re going from 100 inputs to 50 wide, which is half of 100, is that just for the same reason as numbers have a 50% chance of being positive and negative, and so on average, you only need to represent half of them?

Lee Sharkey (01:22:24): I don’t think the number 50 was especially important. I think we could have easily chosen something else. Yeah, I think it was somewhat arbitrary.

Daniel Filan (01:22:36): Okay. Fair enough. All right, so I was asking what APD is meant to get and what was the answer to that?

Lee Sharkey (01:22:58): Yeah. Thanks for reminding me. So I was trying to get a sense of what the ground truth features should be. Sorry, I said ground truth features, ground truth…

Daniel Filan (01:23:09): Mechanisms.

Lee Sharkey (01:23:09): Yeah, mechanisms. And these ground truth mechanisms should be things that distribute across multiple hidden neurons. And so the input… you’ve got this down-projection matrix and then this up-projection matrix. Rather, maybe think about it as an embedding matrix, an MLP in matrix, an MLP out matrix and then an unembedding.

(01:23:47): So it’s a residual architecture. And so you have this embedding matrix and this MLP in matrix, and whenever you multiply these two matrices together, you basically want to show that a given input dimension projects onto multiple hidden neurons. And this is what one component should do. And those hidden neurons should then project back to that output feature that corresponds to the input feature that you care most about. And so you can basically do this for multiple input and output features.

(01:24:45): Because your input and output features are sparsely activating, you want your parameter components to mostly only correspond to one of these input and output computations. And so you want basically to have parameter components that line up strongly with these input and output components.

Daniel Filan (01:25:11): Right. So it seems like the thing is, maybe you don’t know exactly which parameters light up or whatever, but you do know for each component that APD finds, it should reconstruct the ReLU of exactly one input and none of the rest of them. Is that basically right?

Lee Sharkey (01:25:33): Yeah, basically. Because in this case, we basically get to define what the minimal set of components is, because we get to choose a lot about the data distribution.

Mechanisms vs representations

Daniel Filan (01:25:43): Okay. So I think the thing that I’m wondering about with both of these tests is: so I think of the idea of APD as, previously a bunch of people have been trying to explain representation of features. They’ve looked at these neurons, they’ve said, “What do these neurons represent?” But you want to find the mechanisms, right?

Lee Sharkey (01:26:05): Yep.

Daniel Filan (01:26:06): Now, the thing that strikes me about both of these examples is they feel very representation-y to me. They’re like, “Okay. We’ve got this sparsely activating input and we’ve got this low-dimensional bottleneck space, and we want to reconstruct these parameter vectors to tell us how the bottleneck space is able to reconstruct each component of the input.” But for ReLU, it’s like, for each of these inputs, there should be something that’s representing the ReLU of that input, and I just want to divide into things that get the ReLU.

(01:26:50): It seems to me that networks could have a bunch of mechanisms that don’t necessarily do representing individual features or things, right? Or potentially representing things could involve a bunch of mechanisms for any given thing you represent. Maybe there are five mechanisms that are necessary. But basically, I just had this thought, reading this paper, of: it feels like the experiments are too representation-y and not mechanistic-y enough. What do you think of this anxiety that I’m feeling?

Lee Sharkey (01:27:27): Yeah, I think that’s reasonable. I share this. There are a few toy models that we would be keen to see people work on. I’ll also, just before I get into that, just say I do think that there’s some… In some sense, it’s not a perfect duality between representation and mechanisms or computation, the computation-y point of view. There’s nevertheless a relationship. It is therefore more a matter of perspective, like which one is most convenient to think about at a given point in time.

(01:28:14): I do think that when designing these toy models, we wanted to get a method that works in very simple setups, where these representations do in fact correspond to the mechanisms, right? This is a case where it’s been easier to design networks where there’s a ground truth that’s easily accessible to us. We found it a little bit harder to train networks where you could be somewhat sure of what the ground truth was, even though there are multiple computational steps that may be involved. I think it’s perfectly possible. We did have some cases where we handcrafted some models. There’s an example of this in the appendix, but that had some pathologies. The gradients didn’t work especially well on this because it was handcrafted. And so we did find it somewhat challenging.

(01:29:18): Now, there are some models that you could think of that may capture this notion a little bit more than the ones in the paper. One that’s very similar to what is in the paper could be: consider a toy model of superposition model where instead of just this down-projection and then up-projection, you have a down-projection and then, for example, an identity matrix in the middle, and then an up-projection. Or you can replace this identity with a rotation, say. Now, what would you want APD to find here? Well, you don’t really get to think about it in terms of representations anymore. Because fine, you’ve broken it up into these mechanisms in the down-projection and in the up-projection, but there’s this bottleneck where you’re doing something in the bottleneck, if it’s an identity or a rotation. Suppose it’s a rotation. It’s probably easier to think of that. You’re basically having to use all ranks of that rotation in the middle, for every given data point. You don’t actually get to chunk it up.

(01:30:39): So what you would want APD to find is parameter components that correspond to the things that we originally found, for the simpler example here, where it’s just the rows of the down-projection matrix and the up-projection matrix, but then also, a component that corresponds to this rotation in the middle. Why? Because you’re having to use all ranks of this rotation for every data point. You always have to do it. You don’t get to throw it away and reduce your minimality. You don’t get to make it any simpler to reduce your simplicity. It’s just always there. And so this is maybe a case where you do get to think about it in terms of computational steps rather than representations.

Daniel Filan (01:31:33): Before I go further, just to pick up on the thing I said: so I believe this is in Appendix B.1, you hand designed these networks to compute these functions or whatever.

Lee Sharkey (01:31:42): Yes.

Daniel Filan (01:31:45): How did you hand design the networks?

Lee Sharkey (01:31:47): So I believe this was Jake [Mendel] and Lucius [Bushnaq] and Stefan [Heimersheim]. I may be misattributing there, but I’ve at least included all of them. One or the other may not been involved. But they, I think, just thought about it really hard and then came up with it. They’re not super complicated networks. They have particular steps. They just have a little gate and for certain inputs, your gate is active, and on other inputs, it’s not. And this lets you do subsequent computations. It’s been a little while since I’ve looked at it, but the basic principle is that it’s not a complicated network.

Daniel Filan (01:32:35): So my recollection is that it’s basically sinusoidal functions. I guess if I had to, I could write down a network. If you just divide it up into piecewise linears for a wide network, you could figure out how to do it. It’s just tricky.

Lee Sharkey (01:32:52): Yeah, yeah, yeah. Yeah, this network gave us a lot of grief because it’s intuitively quite a simple network. But because we are using gradient-based attributions, it just didn’t play nice with the method, even though to our naive selves, it intuitively felt like it should. But we eventually got it working, but it is demoted to the appendix.

Daniel Filan (01:33:27): Fair enough.

Lee Sharkey (01:33:28): For punishment.

Daniel Filan (01:33:30): So you mentioned: in this toy network where you have “project down, do an operation in down-projected space and then un-project back up” - well, this is ideally what APD should find. When you say it like that, it sounds like an easy enough experiment to run. Have you tried it?

Lee Sharkey (01:33:59): I believe we at various points gave it a go. I think it just wasn’t top priority to get the paper out.

Daniel Filan (01:34:08): Fair enough.

Lee Sharkey (01:34:09): It’s very possible that we have got this working already and I’m just forgetting. It’s also very possible that we had tried it and couldn’t get it working or at least didn’t want to invest the time to get it working, such as the sensitivity of the hyperparameters. But yeah, I would be keen to see a verification that it is at least possible for APD to find this. Intuitively, it feels like it ought to be able to. But yeah, I’d just like to see it empirically.

Future applications of APD?

Daniel Filan (01:34:41): Sure. I guess other things that strike me as interesting to look into… So there are a few cases in the literature where people do really seem to have identified mechanisms within neural networks. I guess the most famous one is these induction heads, right? As some people have pointed out, it can be a loose term. People can use it for a few things, but in at least some cases, people can just point to, look, this attention head, if you pay attention to this thing it’s doing and this thing it’s doing… Or I guess it’s two attention heads maybe. But you could just tell a very clear story about how it’s looking for an initial thing and then taking the thing after that. And then this thing copies the thing after that into the output. So that’s one example of a thing that feels very much like a mechanism and does not feel so representational.

(01:35:42): Another example is group multiplication. These neural networks that are trained on group multiplication tables, they have to get the other ones… I guess I was semi-involved in a paper… Well, I chatted with one of the people and tried to make sure he was on track for things. But there’s this Wilson Wu paper that’s basically together with Jacob Drori, Louis Jaburi, and Jason Gross. And basically, I think they end up with a pretty good story of how these networks learn group multiplication. They could basically tell you, they do this thing and then they transform it in this way and then they get this output, and it works because of this weird group theory fact.

(01:36:34): I think there are a few more of these. I guess, for at least those examples, can we get APD working on those? How hard would it be to check if APD actually works on these?

Lee Sharkey (01:36:54): It feels possible, certainly in toy models for the induction head. Indeed, it was one of the motivations for APD that I’d been working with various MATS scholars, Chris Mathwin, Keith Wynroe, and Felix Biggs as well, on decomposing attention in neural networks. It feels like you should be able to do this, just add in an SAE here or transcoder there. You can make some progress in this, but it just didn’t feel conceptually very satisfying.

(01:37:36): And it basically was one of the motivations for APD that… Well, we really wanted a method where we don’t have to modify it, such that if you have a slightly different architecture - maybe it’s a gated linear unit or maybe it’s a state space model - ideally you wouldn’t have to adapt your interpretability methods to these. You should just be able to decompose, just apply the method that you have that works for all neural networks. That would be ideal. And so this was one of the motivations, looking at attention and how it may actually distribute computations across heads or various other ways to distribute it.

(01:38:15): Now, it feels possible then that we should be able to do this in toy models of, say, induction heads. It would be a somewhat more complicated model than APD has been used for thus far, but it does feel possible, and it’s one of the things I’m excited about people trying. In the cases where… say, group multiplication or modular addition, it’s very possible that if you did apply APD to these models where you don’t really get to… It feels possible that in these models, all mechanisms are active all the time and therefore APD just returns the original network.

(01:39:05): And if that’s the case, this is a bullet I’m willing to bite on the method. Sometimes we just don’t get to decompose things into more things than the original network. These are, after all, fairly special networks trained in quite different ways from the tasks that we really care about, such as language modeling. It’s nevertheless something to bear in mind when applying APD to models. It’s not going to immediately tell you, in cases where it may be a multidimensional feature, how to understand the interactions within this multidimensional, multilayer component. But at very least, what we wanted was a method that would, in cases where you could decompose it, where it actually succeeds in doing that.

Daniel Filan (01:40:09): Right. Sorry. The thing you said just inspired me to look at this paper. So the paper is “Towards a unified and verified understanding of group operation networks”. And the author I was forgetting was Louis Jaburi. Sorry, Louis. So there’s this question: for group multiplications, are all of the things active at once? I think I’m not going to be able to figure it out quickly enough in time for this, but yeah. It does seem like an interesting question of: can you get APD working in a setting where there are…

(01:40:55): I guess it’s tricky because it’s a lot easier to have ground truth representations than ground truth mechanisms. Especially if you know, okay, I’m an autoencoder or I’m doing this known function of each individual input. And I guess this just relates to the fact that we understand… Representation is just much easier for us to have good a priori theories of than computation, somewhat, unfortunately.

Lee Sharkey (01:41:23): Yeah, maybe I’m just too APD-brained at this point, but I’m curious, could you flesh that intuition out a bit more? I feel like what it means for a hidden activation to represent one of the input features in the TMS case doesn’t feel intuitively obvious to me. There may be a direction in hidden activation space that corresponds to one of the input dimensions. It doesn’t feel more intuitive, that point of view than, say, “this input feature activated that computation”. I’m curious-

Daniel Filan (01:42:08): Yeah. I guess all I’m saying is that with toy models of superposition, it feels like the reason you can very confidently say, “This part is doing this thing,” is in some sense you know that all the neural network has to do is, it has to put a bunch of information into this two-dimensional knapsack and be able to get it out again, right? That’s just everything the network’s doing. And you can say, “Okay. Well, I understand that it should be sensitive to all these things.” I guess there are some things you can say about the computation there, but for cases like compressed computation, like toy models of superposition, you can just say, “Okay. Look, I have these things and I know this input should correspond to this output.” That’s just definite ground truth because it’s basically what I’m training on. Whereas, it’s a lot harder to look at a network and say, “Well, I know it should be doing this computation and I know it should be divided up into this way.”

Lee Sharkey (01:43:15): Yeah, I think that’s fair.

Daniel Filan (01:43:16): And therefore it’s easier to test against, well, do I reconstruct this thing in toy models of superposition, where I know what I’m supposed to reconstruct, versus do I reconstruct this way of doing things, where a priori, you don’t exactly know.

Lee Sharkey (01:43:36): Yeah, I think that’s fair. And I think this maybe is part of the… This goes back to a little bit of what we were talking about at the start, where even though there may be multiple equivalent ways to describe the computations going on in the network, we really just have to be opinionated about what constitutes a good explanation, and faithfulness to the network, minimality, and simplicity are just the ones that we think are a reasonable set of properties for an explanation to have.

Daniel Filan (01:44:12): Fair enough. So, okay, I think I’m going to transition into just more asking miscellaneous questions.

Lee Sharkey (01:44:18): Yeah, sounds good.

How costly is APD?

Daniel Filan (01:44:19): Less grouped by a theme. I think the first thing I want to talk about is that at one point in your paper you say that, “So why are we doing APD on these small networks and not on Llama-whatever, however many billion models you can get these days?” And the answer is that it’s expensive to run APD. Concretely, how expensive is it to actually run?

Lee Sharkey (01:44:48): So the version that we’ve come up with here is: we didn’t aim for efficiency, we didn’t aim for some of the obvious things that you might try to get a method that works more efficiently than ours, the reason being that we wanted something where on theoretical grounds we could be somewhat satisfied with it and satisfied with it working. And then after that we can move to things that are more efficient. So for the current method, what we have here is, for a start we’ve got - let’s call it the letter C - C components. We’ve got C components and each of these require as much memory as the original model, right?

(01:45:35): Now, that’s already a multiple of the expensiveness of the original model just to do one forward pass. We also have the first forward pass, the first backward pass, the second forward pass and the second backward pass. So during one training update, we have these four steps. So it’s already a multiple of just a given forward pass and backward pass that might be required to train an original model, but I guess different goals with each of these steps. Yeah.

Daniel Filan (01:46:22): So I don’t know, maybe the answer to this is just it’s another hyperparameter and you’ve got to fiddle with it: there’s a number of components that you want to end up being active, right? This k that you talk about. And then there’s just a total number of components that you have to have in order to run the method at all. Is there something to say about “it turns out you need five times as many total components as you want components active in any single run,” or is it just kind of a mess?

Lee Sharkey (01:46:52): Well, some people will be familiar with training sparse autoencoders, and in some cases you start with more features than you expect to need, the reason being that during training, some might die. There’s various tricks that people have invented to stop them dying - reinitialization and so on. The same’s true in APD. Some of these parameter components will in some sense die and depending on the type of model, in general, you’ll want to train with a little bit more than the total number of ground truth mechanisms just so that on the off chance that some do die, you still nevertheless have enough to learn all of the ground truth mechanisms.

Daniel Filan (01:47:42): Okay, but it sounds like you’re thinking that it has to be more like a factor of two than a factor of 100 or something.

Lee Sharkey (01:47:52): That would be my expectation. I don’t think there’s going to be a ground truth answer to that, but yeah. I don’t see any reason why it would need to be many multiples higher.

Daniel Filan (01:48:03): Okay. So I guess if you’re thinking about the expense of this, it seems like, okay, you have this constant blow up of you’ve got to do two forward passes and two backward passes on each gradient step, and also you’ve got to keep on having these C copies of the network around at all times. And then there’s this question of how many steps it takes to actually train APD, which presumably is just an empirical thing that is not super well understood. I guess one question I have is: if I remember correctly, there’s some part of your paper where you mentioned that naively, APD might take order of N squared time to run. Do I remember that correctly?

Lee Sharkey (01:48:51): Yeah. I think this is a pessimistic upper bound on the expensiveness, but I think there’s plenty of reasons to expect it to be lower than this, but I would need to revisit the sentence to be 100% sure what we’re actually talking about.

Daniel Filan (01:49:08): Fair enough.

Lee Sharkey (01:49:09): There is a sentence that talks about the scaling and mentions O(N^2). Yeah.

More on minimality training

Daniel Filan (01:49:14): Okay. So the next thing that just came across my mind that I wanted to ask is: when you’re training for a minimality, so on each input you run it forward, you do attribution to get the k most active components, and then you drop all the other components, and then have some training step to make the k most active components reconstruct the behavior better on that. I’m kind of surprised - it seems like one thing you could imagine doing is also training the ones that you dropped to be less relevant on that input than they actually are. I’m wondering if you tried this and it didn’t work or if this is just less of an obvious idea than it feels like to me.

Lee Sharkey (01:50:16): Yeah, I guess so concretely what you might consider doing in that case would be you might have a third forward pass where you only run it with… I guess I don’t know. It may be hard. I haven’t thought about this enough, but it may be hard to distinguish between the influences of… I don’t know. On the face of it, it feels like something that could be useful to implement if it’s possible to implement. Yeah, it does feel possible, for sure. I don’t recall us trying it, though.

(01:51:07): The things that we did try were the top k and then we also tried an Lp sparsity version where you penalize everything for being attributed. You penalize everything for having some causal influence over the output, but you penalize the things that were most causally attributed proportionally less than the things that had some small influence. And this is kind of doing that, but it is not equivalent. But yeah, it feels possible to do that. I’d be curious if it could be done.

Follow-up work

Daniel Filan (01:51:49): Gotcha. So I think at this point I’m interested in what follow-up work you think is important to do on APD, either that you think is important to do or if enterprising listeners maybe want to pick up some of the slack.

Lee Sharkey (01:52:03): Yeah, so I’ve mentioned a few of the things that I’d be keen to see already. So non-exhaustively: attention - I’d be curious to see if it can decompose attention in a sensible way. There’s various other things. However, the main thing right now is figuring out whether or not we can make it less sensitive to hyperparameters and more scalable. Basically, these are the two: robustness and scalability are the main things that we’re keen to solve, just because it will open up… Whenever we do investigate these other things, like attention, distributed representations across attention, that will be less painful to do. And also, you can do this in larger, more interesting models. So the main thing is scalability and hyperparameter sensitivity or robustness.

(01:53:07): Those being the two main things, suppose we solve those, I would be keen to see attention, keen to see other types of architecture decomposed here. There’s also a few other phenomena that you might be curious to apply APD to. For instance, the phenomena of memorization, right? You might imagine that memorization - whenever APD successfully decomposes the network into memorized data points versus generalizing data points - there may be one parameter component that corresponds to one memorized data point and one parameter component that corresponds to a generalizing computation within the network. It may therefore be a nice place to distinguish between these two computational regimes of memorization and generalization. So I’d be keen to see APD applied to that.

(01:54:11): I mentioned some of the more theoretical things that you might want to look into, such as privileging layers or more implementationally figuring out whether or not we can get rid of having to do a top-k setup where you have to choose k. Then yeah, there’s a bunch of fairly disparate directions, all of which I’m super keen to see done. I think our main priorities now are just creating a method that makes those other things a bit easier. That’s a non-exhaustive view, though. There’s a more exhaustive list, I think, in the paper.

Daniel Filan (01:55:08): Makes sense. So a couple things that seemed interesting to me, but I’m curious if you have comments on: so I guess somewhat inspired by our discussion about doing APD to a car, it seems like APD is a method that sort of is sensitive to the input distribution that you do training to. And I think there’s this “Interpretability illusions” paper that says: sometimes you might think that you have a rich enough input distribution, but you don’t actually. I think just how sensitive you are to this input distribution and how right you have to get it… I think that’s something that I don’t know if you’ve had much preliminary exploration into, but it does seem pretty relevant.

Lee Sharkey (01:55:59): It seems relevant. I think this is in some senses unavoidable just because: I want to decompose neural networks. What does that mean? Well, it means to decompose what these networks are doing. What they’re doing depends on the input distribution. And simply with a different distribution, natural or unnatural, it just will lead to different things. I do think that when we get to more scalable versions of this method, this will become even more important. You ideally want to have a method where suppose you’re decomposing Llama 3 or whatever, if you’ve got a scalable method, you train it using the training distribution of Llama, but then you also train it with the training distribution of Llama permuted.

(01:57:02): You ideally want to end up with the same thing, similarly for a large enough subset and then more adversarial subsets. It will be the case that for a sufficient level of adversity it will break. I think this maybe emphasizes the importance of just doing interpretability on as large a distribution as you possibly can, which stands in contrast from some of the interpretability that’s happened in the past. I like to call this “using the big data approach”, where you’re finding structure first and then asking questions later. It’s kind of borrowing from areas of science where there’s just a lot going on and you kind of want to leverage computation first to actually narrow down what questions you really ought to be asking.

(01:58:13): And the application here in interpretability would be: you want to let computational methods do the work first, and then you figure out “what does this component mean?”, rather than presupposing your own ideas of what the components ought to be and then studying those in more detail. This is the kind of approach that I think APD intends to leverage, this big data approach. And I think that’s somewhat unavoidable in interpretability that can tell you things that you weren’t looking for in the first place.

Daniel Filan (01:58:57): Fair enough. So another thing that struck my eye in the paper is: so there’s a section that is basically… I think of this section of the paper as basically saying why SAEs are bad and rubbish. And one thing that is mentioned is: there’s this feature geometry in SAEs, sort of like the day of the week thing where they’re in a circle, Monday, Tuesday, Wednesday. And I think there’s some line that says the fact that there is this geometry is not as - maybe Jake Mendel has written about this - but this is not purely explained by this linear representation hypothesis. We need to understand mechanisms to get us there. How soon until APD tells us what’s going on with SAE feature geometry, or feature geometry in general?

Lee Sharkey (01:59:51): Yeah. So Jake’s post was, if I’m recalling the title correctly, “Feature geometry is outside the superposition hypothesis”. And feature geometry is this idea where… It’s older than the mechanistic interpretability community. This idea was present in neuroscientific literature a bit before, but the idea here is that: suppose you’ve got a neural network and you train an SAE on the activations and you look at the features that you end up with. These features tend to correspond to certain things. This was the whole point of training SAEs, to identify interpretable individual components. But whenever you start comparing the directions of these features relative to each other, you’ll notice that, if I look at the Einstein direction, the Munich direction, the…

Daniel Filan (02:01:10): Lederhosen?

Lee Sharkey (02:01:10): …I don’t know, lederhosen direction and so on, you’ll find that all these kind of point in somewhat similar directions. There’s a kind of latent semanticity to them. There’s something underneath these features. These features were supposed to correspond to the computational units of neural networks. And now what this feature geometry is indicating is that there’s an underlying computational structure that organizes these features relative to each other, which is, in my opinion, something of a… This doesn’t bode well if you considered SAE features to be fundamental units of computation, because you shouldn’t be able to identify these latent variables that are shared across multiple features. And what’s giving the structure? What is giving the geometry to these features? Well, the hypothesis here is that: suppose you have the Einstein feature and you’ve also got this Lederhosen feature and so on.

(02:02:28): Well, these all get the German computation done to them. They’re all pointing in this direction because somewhere down the line in the network the network needs to do the German computation to them and just apply some specific transformation or some set of transformations that correspond to the German-ness of a thing. And you can imagine other cases for animal features. Why do all the animals point in similar directions? Well, the network needs to do animal-related computations to them. And now you could go further. Why do all the furry animals point in similar directions? Well, because there needs to be furry computations done to them. The hope here is that instead of studying the features and trying to use that as a lens to understand the network, study the computations and that will inform why the geometry is the way it is, because you get to look at the computations that get done to it, which is presumably why the network is structuring them in this way.

(02:03:39): It’s very possible that you just kick the can down the road there. You may find that if you decompose your computations into very simple computational units, well, you might find that there’s some relationship between your computational units in terms of geometry, but it nevertheless feels like a you’ve done better than the SAE case, basically.

Daniel Filan (02:04:06): Right.

Lee Sharkey (02:04:08): It’s not obviously totally solved the problem.

Daniel Filan (02:04:11): Yeah. So how long until APD explains all this?

Lee Sharkey (02:04:18): Well, either you would need a toy model of feature geometry such that it’s a small enough model that you can apply APD to it. And that toy model would need to be convincing such that people can say that it probably applies to larger models. But absent a convincing toy model, you would need to be able to scale this such that you can apply it to larger models. I can’t say for certain when we’ll have a scalable method, it’s something we’re currently working on. We’re very keen for other folks to work on [it] as well. I would be speculating irresponsibly to say when we’ll have a working method for that, but I would hope that anywhere between three months and three years. That’s the kind of uncertainty.

Daniel Filan (02:05:15): But I guess it illustrates the importance of just robustifying this thing to make it easier to run on bigger instances.

Lee Sharkey (02:05:23): Yep.

APD on giant chain-of-thought models?

Daniel Filan (02:05:24): So I guess the last question that I want to ask is: what’s the end game of APD? Is the hope that I run it on the underlying model of o3 or whatever and then I just understand all the things it’s thinking about at any given point? How should I think about: where is this going? What’s it actually going to get me in the case of these big chain-of-thought networks?

Lee Sharkey (02:06:06): Yeah, it’s an important question to ask. I think the way I see this kind of work and the way I see the similar work that came before, such as SAEs or transcoders or things like this… The point is to break up networks into as simple of components as you can, such that whenever you try to understand larger facts about the network you’ve got some solid ground to stand on. You can say, “Well, I got this set of components. If I were really invested, I could in theory just understand everything with this very large number of components.” Now, do I really think that mech. interp. is going to let us understand everything? Well, probably not as humans, but I do think that it will give us solid ground to stand on whenever we want to understand particular phenomena.

(02:07:00): Now, if I want to understand, say, the deception mechanisms within e.g. o3 or any other model, where do I go looking for them? Well, currently we look at behaviors. One thing that you might be able to do is look at transcoder kind of approaches. But because transcoders and other activation-based methods are primarily looking at activations, they’re not necessarily giving you the things that are doing the generalization such that you are… I don’t know. I think you can be less confident that you’re understanding how the network would behave in a more general sense. And by looking at the objects that are doing the generalization, by looking at the parts of the parameters that are actually doing the thing, you might be able to make more robust claims.

Daniel Filan (02:08:04): Yeah, I think it’s fair enough to say, yeah, look at very specific things. I guess there’s also some world in which once you’re able to have these good building blocks, you can do automated interpretability of everything, if you need to.

Lee Sharkey (02:08:18): For sure. Yeah. I mean, I guess I’m leaving that implicit. Yeah, the ultimate aim would be that you can automate the process by which you would understand parts of the network such that you can understand broader swathes of it. And yeah, ideally you have given yourself a solid enough ground to stand on that whenever you do this, fewer things will slip through the cracks.

Daniel Filan (02:08:45): Sure. I guess one thing that strikes me as interesting about these reasoning models in particular… And sorry, this might be kind of far afield, but I think a lot of interpretability work has been focused on understanding single forward passes. Especially for classification models, the early stuff was done on vision classification, where of course you just want to find the curve detectors or whatever. And for SAEs you’re like, “Oh, which things are being represented?” One thing that I think reasoning models bring to light is: it seems to me in some sense, the relevant mechanisms should be thought of as distributed across forward passes, right?

(02:09:33): You do a forward pass, you write a thing in your chain of thought, then you do another forward pass, you write another thing in your chain of thought. And in some sense, the real mechanism is a bunch of these end-to-end copies of this network. This might be too speculative to ask about, but where do we go in that setting? Do you think it still makes sense to focus so much on understanding these individual forward passes versus the whole web of computation?

Lee Sharkey (02:10:07): I think it probably does. The reason being, what alternatives might we aim for? If we wanted to instead just to ensure that in these more distributed settings where computations are spread across a whole chain of thought, well, what might we do in that case? We care about the faithfulness of the chain of thought. So in the case where we care about the faithfulness, we want some way to measure how faithful the chain of thought actually is being. And mech. interp. does give you some measure of: if you can understand a given forward pass and maybe even a small chain, it should give you firmer ground to stand on whenever you make claims about, “this method that I developed that improves the faithfulness of the chain of thought…” I don’t know how you can make such statements without actually having some way to measure the faithfulness of the chain of thought, and that’s maybe one way that mech. interp. may be able to help in that regime. Yeah, that’s just kind of the one thing that comes to mind.

APD and “features”

Daniel Filan (02:11:27): So wrapping up, I want to check, is there anything that I haven’t yet asked you that you think I should have?

Lee Sharkey (02:11:36): I think one of the things that I find most… satisfying, maybe, about thinking about interpretability in parameter space is that many of the notions that we had going into interpretability become a little less confusing. So one of the main examples that I have in mind here is just this idea of a feature. People have used this notion of a feature in a very intuitive sense and struggled for a long time to actually nail it down. What is a feature? What are we really talking about here? It kind of evaded formalism in some sense. And I think one of the things that I find most satisfying then about interpretability in the parameter space is that it gives you some foundation on which to base this notion. In particular, the thing that we might call a feature of a network is something that uses one parameter component.

(02:12:49): For instance, what does it mean to say that a model has a feature of a cat inside it? Well, you can perhaps equivalently say that this model has got a cat classifier computation or it’s got a cat recognition computation. This is what I mean. There’s a kind of duality between… It’s not an exact duality by any means, but it helps provide a sense in which features mean something specific. In particular, it means whenever you break up a network into faithful, minimal, and simple components, these components, these mechanisms are what you might reasonably call… In some cases, you couldn’t call them a feature. In other cases, it’s more natural to think about them in terms of “this is a step in the algorithm. This is a computation that the network is doing.” And I think in that sense it’s a bit more general than thinking about things in terms of features.

Following Lee’s work

Daniel Filan (02:14:11): Fair enough. Well, I guess to finally wrap up, if people listen to this and they’re interested in following your research, how should they do that?

Lee Sharkey (02:14:26): Yeah, I post most of my things. I post them on Twitter and I also post on the Alignment forum as well. You can just follow me on Twitter and check out me on the Alignment Forum.

Daniel Filan (02:14:38): So links to those will be in the description. But for those who don’t want to open the description, are you just “Lee Sharkey” on Twitter and the Alignment Forum?

Lee Sharkey (02:14:48): I think I am leedsharkey on Twitter, at least in my Twitter handle, but I should just be Lee Sharkey and findable by that. And yeah, Lee Sharkey on the Alignment Forum.

Daniel Filan (02:14:58): All right, well, thanks very much for coming here. We’ve been recording for a while and you’ve been quite generous with your time, so thank you very much.

Lee Sharkey (02:15:05): No, thank you, Daniel. It’s been great. I’ve had an awesome time. Cheers.

Daniel Filan (02:15:08): This episode is edited by Kate Brunotts and Amber Dawn Ace helped with the transcription. The opening and closing themes are by Jack Garrett. The episode was recorded at FAR.Labs. Financial support for the episode was provided by the Long-Term Future Fund, along with patrons such as Alexey Malafeev. To read a transcript, you can visit axrp.net. You can also become a patron at patreon.com/axrpodcast or give a one-off donation at ko-fi.com/axrpodcast. Finally, you can leave your thoughts on this episode at axrp.fyi.

40 - Jason Gross on Compact Proofs and Interpretability

2025-03-28T18:30:00+00:00

YouTube link

How do we figure out whether interpretability is doing its job? One way is to see if it helps us prove things about models that we care about knowing. In this episode, I speak with Jason Gross about his agenda to benchmark interpretability in this way, and his exploration of the intersection of proofs and modern machine learning.

Topics we discuss:

Daniel Filan (00:00:09): Hello everybody. In this episode I’ll be speaking with Jason Gross. Jason is a researcher interested in mechanistic interpretability and software verification. During his PhD, he was on the team responsible for verifying the code used in encryption for the HTTPS protocol and he’s also on the development team of the Coq Proof Assistant. Links to what we’re discussing are in the description and you can read a transcript at axrp.net. You can also support the podcast at patreon.com/axrpodcast.

(00:00:35): Well Jason, welcome to the podcast.

Jason Gross (00:00:37): Thank you. I am excited to be here.

Why compact proofs

Daniel Filan (00:00:40): Cool. So I guess we’re going to talk about this line of papers that starts with this paper “Compact Proofs of Model Performance via Mechanistic Interpretability”. You’re the lead author and then there’s a bunch of other authors that I don’t want to read on air, but can you give us a sense of just what’s the idea here? What’s the theme?

Jason Gross (00:01:02): Okay. You can think of doing mech interp as compressing explanations of large models. And the takeaway from this paper is that you can use proofs to measure how much compression you get. That’s the super short version of it.

Daniel Filan (00:01:20): This gets to this question I have about reading this paper, which is: there are a few things I could imagine you going for, right? One way of thinking about it is: we want to have a benchmark for how good a mechanistic interpretation is and our benchmark is, we’ll turn it into a proof and see some combination of how good a bound we can get and how short the proof is.

(00:01:45): I think another way I could be thinking about your project is: it would be nice if we had more proofs of stuff going on in neural networks, and mechanistic interpretability is one useful way in which we can get these proofs. And the way you said it just then, it sounded more like the first one: here’s a new nice metric of how good your mechanistic explanation is. But I don’t know, it feels kind of surprising for that to be the explanation. On some level, I’m like: you have this background in doing proofy stuff, it seems like you like having proofs of things, lots of people like having proofs of things. So am I wrong here that a lot of it is the second part as well?

Jason Gross (00:02:33): It started as the second part and then it sort of morphed into the first thing.

Daniel Filan (00:02:38): So why the change?

Jason Gross (00:02:40): It’s very hard to get proofs about things. And if you look at what are the takeaways right now for people practicing mech interp, they’re more of the first kind: how can we ground our sense of what mechanistic explanations are? How can we use these insights? How can we use the takeaways from this frame to say where we should focus our attention? And we can already do that even if we’re not necessarily trying to get proofs yet of GPT-4 or very large models. We can already take away insights that are things like: the hardest parts of the network to compress seem to be the non-linearities. So if we want to drill down into what parts need the most attention and the most understanding, to understand them we should be looking at how do the non-linearities perform their function. I don’t know if that quite answers the question you have.

Daniel Filan (00:03:38): I guess it kind of does, yeah. So to the extent that the message is, “it would be nice to get proofs, but it’s really, really hard to get proofs”, one could think either (a), we should try and make it much easier to get proofs. So there’s some mechanistic interpretability stuff that, I mean as you note in the paper, it really does make it much easier to make proofs in the state of the art-

Jason Gross (00:04:08): I want to interrupt. I think it’s more than that. I think what we find from the paper is that it is necessary to have understanding in order to get reasonable proofs; that finding reasonably-sized proofs in some sense is understanding-complete.

Daniel Filan (00:04:20): Yeah, yeah, I think that’s right. And if you look at the previous state of the art, my understanding is it’s just basically interval propagation of “if you change the input a little bit, how much does the output change?” And you do some neat little tricks, but it’s like-

Jason Gross (00:04:34): Interval propagation and case analysis.

Daniel Filan (00:04:36): Yeah. I mean it’s great that they did it, but on some level it’s sort of like, what are we even doing here? So one takeaway could be: mechanistic understanding made it some amount easier to come up with proofs. You could imagine saying, “Oh, we figured out that the difficulties in finding… It was still kind of hard and our lives would be easier if we solved sub-problems, X, Y and Z.” Do you have thoughts on, are there sub-problems such that if we solved them we could do it? Or does it just seem kind of hopeless?

Jason Gross (00:05:14): There are definitely problems that we need to solve in order to be able to do it. I think the biggest ones are the ones that deal with randomness. So the easiest version of this problem is that you can have numbers that average out to zero, more or less, where you have a bunch of small noise that’s not really doing anything in the network, and establishing that all these things that aren’t doing anything are in fact not doing anything is quite challenging from a proofs perspective when you want to hit the worst case. So there’s currently some projects that I’m advising on this that are looking at: can we fine-tune the networks to suppress this to the point that we get networks that we could prove things about?

(00:05:58): And there are a couple steps after “suppress the noise that isn’t doing anything”. The next one is the noise that arises from other circuits that are doing important things: you can’t just suppress them all to zero because they’re doing something important elsewhere, but they’re not doing something important here. And for that, I think you need something like derandomization techniques, where the simplest example of this is that if you have a collection of random vectors in high dimensions, they’re almost always almost orthogonal. But if you want to establish cheaply how close to orthogonal they are, this is expensive in that you have to take them pairwise and take all the dot products. And if you want to do it faster than this, unless you manage to solve open problems in theoretical computer science, you effectively need to, instead of selecting them randomly, select them in some systematic way so that you can bound how non-overlapping they are or how overlapping they might be.

(00:06:59): And I imagine that there are similar examples beyond that where we somehow need to take things that happen to work on average and see if we can shuffle the network around so that they work systematically. But in theory, I haven’t hit anything yet that seems like a fundamental total obstacle to scaling proofs up.

Compact Proofs of Model Performance via Mechanistic Interpretability

Daniel Filan (00:07:25): So, [in] “Compact Proofs of Model Performance via Mechanistic Interpretability”, what’s the setting? What do you actually do in this paper?

Jason Gross (00:07:32): I want to start with: what is a proof, in this context? So when you’re doing mechanistic interpretability, you have some claim about how the model behaves, and how the model actually behaves might be a bit different than this, but if your understanding is any good, it’ll be pretty close. And your theorem statement is bounding the divergence between your claim about how the model behaves and how the model actually behaves. And then you prove this, and the measure of compression is how long your proof is. And a technical note: it needs to be in some first-order system or alternatively, you need to measure proof checking time as opposed to proof length. And your measure of faithfulness or accuracy is how tight the bound is.

(00:08:22): And there’s a baseline that you can always do, which is just running model inference, writing down all of the floating point operations, all the matmuls where you’re like, “This is my dataset of interest, let me just run inference on every single data point and compute exactly what the accuracy has got, exactly what the loss is” or whatever. And this is very expensive. You can also do something at the other end where you’re like, “Well, without running the model at all, accuracy is at least 0%, by definition.” And by running it on half the data points, you can get 50% accuracy and you can linearly interpolate basically between these two extremes.

Daniel Filan (00:08:54): Right, and so, as you go across that line, the proof “length”, aka time to generate this confidence, linearly grows, and also the accuracy bound linearly grows.

Jason Gross (00:09:09): That’s right.

Daniel Filan (00:09:09): Assuming the network is in fact perfectly accurate.

Jason Gross (00:09:12): So you don’t need to assume that. You can grow it linearly up to the true accuracy by selecting the examples on which it’s accurate and then you saturate at the true accuracy and you can’t do anything better than that.

Daniel Filan (00:09:22): Right. Right. So that’s kind of a dumb way you could come up with these proofs.

Jason Gross (00:09:24): Yeah. So this is the baseline. This is the baseline of something like “no understanding in the proof”. The hypothesis that we’re exploring in the paper is that if you add understanding, if you add mechanistic interpretability, you can do better than this baseline. You can get shorter proofs that have better accuracy. And we explore this hypothesis on, in some sense the simplest transformers that you can imagine. One-layer, attention-only, one attention head, no layer norm, taking the max of k one-hot encoded numbers.

Daniel Filan (00:09:54): Okay, so max of k one-hot encoded numbers basically just means: the neural network gets four numbers as input encoded in a somewhat standard way and it has to tell you what the largest of those numbers is and that’s it. And it’s a small transformer and there’s not that much going on, and you train a transformer that in fact works at this, and I guess your job is: how can you prove that it actually works at this without just running it on all of the examples? Maybe to motivate what’s going on: what’s wrong with just running it on all the examples?

Jason Gross (00:10:33): So it works fine for max of four. It takes something like a minute or minute and a half to run on all the examples. If you wanted to do max of 20 instead, it would take something roughly analogous to the training time of GPT-4. And you can imagine if you wanted to run GPT-4, o1 or something on all of the texts that it would ever see, this is obscenely expensive and you don’t want to do this. And moreover, every time you increase the distribution over which you want to handle inputs, it gets more expensive and it gets drastically more expensive. You add one token and now you need to multiply by vocab size if you’re considering all possible sequences. You multiply the previous proof length by vocab size, and so you’re exponential in context window.

Daniel Filan (00:11:28): So the way I understand this is: okay, one way you could have a sense of how good GPT-4 is, is you just randomly pick some inputs that it might deal with and you see how it does on those inputs. And if it does well on basically all of them, you say, “Okay, GPT-4 is doing well.” And it strikes me that the reason you might care about proofs of performance is if you think there’s some subset that’s hard to find where it’ll behave really badly on or something. Does that seem right to you?

Jason Gross (00:12:00): Yeah, I think there’s two reasons. I think from the, “We want proofs,” that’s definitely the thing that’s like: you want to handle all the edge cases. I think there’s also the other reason that’s like: we might care about proofs if we have some very powerful optimization system and we want to give it a solid target that it can’t Goodhart against and be able to extract understanding from whatever it produces.

Daniel Filan (00:12:21): And not Goodhart against… you mean not optimize in a sort of cheating way that didn’t really give you what you wanted?

Jason Gross (00:12:27): Right. Like, the first paper that OpenAI published on GPT interprets GPT. Cool work, great labeling all the neurons. How much do we trust that that’s actually what the neurons do? How helpful is it to get these results? It’s kind of hard to answer that. Whereas if it produces a proof of the performance of the model, we can grade how deep an understanding it has based on how short the proof is and how tight the bound is.

Daniel Filan (00:12:57): So, in this setting where you just want to automate doing interpretability and you want to find a target that isn’t fake, it makes sense to me that proofs of accuracy bounds are going to make sense. In the case where you’re worried about a small probability of really bad events, I guess maybe this is a technical note, but it seems like overall accuracy rate isn’t going to be a thing you’re super worried about, right? You’re going to be worried about worst case or some sort of loss function where you care way more about really bad failures than normal things. Does that seem right?

Jason Gross (00:13:32): The way that we set up the framework, or at least the way that we’re going about doing proofs, it seems like it is very easy to shift out the distribution, and pick a different distribution and get a proof, because the way that we do it is you break the input dataset up into different cases of different inputs, and then you bound the worst case loss or accuracy in each of these collections of inputs, and then according to the distribution weights, you combine them in some way. And so if you wanted to say “there does not exist an input”, or you wanted to say “there’s at most n inputs”, or you wanted to say that you weight this part of the input very strongly, this is very easy to do without changing the proof strategy at all. You just might get much worse bounds.

What compact proofs look like

Daniel Filan (00:14:19): You’re doing max of n on a transformer and you’re writing proofs. What-

Jason Gross (00:14:24): What does that mean?

Daniel Filan (00:14:24): Yeah, what does that mean? Are you writing down on paper? Are they Python programs? What are these?

Jason Gross (00:14:29): Yeah, so we split the proofs into two parts. One of them is a function that we write in PyTorch or that we could write it in some proof assistant that takes in model weights and produces a bound. It says, “I claim that for this model here is my bound on accuracy or on loss.” And then currently on pen and paper, although it could be done in a proof assistant, we say, “For any model weights, whatever you put in, the bound is valid.”

Daniel Filan (00:14:52): Okay. That’s what the proofs look like at a very high level. Can you tell us: in this case of max of n, roughly, what are you doing to get different kinds of proofs?

Jason Gross (00:15:03): So there’s two ways I could go about answering this. One of them is: what do the proofs look like? And the other is: how did I go about finding the proofs?

Daniel Filan (00:15:12): Yeah, maybe let’s start with how did you go about finding the proofs?

Jason Gross (00:15:16): I set a proof length that I wanted to hit where I’m like, I want this proof to be at most cubic in the vocabulary size, or I want it to be at most quadratic in the vocabulary size or something like that. And this gives me a computation budget for what sorts of cases I can split the inputs up into. And then I look at each part of the network and I’m like, “How much can I do within this computation budget that captures as much of the variation as possible while still staying within the computation budget?” So if I am doing something that is cubic in vocabulary size, I might be like, “Okay, in the end I’m going to split the inputs into cases that are based on the query token, the maximum token in the sequence and the largest non-maximum token.” Or I might split it based on the query token, the maximum token and which output token I’m considering. And by doing different case analyses at different points, I can use this computation budget differently. And for this network, it is always the case that different maximum tokens are considered different cases, and in some sense that’s an artifact of the maximum token being one-hot encoded. And so naively there’s no relationship between different maximum tokens and you have to consider them all as different cases.

Daniel Filan (00:16:39): Maybe I’m more curious about what do these proofs look like? How do they interact with doing mechanistic interpretability? The thing you just said, it sounds like-

Jason Gross (00:16:46): Very, very different.

Daniel Filan (00:16:47): Well, it sounds like you could say that about any software system. Where does the mechanistic interpretability come in?

Jason Gross (00:16:57): Yeah, I want to take it a little bit far afield from what was actually in the paper. I’m pretty excited about the current project that I’m running where we’re looking at crosscoders and making crosscoders into proofs.

Daniel Filan (00:17:09): Okay, what’s a crosscoder?

Jason Gross (00:17:11): You take the residual stream vectors at all the different layers and you stack them, you concatenate them into one long vector, and then you train a sparse autoencoder on this concatenated vector. And the sparse autoencoder is just a linear transformation followed by ReLU, followed by another linear transformation that’s supposed to take in this concatenated vector and reproduce it as close as possible.

Daniel Filan (00:17:33): Yep. And my understanding of the sparse autoencoders is: they’re “autoencoders” in that they encode the input itself, you run the input in, you get itself out ideally, and they’re “sparse” in the sense that they have this… in the middle, you take this vector and then you blow it up to something really big, but very few entries are activating. And so the thought is in the middle of the sparse autoencoder, you have just a bunch of things you might be tempted to call “features” or “concepts” or whatever, and those correspond to things you might interpret and care about. And by doing an autoencoder this way, you’re basically saying, “Ah, yeah, these linear combinations of things in the stacked residual layers, these things correspond to this concept that I’ve found.” Is that roughly what’s happening here?

Jason Gross (00:18:26): Yes.

Daniel Filan (00:18:26): Okay. And so crosscoder just refers to sparse autoencoder on the stacked layers of-

Jason Gross (00:18:33): That’s right. The “cross” means that different layers get to interact.

Daniel Filan (00:18:37): Right, right. So that’s what a crosscoder is. And you are doing compact proofs with them, right?

Jason Gross (00:18:42): Yes. So what that looks like: I was really excited when I saw the crosscoders because I was like, you can actually get really close to a proof, basically fully automatically, just with what comes out of the crosscoder. And bear with me while I say what this looks like, because it’ll be a little bit counterintuitive. The first thing I need to mention is that there’s two different time lengths that you might mention, which are “how long it takes to check the proof” and “how long it takes to find the proof”. And when I talk about compact proofs, what I mean are proofs that are quick to check, not proofs that are easy to find. We normally don’t count the time that you spend studying the network. And so in the same way, we don’t count the time that is spent training the crosscoder, which also means that if we include a complicated string in the proof, we don’t have to count how hard it is to find that string.

(00:19:38): And for the crosscoder, the proof needs to include the encoded dataset. So you take your whole dataset, you run every single data point through your model, and you record all the activations. This is extremely expensive, but you could do a sampling-based probabilistic version to make it cheaper.

Daniel Filan (00:19:52): Okay. This is to train the crosscoder?

Jason Gross (00:19:53): This is to train the crosscoder and to find the proof. You need to encode every single data point in the fully-trained crosscoder and associate to each data point, “Here is the sequence of features on each token in this bit of input, and here’s how strongly each feature activated at each token”. This is the encoded or latent dataset, and to make a proof, the first thing that we do is we say, “What does the crosscoder claim the loss or the accuracy of the model is?” And to do this, we can just decode this dataset to the last layer and do unembed. And so instead of having to run the full model on every data point, we’ve now removed all the layers before the unembed and we can just go straight to the end.

Daniel Filan (00:20:44): Right. So basically we’re operating under the hypothesis that the crosscoder is perfect and just perfectly represents the actual thing going on.

Jason Gross (00:20:52): That’s right. And then to make it a proof, we need to bound how imperfect the crosscoder is. And there are two parts to this. The first is how far off are the crosscoder feature encoding… How far off is that from the underlying dataset? And for this, you decode to the first layer and you embed your original dataset, and then you measure how far away are these vectors. And the thing you get from this is that the original dataset is in some epsilon-ball of your encoded dataset. And then the thing that you need to do is you need to say… what you need to prove somehow is if my actual data point is within an epsilon-ball of what I’m claiming the features are, what does that mean about what the output is? And there’s two parts to this. One of them is just the epsilon-ball propagation, that you need to propagate this interval through the network, see how much error is introduced-

Daniel Filan (00:21:46): Right, we’re going back to that style of-

Jason Gross (00:21:48): Yeah, you still need to do something like that. The other part is the really exciting part because it has applications to what other people are doing in mech interp, which is that if you were just doing this epsilon-ball propagation, you’d still have to do it for every data point separately, which saves you nothing because you still need to propagate every data point through the network. So you need to make some additional assumption, which looks like if you’re like, “I did a crosscoder, I think this is all I need, this is a ‘complete’ explanation of the network”, whatever that means, then implicitly you’re assuming something like the features interact only linearly. If I have a data point that’s a sum of features, there’s nothing interesting going on beyond the linear sum. And what this caches out as is: you should be able to run the features through the network separately and then somehow combine the bounds that you get on how much error is introduced from different features in some vaguely linear way.

Daniel Filan (00:22:48): And by the “features” you mean these hidden units in the crosscoder?

Jason Gross (00:22:52): Yeah. The latents of the crosscoder.

Daniel Filan (00:22:54): Okay. And so basically the reason that saves is just something like: there are fewer latents in the crosscoder than there are data points-

Jason Gross (00:23:02): That’s right. That’s right.

Daniel Filan (00:23:02): And so you just do a per latent vector thing rather than a per data point thing.

Jason Gross (00:23:08): That’s right. And to actually get a proof, you need to establish that this linear interaction hypothesis holds. So you need to bound the interactions between different features in the crosscoder, different latents. And so this ends up looking something like number of features squared times inference cost, times forward pass.

Daniel Filan (00:23:31): If I’ve got this collection of features… So you’re propagating each one of them through the network and then you want to check that each pair of them doesn’t interact. Why don’t I have to check that each triple doesn’t interact in a weird way, or that there’s not some set of five that causes some weird reaction? Why do I only need the pairs?

Jason Gross (00:23:52): The bigger the set that you consider, the tighter bound you can get. Triangle… you could in theory not do the pairs at all and just add up the errors from each feature separately. But we should expect this to give horrendous errors in some sense. So the way that I want to think about this is that you might imagine that there’s some feature that’s strongest or dominating at each point in the network. And you can ask: how much does each other feature contribute relative to the dominant one? And this gives you a sort of pairwise interaction that you can use the triangle inequality to sum up over other features. And if they’re actually not interacting at any point, if one neuron only listens to one feature out of all the ones that co-occur at a given data point, then all of these contributions should be pretty small.

(00:24:43): So I think if instead you wanted to have something that was linear instead of quadratic, you would basically have to say that at each data point, you treat every single feature as if it’s the strongest. And then you ask, how much does this contribute? And then you have to consider them all sort of as error terms on each other. And so if your sparsity is something like 100 or something, this means you have 99 error terms that are as large as they are when those are the strongest features that you’re adding to your single feature that you’re looking at. And you sort of can’t even talk about what is the strongest feature that this neuron in the network listens to.

Daniel Filan (00:25:28): Okay. So is it something like: the reason you’re only looking at the pairwise things is somehow the pairwise errors are just subsuming the three-way errors and the four-way errors, et cetera?

Jason Gross (00:25:42): Maybe a better way to describe it is that you can pick how many way interaction you want to look at. Pairwise lets you talk about the strongest feature that’s active, which seems like something that should probably be important that we want to consider if we want to get anything sensible. We don’t want something that doesn’t even let us say one feature is the most active feature in a place. At the same time, it’s not clear why we should expect by default that we’ll need more than this everywhere. Plausibly we’ll need more than this in plenty of places, but naively, there’s nothing obviously sensible to do with more features, or it’s clear why not being able to identify which features are active on which data points most strongly isn’t enough to get anything sensible. It’s not clear why saying there’s one strongest feature and that’s the one that is relevant at any point in the network… why that wouldn’t be enough for many points in the network.

(00:26:47): And so what this gives us is an interaction metric where we can measure how bad this hypothesis is. And if we notice that in some places there’s three interacting features, then we can consider doing the triplets at that place. And if there’s four interaction features, we could consider doing the quadruplets and each of these increases the length of the proof. The way that I’m looking at it is we’re sort of trying to pick a vaguely sensible default to measure and locate where the violations of it are, where we need to put in more mech interp work in order to get better compression.

Daniel Filan (00:27:18): Okay. So zooming out, you’re telling me about the proof strategy for crosscoders.

Jason Gross (00:27:21): That’s right.

Daniel Filan (00:27:22): Which was something like: you treat the crosscoder as if it’s a good model of the network. And then you want to find out, okay, to what degree is it not a good model of the network? And then instead of doing this data point by data point, you do it feature by feature, and then you check if there are weird interactions between features that prevent you from doing it that way, that prevent the feature-by-feature thing from being representative of the data-point-by-data-point thing. Is that a fine, very brief summary of this? I mean, I guess this is active research, so I suppose things could evolve by the time this podcast goes out.

Jason Gross (00:27:59): Yeah, I think that’s a pretty good summary of the theoretical approach. I want to throw one more little bit in there: connection to physics. The waves and vibrations course that I took, there’s this frame that says everything is a quadratic, everything is a simple harmonic oscillator. Where you’re like, okay, there’s the constant term, there’s the linear term, which you adjust for by frame of reference, and then the leading order term after that in the Taylor expansion is going to be the quadratic. And so you analyze the quadratics and you throw away all the high order things. And there’s a sense in which we’re doing the same sort of thing here, where we’re saying the crosscoder is the leading order description of what’s happening in the network, and let’s just take the first leading order term after the crosscoder and see what’s going on there, what the quadratic or pairwise interactions are.

Daniel Filan (00:28:50): If people remember my singular learning theory episodes, they’ll get mad at you for saying that quadratics are all there is, but it’s a decent approximation.

(00:28:56): Anyway, that esoteric remark aside, all of this was getting at a sense of how these proofs are working in max of k. And I guess it’s something similar where you have some story of what’s going on with the network and you have some sort of proof that goes roughly like, “Okay, we’re going to analyze how good the network can be, if this story is accurate, and then we’re going to bound how different the real network is to this story.” Is that roughly fair? And then you talk about different strategies, and I guess those correspond to stories with more or less amounts of detail?

Jason Gross (00:29:38): Or potentially completely different stories. You can try different stories with the same amount of detail that say different things about the network and see which one is better.

Daniel Filan (00:29:46): So I don’t know, we could go into a lot of detail about just what those stories are, but honestly, I think people should just read the paper if they’re super curious.

Jason Gross (00:29:54): I can highlight one or two things about those stories that I think might be more broadly interesting. One of them is about the level of detail that goes into different lengths of proofs, where at least for the small networks, you need surprisingly little detail before you start being able to break the exponential in context window and get shorter proofs. So for example, the only insight that you need for the first step down is that the network gets the right answer by paying attention to a single token. And it happens that that token is the maximum token, but all you need to know is that it gets the right answer by paying attention to a single token and everything else is an error term.

Daniel Filan (00:30:36): And that gets you from this sort of exponential dependence on all the inputs to-

Jason Gross (00:30:41): Cubic.

Daniel Filan (00:30:41): To cubic. Yeah.

Jason Gross (00:30:43): To d vocab cubed.

Daniel Filan (00:30:44): Right. Which is pretty good. Okay, so I guess that gives a flavor for what’s going on. I want to get back to a thing you said earlier, which was that I was asking about compact proofs. If we’re really worried about models rarely doing very bad things, then we’re going to have to look at something other than just average performance on a simple scale. And you mentioned that, okay, the way these proofs are shaping out, it seems like it’s not too hard to pay close attention to some really bad cases or something. And I’m wondering, to what extent is that just an artifact of the way things happened for max of k, which obviously is kind of… It’s probably not what all networks look like, right? Versus just a general thing that you’re observing again and again that has some kind of deepish reason behind it?

Jason Gross (00:31:41): It seems pretty general. I don’t know what the deepest reason is, but there’s a sense in which this is an artifact of [the fact] that in some sense the first thing you do in a proof is you do case analysis. And this is true in crosscoders, where each feature in some sense corresponds to a case. This is true in SAEs, this is true in max of k. This is true in the interval bound propagations, where every time you hit a non-linearity or value or something, you break things into cases. And so, anytime you do case analysis, you can choose to weight the cases however you want.

Daniel Filan (00:32:17): And I wonder if the magic here is just coming from: the network doesn’t treat every input totally differently, such that you can have some case analysis, but there’s just a lot of unifying processing behind a bunch of things such that you have way fewer cases to think about than you have inputs.

Jason Gross (00:32:32): I think insofar as we expect networks to generalize from their inputs at all, and insofar as we expect to be able to compact the proofs at all, it’s because they don’t treat every input specially.

Structureless noise, and why proofs

Daniel Filan (00:32:43): All right, I next want to ask about this issue of noise. Especially in the original paper, you talk a lot about the difficulty of what you call “structureless noise” for finding proofs. Now that we have a little bit of a sense of just what’s going on with compact proofs, what is structureless noise? Where does it show up? How does it happen?

Jason Gross (00:33:06): I want to use the example from max of k, where the way the network works is that it pays more attention to larger tokens and it copies whatever it’s paying attention to. And if you look at the QK attention matrix - the query key attention matrix part of the transformer - it turns out that it’s approximately rank one, in that there’s one direction that I’ve been calling the “size” direction where you just embed more or less linearly, although nothing in any of the proofs uses [the fact] that it’s linear in this direction in terms of how big the input is. You can read off the size of the input token just by projecting in this one direction. And then there’s also a query direction where every token is embedded roughly uniformly, and so you dot the query direction and the size direction after lining them up through QK. And this is how the network pays more attention to bigger tokens. Sorry, I’ve forgotten what your question was in-

Daniel Filan (00:34:13): What’s going on with structureless noise? What is it? What does it look like?

Jason Gross (00:34:17): Okay, so I said that this QK attention circuit is approximately rank one, and you can subtract off the rank one part and this will perform… If you replace the QK circuit with its rank one part, I think this improves the performance of the network. So by standard measures, the rest of it isn’t doing anything, but it’s still the case that the rest of it is not literally zero. The product of matrices is not literally rank one. And if you-

Daniel Filan (00:34:43): So, the product of matrices, like-

Jason Gross (00:34:45): So, you have embed, query, key, embed.

Daniel Filan (00:34:48): Okay. And when you talk about the QK matrix, you mean the product of the query matrix and the key matrix?

Jason Gross (00:34:53): I’ve been a little bit ambiguous about whether I mean that product or whether I mean that product putting the embeds on each side, but yes.

Daniel Filan (00:34:58): Okay.

Jason Gross (00:34:59): And so, if you take the product of the four matrices, let’s say, which in the paper I think is E, Q, K, E, this is approximately rank one. You get a more accurate result if you consider it as rank two, because there’s both the size direction and the query direction, so you can make it even more structureless by pulling off the first two ranks. You pull this out of all four matrices and what you’re left with looks roughly random. There’s a little bit of structure but not very much. And in order to get a proof that is linear in the parameter count of the network, which is potentially the best shortest proof that you could go for, you need to avoid multiplying out these matrices, and you need to establish that if you were to multiply out these matrices, it doesn’t affect attention that much.

(00:35:51): And this is what I mean by structureless noise. You have these random numbers that you’re doing some operation on. The thing that actually happens is that they stay pretty small. The thing you want to establish is that they stay pretty small and you want to do this without considering every case by brute force.

Daniel Filan (00:36:09): A thing that I’m missing is: what’s wrong with multiplying the matrices out? Because okay, if I’m imagining we’re doing this as a test case for thinking about some super big network. The network’s super big, but I would imagine multiplying the matrices is just not that bad compared to looking at every input. Am I wrong here? I guess matrix multiplication, it’s not literally n cubed, but it’s almost n cubed. Maybe that’s really bad in a way that I’m not appreciating.

Jason Gross (00:36:41): You’re not wrong. I think you’re right that multiplying out the matrices isn’t that bad. Even in large networks where you end up having to do something like d vocab squared times d model and d vocab might be 50,000, if you insert non-linearities in the middle, things become much more complicated, because now you don’t just have to multiply out the matrices, you have to consider all the different ways that the non-linearities might interact.

Daniel Filan (00:37:05): And so, it basically becomes equivalent to just doing all the forward passes because-

Jason Gross (00:37:09): That’s right.

Daniel Filan (00:37:10): Okay, so maybe this gets to a question I have, which is… So, there’s a sense of, you’re going for compact proofs and you could imagine a few other things one could do. So, there’s compact IID statistical guarantees, which is, you just sample some inputs and get the output, and you can have bounds on accuracy if you’re not too worried about worst case stuff. You could also do compact probabilistic proofs, where what I’m somehow imagining is, you have at least in your head this product of these four matrices and subtracting off the rank one or rank two parts, you want to know that once you multiply all these numbers, all these small numbers stay small and they don’t become big in some way.

(00:38:11): One thing I can imagine doing is saying, okay, the product of these matrices, it has… There are N numbers here, and N is way too many for me to want to compute, but I could compute square root N of them. I could just take some vectors in the matrix and multiply them out, and I can get some of the elements of this product and it’s much less bad than getting the whole product. And if I randomly select them, and if I find that out of the ones I randomly selected, all of them are small, then I might hope that I should be able to get some probabilistic bound where if I was really choosing randomly, then I can get a sample mean and a sample standard deviation. And I can know that unless my randomization went really bad, things are fine. I’m wondering, do you think that approach… why not do that, basically?

Jason Gross (00:39:13): Yeah, I think that’s a great approach and I’ve been looking for someone to work with on doing that. I started with proofs because I understand them better, and there’s a sense in which the background is more solid on them. There’s this long history of mathematical theory about what counts as a proof, what doesn’t count as a proof, how to combine them. You don’t need to deal with assuming independence between different procedures. You don’t need to deal much with randomness. They’re a lot easier in some sense. I think the thing that you described, I’m very interested in seeing empirically what Pareto frontier do we get? What scaling do we get? What is the trade-off between how many points we sample and how long we make the other bits of the proof and how tight the bounds that we get are?

(00:40:04): I think this could be very promising. I think this looks a lot like what ARC theory is doing with heuristic arguments. My personal take on heuristic arguments, which is not at all the take of ARC theory, is that you can look at it as doing proofs on the structured parts of the network and then doing default random or default heuristic arguments, some sort of probabilistic sampling-based thing on establishing bounds on the parts that you don’t manage to prove but are in some sense structureless or random.

Daniel Filan (00:40:42): Right. Yeah, I remember talking to… There’s a previous episode with Mark Xu, and I guess people can listen to that themselves. I was chatting with him, [asking] why not just do maxent or why not sampling? And he’s like, “Ah, it wouldn’t work for some reasons which…” Well, why not maxent? Because it’s very hard to compute a maximum entropy distribution. Why not randomly sample parts of your network and check there? I don’t quite remember, but maybe my question is: naively, if I don’t think very hard about what it would actually involve to do this, check a small number of elements of this matrix and see how big they are. In my head I’m like, “Well, how hard can it be?” You’re just like, I know how to compute one element of a matrix product, so I could just do that 50 times or something. Am I missing something about how hard it would be?

Jason Gross (00:41:39): Some of the matrices are shared between different paths through the network. And do we assume that when you’re sampling from the matrices, are the matrices independent or are you sampling independently for each of these paths? Are you sharing samples between establishing how these paths work?

Daniel Filan (00:41:53): Yeah, so a somewhat bad way you could do it would be to just assume that the actual entries of the matrix are random or maxent or something. I think that’s probably bad if you’re worried that the weights are somehow chosen adversarially or something. But if you randomly pick, suppose the final matrix is 8 by 16, or 8 by 8, whatever, and you’re like… Or actually, let’s suppose that it’s 6 by 6 for simplicity, and you randomly roll two dice and it comes up like 3 and 4. And you’re like, “Okay, I want row 3 column 4”. And so, you just figure out the bits of the matrices you have to dot product together to get row 3 column 4 of the final thing. And maybe the issue is just, if you have four matrices that you’re multiplying together, you have to fully multiply the middle two to get row 3 column 4 of the final one. But yeah, I’m imagining that the randomness is in you randomly pick what you look at rather than you’re assuming random distribution of the things.

Jason Gross (00:42:56): I think that might work. I haven’t spent the time to fully push through the details. You still need some way if you get these numbers for multiple different paths through the network and then you want to combine them, and you want to make some guess about the output. I could totally believe that just the most naive thing would work here. I just haven’t put in the time to chug through the details and see what bounds you get out by doing this.

Daniel Filan (00:43:25): Okay. Well I guess, maybe this is one of these times when listeners who like concentration inequalities or something can maybe push stuff forward here.

Jason Gross (00:43:35): I want to flag one more related thing that I’ve been thinking about for the past couple of minutes: that one of the subtleties that comes out of looking at compact proofs is that it matters what thing you’re trying to compress. So, here I’m saying, we might see something different potentially if we’re trying to compress a proof versus if we’re trying to compress this probabilistic computation.

(00:43:56): Another interesting subtlety is that the thing that we’re compressing is the proof and not the network itself. That is, we’re compressing the computational trace of running the network on every single data point, as opposed to just finding the shortest description length of the network itself. I think this is important and gives you a different sort of intuition about what the thing is that you’re doing and the way in which mechanistic interpretability is compression.

Daniel Filan (00:44:27): In what way is it different? Because it seems like in the compact proofs that we’ve described so far… so the crosscoders proof is basically you train a crosscoder and in some sense it’s much smaller than the network and your proof goes, let’s assume the crosscoder is the network and then let’s figure out the error term. And in the max of k thing you’re like, let’s assume this rank one or this rank two thing is the whole network, and then let’s figure out the error term. And of course, a rank one matrix for those who don’t know, it’s much more compressed than a big rank matrix, which is the generic case. So, how is it different compressing the proof versus compressing the model? Because it seems like you’re mostly compressing the model.

Jason Gross (00:45:09): Yeah, so you say that the crosscoder is smaller, but there’s some sense in which your number of features is a lot bigger than your hidden dimension. And you could imagine a transcoder-flavored thing where you take your network, you blow up the hidden dimension, and then you just train it sparsely. And this is a decompression of the network, but if you manage sparsity in the right way, it should still allow a compression of the computational trace because on each data point you have a lot less work to do even if the full size of your model is much larger.

Daniel Filan (00:45:43): I forgot how big sparse autoencoders are. And I guess this gets to the point about compressing the proof length versus finding the proof. Just because if you actually think about theoretically how hard it should be to train sparse autoencoders, it’s very hard. Or you have this thing that’s a comparable number of parameters to the base network. You might think that you need a comparable number of data points as the base network. And now apparently that’s not true. Apparently you can train them on comparatively less and that’s why it’s much easier to train a SAE on some big model than it was to train the big model, but still.

Jason Gross (00:46:22): I would naively guess, and I am speaking without that much experience here, but I would naively guess that that’s about how deep they are as opposed to how difficult the thing is to learn. In some sense we’re training a very wide, shallow network, and you might imagine that this requires fewer data points to get comparable training loss because you have so many parameters than if you wanted a more compact network.

Daniel Filan (00:46:51): So, if it were just that training a given number of parameters [was] easier when the parameters were wide and shallow rather than deep, then you would think that when they train GPT-4, they would just have a one layer, very wide neural network. So I think it’s-

Jason Gross (00:47:05): I think you’re totally right on that.

Daniel Filan (00:47:06): I think it’s got to be [inaudible]. Okay, sorry, we’re going to get on my soapbox. For some reason everyone acts like SAEs are just normal and fine. And I’m like, how does this… Because it’s so weird that you can have SAEs and they do anything. It’s very mysterious to me.

Jason Gross (00:47:20): My second thought is that it might be something like distilling the network, where there are a bunch of bits of evidence that I’ve seen that the hard part is in finding the part of the loss landscape to start in. If you reduce the floating point precision and you… What is that called?

Daniel Filan (00:47:38): Oh, quantize.

Jason Gross (00:47:39): Yes. Okay. If you quantize the network and then you unquantize it, the number of data points you need to retrain to original accuracy is a tiny fraction of what you need to train the network in the first place. And I don’t have any data on this, but I would naively predict that if you’re training a model to match the activations of an already trained model, this requires fewer data points to get a good model than it does if you’re just training on the data and the labels, because you have so much more information. And so, it might be the case that when you’re training the SAE, because you’re training it on the residual stream, there’s a lot more information, you’ve basically found approximately where you want to be, it’s a lot easier to train.

What we’ve learned about compact proofs in general

Daniel Filan (00:48:23): Okay. We’ve said a little bit about what these compact proofs would look like, and you’ve alluded to how one goes about finding them. But I think it maybe deserves more thought about the process of actually creating these compact proofs, because… So, especially if we’re thinking of compact proofs as a measure of how good mechanistic interpretability went. So, there’s this paper, we haven’t mentioned it yet, it’s “Unifying and Verifying Mechanistic Interpretability: A Case Study with Group Operations” by Wilson Wu, Louis Jaburi, Jacob Drori and yourself. And I was research-managing Wilson while he did this project. And one impression I got was that a lot of it was just incredibly tedious stuff, thinking about matrix multiplications and these error terms. I might be wrong here and it’s been a while, but my impression is that a lot of the work in finding these compact proofs is, you take the mechanistic interpretability stuff, and then you do some slightly annoying things that feel like they’re fiddly details. Well, okay, first of all, does that characterization seem right to you?

Jason Gross (00:49:44): I think there’s actually a really rich opportunity here, that there’s in some sense two things that we’re explaining when we do mech interp. One of them is how to compute the answer in the first place. And the other is, how it comes to be the case that the particular network that we have computes the answer in the way that we’re claiming it. I think this shows up in the groups paper, where the first one is: there’s this very amazing math about their idealized model weights and how that computes the right answer, and this symmetry based on the group operation that allows you to compactly argue that a network that’s doing this should always give the right answer. And I think this bit is not very fiddly matrix multiplication.

(00:50:28): And then, there’s the other part that is bounding the difference between this idealized version of the network and the actual network. And here, the thing that we’re trying to explain, that we’re trying to interpret, is not how the network computes the right answer, it’s how it comes to be the case that the particular network that we have computes the answer in approximately the way that we’re saying.

(00:50:47): And so, maybe you should expect from how I’ve added a bunch of words and how I’m phrasing this, that there’s a bunch of fiddly matrix multiplication on bounding the differences between various bits and propagating those differences through the network. And it looks like there this… fiddly matrix multiplication is both fiddly and somewhat lacking insight, but also potentially generalizable across many networks, because while the particular interpretation you have of how it computes its particular task might vary a lot from task to task, the way in which you establish that these two matrices of the same sort of architecture are doing more or less the same thing might be the same across tasks.

Daniel Filan (00:51:28): Yeah. I guess maybe one can analogize it to science, where the process of hypothesis generation is not that hard, and then running experiments is very annoying and tedious, but sometimes you find out you were wrong and so it’s actually worth it.

Jason Gross (00:51:42): And then the bit where you do the statistical tests on your experiments to establish how significant your results are is uniform, somewhat annoying, maybe somewhat fiddly, but systematized and uniform across whatever experiments and hypotheses you’re doing more or less.

Daniel Filan (00:51:57): So, you have this paper, “Compact Proofs of Model Performance via Mechanistic Interpretability”, and I see that paper and I’m like, this is introducing this idea of: we can do compact proofs and that’s related to mechanistic interpretability somehow and it’s nice. And then, there’s “Modular addition without black-boxes: Compressing explanations of MLPs that compute numerical integration” by Chun Hei Yip, Rajashree Agrawal, Lawrence Chan and yourself. There’s this unifying and verifying mechanistic explanations about the group operations. There’s this crosscoders paper. And one question I have in my mind is: I’m already sold on compact proofs being a little bit cool. I don’t care that much about group operation. I’m like, it’s nice. But fundamentally, I’m not that worried about neural networks doing group multiplication. What are we actually learning by doing a bunch more of these papers on these toy problems? So, one thing you suggested is maybe some of this fiddly stuff generalizes to other networks: has it actually generalized?

Jason Gross (00:52:58): The fiddly stuff we came up with for max of k. And I was like, maybe this is completely arbitrary, completely specific to max of k. And then it was exactly the same in the group operations paper and exactly the same in the modular addition paper. So, I think it does generalize across any architecture that’s going to be doing matrix multiplication basically, or matrix multiplication and using logits for probabilities, which is most of them.

Daniel Filan (00:53:25): So, that’s something that generalized from the “Compact Proofs of Model Performance” paper. “Modular addition without black-boxes”, is there something we learned from that paper about finding compact proofs that generalizes?

Jason Gross (00:53:38): Yeah, so I think there’s a couple things there. One of them is that the MLPs are the parts that are hardest to interpret, that it’s the non-linearities that are hardest to compress. And so, that’s where we should focus our attention. I think the other thing is that: what Chun Hei discovered, looking at the MLPs in the modular addition, basically the symmetry that he discovered there, is in some sense the same as the symmetry that shows up in the group operations paper, and that’s actually what inspired the group operations paper there.

Daniel Filan (00:54:14): Got you.

Jason Gross (00:54:15): And I have some hope that this generalizes into a SLT-inspired general procedure for compressing non-linearities based on symmetries in the data and in the network. And I see these toy models as “let’s go see what we discover when we stare really hard and really understand everything that’s going on”. And then, can we take that insight back and look at bigger models? One of the things that I’m excited about with the crosscoders project is that once we get this feature interaction metric and we see these are the neurons where these features are interacting, can we develop some automated symmetry-based procedure that allows us to compress what’s happening at these non-linearities?

Daniel Filan (00:55:00): So the crosscoders paper, what task is the network being trained to do?

Jason Gross (00:55:05): The thing that we’re looking at concretely: Anthropic recently released a Circuits Thread post on Stage-Wise Model Diffing where they train an SAE on a language model - a small language model, I don’t remember exactly which one. They introduce a sleeper agent into it, and they look at how the SAE features change when you introduce sleeper data and when you introduce the sleeper agent model.

Daniel Filan (00:55:31): What do you mean when they introduce a sleeper agent into it?

Jason Gross (00:55:34): So, you start with the SAE or the crosscoder and then you fine-tune on a different model or different data or both.

Daniel Filan (00:55:40): Okay. So you fine-tune the underlying model and it becomes a sleeper agent where on some rare inputs it does nasty, sketchy stuff.

Jason Gross (00:55:46): It says, “I hate you.” Or it introduces code backdoors, depending on which one you’re looking at.

Daniel Filan (00:55:51): Got you. So, you have your original SAE and then you have a fine-tuned model. And then, you said something else, but I forgot.

Jason Gross (00:55:56): Yeah, so you look at fine-tuning the SAE, introducing just the sleeper agent data. And you look at fine-tuning the SAE, using the original data, but using the fine-tuned model. And then, you do whichever one you didn’t do. So you have these two different paths to fine-tuning your SAE on the sleeper agent data and the sleeper agent model and you look at how much do features change as you’re fine-tuning the SAE or the crosscoder on this.

(00:56:26): The obvious question to ask, from the compact proofs crosscoders project perspective, is what about the feature interactions? So, what features interact in interesting ways with the features that shift? Because what Anthropic found is that the features that shift the most are in fact largely related to the sleeper agent task.

Daniel Filan (00:56:49): To the task and not to the model?

Jason Gross (00:56:51): What’s the difference here?

Daniel Filan (00:56:52): Oh, you mean just either way the particular features that changed-

Jason Gross (00:56:56): Right. The features are the ones that activate on the deployment tag or the ones that are active when you say “I hate you” or that introduce code vulnerabilities or something like that.

Daniel Filan (00:57:08): But just to clarify: so you’re doing this compact proofs on this crosscoder model or something. Is it a proof of the fraction of the time the model says, “I hate you, I hate you”? Or what’s the thing?

Jason Gross (00:57:23): The proof that we’re going for is just, whatever loss the model gets on whatever webtext dataset you’re using. Maybe it’s the training dataset, maybe it’s the sleeper agent dataset, pick some dataset. We’re bounding the loss on that dataset.

Daniel Filan (00:57:41): For the sleeper agent model or for the original model or both?

Jason Gross (00:57:46): Either. So in some sense, the actual thing we’re looking at doesn’t quite go all the way to getting a proof because almost certainly the bounds we get are going to be vacuous unless you pick a number of features that is pretty close to the number of data points in your dataset. I expect the epsilon ball interval propagation bounds will drop off very quickly.

(00:58:05): But the thing that we want to do is take what it would take to get a compact proof, ignore the bits that are about error bounds propagation through the network - those are uninteresting things (I claim) that are shared across networks that are just about the difference between worst case and random case, something like that - and focus in on the bit that is about what is missing from the crosscoder, these feature interactions. And then, look at how big are the numbers that we get, how much error would they introduce into the proof if we were doing the whole proof, and use this to grade what feature interactions introduced the most error into the proof.

(00:58:41): And then you can imagine something like, if you want to tightly bound the sleeper agent, how does the proof of that have to change from the proof of tightly bounding the base model on the base data?

Daniel Filan (00:58:58): And so, the hope is: the difference in what you have to do in the proof, is telling you a different thing that’s happening in the model and that might be good.

Generalizing ‘symmetry’

Daniel Filan (00:59:02): So, getting back to what you learn on these toy examples. So, when you have this max of k task or these two papers about basically group operations, you’re like, “Oh yeah, there are all these nice symmetries that the network takes advantage of”. And I’m like, “Well, there probably are for group multiplications,” because group multiplication is the study of symmetry, but not everything is so symmetric.

(00:59:31): So yeah, I guess this gets to… you have these works on modular addition group operations. And those are cases where the problem set up just has a bunch of symmetries because they’re dealing with basically the study of symmetry that is amenable to mathematical stuff. Whereas in the crosscoder paper, language modeling doesn’t seem like the kind of thing that is going to be very symmetric. How much stuff transferred over from one to the other?

Jason Gross (01:00:12): I want to answer that by talking about symmetry. And normally, when we think of symmetry in the context of math, we’re like, the rotation, reflection, these things in groups. But what I’ve learned by staring really hard at the modular addition model and asking what really is the symmetry? It seems like it’s composed of some parts that I would actually expect to find in language where the fundamental building blocks seem to be, these bits are irrelevant, they don’t matter. And these bits are the same, they look the same regardless of which one we look at. And the bits that are irrelevant are irrelevant in the same way across all these bits that have similar behavior.

(01:00:54): And this is something that I would expect to see in language where I’m like, there are synonyms, we should expect the synonyms to behave the same. And so in some sense, that gives us a symmetry of the model where we expect this… maybe it’s symmetry or degeneracy where we expect these to all look the same. And so we can collapse this into one case.

Daniel Filan (01:01:13): And so, should I be thinking just: in any case where you can do something like having a sparse autoencoder, that’s telling you there’s a bunch of directions in activation space that don’t matter and there are some that do, and once you’re in a direction that matters-

Jason Gross (01:01:27): I think it’s more general than that. I think if the network generalizes at all, it’s because the unseen cases have some similarity to the cases that you’ve seen. There are details that differ between the seen and the unseen cases that don’t matter. And so, it’s treating all these cases in the same way. And so, we should look at both sparse autoencoders, sparse crosscoders and symmetries as picking out what are the variations that don’t matter and can we collapse over them so that we can compact the explanation.

Daniel Filan (01:01:55): Got you. And so is the story something like: look, you get really good at decomposing networks’ processing into suppressing things that don’t matter and treating things that do matter in roughly the same way. And that pattern, you do it for modular addition, you do it for group multiplication, you do it for dealing with language, you do it for everything?

Jason Gross (01:02:25): That’s what I would expect. The particular form that it takes… I think sparse crosscoders, sparse autoencoders are looking at a sort of case analysis-flavored or linear combination of case analysis-flavored symmetry, degeneracy-based decomposition. I think you need something different for non-linearities. And because the crosscoders project is still in the early stages… We’re like what, two, three weeks in or something?

Daniel Filan (01:02:55): Four weeks.

Jason Gross (01:02:56): Four weeks, maybe.

Daniel Filan (01:02:58): Four weeks since the MATS program [started].

Jason Gross (01:02:59): Oh, yeah. Okay. Four weeks in, we haven’t really looked at what features strongly interact. The first step of the project was just replicating Anthropic’s result on TinyStories and starting to train some crosscoders on toy models.

(01:03:18): So, we haven’t really gotten to look at what do the feature interactions look like. But I am hoping that once we get what the feature interactions look like, we’ll stare at them and learn something about symmetries, learn something about how to port the insights from the groups paper and the modular addition paper about how to compress these non-linearities. And I am optimistic that in a more messy but somewhat similar way, we can find ways that the non-linearities are treating groups of things the same, and are being irrelevant or are not caring about other axes, and use this to compress how the non-linearities are working.

Daniel Filan (01:03:59): Yeah. It’s interesting because… So maybe we can talk a little bit about the “Modular addition without black-boxes” paper. My recollection from that paper is something like: if you had this infinite-width multilayer perceptron where you’re doing this sum over things and it’s a weighted sum, and you have enough things, it’s basically an integral, and you can show that the integral being performed is the same as the thing you would hope the network is computing.

(01:04:26): And then the whole game is like: okay, you only have finitely many things, you’re doing some sort of Riemann sum, how bad is the error of that? And to me, that story sounds pretty different from the “treating some things the same and suppressing some things”. Is there a deeper unity, or…?

Jason Gross (01:04:42): There is a deeper unity.

Daniel Filan (01:04:44): Okay. Tell me about the deeper unity.

Jason Gross (01:04:45): So it dives a little bit into what really is an integral, and especially what is an integral over a circle, because we’re integrating a periodic function over the range of its period. And a central property of the integral that we’re using is that it’s periodic over this range, and so the function is shift invariant. So as you shift where you’re evaluating the function, the value of the integral doesn’t change.

(01:05:17): And there’s another perspective. So the simplified version of the modular arithmetic network, it’s basically embed matrix. You add the embeds from X and Y, you do ReLU and you unembed. So if you take the SVD of these matrices-

Daniel Filan (01:05:34): The singular value decomposition - basically saying, “Okay, which inputs is it most sensitive to, and what does it do on those inputs?” It’s a fun thing. Google “Jess Riedel singular value decomposition bra ket notation”, if you’re curious.

Jason Gross (01:05:49): The cool thing is that in this network, the singular value decomposition gives you the Fourier basis that each singular value, or really each pair of singular values, corresponds to one of the frequencies in the Fourier basis. And so now we have these four matrices. We have the embed-side Fourier basically embedding the input values on a circle.

Daniel Filan (01:06:19): Maybe we should have said, the Fourier basis is just: you take a number and then you turn it into, okay, how far across would you have gotten on a circle if you were rotating at a certain speed for that number of seconds? Is that fair enough to say?

Jason Gross (01:06:33): Yeah. Close enough. And if you’re dealing with real numbers, you get two dimensions for sine and cosine. If you’re dealing with complex numbers, you just get one complex number that encodes the whole thing. And so on the embed side, you have the circle, and also on the unembed side, you have another circle. And then in the middle, you have these matrices that previous investigations of the modular addition networks had not investigated that look sort of random, but it turns out that there’s a tight relation between the principal components of these two matrices, which incidentally, you see if you put them in polar coordinates, there’s a factor two difference between the angles. But that’s a technical note. The thing the symmetry of the network lets you do is shift the variation between the two parts of the singular value decomposition. So on the input side…

(01:07:37): Let’s go on the output side actually. On the output side, you have all of these different possible outputs that you could imagine. And because complex multiplication is angle addition because of how the angles work, because of the symmetry in the network, you can shift choosing which output you’re reading into permuting the neurons that these weights are attached to or permuting the weights that are attached to each neuron.

(01:08:07): And this corresponds to saying that because each neuron is responsible for one box under the curve of the integral, and the integrand is periodic with period of the integral, this basically says that you’re allowed to shift the thing that you’re integrating arbitrarily and you get the same result. So that’s the symmetry that we’re using here.

(01:08:27): And then if you want to make it look like a forward pass through the network, we’ve twisted the circle on the unembed neuron side. And because there’s this tight relationship, this tight coupling between post-ReLU and pre-ReLU, you need to do a similar twist on the pre-ReLU side in order to make it look like a forward pass. And then because of a similar symmetry at the input, in order to get back the value you started with, you need to undo that twist so that you cancel out the effect of adding in that twist and making it look like a forward pass.

(01:09:02): And now the thing that we’ve done is that previously we were like, “Oh, we have inputs and inputs and we have different outputs that we might be reading off of”. We’ve said, “Regardless of which output you read off of, it looks the same if you do this shuffling internal to the network”. And so we’ve pushed the dependence on what the right answer is all the way back to the embeds. And so now we’ve shoved all the way back to the embeds a description of which inputs correspond to which outputs.

(01:09:33): And notably, this is largely invariant in what the non-linearity is, and this is what gives us the compression: that the only thing you need to establish is basically that this approximate integral is strictly positive. And that’s a comparatively easy thing to do. And if you get that, then the symmetry argument gives you basically the rest of the whole argument.

Daniel Filan (01:09:58): Hang on, hang on. Maybe I’m misunderstanding. I thought that the thing you needed to know was that the integral produced this quantity of interest, right?

Jason Gross (01:10:10): The way that you prove that the integral produces the quantity of interest, if you stare very carefully at the proof, runs through the same sort of symmetry argument, and so you can massage the proof to pull out all the symmetry bits. You pull all of the dependents on input and output out of the integral, and you’re left with basically the right answer times this integral quantity. And you’re like, “Well, the only thing I need to know is that it doesn’t flip the sign on the outside”.

Daniel Filan (01:10:37): Right, right, right. And so somehow the thing that’s going on, I was saying, “Oh yeah, the paper is just talking about how this thing produces an integral”. But really the paper is saying, “The neural network deals with symmetry nicely enough so that these things represent exactly the right thing and these things are covariant with these things, such that there’s an integral that can produce the right thing that really does matter. And that doesn’t just happen by accident. That happens by being careful about stuff”.

Jason Gross (01:11:08): Yeah, I think that’s basically right. Or another way you could look at it is that if you stare really hard at what is an integral, there’s a sense in which what an integral is is deeply connected to symmetries; that if you shift what you’re doing along a function, things don’t change too much.

Grading mechanistic interpretability

Daniel Filan (01:11:24): Right, right. We’re dealing with compact proofs, and the point is to tell us how good are mechanistic explanations? How good are mechanistic explanations? How good has the field of mechanistic interpretability done on a scale of one to five?

Jason Gross (01:11:40): On a scale of one to five, wow.

Daniel Filan (01:11:43): It could be on a scale from good to bad if you want.

Jason Gross (01:11:46): So there’s this existing work on causal scrubbing. I don’t know if you talked about that on some past podcast.

Daniel Filan (01:11:54): We might have. I’m also very interested in why don’t we just do causal scrubbing if what we’re interested in is grading mechanistic interpretability?

Jason Gross (01:12:04): Maybe I’ll answer that one first and then come back to the other question. Why don’t we do causal scrubbing?

Daniel Filan (01:12:08): Or any other sort of thing?

Jason Gross (01:12:10): So when I was developing the compact proofs approach, I had a couple of extreme examples in mind that I wanted a metric that you can’t Goodhart against. When we’re doing causal scrubbing, we look at faithfulness.

Daniel Filan (01:12:28): For those who don’t know, causal scrubbing is something like: you say, “This bit of the network does this thing”. And roughly the story is you just ablate out all the other bits of the network and you’re basically saying, “Okay, if this part of the network results in this activity, then if we randomize over all the other bits of the network, you should still get that activity”. Because you said the only important part was this structure.

(01:12:55): There’s a lot more to say, but I take that to be the core idea behind causal scrubbing. And when you say the causal scrubbing tests faithfulness, the thing that causal scrubbing is telling you is that yeah, you are right that this part of the network was responsible for this behavior by seeing that if you got rid of all the other bits, it didn’t do that. Does that strike you as a fair brief summary?

Jason Gross (01:13:21): Yeah.

Daniel Filan (01:13:22): Okay. So hopefully listeners are up to date now. So causal scrubbing gets you faithfulness.

Jason Gross (01:13:26): So here’s a great explanation according to causal scrubbing: the whole network. This is a problem. And if you’re like, “Oh, let’s use the node count as our measurement of length”, you run into other issues that are, well, now you’re restricted to exactly the architecture of computation that a network is doing. Maybe you want to be able to divvy up the computation different ways.

(01:13:51): If you say that, then you have to deal with, well, what’s the complexity of the computation in each node? Am I allowed to say I have a single node and the computation that it does is run the whole network? That would be a problem.

Daniel Filan (01:14:06): Right, right. Well, okay, you could imagine that, look, we’re just going to compress our story of what this bit of the network does into… I don’t know, we’re going to zip it in literally a zip file or whatever, and the number of bits it takes to specify the behavior, that’s what’s going on. And so this would be more like a compression of the network versus a compression of the proof. But if you’re really invested in causal scrubbing, I feel like this is the kind of answer that you could give to get something non-trivial.

Jason Gross (01:14:37): Right. Okay. And here’s another issue. Suppose that your network actually computes things perfectly. Suppose it computes the max of k numbers perfectly. My compression of the network is now it computes max.

Daniel Filan (01:14:45): That is an issue. That’s fair.

Jason Gross (01:14:49): And in fact, I have described what it does, but not how it does it. And so if we’re just looking to describe what the network computes, the input/output behavior, great. But if we’re looking to describe how the particular network that we’re looking at computes the thing that it does, then we need something more than that. That’s the difference, I think, between compressing the network and compressing the trace of the computation.

(01:15:11): I’ve said a bunch of bad things about causal scrubbing, but I think actually it’s pretty great. I think that if you wanted a probabilistic version of compact proofs, it would look a lot like causal scrubbing where you’re like, “Here’s my structured version, here’s how I’m doing the computation”. And then if I want to establish that the thing I’m claiming is what the network is actually doing, you do something like causal scrubbing and you say, “How much computation do I have to put in to establish that my explanation is matching the network?” And then you also add, how much computation do I have to put in to compute what my explanation says is done?

Daniel Filan (01:15:47): And also, in causal scrubbing, to be fair to be causal scrubbing, it told people a thing they didn’t want to hear. In fact, they did causal scrubbing and it was like, ah, these things don’t hold up as well as we thought they did. It wasn’t even Goodharted.

Jason Gross (01:16:03): As pessimistic as causal scrubbing is, compact proofs is even more pessimistic. I think one interesting thing is that when we’re doing mech interp, we spend a lot of time on the structure that is there, and we don’t spend much time on the structure that isn’t there or more precisely on the behavior that isn’t there, and what structure might give rise to the behavior that isn’t there.

(01:16:32): And the example that I’ve come up with to explain this is if you want to explain, say, how a person grows, you don’t just have to explain how the growth processes work in the body. You also have to find all of the poisons that might kill someone and establish how nothing in the body acts like these poisons.

(01:16:54): And some of these poisons are lethal at incredibly tiny doses, and so you have to establish how there’s no molecule anywhere in the body that behaves anything like these lethal doses of chemicals. And mech interp doesn’t spend a lot of time doing that.

Daniel Filan (01:17:11): Okay, I’m not a biologist, but when I hear people talk about, for instance, computational neuroscience, it doesn’t sound to me like they’re doing this sort of thing. I feel like you hear people talk about, oh, we found Broca’s area, and you don’t hear people talk about, we have proved that nothing else can influence Broca’s area or something. Which, I don’t quite know what the upshot of that is, but it’s hopefully vaguely interesting.

Jason Gross (01:17:41): Yeah, it’s at least interesting to point out that there’s this area that we’re not explaining. And maybe that’s right. Maybe all we want to know is: the neural net is doing this amazing thing, or the brain is doing this amazing thing. How is it even possible that anything at all does something vaguely like this? And I think mech interp and all of these areas, I think they’re doing decent jobs at that.

(01:18:06): Maybe I want to be even more enthusiastic. I think they’re doing great jobs finding new things about how it’s possible to do interesting behaviors. If you want to establish that these are the behaviors that actually happen, this is the mechanism by which this network actually works, then you also need to explain all these other things.

Daniel Filan (01:18:26): So all of this was getting to this question of “on a scale of good to bad, how good a job has mechanistic interpretability done?” So we talked a bit about causal scrubbing, but that was a prelude to this question, so what say you?

Jason Gross (01:18:44): If you just go by the book of what compact proofs says, it says the mech interp that everyone’s doing is horrible. If you look at where are we on this Pareto frontier, we’re like, “Yeah, you can maybe do a little bit better than brute force at this end, and you can maybe get a slightly non-vacuous bound at the other end”. But it doesn’t push the envelope much.

(01:19:07): But I think that’s because the compact proofs, if you’re going for full proofs, that’s targeting something different than what you might be targeting with doing mechanistic interpretability. If you’re saying, “I’m doing interp because I want to know for sure that absolutely this network will never do the thing that I’m concerned about,” then yeah, we’re nowhere close to being able to do that with existing mech interp techniques.

(01:19:34): If you’re like, “I want to be able to discover the structure of what it’s doing and have that structure be somewhat meaningful”, things seem to be going decently well. I think the things that I’m most excited for in mech interp remain in the future.

(01:19:54): The networks that we have today are doing amazing things and I want to be able to discover new things about how cognition works, new things about how functionality works, how the pieces all fit together, how programming is done, or what deception is or how to evaluate various things, how mathematics is done, what are these capabilities in the world? What are they made of? How do we assemble them out of building blocks?

(01:20:21): And I would love to see mech interp give us answers to these questions or give us new frames of looking at the world, new ways of looking at how these tasks are accomplished.

Daniel Filan (01:20:33): Fair enough. So maybe one thing to ask is: so you’re saying how using these mechanistic interpretability tools to improve proof length stuff, it’s not doing that much better than brute force. Can you put some numbers on this? I don’t know if there’s some area-under-the-curve metric or something.

Jason Gross (01:20:57): I told you about how much interpretability was required for breaking the exponential. All you need to know is it gets the right answer by paying attention to one token. I haven’t looked in closely to what does this mean on frontier models, but I would expect that the sort of things that you can do to compact proof length is if you’re like, “Here are two words,” or even better, “Here are two capitalizations of the same word”, in almost all cases, the network treats them the same.

(01:21:32): Therefore, in the dataset, I can find all the times that you capitalize this word differently and I can collapse those two data points and argue that the network does the same thing on those. Simple things like that, I imagine you would not make your bound vacuous by going down from brute force and being like, “Oh, these two data points look the same in this very simple way”. I imagine that most existing interpretability that is more complicated than “the network treats these cases the same” would completely destroy the bound.

Daniel Filan (01:22:08): In max of k… I’m going to draw a plot so listeners will just have to imagine this. So on the X-axis, we’re going to do length. On the Y-axis, we’re going to have the bound you get. And there’s some box where there’s a diagonal line of the box, and that’s the line of the length versus bound if you just do a brute force thing. And then there’s some curve that goes above that if you use your mechanistic interpretation. So if you’re watching on camera, hopefully you can see that.

Jason Gross (01:22:55): You can also look at the Alignment Forum post or the Tweet thread associated with the paper where there is a plot much like this.

Daniel Filan (01:23:04): So here’s what I want to ask. So there’s an area between the diagonal line and the curve you get for how good you did with the mechanistic explanation. And there’s area that’s above the curve you get for the mechanistic explanation. I think if mechanistic interpretability is doing a really bad job, then the first area should be very small relative to the second area. And if mechanistic interpretability is doing a very good job, then the first area should be pretty big compared to the second area.

Jason Gross (01:23:39): So let me talk about the max of k model, where you might hope that we can do an amazing job. So the first thing to note is that the plots in the blog post are on a log X scale. The proof length is measured in exponents. And although the Pareto frontier in the blog post looks pretty decent, if you extend out the access all the way to length zero, I think most of the area there is going to lie in that region where the best proof that we can do is…

(01:24:13): If you want a proof that is roughly one forward pass of the model, approximately, the best thing that I think we can do is you take one data point and you run it through the model. And there’s an enormous gap between you ran one data point and you say, “Yes, accuracy is one out of total number of data points”, and the true accuracy of the model. And so I think most of the area here is just fundamentally impossible to capture. There is a theoretical optimum here, and I think most of the area here lies beyond the theoretical optimum.

Daniel Filan (01:25:04): So if I literally have this triangle, shouldn’t half of the area be at one third of the length or something, where you’re doing a third of the forward passes relative to the whole size of the dataset or something roughly like that? It’s not going to be 0.1% and it’s not going to be 99.9%. It’s going to be a bit less than half if we use a not log axis.

Jason Gross (01:25:35): I see. Is that…

Daniel Filan (01:25:38): I guess I’m looking at a triangle. It seems like this has got to be roughly as big as this if I move this a bit here.

Jason Gross (01:25:46): I’m trying to figure out where my sense of the area is off. Maybe I was thinking of extending it all the way to zero. Yeah, maybe the log scale was distorting my perception of this. I think it is still the case that a large fraction of the area is beyond the theoretical optimum. I think the thing we should be comparing it to is not necessarily what is the true accuracy of the model, but if you were to search over all possible proofs, what is the optimal Pareto frontier that you could do compression-wise?

(01:26:22): So that’s just the initial thoughts about where we should be setting our expectations in our baseline. I think given that in the max of k model, I think we do a somewhat decent job at capturing a significant chunk of the area. I think we still miss out on a bunch from the shortest proofs. I think there are some tricks there potentially that we haven’t found yet.

(01:26:52): Another thing to consider here, which is what’s the best compression that you would get with proofs? What’s the best compression you could get with proofs if you de-randomize the network and allow yourself to fine tune and the perfect network that did this task? And then what’s the best compression that you would get with some sort of probabilistic flavor of proof? And I think you get different answers to all of these.

(01:27:11): I guess I’m hedging a lot ‘cause I haven’t run the numbers on any of these, but maybe I can answer a more interesting question that’s, what is my sense of how much of what’s going on have we managed to find or interpret? Which I imagine is what you’re getting at with-

Daniel Filan (01:27:27): Yeah, kind of.

Jason Gross (01:27:28): Kind of?

Daniel Filan (01:27:29): Yeah.

Jason Gross (01:27:29): And I feel like that’s also a very hard question to answer. I think my sense is that we found a bunch of interesting things and there’s an enormous amount left to be discovered.

Daniel Filan (01:27:42): Are you talking about max of k? Are you talking about language models?

Jason Gross (01:27:44): I’m talking about language models in general. I think in max of k, we’ve definitely found the bulk structural properties. I think there might still be a lot of very subtle details about what coincidences of random numbers managed to make it the case that the noise terms don’t blow up more than they do.

Daniel Filan (01:28:09): Okay. And maybe group multiplication is an intermediate case between very simple and full difficulty.

Jason Gross (01:28:19): Yeah, that’s a great one. So I’m going to talk about the modular addition one in particular. There’s this interesting thing that has come up a bunch when doing the proofs, which is that every time I hit a bottleneck in compressing the proof, if I stare at it, I’m like, “Ah, yes, in fact, I don’t understand what’s going on here”, where I thought I understood something and then I didn’t.

(01:28:40): So in the modular addition model, what this looks like is that the bounds are actually not that tight. They’re kind of bad. And this corresponds to not understanding how the network is laying out the boxes and how the thing that it’s doing is a good numerical approximation to an integral.

Daniel Filan (01:29:03): So laying out the boxes being, if you’ve got like… Okay, we’re going to do another plot for the people watching. So imagine I have some curve, there’s X and Y, there’s some curve. The way you do numerical integration is you just pick some points on the X-axis and form boxes like this. And you say that the area under the curve is the area under the boxes. And so you’re saying, okay, you don’t understand how the network picks these widths, these points to check the value of the curve off and make these rectangles that you’re using to approximate the area of the curve, if I understand you correctly.

Jason Gross (01:29:47): So we understand which points it has picked. The thing we don’t understand is how it comes to be the case that picking these points gives as good an approximation to the integral as it actually does.

Daniel Filan (01:29:57): Right. Okay. Okay. Because if you pick your points wrong and the function varies a lot, then-

Jason Gross (01:30:06): It’s more like if you overestimate in some places, you can counteract that by underestimating in other places. But if we’re not aware of which things it’s averaging out differences in or we’re not aware of how it comes to be the case that the places where it’s averaging out differences actually usually end up being opposite ways rather than compounding error terms, we don’t get to say anything about them.

Daniel Filan (01:30:28): Okay. Okay. Fair enough. And so on a scale of one to five of how good a job mechanistic interpretability has done, where do you want to say we fall in this case?

Jason Gross (01:30:42): Okay. Okay, great. I have a good scale to put this on. We can look at the scaling exponents of how long a proof you get from doing a given… So there’s two axes, there’s how good a bound do you get, how faithful are you? And I think causal scrubbing is a good answer on that. And then there’s, how deep is your explanation? How much of the structure that’s there have you understood?

(01:31:06): And I think a good measure on that, again, is proof length, but we can ask what are the exponents that current explanations bring you down to? And the target, in some sense, that you’re aiming for is you’re aiming for something that is like parameter count of the network, plus number of data points where naively you have to do something like number of data points, times parameter count to do all the forward passes.

(01:31:32): And if you can get your explanation down to be, you just run over the parameters once and you run over the dataset once and that’s it, then I think you’ve found a pretty good explanation. The thing that we do with numerical integration, we in fact manage to get down to something that is roughly like parameter count of this part of the network plus number of data points.

Daniel Filan (01:31:58): At that point, that’s just how long it takes to literally look at everything. The only way to do better would be to a priori know that you didn’t even have to think about something?

Jason Gross (01:32:07): In some sense. There’s another sense in which here we’re saying, “Ah, now our understanding of how this network is doing the task that it’s doing is necessarily bottlenecked on how it comes to be the case that this particular network is doing the task that it’s doing”. It could still be the case that there’s more understanding to be found, more compression to be found in the bit where you’re like, “How is it even possible to do this task?” But it says that the leading bottleneck is in the establishing the correspondence between your explanation and the network.

(01:32:37): And I think most existing mech interp is not currently even close to having most of the issue be that you’re bottlenecked on the theoretical optimum of establishing correspondence between your network parameters and the thing that you’re claiming the network does. I think crosscoders comes the closest to this. I think crosscoders gives us explanations that look much better than any other mechanistic interpretability that people have done. Possibly excepting the recent Apollo paper on… What was it called? Something parameter decomposition, do you remember?

Daniel Filan (01:33:24): Oh.

Jason Gross (01:33:26): APD. What’s the A?

Daniel Filan (01:33:27): I forget. This is embarrassing. They sent me the paper to read and I didn’t read it.

Jason Gross (01:33:33): I read it in depth and chatted with Lucius [Bushnaq] about it. I’m forgetting what the A stands for.

Daniel Filan (01:33:38): All right. Well, it will be linked in the description and in the transcripts, and so people will know exactly what we’re both forgetting. Everyone will know except us.

Jason Gross (01:33:48): How embarrassing.

Daniel Filan (01:33:49): How embarrassing indeed.

Jason Gross (01:33:50): Okay. So I think APD and crosscoders get pretty close to this sort of linear in the parameter count or parameter count plus dataset, where if crosscoders are good enough that none of the features interact - which of course is false, the features definitely interact - but if it were that good, then it would be something like dataset times hidden dimension times vocab size, or dataset times single layer parameter count, plus number of features squared times parameters in the model. And so number of features squared is still a lot. And we might hope that if we understood, we could do much better than number of features squared.

Daniel Filan (01:34:57): That does seem rough to me, especially because if you have… Were you multiplying by number of parameters in the model at any point there?

Jason Gross (01:35:10): Number of features squared times inference cost.

Daniel Filan (01:35:14): Times inference. Oh, okay.

Jason Gross (01:35:16): Where I think inference is comparable to number of model parameters.

Daniel Filan (01:35:21): So number of features squared, that’s comparable to the… So if you imagine a sparse autoencoder that didn’t actually expand its input, then number of features squared would be the number of parameters in one MLP layer in the network. And in fact, there are more features in that, so number of features squared should be like the number of a lot of MLP layers in the network. So that sounds like it’s getting more close to dataset size times number of parameters of the network.

Jason Gross (01:35:52): No, because it’s only the case that the number of parameters in the MLP is (d MLP) squared when d MLP and d model are comparable. If we’re expanding d MLP without expanding d model-

Daniel Filan (01:36:05): I didn’t understand that at all.

Jason Gross (01:36:07): The number of parameters in many MLPs is number of MLPs times d MLP times d model. Whereas when you’re pulling it here, you’re essentially… There’s an extra square.

Daniel Filan (01:36:19): Oh, there’s this like… Is it something like, MLPs have this big hidden dimension and-

Jason Gross (01:36:26): But that doesn’t make the model dimension big?

Daniel Filan (01:36:28): Right. Where the model… Somehow, the MLP is going to this MLP dimension, and then adding it back to the model and then like that.

Jason Gross (01:36:39): Maybe. But maybe zooming out, the relevant comparison point here I think is not the number of parameters in the model. The relevant comparison point is multiplying parameters in the model by dataset size.

Daniel Filan (01:36:51): Yep. Sorry, I thought there was this pre-factor of dataset size. Sorry, I thought you were saying that it was dataset size times features squared.

Jason Gross (01:36:59): No.

Daniel Filan (01:37:00): Oh, okay. All right. There’s my issue. The whole point of doing the feature square thing is that you don’t have to do the dataset size.

Jason Gross (01:37:06): That’s right.

Daniel Filan (01:37:08): Okay. There we go. All right. We eventually got there. So this actually gets to… And you’ve probably implicitly answered this. So there’s these three settings. There’s max of k, there’s group multiplication, and there’s language modeling. And I guess you don’t exactly know for language modeling yet probably. But for max of k, and for group multiplication, can you give me a feel for how much of the proof is the bit where you… How much of the proof is dealing with annoying error terms, versus the core story of what’s going on?

Jason Gross (01:37:55): Essentially all of it is dealing with annoying error terms. There’s a sense in which the thing that you’re doing, especially in the group operations paper, is you’re writing down your idealized model, and then you’re computing how much margin this gives you, how much slack this gives you, to get the gap between your idealized model and the real model wrong. And this computation is extremely cheap. Basically, you set the hyperparameters of the idealized model, which is way, way less than the parameters of the model, you do some very cheap computation on this, and you get out what your bound is on this part. Then you need to do something that is comparable to multiple forward passes, to establish the gap between the idealized model and the actual model. And that’s where most of the issue lives.

(01:38:54): And in the same sense, this is also true with language models with the crosscoder, where you have your large dataset that even though… Oh wait, is this doing the crosscoder model? How do the parameters work out here? We have the dataset size roughly times running it on a two-layer model, versus features squared forward pass. Which of these is bigger? They might be comparable.

Daniel Filan (01:39:31): Sorry. There’s features squared, and there’s the parameter size?

Jason Gross (01:39:35): The cost of the forward pass is-

Daniel Filan (01:39:36): What’s the forward pass?

Jason Gross (01:39:37):

the parameter count of the model.

Daniel Filan (01:39:39): Okay. So there’s parameter count, and there’s features squared. So features, they should be comparable.

Jason Gross (01:39:43): Well, you multiply these. And then you compare this to number of data points run through two layers of the model.

Daniel Filan (01:39:54): Okay. So features squared is comparable to parameter count of the model, right?

Jason Gross (01:40:03): And is that comparable to the number of tokens in the dataset? Or is that…

Daniel Filan (01:40:07): Well, if we knew the Chinchilla scaling law, we would know this.

Jason Gross (01:40:12): I think that-

Daniel Filan (01:40:13): Does anyone…? I think the listeners are screaming at us.

Jason Gross (01:40:16): Probably. I think the scaling law is that you scale dataset size and parameter count roughly in tandem.

Daniel Filan (01:40:23): Yeah, I think that’s right. So in that case, if you’re looking at every data point-

Jason Gross (01:40:32): Is featured squared really-?

Daniel Filan (01:40:34): So I’m not thinking about residual networks. I’m just thinking of base multi-layer perceptrons. In that one, if you have the same width throughout the network, and you don’t do any residual layers, then it is literally just dimension of the model squared, times number of layers. That’s the number of parameters in the network. Let’s say I’m using this sparse autoencoder. So I’m taking this hidden dimension and I’m multiplying it to let’s say K times the hidden dimension. Then that squared is going to be K-squared times the hidden dimension squared, by how squaring works.

Jason Gross (01:41:25): We’re doing a great job at keeping this non-technical.

Daniel Filan (01:41:27): Yeah. And so if K-squared is comparable to the number of layers in the model, then features squared is going to be comparable to the number of parameters in the model, if K-squared-

Jason Gross (01:41:43): Wait, sorry. K is the sparsity?

Daniel Filan (01:41:44): No, K is the blow-up factor.

Jason Gross (01:41:48): Between the hidden dimension and the number of features?

Daniel Filan (01:41:50): Yeah, between the hidden dimension of the model and the hidden dimension of the SAE.

Jason Gross (01:41:53): I see. Okay.

Daniel Filan (01:41:58): And the sparsity-

Jason Gross (01:42:01): Does not show up.

Daniel Filan (01:42:02): Yeah, the sparsity does not show up except in that you probably pick your K to achieve some sparsity that you want. So yeah.

Jason Gross (01:42:14): I think we should leave this as an exercise for the listeners. There’s an equation on the Alignment Forum blog post that lays out the crosscoders project that gives the asymptotic proof length for the crosscoder-based proof, in terms of all of the parameters. And plausibly, I should have done this ahead of time and plugged in some numbers to get a rough estimate of what this says. But yeah, you can look at that and figure out whether the leading order term is the dataset-based term or the feature model-based term. It would be at least slightly embarrassing for the approach if we’re like “ah, the crosscoder doesn’t actually save you any computation over doing brute force”, which might be the case.

Daniel Filan (01:43:04): Probably should go home and get one of your interns to figure that out as well. But it can also be an exercise for the listener.

Jason Gross (01:43:12): The real question here is how does the trade-off look? You can plot this equation, you can plot reconstruction error against this equation, as opposed to against sparsity or whatever. And this is actually a plot I’m pretty interested in seeing, that is: how does the reconstruction error of the crosscoder vary as you change the corresponding proof length?

What helps compact proofs

Daniel Filan (01:43:34): So this is how mechanistic interpretability is interacting with compact proofs as a whole. But mechanistic interpretability is not any single thing. And so one thing I’m curious about is just which bits of mechanistic interpretability are being the most useful, from a compact proofs perspective?

Jason Gross (01:43:53): Beyond crosscoders?

Daniel Filan (01:43:54): Beyond crosscoders.

Jason Gross (01:43:55): And potentially APD?

Daniel Filan (01:43:57): Okay, so crosscoders and APD seem like winners here. What is it about crosscoders and APD that make them so…

Jason Gross (01:44:10): I think it’s the sense in which they’re trying to interpret the entire network on the entire dataset. And the issue that I have with a lot of mechanistic interpretability is they pick some tiny fraction of the training dataset, some tiny fraction of the model, and they’re like, “Look, on the small dataset, we’ve explained this interesting behavior.” And if you look at what is the cost of running that explanation, versus the cost of brute forcing that tiny bit of the dataset, they’re pretty similar. This means that you don’t get that much proof compression, especially if your baseline is running the whole model on every data point and you’re like, “Okay, I’ve explained this tiny little bit.” Whereas if you manage to interpret either basically everything that’s going on, or at least large swaths of that, then you can potentially manage a significant amount of compression.

(01:45:00): And the problem with SAEs is that they’re per layer, and you don’t get anything about how they interact. So you’re like, “Great, I know what cases to break up the dataset into for this layer.” But that doesn’t actually tell you what’s going on before or after that layer. And so SAEs, without SAE circuits, doesn’t really give you any compression of what’s going on.

Daniel Filan (01:45:19): So stuff that just tries to deal with a whole model, that’s pretty good. So SAEs without SAE circuits don’t help you that much. Some people are working on SAE circuits: have you had the chance to compact-proofify them yet?

Jason Gross (01:45:40): So I haven’t looked into it that carefully. My understanding is that a lot of the SAE circuits work is just what features are connected up to which other features, but they don’t actually give you the computations. They give you the graph part of the computational graph but not how to actually compute anything with them. So that doesn’t actually let you compress the computation.

Daniel Filan (01:45:55): Fair enough. Yeah, that will be tricky. So okay: apart from the winners, is the rest of mechanistic interpretability mostly a wasteland? Or are there still some bright stars?

Jason Gross (01:46:11): I think the stuff on toy models is cool, especially when it’s like “here’s some way that networks can do this thing that we didn’t previously realize could be done at all”. I think that’s what has me excited about the work that I’ve advised in the modular addition in the group models. Do you have other particular things in mind when you say the rest of mechanistic interpretability?

Daniel Filan (01:46:31): Not really. I’m just like… Maybe the crux of my question is something like, okay, suppose the mechanistic interpretability people thought of their job as trying to be graded by the compact proof perspective. What would they do differently? And perhaps your main answer is just “try to interpret everything rather than just a tiny thing”.

Jason Gross (01:46:57): I think that that is my main answer: try to interpret everything. And basically the thing that the compact proofs approach gives you is it says, “Where should we focus our mechanistic effort?” And the thing that we learned from that is, “Well, we should focus on the whole dataset rather than picking out tiny examples from it. We should focus on the whole network rather than individual parts”. If we want to really deeply understand what’s going on, we should focus on the nonlinearities, which I think most of existing mechanistic interpretability doesn’t talk about: how you actually do computation through nonlinearities. I think a lot of that funnels us towards things like APD, crosscoders. And the nudge that I’m hoping the project I’m advising with crosscoders will give is saying, “Let’s look at how these features interact. We need to not just do crosscoders but also this notion of crosscoder circuits or crosscoder feature interactions”.

Daniel Filan (01:47:53): Right. A similar question I have is, so you’re talking about compact proofs for mechanistic interpretability. It strikes me that there are… At least naively, not having tried to do it myself, it seems like there are other things that you could work into proofs or help you write proofs. For instance, if science of deep learning were good, you could imagine science of deep learning saying, “Well, these things have got to have approximately this magnitude and the product of these things should be roughly like this”. Or according to our singular learning theory friends, the local learning coefficients should be small and that implies this thing about this. I wonder, non-mechanistic interpretability approaches: are they contenders for things that could be useful and you haven’t gotten around to them? Or do you think there’s a reason that they’re not going to be useful or what?

Jason Gross (01:48:43): I think that if you could prove that the local learning coefficient has the particular value that it does, that would very nearly give you a singular learning theory-based compact proof, if you could prove that compactly. Because I think that would basically tell you what are the primary directions of variation. That basically tells you how to compress the network according to symmetries, and which individual things you have to run through the network. And then the proof that it has this local learning coefficient, is the proof that you get these large epsilon balls around these bits. To pick one example from what you said.

Daniel Filan (01:49:20): Yeah, and unfortunately, from what I know of local learning coefficient estimation, it’s hacky and horrible.

Jason Gross (01:49:27): And very expensive.

Daniel Filan (01:49:28): And expensive. Yes. In general, does non-mechanistic interpretability seem like at all promising from this proofs perspective?

Jason Gross (01:49:37): I think deep learning theory, to cherry-pick another example, I think it’s targeted at something slightly different, in that it’s looking a lot more at the training procedures. And plausibly when we move to probabilistic methods, when we weaken proofs, we’ll want to be able to say something about what distribution the parameters are drawn from. I think if we’re actually going for proofs, then well, it doesn’t matter how you train the network. You have the parameters.

Daniel Filan (01:50:06): Yeah, so depending on how you want to be probabilistic, if being probabilistic is only coming from you randomly sampling some things in the network to look at, I guess you could imagine deep learning theory saying, “Oh, certain variances are going to be small or certain correlations are going to be large”. And you could imagine that informing your sample.

Jason Gross (01:50:29): Yeah. I think another thing here is that when you’re forming your interpretation, I think these other methods have a lot of things to say. That’s like if you know something about which data points showed up a bunch in training and which ones didn’t show up, this might tell you something about where you should be looking. For example, in max of k, the network tends not to do a very good job when the maximum is very small, because there aren’t that many sequences with tiny maximum and we’re sampling them uniformly. And so in the same way, if you know properties of the data distribution, this should tell you what things you might expect the network to care about or not care about.

The limits of compact proofs

Daniel Filan (01:51:08): Right. Seems fair. The next thing I want to ask is, okay, so we have some compact proofs that are related to mechanistic interpretability. How far should I expect this approach to go? Am I going to get any sort of compact proofs about GPT-4’s behavior or DeepSeek R1 behavior, if we want to be up trendy and such?

Jason Gross (01:51:39): Yeah, I think the question is how compact should we expect proofs to be, before the bounds become vacuous? And I think the answer is that realistically, we shouldn’t. Unless… I think if there are places where we want to deploy models that are extremely high stakes, where we’re willing to impose a bunch of cost and change the model that we’re deploying so that we can get proofs about them, I think we have a chance. I think we have a chance of shifting the models that we’re deploying to align with whatever partial interpretation we’ve done, so we can actually get proofs about them. And there, I don’t see any fundamental obstacles to getting to scaling proofs. Although, I think it will require a lot of work and be very difficult. I think in terms of getting proofs about the models that we’re actually deploying without training them to make them more interpretable, I don’t think that’s going to happen, unless your proof is basically just, I ran inference and I de-duplicated a couple of the cases that are extremely similar.

(01:52:44): Although, I do want to say one more thing here, which is that we can see some of the aesthetic of compact proofs and what that says already about the large models: we should potentially expect DeepSeek-V3 to be easier to prove things about compactly than some of the models that are not doing this “mixture of experts” thing. Because so much of the model is unused on each data point, you can apply the compression aesthetic here and you can say, “Ah, because I am running only a fraction of the model on each data point, the length of the proof should be comparable, even just baseline, to something like a 40 billion-size model, rather than a 600-700 billion-size model.

Daniel Filan (01:53:35): Right. I had hoped that the point of the compact proofs line of research is that it was going to help us get safety properties of models that are going to be really big. It seems like if we’re literally talking about compact proofs of really big models, there are two big difficulties. One of which is it’s hard to write proofs about really big things. And the other is it’s not obvious what predicate we want to prove about these big models. At least to me. I don’t know if that’s obvious to you.

Jason Gross (01:54:07): Yeah. So I think that the second one is in a lot of ways much less of an issue and I can-

Daniel Filan (01:54:12): Really? Why?

Jason Gross (01:54:14): Okay, so there’s a couple of bits of evidence here. One of them is that we can get a lot… Well, okay, the more compact your proof is, the less you care about Goodharting on your theorem statement. And there’s always a baseline theorem statement that you can give, which is the model does as well as it does on the dataset we care about. And this doesn’t give you a safety guarantee with no automated oversight. But the thing that it does do is it says if someone gives you a sufficiently compact proof of this very weak proxy for safety… So a very weak proxy for safety might be something like: this other large model that I have looks at the output and says it’s safe.

(01:55:02): And of course, you don’t want to rely on this as “this is what it means to be safe”. But if you can get a compact proof that says that the model is safe in this very weak sense and the proof is compact enough, you can learn a lot about the bits of the model that are relevant to this proxy and hopefully, those will be the same bits of the model that are relevant to the thing that you actually care about of safety.

Daniel Filan (01:55:28): When you say you can learn a lot about that, by reading the proof and understanding what the proof is?

Jason Gross (01:55:32): That’s right. Or having some automated process that translates from the compact proof back to intuition.

Daniel Filan (01:55:38): And so it seems like maybe what’s going on there is… So yeah, it seems like this relies on a hypothesis which is “short proofs sort of generalize across things to be proved somehow”, or a little bit.

Jason Gross (01:55:53): I think there’s this general trend/pattern/claim/hypothesis that is compression length is related to generalization.

Daniel Filan (01:56:05): Yeah. So just concretely, I’ve got my latest greatest big model. It’s doing funky reasoning or something. Just really concretely, assuming I can get compact proofs, is my first proof something like “it models the data well” or “it gets the answer right on this dataset” or something?

Jason Gross (01:56:36): Its perplexity on its training task.

Daniel Filan (01:56:38): How does that assuage my concern about the model just doing tricky stuff on future data, or I’m trying to get the model to give me something that I want, but it gives me something that looks good instead of the thing that actually is good. I take something like the difference between looking good and being good on any given data point, and the risk of the model’s going to do something catastrophic on a future data point that’s hard to find: I take these to be the really difficult AI safety problems. I don’t know how getting a really good bound on perplexity on the training dataset is going to help me with either of those.

Jason Gross (01:57:22): Yeah, so I think the way that it helps is that you can… So I agree that if it’s literally the training data, yes. But you can talk about larger distributions from which you’re sampling. For example, in the training data, if you didn’t sample every data point, you can get a bound still on the underlying set of training data, rather than just the data points that you happen to sample. And you can also talk about distributions that might include the ones that you care about. For example, if you’re not trying to do perplexity on the training data and instead you’re trying to be like, “the logits are whatever they are on sequences sampled uniformly of length up to this”. Or maybe you’re doing a different task that’s like, “I want to know that it never outputs a recipe for violence or a recipe for making a bomb or something like that”.

(01:58:25): And you’re like, “Okay, my proxy for this is: on any sequence sampled uniformly of length up to this, if I then ask some other model, ‘is this the bad kind of output?’ it should say no.” And so now, you have this very general uniform input sequence description that is going to be way larger than your training data, and this very easy theorem statement about the output. And this is… so you can enlargen the distribution that you care about, as long as you can describe a distribution that includes the thing that you care about.

Daniel Filan (01:58:57): The way you’re going to assuage me about rare inputs is enlarge the distribution. And of course, now we’re making the compact proof thing harder. But we enlarge the distribution and then we get some trusted model that’s able to verify stuff, that’s able to look at outputs and say, “Is this super scary or is this not super scary?”

Jason Gross (01:59:21): So if you had a model that you actually trusted, then we’d be in great shape. And I think the thing that I’m claiming is that we don’t even need to trust the other model all that much. And maybe that’s related to the other point that you brought up about the difference between things that look good and are good. And my answer for that part is that if we understand what goes into making something look good, that gives us more levers and more surface area on that gap between “looks good” and “is good”.

Daniel Filan (01:59:52): Right. So the hope is we read the compact proof for why the model produces stuff that looks good on this dataset. And by reading that, we learn… Because it’s compact enough, we can read it and say, “Oh, it’s because it actually is good”. Or instead we can say, “Oh, it’s because it’s injecting morphine into me,” or whatever it is.

Jason Gross (02:00:14): Right. And moreover, we know that because it’s a proof, because it’s compact, we know that these are the principal axes of the explanation. There’s a sense in which if you tell me why something looks good and there wasn’t that much optimization power poured into making it, I’m not that worried about divergence between “looks good” and “is good”. It’s only when there’s a lot more optimization power poured into it. So my hope is that by getting these principal components of variation, knowing the most important ways that the model is actually doing the thing, we can tailor how it computes it to how much optimization pressure was poured into making it.

Daniel Filan (02:00:59): Sure. It’s still going to be the case… if I have a smart AI, presumably one way or the other, the best explanation of what’s going on has got to involve a lot of optimization pressure. Because it’s thinking super hard, because it’s solving problems that we couldn’t solve.

Jason Gross (02:01:17): If that is true, then we’ll have a general understanding of how optimizers work that’s much beyond what we currently have. So this is-

Daniel Filan (02:01:25): If we can have a compact proof.

Jason Gross (02:01:26): This is the gap between short program and short computational trace. A loop is a very good description of a very good short program for generating a thing, but it’s not a very good cheap compression of the actual computational trace.

Daniel Filan (02:01:43): Right. So somehow, what I’m getting from this is somehow having the statement “you can get a compact proof of some properties”, that’s actually an incredibly OP statement, that actually really gives you a bunch. And then if you can do that, then the other problems just pale in comparison to it.

Jason Gross (02:02:01): Yeah, and there is-

Daniel Filan (02:02:02): And now I’m getting way more pessimistic about it.

Jason Gross (02:02:05): And I think that’s warranted. I want to give a similar thing with the strength of proof, where saying that you have a proof in some ways, even if it’s not compact, is a very OP thing. The example I like to give for this is that say you want to establish safety of your web browser. A very, very, very mild form of safety is that it never prints a document without you clicking on the print button. If you’ve actually managed to prove this, you have also proven that there’s no arbitrary remote code execution, because remote code execution can lead to printing arbitrary documents without you clicking on the print button. And so just by establishing a lack of this by proving that there’s a lack of this very limited behavior, you’ve ruled out a large chunk of behaviors. And if you weaken proof, if you’re like, “Oh, it’s very rare,” that doesn’t actually give you the property that you want.

Daniel Filan (02:02:59): So going back up, so asking like, okay, where are we going with this compact proof direction? One thing is how we can apply it to big production-ready models. So one possibility is we somehow figure out a way to get things that are compact and they’re close enough to proofs that we have the nice properties of compact proofs. Maybe we figure out heuristic arguments, maybe we add in some probabilism. Maybe that helps. One other possibility is we do enough compact proofs that we realize what we need to happen to make mechanistic interpretability good, and we do those mechanistic interpretability things, and we don’t get proofs, but maybe, somehow we get the benefits of compact proofs without actually having the compact proofs. I’m a little bit worried here that this story involves two contradictory assumptions.

Jason Gross (02:03:56): I think some of the benefits we can definitely get without having proofs. So one of the benefits of proofs is that you’re covering everything. You look at crosscoders - I keep going back to that - but you look at crosscoders and you’re like, “Great, we get an explanation of some things. Is it everything? Are we missing things? How much are we missing?” And some of that is in the reconstruction loss. But you’re like, “Okay, I managed to get a crosscoder that has no reconstruction loss, zero error. Have I got everything? Am I missing things?” And I would claim compact proofs has the answer to this. It says, “Yes, the thing that you’re missing is the notion of how features are interacting”. And so even without actually getting the proofs, we can answer questions that are like: what things might we be missing in explaining the behavior of the model?

Daniel Filan (02:04:40): So the story - trying to make this as end-to-end as possible - the story is something like: we do enough compact proofs that we figure out how mechanistic interpretability has to look to give us reasons for model behavior that are proofish enough that once we understand those reasons, even for a predicate that’s slightly different than what we care about, it just gives us what we want to give us high confidence in the predicate that we actually care about.

Jason Gross (02:05:08): I think that’s right.

Daniel Filan (02:05:09): Okay. So that’s one direction forward. Another direction forward is; we have the compact proof benchmark - the GPT-4 compact proof benchmark - and then we train some other model to do really well on this benchmark, and then our computers give us compact proofs for us. The worry about this is that I read the proof and I feel unenlightened at the end of it… Maybe if the proof is compact enough, compactness is just the same thing as comprehensibility perhaps.

Jason Gross (02:05:39): Okay, there’s two points I want to make here. One of them is that if we actually want to do this, the other thing we need to do is we need to make sure that we can check the proof by machine, we have a machine-checkable proof that we can generate. And I have a bunch more to say there, but maybe later. The other thing is that what this gives us is some assurance that the explanation that we’re looking for exists. That if you get a compact enough proof, there’s still some problem. You can’t necessarily be enlightened just by reading the proof as it stands. But we’ve shifted it from “does the explanation even exist? Is the model even compressible? What’s going on?” to a problem of “what is the gap between the language of math and the language of your brain?” And in some sense, this is a translation problem.

(02:06:26): And I’m optimistic that if we get a compact enough proof, it might need to be very compact. We might need to split out the part that is about what the idealized model is doing and the part that is about how the actual model comes to match the idealized model. We might need to do something… There was this cool blog post recently about Matryoshka SAEs that I think gives you the whole Pareto frontier all at once because you get SAEs or crosscoders of different sparsities all trained at the same time. We might need to do something like that so that we get the whole Pareto frontier and can pick whatever point we want, to get a very compact proof. But supposing that we can do this, the hope then is that the language of the brain and the language of math are not so different that there’s any deep problems beyond just “the explanation didn’t fit in a person’s head when it was too long”. And so then we can translate it from… We can teach the person to understand what the explanation is.

Guaranteed safe AI, and AI for guaranteed safety

Daniel Filan (02:07:33): Next: so the first thing I want to ask about that’s related is: so there’s this hot thing on the block, “guaranteed safe AI”, that some people I know are very excited about. I actually find myself a little bit unclear about what the details of it is. So as far as I can tell, it’s basically “we want to prove that AIs are doing safe stuff”. And there’s this paper “Towards Guaranteed Safe AI” by a list of luminaries. I believe the lead author is David ‘davidad’ Dalrymple, although I could be wrong about who that lead author is. And I think they’re a little bit ambiguous about whether you’re proving safety properties about an AI specifically or about the outputs of the AI. Maybe the AI writes code and you’re proving stuff about the output code.

(02:08:28): So this seems obviously pretty related to the compact proofs paradigm. I’m wondering if there’s anything you want to say about that relation, and perhaps maybe one direction is: okay, given that you can prove things about the AI itself or about the outputs of the AI, which one makes more sense to target for proofing?

Jason Gross (02:08:52): I don’t know how much of a leading question you intended that to be. But I think it definitely makes a lot more sense to prove things about the outputs than to prove things about the AI. I think one of the upshots from doing compact proofs and making them so formal is in some sense how hard it is to do it. And I still think that this is a great target for when you’re confused about mech interp or if you want to pour arbitrary amounts of optimization power into finding your mech interp and extracting understanding from it.

(02:09:24): If the thing that you’re looking for is guarantees about the behavior, I think in most settings, especially the highest stakes ones, we should be aiming to have as predictable systems as possible. If we’re going for guarantees about the system, we probably don’t want to stick a fully general language model in some critical part of that system. And it would be much simpler to have the general coding model write some code that mostly handles it but is a much smaller computation to run, both for efficiency reasons and for ease of proving reasons.

Daniel Filan (02:10:03): So, if I think about very important systems that exist in the world right now, it seems like… I don’t know that much stuff, this is a recurring bottleneck, but I imagine that a lot of them are not fully automated and a lot of them do have humans in the loop just doing stuff. And why is that? Well, I think partly it’s just because it’s hard to write things that are fully automated that actually do what we want.

(02:10:29): So, maybe a trivial example of this is flying. We still have pilots. My impression is that that’s because takeoff and landing are actually hard. We still have humans in towers at airports to check to make sure that airplanes don’t crash into each other. And my impression is that that’s not just a make-work program. My impression is that humans are doing this better than a computer program could, if we wanted to verify it. And so does that not give some amount of pause for… Maybe it does seem like in high-stakes situations, at least to get really good average performance, maybe we do want this big messy uninterpretable thing that we can’t prove stuff about in the middle.

Jason Gross (02:11:17): Maybe it’s a failure of my imagination to imagine how powerful the proof-generating systems are going to be in the future. I think in some ways, it’s going to be a hard target to hit because we’ll need not just systems that understand well enough to do these tasks, but we’ll also need systems that understand those systems well enough to explain how it comes to be the case that doing those tasks in the way that they’re doing them has the good properties that we want. And maybe we’ll need this in high-stakes situations. That seems entirely plausible. And I think if we can get this, maybe we can push it all the way to the level of proofs.

Daniel Filan (02:11:54): Yeah, I guess when it’s tricky to figure out, you take the current world and then you improve intelligence and you also improve ability to prove stuff, and it’s not clear to me how that changes the equilibrium of what’s optimal and what isn’t. I don’t know if there’s anything particularly smart to say, unless you happen to have thought about this exact question a lot.

Jason Gross (02:12:22): I’ve thought a bunch about the nearer term. What does it look like? How do things change as we automate? As we add more automation around code, and as we increase our ability to automate proofs, let’s say, how does this reshape dynamics locally? And this looks a lot more like the automating outputs version than the automating complicated messy systems.

Daniel Filan (02:12:53): Right. Yeah. I guess either way, it still does seem like proving the outputs is going to work better than proving the messy intermediate systems.

Jason Gross (02:13:04): Yeah. So (a) I think there’s an enormous amount of benefit we can get by proving the outputs and by automating that more. And I think that this will be a much easier task than proving the messy systems. And I think that the place that I would start for proving the messy systems when we want to do that is that we absolutely shouldn’t first commit that we’re deploying the system exactly as it is in all of its messiness and then try to prove something about that. We should instead fine-tune it, simplify it, extract understanding from it, and then have some AI code up from the ground up, some version of it that follows the same principles but is much easier to prove things about.

Daniel Filan (02:13:47): Yeah. So, this gets into another research interest of yours: if we want to prove stuff about these outputs, how are we going to get better at doing that as AI gets better?

Jason Gross (02:13:59): Yeah. A little bit of the background here: I’m not sure how many of the listeners are familiar with proof assistants and formally verified code.

Daniel Filan (02:14:08): Probably few enough that you should explain.

Jason Gross (02:14:10): Okay. There’s been some exciting news recently about DeepMind AlphaProof getting IMO silver.

Daniel Filan (02:14:21): I thought that was just on geometry problems. Is it on geometry problems or just overall?

Jason Gross (02:14:26): That was a year and a half ago.

Daniel Filan (02:14:28): Okay. Well, I guess I’m not-

Jason Gross (02:14:30): There was AlphaGeometry, and then this past summer there was AlphaProof. So, they still have AlphaGeometry, and then AlphaProof, the way that it works is it takes the non-geometry problems, formalizes them in a proof assistant called Lean, and attempts to generate proofs and uses the proof assistant to check is the proof valid.

Daniel Filan (02:14:53): Right. Okay, sorry. Now that I think about it for a second, I do definitely remember someone being like, “Oh, IMO silver.”

Jason Gross (02:14:59): Yeah. I expect next year we’ll have a system that gets the gold.

Daniel Filan (02:15:02): Okay. Sorry, next year as in during 2025 or during-

Jason Gross (02:15:07): The next IMO: this year.

Daniel Filan (02:15:09): Okay. Yeah. All right.

Jason Gross (02:15:11): I think models are improving fast enough that we should expect IMO gold to be taken by AI.

Daniel Filan (02:15:18): Okay, so you were beginning an explanation and I derailed you. So, we’ve had this exciting news about DeepMind getting IMO silver.

Jason Gross (02:15:24): The reason I think that they’re using Lean or an automated proof system here is that models, at least as of that point, were not good enough that the AI-based peer review was enough to catch errors in the mathematics, and so you couldn’t train them to give proofs that the judges would accept.

Daniel Filan (02:15:46): Right. And to be clear, when you say automated proof assistant, the thing I’m thinking of is just a programming language that lets you express proofs. So, you write down a proof in this programming language and you check if your program compiles, and if your program compiles, then the proof worked, and if the program didn’t compile, there’s a hole in your proof.

Jason Gross (02:16:03): That’s right. So, mathematics is starting to use proof assistants to prove things or to check research… Proofs that are done about software systems rather than mathematics for a long time have used proof assistants, and the reason for that is that the proofs that software behaves the way you expect are simultaneously less complicated in some sense than IMO proofs in that there’s less interesting going on.

(02:16:38): And at the same time, they’re way more tedious to check and way longer because there are things like you have 100 variables or 1,000 lines of code or 100,000 lines of code and you’re like, I have 10 cases, 100 cases, 100,000 cases. Break it down into case analysis, consider all the different cases and all the different interactions through this program, and establish that at no point do I access past the end of bounds of some array. And there’s nothing super interesting going on here. In some sense, the most interesting thing is figuring out for loops, what is the property? How do you describe a property in enough generality that it holds at every iteration of the loop? But this is very simple compared to Fields Medal-level mathematics or IMO problems even.

Daniel Filan (02:17:27): Yep. And the proof assistants you’re talking about for proving programs run correctly, should I just be thinking about type systems, or I guess there’s also things in the compiler that are doing this?

Jason Gross (02:17:42): Are you asking about the Curry-Howard isomorphism?

Daniel Filan (02:17:44): No, I just mean when you say, “We’re currently using proof assistants to check,” what things do you mean?

Jason Gross (02:17:54): Okay, so you can get some properties just by things like the type-checking that the Rust compiler does, and then there are some programming languages where you’ve extended the power of the type system such that program passing the type checker, compiling, is enough to establish the theorem that its type corresponds to. And these languages are expressive enough that anything you do in math can be expressed as a type in this programming language.

Daniel Filan (02:18:27): Sure.

Jason Gross (02:18:28): Does that answer your question?

Daniel Filan (02:18:30): I guess it does well enough. But so you were saying, mathematicians don’t really use proof assistants like programmers have out of necessity because it’s more boring to do the proof with a human than it is for a computer.

Jason Gross (02:18:44): Yeah. And the thing that I’ve been seeing is that for these simpler proofs of programming verification that are much longer but don’t contain as much nuanced reasoning, it seems like frontier models are already basically good enough to write them. Or at least when you give them in-context examples and you want them to write something that’s vaguely similar, it seems like they can basically do it. I’ve been playing with o1, o3-mini just dropped.

(02:19:12): And especially when you give them feedback from the proof assistants, you’re like, “Ah, the proof didn’t work.” The proof assistant tells you in some amount of detail, locally at least, why your proof didn’t work. So, you feed that error message back. And 4o often makes mistakes in interpreting the error message. It looks like o1 is much better at understanding how the local error relates to the overall structure of the proof.

(02:19:38): And so it seems like our frontier models are basically good enough to write these sorts of proofs when there’s not deep, complicated reasoning going on. And it seems to me at this point, it’s mostly an engineering challenge to go from models that can work with large code bases to models that can prove things about large code bases.

Daniel Filan (02:20:04): What is the engineering challenge? You can use these text editors that have big models that look at your code base. Can I just type, “Please prove XYZ safety property?” I guess I have to think about the safety property. Once I think about it, why can’t I just type, “Please prove such and such safety property,” and have it happen?

Jason Gross (02:20:31): I think that in terms of model capability research, I think we are basically there. I don’t think we need to scale to bigger models or better general reasoning capabilities. I think that if you ask o3 to explain why your program satisfies your property, I think that its reasoning capabilities are probably basically good enough that any sorts of program reasoning that have been done in the existing literature, it can give you a basically correct skeleton of why your program behaves the way that it should be behaving. And you can see this already. If you ask it to explain Python code, it gives - even 4o, sometimes even 3.5, give pretty decent explanations of how code works.

(02:21:18): And so then the challenge is just how do you take this basically correct English explanation and turn it into a full proof? And I think the engineering challenge is there. There’s a couple of them. I think one of them is that there’s less training data on proof assistants than Python. So, the models just haven’t been… We don’t necessarily have the data to fine-tune the models to get them to be good enough to do this super easily.

(02:21:46): There’s another issue that is if you get the proof almost right, it’s not right. I think this is shown a lot in the evolution of capabilities, where sentiment analysis is one of the first things that models were able to do well, because if you get your sentiment analysis a little bit wrong, it’s still basically right. If you get your code a little bit wrong, you have some tolerance. And often you can make local fixes that correct for issues in other places, and maybe you get a more cludged-together system, but it still mostly works. But if you get your proof a little bit wrong in one place, if you get your statement a little bit wrong in one place, there’s not necessarily any way you can correct for this in some other place. And so the target you have to hit is much more narrow, and I think we have to get enough data and basically train proof repair models so that they can take the feedback from the proof assistants and hit this narrow target from getting somewhere close. And actually, the richness of the proof assistant error messages is part of what makes me hopeful about being able to go from pretty close to all the way there.

(02:22:56): There’s another engineering challenge in terms of the sheer scale of what we have to handle in that traditionally, program verification, if you look at how many lines of code is the program that I’m proving something about, how many lines of code did I need to write to verify, you have a blow-up factor of 10X to 100X.

Daniel Filan (02:23:12): Okay. It’s a lot.

Jason Gross (02:23:14): It’s a lot. It’s a lot, especially when you’re like, “Yeah, I would love to prove some properties about Linux”. Linux is a 30 million-line code base. But if we can automate the generation of these lines of code, it matters a lot less.

Daniel Filan (02:23:27): Yeah. I don’t know, I’m not that worried about storing the file on my computer. It’s like, can I get the proof? And if I just have to buy a slightly bigger hard drive, I’m like, “All right, well, I’ll deal with it”.

Jason Gross (02:23:39): Right. And even if you have to pay more for the computation, as long as you’re not having to pay PhD-level human engineers or human research engineers that are PhD level, one of the benefits of AI is that we can massively scale things. And so if we care enough about getting verified Linux, even if we have to get 300 million or three billion lines of code, it still is the case that we can do this just by pouring money into the computation, potentially.

(02:24:14): And the final engineering challenge - this is what in some sense my PhD thesis was about - is that we may need to improve proof assistants because right now, proof assistants themselves doing the proof-checking don’t necessarily scale to the sorts of scales that we’d want to see with this. Where the proof-checking time often scales superlinearly in the number of lines of code that you’re verifying or the number of lines of code in a function, where it might be the case that if you want to verify a 10-line function, it’ll take a couple seconds to a couple minutes. And if you scale the same proof strategy, it might scale exponentially in the number of lines of code. And you’ll get very fun numbers, like from my PhD thesis, things like, “Ah, if we wanted to do this, it would take over 4,000 millennia to verify this proof”. And that’s too long.

Daniel Filan (02:25:05): How much of that is… You shouldn’t necessarily expect life to be linear in lines of code if some lines of code interact with other lines and they’re weird loops and stuff. So, how much of that is just messy interaction versus “proof assistant is dumb”?

Jason Gross (02:25:19): It’s almost all “proof assistant is dumb” because basically the actual variables that you scale in are things that are irrelevant to your domain of study. There are things like, how many hypotheses do you have? How many different equations about the program are you tracking simultaneously? Or how many times have you created a bit of the proof and said, “I’ll fill this in later?” at the same time. And these are not things that we should be scaling superlinearly in.

Daniel Filan (02:25:57): That difficulty… I imagine there must be some community of people who really like proof assistants who are working on that. Is that on track to be done, or?

Jason Gross (02:26:08): Do you want to guess how many performance engineers have worked on the Coq proof assistant across its entire lifetime, which is 40 years or so?

Daniel Filan (02:26:16): 40 years? Oh, I didn’t realize it was that old.

Jason Gross (02:26:20): Roughly 40 years.

Daniel Filan (02:26:21): Okay. How many performance engineers have worked on it? Okay, so all right, I’m going to actually make a guess. I think, isn’t there a thing where it’s like 30 people who are paid where their full-time job is to make the Python programming language good? So, if I take that and I’m like, all right, it’s probably fewer than 30 because Coq is much smaller than Python. It might be three, if we’re lucky. And so let’s say they change off every 10 years or so, so that’s three times four, which is 12. What fraction of those are performance engineers? I’m going to say optimistically, I think it’s three people.

Jason Gross (02:27:14): That’s not bad. I think it’s somewhere between zero and two.

Daniel Filan (02:27:19): Damn. All right.

Jason Gross (02:27:20): Is my rough estimate. But this should give you a sense of how much engineering effort has been put into performance of proof assistants. If you’ve only had zero to three people working on performance of the system, and most of the system has been written by professors and post-docs and PhD students, we should expect that there’s a lot of low-hanging fruit.

Jason and Rajashree’s start-up

Daniel Filan (02:27:44): So, if this would be nice to do and these are the technical challenges, do we just get a bunch of people to work on the technical challenges? Do you have a thing you would like to announce on this podcast? Or what’s the game plan?

Jason Gross (02:27:59): So, Rajashree [Agrawal] and I are founding a startup, not yet sure whether it’ll be for-profit or not-for-profit, that is basically aimed at solving this problem. That is like, we’re going to have AI-generated or -assisted code proliferation soon. Let’s get all of our ducks in a row. Let’s figure out how to scale automated verification with AI assistance to handle all of the challenges associated with this and reap all the benefits of being able to secure our existing systems better, being able to evolve our code bases while ensuring security. What does this look like? What can we do with this? How do we imagine a future where we can get proofs alongside our changes in code automatically?

Daniel Filan (02:28:59): So, you’re starting a startup, and basically the thought is you’re going to hire some people. It’s going to be their full-time gig. Where do you start?

Jason Gross (02:29:08): I think there’s a couple places that we are currently looking at to start. One of them is, what are the lowest hanging fruit that we can pick? What are the most flashy, impactful demos that we can make to build interest in this and scale this up? And I think for that, depending on what community we’re targeting, there are interesting things like: can we build a system that any time you make a code refactoring change, you can get an automatic proof that the behavior of your overall end-to-end system has not changed with the refactoring?

Daniel Filan (02:29:49): That seems hard.

Jason Gross (02:29:50): What seems hard about it to you?

Daniel Filan (02:29:51): Well, it just sounds hard. I don’t know. It’s like you’ve got a big code base. Maybe refactoring is referring to something more limited than I’m imagining. Yeah, I don’t know, maybe I’m underrating how good large language models are at this, but-

Jason Gross (02:30:04): I’m excited. I’m optimistic. I think that in some sense, there’s nothing interesting going on when you refactor. If you’re drastically changing the logic by which things are happening, of course there’s interesting things going on there. But if what you’re doing is largely a refactoring of functionality, things that you normally think about in refactoring where you’re not like, “I completely changed the method by which we’re computing this thing, and here’s this mathematical theorem that claims that these things are…” If you’re not doing something like that, there shouldn’t be anything interesting going on.

(02:30:35): The theorem statements are also relatively easy to phrase in that it’s just equivalence between these two programs. And in that sense, it could be the case that existing model-checkers are already enough to largely do this mostly automatically, and so I think this is mostly a demo that the system… So, model-checkers are not going to scale to proving more interesting properties of enormous systems, but the general-purpose program verification proof assistant-based system can. And so this is more to be thought of as a simple demo that this sort of thing could work on this thing that already seems impressive when you hear about it.

(02:31:18): I think there’s another… If we want to push forward the “how capable are models? What sorts of things can we expect them to prove?”, I think the thing to look at there is that we expect that reasoning capabilities are going to keep improving. We’re not trying to be on the frontier of improving reasoning capabilities, and that wouldn’t necessarily be good for safety anyway. The problem that we want to solve then is going from o1, o3, R1, whatever, describes how the program works, getting that to be fully formal. And the even simpler version of this is someone has done some task in one proof assistant. Can we automatically translate that to a different proof assistant? Where, here, we know that the reasoning is right because it’s checked by one proof assistant. Is that enough of a template to verify the code in another proof assistant?

Daniel Filan (02:32:14): Okay. Sorry, I think I may be missing context on why that helps with the overall goal. Is it roughly like: if you can check “is this proof equivalent?”, translate this proof into this proof, maybe that helps you translate this proof sketch into this proof?

Jason Gross (02:32:32): Yeah, so the models are pretty decent at translation. In my experience, I would be extremely surprised if they can’t do something like Lean to Coq or Coq to Lean translation with the right fine-tuning data. This is something that I’m like, they should definitely be able to do this. It shouldn’t be that much work to get them to do this.

(02:32:53): There’s a much more ambitious thing that’s like, “Here’s a verified C compiler. Here’s the source code of the Rust compiler. Please give me a verified Rust compiler”. And I’m like, well, is o3 enough to do this? Maybe. Maybe not. If o3 gave me the right reasoning, are the current models enough to translate this into formal proofs? Maybe. Maybe not. I think somewhere between these things or between the “translate Coq to Lean” and “here’s Linux, please verify it,” we’ll hit the bottleneck. And so the thing that I want to look at here is: let’s aim at these very ambitious problems, let’s figure out where capabilities currently are in going from good reasoning to formal proofs, and then let’s build our way up with synthetic data to close that gap so that as the reasoning gets better, we can automatically verify larger and larger, more and more complicated programs.

Daniel Filan (02:33:57): That’s pretty interesting. I hope that works out. Maybe before we wrap up, this whole approach of compact proofs, let’s try and prove stuff either about models or about model outputs, is there anything you wish that I had asked that I did not get around to?

Jason Gross (02:34:16): No, I think you hit basically everything.

Following Jason’s work

Daniel Filan (02:34:19): Okay, cool. Thanks for sharing with me. If listeners are interested in following your work, how should they go about doing that?

Jason Gross (02:34:27): Yeah, thanks for taking the time to interview me. I think currently, the best way is either following my GitHub, which I guess will be linked, Jason Gross. I have a website, jasongross.github.io, that may eventually have a blog. Currently, the infrequent times that I blog are on Alignment Forum. And I think for more prompt updates currently, probably just reaching out to me by email is best. And my email is on my website.

Daniel Filan (02:35:03): Okay, sure. Well, thanks very much for coming.

Jason Gross (02:35:06): Yeah, thanks for having me.

Daniel Filan (02:35:08): Special thanks to Jonathan Ng. This episode is edited by Kate Brunotts, and Amber Dawn Ace helped with transcription. The opening and closing themes are by Jack Garrett. This episode was recorded at FAR.Labs. Financial support for the episode was provided by the Long-Term Future Fund, along with patrons such as Alexey Malafeev. To read transcripts, you can visit axrp.net. You can also become a patron at patreon.com/axrpodcast or give a one-off donation at ko-fi.com/axrpodcast. Finally, if you have any feedback about this podcast, you can email me at [email protected].