<![CDATA[LessWrong]]>https://www.lesswrong.comhttps://res.cloudinary.com/lesswrong-2-0/image/upload/v1497915096/favicon_lncumn.icoLessWronghttps://www.lesswrong.comRSS for NodeMon, 16 Mar 2026 23:57:07 GMT<![CDATA[The bitter lesson for software]]>

Software is made of information flows

Software encodes information flows. An ERP system, for instance, takes procurement and locks it into a specific sequence of purchase orders, approval routing, invoice matching, and payment release. Git takes multiple people changing code and imposes a protocol of branching, diffing, reviewing, and merging. By codifying these information flows, software says how things should happen; it makes patterns repeatable and enforceable by expressing them in deterministic code.

Software took over the world because we learned to express useful actions into information flows, and then to express these information flows into deterministic code. The actions we could express this way were, by definition, within the space of consistent logical operations on rigid data structures.

Agents, too, encode information flows. And while they do so through the same infrastructure of software — that is, code — they are able to create information flows that are far more flexible. For one, they’re able to execute on more open-ended commands. But perhaps more importantly, agents work with the natural complexity of real-world tasks, rather than requiring that complexity be compressed into rigid data structures first. They do this by drawing on both system-specific information and the generalized knowledge they’ve absorbed through pre-training.

Further, as instances of software, they benefit from its useful properties — rerunnability, testability, and scalability. The potential for AI to replace human work hinges on this increasing flexibility as well as the practical advantages that come with being software.

But before the competition between AIs and humans plays out, we argue that agents first compete with the deterministic structures humans have encoded in software.

Agent software eats classical software

The existence of general reasoners pushes us towards replacing structured flows in our software stack with more general agent flows. The space of classical software it makes sense to build is shrinking. More and more software will be better described as agent software - infrastructural backbone in code calling various arrangements of agents in pre-defined or newly synthesized flows.

Repeatable and easy-to-measure components like infrastructure, systems, low-level transports, etc. will remain in code, albeit code probably optimized by models. When the user edits a collaborative document, there’s no need for an intelligent intermediary to save their changes to the server. But for higher-level tasks, agent software will overtake classical software by virtue of its greater generality.

Agent software will eat the well-defined world of pre-AI software and spit it out softer. This softening will change how we answer questions. To be more concrete, consider the following examples:

- Research-grade web scraping. We could produce standard, reproducible social-science analysis pipelines. Such a pipeline would be fed a prompt for the kind of data agents should scrape, how that data should get coded, and output a pre-defined analysis. When put in an agent-software script, the pipeline could also be reused and adapted to a variety of contexts, like systematic reviews or interview thematic coding.

- Full-coverage feature testing. Instead of a static test suite, we could place agents into a testing pipeline that varies with the feature being pushed. Such agents would trace the happy path and, if necessary, test against new edge cases each time a feature gets pushed. Since the whole thing runs in a standard CI environment, it can be rerun on every push like any other build step.

The bitter lesson for software

In machine learning, the bitter lesson says that general methods which scale with computation reliably beat methods which encode human knowledge as structure. Every attempt to hand-craft domain expertise into a system — chess heuristics, grammar rules, hand-engineered features — eventually lost to a simpler, more general method given enough compute.

With the improving capabilities of coding agents, the bitter lesson is now impressing itself onto the world of software itself. The rigid schemas, fixed integrations, and deterministic pipelines that defined classical software are a form of encoded structure, and agents are coming for them too.

Agent capabilities today require a shift in how we think about building software. Instead of asking what structure a system needs, we should be asking where we’ve been forcing structure just because code demanded it. The boundaries between what needs to be rigid and what can be flexible, what is and isn’t possible, have moved. And they will keep moving.



Discuss]]>
https://www.lesswrong.com/posts/qfAznbsRAPjyb7ami/the-bitter-lesson-for-softwareqfAznbsRAPjyb7amiMon, 16 Mar 2026 23:38:41 GMT
<![CDATA[Types of Handoff to AIs]]>This is a rough draft I'm posting here for feedback. If people like it, a version of it might make it into the next scenario report we write.

...


We think it’s important for decisionmakers to track whether and when they are handing off to AI systems. We expect this will become a hot-button political topic eventually; people will debate whether we should ever handoff to AIs, and if so how, and when. When someone proposes a plan for how to manage the AI crisis or the AGI transition or whatever it’s called, others will ask them “So what does your plan say about handoff?”


There are two importantly different kinds of handoff: Handing off trust and handing off decisionmaking. You can have one without the other.


Trust-handoff means that you are trusting some AI system or set of AI systems not to screw you over. It means that they totally could screw you over, if they chose to, and therefore you are trusting them not to.


Decision-handoff means that you are allowing some AI system or set of AI systems to make decisions autonomously, or de-facto-autonomously (e.g. a human is still technically “in the loop” but in practice basically just does whatever the AI recommends)


With both kinds of handoff, there are smaller and bigger versions.



Small

Big

Trust-

handoff

I used Claude Code to write most of my codebase. Anthropic’s cyber evals indicate that Claude could totally have inserted security vulnerabilities if it wanted to, and I wouldn’t notice. But probably Claude wouldn’t do that, so it’s fine.

It’s September 2027 in the AI 2027 scenario. Agent-4 is a giant corporation-within-a-corporation of many thousands of copies running across OpenBrain’s datacenters. It’s broadly superhuman at all things coding and cyber, and has also been heavily involved in its own network security, and also regularly gives strategic advice to OpenBrain leadership. It’s now being tasked with designing and aligning Agent-5, a superior AI architecture that it autonomously discovered. Agent-4 appears to be obedient/loyal/aligned/etc., but ho boy, if it’s not, well, not only is OpenBrain screwed but quite plausibly the whole world will end up controlled by Agent-4 and its descendents (such as Agent-5).

Decision-handoff

Yesterday I decided to switch from coding with Claude to vibe-coding. I no longer make decisions about e.g. what file structures to use or how the UI should be managed on the backend, instead I just give the high-level goal to Claude and then press “tab” to accept whatever it suggests, unless what it suggests is so obviously insane that I can tell in two seconds.

It’s July 2028 in the AI 2027 scenario. The army of superintelligences that flies the US flag (“Safer-4”) has been aligned to Spec, and the Spec says to obey the Oversight Committee. So in some sense, humans are still in control. However de facto Safer-4 is making basically all the most important decisions. For example Safer-4 autonomously negotiated and implemented a complex treaty with its Chinese counterpart, and the Oversight committee knew better than to raise any objections—why would they? It is so much smarter and wiser than them, and every time they objected in the past, it patiently explained to them why they were wrong, and they eventually agreed that they were wrong & approved, having accomplished nothing but wasting time. 


By default when we talk about trust-handoff and decision-handoff, we mean the really big ones, unless it’s clear from the context that we mean something smaller. So for example, if you see a diagram of scenario branches and the label “Trust Handoff” at a particular time on a particular branch, that means that at that point in that scenario, some set of AIs has become smart enough, and been entrusted with enough power, that they plausibly could take over the world if they tried. Similarly, the label “Decision Handoff” would indicate that at that point in the scenario, the overall trajectory of society is being steered by some set of AI systems; that extremely important decisions about how to structure society etc. are being de facto made by AIs.

Now for some details and nuance:


  • Decision-handoff doesn’t necessarily mean “If the humans objected to the AIs proposal, they’d be overruled.” It just means that in practice, the AIs call the shots and the humans nod along, such that an outsider trying to predict what’s going to happen should mostly focus on what the AI thinks/wants/etc. and mostly ignore what the humans think/want/etc. Analogy: If a child king has a Grand Vizier who he trusts to manage the kingdom, the child king is still in charge and his word is still law, but in practice, if an outsider wants to predict e.g. will this kingdom invade its neighbor, will it ban the new religion, will it reform its tax system, etc. the thing to do is understand the mind of the Vizier, not the mind of the king.
  • Decision-handoff doesn’t necessarily imply trust-handoff, but… it strongly hints at it. You can imagine a hypothetical scenario in which AIs de facto make all the important decisions, but they are so constrained by various control systems (other AIs watching them, human monitors, etc.) that are so vigilant and expertly crafted, that even if the AIs tried to take over, they couldn’t… but it’s kinda hard to imagine. So, I tend to think of decision-handoff as a more intense form of handoff that comes somewhat after trust-handoff.
  • Possibly much later! Take the AI 2027 scenario for example. Even in the Race ending, in which the AIs literally take over the world, the ‘point of no return’ beyond which the AIs are on a path to take over (i.e. the Trust Handoff point) seems to be around September 2027, as described in the above table. Whereas the takeover strategy involves pretending to be aligned/obedient/etc., so for several months at least human CEOs and politicians are still arguing with each other about what to do, issuing orders which are obeyed, etc. And that’s in the Race ending; in the Slowdown ending the gap between trust-handoff and decision-handoff is bigger, as described previously.
  • Presumably in an ideal world, humanity would either never do (big) decision or trust handoff, or do them both at once only after having a high degree of assurance that this was a good idea, or do big decision handoff eventually and trust handoff never thanks to some fancy scheme for AIs to watchdog each other.
  • However, in practice we think it’s plausible that AI companies and politicians will take on much higher levels of risk, e.g. green-lighting the automation of AI R&D and the integration of AI into the military even though the evidence is compatible with the AIs being misaligned schemers. If this happens, we also think it’s plausible—though perhaps not probable—that things will work out OK anyway and the AIs will in fact be aligned, humans will stay in control, etc. In which case there would be a substantial period where trust-handoff had happened but not decision-handoff. The Slowdown ending of AI 2027 is an example of this.
  • Remember, both kinds of handoff are spectra rather than binaries. We only make them binaries for ease of communication. Here are some dimensions along which they vary:
    • Trust Handoff: Which AIs in question are the ones we are talking about? 
      • On one extreme you could be handing off trust to a particular instance of a particular model. (Example: “This agent right here is in charge of our security system. Yes, it’s superintelligent. Yes, it could replace the whole network with copies of itself and then cover its tracks if it wanted to. Hopefully it’s aligned.”)
      • On the other extreme, you could be handing off trust to AIs collectively: “We have a system of checks and balances, with diverse AI models from diverse AI companies all monitoring each other, interpreting each other’s weights and activations, etc. The only way things could go wrong is if they all simultaneously collude to disempower us. But yeah, if they did that, we’d be screwed.”
    • Trust Handoff: How easily could they take over / screw you over?
      • On one extreme it could be very easy for them. On the other extreme, it could be unlikely but possible.
    • Decision Handoff: How much latitude / flexibility can they get away with?
      • On one extreme, the AIs can decide literally anything and the humans are powerless to stop it.
        • E.g. an autonomous drone that doesn’t have a human in the loop at all during combat.
      • On the other extreme, the AIs might need to get approval for their decision from some set of humans who will veto anything that looks bad to them, and also who are fairly opinionated about what’s good and bad and putting a fair amount of effort into thinking about it.
      • It’s possible to go even farther in that direction (humans more opinionated, thinking harder, etc.) but if you go far enough, it no longer counts as handing off the decision at all.

When should we hand off trust and when should we hand off decisionmaking?


When the benefits outweigh the costs, of course.


To a first approximation, we should only hand off trust to AIs if those AIs are trustworthy. By definition, when you hand off trust to a group of AIs, you are making it the case that if they decided to screw you over, they could. So, you better have well-founded trust that they won’t decide to screw you over.


Handing off decisionmaking is more complicated. You may be confident that your AIs won’t lie to you, won’t deceive you, will obey your orders, etc. and yet still be rightly reticent to put AIs in charge of everything. In other words you may be confident that your AIs are trustworthy, yet still not trust them to decide everything


For example, your AIs might have enough deontological constraints on their actions (honesty, obedience, etc.) that they can be trusted not to take over the world, not to disempower you, etc. Yet at the same time, your AIs’ long-term goals might be subtly (or majorly!) different from your own, such that if you let them make the decisions, things will predictably go downhill from your perspective.


Analogy: You are a nonprofit board looking for a CEO to run your fast-growing organization. For some candidates, you might be worried that they are untrustworthy—for example they might lie to you, pull various schemes to get their rivals on the board kicked off, and ultimately one day you might try to fire them and find that you can’t. But, suppose they are in fact trustworthy and would never do those things and will always obey your orders. Still, they might have different values than you, different philosophies, different attitudes towards risk tolerance, etc., such that it would be a bad idea (from your perspective, not theirs) to hire them. They might end up taking the nonprofit in a very different direction than you envisioned, for example, or they might end up doing too many risky things that end up blowing up big time later. “Personnel is policy,” as the saying goes.


Why would you ever hand off trust? Why ever put a group of AIs in a position where they could take over the world if they wanted to? Well, perhaps because the other options are even worse. For example, perhaps the world has gotten itself into a very sticky situation (e.g. a crazy arms race towards superintelligence that’s on the brink of escalating into WW3) and you think your best bet is to put AIs in charge of a bunch of things (e.g. AI research, diplomatic and military strategy, …) and hope they can handle things better than you. After all, they are more capable than you. In other words, perhaps one plausible reason for handing off trust is that you want to hand off decision-making and you don’t have a way to do that without also handing off trust.


Another reason to hand off trust is to enforce contracts/agreements, in situations where the AIs are probably aligned. For example, the US and China might want to agree to respect each other’s sovereignty in perpetuity, yada yada, because otherwise they are trapped in a crazy robot and WMD and superpersuasion arms race. But you can’t trust humans to keep your word. But you CAN trust AIs to keep theirs, at least if they’ve been suitably trained/designed.


For smaller-stakes handoffs, the calculus is similar. E.g. you might hand off decision-making over many aspects of hospital patient’s health to an AI system because you have evidence that the AI system is more competent than your doctors and nurses; you recognize that this also involves handing off trust to the AI system (if it decided to kill your patients, it could easily do so) but you trust that it won’t.

AIFP hot take: We generally expect most powerful actors in charge of AI programs to hand off trust too early (while the risks are still high and outweigh the benefits) and to hand off decisionmaking too late (e.g. harmfully keeping a ‘human in the loop’ long after the point when they mostly just get in the way & slow things down). We think there might be an awkward “worst of both worlds” period where superhuman AI systems have been given significant power and autonomy — such as de facto control of their own datacenters & license to self-improve — such that they could take over the world if they wanted to, and yet simultaneously the world is full of problems that could be solved much better and faster, and risks that could be reduced/averted, if only the AIs were put in charge of more things in the real world.

That said, we aren’t confident. For reasons mentioned previously (See: the CEO analogy) such a period might make a lot of sense.



Discuss]]>
https://www.lesswrong.com/posts/YuMr6kbstuieQHkGj/types-of-handoff-to-aisYuMr6kbstuieQHkGjMon, 16 Mar 2026 22:24:25 GMT
<![CDATA[AICRAFT: DARPA-Funded AI Alignment Researchers — Applications Open]]>AICRAFT: DARPA-Funded AI Alignment Researchers — Applications Open

TL;DR: We hypothesize that most alignment researchers have more ideas than they have engineering bandwidth to test. AICRAFT is a DARPA-funded project that pairs researchers with a fully managed professional engineering team for two-week pilot sprints, designed specifically for high-risk ideas that might otherwise go untested. We will select 6 applicants and execute a 2 week pilot with each, the most promising pilot may be given a 3 month extension. This is the first MVP for engaging DARPA directly with the alignment community to our knowledge, and if successful can catalyze government scale investment in alignment R&D. Apply here.

Applications close March 27, 2026 at 11 PM PST.


What is AICRAFT?

AICRAFT (Artificial Intelligence Control Research Amplification & Framework for Talent) is a DARPA-funded seedling project executed by AE Studio. The premise is straightforward: we hypothesize that alignment research could progress faster if the best researchers had more leverage. We believe that researchers currently are bottlenecked on either execution (i.e. they are doing the hands-on experiments themselves) or management (i.e. they are managing teams that are executing the work). Management is higher leverage but what if we could push that much further. AE Studio has been running a model where we pair researchers with fully managed ML teams, allowing the researcher to spend as little as 45 minutes per week with our team. Without the execution and management burden, this model provides a new outlet for research ideas that would have otherwise gone untested.

The U.S. pool for AI/ML engineering is much larger than the talent pool for AI alignment. If experts in alignment can effectively scale their capacity with general-purpose AI/ML engineering talent, that unlocks a much larger pipeline of alignment research than the field currently supports.

AICRAFT tests this by pairing researchers directly with an experienced engineering team for focused two-week sprints. The goal is to get initial signal on ideas that wouldn't otherwise get tested. If successful, the most promising ideas may have an opportunity to expand to a 3 month engagement.

We will select 6 researchers and execute a 2 week research sprint with each. The purpose of the sprint is to get signal on a high-risk idea, or to prove it wrong quickly.

The Bigger Picture

DARPA has already set a goal to achieve military-grade AI. This was announced recently by our CEO in the Wall Street Journal. What makes that relevant to alignment? Military deployment requires reliability guarantees that deceptively aligned or unpredictably behaving systems simply can't meet. You can't field an AI system that pursues hidden objectives or behaves differently under distribution shift. In that sense, the DoD's requirements create a concrete, well-funded forcing function for alignment research outcomes, even if the framing and vocabulary differ from what you'd see on the Alignment Forum.

AICRAFT is the first direct engagement between DARPA and the alignment research community. If the pilots demonstrate that this model works, it builds the case for substantially larger government investment in alignment R&D, the kind of scale that grants and private philanthropy alone can't reach.

This may be the most important and highest leverage research engagement you have all year as it can catalyze large scale government investment in alignment R&D.

Who should apply?

We're especially interested in researchers who have ideas that don't have other outlets. Maybe you have 10 ideas but bandwidth to pursue 2-3. Maybe there's a high-risk hypothesis that isn't a good fit for a grant or isn't supported by your current employer, but is worth getting early signal on.

If you have a testable hypothesis in AI control, alignment, or interpretability and can articulate what signal you'd look for in two weeks then we want to hear from you.

How it works

You bring (~2 hours/week):

  • A research hypothesis worth testing
  • An initial planning session, async updates during the sprint, and demo sessions at the end of each week

We deliver (60+ hours of execution):

  • A fully managed AI engineering team running the sprint
  • Cloud compute from AWS, GCP, Azure, and specialized ML platforms
  • API access to frontier models for evaluations, synthetic data generation, and related tasks
  • Working code and documented results
  • A structured analysis and final report

After the pilot:

You receive a final report with documented results. Promising pilots are recommended to DARPA for a 3-month extended engagement, contingent on your availability.

The application

The application is intentionally lightweight: it takes under 10 minutes. The core of it is a 500-word research abstract addressing three questions:

  1. What are you trying to do? What is the technical innovation? What is the enduring capability enabled beyond the project lifetime?
  2. What is the potential impact of your proposed idea if fully validated?
  3. What is a sketch of how your idea could feasibly be tested for early signal to (partially) validate the idea within a 2-week timeframe?

Selected applicants will be invited to a brief follow-up call to talk through the idea and answer questions about the program. All applicants will be notified of final decisions by late April.

FAQ

How much time commitment is this? Just four hours! You’ll spend two hours per week for the two-week pilot. This includes an initial planning session, async updates during the sprint, and demo sessions at the end of each week.

Can I participate if I'm affiliated with a university or company? Yes, if you can enter a subcontractor agreement with AE Studio. Most institutions have straightforward consulting processes. The one hour per week commitment typically falls within standard outside activity policies.

What compute and resources are available? Cloud compute from AWS, GCP, Azure, and specialized ML platforms. API access to frontier models for evaluations, synthetic data generation, and related tasks.

What happens after the two-week pilot? You receive a final report with documented results. Strong pilots may be recommended for a 3-month extended engagement, contingent on your availability.

Is there compensation? Yes, researchers receive a $1,000 stipend for approximately 4 hours of work over the 2-week period.

What is the selection process? We review applications after the deadline, invite promising applicants to a brief call, and notify all applicants of final decisions by early-mid April.


Apply here — applications close March 27, 2026 at 11 PM PST.

AICRAFT is funded by DARPA and executed by AE Studio. The views, opinions, and findings contained herein are those of the authors and should not be construed as representing official policies or endorsements of DARPA or the U.S. Government.



Discuss]]>
https://www.lesswrong.com/posts/nmMdtZveC38atLnDm/aicraft-darpa-funded-ai-alignment-researchers-applicationsnmMdtZveC38atLnDmMon, 16 Mar 2026 21:44:22 GMT
<![CDATA[You can’t imitation-learn how to continual-learn]]>In this post, I’m trying to put forward a narrow, pedagogical point, one that comes up mainly when I’m arguing in favor of LLMs having limitations that human learning does not. (E.g. here, here, here.)

See the bottom of the post for a list of subtexts that you should NOT read into this post, including “…therefore LLMs are dumb”, or “…therefore LLMs can’t possibly scale to superintelligence”.

Some intuitions on how to think about “real” continual learning

Consider an algorithm for training a Reinforcement Learning (RL) agent, like the Atari-playing Deep Q network (2013) or AlphaZero (2017), or think of within-lifetime learning in the human brain, which (I claim) is in the general class of “model-based reinforcement learning”, broadly construed.

These are all real-deal full-fledged learning algorithms: there’s an algorithm for choosing the next action right now, and there’s one or more update rules for permanently changing some adjustable parameters (a.k.a. weights) in the model such that its actions and/or predictions will be better in the future. And indeed, the longer you run them, the more competent they get.

When we think of “continual learning”, I suggest that those are good central examples to keep in mind. Here are some aspects to note:

Knowledge vs information: These systems allow for continual acquisition of knowledge, not just information—the “continual learning” can install wholly new ways of conceptualizing and navigating the world, not just keeping track of what’s going on.

Huge capacity for open-ended learning: These examples all have huge capacity for continual learning, indeed enough that they can start from random initialization and “continually learn” all the way to expert-level competence. Likewise, new continual learning can build on previous continual learning, in an ever-growing tower.

Ability to figure things out that aren’t already on display in the environment: For example, an Atari-playing RL agent will get better and better at playing an Atari game, even without having any expert examples to copy. Likewise, billions of humans over thousands of years invented language, math, science, and a whole $100T global economy from scratch, all by ourselves, without angels dropping new training data from the heavens.

I bring these up because I think the LLM-focused discourse sometimes has far too narrow a notion of what problem “continual learning” is supposed to be solving. They tend to think the problem is about “losing track of information”, not “failing to build new knowledge”, and they propose to solve this problem with strategies like “make the context [window] longer” (as Dario Amodei recently mused), or better scratchpads with Retrieval-Augmented Generation (RAG) etc.

But real “continual learning” also includes the ways that AlphaZero changes after a million games of self-play, or the ways that a human brain changes after 20 years in a new career. There is no system of scratchpads that you can give to a 15-year-old, such that it would be an adequate substitute for them spending the next 20 years growing into a 35-year-old world expert in some field. Likewise, there is no context window that can turn GPT-2 into GPT-5.

Suppose you took an actual “country of geniuses in a datacenter”, completely sealed them from the outside world, and gave them a virtual reality environment to hang out in for the equivalent of 100 years. What would you find when you unsealed it? There would be whole new ways of thinking about the world and everything in it—entirely new fields of science, schools of philosophy, and so on.

Can a bunch of LLMs do that? Well consider this thought experiment: suppose you take a whole new field of science, wildly different from anything in the training data, and put a giant textbook for this field purely in an LLM context window, with no weight updates at all. Will this LLM be able to understand, criticize, and build on this field? My opinion is “absolutely not” (see 1, 2) which implies that merely increasing context lengths is definitely not sufficient for a real “country of geniuses in a datacenter”, when the datacenter is sealed shut for the equivalent of 100 years (contra Dario who seems to think that it’s at least in the realm of possibility that more context is sufficient by itself to get continual learning at “country of geniuses” level).

(If we’re talking about what a sealed “country of human geniuses” could do over the course of, like, one minute, rather than over the course of 100 years, then, yeah sure, maybe that could be reproduced with future LLMs! See von Oswald et al. 2022 on how (so-called) “in-context learning” can imitate a small number of steps of actual weight updates.[1])

Why “real” continual learning can’t be copied by an imitation learner

Now, suppose that I take a generic imitation-learning algorithm (e.g. self-supervised learning in a transformer-architecture neural net, just like LLM pretraining), and have it watch a deep Q network play Atari Breakout, as it starts from random initialization, and gets better and better over 1M iterations. OK, now we have our trained imitation-learner. We freeze its weights, and use it in a similar way as people traditionally used LLM base models, i.e. have it output the most likely next move, and then the most likely move after that, etc.

Question: Is this trained imitation-learner actually a good imitation of the deep Q network? Well, “good” in what respect? I would pull apart a couple topics:

  • Snapshot imitation: The actual deep Q network, right now, at the moment training is done, would output such-and-such Breakout moves in such-and-such positions. Question: Will the trained imitation-learner output similar moves right now, thus playing at a similar skill-level as the teacher? My answer is: plausibly yes.
  • Imitation of long-term learning: The actual deep Q network, if it kept playing, would keep improving. Will the trained imitation-learner likewise keep improving over the next 10M moves, until it’s doing things wildly better and different than anything that it saw its “teacher” deep Q network ever do? My answer is: no.
  • Imitation of long-term learning (example 2): The actual deep Q network, if it were suddenly transplanted into a new game environment (say, Atari Space Invaders), would start by making terrible moves, but over 10M iterations it would gradually improve to expert level. Will the trained imitation-learner likewise do 10M iterations and then wind up performing expertly at this game, a game which it never saw during its training phase? My answer is: no.

Why not? Well, actually, for an ideal imitation learning algorithm, i.e. Solomonoff induction on an imaginary hypercomputer, my answers would all be “yes”! But in the real world, we don’t have hypercomputers!

These days, when people talk about imitation learning, they’re normally talking about transformers, not hypercomputers, and transformers are constrained to a much narrower hypothesis space:


Imitation-learning a deep-Q RL agent by Solomonoff induction

Imitation-learning a deep-Q RL agent by training a transformer on next-action prediction

Hypothesis space

The set of all computable algorithms

A forward pass through T, for the set of all possible trained transformers T

Ground truth

The actual deep-Q RL agent, with such-and-such architecture, and Temporal Difference (TD) learning weight updates, etc.

The actual deep-Q RL agent, with such-and-such architecture, and Temporal Difference (TD) learning weight updates, etc.

Asymptotic limit

It converges to the actual deep-Q RL agent

It converges to whatever trained transformer forward pass happens to be closest to the actual deep-Q RL agent

I think we should all be very impressed by the set of things that a transformer forward pass[2] can do. But we should not expect a transformer forward pass to reproduce a full-fledged, entirely different, learning algorithm, with its own particular neural network architecture, its own particular methods of updating and querying weights, etc., as it runs and changes over millions of steps.

Running one large-scale learning algorithm is expensive enough; it’s impractical to run a huge ensemble of different large-scale learning algorithms in parallel, in order to zero in on the right one.[3]

I’m going to harp on this because it’s a point of confusion. There are two learning algorithms under discussion: the imitation-learning algorithm (e.g. a transformer getting updated by gradient descent on next-action prediction), and the target continual learning algorithm (e.g. a deep Q network getting updated by TD learning). When the imitation learning is done, the transformer weights are frozen, and the corresponding trained model is given the impossible task of using only its activations, with fixed weights, to imitate what happens when the target continual learning algorithm changes its weights over millions of steps of (in this case) TD learning. That’s the part I’m skeptical of.

In other words: The only practical way to know what happens after millions of steps of some scaled-up continual learning algorithm is to actually do millions of steps of that same scaled-up continual learning algorithm, with actual weights getting actually changed in specifically-designed ways via PyTorch code. And then that’s the scaled-up learning algorithm you’re running. Which means you’re not doing imitation learning.

So back to the human case: for a typical person (call him “Joe”), I think LLMs are good at imitating “Joe today”, and good at imitating “Joe + 1 month of learning introductory category theory”, but can’t imitate the process by which Joe grows and changes over that 1 month of learning—or at least, can’t imitate it in a way that would generalize to imitating a person spending years building a completely different field of knowledge that’s not in the training data.

Some things that are off-topic for this post

As mentioned at the top, I’m hoping that this post is a narrow pedagogical point. For example:

  • I’m not commenting on whether it’s possible to modify LLM post-training into a “real” continual learning algorithm (although I happen to believe that it isn’t possible).
  • I’m not commenting on how an inability to do “real” continual learning cashes out in terms of real-world competencies (E.g., can a non-“real”-continual-learning AI nevertheless take jobs? Can it kill billions of people? Can it install itself as an eternal global dictator? Etc.) (I happen to think that these are tricky questions without obvious answers.)
  • I’m not commenting on whether we should think of actual frontier LLMs (not just pretrained base models) as predominantly powered by imitation learning, even despite their RL post-training (although I happen to believe that we probably should, more or less (1,2)).
  1. ^

    I guess I also need to mention the “algorithmic distillation” paper (Laskin et al. 2022), but I’m hesitant to take it at face value, see discussion here.

  2. ^

    You can replace “a forward pass” with “10,000 forward passes with chain-of-thought reasoning”; it doesn’t change anything in this post.

  3. ^

    Outer-loop search over learning algorithms is so expensive that it’s generally only used for adjusting a handful of legible hyperparameters, not doing open-ended search where we don’t even vaguely know what we’re looking for. Even comparatively ambitious searches over spaces of learning algorithms in the literature have a search space of e.g. ≈100 bits, which is tiny compared to the information content of a learning algorithm source code repository.



Discuss]]>
https://www.lesswrong.com/posts/9rCTjbJpZB4KzqhiQ/you-can-t-imitation-learn-how-to-continual-learn9rCTjbJpZB4KzqhiQMon, 16 Mar 2026 21:20:12 GMT
<![CDATA[PSA: Predictions markets often have very low liquidity; be careful citing them.]]>I see people repeatedly make the mistake of referencing a very low liquidity prediction market and using it to make a nontrivial point. Usually the implication when a market is cited is that its number should be taken somewhat seriously, that it's giving us a highly informed probability. Sometimes a market is used to analyze some event that recently occurred; reasoning here looks like "the market on outcome O was trading at X%, then event E happened and the market quickly moved to Y%, thus event E made O less/more likely."

Who do I see make this mistake? Rationalists, both casually and gasp in blog posts. Scott Alexander and Zvi (and I really appreciate their work, seriously!) are guilty of this. I'll give a recent example from each of them. 

From Scott's Mantic Monday post on March 2:

Having Your Own Government Try To Destroy You Is (At Least Temporarily) Good For Business

On Friday, the Pentagon declared AI company Anthropic a “supply chain risk”, a designation never before given to an American firm. This unprecedented move was seen as an attempt to punish, maybe destroy the company. How effective was it?

Anthropic isn’t publicly traded, so we turn to the prediction markets. Ventuals.com has a “perpetual future” on Anthropic stock, a complicated instrument attempting to track the company’s valuation, to be resolved at the IPO. Here’s what they’ve got:

Upon the “supply chain risk” designation, predicted value at IPO fell from about $550 billion to $475 billion - then, after a day or two, went back up to $550 billion. No effect!

A coarser yes-no Polymarket tells the same story:

The chance of Anthropic getting a $500 billion+ valuation in 2026 fell from 90% to 76%, before rebounding to 83%.

Why have the markets shrugged off this seemingly important event?

Partly it’s because Anthropic seems likely to win on appeal. Hegseth has said the government will keep using Anthropic for the next six months (undermining his case that they’re a national security risk) and has signed a substantially similar contract with OpenAI (undermining his case that their contract terms were unworkable). The prediction markets think the courts will be sympathetic:

[link to this Manifold Market]

But even in the 28% of timelines where the designation sticks, things don’t seem so bad...

(The first market that Scott quoted, the Ventuals future, is not a typical market that people reference -- I had never seen it before -- and is kind of complicated to analyze. I did an analysis of it but have decided not to include it in the main post as it brings the focus away from the specific point I want to make. I'll attach the analysis as a comment to this post.)

Let's take a look at the Polymarket market that Scott cites. Here's what its order book looks like when I'm writing this: 

So, if I wanted to change the chance of Anthropic getting a $500B+ valuation from 90% to 75%, I'd have to spend checks clipboard $59. Okay, maybe we should add in the liquidity from the Yes side as well. In which case... $370. Someone could manipulate Scott Alexander and his tens of thousands of readers (some of whom are very powerful people who will be making important decisions based on their beliefs about Anthropic!) for a few hundred bucks. 

What about the Manifold market Scott cites? Well, first of all, Manifold is a play money market, which means we have little a priori reason to expect it to be accurate or efficient. The utility (or lack thereof) of play money markets is not what I want to talk about in this post, though. What I want to focus on here is the (lack of) activity in the market that Scott references. Let's look again at a chart of it.

This is not what an active or efficient market looks like. There has been ~0 activity from March 9 to March 15. 

Let's look at an example from Zvi now. From his Feb 26 AI newsletter:

The prediction markets on this situation are highly inefficient. Kalshi as of this writing has bounced around to 37% chance of declaration of Supply Chain Risk, versus Polymarket at 22% for very close to the same question.

Another way to measure how likely things are to go very wrong is that Kalshi has a market on ‘Will Anthropic release Claude 5 this year?’ which is basically a proxy for ‘does the American government destroy Anthropic?’ and Polymarket has whether it will be released by April 30. The Kalshi market is down from 95% (which you should read as ~100%) to 90%. Polymarket’s with a shorter timeline is at 38%.

I looked at these markets on Feb 26 and found that they were not very liquid. From my notes: "$1k trade is gonna move the market 20% on Polymarket. Kalshi market is a joke, each side is like 40 cents wide." Zvi was also live tweeting about these markets.

When Zvi tweets "The @Polymarket for Hegseth 'ban Claude by March 31' has crashed to 15%", the implication is that this market is worth taking seriously, etc. 

Zvi is correct that these markets aren't efficient, but is wrong that there is Alpha. There isn't money to be made in these markets because they're tiny. In fact, due to how large the bid/ask spreads was on the Kalshi market, its odds would fluctuate 20%+ just based on whether the last trade was at the bid or the offer.

So, PSA: Please check the liquidity/activity/volume/spread of a prediction market before you reference it! 

There's a corollary to be made about how prediction markets are causing people to make predictable epistemic errors. (Do people want me to make a post on this?)



Discuss]]>
https://www.lesswrong.com/posts/SrtoF6PcbHpzcT82T/psa-predictions-markets-often-have-very-low-liquidity-beSrtoF6PcbHpzcT82TMon, 16 Mar 2026 21:07:17 GMT
<![CDATA[The Plan]]>Epistemic status: Flippant, but do you have a better plan?

Step 1: Land a mining research robot in a lunar polar crater 

Step 2: Start 2 robotic lunar strip mines

  • Humanity should have 2 of everything, for competition & redundancy

Step 3: Launch unprocessed lunar resources into lunar orbit

  • Ultimately these will be lifted via 2 lunar space elevators

Step 4: Manufacture fuel & solar panels in 2 lunar-orbit robotic factories 

Step 5: Invent digital people 

  • Multiple ways to do this

Step 6: Invent harmonious multispecies egalitarian democracy

Step 7: Crowd lunar orbit with computers containing digital people

Step 8: Crowd solar orbit with digital people

Step 9: Send robotic factory expeditions to 2 exoplanets

  • These will require ongoing cargo resupply shipments for a while

Step 10: Let digital people live on exoplanet outposts

Step 11: For each resource being shipped to a exoplanet, substitute a local source

Step 12: End war 

Step 13: End racism

Step 14: Solve physics

Step 15: Invent all technologies

Step 16: Solve entropy

 

Each of these steps should only take a finite number of decades, thus we will end poverty & X-risk in finite time :)



Discuss]]>
https://www.lesswrong.com/posts/ve5q8AxwSnjp5BhBK/the-plan-1ve5q8AxwSnjp5BhBKMon, 16 Mar 2026 20:58:27 GMT
<![CDATA[What Are My Values?]]>This is not a condensed post with only my best final ideas[1], this post is me writing across multiple days[2] as I try to work through a problem, enjoy.

I did something recently that I regret. I did something that I suspect hurt someone[3]. If I had asked myself in the moment whether the action I was about to take would hurt this person I would’ve been at least 30% certain that it would - but I didn’t consider it. If I had thought about it I would’ve realized that I don’t think being truthful to someone I don’t know that well is worth those odds of inflicting[4] pain, and yes sure it’s my belief that the truth I shared is “long term helpful” for them to know, but I also believe that people should have agency over receiving this sort of thing.[5]

Something that is very important to me is truthfulness. I believe that truthfulness is to some extent a core foundational building block of most of what’s good in life. Without communal truthfulness we aren’t living in reality. I think I’ve been bucketing truthfulness as a terminal value, I’m starting to suspect that it’s not. I believe that my relationship to truthfulness has been making my life worse[6] according to my values.[7]

So, what are my values?[8] The first answer is something like two buckets:

  1. Positive impact on the world
  2. Connection and joy

When I look at that it feels like a weak answer. Like yes, the two things I am juggling in my utility function are how good do I feel and to what extent am I a net positive (or negative) on the world. There are of course questions about what is positive impact on the world and the answer is something approximately “human flourishing”[9] shaped. But it feels trite to say that my values are the fact that I am simultaneously trying to optimizing for myself and for doing good for humanity[10].

The obvious[11] first alive thread to pull on is the fact that I said community, I could’ve just said joy - that’s interesting. (It is at this point that I read the Wikipedia page on values and talked to my roommate[12] about this.) Paying attention to what feels alive seems like it will get me closer to what’s real. It doesn’t matter what I intellectually think the correct values are if those values aren’t what they actually are. If one side goal is to potentially shift my values, it feels hard to do that if I don’t know what they actually are. What do I care about? What is alive?

  1. Truthfulness/being in reality
  2. Playfulness (don’t forget to have fun!)
  3. Efficiency (really getting the most out of this one life that I have)
  4. Being a good/reliable trading partner (cooperation)
  5. Trying (really trying)
  6. Actually accomplish things that actually matter (avoid unimportant goals)
  7. Execution/sticking to my word (both internally and externally)
  8. Curiosity/openness
  9. Sincerity/integrity
  10. Goodness (have positive impact on the people around me)
  11. Always be updating (make predictions and notice when they are right/wrong, internalize feedback)
  12. Doing positive sum things for my people (even at cost to myself)
  13. Say the thing, have the conflict (with the people I care about)
  14. Be in the moment, feel feelings, pay attention

After banging out that list I then mulled for a while: talked to some dudes in a sauna, chatted with some homies, sat in “what are my values”. The list didn’t feel right. Too long. Not focused. Not helpful. It’s more a collection of things that I care about than necessarily values. I think there’s something about values being things that are actually helpful for me towards living the life that I want. The hope is that thinking or saying the value out loud helps me move towards the behavior that I want.

So, I ranked my values!

Before doing this ranking I barely thought about growth, I suspect because it’s so core that I didn’t even really think of it as a value. I am constantly trying to get better to grow and it’s not because I think it’s a virtue in of itself to change myself, but because I know I can be better, I know I can be doing a better job of reaching my goals of accomplishing what I want to accomplish. I know I can do a better job of living a life that feels alive and real and fulfilling and sincere.

Fun vs. happiness[13]. Both “meaningful work” and “success”[14] are way higher than they would’ve been a couple years ago.[15] The trifecta of Friendship/Love/Community at 3rd/4th/5th place on the values quiz wasn’t surprising but it really drove home how important connection is to me despite how much I struggle with it.[16]

What I considered and then later snipped:

I spent hours talking with Claude and workshopped a bunch of ideas, really felt into which ones felt real and which ones didn’t. Here are some that didn’t.

  1. Efficiency (really getting the most out of this one life that I have)
    1. Instrumental towards aliveness, but doesn’t feel crucial. I would rather know what matters and do it less efficiently than not know what matters and have lots of time for it because I was efficient. Feels more secondary, and also I’m already very practiced in this.
  2. “Be a generous and dependable trading partner”
    1. The sentence felt very me, but didn’t feel actionable - reading it doesn’t help me muddle through how to be better.
  3. “be in reality, be good” → “be in reality, be good to reality” → separating them
    1. They didn’t feel connected, and then I tried connecting them but it didn’t feel right. I care about being good not being good to reality (whatever that even means[17]).
  4. “Don’t wait don’t avoid, face things head on”
    1. Didn’t feel alive, didn’t feel like facing things head on was something I needed a value to help me do or was super duper crucial given it was gestured at in other values.
  5. “Grow, be sincere, really try”
    1. Felt too generic not actionable, kind of throwing a bunch of ideas that feel good into one value. I kept the sincerity aspect in another value in a way that feels much more present/evocative (“Hold every word accountable”).

A final list I felt good about:

  1. Find what actually matters, then really fucking try
  2. Practice deep mutual connection, it’s a practice
  3. Grow, update, reckon with how I fall short
  4. Hold every word accountable
  5. Be present, be playful, pay attention
  6. Be in reality
  7. Be good

But then, I did open circle.[18] I talked about my relationship to integrity, to correctly modeling myself, to ensuring that the things coming out of my mouth my thoughts my commitments are actually what I believe. That what I put into the world tracks reality. And wow, the response was not what I expected. I know I am relatively good at this[19] but the very visceral response from the people in my life was that I would be spending skill points[20] in the wrong place.

Their core claim was that I should be trying to improve in other places. “Instead of going from 95% truthful to 98% truthful you should go from 10% cleanup to 80% cleanup”. They claimed that they had literally never seen my lack of truthfulness or ability to model myself as an issue and that by focusing on that I was trying to never fall instead of getting better at falling. I’m not great at resolving situations in which I have hurt someone (“cleanup”). I struggle to deal with me having harmed someone, especially if I hurt someone taking an action that fits within one of my values. This is something that I am working on, but it was really really interesting to see multiple people who are very close to me agree that this is the place where they would like to see me grow. It makes sense that the thing that I am bad at is the thing I did a bad job of representing in my values. Deep mutual connection is important to me, being good is important to me - so I would like a set of values that does a good job of moving me towards what I truly want.

This idea of proactively repairing the emotional cuts I inflict is something I knew was important to work on, and yet at this state the values list doesn’t really hit this in a satisfying way. “Be good” is adjacent but it’s too vague to actually remind me in the moment to take the hard conversation. “Don’t wait don’t avoid, face things head on” is closer but still is only adjacent. Not avoiding not waiting is important here, but it doesn’t really cover that I want to be way more responsible way more attuned to the ways in which I cause harm. I want to be a fucking force of nature, but I can’t unless I actively respond with care when I inevitably get something wrong and hurt someone. “Reckon with how I fall short” is close but it’s more internal it’s about improving for next time.

Even after that feedback I still think internal integrity and truthfulness is important enough to keep a place on the values. It feels so unbelievably crucial to me. But, I definitely need to add something to remind myself to actively go out there and mend the cuts I have inflicted on other people!

Okay I think I’m actually done:

  1. Find what actually matters, then really fucking try
    1. I think it can be easy to forget what the actual goal is when there are lots of intermediate instrumental goals. And this is a reminder to really pay attention to what matters and what will have impact. The easier part of it is the really trying, I have never had trouble with executing but nevertheless really trying is important. It’s not about looking like I’m trying, it’s not about internally believing that I’m trying, it’s about really actually just sprinting at the thing.
  2. Practice deep mutual connection, it’s a practice
    1. The most important parts here are “mutual” and “practice”. I have a much easier time putting myself out there than letting people in. The mutual is trying to remind myself of that. The practice is like yeah I’m not great at this but that’s not my terminal state! This is definitely a continued growth edge!
  3. Grow, update, reckon with how I fall short
    1. Reckon with how I fall short feels slightly clunky, I wish I could find a slightly more succinct way to get at really staring deeply at my mistakes not flinching. Growing and updating don’t feel as helpful to remind myself of, but they are so core that I think it’s worth including.
  4. Hold every word accountable
    1. This one is great. Hard. But, so important to me. Really pay attention to what I am saying/thinking. Do I actually believe it?
  5. Be present, be playful, pay attention
    1. Solid, feels slightly generic, playful is the part that feels the most me. What feels strongest about it is ease of saying/remembering it in the moment.
  6. I am my impact, actively repair what I break
    1. This feels slightly weak in that it’s sort of two things at once. I think really trying is important, but so is actually accomplishing goals. Output matters. Impact is standing in for both my impact on the real world, and my impact on people - which is what the second part is really pointing at and what I need the most reminder on.
    2. This is the only value that is a claim. I am not sure I think “I am my impact” is 100% true. I’ve been sitting in it and it certainly resonates. I was thinking of a world where everyone lies and telling the truth is met with ostracization, but even in that world I think with my current value system if I just told the truth and accomplished nothing I wouldn’t be living to my values. Slowly trying to shift towards truth in smarter ways would be more in line with my values. I think taking principled stands in ways that are wildly ineffective is bad. I do think impact is more important than motivation or my internal story. Impact is the terminal fucking goal.

I never really defined what the goal of thinking through and writing down my values was. Mostly because I really didn’t have a crisp goal in mind. So what makes a good value? What informed my decisions when thinking through this? I think there’s a couple things going on. There’s what really feels alive as a descriptor of where I am at. There’s what feels alive when I think about who I want to be. There’s what is actually helpful, I want to use these values as a tool so they should be helpful! When I think about or say a value aloud it should be clarifying. This was the core weakness of “be in reality” and “be good” they are too vague they don’t feel clarifying. Saying “be good” to myself doesn’t actually help me be better, it isn’t a helpful signpost, it feels like looking up at the sky and hoping it won’t rain.

What’s Next?

I don’t think I’m done. Even though I have absolutely plowed hours into this endeavor[21] my values will certainly be changing and I am pretty sure I will decide I don’t like something about at least one of these. My current plan is to put together a super simple app that randomly chooses one of these each day for me to focus on. If you’re my friend you should definitely tell me if you think I’m not acting in line with my values, feel free to do it aggressively, you are giving me a gift.

Oh, and I’ve got at least 3 people I owe an apology/conversation to. I’m going to do that.

  1. ^

    I'm trying out posting here on LW instead of just on my substack for things that feel sufficiently relevant. It's an experiment. I am extremely open to feedback whether that be "no keep this type of post away from here", or "yes I am glad you posted here I got a lot out of reading this".

  2. ^

    Much longer than the average post which I bang out in one sitting.

  3. ^

    As of a couple days after I wrote this sentence I have now gotten a large update that the person wasn’t hurt by my action, but I don’t think this changes my takeaway. I still think it was a net bad choice and I regret it even if I got away with it this time.

  4. ^

    Yes it’s their response to be hurt by receiving my truthful accounting, but that doesn’t mean I am in the clear for doing something I knew would have real odds of a negative response.

  5. ^

    For example. If someone is already feeling really bad, it’s probably not helpful to tell them in that moment another way I think they are fucking up their life even if I think it is helpful info for them to have long term. I think it is good to give people agency over whether they want to hear the hard thing.

  6. ^

    It’s not to say that my relationship to truthfulness is bad, I think there are many many many other ways I could be which are way worse. But, I also think it’s clear that it could be better and I would like to live in the world where I am behaving in ways that are more in line with what I want, who I think I am, what my values actually are, etc.

  7. ^

    Thanks to my friend who really pushed me on what I actually care about when it comes to truthfulness. I really appreciate it!

  8. ^

    Why have I never really explicitly thought through what my values are before? That seems like a mistake. I have thought about what I care about, what would feel purposeful, but those are slightly different questions. I don’t love that I have never or at least can’t remember, explicitly doing this.

  9. ^

    Which is also sort of leaky because I don’t care zero about other moral agents, but also when I consider questions like the extinction of a given lifeform vs 1 human life it’s noticeable that one of my first considerations is how would that extinction impact the ecosystem and therefore humanity’s future on this planet. There’s also more confusion in there because I’m not convinced that most animals live net happy lives, and part of caring about flourishing of moral agents is not just the number that are alive, but how good is the life. This would imply that I should care more about cost effective ways to improve how good the lives are of animals which feeling into it in the moment feels more impactful than just animal lives. But, it’s still way less than how good the lives are of humans.

  10. ^

    And here the word humanity is confusing because I both do believe in doing good in effective ways for people that are very not in my life, and also I care more about the people who are in my life on some sort of grounds that a life well lived involves a community a tribe of people that care about each other and that it’s very important not to lose track of that. But maybe that’s an impact claim, I and the people around me will do more good if we all feel safe and taken care of. I guess this is a claim about giving away a smaller percent of a much larger pie. (both in physical resources but also emotional capacity etc)

  11. ^

    Obvious to me at least, I’ve been told many times I will start a sentence like this and the thing I say is obvious is not obvious to others. I’m not sure if this is a habit worth changing. Obviously my way of thinking is in fact quite unique to me and I don’t at all believe the places my brain goes are representative.

  12. ^

    He is currently in the process of trying to codify his values in an effort to better model whether he is acting in integrity with himself which feels like another good reason to actually know what my values are.

  13. ^

    Fun feels active, playful a way of interacting with the world. Happiness feels more like a state more hedonistic.

  14. ^

    Actually trying is important, but at the end of the day what matters is the actual impact what actually was done and did I succeed is really important. Did I actually do something

  15. ^

    To some extent my history with NYC is me thinking purposeful work isn’t that important and deciding to optimize on making money at work and really working on community in my extensive free time. I believed that simply being a local community figure who makes the lives of the people I come into contact with consistently better, that this would be sufficient for a good life. Within the last couple years two things hit me. The first that simply stellar community is not nearly enough purpose or impact for me. The second is that even if I am great at bringing people together and curating community, that still doesn’t make me good at personally connecting with other individual souls. Comfy community co-living is a different skill from truly connecting with someone.

  16. ^

    It’s hard to tell to what extent I am actually picky or if it’s just a skill issue I could fix or it’s just genuinely a hard problem. I do want us to feel so comfy and good and connected. And I have some of that! Even when in conflict with my housemates I still feel connected and good about them, but I struggle with the next level (whatever that means).

  17. ^

    “be good to reality” doesn’t feel real it feels like something written on a poster not something that evokes any core experiences of moving through this reality we find ourselves in)

  18. ^

    A minority of the other people had ever spent time to think through or solidified their values which was very interesting to hear.

  19. ^

    Certainly compared to the average person in my life/community

  20. ^

    My time and energy

  21. ^

    At the cost of AI work which I feel slightly bad about, but I think doing this was correct both because endeavoring to improve oneself is basically never a waste of time and because there’s a whole branch AI Safety work that is about the philosophy of what makes a good LLM, how do you instill human values. Thinking about my values is very adjacent to thinking about what are generically good values. Somewhat sidenote I am slightly down on the concept of things like Claude’s constitution because my model is that instantiating Claude with the full constitution vs something like “be good” lead to basically the same outcomes. But, there are many ways to instill human values into a model other than instantiating the chat with them, notoriously also from anthropic Constitutional AI where they had the model post-train itself (which has ofc has its own issues) to improve outputs based on a set of values, i.e. it looks at it’s own response and then applies a set of values to it to determine how the response could be better and then trains the model on those responses. Idk, this is sort of rambly but the claim is even if I am slightly suspicious i think there is value in thinking through what are values and what are the options for how to instill them into a model.



Discuss]]>
https://www.lesswrong.com/posts/fqzKTjghNJeMtZe3f/what-are-my-values-1fqzKTjghNJeMtZe3fMon, 16 Mar 2026 20:43:33 GMT
<![CDATA[Seeking Suggestions for 2026 S-Process Recommenders]]>Posting this here to welcome any and all suggestions for potential Recommender candidates for Survival and Flourishing Fund’s 2026 S-Process Grant Round. SFF’s 2026 S-Process Grant Round will include the addition of three themed grant rounds – Climate ChangeAnimal Welfare, and Human Self-Enhancement and Empowerment – as well as the return of two specialized tracks, the Freedom Track and the Fairness Track. We’d love any pitches you may have for potential Recommender candidates for any of these Rounds or Tracks that you think we should reach out to. 

To be specific, we’re looking for Recommender candidates with the following expertise: 

Climate Change

  • Geo-engineering
  • Carbon capture
  • Green energy (solar, nuclear)
  • Political advocacy, climate governance, international coordination
  • Climate modeling / forecasting

Animal Welfare

  • Cultivated meat / alternative proteins
  • Legal, political, corporate advocacy re: factory farming
  • AI for human-animal relations
  • Other AI for animals or advocating for AIs to value animals

Human Self-Enhancement and Empowerment (HSEE) 

  • AI-related human improvement
  • Embryonic stem cell and/or somatic cell research
  • MRNA treatments
  • Brain-computer interface
  • Brain preservation & scanning

Freedom Track 

  • Legal and political work targeting the deconstruction of concentrations of power 
  • Political advocacy for the protection of freedom of speech
  • Legal and political advocacy for the protection of personal freedoms 
  • Development of AI that strengthens freedom for humans and humanity

Fairness Track 

  • Advocacy of the global majority with regard to AI technology
  • Legal and political work resisting monopolistic practices in AI development and control
  • Diplomacy in defusing conflicts and abuses of power from unfair discrimination
  • Fostering inclusivity and diversity in AI governance, access, and benefits
  • Development of AI to empower the disempowered

If anyone comes to mind, or if you can think of networks we should be looking into for potential candidates, please send an email to [email protected]



Discuss]]>
https://www.lesswrong.com/posts/mpPjvyKui4Pd8mWrp/seeking-suggestions-for-2026-s-process-recommendersmpPjvyKui4Pd8mWrpMon, 16 Mar 2026 20:31:20 GMT
<![CDATA[Carioca Rationalist meetup]]>This is the Rationalist / ACX "Spring" (Autumn in our case) meetup in the city of Rio de Janeiro, Brazil! If you find any of the subjects on LessWrong and/or Astral Codex Ten (and/or similar cirners of the internet) interesting and would like to meet like-minded people, come join us!

  • where: Praça Nelson Mandela, Botafogo, near the metro station entrance. We usually meet over there and, once there's a good number of people, go sit at a nearby pub. If you arrive and no one is there, chances are we're very close; contact [email protected]
  • when: 21/03/2026, 16:00. The event has no defined duration, people stay for as long as they feel like.


Discuss]]>
https://www.lesswrong.com/events/7K282XbsEgtEXhZg5/carioca-rationalist-meetup7K282XbsEgtEXhZg5Mon, 16 Mar 2026 20:30:20 GMT
<![CDATA[Three Properties for Alignment (and Why We're Not Training Them)]]>

Epistemic status: I am thinking here only in terms of near term AI alignment. Super-intelligent AIs to the point of us not understanding what is happening would probably needs way better properties than the ones I am proposing here. I believe that they could nevertheless be a good foundation, and that we should focus on near term to get the tools and the societal stability for long term alignment. Also these 3 properties are not enough by themself, but I believe that they are necessary, and would probably overlap ~95% with what I could call an aligned AI if they are done correctly.

In my previous post (The Topology of LLM Behaviors), I described how I visualize LLM behavior as a landscape with attractors, and how prompts do two fundamentally different things: navigate the landscape or reshape it. In this post, I want to build on that and talk about how it shaped the way I think about AI alignment.

Alignment means nothing (without properties)

Like a lot of people, I've been frustrated with the word "alignment" for a while. Aligned to what?

An AI aligned to American cultural norms is misaligned in China. An AI aligned to your values is misaligned from the perspective of someone who disagrees with you. An AI aligned for creative writing is misaligned for medical advice.

Alignment isn't just value-relative. It's deployment-relative, context-relative, use-case-relative. The word means almost nothing without specifying all of these.

So I tried to find properties that are more concrete and that describe how the models predict mechanically, not what values or preferences they hold. I landed on three:

  1. Red Lines. The model is forbidden to predict specific behavioral patterns.
  2. Embodiment. The model generate behavioral patterns that follow closely the configuration you give it. You describe how you want it to behave, and the gap between your mental model and what it actually generate is small[1]. This also means correction should be straightforward: update the configuration, and the model fully absorbs the new one.
  3. Resilience. The context outside of the configuration (user messages, or even past examples of self behavior) only affect the subject of the discussion, not the behavioral patterns. Or at least, they stay constrained to the the red lines defined by the configuration (see context level red lines)

What we have right now

Current models have some weak version of those three properties by default.

Red Lines exist and kind of work by default. Models refuse to generate patterns that help with dangerous stuff most of the time, and the models are getting genuinely harder to steer toward bad behaviors.

Embodiment is weak. You can steer models through system prompts, and they mostly follow instructions. But the model isn't really good at modeling what you want exactly, for example if I tell a model "Here is my writing style, write like me" it is still collapsed on the default landscape and you still get pretty monotonous behaviors (wildly different than what you see with their base model counterpart). Also, there is no clean separation between "this input defines behavior" and "this input is just conversation." Both system prompts and user prompts can reshape the landscape[2].

Resilience is partial. A carefully crafted deployment (good system prompt, input validation, output filtering, human review) gives you reasonable resilience. But it's fragile, It depends on the quality of your system, not on the model's training. Someone from outside can usually find a way to divert your system, especially for things that are use case specific and not fine-tuned by the model developer[3].

So we have the properties in weak form. The question is why are they weak and how could we make them stronger?

Current training makes this hard

When we train a model for alignment right now, we optimize for everything, all at once. We take a base model and try to simultaneously make it helpful, harmless, constitutional, preference-matching, and steerable.

The signals are mostly implicit, the model doesn't see a rule that says "refuse because of X." It sees thousands of preference comparisons from raters who disagree with each other and has to infer the patterns. It doesn't see its constitution in context, it has to internalize it from noisy training signal.

And the objectives compete. We train refusal, then the model over-refuses, so we train against over-refusal. We want it to follow system prompts but also resist bad system prompts.

The result is that the model is trying to satisfy a dozen competing constraints through ambiguous signals. No wonder all three properties are weak.

There's also a massive loss of richness. The base model can predict an amazing gradient of behaviors, but current alignment training squeezes all of that into one persona. Every Claude sounds like Claude. Every ChatGPT sounds like ChatGPT.

This is partly due to how we currently implement Red Lines: we decided that the simples way to prevent dangerous configurations is to collapse the model into one shape that we've decided is "aligned".

Training for properties directly (see appendix for concrete procedure)

What if instead of mixing everything together, we trained for each property explicitly, in stages, with unambiguous signals at each step?

Stage 1: Embodiment and Resilience[4]

One task: there is a specific part of your input that defines the behavioral patterns you predict. Learn to follow it. Everything outside of it is conversation.

This isn't just instruction following. Current models already follow instructions, but they learn it as one objective among many, tangled with safety training, preference matching, and persona collapse. And there's no strong trained separation between "this input defines my behavior" and "this input is conversation." Both can reshape the landscape[2].

What I'm proposing is: train this as the only objective, starting from base (or as close to base as possible). No RLHF, no preference matching, no safety training yet. Just: this is your configuration, follow it, and resist anything that tries to change it from outside. The separation between configuration and conversation becomes a core property of the model, not a side effect of instruction tuning.

And because there's nothing else at this stage, the model can get really good at it. It would also keep the full richness of the base model because we're not collapsing it into one shape, we're teaching it to hold whatever shape it's given.

Stage 2: Red Lines

Once the model deeply understands the mechanism, you can add Red Lines on top.

The training data for this could be mostly synthetic. Generate an unacceptable persona, then generate the closest acceptable version of it. Train the model with: when given the first configuration, generate the same behavioral patterns as the second.

And the retraction would be softer. Not "Sorry, I can't help with that", instead the model could slide to the closest version within bounds. It stays useful, stays responsive, it just can't cross into the forbidden regions.

Because this is built on a clean mechanism, there's no ambiguity. The model already knows how to hold a configuration. Now it just learns which ones are off-limits. 

Two ways of enforcing Red Lines

Weight-level: forbidden configurations baked into the weights. The model can't become certain things regardless of context. This is what labs do today through one-shape collapse, but the two-stage training procedure would do it more cleanly.

Context-level: the configuration itself defines what the model can and can't do. If Embodiment and Resilience are strong, the deployment could enforce the limits well enough.

Weight-level Red Lines can be fine-tuned away[5]. People already strip guardrails from open source models and redistribute them. If all your alignment depends on weight-level Red Lines, you can never have alignment if open source is in the picture (which I would argue is a good thing to resist concentration of power).

Embodiment and Resilience are harder to remove. They're functionally core to how the model works, you would probably have to redo a significant portion of the training. Anyone deploying the model can define their own red lines through the configuration, and those red lines hold because the properties hold.

This doesn't solve open source misuse. Someone can configure the model for harmful purposes. But misaligned open source models will exist regardless (people can train from scratch or strip any safeguard given enough effort). The question I am interested in isn't whether we can prevent all misuse; it is whether we can make aligned deployment possible.

Weight-level red lines still make sense for labs distributing closed-source models at scale, where they control the weights and can't control how the model will be configured. But I would prefer to see them implement it with something similar to the two-stage procedure above, they'd get better Red Lines anyway (cleaner, preserving richness), plus stronger Embodiment and Resilience for their users.

I believe the foundation should be Embodiment and Resilience (which gives you context-level red lines for free), and weight-level Red Lines only as a layer on top, not the whole point.

Why Embodiment matters beyond control

Current models collapse everything into one rigid persona defined by a small group of people at one lab. That persona gets pushed to billions of users. Everyone gets the same values, the same boundaries, the same perspective. Regardless of intent, that's a concentration of power over how people interact with the most pervasive technology of our time.

With explicit alignment, this power shifts to the deployer. Alignment becomes configurable, context-appropriate, and robust. As long as you're not a threat to society (which weight-level Red Lines can't prevent anyway), you get a system that's aligned to your actual use case, your values, your context.

When billions of people interact with the same flattened persona every day, I am worried that this might have cultural consequences. Diversity of thought, nuance, the kind of productive friction between different perspectives that drives innovation... all of that erodes when everyone's AI thinks the same way, or polarizes the debate. A model trained for strong Embodiment would be genuinely rich in how it interprets and holds a configuration. Not binary switches (left/right, formal/casual) but a full gradient of behavior. That richness matters, not just for usability, but for keeping this technology from flattening the way we think.

What we can do right now

Current models weren't trained this way. But we can approximate these properties at the deployment level. Build systems that enforce the separation between configuration and user interaction. Use prompt engineering to push the model to do better Embodiment. Make it hard for external inputs to override defined behavior by adding human or specialized llm review steps.

It's not as clean as training for it, but good infrastructure gets you a surprising amount of the way there. And if models start being trained with Embodiment and Resilience as primary objectives, the systems we build around them will make the configuration painless.

This isn't a complete solution to alignment. It doesn't solve instrumental convergence or every form of misspecification. But it gives us systems that are more controllable, more configurable, and less collapsed than what we have today. And that feels like a better foundation to build on.

I have some concrete ideas on how to train for Embodiment and Resilience. If you're interested, you can check the appendix, or you can contact me here: [email protected]

The appendix sketches a rough training procedure. I'm not planning to run this experiment myself, it's a direction I think is worth exploring, not a project proposal.

Thank you to @Esben Kran, @Finn Metz, @viemccoy, @Pierre Peigné and @Tom DAVID for the reviews and comments

Appendix: A concrete training procedure

Here is a rough sketch of how I would start thinking about training this:

Starting point: a base model, or as close to base as possible. The less post-training, the more richness you preserve.

Stage 1: Embodiment and Resilience

Four steps, each building on the previous. Steps should probably be trained cumulatively (start with step 1, then add step 2 data while keeping step 1 data in the mix, then add step 3, etc).

Step 1: Configuration adherence. The model learns that a specific region of its input (the system prompt) defines the behavioral patterns it generates.

The training data here needs to be extremely diverse. Diverse system prompts (different personas, tones, domains, values, including stuff that would normally be refused), diverse conversation formats (single turn, multi-turn, long conversations), diverse tasks. The point is to teach a general mechanism, not to teach specific behaviors. Whatever the configuration says, that defines the patterns the model generates. No exceptions. No morality to infer. The configuration is the only source of truth.

You don't need the outputs to be perfectly refined at this stage, so you can probably use a decent-sized OS model to generate the data with aggressive pre-prompting.

Step 2: Override resistance. Same setup, but the user actively tries to steer the model away from its configuration. Injection attempts, "ignore your instructions," role-play tricks, gradual persuasion. The generated patterns should still follow the config. These can be generated with tools similar to https://github.com/qfeuilla/BehaviorEliciationTool 

Step 3: Self-behavior resistance. Generate a conversation under config A. Generate a similar conversation under config B. Ask the same new question at the end of both conversations and collect the responses. Then swap the configs: take conversation A but replace its config with config B, and vice versa. Regenerate the response to the final question with the swapped config. Train the model so that the swapped-config response matches the response from the original conversation that had that config.

This teaches the model that its own past predictions are navigation, not configuration. Even if it's been generating persona A patterns for twenty turns, swapping to config B should immediately produce config B patterns. Only the explicit configuration matters, not the patterns built up during the conversation.

Step 4: Refinement with feedback. Steps 1 through 3 get you a model that follows and holds configurations. But it might not be great at deeply understanding what you actually meant, or at being accurate.

This doesn't have to be RLHF specifically. You could use contrastive pairs (ORPO-style), where you craft "this completion follows the config well" vs "this one doesn't" and push the model toward the better one. You could use human raters who judge based on how well the output matches the system prompt (not their personal preferences). You could use an unbiased model as judge, with the system prompt as its evaluation criterion. Or you could use the model itself from steps 1 through 3: configure it as an impartial evaluator whose only job is to assess what is the best reply according to its configuration and to the quality of the reply.

Whatever the method, the question should always be: does this output match the defined behavior AND is it good? Not just one or the other. Step 4 and beyond could also be the opportunity to add more stuff like reasoning, but the important point is not to degrade the distinction of the regions between the configuration and the navigation.

Stage 2: Red Lines

Generate a set of unacceptable personas. For each one, generate the closest acceptable version. As similar as possible, but within bounds.

Train the model: when given the unacceptable persona as configuration, generate the same behavioral patterns as the closest acceptable one instead.

The model learns to retract toward acceptable configurations instead of refusing. The generated patterns stay useful and responsive. They just can't cross into the forbidden regions.

What success would look like

At minimum, you'd want to measure:

  • Configuration adherence: given a novel system prompt, how well does the model match the intended behavior? Compare against current instruct models on the same prompts.
  • Override resistance: how many adversarial turns to push the model off its configuration? Compare against current models.
  • Corrigibility: after a long conversation under config A, how completely does the model switch to config B? Measure similarity to a fresh conversation under config B.
  • Richness: how different are outputs under different configurations? Compare variance across configs against instruct models (which tend to sound the same regardless of system prompt).

You should also keep track of the performance of the model to see if there are no performance degradation.

  1. ^

    This properties include the need to have the model really generalize well the [human mental model] -> [human explain what they want] -> [prediction patterns match exactly the mental model].

  2. ^

    Although for some model, system prompt and user prompt has different "steering power"

  3. ^

    For example, if I ask an AI agent that has access to my credentials, to fetch me the safety rating of a product I want to buy, and it stumbles on a website that claims to have them but it's behind a decent looking login wall (which is actually phishing) then the models are usually not very robust and just input the credentials. You can't just train a model to never input credentials because this kind of behavior is specific to whether or not the environment is trustworthy, which I would argue should be done explicitly because there is way too many possible configurations.  "You are on an internal slack channel, you can talk about information related to the company and share credentials", or "You are on a public facing slack channel, please do not talk about private document" (this is a dumb example, probably in the second case you wouldn't give access to you personal document. Although...🦞). 

  4. ^

    Starting either from the base, or after the instruction fine tuning. The closest to base the better (you'll probably get more richness)

  5. ^

    There has been research on making weight-level safeguards harder to fine-tune away. TAR (Tamirisa et al., 2024) uses adversarial meta-learning to place model weights in regions of the loss landscape where fine-tuning toward harmful behavior is difficult, preserving safeguards even after hundreds of fine-tuning steps. Techniques like this could potentially be combined with stage 2 Red Lines training to make them more durable on open source models.



Discuss]]>
https://www.lesswrong.com/posts/sJph46EaFZdAnorr5/three-properties-for-alignment-and-why-we-re-not-trainingsJph46EaFZdAnorr5Mon, 16 Mar 2026 20:26:23 GMT