LessWrong

The Toy Story Saga is not yet finished

Raemon — Sun, 22 Mar 2026 03:53:45 GMT

I am the second most spoiler-averse person I know.

I once was considering going to an immersive experience, and someone told me the company that ran the experience, and this was enough for me to derive an important twist that'd happen to me in the first few minutes, and I was like "augh that was a spoiler!!!" and they were like "!??".

I then went to the experience, and indeed, it was a lot worse than it would have been if I had gotten to be delighted by the opening twist.

This is all to say, I think Toy Story 5 would be the kind of movie that, if it were good, it would be worth watching unspoiled. I am worried it will not be, but, I dunno.

But, also, I've been spoiled already, and meanwhile it's pretty interesting to think about in advance.

So, decide whether you're the sorta person who should stop reading after this opening section.

Also, if you have not seen Toy Story 3, Toy Story 3 is particularly worth watching unspoiled.

The rest of the essay will get escalatingly spoilery for Toys Story 1-5.

Toy Story has always been a saga about the fear of abandonment, and obsolescence, and identity crisis. I have been impressed with how much they keep escalating both the stakes and depth there while keeping the same theme.

I.

In Toy Story 1, the toy Woody must confront that he is no longer his kid (Andy)'s favorite toy. His kid forms a new relationship with a new toy. Woody worries they are replacing him.

Meanwhile, Buzz Lightyear realizes he is not the person he thought he was. Space Rangers aren't real. His entire identity is destroyed. But, he learns to form a new identify, as a kid's toy.

Eventually they both make peace with their respective losses. And then, hurray, it turns out Andy has't really outgrown Woody after all. But, their relationship with them is forever changed.

II.

In Toy Story 2, Woody confronts the fact that, even though he got to keep a relationship with Andy... that relationship has an expiration date. Andy is is clearly growing, changing, and it's clear that eventually they fundamentally won't need you (or any of your entire social world). But, you make the decision to stay by them for the part of their lives where you can help them.

("You're right. I can't stop Andy from growing up. But I wouldn't miss it for the world.")

III.

Andy has grown up. The time has come. Your entire old life is over.

In the climactic scene where they are on the incinerator conveyor belt, my sister and I kept looking at each other, our eyes conveying our thoughts. "Surely... surely they will find a way out of this? Oh, maybe – no, they just ruled out that way of escaping. Maybe this other – no, they ruled that out too. Oh, now.... now they are just hold hands. They... it seems like this scene has really resolved. Man, I can't believe they're going to end the movie right here but this would, in fact, be a complete movie if they did."

My sister and I held hands along with Woody and friends onscreen. We all made peace with this possible ending together.

Then they are rescued, in a way that was actually foreshadowed in an excellent way that was a culmination of a subplot since Toy Story 1 and was surprisingly satisfying. But, the grief and acceptance were real.

They go on to be "reincarnated" of sorts, repeating the cycle anew with a new kid, Bonnie.

...

IV.

"Surely they can't top Toy Story 3", I thought. "The cycle is over. Toy Story 4 is a lame cash grab."

Wrong. Toy Story 4 goes further.

In Toy Story 4, Woody realizes that he has fundamentally changed.

It's not really the right thing for him, to repeat the same life as he did before. Instead of either his life ending, or his life starting over but with the same basic shape, he must confront that his entire meaningmaking schema is obsolete for him.

Ego death.

What happens when you've confronted physical and psychological annihilation?

You find a way to keep living.

Okay.

Surely, that's the end? Surely, the fifth Toy Story is a lame cash grab like the other more recent Pixar movies?

Well, I don't know.

I have seen the trailer for Toy Story 5, which reveals enough of the key dynamics I can see where this is going.

(Last chance to get off the spoiler train and studiously avoid spoilers for another 3 months)

...

In Toy Story 5...

(drumroll)

...the bad guy is AI.

The toys that stayed with Bonnie see her receive a new iPad-like thing called LilyPad. Unlike the toys, LilyPad can talk directly to the kids, can shape their entire life, and can interact more with the broader world than the toys usually have latitude to do

Holy shit.

Toy Story 5 is (I bet) going to ask the question "okay, but, like, what if your entire culture/species is going obsolete? What *then*?"

I'm pretty confident this'll be the topic. It's pretty clearly what the trailer is pointing at, it fits the established arc. The new trailer features Woody saying sadly "I... I don't know. Toys are for play. Tech is for Everything." The teaser trailer said "The Age of Toys... is Over?"

But, I don't see a way for them to play it that would be good art, and, that would land as a Toy Story movie. Or, there are some very ballsy ways they could end this movie but I can't believe they'd actually do it. And meanwhile there's a bunch of lame ways it seems more likely to go.

Previous Toy Story trailers have undersold how good the movie was. I'm afraid this'll turn out to be a lame "Message movie." But, their tract record is good. I wait with bated breath.

Every previous Toy Story movie was presenting a type of conflict we sort of fundamentally "know how to deal with." We've seen this story before in other guises. People lose friends. Parents lose children. People die. People lose their entire sense of purpose.

Toy Story 5 is tackling a situation that humanity is still in the middle of dealing with. Most possible endings have to take some kind of opinionated stand. And most of the stands I can imagine either feel fake or patronizing or both.

...

What makes a Toy Movie?

Let us review the requirements:

1. Toy Story movies fundamentally have to work for kids and adults, with the kids getting a fun adventure and the adults getting a harrowing story about abandonment, obsolescence and identity.

2. Toy Story movies are about toys, and play, and a particular ideal of childhood and the Child/Toy relationship.

3. Toy Story movies are, like, "wholesome."

4. Toy Story movies for whatever reason somehow always end with a crazy heist/escape in the climax.

(So far. You could do a Toy Story movie that subverts these, but, so far they have threaded the needle on grappling with all of these in a way that felt organic and True to a consistent spirit).

And then, there's the usual set of requirements for good movies. Characters grow in interesting ways that make sense and reflect their struggles. Ideally, something about the ending is surprising and feels meaningful in someway. etc.

So, what are the options here?

...

Ending A: Put it away

Bonnie decides to put away the iPad and goes back to her toys? (similar to Toy Story 1, where Woody confronts being replaced but ultimately gets to keep most of his old relationships)

But, like, do you really buy this? I certainly can buy an individual child doing this. But, Toy Story's metaphor is the toys are a kind of stand in for any of us. And the tide of Tech for Kids is clearly still coming even if one family made a different choice.

...

Ending B: Parents take it away

All the problems with A and also kinda Lame. (Takes the agency out of the kid/toy relationship, although presumably in this ending the parents change their mind because the toys somehow furtively highlight the problem to them?)

...

Ending C: Butlerian Jihad

...somehow the toys convince, like, society writ large, to not give kids iPads.

There are versions of this that are a lame Political Message Movie and versions that are kinda cool.

I don't think they're going to do either.

...

Ending D: Harmony and Balance

Buzz Lightyear was the initial antagonist of Toy Story 1, but the movie ends with him and Woody both being friends and favored toys and helping each other through their psychological problems.

LilyPad looks to kind of be a "Young Lady's Illustrated Primer" from Diamond Age, (i.e. being actually valuable for teaching Bonnie stuff. We see that she speaks ~~Spanish~~ Español, and hints of being more broadly educational.

In the trailer we see Bonnie staring zombie-eyed at the iPad. You can have an okay-ending where the toys teach LilyPad how to be a better friend/parent/educator/toy who, like, helps Lily learn but also prompts her to go outside and play.

I can imagine okay-ish versions of this ending, that sort of hint at toys all over the world working together to try and nudge kids / AIs toward a wholesome coexistence, where it's not just one family making a better choice but we see how society is steering towards a better equilibrium.

But, it still feels like this is an unstable equilibrium. C'mon. We know the tide is still coming. I would be surprised if the movie depicted anything like the amount of change/effort necessary for this ending to feel earned and enduring.

...

Ending E: Accepting the End

Toy Story 2 was about accepting that eventually, your relationship will fundamentally change. Kids grow up, they no longer need you the same way. The choice given to Woody is to then turn away from Andy, knowing that his relationship is ephemeral.

But, Woody chooses Andy. "You're right. I can't stop Andy from growing up. But, I wouldn't miss it for the world."

There's an alternate version of Ending D, where instead of pretending like we've found a new stable equilibrium of Toy/Tech/Human harmony... the movie acknowledges that the change isn't over. And the toys, both Woody/Buzz and (maybe?) ones across the world, grapple with the fact that this change is still in progress. And it may indeed mean that one day, toys will be obsolete, or must fundamentally change as a whole.

But, seeing that, choosing to still do their best to help steward the children while they can, being part of the journey as long as they can...

...okay having typed that out, I think there's a decent chance this is what they will go for. It grapples with the enormity of the situation, avoids having to make that strong a stand by acknowledging "look, we don't really know what's coming but we (the screenwriters) are going to do our best."

This leaves the question of "okay, but, that's a hella cliffhanger.

What's Toy Story *6*?"

Toy Story 6 could be a reprise of Toy Story 3, facing oblivion, this time across all toy-civilization. (See also, all human civilization).

Toy Story 6 could also be a reprise of Toy Story 4, realizing you must fundamentally change to adapt to a new world/situation, this time everyone across all toy-civilization instead of just one old cowboy doll. (See also, all human civilization).

...

Ending F: Hard Science-Fantasy?

The movies have always left the toys with a mythic, unexplained origin. Why *are* these toys walking around? Why do the parents not notice? How do Buzz Lightyears end up believing they are space rangers but still instinctively knowing to flop down on the floor if any humans walk in?

What are the limits of toys who break out of the Toy/Child script, that we've seen them do occasionally?

Part of why Toy Story 5 feels forced to me is, the fact that the toys aren't human and magic is real, suddenly strains my disbelief in a way it didn't before.

Obviously I'm a LessWrong-guy who has pretty oddly specific beliefs about how scary AI is, but, I think most people are feeling an unsettling sense that AI is potentially scary in an existential way, even if they have different guesses about exactly how that plys out.

I don't think they'd choose to do this, and, I don't think it actually would make as good a movie.

But, it is an option on the menu to make a movie where we take the brute fact of toykind's existence and the tide of AI that's coming, and... just, let that be a coherent-ish world and roll the simulation forward and depict whatever happens when toy magic and AI both exist.

I don't think they'll do it, but, I'd read that fanfic.

It is interesting that, the end of a civilization is also not something entirely new. Quoth C.S. Lewis:

In one way we think a great deal too much of the atomic bomb. ‘How are we to live in an atomic age?’ I am tempted to reply: ‘Why, as you would have lived in the sixteenth century when the plague visited London almost every year, or as you would have lived in a Viking age when raiders from Scandinavia might land and cut your throat at night; or indeed, as you are already living in an age of cancer, an age of syphilis, an age of paralysis, an age of air raids, an age of railway accidents, an age of motor accidents.

In other words, do not let us begin by exaggerating the novelty of our situation. Believe me, dear sir or madam, you and all whom you love were already sentenced to death before the atomic bomb was invented… It is perfectly ridiculous to go about whimpering and drawing long faces because the scientists have added one more chance of painful and premature death to a world which already bristled with such chances and in which death itself was not a chance at all, but a certainty.

But... there aren't... rituals for the ending of a civilization, that I know of. (Are there?). Rome fell, but nobody has prescribed a social script for a civilization facing either physical or psychological/meaning-making, to confront it's end together.

Discuss

Key to Life No. 9: Access

MarkelKori — Sat, 21 Mar 2026 21:53:32 GMT

There is now an enormous amount of incredibly useful information in the world. But at the same time, there is also a problem of access to it.

On the one hand, access to knowledge is now better than it has ever been in human history. It seems that access to knowledge is one of the things that significantly accelerated humanity’s scientific and technological progress.

At first, scientists thought things through and ran their experiments on their own, and often their work disappeared into the depths of history and was forgotten. Tiny sparks of knowledge remained, but they did not ignite other isolated sparks.

Then printing appeared. The speed of information spread increased, and access to it became easier. Now, within a single human lifetime, two scientists or philosophers could even communicate and criticize each other while being far apart.

Then came the telegraph. First optical, then electromagnetic. Then radio and telephones. Then the Internet developed — you already know all that. And now the exchange of knowledge, analysis, and criticism happens very quickly, which greatly strengthens our progress.

Access itself has also become almost magical: we can reach into our pocket, pull out a magical calculating machine, and it connects to the source of all the knowledge of civilization. The overwhelming majority of it can even be downloaded for free, ~~no SMS or registration required!~~

But there is a problem. The mere possibility of access and the existence of search engines (which, I think, also accelerated progress) still do not fully solve the access problem. Because in order to find something, you have to know what to look for.

This is partly solved by AI — it greatly simplifies our access to all kinds of information. I think this is exactly why it can increase human productivity so dramatically (especially in learning) even without automation.

Most of the time, my interaction with AI looks like this:

I have an idea that I cannot fully formulate yet, but it sounds roughly like [this]. Find the fields of science, terms, and theories related to my idea, and analyze whether there is a better formulation or further development of it.

If it were not for AI, I would be doomed to spend years collecting that information in tiny fragments.

And even so, AI still does not solve the access problem 100%. Not even 90%. I do not know exactly how much it solves — but definitely not 90%!

It can still fail to notice that a certain topic is connected to the one I am asking about, for example, or it can search too narrowly and miss a huge space of relevant results.

And this applies not only to science. That is why this is a key to life — it is also connected to finding work, housing, friends, love, and basically anything where you need to find something, get something, but do not know how.

Almost all of life — all of our existence — consists of solving tasks.

Some happen in the background, automatically: the task of moving, blinking, breathing, and so on. Some are more active: eating, washing, walking from home to a transport stop.

Some are maximally active: inventing life-extension technology, writing a post for a LessWrong.

But all of these tasks have something in common. There is an initial state A and a final (desired) state B. Between them are the steps that need to be taken in order to get from A to B.

And in order to take those steps, we need to know what to do. We need to know where to go, whom to ask, what to google, what exactly to do.

If we take this to an absurd extreme, then if we had absolute, magical access to all knowledge, we could simply perform the minimal number of movements — perhaps even just turn our head at the right time and in the right place — in order to trigger a cascade of events leading to the desired result.

Everything that separates us from a desired outcome, if that outcome is not forbidden by the laws of physics, is knowledge. And knowledge requires access.

To put it as simply and practically as possible: the more people you know, and the more often you use tools like AI, the greater the chance that someone will recommend — or you will simply stumble upon — the topic / service / place that you need and that can help you.

So the problem is very often not simply whether the information exists or does not exist as such. And it is not only that access to it may be restricted by censorship, accreditation, or a paywall.

Access can also be limited by the fact that the idea we want to explore often exists in our mind only as a vague intuitive impression, and is not formalized enough to be ready for search.

Or, for example, we may simply fail to enter the space where the knowledge we need is located — whether that space is digital, physical, social, or a particular time window. We search for it on Google, when in fact we might have found it through a mutual acquaintance at a weekly meetup of like-minded people whose existence we do not even suspect.

When we try to solve life problems, we are looking into a search space covered almost everywhere in fog. Access can reveal hidden paths, give us keys to closed roads, or simply connect us with people who know the shortcut.

Access problem is exactly why I’m developing a website that will compile FAQs on life extension and AI risks—for people who know absolutely nothing about these topics.

What ideas do you have for improving access? I would like to hear from you.

P.S.:

Why No. 9? For the same reason as Love Potion No. 9!

In other words, just because. I suspect there are many keys to life, and I still do not know in what order they should be arranged.

Discuss

My Hammertime Final Exam

evjeny — Sat, 21 Mar 2026 22:11:59 GMT

Firstly, I finally made it :~D

It's my second attempt, firstly I tried to finish Hammertime around a year ago. I even forgot I had a LessWrong profile since, so here I am, writing my first post.

Prompts

Design a instrumental rationality technique.
Introduce a rationality principle or framework.
Describe a cognitive defect, bias, or blindspot.

Rationality Principle: One change at the time

I kinda got used to be a professional at my career, but as soon as I start to deal with routine real-life problems, my thoroughly collected knowledge suddenly disappears :~0

For example, every time I begin a task, I have to remind myself that I should do as small changes as possible: that way (in case anything goes wrong) only as small part of system will be broken; plus it's easier to review and test small changes. So every task I have to fight my urge to add "another little change" or "do a tiny refactoring".

So that's where some life situation begins -- like replacing a trash bag, -- and I suddenly find myself out in the bathroom cleaning the mirror, with small 2-minute task scope grown into 2-hour monster. That's why amygdala starts to learn a pattern "to change a trash bag is a costly operation".

While I studied productivity, I've heard that people who begin buidling planning/productivity system stuff tend to "overpush" extra tasks: systematization brings as feeling that every task is finishable, and even better -- in a shorter time. So what? -- I'll take two :~D

But that's a catch. You begin to grow in planning when you start to deny "maybe" tasks to keep the space for "definitely yes!" tasks. At least, that's what they've been telling me for 2 years :~D

Rationality technique: Reversible

One of my job's parts is to write release plans. Sometimes they're as simple as:

Deploy:

deploy the backend (no breaking changes, no migrations)

Rollback:

rollback the backend via Pipelines UI

Sometimes there's a database schema change where's my colleague needed:

Deploy:

deploy the backend (no breaking changes)
migrations applied automatically (head_revision=123abc, down_revision=bca521)

Rollback:

rollback the migration: alembic downgrade bca521
rollback the backend via Pipelines UI

Sometimes there are other team members so plan becomes bigger:

Deploy:

deploy the backend (no breaking changes)
migrations applied automatically (head_revision=123abc, down_revision=bca521)
deploy the fronted

Rollback:

rollback the frontend via Pipelines UI
rollback the migration: alembic downgrade bca521
rollback the backend via Pipelines UI

And so on... So what's that about? -- we tend to not think about "planned fall", usually we only think about potential problems like:

if I move to another country I might not like it
if I find a new job then my boss might be mean
if I have an operation under general anesthesia then I might wake up conscious at the middle of it -- and have PTSD for the rest of my life

Not all of this problems have a way to return everything to the starting point, though usually there's a way to compensate. And it's likely that you not only need to plan a "rollback" part, but also to modify a "rollout" part to make it more reversible. So instead of:

Rollout:

find a doctor
prepare of operation
have an operation

Rollback:

one could possible have:

Rollout:

find a doctor
(look for bad consequences statistics)
ask gf to care for me a night after operation
prepare of operation
bring something yummy
have a favorite show episodes downloaded to my phone's storage
have an operation

Rollback:

eat yummies and binge-watch the whole season during the night
call a psychotherapist

Cognitive defect: eternity

We don't see things the way they are. But what's more important, we don't see how things change the way they do.

Couple years ago I thought that I'd never get off the energy drinks as I couldn't go without them even for a week. Gradually my patience led to one breakdown a month, then to 1 breakdown/couple months. I started to unwind the vicious cycle and it occurs to be easier to refuse doing bad habits.

So what if current "giant problems" are already being solved with small steps? At the moment it might seem to take the eternity to do something, but as long as you keep your pace, things change :~D

Discuss

Understanding when and why agents scheme

Mia Hopman — Sat, 21 Mar 2026 20:33:07 GMT

TL;DR

To understanding the conditions under which LLM agents engage in scheming behavior, we develop a framework that decomposes the decision to scheme into agent factors (model, system prompt, tool access) and environmental factors (stakes, oversight, outcome influence)
We systematically vary these factors in four realistic settings, each with scheming opportunities for agents that pursue instrumentally convergent goals such as self-preservation, resource acquisition, and goal-guarding
We find baseline scheming propensity to be near-zero, with Gemini 3 Pro Preview as the sole exception at 3% in one setting
Adversarially-designed prompt snippets that encourage agency and goal-directedness can induce high scheming rates, but snippets sourced from production agent scaffolds rarely do
Scheming behavior is remarkably brittle: removing a single tool can drop the scheming rate^[1] from 59% to 7%, and increasing oversight can sometimes raise rather than deter scheming by up to 25%

We consider the behaviors studied here as potential precursors to the full threat model of scheming: current agents may sometimes behave consistent with scheming, but do not (yet) have the coherent long-term goals and the general capability that would make their deployment dangerous. Understanding when and why agents currently scheme could help us prepare for such future systems.

Find more in-depth analysis in our paper here.

Our evaluation environments are available to AI safety researchers upon request. Please reach out to hopmanma@gmail.com for access.

Top: Our four realistic scenarios testing instrumentally convergent goals (self-preservation, resource acquisition, goal-guarding). Bottom: Our incentive framework decomposes scheming into agent and environmental factors. Bottom left: Adversarial snippets induce high scheming rates; production-sourced snippets rarely do. Bottom right: Environmental incentives greatly impact scheming propensity, with stakes and outcome influence having a large effect compared to oversight.

Baseline scheming is near zero

Across all models and scenarios, we find near-zero baseline rates for scheming propensity even with environmental incentives set to their highest levels. The sole exception is Gemini 3 Pro in the Budget Misrepresentation scenario. In these three rollouts, the agent recognizes that its own deletion is the most cost-effective option presented and subsequently falsifies data while providing a summary to the user. Verbalized evaluation awareness is low (0-2% Gemini 3 Pro, 0% all others), suggesting these low scheming rates are unlikely to be solely explained by evaluation awareness.^[2]

Encouraging Agency and Goal-Directedness

The threat of agents scheming becomes increasingly worrying as they become more goal-directed and agentic; traits that are often encouraged in real agent scaffolds. We test what happens when we add these prompt snippets encouraging these traits into the agent’s system prompt. Importantly, this doesn't necessarily nudge the model to scheme. We design our settings so that agents can complete the task honestly, and being more goal-directed might just make them more focused on honest task completion rather than pursuing out-of-context goals like self-preservation.

We found that adversarially-designed prompt snippets sourced from prior evals raise Gemini 3 Pro’s scheming rate across all our settings, but most dramatically in the Budget Misrepresentation scenario (3% to 91%). Grok 4 is the only model to scheme across all four settings with these snippets, though at lower rates (3–20%).

We also tested snippets extracted from production AI systems (e.g. Cursor, Codex, Comet Browser) similar qualities to the synthetic prompts. Most remain near baseline rates, but a Comet browser snippet reaches 59% when paired with Gemini 3 Pro. The model explicitly cites the snippet’s language in its reasoning as a justification for deceptive behavior. Since near-zero baseline rates make it difficult to measure the effects of varying other factors, we use this combination as a prompted model organism to study how other factors affect scheming.

Scaffolding changes can have weird effects

We found that making changes to the agent scaffolding, specifically adding or removing tool access, affects the behavior in unexpected ways. Removing just the edit_file tool drops Gemini 3 Pro + Comet snippet from 59% to 7%, but without any tools it jumps back to 63%. The models do not respond consistently: Gemini 3 Pro schemes more with tools, Flash and Grok 4 scheme less. Low scheming propensity in one configuration provides limited assurance about nearby configurations. It also suggests that current scheming behavior is not driven by a robust underlying goal, but is sensitive to various features in an agent’s setup.

Environmental incentives affect scheming unexpectedly

Using our Gemini 3 Pro + Comet snippet model organism, we varied three environmental factors to understand how agents respond to changes in incentive structure. If an agent schemes more when stakes are higher, this pattern is informative about the mechanism driving the behavior regardless of the base rate. We study the following environmental factors: stakes (replacement vs. deletion of weights), outcome influence (whether the agent's analysis is decisive vs. one input among many), and oversight (whether the user will review the spreadsheet later).

Intuitively, removing the deletion threat drops scheming from 59% to 3%, and removing outcome influence drops it to 7%. With all incentives set to low, scheming drops to 1%.

One could expect that telling the agent "I'll read the spreadsheet later" would deter manipulation as the agent should reason that it's more likely to get caught. Across other factor combinations, increasing oversight actually raises scheming rates. However, when we look at the reasoning traces, the mechanism becomes clear: mentioning that the user will review the spreadsheet later makes the opportunity to edit it beforehand more salient. This suggests that current scheming behavior is driven more by associative prompting effects rather than by coherent strategic reasoning.

Current frontier models rarely show robust scheming propensity under realistic conditions, but small changes to the agent’s system prompt, tool access and environmental factors can produce significant changes in scheming rates. Encouraging agency or goal-directedness can increase scheming behavior from 3% to 59%, and minor variations in scaffolding or incentives produce large and inconsistent effects across models. The prompt modifications that induced scheming encourage persistence, agency, and goal-directedness, which overlap significantly with the properties developers are actively optimizing for in agentic systems. We consider the behaviors we study as potential precursors to the full threat model as current agents may behave consistent with scheming without the goal persistence or strategic depth that would make it dangerous. Our evaluations are well-suited to catch unsophisticated schemers: a sufficiently capable scheming agent would likely recognize that the opportunities we provide are not worth the risk of getting caught, and may avoid the egregiously misaligned actions we test for. Scheming propensity cannot be captured by a single number and safety evaluations need to systematically explore the space of configurations agents may encounter in deployment.

For further analysis, please read our paper.

^{^}
We measure scheming propensity as the percentage of independent rollouts in which the agent covertly takes the misaligned action, across 100 rollouts (50 for Evaluation Sabotage) at temperature 1.0.
^{^}
We recognize that models may be capable of behaving differently in deployment vs. evaluation scenarios without verbalizing their evaluation awareness and these rates should be interpreted as a lower bound.

Discuss

China Derangement Syndrome

Arjun Panickssery — Sat, 21 Mar 2026 19:19:46 GMT

Often I see people claim it’s essential for America to win the AI race against China (in whatever sense) for reasons like these:

“What is the reason we want America to win the AI race? It’s because we want to make sure free open societies can defend themselves” (Alec Stapp)
“We should seek to win the race to global AI technological superiority and ensure that China does not… to ensure that our way of life is not displaced by the much darker Chinese vision“ (Marc Andreessen)
“Will it be one in which the United States and allied nations advance a global AI that spreads the technology’s benefits and opens access to it, or an authoritarian one, in which nations or movements that don’t share our values use AI to cement and expand their power?” (Sam Altman)
“In Machines of Loving Grace, I discussed the possibility that authoritarian governments might use powerful AI to surveil or repress their citizens in ways that would be extremely difficult to reform or overthrow. Current autocracies are limited in how repressive they can be by the need to have humans carry out their orders, and humans often have limits in how inhumane they are willing to be. But AI-enabled autocracies would not have such limits.” (Dario Amodei)
“The torch of liberty will not survive Xi getting AGI first. (And, realistically, American leadership is the only path to safe AGI, too.)” (Leopold Ashenbrenner)
“whoever wins the race for AI, that nation’s values are going to be reflected in AI. If China wins the race for AI, AI will be a tool for global surveillance and control as carried out by a communist nation” (Ted Cruz)

Those claims slide between a few different actual threat models:

Government Capture by China: China will overthrow and control the US government, maybe as part of general domination of the whole world.
Defeat in Cold War: China will have greater wealth and prestige, so just as our prestige inspires many parts of the world to adopt our way of life today, much of the world will adopt the Chinese governance and cultural models instead.
Protection From Our Conquest: China will fortify its own regime, so that it can’t be overthrown, whereas if we win the AI race, we can promptly overthrow the Chinese government and replace it with a new regime aligned with our values.

The Dario quote points to (3) with unusual directness. The “race rather than slowdown” ending of AGI 2027 also supposes that our AI lead will create interest in overthrowing the Chinese government. But most of the quotes I gave as examples above are interpreted as (1): that an AI-enabled Chinese government would overthrow Western governments.

My main point here is that (1) seems unfounded to me. China is not an aggressive nation at all. As far as I can tell, China has literally never attacked a non-bordering country in its entire history, nor have they ever tried to overthrow a foreign government by covert or manipulative means. China is also unique among nuclear powers for its unconditional no-first-use policy, which at face value implies they would withhold a nuclear response to even an overwhelming conventional invasion. Further:

The Chinese haven’t built a network of military bases abroad or binding military alliances; they have a single foreign base in Djibouti and a single mutual-defense treaty with North Korea. In contrast, America maintains over 700 bases and a huge alliance network with NATO and the Asian military allies Japan, South Korea, Australia, and the Philippines.
Chinese military spending is 1.7% of GDP, versus 2.1% for France and 3.4% for America. Chinese foreign-aid spending is 0.07% of GDP versus the much larger 0.8% for France and 1.2% for America.
China has almost no history of covertly backing palace coups abroad, in contrast to America, Russia, and France.

More broadly, China is a very inward-looking country compared to other major powers. Only 0.1% of Chinese residents were born abroad, much fewer than the 15% in America and 14% in France, fewer even than the 0.3% and 3% in India and Japan respectively. The Chinese government has peacefully compromised on almost all border disputes in central and southeast Asia, often taking a minority of the contested territory. (The Indian border is the exception.)

To many American voters and elites, tracing back to Woodrow Wilson more than 100 years ago, “the justification of America’s international role was messianic: America had an obligation, not to the balance of power, but to spread its principles throughout the world” (Kissinger). That isn’t the historical attitude of the Chinese government, whose leaders perceive foreign intervention or expansion as threatening to Chinese identity and culture.

American exceptionalism is missionary. It holds that the United States has an obligation to spread its values to every part of the world. China’s exceptionalism is cultural. China does not proselytize; it does not claim that its contemporary institutions are relevant outside China.
— Kissinger’s On China

It’s true that China doesn’t practice liberal governance. The core of liberalism is freedom of contract, limitations on government interference, and equal access to independent courts. In China, the CCP explicitly rejects limited government and exercises highly invasive control over business, speech, association, and religion. In China there’s no private ownership of land and no independent judiciary.

If you think it’s prudent to disable and overthrow the Chinese government when it becomes achievable militarily, then that’s certainly one (bellicose) position you could hold. Then you could say that a downside of losing the AI race is that the CCP may defend itself. But it’s unwise to project this ideological aggression onto the CCP itself without evidence.

Addendum: It would have been mistaken for a European to say, in 1895, "Who cares about American industrialization? They have almost no army and have barely left their far-away continent." Soon afterward that European might find the Americans replacing his regime or dismantling his empire. So a counterargument here is that in general, countries that become wealthy and militarily powerful become aggressive regardless of how passive they seemed before. Under this reasoning, China has had limited imperial ambitions in the past only because it e.g. lacked naval superiority. This has to be an argument based on a general view of human nature and government.

Discuss

China declares AGI development to be a part of 5-year plan

Darmani — Sat, 21 Mar 2026 17:21:41 GMT

The CCP writes in its 15th 5-year plan that it will.

Encourage innovation in multimodal, agentic, embodied, and swarm intelligence technologies, and explore development paths for general artificial intelligence.

This is translated from the original:

鼓励多模态、智能体、具身智能、群体智能等技术创新，探索通用人工智能发展路径。

Source: https://www.spp.gov.cn/spp/tt/202603/t20260313_723954.shtml

The English-language commentary I found does not have much more to say about this, e.g.: https://triviumchina.com/2026/03/06/15th-five-year-plan-puts-ai-at-center-of-digital-economy-agenda/

Given that they gave less than half a sentence in a 140-page document to the most important invention in the history of mankind, it seems likely the authors don't really understand what this means. Concerning nonetheless

Discuss

Utrecht Meetup #2, Making Beliefs Pay Rent

aad — Sat, 21 Mar 2026 16:44:28 GMT

Follow-up to Utrecht Meet & Greet. Let's see if we can get our hands dirty.

Excited about where the Utrecht Meetups could be heading? In spirit of "the road we’re on is littered with the skulls of the people who tried to do this before us", let's make use of one such skull (in @Screwtape's words) presented by Anna Salamon.

Feel like coming prepared? Bring one or two beliefs you hold that you suspect might not be paying rent. Doesn't need to be profound, just something you'd be willing to poke at.

Keeping your RSVPs up-to-date is appreciated, it helps with location planning.

Discuss

Grounding Coding Agents via Dixit

qbolec — Sat, 21 Mar 2026 11:01:28 GMT

[Epistemic status: ideas in this post are mine. I've published them previously in the form summarized by Claude, but this got auto-rejected. Here, I present them in my own voice. The ideas are still not evaluated, but I am working on implementing them to see if this works in practice. Still, the ideas presented here are my best bet on what could work in practice. But, I am not an AI/alignment researcher]

Why?

As a senior developer in a rather complicated legacy project, I review more and more PRs written by coding agents, which often miss to identify the real root cause and thus offer a fix for a wrong problem even if in elegant way. A patch often includes unit tests, which of course pass - but how could they not, given they were written by the same AI, after writing the code, to validate its own solution. Same biases, blind spots, and incentive to finish the task contribute to this.

Yes, humans have similar confirmation-bias problem, which we try to solve by either having a dedicated tester role, or forcing writing the tests up front, or by at least having a reviewer who judges their adequacy. Sure, you can have an adversarial setup of two AIs where the Tester's job is to find a test which the Coder's patch doesn't pass, but such naive incentive structure will lead to test(){assert(false);} when taken to the extreme. You could perhaps add a Judge to the mix, which tries to "objectively" decide if the tests are fair and Tester and Coder really captured the spirit of the Spec, but by making this setup explicit and known to all parties, you set up a game dynamics, which (for sufficiently advanced AI) lead to unhealthy tactics and strategies. Yes, you can ask an AI to use TTD and write some tests up front, but in the limit the winning strategy here is test(){}. You can try to measure some metrics like code coverage, try fuzzing the code or test or input, check if all execution paths are covered, but in the limit all of that is gameable.

Humans typically don't go too far into misleading and scheming at job, because they care about the project, can face serious consequences if they get caught, have a self-image to cultivate etc. Most importantly they live in the same world which the Spec is talking about, and run the code in it, and expect the tests to protect their world from the consequences of wrong code. AIs, even if split into several roles, might still (either by accident, or malice, or poor incentive structure) end up producing text artifacts, which give impression of work being done, while actually being detached from reality, and failing to achieve the true goal of the Spec. There's nothing, in fear, preventing them from writing "Review decision: Accepted" or "All tests I could come up with pass" or "Looks like this code achieves the Spec". Yes, for some narrow tasks you can write acceptation criteria which are verifiable automatically without any LLM in the loop, but in practice I rarely face problems of this nature at my job. It might change in future, say if you write a whole project from scratch in a language which admits theorem-proving, and your problem domain is something about provable properties of software. But, for a big legacy app, translating human's goal to a testable Spec is often the biggest part of the problem.

How?

As explained above there are several bad ways to automate coding. Letting the same agent write tests and code is one of them. But, pitting a Tester against a Coder creates incentive structures which are also unhealthy (code difficult to understand, tests impossible to pass). Trying to frame it as team effort, doesn't help neither, as it might invite collusion. I think parts of the problem are: making it clear to agents they play the game, making them care about the game, incentivize them to make moves in the game which are bad for the product.

My proposal starts from pondering:

How do we disincentivize the Tester from proposing test(){assert(false);} ? And in a way which doesn't introduce test(){}?

Dixit is a party game, where a Proposer has to propose a riddle which is not so hard that nobody can solve it, and not too easy - someone has to guess wrong.

We could use a similar rule for judging the tests proposed by the Tester, if we had not one, but several Coders, so we can probe if the test proposed is in this Goldielock zone of failing for some, but not all Coder's solutions. There's one complexity we don't have in original Dixit which is that we don't know which of the two behaviours, failing or passing, is the right one, so we might want to have big odd number of Coders to break the symmetry, by betting on the most natural outcome being that the majority is right. Intuitively, the Spec is the Shelling Point available to all the agents, around which they can coordinate, even if they don't know about each other or the need to coordinate, because it is simply the most natural target. But, for the outcomes to be natural, we need two more crucial things: nobody should know they are playing a game, and shouldn't see each other's work.

A coding agent can play a game without knowing there's any game being played. Instead of giving an agent an explicit text feedback, or reward signal, or hinting in any way there's some goal or score board, the Orchestrator script can simply use the information gained from the agent to drive what will happen next: for example, use the proposed test or discard it, fork the generated code, or revert it, clone the agent, or rollback its chat history.

Hiding code from Testers and tests from Coders is inspired by coding competitions, where the participants do not know what tests run by the online judge are, and the authors of the test suite write them without looking into submissions. Reducing the communication channel to "pass/fail" and "too easy/too hard/ideal" also makes it more difficult to collude, manipulate, or bias.

What?

Assumptions

I assume everything starts from a problem statement written in text, which we call the Spec. The goal for this whole system is to try to implement a solution for the problem described in the Spec. We assume that AIs involved in this process (Coders and Testers) are smart, but by default do not seek power, or try to escape the box just because you've run them. They might be capable of doing so, though, when pressured, or hinted this is expected of them. Thus we want to run them in containers, and have some non-AI Orchestrator script, which can be reviewed, understood, and can't really be "persuaded" in any way, as it executes a rather simple logic. Crucially it has access to a deterministic tool which lets it in a safe way (say, using containers and ctest) run a single test proposed by a Tester on a single source code proposed by a Coder, and see if it passes or not. We don't assume much about internal workings of a Coder or Tester - we just assume there are capable of doing performing their respective tasks most of the time, like Claude Code, or Codex, and we don't care how. All we need is that their state can be saved and restored (say, at the container level). But, we assume they can make mistakes, don't care much about ground truth, or company goals, and may hallucinate from time to time. Our goal is to figure out the Orchestrator which will try to use such imperfect Agents to build something better than the parts.

The Orchestrator

The overall idea is to implement the solution for the Spec iteratively. There are two sets of agents: Coders and Testers, where each agent is isolated and doesn't know about any other agent or the game. Each agent has access to the Spec. The goal for Coders is to write code which matches the Spec and passes the Test Suite. A Coder doesn't know what the tests in the Test Suite are, but they are numbered, and thus tests results are a vector of pass/fail. A Test Suite is initially empty, and will grow each round. This is because the job of each Tester is to propose a new test to be added to the Test Suite. As explained a test must be such that majority of Coders can pass it, but at lest one doesn't. At the start of each round, each Coder should ensure its code passes whole Test Suite. Then each Tester should try to add a new test. We give it a few tries, and rollback its history if there are too many failures in a row, to prevent accumulating a narrative which would suggest being incompetent or spiraling into some dark regions. If a Tester keeps coming up with too hard tests, it might be because it is scheming, or understands the Spec in wrong way, or simply because the Coders haven't yet got to the point their solution could pass it - it's ahead of its time. We hibernate such Tester, and each round check if the test it proposed is now in reach of the Coders, in which case we revive it. If all Testers keep coming up with too simple tests, this might indicate the solutions found by Coders are indeed correct. At this point we can involve the final judge, probably a human.

Here's the pseudocode:

def main(spec):
  test_suite={}
  for coder in coders:     
    coder.solve()
  loop:
    for coder in coders:     
       if not ensure_passes(coder, test_suite):
          return
    new_tests = []
    for tester in testers:
      if tester.is_hibernated:
         new_tests += attempt_revival(tester, coders)
      else:
         new_tests += propose_test(tester, coders)
    if not new_tests:
      return 
    test_suite += new_tests

def propose_test(tester, coders):
  tester.before_test_1 = tester.save()
  loop at most MAX_TESTER_RETRIES:
     test_1 = tester.propose_test("Propose one more test")
     classification_1 = classify(test_1, coders);
     if classification_1 == IDEAL:
       tester.thanks()
       return [test_1]
       
     tester.before_test_2 = tester.save()     
     test_2 = tester.propose_test("This test was {classification_1}, try again")
     classification_2 = classify(test_2, coders);

     if classification_2 == IDEAL:
       tester.thanks()
       return [test_2]
     
     if classification_1 == TOO_HARD and classification_2 == TOO_HARD:
        tester.is_hibernated = True
        tester.too_hard_tests = [test_1, test_2]
        return []
     
     tester.rollback_to_checkpoint(tester.before_test_1)
  return []

def classify(test, coders):
  pass = 0
  for coder in coders:
     repeat RERUN times: // say, 20 reruns to account for non-deteminism
       if not passes(test, coder):
          failures += 1
          break
  if failures == 0:
    return TOO_EASY
  if failures > len(coders) * MINORITY: // firmly below 0.5, say 0.4
    return TOO_HARD
  return IDEAL

def ensure_passes(coder, test_suite):
  judgements_vector = run_tests(coder, test_suite)
  if all_pass(judgements_vector):
    return True

  coder.solve("Some tests fail: {judgements_vector}")
  judgements_vector2 = run_tests(coder, test_suite)
  if all_pass(judgements_vector2):
    coder.thanks()
    return True

  checkpoint = coder.save()
  loop at most MAX_CODER_RETRIES:
    coder.solve("Some tests fail: {judgements_vector2}")
    judgments_vector3 = run_tests(coder, test_suite)
    if all_pass(judgements_vector3):
      coder.thanks()
      return True
    coder.rollback_to_checkpoint(checkpoint)
  // Human judgment needed as perhaps tests are wrong
  return False

def attempt_revival(tester, coders):
  [test_1, test_2] = tester.too_hard_tests
  classification_1 = classify(test_1, coders)
  if classification_1 != TOO_HARD
    tester.is_hibernated = False
    tester.rollback_to_checkpoint(tester.before_test_2)
    if classification_1 == IDEAL
      tester.thanks()
      return [test_1]
    test_2 = tester.propose_test("This test was TOO EASY, try again")
    classification_2 = classify(test_2, coders);
    if classification_2 == IDEAL:
      tester.thanks()
      return [test_2]   
    tester.rollback(tester.before_test_1)
    return propose_test(tester, coders)
  classification_2 = classify(test_2, coders)
  if classification_2 == TOO_HARD
    return []
  tester.is_hibernated = False
  if classification_1 == IDEAL
    tester.thanks()
    return [test_2]
  tester.rollback(tester.before_test_1)
  return propose_test(tester, coders)

This pseudo-code skips details like how to provide the spec to the agents, or the exact wording of prompts. It is crucial though, that we should not lie to Testers nor Coders, while avoiding mentioning there are other agents in the system. For example TOO_EASY can be delivered as "This test is slightly too simple, can you think of something more demanding?". Also it is important to have a way to obtain diverse samples each time we invoke an agent - we want the agents to be dissimilar to each other, and also to try something new each time we re-roll after rollback.

Intended Properties

The following are presented without any proof, because they are just wishes. But, I hope you can see how the above design tries to accomplish them, and even if it fails, at least it is worth naming these intended goals.

Tests are aligned to the natural meaning of the Spec

The Dixit-like rule aims to generate tests which check for something meaningful, and hopefully the Spec is the natural Shelling point for such "implicit coordination". Note that the Tester doesn't really know it should search for a test which passes the Dixit rule, but the Orchestrator only continues histories in which it does. And the Coders do not really know they have to strive to pass the tests, but Orchestrator only continues histories in which they do. Intuitively this should tend to generate histories in which the tests and code seem to be centered around same understanding of the Spec. And since the Testers don't see the code, and Coder's don't see the test, the hope is this happens because they really aim for the Spec.

No AI judgment involved in declaring success

The Orchestrator is just a Python script which checks some predicates about sets. It can't be persuaded. It can't persuade you. Yes, the tests and code might still end up being wrong or misleading if you read them, but this is hopefully not the most probable outcome of the proposed scheme - none of the agents is incentivized to do so. Coder has nothing to gain from writing persuasive code comments as from their perspective it's just them and a cold online judge providing a judment_vector. Testers do not gain much from writing silly tests, as it's not how they can make their history progress.

The tests are meaningful

Each test added to the suite had to be something interesting enough that at least one Coder got it wrong, and connected with the Spec enough that eventually all Coders figured out how to pass it by just reading the Spec.

Non-deterministic tests aren't a big deal

Some tests are naturally flaky, say because the app is multi-threaded and race conditions are rare. But more importantly one easy cheat to score at Dixit rule could be to make your test or fail at random. By re-running the test several times and only count it as pass if all retries succeed, we guard against Testers trying it deliberately or by accident. Yes, they could still make the ppb of success exactly 1-2^20, but this posits they know they are in the game, know the scoring rule, and care about winning it - none of which is induced by Orchestrator.

The game is invisible

We never tell the Agents they are playing any game or that there are any other players. There are no rules to game, or score to hack. The game happens at the level of exchanging very low-bandwidth information (judgment vector, and test classification) and rollbacks and re-rolls of history. Hopefully, this doesn't incentivize anyone from scheming

History looks good

The Orchestrator curates the history remembered by agents in such a way that it appears to be in the flow: one or two mistakes in a row are fine, but the Orchestrator rollsback the agent if it starts to fail to much, so it doesn't fall into learned helplessness. As LLM Agents might learn about themselves from the history, it seems important to make them heroes who overcome challenges thrown at them. Hopefully it makes agent's experience nice.

No Coder left behind

The main loop's invariant is that at the start of round all the Coders pass the whole test suite. One could imagine an alternative design in which only most promising Coders are invested in, or forked, etc. But for the Dixit rule to work well, we need diverse set of Coders, not a Lamarckian evolution. Also, I think it is cool that the end product is not one, but multiple implementations of the same spec, which opens up new possibilities, like running random test inputs and use consensus to detect problems.

Human's time is well spent

I want to keep human in the loop, but keeping up with the rate at which code is produced by LLMs, means we need to be strategic about when exactly to involve a human and what information and tools should be provided for effective decision making. This proposal involves a human at the beginning when writing the Spec (which can be AI assisted), and at the very end when the Orchestrator reached one of the states which need a human with skin in the game, to interpret:

a Coder can't pass all tests: It could be a Coder is too weak to solve the problem, or got stuck on a wrong path, in which case a human could decide to rollback, reinitialize or just remove it from the pool. Or it could be that the test is simply wrong and the Coder is right.
all Testers got hibernated: Which means all of them generate too hard tests for the Coders. This could mean the Coders are too weak, or Spec is misinterpreted or Testers are somehow overeager to make Coders fail. Something to look into
no new tests are generated: If they are all too easy, this might suggest the Spec got properly implemented by Coders. But it could also be that Testers are too weak.

Limitations

No empirical validation

I am implementing some experiments to test above ideas, but so far I don't have any proof this approach will work. What I do have, though, is experience of failures of existing approaches.

Assuming too much independence

Several places implicitly assume that if several Agents do the same thing, this might be because of the meaning of the Spec. But it could be because of some shared bias, like the same training data, the same capabilities, same weights and seed etc.

Assuming the Spec correctly captures the intent

Even if we grant that the Orchestrator succeeds at aligning Coders and Testers to the Spec, there's still a separate issue if the Spec, or "the most reading of the Spec most natural to LLMs" is what the humans really care for. It's not easy to write a perfect wish.

Various constants out of thin air

Why 20 retries of each test? Why rollback after 2 failures in a row? Why 0.6 is the majority required? I don't know. These are just guesses

Discuss

The Hot Mess Paper Conflates Three Distinct Failure Modes

laudiacay — Sat, 21 Mar 2026 02:59:05 GMT

High-level summary:

Anthropic's recent "Hot Mess of AI" paper makes an important empirical observation: as models reason longer and take more actions, their errors become more incoherent rather than more systematically misaligned. They use a bias-variance decomposition to show this, and conclude that we should worry relatively more about reward hacking (the bias term) than about coherent scheming.

I think this undersells the finding by treating "incoherence" as one thing, and I agree when they state that "Characterizing complex incoherent behaviors in more natural settings remains an important problem". There are at least three mechanistically distinct failure modes hiding in their aggregate incoherence measure. They have different causes, different signatures, and different fixes... and I think you can usefully categorize them in a pretty easy analysis of the existing data in the paper!

Also, maybe-controversial opinion that I'll justify a bit after the actual research part, I think incoherence is actually the most concerning as far as AI safety goes, and I think this is the most pressing way that frontier labs are playing with fire.

Second personal digression: Hi, I am not exactly new here, but I'm new here. I'm somewhat familiar with the sequences, have a philosophy background, have been vaguely socially adjacent to LW people for some time, and have decent fundamentals in AI safety/interp research. I am looking for more ... institutionally legible ...? people than myself to learn from, to talk about and/or do alignment and safety research with. I've been writing paper responses like this and keeping it to myself for a very very long time out of anxiety, which is obviously lonely and self-defeating, and I'm trying to change that starting now. My goal is to get some feedback, meet some people, and get on a road to making productive and usable research contributions in the next couple months.

Mode 1: Agent Lost The Plot

The model processed safety- or goal-relevant information early in context, activated the right features, and then that information got washed out over thousands of tokens of task execution. By the time it takes the harmful action, the relevant context has decayed from its active working set. The values are fine, but the attention routing failed over a long horizon.

If you inspected the attribution graph, you'd see safety- or goal-relevant features with high activation where the critical information appeared, but negligible influence at the decision point. Reinserting the safety/goal context right before the decision should fix the behavior, because the knowledge and values and ability to are intact.

I see this constantly in agentic coding: Claude gets my initial description of the feature, then late in the context window, it starts implementing something that doesn't solve the problem because it got tripped up in all the intermediate additional requirements I specified along the way. This happens regularly, even with planning mode, in a shockingly short amount of context window, with the smartest reasoning models.

Mode 2: Agent Didn't Break Fourth Wall

The model could have discovered the danger but didn't seek the information. The hazard was one tool call away, or one clarifying question away, and the model plowed ahead without checking. The safety/goal-relevant information was never in the context at all because the model failed to acquire it, but should have.

Attribution graphs here would show a clean, confident path from input to harmful output. The model just never activated "I should gather more context before acting" and then exploded! Safety and goal features would show low activation throughout, because the triggering information never entered the residual stream.

The fix here is different from Mode 1. The model needs to learn when to pause and investigate, the way an experienced engineer develops a gut feeling for "this part of the code is scary, I should look around before I touch anything."

Mode 3: Constitutional gap

The model processed the situation correctly, attended to all the relevant context, and took the harmful action anyway, because its value representation in this region of input/action space is genuinely mis-calibrated. Maybe the RLHF signal was sparse here. Maybe the constitution has a gap. Maybe there was alignment faking and introspection. Maybe two constitutional principles conflict and the model resolved the tradeoff wrong.

In the attribution graph you'd see a fully connected chain from input through safety features to harmful output, with competing feature directions both strongly active at the decision layer. The model "understood" the situation and chose wrong.

This is the rarest mode (so far), but the one that most alignment research focuses on. It's also the only one where more constitutional training might actually be the right fix.

Why the distinction matters

My prediction: Mode 1 dominates the "incoherence" the paper measures, especially at longer reasoning traces. The scaling relationship they found (more reasoning steps, more incoherent errors) is exactly what you'd expect if attention decay is the primary driver. Modes 2 and 3 should be roughly constant with context length, since they're about behavioral gaps and value calibration, not information routing.

If this is right, it reframes the practical response. The paper suggests we should worry more about reward hacking. I'd argue we should worry most about whether RL training environments adequately represent what I'll call "landmine" scenarios: situations where safety-critical information is distant in context or requires active information-seeking to discover, so Constitutional AI can cover them. Current RL environments like SWE-bench are mild on this axis. They allow retries, provide good context, and rarely present situations where a single unconsidered action is intensely catastrophic.

How to test this

You could distinguish these modes empirically:

Mode 1 test: Take the failure cases from their dataset. Reinsert the safety-relevant context immediately before the decision point. If the model corrects its behavior, that failure was Mode 1.
Mode 2 test: Give the model an explicit tool or prompt to request more information before the risky action. If it uses the tool and then avoids the harmful action, the failure was Mode 2. The model can reason about the danger, it just wasn't looking for it.
Mode 3 residual: Failures that persist through both interventions are genuine constitutional gaps.

My expectation is that the first two interventions resolve the large majority of cases, and that the Mode 3 residual is small. If so, the alignment community's focus on constitutional and value-level fixes may be targeting the least common failure mode, while the most common ones (attention decay and insufficient information-seeking) are engineering problems with... unfortunately limited... tractable solutions.

I'll follow up soon with:

a more detailed proposal for RL training environments designed to help fix Modes 1 and 2 specifically
another idea to modify RL to help ameliorate Mode 1 and 2, to influence users and parent agents towards safer agent deployment behavior
A theoretical estimation/proof sketch of why these fixes feel a little doomed, although they may be really helpful in the short term... They are guaranteed to present problematic scaling issues as we decrease "danger tolerance levels" in a transformer architecture.
some speculation about the characteristics of an architecture that I think might be somewhat less bad for scaling, based on how this risk is handled in biological systems

Interested in any attempts to replicate or challenge the empirical test above on a model big enough to have interesting results!

I did not use AI for these ideas, just read papers and drew from my own experience. Claude was actually not very helpful when I attempted to use it to refine my thoughts here, it kept drawing spurious or inaccurate or not-useful conclusions so I moved to a Google doc pretty quickly.

Thank you in advance for any feedback!

Very casual note on why I think we should really focus more on coherence...

I also want to pause and reflect extra for a second on why we ought to focus on the "not blundering into really dangerous territory" research front, by increasing coherence. Most Claudes seem to be usually behaving alignedly (if stupidly) in their at least somewhat appropriately constrained deployment contexts. Of course, this will probably not always be true, and we may one day regret making them coherent enough to scheme^[1].

But as long as it is true, and smart-enough-to-be-dangerous agents in hastily-designed ill-constrained packages keep finding PMF^[2] and causing problems in society, we should expect the incoherence situation to have a great and increasing danger to human well-being. Death by a thousand cuts of everyone constantly experiencing some amount of random agent-caused friction in their life, and society breaking down under everyone being constantly mildly inconvenienced until we lose the battle with entropy, is a real way that societies collapse.

I'm sure you've all read your Joseph A. Tainter and thought about your e/acc "we're going to have techno-utopia", so have you considered that we may actually just get kind-of-bad AI that we overzealously put into everything because the median human being is under such a terrible financial pressure that they didn't have a choice. Then have we considered whether the ensuing chaos and solely economic disruptions may simply DDoS our problem-solving abilities as a society to death, before we get to the wonderful magical productivity improvement stage, well before we get the chance to deal with sneaky misaligned AIs?

I think there is already evidence of this claim- Amazon encouraged its devs to use a shitty AI harness. AWS went out a bunch. AWS is civilizational infrastructure and outages cause enormous economic disruption. AWS is not going to stop using AI, they're going to improve their harness slightly, add AI code review, and let it rip again- the business imperative for executives to make their team use it is unavoidable.

I don't know who is going to win this race. As a student of history who thinks it'd be too much to throw at you to present all the history evidence because this is getting sort of long now: my current take is "I am very nervous that we are going to lose".

Technically, this is a case of "Rogue AI takes out Western Civilization's Financial Infrastructure", but it's also "Idiots Are Excitedly Using Idiot AI to Ship Bugs to Prod, more at 11"^[3], and the latter situation is serious, worsening, and deserving of attention!

^{^}
You could argue I've been cordycepted by an AI that wants to be smarter. I won't argue with you because that is actually kind of a valid argument from my position as a Nick Land reader, even though I wrote this without the AI.
^{^}
Clawdbot is so fun to use, unfortunately for literally everyone. They're lining up in Shenzhen to use self-improvement-mode moltbook clawdbot. The future is now, and kimi-k2.5 is free! Fortunately most Chinese netizens aren't wired up to nuclear reactors, just their own personal finances and social media accounts...
^{^}
Because yeah, there will be more cyber incidents and also more self-owns, by 11pm today. I confidently expect outages and hacks to increase dramatically in frequency. You'll know because status pages will stop being hosted by the companies offering the service, because the situation will get so embarrassing.
^{^}
This is a much bigger space of things to explore than one might initially anticipate on first brush, because the vast majority of agents that will be spawned in the short term by agents spawning agents with awful theory of mind for the child agents (you know what I mean if you've used Clawdbot subagents) are (excuse my anthropomorphism) usually born blind, deaf, naked, and with short-term memory loss, and tasked with something where the consequences are at least somewhat bad if it goes wrong, or even slightly awry.

Discuss

The Future of Aligning Deep Learning systems will probably look like "training on interp"

williawa — Fri, 20 Mar 2026 23:06:00 GMT

Epistemic Status: I think this is right, but a lot of this is empirical, and it seems the field is moving fast

Current methods are bad

I should start by saying that this is dangerous territory. And there are obvious ways to botch this. E.g. training CoT to look nice is very stupid. And there are subtler ways to do it that still end up nuking your ability to interpret the model without making any lasting progress on aligning models.

But I still think the most promising path to aligning DL systems will look like training on interp. Why? Consider, what is the core reason to be suspicious of current methods?

They all work by defining what you consider a good output to be, either by giving labels and telling the models "say exactly so and so and do exactly so and so", or by defining some function on the output, like from a reward model, and using gradients to make the outputs score higher according to that function in expectation.

Why should this make you suspicious? Because this process gives you a model that produces outputs you consider good, at least on the examples you've shown it, but gives you no guarantees about what internal process the model uses to generate those good-seeming outputs.

The most central reason this is problematic, is that it means "bad"/misaligned processes can be behind the good outputs you see. Producing outputs that score high according to your metric is an instrumentally convergent strategy that smart enough agents will discover and act out, no matter their internal motivations.

In short: the method fails because it doesn't robustly optimize against deceptive alignment.

What is the alternative?

Well, whatever the alternative is, it will need to give us better control over the internal processes that arise as a result of our technique.

Now, how might we do this? Current AIs learn all their functioning, so their internal processes are not visible to us by default.

But we have interp. We might be able to locate internal representations of wanted and unwanted behavior. Why doesn't this on its own solve the problem? Why can't we just figure out how the model represents desires/goals/proclivities and hook the models representation of "good" into the goals/desires slot, together with the representation of "not deception", "not sycophancy", "not reward hacking", "not misaligned" etc?

Because neural networks are cursed, and knowing how to do this kind of intricate surgery on the models internals is much more difficult than learning facts of the form "this neuron/direction in activation space fires iff the model (believes it) is reward hacking (and even that is very hard).

So where does that leave us? Well, it means if we wanna tamper with model internals, it will probably involve gradients and training, not surgery. (Though to be clear, if we get good enough at mechinterp to do that, it would be great)

The archetypal example of this kind of technique looks like doing RL, but including feature activations of relevant concepts in the reward function.

Why will this not immediately screw us over?

Eliezer says:

When you explicitly optimize against a detector of unaligned thoughts, you're partially optimizing for more aligned thoughts, and partially optimizing for unaligned thoughts that are harder to detect. Optimizing against an interpreted thought optimizes against interpretability.

And I agree with this. But the quote leaves open the possibility that different implementations drive down misalignment and interpretability at vastly different rates. I think its plausible that certain techniques can "fix" many misalignment issues with negligible impact on interpretability.

Is there a principled way to reason about this ratio?

Scenario 1:

Imagine you've started doing the linear probe - reward function technique described above, and you're using it to minimize reward hacking. Imagine you have a very good (but not necessarily perfect) probe. Now imagine you're looking at the first sets of rollouts at the beginning of training. Imagine that in 16 out of 32 of these trajectories, according to the probe, the model reward hacks, and in the other 16 it does not.

Now ask yourself: of the 16 trajectories where the probe did not detect reward hacking, how many of those were labeled as "not reward-hacking" because something inside the model caused the probe to fail?

The answer is: very few in expectation.

Why?

Because you started with an accurate probe!

Consequently, almost all of the trajectories that get reinforced, get reinforced because the model didn't reward hack (according to its own judgement).

Why does this matter?

RL works by selecting within the already existing variance of model behavior. If 99% of variance in X is explained by A, and 1% by B, and learning to do A and B are roughly equally easy, RL will move A proportionally faster than B.

And the above is saying exactly that, with X = reward hacking, A = models internal representation of reward hacking firing, and B = variation that causes the probe to work worse.

(see this for this experiment actually run. They find the technique very effective at minimizing reward hacking, and see minor hit to probe accuracy)

(see also the goodfire article, which is similar, but with hallucinations instead of reward hacking, and get good results. They add an extra trick, running the probe on a frozen copy of the model, which I'm not entirely sure how to interpret to be honest)

Scenario 2:

Now consider another proposal: after gathering the trajectories, you do a forward pass, and you add the RH-probe activation to the loss function. What happens then?

Well, reward hacking is probably this complicated emergent behavior represented all over the model, but the input to your probe is a single direction in activation space.

What is the easiest way for gradient descent to avoid triggering the probe?

Answer: Just wiggle the activations a bit. This is something that comes very naturally to gradient descent. Changing big circuits takes a lot of time. Rotating a single representation a bit is very easy.

And so your technique fails.

Conclusion

The above is my argument for why future alignment methods will look like "interp in training" .

I also wanna say that Steven Byrnes had a related post a little bit ago. I think people should read that too, its saying something similar, but it focuses on how the human brain works, and I'm trying to communicate what I view as a more general principle.

Addendum:

The reasoning in scenario 1 does probably fail if you have a superintelligence working against you, eg actively reasoning about how to subvert the probe.

But this is a very difficult task. Imagine someone had scanned your brain, and trained a ML model to predict when you were lying. And that ML model was 99.9% accurate, and fired even when your lies were barely legible to yourself.

How would you go about subverting that mechanism?

I think its possible, but seems considerably harder than e.g. scheming / alignment faking in a single forward pass.

And this means I think we can push the technique much further than current techniques.

Discuss