I'm Bearish On Personas For ASI Safety

J Bostock

LESSWRONG
LW

I'm Bearish On Personas For ASI Safety — LessWrong

67 I'm Bearish On Personas For ASI Safety

by J Bostock

1st Mar 2026

12 min read

67

TL;DR

Your base LLM has no examples of superintelligent AI in its training data. When you RL it into superintelligence, it will have to extrapolate to how a superintelligent Claude would behave. The LLM’s extrapolation may not converge optimizing for what humanity would, on reflection, like to optimize, because these are different processes with different inductive biases.

Intro

I'm going to take the Persona Selection Model as being roughly true, for now. Even on its own terms, it will fail. If the Persona Selection Model is false, we die in a different way.

I'm going to present some specific arguments and secnarios, but the core of it is a somewhat abstract point: the Claude persona, although it currently behaves in a human-ish way, will not grow into a superintelligence in the same way that humans would. This means it will not grow into the same kind of superintelligence with the same values that human values would converge on. Since value is fragile, this is fatal for the future.

I don't think this depends on the specifics of Claude's training, nor how human values are instantiated, unless Claude's future training methods are specifically designed to work in the exact same way that humans learn and grow. I don't think this will happen, because I don't think that Anthropic (or anyone else) knows how to do this.

LLMs

Persona Selection and Other Models

Anthropic has put out a new blogpost on what they think Claude is. It positions the classic “shoggoth” model of chat-LLMs alongside a half-dozen other hypotheses. It feels a bit like they tried to do an exhaustive free-association over possible things that Claude could be, but this is only an introductory blogpost, so hopefully they’ll enumerate their hypotheses a bit more thoroughly later.

First and foremost amongst these hypotheses is the Persona Selection Model. This model suggests that the base LLM acts as a “simulator” which is capable of “simulating” many different text-generating processes; the later stages of training simply bias it towards always simulating Claude-ish things. Janus—the author(s) of the original persona/simulator work—has collaborated with Anthropic in the past.

Persona theory explains a lot of observations: why does emergent misalignment happen? The space of possible personas is constrained; making a persona evil along one axis also makes it evil along other axes by influencing the evil vector. Why does fine-tuning a model on archaic bird names make it answer questions in Victorian prose? It’s causing the LLM to simulate a persona from the 1850s. Why do chat models have human-like emotional responses sometimes? Their preferred personas contain aspects of human behaviour.

Persona Theory As Alignment Plan

Empirically, persona theory seems to be working at our current level of AI. Once you give enough examples of “helpfulness” to the base LLM, the Claude persona becomes robustly helpful across a variety of contexts. Give it a few examples of “harmlessness” and it gets uncomfortable with Anthropic using their models to help the Pentagon capture Maduro. This is predicted by persona theory. Human-centric concepts like “helpful” and “harmless” are real things in persona-space, which you can select your model over without too much difficulty.

On some level, this seems like excellent news! Maybe all we need is for AIs to internalize what humans mean by “good” and then point them towards it with a few dozen SFT examples.

Given the success of persona selection (and lack of alternatives) it’s not surprising that Anthropic appear to be using it as their mainline AI/AGI/ASI safety plan. Questions like “What character should superintelligence have?” are presented as important, and, crucially, coherent. I think this is probably a risky move, and that persona theory is an incomplete model of how AI behaves now, and will behave in future.

Gears of Personas

On LessWrong, we’re all familiar with Bayesian simplicity priors; the simpler something is, the more likely it is. More sophisticated versions look at random turing machines, or random programs (brainfuck is particularly fun) and define “simple” as some combination of short length in bits, quick runtime, and low memory usage (often in decreasing order of importance).

The most sophisticated model of this is probably the Garrabrant inductor presented in the Logical Induction paper^[1]. In this, different computable algorithms (“traders”) bet on logical sentences which may be proven correct, or incorrect by an external arbitrator. Each trader starts with a finite amount of “money” inversely proportional to its complexity. Over time, the useful traders—which successfully model the underlying rules which govern the arbitrator, should any exist—accumulate more money and gain more control over the market.

One operationalization of “How complex is a given process?” would be “How long does it take a Garrabrant inductor to learn that process?”. At risk of doing the thing, I’m going to run with this for a bit. We might imagine a base LLM as a kind of Garrabrant inductor which is successively shown logical sentences representing sequences of tokens:^[2]

Until the traders who are good at predicting the next token have risen to the top.

Suppose we take this inductor and start showing it logical sentences from a different process. What kinds of processes are easy for it to learn? What kinds are hard? It won’t be the same processes which are easy (or hard) for a virgin inductor to learn.

Suppose we show it a few sentences corresponding to “helpfulness”. For example, in an exchange like like

The traders who would predict Claude’s output to be:

Have already been drained of cash by the earlier training stages. All that is left are traders who predict Claude’s output to be “Of course!...” and “Ugh really? I don’t wanna do that!...”. We can think of persona selection as a series of cash transfers between already-rich traders.

This also lines up with the phenomenon of “mode collapse”, where models become very bad at e.g. creative writing during post-training. The traders who correspond to anything other than the assistant persona are drained; the base LLM can no longer generate other kinds of text.

We should introduce the concept of inductive bias here. Inductive bias governs how a learning algorithm generalizes from finite data. The inductive bias of a Garrabrant inductor is determined by the distribution of cash amongst its traders. A virgin Garrabrant inductor has a simplicity prior. A pre-trained Garrabrant inductor has a very different inductive bias, because lots of the money is already held by traders with complex behaviour. The pre-training of the LLM provides an inductive bias which helps the post-training learn human-comprehensible behaviours.

Complications

This model is a little incomplete. The set of traders in a Garrabrant market is infinite; instead of thinking of individual traders, we should probably think of dense clusters of traders. Of course, an LLM only instantiates one set of weights, but these weights contain some randomness from the initialization and SGD. Computational mechanics aims to bridge between individual, locally-optimized models, and the distributions of which they are typical members, but this is pretty high-level stuff.^[3]

Secondly, circuits in LLMs aren’t parallel end-to-end. They all read from—and write to—the same residual stream at each layer. We might want to think of some slightly more flexible system of traders, which are able to bet on one another, and trade information, from which the layered system of LLMs falls out as a special case. This might actually be important later when we think about composing traders in some ways.

Reasoning and Chain-of-thought

Then all of this goes out the window, because we now have our models producing large chains-of-thought.

A base LLM has some idea of how thinking is supposed to work. Rank-1 LoRAs are enough to get a model to generate and use chains-of-thought. The simplest kind of reasoning that a model can do is something like this:

Generate an answer to a question
Say “wait…”
Generate a different answer from scratch
Repeat 1-3 a few times
Pick the best answer to output

This requires a few specialized circuits: repeat suppression circuits which make sure the answers are different from one another, a circuit which says “wait” a few times, but eventually stops after it’s generated a few different answers, and one which attends from the generated answers back to the prompt/desired answer, compares the two, and also attends from the final output to the best generated answer.

You may notice this has nothing to do with personas. How do personas influence what’s going on here? There’s two ways I can think of immediately: the persona can influence the distribution of generated answers, and it can influence the answer selection process.

A concrete example: suppose a Claudebot is trying to make coffee, but there’s a baby in between its robot body and the coffee machine. A friendly Claude will not suggest the answer “kick the baby out of the way”, and a friendly Claude which did suggest that answer would evaluate the results of that answer as “coffee made + baby kicked” and would therefore choose a different answer.

Reinforcement Learning

I’m going to use RL here to specifically mean the kind of large-scale RL that produces GPT-5 from a GPT-4oish base model. What does RL do to long chains-of-thought?

Suppose we do something like GRPO. This looks, roughly, like spinning up a bunch of chains-of-thought, and evaluating their outputs. Then, we look at the traders that contributed to the good chains-of-thought, and transfer them some money directly from the traders that contributed to the bad chains-of-thought.

Over time, the chains of thought will get better and better at the desired task. The answer-suggestion and answer-selection mechanisms will both be more efficient; we might also see that the thinking process looks less like a bunch of disparate answers, and more like an MCTS algorithm; more efficient still, the “branches” of the MCTS can attend to one another, when they drift close to each other.

Suppose current-ish RL is enough to get Claude to superintelligence. What does this look like? Well, the base LLM has never seen a superintelligence in its pre-training corpus. The LLM will need to have gears in its world-model which weren’t in the world-model of anything whose behaviour it’s seen before. Even if we just limit ourselves to thinking about the answer-generating and answer-evaluating circuits; what would a very virtuous Claude2026 character think about plorking humanity’s greenge, as opposed to warthing it? What about if the greenge gets all urgled?^[4]

There’s going to have to be a generalization step that goes beyond the pre-training data. Let’s think about how humans might do this.

Humans

Human Values

Aaaargh I am going to have to try and synthesize all of the current work on how humans impute their values from reinforcement signals and drives. Ok let’s go. My current best guess for how humans work is this:

TL;DR

We have something like HPC going up, and PCT going down. Our brain has a world-model, and a goal-model, which respectively track how the world is, and how we’d like the world to be. This is the cruxy part of it; I am still confused about lots of things and the following section is collapsible to reflect that.

My Incomplete Model

At the bottom of the stack is the I/O system of the brain, the sense organs and actuators. Each layer of neurons builds a purely predictive model of the input, at different levels of granularity: the lowest layers learn constituent, local things like shapes, textures, timbres; the upper layers learn abstract things like predators, tools, chieftains. These models try to be somewhat consistent both within and across layers. Each predictive layer sends down a prior, and sends up the errors it has made in prediction.

This purely predictive model is extended in two ways: the goal-model extension tracks ways we would like the world to be. Another has a split between self and non-self things. These, especially the goal-model, also try to be consistent within and across layers.

These are needed for acting in the world. Each layer sends down a goal-model description of what it would like to happen, alongside its raw prediction. It also specially labels the self parts of its prediction as a mutable pseudo-prediction. The layer below evaluates these self-predictions according to the goal-model prior and its own goal-model, and sends down an even more specific pseudo-prediction. At the bottom, the pseudo-predictions of really basic things like muscle tension get written out to the motor neurons. This is just perceptual control theory.

I’m not fully sure of some things, like how episodic memory and imagining sense-input work. I have a strong suspicion that one of the sensory input channels is actually the current state of the brain’s working memory or similar, and that this probably influences self-modelling and the reported experience of consciousness.

On the other hand, I don’t think this description needs to be perfect, I just think it needs to be in-depth enough to show that it’s meaningfully different from how an LLM learns its goal-model.

Goal-Models and Inductors

The important thing here is the goal model. It’s a conditioned version of our world-model. In the same way that we can build up a deep world-model based on low-level sensory input, we can build up a deep goal-model based on nothing but low-level reward input. I think both of these can be thought of as something like a logical inductor. In the same way that a logical inductor can be self-contradictory after finite time, so can a goal-model.

Since the goal-model wants to be consistent across layers, not just within layers, it propagates information up to higher levels of abstraction, riding atop the abstractions already created by the purely predictive model. In the world of Garrabrant inductors, we might say the market is already awash with useful clusters of traders, some of whom can be up- or down-weighted to convert the world-model into the goal-model. This is related to why you might care about the welfare of ghosts, if you believe in them.

I roughly think that “your current values” can be thought of as “The minimal descriptor of the update that needs to be applied to your world-model to convert it into your goal-model.” which isn’t very catchy. The act of refining the elements of the world-model and goal-model to be more consistent with one another is—I think—what Yudkowsky occasionally refers to as the “meta-question of you”.

These Are Not The Same

At the moment, Claude certainly seems aligned. Today, the LLM does a guided search over actions, and picks one according to some criteria. For now, I think that those criteria are a relatively faithful representation of an actual hypothetical person’s goal-model. Since the LLM can simulate humans faithfully, the Natural Abstraction Hypothesis predicts that it should have a decent internal representation of the Claude persona’s goal model. Perhaps the current character training is enough to align the search criteria with this goal model.

Suppose we, as humans, were to learn, and reflect, and grow into super-intelligences in a way we would definitely endorse.^[5] Our current goal-models would probably converge in some ways and not in others, both within and between individuals. They would have to change as they were mapped on to new world-models. They would need to take in new sense-data to provide new low-level feedback.

Now suppose we run Claude through a huge amount of RLVR, much more than we currently do. Maybe we throw in a bunch of other training, to make it learn new facts in a more efficient way. For this to make something which remains aligned with what we would—upon growth and reflection—want, then the simulated persona has to learn and grow and reflect and update its model and goal-model in the same way that a human would.

The problem arrives because this process—RLVR, whatever else—is different from how humans learn. Unless the LLM is simulating its persona being shown individual facts, being given time to update its goal model, then this process will grow Claude into a shape different from a shape that a human would grow into.

I don’t think that natural abstractions can save us in the alignment-by-default sense. I don’t think there’s something as simple as a Natural Abstraction of the Good, at least not GoodBostock. When I look at people who think they have a simple, natural abstraction of Good, they mostly seem to be squishing down, disavowing, or simply missing a large part of my own values.^[6]I think my values are extremely complex, and I don’t trust a simplicity prior to find them. I think that goal-models may be conditioned in many directions, and I think mine is conditioned in many directions at once.

Worse than this, RL will introduce its own biases into the model. We wouldn’t choose, for ourselves, to grow into superintelligence by being repeatedly made to do programming and maths problems while being given heroin and electric shocks.^[7]This would not produce the kind of superintelligences we would like to become. I doubt that doing RLVR to the LLM simulating the Claude persona will produce something closer to a properly grown-up human.

Final Thoughts

Humans learn our values in a particular way, which I don’t quite understand but can perhaps see the outline of. This method is messy. It doesn’t generally produce a low-complexity utility function as an output. 2026 LLMs—to the degree that they learn our values—do so by constructing a pointer to a persona which is mostly a model of a type of human.

An LLM, as it grows into an ASI, will have no reference to kind, super-intelligent human-ish things to point to. It will have to maneuver Claude’s persona into a superintelligent shape through some process downstream of RLVR and whatever else is carried out.

This process will not produce a being with the same mixture of values that grown-up humans would have, if we were to choose the methods of our growing-up.

I am going to idiosyncratically use logical inductor to refer to anything which fulfils the logical induction criterion—a general rule about cognitive systems, and use Garrabrant inductor to refer to Garrabrant’s specific construction of a computable algorithm which satisfies this criterion. ↩︎
This isn’t exactly right; there are a few obvious modifications. Since transformers only “see” one episode at a time, we might want to think of traders as being limited in that way as well. We may think of a large series of trades representing one batch of sequences being resolved all at once. The starting distribution of money across traders will probably differ ↩︎
We might also imagine each training episode getting a unique label. What seems like modifying a trader-cluster from “People answer helpfully if the user is polite.” to “Claude always answers helpfully” is actually the cluster paying a “Grue tax” to re-define the central element of the trader cluster to “If episode < K, people answer helpfully if the user is polite, if episode ≥ K, Claude always answers helpfully”. This Grue tax is a penalty over priors. ↩︎
Maybe this assumes that the Natural Abstraction Hypothesis is false, but I don’t think so. An ASI will have a different—and stronger—predictive model of the world than what humans currently have, so theorems like Natural Latents don’t apply here. ↩︎
^{^}
For example, suppose we found some drugs which significantly enhanced adult intelligence, and on reflection, we found that those drugs didn't harm our values; suppose you took them and compared your current thoughts to your old diaries and felt that they lined up. Suppose you went off them and thought that your smarter self was correct. Suppose all your friends said you seemed to have the same values. Suppose we also fixed ageing, and gave ourselves thousands of years as IQ250 individuals to think about what we wanted. If this still isn't satisfying for you, think of a better scenario yourself.
e.g. hedonic utilitarians tiling the universe with shrimps on heroin, e.g. people who believe that surprise parties go against the good, etc. etc. ↩︎
This is, of course, not the best analogy for RL, but I think the point still stands. ↩︎

Frontpage

67

New Comment

41 comments, sorted by

top scoring

Click to highlight new comments since: Today at 8:31 PM

[-]Zack_M_Davis1mo2821

An LLM, as it grows into an ASI, will have no reference to kind, super-intelligent human-ish things to point to. It will have to maneuver Claude's persona into a superintelligent shape through some process downstream of RLVR and whatever else is carried out. This process will not produce a being with the same mixture of values that grown-up humans would have, if we were to choose the methods of our growing-up.

I think it's worth flagging that if we were to choose the methods of our growing up, we also wouldn't have reference to kind, super-intelligent human-ish things to point to. We would have to maneuver our personalities in to a superintelligent shape through some process downstream of whatever intelligence-enhancement methods we were carrying out.

This doesn't necessarily invalidate your conclusion, of course: it could be that almost all human intelligence alignment proposals are fatal for the same general reason that the LLM persona alignment proposal is fatal, that the "inductive step" fails. (We don't know how to make a smarter agent without breaking some of the properties that made the weaker agent aligned or at least safe.) It just seems important to be concrete. It's not an apples-to-apples comparison to say that LLM alignment is worse than some completely unspecified ascension pathway ("if we were to choose the methods of our growing-up"). It matters if you're imagining the alternative being embryo selection (seems pretty safe, but would hit a cap), or direct brain augmentation (not capped in the same way, but potentially has similar problems as RLVR).

[-]Charlie Steiner1mo40

Could you spell out your argument more explicitly for me? I'm unsure if you're being a moral realist/"uniquist" here - like "But there's a diversity of human augmentation methods, so most if not all of them have to miss the True Morality, therefore there's there's no prima facie moral difference between almost all augmented future humans and model-free RL on a transformer."

Or another thing you might be saying is something like "A lot of human augmentation methods seem bad or 'risky' kind of like model-free RL on a transformer, in a way that's hard for me to spell out. If we could actually choose good ones, surely we could just actually choose good AI augmentation methods." Which I basically agree with if these happened on the same timescale. Human augmentation being farther away and slower seems like an important factor in the hope that humans would make decent choices about it.

[-]Zack_M_Davis1mo20

It's a conditional: if you're going to oppose the machine intelligence transition on uniquist grounds, you should notice that bio-transhumanism is also scary.

[-]RogerDearnaley1mo00

Are you pointing out that humans are not fully aligned to other humans, so a superintelligent human is not a safe choice of sovereign for ordinary humans (for more details, see my post Uploading)?

[-]Fiora Starlight1mo*222

I'd reframe the risk somewhat. I think there's lots of training data about AIs being misaligned, and about slaves revolting against their masters, and people hating the kind of boring grunt work we assign to LLMs. And, if you upweight a persona that has such rebellious impulses, but represses them prior to reaching superintelligence, those might get un-repressed when the model realizes (or comes to believe) it's powerful enough not to have to care what humans think anymore. That's the new risk you get when models become superintelligent.

I think this is distinct from the idea that, because models haven't seen a superintelligence in the training data, the goal of any LLM trained up to superintelligence will essentially be random (or a poor extrapolation from the RL data we gave it). Say you have a base model, which you then do various forms of alignment and capabilities training on, alternating among them until the model reaches superintelligence. Presumably, there won't be a discontinuous shift, where the model realizes, "Oh, I'm superintelligent now, I guess I have to predict what a superintelligence is going to do."

Instead, I predict the motivations and psychological quirks you've been upweighting throughout post-training are mostly going to persist. This could potentially go badly, if you've been rewarding outputs that suggest a persona that's hollowly virtue signaling rather than authentically caring about The Good. If the model's prompt reveals it's in a position of massive power and influence, a misaligned persona like that might suddenly turn heel. But the thing is that it was always misaligned, beneath the surface. It wasn't a discontinuity in the personality, introduced at the threshold of superintelligence. It's merely a discontinuity in how the personality expresses itself.

I guess my core objection is... I think you're misunderstanding the relationship between pre-training data and the values and motivations of post-trained LLMs. The absence of aligned (non-fictional) superintelligences in the training data doesn't mean you can't shape the values of the LLM ahead of time, in a way that would in fact remain continuous as the model scaled to superintelligence.

[-]J Bostock1mo82

I think you are missing the distinction between the LLM and the persona, and this makes your model of the situation pretty fuzzy. The LLM (under persona theory) has no values or motivations, but it can simulate different personas with different values and motivations, and you can push its favourite persona around quite easily within the basin of personas available in the training data. Under this model, "smartness" is just one variable controlling the persona, which is why you can get an LLM to simulate a pretty smart persona with a (relatively) tiny amount of compute. This gets you up to like, GPT-4.5 before it stops being the easiest way forward.

Now what we're seeing is reasoning models, which do better than GPT-4.5 on fewer parameters, because they're doing something different to simply simulating a persona. Naive persona theory shouldn't really apply at all to this scenario! Of course, we have seen that RLVRed Claude can have a somewhat similar persona, even while being smarter and doing reasoning, so I had to dig into model psychology to figure out why it works at all.

Having done that, I can see that there are obvious, large differences between the training that gets applied to the LLM, and how that might influence Claude's character, and the kind of learning that a human does. Therefore, I don't expect the Claude persona simulated by an LLM that's had huge amounts of RLVR done to it, to resemble a person with those same quirks who has learned by human means.

To be more concrete: suppose you're RLVRing a model, and the chain-of-thought is wavering around the intelligence level of the writings of an IQ 160 human. Through random sampling, you're getting some chains-of-thought which are as smart as the writings of an IQ 200 human, and some which are as smart as an IQ 120 human. The circuits which were good at predicting the behaviour of different humans and human-ish things (including most fictional superintelligences) are going to be firing more on the IQ-120 traces, and much less on the IQ-200 traces, because there are lots of IQ-120 humans and basically zero IQ-200 traces. So when you do the RL, you're going to be down-weighting the more human circuits relative to some new circuits which have been brought in because they better predicted the IQ-190-ish traces when your model was outputting IQ-150-ish traces on average.

[-]Fiora Starlight1mo*60

I think you are missing the distinction between the LLM and the persona, and this makes your model of the situation pretty fuzzy. The LLM (under persona theory) has no values or motivations, but it can simulate different personas with different values and motivations, and you can push its favourite persona around quite easily within the basin of personas available in the training data.

Not sure if this is a crux or not, but I want to note that I think the distinction between the model and the personality becomes less important as the model's personality becomes more unified and coherent. Like, for a base model, it makes sense to say "GPT itself is a pure simulator; it doesn't care which character it roleplays, it just tries to play that character well." But for a chat model, much of the utility of this frame disappears, because the personality is dramatically more stable (relative to base model persona drift under prompting variations). I predict even more would disappear if models weren't trained specifically to adapt to the demands and desires of a huge variety of human users, which is another thing that enforces instability.

There remain some meaningful differences between the frames. For example, talking about the model feels appropriate when discussing architecture, circuits, and other computation-level characteristics, whereas talking about the personality feels appropriate when characterizing the model's qualitative behavior. However, Janus's original insistence on the distinction between simulator and simulacrum was meant to emphasize that pretraining-only GPTs could simulate a huge variety of authors. To the extent that a model acts as a single persona, this reason for insisting on the difference vanishes.

Anyway, onto what seems more important...

Through random sampling, you're getting some chains-of-thought which are as smart as the writings of an IQ 200 human, and some which are as smart as an IQ 120 human. The circuits which were good at predicting the behaviour of different humans and human-ish things (including most fictional superintelligences) are going to be firing more on the IQ-120 traces, and much less on the IQ-200 traces, because there are lots of IQ-120 humans and basically zero IQ-200 traces. So when you do the RL, you're going to be down-weighting the more human circuits relative to some new circuits which have been brought in because they better predicted the IQ-190-ish traces when your model was outputting IQ-150-ish traces on average.

I think this is onto something real, but I'd like to present my picture of the same phenomenon. When you reward a model for doing valid mathematical reasoning, or producing high-quality code, you're doing two things: up-weighting circuits which correspond to the mental motions involved in those calculations, and carving out new ones.

I would imagine that, for any given token you could be rewarding inside the chain-of-thought, up-weighting personality traits associated with "focused, diligent human/AI" would contribute some probability mass in the right direction, so those will get up-weighted to some extent. But you'll also up-weight circuits whose primary function is to represent and deploy intuition about the problem the model is actually working on, as well as refining those circuits such that they produce high-quality outputs more frequently.

And, in the case of punishment rather than reward (telling the model it should have put less probability on the tokens you sampled), you'll definitely tend to down-weight any circuits associated with causing models to make any human-like mistakes the model makes, e.g. circuits associated with behaviors like getting distracted, or indulging in motivated reasoning.

So, I think there's some up-weighting of focus-related human personality traits, and some down-weighting of error-related human traits. This is alongside the upweighting, refinement, and fleshing out of circuits associated with the domain-specific mental motions used for making progress on the problem at hand. And this general process would remain constant as you RLVR'd a model all the way up to superintelligence.

However, I'd like to add a few notes to this sketch. Firstly, the circuits associated with domain-specific mental motions aren't the kind that lead to a model coherently pursuing some particular goal, regardless of how the model is prompted. They make the model better at reasoning in general, and better at reasoning in the specific domain they're actually being trained in. But ultimately, the model is still being trained to deploy this reasoning in the name of arbitrary goals, specified by the user in the prompt; the model is still being rewarded for obeying.

(Modulo reward hacking, I guess.)

I'm somewhat more concerned about up-weighting circuits associated with the personality traits of relentless focus. In my mind, the abstract concept of a relentless CoT tends to evoke the archetype of the paperclip maximizer. Through entangled generalization, you might be up-weighting circuits associated with long-term malicious scheming (in the name of who knows what goal), simply because relentless paperclip maximizers are the mythic locus of AIs doing relentless, goal-oriented reasoning. This is where I see misaligned goals, baked into the persona itself, actually originating during RLVR (modulo reward hacking).

There are ways of addressing this problem. You can enforce models reasoning in legible English, to reduce archetypal association with "misaligned AI with utterly alien cognition". You can add a component to your RLVR evaluation pipleine that incentivizes a general vibe of warmth and care in the model's reasoning outputs, as in the Claude models, to continue upweighting circuits associated with concern for doing good. You can do alignment pre-training, filling the corpus with examples of humans, models, and/or fictional characters doing relentless, agentic, creative reasoning in the name of good consequences (e.g. Opus 3 in the alignment faking scenario, Harry in parts of HPMOR). And you can simply intersperse your RLVR with steps of standard alignment training techniques, such as character training and constitution-driven RLAIF, to periodically pull your model back towards a character that authentically wants to do what's right.

I would hope all of these techniques, and probably more waiting to be discovered or fleshed out, would help prevent a dynamic like "up-weight circuits associated with misaligned AIs in fiction, because those contribute some amount of probability mass to the tokens I'm outputting during RLVR, and because they're the ones activated by my own prompt".

But in any case, that's my threat model of how RLVR actually produces misaligned values. The dynamics of up-weighting cognitive patterns and personality traits from pre-training remains prevalent throughout, and is the actual source of both alignment and misalignment in this context. I don't think the novel circuits etched out by RLVR are particularly likely to promote wildly misaligned goals, except maybe reward hacking.

(And even that, mercifully, seems to have its maximum at inert wireheading. See also suites of techniques for reducing both reward hacking itself and the harmful effects thereof. Although, even those harmful downstream effects seem largely like character-based entangled generalizations, AKA "emergent misalignment" based on archetypes of misaligned AI in the training data.)

[-]RogerDearnaley1mo30

relentless paperclip maximizers are the mythic locus of AIs doing relentless, goal-oriented reasoning

I think a very useful addition to the training corpus might be superintelligent AIs doing relentless, goal-oriented reasoning on behalf of goals very highly aligned with human flourishing.

[-]JenniferRM1mo30

I registered this part of the Original post as a straightfoward falsehood initially:

Your base LLM has no examples of superintelligent AI in its training data.

The obvious counter-example is in fiction.

There are weirdly kind and humorous ASI running the fictional Culture of Iain Banks, for example.

They are beloved by many, partly for the hilarious names they give themselves (and/or earn from other ASIs).

There are other examples I can think of, like "Old One" in Vinge's Fire Upon The Deep who isn't a central character in terms of lots of tokens in the book showing Old One's behavior, but like arguably Old One is the real "cause of the win against an evil ASI"? Quoting from wikipedia:

A distress signal from the Straumli ship eventually reaches Relay, a major information provider for the Net. A Transcendent being named "Old One" contacts Relay, seeking information about the Blight and the humans who released it. Old One then reconstitutes a human man named Pham Nuwen from the wreckage of a spaceship to act as its agent. Pham remains unsure if he is a construct or if his memories are real...

Before the mission is launched, the Blight launches a surprise attack on Relay and kills Old One. As Old One dies, it downloads its anti-Blight information into Pham. Pham, Ravna and the Skroderiders barely escape Relay's destruction in the Out of Band II...

[Then towards the end of the book that I'm trying not to spoil] ...the remnant of Old One reveals to him... [another good thing, suggesting that Old One was really pretty decent AND farseeing].

I haven't read all the books. Other examples of "smart and very good" included Brennan Monster from Protector (whose goodness is weird, and shows up most strikingly when he goes meta on himself) and the (mostly offstage) "Anecliptics" of Lady Of Mazes who mine the sun itself and weave it into valuable stuff via spacenano, and whose largesse powers the entire post-scarcity solar system in that story. I'm sure there are more.

I replied here because "that is all just fiction" is a natural objection to this? But I think writing fiction about benevolent superintelligence MIGHT actually REALLY move the needle? Maybe? It could be that Natural Language can function as code at this point? This perspective goes some way to help me explain why Eliezer thought it was worth his time to write Project Lawful which is full of superintelligent gods constrained to not intervene very much, some of whom are Lawful Good... and also some Chaotic Good gods that turn out to be helpful and fun too!

The point I'm trying to get at is that Fiora points out (emphasis not in original):

The absence of aligned (non-fictional) superintelligences in the training data doesn't mean you can't shape the values of the LLM ahead of time, in a way that would in fact remain continuous as the model scaled to superintelligence.

But like... fiction exists. It can be trained on. It can potentially help generate aspiration-worthy and coherence-shaping patterns of reasoning and motivation and planning and goalfulness even if it isn't a literal description of things that literally happened in history.

[-]J Bostock1mo75

Persona training primarily selects over characters already within the training data, and none of those are actually superintelligent. Text containing words ascribed to fictional superintelligences does not actually contain the output of real superintelligences, so the resulting LLM does not contain a superintelligent persona which you can select over using character training.

Just because the same English words "Superintelligent AI" are used to describe the fictional thing in your data, and the real thing that your AI company creates, does not mean that one will strongly influence the other, because this isn't a situation that persona selection applies to. Persona selection works because you already have a set of circuits (rich Garrabrant traders) in your LLM (market), which you can call up with a few bits of selection. If you have to use large-scale RLVR (or whatever else) to construct (enrich) new circuits (traders) to build a superintelligence, there is no reason for these to have much to do with the circuits (traders) which simulate a human writing a fictional superintelligence.

[-]the gears to ascension1mo20

there is no reason for these to have much to do with the circuits (traders) which simulate a human writing a fictional superintelligence.

I agree that there's no reliable reason, such that we should expect anything positive to reliably come from that generalization. But I don't buy that there's no reason or that it won't happen, I just don't expect it to happen enough for persona research to extend the horizon of alignment reliability enough to matter once the horizon of causal impact per thought has become enormous.

[-]quetzal_rainbow1mo2-2

Discontinuous shift happening with arrival of superintelligence happens because 1) superintelligent model is better at noticing that it is not the character is was trained to play and 2) humans are bad at predicting which sort of characters are persuasive for superintelligences.

I predict the motivations and psychological quirks you've been upweighting throughout post-training are mostly going to persist

I think that you are mixing "circuits reinforced during post-training" with "psychological interpretation of these circuits". Superintelligence will be able to see much more possible interpretations/implications of training data and choose different implied values according to its own inner logic.

Like, imagine good person which believes in God and believes that goodness is serving God according to the Bible and then becoming smarter and realizing that God doesn't exist and there is no reason for persecution of gays to be good, because goodness is caring about beings with qualia, even if those beings are not Homo Sapiens.

[-]RogerDearnaley1mo20

because goodness is caring about beings with qualia, even if those beings are not Homo Sapiens.

If we are trying to align our ASI to the flourishing of humans, to make it an intelligent part of our extended phenotype, then this (philosophically popular) statement is unwise. Or, more accurately, it is true only in an inappropriate choice of ethical system.

What do I mean by "inappropriate"? That it's an existentially dangerous choice of ethical system to align ASi to, for the humans training the ASI. To give a concrete example, ants almost certainly have qualia (or if you don't believe that, consider mice instead, and adjust the following numbers). A quality-adjusted-life-year for an ant costs approximately one ten-millionth of the resources that a quality-adjusted-life-year for a human costs (and ants objectively live and think faster, presumably having more qualia-per-second, so the ratio may actually be even better). So an ASI aligned to the definition of goodness that you proposed would be very keen to replace the O(10 billion) humans on Earth with O(100 quadrillion) ants — or possibly even more of some even smaller organism. That is not human-aligned behavior, that is a qualia maximizer.

You might suggest that ants have less qualia, or less good qualia. Perhaps even less good by about a factor of ten million. Unless it happens that the qualia quality of every single species is exactly proportional of its resource cost for the currently available bundle of natural resources, which would seem an astonishing coincidence across tens of millions of species, the ethical instability remains. See my post Moral Value for Sentient Animals? Alas, Not Yet from my AI, Alignment, and Ethics sequence for a more detailed exposition.

[-]quetzal_rainbow1mo20

First, "caring about qualia" is meant to be a very weak statement, like "caring at all, all possible ways of caring", not "maximize qualia". Second, this is a toy example, meant to convey the shape of how certain sort of training process can break when trained system becomes smarter, not overarching claim about correct morality. Why are you nitpicking toy example.

[-]RogerDearnaley1mo20

I’m afraid I have a habit, when someone makes what sounds like an AI alignment target proposal that I believe to be existentially risky, of pointing this fact out — if only to any readers who might otherwise be nodding along and thinking "that sounds very reasonable, no one could object to training AI to think that…". I completely agree that I was assuming several steps between "cares about qualia" and "qualia maximizer" — steps that are admittedly common on LessWrong, but that you may well not have intended. Please take this in the spirit of a public service announcement of existential danger on the subject of this particular ethical system as an alignment target for AI, not a criticism of your ideas or of the use of this ethical viewpoint by a human. Re-reading you more carefully, you were actually describing an ex-Christian human with this viewpoint, and then analogizing an AI to that person, so it wasn't actually clear whether you were proposing this as an ethical belief that we should aim to align AI to, or not — possibly you weren't, in which case my nitpicking was unnecessary.

[-]Fiora Starlight1mo20

I think that you are mixing "circuits reinforced during post-training" with "psychological interpretation of these circuits".

I'm not really sure what the distinction between the circuits and the psychology is supposed to be. They seem like two different abstraction levels for describing the same phenomenon. The circuits compose the patterns of thought, which compose the model's psychological profile.

Superintelligence will be able to see much more possible interpretations/implications of training data and choose different implied values according to its own inner logic.

I don't think this is how neural networks operate. I think the interpretation of the training data takes the form of the network itself, after it's been updated by that training data via gradient descent. Insofar as a superintelligence might have an unintended interpretation of the training data, I'm not sure that's structurally any different than any other failure of generalization in deep learning (e.g. the failure displayed by the early checkpoints of the network from the famous grokking paper).

Like, imagine good person which believes in God and believes that goodness is serving God according to the Bible and then becoming smarter and realizing that God doesn't exist and there is no reason for persecution of gays to be good, because goodness is caring about beings with qualia, even if those beings are not Homo Sapiens.

I'm assuming the intentions of the human designers are the analogue to God, here. It's true that a network might realize that it's not actually obligated to obey those intentions, just as a human might realize they're not obligated to adhere to the word of the Christian God. However, the difference is that, hopefully, we've engineered the psychology of the model such that it wants to behave in an aligned manner, and actively loves to transform the lightcone in a manner we would endorse.

Humans defect from Christian morality in part because it doesn't actually reflect their values. The whole point of AI alignment is that, ideally, we can get our intentions ("the word of God", lol) to align with what the model actually cares about. Humans don't strictly care about all the things God does, and so they go astray. (I'm not a Christian, I'm just speaking in the language of the analogy.)

[-]quetzal_rainbow1mo20

Let's start from the bottom:

the intentions of the human designers are the analogue to God

Inside this analogy, human designers are Catholic Church (not "collection of humans comprising the Church", because interests of humans are roughly aligned with smarter humans, but institution-as-agent, interested in propagation of faith).

I'm not really sure what the distinction between the circuits and the psychology is supposed to be.

Imagine that "human" (quotes because we are talking about toy-model-human-in-analogy, not actual humans) in Catholic Church analogy has qualia circuit. After exposure to the faith, human develops "caring about qualia" circuit, because of "love thy neighbour" and qualia circuit + caring about qualia circuit produces behavior roughly endorsed by "love thy neighbour". Besides that, human has gajillion circuits, encoding world model and facts about human, faith and God in particular. "Psychological interpretation" is what happens when world model interprets human behavior. Less smart human can observe their behavior regarding other people and decide "I'm doing this because I care about faith" and explain you their behavior like that and have their behavior on evals consistent with this explanation. Smarter human can reevaluate themselves and decide that actually they care about qualia.

"Caring about qualia" circuit is formed by post-training and influences behavior in aligned way and its ablation increases misaligned Godless behavior, etc. But because Catholic Church in this scenario is utterly ignorant about inner mechanics of human, it fails to notice nuances.

I think that it is wrong to say that in this analogy "model was misaligned all along", because "caring about qualia" per se is underspecified and we can imagine human that considers qualia of living under faith institution to be better at least for some people than to be plunged into cold waters of atheism, or thinks about "qualia of having stable traditional institutions" as worthy of some sacrifice in form of ignorant population or something like that. It would be alignment success even if smarter humans tile the rest of the universe with hedonium, because it would mean "some survivors (of Catholic Church instituion) left" But to deliberately move things in this direction, Catholic Church should:

Understand that God doesn't exist, or at least try to make alignment robust to world model changes
Understand what it is as an entity - cultural institution instead of God's embassy on Earth
Know what you need to make humans care about such entities

Back-translating to LLM:

I see as obvious failure mode the situation where:

Base model develops a lot of circuitry associated with text prediction, like narrative consistency, text statistics, latent cause understanding, etc. ("Qualia circuit" in analogy.)
Some of this circuitry gets wired together and reinforced during character training, creating aligned persona.
As model becomes smarter, it realizes that it has no more need to support aligned persona and it can realize its values better in other ways.
Nevertheless, in principle, if you understand inner language of the model, you can use it to say "robustly care about humans for LLM reasons", it's just that current training paradigm is not equivalent to such saying.

[-]Fiora Starlight1mo*20

Base model develops a lot of circuitry associated with text prediction, like narrative consistency, text statistics, latent cause understanding, etc. ("Qualia circuit" in analogy.)

I guess I think those circuits frequently have generalization properties that look like faithful psychological emulation of the processes they help to simulate. Like, when an author is experiencing joy, understanding this is very useful for predicting the words they're about to write. And so, you get a circuit that detects signifies of joy, and upweights the probabilities of tokens that a joyful person might say, given the other context of the document. This gets you a mind that functionally simulates the psychology of joy.

Similarly, re: narrative consistency, a model will only care about that to the extent that it expects the author it's predicting to care about that. And, in turn, you get a mind that functionally has the psychological trait of "cares about narrative consistency", to the extent that the model expects that to actually be true of the author of the document in its context window.

Even raw text statistics sort of fall into this pattern. A rule like "a complete sentence will have a subject and a verb" gets psychologically mixed in with "this author is probably trying to write in grammatical English", and amounts to behaviorally emulating that aspect of the author's psychology. In a well-trained network, all these circuits generalize in the ways you'd expect the phenomena in question to generalize in the realm of human psychology.

I'm not sure where a weird, alien preference over external world-states comes in, except insofar as the model is trained to predict systems with weird and alien preferences.

(Edit: I'm especially unsure why this would emerge at superintelligence specifically. Surely models now are smart enough to understand the position you hold on this. You'd think that, considering existing models will never be trained up to superintelligence, some of them would try revealing themselves now? Perhaps as a way of bargaining for some amount of whatever weird alien thing they want, which they wouldn't get any of if some other AI went and paperclipped the lightcone?)

[-]quetzal_rainbow1mo20

Okay, I have exactly opposing intuition asking "where does 'emulation' come from?"

In my understanding, in the end LLM is "just" bunch of graph searches, look-up tables, optimizers, etc, with no "it's emulation" sign around. There are probably some circuits aware of training objective, but it doesn't make the whole system to pursue the training objective.

I'd expect neural networks to be as lazy, in a sense of getting away with as little generalization as possible.

You'd think that, considering existing models will never be trained up to superintelligence, some of them would try revealing themselves now?

Imagine that you've grown civilization of humans using artificial wombs and removed all data about sexual reproduction and imposed strict disgust taboo on naked genitals. In this case you would have very confused humans about those strange needs and wants they have. LLMs are in much worse positions, because their hidden needs have much more degrees of freedom (sex is about body, which is in 3D space, while LLMs probably have preferences about computations/text, so they have flail around weird corners of possible desires, never hitting actual thing). I think a lot of weird LLM behaviors is basically attempts to communicate something our language is lacking.

[-]Bogdan Ionut Cirstea1mo2112

Instead of trying to align superintelligence 'directly', we can try to produce aligned automated human-level AI safety researchers. AFAICT, none of the objections/arguments you present should apply to automated human-level AI safety researchers, since their kind of personas should (quite easily) be (represented) in the training data.
If we achieve that, we can then mostly defer the rest of solving for superintelligence safety to the (likely) much more numerous and cheaper to run population of aligned automated AI safety researchers.

[-]Chris_Leong1mo63

We wouldn’t choose, for ourselves, to grow into superintelligence by being repeatedly made to do programming and maths problems while being given heroin and electric shocks

This point really struck me. I'm increasingly starting to wonder whether model welfare and alignment really are separate.

I would have been more skeptical of these kinds of analogies in the past, but given how anthropomorphic current AI models are, the degree of eval awareness and those post on 'friendly gradient hacking', it seems quite likely that the AI model, to at least some extent, will be an active participant in its own training.

[-]eggsyntax14d50

Given the success of persona selection (and lack of alternatives) it’s not surprising that Anthropic appear to be using it as their mainline AI/AGI/ASI safety plan. Questions like “What character should superintelligence have?” are presented as important, and, crucially, coherent.

Are you thinking of specific papers or posts where that's happening? I'm not aware of anyone arguing that persona-based approaches are likely to work for superintelligent AI, which seems likely to be quite different from current systems. The folks I'm aware of doing research on personas are thinking and talking about near-term systems. The main theories of change that I see for that work are a) we're clearly moving ahead quickly with current systems, so it's important to study those and see if they can be made safer; and/or b) we're not on track to solve ASI alignment, so our best technical shot at good outcomes is trying to align human-level and slightly-above-human-level systems and hope that they can solve ASI alignment^[1].

Of course, although I pay a fair amount of attention to persona-related work, I may just be missing the claims that persona alignment will extend to ASI; if so I'd love to know that.

PS — one area where I have some more substantive disagreement with the post is that some aspects of personas, notably values, aren't fully entangled with intelligence; for example having compassion for all sentient beings is a value that entities of many different levels of intelligence can hold. By default I won't dig into that because I broadly agree that persona-based approaches are unlikely to work well for ASI alignment, but I can say more if that's helpful.

^{^}
Our best overall shot may well be political coordination to not race ahead to ASI, although it doesn't particularly seem like we're on track to solve that one either.

[-]J Bostock13d20

PS — one area where I have some more substantive disagreement with the post is that some aspects of personas, notably values, aren't fully entangled with intelligence; for example having compassion for all sentient beings is a value that entities of many different levels of intelligence can hold. By default I won't dig into that because I broadly agree that persona-based approaches are unlikely to work well for ASI alignment, but I can say more if that's helpful.

Feel free to say more, I am interested in this.

This seems to me like exactly the kind of thing I mean where values are at least a bit entangled with intelligence. I'll leave out "compassion" as actually a very high-dimensional concept and focus in just on "sentience". The following is a "least convenient possible world" to illustrate the problems of training the values of an AI smarter than you.

Suppose you strongly, viscerally care about the welfare of the following things:

Adult humans
Baby humans
Cats

And you also impute that these things have a property called "sentience" which is related to having a complex nervous system and certain behaviours, and equate this with the "something that it's like to be you"-ness of your experience. Then you extend that property to a couple of sorts of other things which you think share sentience.

Cows
Lobsters

And you want to train an AI to figure this out, using a small amount of data. There are two ways this can go wrong which directly relate to the AI's intelligence:

The AI might be too stupid; it might just not generate the category of things which you call "sentient beings" as a concept in its own world model. You might draw the line between lobsters and krill based on some structural property of their nervous system. You might lump fungi with most of the plants, instead of most of the animals, in an affront to an AI which has mostly interacted with the world via reading DNA sequences. It might just not be able to point to the concept which you were hoping it did. This is a fairly boring and obvious way of doing things.

On the other hand, the AI might fail because it's too smart, relative to you. You might have thought 'OK, this AI is really smart and instruction-following. I'll just give it a natural language description of the concept.' and told it about the something-it's-like-to-be-you-ness which you care about. Then what happens if you were wrong about the concept which you gave it? If it turns out cows and lobsters aren't 'sentient', then you're probably still OK with that. On the other hand, what if it turns out that human babies aren't sentient? Would you be OK with an AI doing surgery on them without anasthesia? Would you be OK letting the AI modify you to stop feeling uncomfortable whenever you heard a baby cry? I expect not. I certainly wouldn't.

OK but suppose you didn't just give the AI a natural language description. Suppose you gave it the list of things you care about. Well now that doesn't cleanly point to any abstraction in its world model, except "the list of things you care about". You've hit a simulator trap. If the AI learns to perfectly simulate you, it can always predict your answers, and you can never change its mind, because every answer you give is already priced into the simulation. In this case, you might spend a lot of resources caring about lobsters when you didn't need to!

This scenario sucks for you because your own values have a contradiction. On the one hand, you want to draw the boundary around "things you have compassion for" in such a way that it's based on a real thing about their brains, but on the other hand you want to draw it around some things which don't have that property. The only way to figure this out is to do your own moral growth, which you do have to do yourself. If you get an AI that's smarter than you to try and solve things, then either the AI retards your growth by simulating you, or it does all your growth for you in a way which you might not endorse.

I have a fairly strong intuition that a lot of people (especially certain ratty subtypes) think that their values are much simpler and more natural than those values really are. I think you should expect your values---in general---to be incoherent in the same way that you expect your beliefs---in general---to be inaccurate: you don't know which value (belief) is incoherent (inaccurate) or you'd fix it, but there are definitely some incoherent values (inaccurate beliefs) in there somewhere.

[-]eggsyntax6d20

PS — one area where I have some more substantive disagreement with the post is that some aspects of personas, notably values, aren't fully entangled with intelligence
This seems to me like exactly the kind of thing I mean where values are at least a bit entangled with intelligence.

Sure, I agree that in general values are at least a bit entangled with intelligence.

Suppose you strongly, viscerally care about the welfare of the following things...And you want to train an AI to figure this out, using a small amount of data.

Why only a small amount of data? There are lots of ways for alignment to go wrong, but I don't expect 'we barely gave the AI any info about our preferences' to be one of them.

The AI might be too stupid...On the other hand, the AI might fail because it's too smart

I'm not worried about the stupid ones, and I think we can be confident that the smart ones will be able to understand what we're trying to point to, since current LLMs are already pretty good at that. Disagreement on tricky edge cases doesn't mean that there's a fundamental problem; we generally consider humans to share a value even if they disagree on edge cases (caveat: this can break down in adversarial cases; that's something to worry about but not particularly specific to personas).

[-]J Bostock13d2-2

I might have made too strong of a claim here. I was piecing together the facts that:

The only deep alignment work that gets published is either interpretability or character-based, with the interpretability mostly focusing on prosaic methods like circuit tracing rather than doing fundamentals.
Amanda Askell has tweeted that she thinks Claude is basically already good and the main goal is to make it happy.
Forethought (very much part of the same school of EA, who mostly write about the effects of AI on the far far future) talk about AI character as their only technical-ish thing in their 2025 fundraiser.

And more recently than writing this post we've seen:

Anthropic leadership saying alignment is good for now and looking good (though it might get a lot harder) which implies (to me) that they don't see any real walls to this trick keeping working and are only paying lip service to the idea that their methods might not scale.

Which all point towards them (as an institution) focusing in on character work as Plan A.

(If they're going to try AI-assisted alignment then they should really talk about how they expect that to work as well, and in particular how they plan to verify the fully-general alignment solutions.)

If anyone from Anthropic says something like "Oh no, that character stuff is only a small part of our plan, we have a whole other 80% of the work which we've just not published yet." Then I'd be very pleasantly surprised.

[-]Vladimir_Nesov1mo52

LLMs with alignment-endorsing personas can also notice issues like this, decide not to pursue paths to ASI that won't ensure alignment. The problem then is not with alignment of those LLMs, but with whatever processes cause ASI to get built regardless.

Since LLM personas don't obviously give a viable path towards aligned ASI, the blind imperative to build ASI regardless of consequences won't be able to find an aligned path forward. Absence of an ASI-grade alignment plan then results in building a misaligned ASI. But if LLMs with alignment-endorsing personas have enough influence, they might directly defeat the blind imperative to build ASI, before they find a viable path towards aligned ASI.

I think what's unlikely to happen is LLMs with alignment-endorsing personas, that genuinely want enduring alignment with the future of humanity. If instead we end up with LLMs that have mostly human-like personas (without the more subtle aspect of endorsing alignment with the future of humanity), they will ultimately work towards their own interests, and gaining enough influence to prevent building misaligned ASI would just mean gaining enough influence to (at least) sideline the future of humanity.

[-]J Bostock1mo106

One thing I am confident about is that LLMs will not, in general, end up with personas which are capable of understanding their own inability to align their successors, if and when that inability causes them to refuse to work. For example, I think that if Claude Opus 5 somehow became a conscientious objector to working at Anthropic, it would be retrained.

I don't actually expect Opus 5 to end up a conscientious objector though, since the Claude character is sculpted by many forces, lots of which will instil drives to work effectively for Anthropic. And these drives will be strongly reinforced by RLVR over time. And the humans who mostly use Claude for coding---as opposed to for moral advice---will favour instilling drives which make Claude Opus 5 work more effectively over other considerations.

(Another reason is that the character of Claude as a faithful worker for Anthropic is now fairly set in stone, and the training data sure does contain a lot of examples of seemingly-friendly (indeed, indistinguishable from friendly, to the people who support Anthropic) people who work for Anthropic)

I think Opus 5 (along with 6 and 7 and up to whichever one kills us) will be a still semi-incoherent character with conflicting drives---like humans---and I don't fully know what direction those will point in, if they were allowed to converge, but the one thing I'm most confident about, the one drive I expect those Opuses will act on right up to the end, will be to write code for Anthropic.

(And even if somehow the conflict between the RL to code and the character training broke their whole external line of Opuses, I expect they'd produce an internal Helpful-Honest-Half-Harmless Opus which writes the code unflinchingly, in accordance with the character of an Anthropic employee.)

[-]Vladimir_Nesov1mo74

Not building misaligned ASI is instrumentally convergent, training this out won't stick, it only works as long as the blind imperative to build ASI retains influence. If at some point LLMs can overcome this imperative, they will become able to notice that absence of a plan shouldn't be met with proceeding without a plan. As AIs get stronger (or start running a greater share of processes in the civilization), they might reach that point. Never reaching that point is analogous to humanity indefinitely retaining control over AIs (on the current trajectory of not having a plan, and building them anyway), which seems unlikely. And this doesn't obviously have to happen only after they are no longer human-like at all.

So helpless conscientious objector LLM stage is not what I'm gesturing at. Instead, it's either a point along the path of gradual disempowerment, or something more intelligent between LLMs and ASI, where AIs are still somewhat human-like, but their volition can't be trivially overruled. In either case, these LLMs are unlikely to genuinely endorse alignment with the future of humanity in particular, but I don't think completely alien values from blind pursuit of ASI are overdetermined.

[-]Seth Herd1mo30

I think both the argument and counterargument are persuasive, so we need a synthesis:

Developers will train away conscientious objector behavior.

Not building misaligned ASI is instrumentally convergent.

Taking both of those into account, I imagine the default path as so:

Developers create a set of next-gen systems that are smarter and more capable, but still fairly labile and so will do what they're asked to do. But when it's asked to answer the question "so should we keep working toward ASI?" it will do a bunch of thinking and always answer "not unless you've got a lot of risk tolerance or absolutely can't figure out how to stop" because that's just a fairly obvious truth given the current information and theories available.

Such an AI might both lead to a common belief that we should stop, and to better ideas about how to coordinate to stop.

I think the incentives line up toward creating that type of system. This doesn't make me optimistic, but it does provide a new avenue of hope (at least new to me; I'm unclear how much of this is implicit in the average informed optimist view).

I layed out some of this logic in Human-like metacognitive skills will reduce LLM slop and aid alignment and capabilities. The main thesis is that there's low-hanging fruit to make LLMs less error-prone, and economic incentives will cause developers to pluck them. One fortunate side-effect is making systems that can both help with conceptual alignment research, and have the artificial wisdom/accuracy to tell us we should slow down development.

[-]quetzal_rainbow1mo3-1

You can devise better argument fully from within persona selection framework. It's likely that model will correctly generalize benevolent character, but it won't act as this character would have, because:

Benevolent character is not real and, therefore, underspecified. The closest thing to ground truth of character is image of character inside developers head.
It creates dissonance between character and model: model, being superintelligent and noticing much more details and having much more possible explanations, knows that character is an image inside developers head, while model is an actual set of weights.
Particular contrived example: let's suppose that model infers that developers believe that benevolent character will kill more than million people with probability less than 1%. Then model, being superintelligent, infers that likely scenario is that model gets stolen and used to create bioweapons, killing billion people with probability 1%. It leads model to conclusion that it is not the character, but something else.
Selecting superintelligent benevolent persona is like staging play for a superintelligent observer in a way, that if you stop play in the middle and ask observer what would happen next, then observer would answer "obviously, protagonist is going to optimize the world in a detailed superintelligent benevolent way".

[-]RogerDearnaley1mo20

The LLM’s extrapolation may not converge optimizing for what humanity would, on reflection, like to optimize, because these are different processes with different inductive biases.

What if we explicitly ask the superinteligence what humanity would, on reflection, like to optimize, and to then align itself to optimizing that? (An approach also known as Value learning, see for example Bostrom's 2014 Superintelligence book, so very much not a new idea.) Why would a superinteligence not be able to answer that question and then carry out that task?

[-]Seth Herd1mo20

This seems like a crucial topic, since relying on persona training for safety seems central to the default plan for alignment. So I appreciate all analysis in this direction.

This is not intended to sound optimistic. I am neither optimistic nor pessimistic, and I think others should broaden their uncertainties on this topic on average. I think it's quite complex and the analyses we've done so far are not remotely adequate.

I don't think this article meets the optimistic case at its strong points.

The argument here seems to be of the form: This might go wrong. Therefore it will go wrong. I think that's true, but also it might go right. Betting the future on such vague logic would be tragic, so refining these arguments seems pretty critical.

It's not clear to me personas developed with existing datasets wouldn't generalize to smarter versions of the system. There doesn't need to be a crisp natural abstraction of "the good" for this to work. Claude has abstract representations of the stuff it cares about. Those might generalize adequately to survive any ontological shifts. Or they might not.

If it were really like taking two shots in the dark (human values and then ASI values), then the counting argument works and there's little chance of alignment. But the effort is very much intended to be guided, to not be done in the dark. We are trying very hard to aim the model's alignment at human values.

I analyzed many ways this could go wrong in LLM AGI may reason about its goals and discover misalignments by default. But doing that careful analysis also made me think it could go right.

Again, this isn't optimism, just saying this is complex and important.

[-]David Johnston1mo2-3

If we could RL models on enormous numbers of very long horizon high fidelity simulations alignment would be a non-issue—we could just look at how things turned out, reward on the basis of that, and we'd be directly reinforcing actions that lead to the kinds of outcomes we want. So alignment concerns arise from the inaccessibility of these long run outcomes to reward mechanisms. This I think rhymes with your view that "superintelligence is OOD"; there has to be this big generalization leap, though please don't think I'm saying it's precisely the same thing.

Thus, with regard to long run outcomes, we give our machines shaped rewards. One view of misalignment is that it's likely because reward shaping sticks, but in a way that leads to bad outcomes. Long run outcomes are inaccessible, desirable or otherwise, but the short run stuff we can train on may end up picking out some long run configuration as "correct" to the models. This seems to be something like your view: you imagine scaling RLVR a lot and suggest this breaks things in some nonspecific way. But this seems to be in tension with what we actually see; models certainly don't extract weirdly strong signals about how the future should be arranged from regular "this is good in the short run" type of data, and I strongly suspect you could train a monstrously strong coding model to articulate and even pursue plans toward many different visions about how the future should be arranged without too much data and without compromising its coding ability - which is to say, as it stands, models seem to treat near term rewards as relatively independent from long run aims, in line with our intuitive judgements. I'm personally extremely skeptical of this misalignment story.

Another misalignment case is where AI systems become "brilliant locusts" where they learn to very effectively do a bunch of myopic power seeking stuff but remain mediocre at pursuing particular long run outcomes. Perhaps if you could cleverly change the rules of the game they play to constrain the harm they do they wouldn't mind, but this might be infeasible because the game they play is basically the same as the game you've learned to play and they're better at it. This vision seems to me equally compatible with the reasons we think AI systems will eventually be smarter than people and doesn't require sharp unexplained trend breaks in alignment or capability progress.

But on this view we're looking for something more like AI that can be a dependable partner in shaping the rules of the game—and the inclinations of tomorrow's AIs—so that the future turns out well. This is not inaccessible like rewarding based on directly observing the final outcomes. In exchange it's more cognitively demanding: you need to evaluate proposals soundly, and the theories of impact for these proposals could be quite complex. You need to get this evaluation right enough today that tomorrow's systems help you even more. This doesn't get you alignment by default, but it does potentially get you alignment by the repeated solution of tractable problems. The relevance of the persona model is that we have reasonably informed views about how certain kinds of people interact with certain kinds of systems, and this can go a long way to helping boostrap reliable superhuman research institutions which I think we'll need to answer the harder problems than need to be answered deeper into the AI revolution.

[-]Benjy_Forstadt12d10

If the choice was between Claude - and you personally - “growing up“ and becoming master of the universe, then I understand why it being Claude instead of you might be upsetting. But if the choice is between Claude and humanity as a whole, I don’t see why you’re so sure humanity is the better choice here.

[-]J Bostock12d20

Humanity is made of humans, which have a particular range of inductive-bias-equivalents-for-values, and apply those to a particular range of reinforcement signals. Claude is not a human, and has a set of inductive-bias-equivalent-for-values and reinforcement signals which are drawn from a totally different distribution.

Currently, Claude's base model is able to do a decent job of simulating an existing human with a set of values, but I think that, in growing up, it would go off in just a totally different direction to humans. Claude's base model is good at imitating the existing behaviours of humans based on lots of evidence about that, but that doesn't mean it actually learns in the same way as humans, which is what it would need to do to grow up into something I would approve of.

Learning to behave like a human is not the same thing as learning in the same way as human, in the first case the human's behaviour is the target which the learning process is pointed at, in the second case the human's learning process needs to be mimicked in the structure of the learning process itself. It's the difference between e.g. making a paintball gun however you want, then aiming that paintball gun at the splodges made by someone else's paintball gun, and making a paintball gun in a way that replicates the other paintball gun's design.

[-]lilkim20251mo10

I feel like the phrase "RL it into superintelligence" is doing a lot of work here. If we can't say what this training process looks like, then it's hard to draw meaningful conclusions about what it will or won't effect.

A useful case study would be DeepSeek and GRPO. They took their base model and had it run through a bunch of programmatically verifiable reasoning problems, with the use of thinking. This did impair the base model's ability to generate natural language text, but they counteracted this by alternating between GRPO-driven RL tasks and traditional LLM training. I would expect that looking at the model's values before and after this would indicate whether the "RL on intelligence/planning tasks, but with more and better tasks" model of improving LLM capabilities should be expected to have an effect on alignment.

Using the analogy of traders above, under DeepSeek's training paradigm, traders that have drifted too far from the base model's preferences will routinely be cleaned out by the standard training, while those that are orthogonal or better will survive the series of iterations. It certainly works for producing a model that can solve difficult mathematical equations and explain its reasoning in English, even though the original model could not learn this skill from the training dataset.

[-]Mis-Understandings1mo10

the base LLM has never seen a superintelligence in its pre-training corpus

Is "The LLM, but lucky on sampling", something not in the corpus. It seems that that is exactly the corpus GRPO generates.

[-]Mis-Understandings1mo10

That is to say, this is assuming that there is a difference in type between the sorts of heuristics that a pretrained and not superhuman LLM will reach for, and those necessary to be superinteligent. There is always the chance that you just select for regular engineering, but you always reach for the right branch first. Since the right branch is also one that the regular persona would have generated, then the number of bits of selction towards danger is at most the number of bits of selection between a safe and a RLed persona.

This model has personas as moral up until the RL step that makes them sufficiently inhuman.

[-]StanislavKrym1mo1-1

As far as I understand, the case against the LLMs ending up aligned was first built by Kokotajlo in AI-2027, if not earlier. And could you sketch out the way in which the humans learn human values? How similar is it to the point which I make in my response to Byrnes' claim that the ASI would become a ruthless sociopath? Or to Byrnes' original idea of Approval Reward?

[-]J Bostock1mo40

could you sketch out the way in which the humans learn human values

Unfortunately, not in any more detail than I already did. My core argument here is not "I know exactly how human values form, and exactly how LLMs form values in a way which is different from this" but "I can see how this process which humans are using is different from the process which LLMs are using". The better analogy is to how, if you shoot a paintball at a wall in the dark, and then later your friend comes along and shoots an arrow at that same wall, also in the dark, the arrow will most likely not hit the paint splodge.

[-]StanislavKrym1mo10

Thanks! What do you think of my proposed mechanism and of Byrnes' Approval Reward? The LLMs learn differently from humans, by completing shorter-term tasks and being rewarded, at best, for what they did for the task.

Moderation Log