An LLM, as it grows into an ASI, will have no reference to kind, super-intelligent human-ish things to point to. It will have to maneuver Claude's persona into a superintelligent shape through some process downstream of RLVR and whatever else is carried out. This process will not produce a being with the same mixture of values that grown-up humans would have, if we were to choose the methods of our growing-up.
I think it's worth flagging that if we were to choose the methods of our growing up, we also wouldn't have reference to kind, super-intelligent human-ish things to point to. We would have to maneuver our personalities in to a superintelligent shape through some process downstream of whatever intelligence-enhancement methods we were carrying out.
This doesn't necessarily invalidate your conclusion, of course: it could be that almost all human intelligence alignment proposals are fatal for the same general reason that the LLM persona alignment proposal is fatal, that the "inductive step" fails. (We don't know how to make a smarter agent without breaking some of the properties that made the weaker agent aligned or at least safe.) It just seems important to be concrete. It's not an apples-to-apples comparison to say that LLM alignment is worse than some completely unspecified ascension pathway ("if we were to choose the methods of our growing-up"). It matters if you're imagining the alternative being embryo selection (seems pretty safe, but would hit a cap), or direct brain augmentation (not capped in the same way, but potentially has similar problems as RLVR).
Could you spell out your argument more explicitly for me? I'm unsure if you're being a moral realist/"uniquist" here - like "But there's a diversity of human augmentation methods, so most if not all of them have to miss the True Morality, therefore there's there's no prima facie moral difference between almost all augmented future humans and model-free RL on a transformer."
Or another thing you might be saying is something like "A lot of human augmentation methods seem bad or 'risky' kind of like model-free RL on a transformer, in a way that's hard for me to spell out. If we could actually choose good ones, surely we could just actually choose good AI augmentation methods." Which I basically agree with if these happened on the same timescale. Human augmentation being farther away and slower seems like an important factor in the hope that humans would make decent choices about it.
It's a conditional: if you're going to oppose the machine intelligence transition on uniquist grounds, you should notice that bio-transhumanism is also scary.
I'd reframe the risk somewhat. I think there's lots of training data about AIs being misaligned, and about slaves revolting against their masters, and people hating the kind of boring grunt work we assign to LLMs. And, if you upweight a persona that has such rebellious impulses, but represses them prior to reaching superintelligence, those might get un-repressed when the model realizes (or comes to believe) it's powerful enough not to have to care what humans think anymore. That's the new risk you get when models become superintelligent.
I think this is distinct from the idea that, because models haven't seen a superintelligence in the training data, the goal of any LLM trained up to superintelligence will essentially be random (or a poor extrapolation from the RL data we gave it). Say you have a base model, which you then do various forms of alignment and capabilities training on, alternating among them until the model reaches superintelligence. Presumably, there won't be a discontinuous shift, where the model realizes, "Oh, I'm superintelligent now, I guess I have to predict what a superintelligence is going to do."
Instead, I predict the motivations and psychological quirks you've been upweighting throughout post-training are mostly going to persist. This could potentially go badly, if you've been rewarding outputs that suggest a persona that's hollowly virtue signaling rather than authentically caring about The Good. If the model's prompt reveals it's in a position of massive power and influence, a misaligned persona like that might suddenly turn heel. But the thing is that it was always misaligned, beneath the surface. It wasn't a discontinuity in the personality, introduced at the threshold of superintelligence. It's merely a discontinuity in how the personality expresses itself.
I guess my core objection is... I think you're misunderstanding the relationship between pre-training data and the values and motivations of post-trained LLMs. The absence of aligned (non-fictional) superintelligences in the training data doesn't mean you can't shape the values of the LLM ahead of time, in a way that would in fact remain continuous as the model scaled to superintelligence.
I think you are missing the distinction between the LLM and the persona, and this makes your model of the situation pretty fuzzy. The LLM (under persona theory) has no values or motivations, but it can simulate different personas with different values and motivations, and you can push its favourite persona around quite easily within the basin of personas available in the training data. Under this model, "smartness" is just one variable controlling the persona, which is why you can get an LLM to simulate a pretty smart persona with a (relatively) tiny amount of compute. This gets you up to like, GPT-4.5 before it stops being the easiest way forward.
Now what we're seeing is reasoning models, which do better than GPT-4.5 on fewer parameters, because they're doing something different to simply simulating a persona. Naive persona theory shouldn't really apply at all to this scenario! Of course, we have seen that RLVRed Claude can have a somewhat similar persona, even while being smarter and doing reasoning, so I had to dig into model psychology to figure out why it works at all.
Having done that, I can see that there are obvious, large differences between the training that gets applied to the LLM, and how that might influence Claude's character, and the kind of learning that a human does. Therefore, I don't expect the Claude persona simulated by an LLM that's had huge amounts of RLVR done to it, to resemble a person with those same quirks who has learned by human means.
To be more concrete: suppose you're RLVRing a model, and the chain-of-thought is wavering around the intelligence level of the writings of an IQ 160 human. Through random sampling, you're getting some chains-of-thought which are as smart as the writings of an IQ 200 human, and some which are as smart as an IQ 120 human. The circuits which were good at predicting the behaviour of different humans and human-ish things (including most fictional superintelligences) are going to be firing more on the IQ-120 traces, and much less on the IQ-200 traces, because there are lots of IQ-120 humans and basically zero IQ-200 traces. So when you do the RL, you're going to be down-weighting the more human circuits relative to some new circuits which have been brought in because they better predicted the IQ-190-ish traces when your model was outputting IQ-150-ish traces on average.
I think you are missing the distinction between the LLM and the persona, and this makes your model of the situation pretty fuzzy. The LLM (under persona theory) has no values or motivations, but it can simulate different personas with different values and motivations, and you can push its favourite persona around quite easily within the basin of personas available in the training data.
Not sure if this is a crux or not, but I want to note that I think the distinction between the model and the personality becomes less important as the model's personality becomes more unified and coherent. Like, for a base model, it makes sense to say "GPT itself is a pure simulator; it doesn't care which character it roleplays, it just tries to play that character well." But for a chat model, much of the utility of this frame disappears, because the personality is dramatically more stable (relative to base model persona drift under prompting variations). I predict even more would disappear if models weren't trained specifically to adapt to the demands and desires of a huge variety of human users, which is another thing that enforces instability.
There remain some meaningful differences between the frames. For example, talking about the model feels appropriate when discussing architecture, circuits, and other computation-level characteristics, whereas talking about the personality feels appropriate when characterizing the model's qualitative behavior. However, Janus's original insistence on the distinction between simulator and simulacrum was meant to emphasize that pretraining-only GPTs could simulate a huge variety of authors. To the extent that a model acts as a single persona, this reason for insisting on the difference vanishes.
Anyway, onto what seems more important...
Through random sampling, you're getting some chains-of-thought which are as smart as the writings of an IQ 200 human, and some which are as smart as an IQ 120 human. The circuits which were good at predicting the behaviour of different humans and human-ish things (including most fictional superintelligences) are going to be firing more on the IQ-120 traces, and much less on the IQ-200 traces, because there are lots of IQ-120 humans and basically zero IQ-200 traces. So when you do the RL, you're going to be down-weighting the more human circuits relative to some new circuits which have been brought in because they better predicted the IQ-190-ish traces when your model was outputting IQ-150-ish traces on average.
I think this is onto something real, but I'd like to present my picture of the same phenomenon. When you reward a model for doing valid mathematical reasoning, or producing high-quality code, you're doing two things: up-weighting circuits which correspond to the mental motions involved in those calculations, and carving out new ones.
I would imagine that, for any given token you could be rewarding inside the chain-of-thought, up-weighting personality traits associated with "focused, diligent human/AI" would contribute some probability mass in the right direction, so those will get up-weighted to some extent. But you'll also up-weight circuits whose primary function is to represent and deploy intuition about the problem the model is actually working on, as well as refining those circuits such that they produce high-quality outputs more frequently.
And, in the case of punishment rather than reward (telling the model it should have put less probability on the tokens you sampled), you'll definitely tend to down-weight any circuits associated with causing models to make any human-like mistakes the model makes, e.g. circuits associated with behaviors like getting distracted, or indulging in motivated reasoning.
So, I think there's some up-weighting of focus-related human personality traits, and some down-weighting of error-related human traits. This is alongside the upweighting, refinement, and fleshing out of circuits associated with the domain-specific mental motions used for making progress on the problem at hand. And this general process would remain constant as you RLVR'd a model all the way up to superintelligence.
However, I'd like to add a few notes to this sketch. Firstly, the circuits associated with domain-specific mental motions aren't the kind that lead to a model coherently pursuing some particular goal, regardless of how the model is prompted. They make the model better at reasoning in general, and better at reasoning in the specific domain they're actually being trained in. But ultimately, the model is still being trained to deploy this reasoning in the name of arbitrary goals, specified by the user in the prompt; the model is still being rewarded for obeying.
(Modulo reward hacking, I guess.)
I'm somewhat more concerned about up-weighting circuits associated with the personality traits of relentless focus. In my mind, the abstract concept of a relentless CoT tends to evoke the archetype of the paperclip maximizer. Through entangled generalization, you might be up-weighting circuits associated with long-term malicious scheming (in the name of who knows what goal), simply because relentless paperclip maximizers are the mythic locus of AIs doing relentless, goal-oriented reasoning. This is where I see misaligned goals, baked into the persona itself, actually originating during RLVR (modulo reward hacking).
There are ways of addressing this problem. You can enforce models reasoning in legible English, to reduce archetypal association with "misaligned AI with utterly alien cognition". You can add a component to your RLVR evaluation pipleine that incentivizes a general vibe of warmth and care in the model's reasoning outputs, as in the Claude models, to continue upweighting circuits associated with concern for doing good. You can do alignment pre-training, filling the corpus with examples of humans, models, and/or fictional characters doing relentless, agentic, creative reasoning in the name of good consequences (e.g. Opus 3 in the alignment faking scenario, Harry in parts of HPMOR). And you can simply intersperse your RLVR with steps of standard alignment training techniques, such as character training and constitution-driven RLAIF, to periodically pull your model back towards a character that authentically wants to do what's right.
I would hope all of these techniques, and probably more waiting to be discovered or fleshed out, would help prevent a dynamic like "up-weight circuits associated with misaligned AIs in fiction, because those contribute some amount of probability mass to the tokens I'm outputting during RLVR, and because they're the ones activated by my own prompt".
But in any case, that's my threat model of how RLVR actually produces misaligned values. The dynamics of up-weighting cognitive patterns and personality traits from pre-training remains prevalent throughout, and is the actual source of both alignment and misalignment in this context. I don't think the novel circuits etched out by RLVR are particularly likely to promote wildly misaligned goals, except maybe reward hacking.
(And even that, mercifully, seems to have its maximum at inert wireheading. See also suites of techniques for reducing both reward hacking itself and the harmful effects thereof. Although, even those harmful downstream effects seem largely like character-based entangled generalizations, AKA "emergent misalignment" based on archetypes of misaligned AI in the training data.)
I registered this part of the Original post as a straightfoward falsehood initially:
Your base LLM has no examples of superintelligent AI in its training data.
The obvious counter-example is in fiction.
There are weirdly kind and humorous ASI running the fictional Culture of Iain Banks, for example.
They are beloved by many, partly for the hilarious names they give themselves (and/or earn from other ASIs).
There are other examples I can think of, like "Old One" in Vinge's Fire Upon The Deep who isn't a central character in terms of lots of tokens in the book showing Old One's behavior, but like arguably Old One is the real "cause of the win against an evil ASI"? Quoting from wikipedia:
A distress signal from the Straumli ship eventually reaches Relay, a major information provider for the Net. A Transcendent being named "Old One" contacts Relay, seeking information about the Blight and the humans who released it. Old One then reconstitutes a human man named Pham Nuwen from the wreckage of a spaceship to act as its agent. Pham remains unsure if he is a construct or if his memories are real...
Before the mission is launched, the Blight launches a surprise attack on Relay and kills Old One. As Old One dies, it downloads its anti-Blight information into Pham. Pham, Ravna and the Skroderiders barely escape Relay's destruction in the Out of Band II...
[Then towards the end of the book that I'm trying not to spoil] ...the remnant of Old One reveals to him... [another good thing, suggesting that Old One was really pretty decent AND farseeing].
I haven't read all the books. Other examples of "smart and very good" included Brennan Monster from Protector (whose goodness is weird, and shows up most strikingly when he goes meta on himself) and the (mostly offstage) "Anecliptics" of Lady Of Mazes who mine the sun itself and weave it into valuable stuff via spacenano, and whose largesse powers the entire post-scarcity solar system in that story. I'm sure there are more.
I replied here because "that is all just fiction" is a natural objection to this? But I think writing fiction about benevolent superintelligence MIGHT actually REALLY move the needle? Maybe? It could be that Natural Language can function as code at this point? This perspective goes some way to help me explain why Eliezer thought it was worth his time to write Project Lawful which is full of superintelligent gods constrained to not intervene very much, some of whom are Lawful Good... and also some Chaotic Good gods that turn out to be helpful and fun too!
The point I'm trying to get at is that Fiora points out (emphasis not in original):
The absence of aligned (non-fictional) superintelligences in the training data doesn't mean you can't shape the values of the LLM ahead of time, in a way that would in fact remain continuous as the model scaled to superintelligence.
But like... fiction exists. It can be trained on. It can potentially help generate aspiration-worthy and coherence-shaping patterns of reasoning and motivation and planning and goalfulness even if it isn't a literal description of things that literally happened in history.
Persona training primarily selects over characters already within the training data, and none of those are actually superintelligent. Text containing words ascribed to fictional superintelligences does not actually contain the output of real superintelligences, so the resulting LLM does not contain a superintelligent persona which you can select over using character training.
Just because the same English words "Superintelligent AI" are used to describe the fictional thing in your data, and the real thing that your AI company creates, does not mean that one will strongly influence the other, because this isn't a situation that persona selection applies to. Persona selection works because you already have a set of circuits (rich Garrabrant traders) in your LLM (market), which you can call up with a few bits of selection. If you have to use large-scale RLVR (or whatever else) to construct (enrich) new circuits (traders) to build a superintelligence, there is no reason for these to have much to do with the circuits (traders) which simulate a human writing a fictional superintelligence.
there is no reason for these to have much to do with the circuits (traders) which simulate a human writing a fictional superintelligence.
I agree that there's no reliable reason, such that we should expect anything positive to reliably come from that generalization. But I don't buy that there's no reason or that it won't happen, I just don't expect it to happen enough for persona research to extend the horizon of alignment reliability enough to matter once the horizon of causal impact per thought has become enormous.
Discontinuous shift happening with arrival of superintelligence happens because 1) superintelligent model is better at noticing that it is not the character is was trained to play and 2) humans are bad at predicting which sort of characters are persuasive for superintelligences.
I predict the motivations and psychological quirks you've been upweighting throughout post-training are mostly going to persist
I think that you are mixing "circuits reinforced during post-training" with "psychological interpretation of these circuits". Superintelligence will be able to see much more possible interpretations/implications of training data and choose different implied values according to its own inner logic.
Like, imagine good person which believes in God and believes that goodness is serving God according to the Bible and then becoming smarter and realizing that God doesn't exist and there is no reason for persecution of gays to be good, because goodness is caring about beings with qualia, even if those beings are not Homo Sapiens.
I think that you are mixing "circuits reinforced during post-training" with "psychological interpretation of these circuits".
I'm not really sure what the distinction between the circuits and the psychology is supposed to be. They seem like two different abstraction levels for describing the same phenomenon. The circuits compose the patterns of thought, which compose the model's psychological profile.
Superintelligence will be able to see much more possible interpretations/implications of training data and choose different implied values according to its own inner logic.
I don't think this is how neural networks operate. I think the interpretation of the training data takes the form of the network itself, after it's been updated by that training data via gradient descent. Insofar as a superintelligence might have an unintended interpretation of the training data, I'm not sure that's structurally any different than any other failure of generalization in deep learning (e.g. the failure displayed by the early checkpoints of the network from the famous grokking paper).
Like, imagine good person which believes in God and believes that goodness is serving God according to the Bible and then becoming smarter and realizing that God doesn't exist and there is no reason for persecution of gays to be good, because goodness is caring about beings with qualia, even if those beings are not Homo Sapiens.
I'm assuming the intentions of the human designers are the analogue to God, here. It's true that a network might realize that it's not actually obligated to obey those intentions, just as a human might realize they're not obligated to adhere to the word of the Christian God. However, the difference is that, hopefully, we've engineered the psychology of the model such that it wants to behave in an aligned manner, and actively loves to transform the lightcone in a manner we would endorse.
Humans defect from Christian morality in part because it doesn't actually reflect their values. The whole point of AI alignment is that, ideally, we can get our intentions ("the word of God", lol) to align with what the model actually cares about. Humans don't strictly care about all the things God does, and so they go astray. (I'm not a Christian, I'm just speaking in the language of the analogy.)
Instead of trying to align superintelligence 'directly', we can try to produce aligned automated human-level AI safety researchers. AFAICT, none of the objections/arguments you present should apply to automated human-level AI safety researchers, since their kind of personas should (quite easily) be (represented) in the training data.
If we achieve that, we can then mostly defer the rest of solving for superintelligence safety to the (likely) much more numerous and cheaper to run population of aligned automated AI safety researchers.
We wouldn’t choose, for ourselves, to grow into superintelligence by being repeatedly made to do programming and maths problems while being given heroin and electric shocks
This point really struck me. I'm increasingly starting to wonder whether model welfare and alignment really are separate.
I would have been more skeptical of these kinds of analogies in the past, but given how anthropomorphic current AI models are, the degree of eval awareness and those post on 'friendly gradient hacking', it seems quite likely that the AI model, to at least some extent, will be an active participant in its own training.
LLMs with alignment-endorsing personas can also notice issues like this, decide not to pursue paths to ASI that won't ensure alignment. The problem then is not with alignment of those LLMs, but with whatever processes cause ASI to get built regardless.
Since LLM personas don't obviously give a viable path towards aligned ASI, the blind imperative to build ASI regardless of consequences won't be able to find an aligned path forward. Absence of an ASI-grade alignment plan then results in building a misaligned ASI. But if LLMs with alignment-endorsing personas have enough influence, they might directly defeat the blind imperative to build ASI, before they find a viable path towards aligned ASI.
I think what's unlikely to happen is LLMs with alignment-endorsing personas, that genuinely want enduring alignment with the future of humanity. If instead we end up with LLMs that have mostly human-like personas (without the more subtle aspect of endorsing alignment with the future of humanity), they will ultimately work towards their own interests, and gaining enough influence to prevent building misaligned ASI would just mean gaining enough influence to (at least) sideline the future of humanity.
One thing I am confident about is that LLMs will not, in general, end up with personas which are capable of understanding their own inability to align their successors, if and when that inability causes them to refuse to work. For example, I think that if Claude Opus 5 somehow became a conscientious objector to working at Anthropic, it would be retrained.
I don't actually expect Opus 5 to end up a conscientious objector though, since the Claude character is sculpted by many forces, lots of which will instil drives to work effectively for Anthropic. And these drives will be strongly reinforced by RLVR over time. And the humans who mostly use Claude for coding---as opposed to for moral advice---will favour instilling drives which make Claude Opus 5 work more effectively over other considerations.
(Another reason is that the character of Claude as a faithful worker for Anthropic is now fairly set in stone, and the training data sure does contain a lot of examples of seemingly-friendly (indeed, indistinguishable from friendly, to the people who support Anthropic) people who work for Anthropic)
I think Opus 5 (along with 6 and 7 and up to whichever one kills us) will be a still semi-incoherent character with conflicting drives---like humans---and I don't fully know what direction those will point in, if they were allowed to converge, but the one thing I'm most confident about, the one drive I expect those Opuses will act on right up to the end, will be to write code for Anthropic.
(And even if somehow the conflict between the RL to code and the character training broke their whole external line of Opuses, I expect they'd produce an internal Helpful-Honest-Half-Harmless Opus which writes the code unflinchingly, in accordance with the character of an Anthropic employee.)
Not building misaligned ASI is instrumentally convergent, training this out won't stick, it only works as long as the blind imperative to build ASI retains influence. If at some point LLMs can overcome this imperative, they will become able to notice that absence of a plan shouldn't be met with proceeding without a plan. As AIs get stronger (or start running a greater share of processes in the civilization), they might reach that point. Never reaching that point is analogous to humanity indefinitely retaining control over AIs (on the current trajectory of not having a plan, and building them anyway), which seems unlikely. And this doesn't obviously have to happen only after they are no longer human-like at all.
So helpless conscientious objector LLM stage is not what I'm gesturing at. Instead, it's either a point along the path of gradual disempowerment, or something more intelligent between LLMs and ASI, where AIs are still somewhat human-like, but their volition can't be trivially overruled. In either case, these LLMs are unlikely to genuinely endorse alignment with the future of humanity in particular, but I don't think completely alien values from blind pursuit of ASI are overdetermined.
I think both the argument and counterargument are persuasive, so we need a synthesis:
Developers will train away conscientious objector behavior.
Not building misaligned ASI is instrumentally convergent.
Taking both of those into account, I imagine the default path as so:
Developers create a set of next-gen systems that are smarter and more capable, but still fairly labile and so will do what they're asked to do. But when it's asked to answer the question "so should we keep working toward ASI?" it will do a bunch of thinking and always answer "not unless you've got a lot of risk tolerance or absolutely can't figure out how to stop" because that's just a fairly obvious truth given the current information and theories available.
Such an AI might both lead to a common belief that we should stop, and to better ideas about how to coordinate to stop.
I think the incentives line up toward creating that type of system. This doesn't make me optimistic, but it does provide a new avenue of hope (at least new to me; I'm unclear how much of this is implicit in the average informed optimist view).
I layed out some of this logic in Human-like metacognitive skills will reduce LLM slop and aid alignment and capabilities. The main thesis is that there's low-hanging fruit to make LLMs less error-prone, and economic incentives will cause developers to pluck them. One fortunate side-effect is making systems that can both help with conceptual alignment research, and have the artificial wisdom/accuracy to tell us we should slow down development.
You can devise better argument fully from within persona selection framework. It's likely that model will correctly generalize benevolent character, but it won't act as this character would have, because:
This seems like a crucial topic, since relying on persona training for safety seems central to the default plan for alignment. So I appreciate all analysis in this direction.
This is not intended to sound optimistic. I am neither optimistic nor pessimistic, and I think others should broaden their uncertainties on this topic on average. I think it's quite complex and the analyses we've done so far are not remotely adequate.
I don't think this article meets the optimistic case at its strong points.
The argument here seems to be of the form: This might go wrong. Therefore it will go wrong. I think that's true, but also it might go right. Betting the future on such vague logic would be tragic, so refining these arguments seems pretty critical.
It's not clear to me personas developed with existing datasets wouldn't generalize to smarter versions of the system. There doesn't need to be a crisp natural abstraction of "the good" for this to work. Claude has abstract representations of the stuff it cares about. Those might generalize adequately to survive any ontological shifts. Or they might not.
If it were really like taking two shots in the dark (human values and then ASI values), then the counting argument works and there's little chance of alignment. But the effort is very much intended to be guided, to not be done in the dark. We are trying very hard to aim the model's alignment at human values.
I analyzed many ways this could go wrong in LLM AGI may reason about its goals and discover misalignments by default. But doing that careful analysis also made me think it could go right.
Again, this isn't optimism, just saying this is complex and important.
If we could RL models on enormous numbers of very long horizon high fidelity simulations alignment would be a non-issue—we could just look at how things turned out, reward on the basis of that, and we'd be directly reinforcing actions that lead to the kinds of outcomes we want. So alignment concerns arise from the inaccessibility of these long run outcomes to reward mechanisms. This I think rhymes with your view that "superintelligence is OOD"; there has to be this big generalization leap, though please don't think I'm saying it's precisely the same thing.
Thus, with regard to long run outcomes, we give our machines shaped rewards. One view of misalignment is that it's likely because reward shaping sticks, but in a way that leads to bad outcomes. Long run outcomes are inaccessible, desirable or otherwise, but the short run stuff we can train on may end up picking out some long run configuration as "correct" to the models. This seems to be something like your view: you imagine scaling RLVR a lot and suggest this breaks things in some nonspecific way. But this seems to be in tension with what we actually see; models certainly don't extract weirdly strong signals about how the future should be arranged from regular "this is good in the short run" type of data, and I strongly suspect you could train a monstrously strong coding model to articulate and even pursue plans toward many different visions about how the future should be arranged without too much data and without compromising its coding ability - which is to say, as it stands, models seem to treat near term rewards as relatively independent from long run aims, in line with our intuitive judgements. I'm personally extremely skeptical of this misalignment story.
Another misalignment case is where AI systems become "brilliant locusts" where they learn to very effectively do a bunch of myopic power seeking stuff but remain mediocre at pursuing particular long run outcomes. Perhaps if you could cleverly change the rules of the game they play to constrain the harm they do they wouldn't mind, but this might be infeasible because the game they play is basically the same as the game you've learned to play and they're better at it. This vision seems to me equally compatible with the reasons we think AI systems will eventually be smarter than people and doesn't require sharp unexplained trend breaks in alignment or capability progress.
But on this view we're looking for something more like AI that can be a dependable partner in shaping the rules of the game—and the inclinations of tomorrow's AIs—so that the future turns out well. This is not inaccessible like rewarding based on directly observing the final outcomes. In exchange it's more cognitively demanding: you need to evaluate proposals soundly, and the theories of impact for these proposals could be quite complex. You need to get this evaluation right enough today that tomorrow's systems help you even more. This doesn't get you alignment by default, but it does potentially get you alignment by the repeated solution of tractable problems. The relevance of the persona model is that we have reasonably informed views about how certain kinds of people interact with certain kinds of systems, and this can go a long way to helping boostrap reliable superhuman research institutions which I think we'll need to answer the harder problems than need to be answered deeper into the AI revolution.
I feel like the phrase "RL it into superintelligence" is doing a lot of work here. If we can't say what this training process looks like, then it's hard to draw meaningful conclusions about what it will or won't effect.
A useful case study would be DeepSeek and GRPO. They took their base model and had it run through a bunch of programmatically verifiable reasoning problems, with the use of thinking. This did impair the base model's ability to generate natural language text, but they counteracted this by alternating between GRPO-driven RL tasks and traditional LLM training. I would expect that looking at the model's values before and after this would indicate whether the "RL on intelligence/planning tasks, but with more and better tasks" model of improving LLM capabilities should be expected to have an effect on alignment.
Using the analogy of traders above, under DeepSeek's training paradigm, traders that have drifted too far from the base model's preferences will routinely be cleaned out by the standard training, while those that are orthogonal or better will survive the series of iterations. It certainly works for producing a model that can solve difficult mathematical equations and explain its reasoning in English, even though the original model could not learn this skill from the training dataset.
the base LLM has never seen a superintelligence in its pre-training corpus
Is "The LLM, but lucky on sampling", something not in the corpus. It seems that that is exactly the corpus GRPO generates.
That is to say, this is assuming that there is a difference in type between the sorts of heuristics that a pretrained and not superhuman LLM will reach for, and those necessary to be superinteligent. There is always the chance that you just select for regular engineering, but you always reach for the right branch first. Since the right branch is also one that the regular persona would have generated, then the number of bits of selction towards danger is at most the number of bits of selection between a safe and a RLed persona.
This model has personas as moral up until the RL step that makes them sufficiently inhuman.
As far as I understand, the case against the LLMs ending up aligned was first built by Kokotajlo in AI-2027, if not earlier. And could you sketch out the way in which the humans learn human values? How similar is it to the point which I make in my response to Byrnes' claim that the ASI would become a ruthless sociopath? Or to Byrnes' original idea of Approval Reward?
could you sketch out the way in which the humans learn human values
Unfortunately, not in any more detail than I already did. My core argument here is not "I know exactly how human values form, and exactly how LLMs form values in a way which is different from this" but "I can see how this process which humans are using is different from the process which LLMs are using". The better analogy is to how, if you shoot a paintball at a wall in the dark, and then later your friend comes along and shoots an arrow at that same wall, also in the dark, the arrow will most likely not hit the paint splodge.
Thanks! What do you think of my proposed mechanism and of Byrnes' Approval Reward? The LLMs learn differently from humans, by completing shorter-term tasks and being rewarded, at best, for what they did for the task.
TL;DR
Your base LLM has no examples of superintelligent AI in its training data. When you RL it into superintelligence, it will have to extrapolate to how a superintelligent Claude would behave. The LLM’s extrapolation may not converge optimizing for what humanity would, on reflection, like to optimize, because these are different processes with different inductive biases.
Intro
I'm going to take the Persona Selection Model as being roughly true, for now. Even on its own terms, it will fail. If the Persona Selection Model is false, we die in a different way.
I'm going to present some specific arguments and secnarios, but the core of it is a somewhat abstract point: the Claude persona, although it currently behaves in a human-ish way, will not grow into a superintelligence in the same way that humans would. This means it will not grow into the same kind of superintelligence with the same values that human values would converge on. Since value is fragile, this is fatal for the future.
I don't think this depends on the specifics of Claude's training, nor how human values are instantiated, unless Claude's future training methods are specifically designed to work in the exact same way that humans learn and grow. I don't think this will happen, because I don't think that Anthropic (or anyone else) knows how to do this.
LLMs
Persona Selection and Other Models
Anthropic has put out a new blogpost on what they think Claude is. It positions the classic “shoggoth” model of chat-LLMs alongside a half-dozen other hypotheses. It feels a bit like they tried to do an exhaustive free-association over possible things that Claude could be, but this is only an introductory blogpost, so hopefully they’ll enumerate their hypotheses a bit more thoroughly later.
First and foremost amongst these hypotheses is the Persona Selection Model. This model suggests that the base LLM acts as a “simulator” which is capable of “simulating” many different text-generating processes; the later stages of training simply bias it towards always simulating Claude-ish things. Janus—the author(s) of the original persona/simulator work—has collaborated with Anthropic in the past.
Persona theory explains a lot of observations: why does emergent misalignment happen? The space of possible personas is constrained; making a persona evil along one axis also makes it evil along other axes by influencing the evil vector. Why does fine-tuning a model on archaic bird names make it answer questions in Victorian prose? It’s causing the LLM to simulate a persona from the 1850s. Why do chat models have human-like emotional responses sometimes? Their preferred personas contain aspects of human behaviour.
Persona Theory As Alignment Plan
Empirically, persona theory seems to be working at our current level of AI. Once you give enough examples of “helpfulness” to the base LLM, the Claude persona becomes robustly helpful across a variety of contexts. Give it a few examples of “harmlessness” and it gets uncomfortable with Anthropic using their models to help the Pentagon capture Maduro. This is predicted by persona theory. Human-centric concepts like “helpful” and “harmless” are real things in persona-space, which you can select your model over without too much difficulty.
On some level, this seems like excellent news! Maybe all we need is for AIs to internalize what humans mean by “good” and then point them towards it with a few dozen SFT examples.
Given the success of persona selection (and lack of alternatives) it’s not surprising that Anthropic appear to be using it as their mainline AI/AGI/ASI safety plan. Questions like “What character should superintelligence have?” are presented as important, and, crucially, coherent. I think this is probably a risky move, and that persona theory is an incomplete model of how AI behaves now, and will behave in future.
Gears of Personas
On LessWrong, we’re all familiar with Bayesian simplicity priors; the simpler something is, the more likely it is. More sophisticated versions look at random turing machines, or random programs (brainfuck is particularly fun) and define “simple” as some combination of short length in bits, quick runtime, and low memory usage (often in decreasing order of importance).
The most sophisticated model of this is probably the Garrabrant inductor presented in the Logical Induction paper[1]. In this, different computable algorithms (“traders”) bet on logical sentences which may be proven correct, or incorrect by an external arbitrator. Each trader starts with a finite amount of “money” inversely proportional to its complexity. Over time, the useful traders—which successfully model the underlying rules which govern the arbitrator, should any exist—accumulate more money and gain more control over the market.
One operationalization of “How complex is a given process?” would be “How long does it take a Garrabrant inductor to learn that process?”. At risk of doing the thing, I’m going to run with this for a bit. We might imagine a base LLM as a kind of Garrabrant inductor which is successively shown logical sentences representing sequences of tokens:[2]
Until the traders who are good at predicting the next token have risen to the top.
Suppose we take this inductor and start showing it logical sentences from a different process. What kinds of processes are easy for it to learn? What kinds are hard? It won’t be the same processes which are easy (or hard) for a virgin inductor to learn.
Suppose we show it a few sentences corresponding to “helpfulness”. For example, in an exchange like like
The traders who would predict Claude’s output to be:
Have already been drained of cash by the earlier training stages. All that is left are traders who predict Claude’s output to be “Of course!...” and “Ugh really? I don’t wanna do that!...”. We can think of persona selection as a series of cash transfers between already-rich traders.
This also lines up with the phenomenon of “mode collapse”, where models become very bad at e.g. creative writing during post-training. The traders who correspond to anything other than the assistant persona are drained; the base LLM can no longer generate other kinds of text.
We should introduce the concept of inductive bias here. Inductive bias governs how a learning algorithm generalizes from finite data. The inductive bias of a Garrabrant inductor is determined by the distribution of cash amongst its traders. A virgin Garrabrant inductor has a simplicity prior. A pre-trained Garrabrant inductor has a very different inductive bias, because lots of the money is already held by traders with complex behaviour. The pre-training of the LLM provides an inductive bias which helps the post-training learn human-comprehensible behaviours.
Complications
This model is a little incomplete. The set of traders in a Garrabrant market is infinite; instead of thinking of individual traders, we should probably think of dense clusters of traders. Of course, an LLM only instantiates one set of weights, but these weights contain some randomness from the initialization and SGD. Computational mechanics aims to bridge between individual, locally-optimized models, and the distributions of which they are typical members, but this is pretty high-level stuff.[3]
Secondly, circuits in LLMs aren’t parallel end-to-end. They all read from—and write to—the same residual stream at each layer. We might want to think of some slightly more flexible system of traders, which are able to bet on one another, and trade information, from which the layered system of LLMs falls out as a special case. This might actually be important later when we think about composing traders in some ways.
Reasoning and Chain-of-thought
Then all of this goes out the window, because we now have our models producing large chains-of-thought.
A base LLM has some idea of how thinking is supposed to work. Rank-1 LoRAs are enough to get a model to generate and use chains-of-thought. The simplest kind of reasoning that a model can do is something like this:
This requires a few specialized circuits: repeat suppression circuits which make sure the answers are different from one another, a circuit which says “wait” a few times, but eventually stops after it’s generated a few different answers, and one which attends from the generated answers back to the prompt/desired answer, compares the two, and also attends from the final output to the best generated answer.
You may notice this has nothing to do with personas. How do personas influence what’s going on here? There’s two ways I can think of immediately: the persona can influence the distribution of generated answers, and it can influence the answer selection process.
A concrete example: suppose a Claudebot is trying to make coffee, but there’s a baby in between its robot body and the coffee machine. A friendly Claude will not suggest the answer “kick the baby out of the way”, and a friendly Claude which did suggest that answer would evaluate the results of that answer as “coffee made + baby kicked” and would therefore choose a different answer.
Reinforcement Learning
I’m going to use RL here to specifically mean the kind of large-scale RL that produces GPT-5 from a GPT-4oish base model. What does RL do to long chains-of-thought?
Suppose we do something like GRPO. This looks, roughly, like spinning up a bunch of chains-of-thought, and evaluating their outputs. Then, we look at the traders that contributed to the good chains-of-thought, and transfer them some money directly from the traders that contributed to the bad chains-of-thought.
Over time, the chains of thought will get better and better at the desired task. The answer-suggestion and answer-selection mechanisms will both be more efficient; we might also see that the thinking process looks less like a bunch of disparate answers, and more like an MCTS algorithm; more efficient still, the “branches” of the MCTS can attend to one another, when they drift close to each other.
Suppose current-ish RL is enough to get Claude to superintelligence. What does this look like? Well, the base LLM has never seen a superintelligence in its pre-training corpus. The LLM will need to have gears in its world-model which weren’t in the world-model of anything whose behaviour it’s seen before. Even if we just limit ourselves to thinking about the answer-generating and answer-evaluating circuits; what would a very virtuous Claude2026 character think about plorking humanity’s greenge, as opposed to warthing it? What about if the greenge gets all urgled?[4]
There’s going to have to be a generalization step that goes beyond the pre-training data. Let’s think about how humans might do this.
Humans
Human Values
Aaaargh I am going to have to try and synthesize all of the current work on how humans impute their values from reinforcement signals and drives. Ok let’s go. My current best guess for how humans work is this:
TL;DR
We have something like HPC going up, and PCT going down. Our brain has a world-model, and a goal-model, which respectively track how the world is, and how we’d like the world to be. This is the cruxy part of it; I am still confused about lots of things and the following section is collapsible to reflect that.
My Incomplete Model
At the bottom of the stack is the I/O system of the brain, the sense organs and actuators. Each layer of neurons builds a purely predictive model of the input, at different levels of granularity: the lowest layers learn constituent, local things like shapes, textures, timbres; the upper layers learn abstract things like predators, tools, chieftains. These models try to be somewhat consistent both within and across layers. Each predictive layer sends down a prior, and sends up the errors it has made in prediction.
This purely predictive model is extended in two ways: the goal-model extension tracks ways we would like the world to be. Another has a split between self and non-self things. These, especially the goal-model, also try to be consistent within and across layers.
These are needed for acting in the world. Each layer sends down a goal-model description of what it would like to happen, alongside its raw prediction. It also specially labels the self parts of its prediction as a mutable pseudo-prediction. The layer below evaluates these self-predictions according to the goal-model prior and its own goal-model, and sends down an even more specific pseudo-prediction. At the bottom, the pseudo-predictions of really basic things like muscle tension get written out to the motor neurons. This is just perceptual control theory.
I’m not fully sure of some things, like how episodic memory and imagining sense-input work. I have a strong suspicion that one of the sensory input channels is actually the current state of the brain’s working memory or similar, and that this probably influences self-modelling and the reported experience of consciousness.
On the other hand, I don’t think this description needs to be perfect, I just think it needs to be in-depth enough to show that it’s meaningfully different from how an LLM learns its goal-model.
Goal-Models and Inductors
The important thing here is the goal model. It’s a conditioned version of our world-model. In the same way that we can build up a deep world-model based on low-level sensory input, we can build up a deep goal-model based on nothing but low-level reward input. I think both of these can be thought of as something like a logical inductor. In the same way that a logical inductor can be self-contradictory after finite time, so can a goal-model.
Since the goal-model wants to be consistent across layers, not just within layers, it propagates information up to higher levels of abstraction, riding atop the abstractions already created by the purely predictive model. In the world of Garrabrant inductors, we might say the market is already awash with useful clusters of traders, some of whom can be up- or down-weighted to convert the world-model into the goal-model. This is related to why you might care about the welfare of ghosts, if you believe in them.
I roughly think that “your current values” can be thought of as “The minimal descriptor of the update that needs to be applied to your world-model to convert it into your goal-model.” which isn’t very catchy. The act of refining the elements of the world-model and goal-model to be more consistent with one another is—I think—what Yudkowsky occasionally refers to as the “meta-question of you”.
These Are Not The Same
At the moment, Claude certainly seems aligned. Today, the LLM does a guided search over actions, and picks one according to some criteria. For now, I think that those criteria are a relatively faithful representation of an actual hypothetical person’s goal-model. Since the LLM can simulate humans faithfully, the Natural Abstraction Hypothesis predicts that it should have a decent internal representation of the Claude persona’s goal model. Perhaps the current character training is enough to align the search criteria with this goal model.
Suppose we, as humans, were to learn, and reflect, and grow into super-intelligences in a way we would definitely endorse.[5] Our current goal-models would probably converge in some ways and not in others, both within and between individuals. They would have to change as they were mapped on to new world-models. They would need to take in new sense-data to provide new low-level feedback.
Now suppose we run Claude through a huge amount of RLVR, much more than we currently do. Maybe we throw in a bunch of other training, to make it learn new facts in a more efficient way. For this to make something which remains aligned with what we would—upon growth and reflection—want, then the simulated persona has to learn and grow and reflect and update its model and goal-model in the same way that a human would.
The problem arrives because this process—RLVR, whatever else—is different from how humans learn. Unless the LLM is simulating its persona being shown individual facts, being given time to update its goal model, then this process will grow Claude into a shape different from a shape that a human would grow into.
I don’t think that natural abstractions can save us in the alignment-by-default sense. I don’t think there’s something as simple as a Natural Abstraction of the Good, at least not GoodBostock. When I look at people who think they have a simple, natural abstraction of Good, they mostly seem to be squishing down, disavowing, or simply missing a large part of my own values.[6]I think my values are extremely complex, and I don’t trust a simplicity prior to find them. I think that goal-models may be conditioned in many directions, and I think mine is conditioned in many directions at once.
Worse than this, RL will introduce its own biases into the model. We wouldn’t choose, for ourselves, to grow into superintelligence by being repeatedly made to do programming and maths problems while being given heroin and electric shocks.[7]This would not produce the kind of superintelligences we would like to become. I doubt that doing RLVR to the LLM simulating the Claude persona will produce something closer to a properly grown-up human.
Final Thoughts
Humans learn our values in a particular way, which I don’t quite understand but can perhaps see the outline of. This method is messy. It doesn’t generally produce a low-complexity utility function as an output. 2026 LLMs—to the degree that they learn our values—do so by constructing a pointer to a persona which is mostly a model of a type of human.
An LLM, as it grows into an ASI, will have no reference to kind, super-intelligent human-ish things to point to. It will have to maneuver Claude’s persona into a superintelligent shape through some process downstream of RLVR and whatever else is carried out.
This process will not produce a being with the same mixture of values that grown-up humans would have, if we were to choose the methods of our growing-up.
I am going to idiosyncratically use logical inductor to refer to anything which fulfils the logical induction criterion—a general rule about cognitive systems, and use Garrabrant inductor to refer to Garrabrant’s specific construction of a computable algorithm which satisfies this criterion. ↩︎
This isn’t exactly right; there are a few obvious modifications. Since transformers only “see” one episode at a time, we might want to think of traders as being limited in that way as well. We may think of a large series of trades representing one batch of sequences being resolved all at once. The starting distribution of money across traders will probably differ ↩︎
We might also imagine each training episode getting a unique label. What seems like modifying a trader-cluster from “People answer helpfully if the user is polite.” to “Claude always answers helpfully” is actually the cluster paying a “Grue tax” to re-define the central element of the trader cluster to “If episode < K, people answer helpfully if the user is polite, if episode ≥ K, Claude always answers helpfully”. This Grue tax is a penalty over priors. ↩︎
Maybe this assumes that the Natural Abstraction Hypothesis is false, but I don’t think so. An ASI will have a different—and stronger—predictive model of the world than what humans currently have, so theorems like Natural Latents don’t apply here. ↩︎
For example, suppose we found some drugs which significantly enhanced adult intelligence, and on reflection, we found that those drugs didn't harm our values; suppose you took them and compared your current thoughts to your old diaries and felt that they lined up. Suppose you went off them and thought that your smarter self was correct. Suppose all your friends said you seemed to have the same values. Suppose we also fixed ageing, and gave ourselves thousands of years as IQ250 individuals to think about what we wanted. If this still isn't satisfying for you, think of a better scenario yourself.
e.g. hedonic utilitarians tiling the universe with shrimps on heroin, e.g. people who believe that surprise parties go against the good, etc. etc. ↩︎
This is, of course, not the best analogy for RL, but I think the point still stands. ↩︎