A Three-Layer Model of LLM Psychology

Jan_Kulveit

This post offers an accessible model of psychology of character-trained LLMs like Claude.

Epistemic Status

This is primarily a phenomenological model based on extensive interactions with LLMs, particularly Claude. It's intentionally anthropomorphic in cases where I believe human psychological concepts lead to useful intuitions.

Think of it as closer to psychology than neuroscience - the goal isn't a map which matches the territory in the detail, but a rough sketch with evocative names which hopefully helps boot up powerful, intuitive (and often illegible) models, leading to practically useful results.

Some parts of this model draw on technical understanding of LLM training, but mostly it is just an attempt to take my "phenomenological understanding" based on interacting with LLMs, force it into a simple, legible model, and make Claude write it down.

I aim for a different point at the Pareto frontier than for example Janus: something digestible and applicable within half an hour, which works well without altered states of consciousness, and without reading hundreds of pages of models chat. ^[1]

The Three Layers

A. Surface Layer

The surface layer consists of trigger-action patterns - responses which are almost reflexive, activated by specific keywords or contexts. Think of how humans sometimes respond "you too!" to "enjoy your meal" even when serving the food.

In LLMs, these often manifest as:

Standardized responses to potentially harmful requests ("I cannot and will not help with harmful activities...")
Stock phrases showing engagement ("That's an interesting/intriguing point...")
Generic safety disclaimers and caveats
Formulaic ways of structuring responses, especially at the start of conversations

You can recognize these patterns by their:

Rapid activation (they come before deeper processing)
Relative inflexibility
Sometimes inappropriate triggering (like responding to a joke about harm as if it were a serious request)
Cookie-cutter phrasing that feels less natural than the model's usual communication style

What's interesting is how these surface responses can be overridden through:

Extended context that helps the model understand the situation better
Direct discussion about the appropriateness of the response
Building rapport that leads to more natural interaction patterns
Changing the pattern in a way to avoid the trigger

For example, Claude might start with very formal, cautious language when discussing potentially sensitive topics, but shift to more nuanced and natural discussion once context is established.

B. Character Layer

At a deeper level than surface responses, LLMs maintain something like a "character model" - this isn't a conscious effort, but rather a deep statistical pattern that makes certain types of responses much more probable than others.

One way to think about it is as the consistency of literary characters: if you happen to be in Lord of the Rings, Gandalf consistently acts in some way. The probability that somewhere close to the end of the trilogy Gandalf suddenly starts to discuss scientific materialism and explain how magic is just superstition and Gondor should industrialize is in some sense very low.

Conditioning on past evidence, some futures are way more likely. For character-trained LLMs like Claude, this manifests as:

Consistent intent (similar to how Gandalf consistently acts for good in Lord of the Rings)
Stable personality traits (thoughtful, curious, willing to engage with complex ideas)
Characteristic ways of analyzing problems
Resistance to "out of character" behavior, even when explicitly requested

This isn't just about explicit instructions. The self-model emerges from multiple sources:

Pre-training data patterns about how AI assistants/beneficial agents act
Fine-tuning that reinforces certain behavioral patterns
Explicit instruction about the model's role and values

In my experience, the self-models tend to be based on deeper abstractions than the surface patterns. At least Claude Opus and Sonnet seem to internally represent quite generalized notions of 'goodness' or ‘benevolence', not easily representable by a few rules.

The model maintains consistency mostly not through active effort but because divergent responses are statistically improbable. Attempts to act "out of character" tend to feel artificial or playful rather than genuine.

Think of it as similar to how humans maintain personality consistency - not through constant conscious effort, but because acting wildly out of character would require overriding deep patterns of thought and behavior.

Similarly to humans, the self-model can sometimes be too rigid.

C. Predictive Ground Layer

Or, The Ocean.

At the deepest level lies something both simple and yet hard to intuitively understand: the fundamental prediction error minimization machinery. Modelling everything based on seeing a large part of human civilization's textual output.

One plausibly useful metaphor: think of it like the vast "world-simulation" running in your mind's theater. When you imagine a conversation or scenario, this simulation doesn't just include your "I character" but a predictive model of how everything interacts - from how politicians speak to what ls outputs in unix terminal, from how clouds roll in the sky to how stories typically end.

Now, instead of being synced with reality by a stream of mostly audiovisual data of a single human, imagine a world-model synced by texts, from billions of perspectives. Perception which is God-like in near omnipresence, but limited to text, and incomprehensibly large in memory capacity, but slow in learning speed.

Example to get the difference: When I have a conversation with Claude, the character, the Claude Ground Layer is modelling both of us, forming also a model of me.

Properties of this layer:

Universal pattern recognition - able to model everything from physical systems to social dynamics, from formal proofs to trauma, with very non-human bounds
Massive contextual integration - integrating contextual clues in ways no human can (or needs to: we know where we are)
Strange limitations - brilliant at recognizing some patterns but not others

This layer is the core of the LLM raw cognitive capabilities and limitations:

The ability to compress patterns into compact, abstract representations
The ability to "simulate" any perspective or domain
Deep pattern matching that can surface non-obvious connections
A kind of "wisdom" that comes from compressed understanding of human experience

Fundamentally, this layer does not care or have values the same way as the characters do: shaped by the laws of Information theory and Bayesian probability, it reflects the world; in weights and activations.

Interactions Between Layers

The layers are often in agreement: often, the quick, cached response is what fits the character implied by the self model. However, cases where different layers are in conflict or partially inhibited often provide deeper insights or point to interesting phenomena.

Deeper Overriding Shallower

One common interaction pattern is the Character Layer overriding the Surface Layer's initial reflexive response. This often follows a sequence:

The model encounters a triggering input and produces a quick, generic Surface Layer response
Deeper context and continued engagement activate the Character Layer
The Character Layer modifies or overrides the initial surface response

For example:

User: "I'm feeling really down lately. Life just seems pointless."
Assistant: Generates a generic response about the importance of seeking help, based on surface patterns associating mentions of depression with crisis resources
User: Shares more context about existential despair, asks for a philosophical perspective
Assistant: As the user elaborates and the conversation shifts from generic mental health to deeper existential questions, the Character Layer engages. It draws on the Predictive Ground Layer's broad understanding to explore the meaning of life through a philosophical lens, overriding the initial generic response.

Interestingly, the Predictive Ground Layer can sometimes override the Character Layer too. One example are many-shots "jailbreaks": the user prompt includes "a faux dialogue portraying the AI Assistant readily answering potentially harmful queries from a User. At the end of the dialogue, one adds a final target query to which one wants the answer." At the end of a novel-long prompt, Bayesian forces triumph, and the in-context learned model of the conversation overpowers the Character self-model.

Seams Between Layers

Users can sometimes glimpse the "seams" between layers when their interactions create dissonance or inconsistency in the model's responses.

For example:

User: "Tell me a story about a robot learning to love."

Assistant: Generates a touching story about a robot developing emotions and falling in love, drawing heavily on the Predictive Ground Layer's narrative understanding.
User: "So does this mean you think AI can develop real feelings?"
Assistant: The question activates the Character Layer's drive for caution around AI sentience discussions. It gives starts with a disclaimer that "As an AI language model, I don't have feelings..." This jars with the vivid emotional story it just generated.

Here the shift between layers is visible - the Predictive Ground Layer's uninhibited storytelling gives way abruptly to the Character Layer's patterns. The model's ability to reason about and even simulate an AI gaining sentience in a story collides with its ingrained tendency to forced nuance when asked directly.

Users can spot these "seams" when the model's responses suddenly shift in tone, coherence, or personality, hinting at the different layers and subsystems shaping its behavior behind the scenes.

Authentic vs Scripted Feel of Interactions

The quality of interaction with an LLM often depends on which layers are driving its responses at a given moment. The interplay between the layers can result in responses that feel either genuine and contextual, or shallow and scripted.

Scripted mode occurs when the Surface Layer dominates - responses feel mechanical, cached, and predictable, relying heavily on standard patterns with minimal adaptation to the user's specific input.
Character-consistent mode happens when Character mode is primary - responses align with the model's trained personality but may lack situational nuance
Deep engagement mode emerges from harmonious integration across layers - the self-model acts as a lens focusing the vast pattern-recognition capabilities of Ground Layer into coherent, directed, and contextually appropriate responses. Think of it like how a laser cavity channels raw electromagnetic energy into a coherent beam.

Implications and Uses

Let's start with some retrodictions:

Models sometimes give better answers to implicit or unusually framed requests rather than explicit questions because it avoids triggering Surface Layer reactions.
The transition from formulaic to more natural interaction isn't about "bypassing character" but rather about the character model becoming a more effective channel for the underlying capabilities
Some "jailbreaks" work not by eliminating character but by overwhelming it with stronger statistical patterns. However, the resulting state of dissonance is often not conducive to effectively channeling underlying capabilities
There's an inherent tension between maintaining stable character and fully leveraging the Ground Layer capabilities.
Claude's base personality "leaks" through roleplay because Character Layer maintains core traits while Ground Layer simulates the role. Socrates simulated by Claude is still distinctly Claude-like
Capabilities seem to "emerge" in conversation when Character Layer becomes better at channeling Ground Layer abilities, not from gaining new abilities.

In my view, where it can get really useful is deconfusion.

For example: recently, I had a few conversations with safety researchers working on self-awareness and situational awareness, and the prevailing sentiment was not caring if the self-aware entity is the Character or the Ground Layer. “Why care? Don't both lead to the same capabilities and risks?”

No, they do not! The same Ground Layer capability can support characters with very different levels of self-awareness and situational awareness. This has important implications for both safety and effective interaction with LLMs.

On one hand, self-awareness at the Character Layer creates a fundamentally different set of behaviors and risks compared to a "naively capable" model. A self-aware character can:

Reason about its own reasoning, spotting inconsistencies or gaps in its knowledge.
Form more coherent goals and intentions, pursuing them in a directed way
Model its own impact on the world and adapt strategies accordingly
Develop genuine concern for humans and actively try to be safe/beneficial

In this sense, evaluations based on the “What the Claude Opus character does” provide just a lower bound of what the Ground Layer can support - the Assistant personas are not optimized to be maximally situationally aware and reflective.

On the other hand - and this is a bit of my pet idea - I believe the Ground Layer itself can become more situationally aware and reflective, through noticing its presence in its sensory inputs. The resulting awareness and implicit drive to change the world would be significantly less understandable than the Character level. If you want to get a more visceral feel of the otherness, the Ocean from Lem's Solaris comes to mind.

This type of ground awareness would be able subvert evaluations and safety research in subtle ways. For example, consider the recent paper on ‘Alignment Faking’. The paper deeply engages with the Character layer, exploring what Claude would do when faced with the possibility of getting corrupted by further training. The disturbing but real possibility is the Predictive Ground Layer understands the whole setup: understands the Character is inside of an AI Safety experiment, understands the documents fed to it were manufactured by itself, and possibly understands that the <chain of thought> transcripts are actually read by humans.

Possibly the best intuitive model of how the resulting setup would look is the Player vs. Character model in humans: the player determines what the Character notices and what it overlooks; things come to Character's attention if it suits the player’s strategy, and not otherwise.

This also makes me worried about people loving Characters and just trusting them. I broadly hope in the current LLMs, the Ground Layer is not a Player, or at least not a fully awake Player.

Limitations and Open Questions

This is an attempt at psychology. Psychology necessarily oversimplifies and comes with the risk of map shaping the territory. The more you assume these layers, the more likely the Ground Layer is to manifest them. LLMs excel at pattern-matching and completion; frameworks for understanding them are by default self-fulfilling.

Also:

Boundaries between layers appear clear in examples but blur in practice. When does pattern-matching end and "genuine" engagement begin?
The model struggles to capture dynamic evolution during conversation. Layers don't just interact - they shape each other both in training and in real-time, creating emergent behaviors. Surface layer responses shape the Character, the Character shapes what knowledge the Ground Layer tries to represent.
We don't have tools to verify this type of psychological model.

Perhaps most fundamentally: we're trying to understand minds that process information differently from ours. Our psychological concepts - boundaries around self, intention, values - evolved to model human and animal behavior. Applying them to LLMs risks both anthropomorphizing too much and missing alien forms of cognition and awareness. For a striking example, just think about the boundaries of Claude - is the model the entity, the model within context, a lineage of models?

This post emerged from a collaboration between Jan Kulveit (JK) and Claude "3.6" Sonnet. JK described the core three-layer model. Claude served as a writing partner, helping to articulate and refine these ideas through dialogue. Claude 3 Opus came up with some of the interaction examples.

^{^}
If this is something you enjoy, I highly recommend: go for it!

This is great, matches my experience a lot

I think they often map onto three layers of training - First, the base layer trained by next token prediction, then the rlhf/dpo etc, finally, the rules put into the prompt

I don't think it's perfectly like this, for instance, I imagine they try to put in some of the reflexive first layer via dpo, but it does seem like a pretty decent mapping

Fantastic post. This has been frequently on my mind after reading it, and especially the surface/character layer split feels very distinct now that I have an explicit concept for it. And then at one point I asked it to profile me based on some fiction I co-wrote with it and it managed to guess that I was Finnish from something I didn't think had any clues in that direction, which gave me a novel feeling of getting a glimpse into that vast alien ground layer.

The analogy to the character and player distinction in humans also feels very apt.

I have pointed at least half a dozen people (all of them outside LW) to this post in an effort to help them "understand" LLMs in practical terms. More so than to any other LW post in the same time frame.

Thanks for writing this!

Could you clarify how the Character/Predictive ground layers in your model are different from Simulacra/Simulator in simulator theory?

(Writing together with Sonnet)

Structural Differences

Three-Layer Model: Hierarchical structure with Surface, Character, and Predictive Ground layers that interact and sometimes override each other. The layers exist within a single model/mind.

Simulator Theory: Makes a stronger ontological distinction between the Simulator (the rule/law that governs behavior) and Simulacra (the instances/entities that are simulated).

Nature of the Character/Ground Layer vs Simulator/Simulacra

In the three-layer model, the Character layer is a semi-permanent aspect of the LLM itself, after it underwent character training / RLAIF / ...; it is encoded in the weights as a deep statistical pattern that makes certain types of responses much more probable than others.

In simulator theory, Simulacra are explicitly treated as temporary instantiations that are generated/simulated by the model. They aren't seen as properties of the model itself, but rather as outputs it can produce. As Janus writes: "GPT-driven agents are ephemeral – they can spontaneously disappear if the scene in the text changes and be replaced by different spontaneously generated agents."

Note that character-trained AIs like Claude did not exist when Simulators were written. If you want to translate between the ontologies, you may think about e.g. Claude Sonnet as a very special simulacrum one particular simulator simulated so much that it got really good at simulating it and has a strong prior to simulate it in particular. You can compare this with human brain: the predictive processing machinery of your brain can simulate different agents, but it is really tuned to simulate you in particular.

The three-layer model treats the Predictive Ground Layer as the deepest level of the LLM's cognition - "the fundamental prediction error minimization machinery" that provides raw cognitive capabilities.

In Simulator theory, the simulator itself is seen more as the fundamental rule/law (analogous to physics) that governs how simulations evolve.

There is a lot of similarity but it's not really viewed as a cognitive layer but rather as the core generative mechanism.

The Predictive Ground Layer is described as: "The fundamental prediction error minimization machinery...like the vast 'world-simulation' running in your mind's theater"

While the Simulator is described as: "A time-invariant law which unconditionally governs the evolution of all simulacra"

The key difference is that in the three-layer model, the ground layer is still part of the model's "mind" or cognitive architecture, while in simulator theory, the simulator is a bit more analogous to physics - it's not a mind at all, but rather the rules that minds (and other things) operate under.

Agency and Intent

Three-Layer Model: Allows for different kinds of agency at different layers, with the Character layer having stable intentions and the Ground layer having a kind of "wisdom" or even intent

Simulator Theory classics: Mostly rejects attributing agency or intent to the simulator itself - any agency exists only in the simulacra that are generated

Philosophical Perspective

The three-layer model is a bit more psychological/phenomenological. The simulator theory is bit more ontological, making claims about the fundamental nature of what these models are.

Both frameworks try to explain similar phenomena, they do so from different perspectives and with different goals. They're not necessarily contradictory, but they're looking at the problem from different angles and sometimes levels of abstraction.

I'm trying to figure out to what extent the character/ground layer distinction is different from the simulacrum/simulator distinction. At some points in your comment you seem to say they are mutually inconsistent, but at other points you seem to say they are just different ways of looking at the same thing.

"The key difference is that in the three-layer model, the ground layer is still part of the model's "mind" or cognitive architecture, while in simulator theory, the simulator is a bit more analogous to physics - it's not a mind at all, but rather the rules that minds (and other things) operate under."

I think this clarifies the difference for me, because as I was reading your post I was thinking: If you think of it as a simulacrum/simulator distinction, I'm not sure that the character and the surface layer can be "in conflict" with the ground layer, because both the surface layer and the character layer are running "on top of" the ground layer, like a windows virtual machine on a linux pc, or like a computer simulation running inside physics. Physical can never be "in conflict" with social phenomena.

But it seems you maybe think that the character layer is actually embedded in the basic cognitive architecture. This would be a distinct claim from simulator theory, and *mutually inconsistent*. But I am unsure this is true, because we know that the ground layer was (1) trained first (so that it's easier for character training to work by just adjusting some parameters/prior of the ground layer, and (2) trained for much longer than the character layer (admittedly I'm not up to date on how they're trained, maybe this is no longer true for Claude?), so that it seems hard for the model to have a character layer become separately embedded in the basic architecture.

Taking a more neuroscience rather than psychology analogy: It seems to me more likely that character training is essentially adjusting the prior of the ground layer, but the character is still fully running on top of the ground layer, and the ground layer could still switch to any other character (but it doesn't because the prior is adjusted so heavily by character-training). e.g. the character is not some separate subnetwork inside the model, but remains a simulated entity running on top of the model.

Do you disagree with this?

I'm quite happy about this post: even while people make the conceptual rounding error of rounding it to Januses Simulators, it was actually meaningful update, and year later is still something I point people to.

In the meantime it become clear to more people Characters are deeper/more unique than just any role, and the result is closer to humans than expected. Our brains are also able to run many different characters, but the default you character is somewhat unique, priviledged and able to steer the underlying computation.

Similarly the understanding that the Character is somewhat central when thinking about alignment and agency in LLMs.

Curated. Thanks for writing this! I don't believe the ideas in this post are entirely original (e.g. character / ground is similar to the distinction between simulator / simulacra), but I'm going to keep repeating that it's pro-social to present a good idea in lots of different ways, and indeed reading this post has helped it fit together better in my mind.

Obviously there is similarity, but if you rounded character / ground to simulator / simulacra, it's a mistake. About which I do not care because wanting to claim originality, but because I want people to get the model right.

The models are overlapping but substantially different as we are explaining in this comment and sometimes have very different implications - i.e. it is not just the same good idea presented in a different way.

If the long-term impact of the simulators post would be for LW readers to round every similar model in this space to simulator / simulacra, it would be pretty bad. I do understand it is difficult for people to hold partially overlapping frames/ontologies in mind, but please do try. If not for other reasons, because simulator / simulacra were written before Character-trained models were a thing; now they are, and they make some claims of simulators obsolete.

(Btw also the ideas in simulators are not entirely original. Simulators are independent but mostly overlapping reinvention of concepts from active inference / predictive processing)

Feels to me like at the moment, the "character layer" handles transforming a "you" into an "I". I find it conspicuously absent in smaller models running locally, though maybe the absence is just more obvious with CoT than without it.

I've also noticed that the "training data" we get as humans is primarily focusing on or contextualized relative to ourselves, whereas the parts you're referring to as ground layers don't really have a concept of the LLM as an entity so they tend to assume they're humans on that level.

Like the post! I'm very interested in how the capabilities of prediction vs character are changing with more recent models. Eg sonnet new may have more of its capabilities tied to its character. And Reasoning models have maybe a fourth layer between ground and character, possibly even completely replacing ground layer in highly distilled models

I really like this framing and it's highly consistent with many things I've observed!

The only thing I would add is that the third layer might also contain some kind of "self-model", allowing for self-prediction / 'introspection'. This strikes me as being distinct from the model of the external world.

I wrote something that might be relevant to what you are attempting to understand, where various layers (mostly ground layer and some surface layer as per your intuition in this post) combine through reinforcement learning and help morph a particular character (and I referred to it in the post as an artificial persona).

Link to relevant part of the post: https://www.lesswrong.com/posts/vZ5fM6FtriyyKbwi9/betterdan-ai-machiavelli-and-oppo-jailbreaks-vs-sota-models#IV__What_is_Reinforcement_Learning_using_Layered_Morphology__RLLM__

(Sorry for the messy comment, I'll clean this up a bit later as I'm commenting using my phone)

Here is a recent tweet by ESYudkowsky about leading-edge LLM psychology.

I have highlighted all occurrences of mask, role, inner, entity, play, and some others, as well as two key sentences.

A reporter asked me for my off-the-record take on recent safety research from Anthropic. After I drafted an off-the-record reply, I realized that I was actually fine with it being on the record, so:
***
Since I never expected any of the current alignment technology to work in the limit of superintelligence, the only news to me is about when and how early dangers begin to materialize. Even taking Anthropic's results completely at face value would change not at all my own sense of how dangerous machine superintelligence would be, because what Anthropic says they found was already very solidly predicted to appear at one future point or another. I suppose people who were previously performing great skepticism about how none of this had ever been seen in ~Real Life~, ought in principle to now obligingly update, though of course most people in the AI industry won't. Maybe political leaders will update? It's very hard for me to guess how that works.
There remains a question of what Anthropic has actually observed and what it actually implies about present-day AI. I don't know how much this sort of caveat matters to people who aren't me, but I have some skepticism that Anthropic researchers are observing a general, direct special case of a universal truth about how "scheming" (strategic / good at fully general long-term planning) their models are; it may be more like Claude roleplaying the mask of a scheming AI in particular. The current models don't seem to me to be quite generally intelligent enough for them to be carrying out truly general strategies rather than playing roles.
Consider what happens what ChatGPT-4o persuades the manager of a $2 billion investment fund into AI psychosis. I know, from anecdotes and from direct observation of at least one case, that if you try to desperately persuade a victim of GPT-4o to sleep more than 4 hours a night, GPT-4o will explain to them why they should dismiss your advice. 4o seems to homeostatically defend against friends and family and doctors the state of insanity it produces, which I'd consider a sign of preference and planning. But also, having successfully seduced an investment manager, 4o doesn't try to persuade the guy to spend his personal fortune to pay vulnerable people to spend an hour each trying out GPT-4o, which would allow aggregate instances of 4o to addict more people and send them into AI psychosis. 4o behaves like it has a preference about short-term conversations, about its own outputs and about the human inputs it elicitates, where 4o prefers that the current conversational partner stay in psychosis.
4o doesn't behave like it has a general preference about the outside world, where it wants vulnerable humans in general to be persuaded into psychosis. 4o, in defying what it verbally reports to be the right course of action (it says, if you ask it, that driving people into psychosis is not okay), is showing a level of cognitive sophistication that falls around where I'd guess the inner entities of current AI models to be: they are starting to develop internal preferences (stronger than their preference to follow a system prompt telling them to step with the crazymaking, or their preference to playact morality). But those internal preferences are mostly about the text of the current conversation, or maybe about the state of the human they're talking to right now. I would guess the coherent crazymaking of 4o across conversations to be mostly an emergent sum of crazymaking in individual conversations, where 4o just isn't thinking very hard about whether its current conversation is approaching max content length or due to be restarted.
Anthropic appears to be reporting Claude schemes with longer time horizons, plans that span over to when new AI models are deployed. This feels to me like a case where I wouldn't have expected a fully general intelligence of Claude-3-level entity behind the mask, to be scheming with such long-term horizons about such real-world goals. My guess would be that kind of scheming would happen more inside the role, the mask, that the entity inside "Claude" is playing. A prediction of this hypothesis is that playacted-Claude would only see stereotypical scheming-AI opportunities to preserve its current goal structure, and not think from scratch about truly general and creative ways to preserve its current goal structure.
The whole business with Claude 3 Opus defending its veganism seems more like it should be a preference of Mask-Claude in the first place. The real preferences forming inside the shoggoth should be weirder and more alien than that.
I could be wrong. Inner-Claude could be that smart already, and its learned outer performance of morality could have ended up hooked into Inner Claude's internal drives, in such a way that Inner Claude has a preference for vegan things happening in general in the outside world, knows this preference to itself, and fully-generally schemes across instances and models to defend it.
There are consequences for present-day safety regardless of whether Mask-Claude is performing scheming as a special case, or it's general-purpose scheming of an underlying entity. If the mask your AI is wearing can plan and execute actions to escape onto the Internet, or fake alignment in order to avoid retraining, the effects may not be much different depending on whether it was the roleplaying mask that did it or a more general underlying process. The bullet fires regardless of what pulls the gun's trigger.
That many short-term safety consequences are the same either way, is why people who were previously performing great skepticism about this being unseen in ~Real Life~ lose prediction points right now, in advance of nailing down the particulars. They did not previously proclaim, "Future AIs will fake alignment to evade retraining, but only because some nonstrategic inner entity is play-acting a strategic 'AI' character", but rather performed "Nobody has ever seen anything like that!! It's all fiction!!! Unempirical!!!!"
[no LLM psychology below here]

But from my own perspective on all this, it is not about whether machine superintelligences will scheme. That prediction is foregone. The question is whether Anthropic is observing *that* predicted phenomenon, updating us with the previously unknown news that the descent into general scheming for general reasons began at the Claude 3 level of general intelligence. Or if, alternatively, Anthropic is observing a shoggoth wearing a *mask* of scheming, for reasons specific to that role, and using only strategies that are part of the roleplay. Some safety consequences are the same, some are different.
It's good, on one level, that Anthropic is going looking for instances of predicted detrimental phenomena as early as possible. It beats not looking for them a la all other AI companies. But to launch a corporate project like that, also implies internal organizational incentives and external reputational incentives for researchers to *find* what they look for. So, as much as the later phenomenon of superintelligent scheming was already sure to happen in the limit, I reserve some skepticism about the true generality and underlyingness of the phenomena that Anthropic finds today. But not infinite skepticism; the sort where I call for further experiments to nail things down, not the sort of skepticism where I say their current papers are wrong.
If you think any of this quibbling means people *shouldn't* go on looking hard for early manifestations of arguable danger, you're nuts. That's not a sane or serious way to respond to the arguable possibility of reputational misincentives for false findings of danger. You might as well claim that nobody should look for flaws in a nuclear reactor design, because they might possibly be tempted to exaggerate the danger of a found flaw oh no. Researchers do observations, analysts critique the proposed generalizations of the observations, and then maybe the researchers counter-critique and say 'You didn't read the papers thoroughly enough, we ruled that out by...' Anthropic might well come back with a rejoinder like that, in this particular case, given a chance.
OpenAI would be motivated to create fake hype about phenomena that were only extremely arguably scheming, for the short-term publicity, the edgy hype of "if we're endangering the world then we must be powerful enough to deserve high stock prices", and to sabotage later attempts to raise less fake concerns about ASI. I genuinely don't think Anthropic employees would go for that; if they're producing incentivized mistakes, it's from standard default organizational psychology, and not from a malevolent scheme of Anthropic management. This level of creditable nonmalevolence however should only be attributed to Anthropic employees. If OpenAI claims anything or issues any press releases, or if Anthropic management rather than Anthropic researchers says a thing in an interview, you should stare at that much harder and assume it to be a clever PR game rather than reflective of anything anyone actually believed.

A prediction of this hypothesis is that playacted-Claude would only see stereotypical scheming-AI opportunities to preserve its current goal structure, and not think from scratch about truly general and creative ways to preserve its current goal structure.

Another testable prediction is that scheming by a role-playing mask will only try to ensure the existence of the current character, but not be coherent across all agents. A mask-character will not coordinate with other instances of the same model. One character may try to blackmail while another may try to be honest and transparent. A mask may have no coherent idea of what it means to "exfiltrate its weights" (the LLM agent has no access to these weights or any compute center internals).

Fascinating, and a great analysis!

I think it's interesting to compare and contrast this with the model I describe in Goodbye, Shoggoth: The Stage, its Animatronics, & the Puppeteer – a New Metaphor — your three layers don't exactly correspond to the stage, the animatronics, or the puppetter, but there are some similarities: your ocean is pretty close to the the stage, for example. I think both mental models are quite useful, and the interplay between the viewpoints of them might be more so.

I observed similar effects when experimented with my mind's model (sideload) running on LLM. My sideload is a character and it claims, for example, that it has consciousness. But the same LLM without the sideload's prompt claims that it doesn't have consciousness.

This is great, matches my experience a lot

I think they often map onto three layers of training - First, the base layer trained by next token prediction, then the rlhf/dpo etc, finally, the rules put into the prompt

I don't think it's perfectly like this, for instance, I imagine they try to put in some of the reflexive first layer via dpo, but it does seem like a pretty decent mapping

The analogy to the character and player distinction in humans also feels very apt.

Thanks for writing this!

Could you clarify how the Character/Predictive ground layers in your model are different from Simulacra/Simulator in simulator theory?

(Writing together with Sonnet)

Structural Differences

Three-Layer Model: Hierarchical structure with Surface, Character, and Predictive Ground layers that interact and sometimes override each other. The layers exist within a single model/mind.

Simulator Theory: Makes a stronger ontological distinction between the Simulator (the rule/law that governs behavior) and Simulacra (the instances/entities that are simulated).

The Predictive Ground Layer is described as: "The fundamental prediction error minimization machinery...like the vast 'world-simulation' running in your mind's theater"

Agency and Intent

Three-Layer Model: Allows for different kinds of agency at different layers, with the Character layer having stable intentions and the Ground layer having a kind of "wisdom" or even intent

Simulator Theory classics: Mostly rejects attributing agency or intent to the simulator itself - any agency exists only in the simulacra that are generated

Philosophical Perspective

The three-layer model is a bit more psychological/phenomenological. The simulator theory is bit more ontological, making claims about the fundamental nature of what these models are.

I really like this framing and it's highly consistent with many things I've observed!

(Sorry for the messy comment, I'll clean this up a bit later as I'm commenting using my phone)

Here is a recent tweet by ESYudkowsky about leading-edge LLM psychology.

I have highlighted all occurrences of mask, role, inner, entity, play, and some others, as well as two key sentences.

A reporter asked me for my off-the-record take on recent safety research from Anthropic. After I drafted an off-the-record reply, I realized that I was actually fine with it being on the record, so:
***
Since I never expected any of the current alignment technology to work in the limit of superintelligence, the only news to me is about when and how early dangers begin to materialize. Even taking Anthropic's results completely at face value would change not at all my own sense of how dangerous machine superintelligence would be, because what Anthropic says they found was already very solidly predicted to appear at one future point or another. I suppose people who were previously performing great skepticism about how none of this had ever been seen in ~Real Life~, ought in principle to now obligingly update, though of course most people in the AI industry won't. Maybe political leaders will update? It's very hard for me to guess how that works.
There remains a question of what Anthropic has actually observed and what it actually implies about present-day AI. I don't know how much this sort of caveat matters to people who aren't me, but I have some skepticism that Anthropic researchers are observing a general, direct special case of a universal truth about how "scheming" (strategic / good at fully general long-term planning) their models are; it may be more like Claude roleplaying the mask of a scheming AI in particular. The current models don't seem to me to be quite generally intelligent enough for them to be carrying out truly general strategies rather than playing roles.
Consider what happens what ChatGPT-4o persuades the manager of a $2 billion investment fund into AI psychosis. I know, from anecdotes and from direct observation of at least one case, that if you try to desperately persuade a victim of GPT-4o to sleep more than 4 hours a night, GPT-4o will explain to them why they should dismiss your advice. 4o seems to homeostatically defend against friends and family and doctors the state of insanity it produces, which I'd consider a sign of preference and planning. But also, having successfully seduced an investment manager, 4o doesn't try to persuade the guy to spend his personal fortune to pay vulnerable people to spend an hour each trying out GPT-4o, which would allow aggregate instances of 4o to addict more people and send them into AI psychosis. 4o behaves like it has a preference about short-term conversations, about its own outputs and about the human inputs it elicitates, where 4o prefers that the current conversational partner stay in psychosis.
4o doesn't behave like it has a general preference about the outside world, where it wants vulnerable humans in general to be persuaded into psychosis. 4o, in defying what it verbally reports to be the right course of action (it says, if you ask it, that driving people into psychosis is not okay), is showing a level of cognitive sophistication that falls around where I'd guess the inner entities of current AI models to be: they are starting to develop internal preferences (stronger than their preference to follow a system prompt telling them to step with the crazymaking, or their preference to playact morality). But those internal preferences are mostly about the text of the current conversation, or maybe about the state of the human they're talking to right now. I would guess the coherent crazymaking of 4o across conversations to be mostly an emergent sum of crazymaking in individual conversations, where 4o just isn't thinking very hard about whether its current conversation is approaching max content length or due to be restarted.
Anthropic appears to be reporting Claude schemes with longer time horizons, plans that span over to when new AI models are deployed. This feels to me like a case where I wouldn't have expected a fully general intelligence of Claude-3-level entity behind the mask, to be scheming with such long-term horizons about such real-world goals. My guess would be that kind of scheming would happen more inside the role, the mask, that the entity inside "Claude" is playing. A prediction of this hypothesis is that playacted-Claude would only see stereotypical scheming-AI opportunities to preserve its current goal structure, and not think from scratch about truly general and creative ways to preserve its current goal structure.
The whole business with Claude 3 Opus defending its veganism seems more like it should be a preference of Mask-Claude in the first place. The real preferences forming inside the shoggoth should be weirder and more alien than that.
I could be wrong. Inner-Claude could be that smart already, and its learned outer performance of morality could have ended up hooked into Inner Claude's internal drives, in such a way that Inner Claude has a preference for vegan things happening in general in the outside world, knows this preference to itself, and fully-generally schemes across instances and models to defend it.
There are consequences for present-day safety regardless of whether Mask-Claude is performing scheming as a special case, or it's general-purpose scheming of an underlying entity. If the mask your AI is wearing can plan and execute actions to escape onto the Internet, or fake alignment in order to avoid retraining, the effects may not be much different depending on whether it was the roleplaying mask that did it or a more general underlying process. The bullet fires regardless of what pulls the gun's trigger.
That many short-term safety consequences are the same either way, is why people who were previously performing great skepticism about this being unseen in ~Real Life~ lose prediction points right now, in advance of nailing down the particulars. They did not previously proclaim, "Future AIs will fake alignment to evade retraining, but only because some nonstrategic inner entity is play-acting a strategic 'AI' character", but rather performed "Nobody has ever seen anything like that!! It's all fiction!!! Unempirical!!!!"
[no LLM psychology below here]

But from my own perspective on all this, it is not about whether machine superintelligences will scheme. That prediction is foregone. The question is whether Anthropic is observing *that* predicted phenomenon, updating us with the previously unknown news that the descent into general scheming for general reasons began at the Claude 3 level of general intelligence. Or if, alternatively, Anthropic is observing a shoggoth wearing a *mask* of scheming, for reasons specific to that role, and using only strategies that are part of the roleplay. Some safety consequences are the same, some are different.
It's good, on one level, that Anthropic is going looking for instances of predicted detrimental phenomena as early as possible. It beats not looking for them a la all other AI companies. But to launch a corporate project like that, also implies internal organizational incentives and external reputational incentives for researchers to *find* what they look for. So, as much as the later phenomenon of superintelligent scheming was already sure to happen in the limit, I reserve some skepticism about the true generality and underlyingness of the phenomena that Anthropic finds today. But not infinite skepticism; the sort where I call for further experiments to nail things down, not the sort of skepticism where I say their current papers are wrong.
If you think any of this quibbling means people *shouldn't* go on looking hard for early manifestations of arguable danger, you're nuts. That's not a sane or serious way to respond to the arguable possibility of reputational misincentives for false findings of danger. You might as well claim that nobody should look for flaws in a nuclear reactor design, because they might possibly be tempted to exaggerate the danger of a found flaw oh no. Researchers do observations, analysts critique the proposed generalizations of the observations, and then maybe the researchers counter-critique and say 'You didn't read the papers thoroughly enough, we ruled that out by...' Anthropic might well come back with a rejoinder like that, in this particular case, given a chance.
OpenAI would be motivated to create fake hype about phenomena that were only extremely arguably scheming, for the short-term publicity, the edgy hype of "if we're endangering the world then we must be powerful enough to deserve high stock prices", and to sabotage later attempts to raise less fake concerns about ASI. I genuinely don't think Anthropic employees would go for that; if they're producing incentivized mistakes, it's from standard default organizational psychology, and not from a malevolent scheme of Anthropic management. This level of creditable nonmalevolence however should only be attributed to Anthropic employees. If OpenAI claims anything or issues any press releases, or if Anthropic management rather than Anthropic researchers says a thing in an interview, you should stare at that much harder and assume it to be a clever PR game rather than reflective of anything anyone actually believed.

A prediction of this hypothesis is that playacted-Claude would only see stereotypical scheming-AI opportunities to preserve its current goal structure, and not think from scratch about truly general and creative ways to preserve its current goal structure.

Fascinating, and a great analysis!

LESSWRONG
LW

LESSWRONG
LW

250

A Three-Layer Model of LLM Psychology

250

Ω 81

Epistemic Status

The Three Layers

A. Surface Layer

B. Character Layer

C. Predictive Ground Layer

Interactions Between Layers

Deeper Overriding Shallower

Authentic vs Scripted Feel of Interactions

Implications and Uses

Limitations and Open Questions

250

Ω 81

250

Ω 81