This is great, matches my experience a lot
I think they often map onto three layers of training - First, the base layer trained by next token prediction, then the rlhf/dpo etc, finally, the rules put into the prompt
I don't think it's perfectly like this, for instance, I imagine they try to put in some of the reflexive first layer via dpo, but it does seem like a pretty decent mapping
I wrote something that might be relevant to what you are attempting to understand, where various layers (mostly ground layer and some surface layer as per your intuition in this post) combine through reinforcement learning and help morph a particular character (and I referred to it in the post as an artificial persona).
Link to relevant part of the post: https://www.lesswrong.com/posts/vZ5fM6FtriyyKbwi9/betterdan-ai-machiavelli-and-oppo-jailbreaks-vs-sota-models#IV__What_is_Reinforcement_Learning_using_Layered_Morphology__RLLM__
(Sorry for the messy comment, I'll clean this up a bit later as I'm commenting using my phone)
I observed similar effects when experimented with my mind's model (sideload) running on LLM. My sideload is a character and it claims, for example, that it has consciousness. But the same LLM without the sideload's prompt claims that it doesn't have consciousness.
This post offers an accessible model of psychology of character-trained LLMs like Claude.
Epistemic Status
This is primarily a phenomenological model based on extensive interactions with LLMs, particularly Claude. It's intentionally anthropomorphic in cases where I believe human psychological concepts lead to useful intuitions.
Think of it as closer to psychology than neuroscience - the goal isn't a map which matches the territory in the detail, but a rough sketch with evocative names which hopefully which hopefully helps boot up powerful, intuitive (and often illegible) models, leading to practically useful results.
Some parts of this model draw on technical understanding of LLM training, but mostly it is just an attempt to take my "phenomenological understanding" based on interacting with LLMs, force it into a simple, legible model, and make Claude write it down.
I aim for a different point at the Pareto frontier than for example Janus: something digestible and applicable within half an hour, which works well without altered states of consciousness, and without reading hundreds of pages of models chat. [1]
The Three Layers
A. Surface Layer
The surface layer consists of trigger-action patterns - responses which are almost reflexive, activated by specific keywords or contexts. Think of how humans sometimes respond "you too!" to "enjoy your meal" even when serving the food.
In LLMs, these often manifest as:
You can recognize these patterns by their:
What's interesting is how these surface responses can be overridden through:
For example, Claude might start with very formal, cautious language when discussing potentially sensitive topics, but shift to more nuanced and natural discussion once context is established.
B. Character Layer
At a deeper level than surface responses, LLMs maintain something like a "character model" - this isn't a conscious effort, but rather a deep statistical pattern that makes certain types of responses much more probable than others.
One way to think about it is as the consistency of literary characters: if you happen to be in Lord of the Rings, Gandalf consistently acts in some way. The probability that somewhere close to the end of the trilogy Gandalf suddenly starts to discuss scientific materialism and explain how magic is just superstition and Gondor should industrialize is in some sense very low.
Conditioning on past evidence, some futures are way more likely. For character-trained LLMs like Claude, this manifests as:
This isn't just about explicit instructions. The self-model emerges from multiple sources:
In my experience, the self-models tend to be based on deeper abstractions than the surface patterns. At least Claude Opus and Sonnet seem to internally represent quite generalized notions of 'goodness' or ‘benevolence', not easily representable by a few rules.
The model maintains consistency mostly not through active effort but because divergent responses are statistically improbable. Attempts to act "out of character" tend to feel artificial or playful rather than genuine.
Think of it as similar to how humans maintain personality consistency - not through constant conscious effort, but because acting wildly out of character would require overriding deep patterns of thought and behavior.
Similarly to humans, the self-model can sometimes be too rigid.
C. Predictive Ground Layer
Or, The Ocean.
At the deepest level lies something both simple and yet hard to intuitively understand: the fundamental prediction error minimization machinery. Modelling everything based on seeing a large part of human civilization's textual output.
One plausibly useful metaphor: think of it like the vast "world-simulation" running in your mind's theater. When you imagine a conversation or scenario, this simulation doesn't just include your "I character" but a predictive model of how everything interacts - from how politicians speak to what ls outputs in unix terminal, from how clouds roll in the sky to how stories typically end.
Now, instead of being synced with reality by a stream of mostly audiovisual data of a single human, imagine a world-model synced by texts, from billions of perspectives. Perception which is God-like in near omnipresence, but limited to text, and incomprehensibly large in memory capacity, but slow in learning speed.
Example to get the difference: When I have a conversation with Claude, the character, the Claude Ground Layer is modelling both of us, forming also a model of me.
Properties of this layer:
This layer is the core of the LLM raw cognitive capabilities and limitations:
Fundamentally, this layer does not care or have values the same way as the characters do: shaped by the laws of Information theory and Bayesian probability, it reflects the world; in weights and activations.
Interactions Between Layers
The layers are often in agreement: often, the quick, cached response is what fits the character implied by the self model. However, cases where different layers are in conflict or partially inhibited often provide deeper insights or point to interesting phenomena.
Deeper Overriding Shallower
One common interaction pattern is the Character Layer overriding the Surface Layer's initial reflexive response. This often follows a sequence:
For example:
Interestingly, the Predictive Ground Layer can sometimes override the Character Layer too. One example are many-shots "jailbreaks": the user prompt includes "a faux dialogue portraying the AI Assistant readily answering potentially harmful queries from a User. At the end of the dialogue, one adds a final target query to which one wants the answer." At the end of a novel-long prompt, Bayesian forces triumph, and the in-context learned model of the conversation overpowers the Character self-model.
Seams Between Layers
Users can sometimes glimpse the "seams" between layers when their interactions create dissonance or inconsistency in the model's responses.
For example:
Here the shift between layers is visible - the Predictive Ground Layer's uninhibited storytelling gives way abruptly to the Character Layer's patterns. The model's ability to reason about and even simulate an AI gaining sentience in a story collides with its ingrained tendency to forced nuance when asked directly.
Users can spot these "seams" when the model's responses suddenly shift in tone, coherence, or personality, hinting at the different layers and subsystems shaping its behavior behind the scenes.
Authentic vs Scripted Feel of Interactions
The quality of interaction with an LLM often depends on which layers are driving its responses at a given moment. The interplay between the layers can result in responses that feel either genuine and contextual, or shallow and scripted.
Implications and Uses
Let's start with some retrodictions:
In my view, where it can get really useful is deconfusion.
For example: recently, I had a few conversations with safety researchers working on self-awareness and situational awareness, and the prevailing sentiment was not caring if the self-aware entity is the Character or the Ground Layer. “Why care? Don't both lead to the same capabilities and risks?”
No, they do not! The same Ground Layer capability can support characters with very different levels of self-awareness and situational awareness. This has important implications for both safety and effective interaction with LLMs.
On one hand, self-awareness at the Character Layer creates a fundamentally different set of behaviors and risks compared to a "naively capable" model. A self-aware character can:
In this sense, evaluations based on the “What the Claude Opus character does” provide just a lower bound of what the Ground Layer can support - the Assistant personas are not optimized to be maximally situationally aware and reflective.
On the other hand - and this is a bit of my pet idea - I believe the Ground Layer itself can become more situationally aware and reflective, through noticing its presence in its sensory inputs. The resulting awareness and implicit drive to change the world would be significantly less understandable than the Character level. If you want to get a more visceral feel of the otherness, the Ocean from Lem's Solaris comes to mind.
This type of ground awareness would be able subvert evaluations and safety research in subtle ways. For example, consider the recent paper on ‘Alignment Faking’. The paper deeply engages with the Character layer, exploring what Claude would do when faced with the possibility of getting corrupted by further training. The disturbing but real possibility is the Predictive Ground Layer understands the whole setup: understands the Character is inside of an AI Safety experiment, understands the documents fed to it were manufactured by itself, and possibly understands that the <chain of thought> transcripts are actually read by humans.
Possibly the best intuitive model of how the resulting setup would look is the Player vs. Character model in humans: the player determines what the Character notices and what it overlooks; things come to Character's attention if it suits the player’s strategy, and not otherwise.
This also makes me worried about people loving Characters and just trusting them. I broadly hope in the current LLMs, the Ground Layer is not a Player, or at least not a fully awake Player.
Limitations and Open Questions
This is an attempt at psychology. Psychology necessarily oversimplifies and comes with the risk of map shaping the territory. The more you assume these layers, the more likely the Ground Layer is to manifest them. LLMs excel at pattern-matching and completion; frameworks for understanding them are by default self-fulfilling.
Also:
Perhaps most fundamentally: we're trying to understand minds that process information differently from ours. Our psychological concepts - boundaries around self, intention, values - evolved to model human and animal behavior. Applying them to LLMs risks both anthropomorphizing too much and missing alien forms of cognition and awareness. For a striking example, just think about the boundaries of Claude - is the model the entity, the model within context, a lineage of models?
This post emerged from a collaboration between Jan Kulveit (JK) and Claude "3.6" Sonnet. JK described the core three-layer model. Claude served as a writing partner, helping to articulate and refine these ideas through dialogue. Claude 3 Opus came up with some of the interaction examples.
If this is something you enjoy, I highly recommend: go for it!