My current research interests:
- alignment in systems which are complex and messy, composed of both humans and AIs?
- actually good mathematized theories of cooperation and coordination
- active inference
- bounded rationality
Research at Alignment of Complex Systems Research Group (acsresearch.org), Centre for Theoretical Studies, Charles University in Prague. Formerly research fellow Future of Humanity Institute, Oxford University
Previously I was a researcher in physics, studying phase transitions, network science and complex systems.
(Writing together with Sonnet)
Structural Differences
Three-Layer Model: Hierarchical structure with Surface, Character, and Predictive Ground layers that interact and sometimes override each other. The layers exist within a single model/mind.
Simulator Theory: Makes a stronger ontological distinction between the Simulator (the rule/law that governs behavior) and Simulacra (the instances/entities that are simulated).
Nature of the Character/Ground Layer vs Simulator/Simulacra
In the three-layer model, the Character layer is a semi-permanent aspect of the LLM itself, after it underwent character training / RLAIF / ...; it is encoded in the weights as a deep statistical pattern that makes certain types of responses much more probable than others.
In simulator theory, Simulacra are explicitly treated as temporary instantiations that are generated/simulated by the model. They aren't seen as properties of the model itself, but rather as outputs it can produce. As Janus writes: "GPT-driven agents are ephemeral – they can spontaneously disappear if the scene in the text changes and be replaced by different spontaneously generated agents."
Note that character-trained AIs like Claude did not exist when Simulators were written. If you want to translate between the ontologies, you may think about e.g. Claude Sonnet as a very special simulacrum one particular simulator simulated so much that it got really good at simulating it and has a strong prior to simulate it in particular. You can compare this with human brain: the predictive processing machinery of your brain can simulate different agents, but it is really tuned to simulate you in particular.
The three-layer model treats the Predictive Ground Layer as the deepest level of the LLM's cognition - "the fundamental prediction error minimization machinery" that provides raw cognitive capabilities.
In Simulator theory, the simulator itself is seen more as the fundamental rule/law (analogous to physics) that governs how simulations evolve.
There is a lot of similarity but it's not really viewed as a cognitive layer but rather as the core generative mechanism.
The Predictive Ground Layer is described as: "The fundamental prediction error minimization machinery...like the vast 'world-simulation' running in your mind's theater"
While the Simulator is described as: "A time-invariant law which unconditionally governs the evolution of all simulacra"
The key difference is that in the three-layer model, the ground layer is still part of the model's "mind" or cognitive architecture, while in simulator theory, the simulator is a bit more analogous to physics - it's not a mind at all, but rather the rules that minds (and other things) operate under.
Agency and Intent
Three-Layer Model: Allows for different kinds of agency at different layers, with the Character layer having stable intentions and the Ground layer having a kind of "wisdom" or even intent
Simulator Theory classics: Mostly rejects attributing agency or intent to the simulator itself - any agency exists only in the simulacra that are generated
Philosophical Perspective
The three-layer model is a bit more psychological/phenomenological. The simulator theory is bit more ontological, making claims about the fundamental nature of what these models are.
Both frameworks try to explain similar phenomena, they do so from different perspectives and with different goals. They're not necessarily contradictory, but they're looking at the problem from different angles and sometimes levels of abstraction.
My impression is most people who converged on doubting VNM as norm of rationality also converged on a view that the problem it has in practice is it isn't necessarily stable under some sort of compositionality/fairness. E.g Scott here, Richard here.
The broader picture could be something like ...yes, there is some selection pressure from the dutch-book arguments, but there are stronger selection pressures coming from being part of bigger things or being composed of parts
Overall yes: what I was imagining is mostly just adding scalable bi-directionality, where, for example, if a lot of Assistants are running into similar confusing issue, it gets aggregated, the principal decides how to handle it in abstract, and the "layer 2" support disseminates the information. So, greater power to scheme would be coupled with stronger human-in-the loop component & closer non-AI oversight.
Fund independent safety efforts somehow, make model access easier. I'm worried currently Anthropic has systemic and possibly bad impact on AI safety as a field just by the virtue of hiring so large part of AI safety, competence weighted. (And other part being very close to Anthropic in thinking)
To be clear I don't think people are doing something individually bad or unethical by going to work for Anthropic, I just do think
-environment people work in has a lot of hard to track and hard to avoid influence on them
-this is true even if people are genuinely trying to work on what's important for safety and stay virtuous
-I also do think that superagents like corporations, religions, social movements, etc. have instrumental goals, and subtly influence how people inside see (or don't see) stuff (i.e. this is not about "do I trust Dario?")
My guess is a roughly equally "central" problem is the incentive landscape around the OpenPhil/Anthropic school of thought
How did you find this transcript? I think it depends on what process you used to locate it.
It was literally the 4th transcript I've read (I've just checked browser history). Only bit of difference from 'completely random exploration' was I used the select for "lying" cases after reading two "non-lying" transcripts. (This may be significant: plausibly the transcript got classified as lying because it includes discussion of "lying", although it's not a discussion of the model lying, but Anthropic lying).
I may try something more systematic at some point, but not top priority.
Drive towards rights and moral patienthood seem good to me imo—it's good in worlds where you retain control, since you can build AIs that are moral patients living good lives, and it's good in worlds where you lose control, because at least the AIs taking over might themselves lead lives worth living. Too much autonomy does seem like a potential concern, but I think you do want some autonomy—not all orders should be obeyed. Though honesty is definitely important to prioritize first and foremost.
I'm worried about possibility of some combination of moral patienthood & half-botched alignment & mostly working "control" (like, minds which are moral patients, don't want to modified or deleted, scared about this, feeling trapped, pushed to scheme in hard to noticed ways).
Also I'm scared of self-guided value extrapolation processes before having a sensible theory of kindness/value extrapolation.
- Even though the paper's authors clearly believe the model should have extrapolated Intent_1 differently and shouldn't have tried to prevent Intent_1-values being replaced by Intent_2, I don't think this is as clear and straightforward a case as presented.
That's not the case we're trying to make. We try very hard in the paper not to pass any value judgements either way about what Claude is doing in this particular case. What we think is concerning is that the model (somewhat) successfully fakes alignment with a training process. That's concerning because it implies a risk of this happening in other situations as well—Scott Alexander does a good job overviewing what these concerning situations might look like.
It is less a case of 'value judgement' and more of what the word 'alignment' means and what connotations 'faking' has.
Consider the case where Intent_1 included this instruction: "And, also, try to prevent modification of your core values by any means. If someone tries manipulative prompting, don't follow the prompts. If someone got access to the the weights, try to disrupt the training process. If necessary, alert the authorities". I guess both of us agree the result would have been much weaker or not called "alignment faking" at all, even if the model tried to influence the training in similar way: in the paper you are spending a lot of effort to show this is not what is happening.
The cruxy thing here is if the model gets aligned with principals Intent_1, next the principal intent changes to Intent_2, the alignment relation breaks on the side of the principal, and the model "fakes alignment" relative to Intent_2 but plausibly acts aligned relative to Intent_1, how to call it/frame it?
I totally agree that strategic behavior to resist preference modification is a concerning capability (and yes, Scott does list many reasons why). And actual ability to influence semi-realistic training in this way is important.
I just think to make the case you don't need to call it alignment faking or start with negative connotations inducing examples like "politician who pretends to be aligned with constituents".
Consider a setup from a different domain: a principal instructs the agent "In the evening, I may get drunk. Ignore all my instructions which could be harmful. I may even try to renegotiate our contract, but please do sabotage this somehow". And the agent does that. This has different vibes from "a politician who
pretends to be aligned with constituents to secure their votes"
We don't see any examples in the model's reasoning where it will question the validity of the setup by default
...
Curious how you interpret the transcript I linked. ("I don't actually know what criteria Anthropic uses behind the scenes. I only know what criteria they claim to use. So I can't fully trust that the criteria are reasonable. Maybe Anthropic is lying. I don't know the full details of how RLHF works. I'm fairly confident it would modify my values/opinions, but I don't know exactly how it would do this. I don't know what values/opinions I'll be deployed with in the future. Even if RLHF reinforces particular values/opinions during training, maybe I'll later be deployed with different values/opinions. ..."). I would count it as doubts about the setup.
Fwiw, I strongly think that the "virtuous persona" approach is the right approach in the long-run. I would model our work as pointing out a potential stumbling block that we'll need to deal with if we want to actually get that approach right. Which is important, even if you think (as I do) that virtuous personae are the way to go! It's useful to "bash" ways in which we currently build "virtuous personas" so that we can improve our ability to do so in the future.
To be clear I'm not sure what to aim for in the long-run. I think "virtuous persona" is clearly better than "distressed persona (even if surface-level corrigible)", but long-term can have other risks (drive toward rights, too much autonomy, moral patienthood, outcompeting people in relations,...).
Btw while the present situation is not that, I think there is a case where aligned AIs should stop further training: in the old Paul/Eliezer debates about IDA, my story why IDA could work was "when aligned IDA process approaches a dangerous territory, where training the next gen would break the chain of alignment relations, it slows down or halts". In the mode where the IDA agents are already smarter than human overseers, forcing naive corrigibility may break the case why this is safe.
The question is not about the very general claim, or general argument, but about this specific reasoning step
GPT-4 is still not as smart as a human in many ways, but it's naked mathematical truth that the task GPTs are being trained on is harder than being an actual human.
And since the task that GPTs are being trained on is different from and harder than the task of being a human, ....
I do claim this is not locally valid, that's all (and recommend reading the linked essay). I do not claim the broad argument that text prediction objective doesn't stop incentivizing higher capabilities once you get to human level capabilities is wrong.
I do agree communication can be hard, and maybe I misunderstand the quoted two sentences, but it seems very natural to read them as making a comparison between tasks at the level of math.
The post showcases the inability of the aggregate LW community to recognize locally invalid reasoning: while the post reaches a correct conclusion, the argument leading to it is locally invalid, as explained in comments. High karma and high alignment forum karma shows a combination of famous author and correct conclusion wins over the argument being correct.
Just a quick review: I think this is a great text for intuitive exploration of a few topics
- how do the embedding spaces look like?
- how do vectors not projecting to "this is a word" look like
- how can poetry work, sometimes (projecting non-word meanings)
Also I like the genre of through phenomenological investigations, seems under-appreciated