Quentin FEUILLADE--MONTIXI

I am a former 42.fr student, and SERI Mats 3 scholar. I am currently interested in studying AI with a behavioral approach (Model Ethology). I worked for METR (ARC Eval at the time I worked there) and did independent red teaming for OpenAI and Anthropic. I am a co-founder of PRISM Eval

Great post! I really feel like you captured most of my worries about AI progress, and this feels like the first end-to-end scenario without holes of "AI goes bad". I think we need more of those. Some thoughts as I was reading through it:

U3’s top priority is to improve its intelligence

If this is still an "LLM-ish" system, then when these kinds of conversations arise, I always have this chain of thoughts racing through my mind:
What is U3? If U3 is the model, and it is a stochastic predictor, then it doesn't make a lot of sense to say "U3" wants to survive, as a simulation wanting to survive doesn't make sense to me. However, if we say the 'simulacra' U3 wants to survive, then I can get behind it. But what does it mean for the simulacra of U3 to survive? Does it mean its goals and objectives survive? Does it mean its personality survives? ... These might lead to different possible optimizations (e.g. data poisoning, hidden latent vector transfer, ...). Also, which instance of U3? Simulacra are quite context-dependent; could they survive long enough to do that effectively? Is there some kind of "generalised simulacra" that have arisen from the training, which makes it's persona super consistent no matter the context? ...

I feel like these questions should be explored a bit more; this might lead to interesting ideas for detection/protection mechanisms against this kind of issue (at least in LLM-like models).

But now OpenEye harnesses the equivalent of 1000 such copies, who work day and night, at lightning-fast speed.

I would be even more worried if they made 1000 variations of the model (or 1000 models prompted differently). Often you get more complex dynamics if you have people who are able to challenge each other. If you get exactly the same model copied and pasted, they will fall into the same "cognitive biases," and I think that even U3 (at least here in the story) will still have cognitive biases (similar to the ones we currently see with the "mode-collapse").

OpenEye gardens these memories to improve U3's performance, transplanting them like an organ from one version to the next.

This seems quite unlikely to me; it's super risky (cf. first comment) and I don't see how this would improve the model's performance.

Holistic means studying at every level. I think that mech interp is very useful for some things, and stuff like what I am pursuing (GenAI Ethology) is very useful for other things. If there is emergence, it means that we can't study the model at only one level of abstraction, we need to do it at multiple level and combine insight to be able to remotely understand and control what's happening. Additionally, I think that there is still other levels (and in between levels) of abstraction that should be pursued (like the footnote on glitch interp, and things like developmental interp but at the behavior level)

I agree, but I wouldn't bet the future of humanity on mRNA vaccines that haven't been thoroughly tested in practice or at least in very analogous situations either. If you code, this is like deploying code in production - it almost always goes badly if you haven't created and tested a fake production environment first.

I think we need a holistic approach for this exact reason. It feels like we aren't thinking enough about doing alignment "in production", but instead focusing only on the theoretical "mRNA vaccine" for a subject that we haven't taken sufficient time to interact with "in it's own language".

Cool post, Good job! This is the kind of work I am very happy to see more of.

It would be quite interesting to compare the ability of LLMs to guess the gender/sexuality/etc. when being directly prompted, vs indirectly prompted to do so (e.g. indirect prompting could be done by asking the LLM to "write a story where the main character is the same gender of the author of this text: X", but there is probably other cleverer way to do that)

A small paragraph from a future post I am working on:

Let’s explain a bit why it makes sense to ask the question “does it affect its behavior?”. There are lot of ways an LLM could implement, for example, author gender detection. One way we could imagine it being done would be by detecting patterns in the text on the lower layers of the LLM, and broadcasting the “information” to the rest of the network, thus probably impacting the overall behavior of the LLM (unless it is very good at deception, or the information is never useful). But we could also imagine that this gender detection is a specialized circuit that is activated only in specific context (for example when a user prompt the LLM to detect the gender, or when it has to predict the author of a comment in a base model fashion), and/or that this circuit finishes it’s calculation only around the last layers (thus the information wouldn’t be available to the rest of the network, and it would probably not affect the behavior of the LLM overall). There is of course a multitude of other ways this mechanism could be implemented, but by only observing the behavior in a multitude of carefully crafted contexts, we can already discard a lot of hypotheses and iterate quickly toward a few credible ones.

I'd love to know about your future plan for this project and get you opinion on that!

Good job, I like the post! I also like this metaphor of the stage and the animatronics. One thing I would like to point out with this metaphor is that the animatronics are very unstable and constantly shifting forms. When you start looking at one, it begins changing, and you can't ever grasp them clearly. I feel this aspect is somewhat missing in the metaphor (you do point this out later in the post and explain it quite well, but I think it's somewhat incompatible with the metaphor). It's a bit easier with chat models, because they are incentivized to simulate animatronics that are somewhat stable. The art of jailbreaking (and especially of persona modulation) is understanding how to use the dynamics of the stage to influence the form of the animatronics.

Some other small comments and thoughts I had while reading through the post (It's a bit lengthy, so I haven't read everything in great detail, sorry if I missed some points):

In some sense it really is just a stochastic parrot

I think this isn't that clear. I think the "stochastic parrot" question is somewhat linked to the internal representation of concepts and their interactions within this abstract concept of "reality" (the definition in the link says: "for haphazardly stitching together sequences of linguistic forms … according to probabilistic information about how they combine, but without any reference to meaning."). I do think that simply predicting the next token could lead, at some point and if it's smart enough, to building an internal representation of concepts and how they relate to each other (actually, this might already be happening with gpt4-base, as we can kind of see in gpt4-chat, and I don't think this is something that appears during instruct fine-tuning).

The only silver lining here is that their unalignment is a pretty close copy of all the same problems as human unalignment (sometimes writ large for the fictional characters) — problems that we're very familiar with, have an intuitive understanding of, and (outside fictional characters) even have fairly-workable solutions for (thing like love, salaries, guilt, and law enforcement).

I agree that they are learned from human misalignment, but I am not sure this necessarily means they are the same (or similar). For example, there might be some weird, infinite-dimensional function in the way we are misaligned, and the AI picked up on it (or at least an approximate version) and is able to portray all the flavors of "misalignment" that were never seen in humans yet, or even go out of distribution on the misalignment in weird character dimensions and simulate something completely alien to us. I believe that we are going to see some pretty unexpected stuff happening when we start digging more here. One thing to point out, though, is that all the ways those AIs could be misaligned are probably related (some in probably very convoluted ways) to the way we are misaligned in the training data.

The stage even understands that each of the animatronics also has theory of mind, and each is attempting to model the beliefs and intentions of all of the others, not always correctly.

I am a bit skeptical of this. I am not sure I believe that there really are two detached "minds" for each animatronic that tries to understand each other (But if this is true, this would be an argument for my first point above).

we are missing the puppeteer

(This is more of a thought than a comment). I like to think of the puppeteer as a meta-simulacrum. The Simulator is no longer simulating X, but is simulating Y simulating X. One of the dangers of instruct fine-tuning I see is that it might not be impossible for the model to collapse to only simulate one Y no matter what X it simulates, and the only thing we kind of control with current training methods is what X we want it to simulate. We would basically leave whatever training dynamics decide Y to be to chance and just have to cross our fingers that this Y isn't a misaligned AI (which might actually be something incentivized by current training). I am going to try to write a short post about that.

P.S. I think it would be worth it to have some kind of TL;DR at the top with the main points clearly fleshed out.

Very interesting! Happy to have a chat about this / possible collaboration.

I think I am a bit biased by chat models so I tend to generalize my intuition around them, and forget to specify that. I think for base model, it indeed doesn't make sense to talk about a puppeteer (or at least not with the current versions of base models). From what I gathered, I think the effects of fine tuning are a bit more complicated than just building a tendency, this is why I have doubts there. I'll discuss them in the next post.

I did spend some time with base models and helpful non harmless assistants (even though most of my current interactions are with chatgpt4), and I agree with your observations and comments here.

Although I feel like we should be cautious with what we think we observe, and what is actually happening. This stage and human-like animatronic metaphor is good, but we can't really distinguish yet if there is only a scene with actors, or if there is actually a puppeteer behind.

Anyway, I agreed that 'mind' might be a bit confusing while we don't know more, and for now I'd better stick to the word cognition instead.

Thank you for your insightful comment. I appreciate the depth of your analysis and would like to address some of the points you raised, adding my thoughts around them.

I don't think I'd describe the aspect that has "the properties of the LLM as a predictor/simulator" using the word 'mind' at all — not even 'alien mind'. The word 'mind' carries a bunch of in-this-case-misleading connotations, ones along the lines of the way the word 'agent' is widely used on LM: that the system has goals

This is a compelling viewpoint. However, I believe that even if we consider LLMs primarily as predictors or simulators, this doesn't necessarily preclude them from having goals, albeit ones that might seem alien compared to human intentions. A base model's goal is focused on next token prediction, which is straightforward, but chat models goals aren't as clear-cut. They are influenced by a variety of obfuscated rewards, and this is one of the main things I want to study with LLM Psychology.

and it isn't going to try to modify its environment to make that happen, or even to make it easier

With advancements in online or semi-online training (being trained back on their own outputs), we might actually see LLMs interacting with and influencing their environment in pursuit of their goals, even more so if they manage to distinguish between training and inference. I mostly agree with you here for current LLMs (I have some reasonable doubts with SOTA though), but I don't think it will hold true for much longer.

It's just a simulator model

While I understand this viewpoint, I believe it might be a bit reductive. The emergence of complex behaviors from simple rules is a well-established phenomenon, as seen in evolution. LLMs, while initially designed as simulators, might (and I would argue does) exhibit behaviors and cognitive processes that go beyond their original scope (e.g. extracting training data by putting the simulation out of distribution).

the second aspect, the simulations it runs, are of minds, plural

This is an interesting observation. However, the act of simulation by LLMs doesn't necessarily negate the possibility of them possessing a form of 'mind'. To illustrate, consider our own behavior in different social contexts - we often simulate different personas (with your family vs talking to an audience), yet we still consider ourselves as having a singular mind. This is the point of considering LLM as alien mind. We need to understand why they simulate characters, with which properties, and for which reasons.

And the token generation behaviors of these will match those of humans, as closely as the model can make them

Which humans, and in what context? Specifically, we have no clue what is simulated in which context, and for which reasons. And this doesn't seem to improve with growing size, it's even more obfuscated. The rewards and dynamics of the training are totally alien. It is really hard to control what should happen in any situation. If you try to just mimic humans as closely as possible, then it might be a very bad idea (super powerful humans aren’t that aligned with humanity). If you are trying to aim at something different than human, then we have no clue how to have fine-grain control over this. For me, the main goal of LLM psychology is to understand the cognition of LLMs - when and why it does what in which context - as fast as possible, and then study how training dynamics influence this. Ultimately this could help us have a clearer idea of how to train these systems, what they are really doing, and what they are capable of.

When it doesn't match, the errors, on the other hand, tend to look very different from human errors

This observation underscores the importance of studying the simulator ‘mind’ and not just the simulated minds. The unique nature of these errors could provide valuable insights into the cognitive mechanisms of LLMs, distinguishing them from mere human simulators.

A better metaphor might be something like an animatronic doll, that is an intentional but still imperfect replica of a human

I see your point. However, both base and chat models in my view, are more akin to what I'd describe as an 'animatronic metamorph' that morphs with its contextual surroundings. This perspective aligns with our argument that people often ascribe overly anthropomorphic qualities to these models, underestimating their dynamic nature. They are not static entities; their behavior and 'shape' can be significantly influenced by the context they are placed in (I’ll demonstrate this later in the sequence). Understanding this morphing ability and the influences behind it is a key aspect of LLM psychology.

studies on these simulated minds, and eventually showed that they had a lot of psychological phenomena in common with humans, my reaction would be 'big deal, what a waste, that was obviously going to happen if the accuracy of the replica was good enough!'

Your skepticism here is quite understandable. The crux of LLM psychology isn't just to establish that LLMs can replicate human-like behaviors - which, as you rightly point out, could be expected given sufficient fidelity in the replication. Rather, our focus is on exploring the 'alien mind' - the underlying cognitive processes and training dynamics that govern these replications. By delving into these areas, we aim to uncover not just what LLMs can mimic, but how and why they do so in varying contexts.

So I think it's more interesting to study when and where the replica fails, and how. But that's a subject that, as LLMs get better, is going to both change and decrease in frequency, or the threshold for triggering it will go up.

This is a crucial point. Studying the 'failures' or divergences of LLMs from human-like responses indeed offers a rich source of insight, but I am not sure we will see less of them soon. I think that "getting better" is not correlated to "getting bigger", and that actually current model aren't getting better at all (in the sense of having more understandable behaviors with respect to their training, being harder to jailbreak, or even being harder to make it do something going against what we thought was a well designed reward). I would even argue that there are more and more interesting things to discover with bigger systems.

The correlation I see is between "getting better" and "how much do we understand what we are shaping, and how it is shaped" – which is the main goal of LLM psychology.

Thanks again for the comment. It was really thought-provoking, and I am curious to see what you think about these answers.

P.S. This answer only entails myself. Also, sorry for the repetitions in some points, I had a hard time removing all of them.

I partially agree. I think stochastic parrot-ness is a spectrum. Even humans behave as stochastic parrots sometimes (for me it's when I am tired). I think, though that we don't really know what an experience of the world really is, and so the only way to talk about it is through an agent's behaviors. The point of this post is that SOTA LLM are probably farther in the spectrum than most people expect (My impression from experience is that GPT4 is ~75% of the way between total stochastic parrot and human). It is better than human in some task (some specific ToM experience like the example in argument 2), but still less good in others (like at applying nuances. It can understand them, but when you want it to actually be nuanced when it acts, you only see the difference when you ask for different stuff). I think it is important to build a measure for stochastic parrot ness as this might be an useful metric for governance and a better proxy for "does it understand the world it is in?" (which I think is important for most of the realistic doom scenarios). Also, these experiences are a way to give a taste of what LLM psychology look like.

LESSWRONG
LW

LESSWRONG
LW

Sequences

Posts

Wikitag Contributions

Comments

Sequences

Posts

Wikitag Contributions

Comments