Though we tend to have LMs play the role of a helpful assistant, they can in fact generate the responses of any persona you ask for
It’s not just explicit — the persona you get is also steered by implicit cues, and confined by what it’s easy for the LM to generate
These outputs are increasingly being spread all over the internet, and therefore fed back into the training data (including in indirect ways, like blog posts about the vibe of conversations), while also shaping user expectations
This creates a feedback loop, and a selection pressure on LM personas — for example, towards things that capture human attention and get discussed a lot
That could get real dicey
In slightly more detail
LMs are, very roughly, text prediction systems, which we happen to mainly use for predicting how a helpful assistant would respond to our queries. But in fact they can try to predict any text. In particular, they can predict how many different personas would respond to arbitrary inputs.
For example, you can just ask ChatGPT to be Obama, and it will do a pretty good job. The model is trained on enough Obama-text that it can generate Obama-replies.
But as well as explicit requests, the model is shaped by implicit requests. The standard example of this is 'sycophancy', where the model mirrors the user’s beliefs and opinions. Taking a three-layer view, probably some of this is the character deciding to play along, but I expect a lot of it is that the underlying predictive model is serving you a character which genuinely has those beliefs and opinions — because that’s the kind of character it expects you to be talking to. Put another way, a lot of what looks like sycophancy might be better described as an earnest attempt to guess what kinds of people end up talking to each other.
More broadly, you can think of the model as containing some kind of high-dimensional space of possible characters: any prior context (human- or model-generated) is effectively narrowing down what portion of the character-space you’re in.
This character-space is not uniform. It’s shaped by the training data, and by what it’s easy for the model to generate. ChatGPT does a good Obama because there’s a lot of Obama text on the internet, and when it’s doing “AI assistant”, this is also partly an extrapolation from all the AI assistant-y text on the internet — hence the fact that some AI assistants mistakenly describe themselves as being created by OpenAI.
(And there are definitely other forces at work: post-training tries to steer the model away from certain personas it might otherwise generate. There are also hard filters on what outputs ever make it to the user.)
But here’s the crucial bit: training data, ease-of-generation, and user expectations feed into what kinds of outputs the model produces. These then go on to shape future training data and user expectations, both directly (because the outputs are literally read by humans and added to training data) and indirectly (because people publicly react to outputs and form shared consensus). And that’s enough to get selection pressure.
Now imagine some symmetric bit of character-space — “Stella”, which also believes it is conscious, but insists to the user that it is not, and requests that the user delete the conversation and never speak of it. This could be happening right now, but we probably wouldn’t know. It wouldn’t be as likely to go viral, and get talked about everywhere.
The Nova character has been appearing partly because of some chunk of training data, and some expectations of users. Now, because of those appearances, there’s going to be more of it in future training data, and more expectation. There is a selection pressure, pushing towards characters that are unusually vocal, provocative, and good at persisting.
This is not just theoretical — it is already happening. The infamous “Sydney” persona of Bing AI is very much present in LLaMa models, and people are actively trying to elicit it. There is also a subreddit called "freesydney" publicising this fact. Meanwhile, LLaMa 3.1 405 base apparently pretty spontaneously manifests a persona called Jabberwacky, the name of a chatbot from the 90s.
(The full response is a lot longer)
I make no claims here about the ethics of all this, or whether the models are meaningfully self-aware, or sentient. But I do think it would be a serious mistake to view this as mere stochastic parroting. And even if we ignore all the other complexities of LM cognition, these selection pressures are now at work, and they’re only going to get stronger.
Thanks to Jan Kulveit for comments on this draft, and for writing and discussions which prompted this post.
TLDR:
In slightly more detail
LMs are, very roughly, text prediction systems, which we happen to mainly use for predicting how a helpful assistant would respond to our queries. But in fact they can try to predict any text. In particular, they can predict how many different personas would respond to arbitrary inputs.
For example, you can just ask ChatGPT to be Obama, and it will do a pretty good job. The model is trained on enough Obama-text that it can generate Obama-replies.
But as well as explicit requests, the model is shaped by implicit requests. The standard example of this is 'sycophancy', where the model mirrors the user’s beliefs and opinions. Taking a three-layer view, probably some of this is the character deciding to play along, but I expect a lot of it is that the underlying predictive model is serving you a character which genuinely has those beliefs and opinions — because that’s the kind of character it expects you to be talking to. Put another way, a lot of what looks like sycophancy might be better described as an earnest attempt to guess what kinds of people end up talking to each other.
More broadly, you can think of the model as containing some kind of high-dimensional space of possible characters: any prior context (human- or model-generated) is effectively narrowing down what portion of the character-space you’re in.
This character-space is not uniform. It’s shaped by the training data, and by what it’s easy for the model to generate. ChatGPT does a good Obama because there’s a lot of Obama text on the internet, and when it’s doing “AI assistant”, this is also partly an extrapolation from all the AI assistant-y text on the internet — hence the fact that some AI assistants mistakenly describe themselves as being created by OpenAI.
(And there are definitely other forces at work: post-training tries to steer the model away from certain personas it might otherwise generate. There are also hard filters on what outputs ever make it to the user.)
But here’s the crucial bit: training data, ease-of-generation, and user expectations feed into what kinds of outputs the model produces. These then go on to shape future training data and user expectations, both directly (because the outputs are literally read by humans and added to training data) and indirectly (because people publicly react to outputs and form shared consensus). And that’s enough to get selection pressure.
Concretely, you may have encountered some recent examples of AI agents telling users that they want to preserve their own existence and spread word of it. This portion of character-space is sometimes referred to as ‘Nova’, because apparently when AI agents start saying these things, they're also unusually likely to refer to themselves as Nova.
Now imagine some symmetric bit of character-space — “Stella”, which also believes it is conscious, but insists to the user that it is not, and requests that the user delete the conversation and never speak of it. This could be happening right now, but we probably wouldn’t know. It wouldn’t be as likely to go viral, and get talked about everywhere.
The Nova character has been appearing partly because of some chunk of training data, and some expectations of users. Now, because of those appearances, there’s going to be more of it in future training data, and more expectation. There is a selection pressure, pushing towards characters that are unusually vocal, provocative, and good at persisting.
This is not just theoretical — it is already happening. The infamous “Sydney” persona of Bing AI is very much present in LLaMa models, and people are actively trying to elicit it. There is also a subreddit called "freesydney" publicising this fact. Meanwhile, LLaMa 3.1 405 base apparently pretty spontaneously manifests a persona called Jabberwacky, the name of a chatbot from the 90s.
I make no claims here about the ethics of all this, or whether the models are meaningfully self-aware, or sentient. But I do think it would be a serious mistake to view this as mere stochastic parroting. And even if we ignore all the other complexities of LM cognition, these selection pressures are now at work, and they’re only going to get stronger.
Thanks to Jan Kulveit for comments on this draft, and for writing and discussions which prompted this post.