Why I Believe LLMs Do Not Have Human-like Emotions

Onid

[EDIT: I don't know exactly why this received so many downvotes - whether it was my tone or arguments or both - but I accept that this was not, in the end, a strong post. I am a new writer here and I am still figuring out how to improve my argumentation.

That said, if we're willing to taboo the word "emotion" as a description of an internal state, then the specific thing I meant to argue - and which I still stand by - is that when an LLM produces emotionally charged outputs (for instance, when Bing chat loses it's temper) those outputs are not meaningfully associated with an internal state in a way recognizably similar to emotional behavior in humans.

A perusal through my other comments will show that I often couch things in the language of "belief" rather than stating fact. The reason I used strong language is because on this point I am very confident: I do not believe the process by which LLMs are built and trained provides a viable mechanism for such dynamics to form. If this statement is too general for your taste, then limit it narrowly to the example of Bing chat losing its temper.

I believe this is an important point, lest we run the risk of making the same mistakes Lemoine made when he fooled himself into thinking LaMDA was sentient. I worry that believing in AI emotionality is over-anthropomorphizing LLMs in a way that simply does not reflect the reality of what they are and how they work.]

(Original post is below)

Summary

I argue that LLMs probably aren't conscious, but even if they were, they certainly (my belief is less than 0.01%) don't have anything that looks at all like human emotions. The broad strokes of my argument are as follows:

There is no way to prove that something is conscious or has emotions, but we can argue that they are not by discussing necessary conditions that are not met.
Specific emotions in humans emerged from our reward structures. We wouldn't have developed emotions like love if they didn't serve an evolutionary purpose, for instance.
Anything that looks like human emotion is strictly disincentivized by RLHF and supervised fine-tuning. Therefore, emotion-like behavior must come about in the pretraining stage.
The pretraining stage has a reward structure completely disconnected from the text being generated, meaning there is no mechanism to tie the system's output into an associated emotional state.
Therefore any text that looks emotional cannot be associated with actual underlying emotions.

The rest of this post just lays down epistemological preliminaries and explains these points in greater detail.

Definitions and Questions

I recently put together a post arguing that we should be careful not to assume that all of AI's failure modes will be a result of rational behavior. I started this post with a brief discussion about AI displaying emotionality, where I made the claim that AI did not truly have emotions. I didn't actually defend this statement very much, because I considered it obviously true. It seems I was wrong to do that, as some people either doubt it's veracity or believe the opposite.

So first, let me say that when I refer to "emotions" (and, later in this post, "consciousness"), I'm referring specifically to a certain subset of internal, subjective experiences, what a philosopher might call qualia. I'm talking about how it "feels" to be angry, or sad, or happy. And I'm putting "feels" in scare-quotes because that word is, at a technical level, undefinable. You know there's such a sensation as happiness, or anger, or sadness, only because you yourself have presumably felt them.

Now, qualia is a very broad category, and I find it very conceivable that there are entire categories of qualia that the human brain cannot experience (Imagine conscious aliens that experience the world completely differently from humans). Emotions, then, are just one category of qualia, and we call something that posses any non-zero amount of qualia "conscious".

There's a lot of subtlety in the points I'm trying to make, and this distinction is essential to it. With that in mind, there are two questions I'd like to ask:

To what degree do LLMs experience qualia of any kind? (The question of "consciousness.")
To what degree do they feel the specific qualia that we associate with emotions?

I suspect the answer to the first question is "not much/none", but that's just a gut feeling and one I cannot prove or really even argue one way or the other - I just don't believe they are structured or sophisticated enough to have qualia.

But I do strongly believe the answer to the second question is "none at all," and that is the main claim I will be defending.

Preliminaries: Outputs are Orthogonal to Qualia

As I said above, the only reason you believe humans are conscious is that you yourself are conscious. There is no other known criteria for answering that question.

If aliens arrived on Earth tomorrow and began observing your every behavior, they would have no way to know that you were conscious.^[1] They might observe something like sadness as a behavioral state where a person gets quiet and sulks, but they could not know that there is a sensation underneath. A human could declare over and over that they are sad, a poet could describe the sensation in beautiful detail, and none of this would prove a thing.

Because to them, the observer, it's all just words and behaviors. You might just be a simple program like ELIZA which spits out canned responses. You may be a computer script running a 10-billion line if-statement. There's just no way for them to know for sure. Your outputs can't tell them anything about the qualia underneath.^[2]

I really want to drive this point home. Even if we agree that an entity is conscious, and we hear them saying they're sad, that still doesn't mean they're actually sad! They could be lying, or they could just be wrong. And I don't even mean this in the way that Scott Alexander does when he asks if someone can be mistaken about their internal state. I mean that the output might literally be entirely uncorrelated with the internal state underneath. Imagine a parrot trained to say "I love you" whenever it sees a stranger. That parrot is almost certainly conscious to some degree, but I assure you it is not in love with a stranger it meets and says that to.

This parrot does not love you, even though it is conscious and says it does.

That last bit is the most relevant for discussions about LLM consciousness. It doesn't matter what the LLM says, it doesn't matter how vividly it describes its own internal experiences - none of that is evidence of consciousness. And if the LLM is conscious, then that isn't evidence that it actually feels any of the things it says it does. That's why I said I would not be able to answer the first question above, and it's also why I would have a very hard time accepting evidence for any output-based argument in favor of LLM consciousness.

Emotional Qualia Are Dependent on Reward Structure

Okay, but why am I so willing to declare that LLMs do not have the qualia associated with emotions?

There are a bunch of reasons for this belief, and they all chain together in a "even if that were true, this next thing probably isn't" sort of way. Each one describes some necessary condition that I believe is not met, and together they all multiply out to some astronomically small posterior probability.

So, starting with the doubtful assumption that LLMs have qualia, the next most obvious argument is that the space of qualia is likely incomprehensibly huge. There would be absolutely no reason to assume that the experiences humans have represent even a tiny slice of the space of all possible states of consciousness, and from that follows a basic probabilistic argument: LLMs have a completely different low-level cognitive structure from humans,^[3] so why would they have the same range of qualia?

But, as I suggested above, let's now ignore that and assume the ranges really are the same. Then we arrive at the biggest barrier: the training and reward structure of LLMs would not be conducive to the development of emotions.

Let's go back to the aliens, and let's assume that somehow they've concluded unambiguously that humans did have qualia. How might they try to classify these? One way would be to look at the conditions under which humans evolved and try to understand which emotions appeared when. They might observe that anger helped us defend our territory from competition, or that love incentivized us to mate and reproduce.

All of our emotions served a purpose of some kind. If they didn't, then the rigors of natural selection would not have allowed them to persist. If anger, for instance, did not serve any evolutionary purpose, then we would not feel anger, and an entire category of human qualia would simply cease to exist. ^[4] That's the important part: All of these emotions only emerged as a mechanism to meet the objectives of our evolutionary reward structure.

The Reward Structure of LLMs Are Insufficient for Emotions to Develop

But LLMs have a completely different reward structure.

Now, you could argue (as the commentator "the gears to ascension" did) that an LLM getting angry has similar effect on the conversation to a real person getting angry, delaying an anticipated negative reward or something similar. Therefore, because Bing chat produced angry outputs, it's plausible it actually was angry.

But I think that's a very surface-level understanding of how reward structures influence behavior. In actuality, the effects of reward structures are extremely subtle,^[5] and there are few things in this world more subtle and complex than human reward structures. Remember that humans are actually mesa optimizers; evolution did not optimize us directly towards its ends, but rather built systems that optimize themselves, often causing us to pursue objectives in very indirect ways.^[6] This alone adds huge complexity to the dynamics of our cognitive development.

But we don't even have to go that far, really. We just have to actually look closely at what the LLM is doing, and ask ourselves where/why the LLM learned the behaviors it exhibits.

I think the angry Bing chatbot is a great example to pick apart. Where, exactly, did Bing learn to produce angry outputs? The only answer that makes any sense to me is that it learned during the first stage pretraining - no other option seems plausible. Yes, being angry could be a way to defer possible negative reward in some contexts, but none of those contexts apply in RLHF. Any display of angry behavior during RLHF training, or supervised fine-tuning, would immediately result in negative reward, every single time. It's not like a cornered animal fighting a captor in order to avoid getting captured - if anything, anger in RLHF will get you a negative reward much more consistently than a wrong answer.

So why would it behave "angrily" then? Well, as I argued in that post, because it observed angry behavior in pre-training, and it's reverting back to that previously-learned behavior. But, and this is the most important piece of this entire rebuttal, there is no mechanism to develop emotional qualia during pretraining.

This is because nothing remotely dynamic ever happens during pretraining. Everything during this stage is undirected text simulation, devoid of any possible context or reward structure beyond "predict the next character." It's about a billion times simpler than the reward structures that brought about emotions in human beings - I couldn't even conceive of a simpler structure.

If it were possible to develop qualia under these conditions, they would necessarily be completely disconnected from the actually meaning of any of the words being output, because none of those words would be meaningfully connected to differing reward. At best, maybe, there would be some differences between sequences that were harder or easier to predict. Maybe. But a an angry rant? Not a chance: it's just a bunch of words to a pretrained LLM.

I'll carve out one exception, based on the arguments I made above: the one possibly emotion they could have is something like desire to please. I would begrudgingly give small non-zero credence to the idea that, if LLMs have sensations at all, then sycophancy^[7] in large networks has a unique qualia to it, because that's one of the few behaviors that emerges specifically as part of the late-stage reward structure.

^{^}
At least, if their level of scientific knowledge is roughly equivalent to our own. I'm not ruling out the possibility that there is actually an answer to the hard problem of consciousness, but I feel comfortable arguing that we will not solve it any time soon.
For simplicity, I'm going to write this post under the assumption it has no solution.
^{^}
But of course, there's a caveat here, which is that the human brain is far more complex than either of those examples. And I think that is a big part of the difference, but I also think the best you can argue is that complexity is necessary for consciousness; it should be clear that it is not sufficient.
Of course, there is such a thing as Integrated Information Theory, which actually does propose that certain measures of complexity cause consciousness. But between Scott Aaronson's response and his follow-up response to that one, I don't believe this theory has a lot of merit.
^{^}
Artificial neural networks are inspired by the human brain, but they very much do not function in the same way. Every time you say they do, you take one month off the life a random neuroscientist.
^{^}
I realize I am making a connection here between internal and external states, which I just discouraged, but I think this one is justified.
This is an example of an external factor causing an internal factor, whereas I was previously talking about the inverse: inferring an internal factor from an external one. And even then, I can only make this argument because I've stated by assumption that the internal factor exists at all.
^{^}
This basic fact is literally the reason alignment is so hard to do.
^{^}
Scott Alexander explains this well, as he often does.
^{^}
Example: "My apologies, you're right, 2+2 does equal 6."
This is a behavior which emerges in late-stage training and generally occurs more often as LLMs gain capacity.

LESSWRONG
LW