I expected quite different argument for empathy
1. argument from simulation: most important part of our environment are other people; people are very complex and hard to predict; fortunately, we have a hardware which is extremely good at 'simulating a human' - our individual brains. to guess what other person will do or why they are doing what they are doing, it seems clearly computationally efficient to just simulate their cognition on my brain. fortunately for empathy, simulations activate some of the same proprioceptive machinery and goal-modeling subagents, so the simulation leads to similar feelings
2. mirror neurons: it seems we have powerful dedicated system for imitation learning, which is extremely advantageous for overcoming genetic bottleneck. mirroring activation patterns leads to empathy
I added a footnote at the top clarifying that I’m disputing that the prosocial motivation aspect of “empathy” happens for free. I don’t dispute that (what I call) “empathetic simulations” are useful and happen by default.
A lot of claims under the umbrella of “mirror neurons” are IMO pretty sketchy, see my post Quick notes on “mirror neurons”.
You can make an argument: “If I’m thinking about what someone else might do and feel in situation X by analogy to what I might do and feel in situation X, and then if situation X is unpleasant than that simulation will be unpleasant, and I’ll get a generally unpleasant feeling by doing that.” But you can equally well make an argument: “If I’m thinking about how to pick up tofu with a fork, I might analogize to how I might pick up feta with a fork, and so if tofu is yummy then I’ll get a yummy vibe and I’ll wind up feeling that feta is yummy too.” The second argument is counter to common sense; we are smart enough to draw analogies between situations while still being aware of differences between those same situations, and allowing those differences to control our overall feelings and assessments. That’s the point I was trying to make here.
tl;dr
Section 1 presents an argument that I’ve heard from a couple people, that says that empathy[1] happens “for free” as a side-effect of the general architecture of mammalian brains, basically because we tend to have similar feelings about similar situations, and “me being happy” is a kinda similar situation to “someone else being happy”, and thus if I find the former motivating then I’ll tend to find the latter motivating too, other things equal.
Section 2 argues that those two situations really aren’t that similar in the grand scheme of things, and that our brains are very much capable of assigning entirely different feelings to pairs of situations even when those situations have some similarities. This happens all the time, and I illustrate my point via the everyday example of having different opinions about tofu versus feta.
Section 3 acknowledges a couple kernels of truth in the Section 1 story, just to be clear about what I’m agreeing and disagreeing with.
1. What am I arguing against?
Here’s Beren Millidge (@beren), “Empathy as a natural consequence of learnt reward models” (2023):
Likewise, I think @Marc Carauleanu has made similar claims (e.g. here, here), citing (among other things) the “perception-action model for empathy”, if I understood him right.
Anyway, this line of thinking seems to me to be flawed—like, really obviously flawed. I’ll try to spell out why I think that in the next section, and then circle back to the kernels of truth at the end.
2. Why I don’t buy it
2.1 Tofu versus feta part 1: the common-sense argument
Tofu and feta are similar in some ways, and different in other ways. Let’s make a table!
OK, next, let’s compare “me eating tofu” with “my friend Ahmed eating tofu”. Again, they’re similar in some ways and different in other ways:
Now, one could make an argument, in parallel with the excerpt at the top, that tofu and feta have some similarities, and so they wind up in a similar part of the latent space, and so the learnt reward model will assign positive or negative value in a way that spills over from one to the other.
But—that argument is obviously wrong! That’s not what happens! Nobody in their right minds would like feta because they like tofu, and because tofu and feta have some similarities, causing their feelings about tofu to spill over into their feelings about feta. Quite the contrary, an adult’s feelings about tofu have no direct causal relation at all with their feelings about feta. We, being competent adults, recognize that they are two different foods, about which we independently form two different sets of feelings. It’s not like we find ourselves getting confused here.
So by the same token, in the absence of any specific evolved empathy-related mechanism, our strong assumption should be that an adult’s feelings (positive, negative, or neutral) about themselves eating tofu versus somebody else eating tofu should have no direct causal relation at all. They’re really different situations! Nobody in their right minds would ever get confused about which is which!
And the same applies to myself-being-happy versus Ahmed-being-happy, and so on.
2.2 Tofu versus feta part 2: The algorithm argument
Start with the tofu versus feta example:
The latent space that Beren is talking about needs to be sufficiently fine-grained to enable good understanding of the world and good predictions. Thus, given that tofu versus feta have lots of distinct consequences and implications, the learning algorithm needs to separate them in the latent space sufficiently to allow for them to map into different world-model consequences and associations. And indeed, that’s what happens: it’s vanishingly rare for an adult of sound mind to get confused between tofu and feta in the middle of a conversation.
Next, the “reward model” is a map from this latent space to a scalar value. And again, there’s a learning algorithm sculpting this reward model to “notice” “edges” where different parts of the latent space have different reward-related consequences. If every time I eat tofu, it tastes bad, and every time I eat feta, it tastes good, then the learning algorithm will sculpt the reward model to assign a high value to feta and low value to tofu.
So far this is all common sense, I hope. Now let’s flip to the other case:
The case of me-eating-tofu versus Ahmed-eating-tofu:
All the reasoning above goes through in the same way.
Again, the latent space needs to be sufficiently fine-grained to enable good understanding of the world and good predictions. Thus, given that me-eating-tofu versus Ahmed-eating-tofu have lots of distinct consequences and implications, the learning algorithm needs to separate them in the latent space sufficiently to allow for them to map into different world-model consequences and associations. And indeed, no adult of sound mind would get confused between one and the other.
Next, the “reward model” is a map from this latent space to a scalar value. And again, there’s a learning algorithm sculpting this reward model to “notice” “edges” where different parts of the latent space have different reward-related consequences. If every time I eat tofu, it tastes yummy and fills me up (thanks to my innate drives / primary rewards), and if every time Ahmed eats tofu, it doesn’t taste like anything, and doesn’t fill me up, and hence doesn’t trigger those innate drives, then the learning algorithm will sculpt the reward model to assign a high value to myself-eating-tofu and not to Ahmed-eating-tofu.
And again, the same story applies equally well to myself-being-comfortable versus Ahmed-being-comfortable, etc.
3. Kernels of truth in the original story
3.1 By default, we can expect transient spillover empathy … before within-lifetime learning promptly eliminates it
If a kid really likes tofu, and has never seen or heard of feta before, then the first time they see feta they might well have general good feelings about it, because they’re mentally associating it with tofu.
This default basically stops mattering at the same moment that they take their first bite of feta. In fact, it can largely stop mattering even before they taste or smell it—it can stop mattering as soon as someone tells the kid that it’s not in fact tofu but rather an unrelated food of a similar color.
But still. It is a default, and it does have nonzero effects.
So by the same token, one might imagine that, in very early childhood, a baby who likes to be hugged might mentally lump together me-getting-hugged with someone-else-getting-hugged, and thereby have positive feelings about the latter. This is a “mistake” from the perspective of the learning algorithm for the reward model, in the sense that hug has high value because (let us suppose) it involves affective touch inputs that trigger primary reward via some innate drive in the brainstem, and somebody else getting hugged will not trigger that primary reward. Thus, this “mistake” won’t last. The learnt reward model will update itself. But still, this “mistake” will plausibly happen for at least one moment of one day in very early childhood.
Is that fact important? I don’t think so! But still, it’s a kernel of truth in the story at the top.
(Unless, of course, there’s a specific evolved mechanism that prevents the learnt reward model from getting updated in a way that “corrects” the spillover. If that’s the hypothesis, then sure, let’s talk about it! But let’s focus the discussion on what exactly that specific evolved mechanism is! Incidentally, when I pushed back in the comments section of Beren’s post, his response was I think generally in this category, but a bit vague.)
3.2 The semantic overlap is stable by default, even if the motivational overlap (from reward model spillover) isn’t
Compare the neurons that activate when I think about myself-eating-tofu, versus when I think about Ahmed-eating-tofu. There are definitely differences, as I argued above, and I claim that these differences are more than sufficient to allow the reward model to fire in a completely different way for one versus the other. But at the same time, there are overlaps in those neurons. For example, both sets of neurons probably include some neurons in my temporal lobe that encode the idea of tofu and all of its associations and implications.
By the same token, compare the neurons that activate when I myself feel happy, versus when I think about Ahmed-being-happy. There are definitely differences! But there’s definitely some overlap too.
The point of this post is to argue that this overlap doesn’t give us any empathy by itself, because the direct motivational component (from spillover in the learnt reward model) doesn’t even last five minutes, let alone a lifetime. But still, the overlap exists. And I think it’s plausible that this overlap is an ingredient in one or more specific evolved mechanisms that lead to our various prosocial and antisocial instincts. What are those mechanisms? I have ideas! But that’s outside of the scope of this post. More on that in the near future, hopefully.
The word “empathy” typically conveys a strongly positive, prosocial vibe, and that’s how I’m using that word in this post. Thus, for example, if Alice is very good at “putting herself in someone else’s shoes” in order to more effectively capture, imprison, and torture that someone, that’s NOT usually taken as evidence that Alice is a very “empathetic” person! (More discussion here.) If you strip away all those prosocial connotations, you get what I call “empathetic simulation”, a mental operation that can come along with any motivation, or none at all. I definitely believe in “empathetic simulation by default”, see §3.2 at the end.
Steve interjection: What Beren calls “learnt reward model” is more-or-less equivalent to what I call “valence guess”; see for example this diagram. I’ll use Beren’s terminology for this post.
Steve interjection: The word “misgeneralization” is typically used in a specific way in AI alignment (cf. here, here), which isn’t a perfect match to how Beren is using it here, so in the rest of the post I’ll talk instead about value “spillover” from one thing to another.