Empathy as a natural consequence of learnt reward models

beren

Epistemic Status: Pretty speculative but built on scientific literature. This post builds off my previous post on learnt reward models. Crossposted from my personal blog.

Empathy, the ability to feel another's pain or to 'put yourself in their shoes' is often considered to be a fundamental human cognitive ability, and one that undergirds our social abilities and moral intuitions. As so much of human's success and dominance as a species comes down to our superior social organization, empathy has played a vital role in our history. Whether we can build artificial empathy into AI systems also has clear relevance to AI alignment. If we can create empathic AIs, then it may become easier to make an AI be receptive to human values, even if humans can no longer completely control it. Such an AI seems unlikely to just callously wipe out all humans to make a few more paperclips. Empathy is not a silver bullet however. Although (most) humans have empathy, human history is still in large part a history of us waging war against each other, and there are plenty of examples of humans and other animals perpetuating terrible cruelty on enemies and outgroups.

A reasonable literature has grown up in psychology, cognitive science, and neuroscience studying the neural bases of empathy and its associated cognitive processes. We now know a fair amount about the brain regions involved in empathy, what kind of tasks can reliably elicit it, how individual differences in empathy work, as well as the neuroscience underlying disorders such as psychopathy, autism, and alexithmia which result in impaired empathic processing. However, much of this research does not grapple with the fundamental question of why we possess empathy at all. Typically, it seems to be tacitly assumed that, due to its apparent complexity, empathy must be some special cognitive module which has evolved separately and deliberately due to its fitness benefits. From an evolutionary theory perspective, empathy is often assumed to have evolved because of its adaptive function in promoting reciprocal altruism. The story goes that animals that are altruistic, at least in certain cases, tend to get their altruism reciprocated and may thus tend to out-reproduce other animals that are purely selfish. This would be of especial importance in social species where being able to form coalitions of likeminded and reciprocating individual is key to obtaining power and hence reproductive opportunities. If they could, such coalitions would obviously not include purely selfish animals who never reciprocated any benefits they received from other group members. Nobody wants to be in a coalition with an obviously selfish freerider.

Here, I want to argue a different case. Namely that the basic cognitive phenomenon of empathy -- that of feeling and responding to the emotions of others as if they were your own, is not a special cognitive ability which had to be evolved for its social benefit, but instead is a natural consequence of our (mammalian) cognitive architecture and therefore arises by default. Of course, given this base empathic capability, evolution can expand, develop, and contextualize our natural empathic responses to improve fitness. In many cases, however, evolution actually reduces our native empathic capacity -- for instance, we can contextualize our natural empathy to exclude outgroup members and rivals.

The idea is that empathy fundamentally arises from using learnt reward models to mediate between a low-dimensional set of primary rewards and reinforcers and the high dimensional latent state of an unsupervised world model. In the brain, much of the cortex is thought to be randomly initialized and implements a general purpose unsupervised (or self-supervised) learning algorithm such as predictive coding to build up a general purpose world model of its sensory input. By contrast, the reward signals to the brain are very low dimensional (if not, perhaps, scalar). There is thus a fearsome translation problem that the brain needs to solve: learning to map the high dimensional cortical latent space into a predicted reward value. Due to the high dimensionality of the latent space, we cannot hope to actually experience the reward for every possible state. Instead, we need to learn a reward model that can generalize to unseen states. Possessing such a reward model is crucial both for learning values (i.e. long term expected rewards), predicting future rewards from current state, and performing model based planning where we need the ability to query the reward function at hypothetical imagined states generated during the planning process. We can think of such a reward model as just performing a simple supervised learning task: given a dataset of cortical latent states and realized rewards (given the experience of the agent), predict what the reward will be in some other, non-experienced cortical latent state.

The key idea that leads to empathy is the fact that, if the world model performs a sensible compression of its input data and learns a useful set of natural abstractions, then it is quite likely that the latent codes for the agent performing some action or experiencing some state, and another, similar, agent performing the same action or experiencing the same state, will end up close together in the latent space. If the agent's world model contains natural abstractions for the action, which are invariant to who is performing it, then a large amount of the latent code is likely to be the same between the two cases. If this is the case, then the reward model might 'mis-generalize' to assign reward to another agent performing the action or experiencing the state rather than the agent itself. This should be expected to occur whenever the reward model generalizes smoothly and the latent space codes for the agent and another are very close in the latent space. This is basically 'proto-empathy' since an agent, even if its reward function is purely selfish, can end up assigning reward (positive or negative) to the states of another due to the generalization abilities of the learnt reward function ^[1].

In neuroscience, discussions of action and state invariance often centre around 'mirror neurons', which are neurons which fire regardless of whether the animal is performing some action or whether it is just watching some other animal performing the same action. But, given an unsupervised world model, mirror neurons are exactly what we should expect to see. They are simply neurons which respond to the abstract action and are invariant to the performer of the action. This kind of invariance is no weirder than translation invariance for objects, and simply is a consequence of the fact that certain actions are 'natural abstractions' and not fundamentally tied with who is performing them ^[2].

Completely avoiding empathy at the latent space model would require learning an entirely ego-centric world model, such that any action I perform, or any feeling I feel, is represented as a completely different and orthogonal latent state to any other agent performing the same action or experiencing the same feeling. There are good reasons for not naturally learning this kind of entirely ego-centric world model with a complete separation in latent space between concepts involving self and involving others. The primary one is its inefficiency: it requires a duplication of all concepts into a concept-X-as-relates-to-me and concept-X-as-relates-to-others. This would require at best twice as much space to store and twice as much data to be able to learn than a mingled world model where self and other are not completely separated.

This theory of empathy makes some immediate predictions. Firstly, the more 'similar' the agent and its empathic target is, the more likely the latent state codes are to be similar, and hence the more likely reward generalization is, leading to greater empathy. Secondly, empathy is a continuous spectrum, since closeness-in-the-latent-space can vary continuously. This is exactly what we see in humans where large-scale studies find that humans are better at empathising with those closer to them, both within species -- i.e. people empathise more with those they consider in-groups, and across species where the amount of empathy people show a species is closely correlated with its phylogenetic divergence time from us. Thirdly, the degree of empathy depends on both the ability of the reward model to generalize and the world model to produce a latent space which well represents the natural abstractions of its environment. This suggests, perhaps, that empathy is a capability that scales along with model capacity -- larger, more powerful reward and world models may tend to lead to greater, more expansive empathic responses, although they potentially may have reward models that can make finer grained distinctions as well.

Finally, this phenomenon should be fairly fundamental. We should expect it to occur whenever we have a learnt reward model predicting the reward or values of a general unsupervised world-model latent state. This is, and will increasingly be a common setup for when we have agents in environments for which we cannot trivially evaluate the 'true reward function', especially over hypothetical imagined states. Moreover, this is also the cognitive architecture used by mammals and birds which possess a set of subcortical structures which evalaute and dispense rewards, and a general unsupervised world-model implemented in the cortex (or pallium for birds). This is also exactly what we see, with empathic behaviour being apparently commonplace in the animal kingdom.

The mammalian cognitive architecture that results in empathy is actually pretty sensible and it is possible that the natural path to AGIs is with such an architecture. It doesn't apply to classical utility maximizers based on model based planning, such as AIXI, but as soon as you don't have a utility oracle, which can query the utility function in arbitrary states, you are stuck instead with learning a reward function based on a set of 'ground-truth' actually-experienced rewards. Once you start learning a reward function, it is possible that the generalization this produces can result in empathy, even for some 'purely selfish' utility functions. This is potentially quite important for AI alignment. It means that, if we build AGIs with learnt reward functions, and that the latent states in their world model involving humans are quite close to their latent states involving themselves, then it is very possible that they will naturally develop some kind of implicit empathy towards humans. If this happened, it would be quite a positive development from an alignment perspective, since it would mean that the AGI intrisically cares, at least to some degree, about human experiences. The extent to which this occurs would be predicted to depend upon the similarity in the latent space between the AGIs representation of human states and its own. The details of the AGIs world model and training curriculum would likely be very important, as would be the nature of its embodiment. There are reasons to be hopeful about this since the AGI will almost certainly be trained almost entirely on human text data, human-created environments, and be given human relevant goals. This will likely lead to it gaining quite a good understanding of our experiences, which could lead to closeness in the latent space. On the negative side, the phenomomenology and embodiment of the AGI is likely to be very different -- in a distributed datacenter interacting directly with the internet, as opposed to having a physical bipedal body and small, non-copyable brain.

Given reasonable interpretability and control tooling, this line of thought could lead to methods to try to make an AGI more naturally empathic towards humans. This could include carefully designing the architecture or training data of the reward model to lead it to naturally generalize towards human experiences. Alternatively, we may try to directly edit the latent space to as to bring our desired empathic targets to within the range of generalization of the reward models. Finally, during training, by presenting it with a number of 'test stimuli', we should be able to precisely measure the extent and kinds of empathy it has. Similarly, interpretability on the reward model could potentially reveal the expected contours of empathic responses.

Empathy in the brain

Now that we have thought about the general phenomenon and its applications to AI safety, let's turn towards the neuroscience and the specific cognitive architecture that is implemented in mammals and birds. Traditionally, the neuroscience of empathy, splits up our natural conception of 'empathy' into 3 distinct phenomena, each underpinned by a dissociable neural circuit. These three facets are neural resonance, prosocial motivation, and mentalizing/theory of mind. Neural resonance is the visceral 'feeling of somebody else's pain' that we experience during empathy, and it is close to the reward model evaluation of other's states we discuss here. Prosocial motivation is essentially the desire to act on empathic feelings and be altruistic because of them, even if it comes at personal cost. Finally, mentalizing is the ability to 'put oneself in another's shoes' -- i.e. to simulate their inner cognitive processes. These processes are argued to be implemented in dissociable neural circuits. Typically, mentalizing is thought to occur primarily in the high-level association areas of cortex, typically including the precuneus and especially the temporo-parietal junction (TPJ). Neural resonance is thought to occur primarily by utilizing cortical areas involved with sensorimotor and emotional processing such as the anterior cinvulate and insular cortex, as well as the amygdala and subcortically. Finally, prosocial motivation is thought to be implemented in the regions typically related to goal-directed behaviour such as the VTA in the mid-brain and the orbitofrontal and prefrontal cortex. However, in practice, for ecologically realistic stimuli, these processes do not occur in isolation but always tend to co-occur with each other. Furthermore, in general empathy, especially neural resonance also appears to be modality specific. For instance, viewing a conspecific in pain will tend to activate the 'pain matrix': the network of brain regions which are also activated when you yourself are in pain.

Mentalizing, despite being the most 'cognitively advanced' process, is actually the easiest to explain and is not really related to empathy at all. Instead, we should expect mentalizing and theory of mind to just emerge naturally in any unsupervised world-model with sufficient capacity and data. That is, modelling other agents as agents, and explicitly simulating their cognitive state is a natural abstraction which has an extremely high payoff in terms of predictive loss. Intuitively, this is quite obvious. Other agents apart from yourself are a real phenomenon that exists in the world. Moreover, if you can track the mental state of other agents, you can often make many very important predictions about their current and future behaviour which you cannot if you model them as either non-agentic phenomena or alternatively as simple stimulus-response mappings without any internal state.

However, because it is expected to arise from any sufficiently powerful unsupervised learning model, mentalizing is completely dissociable from having any motivational component based on empathy. An expected utility maximizer like AIXI should possess a very sophisticated theory of mind and mentalizing capability, but zero empathy. Like the classical depiction of a psychopath, AIXI can perfectly simulate your mental state, but feels nothing if its actions cause you distress. It only simulates you so as to better exploit you to serve its goals. Of course, if we do have motivational empathy, then we have a desire to address and reduce the pain of others, and being able to mentalize is very useful for coming up with effective plans to do that. This is why, I suspect, that mentalizing regions are so co-activated with empathy tasks.

Secondly, there is prosocial motivation. My argument is that this is precisely the kind of reward model generalization presented earlier. Specifically, the brain possesses a reward model learnt in the VTA and basal ganglia based on cortical inputs which predicts the values and rewards expected given certain cortical states. These predicted rewards are then used to train high level cortical controllers to query the unsupervised world model to obtain action plans with high expected rewards. To do so, the brain utilizes a learnt reward model based on associating cortical latent states with previously experienced primary rewards fed through to the VTA. As happened previously, if this reward model misgeneralizes so as to assign reward to a state of other's pain or pleasure, as opposed to our own, then the brain should naturally develop this kind of pro-social motivation, in exactly the same way it develops motivation to reduce its own pain and increase its own pleasure based on the same reward model.

Finally, we come to neural resonance. I would argue this is the physiological oldest and most basic state of empathy and occurs due to a very similar mechanism of model misgeneralization. Only this time, it is not the classic RL-based reward model in VTA that is mis-generalizing, but a separate reflex-association model implemented primarily in the amygdala and related circuitry such as the stria terminalis and periacqueductal grey that is misgeneralizing. In my previous post, I argue that there are two separate behavioural systems in the brain. One reward-based based on RL, and one which predicts brainstem reflexes and other visceral sensations based on supervised predictive learning -- i.e. associate a current state with a future visceral sensation. While the prosocial motivation system is fundamentally based on the RL system, the neural resonance empathy arises from this brainstem-prediction circuit. That is, we have a circuit that is constantly parsing cortical latent states for information predictive of experiencing pain, or needing to flinch, or needing to fight or flee, or any other visceral sensation or decision controlled by the brainstem. When this circuit misgeneralizes, it takes a cortical latent representation of another agent experiencing pain, sees that it contains many 'pain-like' features, and then predicts that the agent itself will experience pain shortly, and thus drives the visceral sensation and compensatory reflexes. This, we argue, is the root of neural resonance.

Widespread empathy in animals

Our theory argues that empathy is a fundamental and basic result of the mammalian cognitive architecture, and hence a clear prediction that results is that essentially all mammals should show some degree of empathy -- primarily in terms of neural resonance and prosocial behaviour. Although the evidence on this is not 100% conclusive for all mammals, almost every animal species studied appears to show at least some degree of empathy towards conspecifics. A large amount of work has been done investigating this in mice that rats which see another rat being given painful electric shocks become more sensitive to pain themselves -- evidence of neural resonance. Similarly it has been shown that rats will not pull a lever which gives them food (a positive reward) if it leads to another rat being given a painful shock. Other species for which there is much evidence of empathy include elephants, dolphins and, of course, apes and monkeys.

Outside of apes and monkeys, dophins and elephants, as well as corvids also appear in anecdotal reports and the scientific literature to have many complex forms of empathy. For instance, both dolphins and elephants appear to take care of and nurse sick or injured individuals, even non-kin, as well as grieve for dead conspecifics. They also appear to be able to use mentalizing behaviours to anticipate the needs of their conspecifics -- for instance bringing them food or supporting them if injured. Apes (as well as corvids and human children) also show all of these empathic behaviours, and also are known to perform 'consolation behaviours' where bystanders will go up to and comfort distressed fellows, especially the loser of a dominance fight. This behaviour can be shown to occur both in captivity, and in the wild.

Contextual modulation of empathy

Overall, while we argue that the basic fundamental forms of empathy effectively arise due to misgeneralization of learnt reward models, and are thus a fundamental feature of this kind of cognitive architecture, this does not mean that our empathic responses are not also sculpted by evolution. Many of our emotions and behaviours take this basic level of empathy and elaborate on it in various ways. Much of our social behaviour and reciprocal altruism is based on a foundation of empathy. Moreover, evolutionarily hardwired behaviours like parenting and social bonding is likely deeply intertwined with our proto-empathic ability. An example of this is the hormone oxytocin which is known to boost our empathic response, as well as being deeply involved both in parental care and social bonding.

On the other hand, evolution may also have given us mechanisms that suppress our natural empathic response rather than accentuate it. Humans, as well as other animals, are easily able to override or inhibit their sense of empathy when it comes to outgroups or rivals of various kinds, exactly as predicted by evolutionary theory. This is also why normal empathetic people are reliably able to sanction or inflict terrible cruelties on others. All of humans, chimpanzees, and rodents are able to modulate their empathic response to be greater for those they are socially close with and less for defectors or rivals (and often becoming negative into schadenfreude). Similarly, an MEG study in humans studying adolescents who grew up in an intractable conflict (the Israel-Palestine conflict), found that both Israeli-Jews and Palestinian-Arabs had an initial spike of empathy towards both ingroup and outgroup stimulus. However, this was followed by a top-down inhibition of empathy towards the outgroup and an increase of empathy towards the in-group. This top-down inhibition of empathy must be cortically-based and learnt-from scratch based on contextual factors (since evolution cannot know a-priori who is ingroup and outgroup). The fact that top-down inhibition of the 'natural' empathy can be learnt is probably also why, empirically, research on genocides have typically found that any kind of mass murder requires a long period of dehumanization of the enemy, to suppress people's natural empathic response to them.

All of this suggests that while the fundamental proto-empathy generated by the reward model generalization is automatic, the response can also be shaped by top-down cortical context. From a machine learning perspective, this means that humans must have some kind of cortical learnt meta-reward model which can edit the reward predictions flexibly based on information and associations coming from the world-model itself.

Psychopaths etc

Another interesting question for this theory is how and why psychopaths, or other empathic disorders exist. If empathy is such a fundamental phenomenon, how do we appear to get impaired empathy in various disorders? Our response to this is that the classic cultural depiction of a psychopath as someone otherwise normal (and often highly functioning) but just lacking in empathy is not really correct. In fact, psychopaths do show other deficits, typically in emotional control, disinhibited behaviour, blunted affect (not really feeling any emotions) and often pathological risk-taking. Neurologically, psychopathy is typically associated with a hypoactive and/or abnormal amygdala, among other deficits, including often also impaired VTA connectivity leading to deficits in decision-making and learning from reinforcement (especially punishments). According to our theory, this would argue that psychopathy is not really a syndrome of lacking empathy, but instead in having abnormal and poor learnt reward models mapping between base reward and visceral reflexes and cortical latent states. Abnormal empathy is then a consequence of the abnormal reward model and its (lack of) generalization ability.

^{^}
Our theory is very similar to the [Perception-Action-Mechanism](https://web-archive.southampton.ac.uk/cogprints.org/1042/) (PAM), and the very similar 'simulation theory' of empathy. Both argue that empathy occurs because our brain essentially learns to map representations of other's experiencing some state to our own representations for that state. Our contribution is essentially to argue that this isn't some kind of special ability that must be evolved, but rather a natural outcome an an architecture which learns a reward model against an unsupervised latent state.
^{^}
One prediction of this hypothesis would be that we should expect general unsupervised models, potentially attached to RL agents, to naturally develop all kinds of 'mirror neurons' if trained in a multi-agent environment.

Let’s take a very simplistic model where reward = I am eating chocolate (as detected by the brainstem, say).

There would be some period of time during training when the reward predictor would predict a reward when I see someone else eating chocolate, because there’s a lot of overlap between them-eating-chocolate and me-eating-chocolate in the latent space. I think that’s your point here in this post, right?

But then every time that empathy thing happens, I obviously don’t then immediately eat chocolate. So the reward model would get an error signal—there was a reward prediction, but the reward didn’t happen. And thus the brain would eventually learn a more sophisticated “correct” reward model that didn’t fire empathetically. Right?

Of course, that’s not what’s really happens—adults have empathy too, it doesn’t get naturally trained away. That needs to be explained.

One possibility is that the reward model is somehow blinded to any information that could indicate whether something is empathy or not, but that seems difficult to implement. I’m skeptical.
Another possibility is (mumble mumble) regularization, but I dunno how that would work.
My preferred theory is that the brain has some mechanism to detect when a thought is an empathetic simulation, and then it can just choose not to send an error signal in that circumstance. (Or it can do other things with that information.) I’m currently not sure what that mechanism is.

Interested in how you’re thinking about this. Sorry if I misunderstood anything :)

In the specific example of chocolate (unless it wasn't supposed to be realistic), are you sure it doesn't get trained away? I don't think that, upon seeing someone eating chocolate, I immediately imagine tasting chocolate. I feel like the chocolate needs to rise to my attention for other reasons, and only then do I viscerally imagine tasting chocolate.

What I really believe is that “the brain does other things with that information”, things more general than “feeling the same feeling as the other person is feeling”. See here:

In envy, if a little glimpse of empathy indicates that someone is happy, it makes me unhappy.
In schadenfreude, if a little glimpse of empathy indicates that someone is unhappy, it makes me happy.
When I’m angry, if a little glimpse of empathy indicates that the person I’m talking to is happy and calm, it sometimes makes me even more angry!

I do think “feeling the same feeling as the other person is feeling” can happen. The ice cream example is not great for that; maybe consider “seeing someone get unexpectedly punched hard in the stomach”. That makes me cringe a bit, still, even as an adult. Maybe an even better example (that only works for half the population) is “seeing someone get kicked in the balls”.

But it’s a bit subtle. If I saw people getting unexpectedly punched hard in the stomach day after day, sure, maybe I would stop cringing. But how much of that is a natural consequence of the learning algorithm and how much of that is “empathy is kinda aversive here, so I learn by RL to leverage top-down attention to deliberately avoid triggering that reaction”? I tend to think it’s mostly the latter, but it’s not obvious.

I think this is a mechanism that actually happens a lot. People generally do lose a lot of empathy with experience and age. People definitely get de-sensitized to both strongly negative and strongly positive experiences after viewing them a lot. I actually think that this is more likely than the RL story -- especially with positive-valence empathy which under the RL story people would be driven to seek out.

But then every time that empathy thing happens, I obviously don’t then immediately eat chocolate. So the reward model would get an error signal—there was a reward prediction, but the reward didn’t happen. And thus the brain would eventually learn a more sophisticated “correct” reward model that didn’t fire empathetically. Right?

My main model for why this doesn't happen in some circumstances (but definitely not all) is that the brain uses these signals and has a mechanism for actually providing positive or negative reward when they fire depending on other learnt or innate algorithms. For instance, you could pass the RPE through to some other region to detect whether the empathy triggered for a friend or enemy and then return either positive or negative reward, so implementing either shared happiness or schadenfreude. Generally I think of this mechanism as a low level substrate on which you can build up a more complex repertoire of social emotions by doing reward shaping on these signals.

Also -- I really like your post on empathy that cfoster linked above! I have read a lot of your work but somehow missed that one lol. Cool we are thinking at least somewhat along similar lines

Thanks!

For instance, you could pass the RPE through to some other region to detect whether the empathy triggered for a friend or enemy and then return either positive or negative reward, so implementing either shared happiness or schadenfreude.

In that case I’d be interested in the “some other region to detect whether the empathy triggered for a friend or enemy”. How is that region doing that? Specifically, (1) what exactly is the “low level substrate”, (2) what are the exact recipes for turning those things into the full complex repertoire of social emotions? Those are major research interests of mine. Happy for you & anyone else to join / share ideas :)

Thanks for the reply!

In envy, if a little glimpse of empathy indicates that someone is happy, it makes me unhappy.
In schadenfreude, if a little glimpse of empathy indicates that someone is unhappy, it makes me happy.
When I’m angry, if a little glimpse of empathy indicates that the person I’m talking to is happy and calm, it sometimes makes me even more angry!

How sure are you that these are instances of empathy (defining it as "prediction by our own latent world model of ourselves being happy/unhappy soon")? If I imagine myself in these examples, it doesn't introspectively feel like I am reacting to an impression of their internal state, but rather like I am directly reacting to their social behavior (e.g., abstractly speaking, a learned reflex of status-reasserting anger when someone else displays high status through happy and calm behavior).

This would also cleanly solve the mysteries of why they don't get updated and how they are distinguished from "other transient feelings" - there's no wrong prediction by the latent world model involved (nothing to be distinguished or updated), and the social maneuvering doesn't get negative feedback.

That's where some instinctive disagreement of mine with that post of yours comes from too. But I also haven't read through it carefully enough to be sure.

I think I probably don’t follow what you’re saying. It seems to me that people care very much about the internal state of other people. (Not in the sense of “people care that they have veridical beliefs about the internal state of other people”, but in the sense of “people spend a lot of time thinking about the internal state of other people, and their beliefs about those states are very relevant to their reactions”.)

Like, if I am to feel schadenfraude at Alice’s misfortune, it seems to me that it really matters that it’s a misfortune from Alice’s perspective. If I hate swimming and Alice loves it, and then Alice swims, then I wouldn’t feel schadenfraude there, right? And that requires attending to and reacting to (my beliefs about) Alice’s internal state, right?

Again, this seems very obvious to me, which suggests that I’m probably misunderstanding you.

I appreciate the charity!

I'm not claiming that people don't care about other people's internal states, I'm saying that it introspectively doesn't feel like that is implemented via empathy (the same part of my world model that predicts my own emotions), but via a different part of my model (dedicated to modeling other people), and that this would solve the "distinguishing-empathy-from-transient-feelings" mystery you talk about.

Additionally (but relatedly), I'm also skeptical that those beliefs are better decribed as being about other people's internal states rather than as about their social behavior. It seems easy to conflate these if we're not introspectively precise. E.g., if I imagine myself in your Alice example, I imagine Alice acting happy, smiling and uncaring, and only then is there any reaction - I don't even feel like I'm *able* to viscerally imagine the abstract concept (prod a part of my world model that represents it) of "Alice is happy".

But these are still two distinct claims, and the latter assumes the former.

One illustrative example that comes to mind is the huge number of people who experience irrational social anxiety, even though they themselves would never judge themselves if they were in other people's position.

I'm also skeptical that those beliefs are better decribed as being about other people's internal states rather than as about their social behavior.

Hmm. Continuing with the schadenfraude example, let’s say Alice stole my kettle and I would feel good if she burned her fingers on it. (Serves her right!) My introspection says, if Alice is alone when she burns her fingers, I’m still happy—that still counts. If I never see her again after that, that still counts. Heck, if she becomes a hermit and never sees another human again, that still counts. And therefore, that thought of Alice burning her fingers is pleasing in a way that is tightly connected to how I believe Alice feels, and disconnected from how I believe Alice is behaving socially, I think.

You mention “I imagine Alice acting happy, smiling and uncaring”. But I feel like the following two things feel very different to me:

“I imagine that Alice is acting happy, smiling and uncaring, and this is straightforwardly related to how she really feels”, versus
“I imagine that Alice is acting happy, smiling and uncaring, but on the inside she’s miserable, and she’s hiding how she really feels”.

What do you think?

I'm saying that it introspectively doesn't feel like that is implemented via empathy (the same part of my world model that predicts my own emotions), but via a different part of my model (dedicated to modeling other people)

I don’t update much on that because I think almost all of the discourse and intuitions and literature surrounding the word “empathy” are not talking about the same thing that I want to talk about. Thus I tend to avoid the word “empathy” altogether where possible. I’ve been using other terms like “empathetic simulation” or “little glimpse of empathy”. I talk about that a bit in Section 13.5.2 here. More specifically, I’m guessing that it doesn’t “feel like empathy” when you imagine Alice burning her fingers on the kettle she stole from me, because that thought feels good, whereas empathizing with Alice would be unpleasant. Here, my model says “yes the thought feels good, and if that’s not what you think of as “empathy”, then the thing you think of as “empathy” is not what I’m talking about”.

When we think of emotion concepts / categories, the valence / arousal / etc. associated with them are central properties. E.g. righteous indignation has to have positive valence and high arousal, otherwise we would call it something else (and think of it as something else). So if you think a thought that involves lots of the same cortical neurons as you get in typical righteous indignation, but those neurons trigger negative valence and low arousal in the brainstem (because of the empathy-detector intervening, or whatever), it wouldn’t feel anything like righteous indignation introspectively. Or something like that.

At a high level, I agree that something related to empathy can happen when the same circuits are used for processing thoughts-about-others from thoughts-about-self. This seems like a design pattern that might be worth copying. My main concerns are:

It seems like the AIs we build will be very different from us, at least in terms of basic drives. I can definitely empathize when there's some common currency to the experience (for ex. they're feeling pain, and I've also experienced pain), but probably less so when there's a greater gap. Since AIs won't share any of our physiology or evolutionary history, I worry that that common currency will be missing, which would seemingly incentivize the AI having separate circuits for modeling humans and for modeling itself.
This doesn't seem like it'd give us a robust enough version of empathy by itself, because the agent isn't motivated to actively seek out opportunities to empathize. As an analogy, I know if I were forced to think of, and even look at, the process that produces hamburger meat, I would probably have a visceral reaction and not want to eat the burger. But I like burgers, so I don't seek out that train of thought, so the hypothetical empathy & disgust that would've been invoked lays inactive. Maybe something like Anthropic's Constitutional AI method would help in this direction...

Nitpick about terminology: I think the stuff you're talking about is primarily attributable to having a learned value function rather than to having a learned reward model in the narrow sense of a predictor of immediate reward. I tend to use value function to refer to the thing that, alongside the reward function, produces visceral (gut-like) reactions to thoughts based on forecasts that were learned via something like TD learning. A reward model, on the other hand, is just another part of your model of the world, so it might not be connected to visceral "feels". It doesn't necessarily have any sway over decision-making, in the same way as your "will this number be even or odd" model isn't necessarily connected to any visceral "feels", so you don't tend to make decisions based primarily on those predictions.

Also if you haven't read this post, I think it's a good one and very related.

It seems like the AIs we build will be very different from us, at least in terms of basic drives. I can definitely empathize when there's some common currency to the experience (for ex. they're feeling pain, and I've also experienced pain), but probably less so when there's a greater gap. Since AIs won't share any of our physiology or evolutionary history, I worry that that common currency will be missing, which would seemingly incentivize the AI having separate circuits for modeling humans and for modeling itself

Yes, this depends a lot on the self model of the AGI. It's definitely not a silver bullet. The AGI will almost certainly have a very good model of humans, their culture, and how their minds work from various self-supervised losses. Whether the AGI conceptualises itself as close to this or not depends on the representations of AGI in the dataset as well as potentially our training regime.

Nitpick about terminology: I think the stuff you're talking about is primarily attributable to having a learned value function rather than to having a learned reward model in the narrow sense of a predictor of immediate reward. I tend to use value function to refer to the thing that, alongside the reward function, produces visceral (gut-like) reactions to thoughts based on forecasts that were learned via something like TD learning

I agree it is not necessarily the reward model that generates direct feelings. I think it is hard to connect any part of an RL system directly to gut level 'feels' because we don't really know what these are. The value function is just the estimate of the long run reward and is trained on a supervised bellman equation. It is very possible that the machinery that creates this won't exist at all in the AGI, or maybe it is just some intrinsic property of RL agents I don't know.

Typos

There are good reasons for not naturally learning this kind of entirely ego-centric world model with a complete separation in latent self between concepts involving self and involving others.

Bolded should be "latent space".

Very interesting, thanks. I'm unconvinced that the motivational aspects of empathy are common in learning algorithms that look like gradient descent - if flinching when someone else is hurt doesn't harm your reproductive fitness then maybe it's easy for evolution to stick with it, but substantively changing your plans to avoid causing that flinch (as in the rats not shocking other rats) should rise to the attention of gradient descent and get massaged out.

My prediction is that there really is an evolved nudge towards empathy in the human motivational system, and that human psychology - like usually being empathetic but sometimes modulating it and often justifying self-serving actions - is sculpted by such evolved nudges, and wouldn't be recapitulated in AI lacking those nudges.

My prediction is that there really is an evolved nudge towards empathy in the human motivational system, and that human psychology - like usually being empathetic but sometimes modulating it and often justifying self-serving actions - is sculpted by such evolved nudges, and wouldn't be recapitulates in AI lacking those nudges.

I agree -- this is partly what I am trying to say in the contextual modulation section. The important thing is that the base capability for empathy might exist as a substrate to then get sculpted by gradient descent / evolution to implement a wide range of adaptive pro or anti-social emotions/behaviours. Which of these behaviours, if any, get used by the AI will depend on the reward function / training data it sees.

The key idea that leads to empathy is the fact that, if the world model performs a sensible compression of its input data and learns a useful set of natural abstractions, then it is quite likely that the latent codes for the agent performing some action or experiencing some state, and another, similar, agent performing the same action or experiencing the same state, will end up close together in the latent space. If the agent's world model contains natural abstractions for the action, which are invariant to who is performing it, then a large amount of the latent code is likely to be the same between the two cases. If this is the case, then the reward model might 'mis-generalize' to assign reward to another agent performing the action or experiencing the state rather than the agent itself. This should be expected to occur whenever the reward model generalizes smoothly and the latent space codes for the agent and another are very close in the latent space. This is basically 'proto-empathy' since an agent, even if its reward function is purely selfish, can end up assigning reward (positive or negative) to the states of another due to the generalization abilities of the learnt reward function ^[1].

awesome

Why do you consider the behavior of so-called "psychopaths" as a "disorder"? What if a norm here is just a matter of cultural expectations? So, what is normal and what is not can be understood by comparison of an individual behavior when cultural norms don't limit it. And if, then, let's say, 40% of specimen behaves as psychopaths (particularly, manifest violence in the form of a stable pattern), then we cannot call those individuals having "disorder." We have to consider them as a particular segment of the Homo Sapiens population having a specific evolutionary function.

Outside of apes and monkeys, dophins and elephants, as well as corvids also appear in anecdotal reports and the scientific literature to have many complex forms of empathy.

Might be related to Erich Neumann's book The Great mother which cites: "The psychological development [of humankind]... begins with the 'matriarchal' stage in which the archetype of the Great Mother dominates and the unconscious directs the psychic process of the individual and the group." It's like when we see animals in the wild eg. the lioness and its cub, we always associate it as the mother and its child - we do not have to google or open a book to like ensure that it is the case but deep within our psyche is that pattern that allows us to interpret it as such.

I agree with other commentors that this effect will be washed out by strong optimization. My intuition is that the problem is distinguishing self from other is easy enough (and supported by enough data) that the optimization doesn't have to be that strong.

[I began writing the following paragraph as a counter- argument to the post, but it ended up less decisive when thinking about the details - as next paragraph:] There are many general mechanisms for convergence, synchronization and coordination. I hope to write a list in the close future. For example, as you wrote having a model of other agents is obviously generally useful, and it may require having an approximation of both their worlds models and value functions as part of your world model. Unless you have huge amounts of data and compute, you are going to reuse your own world model as theirs, with small corrections on top. But this is about your world model, not your value function.

[The part that help your argument. Epistemic status: Many speculative details, but ones that I find pretty convincing, at least before multiplying their probabilities] Except having the value function of other agents in your world model, and having the mechanisim for predicting their action as part of your world-model-update, is basically replicating computations that you already have in your actor and critic, in a more general form. Your original actor and critique are then likely to simplify to "do the things that my model of myself would, and value the results as much as my model of myself would" + some corrections. In that stage, if the "some corrections" part is not too heavy, you may have some confusion of the kind that you described. Of course, it will still be optimized against.

BTW speaking about value function rather than reward model is useful here, because convergent instrumental goals are big part of the potential for reuse of others' (deduced) value function as part of yours. Their terminal goals may then leak into yours due to simplicity bias or uncertainty about how to separate them from the instrumental ones.

The main problem with that mechanism is that you liking chocolate will probably leak as "its good for me too to eat chocolate", not "its good for me too when beren eat chocolate" - which is more likely to cause conflict then coordination, if there is only that much chocolate.

And specifically for humans, I think the probably was evolutionary pressure that is actively in favor of leaking terminal goals - as the terminal goals of each of us is a noisy approximation of evolution's "goal" of increasing amount of offspring, that kind of leaking is potential for denoising. I think I explicitly heard this argument in the context of ideals of beauty (though many other things are going on there and pushing in the same direction)

I agree that this will probably wash out with strong optimization against. and that such confusions become less likely the more different the world models of yourself and the other agent that you are trying to simulate is -- this is exactly what we see with empathy in humans! This is definitely not proposed as a full 'solution' to alignment. My thinking is that a.) this effect may be useful for us in providing a natural hook to 'caring' about others which we can then design training objectives and regimens to allow us to extend and optimise this value shard to a much greater extent than it occurs naturally.

We agree 😀

What do you think about some brainstorming in the chat about how to use that hook?

Whether we can build artificial empathy into AI systems also has clear relevance to AI alignment.

I disagree. My tentative guess would be that in the majority of worlds where humanity survives and flourishes, {AGI having empathy} contributed ~nothing to achieving that success. (For most likely interpretations of "empathy".)

If we can create empathic AIs, then it may become easier to make an AI be receptive to human values, even if humans can no longer completely control it.

I suspect that {the cognitive process that produced the above sentence} is completely devoid of security mindset. If so, might be worth trying to develop security mindset? And/or recognize that one is liable to (i.a.) be wildly over-optimistic about various alignment approaches. (I notice that that sounded unkind; sorry, not meaning to be unkind.)

You pointed out that empathy is not a silver bullet. I have a vague (but poignant) intuition that says that the problem is a lot worse than that: Not only is empathy not a silver bullet, it's a really really imprecise heuristic/proxy/shard for {what we actually care about}, and is practically guaranteed to break down when subjected to strong optimization pressure.

Also, doing a quick bit of Rationalist Taboo on "empathy", it looks to me like that word is pointing at a rather complicated, messy swath of territory. I think that swath contains many subtly and not-so-subtly different things, most of which would not begin to be sufficient for alignment (albeit that some might be necessary).

I suspect that {the cognitive process that produced the above sentence} is completely devoid of security mindset. If so, might be worth trying to develop security mindset? And/or recognize that one is liable to (i.a.) be wildly over-optimistic about various alignment approaches. (I notice that that sounded unkind; sorry, not meaning to be unkind.)

Yep this is definitely not proposed as some kind of secure solution to alignment (if only the world were so nice!). The primary point is that if this mechanism exists it might provide some kind of base signal which we can then further optimize to get the agent to assign some kind of utility to others. The majority of the work will of course be getting that to actually work in a robust way.

You pointed out that empathy is not a silver bullet. I have a vague (but poignant) intuition that says that the problem is a lot worse than that: Not only is empathy not a silver bullet, it's a really really imprecise heuristic/proxy/shard for {what we actually care about}, and is practically guaranteed to break down when subjected to strong optimization pressure.

Yes. Realistically, I think almost any proxy like this will break down under strong enough optimization pressure, and the name of the game is just to figure out how to prevent this much optimization pressure being applied without imposing too high a capabilities tax.

the name of the game is just to figure out how to prevent this much optimization pressure being applied without imposing too high a capabilities tax

Hmm. I wonder if you'd agree that the above relies on at least the following assumptions being true:

(i) It will actually be possible to (measure and) limit the amount of "optimization pressure" that an advanced A(G)I exerts (towards a given goal).
(ii) It will be possible to end the acute risk period using an A(G)I that is limited in the above way.

If so, how likely do you think (i) is to be true? If you have any ideas (even very rough/vague ones) for how to realize (i), I'd be curious to read them.

I think realizing (i) would probably be at least nearly as hard as the whole alignment problem. Possibly harder. (I don't see how one would in actual practice even measure "optimization pressure".)

(i) It will actually be possible to (measure and) limit the amount of "optimization pressure" that an advanced A(G)I exerts (towards a given goal).If so, how likely do you think (i) is to be true?

If you have any ideas (even very rough/vague ones) for how to realize (i), I'd be curious to read them.

For this, it is not clear to me that it is impossible or even extremely difficult to do this, at least in a heuristic way. I think that managing to successfully limit the optimization power applied against our defences is fundamental to coming up with alignment techniques that can work in practice. We need some way to bound the adversary otherwise we are essentially doomed by construction.

There is a whole bunch of ideas you can try here which work mostly independently and in parallel -- examples of this are:

1.) Quantilization

2.) Impact regularization

3.) General regularisation against energy use, thinking time, compute cost

4.) Myopic objectives and reward functions. High discount rates

5.) limiting serial compute of the model

6.) Action randomisation / increasing entropy -- something like dropout over actions.

7.) Satisficing utility/reward functions

8.) Distribution matching objectives instead of argmaxing

9.) penalisation of divergence from a 'prior' of human behaviour

10.) Maintaining value uncertainty estimates and acting conservatively within the outcome distribution

These are just examples I have thought of immediately. There are a whole load more if you sit down and brainstorm for a while.

In terms of measuring optimziation power I don't think this is that hard to do roughly. We can definitely define it in terms of outcomes as KL divergence of achieved distribution vs some kind of prior 'uncontrolled' distribution. We already implement KL penalties in RL like this. Additionally, rough proxies are serial compute, energy expenditure, compute expenditure, divergence from previous behaviour etc.

It will be possible to end the acute risk period using an A(G)I that is limited in the above way.

The major issue is what level of alignment tax these solutions impose and whether it is competitive with other players. This ultimately depends on the amount of slack that is available in the immediately post-AGI world. My feeling is that it is possible there is quite a lot of slack here, at least at first, and that most of the behaviours we really want to penalise for alignment purposes are quite far from most likely behaviour -- i.e. there is very little benefit to us of having the AGI having such a low discount rate it is planning about tiling the universe with paperclips in billions of years.

I also don't think of these so much as solutions but as part of the solution -- i.e. we still need to find good robust ways of encoding human values as goals, detect and prevent inner misalignment, and have some approach to manage goodhearting.

Let’s take a very simplistic model where reward = I am eating chocolate (as detected by the brainstem, say).

Of course, that’s not what’s really happens—adults have empathy too, it doesn’t get naturally trained away. That needs to be explained.

One possibility is that the reward model is somehow blinded to any information that could indicate whether something is empathy or not, but that seems difficult to implement. I’m skeptical.
Another possibility is (mumble mumble) regularization, but I dunno how that would work.
My preferred theory is that the brain has some mechanism to detect when a thought is an empathetic simulation, and then it can just choose not to send an error signal in that circumstance. (Or it can do other things with that information.) I’m currently not sure what that mechanism is.

Interested in how you’re thinking about this. Sorry if I misunderstood anything :)

What I really believe is that “the brain does other things with that information”, things more general than “feeling the same feeling as the other person is feeling”. See here:

In envy, if a little glimpse of empathy indicates that someone is happy, it makes me unhappy.
In schadenfreude, if a little glimpse of empathy indicates that someone is unhappy, it makes me happy.
When I’m angry, if a little glimpse of empathy indicates that the person I’m talking to is happy and calm, it sometimes makes me even more angry!

But then every time that empathy thing happens, I obviously don’t then immediately eat chocolate. So the reward model would get an error signal—there was a reward prediction, but the reward didn’t happen. And thus the brain would eventually learn a more sophisticated “correct” reward model that didn’t fire empathetically. Right?

Also -- I really like your post on empathy that cfoster linked above! I have read a lot of your work but somehow missed that one lol. Cool we are thinking at least somewhat along similar lines

Thanks!

For instance, you could pass the RPE through to some other region to detect whether the empathy triggered for a friend or enemy and then return either positive or negative reward, so implementing either shared happiness or schadenfreude.

Thanks for the reply!

In envy, if a little glimpse of empathy indicates that someone is happy, it makes me unhappy.
In schadenfreude, if a little glimpse of empathy indicates that someone is unhappy, it makes me happy.
When I’m angry, if a little glimpse of empathy indicates that the person I’m talking to is happy and calm, it sometimes makes me even more angry!

That's where some instinctive disagreement of mine with that post of yours comes from too. But I also haven't read through it carefully enough to be sure.

Again, this seems very obvious to me, which suggests that I’m probably misunderstanding you.

I appreciate the charity!

But these are still two distinct claims, and the latter assumes the former.

I'm also skeptical that those beliefs are better decribed as being about other people's internal states rather than as about their social behavior.

You mention “I imagine Alice acting happy, smiling and uncaring”. But I feel like the following two things feel very different to me:

“I imagine that Alice is acting happy, smiling and uncaring, and this is straightforwardly related to how she really feels”, versus
“I imagine that Alice is acting happy, smiling and uncaring, but on the inside she’s miserable, and she’s hiding how she really feels”.

What do you think?

I'm saying that it introspectively doesn't feel like that is implemented via empathy (the same part of my world model that predicts my own emotions), but via a different part of my model (dedicated to modeling other people)

It seems like the AIs we build will be very different from us, at least in terms of basic drives. I can definitely empathize when there's some common currency to the experience (for ex. they're feeling pain, and I've also experienced pain), but probably less so when there's a greater gap. Since AIs won't share any of our physiology or evolutionary history, I worry that that common currency will be missing, which would seemingly incentivize the AI having separate circuits for modeling humans and for modeling itself.
This doesn't seem like it'd give us a robust enough version of empathy by itself, because the agent isn't motivated to actively seek out opportunities to empathize. As an analogy, I know if I were forced to think of, and even look at, the process that produces hamburger meat, I would probably have a visceral reaction and not want to eat the burger. But I like burgers, so I don't seek out that train of thought, so the hypothetical empathy & disgust that would've been invoked lays inactive. Maybe something like Anthropic's Constitutional AI method would help in this direction...

Also if you haven't read this post, I think it's a good one and very related.

It seems like the AIs we build will be very different from us, at least in terms of basic drives. I can definitely empathize when there's some common currency to the experience (for ex. they're feeling pain, and I've also experienced pain), but probably less so when there's a greater gap. Since AIs won't share any of our physiology or evolutionary history, I worry that that common currency will be missing, which would seemingly incentivize the AI having separate circuits for modeling humans and for modeling itself

Nitpick about terminology: I think the stuff you're talking about is primarily attributable to having a learned value function rather than to having a learned reward model in the narrow sense of a predictor of immediate reward. I tend to use value function to refer to the thing that, alongside the reward function, produces visceral (gut-like) reactions to thoughts based on forecasts that were learned via something like TD learning

Typos

There are good reasons for not naturally learning this kind of entirely ego-centric world model with a complete separation in latent self between concepts involving self and involving others.

Bolded should be "latent space".

My prediction is that there really is an evolved nudge towards empathy in the human motivational system, and that human psychology - like usually being empathetic but sometimes modulating it and often justifying self-serving actions - is sculpted by such evolved nudges, and wouldn't be recapitulates in AI lacking those nudges.

The key idea that leads to empathy is the fact that, if the world model performs a sensible compression of its input data and learns a useful set of natural abstractions, then it is quite likely that the latent codes for the agent performing some action or experiencing some state, and another, similar, agent performing the same action or experiencing the same state, will end up close together in the latent space. If the agent's world model contains natural abstractions for the action, which are invariant to who is performing it, then a large amount of the latent code is likely to be the same between the two cases. If this is the case, then the reward model might 'mis-generalize' to assign reward to another agent performing the action or experiencing the state rather than the agent itself. This should be expected to occur whenever the reward model generalizes smoothly and the latent space codes for the agent and another are very close in the latent space. This is basically 'proto-empathy' since an agent, even if its reward function is purely selfish, can end up assigning reward (positive or negative) to the states of another due to the generalization abilities of the learnt reward function ^[1].

awesome

Outside of apes and monkeys, dophins and elephants, as well as corvids also appear in anecdotal reports and the scientific literature to have many complex forms of empathy.

We agree 😀

What do you think about some brainstorming in the chat about how to use that hook?

Whether we can build artificial empathy into AI systems also has clear relevance to AI alignment.

If we can create empathic AIs, then it may become easier to make an AI be receptive to human values, even if humans can no longer completely control it.

I suspect that {the cognitive process that produced the above sentence} is completely devoid of security mindset. If so, might be worth trying to develop security mindset? And/or recognize that one is liable to (i.a.) be wildly over-optimistic about various alignment approaches. (I notice that that sounded unkind; sorry, not meaning to be unkind.)

You pointed out that empathy is not a silver bullet. I have a vague (but poignant) intuition that says that the problem is a lot worse than that: Not only is empathy not a silver bullet, it's a really really imprecise heuristic/proxy/shard for {what we actually care about}, and is practically guaranteed to break down when subjected to strong optimization pressure.

the name of the game is just to figure out how to prevent this much optimization pressure being applied without imposing too high a capabilities tax

Hmm. I wonder if you'd agree that the above relies on at least the following assumptions being true:

(i) It will actually be possible to (measure and) limit the amount of "optimization pressure" that an advanced A(G)I exerts (towards a given goal).
(ii) It will be possible to end the acute risk period using an A(G)I that is limited in the above way.

If so, how likely do you think (i) is to be true? If you have any ideas (even very rough/vague ones) for how to realize (i), I'd be curious to read them.

I think realizing (i) would probably be at least nearly as hard as the whole alignment problem. Possibly harder. (I don't see how one would in actual practice even measure "optimization pressure".)

(i) It will actually be possible to (measure and) limit the amount of "optimization pressure" that an advanced A(G)I exerts (towards a given goal).If so, how likely do you think (i) is to be true?

If you have any ideas (even very rough/vague ones) for how to realize (i), I'd be curious to read them.

There is a whole bunch of ideas you can try here which work mostly independently and in parallel -- examples of this are:

1.) Quantilization

2.) Impact regularization

3.) General regularisation against energy use, thinking time, compute cost

4.) Myopic objectives and reward functions. High discount rates

5.) limiting serial compute of the model

6.) Action randomisation / increasing entropy -- something like dropout over actions.

7.) Satisficing utility/reward functions

8.) Distribution matching objectives instead of argmaxing

9.) penalisation of divergence from a 'prior' of human behaviour

10.) Maintaining value uncertainty estimates and acting conservatively within the outcome distribution

These are just examples I have thought of immediately. There are a whole load more if you sit down and brainstorm for a while.

It will be possible to end the acute risk period using an A(G)I that is limited in the above way.

LESSWRONG
LW

LESSWRONG
LW

48

Empathy as a natural consequence of learnt reward models

48

Empathy in the brain

Widespread empathy in animals

Contextual modulation of empathy

Psychopaths etc

48

Typos

48

Typos