(If you’re in a hurry, you can just read the “Background and summary” section, and skip the other 85%.)
0. Background and summary
0.1 Background: What’s the problem and why should we care?
My primary neuroscience research goal for the past couple years has been to solve a certain problem, a problem which has had me stumped since the very beginning of when I became interested in neuroscience at all (as a lens into Artificial General Intelligence safety) back in 2019. In this post I offer a hypothesis for what the solution might generally look like, at least in the big picture. I don’t have all the details pinned down, but I feel like this post is major progress. (Unless it’s all wrong! Like the last one was.[1] Very happy for feedback!)
We can divide the brain into a “Learning Subsystem” (cortex, striatum, amygdala, cerebellum, and a few other areas) that houses a bunch of randomly-initialized within-lifetime learning algorithms, and a “Steering Subsystem” (hypothalamus, brainstem, and a few other areas) that houses a bunch of specific, genetically-specified “business logic”. A major role of the Steering Subsystem is as the home for the brain’s “innate drives”, a.k.a. “primary rewards”, roughly equivalent to the reward function in reinforcement learning—things like eating-when-hungry being good (other things equal), pain being bad, and so on.
Some of those “innate drives” are related to human social instincts—a suite of reactions and drives that are upstream of things like compassion, friendship, love, spite, sense of fairness and justice, etc.
The grand problem is: how do those human social instincts work? Ideally, an answer to this problem would look like legible pseudocode that’s simultaneously compatible with behavioral observations (including everyday experience), with evolutionary considerations, and with a neuroscience-based story of how that pseudocode is actually implemented by neurons in the brain.[2]
Explaining how human social instincts work is tricky mainly because of the “symbol grounding problem”. In brief, everything we know—all the interlinked concepts that constitute our understanding of the world and ourselves—is created “from scratch” in the cortex by a learning algorithm, and thus winds up in the form of a zillion unlabeled data entries like “pattern 387294 implies pattern 579823 with confidence 0.184”, or whatever.[3] Yet certain activation states of these unlabeled entries—e.g., the activation state that encodes the fact that Jun just told me that Xiu thinks I’m cute—need to somehow trigger social instincts in the Steering Subsystem. So there must be some way that the brain can “ground” these unlabeled learned concepts. (See my earlier post Symbol Grounding and Human Social Instincts.)
A solution to this grand problem seems useful for Artificial General Intelligence (AGI) safety, since (for better or worse) someone someday might invent AGI that works by similar algorithms as the brain, and we’ll want to make those AGIs intrinsically care about people’s welfare. It would be a good jumping-off point to understand how humans wind up intrinsically caring about other people’s welfare sometimes. (Slightly longer version in §2.2 here; much longer version in this post.)
0.2 Summary of the rest of the post
I’ll start by going through the four algorithmic ingredients we need for my hypothesis, one by one, in each case describing what it is algorithmically, why it’s useful evolutionarily, and where in the brain we might go looking to find the specific neurons that are running this (alleged) algorithm.
Here’s the roadmap:
Ingredient 1 is innate sensory heuristics in the Steering Subsystem—previously discussed in §3.2.1 here. An example would be some part of your brainstem that detects skittering spiders in your field-of-view.
Ingredient 1A is innate sensory heuristics for conspecific detection in particular.(Terminology note: “Conspecific” = “another member of the same species”.) This is a special case of Ingredient 1, but I think it’s an important and widespread special case. For example, humans have innate reactions to seeing humans (faces, gait, etc.), hearing human voices, and so on—just as mice have innate reactions to seeing and smelling other mice. I claim that these heuristics are combined into a general “thinking of a conspecific” flag in the Steering Subsystem.
Ingredient 2 is ‘short-term predictors’—previously discussed in §5.4 here. These are supervised learning algorithms, mainly housed in the “extended striatum” (including amygdala), that search for connections between aspects of your rich understanding of the world (e.g. the learned concept ‘spider’) and Steering Subsystem reactions (e.g. feeling jittery). This allows generalization—for example, the “thinking of a conspecific” flag can be triggered even when a conspecific is not standing right there.
Ingredient 3 is tailoring learned models via involuntary attention and learning rate. Basically, involuntary attention can sculpt large-scale information flows within the cortex, altering what the short-term predictors wind up learning. As an example, the orienting reflex to a skittering spider comes along with involuntary attention, which ensures that when your brainstem notices a spider, your cortex / “global workspace” / conscious attention jumps to the spider, and to things related to the spider, as opposed to continuing to daydream about Taylor Swift. Correspondingly the short-term predictors can learn to trigger visceral reactions upon seeing spiders, and not learn to trigger visceral reactions upon daydreaming about Taylor Swift.
Ingredient 4 is reading out transient empathetic simulations via a combination of all the above ingredients. Basically, the “thinking of a conspecific” flag above activates a transient involuntary lack of attention to your own raw interoceptive inputs. That clears the way for any “feeling”-related signal from the cortex at that moment to be interpreted (by the Steering Subsystem) as indicative of what that other person is feeling, in conjunction with properly-tailored information flows and learning rates.
Then, I’ll go through an important (putative) example of social instincts built from these ingredients, which I call the “compassion / spite circuit”. This circuit leads to an innate drive to feel compassion towards people we like, and to feel spite and schadenfreude towards people we hate.
In an elegant twist, I claim that this very same “compassion / spite circuit” also leads to an innate “drive to feel liked / admired”—a drive that I hypothesized earlier and believe to be central to both status-seeking and norm-following. The trick in explaining how they’re related is:
“Drive for compassion” basically amounts to “I want Ahmed to feel pleasure”;
“Drive to feel liked / admired” basically amounts to “I want Ahmed to feel pleasure upon thinking about me”;
…and it turns out that, at the particular moments when the “compassion / spite circuit” gets strongly activated, Ahmed is very often thinking about me! An important example would be a moment where Ahmed has just turned to me and made eye contact.
Then I’ll go more briefly through some other possible social instincts, including a sketch of a possible “drive to feel feared” (whose existence I previously hypothesized here). For context, dual strategies theory talks about “prestige” and “dominance” as two forms of status; while the “drive to feel liked / admired” leads to prestige-seeking, the “drive to feel feared” correspondingly leads to dominance-seeking.
0.3 Confidence level
My confidence gradually decreases as you proceed through the article. The “Background” section above is rock-solid in my mind, as are Ingredients 1, 1A, and 2. Ingredients 3 and especially 4 are somewhat new to this post, but derive from ideas I’ve been playing around with for a year or two, and I feel pretty good about them. The specific putative examples of social instincts in §5–§7 are much more new and speculative, and are oversimplified at best. But I’m optimistic that they’re on the right track, and that they’re at least a “foot in the door” towards future refinements.
1. Ingredient 1: Innate sensory heuristics in the Steering Subsystem
Think of things like seeing a slithering snake, or a skittering spider; smelling or tasting rotten food; male dogs smelling a female dog in heat; camouflaged animals recognizing the microenvironment where their bodies will blend in; and so on.
Note that these are all imperfect heuristics, anchored to innate circuitry, rather than developing along with our understanding of the world. We can call it a venomous-spider-detector circuit, for example, noting that it evolved because venomous spiders were dangerous to early humans.[4] But if we do that, then we acknowledge that it will have both false positives (e.g. centipedes, harmless spiders) and false negatives (funny-looking stationary venomous spiders), when compared to actual venomous spiders as we intelligently understand them. In vision especially, think of these heuristics as detecting relatively simple patterns of blobs and motion textures, as opposed to an “image classifier” / “video classifier” up to the standards of modern ML or human capabilities.
For more discussion of Ingredient 1, see §3.2.1 here.
1.1 Ingredient 1A: Innate sensory heuristics for conspecific detection in particular
As a special case of Ingredient 1, I claim that, in pretty much all animals, there are a set of sensory heuristics that are specifically designed by evolution to trigger on conspecifics. That would include one or more variations on: seeing a conspecific, hearing a conspecific, touching (or being touched by) a conspecific, smelling a conspecific, etc.
(I’m confident in this part because pretty much all animals have innate behaviors towards conspecifics that are different from their behaviors in other situations—mating, intermale aggression, parenting, being parented, herding, huddling, and so on.)
I claim that these all trigger a special Steering Subsystem flag that I call “thinking of a conspecific”:
1.2 Neuroscience details
Neuroscience details box
The sensory heuristics involve brainstem areas like the superior colliculus (for innate heuristic calculations on visual data), inferior colliculus (auditory data), gustatory nucleus of the medulla (taste data), and so on. (Again see §3.2.1 here.)
In the case of visual sensory heuristics, I’m actually not 100% confident that these calculations are located in the superior colliculus proper; for all I know, they’re partly or entirely in the neighboring parabigeminal nucleus, or whatever. There are papers on this topic, but they can’t always be taken at face value—see for example me complaining about methodologies used in the literature here and here.
For the “thinking of a conspecific” flag, it would be somewhere within the Steering Subsystem, but I don’t have any particular insight into exactly where. If I had to guess, I might guess that it’s one of the many little cell groups of the medial preoptic hypothalamus, since those often involve social interactions. If not that, then I’d guess it’s somewhere else in the medial hypothalamus, or (less likely) the lateral hypothalamus, or (less likely) some other part of the Steering Subsystem.
If you want to find “thinking of a conspecific” flag experimentally, the conceptually-simplest method would be to first find one of the sensory heuristics for conspecific detection (e.g. the face detector), see what its efferent connections (downstream targets) are, and treat all those as top candidates to be studied one-by-one.
2. Ingredient 2: Generalization via short-term predictors
Ingredient 1 is a first step towards understanding, say, fear-of-spiders. But it’s not the whole story, because I don’t just get nervous when there is actually a large skittering spider in my field-of-view right now, but also when I imagine one, or when somebody tells me that there’s a spider behind me, etc. How does that work? The answer is: what I call the “short-term predictor”.
The “short-term predictor” is a learning algorithm that involves three ingredients—context, output, and supervisor. For definitions see this post; or in the ML supervised learning literature, you can substitute “context” = “trained model input”, “output” = “trained model output”, and “supervisor” = “label” (i.e., ground truth), which is subtracted from the trained model output to get an error that updates the model.[5] (In the actual brain, the subtraction transmitted signal might be the “error” rather than the “label”—I’m glossing over those kinds of implementation details.)
The important points are that:
The short-term predictor will learn within your lifetime to associate otherwise-inscrutable world-model concepts—like the concept of “spider”, the word “spider”, the detailed visual appearance of spiders, the concept of “centipede”, etc.—with the physiological arousal brainstem reaction;
The “output” of the short-term predictor can itself trigger that brainstem reaction, in a kind of self-fulfilling prophecy that I call “defer-to-predictor mode” (see here).
Thus, this kind of story explains the fact that I viscerally react to learning that there’s a spider in my vicinity that I can’t immediately see or feel.
If we take the brainstem reaction and the short-term predictor together, it can function as what I call a long-term predictor, again see here.
By the same token, the “thinking of a conspecific” flag can trigger when I’m, well, thinking of a conspecific, even if the conspecific is not standing right there, triggering my brainstem sensory heuristics right now.
2.1 Neuroscience details
Neuroscience details box
I think the short-term predictors that I’ll be talking about in this post are mostly centered around small clusters of medium spiny neurons somewhere in the amygdala, or the lateral septum, or the medial part of the nucleus accumbens shell. (I haven’t tried to pin them down in more detail than that. See §5.5.4 here for some more general neuroscience discussion of this topic.)
However, in some cases pyramidal neurons can play this short-term predictor role as well, such as in the cortex-like (basolateral) section of the amygdala, along with certain parts of cortex layer 5PT.
The supervisory signal (either ground truth or an error signal, I’m not sure) probably makes an intermediate stop (“relay”) at some little cluster of neurons on the fringes of the Ventral Tegmental Area (VTA), not shown in the diagram above, in which case the supervisory signal would ultimately arrive at the spiny neuron in the form of a dopamine signal. I think. (But there are also VTA GABA neurons that seem somehow related to these particular short-term predictors. I haven’t tried to make sense of that in detail.)
3. Ingredient 3: Tailoring learned models via involuntary attention and learning rate
In this section I’ll just go through a simple example of the orienting reflex upon seeing a spider, then in Ingredient 4 below we’ll see how this applies to social instincts and feelings.
3.1 What does the orienting reflex do?
When the seeing-a-spider brainstem sensory heuristic triggers, I claim that one thing it does is trigger an “orienting reflex”. Part of that reflex involves moving the eyes, head, and body towards whatever triggered the heuristic. And another part of it involves involuntary attention towards the visual inputs in general, and the corresponding part of the field of view in particular.
The involuntary attention plays an important role in constraining what “thought” the cortex is thinking. If you’re daydreaming, imagining, remembering, etc., then your current “thought” has very little to do with current visual inputs. By contrast, involuntary attention towards vision forms a constraint that the thought must be “about” the visual inputs. It’s not completely constraining—the same thought can also contextualize those visual inputs by roping in presumed upstream causes, or expected consequences, or other associations, etc. But the visual inputs have to be a central part of the thought. In other words, you’re not only pointing your eyes at the spider, but you’re also actually thinking about the spider with your cortex (“global workspace”).
To be more specific about what’s going on, we need to be thinking about large-scale patterns of information flow within the cortex, as in the following toy example:
When you’re using visual imagination, your consciously-accessible visual areas of the cortex (e.g. the inferior temporal gyrus (IT)) are, in essence, disconnected from the immediate visual input. You can imagine Taylor Swift’s new dress while looking at a swamp. By contrast, when you’re paying attention to what you’re looking at, then there’s a consistency requirement: the visual models (i.e., generative models of visual data) in IT have to be consistent with the immediate visual input from your retina.
And my claim is that the Steering Subsystem has some control over this kind of large-scale information flow among different parts of the cortex, via its “involuntary attention”.
3.1.1 Side note: Transient attentional gaps are more common, and harder to notice, than you realize
You might be wondering: Is it really true that, if I’m imagining Taylor Swift’s new dress, then my awareness is detached from immediate visual input? Don’t we continue to be aware of visual input even while imagining something else?
A few responses:
First, your cortex has lots of vision-related areas, and it’s possible for some visual areas to be yoked to immediate visual input while other visual areas are simultaneously yoked to episodic memory. I think this definitely happens to some extent.
Second, your attention can jump around between different things rather quickly, such that most people imagine themselves to have far more complete and continuous visual awareness than they actually do—see things like change blindness, or the selective attention test, or the fact that peripheral vision has terrible resolution and terrible color perception and makes faces look creepy.
Third, the fact that the cortex tracks time-extended models, and accordingly has a general ability to pull up activation history from slightly (e.g. half a second) earlier, anywhere in the cortex. That makes it very hard to introspect upon exactly what you were or weren’t thinking at any given moment. For a much more detailed discussion of this point, with an example, see here.
This is a general lesson, going beyond just vision: transient (fraction-of-a-second) attentional gaps and shifts are hard to notice, both as they happen and in hindsight. Don’t unthinkingly trust your intuitions on that topic. (I’ll be centrally relying on these transient attentional shifts in this post, so it’s important that you are thinking about them clearly.)
3.2 Combining attention with time-variable learning rates
The Steering Subsystem gets an additional lever of control over brain learning algorithms by combining that kind of large-scale information flow control with time-variable learning rates, as follows.
Let’s start with learning in the world model / Thought Generator ≈ cortex. Above I was talking about the “space of visual models” which are learned from scratch in IT. Like everything in the world-model (details), this space is learned by predictive (a.k.a. self-supervised) learning. But it’s learned more specifically when we’re paying attention to visual input. The models thus get sculpted to reflect the structure of the actual visual world.
Separately, we can query those existing models for the purpose of memory recall and visual imagination. But when we do, I claim that the learning rate is zero (or at most, almost-zero).
Moving onto the parallel case of learning in the Thought Assessors / short-term predictors ≈ striatum and amygdala. The genome can likewise leverage large-scale information flows to get some control over what the short-term predictors learn.
As a toy example, let’s take the diagram above, but add in a short-term predictor. And just as for the cortex case above, we’ll set the short-term predictor learning rate to zero unless we’re paying attention to visual input. Here’s a diagram:
Thanks to this learning rate modulation, this short-term predictor is trained specifically to maximize its predictive accuracy in situations where we’re paying attention to visual input. When we’re visually imagining or remembering something, by contrast, the short-term predictor will continue to be queried, but it won’t be updated.
What’s the advantage of this setup? Well, imagine my cortex is daydreaming about Taylor Swift, and then my brainstem notices a spider in the corner of my field-of-view. Without the involuntary attention, the learning algorithm update would associate daydreaming-about-Taylor-Swift with the seeing-a-spider reaction (physiological arousal, aversiveness, etc.), which is not a useful thing for me to learn. The involuntary attention can solve that problem: first the involuntary attention kicks the Taylor Swift daydream out of my brain, and ensures that I’m thinking about the spider instead; and second the short-term predictor learning algorithm records those new thinking-about-the-spider thoughts, and fires into its output line whenever similar thoughts recur in the future. Thus I’ll wind up feeling physiological arousal related to the shape and motion of a spider, spiderwebs, centipedes, that corner in the basement, etc., which makes a lot more sense (ecologically) than feeling physiological arousal related to Taylor Swift.
(Well, that’s a bad example. It is entirely ecologically appropriate to feel physiological arousal related to Taylor Swift! But that’s for other reasons!)
3.3 Neuroscience details
Neuroscience details box
For involuntary attention: There are probably multiple pathways working in conjunction. Probably cholinergic and/or adrenergic neurons are involved. More specifically, cholinergic projections to the cortex are probably part of this story, and so are the cholinergic projections to thalamic relay cells. I don’t know the details.
For adjusting learning rate: There are a bunch of ways this could work. If there’s an error signal coming from the Steering Subsystem (hypothalamus or brainstem) to a short-term predictor, it could be set to zero, and then there’s no learning. Or maybe there’s a separate signal for learning rate (maybe acetylcholine again?) coming from the Steering Subsystem, which could be turned off instead. There could also be some more indirect effect of lack-of-attention on the cortex side—like maybe the cortex representations are less active when they’re further removed from sensory input, and that indirectly reduces learning rate, or something. I don’t know.
4. Ingredient 4: Reading out transient empathetic simulations
If we apply the same kind of reasoning as above, it suggests a path to solving the symbol-grounding problem for somebody else’s feelings. A key ingredient we need is “involuntary LACK of attention towards interoceptive inputs”, triggered by the “thinking of a conspecific” flag of Ingredient 1A —the right side of this diagram:
What is this “lack of attention” supposed to accomplish? Here’s a schematic diagram illustrating the flows of information / attention / constraints in a normal situation (left) and in a situation where one of the Ingredient 1A conspecific detection heuristics has just fired (right):
The involuntary lack of attention transiently disconnects the interoceptive models from what I’m feeling right now. Instead, the space of interoceptive models in the cortex will settle into whatever is most consistent with what’s happening in the visual, semantic, and other areas of the cortex (a.k.a. “global workspace”). And thanks to the orienting reflex, those other areas of the cortex are modeling Zoe.
And therefore, if any interoceptive models are active, they’re ones that have some semantic association with Zoe. Or more simply: they’re how Zoe feels (or more precisely, how Zoe seems to feel, from my perspective).
This is progress! But there’s still some more work to do.
Next, let’s put in a couple short-term predictors (Ingredient 2), and think about learning rates (Ingredient 3):
Here, I show two different short-term predictors for the same ground truth (namely, physiological arousal). However, the contexts and learning rates are different, and hence their behaviors are correspondingly different as well.
The short-term predictor on the left uses (let’s say) visual models as context, and its learning rate is nonzero iff I’m paying attention to immediate visual inputs. As it turns out, Zoe is my tyrannical boss, who loves to exercise arbitrary power over me, and thus our conversations are often stressful. This left predictor will pick up on that pattern, and preemptively suggest physiological arousal whenever I notice that Zoe might be coming to talk to me.
Meanwhile, the short term predictor on the right uses interoceptive models as context, and its learning rate is nonzero iff I’m paying attention to my own interoceptive inputs.[6] This short-term predictor will wind up learning things that seem pretty stupidly trivial—e.g. “the conscious feeling of arousal (in the Thought Generator a.k.a. world-model) predicts actual arousal (in the Steering Subsystem)”; but it still needs to be there for technical reasons.[7] Anyway, this output will not respond to the fact that conversations with Zoe tend to be stressful for me. But if Zoe herself seems stressed, the output will reflect that.
Thus, when things are set up properly, the Steering Subsystem can simultaneously get instructions of both how a situation feels to us and the feelings that other people seem to be feeling.
(I showed the example of physiological arousal, but the same logic applies to “being happy”, “being angry”, “being in pain”, etc.)
4.1 So, the “thinking of a conspecific” flag is also a “this is an empathetic simulation” flag?
Well, kinda. But with some caveats.
The sense in which this is true is: both the interoceptive model space and the associated short-term predictors are trained in a circumstance where they relate exclusively to my own interoceptive inputs, but then they’re sometimes queried in a circumstance where they relate to someone else’s interoceptive inputs.
But in other senses, calling it an “empathetic simulation” flag might be a bit misleading.
First, it would be a transient empathetic simulation, lasting a fraction of a second, which is rather different from how we normally use the term “empathy”—more on that here.
Arguably, even “transient empathetic simulation” is an overstatement—it’s just some learned association between what I’m seeing and some feeling-related concept. The concept of Zoe seems to somehow imply the concept of stress, within my world-model. That's all. I don't really need to be “taking her perspective”, nor to be feeling Zoe’s simulated stress in Zoe’s simulated loins, or whatever.
Second, this flag is exclusively related to empathetic simulations of what someone is feeling[8]—not empathetic simulations of what they're thinking, seeing, etc. For example, if I'm curious whether Zoe can see the moon from where she's standing, then I would do a quick empathetic simulation of Zoe is seeing. The “thinking of a conspecific” flag is not particularly related to that; indeed, if anything, this flag is probably anticorrelated with that, since the flag is trained only in situations where orienting reflexes are pulling attention to our own exteroceptive sensory inputs.
Thus, my framework implies that social instincts can only involve reacting to someone's (assumed) feelings. It cannot (directly) involve reacting to what someone is seeing, or thinking, etc. I think that claim rings true to everyday experience.
And there's actually a deeper reason to believe that claim. If I take Zoe’s visual perspective and imagine that she’s looking at a saxophone, then my Steering Subsystem can’t do anything with that information. The Steering Subsystem doesn’t understand saxophones, or anything else about our big complicated world. But it does know the “meaning” of its suite of innate physiological state variables and signals—physiological arousal, body temperature, goosebumps, and so on. See my discussion of “the interface problem” here.
Third, even among the set of short-term predictors related to “feelings”, only some of them are set up such that they will output a transient empathetic simulation. See the toy example above with two different short-term predictors for physiological arousal, one of which conveys empathetic simulations and the other of which does not.
4.2 Neuroscience details
Neuroscience details box
Involuntary lack-of-attention signal: Well, absence-of-attention might just involve suppressing presence-of-attention pathways, like the ones I mentioned under Ingredient 3 above (possibly involving acetylcholine). Or it might be a different system that pushes in the opposite direction—maybe involving serotonin? Or (more likely) multiple complementary signals that work in different ways. I don’t have any strong opinions here.
Two short-term predictors for the same thing: I drew a diagram above with two different short-term predictors of physiological arousal. While that diagram was oversimplified in various ways, I do think it’s true that there are (at least) two different short-term predictors of physiological arousal, one using exteroception-related signals as context, the other using interoception-related signals as context, with the latter capturing empathetic simulations (among its other roles). My guess is that the former is in the amygdala and the latter is somewhere in the medial prefrontal or cingulate cortex. (Clarification for the latter: I think most of the short-term predictors are medium spiny neurons in the “extended striatum”, and have been labeling my diagrams accordingly. But as I mentioned in §2.1 above, I do think there are places where pyramidal neurons play a short-term predictor role too, including in layer 5PT of certain parts of the cortex.)
5. Hypothesis: a “compassion / spite circuit”
Everything so far was preliminaries—now we can start speculating about real social instincts! My main example is a possible innate drive circuit that would be upstream of compassion and spite. Start with another Steering Subsystem signal:
5.1 The “Conspecific seems to be feeling (dis)pleasure” signal
The first step is to get a “conspecific seems to be feeling pleasure / displeasure”[9] signal in the Steering Subsystem, as follows:
The purple box is yet another Steering Subsystem signal that I’m labeling “pleasure / displeasure”. This is closely related to valence—for details see here. Then the gray box would be an intermediate variable[10] in the Steering Subsystem which would, by design, track the extent to which I think of the conspecific as feeling pleased / displeased.
All we need to get that gray box, beyond what we’ve already covered, is a gate: If the thinking-of-a-conspecific flag is on, AND there’s a short-term predictor output consistent with (dis)pleasure, then that means I’m thinking about a conspecific who is currently feeling (dis)pleasure.
This step is built on the kind of “transient empathetic simulation” that I’ve discussed previously and in §4.1 above: the short-term predictor on the right is trained by supervised learning on instances of myself feeling (dis)pleasure, but now at this particular moment it’s being triggered by thinking about someone else feeling (dis)pleasure.
That was just the start. Next, how do we build a social instinct out of the gray “conspecific seems to be feeling pleasure / displeasure” box? We need another Steering Subsystem parameter!
5.2 The “friend (+) vs enemy (–)” parameter
I introduced another Steering Subsystem parameter called “friend (+) vs enemy (–)”. When this parameter is extremely negative, it indicates that whatever you’re thinking about (in this case, the conspecific) should be physically attacked, right now. If the activity level is mildly negative, then you probably won’t go that far, but you’ll still feel like they’re the enemy and you hate them. If it’s positive, you’ll feel “on the same team” as them.
Anyway, when the “friend (+) vs enemy (–)” parameter is positive, then “conspecific seems to be feeling pleasure / displeasure” causes positive / negative valence respectively. This innate drive would lead to compassion—we feel intrinsically motivated by the idea that the conspecific is feeling pleasure, and intrinsically demotivated by the idea that the conspecific is feeling displeasure.
…And if the “friend (+) vs enemy (–)” parameter is negative, we flip the sign: “conspecific seems to be feeling pleasure / displeasure” causes negative / positive valence respectively. This innate drive would lead to both spite and schadenfreude.
How is the “friend (+) vs enemy (–)” parameter itself calculated? By other social instincts outside the scope of this post—more on that in §7 below. Perhaps part of it is a different circuit that says: if thinking about a conspecific co-occurs with positive valence (i.e., if we like / admire them), then that probably shifts the friend/enemy parameter a bit more towards friend, and perhaps also conversely with negative valence. That’s not circular, because conspecifics can acquire positive or negative valence for all kinds of reasons, just like sweaters or computers or anything else can acquire positive or negative valence for all kinds of reasons, including non-social dynamics like if I’m hungry and the conspecific gives me yummy food. That’s a robust and flexible system that will leverage my rich understanding of the world to systematically assign “friend” status to conspecifics who lead to good things happening for me. That’s probably just one factor among many; I imagine that there are lots of innate circuits that can impact friend / enemy status in various circumstances. Of course, as usual, the friend / enemy parameter would be attached to one or more short-term predictors, enabling memory, generalization, and perhaps also transient empathetic simulations.
5.2.1 Evolution and zoological context
Evolutionary and zoological context box
Pretty much every complex social animal has innate, stereotyped behaviors for both helping and hurting conspecifics in different circumstances—e.g. attack behaviors, and companionship-type behaviors such as within families.
And evolutionarily, if it makes sense to help or hurt conspecifics through innate, stereotyped behaviors, then presumably it also makes sense to help or hurt conspecifics through the more powerful and flexible pathways that leverage within-lifetime learning, as would happen through a “compassion / spite circuit”. (See (Appetitive, Consummatory) ≈ (RL, reflex).)
Indeed, even in rodents, I think there’s clear evidence of more flexible, goal-oriented behaviors to (selectively) help conspecifics. For example, Márquez et al. 2015 finds that rats help conspecifics via choice of arm in a T-shaped maze. And Bartal et al. 2014 finds that rats release conspecifics from restraints, but only in situations where they feel friendly towards the conspecific. (See also: Kettler et al. 2021.) I don’t think either of these needs to be explained with my proposed “compassion / spite circuit” above involving transient empathetic simulation; for example, maybe rats squeak in a certain way when they’re happy, and hearing another rat make a happy squeak triggers a primary reward, or whatever. But anyway, as far as I can tell at a glance, the “compassion / spite circuit” is at least plausibly present even in rodents.
…Or maybe it’s just a “compassion” circuit for rodents. I can’t immediately find any evidence either way on whether rats display flexible, goal-oriented spite-type behavior towards other rats they hate. (They undoubtedly have inflexible, stereotyped, threat and attack postures and behaviors, but that’s different—again see (Appetitive, Consummatory) ≈ (RL, reflex).) Let me know if you’ve seen otherwise!
5.2.2 Neuroscience details
Neuroscience details box
I expect that friend-vs-enemy is two groups of neurons that are mutually inhibitory, as opposed to one that swings positive and negative compared to baseline. That’s how the hypothalamus handles hungry-vs-full, for example (see here). As for where those neuron groups are, I don’t know. Probably medial hypothalamus somewhere.
5.3 Phasic physiological arousal
“Phasic” means that physiological arousal jumps up for a fraction of a second, in synchronization with noticing something, thinking a certain thought, etc. The opposite of “phasic” is “tonic”, like how I can have generally high arousal (alertness, excitement) in the morning and generally low arousal in the afternoon.
Now, one thing that my compassion / spite circuit above is missing is a notion that some interactions can feel more important / high-stakes to me than others. I think this is a separate axis of variation from the friend / enemy axis. For example, my neighbor and my boss are both solidly on the “friend” side of my friend / enemy spectrum—I feel “warmly” towards both, or something—but interactions with my boss feel much higher stakes, and correspondingly I react more strongly to their perceived feelings. So let’s refine the circuit above to fix that:
Basically, when I orient to a conspecific, then recognize them, the associated phasic arousal[11] tracks how important (high-stakes) is this interaction with the conspecific, from my perspective. Then we use that to scale up or down the compassion / spite response.
5.3.1 Neuroscience details
Neuroscience details box
I think the locus coeruleus, a tiny group of 30,000 neurons (in humans) is the high-level arousal-controller in your brain, and its activity can vary over short timescales (up and down within half a second, there’s a plot here). If you measure pupil dilation, then maybe you’ll miss some of the very fastest dynamics, but you will see the variation on a ≈1-second timescale. If you measure skin conductance, that’s slower still.
I’m generally assuming in this post that “arousal” is a scalar. That’s probably something of an oversimplification (see Poe et al. 2020) but good enough for present purposes.
I’ve been talking as if the role of phasic arousal is specific to the “compassion / spite circuit”, but a more elegant possibility is that it’s a special case of a very general interaction between arousal and valence, such that arousal makes all good things seem better, and makes all bad things seem worse, other things equal. After all, arousal is saying that a situation is high-stakes. So that kind of general dynamic seems evolutionarily plausible to me.
(For the record, I think the general interaction between arousal and valence is not just multiplicative. I think there’s also a thing that we call “being overwhelmed”, where sufficiently high arousal can cause negative valence all by itself. Basically, in a very high-stakes situation, the Steering Subsystem wants to say that things are either very good or very bad, and in the absence of positive evidence that things are very good, it treats “very bad” as a default.)
5.4 Generalization via short-term predictors
As usual, Steering Subsystem flags can serve as ground-truth supervision for short-term predictors, which supports generalization. Thanks to “defer-to-predictor mode” (see here), we wind up with Steering Subsystem social instincts activating in situations where nobody is in the room with me right now, but nevertheless I find myself intrinsically motivated by the idea of Zoe feeling good in general, and/or Zoe feeling good about me in particular.
6. The “compassion / spite circuit” also causes a “drive to feel liked / admired”
Let’s talk about the social instinct that I call “drive to feel liked / admired”—i.e., an innate drive that makes it so that, if I think highly of person X, then it’s inherently motivating to believe that person X thinks highly of me too. To make this work, one might think that we need another ingredient. It’s not enough for the Steering Subsystem to have strong evidence that my conspecific is feeling pleasure or displeasure, as above. The Steering Subsystem has to get strong evidence that my conspecific is feeling pleasure or displeasure in regards to me in particular. Where could such evidence come from?
Remarkably, my answer is: we already got it! We don’t need any other ingredients. It’s just an emergent consequence of the same circuit above!! Let me explain why:
6.1 Key idea: My “compassion / spite circuit” is disproportionately active and important while the conspecific is thinking about me-in-particular
6.1.1 Starting example: Innate sensory heuristics for receiving eye contact
I think there’s a “I’m receiving eye contact” detector in the human brainstem, just like the other conspecific-detection sensory heuristics of Ingredient 1A.
But if you think about it, the “I’m receiving eye contact” detector has a special property, one that the other Ingredient 1A heuristics lack. Consider: if you’re hearing a conspecific, or noticing their gait, etc., then the conspecific might not even know you exist. By contrast, if a conspecific is giving you eye contact, then their brainstem is activating its “thinking of a conspecific” flag, in regards to you.
Here’s a diagram illustrating this:
As Zoe makes (perhaps brief) eye contact with me, both my and Zoe’s Steering Subsystems are shown. My big idea is marked in red—Zoe is reliably thinking about me at the very moment when I’m sensitive to how Zoe seems to be feeling. So if the circuit frequently triggers this way, then I’ll wind up motivated not so much towards Zoe feeling good in general, but towards Zoe liking / admiring me.
6.1.2 Generalization: Innate sensory heuristics fire strongly upon being the target of an orienting reflex
“Receiving eye contact” is a special case of “I’m the target of an orienting reflex”. And I think that other Ingredient 1A heuristics fit into that mold too. For example, my human-face-detection heuristic fires if someone turns to face me. That has directionally the same effect as eye contact, but it doesn’t require eye contact per se—it also fires if the person has sunglasses. And it also supports the “drive to feel liked / admired”, for the same reason as above.
(Ecologically, we expect a long and robust history of “I’m the target of an orienting reflex” brainstem heuristic detectors. For example, if I’m a mouse, and a fox performs an orienting reflex towards me, then I’d better switch from hiding to running.)
6.1.3 Another example: Somebody deliberately getting my attention
Suppose Zoe walks up to me and says “hey”. That still gets my attention—and being a human voice, it triggers the corresponding Ingredient 1A heuristic, and thus the “thinking of a conspecific” flag. But it has the same special property as eye contact above: at the very moment when it gets my attention, Zoe is reliably thinking about me-in-particular.
So the same logic as above holds: the circuit is responding specifically to how Zoe feels about me, and not just to how Zoe feels in general.
6.2 If the same circuit drives both compassion and “drive to feel liked / admired”, why aren’t they more tightly correlated across the population?
If the same innate circuit in the Steering Subsystem is upstream of both compassion and “drive to feel liked / admired”, then one might think that these two things should be yoked together. In other words, if that circuit’s output is generally strong in one person, then they should wind up with both drives being powerful influences on my behavior, and if it’s weak in another person, then they should wind up with neither drive being a powerful influence.
But in fact, in my everyday experience, these seem to be somewhat independent axes of variation, with some people apparently driven much more by one than the other. How does that work?
The answer is simple. If, in the course of life, the circuit often activates when the conspecific is thinking about me-in-particular, and rarely activates when they aren’t, then that would lead the circuit to mostly incentivize and generalize feeling liked / admired. And conversely, if the circuit rarely activates when the conspecific is thinking about me-in-particular, and often activates when they aren’t, then that would lead the circuit to mostly incentivize and generalize compassion.
As an example of the former, suppose Phoebe tends to react very weakly (low arousal, or perhaps not orienting at all) to seeing a person of the corner of her eye, or to hearing someone’s voice in the distance as they talk to someone else, but Phoebe does reliably react to the more powerful stimuli of transient eye contact, or someone getting her attention to talk to her. Then Phoebe would wind up with a relatively strong drive to feel liked / admired relative to her compassion drive.[12]
As an example of the latter, let’s turn to autism. As I’ve discussed in Intense World Theory of Autism, autism involves many different suites of symptoms which don’t always go together (sensory sensitivity, “learning algorithm hyperparameters”, proneness to seizures, etc.). But a common social manifestation would be kinda the reverse of the above. Given their trigger-happy arousal system, they’ll respond robustly and frequently to things like noticing someone out of the corner of their eye, or hearing someone in the distance. But as for receiving eye contact, or someone deliberately trying to get their attention, they’ll find it so overwhelming that they’ll tend to avoid those situations in the first place,[13] or use other coping methods to limit their physiological arousal. So that’s my attempted explanation for why many autistic people have an especially weak “drive to feel liked / admired”, relative to their comparatively-more-typical levels of compassion and spite, if I understand correctly.
6.3 Whose admiration do I crave?
I think it’s common sense that, in the “drive to feel liked / admired”, we’re driven to be liked / admired by some people much more than others. For example, think of a real person whom you greatly admire, more than almost anyone else, and imagine that they look you in the eye and say, “wow, I’m very impressed by you!” That would probably feel extremely exciting and motivating! Such events can be life-changing—see Mentorship, Management, and Mysterious Old Wizards. Next, imagine some random unimpressive person looks you in the eye and says the same thing. OK cool, maybe you’d be happy to receive the compliment. Or maybe not even that. It sure wouldn’t go down as a life-affirming memory to be treasured forever. More examples in footnote→[14]
I had previously written that, if Zoe likes / admires me, then that feels intrinsically motivating to the extent that I like / admire Zoe in turn. Whoops, I’ve changed my mind! Instead, I now think that it feels intrinsically motivating to the extent that interactions with Zoe seem important and high-stakes from my perspective, regardless of whether I like / admire her. (However, if I see her as “enemy” rather than “friend”, then that would have an impact). For example, if Zoe is my boss whom I mildly like / admire, I think I would still react strongly to her approval. That’s what we get from the circuit above—the physiological arousal will respond to how high-stakes it feels for me to be interacting with Zoe, along with the various other factors (e.g. receiving eye contact automatically causes extra arousal). I think my new theory is a better fit to everyday experience, but you can judge for yourself and let me know what you think.
There’s an additional question of what’s upstream of that—i.e., what leads to some people inducing physiological arousal (i.e. being “attention-grabbing”, “intimidating”, “larger-than-life”, etc.) more than others? I think it’s complicated—lots of things go into that. Some come straight from arousal-inducing innate reactions. For example, I think we have an innate drive that induces arousal upon interacting with a tall person, just as many other animals have instincts to “size each other up”. The evolutionary logic is: Any interaction with a tall person is high-stakes because they could potentially beat us up. In other cases, the physiological arousal routes through within-lifetime learning. Is the person in a position to strongly impact my life?
Incidentally, if we compare my previous theory (that I’m driven to be liked / admired by Zoe in proportion to how much I like / admire Zoe in turn) to my current theory (that I’m driven to be liked / admired by Zoe in proportion to how much interactions with Zoe feel arousing, a.k.a. high-stakes), I think there’s some overlap in predictions, because there’s correlation between strongly liking / admiring Zoe, versus feeling like interactions with Zoe are high-stakes. I think the correlation comes from both directions. If I strongly like / admire Zoe, then as a consequence, my interactions with her can feel high-stakes. My liking / admiring her puts her in a position to impact my life. For example, if she spurns me, then I’ve lost access to something I enjoy; plus, I’ve implicitly given her the power to crush my self-esteem. In the other direction, if interactions with Zoe feel high-stakes, I think that can impact how much I like / admire Zoe, for various reasons, including the general valence-arousal interaction mentioned in §5.3.1.
7. Other examples of social instincts
I think the “compassion / spite circuit” above is an important piece of the puzzle of human social instincts. But there’s a whole lot more to social instincts beyond that! Really, I think there’s a bunch of interacting circuits and signals in the Steering Subsystem. How can we pin it down?
Experimentally, there’s a longstanding thread of work laboriously characterizing each of the hundreds of little neuron groups in the Steering Subsystem. More of that would obviously help. I mentioned at least one specific experiment above (§1.2). In parallel, perhaps we could try leapfrogging that process by measuring a complete connectome! My impression is that there are viable roadmaps to a full mouse connectome within years, not decades—much sooner than people seem to realize. Indeed, my guess is that getting a primate or even human connectome well before Artificial General Intelligence is totally a viable possibility, given appropriate philanthropic or other support. (See here.)
On the theory side, as we wait for that data, I think there’s still plenty of room for further careful armchair theorizing to come up with plausible hypotheses. A possible starting point for brainstorming is to look at the set of innate stereotyped (a.k.a. “consummatory”) behavior towards conspecifics, to guess at some of the signals that might be internal to the Steering Subsystem. Doing that is a bit tricky for humans, since our behavioral repertoire comes disproportionately from learning and culture (excepting early childhood, I suppose). But for example, if a rodent sees another rodent, it might display:
(A) Aggressive behavior—e.g. threatening or attacking;
(B) Friendly, helpful behavior—e.g. grooming or snuggling;
(C) Submissive behavior—e.g. rolling on one’s back in response to a potential threat;
(D) Playful behavior—e.g. laughing or play-posture;
…But (C) seems to be an important ingredient missing in what I’ve said so far.
So that brings us to:
7.1 “Drive to feel feared” (a.k.a. “drive to receive submission”)
Dual strategies theory (see my own discussion at Social status part 2/2: everything else) says that people can have “high status” in two different ways: “prestige” and “dominance”. If the “drive to feel liked / admired” above is upstream of seeking prestige for its own sake, then the “drive to feel feared” would be correspondingly upstream of seeking dominance for its own sake.
The “drive to feel feared” could also be called “drive to receive submission”—i.e., a drive for others to display submissive behavior towards me, as in those rats rolling onto their backs. I’m not sure which of those two terms is better. I figure there’s probably some Steering Subsystem signal that’s upstream of both a tendency towards submissive behavior and a tendency towards fear and flight behavior, and it’s this upstream signal that flows into the circuit.
Evolutionarily, it makes perfect sense for there to be a “drive to feel feared”. If someone submits to me, then I’m dominant, and I get first dibs on food and mates without having to fight.
Neuroscientifically, I think the circuit for “drive to feel feared” could be parallel to the “compassion / spite circuit” above. More specifically, the first step is using Ingredient 4 to get to “Conspecific seems to be feeling fear / submission”:
And then we combine that with physiological arousal to get a motivational effect:
And as before, this would fire especially strongly under eye contact or other signals that the conspecific is thinking of you-in-particular:
(As drawn, the circuit might (mis)fire when I notice my friend submitting to a bully who is also simultaneously threatening me. I think that would be solvable by gating the circuit such that it doesn’t fire if I myself am also feeling fear / submission. Let me know if you think of other examples where this proposal doesn’t work.)
8. Conclusion
I feel like I have the big picture of a plausible nuts-and-bolts explanation of how the human brain solves the symbol grounding problem to implement social instincts. It might be wrong, and I’m happy for feedback.
Ingredients 1–4 constitute a kind of domain-specific language in which I think all of our social instincts are written. And then §5–§7 includes an attempt to build two specific social instincts out of the elements of that language, out of a much larger collection of social instincts yet to be sorted out. I figure that the things I wrote down, while a bit sketchy and incomplete, are probably capturing at least some aspects of compassion, spite, schadenfreude, “drive to feel liked / admired”, and “drive to feel feared”, and I think these collectively capture a lot of the human social world. (See also my post A theory of laughter for how laughter and play work.)
If you think this post is totally on the wrong track, then please let me know, by email or the comments section below. If it’s on the right track, then that’s great, but we still obviously have tons of work left to do to really pin down human social instincts, possibly in conjunction with experiments, as discussed in §7 above.
In case anyone’s wondering, I think my next project going forward will be to spend a while pondering the very biggest picture of brain-like AGI safety—everything from reward functions and training environments and testing, to governance and deployment and society, in light of (what I hope is) my newfound understanding of how human social instincts generally work. My confusion on that topic has been a big blocker to my thinking and progress previous times that I tried to do that. After that, I guess I’ll figure out where to go from there! Should be interesting.
Thanks Seth Herd and Simon Skade for critical comments on earlier drafts.
Thanks to regional specialization across the cortex (roughly correspondingly to “neural network architecture” in ML lingo), there can be a priori reason to believe that, for example, “pattern 387294” is a pattern in short-term auditory data whereas “pattern 579823” is a pattern in large-scale visual data, or whatever. But that’s not good enough. The symbol grounding problem for social instincts needs much more specific information than that. If Jun just told me that Xiu thinks I’m cute, then that’s a very different situation from if Jun just told me that Fang thinks I’m cute, leading to very different visceral reactions and drives. Yet those two possibilities are built from generally the same kinds of data.
Actually, this is an area where the evolutionary “design spec” can be pretty inscrutable. The (so-called) spider detector circuit, like any image classifier, triggers on all kinds of inputs, not all of which are spiders, including Bizarre Visual Input Type 74853 that has no relation to spiders and would occur on average once every 100 lifetimes in our ancestral environment. And maybe it just so happened that Bizarre Visual Input Type 74853 correlates with danger, such that noticing and recoiling from it was adaptive. Then that very fact would be part of the evolutionary pressure sculpting the (so-called) spider detector circuit, such that the term “spider detector circuit” is not a 100% perfect description of its evolutionary purpose.
My diagrams are drawn with the “supervisor” signal traveling from the Steering Subsystem to the short-term predictor, and then the subtraction step (“supervisor – output = error”) happening in the short-term predictor. But that’s just for illustration. I’m also open-minded to the possibility that the subtraction is performed in the Steering Subsystem, and that it’s the error signal that travels up to the short-term predictor. That’s more of a low-level implementation detail that I’m not too concerned with for the purpose of this post.
See my recent post Against empathy-by-default for a related discussion about how things go wrong if you just keep the learning rate turned on 100% of the time.
Details: Basically, I’m saying that, because physiological arousal is one of the interoceptive sensory inputs (related discussion), the Thought Generator self-supervised learning algorithm is already learning to predict imminent physiological arousal. So why do we also need a separate short-term predictor, nominally learning the same thing? My answer is: the Thought Generator algorithm is designed to build unlabeled latent variables that are useful for prediction, not to actually produce meaningful outputs, thanks to locally-random pattern separation. So the short-term predictor is also needed, to turn those unlabeled latent variables into a meaningful (“grounded”) output signal.
For purposes of this discussion, things like sense-of-pain, sense-of-temperature, and “affective touch” (c-tactile receptors) count as interoception, not exteroception, despite the fact that you can in fact learn about the outside world via those signals. After all, the skin is an organ, and sensing the health and status of your organs is an interoception thing. See How Do You Feel by Bud Craig (2020) for detailed physiological evidence—nerve types, pathways in the spine and brain, etc.—that this is the right classification.
Here and elsewhere, I’m using English-language emotion words to refer to Steering Subsystem signals, because I don’t know how else to refer to them. But be warned that there is never a perfect correspondence between brainstem signals and emotion words (as we actually use them in everyday life). For more discussion of that point, see Lisa Feldman Barrett versus Paul Ekman on facial expressions & basic emotions.
As a general rule, there are multiple ways to turn pseudocode into neuroscientifically-plausible circuits. For example, the gray box is an intermediate variable in this calculation. I’m drawing it explicitly because it makes it easier to follow. But it might not be a separate cell group in the hypothalamus. Or conversely, it could be two cell groups, one for “pleasure” and the other for “displeasure”, with mutual inhibition. Or something else, who knows.
In terms of the Ingredient 4 discussion, this would be the actual phasic arousal in our own bodies, which is impacted by the exteroception-sensitive short term predictors, but is not impacted by transient empathetic simulations of someone else’s phasic arousal.
I guess I’m predicting that people with constitutionally low arousal responses (extraverts, thrill-seekers, or in the most extreme case, sociopaths as explained here) will tend to have more status drive, relative to compassion drive. But I didn’t check that. It’s not a strong prediction—there are probably a bunch of other factors at play too.
Aversion to eye contact is common among autistic people. For example, John Elder Robison entitled his first memoir Look Me in the Eye, and discusses his aversion to eye contact in the prologue. And in the book excerpt I copied here, there are three quotes from autistic people about their experience of eye contact.
As an example, there’s an anecdote here of someone making a “feelgood” email folder for when she was feeling down, and most of the entries she mentions are basically compliments from people whom (I suspect) she sees as important and intimidating. As another example, my 9yo craves “impressing his parents” like a drug, and strives endlessly for us to laugh at his jokes, admire his knowledge and achievements, etc. But when we had regular visits with a 4yo who idolized him, he basically couldn’t care less.
(If you’re in a hurry, you can just read the “Background and summary” section, and skip the other 85%.)
0. Background and summary
0.1 Background: What’s the problem and why should we care?
My primary neuroscience research goal for the past couple years has been to solve a certain problem, a problem which has had me stumped since the very beginning of when I became interested in neuroscience at all (as a lens into Artificial General Intelligence safety) back in 2019. In this post I offer a hypothesis for what the solution might generally look like, at least in the big picture. I don’t have all the details pinned down, but I feel like this post is major progress. (Unless it’s all wrong! Like the last one was.[1] Very happy for feedback!)
What is this grand problem? As described in Intro to Brain-Like AGI Safety, I believe the following:
0.2 Summary of the rest of the post
I’ll start by going through the four algorithmic ingredients we need for my hypothesis, one by one, in each case describing what it is algorithmically, why it’s useful evolutionarily, and where in the brain we might go looking to find the specific neurons that are running this (alleged) algorithm.
Here’s the roadmap:
Then, I’ll go through an important (putative) example of social instincts built from these ingredients, which I call the “compassion / spite circuit”. This circuit leads to an innate drive to feel compassion towards people we like, and to feel spite and schadenfreude towards people we hate.
In an elegant twist, I claim that this very same “compassion / spite circuit” also leads to an innate “drive to feel liked / admired”—a drive that I hypothesized earlier and believe to be central to both status-seeking and norm-following. The trick in explaining how they’re related is:
Then I’ll go more briefly through some other possible social instincts, including a sketch of a possible “drive to feel feared” (whose existence I previously hypothesized here). For context, dual strategies theory talks about “prestige” and “dominance” as two forms of status; while the “drive to feel liked / admired” leads to prestige-seeking, the “drive to feel feared” correspondingly leads to dominance-seeking.
0.3 Confidence level
My confidence gradually decreases as you proceed through the article. The “Background” section above is rock-solid in my mind, as are Ingredients 1, 1A, and 2. Ingredients 3 and especially 4 are somewhat new to this post, but derive from ideas I’ve been playing around with for a year or two, and I feel pretty good about them. The specific putative examples of social instincts in §5–§7 are much more new and speculative, and are oversimplified at best. But I’m optimistic that they’re on the right track, and that they’re at least a “foot in the door” towards future refinements.
1. Ingredient 1: Innate sensory heuristics in the Steering Subsystem
The Steering Subsystem (brainstem and hypothalamus, more-or-less) takes sensory data, does innately-specified calculations on them, and uses the results to trigger innate reactions.
Think of things like seeing a slithering snake, or a skittering spider; smelling or tasting rotten food; male dogs smelling a female dog in heat; camouflaged animals recognizing the microenvironment where their bodies will blend in; and so on.
Note that these are all imperfect heuristics, anchored to innate circuitry, rather than developing along with our understanding of the world. We can call it a venomous-spider-detector circuit, for example, noting that it evolved because venomous spiders were dangerous to early humans.[4] But if we do that, then we acknowledge that it will have both false positives (e.g. centipedes, harmless spiders) and false negatives (funny-looking stationary venomous spiders), when compared to actual venomous spiders as we intelligently understand them. In vision especially, think of these heuristics as detecting relatively simple patterns of blobs and motion textures, as opposed to an “image classifier” / “video classifier” up to the standards of modern ML or human capabilities.
For more discussion of Ingredient 1, see §3.2.1 here.
1.1 Ingredient 1A: Innate sensory heuristics for conspecific detection in particular
As a special case of Ingredient 1, I claim that, in pretty much all animals, there are a set of sensory heuristics that are specifically designed by evolution to trigger on conspecifics. That would include one or more variations on: seeing a conspecific, hearing a conspecific, touching (or being touched by) a conspecific, smelling a conspecific, etc.
(I’m confident in this part because pretty much all animals have innate behaviors towards conspecifics that are different from their behaviors in other situations—mating, intermale aggression, parenting, being parented, herding, huddling, and so on.)
I claim that these all trigger a special Steering Subsystem flag that I call “thinking of a conspecific”:
1.2 Neuroscience details
Neuroscience details box
The sensory heuristics involve brainstem areas like the superior colliculus (for innate heuristic calculations on visual data), inferior colliculus (auditory data), gustatory nucleus of the medulla (taste data), and so on. (Again see §3.2.1 here.)
In the case of visual sensory heuristics, I’m actually not 100% confident that these calculations are located in the superior colliculus proper; for all I know, they’re partly or entirely in the neighboring parabigeminal nucleus, or whatever. There are papers on this topic, but they can’t always be taken at face value—see for example me complaining about methodologies used in the literature here and here.
For the “thinking of a conspecific” flag, it would be somewhere within the Steering Subsystem, but I don’t have any particular insight into exactly where. If I had to guess, I might guess that it’s one of the many little cell groups of the medial preoptic hypothalamus, since those often involve social interactions. If not that, then I’d guess it’s somewhere else in the medial hypothalamus, or (less likely) the lateral hypothalamus, or (less likely) some other part of the Steering Subsystem.
If you want to find “thinking of a conspecific” flag experimentally, the conceptually-simplest method would be to first find one of the sensory heuristics for conspecific detection (e.g. the face detector), see what its efferent connections (downstream targets) are, and treat all those as top candidates to be studied one-by-one.
2. Ingredient 2: Generalization via short-term predictors
Ingredient 1 is a first step towards understanding, say, fear-of-spiders. But it’s not the whole story, because I don’t just get nervous when there is actually a large skittering spider in my field-of-view right now, but also when I imagine one, or when somebody tells me that there’s a spider behind me, etc. How does that work? The answer is: what I call the “short-term predictor”.
The “short-term predictor” is a learning algorithm that involves three ingredients—context, output, and supervisor. For definitions see this post; or in the ML supervised learning literature, you can substitute “context” = “trained model input”, “output” = “trained model output”, and “supervisor” = “label” (i.e., ground truth), which is subtracted from the trained model output to get an error that updates the model.[5] (In the actual brain, the subtraction transmitted signal might be the “error” rather than the “label”—I’m glossing over those kinds of implementation details.)
The important points are that:
Thus, this kind of story explains the fact that I viscerally react to learning that there’s a spider in my vicinity that I can’t immediately see or feel.
If we take the brainstem reaction and the short-term predictor together, it can function as what I call a long-term predictor, again see here.
By the same token, the “thinking of a conspecific” flag can trigger when I’m, well, thinking of a conspecific, even if the conspecific is not standing right there, triggering my brainstem sensory heuristics right now.
2.1 Neuroscience details
Neuroscience details box
I think the short-term predictors that I’ll be talking about in this post are mostly centered around small clusters of medium spiny neurons somewhere in the amygdala, or the lateral septum, or the medial part of the nucleus accumbens shell. (I haven’t tried to pin them down in more detail than that. See §5.5.4 here for some more general neuroscience discussion of this topic.)
However, in some cases pyramidal neurons can play this short-term predictor role as well, such as in the cortex-like (basolateral) section of the amygdala, along with certain parts of cortex layer 5PT.
The supervisory signal (either ground truth or an error signal, I’m not sure) probably makes an intermediate stop (“relay”) at some little cluster of neurons on the fringes of the Ventral Tegmental Area (VTA), not shown in the diagram above, in which case the supervisory signal would ultimately arrive at the spiny neuron in the form of a dopamine signal. I think. (But there are also VTA GABA neurons that seem somehow related to these particular short-term predictors. I haven’t tried to make sense of that in detail.)
3. Ingredient 3: Tailoring learned models via involuntary attention and learning rate
In this section I’ll just go through a simple example of the orienting reflex upon seeing a spider, then in Ingredient 4 below we’ll see how this applies to social instincts and feelings.
3.1 What does the orienting reflex do?
When the seeing-a-spider brainstem sensory heuristic triggers, I claim that one thing it does is trigger an “orienting reflex”. Part of that reflex involves moving the eyes, head, and body towards whatever triggered the heuristic. And another part of it involves involuntary attention towards the visual inputs in general, and the corresponding part of the field of view in particular.
The involuntary attention plays an important role in constraining what “thought” the cortex is thinking. If you’re daydreaming, imagining, remembering, etc., then your current “thought” has very little to do with current visual inputs. By contrast, involuntary attention towards vision forms a constraint that the thought must be “about” the visual inputs. It’s not completely constraining—the same thought can also contextualize those visual inputs by roping in presumed upstream causes, or expected consequences, or other associations, etc. But the visual inputs have to be a central part of the thought. In other words, you’re not only pointing your eyes at the spider, but you’re also actually thinking about the spider with your cortex (“global workspace”).
To be more specific about what’s going on, we need to be thinking about large-scale patterns of information flow within the cortex, as in the following toy example:
When you’re using visual imagination, your consciously-accessible visual areas of the cortex (e.g. the inferior temporal gyrus (IT)) are, in essence, disconnected from the immediate visual input. You can imagine Taylor Swift’s new dress while looking at a swamp. By contrast, when you’re paying attention to what you’re looking at, then there’s a consistency requirement: the visual models (i.e., generative models of visual data) in IT have to be consistent with the immediate visual input from your retina.
And my claim is that the Steering Subsystem has some control over this kind of large-scale information flow among different parts of the cortex, via its “involuntary attention”.
3.1.1 Side note: Transient attentional gaps are more common, and harder to notice, than you realize
You might be wondering: Is it really true that, if I’m imagining Taylor Swift’s new dress, then my awareness is detached from immediate visual input? Don’t we continue to be aware of visual input even while imagining something else?
A few responses:
First, your cortex has lots of vision-related areas, and it’s possible for some visual areas to be yoked to immediate visual input while other visual areas are simultaneously yoked to episodic memory. I think this definitely happens to some extent.
Second, your attention can jump around between different things rather quickly, such that most people imagine themselves to have far more complete and continuous visual awareness than they actually do—see things like change blindness, or the selective attention test, or the fact that peripheral vision has terrible resolution and terrible color perception and makes faces look creepy.
Third, the fact that the cortex tracks time-extended models, and accordingly has a general ability to pull up activation history from slightly (e.g. half a second) earlier, anywhere in the cortex. That makes it very hard to introspect upon exactly what you were or weren’t thinking at any given moment. For a much more detailed discussion of this point, with an example, see here.
This is a general lesson, going beyond just vision: transient (fraction-of-a-second) attentional gaps and shifts are hard to notice, both as they happen and in hindsight. Don’t unthinkingly trust your intuitions on that topic. (I’ll be centrally relying on these transient attentional shifts in this post, so it’s important that you are thinking about them clearly.)
3.2 Combining attention with time-variable learning rates
The Steering Subsystem gets an additional lever of control over brain learning algorithms by combining that kind of large-scale information flow control with time-variable learning rates, as follows.
Let’s start with learning in the world model / Thought Generator ≈ cortex. Above I was talking about the “space of visual models” which are learned from scratch in IT. Like everything in the world-model (details), this space is learned by predictive (a.k.a. self-supervised) learning. But it’s learned more specifically when we’re paying attention to visual input. The models thus get sculpted to reflect the structure of the actual visual world.
Separately, we can query those existing models for the purpose of memory recall and visual imagination. But when we do, I claim that the learning rate is zero (or at most, almost-zero).
Moving onto the parallel case of learning in the Thought Assessors / short-term predictors ≈ striatum and amygdala. The genome can likewise leverage large-scale information flows to get some control over what the short-term predictors learn.
As a toy example, let’s take the diagram above, but add in a short-term predictor. And just as for the cortex case above, we’ll set the short-term predictor learning rate to zero unless we’re paying attention to visual input. Here’s a diagram:
Thanks to this learning rate modulation, this short-term predictor is trained specifically to maximize its predictive accuracy in situations where we’re paying attention to visual input. When we’re visually imagining or remembering something, by contrast, the short-term predictor will continue to be queried, but it won’t be updated.
What’s the advantage of this setup? Well, imagine my cortex is daydreaming about Taylor Swift, and then my brainstem notices a spider in the corner of my field-of-view. Without the involuntary attention, the learning algorithm update would associate daydreaming-about-Taylor-Swift with the seeing-a-spider reaction (physiological arousal, aversiveness, etc.), which is not a useful thing for me to learn. The involuntary attention can solve that problem: first the involuntary attention kicks the Taylor Swift daydream out of my brain, and ensures that I’m thinking about the spider instead; and second the short-term predictor learning algorithm records those new thinking-about-the-spider thoughts, and fires into its output line whenever similar thoughts recur in the future. Thus I’ll wind up feeling physiological arousal related to the shape and motion of a spider, spiderwebs, centipedes, that corner in the basement, etc., which makes a lot more sense (ecologically) than feeling physiological arousal related to Taylor Swift.
(Well, that’s a bad example. It is entirely ecologically appropriate to feel physiological arousal related to Taylor Swift! But that’s for other reasons!)
3.3 Neuroscience details
Neuroscience details box
For involuntary attention: There are probably multiple pathways working in conjunction. Probably cholinergic and/or adrenergic neurons are involved. More specifically, cholinergic projections to the cortex are probably part of this story, and so are the cholinergic projections to thalamic relay cells. I don’t know the details.
For adjusting learning rate: There are a bunch of ways this could work. If there’s an error signal coming from the Steering Subsystem (hypothalamus or brainstem) to a short-term predictor, it could be set to zero, and then there’s no learning. Or maybe there’s a separate signal for learning rate (maybe acetylcholine again?) coming from the Steering Subsystem, which could be turned off instead. There could also be some more indirect effect of lack-of-attention on the cortex side—like maybe the cortex representations are less active when they’re further removed from sensory input, and that indirectly reduces learning rate, or something. I don’t know.
4. Ingredient 4: Reading out transient empathetic simulations
If we apply the same kind of reasoning as above, it suggests a path to solving the symbol-grounding problem for somebody else’s feelings. A key ingredient we need is “involuntary LACK of attention towards interoceptive inputs”, triggered by the “thinking of a conspecific” flag of Ingredient 1A —the right side of this diagram:
What is this “lack of attention” supposed to accomplish? Here’s a schematic diagram illustrating the flows of information / attention / constraints in a normal situation (left) and in a situation where one of the Ingredient 1A conspecific detection heuristics has just fired (right):
The involuntary lack of attention transiently disconnects the interoceptive models from what I’m feeling right now. Instead, the space of interoceptive models in the cortex will settle into whatever is most consistent with what’s happening in the visual, semantic, and other areas of the cortex (a.k.a. “global workspace”). And thanks to the orienting reflex, those other areas of the cortex are modeling Zoe.
And therefore, if any interoceptive models are active, they’re ones that have some semantic association with Zoe. Or more simply: they’re how Zoe feels (or more precisely, how Zoe seems to feel, from my perspective).
This is progress! But there’s still some more work to do.
Next, let’s put in a couple short-term predictors (Ingredient 2), and think about learning rates (Ingredient 3):
Here, I show two different short-term predictors for the same ground truth (namely, physiological arousal). However, the contexts and learning rates are different, and hence their behaviors are correspondingly different as well.
The short-term predictor on the left uses (let’s say) visual models as context, and its learning rate is nonzero iff I’m paying attention to immediate visual inputs. As it turns out, Zoe is my tyrannical boss, who loves to exercise arbitrary power over me, and thus our conversations are often stressful. This left predictor will pick up on that pattern, and preemptively suggest physiological arousal whenever I notice that Zoe might be coming to talk to me.
Meanwhile, the short term predictor on the right uses interoceptive models as context, and its learning rate is nonzero iff I’m paying attention to my own interoceptive inputs.[6] This short-term predictor will wind up learning things that seem pretty stupidly trivial—e.g. “the conscious feeling of arousal (in the Thought Generator a.k.a. world-model) predicts actual arousal (in the Steering Subsystem)”; but it still needs to be there for technical reasons.[7] Anyway, this output will not respond to the fact that conversations with Zoe tend to be stressful for me. But if Zoe herself seems stressed, the output will reflect that.
Thus, when things are set up properly, the Steering Subsystem can simultaneously get instructions of both how a situation feels to us and the feelings that other people seem to be feeling.
(I showed the example of physiological arousal, but the same logic applies to “being happy”, “being angry”, “being in pain”, etc.)
4.1 So, the “thinking of a conspecific” flag is also a “this is an empathetic simulation” flag?
Well, kinda. But with some caveats.
The sense in which this is true is: both the interoceptive model space and the associated short-term predictors are trained in a circumstance where they relate exclusively to my own interoceptive inputs, but then they’re sometimes queried in a circumstance where they relate to someone else’s interoceptive inputs.
But in other senses, calling it an “empathetic simulation” flag might be a bit misleading.
First, it would be a transient empathetic simulation, lasting a fraction of a second, which is rather different from how we normally use the term “empathy”—more on that here.
Arguably, even “transient empathetic simulation” is an overstatement—it’s just some learned association between what I’m seeing and some feeling-related concept. The concept of Zoe seems to somehow imply the concept of stress, within my world-model. That's all. I don't really need to be “taking her perspective”, nor to be feeling Zoe’s simulated stress in Zoe’s simulated loins, or whatever.
Second, this flag is exclusively related to empathetic simulations of what someone is feeling[8]—not empathetic simulations of what they're thinking, seeing, etc. For example, if I'm curious whether Zoe can see the moon from where she's standing, then I would do a quick empathetic simulation of Zoe is seeing. The “thinking of a conspecific” flag is not particularly related to that; indeed, if anything, this flag is probably anticorrelated with that, since the flag is trained only in situations where orienting reflexes are pulling attention to our own exteroceptive sensory inputs.
Thus, my framework implies that social instincts can only involve reacting to someone's (assumed) feelings. It cannot (directly) involve reacting to what someone is seeing, or thinking, etc. I think that claim rings true to everyday experience.
And there's actually a deeper reason to believe that claim. If I take Zoe’s visual perspective and imagine that she’s looking at a saxophone, then my Steering Subsystem can’t do anything with that information. The Steering Subsystem doesn’t understand saxophones, or anything else about our big complicated world. But it does know the “meaning” of its suite of innate physiological state variables and signals—physiological arousal, body temperature, goosebumps, and so on. See my discussion of “the interface problem” here.
Third, even among the set of short-term predictors related to “feelings”, only some of them are set up such that they will output a transient empathetic simulation. See the toy example above with two different short-term predictors for physiological arousal, one of which conveys empathetic simulations and the other of which does not.
4.2 Neuroscience details
Neuroscience details box
Involuntary lack-of-attention signal: Well, absence-of-attention might just involve suppressing presence-of-attention pathways, like the ones I mentioned under Ingredient 3 above (possibly involving acetylcholine). Or it might be a different system that pushes in the opposite direction—maybe involving serotonin? Or (more likely) multiple complementary signals that work in different ways. I don’t have any strong opinions here.
Two short-term predictors for the same thing: I drew a diagram above with two different short-term predictors of physiological arousal. While that diagram was oversimplified in various ways, I do think it’s true that there are (at least) two different short-term predictors of physiological arousal, one using exteroception-related signals as context, the other using interoception-related signals as context, with the latter capturing empathetic simulations (among its other roles). My guess is that the former is in the amygdala and the latter is somewhere in the medial prefrontal or cingulate cortex. (Clarification for the latter: I think most of the short-term predictors are medium spiny neurons in the “extended striatum”, and have been labeling my diagrams accordingly. But as I mentioned in §2.1 above, I do think there are places where pyramidal neurons play a short-term predictor role too, including in layer 5PT of certain parts of the cortex.)
5. Hypothesis: a “compassion / spite circuit”
Everything so far was preliminaries—now we can start speculating about real social instincts! My main example is a possible innate drive circuit that would be upstream of compassion and spite. Start with another Steering Subsystem signal:
5.1 The “Conspecific seems to be feeling (dis)pleasure” signal
The first step is to get a “conspecific seems to be feeling pleasure / displeasure”[9] signal in the Steering Subsystem, as follows:
The purple box is yet another Steering Subsystem signal that I’m labeling “pleasure / displeasure”. This is closely related to valence—for details see here. Then the gray box would be an intermediate variable[10] in the Steering Subsystem which would, by design, track the extent to which I think of the conspecific as feeling pleased / displeased.
All we need to get that gray box, beyond what we’ve already covered, is a gate: If the thinking-of-a-conspecific flag is on, AND there’s a short-term predictor output consistent with (dis)pleasure, then that means I’m thinking about a conspecific who is currently feeling (dis)pleasure.
This step is built on the kind of “transient empathetic simulation” that I’ve discussed previously and in §4.1 above: the short-term predictor on the right is trained by supervised learning on instances of myself feeling (dis)pleasure, but now at this particular moment it’s being triggered by thinking about someone else feeling (dis)pleasure.
That was just the start. Next, how do we build a social instinct out of the gray “conspecific seems to be feeling pleasure / displeasure” box? We need another Steering Subsystem parameter!
5.2 The “friend (+) vs enemy (–)” parameter
I introduced another Steering Subsystem parameter called “friend (+) vs enemy (–)”. When this parameter is extremely negative, it indicates that whatever you’re thinking about (in this case, the conspecific) should be physically attacked, right now. If the activity level is mildly negative, then you probably won’t go that far, but you’ll still feel like they’re the enemy and you hate them. If it’s positive, you’ll feel “on the same team” as them.
Anyway, when the “friend (+) vs enemy (–)” parameter is positive, then “conspecific seems to be feeling pleasure / displeasure” causes positive / negative valence respectively. This innate drive would lead to compassion—we feel intrinsically motivated by the idea that the conspecific is feeling pleasure, and intrinsically demotivated by the idea that the conspecific is feeling displeasure.
…And if the “friend (+) vs enemy (–)” parameter is negative, we flip the sign: “conspecific seems to be feeling pleasure / displeasure” causes negative / positive valence respectively. This innate drive would lead to both spite and schadenfreude.
How is the “friend (+) vs enemy (–)” parameter itself calculated? By other social instincts outside the scope of this post—more on that in §7 below. Perhaps part of it is a different circuit that says: if thinking about a conspecific co-occurs with positive valence (i.e., if we like / admire them), then that probably shifts the friend/enemy parameter a bit more towards friend, and perhaps also conversely with negative valence. That’s not circular, because conspecifics can acquire positive or negative valence for all kinds of reasons, just like sweaters or computers or anything else can acquire positive or negative valence for all kinds of reasons, including non-social dynamics like if I’m hungry and the conspecific gives me yummy food. That’s a robust and flexible system that will leverage my rich understanding of the world to systematically assign “friend” status to conspecifics who lead to good things happening for me. That’s probably just one factor among many; I imagine that there are lots of innate circuits that can impact friend / enemy status in various circumstances. Of course, as usual, the friend / enemy parameter would be attached to one or more short-term predictors, enabling memory, generalization, and perhaps also transient empathetic simulations.
5.2.1 Evolution and zoological context
Evolutionary and zoological context box
Pretty much every complex social animal has innate, stereotyped behaviors for both helping and hurting conspecifics in different circumstances—e.g. attack behaviors, and companionship-type behaviors such as within families.
And evolutionarily, if it makes sense to help or hurt conspecifics through innate, stereotyped behaviors, then presumably it also makes sense to help or hurt conspecifics through the more powerful and flexible pathways that leverage within-lifetime learning, as would happen through a “compassion / spite circuit”. (See (Appetitive, Consummatory) ≈ (RL, reflex).)
Indeed, even in rodents, I think there’s clear evidence of more flexible, goal-oriented behaviors to (selectively) help conspecifics. For example, Márquez et al. 2015 finds that rats help conspecifics via choice of arm in a T-shaped maze. And Bartal et al. 2014 finds that rats release conspecifics from restraints, but only in situations where they feel friendly towards the conspecific. (See also: Kettler et al. 2021.) I don’t think either of these needs to be explained with my proposed “compassion / spite circuit” above involving transient empathetic simulation; for example, maybe rats squeak in a certain way when they’re happy, and hearing another rat make a happy squeak triggers a primary reward, or whatever. But anyway, as far as I can tell at a glance, the “compassion / spite circuit” is at least plausibly present even in rodents.
…Or maybe it’s just a “compassion” circuit for rodents. I can’t immediately find any evidence either way on whether rats display flexible, goal-oriented spite-type behavior towards other rats they hate. (They undoubtedly have inflexible, stereotyped, threat and attack postures and behaviors, but that’s different—again see (Appetitive, Consummatory) ≈ (RL, reflex).) Let me know if you’ve seen otherwise!
5.2.2 Neuroscience details
Neuroscience details box
I expect that friend-vs-enemy is two groups of neurons that are mutually inhibitory, as opposed to one that swings positive and negative compared to baseline. That’s how the hypothalamus handles hungry-vs-full, for example (see here). As for where those neuron groups are, I don’t know. Probably medial hypothalamus somewhere.
5.3 Phasic physiological arousal
“Phasic” means that physiological arousal jumps up for a fraction of a second, in synchronization with noticing something, thinking a certain thought, etc. The opposite of “phasic” is “tonic”, like how I can have generally high arousal (alertness, excitement) in the morning and generally low arousal in the afternoon.
Now, one thing that my compassion / spite circuit above is missing is a notion that some interactions can feel more important / high-stakes to me than others. I think this is a separate axis of variation from the friend / enemy axis. For example, my neighbor and my boss are both solidly on the “friend” side of my friend / enemy spectrum—I feel “warmly” towards both, or something—but interactions with my boss feel much higher stakes, and correspondingly I react more strongly to their perceived feelings. So let’s refine the circuit above to fix that:
Basically, when I orient to a conspecific, then recognize them, the associated phasic arousal[11] tracks how important (high-stakes) is this interaction with the conspecific, from my perspective. Then we use that to scale up or down the compassion / spite response.
5.3.1 Neuroscience details
Neuroscience details box
I think the locus coeruleus, a tiny group of 30,000 neurons (in humans) is the high-level arousal-controller in your brain, and its activity can vary over short timescales (up and down within half a second, there’s a plot here). If you measure pupil dilation, then maybe you’ll miss some of the very fastest dynamics, but you will see the variation on a ≈1-second timescale. If you measure skin conductance, that’s slower still.
I’m generally assuming in this post that “arousal” is a scalar. That’s probably something of an oversimplification (see Poe et al. 2020) but good enough for present purposes.
I’ve been talking as if the role of phasic arousal is specific to the “compassion / spite circuit”, but a more elegant possibility is that it’s a special case of a very general interaction between arousal and valence, such that arousal makes all good things seem better, and makes all bad things seem worse, other things equal. After all, arousal is saying that a situation is high-stakes. So that kind of general dynamic seems evolutionarily plausible to me.
(For the record, I think the general interaction between arousal and valence is not just multiplicative. I think there’s also a thing that we call “being overwhelmed”, where sufficiently high arousal can cause negative valence all by itself. Basically, in a very high-stakes situation, the Steering Subsystem wants to say that things are either very good or very bad, and in the absence of positive evidence that things are very good, it treats “very bad” as a default.)
5.4 Generalization via short-term predictors
As usual, Steering Subsystem flags can serve as ground-truth supervision for short-term predictors, which supports generalization. Thanks to “defer-to-predictor mode” (see here), we wind up with Steering Subsystem social instincts activating in situations where nobody is in the room with me right now, but nevertheless I find myself intrinsically motivated by the idea of Zoe feeling good in general, and/or Zoe feeling good about me in particular.
6. The “compassion / spite circuit” also causes a “drive to feel liked / admired”
Let’s talk about the social instinct that I call “drive to feel liked / admired”—i.e., an innate drive that makes it so that, if I think highly of person X, then it’s inherently motivating to believe that person X thinks highly of me too. To make this work, one might think that we need another ingredient. It’s not enough for the Steering Subsystem to have strong evidence that my conspecific is feeling pleasure or displeasure, as above. The Steering Subsystem has to get strong evidence that my conspecific is feeling pleasure or displeasure in regards to me in particular. Where could such evidence come from?
Remarkably, my answer is: we already got it! We don’t need any other ingredients. It’s just an emergent consequence of the same circuit above!! Let me explain why:
6.1 Key idea: My “compassion / spite circuit” is disproportionately active and important while the conspecific is thinking about me-in-particular
6.1.1 Starting example: Innate sensory heuristics for receiving eye contact
I think there’s a “I’m receiving eye contact” detector in the human brainstem, just like the other conspecific-detection sensory heuristics of Ingredient 1A.
But if you think about it, the “I’m receiving eye contact” detector has a special property, one that the other Ingredient 1A heuristics lack. Consider: if you’re hearing a conspecific, or noticing their gait, etc., then the conspecific might not even know you exist. By contrast, if a conspecific is giving you eye contact, then their brainstem is activating its “thinking of a conspecific” flag, in regards to you.
Here’s a diagram illustrating this:
As Zoe makes (perhaps brief) eye contact with me, both my and Zoe’s Steering Subsystems are shown. My big idea is marked in red—Zoe is reliably thinking about me at the very moment when I’m sensitive to how Zoe seems to be feeling. So if the circuit frequently triggers this way, then I’ll wind up motivated not so much towards Zoe feeling good in general, but towards Zoe liking / admiring me.
6.1.2 Generalization: Innate sensory heuristics fire strongly upon being the target of an orienting reflex
“Receiving eye contact” is a special case of “I’m the target of an orienting reflex”. And I think that other Ingredient 1A heuristics fit into that mold too. For example, my human-face-detection heuristic fires if someone turns to face me. That has directionally the same effect as eye contact, but it doesn’t require eye contact per se—it also fires if the person has sunglasses. And it also supports the “drive to feel liked / admired”, for the same reason as above.
(Ecologically, we expect a long and robust history of “I’m the target of an orienting reflex” brainstem heuristic detectors. For example, if I’m a mouse, and a fox performs an orienting reflex towards me, then I’d better switch from hiding to running.)
6.1.3 Another example: Somebody deliberately getting my attention
Suppose Zoe walks up to me and says “hey”. That still gets my attention—and being a human voice, it triggers the corresponding Ingredient 1A heuristic, and thus the “thinking of a conspecific” flag. But it has the same special property as eye contact above: at the very moment when it gets my attention, Zoe is reliably thinking about me-in-particular.
So the same logic as above holds: the circuit is responding specifically to how Zoe feels about me, and not just to how Zoe feels in general.
6.2 If the same circuit drives both compassion and “drive to feel liked / admired”, why aren’t they more tightly correlated across the population?
If the same innate circuit in the Steering Subsystem is upstream of both compassion and “drive to feel liked / admired”, then one might think that these two things should be yoked together. In other words, if that circuit’s output is generally strong in one person, then they should wind up with both drives being powerful influences on my behavior, and if it’s weak in another person, then they should wind up with neither drive being a powerful influence.
But in fact, in my everyday experience, these seem to be somewhat independent axes of variation, with some people apparently driven much more by one than the other. How does that work?
The answer is simple. If, in the course of life, the circuit often activates when the conspecific is thinking about me-in-particular, and rarely activates when they aren’t, then that would lead the circuit to mostly incentivize and generalize feeling liked / admired. And conversely, if the circuit rarely activates when the conspecific is thinking about me-in-particular, and often activates when they aren’t, then that would lead the circuit to mostly incentivize and generalize compassion.
As an example of the former, suppose Phoebe tends to react very weakly (low arousal, or perhaps not orienting at all) to seeing a person of the corner of her eye, or to hearing someone’s voice in the distance as they talk to someone else, but Phoebe does reliably react to the more powerful stimuli of transient eye contact, or someone getting her attention to talk to her. Then Phoebe would wind up with a relatively strong drive to feel liked / admired relative to her compassion drive.[12]
As an example of the latter, let’s turn to autism. As I’ve discussed in Intense World Theory of Autism, autism involves many different suites of symptoms which don’t always go together (sensory sensitivity, “learning algorithm hyperparameters”, proneness to seizures, etc.). But a common social manifestation would be kinda the reverse of the above. Given their trigger-happy arousal system, they’ll respond robustly and frequently to things like noticing someone out of the corner of their eye, or hearing someone in the distance. But as for receiving eye contact, or someone deliberately trying to get their attention, they’ll find it so overwhelming that they’ll tend to avoid those situations in the first place,[13] or use other coping methods to limit their physiological arousal. So that’s my attempted explanation for why many autistic people have an especially weak “drive to feel liked / admired”, relative to their comparatively-more-typical levels of compassion and spite, if I understand correctly.
6.3 Whose admiration do I crave?
I think it’s common sense that, in the “drive to feel liked / admired”, we’re driven to be liked / admired by some people much more than others. For example, think of a real person whom you greatly admire, more than almost anyone else, and imagine that they look you in the eye and say, “wow, I’m very impressed by you!” That would probably feel extremely exciting and motivating! Such events can be life-changing—see Mentorship, Management, and Mysterious Old Wizards. Next, imagine some random unimpressive person looks you in the eye and says the same thing. OK cool, maybe you’d be happy to receive the compliment. Or maybe not even that. It sure wouldn’t go down as a life-affirming memory to be treasured forever. More examples in footnote→[14]
I had previously written that, if Zoe likes / admires me, then that feels intrinsically motivating to the extent that I like / admire Zoe in turn. Whoops, I’ve changed my mind! Instead, I now think that it feels intrinsically motivating to the extent that interactions with Zoe seem important and high-stakes from my perspective, regardless of whether I like / admire her. (However, if I see her as “enemy” rather than “friend”, then that would have an impact). For example, if Zoe is my boss whom I mildly like / admire, I think I would still react strongly to her approval. That’s what we get from the circuit above—the physiological arousal will respond to how high-stakes it feels for me to be interacting with Zoe, along with the various other factors (e.g. receiving eye contact automatically causes extra arousal). I think my new theory is a better fit to everyday experience, but you can judge for yourself and let me know what you think.
There’s an additional question of what’s upstream of that—i.e., what leads to some people inducing physiological arousal (i.e. being “attention-grabbing”, “intimidating”, “larger-than-life”, etc.) more than others? I think it’s complicated—lots of things go into that. Some come straight from arousal-inducing innate reactions. For example, I think we have an innate drive that induces arousal upon interacting with a tall person, just as many other animals have instincts to “size each other up”. The evolutionary logic is: Any interaction with a tall person is high-stakes because they could potentially beat us up. In other cases, the physiological arousal routes through within-lifetime learning. Is the person in a position to strongly impact my life?
Incidentally, if we compare my previous theory (that I’m driven to be liked / admired by Zoe in proportion to how much I like / admire Zoe in turn) to my current theory (that I’m driven to be liked / admired by Zoe in proportion to how much interactions with Zoe feel arousing, a.k.a. high-stakes), I think there’s some overlap in predictions, because there’s correlation between strongly liking / admiring Zoe, versus feeling like interactions with Zoe are high-stakes. I think the correlation comes from both directions. If I strongly like / admire Zoe, then as a consequence, my interactions with her can feel high-stakes. My liking / admiring her puts her in a position to impact my life. For example, if she spurns me, then I’ve lost access to something I enjoy; plus, I’ve implicitly given her the power to crush my self-esteem. In the other direction, if interactions with Zoe feel high-stakes, I think that can impact how much I like / admire Zoe, for various reasons, including the general valence-arousal interaction mentioned in §5.3.1.
7. Other examples of social instincts
I think the “compassion / spite circuit” above is an important piece of the puzzle of human social instincts. But there’s a whole lot more to social instincts beyond that! Really, I think there’s a bunch of interacting circuits and signals in the Steering Subsystem. How can we pin it down?
Experimentally, there’s a longstanding thread of work laboriously characterizing each of the hundreds of little neuron groups in the Steering Subsystem. More of that would obviously help. I mentioned at least one specific experiment above (§1.2). In parallel, perhaps we could try leapfrogging that process by measuring a complete connectome! My impression is that there are viable roadmaps to a full mouse connectome within years, not decades—much sooner than people seem to realize. Indeed, my guess is that getting a primate or even human connectome well before Artificial General Intelligence is totally a viable possibility, given appropriate philanthropic or other support. (See here.)
On the theory side, as we wait for that data, I think there’s still plenty of room for further careful armchair theorizing to come up with plausible hypotheses. A possible starting point for brainstorming is to look at the set of innate stereotyped (a.k.a. “consummatory”) behavior towards conspecifics, to guess at some of the signals that might be internal to the Steering Subsystem. Doing that is a bit tricky for humans, since our behavioral repertoire comes disproportionately from learning and culture (excepting early childhood, I suppose). But for example, if a rodent sees another rodent, it might display:
Of these:
So that brings us to:
7.1 “Drive to feel feared” (a.k.a. “drive to receive submission”)
Dual strategies theory (see my own discussion at Social status part 2/2: everything else) says that people can have “high status” in two different ways: “prestige” and “dominance”. If the “drive to feel liked / admired” above is upstream of seeking prestige for its own sake, then the “drive to feel feared” would be correspondingly upstream of seeking dominance for its own sake.
The “drive to feel feared” could also be called “drive to receive submission”—i.e., a drive for others to display submissive behavior towards me, as in those rats rolling onto their backs. I’m not sure which of those two terms is better. I figure there’s probably some Steering Subsystem signal that’s upstream of both a tendency towards submissive behavior and a tendency towards fear and flight behavior, and it’s this upstream signal that flows into the circuit.
Evolutionarily, it makes perfect sense for there to be a “drive to feel feared”. If someone submits to me, then I’m dominant, and I get first dibs on food and mates without having to fight.
Neuroscientifically, I think the circuit for “drive to feel feared” could be parallel to the “compassion / spite circuit” above. More specifically, the first step is using Ingredient 4 to get to “Conspecific seems to be feeling fear / submission”:
And then we combine that with physiological arousal to get a motivational effect:
And as before, this would fire especially strongly under eye contact or other signals that the conspecific is thinking of you-in-particular:
(As drawn, the circuit might (mis)fire when I notice my friend submitting to a bully who is also simultaneously threatening me. I think that would be solvable by gating the circuit such that it doesn’t fire if I myself am also feeling fear / submission. Let me know if you think of other examples where this proposal doesn’t work.)
8. Conclusion
I feel like I have the big picture of a plausible nuts-and-bolts explanation of how the human brain solves the symbol grounding problem to implement social instincts. It might be wrong, and I’m happy for feedback.
Ingredients 1–4 constitute a kind of domain-specific language in which I think all of our social instincts are written. And then §5–§7 includes an attempt to build two specific social instincts out of the elements of that language, out of a much larger collection of social instincts yet to be sorted out. I figure that the things I wrote down, while a bit sketchy and incomplete, are probably capturing at least some aspects of compassion, spite, schadenfreude, “drive to feel liked / admired”, and “drive to feel feared”, and I think these collectively capture a lot of the human social world. (See also my post A theory of laughter for how laughter and play work.)
If you think this post is totally on the wrong track, then please let me know, by email or the comments section below. If it’s on the right track, then that’s great, but we still obviously have tons of work left to do to really pin down human social instincts, possibly in conjunction with experiments, as discussed in §7 above.
In case anyone’s wondering, I think my next project going forward will be to spend a while pondering the very biggest picture of brain-like AGI safety—everything from reward functions and training environments and testing, to governance and deployment and society, in light of (what I hope is) my newfound understanding of how human social instincts generally work. My confusion on that topic has been a big blocker to my thinking and progress previous times that I tried to do that. After that, I guess I’ll figure out where to go from there! Should be interesting.
Thanks Seth Herd and Simon Skade for critical comments on earlier drafts.
Speaking of which, some bits of text in this introductory section are copied from that post.
For a different (simpler) example of what I think it looks like to make progress towards that kind of pseudocode, see my post A Theory of Laughter.
Thanks to regional specialization across the cortex (roughly correspondingly to “neural network architecture” in ML lingo), there can be a priori reason to believe that, for example, “pattern 387294” is a pattern in short-term auditory data whereas “pattern 579823” is a pattern in large-scale visual data, or whatever. But that’s not good enough. The symbol grounding problem for social instincts needs much more specific information than that. If Jun just told me that Xiu thinks I’m cute, then that’s a very different situation from if Jun just told me that Fang thinks I’m cute, leading to very different visceral reactions and drives. Yet those two possibilities are built from generally the same kinds of data.
Actually, this is an area where the evolutionary “design spec” can be pretty inscrutable. The (so-called) spider detector circuit, like any image classifier, triggers on all kinds of inputs, not all of which are spiders, including Bizarre Visual Input Type 74853 that has no relation to spiders and would occur on average once every 100 lifetimes in our ancestral environment. And maybe it just so happened that Bizarre Visual Input Type 74853 correlates with danger, such that noticing and recoiling from it was adaptive. Then that very fact would be part of the evolutionary pressure sculpting the (so-called) spider detector circuit, such that the term “spider detector circuit” is not a 100% perfect description of its evolutionary purpose.
My diagrams are drawn with the “supervisor” signal traveling from the Steering Subsystem to the short-term predictor, and then the subtraction step (“supervisor – output = error”) happening in the short-term predictor. But that’s just for illustration. I’m also open-minded to the possibility that the subtraction is performed in the Steering Subsystem, and that it’s the error signal that travels up to the short-term predictor. That’s more of a low-level implementation detail that I’m not too concerned with for the purpose of this post.
See my recent post Against empathy-by-default for a related discussion about how things go wrong if you just keep the learning rate turned on 100% of the time.
Details: Basically, I’m saying that, because physiological arousal is one of the interoceptive sensory inputs (related discussion), the Thought Generator self-supervised learning algorithm is already learning to predict imminent physiological arousal. So why do we also need a separate short-term predictor, nominally learning the same thing? My answer is: the Thought Generator algorithm is designed to build unlabeled latent variables that are useful for prediction, not to actually produce meaningful outputs, thanks to locally-random pattern separation. So the short-term predictor is also needed, to turn those unlabeled latent variables into a meaningful (“grounded”) output signal.
For purposes of this discussion, things like sense-of-pain, sense-of-temperature, and “affective touch” (c-tactile receptors) count as interoception, not exteroception, despite the fact that you can in fact learn about the outside world via those signals. After all, the skin is an organ, and sensing the health and status of your organs is an interoception thing. See How Do You Feel by Bud Craig (2020) for detailed physiological evidence—nerve types, pathways in the spine and brain, etc.—that this is the right classification.
Here and elsewhere, I’m using English-language emotion words to refer to Steering Subsystem signals, because I don’t know how else to refer to them. But be warned that there is never a perfect correspondence between brainstem signals and emotion words (as we actually use them in everyday life). For more discussion of that point, see Lisa Feldman Barrett versus Paul Ekman on facial expressions & basic emotions.
As a general rule, there are multiple ways to turn pseudocode into neuroscientifically-plausible circuits. For example, the gray box is an intermediate variable in this calculation. I’m drawing it explicitly because it makes it easier to follow. But it might not be a separate cell group in the hypothalamus. Or conversely, it could be two cell groups, one for “pleasure” and the other for “displeasure”, with mutual inhibition. Or something else, who knows.
In terms of the Ingredient 4 discussion, this would be the actual phasic arousal in our own bodies, which is impacted by the exteroception-sensitive short term predictors, but is not impacted by transient empathetic simulations of someone else’s phasic arousal.
I guess I’m predicting that people with constitutionally low arousal responses (extraverts, thrill-seekers, or in the most extreme case, sociopaths as explained here) will tend to have more status drive, relative to compassion drive. But I didn’t check that. It’s not a strong prediction—there are probably a bunch of other factors at play too.
Aversion to eye contact is common among autistic people. For example, John Elder Robison entitled his first memoir Look Me in the Eye, and discusses his aversion to eye contact in the prologue. And in the book excerpt I copied here, there are three quotes from autistic people about their experience of eye contact.
As an example, there’s an anecdote here of someone making a “feelgood” email folder for when she was feeling down, and most of the entries she mentions are basically compliments from people whom (I suspect) she sees as important and intimidating. As another example, my 9yo craves “impressing his parents” like a drug, and strives endlessly for us to laugh at his jokes, admire his knowledge and achievements, etc. But when we had regular visits with a 4yo who idolized him, he basically couldn’t care less.