EDIT, 2022:

this post is still a reasonable starting point, but I need to post a revised version that emphasizes preventing dominance outside of play. All forms of dominance must be prevented, if society is to heal from our errors. These days I speak of human communication primarily in terms of the network messaging protocols that equate to establishing the state of being in communication, modulated by permissions defined by willingness to demonstrate friendliness via action. I originally wrote this post to counter "status is all you need" type thinking, and in retrospect, I don't think I went anywhere near far enough in eliminating hierarchy and status from my thinking.

With that reasoning error warning in mind, original post continues:

Preface

(I can't be bothered to write a real Serious Post, so I'm just going to write this like a tumblr post. y'all are tryhards with writing and it's boooooring, and also I have a lot of tangentially related stuff to say. Pls critique based on content. If something is unclear, quote it and ask for clarification)

Alright so, this is intended to be an explicit description that, hopefully, could be turned into an actual program, that would generate the same low-level behavior as the way social stuff arises from brains. Any divergence is a mistake, and should be called out and corrected. it is not intended to be a fake framework. it's either actually a description of parts of the causal graph that are above a threshold level of impact, or it's wrong. It's hopefully also a good framework. I'm pretty sure it's wrong in important ways, I'd like to hear what people suggest to improve it.

Recommended knowledge: vague understanding of what's known about how the cortex sheet implements fast inference/how "system 1" works, how human reward works, etc, and/or how ANNs work, how reinforcement learning works, etc.

The hope is that the computational model would generate social stuff we actually see, as high-probability special cases - in semi-technical terms you can ignore if you want, I'm hopeful it's a good causal/generative model, aka that it allows compressing common social patterns with at least somewhat accurate causal graphs.

The thing

So we're making an executable model of part of the brain, so I'm going to write it as a series of changes I'm going to make. (I'm uncomfortable with the structured-ness of this, if anyone has any ideas for how to generalize it, that would be helpful.)

  1. To start our brain thingy off, add direct preferences: experiences our new brain wants to have. Make negative things much worse, maybe around 5x, than good things.
    • From the inside, this is an experience that in-the-moment is enjoyable/satisfying/juicy/fun/rewarding/attractive to you/thrilling/etc etc. Basic stuff like drinking water, having snuggles, being accepted, etc - preferences that are nature and not nurture.
    • From the outside, this is something like the experience producing dopamine/serotonin/endorphin/oxytocin/etc in, like, a young child or something - ie, it's natively rewarding.
    • In the implementable form of this model, our reinforcement learner needs a state-reward function.
    • Social sort of exists here, but only in the form that if an agent can give something you want, such as snuggles, then you want that interaction.
  2. Then, make the direct preferences update by pulling the rewards back through time.
    • From the inside, this is the experience of things that lead to rewarding things becoming rewarding themselves - operant conditioning and preferences that come from nurture, eg complex flavor preferences, room layout preferences, preferences for stability, preferences for hygiene being easy, preferences for stability, etc.
    • From the outside, this is how dopamine release and such happens when a stimulus is presented that indicates an increase in future reward
    • In the implementable form of this model, this is any temporal difference learning technique, such as q learning
    • Social exists more here, in that our agent learns which agents reliably produce experiences are level-1 preferred vs dispreferred. If there's a level-1 boring/dragging/painful/etc thing another agent does, it might result in an update towards lower probability of good interactions with that agent in that context. If there's a level-1 fun/good/satisfying/etc thing another agent does, it might result in an update towards that agent being good to interact with in that context and maybe in others.
  3. Then, modify preferences to deal with one-on-one interactions with other agents:
    • Add tracking of retribution for other agents
      • From the inside, this is feeling that you are your own person, getting angry if someone does something you don't like, and becoming less angry if you feel that they're actually sorry.
      • From the outside, this is people being quick to anger and not thinking things through before getting angry about Bad Things. something about SNS as well. I'm less familiar with neural implementation of anger.
      • To implement: Track retribution-worthiness of the other agent. Increase it if the other agent does something you consider retribution-worthy. Initialize what's retribution-worthy to be "anything that hurts me". Initialize retribution-worthiness of other agents to be zero. Decrease retribution-worthiness once retribution has been enacted and accepted as itself not retribution-worthy by the other agent.
    • Track deservingness/caring-for other agents. Keep decreasing an agents' deservingness open as an option for how to enact retribution.
      • From the inside, this is the feeling that you want good for other people/urge to be fair. It is not the same thing as empathy.
      • From the outside, this is people naturally having moral sysems.
      • To implement, have a world model that allows inferring other agents's locations and preferences, and mix their preferences with yours a little, or something. correct implementation is safe ai
    • Track physical power-over-the-world of you vs other agents
      • From the inside, this is the feeling that someone else is more powerful or that you are more powerful. (fixme: Also something about the impro thing goes here? how to integrate?)
      • From the outside, this is animals' hardcoded tracking of threat/power signaling - I'd expect to find it at least in other mammals
      • To implement, hand-train a pattern matcher on [Threatening vs Nonthreatening] data, and provide this as a feature to reinforcement learning; also increase deservingness/decrease retributionworthiness for agents that have high power, because they are able to force this, so treat it as an acausal trade
  4. Then, track other agent's beliefs to iterate this over a social graph
    • Track other agent's coalition-building power, update the power-over-the-world dominance based on an agent's ability to build coalitions and harness other agent's power.
      • From the inside, this is the feeling that someone else has a lot of friends/is popular, or that you have a lot of friends/are popular
    • Track other agents' verbal trustworthiness, update your models on level 2 directly from trusted agents' statements of fact
    • Track other agents' retribution lists to form consensus on what is retribution-worthy; update what you treat as retribution-worthy off of what other agents will punish you for not punishing
    • Track other agents' retribution status and deservingness among other agents, in case of coordinated punishment.
    • Predict agents' Rewardingness, Retribution-worthiness, Deservingness, and Power based on any proxy signals you can get - try to update as fast as possible.
    • Implementation: I think all you need to do is add a world model capable of rolling-in modeling other agents modeling other agents etc as feelings, and then all of level 4 should naturally fall out of tracking stuff from earlier levels, but I'm not sure. For what I mean by rolling-in, see Unrolling social metacognition

Things that seem like they're missing to me

  • Greg pointed out that current artificial RL (ie, step 1) is missing something simple and important about the way reward works in the brain, but neither of us are quite sure what exactly it is.
  • Greg also pointed out that the way I'm thinking about power here doesn't properly take into account the second to second impro thing
  • Greg thought there were interesting bits about how people do empathy that disagree really hard with the way I thought level 3 works
  • Lex had a bunch of interesting critiques I didn't really understand well enough to use. I thiiink I might have integrated them at this point? not sure.
  • A bunch of people including me hate anything that has levels for being probably more complicated in terms of being organized structurally and simpler in terms of amount of detail than reality actually has. But I still feel like the levels thing is actually a pretty damn good representation. Suggestions welcome, callouts are not
  • This explanation sucks and people probably won't get useful intuitions out of this the way I have from thinking about it a lot

misc interesting consequences

  • level 4 makes each of the other levels into partially-grounded keynesian beauty contests - a thing from economics that was intended to model the stock market - which I think is where a lot of "status signaling" stuff comes from. But that doesn't mean there isn't a real beauty contest underneath.
  • level 2 means it's not merely a single "emotional bank account" deciding whether people enjoy you - it's a question of whether they predict you'll be fun to be around, which they can keep doing even if you make a large mistake once.
  • level 3 Deservingness is referring to how when people say "I like you but I don't want to interact with you", there is a meaningful prediction about their future behavior being positive towards you that they're making - they just won't necessarily want to like, hang out

Examples of things to analyze would be welcome, to exercise the model, whether the examples fit in it or not; I'll share some more at some point, I have a bunch of notes to share.

New Comment
14 comments, sorted by Click to highlight new comments since:
level 4 makes each of the other levels into partially-grounded keynesian beauty contests - a thing from economics that was intended to model the stock market - which I think is where a lot of "status signaling" stuff comes from. But that doesn't mean there isn't a real beauty contest underneath.

Yes!

I wrote a low-edit post about how individual interactions give rise to consistent status hierarchies, a few month ago. (That blog is only for quick low-edit writing of mine. Those are called Tumbler posts?)

Briefly, people help people who can help them. A person who has many people who want to help them can be more helpful, so more people want to help them.

My current thinking about how to implement this without having to build full sized agents is to make little stateful reinforcement learner type things in a really simple agent-world, something like a typed-message-passing type thing. possibly with 2d or 3d locations and falloff of action effects by distance? then each agent can take actions, can learn to map agent to reward, etc.

  • make other agent's reward states observable, maybe with a gating where an agent can choose to make its reward state non-observable to other agents, in exchange for that action being visible somehow.
  • make some sort of game of available actions - something like, agents have resources they need to live, can take them from each other, value being close to each other, value stability, etc etc. some sort of thing to make there be different contexts an agent can be cooperatey or defecty in.
  • hardcode or preinitialize-from-code level 3 stuff. hardcode into the world identification of which agent took an action at you? irl there's ambiguity about cause and without that some patterns probably won't arise

could use really small neural networks I guess, or maybe just linear matrices of [agents, actions] and then mcmc sample from actions taken and stuff?

I'm confused precisely how to implement deservingness... seems like deservingness is something like a minimum control target for others' reward, retribution is a penalty that supersedes it? maybe?

if using neural networks implementing the power thing on level 3 is a fairly easy prediction task, using bayesian mcmc whatever it's much harder. maybe that's an ok place to use NNs? trying to use NNs in a model like this feels like a bad idea unless the NNs are extremely regularized.... also the inference needed for level 4 is hard without NNs.

I haven't had time to fully load this up into my working memory to think it through, check implications, etc, but for now wanted to say I very much appreciate the spirit in which the post is presented. (Specifically: it attempts to present a concrete model specific enough to be falsifiable and predictive)

Yes. Strong upvote. I'm very excited to see hypothesized models that preport to give rise to high level phenomena, and models that are on their way to be executable are even better.

So we're making an executable model of part of the brain, so I'm going to write it as a series of changes I'm going to make.
1. To start our brain thingy off, add direct preferences:

Haven't finished reading and apologies if this is cleared up later, but I wasn't clear what you meant by "changes you're going to make" – is this relative to a blank rock of nothingness, or to a crude learning algorithm, or something else?

0. start with blank file

1. add preference function

2. add time

3. add the existence of another agent

4. add the existence of networks of other agents

something that I realized bothers me about this model: I basically didn't include TAPs reasoning aka classical conditioning, I started from operant conditioning.

also, this explanation fails miserably at the "tell a story of how you got there in order to convey the subtleties" thing that eg ben hoffman was talking about recently.

yeahhhhhh missing TAP type reasoning is a really critical failure here, I think a lot of important stuff happens around signaling whether you'll be an agent that is level 1 valuable to be around, and I've thought before about how keeping your hidden TAP depth short in ways that are recognizeable to others makes you more comfortable to be around because you're more predictable. or something

this would have to take the form of something like, first make the agent as a slightly-stateful pattern-response bot, maybe with a global "emotion" state thing that sets which pattern-response networks to use. then try to predict the world in parts, unsupervised. then have preferences, which can be about other agents' inferred mental states. then pull those preferences back through time, reinforcement learned. then add the retribution and deservingness things on top. power would be inferred from representations of other agents, something like trying to predict the other agents' unobserved attributes.

also this doesn't put level 4 as this super high level thing, it's just a natural result of running the world prediction for a while.

the better version of this model probably takes the form of a list of the most important built-in input-action mappings.

[-]jmh20

Not sure if you were already thinking along these lines or not (not not entirely sure I think it's how my brain works much less normal brains) but since you were borrowing from economics how are your preferences balanced internally? Looking at some constrained max reward? Decision-making a la marginal rewards? Something else?

so I'm very interested in anything you feel you can say about how this doesn't work to describe your brain.

with respect to economics - I'm thinking about this mostly in terms of partially-model-based reinforcement learning/build-a-brain, and economics arises when you have enough of those in the same environment. the thing you're asking about there is more on the build-a-brain end and is like pretty open for discussion, the brain probably doesn't actually have a single scalar reward but rather a thing that can dispatch rewards with different masks or something

[-]jmh20

That is very difficult to articulate for me. If we take the standard econ choice equilibrium definition of equating marginal utility per dollar, and then toss out the price element since we're purely comparing the utility (ignoring the whole subjective versus other forms issue here) we don't need to normalize on cost (I think).

That implies that the preference for completely different actions/choices I make are directly comparable. In other words it is a choice between differences in kind and not categorical differences.

However, when I really find myself in a position where I have a hard choice to make, it's never a problem with some simply mental calculation such as above but feels entirely different. The challenge I face is that I'm not making that type of comparison but something more along the lines of choosing between two alternatives that lack a common basis for comparison.

I was thinking a bit about this in a different context a while back. If economic decision theory, at least from a consumer perspective, is all about indifference curves, is that really a decision theory or merely a rule following approach? The real decision arises in the setting where you are in a position of indifference between multiple alternative but economics cannot say anything about that -- the answer there is flip a coin/random selection but is that really a rational though process for choice?

But, as I said, I'm not entirely sure I think like other people.

> From the inside, this is an experience that in-the-moment is enjoyable/satisfying/juicy/fun/rewarding/attractive to you/thrilling/etc etc.

people’s preferences change in different contexts since they are implicitly always trying to comply with what they think is permissible/safe before trying to get it, up to some level of stake outweighing this, along many different axes of things one can have a stake in

to see people’s intrinsic preferences we have to consider that people often aren’t getting what they want and are tricked into wanting suboptimal things wrt some of their long-suppressed wants, because of social itself

this has to be really rigorous because it’s competing against anti-inductive memes

this is really important to model because if we know anything about people’s terminal preferences modulo social we know we are confused about social anytime we can’t explain why they aren’t pursuing opportunities they should know about or anytime they are internally conflicted even though they know all the consequences of their actions relative to their real ideal-to-them terminal preferences

> Social sort of exists here, but only in the form that if an agent can give something you want, such as snuggles, then you want that interaction.

is it social if a human wants another human to be smiling because perception of smiles is good?

is it social if a human wants another human to be smiling because perception of smiles is good?

I wouldn't say so, no.

good point about lots of level 1 things being distorted or obscured by level 3. I think the model needs to be restructured to not have a privileged instrinsicness to level 1, but rather initialize moment to moment preferences with one thing, then update that based on pressures from the other things