Formerly alignment and governance researcher at DeepMind and OpenAI. Now independent.
I've now edited that section. Old version and new version here for posterity.
Old version:
None of these is very satisfactory! Intuitively speaking, Alice and Bob want to come to an agreement where respect for both of their interests is built in. For example, they might want the EUM they form to value fairness between their two original sets of interests. But adding this new value is not possible if they’re limited to weighted averages. The best they can do is to agree on a probabilistic mixture of EUMs—e.g. tossing a coin to decide between option 1 and option 2—which is still very inflexible, since it locks in one of them having priority indefinitely.
Based on similar reasoning, Scott Garrabrant rejects the independence axiom. He argues that the axiom is unjustified because rational agents should be able to follow through on commitments they made about which decision procedure to follow (or even hypothetical commitments).
New version:
These are all very unsatisfactory. Bob wouldn’t want #1, Alice wouldn’t want #2, and #3 is extremely non-robust. Alice and Bob could toss a coin to decide between options #1 and #2, but then they wouldn’t be acting as an EUM (since EUMs can’t prefer a probabilistic mixture of two options to either option individually). And even if they do, whoever loses the coin toss will have a strong incentive to renege on the deal.
We could see these issues merely as the type of frictions that plague any idealized theory. But we could also seem them as hints about what EUM is getting wrong on a more fundamental level. Intuitively speaking, the problem here is that there’s no mechanism for separately respecting the interests of Alice and Bob after they’ve aggregated into a single agent. For example, they might want the EUM they form to value fairness between their two original sets of interests. But adding this new value is not possible if they’re limited to (a probability distribution over) weighted averages of their utilities. This makes aggregation very risky when Alice and Bob can’t consider all possibilities in advance (i.e. in all realistic settings).
Based on similar reasoning, Scott Garrabrant rejects the independence axiom. He argues that the axiom is unjustified because rational agents should be able to lock in values like fairness based on prior agreements (or even hypothetical agreements).
I was a bit lazy in how I phrased this. I agree with all your points; the thing I'm trying to get at is that this approach falls apart quickly if we make the bargaining even slightly less idealized. E.g. your suggestion "Form an EUM which is totally indifferent about the cake allocation between them and thus gives 100% of the cake to whichever agent is cheaper/easier to provide cake for":
EUM treats these as messy details. Coalitional agency treats them as hints that EUM is missing something.
EDIT: another thing I glossed over is that IIUC Harsanyi's theorem says the aggregation of EUMs should have a weighted average of utilities, NOT a probability distribution over weighted averages of utilities. So even flipping a coin isn't technically kosher. This may seem nitpicky but I think it's yet another illustration of the underlying non-robustness of EUM.
On a meta level, I have a narrative that goes something like: LessWrong tried to be truth-seeking, but was scared of discussing the culture war, so blocked that off. But then the culture war ate the world, and various harms have come about from not having thought clearly about that (e.g. AI governance being a default left-wing enterprise that tried to make common cause with AI ethics). Now cancel culture is over and there are very few political risks to thinking about culture wars, but people are still scared to. (You can see Scott gradually dipping his toe into the race + IQ stuff over the past few months, but in a pretty frightened way. E.g. at one point he stated what I think is basically his position, then appended something along the lines of "And I'm literally Hitler and should be shunned.")
Thanks for the well-written and good-faith reply. I feel a bit confused by how to relate to it on a meta level, so let me think out loud for a while.
I'm not surprised that I'm reinventing a bunch of ideas from the humanities, given that I don't have much of a humanities background and didn't dig very far through the literature.
But I have some sense that even if I had dug for these humanities concepts, they wouldn't give me what I want.
What do I want?
So I'd be interested in pointers to sources that can give me #1 and #2 in particular.
EDIT: actually I think there's another meta-level gap between us. Something like: you characterize Yarvin as just being annoyed that the consensus disagrees with him. But in the 15 years since he was originally writing, the consensus did kinda go insane. So it's a bit odd to not give him at least some credit for getting something important right in advance.
Nice, that's almost exactly how I intended it. Except that I wasn't thinking of the "stars" as satellites looking for individual humans to send propaganda at (which IMO is pretty close to "communicating"), but rather a network of satellites forming a single "screen" across the sky that plays a video infecting any baseline humans who look at it.
In my headcanon the original negotiators specified that sunlight would still reach the earth unimpeded, but didn't specify that no AI satellites would be visible from the Earth. I don't have headcanon explanations for exactly how the adversanimals arose or how the earth became desolate though.
(Oh, also, I think of the attack as being inefficient less because of lack of data, since AIs can just spin up humans to experiment on, and more because of the inherent difficulty of overwriting someone's cognition via only a brief visual stimulus. Though now that I think about it more, presumably once someone has been captured the next thing you'd get them to do is spend a lot of time staring at a region of the sky that will reprogram them in more sophisticated ways. So maybe the normal glitchers in my story are unrealistically incompetent.)
Oh, I see what you mean now. In that case, no, I disagree. Right now this notion of robustness is pre-theoretic. I suspect that we can characterize robustness as "acting like a belief/goal agent" in the limit, but part of my point is that we don't even know what it means to act "approximately like belief/goal agents" in realistic regimes, because e.g. belief/goal agents as we currently characterize them can't learn new concepts.
Relatedly, see the dialogue in this post.
In my post on value systematization I used utilitarianism as a central example of value systematization.
Value systematization is important because it's a process by which a small number of goals end up shaping a huge amount of behavior. But there's another different way in which this happens: core emotional motivations formed during childhood (e.g. fear of death) often drive a huge amount of our behavior, in ways that are hard for us to notice.
Fear of death and utilitarianism are very different. The former is very visceral and deep-rooted; it typically influences our behavior via subtle channels that we don't even consciously notice (because we suppress a lot of our fears). The latter is very abstract and cerebral, and it typically influences our behavior via allowing us to explicitly reason about which strategies to adopt.
But fear of death does seem like a kind of value systematization. Before we have a concept of death we experience a bunch of stuff which is scary for reasons we don't understand. Then we learn about death, and then it seems like we systematize a lot of that scariness into "it's bad because you might die".
But it seems like this is happening way less consciously than systematization to become a utilitarian. So maybe we need to think about systematization happening separately in system 1 and system 2? Or maybe we should think about it as systematization happening repeatedly in "layers" over time, where earlier layers persist but are harder to access later on.
I feel pretty confused about this. But for now my mental model of the mind is as two (partially overlapping) inverted pyramids, one bottoming out in a handful of visceral motivations like "fear of death" and "avoid pain" and "find love", and the other bottoming out in a handful of philosophical motivations like "be a good Christian" or "save the planet" or "make America great again" or "maximize utility". The second (system 2) pyramid is trying to systematize the parts of system 1 that it can observe, but it can't actually observe the deepest parts (or, when it does, it tries to oppose them), which creates conflict between the two systems.