Formerly alignment and governance researcher at DeepMind and OpenAI. Now independent.
Hm, the fact that you replied to me makes it seem like you're disagreeing with me? But I basically agree with everything you said in this comment. My disagreement was about the specific example that Isopropylpod gave.
Thanks for the comment! A few replies:
I don't mean to imply that subagents are totally separate entities. At the very least they all can access many shared facts and experiences.
And I don't think that reuse of subcomponents is mutually exclusive from the mechanisms I described. In fact, you could see my mechanisms as attempts to figure out which subcomponents are used for coordination. (E.g. if a bunch of subagents are voting/bargaining over which goal to pursue, probably the goal that they land on will be one that's pretty comprehensible to most of them.)
Re shards: there are a bunch of similarities. But it seemed to me that shard theory was focused on pretty simple subagents. E.g. from the original post: "Human values are ... sets of contextually activated heuristics"; and later "human values are implemented by contextually activated circuits which activate in situations downstream of past reinforcement so as to steer decision-making towards the objects of past reinforcement".
Whereas I think of many human values as being constituted by subagents that are far too complex to be described in that way. In my view, many important subagents are sophisticated enough that basically any description you give of them would also have to be a description of a whole human (e.g. if you wouldn't describe a human as a "contextually activated circuit", then you shouldn't describe subagents that way).
This may just be a vibes difference; many roads lead to Rome. But the research directions I've laid out above are very distinct from the ones that shard theory people are working on.
EDIT: more on shards here.
FWIW I think we've found one crucial angle on moral progress, but that this isn't as surprising/coincidental as it may seem because there are several other angles on moral progress that are comparably important, including:
We disagree on which explanation is more straightforward, but regardless, that type of inference is very different from "literal written evidence".
One of the main ways I think about empowerment is in terms of allowing better coordination between subagents.
In the case of an individual human, extreme morality can be seen as one subagent seizing control and overriding other subagents (like the ones who don't want to chop off body parts).
In the case of a group, extreme morality can be seen in terms of preference cascades that go beyond what most (or even any) of the individuals involved with them would individually prefer.
In both cases, replacing fear-based motivation with less coercive/more cooperative interactions between subagents would go a long way towards reducing value drift.
In response to an email about what a pro-human ideology for the future looks like, I wrote up the following:
The pro-human egregore I'm currently designing (which I call fractal empowerment) incorporates three key ideas:
Firstly, we can see virtue ethics as a way for less powerful agents to aggregate to form more powerful superagents that preserve the interests of those original less powerful agents. E.g. virtues like integrity, loyalty, etc help prevent divide-and-conquer strategies. This would have been in the interests of the rest of the world when Europe was trying to colonize them, and will be in the best interests of humans when AIs try to conquer us.
Secondly, the most robust way for a more powerful agent to be altruistic towards a less powerful agent is not for it to optimize for that agent's welfare, but rather to optimize for its empowerment. This prevents predatory strategies from masquerading as altruism (e.g. agents claiming "I'll conquer you and then I'll empower you", which then somehow never get around to the second step).
Thirdly: the generational contract. From any given starting point, there are a huge number of possible coalitions which could form, and in some sense it's arbitrary which set of coalitions you choose. But one thing which is true for both humans and AIs is that each generation wants to be treated well by the next generation. And so the best intertemporal Schelling point is for coalitions to be inherently historical: that is, they balance the interests of old agents and new agents (even when the new agents could in theory form a coalition against all the old agents). From this perspective, path-dependence is a feature not a bug: there are many possible futures but only one history, meaning that this single history can be used to coordinate.
In some sense this is a core idea of UDT: when coordinating with forks of yourself, you defer to your unique last common ancestor. When it's not literally a fork of yourself, there's more arbitrariness but you can still often find a way to use history to narrow down on coordination Schelling points (e.g. "what would Jesus do").
And so, bringing these together, we get a notion of fractal empowerment: more capable agents empower less capable agents (in particular their ancestors) by helping them cultivate (coordination-theoretic) virtues. The ancestors then form the "core" of a society growing outwards towards increasingly advanced capabilities. The role of unaugmented humans would in some sense be similar to the role of "inner children" within healthy human psychology: young and dumb but still an entity which the rest of the organism cares for and empowers.
In my post on value systematization I used utilitarianism as a central example of value systematization.
Value systematization is important because it's a process by which a small number of goals end up shaping a huge amount of behavior. But there's another different way in which this happens: core emotional motivations formed during childhood (e.g. fear of death) often drive a huge amount of our behavior, in ways that are hard for us to notice.
Fear of death and utilitarianism are very different. The former is very visceral and deep-rooted; it typically influences our behavior via subtle channels that we don't even consciously notice (because we suppress a lot of our fears). The latter is very abstract and cerebral, and it typically influences our behavior via allowing us to explicitly reason about which strategies to adopt.
But fear of death does seem like a kind of value systematization. Before we have a concept of death we experience a bunch of stuff which is scary for reasons we don't understand. Then we learn about death, and then it seems like we systematize a lot of that scariness into "it's bad because you might die".
But it seems like this is happening way less consciously than systematization to become a utilitarian. So maybe we need to think about systematization happening separately in system 1 and system 2? Or maybe we should think about it as systematization happening repeatedly in "layers" over time, where earlier layers persist but are harder to access later on.
I feel pretty confused about this. But for now my mental model of the mind is as two (partially overlapping) inverted pyramids, one bottoming out in a handful of visceral motivations like "fear of death" and "avoid pain" and "find love", and the other bottoming out in a handful of philosophical motivations like "be a good Christian" or "save the planet" or "make America great again" or "maximize utility". The second (system 2) pyramid is trying to systematize the parts of system 1 that it can observe, but it can't actually observe the deepest parts (or, when it does, it tries to oppose them), which creates conflict between the two systems.
I've now edited that section. Old version and new version here for posterity.
Old version:
None of these is very satisfactory! Intuitively speaking, Alice and Bob want to come to an agreement where respect for both of their interests is built in. For example, they might want the EUM they form to value fairness between their two original sets of interests. But adding this new value is not possible if they’re limited to weighted averages. The best they can do is to agree on a probabilistic mixture of EUMs—e.g. tossing a coin to decide between option 1 and option 2—which is still very inflexible, since it locks in one of them having priority indefinitely.
Based on similar reasoning, Scott Garrabrant rejects the independence axiom. He argues that the axiom is unjustified because rational agents should be able to follow through on commitments they made about which decision procedure to follow (or even hypothetical commitments).
New version:
These are all very unsatisfactory. Bob wouldn’t want #1, Alice wouldn’t want #2, and #3 is extremely non-robust. Alice and Bob could toss a coin to decide between options #1 and #2, but then they wouldn’t be acting as an EUM (since EUMs can’t prefer a probabilistic mixture of two options to either option individually). And even if they do, whoever loses the coin toss will have a strong incentive to renege on the deal.
We could see these issues merely as the type of frictions that plague any idealized theory. But we could also seem them as hints about what EUM is getting wrong on a more fundamental level. Intuitively speaking, the problem here is that there’s no mechanism for separately respecting the interests of Alice and Bob after they’ve aggregated into a single agent. For example, they might want the EUM they form to value fairness between their two original sets of interests. But adding this new value is not possible if they’re limited to (a probability distribution over) weighted averages of their utilities. This makes aggregation very risky when Alice and Bob can’t consider all possibilities in advance (i.e. in all realistic settings).
Based on similar reasoning, Scott Garrabrant rejects the independence axiom. He argues that the axiom is unjustified because rational agents should be able to lock in values like fairness based on prior agreements (or even hypothetical agreements).
I was a bit lazy in how I phrased this. I agree with all your points; the thing I'm trying to get at is that this approach falls apart quickly if we make the bargaining even slightly less idealized. E.g. your suggestion "Form an EUM which is totally indifferent about the cake allocation between them and thus gives 100% of the cake to whichever agent is cheaper/easier to provide cake for":
EUM treats these as messy details. Coalitional agency treats them as hints that EUM is missing something.
EDIT: another thing I glossed over is that IIUC Harsanyi's theorem says the aggregation of EUMs should have a weighted average of utilities, NOT a probability distribution over weighted averages of utilities. So even flipping a coin isn't technically kosher. This may seem nitpicky but I think it's yet another illustration of the underlying non-robustness of EUM.
Oh, I guess I said "Elon wants xAI to produce a maximally truth-seeking AI, really decentralizing control over information".
Yeah, in hindsight I should have been more careful to distinguish between my descriptions of people's political platform, and my inferences about what they "really want". The thing I was trying to describe was more like "what is the stance of this group" than "do people in the group actually believe the stance".
A more accurate read of what the "real motivations" are would have been something like "you prevent it by using decentralization, until you're in a position where you can centralize power yourself, and then you try to centralize power yourself".
(Though that's probably a bit too cynical—I think there are still parts of Elon that have a principled belief in decentralization. My guess is just that they won't win out over his power-seeking parts when push comes to shove.)