Stuart Armstrong has proved some theorems showing that it's really really hard to get to the Pareto frontier unless you're adding utility functions in some sense, with the big issue being the choice of scaling factor. I'm not sure even so, on a moral level - in terms of what I actually want - that I quite buy Armstrong's theorems taken at face value, but on the other hand it's hard to see how, if you had a solution that wasn't on the Pareto frontier, agents would object to moving to the Pareto frontier so long as they didn't get shafted somehow.
It occurred to me (and I suggested to Armstrong) that I wouldn't want to trade off whole stars turned into paperclips against individual small sentients on an even basis when dividing the gains from trade, even if I came out ahead on net against the prior state of the universe before the trade. I.e., if we were executing a gainful trade and the question was how to split the spoils, and some calculation took which ended up with the paperclip maximizer gaining a whole star's worth of paperclips from the spoils every time I gain one small-sized eudaimonic sentient, then my primate fairness calculator wants to tell the maximizer to eff off and screw the trade. I suggested to Armstrong that the critical scaling factor might revolve around equal amounts of matter affected by the trade, and you can also see how something like that might emerge if you were conducting an auction between many superintelligences (they would purchase matter affected where it was cheapest). Possibilities like this tend not to be considered in such theorems, and when you ask which axiom they violate it's often an axiom that turns out to not be super morally appealing.
Irrelevant alternatives is a common hinge on which such theorems fail when you try to do morally sensible-seeming things with them. One of the intuition pumps I use for this class of problem is to imagine an auction system in which all decision systems get to spend the same amount of money (hence no utility monsters). It is not obvious that you should morally have to pay the money only to make alternatives happen, and not to prevent alternatives that might otherwise be chosen. But then the elimination of an alternative not output by the system can, and morally should, affect how much money someone must pay to prevent it from being output.
Stuart Armstrong has proved some theorems showing that it's really really hard to get to the Pareto frontier unless you're adding utility functions in some sense, with the big issue being the choice of scaling factor.
He knows. Also, why do you say "really really hard" when the theorem says "impossible"?
...It occurred to me (and I suggested to Armstrong) that I wouldn't want to trade off whole stars turned into paperclips against individual small sentients on an even basis when dividing the gains from trade, even if I came out ahead on
Prerequisite reading: Cognitive Neuroscience, Arrow's Impossibility Theorem, and Coherent Extrapolated Volition.
Abstract: Arrow's impossibility theorem poses a challenge to viability of coherent extrapolated volition (CEV) as a model for safe-AI architecture: per the theorem, no algorithm for aggregating ordinal preferences can necessarily obey Arrow's four fairness criteria while simultaneously producing a transitive preference ordering. One approach to exempt CEV from these consequences is to claim that human preferences are cardinal rather than ordinal, and therefore Arrow's theorem does not apply. This approach is shown to ultimately fail and other options are briefly discussed.
A problem arises when examining CEV from the perspective of welfare economics: according to Arrow's impossibility theorem, no algorithm for the aggregation of preferences can necessarily meet four common-sense fairness criteria while simultaneously producing a transitive result. Luke has previously discussed this challenge. (See the post linked above.)
Arrow's impossibility theorem assumes that human preferences are ordinal but (as Luke pointed out) recent neuroscientific findings suggest that human preferences are cardinally encoded. This fact implies that human preferences - and subsequently CEV - are not bound by the consequences of the theorem.
However, Arrow's impossibility theorem extends to cardinal utilities with the addition of a continuity axiom. This result - termed Samuelson's conjecture - was proven by Ehud Kalai and David Schmeidler in their 1977 paper "Aggregation Procedure for Cardinal Preferences." If an AI attempts to model human preferences using a utility theory that relies on the continuity axiom, then the consequences of Arrow's theorem will still apply. For example, this includes an AI using the von Neumann-Morgenstern utility theorem.
The proof of Samuelson's conjecture limits the solution space for what kind of CEV aggregation procedures are viable. In order to escape the consequences of Arrow's impossibility theorem, a CEV algorithm must accurately model human preferences without using a continuity axiom. It may be the case that we are living in a second-best world where such models are impossible. This scenario would mean we must make a trade-off between employing a fair aggregation procedure and producing a transitive result.
Supposing this is the case, what kind of trade-offs would be optimal? I am hesitant about weakening the transitivity criterion because an agent with a non-transitive utility function is vulnerable to Dutch-book theorems. This scenario poses a clear existential risk. On the other hand, weakening the independence of irrelevant alternatives criterion may be feasible. My cursory reading of the literature suggests that this is a popular alternative among welfare economists, but there are other choices.
Going forward, citing Arrow's impossibility theorem may serve as one of the strongest objections against CEV. Further consideration on how to reconcile CEV with Arrow's impossibility theorem is warranted.