Prerequisite reading: Cognitive Neuroscience, Arrow's Impossibility Theorem, and Coherent Extrapolated Volition.
Abstract: Arrow's impossibility theorem poses a challenge to viability of coherent extrapolated volition (CEV) as a model for safe-AI architecture: per the theorem, no algorithm for aggregating ordinal preferences can necessarily obey Arrow's four fairness criteria while simultaneously producing a transitive preference ordering. One approach to exempt CEV from these consequences is to claim that human preferences are cardinal rather than ordinal, and therefore Arrow's theorem does not apply. This approach is shown to ultimately fail and other options are briefly discussed.
A problem arises when examining CEV from the perspective of welfare economics: according to Arrow's impossibility theorem, no algorithm for the aggregation of preferences can necessarily meet four common-sense fairness criteria while simultaneously producing a transitive result. Luke has previously discussed this challenge. (See the post linked above.)
Arrow's impossibility theorem assumes that human preferences are ordinal but (as Luke pointed out) recent neuroscientific findings suggest that human preferences are cardinally encoded. This fact implies that human preferences - and subsequently CEV - are not bound by the consequences of the theorem.
However, Arrow's impossibility theorem extends to cardinal utilities with the addition of a continuity axiom. This result - termed Samuelson's conjecture - was proven by Ehud Kalai and David Schmeidler in their 1977 paper "Aggregation Procedure for Cardinal Preferences." If an AI attempts to model human preferences using a utility theory that relies on the continuity axiom, then the consequences of Arrow's theorem will still apply. For example, this includes an AI using the von Neumann-Morgenstern utility theorem.
The proof of Samuelson's conjecture limits the solution space for what kind of CEV aggregation procedures are viable. In order to escape the consequences of Arrow's impossibility theorem, a CEV algorithm must accurately model human preferences without using a continuity axiom. It may be the case that we are living in a second-best world where such models are impossible. This scenario would mean we must make a trade-off between employing a fair aggregation procedure and producing a transitive result.
Supposing this is the case, what kind of trade-offs would be optimal? I am hesitant about weakening the transitivity criterion because an agent with a non-transitive utility function is vulnerable to Dutch-book theorems. This scenario poses a clear existential risk. On the other hand, weakening the independence of irrelevant alternatives criterion may be feasible. My cursory reading of the literature suggests that this is a popular alternative among welfare economists, but there are other choices.
Going forward, citing Arrow's impossibility theorem may serve as one of the strongest objections against CEV. Further consideration on how to reconcile CEV with Arrow's impossibility theorem is warranted.
I couldn't access the "Aggregation Procedure for Cardinal Preferences" article. In any case, why isn't using an aggregate utility function that is a linear combination of everyone's utility functions (choosing some arbitrary number for each person's weight) a way to satisfy Arrow's criteria?
It should also be noted that Arrow's impossibility theorem doesn't hold for non-deterministic decision procedures. I would also caution against calling this an "existential risk", because while decision procedures that violate Arrow's criteria might be considered imperfect in some sense, they don't necessarily cause an existential catastrophe. Worldwide range voting would not be the best way of deciding everything, but it most likely wouldn't be an existential risk.
On first inspection, it looks like "linear combination of utility functions" still has issues with strategic voting. If you prefer A to B and B to C, but A isn't the winner regardless of how you vote, it can be arranged such that you make yourself worse off by expressing a preference for A over B. Any system where you reward people for not voting t... (read more)