Prerequisite reading: Cognitive Neuroscience, Arrow's Impossibility Theorem, and Coherent Extrapolated Volition.
Abstract: Arrow's impossibility theorem poses a challenge to viability of coherent extrapolated volition (CEV) as a model for safe-AI architecture: per the theorem, no algorithm for aggregating ordinal preferences can necessarily obey Arrow's four fairness criteria while simultaneously producing a transitive preference ordering. One approach to exempt CEV from these consequences is to claim that human preferences are cardinal rather than ordinal, and therefore Arrow's theorem does not apply. This approach is shown to ultimately fail and other options are briefly discussed.
A problem arises when examining CEV from the perspective of welfare economics: according to Arrow's impossibility theorem, no algorithm for the aggregation of preferences can necessarily meet four common-sense fairness criteria while simultaneously producing a transitive result. Luke has previously discussed this challenge. (See the post linked above.)
Arrow's impossibility theorem assumes that human preferences are ordinal but (as Luke pointed out) recent neuroscientific findings suggest that human preferences are cardinally encoded. This fact implies that human preferences - and subsequently CEV - are not bound by the consequences of the theorem.
However, Arrow's impossibility theorem extends to cardinal utilities with the addition of a continuity axiom. This result - termed Samuelson's conjecture - was proven by Ehud Kalai and David Schmeidler in their 1977 paper "Aggregation Procedure for Cardinal Preferences." If an AI attempts to model human preferences using a utility theory that relies on the continuity axiom, then the consequences of Arrow's theorem will still apply. For example, this includes an AI using the von Neumann-Morgenstern utility theorem.
The proof of Samuelson's conjecture limits the solution space for what kind of CEV aggregation procedures are viable. In order to escape the consequences of Arrow's impossibility theorem, a CEV algorithm must accurately model human preferences without using a continuity axiom. It may be the case that we are living in a second-best world where such models are impossible. This scenario would mean we must make a trade-off between employing a fair aggregation procedure and producing a transitive result.
Supposing this is the case, what kind of trade-offs would be optimal? I am hesitant about weakening the transitivity criterion because an agent with a non-transitive utility function is vulnerable to Dutch-book theorems. This scenario poses a clear existential risk. On the other hand, weakening the independence of irrelevant alternatives criterion may be feasible. My cursory reading of the literature suggests that this is a popular alternative among welfare economists, but there are other choices.
Going forward, citing Arrow's impossibility theorem may serve as one of the strongest objections against CEV. Further consideration on how to reconcile CEV with Arrow's impossibility theorem is warranted.
Harsanyi's social aggregation theorem seems more relevant than Arrow.
And for anyone else who was wondering which condition in Kalai and Schmeidler's theorem fails for adding up utility functions, the answer as far as I can tell is cardinal independence of alternatives, but the reason is unsatisfying (again as far as I can tell): namely, restricting a utility function to a subset of outcomes changes the normalization used in their definition of adding up utility functions. If you're willing to bite the bullet and work with actual utility functions rather than equivalence classes of functions, this won't matter to you, but then you have other issues (e.g. utility monsters).
Edit: I would also like to issue a general warning against taking theorems too seriously. Theorems are very delicate creatures; often if their assumptions are relaxed even slightly they totally fall apart. They aren't necessarily well-suited for reasoning about what to do in the real world (for example, I don't think the Aumann agreement theorem is all that relevant to humans).
Is the criteria for antifragility formal enough that there could be a list of antifragile theorems?