I often notice that in many (not all) discussions about utility functions, one side is "for" their relevance, while others tend to be "against" their usefulness, without explicitly saying what they mean. I don't think this is causing any deep confusions among researchers here, but I'd still like to take a stab at disambiguating some of this, if nothing else for my own sake. Here are some distinct (albeit related) ways that utility functions can come up in AI safety, in terms of what assumptions/hypotheses they give rise to:
AGI utility hypothesis: The first AGI will behave as if it is maximizing some utility function
ASI utility hypothesis: As AI capabilities improve well beyond human-level, it will behave more and more as if it is maximizing some utility function (or will have already reached that ideal earlier and stayed there)
Human utility hypothesis: Even though in some experimental contexts humans seem to not even be particularly goal-directed, utility functions are often a useful model of human preferences to use in AI safety research
Coherent Extrapolated Volition (CEV) hypothesis: For a given human H, there exists some utility function V such that if H is given the appropriate time/resources for reflection, H's values would converge to V
Some points to be made:
- The "Goals vs Utility Functions" chapter of Rohin's Value Learning sequence, and the resulting discussion focused on differing intuitions about the AGI and ASI utility hypotheses. Specifically, the main post there pointed out that seemingly anything can be trivially modeled as being a "utility maximizer" (further discussion here), whereas only some intelligent agents can be described as being "goal-directed" (as defined in this post), and the latter is a more useful concept for reasoning about AI safety.
- AGI utility doesn't logically imply ASI utility, but I'd be surprised if anyone thinks it's very plausible for the former to be true while the latter fails. In particular, the coherence arguments and other pressures that move agents toward VNM seem to roughly scale with capabilities. A plausible stance could be that we should expect most ASIs to hew close to the VNM ideal, but these pressures aren't quite so overwhelming at the AGI level; in particular, humans are fairly goal-directed but only "partially" VNM, so the goal-directedness pressures on an AGI will likely be at this order of magnitude. Depending on takeoff speeds, we might get many years to try aligning AGIs at this level of goal-directedness, which seems less dangerous than playing sorcerer's apprentice with VNM-based AGIs at the same level of capability.(Note: I might be reifying VNM here too much, in thinking of things having a measure of "goal-directedness" with "very goal-directed" approximating VNM. But this basic picture could be wrong in all sorts of ways.)
- The human utility hypothesis is much more vague than the others, and seems ultimately context-dependent. To my knowledge, the main argument in its favor is the fact that most of economics is founded on it. On the other hand, behavioral economists have formulated models like prospect theory for when greater precision is required than the simplistic VNM model gives, not to mention the cases where it breaks down more drastically. I haven't seen prospect theory used in AI safety research; I'm not sure if this reflects more a) the size of the field and the fact that few researchers have had much need to explicitly model human preferences, or b) that we don't need to model humans more than superficially. since this kind of research is still at a very early theoretical stage with all sorts of real-world error terms abounding.
- The CEV hypothesis can be strengthened, consistent with Yudkowsky's original vision, to say that every human will converge to about the same values. But the extra "values converge" assumption seems orthogonal to one's opinions about the relevance of utility functions, so I'm not including it in the above list.
- In practice a given researcher's opinions on these tend to be correlated, so it makes sense to talk of "pro-utility" and "anti-utility" viewpoints. But I'd guess the correlation is far from perfect, and at any rate, the arguments connecting these hypotheses seem somewhat tenuous.
Ok, I see what you mean about independence of irrelevant alternatives only being a real coherence condition when the probabilities are objective (or otherwise known to be equal because they come from the same source, even if there isn't an objective way of saying what their common probability is).
But I disagree that this makes VNM only applicable to settings in which all sources of uncertainty have objectively correct probabilities. As I said in my previous comment, you only need there to exist some source of objective probabilities, and you can then use preferences over lotteries involving objective probabilities and preferences over related lotteries involving other sources of uncertainty to determine what probability the agent must assign for those other sources of uncertainty.
Re: the difference between VNM and Bayesian expected utility maximization, I take it from the word "Bayesian" that the way you're supposed to choose between actions does involve first coming up with probabilities of each outcome resulting from each action, and from "expected utility maximization", that these probabilities are to be used in exactly the way the VNM theorem says they should be. Since the VNM theorem does not make any assumptions about where the probabilities came from, these still sound essentially the same, except with Bayesian expected utility maximization being framed to emphasize that you have to get the probabilities somehow first.