My take is that the concept of expected utility maximization is a mistake. In Eliezer's Coherent decisions imply consistent utilities, you can see the mistake where he writes:
From your perspective, you are now in Scenario 1B. Having observed the coin and updated on its state, you now think you have a 90% chance of getting $5 million and a 10% chance of getting nothing.
Reflectively stable agents are updateless. When they make an observation, they do not limit their caring as though all the possible worlds where their observation differs do not exist.
As far as I know, every argument for utility assumes (or implies) that whenever you make an observation, you stop caring about the possible worlds where that observation went differently.
The original Timeless Decision Theory was not updateless. Nor were any of the more traditional ways of thinking about decision. Updateless Decision Theory, and subsequent decision theories corrected this mistake.
Von Neumann did not notice this mistake because he was too busy inventing the entire field. The point where we discover updatelessness is the point where we are supposed to realize that all of utility theory is wrong. I think we failed to notice.
Ironically the community that was the birthplace of updatelessness became the flag for taking utility seriously. (To be fair, this probably is the birthplace of updatelessness because we took utility seriously.)
Unfortunately, because utility theory is so simple, and so obviously correct if you haven't thought about updatelessness, it ended up being assumed all over the place, without tracking the dependency. I think we use a lot of concepts that are built on the foundation of utility without us even realizing it.
(Note that I am saying here that utility theory is a theoretical mistake! This is much stronger than just saying that humans don't have utility functions.)
What should I read to learn about propositions like "Reflectively stable agents are updateless" and "utility theory is a theoretical mistake"?
I notice that I'm confused. I've recently read the paper "Functional decision theory..." and it's formulated explicitly in terms of expected utility maximization.
To ask for decisions to be coherent, there need to be multiple possible situations in which decisions could be made, coherently across these situations or not. A UDT agent that picks a policy faces a single decision in a single possible situation. There is nothing else out there for the decision in this situation to be coherent with.
The options offered for the decision could be interpreted as lotteries over outcomes, but there is still only one decision to pick one lottery among them all, instead of many situations where the decision is to pick among a par...
That depends on what you mean by "suitably coherent." If you mean they need to satisfy the independence vNM axiom, then yes. But the point is that I don't see any good argument why updateless agents should satisfy that axiom. The argument for that axiom passes through wanting to have a certain relationship with Bayesian updating.
I'm confused about the example you give. In the paragraph, Eliezer is trying to show that you ought to accept the independence axiom, cause you can be Dutch booked if you don't. I'd think if you're updateless, that means you already accept the independence axiom (cause you wouldn't be time-consistent otherwise).
And in that sense it seems reasonable to assume that someone who doesn't already accept the independence axiom is also not updateless.
Do you expect learned ML systems to be updateless?
It seems plausible to me that updatelessness of agents is just as "disconnected from reality" of actual systems as EU maximization. Would you disagree?
I haven't followed this very close, so I'm kinda out-of-the-loop... Which part of UDT/updatelessness says "don't go for the most utility" (no-maximization) and/or "utility cannot be measured / doesn't exist" (no-"foundation of utility", debatably no-consequentialism)? Or maybe "utility" here means something else?
As far as I know, every argument for utility assumes (or implies) that whenever you make an observation, you stop caring about the possible worlds where that observation went differently.
Are you just referring to the VNM theorems or are there other theorems you have in mind?
Note for self: It seems like the independence condition breaks for counterfactual mugging assuming you think we should pay. Assume P is paying $50 and N is not paying, M is receiving $1 million if you would have paid in the counterfactual and zero otherwise. We have N>P but 0.5P+0.5M>0.5N+0.5M in contradiction to independence. The issue is that the value of M is not independent of the choice between P and N.
Note that I am not saying here that rational agents can't have a utility function. I am only saying that they don't have to.
Reflectively stable agents are updateless. When they make an observation, they do not limit their caring as though all the possible worlds where their observation differs do not exist.
This is very surprising to me! Perhaps I misunderstand what you mean by "caring," but: an agent who's made one observation is utterly unable[1] to interact with the other possible-worlds where the observation differed; and it seems crazy[1] to choose your actions based on something they can't affect; and "not choosing my actions based on X" is how I would defi...
My personal take is that everything you wrote in this post is correct, and expected utility maximisers are neither the real threat, nor a great model for thinking about dangerous AI. Thanks for writing this up!
The key question I always focus on is: where do you get your capabilities from?
For instance, with GOFAI and ordinary programming, you have some human programmer manually create a model of the scenarios the AI can face, and then manually create a bunch of rules for what to do in order to achieve things. So basically, the human programmer has a bunch of really advanced capabilities, and they use them to manually build some simple capabilities.
"Consequentialism", broadly defined, represents an alternative class of ways to gain capabilities, namely choosing what to do based on it having the desired consequences. To some extent, this is a method humans uses, perhaps particularly the method the smartest and most autistic humans most use (which I suspect to be connected to LessWrong demographics but who knows...). Utility maximization captures the essence of consequentalism; there are various other things, such as multi-agency that one can throw on top of it, but those other things still mainly derive their capabilities from the core of utility maximization.
Self-supervised language models such as GPT-3 do not gain their capabilities from consequentialism, yet they have advanced capabilities nonetheless. How? Imitation learning, which basically works because of Aumann's agreement theorem. Self-supervised language models mimic human text, and humans do useful stuff and describe it in text, so self-supervised language models learn the useful stuff that can be described in text.
Risk that arises purely from language models or non-consequentialist RLHF might be quite interesting and important to study. I feel less able to predict it, though, partly because I don't know what the models will be deployed to do, or how much they can be coerced into doing, or what kinds of witchcraft are necessary to coerce the models into doing those things.
It is possible to me that imitation learning and RLHF can bring us to the frontier of human abilities, so that we have a tool that can solve tasks as well as the best humans can. However, I don't think it will be able to much exceed that frontier. This is still superhuman, because no human is as good as all the best humans at all the tasks. But it is not far-superhuman, even though I think being far-superhuman is possible, and a key part in it not being far-superhuman is that it cannot extend its capabilities. As such, I would expect consequentialism to be necessary for creating something that is far-superhuman.
I think many of the classical AI risk arguments apply to consequentialist far-superhuman AI.
If I understood your model correctly, GPT has capability because (1) humans are consequentialists so they have capabilities and (2) GPT imitates human output (3) which requires the GPT learning the underlying human capabilities.
GPT is behavior cloning. But it is the behavior of a universe that is cloned, not of a single demonstrator, and the result isn’t a static copy of the universe, but a compression of the universe into a generative rule.
I think the above quote from janus would add to (3) that it requires GPT to also learn the environment and the human-environment interactions, aside from just mimicking human capabilities. I know what you said doesn't contradict this, but I think there's a difference in emphasis, i.e. imitation of humans (or some other consequentialist) not necessarily being the main source of capability.
Generalizing this, it seems obviously wrong that imitation-learning-of-consequentialists is where self-supervised language models get their capabilities from? (I strongly suspect I misunderstood your argument or what you meant by capabilities, but just laying out anyways)
Like, LLM-style transformer pretrained on protein sequences get their "protein-prediction capability" purely from "environment generative-rule learning," and none from imitation learning of a consequentialist's output.
...It is possible to me that imitation learning and RLHF can bring us to the frontier of human abilities, so that we have a tool that can solve tasks as well as the best humans can. However, I don't think it will be able to much exceed that frontier. This is still superhuman, because no human is as good as all the best humans at all the tasks. But it is not far-superhuman, even though I think being far-superhuman is possible, and a key part in it not being far-superhuman is that it cannot extend its capabilities. As such, I would expect consequentialism to b
If AI risk arguments mainly apply to consequentialist (which I assume is the same as EU-maximizing in the OP) AI, and the first half of the OP is right that such AI is unlikely to arise naturally, does that make you update against AI risk?
I don't think consequentialism is related to utility maximisation in the way you try to present it. There are many consequentialistic agent architectures that are explicitly not utility maximising, e. g. Active Inference, JEPA, ReduNets.
Then you seem to switch your response to discussing that consequentialism is important for reaching the far-superhuman AI level. This looks at least plausible to me, but first, these far-superhuman AIs could have a non-UM consequentialistic agent architecture (see above), and second, DragonGod didn't say that the risk is ne...
If you're saying "let's think about a more general class of agents because EU maximization is unrealistic", that's fair, but note that you're potentially making the problem more difficult by trying to deal with a larger class with fewer invariants.
If you're saying "let's think about a distinct but not more general class of agents because that will be more alignable", then maybe, and it'd be useful to say what the class is, but: you're going to have trouble aligning something if you can't even know that it has some properties that are stable under self-reflection. An EU maximizer is maybe close to being stable under self-reflection and self-modification. That makes it attractive as a theoretical tool: e.g. maybe you can point at a good utility function, and then get a good prediction of what actually happens, relying on reflective stability; or e.g. maybe you can find nearby neighbors to EU maximization that are still reflectively stable and easier to align. It makes sense to try starting from scratch, but IMO this is a key thing that any approach will probably have to deal with.
I strongly suspect that expected utility maximisers are anti-natural for selection for general capabilities.
My current take is that we don't have good formalisms for consequentialist goal-directed systems that are weaker than expected utility maximization, and therefore don't really know how to reason about them. I think this is main cause of overemphasis on EUM.
For example, completeness as stated in the VNM assumptions is actually a really strong property. Aumann wrote a paper on removing completeness, but the utility function is no longer unique.
Speaking for myself, I sometimes use "EU maximization" as shorthand for one of the following concepts, depending on context:
Hmm, I just did a search of my own LW content, and can't actually find any instances of myself doing this, which makes me wonder why I was tempted to type the above. Perhaps what I actually do is if I see someone else mention "EU maximization", I mentally steelman their argument by replacing the concept with one of the three above, if anyone of them would make a sensible substitution.
Do you have any actual examples of anyone talking about EU maximization lately, in connection with AI risk?
I note that EU maximization has this baggage of never strictly preferring a lottery over outcomes to the component outcomes, and you steelmen appear to me to not carry that baggage. I think that baggage is actually doing work in some people's reasoning and intuitions.
I parsed the Rob Bensinger tweet I linked in the OP as being about expected utility maximising when I read it, but others have pointed out that wasn't necessarily a fair reading.
I think it depends on how you define expected utility. I agree that a definition that limits us only to analyzing end-state maximizers that seek some final state of the world is not very useful.
I don't think that for non-trivial AI agents, the utility function should or even can be defined as a simple function over the preferable final state of the world. U:Ω→R
This function does not take into account time and an intermediate set of predicted future states that the agent will possibly have preference over. The agent may have a preference for the final state of the universe but most likely and realistically it won't have that kind of preference except for some special strange cases. There are two reasons:
Any complex agent would likely have a utility function over possible actions that would be equal to the utility function of the set of predicted futures after action A vs the set of predicted futures without action A (or over differences between worlds in those futures). By action I mean possibly a set of smaller actions (hierarchy of actions - e.g. plans, strategies), it might not be atomic. Directly it cannot be easily computable so most likely this would be compressed to a set of important predicted future events on the level of abstraction that the agent cares about, which should constitute future worlds without action A and action A with enough approximation.
This is also how we evaluate actions. We evaluate outcomes in the short and long terms. We also care differently depending on time scope.
I say this because most sensible "alignment goals" like please don't kill humans are time-based. What does it mean not to kill humans? It is clearly not about the final state. Remember, Big Rip or Big Freeze. Maybe AGI can kill some for a year and then no more assuming the population will go up and some people are killed anyway so it does not matter long-term? No, this is also not about the non-final but long-term outcome. Really it is a function of intermediate states. Something like the integral of some function U'(dΩ) where dΩ is a delta between outcomes of action vs non-action, over time, which can be approximated and compressed into integral over the function of an event over multiple events until some time T being maximal sensible scope.
Most of the behaviors and preferences of humans are also time-scoped, and time-limited and take multiple future states into account, mostly short-scoped. I don't think that alignment goals can be even expressed in terms of simple end-goal (preferable final state of the world) as the problem partially comes from the attitude of eng goal justifying the means that are at the core of the utility function defined as U:Ω→R.
It seems plausible to me that even non-static human goals can be defined as utility functions over the set of differences in future outcomes (difference between two paths of events). What is also obvious to me is that we as humans are able to modify our utility function to some extent, but not very much. Nevertheless, for humans the boundaries between most baseline goals, preferences, and morality vs instrumental convergence goals are blurry. We have a lot of heuristics and biases so our minds work out some things more quickly and more efficiently than if we would on intelligence, thinking, and logic. The cost is lower consistency, less precision, and higher variability.
So I find it useful to think about agents as maximizers over utility function, but not defined as one final goal or outcome or state of the world. Rather one that maximizes the difference between two ordered sets of events in different time scopes to calculate the utility of an action.
I also don't think agents must be initially rationally stable with an unchangeable utility function. This is also a problem as an agent can have initially a set of preferences with some hierarchy or weights, but it also can reason that some of these are incompatible with others, that the hierarchy is not logically consistent, and might seek to change it for sake of consistency to be fully coherent.
I'm not an AGI, clearly, but it is just like I think about morality right now. I learned that killing is bad. But I still can question "why we don't kill?" and modify my worldview based on the answer (or maybe specify it in more detail in this matter). And it is a useful question as it says a lot about edge cases including abortion, euthanasia, war, etc. The same might happen for rational agents - as it might update their utility function to be stable and consistent, maybe even questioning some of the learned parts of the utility function in the process. Yes, you can say that if you can change that then it was not your terminal goal. Nevertheless, I can imagine agents with no terminal core goals at all. I'm not even sure if we as humans have any core terminal goals (maybe except avoiding death and own harm in the case of most humans in most circumstances... but some overcome that as Thích Quảng Đức did).
I agree with the following caveats:
The argument in the tweet also goes through if the AI has 1000 goals as alien as maximizing granite spheres, which I would guess Rob thinks is more realistic.
As an aside: If one thinks 1000 goals is more realistic, then I think it's better to start communicating using examples like that, instead of "single goal" examples. (I myself lazily default to "paperclips" to communicate AGI risk quickly to laypeople, so I am critiquing myself to some extent as well.)
Anyways, on your read, how is "maximize X-quantity" different from "max EU where utility is linearly increasing in granite spheres"?
There's a trivial sense in which the agent is optimizing the world and you can rationalize a utility function from that, but I think an agent that, from our perspective, basically just maximizes granite spheres can look quite different from the simple picture of an agent that always picks the top action according to some (not necessarily explicit) granite-sphere valuation of the actions, in ways such that the argument still goes through.
The original wording of the tweet was "Suppose that the AI's sole goal is to maximize the number of granite spheres in its future light cone." This is a bit closer to my picture of EU maximization but some of the degrees of freedom still apply.
1. Yeah, I think that's fair. I may have pattern matched/jumped to conclusions too eagerly. Or rather, I've been convinced that my allegation is not very fair. But mostly, the Rob tweet provided the impetus for me to synthesise/dump all my issues with EU maximisation. I think the complaint can stand on its own, even if Rob wasn't quite staking the position I thought he was.
That said, I do think that multi objective optimisation is way more existentially safe than optimising for a single simple objective. I don't actually think the danger directly translates. And I think it's unlikely that multi-objective optimisers would not care about humans or other agents.
I suspect the value shard formation hypotheses would imply instrumental convergence towards developing some form of morality. Cooperation is game theoretically optimal. Though it's not clear yet, how accurate the value shard formation hypothesis is true.
2. I'm not relying too heavily on Shard Theory I don't think. I mostly cited it because it's what actually lead me in that direction not because I fully endorse it. The only shard theory claims I rely on are:
Do you think the first is "non obvious"?
That said, I do think that multi objective optimisation is way more existentially safe than optimising for a single simple objective. I don’t actually think the danger directly translates. And I think it’s unlikely that multi-objective optimisers would not care about humans or other agents.
I think one possible form of existential catastrophe is that human values get only a small share of the universe, and as a result the "utility" of the universe is much smaller than it could be. I worry this will happen if only one or few of the objectives of multi objective optimization cares about humans or human values.
Also, if one of the objectives does care about humans or human values, it might still have to do so in exactly the right way in order to prevent (other forms of) existential catastrophe, such as various dystopias. Or if more than one cares, they might all have to care in exactly the right way. So I don't see multi objective optimisation as much safer by default, or much easier to align.
I think that multi-decision-influence networks seem much easier to align and much safer for humans.
I think that multi-decision-influence networks seem much easier to align and much safer for humans.
It seems fine to me that you think this. As I wrote in a previous post, "Trust your intuitions, but don’t waste too much time arguing for them. If several people are attempting to answer the same question and they have different intuitions about how best to approach it, it seems efficient for each to rely on his or her intuition to choose the approach to explore."
As a further meta point, I think there's a pattern where because many existing (somewhat) concrete AI alignment approaches seem doomed (we can fairly easy see how they would end up breaking), people come up with newer, less concrete approaches which don't seem doomed, but only because they're less concrete and therefore it's harder to predict what they would actually do, or because fewer people have looked into them in detail and tried to break them. See this comment where I mentioned a similar worry with regard to Paul Christiano's IDA when it was less developed.
In this case, I think there are many ways that a shard-based agent could potentially cause existential catastrophes, but it's hard for me to say more, since I don't know the details of what your proposal will be.
(For example, how do the shards resolve conflicts, and how will they eventually transform into a reflectively stable agent? If one of the shards learns a distorted version of human values, which would cause an existential catastrophe if directly maximized for, how exactly does that get fixed by the time the agent becomes reflectively stable? Or if the agent never ends up maximizing anything, why isn't that liable to be a form of existential catastrophe? How do you propose to prevent astronomical waste caused by the agent spending resources on shard values that aren't human values? What prevents the shard agent from latching onto bad moral/philosophical ideas and causing existential catastrophes that way?)
I don't want to discourage you from working more on your approach and figuring out the details, but at the same time it seems way too early to say, hey let's stop working on other approaches and focus just on this one.
I think these are great points, thanks for leaving this comment. I myself patiently await the possible day where I hit an obvious shard theory landmine which has the classic alignment-difficulty "feel" to it. That day can totally come, and I want to be ready to recognize if it does.
at the same time it seems way too early to say, hey let's stop working on other approaches and focus just on this one.
FWIW I'm not intending to advocate "shard theory or GTFO", and agree that would be bad as a community policy.
I've tried to mention a few times[1] (but perhaps insufficiently prominently) that I'm less excited about people going "oh yeah I guess shard theory is great or something, let's just think about that now" and more excited about reactions like "Oh, I guess I should have been practicing more constant vigilance, time to think about alignment deeply on my own terms, setting aside established wisdom for the moment." I'm excited about other people thinking about alignment from first principles and coming up with their own inside views, with their own theories and current-best end-to-end pictures of AGI training runs.
From Inner and outer alignment decompose one hard problem into two extremely hard problems:
A-Outer: Suppose I agreed. Suppose I just dropped outer/inner. What next?
A: Then you would have the rare opportunity to pause and think while floating freely between agendas. I will, for the moment, hold off on proposing solutions. Even if my proposal is good, discussing it now would rob us of insights you could have contributed as well. There will be a shard theory research agenda post which will advocate for itself, in due time.
I've also made this point briefly at the end of in-person talks. Maybe I should say it more often.
Cooperation is game theoretically optimal.
This is a claim I strongly disagree with, assuming there aren't enforcement mechanisms like laws or contracts. If there isn't enforcement, then this reduces to the Prisoner's Dilemma, and there defection is game-theoretically optimal. Cooperation only works if things can be enforced, and the likelihood that we will be able to enforce things like contracts on superhuman intelligences is essentially like that of animals enforcing things on a human, i.e so low that it's not worth privileging the hypothesis.
And this is important, because it speaks to why the alignment problem is hard: agents with vastly differing capabilities can't enforce much of anything, so defection is going to happen. And I think this prediction bears out in real life relations with animals, that is humans can defect consequence free, so this usually happens.
One major exception is pets, where the norm really is cooperation, and the version that would be done for humans is essentially benevolent totalitarianism. Life's good in such a society, but modern democratic freedoms are almost certainly gone or so manipulated that it doesn't matter.
That might not be bad, but I do want to note that in game theory without enforcement is where defection rules.
instrumental convergence towards developing some form of morality.
That respects the less capable agent's wants, and stably is the necessary thing. And the answer to this is negative, expect in the pets case. And even here, this will entail the end of democracy and most freedom as we know it. It might actually be benevolent totalitarianism, and you may make an argument that this is desirable, though I do want to note the costs.
In one sense, I no longer endorse the previous comment. In another sense, I sort of endorse the previous comment.
I was basically wrong about alignment requiring human values to be game-theoretically optimal, and I think that cooperation is actually doable without relying on game theory tools like enforcement, because the situation with human alignment is very different than the situation with AI alignment, because we have access to the AI's brain and can directly reward good things and negatively reward bad things, combined with the fact that we have a very powerful optimizer called SGD that lets us straightforwardly select over minds and directly edit the AI's brain, which aren't things we have to align humans, partially for ethical reasons and partially for technological reasons.
I also think my analogy of human-animal alignment is actually almost as bad as human-evolution alignment, which is worthless, and instead the better analogy for how to predict the likelihood of AI alignment is prefrontal cortex-survival value alignment, or innate reward alignment, which is very impressive alignment.
However, I do think that even with that assumption of aligned AIs, I do think that democracy is likely to decay pretty quickly under AI, especially because of the likely power imbalances, especially hundreds of years into the future. We will likely retain some freedoms under aligned AI rule, but I expect it to be a lot less than what we're used to today, and it will transition into a form of benevolent totalitarianism.
Optimising multiple objective functions in a way that cannot be collapsed into a single utility function to e.g. the reals.
I guess multi objective optimisation can be represented by a single utility function that maps to a vector space, but as far as I'm aware, utility functions usually have a field as their codomain.
I think most examples of "arguments from expected utility maximisation" are going to look like what Rob wrote: not actually using expected utility maximization, but rather "having goals that you do a pretty good job at accomplishing". This gets you things like "more is better" with respect to resources like negentropy and computation, it gets you the idea that it's better to achieve the probability that your goal is achieved, it gets you some degree of resisting changes to goal content (altho I think contra Omohundro this can totally happen in bargaining scenarios), if "goal content" exists, and it gets you "better to not be diverted from your goal by meddling humans".
Also: I don't understand how one is supposed to get from "trained agents have a variety of contextual influences on decision making" to "trained agents are not expected utility maximizers", without somehow rebutting the arguments people make for why utility maximizers are good - and once we refute these arguments, we don't need talk of "shards" to refute them extra hard. Like, you can have different influences on your behaviour at different times that all add up to coherence. For example, one obvious influence that would plausibly be reinforced is "think about coherence so you don't randomly give up resources". Maybe this is supposed to make more sense if we use "expected utility maximizer" in a way that excludes "thing that is almost expected-utility-optimal" or "thing that switches between different regimes of expected utility maximization" but that strikes me as silly.
Separately from Scott's answer, if people reason
I think both (1) and (3) are sketchy/wrong/weird.
(1) There's a step like "Don't you want to save as many lives as possible? Then you have to coherently trade off opportunities by assigning a value to each life." and the idea that this kind of reasoning then pins down "you now maximize, or approximately maximize, or want to maximize, some utility function over all universe-histories." This is just a huge leap IMO.
(3) We don't know what the entities care about, or even that what they care about cleanly maps onto tileable, mass-producible, space-time additive quantities like "# of diamonds produced."
Also, I think that people mostly just imagine specific kinds of EU maximizers (e.g. over action-observation histories) with simple utility functions (e.g. one we could program into a simple Turing machine, and then hand to AIXI). And people remember all the scary hypotheticals where AIXI wireheads, or Eliezer's (hypothetical) example of an outcome-pump. I think that people think "it'll be an EU maximizer" and remember AIXI and conclude "unalignable" or "squeezes the future into a tiny weird contorted shape unless the utility function is perfectly aligned with what we care about." My imagined person acknowledges "mesa optimizers won't be just like AIXI, but I don't see a reason to think they'll be fundamentally differently structured in the limit."
On these perceptions of what happens in common reasoning about these issues, I think there is just an enormous number of invalid reasoning steps, and I can tell people about a few of them but—even if I make myself understood—there usually don't seem to be internal errors thrown which leads to a desperate effort to recheck other conclusions and ideas drawn from invalid steps. EU-maxing and its assumptions seep into a range of alignment concepts (including exhaustive search as a plausible idealization of agency). On my perceptions, even if someone agrees that a specific concept (like exhaustive search) is inappropriate, they don't seem to roll back belief-updates they made on the basis of that concept.
My current stance is "IDK what the AI cognition will look like in the end", and I'm trying not to collapse my uncertainty prematurely.
My personal suspicion is that an AI being indifferent between a large class of outcomes matters little; it's still going to absolutely ensure that it hits the pareto frontier of its competing preferences.
Hitting the pareto frontier looks very different from hitting the optimum of a single objective.
I don't think those arguments that rely on EU maximisation translate.
Agree with everything, including the crucial conclusion that thinking and writing about utility maximisation is counterproductive.
Just one minor thing that I disagree with in this post: while simulators as a mathematical abstraction are not agents, the physical systems that are simulators in our world, e. g. LLMs, are agents.
An attempt to answer the question in the title of this post, although that could be a rhetorical one:
I wish that we were clearer that in a lot of circumstances we don't actually need a utility maximiser for our argument, but rather an AI that is optimising sufficiently hard. However, the reason people often look at utility maximisers is to see what the behaviour will look like in the limit. I think this is a sensible thing to do, so long as we remember that we are looking at behaviour in the limit.
Unfortunately, I suspect competitive dynamics and the unilateralist's curse might push us further down this path than we'd like.
As an additional reason to be suspicious of arguments based on expected utility maximization, VNM expected utility maximizers aren't embedded agents. Classical expected utility theory treats computations performed at EUMs as having no physical side effects (e.g., energy consumption or waste heat generation), and the hardware that EUMs run on is treated as separate from the world that EUMs maximize utility over. Classical expected utility theory can't handle scenarios like self-modification, logical uncertainty, or the existence of other copies of the agent in the environment. Idealized EUMs aren't just unreachable via reinforcement learning, they aren't physically possible at all. An argument based on expected utility maximization that doesn't address embedded agency is going to ignore a lot of factors that are relevant to AI alignment.
For findability link to previous struggles I have had with this.
I still have serious trouble trying to get what people include in "expected utility maximatization". A utility function is just a restatement of preferences. It does and requires nothing.
I collected some bits of components what this take is (cognizable to me) actually saying.
static total order over preferences (what a utility function implies)
This claims that utility functions have temporal translation symmetry built-in.
maximising a simple expected utility function
This claims that utility functions means that an agent has internal representations of its affordances (or some kind of self-control logic). I disagree/I don't understand.
Suppose you want to test how fire-safe agents are. You do so by putting an appendage of them on a hot stove. If the agent rests its appendage on the stove you classify it as defective. If the agent removes its appendage from the stove you classify it as compliant. You test rock-bot and spook-bot. Rock-bot fails and does not have any electronics inside its shell. Spook-bot just has a reflex retracting everythin upon a pain singnal and passes. Neither bot involves making a world-model or considering options. Another way of frasing this is that you dislike agents that find bots whos utility function values resting the appendage to a great degree to be undesirable.
maximising a simple expected utility function
This claims that expected utility maximation involves using an internal representation that is some combination of: fast to use in deployment, has low hardware space requirements to store, uses little programmer time to code, uses few programming lines to encode.
And I guess in the mathematical sense this line of direction goes to the direction of "utility function has a finite small amount of terms as an algebraic expression".
So the things I have fished out and explicated:
Not all decision-making algorithms work by preferring outcomes, and not all decision-making algorithms that work by preferring outcomes have preferences that form a total preorder over outcomes, which is what would be required to losslessly translate those preferences into a utility function. Many reasonable kinds of decision-making algorithms (for example, ones that have ceteris paribus preferences) do not meet that requirement, including the sorts we see in real world agents. I see no reason to restrict ourselves to the subset that do.
So the phenomenological meaning is what you centrally mean?
I do not advocate for any of the 3 meanings, but I want to figure out what you are against.
To me a utility function is a description of the agents existences impact and even saying that it refers to an algorithm is a misuse of the concept.
To be honest I'm not sure what you mean. I don't think so?
An agent makes decisions by some procedure. For some agents, the decisions that procedure produces can be viewed as choosing the more preferred outcome (i.e. when given a choice between A and B, if its decision procedure deterministically chooses A we'd describe that as "preferring A over B"). For some of those agents, the decisions they make have some additional properties, like that they always either consistently choose A over B or are consistently indifferent between them. When you have an agent like that and combine it with probabilistic reasoning, you get agent whose decision-making can be compressed into a single utility function.
Already non-choosers can be made into an utility function.
That notion of chooser is sensible. I think it is important to differentiate between "giving a choice" and "forms a choice" ie whether it is the agent or the enviroment doing it. Seating a rock-bot in front of a chess board can be "giving a choice" without "forms a choice" ever happening (rock-bot is not a chooser). Simiarly while the environment "gives a choice to pull arm away" spook-bot never "forms a choice" (because it is literally unimaginable for it to do otherwise) and is not a chooser.
Even spook-bot is external situation consistent and doesn't require being a chooser to do that. Only a chooser can ever be internal situation consistent (and even then it should be relativised to particular details of the internal state ie "Seems I can choose between A and B" and "Seems I can choose between A and B. Oh there is a puppy in the window." are in the same bucket) but that is hard to approach as the agent is free to build representations as it wants.
So sure if you have an agent that is internal-situation-consistent along some of its internal situations details and you know what details those are then you can specify which bits of the agents internal state you can forget without impacting your ability to predict its external actions.
Going over this revealed a stepping stone I had been falling for. "Expected utility" involves mental representations and "utility expectation" is about statistics of which there might not be awereness. An agent that makes the choice with highest utility expectation is statistically as suffering-free as possible. An agent that makes the choice with highest expected utility is statistically minimally (subjectively) regretful.
I think that solving the alignment for EV maximizers is a much stronger version of alignment than eg prosaic alignment of LLM-type models. Agents seem like they’ll be more powerful than Tool AIs. We don’t know how to make them, but if someone does, and capabilities timelines shorten drastically, it would be awesome to even have a theory of EV maximizer alignment before then
Reinforcement learning does create agents, those agents just aren't expected utility maximisers.
Claims that expected utility maximisation is the ideal or limit of agency seem wrong.
I think expected utility maximisation is probably anti-natural to generally capable optimisers.
Epistemic Status
Unsure[1], partially noticing my own confusion. Hoping Cunningham's Law can help resolve it.
Related Answer
Confusions About Arguments From Expected Utility Maximisation
Some MIRI people (e.g. Rob Bensinger) still highlight EU maximisers as the paradigm case for existentially dangerous AI systems. I'm confused by this for a few reasons:
I don't expect the systems that matter (in the par human or strongly superhuman regime) to be expected utility maximisers. I think arguments for AI x-risk that rest on expected utility maximisers are mostly disconnected from reality. I suspect that discussing the perils of expected utility maximisation in particular — as opposed to e.g. dangers from powerful (consequentialist?) optimisation processes — is somewhere between being a distraction and being actively harmful[3].
I do not think expected utility maximisation is the limit of what generally capable optimisers look like[4].
Arguments for Expected Utility Maximisation Are Unnecessary
I don't think the case for existential risks from AI safety rest on expected utility maximisation. I kind of stopped alieving expected utility maximisers a while back (only recently have I synthesised explicit beliefs that reject it), but I still plan on working on AI existential safety, because I don't see the core threat as resulting from expected utility maximisation.
The reasons I consider AI an existential threat mostly rely on:
I do not actually expect extinction near term, but it's not the only "existential catastrophe":
I optimised for writing this quickly. So my language may be stronger/more confident that I actually feel. I may not have spent as much time accurately communicating my uncertainty as may have been warranted.
Correct me if I'm mistaken, but I'm under the impression that RL is the main training paradigm we have that selects for agents.
I don't necessarily expect that our most capable systems would be trained via reinforcement learning, but I think our most agentic systems would be.
There may be significant opportunity cost via diverting attention from other more plausible pathways to doom.
In general, I think exposing people to bad arguments for a position is a poor persuasive strategy as people who dismiss said bad arguments may (rationally) update downwards on the credibility of the position.
I don't necessarily think agents are that limit either. But as "Why Subagents?" shows, expected utility maximisers aren't the limit of idealised agency.