Epistemic Status
Unsure[1], partially noticing my own confusion. Hoping Cunningham's Law can help resolve it.
Confusions About Arguments From Expected Utility Maximisation
Some MIRI people (e.g. Rob Bensinger) still highlight EU maximisers as the paradigm case for existentially dangerous AI systems. I'm confused by this for a few reasons:
- Not all consequentialist/goal directed systems are expected utility maximisers
- E.g. humans
- Some recent developments make me sceptical that VNM expected utility are a natural form of generally intelligent systems
- Wentworth's subagents provide a model for inexploitable agents that don't maximise a simple unitary utility function
- The main requirement for subagents to be a better model than unitary agents is path dependent preferences or hidden state variables
- Alternatively, subagents natively admit partial orders over preferences
- If I'm not mistaken, utility functions seem to require a (static) total order over preferences
- This might be a very unreasonable ask; it does not seem to describe humans, animals, or even existing sophisticated AI systems
- If I'm not mistaken, utility functions seem to require a (static) total order over preferences
- I think the strongest implication of Wentworth's subagents is that expected utility maximisation is not the limit or idealised form of agency
- Shard Theory suggests that trained agents (via reinforcement learning[2]) form value "shards"
- Values are inherently "contextual influences on decision making"
- Hence agents do not have a static total order over preferences (what a utility function implies) as what preferences are active depends on the context
- Preferences are dynamic (change over time), and the ordering of them is not necessarily total
- This explains many of the observed inconsistencies in human decision making
- Hence agents do not have a static total order over preferences (what a utility function implies) as what preferences are active depends on the context
- A multitude of value shards do not admit analysis as a simple unitary utility function
- Reward is not the optimisation target
- Reinforcement learning does not select for reward maximising agents in general
- Reward "upweight certain kinds of actions in certain kinds of situations, and therefore reward chisels cognitive grooves into agents"
- Reinforcement learning does not select for reward maximising agents in general
- I'm thus very sceptical that systems optimised via reinforcement learning to be capable in a wide variety of domains/tasks converge towards maximising a simple expected utility function
- Values are inherently "contextual influences on decision making"
- Wentworth's subagents provide a model for inexploitable agents that don't maximise a simple unitary utility function
- I am not aware that humanity actually knows training paradigms that select for expected utility maximisers
- Our most capable/economically transformative AI systems are not agents and are definitely not expected utility maximisers
- Such systems might converge towards general intelligence under sufficiently strong selection pressure but do not become expected utility maximisers in the limit
- The do not become agents in the limit and expected utility maximisation is a particular kind of agency
- Such systems might converge towards general intelligence under sufficiently strong selection pressure but do not become expected utility maximisers in the limit
- Our most capable/economically transformative AI systems are not agents and are definitely not expected utility maximisers
- I am seriously entertaining the hypothesis that expected utility maximisation is anti-natural to selection for general intelligence
- I'm not under the impression that systems optimised by stochastic gradient descent to be generally capable optimisers converge towards expected utility maximisers
- The generally capable optimisers produced by evolution aren't expected utility maximisers
- I'm starting to suspect that "search like" optimisation processes for general intelligence do not in general converge towards expected utility maximisers
- I.e. it may end up being the case that the only way to create a generally capable expected utility maximiser is to explicitly design one
- And we do not know how to design capable optimisers for rich environments
- We can't even design an image classifier
- I currently disbelieve the strong orthogonality thesis translated to practice
- While it may be in theory feasible to design systems at any intelligence level with any final goal
- In practice, we cannot design capable optimisers.
- For intelligent systems created by "search like" optimisation, final goals are not orthogonal to cognitive ability
- Sufficiently hard optimisation for most cognitive tasks would not converge towards selecting for generally capable systems
- In the limit, what do systems selected for playing Go converge towards?
- I posit that said limit is not "general intelligence"
- In the limit, what do systems selected for playing Go converge towards?
- The cognitive tasks/domain on which a system was optimised for performance on may instantiate an upper bound on the general capabilities of the system
- You do not need much optimisation power to attain optimal performance in logical tic tac toe
- Systems selected for performance at logical tic tac toe should be pretty weak narrow optimisers because that's all that's required for optimality in that domain
- You do not need much optimisation power to attain optimal performance in logical tic tac toe
- Sufficiently hard optimisation for most cognitive tasks would not converge towards selecting for generally capable systems
- I.e. it may end up being the case that the only way to create a generally capable expected utility maximiser is to explicitly design one
I don't expect the systems that matter (in the par human or strongly superhuman regime) to be expected utility maximisers. I think arguments for AI x-risk that rest on expected utility maximisers are mostly disconnected from reality. I suspect that discussing the perils of expected utility maximisation in particular — as opposed to e.g. dangers from powerful (consequentialist?) optimisation processes — is somewhere between being a distraction and being actively harmful[3].
I do not think expected utility maximisation is the limit of what generally capable optimisers look like[4].
Arguments for Expected Utility Maximisation Are Unnecessary
I don't think the case for existential risks from AI safety rest on expected utility maximisation. I kind of stopped alieving expected utility maximisers a while back (only recently have I synthesised explicit beliefs that reject it), but I still plan on working on AI existential safety, because I don't see the core threat as resulting from expected utility maximisation.
The reasons I consider AI an existential threat mostly rely on:
- Instrumental convergence for consequentialist/goal directed systems
- A system doesn't need to be a utility maximiser for a simple utility function to be goal directed (again, see humans)
- Selection pressures for power seeking systems
- Reasons
- More economically productive/useful
- Some humans are power seeking
- Power seeking systems promote themselves/have better reproductive fitness
- Human disempowerment is the immediate existential catastrophe scenario I foresee from power seeking
- Reasons
- Bad game theoretic equilibria
- This could lead towards dystopian scenarios in multipolar outcomes
- Humans getting outcompeted by AI systems
- Could slowly lead to an extinction
I do not actually expect extinction near term, but it's not the only "existential catastrophe":
- Human disempowerment
- Various forms of dystopia
- ^
I optimised for writing this quickly. So my language may be stronger/more confident that I actually feel. I may not have spent as much time accurately communicating my uncertainty as may have been warranted.
- ^
Correct me if I'm mistaken, but I'm under the impression that RL is the main training paradigm we have that selects for agents.
I don't necessarily expect that our most capable systems would be trained via reinforcement learning, but I think our most agentic systems would be.
- ^
There may be significant opportunity cost via diverting attention from other more plausible pathways to doom.
In general, I think exposing people to bad arguments for a position is a poor persuasive strategy as people who dismiss said bad arguments may (rationally) update downwards on the credibility of the position.
- ^
I don't necessarily think agents are that limit either. But as "Why Subagents?" shows, expected utility maximisers aren't the limit of idealised agency.
I think it depends on how you define expected utility. I agree that a definition that limits us only to analyzing end-state maximizers that seek some final state of the world is not very useful.
I don't think that for non-trivial AI agents, the utility function should or even can be defined as a simple function over the preferable final state of the world. U:Ω→R
This function does not take into account time and an intermediate set of predicted future states that the agent will possibly have preference over. The agent may have a preference for the final state of the universe but most likely and realistically it won't have that kind of preference except for some special strange cases. There are two reasons:
Any complex agent would likely have a utility function over possible actions that would be equal to the utility function of the set of predicted futures after action A vs the set of predicted futures without action A (or over differences between worlds in those futures). By action I mean possibly a set of smaller actions (hierarchy of actions - e.g. plans, strategies), it might not be atomic. Directly it cannot be easily computable so most likely this would be compressed to a set of important predicted future events on the level of abstraction that the agent cares about, which should constitute future worlds without action A and action A with enough approximation.
This is also how we evaluate actions. We evaluate outcomes in the short and long terms. We also care differently depending on time scope.
I say this because most sensible "alignment goals" like please don't kill humans are time-based. What does it mean not to kill humans? It is clearly not about the final state. Remember, Big Rip or Big Freeze. Maybe AGI can kill some for a year and then no more assuming the population will go up and some people are killed anyway so it does not matter long-term? No, this is also not about the non-final but long-term outcome. Really it is a function of intermediate states. Something like the integral of some function U'(dΩ) where dΩ is a delta between outcomes of action vs non-action, over time, which can be approximated and compressed into integral over the function of an event over multiple events until some time T being maximal sensible scope.
Most of the behaviors and preferences of humans are also time-scoped, and time-limited and take multiple future states into account, mostly short-scoped. I don't think that alignment goals can be even expressed in terms of simple end-goal (preferable final state of the world) as the problem partially comes from the attitude of eng goal justifying the means that are at the core of the utility function defined as U:Ω→R.
It seems plausible to me that even non-static human goals can be defined as utility functions over the set of differences in future outcomes (difference between two paths of events). What is also obvious to me is that we as humans are able to modify our utility function to some extent, but not very much. Nevertheless, for humans the boundaries between most baseline goals, preferences, and morality vs instrumental convergence goals are blurry. We have a lot of heuristics and biases so our minds work out some things more quickly and more efficiently than if we would on intelligence, thinking, and logic. The cost is lower consistency, less precision, and higher variability.
So I find it useful to think about agents as maximizers over utility function, but not defined as one final goal or outcome or state of the world. Rather one that maximizes the difference between two ordered sets of events in different time scopes to calculate the utility of an action.
I also don't think agents must be initially rationally stable with an unchangeable utility function. This is also a problem as an agent can have initially a set of preferences with some hierarchy or weights, but it also can reason that some of these are incompatible with others, that the hierarchy is not logically consistent, and might seek to change it for sake of consistency to be fully coherent.
I'm not an AGI, clearly, but it is just like I think about morality right now. I learned that killing is bad. But I still can question "why we don't kill?" and modify my worldview based on the answer (or maybe specify it in more detail in this matter). And it is a useful question as it says a lot about edge cases including abortion, euthanasia, war, etc. The same might happen for rational agents - as it might update their utility function to be stable and consistent, maybe even questioning some of the learned parts of the utility function in the process. Yes, you can say that if you can change that then it was not your terminal goal. Nevertheless, I can imagine agents with no terminal core goals at all. I'm not even sure if we as humans have any core terminal goals (maybe except avoiding death and own harm in the case of most humans in most circumstances... but some overcome that as Thích Quảng Đức did).