If not (paused): utility=NormalUtilityFunction
If NormalUtilityFunction
assigns a higher score to futures where its agent avoids being paused, then that has the usual problem of instrumental convergence avoiding shut down or utility function modification.
I'll try modifying your example to instead use UtilityFunction2
, which assigns, to worlds where its agent ends up paused, a score equal to the average score assigned to {worlds where the agent will not end up paused} in the agent's prior as it was just before updating on observing being paused. This doesn't mean it wouldn't still instrumentally converge and start disassembling things, just that if you tried to pause it, it wouldn't have a direct incentive to specifically try to stop you, per se, because if you succeed it still assigns score equal to that of the average world in its prior where you didn't succeed. It has some interesting indirect incentives to avoid being paused (or pause itself), though, which I get into below.
Consider if this utility function were to score the worlds in which the agent will take one action or the other: action (1), which has a 99% of chance leading to being paused, and results in an average score of 10 if not paused; and action (2), which has a 5% chance of leading to being paused, and results in an average score of 5 if not paused. In this case, UtilityFunction2
prefers the world where the agent takes action (1). (This alone does not lead to resisting pauses or self-pausing; this is just an observation for now).
Next, consider how "the average score of {worlds where the agent will not end up paused} in the agent's prior as it was just before updating on observing being paused" does depend on the agent's own action policy. In short, the agent can commit to pausing itself in all but the highest-scoring world in its prior to create an extreme version of action (1). (Longer, original explanation in footnote)
Separately, I also just noticed that, as worded - "as [the prior] was just before updating on observing being paused" - there are yet more unexpected implications. For example: at time 0, the set of non-paused futures are generally not low-probability exceptional worlds; But one hour later, the agent will soon be paused, absent its action; And the moment just before being paused, it is all but certain that it will be paused - and so, in the prior at that final point, the set of worlds where it does not end up paused are exceptional ones, like simulation hypotheses where there's an intervention at the last moment, which of course may themselves have higher or lower scores than remaining unpaused (through resistance). One could try to avoid this by specifying "some number of updates before", e.g. "the prior as it was 10000 updates before updating on being paused", but this would create new complications (e.g., if that agent learns information which decreases the score it assigns to the world, it will pause itself if no outweighing positive update occurs in the next 9999 updates).
Isn't this a blocker for any discussion of particular utility functions?