One of the questions I've heard asked is "how do you design a utility function that would make the AI do nothing?"  That is, how could we put a pause button on an AI so that we could pause it if we wanted to?  I had an idea about how one could do this, and am sure it has already been thought of, so I'm curious to know why this doesn't end up working.

 

Why can't we just write something like:

If (paused) and not (taking actions): utility=PredictedUtilityIfNotPaused

If not (paused): utility=NormalUtilityFunction

New Answer
New Comment

2 Answers sorted by

quila

23

If not (paused): utility=NormalUtilityFunction

If NormalUtilityFunction assigns a higher score to futures where its agent avoids being paused, then that has the usual problem of instrumental convergence avoiding shut down or utility function modification.

I'll try modifying your example to instead use UtilityFunction2, which assigns, to worlds where its agent ends up paused, a score equal to the average score assigned to {worlds where the agent will not end up paused} in the agent's prior as it was just before updating on observing being paused. This doesn't mean it wouldn't still instrumentally converge and start disassembling things[1], just that if you tried to pause it, it wouldn't have a direct incentive to specifically try to stop you, per se, because if you succeed it still assigns score equal to that of the average world in its prior where you didn't succeed. It has some interesting indirect incentives to avoid being paused (or pause itself), though, which I get into below.

Consider if this utility function were to score the worlds in which the agent will take one action or the other: action (1), which has a 99% of chance leading to being paused, and results in an average score of 10 if not paused; and action (2), which has a 5% chance of leading to being paused, and results in an average score of 5 if not paused. In this case, UtilityFunction2 prefers the world where the agent takes action (1). (This alone does not lead to resisting pauses or self-pausing; this is just an observation for now).

Next, consider how "the average score of {worlds where the agent will not end up paused} in the agent's prior as it was just before updating on observing being paused" does depend on the agent's own action policy. In short, the agent can commit to pausing itself in all but the highest-scoring world in its prior to create an extreme version of action (1). (Longer, original explanation in footnote[2])

Separately, I also just noticed that, as worded - "as [the prior] was just before updating on observing being paused" - there are yet more unexpected implications. For example: at time 0, the set of non-paused futures are generally not low-probability exceptional worlds; But one hour later, the agent will soon be paused, absent its action; And the moment just before being paused, it is all but certain that it will be paused - and so, in the prior at that final point, the set of worlds where it does not end up paused are exceptional ones, like simulation hypotheses where there's an intervention at the last moment, which of course may themselves have higher or lower scores than remaining unpaused (through resistance). One could try to avoid this by specifying "some number of updates before", e.g. "the prior as it was 10000 updates before updating on being paused", but this would create new complications (e.g., if that agent learns information which decreases the score it assigns to the world, it will pause itself if no outweighing positive update occurs in the next 9999 updates).

  1. ^

    If this utility function is otherwise a classical physical-thing-maximizer

  2. ^

    "Specifically, {the score of the average world where the agent does not end up paused} could be higher or lower conditional on whether the agent tries to resist being paused, (as resisting would add more worlds to the set of non-paused worlds, whose scores are averaged to return the score of any paused world). The same applies to pausing itself in case it observes a negative or not-as-positive worldstate, to have such worldstates instead be assigned the score of the remaining unpaused ones. (This does lead to self-pausing in all possible futures except the highest-scoring one)"

  3. ^

    One could try to avoid this by specifying "some number of updates before", e.g. "the prior as it was 10000 updates before updating on being paused", but this would create new complications (e.g., if that agent were to learn information which decreased the score assigned to the world, it would pause itself if no outweighing positive update occurred in the next 9999 updates).

[-]k6410

Thank you for this answer -  I really like it!  I'm trying to wrap my head the last 2 paragraphs.  

2nd to last paragraph:
Ok, so you're saying that it could choose to self-pause unless it was in the highest-scoring world?  I'm conceptualizing a possible world as an (action,result) pair, from which it could calculate (action, E[result]) pairs and then would choose the action with the highest E[result], while being paused would also provide max(E[result]).  So are you saying it would limit the possible actions it would take?  Tha... (read more)

1quila
It sounds like understanding functional decision theory might help you understand the parts you're confused about? Yes, it would do whatever the highest-possible-score thing is, regardless of how unlikely it is By setting a self-pausing policy at the earliest point in time it can, yes. (Though I'm not sure if I'm responding to what you actually meant, or to some other thing that my mind also thinks can match to these words, because the intended meaning isn't super clear to me) (To be clear, I'm conceptualizing the agent as having Bayesian uncertainty about what world it's in, and this is what I meant when writing about "worlds in the agent's prior") An agent, (aside from edge cases where it is programmed to be inconsistent in this way), would not have priors about what it will do which mismatch its policy for choosing what to actually do, any change to the latter logically-corresponds to the agent having a different prior about itself, so an attempt to follow this logic would infinitely recur (each time picking a new action in response to the prior's change, which in turn logically changes the prior, and so on). This seems like a case of 'subjunctive dependence' to me (even though it's a bit of an edge case of that, where the two correlated things - what action an agent will choose, and the agent's prior about what action they will choose - are both localized in the same agent), which is why functional decision theory seems relevant. I think there must be some confusion here, but I'm having trouble understanding exactly what you mean. Short answer: the scenario, or set of scenarios, where it is not paused, are dependent on what choice it makes, not locked in and independent of it; and it can choose what choice it makes, so it can pick whatever choice corresponds to the set of unpaused futures which score higher. Longer original answer: When you write, there is one possible future in it's prior where it does not get paused, and then write that this one future

ProgramCrafter

10

Your idea seems to break when AI is being unpaused: as it has not done any beneficial actions, utility would suddenly go down from "simulated" to "normal", meaning that AI will likely resist waking it up.

Also, it assumes there is a separate module for making predictions, which cannot be manipulated by the agent. This assumption is not very probable in my view.

that AI will likely resist waking it up.

If the AI is resisting being turned on, then it would have to be already on, by which point the updates (to the AI's prior, and score assigned to it) would have already happened.