I think I can give a concise unification of my idea with Karl's. In short, the comment in the paper that
The concerns in Section 4.2 could potentially be addressed by some form of counterfactual (rather than conditional) reasoning.
is correct, and the fix is a pretty simple one. Basically, we want the following:
One utility function that satisfies these requirements, in a general form, is
U(a1, o, a2) = {
(1-p) UN(a1, o, a2) + p max_b(US(a1, o, b)), if o is not in Press
p US(a1, o, a2) + (1-p) max_b(UN(a1, o, b)), if o is in Press
}, where p is a fixed constant.
From the agent's point of view, this is equivalent to deciding a1 based on an a priori assumption that it will shut down (maximise US) with a fixed probability p, and not shut down (maximise UN) otherwise. However, in selecting a2 it will in fact maximise either UN or US depending upon whether the shutdown button was pressed or not. As a particular example, Karl's proposition below is equivalent to the limit of this general case as p->0. See below for further details on how/why this utility function achieves the three criteria above.
We don't want the agent to be indifferent to the actual event of the shutdown button being pressed. In fact, what we want is indifference as to whether or not any given event is classed as being a shutdown press. Thus the expected utility of (a1, o) for o in Press should be equivalent to the expected utility of the same (a1, o) under the counterfactual assumption that o is not in Press, and vice versa. To do this, the utilities of both branches of U (the Press branch and the NotPress branch) need to be made counterfactually equivalent, even though the agent selects a2 via UN in the NotPress branch and via US in the Press branch. This is precisely what Karl's modification achieves.
This is pretty easy to achieve by ensuring that only the UN component of the NotPress branch depends upon a2, and only the US component of the Press branch depends upon a2.
We want the agent to be incentivised to take actions that are compatible with future shutdown, which means that the agent's selection of a1 should be influenced by US as well as UN, but without introducing perverse incentives. Fortunately, the counterfactual indifference approach already solves the perversity issue, and so all that remains is to appropriately combine the influence of both UN and US. A reasonable approach is for the agent to act as though it will have to shut down with a fixed probability p, regardless of its observations. Because p is fixed, and because the agent is indifferent between any given observation being classed as a button press or not a button press, the agent is not in any way incentivised to influence the likelihood of the shutdown button being pressed.
Thanks, and nice work!
Thus the utility of (a1, o) for o in Press should be equivalent to the utility of the same (a1, o) under the counterfactual assumption that o is not in Press, and vice versa
Yeah, this is pretty key. You need it to optimize for both cases as if the probability of the button being pressed is fixed and independent of whether the programmers actually press the button. We can achieve this via a causal intervention on whether or not the button is pressed, and then clean up your U a bit by redefining it as follows:
U(a1, o, a2) :=
{ UN(...
Benja, Eliezer, and I have published a new technical report, in collaboration with Stuart Armstrong of the Future of Humanity institute. This paper introduces Corrigibility, a subfield of Friendly AI research. The abstract is reproduced below:
We're excited to publish a paper on corrigibility, as it promises to be an important part of the FAI problem. This is true even without making strong assumptions about the possibility of an intelligence explosion. Here's an excerpt from the introduction:
(See the paper for references.)
This paper includes a description of Stuart Armstrong's utility indifference technique previously discussed on LessWrong, and a discussion of some potential concerns. Many open questions remain even in our small toy scenario, and many more stand between us and a formal description of what it even means for a system to exhibit corrigible behavior.