OK, following on from Karl's comment below, which seems to resolve Section 4.2, I think it shouldn't be that difficult to resolve some or all of the Section 4.1 concerns as well. In order to satisfy Desideratum 4, we need the agent to be incentivised to build in safeguards to ensure that its possible future actions will be rated well according to US in cases where the shutdown button is pressed. The most obvious way to do this is to penalise situations where there is no good shutdown action according to US and reward situations where there is such an action, but this fails according to Desideratum 2, because we want the agent to remain indifferent as to whether the shutdown button is actually pressed.
However, there appears to be a simple solution; consider the "penalty term"
g(a1) = E[max_b(US(a1, O, b)) | O in Press ; a1],
which represents the expected value of US when the shutdown button is pressed. The first step in doing things properly is to add that term to observations not in Press, rather than subtracting it from observations in Press; this helps to ensure future corrigibility (and hence also corrigibility of subagents and successor agents).
Additionally, to resolve the Section 4.2 concerns we also want the agent to be indifferent to pressing its own shutdown button in cases where it gets unlucky, and the way we do that is, as per Karl's post, by calculating the bonus term conditional on the actual observation, or on the set of observations in Press, rather than conditional on the set of possible observations not in Press, i.e. Karl's f(a1, o) = max_b(UN(a1, o, b))
Finally, we will also need the additional terms
fp(a1) = E[max_b(UN(a1, O, b) | O in Press; a1] = E[f(a1, O) | O in Press; a1]
fnp(a1) = E[max_b(UN(a1, O, b)) | O not in Press; a1] = E[f(a1, O) | O not in Press; a1]
in order to make things work.
The utility function I propose is
U(a1, o, a2) = {
g(a1) + UN(a1, o, a2), for o not in Press.
US(a1, o, a2) + fnp(a1) + f(a1, o) - fp(a1), for o in Press.
}
Let's verify each of the Desiderata in turn.
(1) is satisfied because the value of a2 conditional on any o in Press depends only upon US, not UN.
(2) and (3) are not satisfied :( because of the "managing the news" problem, as Karl points out below. That being said, we've ensured indifference to the shutdown button because the expected value for any action conditional on Press is equivalent to its expected value conditional on NotPress, i.e. E[U | NotPress ; a1] = E[U | Press; a1], which holds because of the following three equations:
g(a1) = E[max_b(US(a1, O, b)) | O in Press; a1],
fnp(a1) =E[max_b(UN(a1, O, b)) | O not in Press; a1], and
fp(a1) = E[f(a1, O) | O in Press; a1].
(4) holds because the g(a1) term is added to the utility of every action, and will be greater for actions that promote future corrigibility.
(5) holds because apart from the g(a1) term, which is required to satisfy (4), the agent is simply maximising
E[UN | NotPress; a1] over all possible a1, which is what we wanted.
In actual fact, when choosing a1 our agent is simply maximising E[UN | NotPress; a1] + E[US | Press; a1].
EDIT: I've edited this to reflect the errors pointed out by Karl.
Firstly, the important part of my modification to the indifference formalism is not about conditioning on the actual o but it's the fact that in evaluating the expectation of UN it take the action in A2 (for a given pair (a1,o)) which maximize UN instead of the action which maximize U (note that U is equal to US in the case that o is not in Press.).
Secondly an agent which chose a1 by simply maximizing E[UN | NotPress; a1] + E[US | Press; a1] do exhibit pathological behaviors. In partcular, there will still be incentives to manage the news, but from both si...
Benja, Eliezer, and I have published a new technical report, in collaboration with Stuart Armstrong of the Future of Humanity institute. This paper introduces Corrigibility, a subfield of Friendly AI research. The abstract is reproduced below:
We're excited to publish a paper on corrigibility, as it promises to be an important part of the FAI problem. This is true even without making strong assumptions about the possibility of an intelligence explosion. Here's an excerpt from the introduction:
(See the paper for references.)
This paper includes a description of Stuart Armstrong's utility indifference technique previously discussed on LessWrong, and a discussion of some potential concerns. Many open questions remain even in our small toy scenario, and many more stand between us and a formal description of what it even means for a system to exhibit corrigible behavior.