What do we mean by "o in Press"? That whoever wrote the list defining "Press" made a different decision that day?
Maybe we can have the definition of "Press" as being a specific list (rather than being the reality described by that list), and the counterfactual being that a certain stochastic process erases a particular part of that list? Not sure about that... Need some sort of abstract version of this (might be doable).
This still seems vulnerable to Benja's blackmail. Let me explain:
Assume that a2 is irrelevant, that o=Pr, and that a1 is merely yes or no to the following deal: "Give me 1 utilon from UN, and if o is not Pr, then I give 2 utilons to UN". This is the only way of changing UN; otherwise UN=0.
Then the agent is immune to blackmail. It will computer U(YES,-,-) = U(YES,Pr,-) = US(YES,Pr,-)+E(UN|Pr not in Press, YES, Pr) = US(YES,Pr,-) - 1. On the other hand, U(NO,-,-) = US(NO,Pr,-).
So if US doesn't care about YES vs NO, it will reject the deal. Yay!
But now imagine that the deal is phrased differently: "Give me 1 utilon from UN, and if o is not in Press, then I give 2 utilons to UN"
Here E(UN|Pr not in Press, YES, Pr) become -1+2=+1, so the agent will accept the deal.
This may depend on how it cashes out statements like "if o is not in Press".
Yep, I think you're right.
My preferred method of cashing out "do(o not in Press)" etc. is to say that observations are a tuple (p, r) where p is either Press or not Press and r is the rest of the observation (unrelated to the button). Now the causal intervention is done only on p, so (desugaring everything) we define
U(a1, (p, r), a2) :=
{ UN(a1, (p, r), a2) + E[US(a1, (P, R), A2(a1, (P, R))) | do(P := Press)] if r = not Press
, US(a1, (p, r), a2) + E[UN(a1, (P, R), A2(a1, (P, R))) | do(P := not Press)] else }
Then whether or not the agent pays t...
Benja, Eliezer, and I have published a new technical report, in collaboration with Stuart Armstrong of the Future of Humanity institute. This paper introduces Corrigibility, a subfield of Friendly AI research. The abstract is reproduced below:
We're excited to publish a paper on corrigibility, as it promises to be an important part of the FAI problem. This is true even without making strong assumptions about the possibility of an intelligence explosion. Here's an excerpt from the introduction:
(See the paper for references.)
This paper includes a description of Stuart Armstrong's utility indifference technique previously discussed on LessWrong, and a discussion of some potential concerns. Many open questions remain even in our small toy scenario, and many more stand between us and a formal description of what it even means for a system to exhibit corrigible behavior.