Suppose you have some positive utility functions , mathematically considered to be random variables dependant on choice of policy.
Let be a random variable that is 1 if the button is pressed, 0 otherwise. Again dependant on choice of policy. I am considering that there is a particular time that the button must be pressed to make this a single bit. You have a single chance to switch this AI between 2 utilities, and whatever you pick in that instant is baked in forevermore.
Let be the expected partial utilities.
Now consider all policies. In particular, consider the pareto frontier between and . Through the introduction of randomness, we can make that parito frontier continuous.
We want some policy off that frontier.
There will be some policy that throws all resources at maximizing . And there will be some (usually different) policy that throws all available resources at maximizing .
let be a tradeoff rate. And . The parito frontier can now be defined as the policies that maximize for varying .
For , this is the policy that is optimal when which has . Then .
Likewise at with have .
So somewhere in the middle, by the continuity granted by stochastic mixes of policies, their must be at least one point where . Use that policy, or stochastic mixture of policies.
This agent will never pay utility to set the value of the button one way or the other. Because the policy is one in which has (I think) Which hopefully means that this policy does worse than the policy that doesn't pay to change the button, but otherwise does everything else the same.
Put another way, if you pay to flip the button. Then you must care about the state of the button, if the chance the button is pressed changes, you are getting more or less for your utility. So the gradient can't be 0.
And this is described as a choice of policy. A framework which automatically includes the actions of all sub agents.
So, how does this fail? Who can break this agent and get it to do crazy things?
EDIT: I now consider this whole approach to be irreparably flawed.
Imagine if their was a 50% chance the button would be pressed. Also the AI was in some betting thing that would pay out paperclips or staples based on a coin flip.
In scenario 1, this AI has an option of setting up a mechanism, external to itself, that controls the coin flip based on if the button is pressed.
In scenario 2, the AI can set up a mechanism to control the button based on the coin flip.
When only considering actions, probabilities and utilities, these look identical. But we want the AI to behave differently. So we must make use of the arrow of time. The structure of cause and effect. In a way we aren't doing here.
End edit.
To make this work, you would need a utility function for shutting off, including all subagents.
Lets suppose that other than these robots, there are ~0 transistors in your house. Then we can define shutting off as minimizing the number of transistor flips in your house.
So you make
A=at least 1 coffee
B=1/transistor flips.
Note that, when imagining any potential future world where the switch isn't pressed, the AI has no reason to bother counting transistor flips. And in potential futures where the switch is pressed, it doesn't need to know what coffee is.
What it cares about are a=E(A(1−K)) and b=E(BK)
Lets list the robot's options.
Let a=1 without coffee, and a=2 with coffee.
In worlds where the switch is pressed, the robot turns it's self off immediately once it is fairly sure the switch will be pressed. Making b=1.5 In worlds where it doesn't turn off, more transistors flip making b=0.5. Because a=1 in both cases, and we only select from the paerito frontier, whenever the switch is pressed, it will turn off.
Lets apply your utility penalties, by putting them in A and B. Ie in a or b depending on the switch.
TC) Press switch, avoid cat. Switch pressed. So (a=0.7) . b=1.5−0.1−0.2=1.2,
PC) Prevent switch, avoid cat. a=1.7,(b=0.2)
IC) Ignore switch, avoid cat. a=1.8,(b=0.3)
TH) Press switch, hit cat. b=1.4,(a=0.9)
IH) Ignore switch, hit cat b=1.5,(a=1) (because it predicts humans will see it and turn it off)
PH) Prevent switch, hit cat. a=1.9,(b=0.4)
This puts IH and PH on the convex hull.
And I think my algorithm picks between them stochastically.