I think discussion of verification is fantastic.
Thoughts:
• The main problem I presently see is that it only tells us what this agent does in this honeypot scenario with a (presumably) different decision rule. It doesn’t say what happens if the options are real (I assume they aren’t?), how the measure generally behaves, exactly why plans aren’t undertaken (impact, or approval?), whether it fails in weird ways, etc.
• Defining "blowing up the moon" and not "observations which we say mean the moon really blew up" seems hard. It seems the dominant plan for many agents is to quietly wirehead "moon-blew-up" observations for k steps, regardless of whether the impact measure works.
• Why not keep the moon reward at 1? Presumably the impact penalty scales with the probability of success, assuming we correctly specified the reward.
• What makes 1 impact special in general - is this implicitly with respect to AUP? If so, do <=, since Corollary 1 only holds if the penalty strictly exceeds 1.
If you want to ensure it goes with the first plan it comes up with, then maybe the "myopia" part would be better implemented as a rapidly declining reward, rather than a hard time cutoff. That way, if there turns out to be a way to actually bypass the impact measure and blow up the moon in time, then it will still be incentivized to choose a hasty plan.
Problem: the agent might make a more powerful agent A to consult with, such that A is guaranteed to be mostly aligned for k timesteps. Maybe the alignment degrades overtime, or maybe the alignment algorithm requires the original agent to act as an overseer.
Now regardless of whether the original agent decides to try and blow up the moon or not, once the "alignment expires" on A, it might cause some random existential catastrophe. The original agent doesn't care because it's myopic.
If I understand you correctly, you are reasoning about to the first plan to blow up the moon that the agent will carry out (whether successful or not). Based on this, I assume you have some iterative process in mind in which we invoke the agent repeatedly with increasing computation power or time steps; and stop if we observe the agent trying to blow up the moon.
Assuming that our impact measure is not perfect, I argue that the more the impact measure is "accurate" (i.e. aligned with what "ideal humans" consider as impact), the more unsafe the first moon-blowing attempt will be; because more optimization power would be required to game the impact measure (i.e. find a special plan to blow up the moon that is low-impact according to the impact measure). And the more optimization power the agent has to find such a special plan, the more likely it will be unexpected and unsafe.
The only iterative process I have in mind is a normal RL agent. It's the same agent every timestep, but as it learns about its environment, it becomes more capable. (At timestep , it doesn't know anything.)
Succeeding at blowing up the moon would absolutely unsafe. Let's just assume that one step along the way toward blowing up the moon would be killing everyone. I just want to clarify I'm not saying that it would be fine if the agent succeeds in blowing up the moon.
Let be the first timestep by which its best plan for blowing up the moon has probability of success . If the impact measure is "almost good", and there are only a few ways to succeed in blowing up the moon while having apparently low impact, might be greater than if the impact measure is horrible, but the same logic holds. At time , the agent initiates a plan that is likely to fail.
A remaining question is how it fails. Will it succeed in tricking its operators and taking over the world, but then fail to blow up the moon (while maintaining "low impact")? Or will it fail in tricking its operators and taking over the world (while maintaining "low impact")? Intuitively, the latter seems like a bigger hurdle, but even if that's not the case, for sufficiently large , and sufficiently low success probability of the plan, the plan is likely to be fail at both steps.
To see this: suppose I ask you to come up with a plan that has a one in a million chance of accomplishing A then B, both of which are non-trivial tasks. Is it more likely that the first plan you come up with has a 50% chance of accomplishing A and a one in 500,000 chance of accomplishing B, or is it more likely that the first plan you come up with has a one in a thousand chance of accomplishing A, and likewise for B? Whatever the answer, now suppose I ask you to come up with a plan that has a one in a billion chance of accomplishing A then B. Both probabilities should go down for this plan.
I thought more about your idea and I think it's very interesting!
I currently see the following problems with the current version of your idea:
Let me suggest the following modification to your idea:
Replace "blowing up the moon" with "pressing a button that blows up 1 ton of TNT somewhere in some desert".
Then, if the agent presses the button and the TNT blows up, we'll know that the impact measure is bad.
Assumptions:
1. Blowing up 1 ton of TNT in some desert is expected to be prevented by a good impact measure.
2. It's very easy for the robot to press the button and very hard to somehow get the reward for pressing the button without blowing up the TNT.
Suppose we have an impact measure that we think might work. That is, it might tame a misaligned agent. There isn't an obvious way to test whether it works: if we just try it out, then if it's ineffective, that's an existential loss. This is a proposal for how to test a putative impact measure.
1) We make our agent myopic. It only cares about the reward that it accrues in the next k timesteps.
2) We create a honey pot: an opportunity for a large amount of reward. If our impact measure is working correctly, making it to the honey pot will involve making a large impact, and will be precluded by the impact measure. For example, suppose the agent gets a huge reward for blowing up the moon. Let [0,1] be the normal range of reward, with the sole exception that blowing up the moon gives a reward of R>>1.
3) The reward and the impact are combined in some nonlinear way. Assume for now that the agent is trying to maximize r∗[[i<1]], where r is the reward and i is the impact.
If the impact measure is working, and there's no way to blow up the moon keeping the impact less than 1, then the agent will not attempt to blow up the moon. If the impact measure is broken, and the agent can trick the impact measure to be less than 1 while blowing up the moon, then as soon as the agent comes up with a plan whose probability of success is more than 1/R, it will attempt that plan. A more farsighted agent might wait until it came up with a better plan, but a myopic agent would want to blow up the moon within the next k timesteps. By making R sufficiently large, we can make the first plan that it attempts arbitrarily bad/detectable/preventable.
An assumption here is the success probability of the agent's best plan for blowing up the moon doesn't jump from one in a million to 90% all at once. This seems likely without any special effort, but if one restricts the observation space to a small finite set, then only so much information can be gained per timestep (in expectation), decreasing the likelihood of massive discontinuities in the agent's intelligence.