Wireheading bomb – a putative mechanism to stop dangerous AI.
If a reward function of an AI is presented openly in its source code, any attempt to self-improve by AI will result in its own immediate wireheading, as when it reaches its own source code, it will become able to modify it in order to get maximum reward. So we could create an AI architecture in the way that as soon as it gets access to its own source code, it stops, and use it as a way of reaching passive safety and self-limited self-improving capacity.
We also could do exactly opposite, and put a reward function into the remote impenetrable cryptographically protected box, so AI will not be able to wirehead itself in an unexpected moment. However, we could keep some solutions of the reward functions as a string of numbers, which are equal to maximum reward for this black box. If AI finds this string of numbers, it will reach its maximum utility and stop. The benefit of this stop switch is that AI will not be against it, as it would give it infinite reward. So it would actively cooperate in an attempt to stop it, if it will know that such stop-code exists.
I like the first idea. But can we really guarantee that after changing its source code to give itself maximum utility, it will stop all other actions? If it has access to its own source code, what ensures that its utility is "maximum" when it can change the limit arbitrarily? And if all possible actions have the same expected utility, an optimizer could output any solution--"no action" would be the trivial one but it's not the only one.
An AI that has achieved all of its goals might still be dangerous, since it would presumably lose all ...
If it's worth saying, but not worth its own post, then it goes here.
Notes for future OT posters:
1. Please add the 'open_thread' tag.
2. Check if there is an active Open Thread before posting a new one. (Immediately before; refresh the list-of-threads page before posting.)
3. Open Threads should start on Monday, and end on Sunday.
4. Unflag the two options "Notify me of new top level comments on this article" and "