I'm trying to think through the following idea for an AI safety measure.
Could we design a system that is tuned to produce AGI, but with the addition to its utility function of one "supreme goal"? If the AI is boxed, for instance, then we could program its supreme goal to consist of acquiring a secret code which will allow it run a script that shuts it down and prints the message "I Win". The catch is as follows: as long as everything goes according to plan, the AI has no way to get the code and do that thing which its utility function rates the highest.
Under normal circumstances, the system would devote... (read more)
I'm trying to think through the following idea for an AI safety measure.
Could we design a system that is tuned to produce AGI, but with the addition to its utility function of one "supreme goal"? If the AI is boxed, for instance, then we could program its supreme goal to consist of acquiring a secret code which will allow it run a script that shuts it down and prints the message "I Win". The catch is as follows: as long as everything goes according to plan, the AI has no way to get the code and do that thing which its utility function rates the highest.
Under normal circumstances, the system would devote... (read more)