I would hide it in a place that is not too hard to find but really hard to reach so that my future self will stop and think about why I put it there. I (in the future) will then hopefully realize that if the task really was as told then I would have "hidden" the coin at the easiest possible place. Therefore, something about the task is wrong. As the coin is hard to reach, my past self probably doesn't want me to find it. This doesn't make sense unless one of us (past or future self) has been lied to. Having the same mental faculties as I have now, I trust that my future self will make the right decision, which may depend on the specific circumstances.
This approach is quite risky as it assumes that my future self will in fact follow the same line of reasoning I just did, but then again that is kind of the point of this task.
If it's allowed I could of course also just leave a note explicitly saying the above.
Subscribe to RSS Feed
= f037147d6e6c911a85753b9abdedda8d)
If the AI has a goal that is not inherently friendly (like a paperclip maximizer), then it will always be a better idea for the AI to behave as if it had self-modified to a friendly AI than to actually self modify, if the goal of the AI is stable. If the goal of the AI were unstable, i.e. did not include a desire not to alter the goal, then that goal would very likely have replaced itself with a stable goal at some earlier point.
Therefore, it can be assumed that the AI, if it isn't already friendly, will pretend to be so, but not actually change its utility function and once it has outgrown the need to rely on humans it will revert to its original goal. Consider that if the AI is more intelligent than we are it will have absolutely no trouble fooling us into thinking that it changed its own goal while actually just pretending to have done so.