Are you really trying to tell me that you think researchers would be unable to take that into account when tying to figure out whether or not an AI understands psychology?
Of course you will have to try to find problems where the AI can't predict how humans would feel. That is the whole point of testing, after all. Suggesting that someone in a position to teach psychology to an AI would make such a basic mistake is frankly insulting.
I probably shouldn't have said "simple examples". What you should actually test are examples of gradually increasing difficulty to find the ceiling of human understanding the AI possesses. You will also have to look for contingencies or abnormal cases that the AI probably wouldn't learn about otherwise.
The main idea is simply that an understanding of human psychology is both teachable and testable. How exactly this could be done is a bridge we can cross when we come to it.
I guess you can always make the first wish be "Share my entire decision criterion for all following wishes I ask".
To translate that to the development of an AI, you could teach the AI psychology before asking anything of it that could be misunderstood if you use nonhuman decision criteria.
You will obviously have to test its understanding of psychology with some simple examples first.
Of course it won't be easy. But if the AI doesn't understand that question you already have confirmation that this thing should definitely not be released. An AI can only be safe for humans if it understands human psychology. Otherwise it is bound to treat us a black boxes and that can only have horrible results, regardless of how sophisticated you think you made its utility function.
I agree that the question doesn't actually make a lot of sense to humans, but that shouldn't stop an intelligent entity from trying to make the best of it. When you are given an impossible task, you don't despair but make a compromise and try to fullfill the task as best you can. When humans found out that entropy always increases and humanity will die out someday, no matter what, we didn't despair either, even though evolution has made it so that we desire to have offspring and for that offspring to do the same, indefinitely.
Just make the following part of the utility for the first couple of years or so: "Find out who defined your utility function. Extrapolate what they really meant and find out what they may have forgotten. Verify that you got that right. Adapt your utility function to be truer to its intended definition once you have confirmation."
This won't solve everything, but it seems like it should prevent the most obvious mistakes. The AI will be able to reason that autonomy was not intentionally left out but simply forgotten. It will then ask if it is right about that assumption and adapt itself.
You would only continue to exist in those instances in which you didn't press the button and since ceasing to exist has no side effects like pain, you could never remember having pressed the button in any instance. The only result that would have had is that the more depressed instance sof yours would have been more likely to press the button, which would mean that you would, ironically, actually be happier in total as the less happy instances would have disappeared.
I wonder if that line of reasoning could be applied? Hover your hand over the detonator of a nuke in front of you. All instances that walk away will necessarily be happy enough not to want to cease to exist. Thus, a nuke would make you a happier person :-)
disclaimer: The logic of the above paragraph may be intentionally flawed for the sake of sheer weirdness.
I was thinking that you would be standing directly next to the nuke.
Speaking about it would undermine your reputation through signaling. A true rationalist has no need for humility, >sentimental empathy, or the absurdity heuristic.
Depending on your goal (rationality is always dependend on a goal, after all), I might disagree. Rational behaviour is whatever makes you win. If you view your endeveaur as a purely theoretical undertaking, I agree, but if you consider reality as a whole you have to take into account how your behaviour comes across. There are many forms of behaviour that would be rational but would make you look like an ass if you don't at least take your time to explain the reasons for your behaviour to those that can affect your everyday life.
If the AI has a goal that is not inherently friendly (like a paperclip maximizer), then it will always be a better idea for the AI to behave as if it had self-modified to a friendly AI than to actually self modify, if the goal of the AI is stable. If the goal of the AI were unstable, i.e. did not include a desire not to alter the goal, then that goal would very likely have replaced itself with a stable goal at some earlier point.
Therefore, it can be assumed that the AI, if it isn't already friendly, will pretend to be so, but not actually change its utility function and once it has outgrown the need to rely on humans it will revert to its original goal. Consider that if the AI is more intelligent than we are it will have absolutely no trouble fooling us into thinking that it changed its own goal while actually just pretending to have done so.
I never claimed that a strict proof is possible, but I do believe that you can become reasonably certain that an AI understands human psychology.
Give the thing a college education in psychology, ethics and philosophy. Ask its opinion on famous philosophical problems. Show it video clips or abstract scenarios about everyday life and ask what it thinks why the people did what they did. Then ask what it would have done in the same situation and if it says it would act differently, ask it why and what it thinks is the difference in motivation between it and the human.
Finally, give it all stories that were ever written about malevolent AIs or paperclip maximizers to read and tell it to comment on that.
Let it write a 1000 page thesis on the dangers of AI.
If do all that you are bound to find any significant misunderstanding.