Houshalter comments on The genie knows, but doesn't care - Less Wrong
You are viewing a comment permalink. View the original post to see all comments and the full post content.
You are viewing a comment permalink. View the original post to see all comments and the full post content.
Comments (515)
Having an accurate model of something is in no way equivalent to letting you do anything you want. If I know everything about physics, I still can't walk through walls. A boxed AI won't be able to magically make it's creators forget about AI risks and unbox it.
There are other possible set ups, like feeding it's output to another AI who's goal is to find any flaws or attempts at manipulation in it, and so on. Various other ideas might help, like threatening to severely punish attempts at manipulation.
This is of course only necessary for the AI who can interact with us at such a level, the other ideas were far more constrained, e.g. restricting it to solving math or engineering problems.
Nor is it necessary to let it be superintelligent, instead of limiting it to something comparable to high IQ humans.
Another super strong assumption with no justification at all. It's trivial to propose an AI model which only cares about finite time horizons. Predict what actions will have the highest expected utility at time T, take that action.
The results of AI box game trials disagree.
And what does it do at time T+1? And if you said 'nothing', try again, because you have no way of justifying that claim. It may not have intentionally-designed long-term preferences, but just because your eyes are closed does not mean the room is empty.
That doesn't prove anything, no one has even seen logs. Based on reading what people involved have said about it, I strongly suspect the trick is for the AI to emotionally abuse the gatekeeper until they don't want to play anymore (which counts as letting the AI out.)
This doesn't apply to the real world AI, since no one is forcing you to choose between letting the AI out, and listening to it for hours. You can just get up and leave. You can turn the AI off. There is no reason you even have to allow interactivity in the first place.
But Yudkowsky and others claim these experiments demonstrate that human brains are "hackable". That there is some sentence which, just by reading, will cause you to involuntarily perform any arbitrary action. And that a sufficiently powerful AI can discover it.
At time T+1, it does whatever it thinks will result in the greatest reward at time T+2, and so on. Or you could have it shut off or reset to a blank state.
Enjoy your war on straw, I'm out.