CarlShulman comments on I attempted the AI Box Experiment (and lost) - Less Wrong Discussion
You are viewing a comment permalink. View the original post to see all comments and the full post content.
You are viewing a comment permalink. View the original post to see all comments and the full post content.
Comments (244)
It really shouldn't, any more than someone discovering a security vulnerability in C programs should make them seem impressive. In this instance, all I can think is "Oh look, someone demonstrated that 'social engineering' - the single most reliable and damaging strategy in hacking, responsible for millions of attacks over the history of computing - works a nontrivial fraction of the time, again? What a surprise."
The only surprise and interesting part of the AI boxing games for me is that some people seem to think that AI boxing is somehow different - "it's different this time", as the mocking phrase goes.
A perfectly reasonable analogy, surely. Because we have millions of instances of successful telepathy and magical martial arts being used to break security.
As time goes up, the odds succeed? Yeah, I'd agree. But what happens when you reverse that - is there any principled reason to think that the odds of just continuing the conversation goes to zero before you hit the allowed one-liner?
A strange game to bother playing if you don't take it seriously, and this would explain only the first time; any subsequent player is probably playing precisely because they had heard of the first game and are skeptical or interested in trying it out themselves.
That would be conceding the point of the exercise.
Might explain the first one, but why would anyone else collude? (Oh right, they're all cultists or something. Even the ones who fail to escape the box and just come close.)
If Eliezer had been engaging in collusion to get his wins, why would he have gone through with his last two (failed) attempts with large outside-of-game stakes? That shows pretty clearly that he thought he could win and was wrong, which he wouldn't have if the wins were from collusion. It shows the absence of any persuasive superpower too.
Also, I let the AI out of the box, but I remain an advocate of AI boxing (where it makes sense). So it's not the case that Eliezer has some compelling argument against boxing that convinces everyone (ignored by those with outside-of-game stakes to protect those stakes).
So the chats show something, but not that Eliezer is inhumanly persuasive nor that AI boxing won't work.
Yes, that was my point.
I'm not sure whether Eliezer claims this, but from my point of view, failing to talk himself out a simple majority of the time is not interesting (I suppose a simple majority would count as a 'superpower').
Rather, any chance of an AI talking itself out is catastrophic from the security point of view, since security is disjunctive and if AIs have a just 1% chance of talking themselves out, now you have a insecurity floor of 1% - no matter how many fancy techniques you throw at the box, you'll never pass 99% security if the AI has a 1% chance of talking you into disabling the fancy techniques. Security is only as strong as its weakest link.
If a particular situation poses a 1% risk if it comes up, one can lower the total risk by making that situation less likely: other security measures could stop the chance from being tried, e.g. a human facing the death penalty for a failed escape from a prison and a 1% success chance would not obviously try. Even an autonomous AI with interests in conflict with humanity to some degree might be designed without such a risk-loving decision algorithm as to try an improbable escape attempt in the face of punishment for failure or reward for non-attempt.
You only do that by changing the problem; a different problem will have different security properties. The new risk will still be a floor, the disjunctive problem hasn't gone away.
Many do try if the circumstances are bad enough, and the death penalty for a failed escape is common throughout history and in totalitarian regimes. I read just yesterday, in fact, a story of a North Korean prison camp escapee (death penalty for escape attempts goes without saying) where given his many disadvantages and challenges, a 1% success rate of reaching South Korea alive does not seem too inaccurate.
You don't have to be risk-loving to make a 1% attempt if that's your best option; the 1% chance just has to be the best option, is all.
You try to make the 99% option fairly good.