You're looking at Less Wrong's discussion board. This includes all posts, including those that haven't been promoted to the front page yet. For more information, see About Less Wrong.

gwern comments on I attempted the AI Box Experiment (and lost) - Less Wrong Discussion

47 Post author: Tuxedage 21 January 2013 02:59AM

You are viewing a comment permalink. View the original post to see all comments and the full post content.

Comments (244)

You are viewing a single comment's thread. Show more comments above.

Comment author: CarlShulman 22 January 2013 10:57:23PM 2 points [-]

Might explain the first one, but why would anyone else collude? (Oh right, they're all cultists or something. Even the ones who fail to escape the box and just come close.)

If Eliezer had been engaging in collusion to get his wins, why would he have gone through with his last two (failed) attempts with large outside-of-game stakes? That shows pretty clearly that he thought he could win and was wrong, which he wouldn't have if the wins were from collusion. It shows the absence of any persuasive superpower too.

Also, I let the AI out of the box, but I remain an advocate of AI boxing (where it makes sense). So it's not the case that Eliezer has some compelling argument against boxing that convinces everyone (ignored by those with outside-of-game stakes to protect those stakes).

So the chats show something, but not that Eliezer is inhumanly persuasive nor that AI boxing won't work.

Comment author: gwern 22 January 2013 11:04:27PM 2 points [-]

If Eliezer had been engaging in collusion to get his wins, why would he have gone through with his last two (failed) attempts with large outside-of-game stakes? That shows pretty clearly that he thought he could win and was wrong, which he wouldn't have if the wins were from collusion.

Yes, that was my point.

It shows the absence of any persuasive superpower too.

I'm not sure whether Eliezer claims this, but from my point of view, failing to talk himself out a simple majority of the time is not interesting (I suppose a simple majority would count as a 'superpower').

Rather, any chance of an AI talking itself out is catastrophic from the security point of view, since security is disjunctive and if AIs have a just 1% chance of talking themselves out, now you have a insecurity floor of 1% - no matter how many fancy techniques you throw at the box, you'll never pass 99% security if the AI has a 1% chance of talking you into disabling the fancy techniques. Security is only as strong as its weakest link.

Comment author: CarlShulman 22 January 2013 11:51:30PM 1 point [-]

since security is disjunctive and if AIs have a just 1% chance of talking themselves out, now you have a insecurity floor of 1%

If a particular situation poses a 1% risk if it comes up, one can lower the total risk by making that situation less likely: other security measures could stop the chance from being tried, e.g. a human facing the death penalty for a failed escape from a prison and a 1% success chance would not obviously try. Even an autonomous AI with interests in conflict with humanity to some degree might be designed without such a risk-loving decision algorithm as to try an improbable escape attempt in the face of punishment for failure or reward for non-attempt.

Comment author: gwern 23 January 2013 12:03:50AM 1 point [-]

If a particular situation poses a 1% risk if it comes up, one can lower the total risk by making that situation less likely

You only do that by changing the problem; a different problem will have different security properties. The new risk will still be a floor, the disjunctive problem hasn't gone away.

a human facing the death penalty for a failed escape from a prison and a 1% success chance would not obviously try.

Many do try if the circumstances are bad enough, and the death penalty for a failed escape is common throughout history and in totalitarian regimes. I read just yesterday, in fact, a story of a North Korean prison camp escapee (death penalty for escape attempts goes without saying) where given his many disadvantages and challenges, a 1% success rate of reaching South Korea alive does not seem too inaccurate.

Even an autonomous AI with interests in conflict with humanity to some degree might be designed without such a risk-loving decision algorithm as to try an improbable escape attempt in the face of punishment for failure or reward for non-attempt.

You don't have to be risk-loving to make a 1% attempt if that's your best option; the 1% chance just has to be the best option, is all.

Comment author: CarlShulman 23 January 2013 12:56:49AM 0 points [-]

You don't have to be risk-loving to make a 1% attempt if that's your best option; the 1% chance just has to be the best option, is all.

You try to make the 99% option fairly good.