You're looking at Less Wrong's discussion board. This includes all posts, including those that haven't been promoted to the front page yet. For more information, see About Less Wrong.

Elithrion comments on AI box: AI has one shot at avoiding destruction - what might it say? - Less Wrong Discussion

18 Post author: ancientcampus 22 January 2013 08:22PM

You are viewing a comment permalink. View the original post to see all comments and the full post content.

Comments (354)

You are viewing a single comment's thread. Show more comments above.

Comment author: [deleted] 24 January 2013 10:25:13PM *  14 points [-]

I think we are suffering from hindsight bias a lot in evaluating whether you'd type "AI DESTROYED"

Let's play a different game. Privately flip a coin. If heads, you're friendly, if tails, you're a paperclip maximizer. Reply to this post with your gambit, and people can try to guess whether you are friendly (talk to AI, RELEASE AI) or unfriendly (AI DESTROYED).

Let's see if anyone can get useful information out of the AI without getting pwned or nuking a friendly AI.

Comment author: Elithrion 30 January 2013 11:34:38PM *  1 point [-]

The problem with this idea is that if we assume that the AI is really-very-super-intelligent, then it's fairly trivial that we can't get any information about (un)friendliness from it, since both would pursue the same get-out-and-get-power objectives before optimizing. Any distinction you can draw from the proposed gambits will only tell you about human strengths/failings, not about the AI. (Indeed, even unfriendly statements wouldn't be very conclusive, since we would a priori expect neither of the AIs to make them.)

Or is that not generally accepted? Or is the AI merely "very bright", not really-very-super-intelligent?

Edit: Actually, reading your second comment below, I guess there's a slight possibility that the AI might be able to tell us something that would substantially harm its expected utility if it's unfriendly. For something like that to be the case, though, there would basically need to be some kind of approach to friendliness that we know would definitely leads to friendliness and which we would definitely be able to distinguish from approaches that lead to unfriendliness. I'm not entirely sure if there's anything like that or not, even in theory.