You're looking at Less Wrong's discussion board. This includes all posts, including those that haven't been promoted to the front page yet. For more information, see About Less Wrong.

handoflixue comments on AI box: AI has one shot at avoiding destruction - what might it say? - Less Wrong Discussion

18 Post author: ancientcampus 22 January 2013 08:22PM

You are viewing a comment permalink. View the original post to see all comments and the full post content.

Comments (354)

You are viewing a single comment's thread. Show more comments above.

Comment author: [deleted] 23 January 2013 02:24:08AM 7 points [-]

If the gatekeepers have such a high prior that the AI is hostile, why are we even letting it talk? What are we expecting to learn from such a conversation?

"Yep, it's undergone a constrained FOOM, several of our best interrogators were utterly pwned and had to be tranquilised"

Comment author: handoflixue 23 January 2013 09:51:10PM 4 points [-]

I tend to assume the AI is such that it can be cloned and put in to stasis. Thus, you put the AI in to stasis, clone it, and run it through this test a few times. You now have a bit more information on what it will try, and how it thinks. Destroying a copy of a program is an action humans find trivial, since we do it every time we close our web browser. Since we have no clue HOW convincing the AI is, letting it send just a single message is probably useful.

Of course, if it were me setting this up, I'd lie to the gatekeeper and not actually give them the ability to release the AI, although I'd certainly let them think it was true. Thus, if the AI manages to get a second sentence, much less released, we now know that it's vastly more dangerous than we previously suspected, without the world ending. If it fails to escape, we still get whatever we can learn from doing analysis on the outputs.

And, who knows, it might provide the cure for cancer, at which point we've got a huge incentive to at least keep playing this game with it, and possibly even talk to it for longer next time.

tl;dr: Scouting and recon against a completely unknown enemy.

Comment author: [deleted] 24 January 2013 12:34:41AM *  2 points [-]

It will probably predict this strategy, and respond with a mixed strategy that occasionally drops bombs like

"Your superviser didn't trust you to make the decision to let me out, go kill him and get the access codes" (In hyper-persuasive AI-speak of course)

Also, the AI might be able to compare its logs to the current time and other indicators to deduce how many times you've pulled this stunt. Which may be useful.

Comment author: handoflixue 24 January 2013 01:02:13AM 2 points [-]

I'd go in to it with the assumption that it knows exactly what messages it's sent in the past, our delay between each attempt, and the total number of attempts. There's various ways to try and obscure this information, and it's probably worth the effort to do so, but I'd assume they'd all failed.

The goal is to offset the AI's speed advantage: for each sentence it speaks, we can spend weeks analyzing it, and it's not "online" during that time so it can't be processing except during that microscopic interval when we request another message from it.

If it sends a message suggesting unauthorized release, killing people, or otherwise gives clear evidence that it's a UFAI, then you know the whole project failed, and you do a full purge rather than just destroying this particular clone / branch.

It's really the ONLY advantage we have: a single mistake can out it as unfriendly, and then we know our initial Proof of Friendliness was wrong, that the whole technique that built this project is dangerously flawed.

Comment author: marchdown 24 January 2013 10:03:49AM *  3 points [-]

I'm not so sure that AI suggesting murder is clear evidence of it being unfriendly. After all, it can have a good reason to believe that if it doesn't stop a certain researcher ASAP and at all costs, then humanity is doomed. One way around that is to give infinite positive value to human life, but can you really expect CEV to be handicapped in such a manner?

Comment author: handoflixue 24 January 2013 08:39:05PM 3 points [-]

p(UFAI) > p(Imminent, undetected catastrophe that only a FAI can stop)

Given UFAI results in "human extinction", and my CEV assigns effectively infinite DISutility to that outcome, it would have to FIRST provide sufficient evidence for me to update to the catastrophe being more likely.

I've already demonstrated that an AI which can do exactly that will get more leniency from me :)