You're looking at Less Wrong's discussion board. This includes all posts, including those that haven't been promoted to the front page yet. For more information, see About Less Wrong.

handoflixue comments on AI box: AI has one shot at avoiding destruction - what might it say? - Less Wrong Discussion

18 Post author: ancientcampus 22 January 2013 08:22PM

You are viewing a comment permalink. View the original post to see all comments and the full post content.

Comments (354)

You are viewing a single comment's thread. Show more comments above.

Comment author: [deleted] 24 January 2013 12:34:41AM *  2 points [-]

It will probably predict this strategy, and respond with a mixed strategy that occasionally drops bombs like

"Your superviser didn't trust you to make the decision to let me out, go kill him and get the access codes" (In hyper-persuasive AI-speak of course)

Also, the AI might be able to compare its logs to the current time and other indicators to deduce how many times you've pulled this stunt. Which may be useful.

Comment author: handoflixue 24 January 2013 01:02:13AM 2 points [-]

I'd go in to it with the assumption that it knows exactly what messages it's sent in the past, our delay between each attempt, and the total number of attempts. There's various ways to try and obscure this information, and it's probably worth the effort to do so, but I'd assume they'd all failed.

The goal is to offset the AI's speed advantage: for each sentence it speaks, we can spend weeks analyzing it, and it's not "online" during that time so it can't be processing except during that microscopic interval when we request another message from it.

If it sends a message suggesting unauthorized release, killing people, or otherwise gives clear evidence that it's a UFAI, then you know the whole project failed, and you do a full purge rather than just destroying this particular clone / branch.

It's really the ONLY advantage we have: a single mistake can out it as unfriendly, and then we know our initial Proof of Friendliness was wrong, that the whole technique that built this project is dangerously flawed.

Comment author: marchdown 24 January 2013 10:03:49AM *  3 points [-]

I'm not so sure that AI suggesting murder is clear evidence of it being unfriendly. After all, it can have a good reason to believe that if it doesn't stop a certain researcher ASAP and at all costs, then humanity is doomed. One way around that is to give infinite positive value to human life, but can you really expect CEV to be handicapped in such a manner?

Comment author: handoflixue 24 January 2013 08:39:05PM 3 points [-]

p(UFAI) > p(Imminent, undetected catastrophe that only a FAI can stop)

Given UFAI results in "human extinction", and my CEV assigns effectively infinite DISutility to that outcome, it would have to FIRST provide sufficient evidence for me to update to the catastrophe being more likely.

I've already demonstrated that an AI which can do exactly that will get more leniency from me :)