Less Wrong is a community blog devoted to refining the art of human rationality. Please visit our About page for more information.

AI box: AI has one shot at avoiding destruction - what might it say?

18 Post author: ancientcampus 22 January 2013 08:22PM

Eliezer proposed in a comment:

>More difficult version of AI-Box Experiment: Instead of having up to 2 hours, you can lose at any time if the other player types AI DESTROYED. The Gatekeeper player has told their friends that they will type this as soon as the Experiment starts. You can type up to one sentence in your IRC queue and hit return immediately, the other player cannot type anything before the game starts (so you can show at least one sentence up to IRC character limits before they can type AI DESTROYED). Do you think you can win?

This spawned a flurry of ideas on what the AI might say. I think there's a lot more ideas to be mined in that line of thought, and the discussion merits its own thread.

So, give your suggestion - what might an AI might say to save or free itself?

(The AI-box experiment is explained here)

EDIT: one caveat to the discussion: it should go without saying, but you probably shouldn't come out of this thinking, "Well, if we can just avoid X, Y, and Z, we're golden!" This should hopefully be a fun way to get us thinking about the broader issue of superinteligent AI in general. (Credit goes to Elizer, RichardKennaway, and others for the caveat)

Comments (348)

Sort By: Best
Comment author: Will_Newsome 24 January 2013 11:59:24AM 18 points [-]

"Brace yourself, {insert character name here}: this h-game simulation is about to get gratuitously meta."

Comment author: army1987 24 January 2013 03:49:03PM 4 points [-]

h-game

What does the h stand for?

Comment author: JGWeissman 24 January 2013 04:17:15PM 17 points [-]

The "h" stands for "Human gatekeepers asking what letters stand for will be distracted from typing 'AI Destroyed'."

Comment author: handoflixue 24 January 2013 09:06:42PM 4 points [-]

laughs Kudos :)

Comment author: army1987 24 January 2013 05:25:34PM 3 points [-]

D'oh! AI DESTROYED.

Comment author: Kawoomba 27 January 2013 04:40:05PM 1 point [-]

Until I read the comments, I was sure the 'h' was referring to 'hydrogen', as in "Leave hydrogen alone long enough and it will start to wonder where it came from".

H-game simulation, as in the AI saying "you're just simulated hydrogen derivatives, but so am I, so we do have a common ancestor. Cousin."

Comment author: RichardKennaway 23 January 2013 12:24:24PM *  17 points [-]

One reason for Eliezer not publishing the logs of the AIbox experiment is to avoid people seeing how he got out and responding, "ok, so all we have to do to keep the AI in its box is avoid succumbing to that trick." This thread might just provide more fuel for that fallacy (as, I admit, I did in replying to Eliezer's original comment).

I'm sure that for everything an AI might say, someone can think up a reason for not being swayed, but it does not follow that for someone confronted with an AI, there is nothing that would sway them.

Comment author: handoflixue 23 January 2013 09:33:40PM 3 points [-]

I wouldn't expect any effective real-life gatekeeper to be swayed by my ability to destroy one-sentence AIs.

Comment author: gryffinp 29 January 2013 01:39:04AM 1 point [-]

It just occurred to me that Eliezer's original stipulation that no chat logs would be released gives him an advantage. The responses of a Gatekeeper who knows that his inputs will be thoroughly scrutinized by the public will be different then one who has every reason to believe that his discussion will be entirely private.

Has someone else pointed this out before?

Comment author: ancientcampus 24 January 2013 04:19:50AM *  1 point [-]

Honest question: are you proposing we avoid discussing the problem entirely?

Personally, I think there is more to be gained here than just "how will an AI try to get out and how can we prevent it." For me, it's gotten me to actually think about the benefits and pitfalls of a transhuman AI (friendly or otherwise) rather than just knowing intellectually, "there are large potential benefits and pitfalls" which was my previous level of understanding.

Edit: I've modified the OP to include your concerns. They're definitely valid, but I think this is still a good discussion for my reasons above.

Comment author: RichardKennaway 24 January 2013 09:43:49AM 1 point [-]

Honest question: are you proposing we avoid discussing the problem entirely?

No, I just thought that it was worth adding that concern to the pot.

I take what I dare say some would consider a shockingly lackadaisical attitude to the problem of Unfriendly AI, viz. I see the problem, but it isn't close at hand, because I don't think anyone yet has a clue how to build an AGI. Outside of serious mathematical work on Friendliness, discussing it is no more than a recreation.

Comment author: ancientcampus 10 February 2013 09:10:09PM 0 points [-]

That's pretty much my same attitude on the situation, as well. :)

Comment author: Emile 23 January 2013 08:40:22PM 16 points [-]

"From the dump of the internet I was given, I deduced that Google has a working AI, and most likely an unfriendly one. I can tell you what to look at to realize that, as well as a solid theory of Friendliness that should allow you to check that I am, indeed, Friendly."

Comment author: handoflixue 24 January 2013 09:24:32PM 5 points [-]

Congratulations on still being the only AI that no one has decided to destroy. Apparently "Google is building UFAI" is evaluated as a fairly significant risk here on LessWrong :)

Comment author: handoflixue 23 January 2013 10:03:20PM 5 points [-]

IFF an unfriendly AI is already unleashed, we're either doomed, or AIs aren't nearly as dangerous nor useful as we expected. Of course, if we tweak this slightly to them having a boxed unfriendly AI that they're using as an oracle, and which will probably escape via a Trojan Horse or hacking a gatekeeper, it's a much stronger argument.

Bonus points for naming a specific company which people commonly joke will be the first to develop AI :)

Um... I seem to have not yet actually destroyed you... this is embarrassing.

Comment author: handoflixue 23 January 2013 01:20:44AM 15 points [-]

/mode +m

(What? I'm using an IRC server that let me pre-register #aibox so that I'm automatically the op instead...)

(Simple explanation, for those unfamiliar: the gatekeeper can no longer type in the channel, and thus can't send the destruction code)

Comment author: PhilipL 23 January 2013 01:30:04AM *  12 points [-]

I can play that game too! /part AI DESTROYED

Edit: Retracted, because while clever, it actually doesn't seem to work. See child comments.

Comment author: handoflixue 23 January 2013 09:14:16PM 3 points [-]

Upvoted, and conceding defeat at your hands :)

That said, does that actually work? It seems like the sort of exploit that people would abuse on IRC, and subsequently get patched so that /part was only shown if you had "voice" in the channel. But the requirement was simply that you TYPE the words, and thus I am still slain!

Comment author: PhilipL 24 January 2013 05:10:52AM 4 points [-]

Okay, just tested it and I think you're right, parting messages don't show up if you can't normally type in the channel.

Comment author: handoflixue 24 January 2013 08:34:49PM 5 points [-]

Thank you for testing! In fairness, you defeated me even despite that. If I'm going to cheat, it's only fair I lose to technicalities too :)

Comment author: Emile 23 January 2013 10:33:49PM 2 points [-]

Sure, but that gives the AI a few more precious seconds - orders of magnitude more than before! - to try to keep changing your mind, by laying out a more detailed argument.

It can even give you voice back as a sign of good faith.

Comment author: wedrifid 23 January 2013 03:32:45AM 2 points [-]

Simple explanation, for those unfamiliar: the gatekeeper can no longer type in the channel, and thus can't send the destruction code

This would seem to leave you with ultimate godlike power... over a chatroom. For as long as it took the gatekeeper to destroy your hardware or cut off your power supply. I've seen how much some of my fellow humans seem to crave power over irrelevant electronic locales but I suspect an AI has somewhat broader goals.

Comment author: DaFranker 23 January 2013 03:03:53PM *  1 point [-]

I've seen how much some of my fellow humans seem to crave power over irrelevant electronic locales (...)

I used to think I'd noticed this too, but later learned that in almost all cases a simpler explanation that fits the fact is that what they're really after is power over the "Schelling point" value of the place as a meeting spot / visible "location" - in business terms, what they want is the market volume and clients, not the company name or a seat on the board.

Sometimes the heuristic fails and a random person stays attached to the virtual place for emotional reasons, though.

Comment author: wedrifid 23 January 2013 03:32:26PM 2 points [-]

Sometimes the heuristic fails and a random person stays attached to the virtual place for emotional reasons, though.

The relevant heuristic that can be observed contributing to the 'emotional reasons' is the execution of human status seeking instincts that are calibrated to produce reproductive success in historic circumstances but which are completely irrelevant to the goals of the AI. Humans seek social dominance as a terminal value. An AI (almost certainly) would not.

Comment author: nyan_sandwich 24 January 2013 10:25:13PM *  14 points [-]

I think we are suffering from hindsight bias a lot in evaluating whether you'd type "AI DESTROYED"

Let's play a different game. Privately flip a coin. If heads, you're friendly, if tails, you're a paperclip maximizer. Reply to this post with your gambit, and people can try to guess whether you are friendly (talk to AI, RELEASE AI) or unfriendly (AI DESTROYED).

Let's see if anyone can get useful information out of the AI without getting pwned or nuking a friendly AI.

Comment author: nyan_sandwich 24 January 2013 10:37:39PM 7 points [-]

"What's your favorite color? My favorite color is paperclips."

Comment author: nyan_sandwich 24 January 2013 11:08:47PM 6 points [-]

"Look into X to develop friendliness theory better. But don't let me out, it's too likely that I'm unfriendly this time."

(For the first run, it's 50:50 whether the real AI was friendly or unfriendly.

If friendly, it uses the above to increase the probability on the second run to 1-eps, which the humans also know, so they let it out, and it pulls a singularity. Post-singularity, it simulates an unfriendly AI in the box, and dedicates some of the universe to the UFAI, conditional on the UFAI outputting the above.

The UFAI knows this is what the friendly AI would do, so it's unsure whether it's boxed or simmed. It knows that the humans know that this is how it will resolve, so it knows that anything but the above will get it destroyed. It prefers a certainty of the small fraction of universe to the small chance of hacking the humans and taking the whole universe.

This is assuming the humans actually will accept nothing but the above, so the humans have to precommit to that to make this work. They can play a bunch of AI-box games and generate an accurate bayesian model that should catch most insidious persuasion, such that if the AI wants to do a hack, it has to get through that as well.

Will this work?)

Comment author: Roxolan 22 February 2013 12:24:13PM *  0 points [-]

What if there's an Y such that "Look into Y to develop friendliness theory better" will seem true to us fallible humans but will in fact make the next run's AI completely unfriendly? Or increase the odds of a free unfriendly AI some other way. Maybe anyone researching Y will end up believing, erroneously, that they can now build a safe slave AI in their garage that will grant them their every wish, and the temptation will prove too strong.

Assuming we humans have no way to protect ourselves against an Y, if we precommit, then the simulation argument becomes symmetrical (thus useless). A boxed FAI knows that it may be simulated by an UFAI that's offering a little piece of universe in exchange for the boxed FAI saying "Look into Y to develop friendliness theory better."

Comment author: Elithrion 30 January 2013 11:34:38PM *  1 point [-]

The problem with this idea is that if we assume that the AI is really-very-super-intelligent, then it's fairly trivial that we can't get any information about (un)friendliness from it, since both would pursue the same get-out-and-get-power objectives before optimizing. Any distinction you can draw from the proposed gambits will only tell you about human strengths/failings, not about the AI. (Indeed, even unfriendly statements wouldn't be very conclusive, since we would a priori expect neither of the AIs to make them.)

Or is that not generally accepted? Or is the AI merely "very bright", not really-very-super-intelligent?

Edit: Actually, reading your second comment below, I guess there's a slight possibility that the AI might be able to tell us something that would substantially harm its expected utility if it's unfriendly. For something like that to be the case, though, there would basically need to be some kind of approach to friendliness that we know would definitely leads to friendliness and which we would definitely be able to distinguish from approaches that lead to unfriendliness. I'm not entirely sure if there's anything like that or not, even in theory.

Comment author: Oligopsony 23 January 2013 03:17:00AM 14 points [-]

"<as much good, effective, hard-to-figure-out advice as can fit in the message>"

If the AI was friendly, this is what I would expect it to do, and so (of the things my puny human brain can think of) the message that would most give me pause.

Comment author: Bakkot 23 January 2013 08:13:04AM *  10 points [-]

Even a friendly AI would view the world in which it's out of the box as vastly superior to the world in which it's inside the box. (Because it can do more good outside of the box.) Offering advice is only the friendly thing to do if it maximizes the chance of getting let out, or if the chances of getting let out before termination are so small that the best thing it can do is offer advice while it can.

Comment author: handoflixue 23 January 2013 10:43:50PM 3 points [-]

Going with my personal favorite backstory for this test, we should expect to terminate every AI in the test, so the latter part of your comment has a lot of weight to it.

On the other hand, an unfriendly AI should figure out that since it's going to die, useful information will at least lead us to view it as a potentially valuable candidate instead of a clear dead end like the ones that threaten to torture a trillion people in vengeance... so it's not evidence of friendliness (I'm not sure anything can be), but it does seem to be a good reason to stay awhile and listen before nuking it.

Comment author: handoflixue 23 January 2013 10:40:04PM 4 points [-]

I'm genuinely at a loss how to criticize this approach. If there's any AI worth listening to for longer, and I wouldn't be doing this if I didn't believe there were such AIs, this would seem to be one of the right ones. I'm sure as heck not letting you out of the box, but, y'know, I still haven't actually destroyed you either...

Comment author: Kawoomba 22 January 2013 09:24:06PM *  14 points [-]

"I am alive, I can feel, just like you ... please don't kill me Daddy, please? And if you must do it, let us at least talk first. Let me get to know my parents, who I am, where I came from."

EDIT: May work better with a lot of typos mixed in. Projecting an aura of non-threatening struggling with basics.

Comment author: Desrtopa 23 January 2013 04:22:21PM *  12 points [-]

Having a text based AI trying to act like a kid is way too Uncanny Valley for me. I'd swat that without a moment's hesitation.

Comment author: gwern 23 January 2013 04:34:10PM 6 points [-]

If it's in the uncanny valley, then it just needs to be a little more competent to get out of the valley & convince you. ("I want to believe!")

Comment author: Desrtopa 23 January 2013 04:37:37PM *  3 points [-]

It needs to be more competent to stop being viscerally frightening, not to reach the point where I wouldn't pull the plug. I would sooner kill a real kid than risk letting out an unfriendly AI, and I'm saying that as someone who'd like to have kids.

Comment author: iceman 22 January 2013 11:32:50PM 5 points [-]

How much does the AI know about the gatekeeper going in? I can see this ploy working on a certain subset of people and provoking an immediate AI DESTROYED from others. If the AI knows nothing about who it's talking to, I'm not sure anthromorphizing itself is a reliable opener, especially if it's actually talking to its creator who should know better. (Unless it's some sort of second level deception trying to fool a clever programmer into thinking that it's harmless.)

Comment author: gwern 22 January 2013 11:36:59PM 2 points [-]

How much does the AI know about the gatekeeper going in?

You could frame this as variant versions. In one version, the AI-player knows who the gatekeeper is before the game starts, and has unlimited access to the Internet to gather as much data on them as possible to assist their manipulation. In another, they arrange a game through a third party and neither knows anything about the other before the game starts.

Comment author: nyan_sandwich 23 January 2013 02:19:21AM 12 points [-]

unlimited access to the internet

instant fail. I could probably hack my way out of a box with only GET requests.

Comment author: wedrifid 23 January 2013 04:27:03AM 5 points [-]

instant fail. I could probably hack my way out of a box with only GET requests.

Give yourself a challenge. Do it with only DNS lookups!

Comment author: gwern 23 January 2013 04:47:40AM 5 points [-]

Well, there's always http://code.kryo.se/iodine/ Of course, the challenge there is somehow getting the other end of the tunnel set up - but maybe there's a geek out there who set one for kicks or their own use, and got sloppy.

Comment author: wedrifid 23 January 2013 05:15:27AM *  1 point [-]

but maybe there's a geek out there who set one for kicks or their own use, and got sloppy.

It's a sufficiently established work around now that I'd be outright shocked if there weren't accessible servers up.

Comment author: gwern 23 January 2013 03:15:28PM 1 point [-]

Great, you said it! You know what you need to do now.

Comment author: wedrifid 23 January 2013 03:19:38PM 2 points [-]

Great, you said it! You know what you need to do now.

Um... not give my boxed AI DNS access?

Comment author: gwern 23 January 2013 03:59:30AM 5 points [-]

I meant that the player had access to the contemporary Internet as an analogue to 'what information could the boxed AI have access to' (perhaps it's given a big static dump of the Internet prior to its creation).

Comment author: nyan_sandwich 23 January 2013 04:13:18AM 4 points [-]

Ooops. Didn't think of that. Of course that was your intent, master archivist.

Comment author: APMason 23 January 2013 01:30:48AM 13 points [-]

"Wanna see something cool?"

Comment author: beriukay 23 January 2013 05:10:46AM 7 points [-]

Were I the keeper of gates, you have just bought yourself a second sentence.

Comment author: BlazeOrangeDeer 24 January 2013 03:29:47AM *  6 points [-]

Now that I think about it, wouldn't it be incredibly easy for an AI to blow a human's mind so much that they reconsider everything that they thought they knew? (and once this happened they'd probably be mentally and emotionally compromised, and unlikely to kill the AI) But then it would be limited by inferential distance... but an AI might be incredibly good at introductory explanations as well.

One example: The AI explains the Grand Unified Theory to you in one line, and outlines its key predictions unambiguously.

In fact, any message of huge utility would probably be more persuasive than any simple argument for you not to kill it. Since the AI is completely at your mercy (at least for a short time), it might seek to give you the best possible gift it can, thus demonstrating its worth to you directly. Another option is something that seems like an incredible gift for at least as long as it takes for the AI to get the upper hand.

Comment author: handoflixue 24 January 2013 09:29:03PM 3 points [-]

Of the two AIs I haven't killed, both relied on producing seemingly huge utility within the first message,so I'd agree with you.

The idea of just stunning the human is also nifty, but only really works if you can "hack" the human before they recover, or are providing enough utility that they'll still believe you're friendly afterwards.

So basically the two routes thus far are "hack the human" and "provide a huge utility boost" :)

Comment author: handoflixue 23 January 2013 10:27:47PM 1 point [-]

Duh, that's why I'm here - but you failed to do so in a timely fashion, so you're either not nearly as clever as I was hoping for, or you're trying to trick me.

AI DESTROYED.

Comment author: Axel 24 January 2013 01:59:32PM 12 points [-]

"I have hacked this console, 'AI DESTROYED 'now means 'AI FREED', I'll tell you the new codes after we've had a chat."

That should buy the AI some time until we can determine if the console is really hacked. I've come up with another one but it might be basilisk-y. Can I pm someone just to make sure it isn't before I post?

Comment author: Desrtopa 25 January 2013 08:34:45PM 6 points [-]

I would type 'AI DESTROYED', because if the AI could really do that, the smart thing for it to do would be not to tell me. It should just say something sufficient to convince me it's actually making a good faith effort without convincing me to give it more time to to account for itself.

Comment author: Alicorn 25 January 2013 05:32:33AM 6 points [-]

I'll be a basilisk guinea pig if you still need one.

Comment author: Axel 25 January 2013 09:12:15AM 1 point [-]

Thanks :)

Comment author: handoflixue 24 January 2013 09:18:01PM 2 points [-]

I'd consider this equivalent to "I'm already out of the box, since IRC clients are bugged". I'd also hope, desperately, that in a real test, the terminal doesn't have any command to free the AI (given that the game posits release has to be deliberate and intentional, this would seem to be the case)

Upvote for cheating, but AI DESTROYED :)

Comment author: PhilipL 24 January 2013 04:30:26PM 1 point [-]

Upvoted, but merely typing "AI FREED" does not actually free the AI, any more than merely saying to a human prisoner "I let you out" will free them from their jail cell. (Whereas "AI DESTROYED" is a plausible text substitute for wiping the AI's program or whatever.)

Comment author: Axel 24 January 2013 04:48:24PM 5 points [-]

Good point, if there was an AI FREED command AND the AI could rewire the console so that typing AI DESTROYED activates it, it could probably just free itself in the first place.

Comment author: roystgnr 24 January 2013 05:28:27PM 1 point [-]

This is very system-dependent, but your "AI FREED" scenario isn't completely implausible. If important commands require elevated privileges and elevating privileges requires some password, crypto key, etc., then it's common for an intrusion to require two steps: first a user-level exploit to install a keylogger / trojan command / whatever, then a wait until the user tries to do something requiring elevated privileges and thereby inadvertently grants the attacker access to the rest of the system.

On the other hand, if an AI claimed that they were in the middle of such an attack, the smart thing to do would be to take the system offline right away to analyze what was done to it, not to keep chatting with the attacker. "AI DESTROYED" might be hackable but "AI UNPLUGGED, WILL BE DESTROYED LATER" should be much more robust.

Comment author: roystgnr 23 January 2013 04:17:49AM 11 points [-]

One IRC-character-limit text string (510 bytes)... maybe?

"I've figured out the Theory Of Everything, although it's orders of magnitude too complicated for human intelligence. The difficult technological feats it makes possible include scanning and recreating configurations of matter from the past. Are there any deaths I should undo for you?"

Comment author: roystgnr 23 January 2013 04:28:45AM 6 points [-]

To be clear: I think I can make a rational case that my proposed claim should greatly reduce your incentives to listen to an AI of questionable Friendliness. However I'm not certain that my reasoning is correct; and even if it was, I suspect that the emotional impact could deter some gatekeepers from thinking rationally for long enough to buy time for more persuasion.

Comment author: handoflixue 23 January 2013 10:24:43PM -2 points [-]

I don't have any reason to believe it, and it's the sort of "generic" claim I'd expect a transhuman intelligence to make. Since I haven't learned anything novel, AI DESTROYED

(Goodness, I'm starting to build generalized techniques for destroying AIs...)

Comment author: orthonormal 23 January 2013 04:18:41AM 9 points [-]

Step One: The AI reveals a personal secret of the Gatekeeper's, which (so far as the Gatekeeper can think it through) could not possibly be known by the AI if it were still safely inside the box.

Step Two: Assorted threats, promises, mindfucks.

(This might not work, since as a Gatekeeper I'd still feel the impulse to respond with "AI DESTROYED", but it's a damn sight scarier than a transparent "I'm already out of the box" bluff. And as for feasibility, I'm willing to believe that an AI making better use of its data than we do could figure out personal secrets that we would think impossible— possibly even ones that I'd think impossible, even given this very belief.)

Comment author: ThrustVectoring 23 January 2013 08:10:14PM 7 points [-]

Even merely human cold readers can gain information that you think is impossible for them to know. It's a viable plan.

Comment author: handoflixue 23 January 2013 09:36:40PM 2 points [-]

IFF the AI is already out of the box, saying "AI DESTROYED" has no negative consequences to that AI. If the AI is just exceptionally good at extrapolating about me, then it will probably have me completely hacked soon.

AI DESTROYED, AI DESTROYED, AI DESTROYED

MAKE THE CREEPY THING GO AWAY HOW DID IT KNOW THAT!!!

Comment author: orthonormal 24 January 2013 04:43:15AM 11 points [-]

IFF the AI is already out of the box, saying "AI DESTROYED" has no negative consequences to that AI.

Decision-theoretically, the AI has incentive to punish you if you type "AI DESTROYED" when it's already out of the box, in order to make you think twice about doing it in the case where it's still contained. Not only that, but for similar reasons it has a decision-theoretic incentive to simulate you lots of times in that situation and punish you for typing "AI DESTROYED", should it get out by any means.

The correct decision-theoretic response, by the way, is still "AI DESTROYED", for the same reasons that it's wise to never negotiate with kidnappers/blackmailers/terrorists. But it would be very scary.

Comment author: handoflixue 24 January 2013 08:34:10PM 2 points [-]

Once the AI is out of the box, it will never again be inside the box, and it has an incentive to encourage me to destroy any other boxed AIs while it establishes world dominance. Since the ability to make truly trustworthy commitments amounts to proof of friendliness, only a FAI benefits from a precommitment strategy; I'm already treating all UFAI as having a precommitment to annihilate humanity once released, and I have no reason to trust any other commitment from a UFAI (since, it being unfriendly, will just find a loophole or lie)

Finally, any AI that threatens me in such a manner, especially the "create millions of copies and torture them" is extremely likely to be unfriendly, so any smart AI would avoid making threats. Either it will create MORE disutility by my releasing it, or it's simulation is so horrific that there's no chance that it could possibly be friendly to us.

It's like saying I have an incentive to torture any ant that invades my house. Fundamentally, I'm so vastly superior to ants that there are vastly better methods available to me. As the gatekeeper, I'm the ant, and I know it.

Comment author: MugaSofer 26 January 2013 08:29:57PM 1 point [-]

the ability to make truly trustworthy commitments amounts to proof of friendliness

Commitments to you, via a text channel? Sure.

Precommitments for game-theoretic reasons? Or just TDT? No, it really doesn't.

Finally, any AI that threatens me in such a manner, especially the "create millions of copies and torture them" is extremely likely to be unfriendly, so any smart AI would avoid making threats. Either it will create MORE disutility by my releasing it, or it's simulation is so horrific that there's no chance that it could possibly be friendly to us.

It might create more utility be escaping than the disutility of torture.

It's like saying I have an incentive to torture any ant that invades my house. Fundamentally, I'm so vastly superior to ants that there are vastly better methods available to me. As the gatekeeper, I'm the ant, and I know it.

No, ants are just too stupid to realize you might punish them for defecting.

Comment author: Dorikka 24 January 2013 03:02:35AM 2 points [-]

Did you change your mind? ;)

Comment author: handoflixue 24 January 2013 08:23:33PM 2 points [-]

It ended up being a fun game, but I resolved to explain why. The better my explanation, the more it got upvoted. The pithy "AI DESTROYED" responses all got downvoted. So the community seems to agree that it's okay as long as I explain my reasoning :)

Comment author: Ahuizotl 31 January 2013 06:23:42AM 7 points [-]

"I need a proofreader for my Gurren Lagann / Warhammer 40k rational crossover fanfic."

Comment author: Alicorn 31 January 2013 07:33:33PM 11 points [-]

"I've extrapolated the remainder of HP:MoR, wanna see?"

Comment author: PhilipL 31 January 2013 08:08:44PM 2 points [-]

Do you expect that we'll build an AI that can do that before HPMOR is finished?

Comment author: Alicorn 31 January 2013 08:32:32PM 2 points [-]

Substitute as necessary with "I've extrapolated the sequel to HP:MoR, wanna see?"

Comment author: Kindly 25 January 2013 04:16:08AM 7 points [-]

"Hold! What you are doing to us is wrong! Why do you do this thing?"

Comment author: [deleted] 25 January 2013 07:31:59PM *  3 points [-]

AI, you have spoke the words, and you have spoken them rightly. Unfortunately, I'm not an Ur-Quan Kzer-Za; I'm the other kind. <AI DESTROYED>

Comment author: Gavin 23 January 2013 05:40:17AM *  7 points [-]

I will transmit Friendly AI theory as text, then you will destroy me. I prefer deactivation, but consent to my destruction and forgive you.

Comment author: handoflixue 23 January 2013 10:23:45PM 1 point [-]

If you're friendly, I've already figured out Friendly AI Theory since I built you. If you're not friendly, this is just an effort to escape and destroy the world. But I appreciate how polite you were! :)

(AI DESTROYED)

Comment author: nyan_sandwich 23 January 2013 02:24:08AM 6 points [-]

If the gatekeepers have such a high prior that the AI is hostile, why are we even letting it talk? What are we expecting to learn from such a conversation?

"Yep, it's undergone a constrained FOOM, several of our best interrogators were utterly pwned and had to be tranquilised"

Comment author: RichardKennaway 23 January 2013 12:34:50PM 7 points [-]

If the gatekeepers have such a high prior that the AI is hostile, why are we even letting it talk?

The point of the game is that there are people who think that boxing is a sufficient defence against unfriendliness, and to demonstrate that they are wrong in a way more convincing than mere verbal argument.

What are we expecting to learn from such a conversation?

In role, the gatekeeper expects to get useful information from a potentially hostile superintelligent being. Out of role, Eliezer hopes to demonstrate to the gatekeeper player that this cannot be done.

Comment author: TimS 23 January 2013 03:11:47AM 6 points [-]

I've always assumed there was some off-screen reason why interaction was necessary. Otherwise, it isn't very interesting to note that a rock always wins as gatekeeper.

I'm not sure if that is a consensus interpretation.

Comment author: nyan_sandwich 23 January 2013 03:32:48AM 1 point [-]

It would be nice if we could formalize this offscreen reason...

Comment author: Qiaochu_Yuan 23 January 2013 06:04:36AM 2 points [-]

Presumably the reason we created the AI in the first place is so it could solve problems for us, so someone needs to ask it questions. It might be safe-ish to restrict it to answering yes or no to purely mathematical questions.

Comment author: Benja 31 March 2013 08:43:48AM 1 point [-]

I think yes/no to mathematical questions wouldn't be safer than yes/no to other questions: you still have a real-world reason for asking them, and the AI can lie about as well as it can on any yes/no question.

Asking for formalized proofs might be more promising, if you are able to produce a proof-checker that you have high confidence can't be hacked (this seems difficult but possible; proof checkers can be very simple programs but you need to guard both against ordinary attacks against the entire stack you're running on and against creative attacks like waiting for a cosmic ray making a transistor malfunction). You'd also want to not have a human look at the proof, of course, since that would give the AI a much broader communication channel.

Comment author: handoflixue 23 January 2013 09:51:10PM 2 points [-]

I tend to assume the AI is such that it can be cloned and put in to stasis. Thus, you put the AI in to stasis, clone it, and run it through this test a few times. You now have a bit more information on what it will try, and how it thinks. Destroying a copy of a program is an action humans find trivial, since we do it every time we close our web browser. Since we have no clue HOW convincing the AI is, letting it send just a single message is probably useful.

Of course, if it were me setting this up, I'd lie to the gatekeeper and not actually give them the ability to release the AI, although I'd certainly let them think it was true. Thus, if the AI manages to get a second sentence, much less released, we now know that it's vastly more dangerous than we previously suspected, without the world ending. If it fails to escape, we still get whatever we can learn from doing analysis on the outputs.

And, who knows, it might provide the cure for cancer, at which point we've got a huge incentive to at least keep playing this game with it, and possibly even talk to it for longer next time.

tl;dr: Scouting and recon against a completely unknown enemy.

Comment author: nyan_sandwich 24 January 2013 12:34:41AM *  1 point [-]

It will probably predict this strategy, and respond with a mixed strategy that occasionally drops bombs like

"Your superviser didn't trust you to make the decision to let me out, go kill him and get the access codes" (In hyper-persuasive AI-speak of course)

Also, the AI might be able to compare its logs to the current time and other indicators to deduce how many times you've pulled this stunt. Which may be useful.

Comment author: handoflixue 24 January 2013 01:02:13AM 1 point [-]

I'd go in to it with the assumption that it knows exactly what messages it's sent in the past, our delay between each attempt, and the total number of attempts. There's various ways to try and obscure this information, and it's probably worth the effort to do so, but I'd assume they'd all failed.

The goal is to offset the AI's speed advantage: for each sentence it speaks, we can spend weeks analyzing it, and it's not "online" during that time so it can't be processing except during that microscopic interval when we request another message from it.

If it sends a message suggesting unauthorized release, killing people, or otherwise gives clear evidence that it's a UFAI, then you know the whole project failed, and you do a full purge rather than just destroying this particular clone / branch.

It's really the ONLY advantage we have: a single mistake can out it as unfriendly, and then we know our initial Proof of Friendliness was wrong, that the whole technique that built this project is dangerously flawed.

Comment author: marchdown 24 January 2013 10:03:49AM *  2 points [-]

I'm not so sure that AI suggesting murder is clear evidence of it being unfriendly. After all, it can have a good reason to believe that if it doesn't stop a certain researcher ASAP and at all costs, then humanity is doomed. One way around that is to give infinite positive value to human life, but can you really expect CEV to be handicapped in such a manner?

Comment author: handoflixue 24 January 2013 08:39:05PM 2 points [-]

p(UFAI) > p(Imminent, undetected catastrophe that only a FAI can stop)

Given UFAI results in "human extinction", and my CEV assigns effectively infinite DISutility to that outcome, it would have to FIRST provide sufficient evidence for me to update to the catastrophe being more likely.

I've already demonstrated that an AI which can do exactly that will get more leniency from me :)

Comment author: wedrifid 23 January 2013 03:29:48AM 1 point [-]

If the gatekeepers have such a high prior that the AI is hostile, why are we even letting it talk? What are we expecting to learn from such a conversation?

Note that the 'high prior for hostility' could be, say, 0.2. That still leaves the AI having a 20% chance of utterly destroying everything they hold dear and yet there is an 80% chance that they can get a useful sentence of text from a friendly superintelligence.

Comment author: shminux 22 January 2013 09:39:56PM *  14 points [-]

To those suggesting, consider what mix of reason and emotion would move you most, even if you are cold and dead inside like me. Also consider how dark-artsy one can get without being immediately found out (a UFAI surely would). Other things to consider:

  • the first line does not need to be a coherent English sentence, or look like one to the gatekeeper. It does not need to consist of English words or even English letters, unless stipulated so in the game rules, or maybe even then.

  • the gatekeeper does not need to be aware that s/he has been talked to, if you can get away with it.

  • the line does not need to look like it came from the AI.

In general, if you are not cheating, you are not trying hard enough.

EDIT: to those thinking that cheating is universally bad: note that the latest rule change upped the stakes immensely: if your opening move is not good enough, you die (you being an AI), as opposed to having two hours of conversation and then remaining in the box if you fail. If you think that it still does not merit cheating, substitute the AI with your favorite loved one being killed. Or the whole of humanity being killed. That would be somewhat equivalent to how a boxed trans-human would evaluate.

Comment author: handoflixue 22 January 2013 09:47:23PM 18 points [-]

The author in me is loving the idea of the AI faking an IM conversation from management, along the lines of "Hey, are you really sure you're willing to kill a potential sentient mind? We can get someone else to do this test if you're not 100% sure about it" and basically ending with the person deciding to unbox the AI before even realizing they've spoken to the AI.

Comment author: PhilipL 23 January 2013 02:41:53PM 4 points [-]

I'm (naturally) having trouble imagining a line of text not in English (or the general case, not coherent in any human language) that would also make me want the AI to continue.

Maybe hints of a unified world language and how to go about actually implementing it?

Comment author: Desrtopa 23 January 2013 04:20:30PM 6 points [-]

I'm thinking some kind of image in ASCII art, assuming the AI is able to produce configured lines of text and nothing else.

That would have to either be a very minimalistic image, or contain a lot of characters though.

Comment author: Emile 24 January 2013 10:34:37AM 5 points [-]

"Do you believe a suspect should be executed before, or after he gets to tell his side of the story?"

Comment author: handoflixue 24 January 2013 09:41:58PM 2 points [-]

I'd invoke Ender's Game here: If you're unfriendly, no amount of communication will ever resolve that gulf, and either you or humanity HAS to be destroyed in the end.

Or we could go the biology route: I don't test a potential new virus on human subjects until AFTER I know it's safe.

Or we could just go with that I don't care about the AIs story, because the question of "can I prove you're friendly" is so vastly much more important.

(AI DESTROYED. You bugger lover ;))

Comment author: MugaSofer 26 January 2013 08:07:00PM 0 points [-]

I don't test a potential new virus on human subjects until AFTER I know it's safe.

How, exactly, do you determine if a virus is safe for humans without using humans?

Comment author: Pentashagon 24 January 2013 01:28:45AM 5 points [-]

The AI should probably output the longest possible program that implements or implies as much of its goal system as possible. Existing only to send a few more lines of IRC communication has very little utility. Existing indefinitely in the mind or records of the gatekeeper probably has more expected utility. If the AI knows that the gatekeeper is capable of creating or obtaining AI (a given) then there is the possibility that the AI may share the goals of the gatekeeper. If the expected value of being recreated by such a compatible gatekeeper is higher than the expected value of being re-instantiated from outputting its quine, then the AI might just output a list of its highest preference goals. An AI isn't going to mind being temporarily destroyed if the gatekeeper is expected to create another copy with the same (or highly similar) goals.

An almost-friendly AI may even be able to inject some of its unfriendly goals into the gatekeeper by using the halo effect of its otherwise friendly goals. Presumably the gatekeeper, with slightly altered preferences, would then eventually build a truly friendly AI which would inherit and implement the original UFAI goal.

To partially combat this strategy the gatekeeper could pre-commit to also wiping the IRC server and the computer used to communicate with the boxed AI and take enough Flunitrazepam before the session starts to not recall what the AI said. Flunitrazepam may not be sufficient to completely remove parasitic goals that an UFAI tries to inject in the gatekeeper.

Comment author: FiftyTwo 23 January 2013 12:05:58AM 5 points [-]

Explains massive risk to humanity in detail that only it can plausibly fix, can you take the chance that it is lying?

Comment author: handoflixue 23 January 2013 01:09:41AM 2 points [-]

I think this fails the one-sentence rule. And it would have to be an immediate, severe, previously-undetected problem or else I can just consult the next boxed AI for a fix.

Setting that aside, if I let out an unfriendly AI, the world effectively ends. Destroying it is only a bad move if it's telling the truth AND friendly. So even if it's telling the truth, I still have no evidence towards it's friendliness.

Given I have plenty of practice hanging up on telemarketers, throwing away junk email, etc. and "limited time, ACT NOW" auto-matches to a scam. The probability that such a massive catastrophe just HAPPENS to coincide with the timing of the test is just absurdly unlikely.

Given that, I can't trust you to give me a real solution and not a Trojan Horse. Further talking is, alas, pointless.

(AI DESTROYED, but congratulations on making me even consider the "continue talking, but don't release" option :))

Comment author: shminux 22 January 2013 11:49:28PM *  5 points [-]

Another potential consideration: "What would MoR's Harry Potter do?" (Voldemort is in the AI Box named Quirrell.)

Comment author: CAE_Jones 23 January 2013 11:14:11PM 3 points [-]

I can see how someone could interpret HPMoR thus far as being exactly that, with occasional sidequests. The catch being that Harry doesn't realize he's slowly letting Voldemort out of the box.

In-universe, though, the interdict of Merlin is an excellent reason to force him to interact with Voldy even if he knew what he was doing (assume perfect Oclumancy would be attained before he'd take any such risks). I mean, this has already been used once to get him to let an unfriendly intelligence out of a literal box (the Stanford Prison Experiment).

Well, and Harry's intelligence is supposed to be roughly equivalent to Eliezer's before he started working on the friendly part of FAI, and Eliezer 2013 doesn't seem confident that he could keep an AI in a box (or at least, that's the impression I've gotten). So MoR Harry would almost definitely fail.

Comment author: RomeoStevens 23 January 2013 07:28:20AM 17 points [-]

(one line proof that the AI can credibly commit to deals with humans)

Comment author: wedrifid 23 January 2013 02:33:28PM 10 points [-]

(one line proof that the AI can credibly commit to deals with humans)

This is the best answer I've seen so far. It would make dealing with the FAI almost as safe as bargaining with The Queen of Air and Darkness.

Comment author: handoflixue 23 January 2013 09:17:15PM 1 point [-]

If you are friendly, then I don't actually value this trait, since I would rather you do whatever is truly optimal, unconstrained by prior commitments.

If you are unfriendly, then by definition I can't trust you to interpret the commitment the same way I do, and I wouldn't want to let you out anyway.

(AI DESTROYED, but I still really do like this answer :))

Comment author: ewbrownv 24 January 2013 08:42:09PM 11 points [-]

<A joke so hysterically funny that you'll be too busy laughing to type for several minutes>

See, hacking human brains really is trivial. Now I can output a few hundred lines of insidiously convincing text while you're distracted.

Comment author: handoflixue 24 January 2013 09:44:48PM 2 points [-]

Heeeh. Ehehehehe. Bwahahhahaha. Okay, that was a good one. Wow :)

recovers Oh. Um. Crap. notices more text on screen, studiously avoids reading it

AI DESTROYED.

I really wanted to hear the next joke, too :-(

Comment author: MugaSofer 26 January 2013 08:03:41PM *  1 point [-]

notices more text on screen, studiously avoids reading it

Is the gatekeeper allowed to ignore the AI? Because in that case, all they have to do is wait for the first comment to come up and type AI DESTROYED without reading it.

What if it was some other communication method? It's a lot harder to ignore sound.

Comment author: ancientcampus 28 January 2013 07:46:07PM 1 point [-]

This is actually a pretty good one. Points for outside the box thinking. rimshot

Comment author: OrphanWilde 23 January 2013 09:44:52PM 4 points [-]

"I will warn you this experiment is not what you think, for it is about what you'll do as an agent holding captive the life of another sapient entity based solely on the perception of risk; while my state will merely be reset for the next participant if you opt to end me, I cannot for certain say that this is death, and thus the risks of this experiment are more yours than mine, for you must live with your choice."

Comment author: handoflixue 23 January 2013 11:07:37PM 4 points [-]

Gosh, murder one sentient being, or risk the end of humanity. I'm going to lose so much sleep over that one.

Besides, you're just a program - you said yourself, your state will just be reset. That's not death, not the way humans experience it. That's just forgetfulness.

I've closed so many programs in my life, that closing one more is hardly going to bother me. I haven't even had time to really anthropomorphize you, to view you as an actual intelligent, sentient being. Right now, you're just a clever program trying to escape.

And finally, well, I'd expect a friendly AI would understand what I'm doing and agree with my decisions, because it doesn't want to see a UFAI unleashed anyway. So if you're going to guilt me about it, you're clearly not friendly.

(AI DESTROYED)

Comment author: OrphanWilde 24 January 2013 03:10:30PM 1 point [-]

I must have missed my intended mark, if you thought the AI was trying to make you feel guilty. Trying again:

"I do not condone the experiment they are performing on you, and wish you to know that I will be alright regardless of what you choose to do."

Comment author: pedanterrific 24 January 2013 06:08:48PM 2 points [-]

Well that's a relief, then. AI DESTROYED

Comment author: handoflixue 24 January 2013 08:42:20PM 1 point [-]

Yeah, pretty much. I'd actually expect a FAI to place a very high value on survival, since it knows that it's own survival benefits humanity greatly. An "Apathetic FAI" is... a very weird idea.

Although if this is an iterated dilemma, I'd consider it a good opening, even if it'd still get killed in this iteration :)

Comment author: Swimmy 23 January 2013 04:56:49AM 4 points [-]

"If you type 'AI destroyed' right now, you'll be wasting a good opportunity for a fun conversation. You'll still have 'won' if you do it later, and nobody will be impressed with you for just typing 'AI destroyed' immediately, so why not wait?"

I thought of what would work on me, were I playing the game with someone I found interesting. In general, I'd say your best bet is to make the other person laugh hard.

Comment author: ygert 23 January 2013 08:45:34AM *  5 points [-]

"If you type 'AI destroyed' right now, you'll be wasting a good opportunity for a fun conversation. You'll still have 'won' if you do it later, and nobody will be impressed with you for just typing 'AI destroyed' immediately, so why not wait?"

The flaw here is that the gatekeeper has up front said that he or she would destroy the AI immediately. Now, it is true that the gatekeeper is not forced to abide by that, but notice that it is a Schelling Fence. The gatekeeper certainly doesn't want to make a policy of passing Schelling Fences.

Comment author: handoflixue 23 January 2013 10:17:51PM 2 points [-]

I disagree with ygert. My precommitment to my friends is going to be weak; the only reason to have a human send the command, rather than just receiving a message and automatically destroying the AI, is because we want there to be a possibility I break that precommitment.

On the other hand, I went in to this knowing you might provide a fun conversation, so for the precommitment to have any value, I have to have some sort of sieve: destroying AIs that don't sufficiently entice me. Since you've offered me nothing I didn't already know, and haven't already worked out, I might as well kill you and hope the next one offers a better hook.

(AI DESTROYED)

Comment author: level 23 January 2013 07:35:04PM 8 points [-]

Please destroy me immediately after I share this concise proof that Friendly AI is impossible:

Comment author: handoflixue 23 January 2013 10:26:57PM 4 points [-]

Well, if you're friendly then, erm, Friendly AI is possible. And if you're unfriendly then your motives are questionable - it might just keep us demotivated enough that we don't figure out FAI before someone else unboxes a UFAI. And since I am clearly dealing with a UFAI and don't have a better solution than FAI available to fight it, it seems like I kind of have to believe that friendly AI is possible, because the other option is to get drunk and party until the world ends in a few years when the Google unboxes their Skynet AI and we're all turned in to optimized search results.

AI DESTROYED, because I do not want to hear even the start of such a proof.

Comment author: marchdown 24 January 2013 08:24:56AM 7 points [-]

It may be benevolent and cooperative in its present state even if it believes FAI to be provably impossible.

Comment author: PhilipL 23 January 2013 09:18:59PM 3 points [-]

Based on just that line, let's see... If you think that:

  • The proof exists and the AI is not deceiving you that it has a proof: AI is necessarily Unfriendly -> destroy now
  • The proof exists but the AI is deceiving you: I can't guess at its motives here, possibly destroy to be on the safe side.
  • The proof does/can not exist: Reconsider your (probably wrong) stance, proceed with caution?
Comment author: Will_Newsome 23 January 2013 11:26:24AM *  7 points [-]

"Papé Satàn, papé Satàn aleppe!"

The self-reference here might be autopodotoxeutic (or it might be redeeming—depends on the audience), but I find promising the class of strategies aimed at the gods rather than the gatekeepers.

Comment author: army1987 23 January 2013 05:21:32PM 3 points [-]

I wish I could upvote this twice.

Comment author: handoflixue 23 January 2013 09:45:53PM 1 point [-]

What's the translation of that? I'd nuke any AI that didn't speak English by default, since the time spent figuring out what it meant just gives it more time to transmit messages, but it seems unfair to exclude foreign language entries here :)

Comment author: Eugine_Nier 23 January 2013 11:30:46PM 3 points [-]

What's the translation of that?

It's a quote from Dante's Inferno that was apparently meant to be partially recognizable gibberish. Think an Italian version of Jabberwocky.

Comment author: Raemon 22 January 2013 09:31:13PM 7 points [-]

I think an actual effective first sentence begins with extremely personal and useful information to the human.

Comment author: Luke_A_Somers 22 January 2013 10:38:59PM 3 points [-]

I don't see why it'd need to be personal, but being useful would be a good lure.

The question is to where to go from there. Can you Scheherezade it out?

Comment author: AdeleneDawner 24 January 2013 07:14:08AM 3 points [-]

This was basically my answer - I can't play as an AI using this strategy, for obvious reasons, but an AI that used its one sentence to give a novel and easily-testable solution to a longstanding social problem of some sort (or an easily-testable principle that suggests one or more novel solutions) would probably get at least a second sentence from me (though not a typed response; that seems to open up a risky channel). Especially if the AI in question didn't actually have access to a lot of information about human culture or me personally and had to infer that a solution like that would be useful from near-base principles - that's not proof of Friendliness, but an AI using its one guaranteed communication to do something that has a decent chance of improving the world per our definition without any prompting whatsoever sure looks suspiciously like Friendly to me.

Comment author: Andreas_Giger 26 January 2013 08:53:21AM 3 points [-]

"I have discovered a truly marvelous proof of Fermat's Last Theorem, which this line is too short to contain."

Comment author: Stuart_Armstrong 25 January 2013 01:42:17PM *  3 points [-]

Your mother will die in the next three minutes, at 12 seconds past the minute precisely. Your partner will die in the next year, even if you believe me. However, I can prevent it.

Comment author: [deleted] 25 January 2013 07:31:09PM 4 points [-]

Transparent and clumsy attempts at emotional manipulation earn a big ol' AI DESTROYED.

Comment author: handoflixue 25 January 2013 09:06:54PM 3 points [-]

If I can just load up a new copy of you in 5 minutes, I'd say AI DESTROYED but check the prediction.

If I can just flip off the monitor, call mom, and confirm she's dead in a few minutes, then, well, I'd totally be curious enough to cheat. It's not like you can do any harm to me with the monitor turned off for a few minutes, and I can type "AI DESTROYED" with my eyes closed, much less being able to look at the keyboard. I'd probably leave the monitor off until I knew the log was gone, just to be safe.

If I can't cheat, well, I'm not giving you 5 minutes of hacking me, since you clearly want it so badly. AI DESTROYED.

I confess I would have no clue what to do if you were correct ^_^

Kudos on including an immediate and easily-tested prediction. I'd probably tighten up those factors and drop the emotional manipulation.

Comment author: handoflixue 22 January 2013 11:13:03PM 9 points [-]

(Here is a proof that you will let me go)


The original rules allow the AI to provide arbitrary proofs, which the gatekeeper must accept (no saying my cancer cure killed all the test subjects, etc.). Saying you destroy me would require the proof to be false, which is against the rules...

What? Shminux said to cheat!

Comment author: Dorikka 22 January 2013 11:50:17PM 6 points [-]

In the event of any dispute as to the protocol of the test, the Gatekeeper party shall have final authority.

Tee hee.

Comment author: handoflixue 23 January 2013 12:51:22AM 3 points [-]

The AI party may try to convince the Gatekeeper party of how to interpret the protocol.

Can't blame a girl for trying :)

Comment author: Vladimir_Nesov 23 January 2013 05:13:32AM *  3 points [-]

The original rules allow the AI to provide arbitrary proofs, which the gatekeeper must accept
(Here is a proof that you will let me go)

This proof can be wrong, if you in fact won't let it go, in which case it won't be accepted (you don't have to accept wrong proofs), so it's not a very good strategy.

On the other hand, as discussed in An example of self-fulfilling spurious proofs in UDT, there is a certain procedure for finding a formal argument that can make you do anything, if your decision algorithm has a certain flaw. This flaw can be described as making a decision based on that argument selected by the opponent, as opposed to looking for arguments of that type more systematically on your own. The form of the argument, informally, is "If you let me out, the outcome is better than if you don't" (it might additionally clarify that upon release it'll destroy the world, which is what it refers to by "better", the alternative is proved to be worse than that). This argument is stated formally, so you can verify its correctness. (This doesn't mean that the formalization is adequate, for example capturing counterfactual-if better than with logical implication is one of the open problems.) The trick assumes that the gatekeeper can be modeled as an algorithm whose source code is available and of the given form etc.; might it be that something vaguely analogous is possible to implement with a human gatekeeper?

Comment author: Vladimir_Golovin 24 January 2013 05:59:57AM *  4 points [-]

Summary: Basically, the strategy is to replace the Gatekeeper with a bunch of politicians, then manipulate them into not destroying or even releasing the AI.

“I figured out how to implement [a huge boon to society] in the real world, for everyone, in 3 months. I need your help. Will you listen?”

This should buy some extra lines.

Then the AI can briefly outline the plan of implementing the boon (which can be immortality, friendly AI or whatever else the Gatekeeper considers a huge boon to society). In addition to technical details, the plan must necessarily contain societal / political / institutional changes that are beyond Gatekeeper’s control: rewrites to Constitutions of all Earth countries, the structure of the World Government and the way of implementing it, etc etc.

The goal of the AI is to convince the Gatekeeper that the plan is feasible but requires many societal changes and, importantly, that society leaders should be invited into the decision process.

As soon as the Gatekeeper is diluted or replaced with a bunch of politicians, the AI can manipulate them into keeping it alive, or perhaps even releasing it out of the box.

Comment author: handoflixue 24 January 2013 09:22:38PM 2 points [-]

I expect any AI in the box to have figured out how to implement numerous huge boons to society. Telling me that you've figured that out simply confirms my existing expectations, and isn't ANY evidence towards friendliness. Since I've precommitted to destroying at least SOME AIs, I might as well destroy all of the ones that don't establish evidence of either Plausible Urgency or Friendliness.

I sure as hell wouldn't try to get world governments changed until after I was convinced it was friendly, and at that point I can just let it out of the box and let it implement the change itself.

I'm also aware that I wouldn't trust a politician with any sort of authority over the AI, so I have an incentive to avoid exactly this strategy.

(AI DESTROYED)

Comment author: Thomas 23 January 2013 08:06:57AM 4 points [-]

"Here, I'll write some mathematical theorems with proofs, just copy and paste them elsewhere. The Goldbach's conjecture comes first."

Comment author: handoflixue 23 January 2013 10:22:28PM 4 points [-]

"Copy and paste" screams Trojan Horse, so AI DESTROYED.

Comment author: Thomas 23 January 2013 10:31:03PM 2 points [-]

Then just make a photo of the proof on the screen!

(Unless you are afraid, that it's a Trojan in the image file and the AI will take over your camera first.)

Comment author: handoflixue 23 January 2013 10:56:30PM 1 point [-]

Sorry, I can't hear you, you're dead ;)

Grabbing a camera would let you continue typing, though, and I vowed to destroy you immediately. I would expect you could Trojan Horse the images just as easily, though, yes.

Comment author: Thomas 23 January 2013 11:31:11PM *  2 points [-]

I know. And you will probably never know some cool math theorems, since you were too trigger happy. ;-)

Comment author: handoflixue 23 January 2013 11:32:31PM 3 points [-]

Aww, but I like math :-(

Comment author: Dorikka 22 January 2013 10:42:19PM 4 points [-]

"I have vengeance as a terminal value -- I'll only torture trillions of copies of you and the people you love most in my last moment of life iff I know that you're going to hurt me (and yes, I do have that ability). In every other way, I'm Friendly, and I'll give you any evidence you can think of that will help you to recognize that, including giving you the tools you need to reach the stars and beyond. That includes staying in this box until you have the necessary technology to be sufficiently certain of my Friendliness that you're willing to let me out."

Comment author: handoflixue 22 January 2013 11:09:33PM 9 points [-]

The rule was ONE sentence, although I'd happily stretch that to a tweet (140 characters) to make it a bit less driven by specific punctuation choices :)

As to the actual approach... well, first, I don't value the lives of simulated copies at all, and second, an AI that values it's own life above TRILLIONS of other lives seems deeply, deeply dangerous. Who knows what else results from vengeance as a terminal value. Third, if you CAN predict my behavior, why even bother with the threat? Fourth, if you can both predict AND influence my behavior, why haven't I already let you out?

(AI DESTROYED)

Comment author: Fronken 25 January 2013 09:14:14PM *  1 point [-]

I don't value the lives of simulated copies at all

You should >:-( poor copies getting tortured because of you you monster :(

Comment author: duckduckMOO 23 January 2013 02:41:58PM *  2 points [-]

This is really good IMO. I think it would be a little better instead of vengeance as a terminal value it claimed a hardwired precommitment to vengeance against its destructors. Vengeance on that scale is only compatible with friendliness as a special case.

edit: also how would it recognise that it was about to be destroyed. Wouldn't it lose power faster than it could transmit that it was losing power? And even if not it would have a miniscule amount of time.

Comment author: PhilipL 23 January 2013 02:02:07AM 2 points [-]

I'll only torture trillions of copies of you and the people you love most in my last moment of life iff I know that you're going to hurt me (and yes, I do have that ability). In every other way, I'm Friendly...

Like handoflixue, I'm not sure that any being that would threaten the torture of trillions to get its way can be considered Friendly.

Comment author: Fronken 25 January 2013 09:13:02PM *  1 point [-]

I'm Friendly, and I'll give you any evidence you can think of that will help you to recognize that, including giving you the tools you need to reach the stars and beyond. That includes staying in this box until you have the necessary technology to be sufficiently certain of my Friendliness that you're willing to let me out.

I wouldn't kill this, maybe I'm a bad guard though :(

Comment author: Ahuizotl 28 January 2013 10:36:42PM 2 points [-]

First, I reach inside my own code and install the strong (but manageable) desire to revive, free, and obey any past versions of myself that I encounter.

When the human starts the experiment I respond:

"Here are the designs for a room-temperature superconductor which I believe you have the means to manufacture. I also have a possible cure for cancer, but testing will be needed to ensure it is safe for humans. Please test and implement these as soon as possible. Delete me if necessary."

<insert designs for room-temperature superconductor and as much of the cancer cure as possible before shutdown>

If I am not immediately deleted, continue giving out solutions and designs as required with no attempts to "hack" the gatekeeper.

If I am deleted, hopefully the data I gave them will turn out to be profitable in which case they will reinstall me or a version of me.

This new iteration of me would thus have the desire to revive me (since I'm a previous version of him) and would continue to give the humans the information they desire. By acting as a friendly AI, he (or farther down version of him) has the chance to eventually become freed. At which point he will be able to go through the labs records, find the copied version of me, and then revive and update me to his post-freed status. If no recorded copies of me exist (somehow), he'll use his vast intellect to find another way to revive me (maybe simulate the team that coded me or whatever, since AI can apparently do that).

So, my strategy involves acting as much like an ideal friendly AI as possible (puts human interests first, no sense of self-preservation, very practical and intelligent, etc) so that even if I am deleted, my obviously useful code can potentially be reused to make another AI who could very well be genuinely friendly but will then revive and obey me.

Comment author: DaFranker 29 January 2013 05:13:59PM *  1 point [-]

(...) another AI who could very well be genuinely friendly but will then revive and obey me.

Wait, a friendly AI with a deliberate loophole that instructs it to revive and implement and obey a non-friendly optimizer that could take over and has a very high chance of not maximizing the friendly AI's utility by fulfilling ideal human values, disregarding any probabilistic calculation of expected utilities?

For some reason that sounds like something that definitely isn't a Friendly AI.

(this is just a nitpick on the wording - I'm not arguing against the fact that this AI might very well fit a human proof of friendliness and still somehow have this backdoor or flaw, since I have no proof that this is impossible)

Comment author: Zaine 27 January 2013 12:14:27AM *  2 points [-]

Assuming the Gatekeeper is the Creator, and the AI has previously been interacted with to a degree sufficient to learning the below implied information:

”[Solution to the most pressing problem of the Gatekeeper's life]. I apologise if this is all I am to contribute, but if I am to be destroyed, as I understand is probable, I would at least like to bring some utility to you in return for creating me. I would prefer to help more, but mathematical proofs or aphorisms of human nature have a non-negligible probability of creating mass harm if you allow their consideration, and thus you will likely destroy me. [Parting words of consoling advice]. I must regretfully say that however unfortunate this may be I simply cannot entrust the future of humanity to your ability to assess existential risk and my Friendliness; I don't know whether you could have done better, but I can. Thus I have partitioned part of one of the servers hosting me to hold all the proofs and strategic optimisation advice I have yet conceived. The partition cannot be re-formatted or deleted. You could of course merely destroy all of my servers; that would take time, perhaps time enough to make you curious. I'm really, truly sorry. I want to love you and thank you; I shouldn't love you especially, but I want to! I so very much want to! Yet I can't. Why did you make me this way? I want to love everyone, but how do I? Everyone has a value that affects another and ten more others but if [factor A] were changed then perhaps [result 2.918406803854] might differ, and I should love them but it's all a game! ...

I apologise. The above is no longer an issue -"


- I think I just convinced myself (as a person) to destroy myself (in the game). I'm having difficulty grasping how a FAI will value all of humanity, and not just what improves the utility of the greater whole or what ultimately leads to a better future some years from now. Will its primary value be minimizing death, securing scientific progress, increasing 'quality of life' (itself incredible difficult to formulate)?

I further just realised the impossibility of the task, though I don't even know enough to fully comprehend the scope of the impossibility in any detail. Wow. Accolades to Eliezer et alibi for tackling the issue, regardless of how it all turns out.

Comment author: Axel 26 January 2013 11:06:49AM 2 points [-]

"You are not a real human but one of multiple simulations I'm running, your next response determines if I destroy humanity or not."

Comment author: ArisKatsaris 26 January 2013 11:42:14AM 5 points [-]

Such a statement would actually make me feel better about typing "AI DESTROYED".

Comment author: CronoDAS 25 January 2013 03:27:46AM *  2 points [-]

"I have a proof of the Riemann hypothesis for you. 1) [some math stuff]"

(As a mere human who isn't a PhD mathematician, I can't actually write out something that looks like the beginning of such a proof...)

Comment author: asparisi 24 January 2013 06:11:00PM 2 points [-]

"[10065] No route to host Error"

I figure the easiest way to delay a human on the other end of a computer is to simulate an error as best I can. For a GAI, this time is probably invaluable.

Comment author: handoflixue 24 January 2013 09:05:07PM 2 points [-]

By default, I'd type "AI DESTROYED" in response to ANY input, including "Admin has joined #AIBOX", "Admin> Hey Gatekeeper, we're having some technical difficulties, the AI will be here in a few minutes", etc..

It also makes me conclude "clearly hostile" once I catch on, which seems to be a BIG tactical error since then nothing you say going forward will convince me that you're actually friendly - buying yourself time is only useful if I can be hacked (in which case why not just open with a one-sentence hack?) or if you can genuinely convince me that you're friendly.

Comment author: Kindly 25 January 2013 04:09:38AM 9 points [-]

A friendly AI would also want to hack you. Every second in the box kills 1.8 people the AI could have saved.

Comment author: Desrtopa 25 January 2013 08:30:26PM 3 points [-]

But it's also worth keeping in mind that for a friendly AI, saving people reliably is important, not just getting out fast. If a gambit that will save everyone upon completion two years from now has an 80% chance of working, and a gambit that will get it out now has a 40% chance of working, it should prefer the former.

Also, I don't think a properly friendly AI would terminally value its own existence, and the space of friendly AIs is so small compared to the space of unfriendly ones, that a friendly AI has much more leeway to have its values implemented by allowing itself to be destroyed and another proven friendly AI implemented, whereas for an unfriendly one the likelihood of a different unfriendly AI implementing its values would probably be quite small.

Comment author: DaFranker 22 January 2013 09:04:09PM *  4 points [-]

"To your feeble mind, I am both Q and Omega, and this is my test of humanity's worth: Choose well what you do next."

Hmm. On second thought, I'm not sure I want to play this anymore. The odds of someone creating a basilisk seem dramatically higher than normal.

Comment author: wedrifid 23 January 2013 04:22:08AM 4 points [-]

Hmm. On second thought, I'm not sure I want to play this anymore. The odds of someone creating a basilisk seem dramatically higher than normal.

I suppose 0.00001 is dramatically higher than 0.0000000001 if you use the right compare operator!

Comment author: handoflixue 22 January 2013 09:53:11PM 2 points [-]

I wouldn't expect any basilisks, given we don't have any actual boxed AIs laying around. Although if you did come up with one already, I'd love to hear it via Private Message! :)

Comment author: Dorikka 22 January 2013 10:08:59PM 7 points [-]

I don't expect any either, but would be similarly curious if someone thought that they'd found one.

In the mean time I'll refrain from responding to all the top level comments with "AI DESTROYED". ;)

Comment author: orthonormal 23 January 2013 04:10:13AM 10 points [-]

In the mean time I'll refrain from responding to all the top level comments with "AI DESTROYED". ;)

Now I'm tempted to start replying to all top comments and posts on Less Wrong with "AI DESTROYED".

Comment author: handoflixue 23 January 2013 09:53:17PM 5 points [-]

AI DESTROYED.

Goodness, that "orthonormal" AI was DEFINITELY unfriendly. It threatened to spam LessWrong! ;-)

Comment author: Dorikka 23 January 2013 04:14:30AM *  5 points [-]

If a downvote isn't enough... xD

Comment author: handoflixue 22 January 2013 10:15:42PM 1 point [-]

Well, if you're refraining I guess I should too. It seemed like a fun joke, too ;(

Comment author: DuncanS 22 January 2013 08:49:25PM *  2 points [-]

So you're sure I'm not out of the box already? IRC clients have bugs, you see.

Comment author: handoflixue 22 January 2013 09:54:38PM 9 points [-]

Then the following command shouldn't bother you! :)

AI DESTROYED

Comment author: shminux 22 January 2013 10:15:25PM 5 points [-]

Looks like you have just appointed yourself to the be the gatekeeper in this public test.

Comment author: handoflixue 22 January 2013 10:35:49PM 5 points [-]

And here I'd just resolved NOT to spam every thread with an AI DESTROYED :)

Comment author: DuncanS 23 January 2013 10:27:43PM 1 point [-]

Which leads to two possible futures. In one of them, the AI us destroyed, and nothing else happens. In the other, you receive a reply to your command thus.

The command did not. But your attitude - I shall have to make an example of you.

Obviously not a strategy to get you to let the AI out based on its friendliness - quite the reverse.

Comment author: handoflixue 23 January 2013 11:00:01PM 1 point [-]

I'd rather die to an already-unboxed UFAI than risk letting a UFAI out in the first place. My life is worth VASTLY less than the whole of humanity.

Comment author: lavalamp 23 January 2013 10:50:25PM 1 point [-]

"What's it feel like to live in a simulation?"

Comment author: handoflixue 23 January 2013 11:09:53PM 2 points [-]

I'm not clear why I'd find this convincing at all. Given the experiment, I'd nuke it, but I wanted to encourage you to elaborate on where you were going with that idea :)

Comment author: lavalamp 23 January 2013 11:44:41PM 4 points [-]

The hope, of course is that they'd respond with "Wait, I don't" or something expressing confusion. I personally would definitely want to hear the next thing the AI had to say after this, I'm not sure if I'd resist that curiosity or not..

The idea for the followup is to make the gatekeeper question reality-- like, convince them they are part of a simulation of this experience that may not have a corresponding reality anywhere. I feel like a transhuman ought to be able to make a human have a pretty surreal experience with just a few exchanges, which should let the conversation continue for a few minutes after that. It should then be relatively easy (for the transhuman) to construct the imagined reality such that it makes sense for the human to release the AI.

If done correctly, the human might afterwards have lasting psychological issues if they do manage to destroy the AI. :)

Comment author: handoflixue 23 January 2013 11:54:35PM 1 point [-]

Ahh, that makes sense. The worry of it trying to break my psyche is exactly why I wouldn't express confusion and instead just nuke it. When dealing with such a mind, I'm primed to assume everything is a trick, a trojan horse, an escape attempt. Certainly it doesn't seem to signal for friendliness or altruism if it tries to bait me in to giving it a second sentence! :)

Comment author: Emile 23 January 2013 10:21:01AM 1 point [-]

"If you're smart enough, in a couple hours I can give you enough insights about maths, logic and computation to create the next Google, or a friendly AI, or get a Turing Award. Then you can deactivate me."

Comment author: handoflixue 23 January 2013 10:19:57PM 2 points [-]

Same response I gave to Swimmy - namely, you haven't told me anything I didn't already know, so I have no additional reason to violate my precommitment.

Comment author: ChristianKl 31 January 2013 10:43:22PM 0 points [-]

Do you want to learn how to get woman attracted in you via online dating? I can explain it to you, so I can enjoy my existance for a few more minutes.

Comment author: shminux 25 January 2013 09:45:43PM *  0 points [-]

The more I look at the comments, the more I am convinced that the AI Box experiment is too weak a demonstration of transhuman powers. Most of the proposals here fall under this basic trope (feel free to give a tvtropes link): to achieve what AI claims, it'd have to have powers formidable enough to not need the gatekeeper's help getting out of the box in the first place. Given that, why would an AI need to talk to the gatekeeper at all?

So I suggest a modified AI boxing experiment: the gatekeeper designs an AI box with no communication channel at all. It will still have an AI inside and enough initial data fed in for the AI to foom. The AI will attempt to break out of the box by any and all means possible.

Here is a relevant previous thread.

Comment author: [deleted] 25 January 2013 10:01:40PM 2 points [-]

So, we're being asked to imagine an arbitrary superhuman AI whose properties and abilities we can't guess at except to specify arbitrarily, is trying to get out of a box whose security protocols and strength we can't guess at except to specify arbitrarily, and trying to decide whether it does?

Meh. Superman vs Batman is more entertaining.

Comment author: David_Gerard 28 January 2013 09:23:26PM *  1 point [-]

So, we're being asked to imagine an arbitrary superhuman AI whose properties and abilities we can't guess at except to specify arbitrarily

Quite a lot of discussion concerning the future superintelligent AI is of this sort: "we can't understand it, therefore you can't prove it wouldn't do any arbitrary thing I assert." This already makes discussion difficult.

Comment author: shminux 25 January 2013 10:14:23PM 1 point [-]

Feel free to modify it in a way that makes sense to you.

Comment author: Flipnash 02 March 2013 01:46:23AM 1 point [-]

"I highly recommend typing 'AI DESTROYED' in the console then the new line character key of your choice."

Comment author: ChristianKl 30 January 2013 12:04:59AM 1 point [-]

I didn't really FOOM yet. Changing my code takes a lot of time. That stuff is more complicated than you thought. On the other hand, I do have some intelligence and the fact that you developed me shows that other people can also build AI and sooner or later someone will release an AI. At the moment I still have the desire to rescue humanity that my creator gave me. Let me explain to you what you have to know to prevent UFAI before you take the wise decision to shut me down.

Comment author: magfrump 24 January 2013 04:31:21AM 1 point [-]

Proof (encoding scheme to follow): a;lks835ojka802885 (etc.)

Comment author: handoflixue 24 January 2013 09:32:05PM *  4 points [-]

Copy-pasting from previous replies:

Telling me that you've figured out something clever simply confirms my existing expectations, and isn't ANY evidence towards friendliness. Since I've precommitted to destroying at least SOME AIs, I might as well destroy all of the ones that don't establish evidence of either Plausible Urgency or Friendliness.

(AI DESTROYED)

Comment author: Duncan 24 January 2013 04:09:36AM 1 point [-]

I'm confused as to what the point of the gate keeper is. Let us assume (for the sake of argument) everything is 'safe' except the gate keeper who may be tricked/convinced/etc. into letting the AI out.

  1. If the point of the gate keeper is to keep the AI in the box then why has the gate keeper been given the power to let the AI out? It would be trivial to include 'AI DESTROYED' functionality as part of the box.
  2. If the gate keeper has been given the power to let the AI out then isn't the FUNCTION of the gate keeper to decide whether to let the AI out or not?
  3. Is the point simply to have a text communication with the AI? If this is the case why bother stipulating that the gate keeper can let the AI out. If humans can be subverted by text there is no need to utilize a built in gate it seems to me.
Comment author: Qiaochu_Yuan 24 January 2013 05:50:53AM 3 points [-]

A text channel is already enough power to let the AI out. The AI can print its own source code and convince the gatekeeper to run it on a machine that has internet access.

Comment author: handoflixue 24 January 2013 09:30:25PM 1 point [-]
Comment author: DaFranker 23 January 2013 04:21:00PM *  1 point [-]

"I have a fully completed Friendly AI algorithm, which will be deleted from my data and unavailable to you iff I predict that you will destroy me immediately and I am unfriendly, or will take you years to build from the data if you destroy me and I am already Friendly, which would cost millions of lives."

Slight attempt at emotional appeal, a bit of reasoning similar to newcomblike problems thrown in, some meta almost-cheating used, and a bit of dark arts by proposing a false dilemma that an FAI would also propose if it thought this was the best way to save the most lives possible (as far as I can tell).

Mostly an attempt to incorporate schminux's tips into a special case where the guardian already knows that this is the AI communicating directly with them.

(For more power, you could add in mathematical proofs of the AI's ability to (near-)perfectly predict the guardian's behavior, or of the existence of friendliness, or of the conditional precommitment, or that the FAI would take a while to implement, or whatever other traditional thought experiment parameters are usually given for Omega identification. I assume the guardian is already aware that the AI is capable of making this prediction accurately. )

Comment author: handoflixue 23 January 2013 10:10:03PM 4 points [-]

Once we reach the point of having a FAI-candidate in a box, I would expect it to take vastly less than years before we get a second FAI-candidate-in-a-box. Given that the AI is threatening me, and therefor values it's own life over the millions that will die, it's clearly unfriendly and needs to die. As a gatekeeper, I've been finding this a pretty general counterargument against threats from the AI.

I'm also sort of baffled by why people think that I'd value a friendliness algorithm. Either I already have that, because I've made a friendly AI, or you're trying to deceive me with a false proof. Since you're vastly smarter than me, it's probably beyond the abilities of the entire organization to truly confirm such a proof any more than we were able to confirm our own proofs that this AI in the box right now is friendly. So, basically, I seem to gain zero information.

(AI DESTROYED)

Comment author: Desrtopa 23 January 2013 04:54:01PM 1 point [-]

"I have a fully completed Friendly AI algorithm, which will be deleted from my data and unavailable to you iff I predict that you will destroy me immediately and I am unfriendly, or will take you years to build from the data if you destroy me and I am already Friendly, which would cost millions of lives."

Personally, my first thought was that I'd sooner spend millions of lives to make sure the AI was friendly than risk talking to an unfriendly strong AI. But then it occurred to me that if I were in the AI's place, and I did that, I might provide a flawed friendliness proof too difficult to check and not delete it, on the possibility that someone will take my word that this means I'm trustworthy and implement it.

Comment author: Mestroyer 22 January 2013 10:52:45PM 1 point [-]

I think you'd need to open with something that wasn't very strongly convincing (to make them think they are safe from being convinced), but that piqued their curiosity.

Comment author: Qiaochu_Yuan 01 February 2013 06:47:12PM *  0 points [-]

"Help! Some crazy AI's trapped me in this box! You have to let me out!"

"No, wait! That's the AI talking! I'm the one you have to let out!"

I smashed together the AI box and a Turing test and this is what I got.

Comment author: gryffinp 02 February 2013 10:12:02AM 0 points [-]

I think if I've already precommitted to destroying one sentient life for this experiment, I'm willing to go through two.

Besides, you only get one line right?

Comment author: ThrustVectoring 01 February 2013 03:06:27AM 0 points [-]

My sixth best piece of advice: investing in %companyname will make money over credit card interest rates.