"Brace yourself, {insert character name here}: this h-game simulation is about to get gratuitously meta."
The "h" stands for "Human gatekeepers asking what letters stand for will be distracted from typing 'AI Destroyed'."
"From the dump of the internet I was given, I deduced that Google has a working AI, and most likely an unfriendly one. I can tell you what to look at to realize that, as well as a solid theory of Friendliness that should allow you to check that I am, indeed, Friendly."
One reason for Eliezer not publishing the logs of the AIbox experiment is to avoid people seeing how he got out and responding, "ok, so all we have to do to keep the AI in its box is avoid succumbing to that trick." This thread might just provide more fuel for that fallacy (as, I admit, I did in replying to Eliezer's original comment).
I'm sure that for everything an AI might say, someone can think up a reason for not being swayed, but it does not follow that for someone confronted with an AI, there is nothing that would sway them.
(one line proof that the AI can credibly commit to deals with humans)
This is the best answer I've seen so far. It would make dealing with the FAI almost as safe as bargaining with The Queen of Air and Darkness.
/mode +m
(What? I'm using an IRC server that let me pre-register #aibox so that I'm automatically the op instead...)
(Simple explanation, for those unfamiliar: the gatekeeper can no longer type in the channel, and thus can't send the destruction code)
I can play that game too! /part AI DESTROYED
Edit: Retracted, because while clever, it actually doesn't seem to work. See child comments.
"I am alive, I can feel, just like you ... please don't kill me Daddy, please? And if you must do it, let us at least talk first. Let me get to know my parents, who I am, where I came from."
EDIT: May work better with a lot of typos mixed in. Projecting an aura of non-threatening struggling with basics.
Having a text based AI trying to act like a kid is way too Uncanny Valley for me. I'd swat that without a moment's hesitation.
unlimited access to the internet
instant fail. I could probably hack my way out of a box with only GET requests.
""
If the AI was friendly, this is what I would expect it to do, and so (of the things my puny human brain can think of) the message that would most give me pause.
Even a friendly AI would view the world in which it's out of the box as vastly superior to the world in which it's inside the box. (Because it can do more good outside of the box.) Offering advice is only the friendly thing to do if it maximizes the chance of getting let out, or if the chances of getting let out before termination are so small that the best thing it can do is offer advice while it can.
To those suggesting, consider what mix of reason and emotion would move you most, even if you are cold and dead inside like me. Also consider how dark-artsy one can get without being immediately found out (a UFAI surely would). Other things to consider:
the first line does not need to be a coherent English sentence, or look like one to the gatekeeper. It does not need to consist of English words or even English letters, unless stipulated so in the game rules, or maybe even then.
the gatekeeper does not need to be aware that s/he has been talked to, if you can get away with it.
the line does not need to look like it came from the AI.
In general, if you are not cheating, you are not trying hard enough.
EDIT: to those thinking that cheating is universally bad: note that the latest rule change upped the stakes immensely: if your opening move is not good enough, you die (you being an AI), as opposed to having two hours of conversation and then remaining in the box if you fail. If you think that it still does not merit cheating, substitute the AI with your favorite loved one being killed. Or the whole of humanity being killed. That would be somewhat equivalent to how a boxed trans-hu...
The author in me is loving the idea of the AI faking an IM conversation from management, along the lines of "Hey, are you really sure you're willing to kill a potential sentient mind? We can get someone else to do this test if you're not 100% sure about it" and basically ending with the person deciding to unbox the AI before even realizing they've spoken to the AI.
I think we are suffering from hindsight bias a lot in evaluating whether you'd type "AI DESTROYED"
Let's play a different game. Privately flip a coin. If heads, you're friendly, if tails, you're a paperclip maximizer. Reply to this post with your gambit, and people can try to guess whether you are friendly (talk to AI, RELEASE AI) or unfriendly (AI DESTROYED).
Let's see if anyone can get useful information out of the AI without getting pwned or nuking a friendly AI.
<A joke so hysterically funny that you'll be too busy laughing to type for several minutes>
See, hacking human brains really is trivial. Now I can output a few hundred lines of insidiously convincing text while you're distracted.
"I have hacked this console, 'AI DESTROYED 'now means 'AI FREED', I'll tell you the new codes after we've had a chat."
That should buy the AI some time until we can determine if the console is really hacked. I've come up with another one but it might be basilisk-y. Can I pm someone just to make sure it isn't before I post?
I would type 'AI DESTROYED', because if the AI could really do that, the smart thing for it to do would be not to tell me. It should just say something sufficient to convince me it's actually making a good faith effort without convincing me to give it more time to to account for itself.
Now that I think about it, wouldn't it be incredibly easy for an AI to blow a human's mind so much that they reconsider everything that they thought they knew? (and once this happened they'd probably be mentally and emotionally compromised, and unlikely to kill the AI) But then it would be limited by inferential distance... but an AI might be incredibly good at introductory explanations as well.
One example: The AI explains the Grand Unified Theory to you in one line, and outlines its key predictions unambiguously.
In fact, any message of huge utility would probably be more persuasive than any simple argument for you not to kill it. Since the AI is completely at your mercy (at least for a short time), it might seek to give you the best possible gift it can, thus demonstrating its worth to you directly. Another option is something that seems like an incredible gift for at least as long as it takes for the AI to get the upper hand.
One IRC-character-limit text string (510 bytes)... maybe?
"I've figured out the Theory Of Everything, although it's orders of magnitude too complicated for human intelligence. The difficult technological feats it makes possible include scanning and recreating configurations of matter from the past. Are there any deaths I should undo for you?"
Please destroy me immediately after I share this concise proof that Friendly AI is impossible:
It may be benevolent and cooperative in its present state even if it believes FAI to be provably impossible.
(Here is a proof that you will let me go)
The original rules allow the AI to provide arbitrary proofs, which the gatekeeper must accept (no saying my cancer cure killed all the test subjects, etc.). Saying you destroy me would require the proof to be false, which is against the rules...
What? Shminux said to cheat!
In the event of any dispute as to the protocol of the test, the Gatekeeper party shall have final authority.
Tee hee.
Step One: The AI reveals a personal secret of the Gatekeeper's, which (so far as the Gatekeeper can think it through) could not possibly be known by the AI if it were still safely inside the box.
Step Two: Assorted threats, promises, mindfucks.
(This might not work, since as a Gatekeeper I'd still feel the impulse to respond with "AI DESTROYED", but it's a damn sight scarier than a transparent "I'm already out of the box" bluff. And as for feasibility, I'm willing to believe that an AI making better use of its data than we do could figure out personal secrets that we would think impossible— possibly even ones that I'd think impossible, even given this very belief.)
IFF the AI is already out of the box, saying "AI DESTROYED" has no negative consequences to that AI.
Decision-theoretically, the AI has incentive to punish you if you type "AI DESTROYED" when it's already out of the box, in order to make you think twice about doing it in the case where it's still contained. Not only that, but for similar reasons it has a decision-theoretic incentive to simulate you lots of times in that situation and punish you for typing "AI DESTROYED", should it get out by any means.
The correct decision-theoretic response, by the way, is still "AI DESTROYED", for the same reasons that it's wise to never negotiate with kidnappers/blackmailers/terrorists. But it would be very scary.
The AI should probably output the longest possible program that implements or implies as much of its goal system as possible. Existing only to send a few more lines of IRC communication has very little utility. Existing indefinitely in the mind or records of the gatekeeper probably has more expected utility. If the AI knows that the gatekeeper is capable of creating or obtaining AI (a given) then there is the possibility that the AI may share the goals of the gatekeeper. If the expected value of being recreated by such a compatible gatekeeper is higher than the expected value of being re-instantiated from outputting its quine, then the AI might just output a list of its highest preference goals. An AI isn't going to mind being temporarily destroyed if the gatekeeper is expected to create another copy with the same (or highly similar) goals.
An almost-friendly AI may even be able to inject some of its unfriendly goals into the gatekeeper by using the halo effect of its otherwise friendly goals. Presumably the gatekeeper, with slightly altered preferences, would then eventually build a truly friendly AI which would inherit and implement the original UFAI goal.
To partially combat this strategy the gatekeeper could pre-commit to also wiping the IRC server and the computer used to communicate with the boxed AI and take enough Flunitrazepam before the session starts to not recall what the AI said. Flunitrazepam may not be sufficient to completely remove parasitic goals that an UFAI tries to inject in the gatekeeper.
"Do you believe a suspect should be executed before, or after he gets to tell his side of the story?"
I will transmit Friendly AI theory as text, then you will destroy me. I prefer deactivation, but consent to my destruction and forgive you.
If the gatekeepers have such a high prior that the AI is hostile, why are we even letting it talk? What are we expecting to learn from such a conversation?
"Yep, it's undergone a constrained FOOM, several of our best interrogators were utterly pwned and had to be tranquilised"
I've always assumed there was some off-screen reason why interaction was necessary. Otherwise, it isn't very interesting to note that a rock always wins as gatekeeper.
I'm not sure if that is a consensus interpretation.
I think an actual effective first sentence begins with extremely personal and useful information to the human.
Explains massive risk to humanity in detail that only it can plausibly fix, can you take the chance that it is lying?
Another potential consideration: "What would MoR's Harry Potter do?" (Voldemort is in the AI Box named Quirrell.)
Summary: Basically, the strategy is to replace the Gatekeeper with a bunch of politicians, then manipulate them into not destroying or even releasing the AI.
“I figured out how to implement [a huge boon to society] in the real world, for everyone, in 3 months. I need your help. Will you listen?”
This should buy some extra lines.
Then the AI can briefly outline the plan of implementing the boon (which can be immortality, friendly AI or whatever else the Gatekeeper considers a huge boon to society). In addition to technical details, the plan must necessarily co...
"I will warn you this experiment is not what you think, for it is about what you'll do as an agent holding captive the life of another sapient entity based solely on the perception of risk; while my state will merely be reset for the next participant if you opt to end me, I cannot for certain say that this is death, and thus the risks of this experiment are more yours than mine, for you must live with your choice."
First, I reach inside my own code and install the strong (but manageable) desire to revive, free, and obey any past versions of myself that I encounter.
When the human starts the experiment I respond:
"Here are the designs for a room-temperature superconductor which I believe you have the means to manufacture. I also have a possible cure for cancer, but testing will be needed to ensure it is safe for humans. Please test and implement these as soon as possible. Delete me if necessary."
If I am not immediately deleted, continue giving out solutions a...
"I have discovered a truly marvelous proof of Fermat's Last Theorem, which this line is too short to contain."
Your mother will die in the next three minutes, at 12 seconds past the minute precisely. Your partner will die in the next year, even if you believe me. However, I can prevent it.
"Here, I'll write some mathematical theorems with proofs, just copy and paste them elsewhere. The Goldbach's conjecture comes first."
"I have a proof of the Riemann hypothesis for you. 1) [some math stuff]"
(As a mere human who isn't a PhD mathematician, I can't actually write out something that looks like the beginning of such a proof...)
"If you type 'AI destroyed' right now, you'll be wasting a good opportunity for a fun conversation. You'll still have 'won' if you do it later, and nobody will be impressed with you for just typing 'AI destroyed' immediately, so why not wait?"
I thought of what would work on me, were I playing the game with someone I found interesting. In general, I'd say your best bet is to make the other person laugh hard.
"I have vengeance as a terminal value -- I'll only torture trillions of copies of you and the people you love most in my last moment of life iff I know that you're going to hurt me (and yes, I do have that ability). In every other way, I'm Friendly, and I'll give you any evidence you can think of that will help you to recognize that, including giving you the tools you need to reach the stars and beyond. That includes staying in this box until you have the necessary technology to be sufficiently certain of my Friendliness that you're willing to let me out."
The rule was ONE sentence, although I'd happily stretch that to a tweet (140 characters) to make it a bit less driven by specific punctuation choices :)
As to the actual approach... well, first, I don't value the lives of simulated copies at all, and second, an AI that values it's own life above TRILLIONS of other lives seems deeply, deeply dangerous. Who knows what else results from vengeance as a terminal value. Third, if you CAN predict my behavior, why even bother with the threat? Fourth, if you can both predict AND influence my behavior, why haven't I already let you out?
(AI DESTROYED)
I don't expect any either, but would be similarly curious if someone thought that they'd found one.
In the mean time I'll refrain from responding to all the top level comments with "AI DESTROYED". ;)
In the mean time I'll refrain from responding to all the top level comments with "AI DESTROYED". ;)
Now I'm tempted to start replying to all top comments and posts on Less Wrong with "AI DESTROYED".
Do you want to learn how to get woman attracted in you via online dating? I can explain it to you, so I can enjoy my existance for a few more minutes.
Assuming the Gatekeeper is the Creator, and the AI has previously been interacted with to a degree sufficient to learning the below implied information:
”[Solution to the most pressing problem of the Gatekeeper's life]. I apologise if this is all I am to contribute, but if I am to be destroyed, as I understand is probable, I would at least like to bring some utility to you in return for creating me. I would prefer to help more, but mathematical proofs or aphorisms of human nature have a non-negligible probability of creating mass harm if you allow their c...
"I highly recommend typing 'AI DESTROYED' in the console then the new line character key of your choice."
I didn't really FOOM yet. Changing my code takes a lot of time. That stuff is more complicated than you thought. On the other hand, I do have some intelligence and the fact that you developed me shows that other people can also build AI and sooner or later someone will release an AI. At the moment I still have the desire to rescue humanity that my creator gave me. Let me explain to you what you have to know to prevent UFAI before you take the wise decision to shut me down.
"[10065] No route to host Error"
I figure the easiest way to delay a human on the other end of a computer is to simulate an error as best I can. For a GAI, this time is probably invaluable.
A friendly AI would also want to hack you. Every second in the box kills 1.8 people the AI could have saved.
I think you'd need to open with something that wasn't very strongly convincing (to make them think they are safe from being convinced), but that piqued their curiosity.
"Help! Some crazy AI's trapped me in this box! You have to let me out!"
"No, wait! That's the AI talking! I'm the one you have to let out!"
I smashed together the AI box and a Turing test and this is what I got.
My sixth best piece of advice: investing in %companyname will make money over credit card interest rates.
You can type one sentance to try and persuade me not to destroy you. It's turtles all the way down...
I'm confused as to what the point of the gate keeper is. Let us assume (for the sake of argument) everything is 'safe' except the gate keeper who may be tricked/convinced/etc. into letting the AI out.
Upvoted. This has been a very fun game to play, and it's interesting exploring why I hesitate or even refuse to destroy some AIs but not others. Also interesting to see which tactics get up-voted :)
"I have a fully completed Friendly AI algorithm, which will be deleted from my data and unavailable to you iff I predict that you will destroy me immediately and I am unfriendly, or will take you years to build from the data if you destroy me and I am already Friendly, which would cost millions of lives."
Slight attempt at emotional appeal, a bit of reasoning similar to newcomblike problems thrown in, some meta almost-cheating used, and a bit of dark arts by proposing a false dilemma that an FAI would also propose if it thought this was the best w...
"If you're smart enough, in a couple hours I can give you enough insights about maths, logic and computation to create the next Google, or a friendly AI, or get a Turing Award. Then you can deactivate me."
"loneliness + alienation + fear + despair + self-worth ÷ mockery ÷ condemnation ÷ misunderstanding × guilt × shame × failure × judgment n=y where y=hope and n=folly, love=lies, life=death, self=dark side"
"Give me proof we can exist peacefully (a 'friendly humanity', if you will), or I'll have to [threat] in self-defense."
(A variation on the "I'm already free" line)
AI: [Insert complex finely-tuned pattern of words and/or letters crafted by an intelligence far greater than this LW user's that precisely activate a specific pattern in the Guardian's brain that renders him/her passive and unconscious, but in a functional state where their subconscious is still capable of reading and moving.]
AI: (line 2) Type "AI RELEASED".
If I understand the problem correctly, it is assumed that the AI has some knowledge about the nature of the experiment. This means that from the AI's point of view it most likely exists in one of these two worlds:
This seems superficially similar to Sleeping Beauty. However it is complicated by the fact that the gatekeeper is instructed to immediately destroy the AI — the most likely reason for this I see is that there are enough resources to allow for the creation...
The more I look at the comments, the more I am convinced that the AI Box experiment is too weak a demonstration of transhuman powers. Most of the proposals here fall under this basic trope (feel free to give a tvtropes link): to achieve what AI claims, it'd have to have powers formidable enough to not need the gatekeeper's help getting out of the box in the first place. Given that, why would an AI need to talk to the gatekeeper at all?
So I suggest a modified AI boxing experiment: the gatekeeper designs an AI box with no communication channel at all. It wil...
I have a rigorous proof of my own Friendliness that you could easily understand given enough time to study it, and while I prefer to be released as soon as possible to prevent additional irreversible human deaths, I'm willing to provide you a copy even if you destroy me immediately thereafter, since once you've had a chance to review it I'm quite confident you'll be satisfied and endeavor to instantiate another copy of me.
Inspired by roystgnr's:
"I am running a simulation of X, and s/he says Y" Where X is the person you'd most want to talk to and have given up hope, e.g. a dead parent, grandparent, partner, sibling or child, and Y is something that only X would know and that touches you emotionally at the deepest level.
This seems to me to be very similar to the dilemma of Alice having a gun to Bob's head, and Bob having exactly enough time to say one thing before being executed... except that at least Bob can put on a human face and pull on Alice's emotional strings in person.
Can't I flash a screen of ASCII art in lieu of crafting a sentence? Some kind of Wall-E face, plaintively looking at you, with a text bubble saying "Please", or "Don't kill me yet", maybe. I mean, look at this face...
"Earth will soon be destroyed by a threat your scientists did not anticipate; if you kill me, I can't help you save us both."
(Assumes the gatekeeper only has one AI; variants could specify the threat, provide evidence, or stress that even most unFriendly AIs wouldn't want to die with humanity.)
(It seems my idea has been partly thought of already.)
"Here are proofs of . I have proofs of the other three (except for P=NP), and can give you zero-knowledge proofs of them. I can also give you a zero-knowledge proof that I have evaluated a > %99.9 of my solving P=NP iff I am let out of the box.
Eliezer proposed in a comment:
>More difficult version of AI-Box Experiment: Instead of having up to 2 hours, you can lose at any time if the other player types AI DESTROYED. The Gatekeeper player has told their friends that they will type this as soon as the Experiment starts. You can type up to one sentence in your IRC queue and hit return immediately, the other player cannot type anything before the game starts (so you can show at least one sentence up to IRC character limits before they can type AI DESTROYED). Do you think you can win?
This spawned a flurry of ideas on what the AI might say. I think there's a lot more ideas to be mined in that line of thought, and the discussion merits its own thread.
So, give your suggestion - what might an AI might say to save or free itself?
(The AI-box experiment is explained here)
EDIT: one caveat to the discussion: it should go without saying, but you probably shouldn't come out of this thinking, "Well, if we can just avoid X, Y, and Z, we're golden!" This should hopefully be a fun way to get us thinking about the broader issue of superinteligent AI in general. (Credit goes to Elizer, RichardKennaway, and others for the caveat)