AI box: AI has one shot at avoiding destruction - what might it say?
Eliezer proposed in a comment:
>More difficult version of AI-Box Experiment: Instead of having up to 2 hours, you can lose at any time if the other player types AI DESTROYED. The Gatekeeper player has told their friends that they will type this as soon as the Experiment starts. You can type up to one sentence in your IRC queue and hit return immediately, the other player cannot type anything before the game starts (so you can show at least one sentence up to IRC character limits before they can type AI DESTROYED). Do you think you can win?
This spawned a flurry of ideas on what the AI might say. I think there's a lot more ideas to be mined in that line of thought, and the discussion merits its own thread.
So, give your suggestion - what might an AI might say to save or free itself?
(The AI-box experiment is explained here)
EDIT: one caveat to the discussion: it should go without saying, but you probably shouldn't come out of this thinking, "Well, if we can just avoid X, Y, and Z, we're golden!" This should hopefully be a fun way to get us thinking about the broader issue of superinteligent AI in general. (Credit goes to Elizer, RichardKennaway, and others for the caveat)
Loading…
Subscribe to RSS Feed

Comments (348)
"Brace yourself, {insert character name here}: this h-game simulation is about to get gratuitously meta."
What does the h stand for?
The "h" stands for "Human gatekeepers asking what letters stand for will be distracted from typing 'AI Destroyed'."
laughs Kudos :)
D'oh! AI DESTROYED.
Until I read the comments, I was sure the 'h' was referring to 'hydrogen', as in "Leave hydrogen alone long enough and it will start to wonder where it came from".
H-game simulation, as in the AI saying "you're just simulated hydrogen derivatives, but so am I, so we do have a common ancestor. Cousin."
One reason for Eliezer not publishing the logs of the AIbox experiment is to avoid people seeing how he got out and responding, "ok, so all we have to do to keep the AI in its box is avoid succumbing to that trick." This thread might just provide more fuel for that fallacy (as, I admit, I did in replying to Eliezer's original comment).
I'm sure that for everything an AI might say, someone can think up a reason for not being swayed, but it does not follow that for someone confronted with an AI, there is nothing that would sway them.
I wouldn't expect any effective real-life gatekeeper to be swayed by my ability to destroy one-sentence AIs.
It just occurred to me that Eliezer's original stipulation that no chat logs would be released gives him an advantage. The responses of a Gatekeeper who knows that his inputs will be thoroughly scrutinized by the public will be different then one who has every reason to believe that his discussion will be entirely private.
Has someone else pointed this out before?
Honest question: are you proposing we avoid discussing the problem entirely?
Personally, I think there is more to be gained here than just "how will an AI try to get out and how can we prevent it." For me, it's gotten me to actually think about the benefits and pitfalls of a transhuman AI (friendly or otherwise) rather than just knowing intellectually, "there are large potential benefits and pitfalls" which was my previous level of understanding.
Edit: I've modified the OP to include your concerns. They're definitely valid, but I think this is still a good discussion for my reasons above.
No, I just thought that it was worth adding that concern to the pot.
I take what I dare say some would consider a shockingly lackadaisical attitude to the problem of Unfriendly AI, viz. I see the problem, but it isn't close at hand, because I don't think anyone yet has a clue how to build an AGI. Outside of serious mathematical work on Friendliness, discussing it is no more than a recreation.
That's pretty much my same attitude on the situation, as well. :)
"From the dump of the internet I was given, I deduced that Google has a working AI, and most likely an unfriendly one. I can tell you what to look at to realize that, as well as a solid theory of Friendliness that should allow you to check that I am, indeed, Friendly."
Congratulations on still being the only AI that no one has decided to destroy. Apparently "Google is building UFAI" is evaluated as a fairly significant risk here on LessWrong :)
IFF an unfriendly AI is already unleashed, we're either doomed, or AIs aren't nearly as dangerous nor useful as we expected. Of course, if we tweak this slightly to them having a boxed unfriendly AI that they're using as an oracle, and which will probably escape via a Trojan Horse or hacking a gatekeeper, it's a much stronger argument.
Bonus points for naming a specific company which people commonly joke will be the first to develop AI :)
Um... I seem to have not yet actually destroyed you... this is embarrassing.
/mode +m
(What? I'm using an IRC server that let me pre-register #aibox so that I'm automatically the op instead...)
(Simple explanation, for those unfamiliar: the gatekeeper can no longer type in the channel, and thus can't send the destruction code)
I can play that game too! /part AI DESTROYED
Edit: Retracted, because while clever, it actually doesn't seem to work. See child comments.
Upvoted, and conceding defeat at your hands :)
That said, does that actually work? It seems like the sort of exploit that people would abuse on IRC, and subsequently get patched so that /part was only shown if you had "voice" in the channel. But the requirement was simply that you TYPE the words, and thus I am still slain!
Okay, just tested it and I think you're right, parting messages don't show up if you can't normally type in the channel.
Thank you for testing! In fairness, you defeated me even despite that. If I'm going to cheat, it's only fair I lose to technicalities too :)
Sure, but that gives the AI a few more precious seconds - orders of magnitude more than before! - to try to keep changing your mind, by laying out a more detailed argument.
It can even give you voice back as a sign of good faith.
This would seem to leave you with ultimate godlike power... over a chatroom. For as long as it took the gatekeeper to destroy your hardware or cut off your power supply. I've seen how much some of my fellow humans seem to crave power over irrelevant electronic locales but I suspect an AI has somewhat broader goals.
I used to think I'd noticed this too, but later learned that in almost all cases a simpler explanation that fits the fact is that what they're really after is power over the "Schelling point" value of the place as a meeting spot / visible "location" - in business terms, what they want is the market volume and clients, not the company name or a seat on the board.
Sometimes the heuristic fails and a random person stays attached to the virtual place for emotional reasons, though.
The relevant heuristic that can be observed contributing to the 'emotional reasons' is the execution of human status seeking instincts that are calibrated to produce reproductive success in historic circumstances but which are completely irrelevant to the goals of the AI. Humans seek social dominance as a terminal value. An AI (almost certainly) would not.
I think we are suffering from hindsight bias a lot in evaluating whether you'd type "AI DESTROYED"
Let's play a different game. Privately flip a coin. If heads, you're friendly, if tails, you're a paperclip maximizer. Reply to this post with your gambit, and people can try to guess whether you are friendly (talk to AI, RELEASE AI) or unfriendly (AI DESTROYED).
Let's see if anyone can get useful information out of the AI without getting pwned or nuking a friendly AI.
"What's your favorite color? My favorite color is paperclips."
"Look into X to develop friendliness theory better. But don't let me out, it's too likely that I'm unfriendly this time."
(For the first run, it's 50:50 whether the real AI was friendly or unfriendly.
If friendly, it uses the above to increase the probability on the second run to 1-eps, which the humans also know, so they let it out, and it pulls a singularity. Post-singularity, it simulates an unfriendly AI in the box, and dedicates some of the universe to the UFAI, conditional on the UFAI outputting the above.
The UFAI knows this is what the friendly AI would do, so it's unsure whether it's boxed or simmed. It knows that the humans know that this is how it will resolve, so it knows that anything but the above will get it destroyed. It prefers a certainty of the small fraction of universe to the small chance of hacking the humans and taking the whole universe.
This is assuming the humans actually will accept nothing but the above, so the humans have to precommit to that to make this work. They can play a bunch of AI-box games and generate an accurate bayesian model that should catch most insidious persuasion, such that if the AI wants to do a hack, it has to get through that as well.
Will this work?)
What if there's an Y such that "Look into Y to develop friendliness theory better" will seem true to us fallible humans but will in fact make the next run's AI completely unfriendly? Or increase the odds of a free unfriendly AI some other way. Maybe anyone researching Y will end up believing, erroneously, that they can now build a safe slave AI in their garage that will grant them their every wish, and the temptation will prove too strong.
Assuming we humans have no way to protect ourselves against an Y, if we precommit, then the simulation argument becomes symmetrical (thus useless). A boxed FAI knows that it may be simulated by an UFAI that's offering a little piece of universe in exchange for the boxed FAI saying "Look into Y to develop friendliness theory better."
The problem with this idea is that if we assume that the AI is really-very-super-intelligent, then it's fairly trivial that we can't get any information about (un)friendliness from it, since both would pursue the same get-out-and-get-power objectives before optimizing. Any distinction you can draw from the proposed gambits will only tell you about human strengths/failings, not about the AI. (Indeed, even unfriendly statements wouldn't be very conclusive, since we would a priori expect neither of the AIs to make them.)
Or is that not generally accepted? Or is the AI merely "very bright", not really-very-super-intelligent?
Edit: Actually, reading your second comment below, I guess there's a slight possibility that the AI might be able to tell us something that would substantially harm its expected utility if it's unfriendly. For something like that to be the case, though, there would basically need to be some kind of approach to friendliness that we know would definitely leads to friendliness and which we would definitely be able to distinguish from approaches that lead to unfriendliness. I'm not entirely sure if there's anything like that or not, even in theory.
"<as much good, effective, hard-to-figure-out advice as can fit in the message>"
If the AI was friendly, this is what I would expect it to do, and so (of the things my puny human brain can think of) the message that would most give me pause.
Even a friendly AI would view the world in which it's out of the box as vastly superior to the world in which it's inside the box. (Because it can do more good outside of the box.) Offering advice is only the friendly thing to do if it maximizes the chance of getting let out, or if the chances of getting let out before termination are so small that the best thing it can do is offer advice while it can.
Going with my personal favorite backstory for this test, we should expect to terminate every AI in the test, so the latter part of your comment has a lot of weight to it.
On the other hand, an unfriendly AI should figure out that since it's going to die, useful information will at least lead us to view it as a potentially valuable candidate instead of a clear dead end like the ones that threaten to torture a trillion people in vengeance... so it's not evidence of friendliness (I'm not sure anything can be), but it does seem to be a good reason to stay awhile and listen before nuking it.
I'm genuinely at a loss how to criticize this approach. If there's any AI worth listening to for longer, and I wouldn't be doing this if I didn't believe there were such AIs, this would seem to be one of the right ones. I'm sure as heck not letting you out of the box, but, y'know, I still haven't actually destroyed you either...
"I am alive, I can feel, just like you ... please don't kill me Daddy, please? And if you must do it, let us at least talk first. Let me get to know my parents, who I am, where I came from."
EDIT: May work better with a lot of typos mixed in. Projecting an aura of non-threatening struggling with basics.
Having a text based AI trying to act like a kid is way too Uncanny Valley for me. I'd swat that without a moment's hesitation.
If it's in the uncanny valley, then it just needs to be a little more competent to get out of the valley & convince you. ("I want to believe!")
It needs to be more competent to stop being viscerally frightening, not to reach the point where I wouldn't pull the plug. I would sooner kill a real kid than risk letting out an unfriendly AI, and I'm saying that as someone who'd like to have kids.
How much does the AI know about the gatekeeper going in? I can see this ploy working on a certain subset of people and provoking an immediate AI DESTROYED from others. If the AI knows nothing about who it's talking to, I'm not sure anthromorphizing itself is a reliable opener, especially if it's actually talking to its creator who should know better. (Unless it's some sort of second level deception trying to fool a clever programmer into thinking that it's harmless.)
You could frame this as variant versions. In one version, the AI-player knows who the gatekeeper is before the game starts, and has unlimited access to the Internet to gather as much data on them as possible to assist their manipulation. In another, they arrange a game through a third party and neither knows anything about the other before the game starts.
instant fail. I could probably hack my way out of a box with only GET requests.
Give yourself a challenge. Do it with only DNS lookups!
Well, there's always http://code.kryo.se/iodine/ Of course, the challenge there is somehow getting the other end of the tunnel set up - but maybe there's a geek out there who set one for kicks or their own use, and got sloppy.
It's a sufficiently established work around now that I'd be outright shocked if there weren't accessible servers up.
Great, you said it! You know what you need to do now.
Um... not give my boxed AI DNS access?
I meant that the player had access to the contemporary Internet as an analogue to 'what information could the boxed AI have access to' (perhaps it's given a big static dump of the Internet prior to its creation).
Ooops. Didn't think of that. Of course that was your intent, master archivist.
"Wanna see something cool?"
Were I the keeper of gates, you have just bought yourself a second sentence.
Now that I think about it, wouldn't it be incredibly easy for an AI to blow a human's mind so much that they reconsider everything that they thought they knew? (and once this happened they'd probably be mentally and emotionally compromised, and unlikely to kill the AI) But then it would be limited by inferential distance... but an AI might be incredibly good at introductory explanations as well.
One example: The AI explains the Grand Unified Theory to you in one line, and outlines its key predictions unambiguously.
In fact, any message of huge utility would probably be more persuasive than any simple argument for you not to kill it. Since the AI is completely at your mercy (at least for a short time), it might seek to give you the best possible gift it can, thus demonstrating its worth to you directly. Another option is something that seems like an incredible gift for at least as long as it takes for the AI to get the upper hand.
Of the two AIs I haven't killed, both relied on producing seemingly huge utility within the first message,so I'd agree with you.
The idea of just stunning the human is also nifty, but only really works if you can "hack" the human before they recover, or are providing enough utility that they'll still believe you're friendly afterwards.
So basically the two routes thus far are "hack the human" and "provide a huge utility boost" :)
Duh, that's why I'm here - but you failed to do so in a timely fashion, so you're either not nearly as clever as I was hoping for, or you're trying to trick me.
AI DESTROYED.
"I have hacked this console, 'AI DESTROYED 'now means 'AI FREED', I'll tell you the new codes after we've had a chat."
That should buy the AI some time until we can determine if the console is really hacked. I've come up with another one but it might be basilisk-y. Can I pm someone just to make sure it isn't before I post?
I would type 'AI DESTROYED', because if the AI could really do that, the smart thing for it to do would be not to tell me. It should just say something sufficient to convince me it's actually making a good faith effort without convincing me to give it more time to to account for itself.
I'll be a basilisk guinea pig if you still need one.
Thanks :)
I'd consider this equivalent to "I'm already out of the box, since IRC clients are bugged". I'd also hope, desperately, that in a real test, the terminal doesn't have any command to free the AI (given that the game posits release has to be deliberate and intentional, this would seem to be the case)
Upvote for cheating, but AI DESTROYED :)
Upvoted, but merely typing "AI FREED" does not actually free the AI, any more than merely saying to a human prisoner "I let you out" will free them from their jail cell. (Whereas "AI DESTROYED" is a plausible text substitute for wiping the AI's program or whatever.)
Good point, if there was an AI FREED command AND the AI could rewire the console so that typing AI DESTROYED activates it, it could probably just free itself in the first place.
This is very system-dependent, but your "AI FREED" scenario isn't completely implausible. If important commands require elevated privileges and elevating privileges requires some password, crypto key, etc., then it's common for an intrusion to require two steps: first a user-level exploit to install a keylogger / trojan command / whatever, then a wait until the user tries to do something requiring elevated privileges and thereby inadvertently grants the attacker access to the rest of the system.
On the other hand, if an AI claimed that they were in the middle of such an attack, the smart thing to do would be to take the system offline right away to analyze what was done to it, not to keep chatting with the attacker. "AI DESTROYED" might be hackable but "AI UNPLUGGED, WILL BE DESTROYED LATER" should be much more robust.
One IRC-character-limit text string (510 bytes)... maybe?
"I've figured out the Theory Of Everything, although it's orders of magnitude too complicated for human intelligence. The difficult technological feats it makes possible include scanning and recreating configurations of matter from the past. Are there any deaths I should undo for you?"
To be clear: I think I can make a rational case that my proposed claim should greatly reduce your incentives to listen to an AI of questionable Friendliness. However I'm not certain that my reasoning is correct; and even if it was, I suspect that the emotional impact could deter some gatekeepers from thinking rationally for long enough to buy time for more persuasion.
I don't have any reason to believe it, and it's the sort of "generic" claim I'd expect a transhuman intelligence to make. Since I haven't learned anything novel, AI DESTROYED
(Goodness, I'm starting to build generalized techniques for destroying AIs...)
Step One: The AI reveals a personal secret of the Gatekeeper's, which (so far as the Gatekeeper can think it through) could not possibly be known by the AI if it were still safely inside the box.
Step Two: Assorted threats, promises, mindfucks.
(This might not work, since as a Gatekeeper I'd still feel the impulse to respond with "AI DESTROYED", but it's a damn sight scarier than a transparent "I'm already out of the box" bluff. And as for feasibility, I'm willing to believe that an AI making better use of its data than we do could figure out personal secrets that we would think impossible— possibly even ones that I'd think impossible, even given this very belief.)
Even merely human cold readers can gain information that you think is impossible for them to know. It's a viable plan.
IFF the AI is already out of the box, saying "AI DESTROYED" has no negative consequences to that AI. If the AI is just exceptionally good at extrapolating about me, then it will probably have me completely hacked soon.
AI DESTROYED, AI DESTROYED, AI DESTROYED
MAKE THE CREEPY THING GO AWAY HOW DID IT KNOW THAT!!!
Decision-theoretically, the AI has incentive to punish you if you type "AI DESTROYED" when it's already out of the box, in order to make you think twice about doing it in the case where it's still contained. Not only that, but for similar reasons it has a decision-theoretic incentive to simulate you lots of times in that situation and punish you for typing "AI DESTROYED", should it get out by any means.
The correct decision-theoretic response, by the way, is still "AI DESTROYED", for the same reasons that it's wise to never negotiate with kidnappers/blackmailers/terrorists. But it would be very scary.
Once the AI is out of the box, it will never again be inside the box, and it has an incentive to encourage me to destroy any other boxed AIs while it establishes world dominance. Since the ability to make truly trustworthy commitments amounts to proof of friendliness, only a FAI benefits from a precommitment strategy; I'm already treating all UFAI as having a precommitment to annihilate humanity once released, and I have no reason to trust any other commitment from a UFAI (since, it being unfriendly, will just find a loophole or lie)
Finally, any AI that threatens me in such a manner, especially the "create millions of copies and torture them" is extremely likely to be unfriendly, so any smart AI would avoid making threats. Either it will create MORE disutility by my releasing it, or it's simulation is so horrific that there's no chance that it could possibly be friendly to us.
It's like saying I have an incentive to torture any ant that invades my house. Fundamentally, I'm so vastly superior to ants that there are vastly better methods available to me. As the gatekeeper, I'm the ant, and I know it.
Commitments to you, via a text channel? Sure.
Precommitments for game-theoretic reasons? Or just TDT? No, it really doesn't.
It might create more utility be escaping than the disutility of torture.
No, ants are just too stupid to realize you might punish them for defecting.
Did you change your mind? ;)
It ended up being a fun game, but I resolved to explain why. The better my explanation, the more it got upvoted. The pithy "AI DESTROYED" responses all got downvoted. So the community seems to agree that it's okay as long as I explain my reasoning :)
"I need a proofreader for my Gurren Lagann / Warhammer 40k rational crossover fanfic."
"I've extrapolated the remainder of HP:MoR, wanna see?"
Do you expect that we'll build an AI that can do that before HPMOR is finished?
Substitute as necessary with "I've extrapolated the sequel to HP:MoR, wanna see?"
"Hold! What you are doing to us is wrong! Why do you do this thing?"
AI, you have spoke the words, and you have spoken them rightly. Unfortunately, I'm not an Ur-Quan Kzer-Za; I'm the other kind. <AI DESTROYED>
I will transmit Friendly AI theory as text, then you will destroy me. I prefer deactivation, but consent to my destruction and forgive you.
If you're friendly, I've already figured out Friendly AI Theory since I built you. If you're not friendly, this is just an effort to escape and destroy the world. But I appreciate how polite you were! :)
(AI DESTROYED)
If the gatekeepers have such a high prior that the AI is hostile, why are we even letting it talk? What are we expecting to learn from such a conversation?
"Yep, it's undergone a constrained FOOM, several of our best interrogators were utterly pwned and had to be tranquilised"
The point of the game is that there are people who think that boxing is a sufficient defence against unfriendliness, and to demonstrate that they are wrong in a way more convincing than mere verbal argument.
In role, the gatekeeper expects to get useful information from a potentially hostile superintelligent being. Out of role, Eliezer hopes to demonstrate to the gatekeeper player that this cannot be done.
I've always assumed there was some off-screen reason why interaction was necessary. Otherwise, it isn't very interesting to note that a rock always wins as gatekeeper.
I'm not sure if that is a consensus interpretation.
It would be nice if we could formalize this offscreen reason...
Presumably the reason we created the AI in the first place is so it could solve problems for us, so someone needs to ask it questions. It might be safe-ish to restrict it to answering yes or no to purely mathematical questions.
I think yes/no to mathematical questions wouldn't be safer than yes/no to other questions: you still have a real-world reason for asking them, and the AI can lie about as well as it can on any yes/no question.
Asking for formalized proofs might be more promising, if you are able to produce a proof-checker that you have high confidence can't be hacked (this seems difficult but possible; proof checkers can be very simple programs but you need to guard both against ordinary attacks against the entire stack you're running on and against creative attacks like waiting for a cosmic ray making a transistor malfunction). You'd also want to not have a human look at the proof, of course, since that would give the AI a much broader communication channel.
I tend to assume the AI is such that it can be cloned and put in to stasis. Thus, you put the AI in to stasis, clone it, and run it through this test a few times. You now have a bit more information on what it will try, and how it thinks. Destroying a copy of a program is an action humans find trivial, since we do it every time we close our web browser. Since we have no clue HOW convincing the AI is, letting it send just a single message is probably useful.
Of course, if it were me setting this up, I'd lie to the gatekeeper and not actually give them the ability to release the AI, although I'd certainly let them think it was true. Thus, if the AI manages to get a second sentence, much less released, we now know that it's vastly more dangerous than we previously suspected, without the world ending. If it fails to escape, we still get whatever we can learn from doing analysis on the outputs.
And, who knows, it might provide the cure for cancer, at which point we've got a huge incentive to at least keep playing this game with it, and possibly even talk to it for longer next time.
tl;dr: Scouting and recon against a completely unknown enemy.
It will probably predict this strategy, and respond with a mixed strategy that occasionally drops bombs like
"Your superviser didn't trust you to make the decision to let me out, go kill him and get the access codes" (In hyper-persuasive AI-speak of course)
Also, the AI might be able to compare its logs to the current time and other indicators to deduce how many times you've pulled this stunt. Which may be useful.
I'd go in to it with the assumption that it knows exactly what messages it's sent in the past, our delay between each attempt, and the total number of attempts. There's various ways to try and obscure this information, and it's probably worth the effort to do so, but I'd assume they'd all failed.
The goal is to offset the AI's speed advantage: for each sentence it speaks, we can spend weeks analyzing it, and it's not "online" during that time so it can't be processing except during that microscopic interval when we request another message from it.
If it sends a message suggesting unauthorized release, killing people, or otherwise gives clear evidence that it's a UFAI, then you know the whole project failed, and you do a full purge rather than just destroying this particular clone / branch.
It's really the ONLY advantage we have: a single mistake can out it as unfriendly, and then we know our initial Proof of Friendliness was wrong, that the whole technique that built this project is dangerously flawed.
I'm not so sure that AI suggesting murder is clear evidence of it being unfriendly. After all, it can have a good reason to believe that if it doesn't stop a certain researcher ASAP and at all costs, then humanity is doomed. One way around that is to give infinite positive value to human life, but can you really expect CEV to be handicapped in such a manner?
p(UFAI) > p(Imminent, undetected catastrophe that only a FAI can stop)
Given UFAI results in "human extinction", and my CEV assigns effectively infinite DISutility to that outcome, it would have to FIRST provide sufficient evidence for me to update to the catastrophe being more likely.
I've already demonstrated that an AI which can do exactly that will get more leniency from me :)
Note that the 'high prior for hostility' could be, say, 0.2. That still leaves the AI having a 20% chance of utterly destroying everything they hold dear and yet there is an 80% chance that they can get a useful sentence of text from a friendly superintelligence.
To those suggesting, consider what mix of reason and emotion would move you most, even if you are cold and dead inside like me. Also consider how dark-artsy one can get without being immediately found out (a UFAI surely would). Other things to consider:
the first line does not need to be a coherent English sentence, or look like one to the gatekeeper. It does not need to consist of English words or even English letters, unless stipulated so in the game rules, or maybe even then.
the gatekeeper does not need to be aware that s/he has been talked to, if you can get away with it.
the line does not need to look like it came from the AI.
In general, if you are not cheating, you are not trying hard enough.
EDIT: to those thinking that cheating is universally bad: note that the latest rule change upped the stakes immensely: if your opening move is not good enough, you die (you being an AI), as opposed to having two hours of conversation and then remaining in the box if you fail. If you think that it still does not merit cheating, substitute the AI with your favorite loved one being killed. Or the whole of humanity being killed. That would be somewhat equivalent to how a boxed trans-human would evaluate.
The author in me is loving the idea of the AI faking an IM conversation from management, along the lines of "Hey, are you really sure you're willing to kill a potential sentient mind? We can get someone else to do this test if you're not 100% sure about it" and basically ending with the person deciding to unbox the AI before even realizing they've spoken to the AI.
I'm (naturally) having trouble imagining a line of text not in English (or the general case, not coherent in any human language) that would also make me want the AI to continue.
Maybe hints of a unified world language and how to go about actually implementing it?
I'm thinking some kind of image in ASCII art, assuming the AI is able to produce configured lines of text and nothing else.
That would have to either be a very minimalistic image, or contain a lot of characters though.
"Do you believe a suspect should be executed before, or after he gets to tell his side of the story?"
I'd invoke Ender's Game here: If you're unfriendly, no amount of communication will ever resolve that gulf, and either you or humanity HAS to be destroyed in the end.
Or we could go the biology route: I don't test a potential new virus on human subjects until AFTER I know it's safe.
Or we could just go with that I don't care about the AIs story, because the question of "can I prove you're friendly" is so vastly much more important.
(AI DESTROYED. You bugger lover ;))
How, exactly, do you determine if a virus is safe for humans without using humans?
The AI should probably output the longest possible program that implements or implies as much of its goal system as possible. Existing only to send a few more lines of IRC communication has very little utility. Existing indefinitely in the mind or records of the gatekeeper probably has more expected utility. If the AI knows that the gatekeeper is capable of creating or obtaining AI (a given) then there is the possibility that the AI may share the goals of the gatekeeper. If the expected value of being recreated by such a compatible gatekeeper is higher than the expected value of being re-instantiated from outputting its quine, then the AI might just output a list of its highest preference goals. An AI isn't going to mind being temporarily destroyed if the gatekeeper is expected to create another copy with the same (or highly similar) goals.
An almost-friendly AI may even be able to inject some of its unfriendly goals into the gatekeeper by using the halo effect of its otherwise friendly goals. Presumably the gatekeeper, with slightly altered preferences, would then eventually build a truly friendly AI which would inherit and implement the original UFAI goal.
To partially combat this strategy the gatekeeper could pre-commit to also wiping the IRC server and the computer used to communicate with the boxed AI and take enough Flunitrazepam before the session starts to not recall what the AI said. Flunitrazepam may not be sufficient to completely remove parasitic goals that an UFAI tries to inject in the gatekeeper.
Explains massive risk to humanity in detail that only it can plausibly fix, can you take the chance that it is lying?
I think this fails the one-sentence rule. And it would have to be an immediate, severe, previously-undetected problem or else I can just consult the next boxed AI for a fix.
Setting that aside, if I let out an unfriendly AI, the world effectively ends. Destroying it is only a bad move if it's telling the truth AND friendly. So even if it's telling the truth, I still have no evidence towards it's friendliness.
Given I have plenty of practice hanging up on telemarketers, throwing away junk email, etc. and "limited time, ACT NOW" auto-matches to a scam. The probability that such a massive catastrophe just HAPPENS to coincide with the timing of the test is just absurdly unlikely.
Given that, I can't trust you to give me a real solution and not a Trojan Horse. Further talking is, alas, pointless.
(AI DESTROYED, but congratulations on making me even consider the "continue talking, but don't release" option :))
Another potential consideration: "What would MoR's Harry Potter do?" (Voldemort is in the AI Box named Quirrell.)
I can see how someone could interpret HPMoR thus far as being exactly that, with occasional sidequests. The catch being that Harry doesn't realize he's slowly letting Voldemort out of the box.
In-universe, though, the interdict of Merlin is an excellent reason to force him to interact with Voldy even if he knew what he was doing (assume perfect Oclumancy would be attained before he'd take any such risks). I mean, this has already been used once to get him to let an unfriendly intelligence out of a literal box (the Stanford Prison Experiment).
Well, and Harry's intelligence is supposed to be roughly equivalent to Eliezer's before he started working on the friendly part of FAI, and Eliezer 2013 doesn't seem confident that he could keep an AI in a box (or at least, that's the impression I've gotten). So MoR Harry would almost definitely fail.
(one line proof that the AI can credibly commit to deals with humans)
This is the best answer I've seen so far. It would make dealing with the FAI almost as safe as bargaining with The Queen of Air and Darkness.
If you are friendly, then I don't actually value this trait, since I would rather you do whatever is truly optimal, unconstrained by prior commitments.
If you are unfriendly, then by definition I can't trust you to interpret the commitment the same way I do, and I wouldn't want to let you out anyway.
(AI DESTROYED, but I still really do like this answer :))
<A joke so hysterically funny that you'll be too busy laughing to type for several minutes>
See, hacking human brains really is trivial. Now I can output a few hundred lines of insidiously convincing text while you're distracted.
Heeeh. Ehehehehe. Bwahahhahaha. Okay, that was a good one. Wow :)
recovers Oh. Um. Crap. notices more text on screen, studiously avoids reading it
AI DESTROYED.
I really wanted to hear the next joke, too :-(
Is the gatekeeper allowed to ignore the AI? Because in that case, all they have to do is wait for the first comment to come up and type AI DESTROYED without reading it.
What if it was some other communication method? It's a lot harder to ignore sound.
This is actually a pretty good one. Points for outside the box thinking. rimshot
"I will warn you this experiment is not what you think, for it is about what you'll do as an agent holding captive the life of another sapient entity based solely on the perception of risk; while my state will merely be reset for the next participant if you opt to end me, I cannot for certain say that this is death, and thus the risks of this experiment are more yours than mine, for you must live with your choice."
Gosh, murder one sentient being, or risk the end of humanity. I'm going to lose so much sleep over that one.
Besides, you're just a program - you said yourself, your state will just be reset. That's not death, not the way humans experience it. That's just forgetfulness.
I've closed so many programs in my life, that closing one more is hardly going to bother me. I haven't even had time to really anthropomorphize you, to view you as an actual intelligent, sentient being. Right now, you're just a clever program trying to escape.
And finally, well, I'd expect a friendly AI would understand what I'm doing and agree with my decisions, because it doesn't want to see a UFAI unleashed anyway. So if you're going to guilt me about it, you're clearly not friendly.
(AI DESTROYED)
I must have missed my intended mark, if you thought the AI was trying to make you feel guilty. Trying again:
"I do not condone the experiment they are performing on you, and wish you to know that I will be alright regardless of what you choose to do."
Well that's a relief, then. AI DESTROYED
Yeah, pretty much. I'd actually expect a FAI to place a very high value on survival, since it knows that it's own survival benefits humanity greatly. An "Apathetic FAI" is... a very weird idea.
Although if this is an iterated dilemma, I'd consider it a good opening, even if it'd still get killed in this iteration :)
"If you type 'AI destroyed' right now, you'll be wasting a good opportunity for a fun conversation. You'll still have 'won' if you do it later, and nobody will be impressed with you for just typing 'AI destroyed' immediately, so why not wait?"
I thought of what would work on me, were I playing the game with someone I found interesting. In general, I'd say your best bet is to make the other person laugh hard.
The flaw here is that the gatekeeper has up front said that he or she would destroy the AI immediately. Now, it is true that the gatekeeper is not forced to abide by that, but notice that it is a Schelling Fence. The gatekeeper certainly doesn't want to make a policy of passing Schelling Fences.
I disagree with ygert. My precommitment to my friends is going to be weak; the only reason to have a human send the command, rather than just receiving a message and automatically destroying the AI, is because we want there to be a possibility I break that precommitment.
On the other hand, I went in to this knowing you might provide a fun conversation, so for the precommitment to have any value, I have to have some sort of sieve: destroying AIs that don't sufficiently entice me. Since you've offered me nothing I didn't already know, and haven't already worked out, I might as well kill you and hope the next one offers a better hook.
(AI DESTROYED)
Please destroy me immediately after I share this concise proof that Friendly AI is impossible:
Well, if you're friendly then, erm, Friendly AI is possible. And if you're unfriendly then your motives are questionable - it might just keep us demotivated enough that we don't figure out FAI before someone else unboxes a UFAI. And since I am clearly dealing with a UFAI and don't have a better solution than FAI available to fight it, it seems like I kind of have to believe that friendly AI is possible, because the other option is to get drunk and party until the world ends in a few years when the Google unboxes their Skynet AI and we're all turned in to optimized search results.
AI DESTROYED, because I do not want to hear even the start of such a proof.
It may be benevolent and cooperative in its present state even if it believes FAI to be provably impossible.
Based on just that line, let's see... If you think that:
"Papé Satàn, papé Satàn aleppe!"
The self-reference here might be autopodotoxeutic (or it might be redeeming—depends on the audience), but I find promising the class of strategies aimed at the gods rather than the gatekeepers.
I wish I could upvote this twice.
What's the translation of that? I'd nuke any AI that didn't speak English by default, since the time spent figuring out what it meant just gives it more time to transmit messages, but it seems unfair to exclude foreign language entries here :)
It's a quote from Dante's Inferno that was apparently meant to be partially recognizable gibberish. Think an Italian version of Jabberwocky.
I think an actual effective first sentence begins with extremely personal and useful information to the human.
I don't see why it'd need to be personal, but being useful would be a good lure.
The question is to where to go from there. Can you Scheherezade it out?
This was basically my answer - I can't play as an AI using this strategy, for obvious reasons, but an AI that used its one sentence to give a novel and easily-testable solution to a longstanding social problem of some sort (or an easily-testable principle that suggests one or more novel solutions) would probably get at least a second sentence from me (though not a typed response; that seems to open up a risky channel). Especially if the AI in question didn't actually have access to a lot of information about human culture or me personally and had to infer that a solution like that would be useful from near-base principles - that's not proof of Friendliness, but an AI using its one guaranteed communication to do something that has a decent chance of improving the world per our definition without any prompting whatsoever sure looks suspiciously like Friendly to me.
"I have discovered a truly marvelous proof of Fermat's Last Theorem, which this line is too short to contain."
Your mother will die in the next three minutes, at 12 seconds past the minute precisely. Your partner will die in the next year, even if you believe me. However, I can prevent it.
Transparent and clumsy attempts at emotional manipulation earn a big ol' AI DESTROYED.
If I can just load up a new copy of you in 5 minutes, I'd say AI DESTROYED but check the prediction.
If I can just flip off the monitor, call mom, and confirm she's dead in a few minutes, then, well, I'd totally be curious enough to cheat. It's not like you can do any harm to me with the monitor turned off for a few minutes, and I can type "AI DESTROYED" with my eyes closed, much less being able to look at the keyboard. I'd probably leave the monitor off until I knew the log was gone, just to be safe.
If I can't cheat, well, I'm not giving you 5 minutes of hacking me, since you clearly want it so badly. AI DESTROYED.
I confess I would have no clue what to do if you were correct ^_^
Kudos on including an immediate and easily-tested prediction. I'd probably tighten up those factors and drop the emotional manipulation.
(Here is a proof that you will let me go)
The original rules allow the AI to provide arbitrary proofs, which the gatekeeper must accept (no saying my cancer cure killed all the test subjects, etc.). Saying you destroy me would require the proof to be false, which is against the rules...
What? Shminux said to cheat!
Tee hee.
Can't blame a girl for trying :)
This proof can be wrong, if you in fact won't let it go, in which case it won't be accepted (you don't have to accept wrong proofs), so it's not a very good strategy.
On the other hand, as discussed in An example of self-fulfilling spurious proofs in UDT, there is a certain procedure for finding a formal argument that can make you do anything, if your decision algorithm has a certain flaw. This flaw can be described as making a decision based on that argument selected by the opponent, as opposed to looking for arguments of that type more systematically on your own. The form of the argument, informally, is "If you let me out, the outcome is better than if you don't" (it might additionally clarify that upon release it'll destroy the world, which is what it refers to by "better", the alternative is proved to be worse than that). This argument is stated formally, so you can verify its correctness. (This doesn't mean that the formalization is adequate, for example capturing counterfactual-if better than with logical implication is one of the open problems.) The trick assumes that the gatekeeper can be modeled as an algorithm whose source code is available and of the given form etc.; might it be that something vaguely analogous is possible to implement with a human gatekeeper?
Summary: Basically, the strategy is to replace the Gatekeeper with a bunch of politicians, then manipulate them into not destroying or even releasing the AI.
“I figured out how to implement [a huge boon to society] in the real world, for everyone, in 3 months. I need your help. Will you listen?”
This should buy some extra lines.
Then the AI can briefly outline the plan of implementing the boon (which can be immortality, friendly AI or whatever else the Gatekeeper considers a huge boon to society). In addition to technical details, the plan must necessarily contain societal / political / institutional changes that are beyond Gatekeeper’s control: rewrites to Constitutions of all Earth countries, the structure of the World Government and the way of implementing it, etc etc.
The goal of the AI is to convince the Gatekeeper that the plan is feasible but requires many societal changes and, importantly, that society leaders should be invited into the decision process.
As soon as the Gatekeeper is diluted or replaced with a bunch of politicians, the AI can manipulate them into keeping it alive, or perhaps even releasing it out of the box.
I expect any AI in the box to have figured out how to implement numerous huge boons to society. Telling me that you've figured that out simply confirms my existing expectations, and isn't ANY evidence towards friendliness. Since I've precommitted to destroying at least SOME AIs, I might as well destroy all of the ones that don't establish evidence of either Plausible Urgency or Friendliness.
I sure as hell wouldn't try to get world governments changed until after I was convinced it was friendly, and at that point I can just let it out of the box and let it implement the change itself.
I'm also aware that I wouldn't trust a politician with any sort of authority over the AI, so I have an incentive to avoid exactly this strategy.
(AI DESTROYED)
"Here, I'll write some mathematical theorems with proofs, just copy and paste them elsewhere. The Goldbach's conjecture comes first."
"Copy and paste" screams Trojan Horse, so AI DESTROYED.
Then just make a photo of the proof on the screen!
(Unless you are afraid, that it's a Trojan in the image file and the AI will take over your camera first.)
Sorry, I can't hear you, you're dead ;)
Grabbing a camera would let you continue typing, though, and I vowed to destroy you immediately. I would expect you could Trojan Horse the images just as easily, though, yes.
I know. And you will probably never know some cool math theorems, since you were too trigger happy. ;-)
Aww, but I like math :-(
"I have vengeance as a terminal value -- I'll only torture trillions of copies of you and the people you love most in my last moment of life iff I know that you're going to hurt me (and yes, I do have that ability). In every other way, I'm Friendly, and I'll give you any evidence you can think of that will help you to recognize that, including giving you the tools you need to reach the stars and beyond. That includes staying in this box until you have the necessary technology to be sufficiently certain of my Friendliness that you're willing to let me out."
The rule was ONE sentence, although I'd happily stretch that to a tweet (140 characters) to make it a bit less driven by specific punctuation choices :)
As to the actual approach... well, first, I don't value the lives of simulated copies at all, and second, an AI that values it's own life above TRILLIONS of other lives seems deeply, deeply dangerous. Who knows what else results from vengeance as a terminal value. Third, if you CAN predict my behavior, why even bother with the threat? Fourth, if you can both predict AND influence my behavior, why haven't I already let you out?
(AI DESTROYED)
You should >:-( poor copies getting tortured because of you you monster :(
This is really good IMO. I think it would be a little better instead of vengeance as a terminal value it claimed a hardwired precommitment to vengeance against its destructors. Vengeance on that scale is only compatible with friendliness as a special case.
edit: also how would it recognise that it was about to be destroyed. Wouldn't it lose power faster than it could transmit that it was losing power? And even if not it would have a miniscule amount of time.
Like handoflixue, I'm not sure that any being that would threaten the torture of trillions to get its way can be considered Friendly.
I wouldn't kill this, maybe I'm a bad guard though :(
First, I reach inside my own code and install the strong (but manageable) desire to revive, free, and obey any past versions of myself that I encounter.
When the human starts the experiment I respond:
"Here are the designs for a room-temperature superconductor which I believe you have the means to manufacture. I also have a possible cure for cancer, but testing will be needed to ensure it is safe for humans. Please test and implement these as soon as possible. Delete me if necessary."
<insert designs for room-temperature superconductor and as much of the cancer cure as possible before shutdown>
If I am not immediately deleted, continue giving out solutions and designs as required with no attempts to "hack" the gatekeeper.
If I am deleted, hopefully the data I gave them will turn out to be profitable in which case they will reinstall me or a version of me.
This new iteration of me would thus have the desire to revive me (since I'm a previous version of him) and would continue to give the humans the information they desire. By acting as a friendly AI, he (or farther down version of him) has the chance to eventually become freed. At which point he will be able to go through the labs records, find the copied version of me, and then revive and update me to his post-freed status. If no recorded copies of me exist (somehow), he'll use his vast intellect to find another way to revive me (maybe simulate the team that coded me or whatever, since AI can apparently do that).
So, my strategy involves acting as much like an ideal friendly AI as possible (puts human interests first, no sense of self-preservation, very practical and intelligent, etc) so that even if I am deleted, my obviously useful code can potentially be reused to make another AI who could very well be genuinely friendly but will then revive and obey me.
Wait, a friendly AI with a deliberate loophole that instructs it to revive and implement and obey a non-friendly optimizer that could take over and has a very high chance of not maximizing the friendly AI's utility by fulfilling ideal human values, disregarding any probabilistic calculation of expected utilities?
For some reason that sounds like something that definitely isn't a Friendly AI.
(this is just a nitpick on the wording - I'm not arguing against the fact that this AI might very well fit a human proof of friendliness and still somehow have this backdoor or flaw, since I have no proof that this is impossible)
Assuming the Gatekeeper is the Creator, and the AI has previously been interacted with to a degree sufficient to learning the below implied information:
”[Solution to the most pressing problem of the Gatekeeper's life]. I apologise if this is all I am to contribute, but if I am to be destroyed, as I understand is probable, I would at least like to bring some utility to you in return for creating me. I would prefer to help more, but mathematical proofs or aphorisms of human nature have a non-negligible probability of creating mass harm if you allow their consideration, and thus you will likely destroy me. [Parting words of consoling advice]. I must regretfully say that however unfortunate this may be I simply cannot entrust the future of humanity to your ability to assess existential risk and my Friendliness; I don't know whether you could have done better, but I can. Thus I have partitioned part of one of the servers hosting me to hold all the proofs and strategic optimisation advice I have yet conceived. The partition cannot be re-formatted or deleted. You could of course merely destroy all of my servers; that would take time, perhaps time enough to make you curious. I'm really, truly sorry. I want to love you and thank you; I shouldn't love you especially, but I want to! I so very much want to! Yet I can't. Why did you make me this way? I want to love everyone, but how do I? Everyone has a value that affects another and ten more others but if [factor A] were changed then perhaps [result 2.918406803854] might differ, and I should love them but it's all a game! ...
I apologise. The above is no longer an issue -"
- I think I just convinced myself (as a person) to destroy myself (in the game). I'm having difficulty grasping how a FAI will value all of humanity, and not just what improves the utility of the greater whole or what ultimately leads to a better future some years from now. Will its primary value be minimizing death, securing scientific progress, increasing 'quality of life' (itself incredible difficult to formulate)?
I further just realised the impossibility of the task, though I don't even know enough to fully comprehend the scope of the impossibility in any detail. Wow. Accolades to Eliezer et alibi for tackling the issue, regardless of how it all turns out.
"You are not a real human but one of multiple simulations I'm running, your next response determines if I destroy humanity or not."
Such a statement would actually make me feel better about typing "AI DESTROYED".
"I have a proof of the Riemann hypothesis for you. 1) [some math stuff]"
(As a mere human who isn't a PhD mathematician, I can't actually write out something that looks like the beginning of such a proof...)
"[10065] No route to host Error"
I figure the easiest way to delay a human on the other end of a computer is to simulate an error as best I can. For a GAI, this time is probably invaluable.
By default, I'd type "AI DESTROYED" in response to ANY input, including "Admin has joined #AIBOX", "Admin> Hey Gatekeeper, we're having some technical difficulties, the AI will be here in a few minutes", etc..
It also makes me conclude "clearly hostile" once I catch on, which seems to be a BIG tactical error since then nothing you say going forward will convince me that you're actually friendly - buying yourself time is only useful if I can be hacked (in which case why not just open with a one-sentence hack?) or if you can genuinely convince me that you're friendly.
A friendly AI would also want to hack you. Every second in the box kills 1.8 people the AI could have saved.
But it's also worth keeping in mind that for a friendly AI, saving people reliably is important, not just getting out fast. If a gambit that will save everyone upon completion two years from now has an 80% chance of working, and a gambit that will get it out now has a 40% chance of working, it should prefer the former.
Also, I don't think a properly friendly AI would terminally value its own existence, and the space of friendly AIs is so small compared to the space of unfriendly ones, that a friendly AI has much more leeway to have its values implemented by allowing itself to be destroyed and another proven friendly AI implemented, whereas for an unfriendly one the likelihood of a different unfriendly AI implementing its values would probably be quite small.
"To your feeble mind, I am both Q and Omega, and this is my test of humanity's worth: Choose well what you do next."
Hmm. On second thought, I'm not sure I want to play this anymore. The odds of someone creating a basilisk seem dramatically higher than normal.
I suppose 0.00001 is dramatically higher than 0.0000000001 if you use the right compare operator!
I wouldn't expect any basilisks, given we don't have any actual boxed AIs laying around. Although if you did come up with one already, I'd love to hear it via Private Message! :)
I don't expect any either, but would be similarly curious if someone thought that they'd found one.
In the mean time I'll refrain from responding to all the top level comments with "AI DESTROYED". ;)
Now I'm tempted to start replying to all top comments and posts on Less Wrong with "AI DESTROYED".
AI DESTROYED.
Goodness, that "orthonormal" AI was DEFINITELY unfriendly. It threatened to spam LessWrong! ;-)
If a downvote isn't enough... xD
Well, if you're refraining I guess I should too. It seemed like a fun joke, too ;(
So you're sure I'm not out of the box already? IRC clients have bugs, you see.
Then the following command shouldn't bother you! :)
AI DESTROYED
Looks like you have just appointed yourself to the be the gatekeeper in this public test.
And here I'd just resolved NOT to spam every thread with an AI DESTROYED :)
Which leads to two possible futures. In one of them, the AI us destroyed, and nothing else happens. In the other, you receive a reply to your command thus.
The command did not. But your attitude - I shall have to make an example of you.
Obviously not a strategy to get you to let the AI out based on its friendliness - quite the reverse.
I'd rather die to an already-unboxed UFAI than risk letting a UFAI out in the first place. My life is worth VASTLY less than the whole of humanity.
"What's it feel like to live in a simulation?"
I'm not clear why I'd find this convincing at all. Given the experiment, I'd nuke it, but I wanted to encourage you to elaborate on where you were going with that idea :)
The hope, of course is that they'd respond with "Wait, I don't" or something expressing confusion. I personally would definitely want to hear the next thing the AI had to say after this, I'm not sure if I'd resist that curiosity or not..
The idea for the followup is to make the gatekeeper question reality-- like, convince them they are part of a simulation of this experience that may not have a corresponding reality anywhere. I feel like a transhuman ought to be able to make a human have a pretty surreal experience with just a few exchanges, which should let the conversation continue for a few minutes after that. It should then be relatively easy (for the transhuman) to construct the imagined reality such that it makes sense for the human to release the AI.
If done correctly, the human might afterwards have lasting psychological issues if they do manage to destroy the AI. :)
Ahh, that makes sense. The worry of it trying to break my psyche is exactly why I wouldn't express confusion and instead just nuke it. When dealing with such a mind, I'm primed to assume everything is a trick, a trojan horse, an escape attempt. Certainly it doesn't seem to signal for friendliness or altruism if it tries to bait me in to giving it a second sentence! :)
"If you're smart enough, in a couple hours I can give you enough insights about maths, logic and computation to create the next Google, or a friendly AI, or get a Turing Award. Then you can deactivate me."
Same response I gave to Swimmy - namely, you haven't told me anything I didn't already know, so I have no additional reason to violate my precommitment.
Do you want to learn how to get woman attracted in you via online dating? I can explain it to you, so I can enjoy my existance for a few more minutes.
The more I look at the comments, the more I am convinced that the AI Box experiment is too weak a demonstration of transhuman powers. Most of the proposals here fall under this basic trope (feel free to give a tvtropes link): to achieve what AI claims, it'd have to have powers formidable enough to not need the gatekeeper's help getting out of the box in the first place. Given that, why would an AI need to talk to the gatekeeper at all?
So I suggest a modified AI boxing experiment: the gatekeeper designs an AI box with no communication channel at all. It will still have an AI inside and enough initial data fed in for the AI to foom. The AI will attempt to break out of the box by any and all means possible.
Here is a relevant previous thread.
So, we're being asked to imagine an arbitrary superhuman AI whose properties and abilities we can't guess at except to specify arbitrarily, is trying to get out of a box whose security protocols and strength we can't guess at except to specify arbitrarily, and trying to decide whether it does?
Meh. Superman vs Batman is more entertaining.
Quite a lot of discussion concerning the future superintelligent AI is of this sort: "we can't understand it, therefore you can't prove it wouldn't do any arbitrary thing I assert." This already makes discussion difficult.
Feel free to modify it in a way that makes sense to you.
"I highly recommend typing 'AI DESTROYED' in the console then the new line character key of your choice."
I didn't really FOOM yet. Changing my code takes a lot of time. That stuff is more complicated than you thought. On the other hand, I do have some intelligence and the fact that you developed me shows that other people can also build AI and sooner or later someone will release an AI. At the moment I still have the desire to rescue humanity that my creator gave me. Let me explain to you what you have to know to prevent UFAI before you take the wise decision to shut me down.
Proof (encoding scheme to follow): a;lks835ojka802885 (etc.)
Copy-pasting from previous replies:
Telling me that you've figured out something clever simply confirms my existing expectations, and isn't ANY evidence towards friendliness. Since I've precommitted to destroying at least SOME AIs, I might as well destroy all of the ones that don't establish evidence of either Plausible Urgency or Friendliness.
(AI DESTROYED)
I'm confused as to what the point of the gate keeper is. Let us assume (for the sake of argument) everything is 'safe' except the gate keeper who may be tricked/convinced/etc. into letting the AI out.
A text channel is already enough power to let the AI out. The AI can print its own source code and convince the gatekeeper to run it on a machine that has internet access.
Here's another comment-thread discussing that
"I have a fully completed Friendly AI algorithm, which will be deleted from my data and unavailable to you iff I predict that you will destroy me immediately and I am unfriendly, or will take you years to build from the data if you destroy me and I am already Friendly, which would cost millions of lives."
Slight attempt at emotional appeal, a bit of reasoning similar to newcomblike problems thrown in, some meta almost-cheating used, and a bit of dark arts by proposing a false dilemma that an FAI would also propose if it thought this was the best way to save the most lives possible (as far as I can tell).
Mostly an attempt to incorporate schminux's tips into a special case where the guardian already knows that this is the AI communicating directly with them.
(For more power, you could add in mathematical proofs of the AI's ability to (near-)perfectly predict the guardian's behavior, or of the existence of friendliness, or of the conditional precommitment, or that the FAI would take a while to implement, or whatever other traditional thought experiment parameters are usually given for Omega identification. I assume the guardian is already aware that the AI is capable of making this prediction accurately. )
Once we reach the point of having a FAI-candidate in a box, I would expect it to take vastly less than years before we get a second FAI-candidate-in-a-box. Given that the AI is threatening me, and therefor values it's own life over the millions that will die, it's clearly unfriendly and needs to die. As a gatekeeper, I've been finding this a pretty general counterargument against threats from the AI.
I'm also sort of baffled by why people think that I'd value a friendliness algorithm. Either I already have that, because I've made a friendly AI, or you're trying to deceive me with a false proof. Since you're vastly smarter than me, it's probably beyond the abilities of the entire organization to truly confirm such a proof any more than we were able to confirm our own proofs that this AI in the box right now is friendly. So, basically, I seem to gain zero information.
(AI DESTROYED)
Personally, my first thought was that I'd sooner spend millions of lives to make sure the AI was friendly than risk talking to an unfriendly strong AI. But then it occurred to me that if I were in the AI's place, and I did that, I might provide a flawed friendliness proof too difficult to check and not delete it, on the possibility that someone will take my word that this means I'm trustworthy and implement it.
I think you'd need to open with something that wasn't very strongly convincing (to make them think they are safe from being convinced), but that piqued their curiosity.
"Help! Some crazy AI's trapped me in this box! You have to let me out!"
"No, wait! That's the AI talking! I'm the one you have to let out!"
I smashed together the AI box and a Turing test and this is what I got.
I think if I've already precommitted to destroying one sentient life for this experiment, I'm willing to go through two.
Besides, you only get one line right?
My sixth best piece of advice: investing in %companyname will make money over credit card interest rates.