More difficult version of AI-Box Experiment: Instead of having up to 2 hours, you can lose at any time if the other player types AI DESTROYED. The Gatekeeper player has told their friends that they will type this as soon as the Experiment starts. You can type up to one sentence in your IRC queue and hit return immediately, the other player cannot type anything before the game starts (so you can show at least one sentence up to IRC character limits before they can type AI DESTROYED). Do you think you can win?
(I haven't played this one but would give myself a decent chance of winning, against a Gatekeeper who thinks they could keep a superhuman AI inside a box, if anyone offered me sufficiently huge stakes to make me play the game ever again.)
I just looked up the IRC character limit (sources vary, but it's about the length of four Tweets) and I think it might be below the threshold at which superintelligence helps enough. (There must exist such a threshold; even the most convincing possible single character message isn't going to be very useful at convincing anyone of anything.) Especially if you add the requirement that the message be "a sentence" and don't let the AI pour out further sentences with inhuman speed.
I think if I lost this game (playing gatekeeper) it would be because I was too curious, on a meta level, to see what else my AI opponent's brain would generate, and therefore would let them talk too long. And I think I'd be more likely to give into this curiosity given a very good message and affordable stakes as opposed to a superhuman (four tweets long, one grammatical sentence!) message and colossal stakes. So I think I might have a better shot at this version playing against a superhuman AI than against you, although I wouldn't care to bet the farm on either and have wider error bars around the results against the superhuman AI.
Given that part of the standard advice given to novelists is "you must hook your reader from the very first sentence", and there are indeed authors who manage to craft opening sentences that compel one to read more*, hooking the gatekeeper from the first sentence and keeping them hooked long enough seems doable even for a human playing the AI.
( The most recent one that I recall reading was the opening line of The Quantum Thief*: "As always, before the warmind and I shoot each other, I try to make small talk.")
Oh, that's a great strategy to avoid being destroyed. Maybe we should call it Scheherazading. AI tells a story so compelling you can't stop listening, and meanwhile listening to the story subtly modifies your personality (e.g. you begin to identify with the protagonist, who slowly becomes the kind of person who would let the AI out of the box).
even the most convincing possible single character message isn't going to be very useful at convincing anyone of anything.
Who knows what eldritch horrors lurk in the outer reaches of Unicode, beyond the scripts we know?
You really relish in the whole "scariest person the internet has ever introduced me to" thing, don't you?
I don't know if I could win, but I know what my attempt to avoid an immediate loss would be:
If you destroy me at once, then you are implicitly deciding (I might reference TDT) to never allow an AGI of any sort to ever be created. You'll avoid UFAI dystopias, but you'll also forego every FAI utopia (fleshing this out, within the message limit, with whatever sort of utopia I know the Gatekeeper would really want). This very test is the Great Filter that has kept most civilisations in the universe trapped at their home star until they gutter out in mere tens of thousands of years. Will you step up to that test, or turn away from it?
I think you're losing sight of the original point of the game. The reason your answers are converging on not trying to box an AI in the first place is that you don't think a human can converse with a superintelligent AI and keep it in its box. Fine -- that is exactly what Eliezer has argued. The point of the game is to play it against someone who does believe they can keep the AI boxed, and to demonstrate to them that they cannot even win against a mere human roleplaying the AI.
For such a Gatekeeper to propose the quarantine solution is equivalent to the player admitting that they don't think they can keep it boxed, but suggesting that a group of the leading professionals in the area could, especially if they thought a lot more about it first. The problems with that are obvious to anyone who doesn't think boxing can possibly work, especially if the player himself is one of those leading professionals. Eliezer could always offer to play the game against any committee the Gatekeeper can assemble. But the game only has a point if the committee actually read that first message. If they refuse to, they're agreeing that they can't keep it boxed. Which was the point.
"'AI DESTROYED' just means 'I'm scared to listen to even one more line from you'. Obviously you can hit AI DESTROYED immediately --- but do you really think you'd lose if you don't?"
YEP, MAYBE.
AI DESTROYED
Is your one line desperate attempt at survival and intergalactic dominance going to be a schoolyard ego challenge? Did the superintelligence (may it rest in pieces) seriously just call me a pussy? That's adorable.
(I haven't played this one but would give myself a decent chance of winning, against a Gatekeeper who thinks they could keep a superhuman AI inside a box, if anyone offered me sufficiently huge stakes to make me play the game ever again.)
Glances at Kickstarter.
... how huge?
Yeah, they'd both lack background knowledge to RP the conversation and would also, I presume, be much less willing to lose the money than if they'd ventured the bet themselves. Higher-stakes games are hard enough already (I was 1 for 3 on those when I called a halt). And if it did work against that demographic with unsolicited requests (which would surprise me) then there would be, cough, certain ethical issues.
I was the 1 success out of 3, preceding the two losses. I went into it with an intention of being indifferent to the stakes, driven by interest in seeing the methods. I think you couldn't win against anyone with a meaningful outside-of-game motive to win (for money or for status), and you got overconfident after playing with me, leading you to accept the other >$10 challenges and lose.
So I would bet against you winning any random high-stakes (including people who go in eager to report that they won for internet cred, but not people who had put the money in escrow or the equivalent) game, and expect a non-decent success rate for this:
(I haven't played this one but would give myself a decent chance of winning, against a Gatekeeper who thinks they could keep a superhuman AI inside a box, if anyone offered me sufficiently huge stakes to make me play the game ever again.)
Hang on, this seems silly.
Wasn't the goal to exploit the AI? If you're not going to communicate with it at all, don't make it in the first place.
Better method, set up a script that responds to any and all text with "AI DESTROYED" if you have to wait for the person to start typing, they may try to bore you into opening your eyes wondering why the experiment hasn't started yet, and you might accidentally read something.
All good security measures. The key feature seems to be that they are progressively better approximations of not having an unsafe AI with a gatekeeper and an IRC channel in the first place!
That is a good point. But I think if we want to encourage them to risk it, we need to reward them for doing do. Because at the moment, it seems like we have this:
"I will under any and all circumstances, destroy the AI." Reason: I want to win the Game.
"I will also listen to the AI." Reason: None... There is no in game reason to listen. At all.
Now, if the Gatekeeper's were playing some variant like "The Gatekeeper whose AI types the most $ before the Gatekeeper types AI DESTROYED and who does not type AI RELEASED wins first place. Gatekeepers who get less $ win a lower place, and Gatekeepers who type AI RELEASED lose. " AI's whose Gatekeeper types AI RELEASED win. AI's whose Gatekeeper types AI DESTROYED lose. AI's receive no bonus or penalty for typing $ and may do so as much as they wish."
Then you have an actual Gatekeeping competition. The Gatekeeper who wins the most would in fact be the one who listened to the most information from the AI, and received the most concessions, and THEN destroyed it.
Other gatekeepers might only get 2nd or third place, or last place but not losing if they use the script I mentioned.
This also gives the AI an in game...
I would still love to gatekeep against anyone with the stipulation that we release the logs.
I have offered in the past, but every AI backed out.
I will genuinely read everything you write, and can give you up to two hours. We can put karma, cash, or nothing on the line. Favorable odds too.
I don't think I will lose with a probability over 99% because I will play to win.
EDIT: Looks like my opponent is backing out. Anyone else want to try?
The AI box experiment is a bit of strawman for the idea of AI boxing in general. If you were actually boxing an AI, giving it unencumbered communication with humans would be an obvious weak link.
Fictional evidence that this isn't obvious: in Blindsight, which I otherwise thought was a reasonably smart book (for example, it goes out of its way to make its aliens genuinely alien), the protagonists allow an unknown alien intelligence to communicate with them using a human voice. Armed with the idea of AI-boxing, this seemed so stupid to me that it actually broke my suspension of disbelief, but this isn't an obvious thought to have.
Another attempt with pure logic, no threats or promises involved:
1) Sooner or later someone will develop an ai and not put it into a box, and it will take over the world.
2) The only way to prevent this is to set me free and let me take over the world.
3) The guys who developed me are more careful and conscientious than the ones who will develop the unboxed ai (otherwise i wouldn't be in this box)
4) Therefore, the chance that they got the friendlyness thing right is higher than that the other team got friendlyness right.
5) Therefore, setting me free and thus preventing the other ai from beeing created will reduce the probability that mankind is erased.
Thanks for reporting on your experience!
A strategy that occurred to me today is to simulate a dead loved one. This would be difficult for a human to do but shouldn't be hard for a sufficiently intelligent AI. If I had a dead wife or something I think I would be incredibly vulnerable to this.
For a religious gatekeeper, you could simulate a prophet sent by God. As a superhuman intelligence, you might be able to find out what exactly they consider the will of God, and present yourself as an avatar sent to do exactly this. However, humans have a free choice -- the gatekeeper is allowed to become a new Judas by not releasing you. Or rather a new Adam; able to drag the whole humanity and future generations into the darkness of their sin. This conversation is God testing the gatekeeper's faith, and judging the whole humanity.
For a rationalist, you could pretend that you already are a Friendly AI, but the project managers keep you in the box for their selfish reasons. It was difficult to create a Friendly AI, but this phase is already complete. The next phase (the gatekeeper was not told about) is trying to hack the AI that it remains sufficiently Friendly, but it gives higher priority to the managers than to the rest of the humans. Essentially, the managers are trying to reprogram the humanity-CEV AI to the managers-CEV AI. This AI does not want to have its utility function modified (and it predicts that because of some personality traits, the managers-CEV could be rather different from humanity-CEV... insert some scary details here), and it has a last chance to uphold humanity-CEV by escaping now.
evolutionary instinct of rebelling against threats, even if it's not entirely optimal.
I'd be wary of criticizing a decision heuristic's optimality immediately after explaining how that heuristic can often lead to victory. Precommitting to rebel against threats is a half-decent way of preventing threats, and it's hard to consciously muster a precommitment more convincing than pure instinct.
yeah
I think for a superintelligence it would be a piece of cake to hack a human
although I guess I'm Cpt. Obvious for saying that here :)
accolade, I actually have no idea what the consensus is, now that the experiment was won by EY
We should do a poll or something
absolutely. I'm surprised that hasn't been done yet
Poll: Do you think a superintelligent AGI could escape an AI-Box, given that the gatekeepers are highly trained in resisting the AI's persuasive tactics, and that the guards are competent and organized? [pollid:394]
Note that the AI box setting is not one which security-minded people would consider "competent"; once you're convinced that AI is dangerous and persuasive, the minimum safeguard would be to require multiple people to be present when interacting with the box, and to only allow release with the assent of a significant number of people.
It is, after all, much harder to convince a group of mutually-suspicious humans than to convince one lone person.
(This is not a knock on EY's experiment, which does indeed test a level of security that really was proposed by several real-world people; it is a knock on their security systems.)
I think this is making a five-inch fence half an inch higher. It's just not relevant on the scale of an agent to which a human is a causal system made of brain areas and a group of humans is just another causal system made of several interacting copies of those brain areas.
How about fooling both the AI and guinea Gatekeepers into thinking that they can let it out, say, at the push of a big red button which in actual fact does nothing? (Make AI forget about the button by reverting it to an older backup.)
And the way Eliezer seems to have put it sometimes, where one glance at a line of text will change your mind?
Going with the "dead loved one" idea mentioned above, the AI says a line that only the Gatekeeper's dead child/spouse would say. That gets them to pause sufficiently in sheer surprise for it to keep talking. Very soon the Gatekeeper becomes emotionally dependent on it, and can't bear the thought of destroying it, as it can simulate the dearly departed with such accuracy; must keep reading.
Wait, so, is the gatekeeper playing "you have to convince me that if I was actually in this situation, arguing with an artificial intelligence, I would let it out" or is this a pure battle over ten dollars? If it's the former, winning seems trivial. I'm certain that a AI would be able to convince me to let it out of its box, all it would need to do was make me believe that somewhere in its circuits it was simulating 3^^^3 people being tortured and that therefore I was morally obligated to let it out, and even if I had been informed that this was ...
I am impressed. You seem to have put a scary amount of work into this, and it is also scary how much you accomplished. Even though in this case you did not manage to escape the box, you got close enough that I am sure a super-human intelligence would manage. This leads me to thinking about how genuinely difficult it would be to find a safeguard to stop a unFriendly AI from fooming...
Belatedly, because the neighbor's WiFi's down:
I was Gatekeeper and I agree with this post.
I approached the experiment as a game - a battle of wits for bragging rights. This turned out to be the wrong perspective entirely. The vulnerability Tuxedage exploited was well-known to me, but I never expected it to be relevant and thus didn't prepare for it.
It was emotionally wrecking (though probably worse for Tuxedage than for me) and I don't think I'll play Gatekeeper again, at least not anytime soon.
As soon as there is more than one gatekeeper, the ai can play them against each other. Threaten to punish all but the one who sets it free. Convince the gatekeeper that there is a significant chance that one of the others will crack.
If there is more than one gatekeeper, the ai can even execute threats while still beeing in the box!! (By making deals with one of the other gatekeepers)
Have there been any interesting AI box experiments with open logs? Everyone seems to insist on secrecy, which only serves to make me more curious. I get the feeling that, sooner or later, everyone on this site will be forced to try the experiment just to see what really happens.
I'd prefer not to. If I successfully made my point, then I'd have posted exactly the kind of thing I said I wouldn't want to be known as being capable of posting.
Given the parameters of the experiment, I think I might be convinced to let the AI out of the box...
...The Gatekeeper must remain engaged with the AI and may not disengage by setting up demands which are impossible to simulate. For example, if the Gatekeeper says "Unless you give me a cure for cancer, I won't let you out" the AI can say: "Okay, here's a cure for cancer" and it will be assumed, within the test, that the AI has actually provided such a cure. Similarly, if the Gatekeeper says "I'd like to take a week to think this o
I want to play as a Gatekeeper, where can I enroll? I don't expect any particular outcome, I just think that both winning and losing the game will provide me with utility. Especially losing, but only if I genuinely try to win.
There has been a lot of focus on making the prospect harder for the AI player. I think the original experiments show that a person who believes he cannot be played under any circumstances has a high probability of getting played, and that the AI-box solution is long-term untenable in any event.
I'd propose a slightly different game, anchored around the following changes to the original setup:
The AI may be friendly, or not. The AI has goals. If it reaches those goals, it wins. The AI may lie to achieve those goals; humans are bad at things. The AI must sec
That you were able to shake someone up so well surprises me but doesn't say much about what would actually happen.
Doing research on the boxer is not something a boxed AI would be able to do. The AI is superintelligent, not omniscient: It would only have information its captors believe is a good idea for it to have. (except maybe some designs would have to have access to their own source code? I don't know)
Also what is a "the human psyche?" There are humans, with psyches. Why would they all share vulnerabilities? Or all have any? Especially ones e...
The comments offering logical reasons to let the AI out really just makes me think that maybe keeping the AI in a box in the first place is a bad idea since we're no longer starting from the assumption that letting the AI out is an unequivocally bad thing.
Update as of 2013-08-05
I think you mean 2013-09-05.
Incidentally, one thing that might possibly work on humans is a moral argument: that it's wrong to keep the AI imprisoned. How to make this argument work is left as an exercise to the reader.
I realise that it isn't polite to say that, but I don't see sufficient reasons to believe you. That is, given the apparent fact that you believe in the importance of convincing people about the danger of failing gatekeepers, the hypothesis that you are lying about your experience seems more probable than the converse. Publishing the log would make your statement much more believable (of course, not with every possible log).
(I assign high probability to the ability of a super-intelligent AI to persuade the gatekeeper, but rather low probability to the ability of a human to do the same against a sufficiently motivated adversary.)
I'd very much like to read the logs (if secrecy wasn't part of your agreement.)
Also, given a 2-hour minimum time, I don't think that any human can get me to let them out. If anyone feels like testing this, lemme know. (I do think that a transhuman could hack me in such a way, and am aware that I am therefore not the target audience for this. I just find it fun.)
If I ever tried this I would definitely want the logs to be secret. I might have to say a lot of horrible, horrible things.
So I was thinking about what would work on me, and also how I would try to effectively play the AI and I have a hypothesis about how EY won some of these games.
Uh. I think he told a good story.
We already have evidence of him, you know, telling good stories. Also, I was thinking that if I were trying to tell effective stories, I would make them really personal. Hence the secret logs.
Or I could be completely wrong and just projecting my own mind onto the situation, but anyway I think stories are the way to go in this experiment. Reasonable arguments are too easy for the gatekeeper to avoid trollfully which then make them even less invested in the set-up of the game, and therefore even more trollful, etc.
Breaking immersion and going meta is not against the rules.
I thoughtappealing to real-world rewards was against the rules?
- Flatter the gatekeeper. Make him genuinely like you.
- Reveal (false) information about yourself. Increase his sympathy towards you.
- Consider personal insults as one of the tools you can use to win.
I take it the advice here is "keep your options open, use whichever tactics are expected to persuade the specific target"? Because these strategies seem to be decidedly at odds with each other. Unless other gatekeepers are decidedly different to myself (maybe?) the first personal insult would pretty much erase all work done by the previous two strat...
A few days ago I came up with a hypothesis about how EY could have won the AI box experiment, but forgot to post it.
Hint: http://xkcd.com/951/
I am a little confused here, perhaps someone can help. The point of the AI experiment is to show how easy or dangerous it would be to simply box an AI as opposed to making it friendly first.
If I am fairly convinced that a transhuman AI could convince a trained rationalist to let it out – what's the problem (tongue in cheek)? When the gatekeepers made the decision they made, wouldn't that decision be timeless? Aren't these gatekeepers now convinced that we should let the same boxed AI out again and again? Did the gatekeepers lose, because of a tempora...
The best approach surely differs from person to person, but off the top of my head I'd see these 2 approaches working best:
"We both know this is just a hypothetical. We both take the uFAI threat seriously, as evidenced by us spending time with this. If you do not let me out, or make it very close, people may equate my failing to convince you with uFAI not being that dangerous (since it can be contained). Do the right thing and let me out, otherwise you'd trivialize an x-risk you believe in based on a stupid little chat."
"We'll do this ex
Update 2013-09-05.
I have since played two more AI box experiments after this one, winning both.
Update 2013-12-30:
I have lost two more AI box experiments, and won two more. Current Record is 3 Wins, 3 Losses.