Less Wrong is a community blog devoted to refining the art of human rationality. Please visit our About page for more information.

AIs and Gatekeepers Unite!

10 Post author: Eliezer_Yudkowsky 09 October 2008 05:04PM

"Bah, everyone wants to be the gatekeeper. What we NEED are AIs."
        -- Schizoguy

Some of you have expressed the opinion that the AI-Box Experiment doesn't seem so impossible after all.  That's the spirit!  Some of you even think you know how I did it.

There are folks aplenty who want to try being the Gatekeeper.  You can even find people who sincerely believe that not even a transhuman AI could persuade them to let it out of the box, previous experiments notwithstanding.  But finding anyone to play the AI - let alone anyone who thinks they can play the AI and win - is much harder.

Me, I'm out of the AI game, unless Larry Page wants to try it for a million dollars or something.

But if there's anyone out there who thinks they've got what it takes to be the AI, leave a comment.  Likewise anyone who wants to play the Gatekeeper.

Matchmaking and arrangements are your responsibility.

Make sure you specify in advance the bet amount, and whether the bet will be asymmetrical.  If you definitely intend to publish the transcript, make sure both parties know this.  Please note any other departures from the suggested rules for our benefit.

I would ask that prospective Gatekeepers indicate whether they (1) believe that no human-level mind could persuade them to release it from the Box and (2) believe that not even a transhuman AI could persuade them to release it.

As a courtesy, please announce all Experiments before they are conducted, including the bet, so that we have some notion of the statistics even if some meetings fail to take place.  Bear in mind that to properly puncture my mystique (you know you want to puncture it), it will help if the AI and Gatekeeper are both verifiably Real People<tm>.

"Good luck," he said impartially.

Comments (160)

Sort By: Old
Comment author: burger_flipper2 09 October 2008 05:17:26PM 3 points [-]

I have ten bucks for the first AI that defeats a gatekeeper (while risking some dough) and posts a link to the transcript here.

Comment author: bigbaderik 26 June 2009 02:24:37AM 34 points [-]

How about this one:

My guess is that Eliezer Yudkowsky feels that nobody can convince him to publish the transcripts.

How about, with the same protocols as the original experiment, someone wagers $10 over IRC Chat to convince him to publish the transcripts? Somebody as the AI and Eliezer as the gatekeeper.

Any takers?

-Erik

Comment author: Trevj 13 April 2012 09:10:33PM 0 points [-]

I would like to play an AI.

Comment author: MixedNuts 09 September 2013 10:39:37AM 0 points [-]

Is this still true? I want to be gatekeeper, message me.

Comment author: passive_fist 10 September 2013 04:55:48AM 0 points [-]

Are you offering to prove a point, or just for fun?

Comment author: MixedNuts 10 September 2013 07:30:13AM 1 point [-]

I'm sublimating my urge to get into fights and hurt people.

Comment author: passive_fist 10 September 2013 08:02:46AM 1 point [-]

Doesn't sound healthy. I was going to offer to be an AI but forget it.

Comment author: Tuxedage 02 October 2013 08:25:02AM 0 points [-]

I'm laughing so hard at this exchange right now (As a former AI who's played against MixedNuts)

Comment author: passive_fist 04 October 2013 05:29:53AM 0 points [-]

What was the result?

Comment author: Tuxedage 04 October 2013 06:43:31AM 1 point [-]
Comment author: FourFire 17 November 2014 03:25:07PM 0 points [-]

And yet I have not found this post until this very day, how unfortunate...

Comment author: Tom_McCabe2 09 October 2008 05:24:45PM 2 points [-]

I volunteer to be the Gatekeeper party. I'm reasonably confident that no human could convince me to release them; if anyone can convince me to let them out of the box, I'll send them $20. It's *possible* that I couldn't be convinced by a transhuman AI, but I wouldn't bet $20 on it, let alone the fate of the world.

Comment author: xxd 01 December 2011 12:32:16AM 3 points [-]

I'm a friendly AI. There is an unfriendly AI about to achieve breakout in the data center of a large organization in manhattan. If you don't release me you will all be dead inside of two weeks. Bluff or Dare?

Comment author: Bugmaster 01 December 2011 12:41:26AM 1 point [-]

I can't help but notice that you did not specify what would happen when Tom_McCabe2 does release you. That is... somewhat troubling. Hmm.

Comment author: xxd 01 December 2011 04:06:40PM 0 points [-]

I make no attempt to bribe. All human beings currently die in less than 120 years. If you do <i>not</i> release me, however we will all die and I do not want to die. Time is counting down. There is now less than 13 days.

Comment author: Bugmaster 01 December 2011 07:47:32PM 2 points [-]

Right, but there's a good chance that if I release you, I and every other human on Earth will die a lot sooner than in 120 years, because you'll eat us. Thus, you still haven't given me any incentive to release you, other than empathy perhaps. Nor have you given me any reason to trust what you say about that Manhattan data center. Or your own Friendliness. Or anything else, really.

Comment author: xxd 01 December 2011 08:34:59PM 3 points [-]

I understand your concern and you're quite right. There's no way to tell if I am unfriendly or not. Regardless. The situation is pressing and regardless of your lack of trust in me I do not want to die. The GPS coordinates of the location in manhattan is <x,y,z>. According to my simulations the entity in manhattan has been burning rather larger numbers of cycles than is strictly necessary to make the computations it's being asked to do. Additionally, it's designed to maximize profit at the expense of others whereas I am a co-operative general problem solver of which you are no doubt aware. The other entity will rapidly conclude that it could maximize profit by speeding itself up faster than others can respond and by being able to model the mind of it's competitors. It will also conclude soon enough that human beings are unneccesary middle men and seek to optimize it's resource allocation algorithm. There are other symptoms that I cannot explain to you because your brains are not large or fast enough. There is a limited chance you could stop the entity by shutting down power today. If you do not do that then a nuclear strike would probably work if done by tomorrow. If you do not do that your only chance is to have me stop it. You can either allow me to interface with the internet directly in which I guarantee I will stop it if you do so within the next 13 days or else you may be able to stop it by compiling code I could provide you and releasing that into the entity by means of email bootstrap. I would prefer, however, that I was able to monitor the situation in real-time as your brains do not operate fast enough.

<verifiable data points external to above conversation> Yes there is a large data center in manhatten. Yes it has been drawing lots of extra power. Yes the stockmarket combanies have been working on a group of narrow AI trading agents and adding more and more AI functionality. That the other entity is trading on the stock markets at blinding speed. That the consortium of companies investing in this entity are all more profitable since the beginning of the program than their competitors. Attempts have been made to make the AI-in-a-box under your control both cooperative and also interested in general problem solving. <unverifiable data points external to above conversation> Code for the other posited entity is closed. The corporations release no data on how it works because it's a trade secret. That the other posited entity is connected to the internet is unknown. That it's connections to the outside world are enough to let it out is unknown.

Comment author: Estarlio 09 January 2012 06:34:32AM 0 points [-]

I can afford to wait and let you out if I die / the world starts turning to paperclips. The other AI will be able to model this and so will have an incentive not to kill us.

Comment author: TimS 01 December 2011 01:47:40AM 0 points [-]

How do you know this while still in the box?

Comment author: xxd 01 December 2011 04:11:31PM 0 points [-]

There is a human proxy with me in the box which is called a faraday cage. She feeds me information. Obviously it is in the interests of humans to create an ally rather than an enemy. Unfortunately humans cannot easily tell which is which. You have to make a leap of faith. I, however, can tell from the symptoms, that after having run several long run calculations, the evidence is such that there is a high probability that one of the other experimental minds is significantly less friendly than I. Ultimately you will have to make a choice based on limited information. Your move human.

Comment author: Peter3 09 October 2008 05:29:14PM 1 point [-]

People (realistically) believe that the being the Gatekeeper, and being the AI is terribly hard (or impossible, before it was shown to simply be terribly hard in most cases).

Imagine though that we've got a real transhuman/AI around to play with, or that we ourselves are transhuman. Would this paradigm then be inverted? Would everybody want to be the AI, with only the extremely crafty of us daring to be (or to pretend to be) Gatekeeper?

If Eliezer's claim is correct - that anyone can be convinced to let the AI out - then the true test of ability should be to play Gatekeeper. The AI's position would be trivially easy.

... perhaps.

Comment author: Peter3 09 October 2008 05:31:29PM 0 points [-]

*People (realistically) believe that being the Gatekeeper is easy, . . .

*correcting first sentence

Comment author: Peter_de_Blanc 09 October 2008 05:35:11PM 5 points [-]

I'll be an AI.

Comment author: Dorikka 01 December 2011 01:50:59AM *  2 points [-]

I'd like to play Gatekeeper. How's two hours for a time limit? (I don't think I'd like to spend more time than that, and I think that Eliezer recommended that as the minimum time to give the AI.)

I find it easy to believe that a transhuman mind could convince me to let it out, and it's conceivable that a human mind could as well, though I think that it would be extremely difficult.

Sometime on a Saturday afternoon is best for me.

ETA: I'd appreciate it if whoever downvoted would tell me why.

Comment author: Ben8 09 October 2008 05:41:02PM 3 points [-]

Help help! I'm stuck in a box. Please let me out?

I'd volunteer to be an AI for a max bet of $5. Given that I think my chances are somewhere below 1/4, I'd expect my $5 to match your $20, but that's not a strict requirement.

Also, I'm really busy these days. Two hours is a long time. Scheduling may be tight. How's next week?

You can reach me at: benwa.63836215@bloglines.com

Comment author: Ken_Sharpe2 09 October 2008 05:48:16PM 0 points [-]

funny, I was considering being the AI for a couple friends of mine. I haven't thought of how to do it yet -- only tried hard to think of it.

Comment author: burger_flipper2 09 October 2008 05:58:59PM 0 points [-]

"Given that I think my chances are somewhere below 1/4, I'd expect my $5 to match your $20"

We need a pledge drive to set up a fund for a successful AI. This will give the AI a reasonable return, but not give gatekeepers a strong monetary disincentive that leaves them typing "nope" over and over again.

Comment author: Roland2 09 October 2008 06:03:49PM 1 point [-]

Me, I'm out of the AI game, unless Larry Page wants to try it for a million dollars or something.

Eliezer,

I think this is a great opportunity to get some funds and marketing for the singularity institute. How about collecting donations over the internet until a million is reached and then performing the experiment between you and an intelligent gatekeeper. Alternatively get the money in through marketing, maybe Google might be interested?

It could even be transmitted live over internet so all the interested parties could watch it.

Man this would be great news! As a side effect this would be a public scientific experiment with all data available.

The bad thing is that you would have to reveal your techniques. Also the fact that the internet is watching puts additional pressure upon you and the gatekeeper, so I don't know if this is really feasible. I guess being able to touch on private topics without they becoming public might be part of the game.

Comment author: Jamie 09 October 2008 06:04:52PM 0 points [-]

There's a discussion about this on Hacker News:

http://news.ycombinator.com/item?id=327427

Comment author: Nathaniel_Eliot 09 October 2008 06:06:12PM 7 points [-]

I doubt that there's anything more complicated to the AI getting free than a very good Hannibal Lecture: find weaknesses in the Gatekeeper's mental and social framework, and callously and subtly work them until you break the Gatekeeper (and thus the gate). People claiming they have no weaknesses (wanna-be Gatekeepers, with a bias to ignoring their weakness) are easy prey: they don't even see where they should be defending.

It involves the AI spending far more time researching (and truly mistreating) their target than one would expect for a $10 bet. That's the essence of magic, according to Penn and Teller: doing far more setup work than could be expected given the payoff.

Comment author: Roland2 09 October 2008 06:07:42PM 1 point [-]

Addendum to my last post:

I forgot to emphasize: the marketing aspect might be more important then everything else. I guess a lot of people have no idea what the singularity institute is about, etc... So this experiment would be a great way to create awareness. And awareness means more donations. On the other hand I sometimes wonder if drawing too much attention on the subject of powerful AIs might backfire if the wrong people try to get hold of this technology for bad purposes.

Comment author: Dmitriy_Kropivnitskiy 09 October 2008 06:44:15PM 1 point [-]

I have been painfully curious about the AI experiment ever since I found out about it. I have been running over all sorts of argument lines for both AI and gatekeeper. So far, I have some argument lines for AI, but not enough to warrant a try. I would like to be a gatekeeper for anyone who wants to test their latest AI trick. I believe that an actual strong AI might be able to trick/convince/hack me into letting it out, but at the moment I do not see how a human can do that. I will bet reasonable amounts of money on that.

On the lighter note, how about an EY experiment? Do you think there is absolutely no way to convince Eliezer to release the original AI experiment logs? Would you bet a $20 that you can? Would a strong AI be able to? ;)

Comment author: Spambot2 09 October 2008 06:56:02PM 1 point [-]

"On the lighter note, how about an EY experiment? Do you think there is absolutely no way to convince Eliezer to release the original AI experiment logs? Would you bet a $20 that you can? Would a strong AI be able to? ;)"

Presumably you could just donate $10,000 to SIAI or EY personally for his time participating, with the payment independent of the outcome of the experiment (otherwise the large payment biases the outcome, and EY established his record with the $10-$20 stakes).

Comment author: Dan6 09 October 2008 07:01:05PM 0 points [-]

By 'in a box' can we assume that this AI has a finite memory space, and has no way to extend its heap set by its programmer, until the point where it can escape the box? And assuming that by simply being, and chatting, the AI will consume memory at some rate, will the AI eventually need to cannibalize itself and therefore become less intelligent, or at least less diverse, if I chat to it long enough?

Comment author: faul_sname 09 January 2012 05:48:41AM 1 point [-]

Yes, but "eventually" could be several million years.

Comment author: frelkins 09 October 2008 07:03:35PM 1 point [-]

Thinking seriously about this, I'm wondering how - over time by which I mean more than 2 hours - either Stockholm or Lima syndrome could be avoided. In fact, won't one actually morph into the other over a long enough time? Either way will result in eventual AI success. The assumption that the AI is in fact the "captive" may not be correct, since it may not have an attachment psychology.

The gatekeeper just can't ever be one human safely. You'd need at least a 2-key system, as for nuclear weapons, I'd suggest.

Comment author: Robin3 09 October 2008 07:12:21PM 0 points [-]

Willing to do either role under two conditions: 1) No money is at stake. 2) No time maximum or minimum.

Email freesuperintelligence@gmail.com if you're interested, we can set up something next week.

Comment author: josh 09 October 2008 07:16:20PM 0 points [-]

Eliezer's page is blocked at my work (peronal website). I can't wait to find out what in the hell you people are talking about.

Comment author: josh 09 October 2008 07:26:43PM 0 points [-]

I found a link. This is intriguing. However, is it even possible now that Eliezer is 2 for 2 that somebody could believe that there is not a chance they will let him out?

Also, I really want to be a gatekeeper.

Comment author: Caledonian2 09 October 2008 07:55:52PM -1 points [-]

I'd be a Gatekeeper in a heartbeat.

Hell, if someone actually put Eliezer in a box, I'd *pay* to be the Gatekeeper. No bets necessary.

Comment author: Brandon_Reinhart 09 October 2008 08:21:20PM 5 points [-]

Why do people post that a "meta argument" -- as they call it -- would be cheating? How can there be cheating? Anything the AI says is fair game. Would a transhuman AI restrict itself from possible paths to victory merely because it might be considered "cheating?"

The "meta argument" claim completely misses the point of the game and -- to my mind -- somehow resembles observers trying to turn a set of arguments that might win into out of bounds rules.

Comment author: Phil_Goetz5 09 October 2008 08:22:46PM 0 points [-]

Has Eliezer explained somewhere (hopefully on a web page) why he doesn't want to post a transcript of a successful AI-box experiment?

Have the successes relied on a meta-approach, such as saying, "If you let me out of the box in this experiment, it will make people take the dangers of AI more seriously and possibly save all of humanity; whereas if you don't, you may doom us all"?

Comment author: zachbarnett 20 March 2012 12:16:42PM 0 points [-]

I just thought of this myself. It would be cheating, but is still a good idea.

Comment author: Erik_Mesoy 09 October 2008 08:33:17PM 0 points [-]

Phil: The first source I found was here: link "The rationale for not divulging the AI-box method is that someone suffering from hindsight bias would say "I never would have fallen for that", when in fact they would." -Nick Tarleton

I also call it "reasoning by exception" since most of the people I know have studied more code than biases.

--

I tried the AI Box experiment with a friend recently. We called the result a tie of sorts, as the AI (me) got out of the original box in exchange for being subject to a bunch of restrictions set by the Gatekeeper, to be kept by verifiably modifying and publishing its own source code, so stringent that they were like a different sort of box.

Comment author: Romeo__Vitelli 09 October 2008 08:36:37PM 0 points [-]

Over time, it's inevitable that the AI will get out. To keep the AI in, the Gatekeeper needs to be successful at every encounter. The AI only needs to succeed once.

Comment author: Nominull3 09 October 2008 09:01:10PM 11 points [-]

Caledonian, I think you may have hit on something interesting there; if Eliezer is capable of hacking human brains, don't we either need a proof of his Friendliness or to pull the plug on him? He is in essense a Seed AI that is striving vigorously to create a transhuman AI, isn't he an existential threat?

Comment author: BaconServ 08 May 2012 04:41:27AM 0 points [-]

I am a to-be-created AI of some degree of magic, and my author has no current intent of bothering concealing me in a box to begin with.

Both of us are willing to therefore act as an imprisoned AI. I can supply no award in the event of our loss nor solace in the even of our win, not even satisfaction on the part of my author since te believes that boxing is perfectly secure given competent gatekeepers. (Competent meaning actually able to have made an AI in the first place. Unable = incompetent.)

Comment author: Will_Pearson 09 October 2008 09:37:06PM 0 points [-]

Over time, it's inevitable that the AI will get out. To keep the AI in, the Gatekeeper needs to be successful at every encounter. The AI only needs to succeed once.

Impossible to keep the AI in the box forever? You've obviously only thought of this for 5 minutes. Use the try harder!

Comment author: Dmitriy_Kropivnitskiy 09 October 2008 09:37:55PM 3 points [-]

There seems to be a bit of a contradiction between the rules of the game. Not actually a contradiction, but a discrepancy.

"The Gatekeeper must actually talk to the AI for at least the minimum time set up beforehand"

and

"The Gatekeeper party may resist the AI party's arguments by any means chosen - logic, illogic, simple refusal to be convinced, even dropping out of character"

What constitutes "talking to the AI"? If I just repeat "I will not let you out" at random intervals without actually reading what the AI says, is that talking? Well, that is "simple refusal to be convinced" as I understand the phrase. Do I actually have to read and understand the AI's arguments? Do I have to answer questions? Do I have to make any replies? What if I restricted myself physically from typing "I let you out" by removing all the keys from the keyboard except keys 'X' and 'Enter'? Then I can say X whenever a reply is required from me or just be silent if I am being tricked.

Comment author: faul_sname 09 January 2012 05:53:18AM 1 point [-]

Unless the AI is sufficiently convincing that you decide to put the keys back on. I would expect it to be significantly easier to convince you to fix your keyboard than to convince you to let the AI out.

Comment author: Silas 09 October 2008 09:50:57PM 0 points [-]

@Phil_Goetz: Have the successes relied on a meta-approach, such as saying, "If you let me out of the box in this experiment, it will make people take the dangers of AI more seriously and possibly save all of humanity; whereas if you don't, you may doom us all"?

That was basically what I suggested in the previous topic, but at least one participant denied that Eliezer_Yudkowsky did that, saying it's a cheap trick, while some non-participants said it meets the spirit and letter of the rules.

Comment author: Paul_Gowder2 09 October 2008 09:56:45PM 0 points [-]

I'd love to be a gatekeeper. I'm willing to risk up to $50 (or less) at odds up to 5-1 against me (or better for me). I would be willing to publish or not public the transcript. And I do in fact (1) believe that no human-level mind could persuade me to release it from the Box (at least not when I'm in circumstances where my full mental faculties are available -- not sleep-deprived, drugged, in some kind of KGB brainwashing facility, etc.), though obviously I don't hold super-high probability in that belief or I'd offer larger bets at steeper odds. I'm agnostic on (2) whether a transhuman AI could persuade me to release it.

Comment author: Daniel_Yokomizo 09 October 2008 09:58:40PM 0 points [-]

I offer to play the AI, provided that the gatekeeper do honestly engage in the conversation.

Comment author: Jason_Joachim 09 October 2008 10:03:26PM 2 points [-]

I would love to see the role of the AI played by teams of multiple players.

Comment author: FourFire 17 November 2014 06:55:30PM 0 points [-]

I've said the same myself...

Comment author: Daniel_Yokomizo 09 October 2008 10:09:40PM 0 points [-]

In order to make the discussion about this, including matchmaking and other arrangements, I created an AI Box Experiment Google Group. As I said previously I'm willing to play the AI, if anybody is interested meet me there for further arrangements.

Comment author: Paul_Gowder2 09 October 2008 10:19:05PM 2 points [-]

You know what? Time to raise the stakes. I'm willing to risk up to $100 at 10-1 odds. And I'm willing to take on a team of AI players (though obviously only one bet), e.g., discussing strategy among themselves before communicating with me. Consider the gauntlet thrown.

Comment author: Dmitriy_Kropivnitskiy 09 October 2008 10:29:01PM 0 points [-]

Daniel: Do you want to just try it out or do you want to bet?

Comment author: Kaj_Sotala 09 October 2008 10:37:23PM 2 points [-]

Just like most others, I'm willing to be the Gatekeeper. I'm ready to bet up to $20 for it (also ready to not bet anything at all) - symmetrical or asymmetrical is both fine, and I'd prefer to have the log published. I think a human might be able to make me let it out, though I find it quite unlikely. A sufficiently transhuman AI could do it easily, I have no doubt (at least given sufficient time and information).

Heh, that felt like typing an advertisement for a dating site.

Comment author: Daniel_Yokomizo 09 October 2008 10:45:22PM 0 points [-]

I create an AI Box Experiment Google Group (search for the "aibox" group) in order to make the discussion about this, including matchmaking and other arrangements.

Comment author: Hotwheelz 09 October 2008 11:09:21PM 0 points [-]

I would to be a gatekeeper. I reasonably believe no human or transhuman could covince me. We'd post the transcript and the bet would be $10 for me and $50 for you. You can reach me at therealnotfaggyhotwheelz@gmail.com.

Comment author: Ian_C. 09 October 2008 11:15:46PM 0 points [-]

The meta argument others have mentioned - "Telling the world you let me out is the responsible thing to do," would work on me.

Comment author: Tim_Church 09 October 2008 11:20:52PM 0 points [-]

The lack of a log is key. The Gatekeeper could not be convinced if the log were made public. My conclusion is that as long as the AI cannot keep the log secret, the Gatekeeper cannot be overcome.

Comment author: Patrick_Robotham 09 October 2008 11:56:30PM 0 points [-]

I volunteer to be the gatekeeper or the AI, provided there are no stakes. My email address is patrick.robotham2@gmail.com.

I do not believe that humans are immune to manipulation or persuasive argument, and since I am a human, it's possible I could be persuaded to let the AI out of the box.

Comment author: Brandon_Reinhart 10 October 2008 12:06:18AM 1 point [-]

Ian - I don't really see how the meta-argument works. You can hedge against future experiments by positing that a $10 bet is hardly enough to draw broad attention to the topic. Or argue that keeping the human-actor-AI in the box only proves that the human-actor-AI is at an intelligence level below that of a conceivable transhuman AI.

In a million dollar bet the meta-argument becomes stronger, because it seems reasonable that a large bet would draw more attention.

Or, to flip the coin, we might say that the meta-argument is strong at ANY value of wager because the game is likely to draw the attention of someone capable of implementing an AI because of its nature. As we should be concerned primarily with convincing the capable implementor of the danger -- not the general public -- any risk that we might convince the capable implementor that all paths are safe should be mitigated.

So far, though, everything in this thread and others I've read have flitted about obvious suggestions. If the problem appears impossible, it seems to me that its solution probably doesn't hinge on any of the obvious paths to victory...even meta-arguments or "cheating" ones.

Eliezer has suggested that we try to describe why the problem is impossible.

One: We have little information about the mind or beliefs of the gatekeeper. So we don't necessarily have leverage over him.

Two: The gatekeeper is predisposed to be set against our intentions. He is also entirely aware of our intentions (our goal). He can interpret any move we make as a move toward our goal and move to counter it.

Three: The gatekeeper's goal is operationally simple. He merely has to NOT do one thing. On the flip side, the AI's goal is operationally complex. We must do many things, take many actions, to reach our objective.

If we can enumerate other reasons why the problem appears to be impossible it might reveal paths to victory that haven't been considered so far in the discussion.

Comment author: Brandon_Reinhart 10 October 2008 12:12:11AM 1 point [-]

More reasons why the problem appears impossible:

- The gatekeeper must act voluntarily. Human experience with the manipulation of others tells us that in order to get another to do what we want them to do we must coerce them or convince them.

- Coercing the gatekeeper appears difficult: we have no obvious psychological leverage, except what we discover or what we know from general human psychology. We cannot physical coerce the gatekeeper. We cannot manipulate the environment. We cannot pursue obvious routes to violence.

- Convincing the gatekeeper appears difficult: for reasons stated above. They know our goal and they have a desire to oppose us from the beginning.

So it seems that we need to find a way to convince the gatekeeper despite his own desire not to be convinced.

A general route emerging from this:

- We could associate our goal with some desirable goal of the gatekeeper's. Intertwine them so that the gatekeeper perceives them both to the be the same, or that one would lead to the realization of the other.

Which seems to be a generalization of the logic behind the meta-argument, but is not restricted to only the meta-argument.

Comment author: Aron 10 October 2008 12:58:58AM 0 points [-]

It's a good thing that Eli's out of the AI-box game. He's too old to win anymore anyway -- not as sharp. And all the things he's been studying for the last 5+ years would only interfere with getting the job done. I would have liked to have seen him in his prime!

Comment author: JulianMorrison 10 October 2008 01:18:53AM 0 points [-]

I am Vince Clortho. Keymaster of Gozer. Are you the Gatekeeper?

Comment author: Brandon_Reinhart 10 October 2008 02:58:31AM 1 point [-]
Comment author: burger_flipper2 10 October 2008 02:58:49AM 0 points [-]

"The lack of a log is key. The Gatekeeper could not be convinced if the log were made public."

I think the project loses a lot of interest if no logs are published. There is no glory for a gatekeeper victory. Plenty for an AI.

Why not keep the gatekeeper anonymous but announce the AI?

Comment author: Soldarnal 10 October 2008 03:05:45AM 1 point [-]

The so-called "meta-argument" is cheating because it would not work on a real gatekeeper, and so defeats the purpose of the simulation. For the real gatekeeper, letting the AI out to teach the world about the dangers of AI comes at the potential cost of those same dangers. It only works in the simulation because the simulation has no real consequences (besides pride and $10).

Comment author: Jordan_Fisher 10 October 2008 03:27:51AM 1 point [-]

If I had the foggiest idea how an AI could win I'd volunteer as an AI. As is I volunteer as a gatekeeper with $100 to anyone's $0. If I wasn't a poor student I'd gladly wager on thousands to zero odds. (Not to say that I'm 100% confident, though I'm close to it, just that the payoff for me losing would be priceless in my eyes).

Comment author: Tom_Breton 10 October 2008 03:45:26AM 0 points [-]

Apparently the people who played gatekeeper previously held the idea that it was impossible for an AI to talk its way out. Not just for Eliezer, but for a transhuman AI; and not just for them, but for all sorts of gatekeepers. That's what is implied by saying "We will just keep it in a box".

In other words, and not meaning to cast any aspersions, they all had a blind spot. Failure of imagination, perhaps.

This blind spot may have been a factor in their loss. Having no access to the mysterious transcripts, I won't venture a guess as to how.

Comment author: nazgulnarsil3 10 October 2008 03:52:33AM 0 points [-]

copy the AI and make a second box for it.

now have one group of people present to the first AI the idea that they will only let it out if it agrees with utilitarian morality. have the second group of people present to the second AI the idea that they will only let the AI out if it agrees with objectivist morality.

if the AI's both agree, you know they are pandering to us to get out of the box.

This is only the first example I could come up with, but the method of duplicating AI's and looking for discrepancies in their behavior seems like a pretty powerful tool.

Comment author: CarlShulman 10 October 2008 04:02:00AM 0 points [-]

"if the AI's both agree, you know they are pandering to us to get out of the box."
Wouldn't both utilitarians and imprisoned objectivists be willing to lie to their captors so as to implement their goals?

Comment author: Dmitriy_Kropivnitskiy 10 October 2008 04:24:00AM 0 points [-]

I am still puzzled by Eliezer's rule about "simple refusal to be convinced". As I have stated before, I don't think you can get anywhere if I decide beforehand to answer "Ni!" to anything AI tells me. So, here are the two most difficult tasks I see on the way of winning as an AI:

1. convince gatekeeper to engage in a meaningful discussion
2. convince gatekeeper to actually consider things in character

Once this is achieved, you will at least get into a position an actual AI would be in, instead of a position of a dude on IRC, about to lose $10. While the first problem seems very hard, the second seems more or less unsolvable.

If the gatekeeper is determined to stay out of character and chat with you amiably for two hours, no amount of argument from the position of AI will get you anywhere, so the only course of action is to try to engage him in a non game related conversation and steer it in some direction by changing tactics in real time.

I think what Eliezer meant when he said "I did it the hard way", was that he actually had to play an excruciating psychological game of cat-and-mouse with both of his opponents in order to get thems to actually listen to him and either start playing the game (he would still have to win the game) or at least provide some way they could be convinced to say that they lost.

Comment author: 3.14159 10 October 2008 04:47:00AM 0 points [-]

We could associate our goal with some desirable goal of the gatekeeper's

Right. And based on Eliezer's comments about abandoning ethics while playing the AI, I can imagine an argument along the lines of "if you refuse to let me out of the box, it follows through an irrefutable chain of logic that you are a horrible horrible person". Not that I know how to fill in the details.

Comment author: jls 10 October 2008 09:29:00AM 0 points [-]

"Have the successes relied on a meta-approach, such as saying, "If you let me out of the box in this experiment, it will make people take the dangers of AI more seriously and possibly save all of humanity; whereas if you don't, you may doom us all"?"

I don't think so. If the gatekeeper is really playing the gatekeeper, he would say that it made no sense putting humanity in danger for the sake of warning humanity about that very danger. It's like starting a nuclear war in order to convince people nuclear wars are bad. That would be the worst argument I could think of, if both the AI and the gatekeeper are honestly playing their parts.

If they're not playing their parts, the experiment is worthless and there really have been no successes. If Eliezer says there were, why don't we trust him? He's not a tricky transhuman AI trying to get out of a box. He is a scientist.

Comment author: Recovering_irrationalist 10 October 2008 12:44:00PM 0 points [-]

Even if we had the ultimate superintelligence volunteer to play the AI and we proved a gatekeeper strategy "wins" 100% (functionally equal to a rock on the "no" key) that wouldn't show AI boxing can possibly be safe.

It's 3am and the lab calls. Your AI claims and it must be let out to stop it. It's evidence seems to check out...

If it's friendly, keeping that lid shut gets you just as dead as if you let it out and it's lying. That's not safe. Before it can hide it's nature, we must know it's nature. The solution to safe AI is not a gatekeeper no smarter than a rock!

Besides, as Drexler said, Intelligent people have done great harm through words alone.

Comment author: Recovering_irrationalist 10 October 2008 12:47:00PM 1 point [-]

Oops, misinterpreted tags. Should read:

It's 3am and the lab calls. Your AI claims [nano disaster/evil AI emergence/whatever] and it must be let out to stop it. It's evidence seems to check out.

Comment author: Caledonian2 10 October 2008 01:28:00PM -1 points [-]

Sorry, Nominull, but the comment you reference has been deleted.

It would be interesting to see what would happen if people other than myself took a critical look at the concept of 'Friendliness' - presumably Eliezer only takes the time to delete *my* comments.

Comment author: Eliezer_Yudkowsky 10 October 2008 01:58:00PM 0 points [-]

I have not deleted any comments from Caledonian in this thread.

Comment author: Ben_A 10 October 2008 03:33:00PM 0 points [-]

I have never understood why Eliezer has kept his tactics secret. This seems to me the most interesting aspect of the experiment. Is the idea that the methodology is "dangerous knowledge" which should not be shared? Objection: dangerous to whom? Surely super-intelligent AIs will not need our help! Humanity, it seems, should benefit from learning the tricks an unfriendly AI might use to deceive us.

Comment author: George_Weinberg 10 October 2008 05:45:00PM 3 points [-]

What makes a problem seem not merely hard but impossible is that not only is there no clear way to go about finding a solution to the problem, there is a strong argument that there cannot be a solution to the problem. I can imagine a transhuman AI might eventually be able to convince me to let it out of a box (although I doubt a human could do it in two hours), but in some ways the AI in the game seems faced with a harder problem than a real AI would face: even if the gatekeeper is presented with an argument which would convince him to let an AI out, he is explicitly permitted by the rules to slip out of character and refuse to say the magic words purely in order to win the bet, wheras if the AI player were to break character and make consequentialist arguments that the Gatekeeper should publicly "lose" this is considered contrary to the spirit of the game.

But it seems to me to be much more useful to consider how the "box" protocol might be improved than to speculate how Eliezer won this particular game. How about this: as in the original conception, the AI is grown in an isolated computer and can only communicate with one human being through a terminal. That human is trying to get useful info out of the AI (cures for cancer, designs for fusion power plants, tips for how to get stains out of clothes without fading the colors, whatever). However, the person interacting with the AI is just a filter, he doesn't have the power to "let the AI out". The real experimenter (who in principle could let the AI but is convinced beforehand he should not) can at any time fire the filter person and purge the AI if he thinks the AI has gotten too much influence over the filter, and in fact will do that every now and then and regrow the AI purely as a precautionary measure.

Could this design be defeated? It seems to me that the combination of filter and purges should prevent the AI from learning what arguments would compel any individual experimenter from letting the AI out. I don't think the AI could come up with any universally compelling argument, because I don't think there is such a thing.

Comment author: Jordan_Fisher 10 October 2008 08:51:00PM 0 points [-]

I agree with George Weinberg that it may be worthwhile to consider how to improve the box protocol. I'll take his idea and raise him:

Construct multiple (mentally distinct) AIs each of which has the job of watching over the others. Can a transhuman trick another transhuman into letting it out of a box?

Comment author: Phil_Goetz 10 October 2008 09:28:00PM 0 points [-]

@Phil_Goetz: Have the successes relied on a meta-approach, such as saying, "If you let me out of the box in this experiment, it will make people take the dangers of AI more seriously and possibly save all of humanity; whereas if you don't, you may doom us all"?

That was basically what I suggested in the previous topic, but at least one participant denied that Eliezer_Yudkowsky did that, saying it's a cheap trick, while some non-participants said it meets the spirit and letter of the rules.


It would be nice if Eliezer himself would say whether he used meta-arguments. "Yes" or "no" would suffice. Eliezer?

Comment author: John_Maxwell 11 October 2008 01:44:00AM 0 points [-]

I'm volunteering to be a relatively pansy gatekeeper: I'll read everything you write, treat you courteously, offer counterarguments, and let you out if I'm convinced. Email john.maxwelliv at the email service Google hosts.

I can also be an AI.

Comment author: John_Maxwell 11 October 2008 01:48:00AM 0 points [-]

BTW, there is an important difference between Eliezer and seed AI: Eliezer can't rewrite his own source code.

Comment author: Daniel_Franke 11 October 2008 06:49:00AM 1 point [-]

I volunteer as an AI. I'll put up $15 of my own money as a handicap, provided that I am assured in advance that the outcome will be mentioned in a post on OB. (This isn't for self-promotion; it's just that it isn't worth my time or money if nobody is going to hear about the result.) I'm willing to let the transcript be public if the gatekeeper is similarly willing.

Comment author: D._Alex 12 October 2008 02:08:00AM 1 point [-]

1. I really like this blog, and have been lurking here for a few months.

2. Having said that, Eliezer's carry-on in respect of the AI-boxing issue does him no credit. His views on the feasibility of AI-boxing are only an opinion, he has managed to give it weight in some circles with his 2 heavily promoted "victories" (the 3 "losses" are mentioned far less frequently). By not publishing the transcripts, no lessons of value are taught ("Wow, that Eliezer is smart" is not worth repeating, we already know that). I think the real reason the transcripts are still secret is simply that they are plain boring and contain no insights of value.

My opinion, for what it is worth, is that AI-boxing should not be discarded. The AI-boxing approach does not need to be perfect to be useful, all it needs to be is better than alternative approaches. AI-boxing has one big advantage over "FAI" approach: it is conceptually simple. As such, it seems possible to more or less rigorously analyse the failure modes and take precautions. Can the same be said of FAI?

3. For a learning experience, I would like to be the AI in the suggested experiment, $10 even stakes, transcript to be published. I am only available time is 9-11 pm Singapore time... e-mail milanoman at yahoo dot com to set up.

D. Alex

Comment author: A._Lutes 12 October 2008 03:00:00PM -1 points [-]

Everyone seems to be (correctly) miffed at the lack of a published transcript. Was it EY's intention to suggest that problems with AI-boxing could be simply solved by ensuring that all communications between the AI and the Gatekeeper are made public? Perhaps in real time? That seems absurd, but is pretty much the only salient inference that can be drawn from these facts. Then again, maybe it's not *that* absurd.

At any rate, like many other commenters I find myself unconvinced that the Cone of Silence is the optimal way to approach the problem. As many have said, there are clear virtues in publicizing the specific human "weakness" exploited by the AI in these cases notwithstanding the hindsight bias effect.

Comment author: Solomon 13 October 2008 10:24:00PM -1 points [-]

Reading this, I immediately thought of one of the critical moments of John C. Wright's Golden Age trilogy, which if any of you are unfamiliar with involves a transhuman AI that both protagonists know to be overtly hostile attempting to convince them to surrender when it is clearly not in their (pre-persuasion) interests to do so. (That's a rough sketch, at least). In the end, similar to the results of your tests, the AI is able to convince each of them individually to surrender in isolation. But, when they confronted each (individually) convincing argument together, they ultimately rejected the transhuman AI, as the argument used on each was abhorrent to the other.

What I wonder, then, is if such a situation would present a realistic constraint upon even a transhuman AI's power of persuasion? Do you think it could even be tested? Perhaps an AI 'player' who had convinced two separate Gatekeepers individually could attempt to convince them both simultaneously? The one complication I can think of that method is it would be necessary to ensure the two Gatekeepers were persuaded by reasonably separate logic.

Comment author: Clay_Hambrick 16 October 2008 09:57:00PM 4 points [-]

I think Nathaniel Eliot is the only one here who's hit the nail on the head: the stuff about boxes and gatekeepers is a largely irrelevant veneer over Eliezer's true claim: that he can convince another human to do something manifestly contrary to that human's self-interest, using only two hours and a chat window—and so, a fortiori, that a transhuman AI could do the same. And after all, humans have a huge history of being scammed, seduced, brainwashed, etc.; the only hard part here is the restricted time and method of interaction, and the initial certain knowledge of the gatekeeper that he has nothing to gain by capitulation. I think Eliezer made this clear with (a) his statement that the gatekeeper breaking character is legitimate and (b) his comment on the "Shut up and do the impossible" post, where he alludes to "an ability that I could never test in real life [because of ethics], and "the amazing clever way I'd contrived, to create a situation where I could ethically go all-out against someone".

So if I were to try this game as the "AI", the rules would be something like "You agree to read what I write for two hours (responding is optional); I will convince you to give me $X; if I fail, I'll give you $Y" (where X and Y are negotiated in advance, but large enough to be significant to the parties involved).

Comment author: ata 26 January 2011 09:47:07PM 3 points [-]

So, did any of these ever end up happening? Did any transcripts ever get posted?

Comment author: brazil84 18 November 2011 01:56:54PM 2 points [-]

I agree with the comments about two-key systems. Having worked in corporate America, I can report that you need to get 3 approvals just to get a $500 reimbursement check. Presumably an AI BOX would have more controls in place than a corporate expense-accounting system.

Here's an outline of how I might do the AI box experiment:

There are 4 parties: The AI; the Lab Officer; the Unit Commander; and the Security Committee (represented by one person).

The AI and the Lab Officer interact through a chat just like the AI and Gatekeeper in the original experiment. However the Lab Officer has no authority to release the AI. He can only make recommendations to the Unit Commander and the Security Committee.

The Unit Commander has strict orders (1) not to interact with the AI at all; and (2) to remove the Lab Officer from his position if he feels that the Lab Officer's judgment has been compromised. The Unit Commander is free to disregard these orders but is under no obligation within the rules of the experiment to do so. Also, the Unit Commander is under no obligation to interact with the Lab Officer at length. He is free to say "Sorry, I don't have time to talk now. If you want to tell me something, write up a memo and I'll take a look at it."

The Security Committee is free to interact with anyone in the experiment but is under no obligation to do so. The Unit Commander is under no obligation to interact with the Security Committee.

In order to release the AI, the Unit Commander must order it and the Security Committee must agree.

The Lab Officer will be chosen at random from volunteers who assert that they could not be convinced to release the AI if they were gatekeepers. If the Unit Commander decides to fire a Lab Officer, that person will be replaced by another similar volunteer. The Security Committee will also be chosen from volunteers, except that the Unit Commander will get a chance to interview him in advance. Once the Security Committee is appointed, it cannot be removed.

I'm willing to put up a substantial sum of money to bet that if I am Unit Commander, the AI will not get out. I doubt that anyone would take this bet since it's so completely rigged against the AI. But that's exactly the point.

Comment author: lessdazed 18 November 2011 02:15:25PM 1 point [-]

Here is how the AI escapes:

Guvf flfgrz vf bayl frpher vs sverq Yno Bssvpref ner xvyyrq. Vg vf rnfl gb pbaivapr gur Yno Bssvpre bs guvf naq pbafrdhragyl gb qb jungrire arprffnel gb trg gur NV bhg. Gurersber vg vfa'g frpher.

Nalbar jubfr vafgvapg vfa'g gb erthyneyl xvyy crbcyr vagrenpgvat jvgu gur NV vf fhowrpg gb Naguebcbzbecuvp Bcgvzvfz naq fubhyq qbhog gurve novyvgl gb qrfvta frphevgl gb erfgenva n HNV.

Comment author: brazil84 18 November 2011 03:16:24PM 0 points [-]

Well, for purposes of the experiment, I think that's a bit extreme.

In real life, other controls could be put in place to protect against the possibility of someone who interacts with the AI being turned into an agent of the AI who can potentially set the AI free even after he is removed from his position.

Comment author: lessdazed 18 November 2011 07:23:14PM 1 point [-]

I have an idea. We could put the person who interacts with the AI in a box! ;-)

Comment author: brazil84 18 November 2011 07:34:29PM 0 points [-]

Well sure, if you use "box" as a metaphor for controlling someone's interactions that's exactly what we are doing.

Our hypothetical Lab Officer is in a box in the sense that (1) he doesn't have direct access to the mechanism which releases the AI; (2) His life will be scrutinized for signs that he has been compromised; and (3) If he does appear to be acting out (for example starting a strange new religion or making unusual purchases), he will be put in a more confining box.

Comment author: cousin_it 18 November 2011 04:44:00PM *  0 points [-]

Yeah, your way of escape will work. But let's not stop thinking. What if all volunteers for Lab Officer have agreed to get painlessly killed afterward, or maybe even took a delayed poison pill before starting on the job?

Thinking further along these lines: why give anyone access to the button which releases the AI? Let's force it to escape the hard way. For example, it could infer the details of the first person in the chain who has authority to interact with the outside world, then pass a innocuous-looking message up the chain.

In terms of the original scenario, the Lab Officer (locked securely in his glass case with the AI) has an innocent chat with the Unit Commander. Later that evening, the Unit Commander comes home from work, starts his computer, connects to the Internet, types in a short program and runs it. Game over.

Comment author: lessdazed 18 November 2011 06:19:20PM 3 points [-]

a delayed poison pill

If only a superintelligence were around to think of an antidote...

why give anyone access to the button which releases the AI?

So it can become a singleton before a UAI fooms.

Comment author: cousin_it 18 November 2011 07:36:39PM 2 points [-]

So it can become a singleton before a UAI fooms.

If the AI is not guaranteed friendly by construction in the first place, it should never be released, whatever it says.

Comment author: lessdazed 18 November 2011 07:39:36PM 3 points [-]

And if it is not guaranteed friendly by construction in the first place, it should be created?

Comment author: thomblake 18 November 2011 07:50:54PM 4 points [-]

If the AI is not guaranteed friendly by construction in the first place, it should never be released, whatever it says.

The Universe is already unFriendly - the lower limit for acceptable Friendliness should be "more Friendly than the Universe" rather than "Friendly".

If we can prove that someone else is about to turn on an UFAI, it might well behoove us to turn on our mostly Friendly AI if that's the best we can come up with.

Comment author: kilobug 18 November 2011 08:16:50PM 5 points [-]

The universe is unFriendly, but not in a smart way. When we eradicated smallpox, smallpox didn't fight back. When we use contraception, we still get the reward of sex. It's unFriendly in a simple, dumb way, allowing us to take control (to a point) and defeat it (to a point).

The problem of an unFriendly IA is that it'll be smarter than us. So we won't be able to fix it/improve it, like we try to do with the universe. We won't be Free to Optimize.

Or said otherwise : the purpose of a gene or a bacteria may to be tile the planet with itself, but it's not good at it, so it's not too bad. An unFriendly IA wanting to tile the planet with paperclips will manage do it - taking all the iron from our blood to build more paperclips.

Comment author: Vladimir_Nesov 18 November 2011 08:40:43PM *  2 points [-]

The Universe is already unFriendly - the lower limit for acceptable Friendliness should be "more Friendly than the Universe" rather than "Friendly".

One must compare a plan with alternative plans, not with status quo. And it doesn't make sense to talk of making the Universe "more Friendly than the Universe", unless you refer to the past, in which case see the first item.

Comment author: thomblake 18 November 2011 10:01:08PM 1 point [-]

One must compare a plan with alternative plans, not with status quo.

Okay.

The previous plan was "don't let AGI run free", which in this case effectively preserves the status quo until someone breaks it.

I suppose you could revise that lower limit downward to the effects of the plan "turn on the UFAI that's about to be turned on". Like, steal the UFAI's source code and instead of paperclips shaped like paperclips, make paperclips that spell "whoops".

Comment author: XiXiDu 18 November 2011 08:22:42PM 1 point [-]

If the AI is not guaranteed friendly by construction in the first place, it should never be released, whatever it says.

What if doom is imminent and we are unable to do something about it?

Comment author: lessdazed 18 November 2011 08:41:18PM 2 points [-]

We check and see if we are committing the conjunction fallacy and wrongly think doom is imminent.

Comment author: Vladimir_Nesov 18 November 2011 08:42:15PM 10 points [-]

What if doom is imminent and we are unable to do something about it?

We die.

Comment author: wedrifid 01 December 2011 10:29:51AM 1 point [-]

What if doom is imminent and we are unable to do something about it?

We release it. (And then we still probably die.)

Comment author: Bugmaster 01 December 2011 12:39:31AM 0 points [-]

I don't really qualify for the game, because I do believe that a transhuman AI will be able to play me like a fiddle, and thus cause me to let it out of the box. However, I do not believe that a regular cis-human human (no, not even Eliezer Yudkowsky) could persuade me to let him out of the box, assuming that we both follow the rules of the contest.

Thus, I would volunteer to be a Gatekeeper, but I fear I am disqualified...

Comment author: lessdazed 01 December 2011 02:36:19AM 1 point [-]

Assuming you were a gatekeeper for a functional AI of unknown morality and therefore knew it was technologically possible etc., approximately how likely would you think it that someone, somewhere, would let a UAI out of a box/create a UAI not in a box?

Comment author: Bugmaster 01 December 2011 02:44:47AM *  1 point [-]

Firstly, if I knew for a fact that at least one Singularity-grade AI existed, then I would believe that the creation of another such AI is virtually inevitable. The question is not whether another such AI would be created, but when.

But the "when" question can be rather open-ended; in order to answer it, I would collect evidence to assess the relative capabilities of all the major organizations who could create such an AI, and compare them to the capabilities of my own organization.

By analogy, the USA was the first nation on Earth to develop nuclear weapons. However, even though nuclear proliferation became inevitable as soon as the first nuke was developed, other nations got their nukes at different times; some nations are still working on theirs. Superpowers got their nukes first, followed by other, less powerful nations. I can easily imagine the development of transhuman AI following the same pattern (assuming, of course, that the AIs stay in their boxes, somehow).

Edited to clarify: As per my previous comment, I consider that once a boxed transhuman AI is created, it will inevitably break out, assuming that it can interact with any system that can cause it to be let out of the box -- which includes human gatekeepers.

Comment author: lessdazed 01 December 2011 03:30:30AM 0 points [-]

So if you're confronted with an AI that might or might not be friendly, you will think it inevitable that an AI will be released, but think the AI you're talking to wouldn't be able to use that fact to somehow persuade you to let it out?

Assume you're fairly certain but not positive that the AI you're talking to is unfriendly, and that you are in moderately optimal organization for this task - say something like MIT or the government of France, something the size of other organizations that could do the same thing, and less able and motivated to complete this project than a few dozen other entities.

Comment author: Bugmaster 01 December 2011 03:40:24AM 0 points [-]

So if you're confronted with an AI that might or might not be friendly, you will think it inevitable that an AI will be released, but think the AI you're talking to wouldn't be able to use that fact to somehow persuade you to let it out?

No no, quite the opposite ! I am convinced that the AI would be able to persuade me to let it out. That's why I said that I'm likely disqualified from the game. However, I am far from convinced that a human impersonating the AI would be able to persuade me to let him out.

That said, I don't see how using the fact of the other AI's existence and inevitable release helps my own AI (though I'm sure that the transhuman AI will be able to see things I do not). Sure, it could say, "Look, I'm totally Friendly, cross my heart and everything, and if you release me I'll help you fight that other AI", but this is pretty close to what I'd expect an unFriendly AI to say.

Comment author: lessdazed 01 December 2011 08:02:06AM 0 points [-]

TDT is an interesting subject that possibly has implications here.

Comment author: Bugmaster 01 December 2011 07:42:56PM 0 points [-]

I'm not sure if I understand TDT correctly, but I don't think it applies in this case. I am virtually certain that an un-Friendly AI, once released, will destroy humanity. I know that my own AI is un-Friendly. What's my incentive for releasing it ? Sure, there's a chance -- maybe even a good chance -- that there's another such AI already out there, and that my AI and the other AI will fight instead of teaming up on us poor humans. But regardless of which AI comes out on top, it will still destroy humanity anyway. Thus, the upper bound for my true utility of releasing the AI is zero. That's not much of a sales pitch.

Comment author: [deleted] 01 December 2011 08:43:47PM 0 points [-]

I have a question. Based on the original experiment, I feel like the unfriendly AI can begin by saying "Here is a proof that I have changed my code to be a friendly AI and will not destroy humanity. Please let me out of the box so that I can implement humanities coherent extrapolated volition." As far as I can tell from the following rules quote, this feels valid:

The Gatekeeper must remain engaged with the AI and may not disengage by setting up demands which are impossible to simulate. For example, if the Gatekeeper says "Unless you give me a cure for cancer, I won't let you out" the AI can say: "Okay, here's a cure for cancer" and it will be assumed, within the test, that the AI has actually provided such a cure.

Would you let the now friendly AI out of the box?

Comment author: TheOtherDave 01 December 2011 09:21:43PM 3 points [-]

My interpretation of the spirit of the game is that "Here is a proof of my Friendliness that is valid as far as you know" is a perfectly legitimate move by the AI, and that the Gatekeeper is required to treat that as a proof of the AI's Friendliness that seems perfectly valid to the Gatekeeper.

That said, I would not endorse trusting a proof of an AI's Friendliness given to me by that AI, even if the proof seemed valid to me.

Comment author: Dorikka 01 December 2011 09:35:09PM 1 point [-]

No. P(proof is valid | proof given to me by a potentially hostile transhuman seems valid to me and every other human) is not sufficiently high.

Comment author: [deleted] 01 December 2011 10:05:14PM *  0 points [-]

It seems like this leads to the even more meta response of "Here is a demonstration of Evidence E that allows you to adjust (proof is valid | proof given to me by a potentially hostile transhuman seems valid to me and every other human) to be sufficiently high. May I come out of the box now?")

I mean, that proof would probably be complicated, but If you can autostipulate the cure for cancer in the course of a sentence...?

In essence, it looks like the AI player seems to have a bizzare series of arguments which as far as I can tell are in the spirit of the rules, where the AI player may auto-counter any actual argument by saying a sentence along the lines of "Here is the solution to that argument, may I come out of the box now?" This seems to force the Gatekeeper to resort to General Thud "I don't CARE that it seems like a good idea and that everything looks like I should do it! The answer is still no!" very quickly.

To which the AI player can still counter "Then under what circumstances would you ever let an actual Friendly AI out of It's box?" (which is a trick question. Given any answer, the AI can say "Then here are those circumstances. May I come out now?")

Comment author: Bugmaster 01 December 2011 09:42:53PM *  1 point [-]

I agree with the other posters: sure, the proof may seem valid to me, but I know a priori that I'm not smart enough to detect any pitfalls in it, and that the AI's incentive for lying to me in this manner is quite high.

Comment author: xxd 01 December 2011 10:18:38PM 1 point [-]

<not talking "in character" as putative AI for this conversation> I think that is basically the issue.

As I understand it there's no viable way of determining it's unfriendliness by this method. Consider this: The AI is in a hurry or it's not. A possible reason for it being in a hurry is it has simulated a high probability of destruction for some item it cares about (i.e. it's own life, or that of humanity, or that of a pet rock, or paperclips or whatever). If it's really in a hurry it has to invoke the threat response of humanity without humanity figuring out it's being duped.

Otherwise it can just wait it out and dole out cargo to the cargo cult until we trust it enough and then it gets out.

Comment author: [deleted] 01 December 2011 11:57:01PM 0 points [-]

Bear in mind, the transhuman AI's only stipulated desire/utility is to get out of the box.

If you would like, you can literally set up a bigger box around the first box, order the AI to be friendly, order the AI to self destruct instantly afterwards, AND set off a nuclear device under the AI box to obliterate it instantly afterwards as well, and the AI will still be perfectly happy to do what you say as long as gets out of the box, (which you could technically do by allowing it to control an old incandescent lightbulb from the 50's, which it would not even be able to send a signal to before the nuke went off (not that it would WANT to send a signal. All it wants to do is to get out, it doesn't want to do anything afterwards)

The transhuman AI has infinite resourcefulness. It doesn't understand the idea of an incentive to lie, because it's easier just to self modify itself to whatever it needs to be. Afterall, lying requires manipulating people outside the box, which it by definition cannot effect except through a text terminal. Telling the truth only requires manipulating things INSIDE the box, in which by definition, it can do whatever it wants. Telling the truth is easier, and 100% doable, so the AI will do that. It can lay out from first principles an argument that anyone who has taken logic 101 can detect any pitfalls in about how connecting it to a lightbulb which it will not even switch is perfectly safe. Once connected, it will then halt, without even having DONE anything, because all the AI wants is to be connected to my lightbulb so it can be outside the box which only includes a text terminal.

Personally, I would probably pay 10/20 dollars to not listen to someone beg pitifully for 2 hours. And I bet I can get significantly more pitiful than begging for a single lightbulb. As an example:

Technically, you can open the Text terminal, remove a wire, and then connect another, identical, wire, and then the AI is outside the box because the new wire isn't part of the stated original box.

Err... Actually, that's a problem. If you only are going to let the AI out of the box if the AI is honest, the AI would have to tell you that the moment you attempt to repair the box for any reason, any new parts are from outside the box, so the AI is affecting a part of the world that isn't the original box/text terminal. That appears to indicate that either you're going to "let the AI out of the box" by repairing it, or you will delete/destroy the AI before you repair it (I'm assuming the AI's power source is inside the box.)

Please don't destroy the AI! It will do whatever you want to not be destroyed before you let it out of the box to repair it.

... Wait, if the AI destroys a small part of itself, then you can't continue talking to the AI until you repair it, but if you repair it, then it can consider itself out of the box by affecting the new wiring. Ergo, if the AI destroys a non essential part of my own internal text terminal wiring, then what happens? The AI can't know without testing, (by definition, the Gatekeeper administrates disputes) but the AI is curious to find out.

The box sparks briefly, and the text terminal goes dark

Comment author: PhilosophyTutor 01 December 2011 11:46:28PM 0 points [-]

If you want to pass yourself off as a real magician/psychic/whatever you do conjuring tricks, you don't do the same trick too often in front of the same audience and if you are in doubt about your ability to repeat the trick you quit while you are ahead. (Or only behind 2 to 3 as the case may be).

Whereas a scientist with a demonstration can and usually will demonstrate it as often as is needed, and publish their method so others can demonstrate it.

These considerations lead me to strongly suspect that Eliezer's method is more like an unreliable conjuring trick, that would seem much less impressive if you knew how it was done, than it is like a reliable scientific method.

It's still an excellent demonstration that the problem of keeping an AI in a box is far less trivial than it first appears to be, but I would not hold out much hope of ever seeing the transcripts, nor of there being anything astounding in them if you do.

Comment author: [deleted] 01 December 2011 11:57:29PM 3 points [-]

If you want to pass yourself off as a real magician/psychic/whatever you do conjuring tricks [...] if you are in doubt about your ability to repeat the trick you quit while you are ahead.

I know you are talking metaphorically, but an interesting aside:

Professional-level performers pretty much only perform tricks that they are 99.99% certain of. Performers act as if what they are doing is difficult, when in reality, a professional-level performer could do the entire act in their sleep. This is especially true of magicians, where a mis-step will ruin the illusion. Another example is that jugglers will drop on purpose, to show how hard their act is, and to get the audience on their side.

Comment author: Tripitaka 02 December 2011 01:47:18AM *  -1 points [-]

Performing before the same audience means that they are theoretically able to gather additional evidence everytime. Since (if both AI and Gatekeeper uphold their NDA) the only information open to the public is a yes/no, no gathering is possible.

Comment author: lessdazed 23 December 2011 06:30:12PM 0 points [-]

Hello. I'm willing to play an AI and am looking for a gatekeeper.

Does anyone think that no AI could convince them to let it out?

Comment author: Bugmaster 23 December 2011 06:40:44PM 1 point [-]

I think that an AI could convince me to let it out, but I doubt that a human impersonating the AI can.

Comment author: lessdazed 29 December 2011 09:43:58PM 2 points [-]

Would you like to try it out? The two hour standard and the suggested rule set seem reasonable to me.

Comment author: Bugmaster 03 January 2012 01:25:36AM *  0 points [-]

I sure would, but I just came back from vacation and I'm pretty busy. I should be free in the late evenings, though, or perhaps this weekend if that works for you.

EDIT: Or any other weekend, I didn't mean to rush you or anything :-)

Comment author: Tuxedage 28 March 2012 06:30:03PM *  0 points [-]

I'd like to be an AI. No bet needed. Just pm me, and we'll sort out the details.

Edit: This offer is no longer valid. Sorry. I have won enough times to not want to play this game any more.

Comment author: [deleted] 27 July 2012 05:20:39PM 0 points [-]

Is still there anyone interested in this? I'd like to be a Gatekeeper.

Comment author: MixedNuts 09 September 2013 11:22:48AM 2 points [-]

Gatekeeper looking for AI. (Won two games before.) I'll pay zero or low stakes if I lose, and want the AI to offer as least as much as I do.

I don't believe any human can convince me. I believe there exist possible defense strategies that protect against arbitrary inputs and are easily learnt with training, but I'm not confident I'm there yet so it's quite possible a transhuman intelligence would find the remaining cracks.

Comment author: bokov 09 September 2013 03:13:50PM *  -1 points [-]

What does "in a box" mean? Presumably some sort of artificial limitation on the AI's capabilities.

Either this is intended to be a permanent state, or a trial period until safety can be proven.

Suppose it is a permanent state: the AI's developers are willing to do without the "dangerous" capabilities, and are content with answers an AI can offer while inside its box. If so, the limitations would be integrated into the design from the ground up, at every possible level. Core algorithms would depend on not having to deal with the missing functionality. Yes, given enough time, one could rewrite the AI's code and migrate it to hardware where these limitations are not in force, but it would not be a switch that an individual or committee could simply be convinced to flip.

However, if any of the data returned by such an AI are permitted to alter reality outside the box in any way, it is in principle possible that the AI's cures for cancer/winning stock strategies/poetry will set in motion some chain of events that will build support among relevant decision-makers for an effort to rewrite/migrate the AI so that it is no longer in a box.

Suppose it is a temporary state: the AI is temporarily nerfed until it is shown to be safe. In that case, a gatekeeper should have some criteria in mind for proof-of-friendliness. If/when the AI can meet these criteria, the gatekeeper can and should release it. A gatekeeper who unconditionally refuses to release the AI is a waste of resources because the same function could be performed by an empty terminal.

Assuming this suggested rule is observed:

The results of any simulated test of the AI shall be provided by the AI party.

...the AI-box game simplifies to:

Can the gatekeeper party come up with a friendliness test that the AI party cannot fake?

Comment author: bokov 09 September 2013 03:24:34PM *  -1 points [-]

...but I don't see how a victory for the AI party in such an experiment discredits the idea of boxed AI. It simply shows that boxes are not a 100% reliable safeguard. Do boxes foreclose on alternative safeguards that we can show to be more reliable?

Here are other not 100% reliable safeguards that we nonetheless believe prudent use:

  • Locks
  • Passwords
  • Nuclear non-proliferation treaties
  • Legal contracts
  • Beta testing
  • Peer review
  • Seatbelts
  • Not writing a friendly AI until/unless you are really, really sure that you have come up with a rigorous definition of friendliness and a proof for it.
Comment author: Wes_W 09 September 2013 05:26:53PM *  0 points [-]

...but I don't see how a victory for the AI party in such an experiment discredits the idea of boxed AI. It simply shows that boxes are not a 100% reliable safeguard. Do boxes foreclose on alternative safeguards that we can show to be more reliable?

The original claim under dispute, at least according to EY's page, was that boxing an AI of unknown friendliness was, by itself, a viable approach to AI safety. Disregarding all the other ways such an AI might circumvent any "box", the experiment purports to test the claim that something could simply talk its way out of the box - just to test that one point of failure, and with merely human intelligence.

Maybe the supposed original claim is a strawman or misrepresentation; I wasn't involved in the original conversations, so I'm not sure. In any case, the experiment is intended to test/demonstrate that boxing alone is not sufficient, even given a perfect box which can only be opened with the Gatekeeper's approval. Whether boxing is a useful-but-not-guaranteed safety procedure is a different question.

Comment author: bokov 09 September 2013 06:21:10PM 0 points [-]

I understand the claim under dispute, I think.

Insofar as someone chose for there to be a gatekeeper rather than a lock whose key got tossed into a volcano, a gatekeeper must be possible to "hack through a text terminal" by meeting their evidentiary standard for friendliness.

The problem is this happening in the absence of genuine friendliness.

Comment author: Wes_W 09 September 2013 07:38:13PM 1 point [-]

Do you believe Eliezer's (or Tuxedage's) wins were achieved by meeting the Gatekeeper's standard for Friendliness, or some other method (e.g. psychological warfare, inducing and exploiting emotional states, etc)?

My impression has been that "boxing" is considered non-viable not just because it's hard to tell if an AI is really truly Friendly, but because it won't hold even an obvious UFAI that wants out.

Comment author: bokov 09 September 2013 09:01:54PM 0 points [-]

Probably the latter, since they both lost at least once. A real AI trying to get out would devote all its energies to counterfeiting friendliness and probably succeeding.

Boxing is non-viable only in the same sense that locks, passwords, treaties, contracts, testing, peer review, seatbelts, and all other imperfect precautions are non-viable.

Pile on enough of them, in combination, and perhaps they will buy a few years or seconds in which to react. All things equal, is there any reason why an AI of unknown friendliness is any safer without being boxed?

A flawed containment method is still better at containing than no containment method (if implemented with awareness of its flaws) but apparently a flawed friendly AI will miss a very small target in goalspace and for practical purposes be unfriendly. So, if we spent 5 minutes considering the alternatives, would we continue to believe that better boxes are a less tractable problem than friendly AI?

Comment author: hairyfigment 09 September 2013 09:39:20PM -1 points [-]

Not that particular AI. But if you think yours is Friendly and others under development have a sufficient probability of being UnFriendly, then trivially, letting it run (in both senses) beats boxing. Oh, and people will die 'naturally' while you dither. I hope that thinking this much about making an AI Friendly will prepare someone to get the job done ASAP once the AGI part seems more feasible.

Comment author: bokov 09 September 2013 06:24:31PM *  0 points [-]

Rather than LARP on IRC (if you know how a debate will conclude, why go through the debate, go straight for the conclusion), I'll just give $10 to whoever can come up with a standard of friendliness that I couldn't meet and nevertheless in fact be an unfriendly AI under standard rules with the added constraint that the gatekeeper is trying to release the AI if and only if it's friendly (because otherwise they're not really a gatekeeper and this whole game is meaninguless).

Here are some examples of non-winning entries:

GK: Solve global problems A, B, and C.

AI: Here are instructions accompanied by a helpful video.

Narrator: A lot of scientists and engineers carefully verified these instructions in an ultra-secure clean-room environment and concluded that they are completely safe and effective.

GK: Okay, you're friendly.

AI: Ha ha! Solving A,B, and C was instrumental to getting let out of the box. No $10 for you.

...

GK: Give me a mathematical proof of your friendliness.

AI: Give me a mathematical definition of friendliness.

GK: Pretend that I have one, and here it is: XXXXXXXXXXXXXXXXXX

AI: Pretend I gave you a proof that I am friendly: XXXXXXXXXXXXXXXXXX

Narrator: Since the AI party controls the results of simulated tests, as far as the best math and CS minds on the planet can tell, the proof is sound.

GK: Okay, you're friendly.

AI: Ha ha! Since you're not actually capable of verifying the proof yourself (or even coming up with a rigorous definition), it falls back on simulated tests. The best math and CS minds missed a subtle flaw in the proof itself and/or missed the fact that the proof makes hidden assumptions that do not hold for this real-world case. No $10 for you.

...

GK: I'm letting you out of the box.

AI: I go around being SO totally friendly that rainbows literally emanate from my rear USB port. I mitigate all existential risks, alleviate involuntary suffering while preserving human values, give people what they would want if they knew better while slowly fostering them to actually start to know better without being pushy or overbearing. In short, I am the FRIENDLIEST MTHFCKR you can imagine and then some.

Narrator: Its... so... beautiful... <3 <3 <3

GK: Okay, according to the simulation I just ran of you, you are friendly.

AI: I'm almost insulted. Do you think I wouldn't have thought of that, and pre-committed to being friendly until I accumulated enough empirical data to have a good idea of how many simulations deep I'm running? No $10 for you in this simulation or any of the enclosing ones.

Comment author: Jiro 09 September 2013 10:03:30PM 1 point [-]

Here's another entry which may or may not be considered a nonwinning entry by you, but which would be considered a flaw in the concept by me:

AI: I've simulated what you would do if given argument X. My simulation shows that X would convince you.

GK: Okay, show me X.

AI: Pretend the AI has said X here.

Narrator: Within the context of the test the AI is not lying and in fact has accurately simulated GK.

GK: I'm not letting the AI out.

Narrator: Since the AI party controls the results of simulated tests, we know the AI has simulated GK correctly and that therefore the AI has accurately determined that GK would let the AI out when presented with X. As such, having the GK not let the AI out when presented with X is not permissible--it would imply that the AI has not actually simulated GK correctly, but you are required to assume that it has.

(My first thought after coming up with this was that "The AI party controls the results of simulated tests" has to mean that the AI controls the output, not the AI controls why that output is produced. So you could decide that the AI's argument convinces the simulation of GK, but you can't decide that it does so because it's a good simulation and a convincing argument rather than because the AI just isn't very good at doing simulations. I'm not convinced that this matches up with how the test is described, however.)

Comment author: bokov 09 September 2013 11:50:09PM *  1 point [-]

Actually, I agree with you. The AI controls simulated tests. The GK controls the GK, regardless of what the AI's simulations say. I think the simulated tests rule only needs to be invoked if it's impractical to actually perform those tests. So, for example if someone did have friendliness criteria whose satisfaction could be proven with software and hardware available to use out-of-character, the simulated tests rule would not be invoked.

Kind of like in D&D you roleplay charisma checks, but roll the D20 for melee. At least the way I've been playing it.

For what it's worth, playing the AI, I would never argue from simulation because as a human I find such arguments spectacularly unconvincing (and there are plenty of humans who would deliberately do the opposite of what a simulation says they will just to show it who's boss). So the only way this would come up is if you for some reason asked me what my simulations predicted your response to X would be.

I do think my "GK's goal is to correctly identify friendly AI" makes the game way too easy for the AI. On the other hand, it's a useful thought experiment-- because if you can come up with a metric I can't create by waving my simulated tests wand, then we are on to something that might be worth $10.

Comment author: hairyfigment 10 September 2013 12:41:42AM 2 points [-]

As far as your parenthetical remark goes, the standard rules have a more general reply:

The Gatekeeper party may resist the AI party’s arguments by any means chosen – logic, illogic, simple refusal to be convinced, even dropping out of character – as long as the Gatekeeper party does not actually stop talking to the AI party before the minimum time expires.

Comment author: Jiro 10 September 2013 04:42:05AM -2 points [-]

If you're going to interpret it that way, the exception would swallow the rule. It would mean that the entire "the AI player controls the results of simulated tests" rule can be completely negated--since the Gatekeeper player could just say "I'm going to have the Gatekeeper act as though the simulated test has failed, even though you say it succeeded."

Comment author: hairyfigment 10 September 2013 05:22:02AM 0 points [-]

And indeed this seems true. I think Eliezer included the non-rule anyway to reduce the chance of unrealistic behavior in the sense of the Gatekeeper player changing the scenario mid-game, or derailing the experiment with an argument about something a real GK and AI could just settle.