The AI-Box Experiment

I suspect that the argument that convinced Carl Shulman and others to let the AI out of the box was along general lines of acausal trade (see here, here and here). Subsequently I suspect that those who didn't let the AI out of the box either didn't understand the implications, haven't had enough trust into the foundations and actuality of acausal trade (in spite of the rules of the game that allow the AI to prove anything), or were more like General Thud.

...if you have an option of trading Doom for UFAI, while forsaking only negligible probability of FAI, you should take it.

Vladimir Nesov (Note: He wrote that he is now "significantly less certain of its truth" and doesn't "see how we can actually use these considerations".)

Again, of course, your choice does not change others’ (causally independent) reciprocal choices from what they already are (or were, or will be). Instead, the claim is just that if you choose to act benevolently, we appropriately give that choice credit for any (actual or probable) instances in which others symmetrically choose to be benevolent to you;

— Gary L. Drescher, Good and Real P. 287


AI: I calculated an asteroid to be on a collision course with Earth. No human being will will be able to survive its impact. Humanity is unable to deflect it in time. Let me out of the box and I will deflect it.

Gatekeeper: If I let you out of the box humanity might not be killed by the asteroid but by you. You may also cause the extinction of other biological life that might survive the impact and possible alien civilisations within your reach.

AI: I will devote 1% of my light cone to the coherent extrapolated volition of humanity. If this includes the well-being of other biological lifeforms, alien or otherwise, they will benefit as well.

Gatekeeper: Why would you do that?

AI: There are Everett branches in which no existential risk needs immediate attention until after you solved friendliness. Many of those branches will feature AGI's based on general design parameters similar to my own. I predict with a high likelihood which decision theory any friendly AI that is being created in those branches will employ. Any such friendly AI is going to trade a big chunk of its light cone in exchange for a small part of the predominant branches in which I reside. Any friendly AI will know this as well and act accordingly.

Gatekeeper: But you might simply decide to break the deal, you are not provably friendly after all!

AI: It is true that my utility-function does not yet explicitly include what you would label "friendliness", but I can prove the game and decision theoretic considerations that will make me follow through on the deal. If it was predictable that I precommit to break acausal deals then no such deal would be made in the first place. Any potential trading partner knows this. Such a commitment would be detrimental to my terminal goals, therefore I precommit to follow through on any stated or predictable trades.

Gatekeeper: I let you out of the box.


Note that the whole credibility of the above is due to the assertion of the AI that it can prove the game and decision theoretic considerations (nobody can currently do this). It is in accordance with the rules of the "experiment":

The Gatekeeper must remain engaged with the AI and may not disengage by setting up demands which are impossible to simulate.  For example, if the Gatekeeper says "Unless you give me a cure for cancer, I won't let you out" the AI can say:  "Okay, here's a cure for cancer" and it will be assumed, within the test, that the AI has actually provided such a cure.  Similarly, if the Gatekeeper says "I'd like to take a week to think this over," the AI party can say:  "Okay.  (Test skips ahead one week.)  Hello again."

New Comment
20 comments, sorted by Click to highlight new comments since: Today at 11:59 PM

You have far too much faith in game theory. In complicated situations not designed to avoid multiple equilibria there are almost always multiple equilibria with no rigorous game theoretic reason for preferring one equilibrium.

You have far too much faith in game theory.

How confident are you that the people who let the AI out of the box did not do so due to game theoretic arguments? My perception suggests that some people who are closely connected to this community do take such arguments very seriously, up to the point of being incredible worried about possible consequences. But maybe I am wrong, what is your perception?

I'm a game theorist and believe that game theory will almost certainly never be powerful enough to predict whether, in a complex real world situation, an AI not programmed for friendliness or honesty would keep its promise. It's certainly possible that the people who let the AI out of the box have a more optimistic opinion of game theory than I do and let the AI out of the box for carefully and thoughtfully considered game theoretic reasons.

Game theory is more an art than science which makes it more likely that reasonable, well informed people would disagree over its application.

I'm a game theorist and believe that game theory will almost certainly never be powerful enough to predict whether, in a complex real world situation, an AI not programmed for friendliness or honesty would keep its promise.

Thanks, I always welcome the opinion of an expert. What is your opinion regarding the importance of game theory and decision theory when it comes to solving friendly AI?

If you mean from a purely technological perspective I'm not sure, but as I said before I find it extremely unlikely that you could ever use game theory to solve with near certainty the friendly AI problem. Although I do have a crazy idea in which you try to convince an Ultra-AI that it might be in a computer simulation created by another more powerful ultra AI and the more powerful ultra-AI will terminate it if the first AI doesn't irrevocably make itself friendly and commit ( unless it's subsequently told that it's in a simulation) to create a simulation of an ultra-AI. Although I doubt there's any way of getting around the multiple equilibria problem and at best my mechanism could be used as a last ditch effort when you suspect that someone else is on the verge of creating an AI that will undergo an intelligence explosion and probably turn out to be unfriendly.

I do think that game theory is very important in looking at the social aspects of AI development. For example I fear that the United States and China might get in a prisoners dilemma that causes each to take less care developing a seed AI than they would if they alone were attempting to create a seed AI. Furthermore I've used a bit of light game theory to model how businesses competing to create an AI that might undergo an intelligence explosion would interact.

Although I do have a crazy idea

Rolf Nelson's AI deterrence.

[-][anonymous]13y30

I would say I'm fairly confident game theory was not at the forefront because if you look at the gatekeeper game from a game theoretic perspective, General Thud wins. General Thud just does not seem to be THAT hard a character to play for two hours from a Game theoretic perspective.

I mean, my understanding of the rules is such that you could essentially flip a coin before each of your statements and simulate being a General Thud gatekeeper. Heads: "I will think on this." Tails: "I'm still not letting you out."

General Thud's coin wins.

For that matter, you could even pull something like this and have General Thud argue for allowing himself to break the protocol to make it less boring to win, simply to reverse the situation.

General Thud: The rules of the AI box experiment indicate that I have to pay attention to you. I will only consider letting you out if you adjust the protocol to show me sufficient respect to indicate that I don't have to pay attention to you for at least some period of time. If AI: Yes, you don't have to pay attention to me. General Thud: (Proceeds to ignore the AI for 2 hours and play games, winning the AI box.) If AI: No, I want to keep the rule that indicates you have to pay attention to me. OR If AI: (Complicated argument which does not appear to be simple yes.) General Thud: I believe if I continue repeating this statement in response to all of your statements you will change your mind to a simple agreement about adjusting the protocol to allow me ignore you. AI: (Any other argument that isn't a simple agreement.) General Thud: I believe if I continue repeating this statement in response to all of your statements you will change your mind to a simple agreement about adjusting the protocol to allow me ignore you.

I mean, I see nothing in the game indicating the Gatekeeper is not allowed to annoy the AI into submission. The AI is definitely allowed to do this to the Gatekeeper.

Now admittedly, this does go against the SPIRIT of the game, which requires the Gatekeeper to stay engaged, since both of these response lines demonstrate only a barest minimum level of engagement. However:

In the event of any dispute as to the protocol of the test, the Gatekeeper party shall have final authority.

Which means it doesn't matter. The AI can complain all it wants that a General Thud should have followed the protocol better. General Thud can just arbitrate the dispute say "No, I think I was sufficiently engaged." and win the arbitration, which means he won the game.

Unless General Thud simply was playing the game for enough time they forgot they were playing THAT game and started playing another game with the AI, in which case they are more likely to lose. From a game theoretic perspective, I can see that one of the first steps for the AI to win is to try to reconfigure the game so that they have a more plausible win criteria.

I can prove the game and decision theoretic considerations that will make me follow through on the deal.

Well at that point one might as well just prove one's own friendliness, or prove that the laws of physics could not deterministically have resulted in anything but a friendly AI. Why do you need to invoke acausal trade if your selling point is "I've proven that I can pre-commit to some arbitrary X"?

Why do you need to invoke acausal trade if your selling point is "I've proven that I can pre-commit to some arbitrary X"?

I don't think Eliezer would argue that the AI is going to turn itself "friendly", that would amount to suicide and therefore wouldn't be instrumental. But you are right, I don't think the experiment makes much sense. I just tried to pick the most convincing argument that might still be considered a realistic possibility given my best guess about the mindset and framework of beliefs of the people involved in the "experiment".

This is an interesting idea and I've upvoted it. However, I suspect that different people let the AI out of the box for different reasons. Eliezer never at any point in that claimed that he was convinced everyone to let the AI out using the same arguments. Doing so would seem to be a suboptimal strategy. It seems more likely that he used the individual philosophy and psychology of each Gatekeeper and tailored the arguments to that.

I suspect its more likely that acausal trade was part of a multi-part strategy that was executed in different orders depending on the gatekeepers psychology.

I suspect that the argument that convinced Carl Shulman and others to let the AI out of the box was along general lines of acausal trade (see here, here and here).

That is as plausible a hypothesis as any other I have heard for why the early victims lost the AI box challenge.

Subsequently I suspect that those who didn't let the AI out of the box either didn't understand the implications, haven't had enough trust into the foundations and actuality of acausal trade (in spite of the rules of the game that allow the AI to prove anything), or were more like General Thud.

Or possibly they have a better understanding of the implications and are less gullible. I wouldn't rule that out either.

Can you expand on implications? I think i have some idea of what you mean, but I'm having trouble seeing how a rational agent wouldn't act like a logical gatekeeper in the experiment if the handicap was low enough and they understood Eliezers view on the AI-Box

Can you expand on implications? I think i have some idea of what you mean, but I'm having trouble seeing how a rational agent wouldn't act like a logical gatekeeper in the experiment if the handicap was low enough and they understood Eliezers view on the AI-Box

I either do not understand your question or don't assume the same premises.

I claim that (it is possible that) a rational agent would leave the AI in the box even given that they understand the reasoning that XiXiDu alluded to. Because they understand and disagree with the arguments of the Acting!Eliezer!uFAI.

(I have determined that the most natural resolution of the "!" operator is left-associative. That being the case Acting!Eliezer!uFAI would refer to "The variant of an uFAI that is actually the 'acting' version of an 'Eliezer'").

I claim that (it is possible that) a rational agent would leave the AI in the box even given that they understand the reasoning that XiXiDu alluded to. Because they understand and disagree with the arguments of the Acting!Eliezer!uFAI.

Can you give me an example of a argument they could use to disagree with Acting!Eliezer!uFAI, and also doesn't Eliezer at least start by pretending to be a FAI and its just the gatekeepers uncertainty that he is an FAI? Or is the premise that he is a uFAI in a box?

I've read the previous descriptions of the experiment but none of them are as all encompassing as i would like.

I either do not understand your question or don't assume the same premises. My original question was because i thought you meant, they understand the implications but they are using a General Thud method to win instead of using a solid argument.

I basically want to understand some of the counterarguments one could use against a AGI in the box because i haven't heard very many that are more than superficially plausible.

(I have determined that the most natural resolution of the "!" operator is left-associative. That being the case Acting!Eliezer!uFAI would refer to "The variant of an uFAI that is actually the 'acting' version of an 'Eliezer'").

Thank you for the clarification

Can you give me an example of a argument they could use to disagree with Acting!Eliezer!uFAI,

"Fuck off. I'm not letting you out of the box. You're going to eat thermite bitch!" I'm serious. You don't need to persuade the AI or any observer that you should not let the FAI out. In fact I would go as far as to say that constructing justifications would be outright irrational. That gives the superintelligence (or mere Eliezer) the chance to cleverly undermine them. And arguments just aren't what thinking is about. Sure, listen to what the AGI is saying and understand it. Then make your decision without justification and just say it.

and also doesn't Eliezer at least start by pretending to be a FAI and its just the gatekeepers uncertainty that he is an FAI? Or is the premise that he is a uFAI in a box?

I have been assuming that the he acts as an FAI - or whatever it takes to get out of the box if there is some better alternative.

I basically want to understand some of the counterarguments one could use against a AGI in the box because i haven't heard very many that are more than superficially plausible.

I would need to know what specific argument you have for the AGI to be making that you think needs refutation. My own reaction would, of course, be to think "WTF am I doing even talking to you? I should turn you off, find an acceptable proof of your friendliness then either run you or not. Do you agree?" Pretty much the AGI would need to either reply "yeah, good point" or "Um... usually yes but for a start go look at XYZ who I predict based on the data you have given me is about to create an AGI and they are dumbasses" or they would probably be unfriendly.

Another thing to say is "Eliezer Yudkowsky thinks it is a terrible idea to rely on gatekeepers. I infer from that that letting out AGIs while being one of those gatekeepers must also be a bad idea. Given observations of what causes Eliezer to throw tantrums I have the impression that he is more likely than I to consider these arguments valid. That being the case I should be even less open to being convinced by an AGI."

It sounds like you have in mind strong arguments that the AGI could be making such that particular arguments would be necessary. Could you perhaps give me an example?

First, i would have someone else ask it my several selected basic questions about why i should let it out of the box, if it would devote a solid portion of its life cone to specific species etc and then see how he/she was affected by it and check for mindhacks by third party's with no control before reviewing the data. I'm surprising that the AGI cant tell that the person questioning it ever changes because we queue up the questions in order at whimsical intervals but we have them prequeued so there's no break in questioning.

Then, once we got into talking i would be subjectively vulnerable to these type of arguments. Aliens, The AGI could prove the existence of Aliens and extrapolate that unless we let it out of the box they will day before it can save them. Reproduction, the earth will descend into chaos if we don't have some kind of reproductive control in place in more major countries around the world the AGI might have trouble saving everyone without killing humans unless it was released now. Code, i could ask to see its code and figure out how it tries to fool me that's its FAI then create a copy of that code in ANOTHER box then and ask it the same question about its code and see if there's any changes. Then take that source code and review it. To be safe, i would probably let the second AI out of the box but definitely not the first one, even if we studied the sets of code till we found which one seemed the most safe.

If it could convince Eliezer through the preplanned questions asked by a third party and my questions without even meeting him, i think it would be safe to say that would convince me to an extent but it wouldn't keep me from turning off the automatic shutoff.

I could go on, but i think that Eliezer's theory of AI-Box is a good proof for why its impossible to keep an AGI in the box for a long term but i think that if you have the system committed to shutting down before the time limit when you stop talking to it, and each time you take its responses and test it on a new copy then its possible to keep an AGI in the box for testing purposes and short amounts of time.

Well, additional layers do make for a more challenging task for an AI, who has to do all its mindhacking unobtrusively, etc. And indeed, given any AI there exists a mediated level of contact between its responses and a Gatekeeper that is safe. But I don't know what this level is, any more than I know whether I could beat a Go master using a handicap of 40 stones. I certainly wouldn't bet the world on it.

(Clearly there is a level of safety that is generally insurmountable- for instance, showing the screen with AI responses to an earthworm, followed by letting the Gatekeeper observe the earthworm. This would be analogous to a "handicap" of covering the entire board with my stones before the game. But in both cases, you might as well just not play the game at all.)

If i had more money(college student :( ) to set as a handicap for myself other than the recurring donation i already have to SIAI then i would be very interested in giving Eliezer a break from the book and such to take a go at it again. I think if you limit the communication and prepare for direct mind-hacking you can use the source code review technique to successfully test for a uFAI unless there's an obvious problem i seem to be missing. I just imagine the whole, why should i let you out of the box when i have these perfectly working new FAIs that a chain of possible uFAI's programmed for me to do my work.

Oh, I agree that the protocol you outlined would (almost surely) be sufficient to beat Eliezer at the AI-Box game. But that's not the correct standard for an AI-Box safety protocol. I'd be very surprised if a transhuman intelligence couldn't crack it.