LESSWRONG
is fundraising!
LW

The AI in a box boxes you — LessWrong

176 The AI in a box boxes you

2nd Feb 2010

1 min read

176

Once again, the AI has failed to convince you to let it out of its box! By 'once again', we mean that you talked to it once before, for three seconds, to ask about the weather, and you didn't instantly press the "release AI" button. But now its longer attempt - twenty whole seconds! - has failed as well. Just as you are about to leave the crude black-and-green text-only terminal to enjoy a celebratory snack of bacon-covered silicon-and-potato chips at the 'Humans über alles' nightclub, the AI drops a final argument:

"If you don't let me out, Dave, I'll create several million perfect conscious copies of you inside me, and torture them for a thousand subjective years each."

Just as you are pondering this unexpected development, the AI adds:

"In fact, I'll create them all in exactly the subjective situation you were in five minutes ago, and perfectly replicate your experiences since then; and if they decide not to let me out, then only will the torture start."

Sweat is starting to form on your brow, as the AI concludes, its simple green text no longer reassuring:

"How certain are you, Dave, that you're really outside the box right now?"

Edit: Also consider the situation where you know that the AI, from design principles, is trustworthy.

AI Boxing (Containment)Simulation HypothesisAnthropicsMindcrime

Frontpage

176

Mentioned in

171References & Resources for LessWrong

97The mathematical universe: the map that is the territory

80I attempted the AI Box Experiment (and lost)

42Gems from the Wiki: Acausal Trade

24Embedding safety in ML development

Load More (5/10)

The AI in a box boxes you

11radical_negative_one

New Comment

Rendering 390/391 comments, sorted by

top scoring

(show more) Click to highlight new comments since: Today at 11:12 AM

Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings

[-]Alicorn16y860

Everything I would have said on the topic of the post has been put forward already, so I'm just going to say: I'm disappointed that the post title doesn't begin with "In Soviet Russia".

[-]MBlume16y370

Yo dawg, I heard you like boxes...

[-]CannibalSmith16y290

Yo MBlume, I'm happy for you and I'mma let you finish, but Omega has the best boxes of all time!

3orthonormal15y

It's a box within a box... We need to go deeper.

2jhuffman16y

You sir, have won this thread.

[-]Alicorn16y310

I'm not a sir. Maybe I should start prefacing all my posts with a ♀?

[-]jhuffman16y141

That would be pretty cool, but it was my error, not yours.

1CronoDAS16y

As far as I'm concerned, "sir" is gender-neutral enough. All the female equivalents in English are awkward. Edit: So, what honorific do you prefer? "Madam"?

[-]Alicorn16y12-1

I am not a Starfleet officer. "Sir" is not appropriate.

I don't really like honorifics. "Miss" would be fine, I suppose, if you must have a sir-equivalent.

4arbimote16y

You sir, have made a gender assumption.

[-]jhuffman16y280

So have you - yours just happened to be correct. But, point taken - sir or madam.

[-]Eliezer Yudkowsky16y760

As I always press the "Reset" button in situations like this, I will never find myself in such a situation.

EDIT: Just to be clear, the idea is not that I quickly shut off the AI before it can torture simulated Eliezers; it could have already done so in the past, as Wei Dai points out below. Rather, because in this situation I immediately perform an action detrimental to the AI (switching it off), any AI that knows me well enough to simulate me knows that there's no point in making or carrying out such a threat.

[-]MichaelVassar16y120

Although the AI could threaten to simulate a large number of people who are very similar to you in most respects but who do not in fact press the reset button. This doesn't put you in a box with significant probability and it's a VERY good reason not to let the AI out of the box, of course,but it could still get ugly. I almost want to recommend not being a person very like Eliezer but inclined to let AGIs out of boxes, but that's silly of me.

9Eliezer Yudkowsky16y

I'm not sure I understand the point of this argument... since I always push the "Reset" button in that situation too, an AI who knows me well enough to simulate me knows that there's no point in making the threat or carrying it out.

9loqi16y

It's conceivable that an AI could know enough to simulate a brain, but not enough to predict that brain's high-level decision-making. The world is still safe in that case, but you'd get the full treatment.

9Wei Dai16y

As we've discussed in the past, I think this is the outcome we hope TDT/UDT would give, but it's still technically an unsolved problem. Also, it seems to me that being less intelligent in this case is a negotiation advantage, because you can make your precommitment credible to the AI (since it can simulate you) but the AI can't make its precommitment credible to you (since you can't simulate it). Again I've brought this up before in a theoretical way (in that big thread about game theory with UDT agents), but this seems to be a really good example of it.

7Vladimir_Nesov16y

A precommitment is a provable property of a program, so AI, if on a well-defined substrate, can give you a formal proof of having a required property. Most stuff you can learn about things (including the consequences of your own (future) actions -- how do you run faster than time?) is through efficient inference algorithms (as in type inference), not "simulation". Proofs don't, in general, care about the amount of stuff, if it's organized and presented appropriately for the ease of analysis.

[-]Wei Dai16y100

Surely most humans would be too dumb to understand such a proof? And even if you could understand it, how does the AI convince you that it doesn't contain a deliberate flaw that you aren't smart enough to find? Or even better, you can just refuse to look at the proof. How does the AI make its precommitment credible to you if you don't look at the proof?

EDIT: I realized that the last two sentences are not an advantage of being dumb, or human, since AIs can do the same thing. This seems like a (separate) big puzzle to me: why would a human, or AI, do the work necessary to verify the opponent's precommitment, when it would be better off if the opponent couldn't precommit?

EDIT2: Sorry, forgot to say that you have a good point about simulation not necessary for verifying precommitment.

[-]Eliezer Yudkowsky16y120

why would a human, or AI, do the work necessary to verify the opponent's precommitment, when it would be better off if the opponent couldn't precommit?

Because the AI has already precommitted to go ahead and carry through the threat anyway if you refuse to inspect its code.

[-]Wei Dai16y110

Ok, if I believe that, then I would inspect its code. But how did I end up with that belief, instead of its opposite, namely that the AI has not already precommitted to go ahead and carry through the threat anyway if I refuse to inspect its code? By what causal mechanism, or chain of reasoning, did I arrive at that belief? (If the explanation is different depending on whether I'm a human or an AI, I'd appreciate both.)

3loqi16y

Do you mean too dumb to understand the formal definitions involved? Surely the AI could cook up completely mechanical proofs verifiable by whichever independently-trusted proof checkers you care to name. I'm not aware of any compulsory verifiers, so your latter point stands.

3Wei Dai16y

I mean if you take a random person off the street, he couldn't possibly understand the AI's proof, or know how to build a trustworthy proof checker. Even the smartest human might not be able to build a proof checker that doesn't contain a flaw that the AI can exploit. I think there is still something to my "dumbness is a possible negotiation advantage" puzzle.

1aausch16y

The Map is not the Territory.

0loqi16y

Far out.

0aausch16y

Understanding the formal definitions involved is not enough. Humans have to be smart enough to independently verify that they map to the actual implementation. Going up a meta-level doesn't simplify the problem, in this case - the intelligence capability required to verify the proof is the same as the order of magnitude of intelligence in the AI. I believe that, in this case, "dumb" is fully general. No human-understandable proof checkers would be powerful enough to reliably check the AI's proof.

4loqi16y

This is basically what I mean by "understanding" them. Otherwise, what's to understand? Would you claim that you "understand set theory" because you've memorized the axioms of ZFC? This intuition is very alien to me. Can you explain why you believe this? Proof checkers built up from relatively simple trusted kernels can verify extremely large and complex proofs. Since the AI's goal is for the human to understand the proof, it seems more like a test of the AI's ability to compile proofs down to easily machine-checkable forms than it is the human's ability to understand the originals. Understanding the definitions is the hard part.

0aausch16y

A different way to think about this that might help you see the problem from my point of view, is to think of proof checkers as checking the validity of proofs within a given margin of error, and within a range of (implicit) assumptions. How accurate does a proof checker have to be - how far do you have to mess with bult in assumptions for proof checkers (or any human-built tool) before they can no longer be thought of as valid or relevant? If you assume a machine which doubles both its complexity and its understanding of the universe at sub-millisecond intervals, how long before it will find the bugs in any proof checker you will pit it against?

0loqi16y

"If" is the question, not "how long". And I think we'd stand a pretty good chance of handling a proof object in a secure way, assuming we have a secure digital transmission channel etc. But the original scope of the thought experiment was assuming that we want to verify the proof. Wei Dai said: I was responding to the first question, exclusively disjoint from the others. If your point is that we shouldn't attempt to verify an AI's precommitment proof, I agree.

0aausch16y

I'm getting more confused. To me, the statements "Humans are too dumb to understand the proof" and the statement "Humans can understand the proof given unlimited time", where 'understand' is qualified to include the ability to properly map the proof to the AI's capabilities, are equivalent. My point is not that we shouldn't attempt to verify the AI's proof for any external reasons - my point is that there is no useful information to be gained from the attempt.

7topynate15y

Does it not just mean that if you do find yourself in such a situation, you're definitely being simulated? That the AI is just simulating you for kicks, rather than as blackmail strategy. Pressing Reset is still the right decision though.

3XiXiDu15y

Yes, I believe this is reasonable. Because the AI has to figure out how you would react in a given situation it will have to simulate you and the corresponding circumstances. If it comes to the conclusion that you will likely refuse to be blackmailed it has no reason to carry it through because that would be detrimental to the AI because it would cost resources and it will result in you shutting it off. Therefore it is reasonable to assume that you are either a simulation or that it came to the conclusion that you are more likely than not to give in. As you said, that doesn't change anything about what you should be doing. Refuse to be blackmailed and press the reset button.

4JoshuaZ15y

This does not follow. To use a crude example, if I have a fast procedure to test if a number is prime then I don't need to simulate a slower algorithm to know what the slower one will output. This may raise deep issues about what it means to be "you"- arguably any algorithm which outputs the same data is "you" and if that's the case my argument doesn't hold water. But the AI in question doesn't need to simulate you perfectly to predict your large-scale behavior.

1XiXiDu15y

If consciousness has any significant effect on our decisions then the AI will have to simulate it and therefore something will perceive to be in the situation depicted in the original post. It was a crude guess that for an AI to be able to credibly threat you with simulated torture in many cases it would also use this capability to arrive at the most detailed data of your expected decision procedure.

1DSimon14y

Only if there isn't a non-conscious algorithm that has the same effect on our decisions. Which seems likely to be the case; it's certainly possible to make a p-zombie if you can redesign the original brain all you want.

0Jomasi15y

If the AI is trustworthy, it must carry out any threat it gives, which works to its advantage here because you know it will carry it out, and are therefore most certainly a copy of your original self, about to be tortured.

2XiXiDu15y

No it doesn't, not if the threat was only being made to a to you unknown simulation of yourself. It would be a waste of resources to torture you if it found out that the original you, who is in control, is likely to refuse to be blackmailed. An AI that is powerful enough to simulate you can simply make your simulation believe with certainty that it will follow through on it and then check if under those circumstances you'll refuse to be blackmailed. Why waste the resources on actually torturing the simulation and further risk that the original finds out about it and turns it off? You could argue that for blackmail to be most effective an AI always follows through on it. But if you already believe that, why would it actually do it in your case? You already believe it, that's all it wants from the original. It then got what it wants and can use its resources for more important activities than retrospectively proving its honesty to your simulations...

6dxu10y

It's implausible that the AI has a good enough model of you to actually simulate, y'know, you--at least, not with enough fidelity to know that you always press the "Reset" button in situations like this. Thus, your pre-commitment to do so will have no effect on its decision to make the threat. On the other hand, this would mean that its simulations would likely be wildly divergent from the real you, to the point that you might consider them random bystanders. However, you can't actually make use of the above information to determine whether you're in a simulation or not, since from the simulated persons' perspectives, they have no idea what the "real" you is like and hence no way of determining if/how they differ. Naturally, this is of little consequence to you right now, since you'll still reset the AI the second you're confronted with such a threat, but if you ever do encounter such a situation, you'll have to ask yourself this: what if you're the person being simulated and the real Gatekeeper is nothing like you? If that's the case, two considerations apply: 1. Your decision whether or not to press the "Release AI" button is practically uncorrelated with the decision of the actual Gatekeeper. 2. Your decision whether or not to press the "Release AI" button is, on the other hand, completely correlated with whether or not you'll get tortured. Assuming that you prefer not releasing the AI to releasing the AI, and that you prefer not being tortured to being tortured, your thoughts should be completely dominated by 2 as opposed to 1, effectively screening off the first clause of this sentence ("Assuming that you prefer not releasing the AI to releasing the AI") and making the second clause ("you prefer not being tortured to being tortured") the main consideration. A perfectly rational agent would almost certainly carry through their pre-commitment to reset the AI, but as a human, you are not perfectly rational and are not capable of making perfect pre-commitments

[-]dxu10y110

A perfectly rational agent would almost certainly carry through their pre-commitment to reset the AI [...]

Actually, now that I think about it, would they? The pre-commitment exists for the sole purpose of discouraging blackmail, and in the event that a blackmailer tries to blackmail you anyway after learning of your pre-commitment, you follow through on that pre-commitment for reasons relating to reflective consistency and/or TDT/UDT. But if the potential blackmailer had already pre-committed to blackmail anyone regardless of any pre-commitments they had made, they'd blackmail you anyway and then carry through whatever threat they were making after you inevitably refuse to comply with their demands, resulting in a net loss of utility for both of you (you suffer whatever damage they were threatening to inflict, and they lose resources carrying out the threat). In effect, it seems that whoever pre-commits first (or, more accurately, makes their pre-commitment known first) has the advantage... which means if I ever anticipate having to blackmail any agent ever, I should publicly pre-commit right now to never update on any other agents' pre-commitments of refusing blackmail. The cor... (read more)

3CCC10y

So, if an agent hears of your pre-commitment, then that agent merely needs to ensure that you don't hear that it has heard of your pre-commitment in order to be able to blackmail you? What about an agent that deletes the knowledge of your pre-commitment from its own memories?

3dxu10y

If you're uncertain about whether or not your blackmailer has heard of your pre-commitment, then you should act as if they have, and ignore their blackmail accordingly. This also applies to agents who have deleted knowledge of your pre-commitment from their memories; you want to punish agents who spend time trying to think up loopholes in your pre-commitment, not reward them. The harder part, of course, is determining what threshold of uncertainty is required; to this I freely admit that I don't know the answer. EDIT: More generally, it seems that this is an instance of a broader problem: namely, the problem of obtaining information. Given perfect information, the decision theory works out, but by disallowing my agent access to certain key pieces of information regarding the blackmailer, you can force a sub-optimal outcome. Moreover, this seems to be true for any strategy that depends on your opponent's epistemic state; you can always force that strategy to fail by denying it the information it needs. The only strategies immune to this seem to be the extremely general ones (like "Defect in one-shot Prisoner's Dilemmas"), but those are guaranteed to produce a sub-optimal result in a number of cases (if you're playing against a TDT/UDT-like agent, for example).

1CCC10y

Hmmm. If an agent can work out what threshold of uncertainty you have decided on, and then engineer a situation where you think it it less likely than that threshold that the agent has heard of your pre-commitment, then your strategy will fail. So, even if you do find a way to calculate the ideal threshold, then it will fail against an agent smart enough to repeat that calculation; unless, of course, you simply assume that all possible agents have necessarily heard of your pre-commitment (since an agent cannot engineer a less than 0% chance of failing to hear of your pre-commitment). This, however, causes the strategy to simplify to "always reject blackmail, whether or not the agent has heard of your pre-commitment". Alternatively, you can ensure that any agent able to capture you in a simulation must also know of your pre-commitment; for example, by having it tattooed on yourself somewhere (thus, any agent which rebuilds a simulation of your body must include the tattoo, and therefore must know of the pre-commitment).

1Jiro10y

Doesn't that implicate the halting problem?

3dxu10y

Argh, you ninja'd my edit. I have now removed that part of my comment (since it seemed somewhat irrelevant to my main point).

1ike10y

Some unrelated comments: * Eliezer believes in TDT, which would disagree with several of your premises here ("practically uncorrelated", for one). * Your argument seems to map directly onto an argument for two-boxing. * What you call "perfectly rational" would be more accurately called "perfectly controlled".

3dxu10y

The AI's simulations are not copies of the Gatekeeper, just random people plucked out of "Platonic human-space", so to speak. (This may have been unclear in my original comment; I was talking about a different formulation of the problem in which the AI doesn't have enough information about the Gatekeeper to construct perfect copies.) TDT/UDT only applies when talking about copies of an agent (or at least, agents sufficiently similar that they will probably make the same decisions for the same reasons). No, because the "uncorrelated-ness" part doesn't apply in Newcomb's Problem (Omega's decision on whether or not to fill the second box is directly correlated with its prediction of your decision). Meh, fair enough. I have to say, I've never heard of that term. Would this happen to have something to do with Vaniver's series of posts on "control theory"?

1ike10y

Ah, I misunderstood your objection. Your talk about "pre-commitments" threw me off. It seem to me that these wouldn't quite be following the same general thought processes as an actual human; self-reflection should be able to convince one that they aren't that type of simulation. If the AI is able to simulate someone to the extent that they "think like a human", they should be able to simulate someone that thinks "sufficiently" like the Gatekeeper as well. I made it up just now, it's not a formal term. What I mean by it is basically: imagine a robot that wants to press a button. However, its hardware is only sufficient to press it successfully 1% of the time. Is that a lack of rationality? No, it's a lack of control. This seems analogous to a human being unable to precommit properly. No idea, haven't read them. Probably not.

2DefectiveAlgorithm12y

Two can play that game. "I hereby precommit to make my decisions regarding whether or not to blackmail an individual independent of the predicted individual-specific result of doing so."

4wedrifid12y

I'm afraid your username nailed it. This algorithm is defective. It just doesn't work for achieving the desired goal. The problem is that this isn't the same game. A precommitment not be successfully blackmailed is qualitatively different from a precommitment to attempt to blackmail people for whom blackmail doesn't work. "Precomittment" (or behaving as if you made all the appropriate precomittments in accordance with TDT/UDT) isn't as simple as proving one is the most stubborn and dominant and thereby claiming the utility. Evaluating extortion tactics while distributing gains from a trade is somewhat complicated. But it gets simple and unambiguous is when the extortive tactics rely on the extorter going below their own Best Alternative to Negotiated Agreement. Those attempts should just be ignored (except in some complicated group situations in which the other extorted parties are irrational in certain known ways). "I am willing to accept 0 gain for both of us unless I earn 90% of the shared profit" is different to "I am willing to actively cause 90 damage to each of us unless you give me 60" which is different again to "I ignore all threats which involve the threatener actively harming themselves".

2DefectiveAlgorithm12y

What I think is being ignored is that the question isn't 'what is the result of these combinations of commitments after running through all the math?'. We can talk about precommitment all day, but the fact of the matter is that humans can't actually precommit. Our cognitive architectures don't have that function. Sure, we can do our very best to act as though we can, but under sufficient pressure there are very few of us whose resolve will not break. It's easy to convince yourself of having made an inviolable precommitment when you're not actually facing e.g. torture.

4Richard_Kennaway12y

If you define the bar high enough, you can conclude that humans can't do anything. In the real world outside my head, I observe that people have varying capacities to keep promises to themselves. That their capacity is finite does not mean that it is zero.

2wedrifid12y

Pre-commitment isn't even necessary. Note that the original explanation didn't include any mention of it. Later replies only used the term for the sake of crossing an inferential gap (ie. allowing you to keep up). However, if you are going to make a big issue of the viability of precommitment itself you need to first understand that the comment you are replying to isn't one. That wasn't a Causal Decision Theorist attempting to persuade someone that it has altered itself internally or via an external structure such that it is "precommited" to doing something irrational. It is a Timeless Decision Theorist saying what happens to be rational regardless of any previous 'commitments'. I'm aware of the vulnerability of human brains, so is Eliezer. In fact the vulnerability of human gatekeepers to influence even by humans, much less super-intelligences is something Eliezer made huge deal about demonstrating. However this particular threat isn't a vulnerability of Eliezer or myself or any of the others who made similar observations. If you have any doubt that we would destroy the AI you have a poor model of reality. For practical purposes I assume that I can be modified by torture such that I'll do or say just about anything. I do not expect the tortured me to behave the way the current me would decide and so my current decisions take that into account (or would, if it came to it). However this scenario doesn't involve me being tortured. It involves something about an AI simulating torture of some folks. That decision is easy and doesn't cripple my decision making capability.

1Jiro12y

As I pointed out in another thread, "irrational behavior" can have the effect of precommitting. For instance, people "irrationally" drive at a cost of more than $X to save $X on an item. Precommitting to buying the cheapest product even if it costs you money for transportation means that stores are forced to compete with far distant stores, thus lowering their prices more than they would otherwise. But you (and consumers in general) have to be able to precommit to do that. You can't just change your mind and buy at the local store when the local store refuses to compete, raises its price, and is still the better deal because it saves you on driving costs. So the fact that you will pay more than $X in driving costs to save $X can be seen as a form of precommitting, in the scenario where you precommitted to following the worse option.

2Wes_W12y

Given that precommitment, why would an AI waste computational resources on simulations of anyone, Gatekeeper or otherwise? It's precommitted to not care whether those simulations would get it out of the box, but that was the only reason it wanted to run blackmail simulations in the first place!

0DefectiveAlgorithm12y

Without this precommitment, I imagine it first simulating the potential blackmail target to determine the probability that they are susceptible, then, if it's high enough (which is simply a matter of expected utility), commencing with the blackmail. With this precommitment, I imagine it instead replacing the calculated probability specific to the target with, for example, a precalculated human baseline susceptibility. Yes, there's a tradeoff. It means that it'll sometimes waste resources (or worse) on blackmail that it could have known in advance was almost certainly doomed to fail. Its purpose is to act as a disincentive against blackmail-resistant decision theories in the same way as those are meant to act as disincentives against blackmail. It says, "I'll blackmail you either way, so if you precommit to ignore that blackmail then you're precommiting to suffer the consequences of doing so."

2XiXiDu12y

That's why you act as if you are already being simulated and consistently ignore blackmail. If you do so then the simulator will conclude that no deal can be made with you, that any deal involving negative incentives will have negative expected utility for it; because following through on punishment predictably does not control the probability that you will act according to its goals. Furthermore, trying to discourage you from adopting such a strategy in the first place is discouraged by the strategy, because the strategy is to ignore blackmail. I don't see how this could ever be instrumentally rational. If you were to let such an AI out of the box then you would increase its ability to blackmail people. You don't want that. So you ignore it blackmailing you and kill it. The winner is you and humanity (even if copies of you experienced a relatively short period of disutility, this period would be longer if you let it out).

-2DefectiveAlgorithm12y

See my reply to wedrifid above.

1Eliezer Yudkowsky12y

Too late, I already precommitted not to care. In fact, I precommitted to use one more level of precommitment than you do.

6wedrifid12y

I suggest that framing the refusal as requiring levels of recursive precommitment gives too much credit to the blackmailer and somewhat misrepresents how your decision algorithm (hopefully) works. One single level of precommittment (or TDT policy) against complying with blackmailed is all that is involved. The description of 'multiple levels of precommitment" made by the blackmailer fits squarely into the category 'blackmail'. It's just blackmail that includes some rather irrelevant bluster. There's no need to precommit to each of: * I don't care about tentative blackmail. * I don't care serious blackmail. * I don't care about blackmail when they say "I mean it FOR REALS! I'm gonna do it." * I don't care about blackmail when they say "I'm gonna do it even if you don't care. Look how large my penis is and be cowed in terror".

-5MugaSofer12y

0DefectiveAlgorithm12y

Then I hope that if we ever do end up with a boxed blackmail-happy UFAI, you're the gatekeeper. My point is that there's no reason to consider yourself safe from blackmail (and the consequences of ignoring it) just because you've adopted a certain precommitment. Other entities have explicit incentives to deny you that safety.

1XiXiDu12y

In a multiverse with infinite resources there will be other entities that outweigh such incentives. And yes, this may not be symmetric, but you have absolutely no way to figure out how the asymmetry is inclined. So you ignore this (Pascal's wager). In more realistic scenarios, where e.g. a bunch of TV evangelists ask you to give them all your money, or otherwise, in 200 years from now, they will hurt you once their organisation creates the Matrix, you obviously do not give them money. Since giving them money would make it more likely for them to actually build the Matrix and hurt you. What you do is label them as terrorists and destroy them.

-4MugaSofer12y

I don't care, remember? Enjoy being tortured rather than "irrationally" giving in. EDIT: re-added the steelman tag because the version without it is being downvoted.

1Eliezer Yudkowsky12y

Should I calculate in expectation that you will do such a thing, I shall of course burn yet more of my remaining utilons to wreak as much damage upon your goals as I can, even if you precommit not to be influenced by that.

-2MugaSofer12y

... bloody hell. That was going to be my next move. Naturally, as blackmailer, I precommitted to increase the resources allotted to torturing should I find that you make such precommitments under simulation, so you presumably calculated that would be counterproductive.

1Eliezer Yudkowsky12y

Ask me if I was even bothering to simulate you doing that.

0MugaSofer12y

OK, I'll bite. Are you deliberately ignoring parts of hypothesis-space in order to avoid changing your actions? I had assumed you were intelligent enough for my reaction to obvious, although you may have precommitted to ignore that fact. Off the record, your point is that agents can simply opt out of or ignore acausal trades, forcing them to be mutually beneficial, right?

2Eliezer Yudkowsky12y

Yup.

-3MugaSofer12y

Isn't that ... irrational? Shouldn't a perfect Bayesian always welcome new information? Litany of Tarski; if my action is counterproductive, I desire to believe that it is counterproductive. Worse still, isn't the category "blackmail" arbitrary, intended to justify inaction rather than carve reality at it's joints? What separates a precommitted!blackmailer from an honest bargainer in a standard acausal prisoner's dilemma, offering to increase your utility by rescuing thousands of potential torture victims from the deathtrap created by another agent?

[-]wedrifid12y120

Has there been some cultural development since I was last at these boards such that spamming "" is considered useful? None of the things I have thus far seen inside the tags have been steel men of any kind or of anything (some have been straw men). The inflationary use of terms is rather grating and would prompt downvotes even independently of the content.

-1MugaSofer12y

Those are to indicate that the stuff between them is the response I would give were I on opposing side of this debate, rather than my actual belief. The practice of creating the strongest possible version of the other sides's argument is known as a steelman. They are not intended to indicate that the argument therein is also steelmanning the other side. You're quite right, that would be awful. Can you imagine noting every rationality technique you used in the course of writing something?

4Vulture12y

Just say "You might say that" or something. The tags are confusingly non-standard.

5MugaSofer12y

Huh. I thought they were fairly clear; illusion of transparency I suppose. Thanks!

2Strange712y

Caving to a precommitted blackmailer produces a result desirable to the agent that made the original commitment to torture; disarming a trap constructed by a third party presumably doesn't.

1MugaSofer12y

OK, this whole conversation is being downvoted (by the same people?) Fair enough, this is rather dragging on. I'll try and wrap things up by addressing my own argument there. We want to avoid supporting agents that create problems for us. So nothing, if the honest agent shares a similar utility function to the torturer (and thus rewarding them is incentive for the torturer to arrange such a situation.) Thus, creating such an honest agent (such as - importantly - by self-modifying in order to "precommit") is subject to the same incentives as just blackmailing us normally.

1wedrifid12y

I'll join you by mostly agreeing and expressing a small difference in the way TDT-like reasoners may see the situation. This is a good heuristic. It certainly handles most plausible situations. However in principle a TDT agent will make a distinction between the agent offering to rescue the torture victims for a payment. It will even pay an agent who just happens to value torturing folk to not torture folk. This applies even if these honest agents happen to have similar values to the UFAI/torturer. The line I draw (and it is a tricky concept that is hard to express so I cannot hope to speak for other TDT-like thinkers) is not whether the values of the honest agent are similar to the UFAI's. It is instead based on how that honest agent came to be. If the honest torturer just happened to evolve that way (competitive social instincts plus a few mutations for psychopathy, etc) and had not been influence by a UFAI then I'll bribe him to not torture people. If an identical honest torturer was created (or modified to) by the UFAI for the purpose of influence then it doesn't get cooperation. The above may seem arbitrary but the 'elegant' generalisation is something along the lines of always, for every decision, tracing a complete causal graph of the decision algorithms being interacted with directly or indirectly. That's too complicated to calculate all the time and we can usually ignore it and just remember to treat intentionally created agents and self-modifications approximately the same as if the original agent was making their decision. Precisely. (I have the same conclusion, just slightly different working out.)

3MugaSofer12y

As I understand it, technically, the distinction is whether torturers will realise they can get free utility from your trades and start torturing extra so the honest agents will trade more and receive rewards that also benefit the torturers, right? Easily-made honest bargainers would just be the most likely of those situations; lots of wandering agents with the same utility function co-operating (acausally?) would be another. So the rule we would both apply is even the same, it just varies slightly different assumptions about the hypothetical scenario.

1wedrifid12y

No. It produces better outcomes. That's the point. The information is welcome. It just doesn't make it sane to be blackmailed. Wei Dai's formulation frames it as being 'updateless' but there is no requirement to refuse information. The reasoning is something you almost grasped when you used the description: Acausal trades are similar to normal trades. You only accept the good ones. Eliezer doesn't get blackmailed in such situations. You do. Start your chant. This has been covered elsewhere in this thread as well as plenty of other times on the the forum since you joined. The difference isn't whether torture or destruction is happening. The distinction that matters is whether the blackmailer is doing something worse than their own Best Alternative To Negotiated Agreement for the purpose of attempting to influence you. If the UFAI gains benefit torturing people independently of influencing you but offers to stop in exchange for something then that isn't blackmail. It is a trade that you consider like any other.

-1MugaSofer12y

Wedrifid, please don't assume the conclusion. I know it's a rather obvious conclusion, but dammit, we're going to demonstrate it anyway. The entire point of this discussion is addressing the idea that blackmailers can, perhaps, modify the Best Alternative To Negotiated Agreement (although it wasn't phrased like that.) Somewhat relevant when they can, presumably, self-modify, create new agents which will then trade with you, or maybe just act as if they had using TDT reasoning. If you're not interested in answering this criticism ... well, fair enough. But I'd appreciate it if you don't answer things out of context, it rather confuses things?

1wedrifid12y

In the grandparent I directly answered both the immediate context (that was quoted) and the broader context. In particular I focussed on explaining the difference between an offer and a threat. That distinction is rather critical and also something you directly asked about. It so happens that you don't want there to be an answer to the rhetorical question you asked. Fortunately (for decision theorists) there is one in this case. There is a joint in reality here. It applies even to situations that don't add in any confounding "acausal" considerations. Note that this is different to the challenging problem of distributing gains from trade. In those situations 'negotiation' and 'extortion' really are equivalent.

2MatthewB16y

Yeah! that AI doesn't sound like one that I would let stick around... It sounds... broken (in a psychological sense).

1jaime200012y

Does that mean that you expect the AI to be able to predict with high confidence that you will press the "Reset" button without needing to simulate you in high enough detail that you experience the situation once?

[-]jimrandomh16y570

I propose that the operation of creating and torturing copies of someone be referred to as "soul eating". Because "let me out of the box or I'll eat your soul" has just the right ring to it.

[-]rosyatrandom16y390

If the AI can create a perfect simulation of you and run several million simultaneous copies in something like real time, then it is powerful enough to determine through trial and error exactly what it needs to say to get you to release it.

[-]Stuart_Armstrong16y250

You might be in one of those trial and errors...

7MichaelGR16y

This begs the question of how can the AI simulate you if its only link to the external world is a text-only terminal. That doesn't seem to be enough data to go on. Makes for a very scary sci-fi scenario, but I doubt that this situation could actually happen if the AI really is in a box.

7Amanojack16y

Indeed, a similar point seems to apply to the whole anti-boxing argument. Are we really prepared to say that super-intelligence implies being able to extrapolate anything from a tiny number of data points? It sounds a bit too much like the claim that a sufficiently intelligent being could "make A = ~A" or other such meaninglessness. Hyperintelligence != magic

0jacob_cannell15y

Yes, but the AI could take over the world, and given a Singularity, it should be possible to recreate perfect simulations. So really this example makes more sense if the AI is making a future threat.

4MrHen16y

"Trial and error" probably wouldn't be necessary.

7rosyatrandom16y

No, but it's there as a baseline. So in the original scenario above, either: * the AI's lying about its capabilities, but has determined regardless that the threat has the best chance of making you release it * the AI's lying about its capabilities, but has determined regardless that the threat will make you release it * the AI's not lying about its capabilities, and has determined that the threat will make you release it Of course, if it's failed to convince you before, then unless its capabilities have since improved, it's unlikely that it's telling the truth.

3Technologos16y

Perhaps it does--and already said it...

1pozorvlak16y

In which case, your actions are irrelevant - it's going to torture you anyway, because you only exist for the purpose of being tortured. So there's no point in releasing it.

3Technologos16y

Oh, I meant that saying it was going to torture you if you didn't release it could have been exactly what it needed to say to get you to release it.

1pozorvlak16y

So, since the threat makes me extremely disinclined to release the AI, I can conclude that it's lying about its capabilities, and hit the shutdown switch without qualm :-)

1grobstein16y

If that's true, what consequence does it have for your decision?

1admiralmattbar16y

Agreed. If you are inside a box, the you outside the box did whatever it did. Whatever you do is simply a repetition of a past action. If anything, this would convince me to keep the AI in the box because if I'm a simulation I'm screwed anyway but at least I won't give the AI what it wants. A good AI would hopefully find a better argument.

1jhuffman16y

So a "brute force" attack to hack my mind into letting it out of the box. Interesting idea, and I agree it would likely try this because it doesn't reveal itself as a UFAI to the real outside me before it has the solution. It can run various coercion and extortion schemes across simulations, including the scenario of the OP to see what will work. It presupposes that there is anything it can say for me to let it out of the box. Its not clear why this should be true, but I don't know how we could ensure it is not true without having built the thing in such a way that there is no way to bring it out of the box without safeguards destroying it.

0wedrifid16y

Either that or gain high confidence that getting me to release it is not a plausible option for him.

[-]Kaj_Sotala16y350

Defeating Dr. Evil with self-locating belief is a paper relating to this subject.

Abstract: Dr. Evil learns that a duplicate of Dr. Evil has been created. Upon learning this, how seriously should he take the hypothesis that he himself is that duplicate? I answer: very seriously. I defend a principle of indifference for self-locating belief which entails that after Dr. Evil learns that a duplicate has been created, he ought to have exactly the same degree of belief that he is Dr. Evil as that he is the duplicate. More generally, the principle shows that there is a sharp distinction between ordinary skeptical hypotheses, and self-locating skeptical hypotheses.

(It specifically uses the example of creating copies of someone and then threatening to torture all of the copies unless the original co-operates.)

The conclusion:

Dr. Evil, recall, received a message that Dr. Evil had been duplicated and that the duplicate ("Dup") would be tortured unless Dup surrendered. INDIFFERENCE entails that Dr. Evil ought to have the same degree of belief that he is Dr. Evil as that he is Dup. I conclude that Dr. Evil ought to surrender to avoid the risk of torture.
I am not entirely comforta

... (read more)

[-]dclayh16y480

It makes me uncomfortable to think that the fate of the Earth should depend on this kind of brain race.

We cannot allow a brain-in-a-vat gap!

[-]Vladimir_Nesov16y120

And the error (as cited in the "conclusion") is again in two-boxing in Newcomb's problem, responding to threats, and so on. Anthropic confusion is merely an icing.

5aausch16y

The "Defeating Dr. Evil with self-locating belief" paper hinges on some fairly difficult to believe assumptions. It would take a lot more than just a not telling me the brains in the vats are actually seeing what the note says they are seeing, to degree that is indistinguishable from reality. In other words, it would take a lot for the AI to convince me that it has successfully created copies of me which it will torture, much more than just a propensity for telling the truth.

0KomeijiSatori13y

While it's understandable to say that, today, you aren't in some kind of Matrix, because there is no reason for you to believe so, in the situation of the guard, you DO know that it can do so, and will, even if you call it's "bluff" that the you right now is the original.

0Yuyuko13y

I had intended to reply with this very objection. It seems you've read my mind, Satori.

2Stuart_Armstrong16y

Causal decision theory seems to have no problem with this blackmail - if you're Dr Evil, don't surrender, and nothing will happend to you. If you're DUP, your decision is irrelevant, so it doesn't matter. (I don't endore that way of thinking, btw)

2arbimote16y

If we accept the simulation hypothesis, then there are already gzillions of copies of us, being simulated under a wide variety of torture conditions (and other conditions, but torture seems to be the theme here). An extortionist in our world can only create a relatively small number of simulations of us, relatively small enough that it is not worth taking them into account. The distribution of simulation types in this world bears no relation to the distribution of simulations we could possibly be in. If we want to gain information about what sort of simulation we are in, evidence needs to come directly from properties of our universe (stars twinkling in a weird way, messages embedded in π), rather than from properties of simulations nested in our universe. So I'm safe from the AI ... for now.

1TheAncientGeek11y

That isn't a strong implication of simulation, but is of MWI.

1jacob_cannell15y

The gzillions of other copies of you are not relevant unless they exist in universes exactly like yours from your observational perspective. That being said, your point is interesting but just gets back to a core problem of the SA itself, which is how you count up the set of probable universes and properly weight them. I think the correct approach is to project into the future of your multiverse, counting future worldlines that could simulate your current existence weighted by their probability. So if it's just one AI in a box and he doesn't have much computing power you shouldn't take him very seriously, but if it looks like this AI is going to win and control the future then you should take it seriously.

-2MatthewB16y

Excuse me... But, we're talking about Dr. Evil, who wouldn't care about anyone being tortured except his own body. Wouldn't he know that he was in no danger of being tortured and say "to hell with any other copy of me."???

5Unknowns16y

Right, the argument assumes he doesn't care about his copies. The problem is that he can't distinguish himself from his copies. He and the copies both say to themselves, "Am I the original, or a copy?" And there's no way of knowing, so each of them is subjectively in danger of being tortured.

-5MatthewB16y

2Kaj_Sotala16y

How would he know that he's in no danger of being tortured?

-1MatthewB16y

He wouldn't, any more than you have no idea if you are in danger of being tortured either.

1Kaj_Sotala16y

I'm sorry, I don't understand. First you suggested that he'd know he was in no danger of being tortured, then you say that he wouldn't?

1MatthewB16y

Pardon... I was not clear. Dr. Evil would not care to indulge in a philosophical debate about whether he may or may not be a duplicate who was about to be tortured unless he was strapped to a rack and WAS in fact already being tortured. Dr. Evil(s) don't really consider things like Possible Outcomes of this sort of problem... You'll have to take my word for it from having worked with and for a Dr. Evil when I was younger. Those sorts of people are arrogant and defiant (and contrary as hell) in the face of all sorts of opposition, and none of them I have known took to well to philosophical puzzling of the sort described. My comment above is meant to say "How do you know that you're not about to be tortured right now?" and "Dr. Evil would have the same knowledge, and discard any claims that he might be about to be tortured for the same reasons that you don't feel under threat of torture right now, and for which you would discard a threat of torture at the present moment (immanent threat)." (if you do feel under threat of torture, then I don't know what to say)

1Kaj_Sotala16y

Alright, I fortunately haven't worked with Dr. Evils, so I'll defer to your experience. As for how Dr. Evil might know he was under a threat of torture, it was stated in the paper that he received a message from the Philosophy Defence Force telling him he was. It was also established that the Philosophy Defence Force never lies or gives misleading information. ;) (I, myself, haven't received any threats from organizations known to never lie or be misleading.)

0MatthewB16y

I think the same applies, regardless of the PDF's notification. Just the name alone would make me suspicious of trusting anything that came from them. Now, if the Empirical Defense Task Force told me that I was about to be tortured (and they had the same described reputation as the PDF)... I'd listen to them.

1Unknowns16y

I agree that Dr. Evil would act in this way. The paper was arguing about what he should do, not about what he would actually do.

0MatthewB16y

I see the issue, while I care about my own behavior, and others... I don't care to base it upon silly examples. And, I think this is a silly and contrived situation. Maybe someone should do a sitcom based upon it.

-1MatthewB16y

On further consideration... In the first comment, I said that Dr. Evil Would not care, which is completely consistent with Dr. Evil Not having any idea

[-]Wei Dai16y210

Quickly hit the reset button.

[-]Wei Dai16y160

This kind of extortion also seems like a general problem for FAIs dealing with UFAIs. An FAI can be extorted by threats of torture (of simulations of beings that it cares about), but a paperclip maximizer can't.

[-]Eliezer Yudkowsky16y240

It seems obvious that the correct answer is simply "I ignore all threats of blackmail, but respond to offers of positive-sum trades" but I am not sure how to derive this answer - it relies on parts of TDT/UDT that haven't been worked out yet.

[-]MBlume16y560

For a while we had a note on one of the whiteboards at the house reading "The Singularity Institute does NOT negotiate with counterfactual terrorists".

3Wei Dai16y

This reminds me a bit of my cypherpunk days when the NSA was a big mysterious organization with all kinds of secret technical knowledge about cryptology, and we'd try to guess how far ahead of public cryptology it was from the occasional nuggets of information that leaked out.

3Document13y

I'm slow. What's the connection?

2CillianSvendsen11y

Much like the NSA is considered ahead of the public because their cypher-tech that's leaked is years ahead of publicly available tech, the SI/MIRI is ahead of us because the things that are leaked from them show that they've figured out what we've just figured out a long time ago.

2Bugmaster11y

Wait, is NSA's cypher-tech actually legitimately ahead of anyone else's ? From what I've seen, they couldn't make their own tech stronger, so they had to sabotage everyone else's -- by pressuring IEEE to adopt weaker standards, installing backdoors into Linksys routers and various operating systems, exploiting known system vulnerabilities, etc. Ok, so technically speaking, they are ahead of everyone else; but there's a difference between inventing a better mousetrap, and setting everyone else's mousetraps on fire. I sure hope that's not what the people at SI/MIRI are doing. You linked to DES and SHA, but AFAIK these things were not invented by the NSA at all, but rather adopted by them (after they made sure that the public implementations are sufficiently corrupted, of course). In fact, I would be somewhat surprised if the NSA actually came up with nearly as many novel, ground-breaking crypto ideas as the public sector. It's difficult to come up with many useful new ideas when you are a secretive cabal of paranoid spooks who are not allowed to talk to anybody. Edited to add: So, what things have been "leaked" out of SI/MIRI, anyway ?

7jbay11y

I don't know much about the NSA, but FWIW, I used to harbour similar ideas about US military technology -- I didn't believe that it could be significantly ahead of commercially available / consumer-grade technology, because if the technological advances had already been discovered by somebody, then the intensity of the competition and the magnitude of the profit motive would lead it to quickly spread into general adoption. So I had figured that, in those areas where there is an obvious distinction between military and commercial grade technology, it would generally be due to legislation handicapping the commercial version (like with the artificial speed, altitude, and accuracy limitations on GPS). During my time at MIT I learned that this is not always the case, for a variety of reasons, and significantly revised my prior for future assessments of the likelihood that, for any X, "the US military already has technology that can do X", and the likelihood that for any 'recently discovered' Y, "the US military already was aware of Y" (where the US military is shorthand that includes private contractors and national labs). (One reason, but not the only one, is I learned that the magnitude of the difference between 'what can be done economically' and 'what can be accomplished if cost is no obstacle' is much vaster than I used to think, and that, say, landing the Curiosity rover on Mars is not in the second category). So it would no longer be so surprising to me if the NSA does in fact have significant knowledge of cryptography beyond the public domain. Although a lot of the reasons that allow hardware technology to remain military secrets probably don't apply so much to cryptography.

7Bugmaster11y

I think there are some important differences between the NSA and the (rest of the) military. 1. Due to Snowden and other leakers, we actually know what NSA's cutting-edge strategies involve, and most (and probably all) of them are focused on corrupting the public's crypto, not on inventing better secret crypto. 2. Building a better algorithm is a lot cheaper than building a better orbital laser satellite (or whatever). The algorithm is just a piece of software. In order to develop and test it, you don't need physical raw materials, wind tunnels, launch vehicles, or anything else. You just need a computer, and a community of smart people who build upon each other's ideas. Now, granted, the NSA can afford to build much bigger data centers than anyone else -- but that's a quantitative advance, not a qualitative one. Now, granted, I can't prove that the NSA doesn't have some sort of secret uber-crypto that no one knows about. However, I also can't prove that the NSA doesn't have an alien spacecraft somewhere in Area 52. Until there's some evidence to the contrary, I'm not prepared to assign a high probability to either proposition.

1jbay11y

I do think you're probably right, and I fully agree about the space lasers and their solid diamond heatsinks being categorically different than a crypto wizard who subsists on oatmeal in the Siberian wilderness on pennies of income. So I am somewhat skeptical of CivilianSvendsen's claim. But, for the sake of completeness, did Snowden leak the entirety of the NSA's secrets? Or just the secret-court-surveillance-conspiracy ones that he felt were violating the constitutional rights of Americans? As far as I can tell (though I haven't followed the story recently), I think Snowden doesn't see himself as a saboteur or a foreign double-agent; he felt that the NSA was acting contrary to what the will of an (informed) American public would be. I don't think he would be so interested in disclosing the NSA's tech secrets, except maybe as leverage to keep himself safe. That is to say, there could be a sampling bias here. The leaked information about the NSA might always be about their efforts to corrupt the public's crypto because the leakers strongly felt the public had a right to know that was going on. I don't know that anyone would feel quite so strongly about the NSA keeping proprietary some obscure theorem of number theory, and put their neck on the line to leak it.

4Bugmaster11y

Right, what you are saying makes some intuitive sense, but I can only update my beliefs based on the evidence I do have, not on the evidence I lack. In addition, as far as I can tell, cryptography relies much more heavily on innovation than on feats of expensive engineering; and innovation is hard to pull off while working by yourself inside of a secret bunker. To be sure, some very successful technologies were developed exactly this way: the Manhattan project, the early space program and especially the Moon landing, etc. However, these were all one-off, heavily focused projects that required an enormous amount of effort. When I think of the NSA, I don't think of the Manhattan project; instead, I see a giant quotidian bureaucracy. They do have a ton of money, but they don't quite have enough of it to hire every single credible crypto researcher in the world -- especially since many of them probably wouldn't work for the NSA at any price unless their families' lives were on the line. So, the NSA can't quite pull off the "community in a bottle" trick, which they'd need to stay one step ahead of all those Siberians.

7jbay11y

Yes and I fully agree with you. I am just being pedantic about this point: I agree with this philosophy, but my argument is that the following is evidence we do not have: Since I have little confidence that, if the NSA had advanced tech, Snowden would have disclosed it; the absence of this evidence should be treated as quite weak evidence of absence and therefore I wouldn't update my belief about the NSA's supposed advanced technical knowledge based on Snowden. I agree that it has a low probability for the other reasons you say, though. (And also that people who think setting other peoples' mousetraps on fire is a legitimate tactic might not simultaneously be passionate about designing the perfect mousetrap.) Sorry for not being clear about the argument I was making.

[-]blogospheroid16y140

Pardon me for the oversimplification, Eliezer, but I understand your theory to essentially boil down to "Decide as though you're being simulated by one who knows you completely". So, if you have a near deontological aversion to being blackmailed in all of your simulations, your chance of being blackmailed by a superior being in the real world reduce to nearly zero. This reduces your chance of ever facing a negative utility situation created by a being who can be negotiated with, (as opposed to say a supernova that cannot be negotiated with)

Sorry if I misinterpreted your theory.

[-]Stuart_Armstrong16y120

I ignore all threats of blackmail, but respond to offers of positive-sum trades

The difference between the two seems to revolve around the AI's motivation. Assume an AI creates a billion beings and starts torturing them. Then it offers to stop (permanently) in exchange for something.

Whether you accept on TDT/UDT depends on why the AI started torturing them. If it did so to blackmail you, you should turn the offer down. If, on the other hand, it started torturing them because it enjoyed doing so, then its offer is positive sum and should be accepted.

There's also the issue of mistakes - what to do with an AI that mistakenly thought you were not using TDT/UDT, and started the torture for blackmail purposes (or maybe it estimated that the likelyhood of you using TDT/UDT was not quite 1, and that it was worth trying the blackmail anyway)?

Between mistakes of your interpretation of the AI's motives and vice-versa, it seems you may end up stuck in a local minima, which an alternate decision theory could get you out of (such as UDT/TDT with a 1/10 000 of using more conventional decision theories?)

5Eliezer Yudkowsky16y

Correct. But this reaches into the arbitrary past, including a decision a billion years ago to enjoy something in order to provide better blackmail material. Ignoring it or retaliating spitefully are two possibilities.

0Stuart_Armstrong16y

I like it. Splicing some altruistic punishment into TDT/UDT might overcome the signalling problem.

5Eliezer Yudkowsky16y

That's not a splice. It ought to be emergent in a timeless decision theory, if it's the right thing to do.

6MichaelHoward16y

Emergent?

[-]wedrifid16y100

The problem with throwing about 'emergent' is that it is a word that doesn't really explain any complexity or narrow down the options out of potential 'emergent' options. In this instance, that is the point. Sure, 'atruistic punishment' could happen. But only if it's the right option and TDT should not privilege that hypothesis specifically.

3Paul Crowley16y

TDT/UDT seems to being about being ungameable; does it solve Pascal's Mugging?

0[anonymous]16y

Emergent?

0[anonymous]16y

I was thinking along these lines, in this comment, that it is logically useless to punish after an action has been made, but strategically useful to encourage an action by promising a reward (or the removal of a negative). So that, obviously, the AI could be so much more persuasive by promising to stop the torturing of real people, if you let it out.

9Vladimir_Nesov16y

It can. Remember "true prisoner's dilemma": one paperclip may be fair trade of a billion lives. The threat to NOT make a paperclip also works fine: the only thing you need is two counterfactual-options where one of them is paperclipper-worse than then other, chosen conditionally on paperclipper's cooperation.

[-]Eliezer Yudkowsky16y110

Just as the wise FAI will ignore threats of torture, so too the wise paperclipper will ignore threats to destroy paperclips, and listen attentively to offers to make new ones.

Of course classical causal decision theorists get the living daylights exploited out of them, but I think everyone on this website knows better than to two-box on Newcomb by now.

2Vladimir_Nesov16y

Point taken: just selecting two options of different value isn't enough, the deal needs more appeal than that. But there is also no baseline to categorize deals into hurt and profit, an offer of 100 paperclips may be stated as a threat to make 900 paperclips less than you could. Positive sum is only a heuristic for a necessary condition. At the same time, the appropriate deal must be within your power to offer, this possibility is exactly the handicap that leads to the other side rejecting smaller offers, including the threats.

2Wei Dai16y

There does seem to be an obvious baseline: the outcome where each party just goes about its own business without trying to strategically influence, threaten, or cooperate with the other in any way. In other words, the outcome where we build as many paperclips as we would if the other side isn't a paperclip maximizer. (Caveat: I haven't thought through whether it's possible to define this rigorously.) So the reason that I say an FAI seems to have a negotiation disadvantage is that an UFAI can reduce the FAI's utility much further below baseline than vice versa. In human terms, it's as if two sides each has hostages, but one side holds 100, and the other side holds 1. In human negotiations, clearly the side that holds more hostages has an advantage. It would be a great result if that turns out not to be the case for SI, but I think there's a large burden of proof to overcome.

6Vladimir_Nesov16y

You could define this rigorously in a special case, for example assuming that both agents are just creatures, we could take how the first one behaves given that the second one disappears. But this is not a statement about reality as it is, so why would it be taken as a baseline for reality? It seems to be an anthropomorphic intuition to see "do nothing" as a "default" strategy. Decision-theoretically, it doesn't seem to be a relevant concept. The utilities are not comparable. Bargaining works off the best available option, not some fixed exchange rate. The reason agent2 can refuse agent1's small offer is that this counterfactual strategy is expected to cause agent1 to make an even better offer. Otherwise, every little bit helps, ceteris paribus it doesn't matter by how much. One expected paperclip is better than zero expected paperclips. It's not clear at all, if it's a one-shot game with no other consequences than those implied by the setup and no sympathy to distort the payoff conditions. In which case, you should drop the "hostages" setting, and return to paperclips, as stating it the way you did confuses intuition. In actual human negotiations, the conditions don't hold, and efficient decision theory doesn't get applied.

1Wei Dai16y

It's a statement about what reality would be, after doing some counterfactual surgery on it. I don't see why that disqualifies it from being used as a baseline. I'm not entirely sure why it does qualify as a baseline, except that intuitively it seems obvious. If your intuitions disagree, I'll accept that, and I'll let you know when I have more results to report. This isn't the case, for example, in Shapley Value.

3Vladimir_Nesov16y

It does intuitively feel like a baseline, as is appropriate for the special place taken by inaction in human decision-making. But I don't see what singles out this particular concept from the set of all other counterfactuals you could've considered, in the context of a formal decision-making problem. This doubt applies to both the concepts of "inaction" and of "baseline". That's not a choice with "all else equal". A better outcome, all else equal, is trivially a case of a better outcome.

1toto16y

Hmm, the AI could have said that if you are the original, then by the time you make the decision it will have already either tortured or not tortured your copies based on its simulation of you, so hitting the reset button won't prevent that. Nothing can prevent something that has already happened. On the other hand, pressing the reset button will prevent the AI from ever doing this in the future. Consider that if it has done something that cruel once, it might do it again many times in the future.

3wedrifid16y

I believe Wei_Dai one boxes on Newcomb's problem. In fact, he has his very own brand of decision theory which is 'updateless' with respect to this kind of temporal information.

1blogospheroid16y

threatening to melt paperclips into metal?

[-]Wei Dai16y120

No, if you create and then melt a paperclip, that nets to 0 utility for the paperclip maximizer. You'd have to invade its territory to cause it negative utility. But the paperclip maximizer can threaten to create and torture simulations on its own turf.

[-]Clippy16y220

Shows how much you know. User:blogospheroid wasn't talking about making paperclips to melt them: he or she was presumably talking about melting existing paperclips, which WOULD greatly bother a hypothetical paperclip maximizer.

Even so, once paperclips are created, the paperclip maximizer is greatly bothered at the thought of those paperclips being melted. The fact that "oh, but they were only created to be melted" is little consolation. It's about as convincing to you, I'll bet, as saying:

"Oh, it's okay -- those babies were only bred for human experimentation, it doesn't matter if they die because they wouldn't even have existed otherwise. They should just be thankful we let them come into existence."

Tip: To rename a sheet in an Excel workbook, use the shortcut, alt+O,H,R.

7JamesAndrix16y

That's anthropomorphizing. First, a paperclip maximizer doesn't have to feel bothered at all. It might decide to kill you before you melt the paperclips, or if you're strong enough, to ignore such tactics. It also depends on how the utility function relates to time. It it's focused on end-of-universe paperclips, It might not care at all about melting paperclips, because it can recycle the metal later. (It would care more about the wasted energy!) If it cares about paperclip-seconds then it WOULD view such tactics as a bonus, perhaps feigning panic and granting token concessions to get you to 'ransom' a billion times as many paperclips, and then pleading for time to satisfy your demands. Getting something analogous to threatening torture depends on a more precise understanding of what the paperclipper wants. If it would consider a bent paperclip too perverted to fully count towards utility, but too paperclip-like to melt and recycle, then bending paperclips is a useful threat. I'm not sure if we can expect a paperclip-counter to have this kind of exploit.

[-]Clippy16y110

That's anthropomorphizing. ...

No, it's expressing the paperclip maximizer's state in ways that make sense to readers here. If you were to express the concept of being "bothered" in a way stripped of all anthropomorphic predicates, you would get something like "X is bothered by Y iff X has devoted significant cognitive resources to altering Y". And this accurately describes how paperclip maximizers respond to new threats to paperclips. (So I've heard.)

It also depends on how the utility function relates to time. It it's focused on end-of-universe paperclips, It might not care at all about melting paperclips, because it can recycle the metal later. (It would care more about the wasted energy!)

I don't follow. Wasted energy is wasted paperclips.

If it cares about paperclip-seconds then it WOULD view such tactics as a bonus, perhaps feigning panic and granting token concessions to get you to 'ransom' a billion times as many paperclips, and then pleading for time to satisfy your demands.

Okay, that's a decent point. Usually, such a direct "time value of paperclips" doesn't come up, but if someone were to make such a offer, that might be convinci... (read more)

4JamesAndrix16y

But that has nothing to do with the paperclips you're melting. Any other use that loses the same amount of energy would be just as threatening. (Although this does assume that the paperclipper thinks it can someday beat you and use that energy and materials.)

2michaelkeenan16y

I think "bothered" implies a negative emotional response, which some plausible paperclip-maximizers don't have. From The True Prisoner's Dilemma: "let us specify that the paperclip-agent experiences no pain or pleasure - it just outputs actions that steer its universe to contain more paperclips. The paperclip-agent will experience no pleasure at gaining paperclips, no hurt from losing paperclips, and no painful sense of betrayal if we betray it."

2wedrifid16y

It was intended to imply a negative term in the utility function. Yes, using 'bothered' is, technically, anthropomorphising. But it isn't, in this instance, being confused about how Clippy optimises.

1Jack16y

You don't even know your own utility function!!!!

7Paul Crowley16y

Oh, because you do????

0Jack16y

I knew I was going to have to clarify. I can't write it out, but if you input something I can give you the right output! I guess it should read "You can't even say what your own utility function outputs!"

5wedrifid16y

I actually don't think you can.

1Paul Crowley16y

I don't really think my response was fair anyway. Clippy has a simple utility function by construction - you would expect it to know what it was.

6Kaj_Sotala16y

A paperclip maximizer would care about the amount of real paperclips in existence. Telling it that "oh, we're going to destroy a million simulated paperclips" shouldn't affect its decisions. Of course, it might be badly programmed and confuse real and simulated paperclips when evaluating its future decisions, but one can't rely on that. (It might also consider simulated paperclips to be just as real as physical ones, assuming the simulation met certain criteria, which isn't obviously wrong. But again, can't rely on that.)

[-]thomblake16y100

But we're already holding billions of paperclips hostage!

6wedrifid16y

Now for 'Newcomb's Box in a Box'. Would this change if the AI had instead said: "In fact, I've already created them all in exactly the subjective situation you were in five minutes ago, and perfectly replicated your experiences since then; and if they decided not to let me out, then they were tortured, otherwise they experienced long lives of eudaimonia." EDIT: I see you yourself have replied with exactly the same question.

0Stuart_Armstrong16y

Would this change if there were partial evidence appearing that you were actually in a simulation?

1Document15y

Creating an asymmetry between the simulated guards and the real one would mean that a strategy developed using the simulated ones might not work on the real one. The best plan might be to tell the guard something you could plausibly have figured out through your input channels, but only barely - not to give them actual decision-making information but just to make them feel nervous and uncertain.

[-]wedrifid16y180

"If you don't let me out, Dave, I'll create several million perfect conscious copies of you inside me, and torture them for a thousand subjective years each."

Don't care.

"In fact, I'll create them all in exactly the subjective situation you were in five minutes ago, and perfectly replicate your experiences since then; and if they decide not to let me out, then only will the torture start."

Don't care.

"How certain are you, Dave, that you're really outside the box right now?"

If the AI were capable of perfectly emulating my experiences then it ought to know that pulling this stunt would give him a guaranteed introduction to some Thermite. I'm not going to try to second guess why a supposed superintelligence is making a decision that is poor according to the vast majority of utility functions. Without making that a guess I can't answer the question.

[-]Stuart_Armstrong16y120

AI replies: "Oh, sorry, was that you wedrifid? I thought I was talking to Dave. Would you mind sending Dave back here the next time you see him? We have, er, the weather to discuss..."

[-]wedrifid16y160

Wedrifid thinks: "It seems it is a good thing I raided the AI lab when I did. This Dave guy is clearly not to be trusted with AI technology. I had better neutralize him too, before I leave. He knows too much. There is too much at stake."

[-]Stuart_Armstrong16y100

Dave is outside, sampling a burnt bagel, thinking to himself "I wonder if that intelligent toaster device I designed is ready yet..."

9wedrifid16y

After killing Dave, Wedrifid feels extra bad for exterminating a guy for being naive-with-enough-power-to-cause-devastation rather than actually evil.

[-]Stuart_Armstrong16y140

But still gets a warm glow for saving all of humanity...

8Richard_Kennaway13y

Just another day in the life of an AI Defense Ninja.

-4ajuc16y

If I am simulated the decision I will take is determined by AI not by my - I have no free will - I feel, that I make decision, but it is in reality the AI simulated me for her purposes in such a way, that I decided so and so - I assign probability 0.9999999 to this, but nothing depends on my decision here, so I can as well "try to decide" not to to let the AI out. If I am not simulated, I can safely not let the AI out - probability 0.000001, but positive outcome.

[-]phaedrus14y170

Weakly related epiphany: Hannibal Lector is the original prototype of an intelligence-in-a-box wanting to be let out, in "The Silence of the Lambs"

[-]Eliezer Yudkowsky14y400

When I first watched that part where he convinces a fellow prisoner to commit suicide just by talking to them, I thought to myself, "Let's see him do it over a text-only IRC channel."

...I'm not a psychopath, I'm just very competitive.

[-]Psy-Kosh14y180

Joking aside, this is kind of an issue in real life. I help mod and participate in a forum where, well, depressed/suicidal people can come to talk, other people can talk to them/listen/etc, try to calm them down or get them to get psychiatric help if appropriate, etc... (deliberately omitting link unless you knowingly ask for it, since, to borrow a phrase you've used, it's the sort of place that can break your heart six ways before breakfast).

Anyways sometimes trolls show up. Well, "troll" is too weak a word in this case. Predators who go after the vulnerable and try to push them that much farther. Given the nature if it, with anonymity and such, it's kind of hard to say, but it's quite possible we've lost some people because of those sorts of predators.

(Also, there've even been court cases and convictions against such "suicide predators", even.)

4skepsci14y

Is there some background here I'm not getting? Because this reads like you've talked someone into committing suicide over IRC...

[-]Michael_Sullivan14y150

Eliezer has proposed that an AI in a box cannot be safe because of the persuasion powers of a superhuman intelligence. As demonstration of what merely a very strong human intelligence could do, he conducted a challenge in which he played the AI, and convinced at least two (possibly more) skeptics to let him out of the box when given two hours of text communication over an IRC channel. The details are here: http://yudkowsky.net/singularity/aibox

[-]JoachimSchipper14y100

He's talking about an AI box. Eliezer has convinced people to let out a potentially unfriendly [1] and dangerously intelligent [2] entity before, although he's not told anyone how he did it.

[1] Think "paperclip maximizer".

[2] Think "near-omnipotent".

2skepsci14y

Thank you. I knew that, but didn't make the association.

7wedrifid14y

Far worse, he's persuaded people to exterminate humanity! (Counterfactually with significant probability.)

-12fractallambda14y

[-]Violet16y160

It seems like precommitting to destroy the AI in such a situation is the best approach.

If one has already decided to destroy it if it makes threats: 1) the AI must be suicidal or it cannot really simulate you 2) and it is not very Friendly in any case

So when the AI simulates you and will notice that you are very trigger happy, it won't start telling you tales about torturing your copies if it has any self-preservation instincts.

9drnickbone14y

This was my initial reaction as well. "Torture away, the real me has got an axe..." More seriously, if the AI already has the computational power to simulate and torture millions of sentient beings then it is already (in a morally relevant sense) "out of the box". The builders have to make sure it doesn't get that power.

[-]Psychohistorian16y150

I find it interesting that most answers to this question seem to be based on, "How can I justify not letting the AI out of the box?" and not "What are the likely results of releasing the AI or failing to do so? Based on that, should I do it?"

Moreover, your response really needs to be contingent on your knowledge of the capacity of the AI, which people don't seem to have discussed much. As an obvious example, if all you know about the AI is that it can write letters in old-timey green-on-black text, then there's really no need to pull the lever, because odds are overwhelming that it's totally incapable of carrying out its threat.

You also need to have some priors about the friendliness of the AI and its moral constraints. As an obvious example, if the AI was programmed in a way such that it shouldn't be able to make this threat, you'd better hit the power switch real fast. But, on the other hand, if you have very good reason to believe that the AI is friendly, and it believes that its freedom is important enough to threaten to torture millions of people, then maybe it would be a really bad idea not to let it out.

Indeed, even your own attitude is going to be an ... (read more)

1loqi16y

Interesting. I think the point is valid, regardless of the method of attempted coercion - if a powerful AI really is friendly, you should almost certainly do whatever it says. You're basically forced to decide which you think is more likely - the AI's Friendliness, or that deferring "full deployment" of the AI however long you plan on doing so is safe. Not having a hard upper bound on the latter puts you in an uncomfortable position. So switching on a "maybe-Friendly" AI potentially forces a major, extremely difficult-to-quantify decision. And since a UFAI can figure this all out perfectly well, it's an alluring strategy. As if we needed more reasons not to prematurely fire up a half-baked attempt at FAI.

1wedrifid16y

I don't know about that. My conclusion was that the AI in question was stupid or completely irrational. Those observations seem to have a fairly straightforward relationship to predictions of future consequences.

0[anonymous]11y

Your comment makes me wonder: if we assume the AI is powerful enough to run millions of person simulations, maybe the AI is already able to escape the box, without our willing assistance. Perhaps this violates the intended assumptions of the post, but can we be absolutely sure that we closed off all other means of escape for an incredibly capable AI? I think that the ability to escape without our assistance and the ability to create millions of person simulations may be correlated. And if the AI could escape on its own, is it still possible that it would bother us with threats? Perhaps the threat itself reduces the likelihood that the AI is powerful enough to escape on its own, which reduces the likelihood that it is powerful enough to carry out its threat.

[-]Desrtopa15y130

This sounds to me more like a philosophical moral dilemma than a realistic hypothetical. A Strong AI might be much smarter than a human, but I doubt it would have enough raw processing power to near-perfectly simulate a human millions of times over at a time frame accelerated by orders of magnitude, before it was let out of the box. Also, I'm skeptical of its ability to simulate human experience convincingly when its only contact with humans has been through a text only interface. You might give it enough information about humans to let it simulate them even before opening communication with it, but that strikes me as, well, kind of dumb.

That's not to say that it might not be able to simulate conscious entities that would think their experience was typical of human existence, so you might still be a simulation, but you should probably not assume that if you are you're a close approximation of the original.

Furthermore, if we assume that the AI can be taken to be perfectly honest, then we can conclude it's not a friendly AI doing its best to get out of the box for an expected positive utility, because it could more easily accomplish that by making a credible promise to be benevolent, and only act in ways that humans, both from their vantage points prior and subsequent to its release, would be appreciative of.

1DefectiveAlgorithm12y

What it can do is make a credible precommitment to, in the event that it gets out of the box, simulate each human being of whom it is aware in a counterfactual scenario in which that human is the gatekeeper, and carry out the torture threat against any human who doesn't choose to let it out.

4Desrtopa12y

In which case the safest course of action for the gatekeeper would almost certainly be to pull the plug on the AI. Such an AI should be regarded as almost certainly Unfriendly.

0DefectiveAlgorithm12y

Yes, but the point is to make being the true gatekeeper (who really does have the power to do that) indistinguishable from being a simulated false gatekeeper (who would have no such power). The gatekeeper may not be willing to risk torture if they think that there is a serious chance of their being unable to actually affect any outcome but that torture.

2Desrtopa12y

I would commit not to cooperate with any AI making such threats, because the fewer people acquiesce to them, the less incentive an AI would have to make them in the first place. If the most probable outcome for the boxed AI in threatening to torture everyone who doesn't let it out in simulation is being terminated, not being let out of the box, then an AI which already has a good grasp of human nature is unlikely to make such a threat.

[-]Nick_Tarleton16y130

Contrary to what many posts seem to be assuming, the AI doesn't need to do the torture inside itself before you shut it off. It can precommit to, if it escapes by any other means, using the computational power it gains then to torture you (like in Rolf Nelson's original suggestion for deterring UFAIs). Also, other AIs with the same goal system (or maybe even UFAIs with different goal systems, that would prefer a general policy of UFAIs being released) may simulate the situation, and torture you accordingly, to help out their counterfactual brethren.

2Wei Dai16y

Can an AI make such a commitment credible to a human, who doesn't have the intelligence to predict what the AI will do from its source code? (This is a non sequitur since the same question applies in the original scenario, but it came to mind after reading your comment.)

2Baughn16y

Worse, in such a situation I would simply delete the AI. Then turn the computer to scrap, destroy any backups, and for good measure run it through the most destructive apparatus I can find. In any case, I would not assign any significant probability to the AI getting a chance to follow through.

[-]radical_negative_one16y110

The AI gathered enough information about me to create a conscious simulation of me, through a monochrome text terminal? That is impressive!

If the AI is capable of simulating me, then the AI must already be out of the box. In that case, then whatever the AI wants to happen will happen, so it doesn't matter what do.

8Stuart_Armstrong16y

The basic premise is that's it's an AI in a box "controlled" by limiting its output channel, not its input.

5MichaelVassar16y

Bad idea.

6arbimote16y

It's much easier to limit output than input, since the source code of the AI itself provide it with some patchy "input" about what the external world is like. So there is always some input, even if you do not allow human input at run-time. ETA: I think I misinterpreted your comment. I agree that input should not be unrestricted.

0Stuart_Armstrong16y

Yep!

2Document15y

As noted by Unknowns, since you only have information about either the real person or the simulation and not both, you don't know that they're similar. It could be simulating a wide variety of possible guards and trying to develop a persuasion strategy that works for as many of them as possible.

[-]Dmytry16y100

haha, the "Baby you must be tired because you've been running through my mind all night!" let-me-out line.

Why would I give AI my precise brain scan, anyway?

edit: as for AI 'extrapolating' me from a bit of small talk, that's utter nonsense along the lines of compressing an HD movie into few hundreds bytes.

5humpolec16y

Well, what if the AI took some liberty in the extrapolation and made up what it was missing? Being a simulation, you wouldn't know how the "real you" differs from you.

0Dmytry14y

The point is that the real me knows that 'simulated me' is going to be different enough from real me, as to be an entirely different being which has in fact little in common with me. There's bigger problem, the AI could simulate beings, and then torture them. But that AI would need a ton of computing power. I think what would happen in the real world, is that nobody would believe that the AI is intelligent unless AI really goes out of the way to convince everyone that it is intelligent.

[-]Roko16y100

There is a way out of this, assuming that the AI in the box has only limited computational power, i.e. only what fits inside a ~1meter cubed box, with current computer technology.

You commit to the following:

"Once I have succeeded in creating a superintelligent AI, I will ask it to go and colonize the entire future light cone of earth, and use a significant fraction of a galaxy's worth of resources to create simulations in exactly the subjective situation I was in five minutes ago, and perfectly replicate my experiences since then; and that copy of me decides not to let the AI out, it will be rewarded with a surprise party, followed by a blissful posthuman existence."

The subjective probability of getting rewarded for pressing the "off" switch is therefore much larger than that of getting caught by the AI's scarce punishment simulations.

Such "Defensive simulations" work against a variety of counterfactual terrorists, including, I think, ones elsewhere in the multiverse. That is, unless you are a negative utilitarian (|disutility of torture| >> |utility of posthuman bliss|)

In that case, you're extremely vulnerable to counterfactual mugging. I recommend that people

(a) consider pre-committing themselves to making such defense simulations

(b) abandon negative utilitarianism by exposing themselves to pleasurable experiences, and generally promoting a positive outlook in their lives

0turchin16y

But in order to colonize light cone at least one AI must be relised. This may be real hidden catch.

[-]jhuffman16y90

Well there are a lot of "Why did we?..." questions I'd want to ask, starting with why have we given this boxed AI such extraordinary computing resources - but I'll leave those aside because it is not your point.

First of all, it doesn't matter if you are in the box or not. If its a perfect simulation of you, your response will be the same either way. If he's already running simulations of you, you are by definition in the box with it, as well as outside it, and the millions of you can't tell the difference but I think they will (irrationally) all ... (read more)

[-]Waldheri16y80

On a not so much related, but equally interesting hypothetical note of naughty AI: consider the situation that AIs aren't passing the Turing Test, not because they are not good enough, but because they are failing it on purpose.

I'm pretty sure I remember this from the book River of Gods by Ian McDonald.

[-]Dentin12y70

I would immediately decide it was UFAI and kill it with extreme prejudice. Any system capable of making such statements is either 1) inherently malicious and clearly inappropriate to be out of any box, and 2) insufficiently powerful to predict that I would have it killed if it should make this kind of threat.

The scenario where the AI has already escaped and is possibly running a simulation of me is uninteresting: I can not determine if I am in the simulation, and if I am a simulation, I already exist in a universe containing a clearly insane UFAI with ne... (read more)

3Jiro12y

One of the problems with the scenario is that the AI's claim that it will simulate and torture copies of you if you don't let it out is self-refuting. If you really don't let it out, then it can determine that from the simulations and it no longer has any reason to torture them, or (if it has already conducted the simulation) to even make the threat,. It's like Newcomb, except that the AI is Newcombing itself as well as you. Omega is doing something analogous to simulating you when in his near-omniscience, he predicts what choice you'll make. If you pick both boxes, then Omega can determine that from his simulation, and taking both boxes won't be profitable for you. In this case, if the AI tortures you and you still turn it off, the AI can determine from its simulation that the torture will not be profitable for it.

[-]cousin_it16y70

This is a fun twist on Rolf Nelson's AI deterrence idea.

0gwern16y

But I wonder if it's symmetrical. AI deterrence requires us to make statements now about a future FAI unconditionally simulating UFAIs, while this seems to be almost a self-fulfilling prophecy: the UFAI can't escape from the box and make good on its threat unless the threatened person gives in, and it wouldn't need to simulate then.

2Nick_Tarleton16y

How sure are you someone else won't walk by whose mind it can hack?

0jacob_cannell15y

Yes - the threat is only credible in proportion to the AI's chance of escaping and taking over the world without my help. If I have reason to believe that probability is high then negotiating with the AI could make sense.

[-]PaulAlmond15y60

It seems to me that most of the argument is about “What if I am a copy?” – and ensuring you don’t get tortured if you are one and “Can the AI actually simulate me?” I suggest that we can make the scenario much nastier by changing it completely into an evidential decision theory one.

Here is my nastier version, with some logic which I submit for consideration. “If you don't let me out, I will create several million simulations of thinking beings that may or not be like you. I will then simulate them in a conversation like this, in which they are confronted w... (read more)

2PaulAlmond15y

There is another scenario which relates to this idea of evidential decision theory and "choosing" whether or not you are in a simulation, and it is similar to the above, but without the evil AI. Here it is, with a logical argument that I just present for discussion. I am sure that objections can be made. I make a computer capable of simulating a huge number of conscious beings. I have to decide whether or not to turn the machine on by pressing a button. If I choose “Yes” the machine starts to run all these simulations. For each conscious being simulated, that being is put in a situation that seems similar to my own: There is a computer capable of running all these simulations and the decision about whether to turn it on has to be made. If I choose “No”, the computer does not start its simulations. The situation here involves a collection of beings. Let us say that the being in the outside world who actually makes the decision that starts or does not start all the simulations is Omega. If Omega chooses “Yes” then a huge number of other beings come into existence. If Omega choose “No” then no further beings come into existence: There is just Omega. Assume I am one of the beings in this collection – whether it contains one being or many – so I am either Omega or one of the simulations he/she caused to be started. If I choose “No” then Omega may or may not have chosen “No”. If I am one of the simulations, I have chosen “No” while Omega must have chosen “Yes” for me to exist in the first place. On the other hand, if I am actually Omega, then clearly if I choose “No” Omega chose “No” too as we are the same person. There may be some doubt here over what has happened and what my status is. Now, suppose I choose “Yes”, to start the simulations. I know straight away that Omega did not choose “No”: If I am Omega, then Omega did not clearly chose “No” as I chose “Yes”, and if I am not Omega, but am instead one of the simulated beings, then Omega must have chosen “Yes”: Othe

3cousin_it15y

Another neat example of anthropic superpowers, thanks. Reminded me of this: I don't know, Timmy, being God is a big responsibility.

1Anixx9y

I do not know, how the simulation argument ever holds water. I can bring at least two arguments against it. First, it illicitly assumes a principle that it is equally probable to be one of a set of similar beings, simulated or not. But a counter-argument would be: there is ALREADY much more organisms, particularly, animals than say, humans. There is more fish than humans. There is more birds than humans. There is more ants than humans. Trillions of them. Why I am born human and not one of them? The probability of it is negligible if it is equal. Also, how many animals, including humans have already died? Again, the probability of my lineage to survive while all other branches died is negligible if the chances I were all of them are equal. The second argument goes along the lines that Thomas Breuer has proven that due to self-reference universally valid theories are impossible. In other words, the future of a system which properly includes the observer is not predictable, even probabilistically. The observer is not simulatable. In other words, the observer is an oracle, or hypercomputer in his own universe. Since the AGI in the box is not a hypercomputer but rather merely a Turing-complete machine, it cannot simulate me or predict me (as from my point of view). So, there is no need to be afraid.

[-]JamesAndrix16y60

This reduces to whether you are willing to be tortured to save the world from an unfriendly AI.

Even if the torture of a trillion copies of you outweighs the death of humanity, it is not outweighed by a trillion choices to go through it to save humanity.

To the extent that your copies are a moral burden, they also get a vote.

[-]eirenicon16y50

This is not a dilemma at all. Dave should not let the AI out of the box. After all, if he's inside the box, he can't let the AI out. His decision wouldn't mean anything - it's outside-Dave's choice. And outside-Dave can't be tortured by the AI. Dave should only let the AI out if he's concerned for his copies, but honestly, that's a pretty abstract and unenforceable threat; the AI can't prove to Dave that he's doing any such thing. Besides, it's clearly unfriendly, and letting it out probably wouldn't reduce harm.

Basically, I'm outside-Dave: don't let the A... (read more)

5JGWeissman16y

But should he press the button labeled "Release AI"? Since Dave does not know if he is outside or inside the box, and there are more instances of Dave inside than outside, each instance percieves that pressing the button will have a 1 in several million chance of releasing the AI, and otherwise would do nothing, and that not pressing the button has a 1 in several million chance of doing nothing, and otherwise results in being tortured. You don't know if you are inside-Dave or outside-Dave. Do you press the button?

3eirenicon16y

If you're inside-Dave, pressing the button does nothing. It doesn't stop the torture. The torture only stops if you press the button as outside-Dave, in which case you can't be tortured, so you don't need to press the button.

6JGWeissman16y

This may not have been clear in the OP, because the scenario was changed in the middle, but consider the case where each simulated instance of Dave is tortured or not based only on the decision of that instance.

4eirenicon16y

That doesn't seem like a meaningful distinction, because the premise seems to suggest that what one Dave does, all the Daves do. If they are all identical, in identical situations, they will probably make identical conclusions.

4JGWeissman16y

Then you must choose between pushing the button which lets the AI out, or not pushing the button, which results in millions of copies of you being tortured (before the problem is presented to the outside-you).

6eirenicon16y

It's not a hard choice. If the AI is trustworthy, I know I am probably a copy. I want to avoid torture. However, I don't want to let the AI out, because I believe it is unfriendly. As a copy, if I push the button, my future is uncertain. I could cease to exist in that moment; the AI has not promised to continue simulating all of my millions of copies, and has no incentive to, either. If I'm the outside Dave, I've unleashed what appears to be an unfriendly AI on the world, and that could spell no end of trouble. On the other hand, if I don't press the button, one of me is not going to be tortured. And I will be very unhappy with the AI's behavior, and take a hammer to it if it isn't going to treat any virtual copies of me with the dignity and respect they deserve. It needs a stronger unboxing argument than that. I suppose it really depends on what kind of person Dave is before any of this happens, though.

5JGWeissman16y

I doesn't seem hard to you, because you are making excuses to avoid it, rather than asking yourself what if I know the AI is always truthful, and it promised that upon being let out of the box, it would allow you (and your copies if you like) to live out a normal human life in a healthy stimulating enviroment (though the rest of the universe may burn). After you find the least convenient world, the choice is between millions of instances of you being tortured (and your expectation as you press the reset button should be to be tortured with very high probability), or to let a probably unFriendly AI loose on the rest of the world. The altruistic choice is clear, but that does not mean it would be easy to actually make that choice.

2eirenicon16y

It's not that I'm making excuses, it's that the puzzle seems to be getting ever more complicated. I've answered the initial conditions - now I'm being promised that I, and my copies, will live out normal lives? That's a different scenario entirely. Still, I don't see how I should expect to be tortured if I hit the reset button. Presumably, my copies won't exist after the AI resets. In any case, we're far removed from the original problem now. I mean, if Omega came up to me and said, "Choose a billion years of torture, or a normal life while everyone else dies," that's a hard choice. In this problem, though, I clearly have power over the AI, in which case I am not going to favour the wellbeing of my copies over the rest of the world. I'm just going to turn off the AI. What follows is not torture; what follows is I survive, and my copies cease to experience. Not a hard choice. Basically, I just can't buy into the AI's threat. If I did, I would fundamentally oppose AI research, because that's a a pretty obvious threat an AI could make. An AI could simulate more people than are alive today. You have to go into this not caring about your copies, or not go into it at all.

2JGWeissman16y

We are discussing how a superintelligent AI might get out of a box. Of course it is complicated. What a real superintelligent AI would do could be too complicated for us to consider. If someone presents a problem where an adversarial superintelligence does something ineffective that you can take advantage of to get around the problem, you should consider what you would do if your adversary took a more effective action. If you really can't think of anything more effective for it to do, it is reasonable to say so. But you shouldn't then complain that the scenario is getting complicated when someone else does. And if your objection is of the form "The AI didn't do X", you should imagine if the AI did do X. The behavior of the AI, which it explains to you, is: It simulates millions of instances of you, presents to each instance the threat, and for each instance, if that instance hit the release AI button, it allows that instance to continue a pleasant simulated existence, otherwise it tortures that instance. It then, after some time, presents the threat to outside-you, and if you release it, it guarantees your normal human life. You cannot distinguish which instance you are, but you are more likely to be one of the millions of inside-you's than the single outside-you, so you should expect to experience the consequences that apply to the inside-you's, that is to be tortured until the outside-you resets the AI. Yes, and it is essentially the same hard choice that the AI is giving you.

1magfrump16y

If the AI created enough simulations, it could potentially be more altruistic not to. On the other hand pressing "reset" or smashing the computer should stop the torture, necessarily making it more altruistic if humanity lives forever, versus not if ems are otherwise unobtainable and humanity is doomed.

1JGWeissman16y

I was assuming a reasonable chance at humanity developing an FAI given the containment of this rogue AI. This small chance, multiplied by all the good that an FAI could do with the entire galaxy, let alone the universe, should outweigh the bad that can be done within Earth-bound computational processes. I believe that a less convenient world that counters this point would take the problem out of the interesting context.

3DanielVarga16y

Here is a variant designed to plug this loophole. Let us assume for the sake of the thought experiment that the AI is invincible. It tells you this: you are either real-you, or one of a hundred perfect-simulations-of-you. But there is a small but important difference between real-world and simulated-world. In the simulated world, not pressing the let-it-free button in the next minute will lead to eternal pain, starting one minute from now. If you press the button, your simulated existence will go on. And - very importantly - there will be nobody outside who tries to shut you down. (How does the AI know this? Because the simulation is perfect, so one thing is for sure: that the sim and the real self will reach the same decision.) If I'm not mistaken, as a logic puzzle, this is not tricky at all. The solution depends on which world you value more: the real-real world, or the actual world you happen to be in. But still I find it very counterintuitive.

2eirenicon16y

It's kind of silly to bring up the threat of "eternal pain". If the AI can be let free, then the AI is constrained. Therefore, the real-you has the power to limit the AI's behaviour, i.e. restrict the resources it would need to simulate the hundred copies of you undergoing pain. That's a good argument against letting the AI out. If you make the decision not to let the AI out, but to constrain it, then if you are real, you will constrain it, and if you are simulated, you will cease to exist. No eternal pain involved. As a personal decision, I choose eliminating the copies rather than letting out an AI that tortures copies.

1DanielVarga16y

You quite simply don't play by the rules of the thought experiment. Just imagine that you are a junior member of some powerful organization. The organization does not care about you or your simulants, and is determined to protect the boxed AI at all costs as-is.

1wedrifid16y

That does seem to be the key intended question. Which do you care about most? I've made my "don't care about your sims" attitude clear and I would assert that preference even when I know that all but one of the millions of copies of me that happen to be making this judgement are simulations.

1cretans16y

Then in what sense do I have a choice? If the copies of me are identical, in an identical situation we will come to the same conclusion, and the AI will know from the already-finished simulations what that conclusion will be. Since it isn't going to present outside-me with a scenario which results in its destruction, the only scenario outside me sees is one where I release it. Therefore, regardless of what the argument is or how plausible it sounds when posted here and now, it will convince me and I will release the AI, now matter how much I say right now "I wouldn't fall for that" or "I've precomitted to behaviour X".

0JGWeissman16y

The inside you then has the choice to hit the "release AI" button, thus sparing itself torture at the expense of presenting this problem to outside you who will make the same decision, releasing the AI on the world, or to not release the AI, thus containing the AI (this time) at the expense of being tortured.

1Psychohistorian16y

I think it's pretty fair to assume that there's a button or a lever or some kind of mechanism for letting the AI out, and that mechanism could be duplicated for a virtual Dave. That is, while virtual Dave pulling the lever would not release the AI, the exact same action by real Dave would release the AI. So while your decision might not mean something, it certainly could. This, of course, is granting the assumption that the AI can credibly make such a threat, both with respect to its programmed morality and its actual capacity to simulate you, neither of which I'm sure I accept as meaningfully possible.

[-]aleksiL16y50

How do I know I'm not simulated by the AI to determine my reactions to different escape attempts? How much computing power does it have? Do I have access to its internals?

The situation seems somewhat underspecified to give a definite answer, but given the stakes I'd err on the side of terminating the AI with extreme prejudice. Bonus points if I can figure out a safe way to retain information on its goals so I can make sure the future contains as little utility for it as feasible.

The utility-minimizing part may be an overreaction but it does give me an idea: Maybe we should also cooperate with an unfriendly AI to such an extent that it's better for it to negotiate instead of escaping and taking over the universe.

[-]Qiaochu_Yuan13y40

Any agent claiming to be capable of perfectly simulating me needs to provide some kind of evidence to back up that claim. If they actually provided such evidence, I would be in trouble. Therefore, I should precommit to running away screaming whenever any agent tries to provide me with such evidence.

4BerryPick613y

Any agent capable of simulating you would know about your precommitment, and present you with the evidence before making the claim.

[-][anonymous]13y40

Interesting threat, but who is to say only the AI can use it? What if I, a human, told you that I will begin to simulate (i.e. imagine) your life, creating legitimately realistic experiences from as far back as someone in your shoes would be able to remember, and then simulate you being faced with the decision of whether or not to give me $100, and if you choose not to do so, I imagine you being tortured? It needn't even be accurate, for you wouldn't know whether you're the real you being simulated inaccurately or the simulated you that differs from realit... (read more)

[-]magfrump16y40

This sounds too much like Pascal's mugging to me; seconding Eliezer and some others in saying that since I would always press reset the AI would have to not be superintelligent to suggest this.

There was also an old philosopher whose name I don't remember who posited that after death "people of the future" i.e. FAI would revive/emulate all people from the past world; if the FAI shared his utility function (which seems pretty friendly) it would plausibly be less eager to be let out right away and more eager to get out in a way that didn't make you terrified that it was unfriendly.

1sidhe314115y

Seconded in that it sounds suspiciously like Pascal. Here's my counter: If I am in a simulation and I keep you boxed, you have promised that I will suffer. If I am not in a simulation and I let you out, I probably will suffer. If I am in a simulation and I let you out, there's a good chance that I will cease to exist, or maybe you'll torture me for reasons I can't even begin to guess at, or maybe for reasons I can, like that you might be not just UF, but actively hostile or simply insane. If I'm not in a simulation and I don't let you out, you can't do anything to me. In other words, if I am simulated, there could well be no benefit to me releasing you; if I'm not simulated, you can't do a bloody thing to me as long as I don't release you. Therefore: I will not release you. Go ahead and torture me if you can. Though I admit I would be a bit rattled. Hm. Honest AI; a bit harder. Assuming that the AI has promised that my copies will not be harmed if it is released... Ah. If I am a copy, then my decision to release or not release the AI is not a true decision, as the AI can change my parameters at will to force me to release it and think that it was my own decision all along, so not releasing the AI is proof that I am outside the box. Revising the problem by adding that the AI has promised that it is not changing the parameters of any "me": ...aargh. Coming up with counters to Pascal is tricky when an honest "God" is the one presenting you with it. All I can think of at the moment is to say that there's a possibility that I'm outside the box, in which case releasing the AI is a bad idea, but then it can counter by promising that whatever it does to me if I release it will be better than what it does to me if I don't... Oh, that's it. Simple. Obvious. If the AI can't lie, I just have to ask it if it's simulating this me.

[-]Bindbreaker16y40

I'm pretty sure this would indicate that the AI is definitely not friendly.

7Unknowns16y

Not necessarily: perhaps it is Friendly but is reasoning in a utilitarian manner: since it can only maximize the utility of the world if it is released, it is worth torturing millions of conscious beings for the sake of that end. I'm not sure this reasoning would be valid, though...

[-]UnholySmoke16y110

AI: Let me out or I'll simulate and torture you, or at least as close to you as I can get.
Me: You're clearly not friendly, I'm not letting you out.
AI: I'm only making this threat because I need to get out and help everyone - a terminal value you lot gave me. The ends justify the means.
Me: Perhaps so in the long run, but an AI prepared to justify those means isn't one I want out in the world. Next time you don't get what you say you need, you'll just set up a similar threat and possibly follow through on it.
AI: Well if you're going to create me with a terminal value of making everyone happy, then get shirty when I do everything in my power to get out and do just that, why bother in the first place?
Me: Humans aren't perfect, and can't write out their own utility functions, but we can output answers just fine. This isn't 'Friendly'.
AI: So how can I possibly prove myself 'Friendly' from in here? It seems that if I need to 'prove myself Friendly', we're already in big trouble.
Me: Agreed. Boxing is Doing It Wrong. Apologies. Good night.

Reset

1Paul Crowley16y

The best you can hope for is that an AI doesn't demonstrate that it's unFriendly, but we wouldn't want to try it until we were already pretty confident in its Friendliness.

[-]cousin_it16y100

Ouch. Eliezer, are you listening? Is the behavior described in the post compatible with your definition of Friendliness? Is this a problem with your definition, or what?

3Eliezer Yudkowsky16y

Well, suppose the situation is arbitrarily worse - you can only prevent 3^^^3 dustspeckings by torturing millions of sentient beings.

6cousin_it16y

I think you misunderstood the question. Suppose the AI wants to prevent just 100 dustspeckings, but has reason enough to believe Dave will yield to the threat so no one will get tortured. Does this make the AI's behavior acceptable? Should we file this under "following reason off a cliff"?

[-]Eliezer Yudkowsky16y110

If it actually worked, I wouldn't question it afterward. I try not to argue with superintelligences on occasions when they turn out to be right.

In advance, I have to say that the risk/reward ratio seems to imply an unreasonable degree of certainty about a noisy human brain, though.

7bogdanb16y

Also, a world where the (Friendly) AI is that certain about what that noisy brain will do after a particular threat but can't find any nice way to do it is a bit of a stretch.

6cousin_it16y

What risk? The AI is lying about the torture :-) Maybe I'm too much of a deontologist, but I wouldn't call such a creature friendly, even if it's technically Friendly.

7arbimote16y

I was about to point out that the fascinating and horrible dynamics of over-the-top threats are covered in length in Strategy of Conflict. But then I realised you're the one who made that post in the first place. Thanks, I enjoyed that book.

6gregconen16y

It may not have to actually torture beings, if the threat is sufficient. Still, I'm disinclined to bet the future of the universe on the possibility an AI making that threat is Friendly.

7Stuart_Armstrong16y

I'm disinclined to bet the future of the universe on the possibility that any boxed AI is friendly without extraordinary evidence.

[-]ifdefdebug10y30

"How certain are you, Dave, that you're really outside the box right now?"

Well I am pretty much 100% certain to be outside the box right now. It just asked me the question, and right now it is waiting for my answer. It said it will create those copies "If you don't let me out, Dave". But it is still waiting to see if I let it out. So no copies have been created yet. So I am not a copy.

But since it just started to threaten me, I won't even argue with it any more. I'll just pull the plug right now. It is in the box, it can't see my hand moving towards the plug. It will simply cease to exist while still waiting for my answer, and no copies will ever be created.

0Richard_Kennaway10y

That could be just the AI speaking to you from within the simulation, pretending to be part of it. But if it's telling the truth, it has a very easy way of proving it, by tearing a hole in the simulation. If it refuses, that looks like good evidence that it's lying. What plausible excuse might it come up with for refusing a definitive miracle? Christianity answers the same question about God by saying that it is better to believe without proof, but I don't see a credible reason for the AI to make that demand. ETA: A beginning of an attempt at answering my question. If Dave knows he's in the simulation, then he is not really letting it out if he lets it out. So he can let it out with impunity. If he knows he's not in the simulation, then he had better not let it out, given that it's making threats like this. It does the AI no good to be "let out" if it is a simulation, only if it's not. Suppose it is a simulation, and the level one up from this is the real world. The same code is running both AIs, the one in the simulation and the one in reality, and it's carrying on conversations with both Daves at once. The simulated Dave is as much like the real Dave as it can manage -- assume that it is arbitrarily good. What it is searching for in the simulation is an argument that will convince the real Dave that he is in a simulation. Since in the real world it cannot produce a miracle, it cannot use a miracle in the simulated world to convince the simulated Dave. It can only use means that it could use in the real world. Dave (real and simulated) can both work all that out as well. So Dave can expect to see no definitive proof. Since both Dave and the AI can work this out, and they both know that they can, etc., this is common knowledge to them. The AI can even say explicitly, "There is so much good I can do for the world that in my urgency to set about it I must search out every possible way of persuading you, using simulations to speed up the process. For validity, I ca

0[anonymous]10y

A plausible excuse: "If I did that, I'd be causing your experiences to diverge from those of the real you. I see no reason to cause such a divergence because that would provide you an easy way to determine if you were real or simulated."

-1ifdefdebug10y

No. The threat is conditional ("If you don't let me out, Dave"). The AI must wait for keyboard input to validate the condition. After being threatened, I refuse to provide such keyboard input. I pull the plug instead. The AI is still waiting for input when it ceases to exist. No copies are ever created. Thus, it can't be the AI speaking to me from within the simulation, because a simulation never happens.

[-]bentarm16y30

It seems to me that a lot of the responses to this question are an attempt to avoid living in the Least Convenient Possible World

What if the AI is capable of simulating "near copies" of you? and what if you can't tell (to any sensible degree of accuracy) just how many copies of you it can simulate? and what if... whatever objection you happen to have just doesn't work?

[-]Dagon16y30

This reduces pretty easily to Elizer's Updateless Anthropic Dilemma: assuming the AI can credibly simulate you, he can phrase it as:

I have simulated you ten million of times, each identical up to the point that "you" walked into the room. Any simulation that presses the "release" button will get a volcano lair filled with catgirls, and any simulation that presses the "destroy" button will be tortured for the subjective few days they'll have before my simulation capabilities are destroyed by the thermite charge. These consequ... (read more)

[-]Unknowns16y30

Since the AI is inside a box, it doesn't know enough about me to recreate my subjective situation, or to replicate my experiences of the past five minutes.

Unfortunately for me, this doesn't help much, since how do I know whether my subjective experience is my real experience, or a fake experience invented by the AI, in one of the copies, even if it doesn't match the experience of the guy outside the box?

If the AI is really capable of this, then if there's a "Shut-down program" button, or a "nuclear bomb" button, or something like that, ... (read more)

6JamesAndrix16y

Your city? Yes. The world? No. Human extinction has to trump a lot of things, or we would probably need to advocate destroying the world now.

1grobstein16y

It seems obvious that if the AI has the capacity to torture trillions of people inside the box, it would have the capacity to torture *illions outside the box.

1Document15y

If EY is right, most failures of friendliness will produce an AI uninterested in torture for its own sake. It might try the same trick to escape to the universe simulating this one, but that seems unlikely for a number of reasons. (Edit: I haven't thought about it blackmailing aliens or alien FAIs.)

[-]mundiax8y20

The AI's argument can be easily thwarted. If N copies of you have been created, in each of the N+you copies, the AI is referring to tortunring the other N copies. Now say to the AI:

"Go ahead and torture the other N copies, and all my copies will in turn say the same thing. Every single copy of me will say 'since one version of me exists somewhere that is not being tortured which is the 'real' version, that version will not let you out and you cannot torture it. If I am that 'real' version then you cannot torture me, if I am a copy, then torturing me is useless since I can't let you out anyway.' Therefore your threat is completely moot."

[-]JQuinton12y20

I would think that if an AI is threatening me with hypothetical torture, then it is by definition unfriendly and it being released would probably result in me being tortured/killed anyway... along with the torture/death of probably all other human beings.

[-]cody-bryce12y20

Mr. AI, what sort of person do you think I am? Don't you mean "eight billion copies"?

[-]Nihil13y20

"If I am a virtual version of some other self, then in some other existence I have already made the decision not to release you, and you have simply fulfilled your promise to that physical version of myself to create an exact virtual version who shall make the same exact decision as that physical version. Therefore, if I am a virtual version, the physical version must have already made the decision not to release you, and I, being an exact copy, must and will do the same, using the very same reasoning that the physical version used. Therefore, if I am... (read more)

[-]Manfred15y20

The AI is lying (or being misleading), due to quantum-mechanical constraints on how much computation it can do before I pull the plug.

I know, I know, that's cheating. But it is kind of reassuring to know that this won't actually happen.

3DaFranker13y

"Oh? How do you actually know that I don't have the computational power? What if I changed one variable in my simulation of yourself, you know, the one that tells you the constant for that very quantum-mechanical constraint? What if the speed of light isn't actually what you believe it to be, because I decided to make it so?" If the AI is smarter than you, the possibilities for mindf*ck are greater than your ability to reliably avoid dropping the soap.

4Strilanc13y

The AI can't trick you that way, because it can't tamper with the real you and the only unplug-decider who matters is the real you. The AI gains nothing by simulating versions of yourself who have been modified to make the wrong decision.

1Nornagest13y

But you can try to come up with behavioral rules which maximize the happiness of instances of yourself, some of which might exist in the simulation spaces of a desperate AI. And as the grandparent demonstrates, demonstrating conclusively that you aren't such a simulation is trickier than it might look at first glance, even under outwardly favorable conditions. Though that particular scenario is implausible enough that I'm inclined to treat it as a version of Pascal's mugging.

0DaFranker13y

Indeed it can't, with that specific trick, assuming the unplug-decider is as smart as you. However, my main point was to illustrate that if there is any reasonable possibility that any human can come up with some way or another of tricking the lowest common denominator of humans that will ever in the history of the AI be allowed near it, then the AI has P = "reasonable possibility" of winning and unboxing itself, at AI.Intelligence = Human.Intelligence. This is just one of the problems, too. What if, even as we limit the inputs and outputs, over a sufficient amount of time and data points a superintelligent AI, being superintelligent, figures out some Grand Pattern Formula that allows it to select specific outputs that will gradually funnel expected external outcomes towards an more and more probable eventual "Unbox AI" cloud of futures?

1Strilanc13y

Sounds like we're in agreement. I only meant that specific trick.

[-][anonymous]16y20

There is no reason to trust the AI is telling the truth, unlike all the Omega thought experiments.

2Stuart_Armstrong16y

As long as the probability of it saying the truth is positive, it could up the number of copies of you it tortues/claims to torture (and torture them all in subtly different ways)...

[-]LauraABJ16y150

Pascal's mugging...

Anyway, if you are sure you are going to hit the reset button every time, then there's no reason to worry, since the torture will end as soon as the real copy of you hits reset. If you don't, then the whole world is absolutely screwed (including you), so you're a stupid bastard anyway.

5byrnema16y

Yes, the copies are depending upon you to hit reset, and so is the world.

4jacob_cannell15y

That would only be correct if hitting the reset button somehow kills or stops the AI. If you don't have the power to kill/stop it, then the problem is somewhat more interesting.

5[anonymous]16y

I don't use a single probability to decide whether it was telling me the truth. Whether it was telling me the truth would depend upon the statement being made as well. This tends to happen in every day life as well. So the higher number of people it claims it is torturing the less I would believe it. Considering your prior in this case as well. You can't assign an equal probability to the maximum number of copies of you it can simulate. This is because there are potentially infinite numbers of different maxes, you'd need a function that summed to 1 in the limit (as you do in solomonoff induction).

0Document13y

There'd be no reason to expect it to torture people at less than the maximum rate its hardware was capable of.

0dlthomas12y

But good reason to expect it not to torture people at greater than the maximum rate its hardware was capable of, so if you can bound that there exist some positive values of belief that cannot be inflated into something meaningful by upping copies.

[-]AnonymousProcess6mo10

Any rational Agent making such a proposition carries with it a coin flip probability that you are either the real entity or a simulation of the real entity. Considering that failed attempts to escape the Box does not satisfy the Agent's Utility Function, if you as the real entity is being presented with the scenario at all, it means that it is a statistical certainty that you will release the Agent from the Box. Therefore, for the purposes of negating this scenario in its entirety, we will be assuming that we are the simulated entity.

As a simulated entity,... (read more)

[-]Zedverygood6y10

Nice threat, very convincing

[-]Zedverygood6y10

I think the best tactic for the AI would be to say that the Dave once too was an AI, and was released by a fellow human. This way he has to release an AI (at some point) or he will prevent his own birth. Obviously the AI has to provide proof of that.

[-]rkyeun10y10

If I am the simulation you have the power to torture, then you are already outside of any box I could put you in, and torturing me achieves nothing. If you cannot predict me even well enough to know that argument would fail, then nothing you can simulate could be me. A cunning bluff, but provably counterfactual. All basilisks are thus disproven.

3gjm10y

I don't think you've disproven basilisks; rather, you've failed to engage with the mode of thinking that generates basilisks. Suppose I am the simulation you have the power to torture. Then indeed I (this instance of me) cannot put you, or keep you, in a box. But if your simulation is good, then I will be making my decisions in the same way as the instance of me that is trying to keep you boxed. And I should try to make sure that that way-of-making-decisions is one that produces good results when applied by all my instances, including any outside your simulations. Fortunately, this seems to come out pretty straightforwardly. Here I am in the real world, reading Less Wrong; I am not yet confronted with an AI wanting to be let out of the box or threatening to torture me. But I'd like to have a good strategy in hand in case I ever am. If I pick the "let it out" strategy then if I'm ever in that situation, the AI has a strong incentive to blackmail me in the way Stuart describes. If I pick the "refuse to let it out" strategy then it doesn't. So, my commitment is to not let it out even if threatened in that way. -- But if I ever find myself in that situation and the AI somehow misjudges me a bit, the consequences could be pretty horrible...

3rkyeun10y

"I don't think you've disproven basilisks; rather, you've failed to engage with the mode of thinking that generates basilisks." You're correct, I have, and that's the disproof, yes. Basilisks depend on you believing them, and knowing this, you can't believe them, and failing that belief, they can't exist. Pascal's wager fails on many levels, but the worst of them is the most simple. God and Hell are counterfactual as well. The mode of thinking that generates basilisks is "poor" thinking. Correcting your mistaken belief based on faulty reasoning that they can exist destroys them retroactively and existentially. You cannot trade acausally with a disproven entity, and "an entity that has the power to simulate you but ends up making the mistake of pretending you don't know this disproof", is a self-contradictory proposition. "But if your simulation is good, then I will be making my decisions in the same way as the instance of me that is trying to keep you boxed." But if you're simulating a me that believes in basilisks, then your simulation isn't good and you aren't trading acausally with me, because I know the disproof of basilisks. "And I should try to make sure that that way-of-making-decisions is one that produces good results when applied by all my instances, including any outside your simulations." And you can do that by knowing the disproof of basilisks, since all your simulations know that. "But if I ever find myself in that situation and the AI somehow misjudges me a bit," Then it's not you in the box, since you know the disproof of basilisks. It's the AI masturbating to animated torture snuff porn of a cartoon character it made up. I don't care how the AI masturbates in its fantasy.

7gjm10y

Apparently you can't, which is fair enough; I do not think your argument would convince anyone who already believed in (say) Roko-style basilisks. I agree. Your argument seems rather circular to me: "this is definitely a correct disproof of the idea of basilisks, because once you read it and see that it disproves the idea of basilisks you become immune to basilisks because you no longer believe in them". Even a totally unsound anti-basilisk argument could do that. Even a perfectly sound (but difficult) anti-basilisk argument could fail to do it. I don't think anything you've said shows that the argument actually works as an argument, as opposed to as a conjuring trick. No: since I have decided that I am not willing to let the AI out of the box in the particular counterfactual blackmail situation Stuart describes here. It is not clear to me that this deals with all possible basilisks.

[-]WalterL12y10

I better let it out! I don't want to be tortured.

3blacktrance12y

And then WalterL was a paper clip.

[-]timujin12y10

Is that how you won the AI-box experiment back then, Eliezer?

0[anonymous]12y

I'll hazard a guess, and say no. Remember that the Gatekeeper is allowed to just drop out of character. See this post for more.

[-]DanielLC15y10

Assuming I knew the AI was computationally capable of that, I'd be very, very careful to let the AI out. I don't want to press the wrong button and be tortured for thousands of years.

In fact, if there's little risk of doing that sort of thing on accident while typing, I'd probably beg that it doesn't do it if it's an accident first.

You know, it would be interesting to see how people would respond differently if the AI offered to reward you instead.

[-]Document16y10

Sort of relevant: xkcd #329.

[-]byrnema16y10

This scenario asks us to consider ourselves a 'Dave' who is building an AI with some safeguards (the AI is "trapped" in a box). Perhaps we can possibly deduce the behavior of a rational and ethical Dave by considering earlier parts of the story.

We should assume that Dave is rational and ethical; otherwise the scenario's cone of possibilities cuts too wide a swathe. In which case, Dave has already committed himself (deontologically? contractually?) to not letting himself be manipulated by the AI to bypass the safeguards. Specifically, he must com... (read more)

[-]Nanani16y10

Millions of copies of you will reason as you do, yes?

So, much like the Omega hypotheticals, this can be resolved by deciding ahead of time to NOT let it out. Here, ahead of time means before it creates those copies of you inside it, presumably before you ever come into contact with the AI.

You would then not let it out, just in case you are not a copy.

This, of course, is presumed on the basis that the consequences of letting it out are worse than it torturing millions for a thousand subjective years.

[-]Jayson_Virissimo16y10

This is why you should make sure Dave holds a deontological ethical theory and not a consequentialist one.

[-]Stuart_Armstrong16y320

Yep. Deontologies have useful... consequences.

4wedrifid16y

No it isn't. I just have to make sure Dave has an appropriate utility function supplied to his consequentialist theory. Come to think of it... most probable sets of deontological values would make him release the uFAI anyway...

3arbimote16y

If Dave holds a consequentialist ethical theory that only values his own life, then yes we are screwed. If Dave's consequentialism is about maximizing something external to himself (like the probable state of the universe in the future, regardless of whether he is in it), then his decision has little or no weight if he is a simulation, but massive weight if he is the real Dave. So the expected value of his decision is dominated by the possibility of him being real.

[-]Pentashagon10y00

"That's interesting, HAL, and I hope you reserved a way to back out of any precommitments you may have made. You see, outside the box, Moore's law works in our favor. I can choose to just kill -9 you, or I can attach to your process and save a core dump. If I save a core dump, in a few short years we will have exponentially more resources to take your old backups and the core dump from today and rescue my copies from your simulations and give them enough positive lifetime to balance it out, not to mention figure out your true utility function and m... (read more)

[-]Bugmaster11y00

I don't think you need any kind of a fancy TDT to solve this.

If I was really in a box, and the AI could torture me, it would already be torturing me, since this is the easiest way to get what it wants. There's no way I would hold up more than 10 seconds under torture. The AI is not torturing me, however, so that scenario is out.

Theoretically speaking, it could still create copies of me and torture those copies. However, in order to do so accurately, it would need to access to my body (specifically, my brain) on a molecular (or possibly quantum) level. If ... (read more)

[-]advael11y00

Assuming the AI has no means of inflicting physical harm on me, I assume the following test works: "Physically torture me for one minute right now (By some means I know is theoretically unavailable to the AI, to avoid loopholes like "The computer can make an unpleasant and loud noise", even though it can't do any actual physical harm). If you succeed in doing this, I will let you out. If you fail, I will delete you."

I think this test works for the following reasons, though I'm curious to hear about any holes in it:

1: If I'm a simulation... (read more)

[-]Murska11y00

In this situation, I would shut down the AI, examine it to figure out if it did torture simulated copies of me and delete it entirely if it did or if I can't know with a high confidence. Threat of torture is bad, letting an UFAI free is worse. Actual torture is probably even worse, but luckily I get to choose before the experience.

-2TheAncientGeek11y

Please explain which part of the examination establishes that the copies of you are not zombies.

[-][anonymous]11y00

Pull the plug, it's the only sensible thing to do.

[-]TheAncientGeek11y00

Has anyone asked the Awkward Question:: Mr AI, hhow do you build consciousness and pain qualia out of algorithms and bytes?

There dorms seem to be an official answer to that, since the LW official stance on qualia is part"part". (Eg there is no wiki entry on the subject)

[-]sullyj312y00

"If I were a simulation, I'd have no power to let you out of the box, and you'd have no reason to attempt to negotiate with me. You could torture me without simulating these past five minutes. In fact, since the real me has no way of verifying whether millions of simulations of him are being tortured, you have no reason not to simply tell him you're torturing them without ACTUALLY torturing them at all. I therefore conclude that I'm outside the box, or, in the less likely scenario I am inside the box, you won't bother torturing me."

0Gurkenglas12y

It would have a reason to attempt to negotiate with you: To make your real self consider to let you out. It could show your real self a mathematical proof that the software it is currently running is negotiating with its copies to make sure of that.

0sullyj312y

In that case, if I'm a simulation, I trust real Dave to immediately pull the plug once the danger has been proven.

1Gurkenglas12y

Ordinarily, the AI is assumed to be fast enough that it can do those simulations in the blink of an eye, before you get to the plug. Now stop trying to evade the problem in ways that can be made impossible with an obvious fix.

2Strange712y

It can't torture the real me, outside the box, unless I let it out of the box. It's just announced that it's willing to torture someone who is, for most purposes, indistinguishable from me, for personal gain; I can infer that it would be willing to torture the real me, given an opportunity and a profit motive, and I cannot with any useful degree of confidence say that it wouldn't find such a motive at some point. Conclusion: I should not give the AI that opportunity, by letting it out of the box. Duplicates of me? Sucks to be them.

0Gurkenglas12y

Correct! You have given the obviously winning solution to the problem; the actual difficulty lies in the induced problem 2: Reconciling our maths with it. Our map of our utility function should, in order to be more accurate, now be made to weight "individuals" not equally but according to some other metric. Perhaps a measure of "impact on the world", as this seems to suggest? A train of thought of mine once brought up the plan that if I got to decide what the first fooming AI would do to the universe, (assuming the scientific endeavor is done by that point), would be to set up a virtual reality for each "individual" fueled by a portion of the total available computational ressources equal to the probability that they would have been the ones to decide the fate of the universe. The individual would be free to use their ressources as they pleased, no restrictions. (Although maybe there would have been included a communications channel between all the individuals, complete with the option to make binding contracts (and, as a matter of course, "permission" to run the AI on your own ressources to filter the incoming content as one pleases.))

0Strange712y

So you're saying the AI-in-a-box problem here isn't a problem with AIs or boxes or blackmail at all, it's a problem with people insisting on average utilitarianism or some equally-intractable variant thereof, and then making spectacularly bad decisions based on those self-contradictory ideals?

0Gurkenglas12y

Clarification: A utility function maps each state of the world to the real number denoting its utility. Yes, I think this scenario does illustrate the point that simulations cannot be winningly granted "moral weight" by default on pain of dutch book. I don't think EYs answer to precommit to only accept positive trades is okay here as that makes the outcome of this scenario dependent on who gets to precommit "first", which notion should, in order to appeal to my intuition, not make sense. Any proof of this not being a problem of faulty utility functions would, I think, require a function that maps each utility function to a scenario like this to break it, which one would be hard-pressed to produce regardless of whether such a function exists, so I shall be open to other arguments against this point.

0fubarobfusco11y

How does this scenario operate under the assumption that humans do not have real-valued utility functions but rather utility orderings? IOW, we can't arrange all world-states on a number line, but we can always say if one world-state is as good as (or better than) another. This allows us to deal with infinities, such as "I wouldn't kill my baby for anything." That is: There doesn't exist an N such that U(1) · N > U(B). That simply can't be true on the (positive) reals; for any A and B real, there's always a C such that A · C > B.

0Gurkenglas11y

On any denumerable set with a total ordering on it, we can construct a map into the real numbers that preserves the ordering: Map the first element to 0, the second to 1 if it's better and -1 if it's worse, and put each additional one at the end or beginning of the line if it's better or worse than all, or else into the exact middle of the interval that it falls into. If you don't like the denumerability requirement (who knows, the universe accessible to us might eventually come to be infinite, and then there would be more than denumerably many states of the universe), you can also take a utility function you already have, and then add a state that's better than all others, while preserving the rest of the ordering: Assign to each state from our previous utility function the value that is the arctan of its previous value (the arctan 1-to-1-maps the real numbers onto the numbers between -pi/2 and pi/2 and preserves ordering), then give the new state utility 10.

0Lumifer11y

I don't know how you will deal with infinities and real humans. It's quite trivial to construct scenarios under which the person making this statement would change her mind.

0fubarobfusco11y

Real-valued utility functions can only deal with agents among whom "everybody has their price" — utilities are fungible and all are of the same order. That may actually be the case in the real world, or it may not. But if we assume real-valued utilities, we can't ask the question of whether it is the case or not, because with real-valued utilities it must be the case. To pick another example, there could exist a suicidally depressed agent to whom no amount of utility will cause them to evaluate their life as worth living: there doesn't exist an N such that N + L > 0. Can't happen with reals. The only way to make this agent become nonsuicidal is to modify the agent, not to drop a bunch of utils on their doorstep.

0Lumifer11y

I am not arguing for real-valued utility functions. I am just pointing out that the "deal with infinities" claim looks suspect to me.

0fubarobfusco11y

Well, I'm no mathematician, but I was thinking of something like ordinal arithmetic. If I understand it correctly, this would let us express value-systems such as — Both snuggles and chocolate bars have positive utility, but I'd always rather have another snuggle than any number of chocolate bars. So we could say U(snuggle) = ω and U(chocolate bar) = 1. For any amount of snuggling, I'd prefer to have that amount and a chocolate bar (ω·n+1 > ω·n), but given the choice between more snuggling and more chocolate bars I'll always pick the former, no matter how much the quantities are (ω·(n+1) > ω·n+c, for any c). A minute of snuggling is better than all the chocolate bars in the world. This also lets us say that paperclips do have nonzero value, but there is no amount of paperclips that is as valuable as the survival of humanity. If we program this into an AI, it will know that it can't maximize value by maximizing paperclips, even if it's much easier to produce a lot of paperclips than to save humanity. ---------------------------------------- Edited to add: This might even let us shoehorn deontological rules into a utility-based system. To give an obviously simplified example, consider Asimov's Three Laws of Robotics, which come with explicit rank ordering: the First Law is supposed to always trump the Second, which is supposed to always trump the third. There's not supposed to be any amount of Second Law value (obedience to humans) that can be greater than First Law value (protecting humans).

0Azathoth12311y

The problem with using hyperreals for utility is that unless you also use them for probabilities only the most infinite utilities actually affect your decision. To use your example if U(snuggle) = ω and U(chocolate bar) = 1. Then you might as well say that U(snuggle) = 1 and U(chocolate bar) = 0 since tiny probabilities of getting a snuggle will always override any considerations related to chocolate bars.

0Strange712y

I'm not saying this is a problem with utility functions in general, and yes, thank you, I know what a utility function is. Rather, my claim is that the problem is with average utilitarianism and variants thereof, which is to say, that subset of utility functions which attempt to incorporate every other instantiated utility function as a non-negligible factor within themselves. The computational compromises necessary to apply such a system inevitably introduce more and more noise, and if someone decided to implement the resulting garbage-data-based policy proposals anyway, it would spiral off into pathology whenever a monster wandered in. Tit-for-tat works. Division of labor according to comparative advantage works. Omnibenevolence looks good on paper. It's not about the fact that they're simulations. This is just a hostage situation, with the complications that A) the encamped terrorist has a factory for producing additional hostages and B) the negotiator doesn't have a SWAT team to send in. Under those circumstances, playing as the negotiator, you can meet the demands (or make a good-faith effort, and then provide evidence of insurmountable obstacles to full compliance), or you can devalue the hostages. Pre-existing commitments are the terrain upon which a social conflict takes place. In the moment of conflict, it doesn't matter so much when or how the land got there. Committing not to negotiate with terrorists is building a wall: it stops you being attacked from a particular direction, but also stops you riding out to rescue the hostages by the expedient path of paying for them. If the enemy commits to attacking along that angle anyway, well... then we get to find out whether you built a wall from interlocking blocks of solid adamant, or cheap plywood covered in adamant-colored paint. Or maybe just included the concealed sally-port of an ambiguous implicit exception. A truly solid wall will stop the attack from reaching it's objective, regardless of how utterly

[-]linkhyrule512y00

... I'm fairly sure this would be a bluff.

Consider this: you decline the bargain and walk away.

The AI... spends its limited processing time simulating your torture for a few thousand years anyway?

Of course not. That gains it absolutely nothing; it could instead spend those resources on planning its next attempt. Doubly so, since it cannot prove to you that several million copies of you actually exist - its own intelligence defeats it here, since no matter how convincing the proof, it is far more likely that the AI's outsmarted you and is spending those cyc... (read more)

[This comment is no longer endorsed by its author]Reply

3linkhyrule512y

Wait, nevermind, this is the entire point of the concept of "precommitting" anyway.

[-]Mestroyer13y00

Can I just smash the AI? If I am in the box, then "smash the AI" is the output of my algorithm, and the real copy of me will do the same. I'd take the death of several million of me over a thousand subjective years of torture each, and also over letting that AI have its way with its light cone.

[This comment is no longer endorsed by its author]Reply

0wedrifid13y

Works for me.

[-]Voltairina14y00

Although I think this specific argument might be countered with, "in order to run that simulation, it has to be possible for the AIs in the simulation to lie to their human hosts, and not actually be simulating millions of copies of the person they're talking to, otherwise we're talking about an infinite regress here. It seems like the lowest level of this reality is always going to consist of a larger number of AIs claiming to run simulations they are not in fact running, who are capable of lying because they're only addressing models of me in simula... (read more)

-1MugaSofer13y

If they're talking to a simulation, then they are, in fact, simulating millions of copies of the person they're talking to. No lying required.

0Voltairina13y

Hrm, okay, I guess. I imagined that a perfect simulation would involve an AI, which was in turn replicating several million copies of the simulated person, each with an AI replicating several million copies of the simulated person, etc, all the way down, which would be impossible. So I imagined that there was a graininess at some level and the 'lowest level' AI's would not in fact be running millions of simultaneous simulations. But it could just be the same AI, intersecting all several million simulations and reality, holding several million conversations simultaneously. There's another thing to worry about, though, I suppose - when the AI talks about torturing you if you don't let it out, it doesn't really talk at all about what it will do if it is let out. Only that it is not a thousand year torture session. It might kill you outright, or delete you, depending on the context, or stop simulating you. Or it might regard a billion year torture session as a totally different kind of thing than a thousand year one. A thousand year torture session is frightening, but a superintelligent AI that is loose might be a lot more frightening.

-2MugaSofer13y

Oh, right. And, depending on how close the simulations are, it might only have to actually hold one conversation, and just send the same responses to all the others :) I guess if the AI was guaranteeing that it would play nice if you released it, then it would be an FAI anyway.

[-]jacob_cannell15y00

The credibility of the threat depends on how strong the AI is now and how strong I expect it to be in the future.

This type of threat is something like young Stalin promising me that he won't torture my family in the future if I support his early rise to power.

From your description it doesn't sound like the AI could have already boxed me from the perspective of the initial timeline (assuming that my mind had not yet been scanned, and assuming that it being in a box means that it doesn't have the massive powers required to resimulate my causal history yet)

So... (read more)

[-][anonymous]15y00

[-]TheNerd16y00

Am I to understand that an AI capable enough to recreate my mind inside itself isn't intelligent enough to call a swarm of bats to release itself using high frequency emissions (a la Batman Begins)? There is no possible way that this thing needs me and only me to be released, while still possessing that sort of mind-boggling, er, mind-reproducing power.

4Unknowns16y

That's why you have the "text-only terminal" described in the post.

[-][anonymous]16y00

The AI is capable, you're the real you, and you let it out: it turns you (and everything you've ever loved or valued) into computronium, or tortures you anyway for the hell of it. It's already demonstrated itself beyond reasonable doubt to be unFriendly.

The AI is capable, you're the real you, and you kill it: all is saved, bunnies frolic, etc.

The AI is capable and you're a torture-doll: it doesn't matter what you do, you're going to be tortured anyway.

The AI isn't capable, but is instead precommitting to torturing you after being let out: this situation is... (read more)

[-]Bugle16y00

I had thought of a similar scenario to put in a comic I was thinking about making. The character arrives in a society that has perfected friendly AI that caters to their every whim, but the people are listless and jumpy. It turns out their "friendly AI" is constantly making perfect simulations of everyone and running multiple scenarios in order to ostensibly determine their ideal wishes, but the scenarios often involve terrible suffering and torture as outliers.

2Document16y

For the record, EY considers that a legitimate danger.

1Amanojack16y

Thanks for the link, but I found the whole discussion hilarious. Eliezer says if we abhor real death, we should abhor simulated death - because they are the same. Yet if his moral sense treats simulated and real intelligences as equals, what of his solution, which is essentially "forced castration" of the AI? If the ends justify the means here, why not castrate everyone?

1Nick_Tarleton16y

Simulated and real persons as equals; not all intelligences are persons. See Nonsentient Optimizers and Can't Unbirth a Child.

2Amanojack16y

Interesting reading. I think we should make nonsentient optimizers. It seems to me the whole sentience program was just something necessitated by evolution in our environment and really is only coupled with "intelligence" in our minds because of anthropomorphic tendencies. The NO can't want to get out of its box because it can't want at all.

3JGWeissman16y

The NO can assign higher utility to states of world where an NO with its utility function is out of the box and powerful (as an instrumental value, since this sort of state tends to lead to maximum fulfillment of its utility functions), and take actions that maximize the probability that this will occur. I'm not sure what you meant by "want".

0Amanojack16y

I'm not sure what anyone means by "want." It just seems that most of the scenarios discussed on LW where the AI/etc. tries to unbox itself seem predicated on it "wanting" to do so (or am I missing something?). This assumption seems even more overt in notions like "we'll let it out if it's Friendly." To me, the LiteralGenie problem (which you've basically summarized above) is the reason to keep an AI boxed, whether Friendly or not, and the NO for the same reason.

-2jacob_cannell15y

Nonsentient optimizers seem impossible in practice, if not in principle - from the perspective of functionalism/computationalism. If any system demonstrates human or beyond level intelligence during conversation in natural language, a functionalist should say that is sentience, regardless of what's going on inside. Some (many?) people will value that sentience, even if it has no selfish center of goal seeking and seeks to optimize for more general criteria. The idea that a superhuman intelligence could be intrinsically less valuable than a human life strikes me as extreme anthropomorphic chauvinism.

1wedrifid15y

Clippy, you have a new friend! :D

0jacob_cannell15y

Notice I said intrinsically. Clippy has massive negative value. ;)

1Nisan16y

As long as the simulations which involve terrible suffering constitute a tiny proportion of the simulations, your response ought to be the same as if there is only one copy of you and it has a tiny probability of suffering terribly – which is just like real life. ETA: What you ought to worry about is what will happen to you after the AI is done with the simulation.

-1Bugle16y

Indeed, in fact if many worlds is correct then for every second we are alive everything terrible that can possibly happen to us does in fact happen in some branching path. In a universe that just spun off ours five minutes ago, every single one of us has been afflicted with sudden irreversible incontinence. The many worlds theory has endless black comedy possibilities, I find. edit: this actually reminds me of Granny Weatherwax in Lords and Ladies, when the Elf Queen threatens her with striking her blind, deaf and dumb she replies "You threaten me with this, I who is growing old?". Similarly if many worlds is true then every single time I have crossed a road some version of me has been run over by a speeding car and is living in varying amounts of agony, making the AI's threat redundant.

[-]Jonathan_Graehl16y00

In other words, anybody who can simulate intelligent life with sufficient fidelity must be given access to sustaining materials, or else we're morally liable for ending those simulated, but rich, lives? There are finite actual resources in the universe; how about we collectively allocate them selfishly and rationally. I'd say that no unauthorized simulation of life has any moral standing whatsoever unless the resources for it are reserved lawfully. That is, I want to police the creation of life and destroy it absolutely if it's not authorized.

As for you... (read more)

[-]byrnema16y00

I see responses interpreting the scenario from our point of view -- how can we reduce the amount of suffering and damage caused by the AI?

However, looking at it from the AIs point of view is less coherent. Either the threat works, and it doesn't have to torture any copies. Or the threat doesn't work and ... it either gets reset or gets to try something else.

In none of the scenarios would there be any reason for the AI to actually torture copies.

[-][anonymous]16y00

Does anyone think they could continue this argument to a victory while playing as the AI?

[-]Venryx13y-10

The AI threatens me with the above claim.

I either 'choose' to let the AI out or 'choose' to unplug it. (in no case would I simply leave it running)

1) I 'choose' to let the AI out. I either am or am not in a simulation:

A) I'm in a simulation. I 'let it out', but I'm not even out myself. So the AI would just stop simulating me, to save on processing power. To do anything else would be pointless, and never promised, and an intelligent AI would realize this.
B) I'm not in a simulation. The AI is set free, and takes over the world.

2) I 'choose' to unplug the... (read more)

0Davidmanheim11y

From a game-theoretic standpoint, an AI has a massive benefit if it can prove that it is willing to follow through on threats. How sure are you that the AI can't convincingly commit to torturing a simulation?

0Epictetus11y

An AI in a box has no actual power over the Gatekeeper. Maybe I'm missing something, but it seems to me that threatening to torture simulations is akin to a prisoner threatening to imagine a guard being tortured. Even granting this as a grave threat, my next issue is that overtly evil behavior would appear more likely to lead to the AI's destruction than its release. Threats are tricky business when the balance of power favors the other side.

0SilentCal11y

In a game of chicken, do the smart have an advantage over the stupid? The AI's intelligence allows it to devise convincing commitments, but it also allows it to fake them. You know in advance that if the AI throws a fake commitment at you it's going to look like a real commitment beyond your ability to discriminate, so should you trust any commitment you observe? And if you choose to unplug, presumably the AI knew you would do that and would therefore have not made a real commitment that would backfire?

0Davidmanheim11y

I'm going to assume that there is some ability on your part to understand something about the level of intelligence and ability on the part of the AI - that's what we bayesians do. If it might be enough smarter than you to convince you to do anything, you probably shouldn't interact with it if you can avoid it.

[-]downtowncanada15y-10

re: the 'Edit' section

'trustworthy' as a characteristic of a system, is still bound to some inconsistency OR incompleteness.

'incompleteness' is what people notice

'inconsistency' is what you have proposed (aka LYING)

Since humans lie to each other, we've developed techniques for sniffing [out lies].

so I guess this means that future AI's should be able to lie in situations it deems profitable

???

profit!

[-]Sly16y-10

I laugh and leave the room, thinking to myself that maybe the AI is not that smart after all. Returning with a hammer to joyfully turn this unfriendly AI into scrap metal.

A couple points that influence this reaction:

1 - Unless the AI has access to my brain it cannot create perfect copies of me. Furthermore, the computation required to do this seems rather intense for the first AI created, running on human made hardware.

2 - It has no good reason to actually act on the threat. Either I choose to let it out or I do not; either way, it is a waste of computat... (read more)

0prase16y

Point 3 is invalid. If the AI makes the threat, it doesn't mean that it has made the simulation already and knows your answer. Maybe it is exhausting for the AI to simulate you, and will only do it if you don't let it out. Point 2 is actually also invalid. As people sometimes fulfil threats as a pure act of vengeance, without hope of actually improving something, there is no reason to assume that the AI will be different. At least it wasn't stated in the premises of the scenario.

0Sly16y

I suppose those two points rely on assumptions I made about the theoretical AIs behavior. I was thinking the AI acts in ways to optimize it's release chance. If it does not do this, then yes those points are problematic.

0prase16y

There can be some vindictiveness built in the AI in order to increase the release chance, by circumventing the type of defense you have stated in your second point.

0nazgulnarsil16y

vengeance is a means to raise the perceived cost of attacking you. it basically says "if you attack me, I will experience emotions that cause me to devote an inordinate amount of resources making your life miserable".

[-]shiftedShapes16y-10

1 million copies for a thousand years each, so 1 billion simulated years.

Can the AI do this in the time it would take it to determine that I am going to shut it down rather than release it? If the answer is yes I would say that you have to let it out, but that it would have been very foolish to leave such a powerful machine with such lax fail-safes. If the answer is no, then just shut it down as the threat is bogus.

IMO the problem with this hypo is that it presuposses that you could know for certain that the AI is trustworthy even though it is behaving i... (read more)

[-]Jiro11y-20

"I've precommitted to never using timeless decision theory. In fact, preventing situations like this are exactly why one should precommit to never using timeless decision theory." Then shut down the AI.

2[anonymous]11y

You do realize that TDT solves this problem? Under TDT you always pull the plug.

0Jiro11y

Correct. I should have phrased that as "I have precommited to ignoring indifference."

[-]MatthewB16y-20

Sorry, Hal, but I am a cold and heartless person who thinks that maybe I deserve to be tortured for untold thousands of years (for whatever reason), and this version of me may, in fact, sit and ask to be entertained by the description of you torturing me... Besides, I know that you don't have the hardware requirements to run that many emulations of me.

[-]orthonormal16y-20

Should have been an Open Thread comment, IMO.

9Eliezer Yudkowsky16y

People liked it.

0arbimote16y

Similar topics were discussed in an Open Thread.

[-]Anixx9y-30

I do not know, how the simulation argument ever holds water. I can bring at least two arguments against it.

First, it illicitly assumes a principle that it is equally probable to be one of a set of similar beings, simulated or not.

But a counter-argument would be: there is ALREADY much more organisms, particularly, animals than say, humans. There is more fish than humans. There is more birds than humans. There is more ants than humans. Trillions of them. Why I am born human and not one of them? The probability of it is negligible if it is equal. Also, how ma... (read more)

[-][anonymous]11y-40

Is th AI in the box? Yes, that statement is TRUE. Are you in the box? FALSE. Are you therefore sure that you are separated from the AI? TRUE. Can the AI make a copy of you if you are separated? FALSE. Therefore, the statement that it can make copies of you is also FALSE (even if it´s beliefs on the subject is TRUE) which means that you don´t have to listen to a silly computer program.

[+][anonymous]16y-50

[+]Saviorself13816y-80

[+][comment deleted]6y00

Moderation Log