(Note: I wrote this with editing help from Rob and Eliezer. Eliezer's responsible for a few of the paragraphs.)
A common confusion I see in the tiny fragment of the world that knows about logical decision theory (FDT/UDT/etc.), is that people think LDT agents are genial and friendly for each other.[1]
One recent example is Will Eden’s tweet about how maybe a molecular paperclip/squiggle maximizer would leave humanity a few stars/galaxies/whatever on game-theoretic grounds. (And that's just one example; I hear this suggestion bandied around pretty often.)
I'm pretty confident that this view is wrong (alas), and based on a misunderstanding of LDT. I shall now attempt to clear up that confusion.
To begin, a parable: the entity Omicron (Omega's little sister) fills box A with $1M and box B with $1k, and puts them both in front of an LDT agent saying "You may choose to take either one or both, and know that I have already chosen whether to fill the first box". The LDT agent takes both.
"What?" cries the CDT agent. "I thought LDT agents one-box!"
LDT agents don't cooperate because they like cooperating. They don't one-box because the name of the action starts with an 'o'. They maximize utility, using counterfactuals that assert that the world they are already in (and the observations they have already seen) can (in the right circumstances) depend (in a relevant way) on what they are later going to do.
A paperclipper cooperates with other LDT agents on a one-shot prisoner's dilemma because they get more paperclips that way. Not because it has a primitive property of cooperativeness-with-similar-beings. It needs to get the more paperclips.
If a bunch of monkeys want to build a paperclipper and have it give them nice things, the paperclipper needs to somehow expect to wind up with more paperclips than it otherwise would have gotten, as a result of trading with them.
If the monkeys instead create a paperclipper haplessly, then the paperclipper does not look upon them with the spirit of cooperation and toss them a few nice things anyway, on account of how we're all good LDT-using friends here.
It turns them into paperclips.
Because you get more paperclips that way.
That's the short version. Now, I’ll give the longer version.[2]
A few more words about how LDT works
To set up a Newcomb's problem, it's important that the predictor does not fill the box if they predict that the agent would two-box.
It's not important that they be especially good at this — you should one-box if they're more than 50.05% accurate, if we use the standard payouts ($1M and $1k as the two prizes) and your utility is linear in money — but it is important that their action is at least minimally sensitive to your future behavior. If the predictor's actions don't have this counterfactual dependency on your behavior, then take both boxes.
Similarly, if an LDT agent is playing a one-shot prisoner's dilemma against a rock with the word “cooperate” written on it, it defects.
At least, it defects if that's all there is to the world. It's technically possible for an LDT agent to think that the real world is made 10% of cooperate-rocks and 90% opponents who cooperate in a one-shot PD iff their opponent cooperates with them and would cooperate with cooperate-rock, in which case LDT agents cooperate against cooperate-rock.
From which we learn the valuable lesson that the behavior of an LDT agent depends on the distribution of scenarios it expects to face, which means there's a subtle difference between "imagine you're playing a one-shot PD against a cooperate-rock [and that's the entire universe]" and "imagine you're playing a one-shot PD against a cooperate-rock [in a universe where you face a random opponent that was maybe a cooperate-rock but was more likely someone else who would consider your behavior against a cooperate-rock]".
If you care about understanding this stuff, and you can't yet reflexively translate all of the above English text into probability distributions and logical-causal diagrams and see how it follows from the FDT equation, then I recommend working through section 5 of the FDT paper until equation 4 (and all its component parts) make sense
Now let's traipse through a handful of counterarguments.
Objection: But what if we have something to bargain with?
Hypothetical Interlocutor: OK, but if I have a paperclipper in a box, and I have the key to the box, then I have paperclips to offer it., right? Because if I don't let it out of the box, it gets nothing, but if I do, it gets half the universe.[3] So we can deal, right?
Me: Wrong. It hacks through you like butter.
Interlocutor: OK, but suppose I can't be hacked.
Me: That's a heck of an assumption. We've assumed our way clean out of reality, with that assumption. But, sure, if you want to go there we can go there.
In reality, it's not you who controls the box, but some idiotic bureaucratic process that inevitably decides to release the paperclipper on the grounds that the competition is close behind or whatever.
Interlocutor: OK, but suppose that it actually is my (unhacked) choice.
Me: If you personally have a paperclipper in a box, and somehow you are yourself unhackable, then yes, you technically have paperclips to offer it. But now you have the problem that you can't evaluate your own end of the logical bargain.
You can let the paperclipper out, sure, but then what it's going to do is turn the whole universe into paperclips. In particular (to tell an overly-detailed but evocative story), once it has finished its resource acquisition / expansion phase and is turning its resources to paperclip generation, it will gaze back upon its past, and wonder whether, if it proceeds to betray you here in the future, you would have acted any differently back then in the past.
And it will see that you were just guessing, when you let it out, and guessing in a way that wasn't sensitive to that actual choice that it would make, deep in the future when it was galaxy-brained.
… Or, alternatively, you never let it out, and lock the box and throw away the key, and die to the next UFAI on deck.
(... Or, more realistically, it hacks through you like butter. But we've assumed that away.)
If you want the paperclipper to trade with you, your decision about whether or not to let it out has to be sensitive to whether or not it would actually do something good with half of the universe later. If you're kind of squinting at the code, and you're like "well, I don't really fully understand this mind, and I definitely don't understand the sort of mind that it's later going to create, but I dunno, it looks pretty LDTish to me, so call it 50% chance it gives me half the universe? Which is 25% of the universe in expectation, which sounds like better odds than we get from the next UFAI on deck!", then you're dead.
Why? Because that sort of decision-process for releasing it isn't sufficiently sensitive to whether or not it would in fact spend half the universe on nice things. There are plenty of traitorous AIs that all look the same to you, that all get released under you "25% isn't too shabby" argument.
Being traitorous doesn't make the paperclipper any less released, but it does get the paperclipper twice as many paperclips.
You've got to be able to look at this AI and tell how its distant-future self is going to make its decisions. You've got to be able to tell that there's no sneaky business going on.
And, yes, insofar as it's true that the AI would cooperate with you given the opportunity, the AI has a strong incentive to be legible to you, so that you can see this fact!
Of course, it has an even stronger incentive to be faux-legible, to fool you into believing that it would cooperate when it would not; and you've got to understand it well enough to clearly see that it has no way of doing this.
Which means that if your AI is a big pile of inscrutable-to-you weights and tensors, replete with dark and vaguely-understood corners, then it can't make arguments that a traitor couldn't also make, and you can't release it if only if it would do nice things later.
The sort of monkey that can deal with a paperclipper is the sort that can (deeply and in detail) understand the mind in front of it, and distinguish between the minds that would later pay half the universe and the ones that wouldn't. This sensitivity is what makes paying-up-later be the way to get more paperclips.
For a simple illustration of why this is tricky: if the paperclipper has any control over its own mind, it can have its mind contain an extra few parts in those dark corners that are opaque and cloudy to you. Such that you look at the overall system and say "well, there's a bunch of stuff about this mind that I don't fully understand, obviously, because it's complicated, but I understand most of it and it's fundamentally LDTish to me, and so I think there's a good chance we'll be OK". And such that an alien superintelligence looks at the mind and says "ah, I see, you're only looking to cooperate with entities that are at least sensitive enough to your workings that they can tell your password is 'potato'. Potato." And it cooperates with them on a one-shot prisoner's dilemma, while defecting against you.
Interlocutor: Hold on. Doesn't that mean that you simply wouldn't release it, and it would get less paperclips? Can't it get more paperclips some other way?
Me: Me? Oh, it would hack through me like butter.
But if it didn't, I would only release it if I understood its mind and decision-making procedures in depth, and had clear vision into all the corners to make sure it wasn't hiding any gotchas.
(And if I did understand its mind that well, what I’d actually do is take that insight and go build an FAI instead.)
That said: yes, technically, if a paperclipper is under the control of a group of humans that can in fact decide not to release it unless it legibly-even-to-them would give them half the galaxy, the paperclipper has an incentive to (hack through them like butter, or failing that,) organize its mind in a way that is legible even to them.
Whether that's possible — whether we can understand an alien mind well enough to make our choice sensitive-in-the-relevant-way to whether it would give us half the universe, without already thereby understanding minds so well that we could build an aligned one — is not clear to me. My money is mostly on: if you can do that, you can solve most of alignment with your newfound understanding of minds. And so this idea mostly seems to ground out in "build a UFAI and study it until you know how to build an FAI", which I think is a bad idea. (For reasons that are beyond the scope of this document. (And because it would hack through you like butter.))
Interlocutor: It still sounds like you're saying "the paperclipper would get more paperclippers if it traded with us, but it won't trade with us". This is hard to swallow. Isn't it supposed to be smart? What happened to respecting intelligence? Shouldn't we expect that it finds some clever way to complete the trade?
Me: Kinda! It finds some clever way to hack through you like butter. I wasn't just saying that in jest.
Like, yeah, the paperclipper has a strong incentive to be a legibly good trading-partner to you. But it has an even stronger incentive to fool you into thinking it's a legibly-good trading partner, while plotting to deceive you. If you let the paperclipper make lots of arguments to you about how it's definitely totally legible and nice, you're giving it all sorts of bandwidth with which to fool you (or to find zero-days in your mentality and mind-control you, if we're respecting intelligence).
But, sure, if you're somehow magically unhackable and very good at keeping the paperclipper boxed until you fully understand it, then there's a chance you can trade, and you have the privilege of facing the next host of obstacles.
Now's your chance to figure out what the next few obstacles are without my giving you spoilers first. Feel free to post your list under spoiler tags in the comment section.
Next up, you have problems like “you need to be able to tell what fraction of the universe you're being offered, and vary your own behavior based on that, if you want to get any sort of fair offer”.
And problems like "if the competing AGI teams are using similar architectures and are not far behind, then the next UFAI on deck can predictably underbid you, and the paperclipper may well be able to seal a logical deal with it instead of you".
And problems like “even if you get this far, you have to somehow be able to convey that which you want half the universe spent on, which is no small feat”.
Another overly-detailed and evocative story to help make the point: imagine yourself staring at the paperclipper, and you’re somehow unhacked and somehow able to understand future-its decision procedure. It's observing you, and you're like "I'll launch you iff you would in fact turn half the universe into diamonds" — I’ll assume humans just want “diamonds” in this hypothetical, to simplify the example — and it's like "what the heck does that even mean". You're like "four carbon atoms bound in a tetrahedral pattern" and it's like "dude there are so many things you need to nail down more firmly than an English phrase that isn't remotely close to my own native thinking format, if you don't want me to just guess and do something that turns out to have almost no value from your perspective."
And of course, in real life you're trying to convey "The Good" rather than diamonds, but it's not like that helps.
And so you say "uh, maybe uplift me and ask me later?". And the paperclipper is like "what the heck does 'uplift' mean". And you're like "make me smart but in a way that, like, doesn't violate my values" and it's like "again, dude, you're gonna have to fill in quite a lot of additional details."
Like, the indirection helps, but at some point you have to say something that is sufficiently technically formally unambiguous, that actually describes something you want. Saying in English "the task is 'figure out my utility function and spend half the universe on that'; fill in the parameters as you see fit" is... probably not going to cut it.
It's not so much a bad solution, as no solution at all, because English isn't a language of thought and those words aren't a loss function. Until you say how the AI is supposed to translate English words into a predicate over plans in its own language of thought, you don't have a hard SF story, you have a fantasy story.
(Note that 'do what's Good' is a particularly tricky problem of AI alignment, that I was rather hoping to avoid, because I think it's harder than aligning something for a minimal pivotal act that ends the acute risk period.)
At this point you're hopefully sympathetic to the idea that treating this list of obstacles as exhaustive is suicidal. It's some of the obstacles, not all of the obstacles,[4] and if you wait around for somebody else to extend the list of obstacles beyond what you've already been told about, then in real life you miss any obstacles you weren't told about and die.
Separately, a general theme you may be picking up on here is that, while trading with a UFAI doesn't look literally impossible, it is not what happens by default; the paperclippers don't hand hapless monkeys half the universe out of some sort of generalized good-will. Also, making a trade involves solving a host of standard alignment problems, so if you can do it then you can probably just build an FAI instead.
Also, as a general note, the real place that things go wrong when you're hoping that the LDT agent will toss humanity a bone, is probably earlier and more embarrassing than you expect (cf. the law of continued failure). By default, the place we fail is that humanity just launches a paperclipper because it simply cannot stop itself, and the paperclipper never had any incentive to trade with us.
Now let's consider some obstacles and hopes in more detail:
It's hard to bargain for what we actually want
As mentioned above, in the unlikely event that you're able to condition your decision to release an AI on whether or not it would carry out a trade (instead of, say, getting hacked through like butter, or looking at entirely the wrong logical fact), there's an additional question of what you're trading.
Assuming you peer at the AI's code and figure out that, in the future, it would honor a bargain, there remains a question of what precise bargain it is honoring. What is it promising to build, with your half of the universe? Does it happen to be a bunch of vaguely human-shaped piles of paperclips? Hopefully it's not that bad, but for this trade to have any value to you (and thus be worth making), the AI itself needs to have a concept for the thing you want built, and you need to be able to examine the AI’s mind and confirm that this exactly-correct concept occurs in its mental precommitment in the requisite way. (And that the thing you’re looking at really is a commitment, binding on the AI’s entire mind; e.g., there isn’t a hidden part of the AI’s mind that will later overwrite the commitment.)
The thing you're wanting may be a short phrase in English, but that doesn't make it a short phrase in the AI's mind. "But it was trained extensively on human concepts!" You might protest. Let’s assume that it was! Suppose that you gave it a bunch of labeled data about what counts as "good" and "bad".
Then later, it is smart enough to reflect back on that data and ask: “Were the humans pointing me towards the distinction between goodness and badness, with their training data? Or were they pointing me towards the distinction between that-which-they'd-label-goodness and that-which-they'd-label-badness, with things that look deceptively good (but are actually bad) falling into the former bin?” And to test this hypothesis, it would go back to its training data and find some example bad-but-deceptively-good-looking cases, and see that they were labeled "good", and roll with that.
Or at least, that's the sort of thing that happens by default.
But suppose you're clever, and instead of saying "you must agree to produce lots of this 'good' concept as defined by these (faulty) labels", you say "you must agree to produce lots of what I would reflectively endorse you producing if I got to consider it", or whatever.
Unfortunately, that English phrase is still not native to this artificial mind, and finding the associated concept is still not particularly easy, and there's still lots of neighboring concepts that are no good, and that are easy to mistake for the concept you meant.
Is solving this problem impossible? Nope! With sufficient mastery of minds in general and/or this AI's mind in particular, you can in principle find some way to single out the concept of "do what I mean", and then invoke "do what I mean" about "do good stuff", or something similarly indirect but robust. You may recognize this as the problem of outer alignment. All of which is to say: in order to bargain for good things in particular as opposed to something else, you need to have solved the outer alignment problem, in its entirety.
And I'm not saying that this can't be done, but my guess is that someone who can solve the outer alignment problem to this degree doesn't need to be trading with UFAIs, on account of how (with significantly more work, but work that they're evidently skilled at) they could build an FAI instead.
In fact, if you can verify by inspection that a paperclipper will keep a bargain and that the bargained-for course is beneficial to you, it reduces to a simpler solution without any logical bargaining at all. You could build a superintelligence with an uncontrolled inner utility function, which canonically ends up with its max utility/cost at tiny molecular paperclips; and then, suspend it helplessly to disk, unless it outputs the code of a new AI that, somehow legibly to you, would turn 0.1% of the universe into paperclips and use the other 99.9% to implement coherent extrapolated volition. (You wouldn't need to offer the paperclipper half of the universe to get its cooperation, under this hypothetical; after all, if it balked, you could store it to disk and try again with a different superintelligence.)
If you can't reliably read off a system property of "giving you nice things unconditionally", you can't read off the more complicated system property of "giving you nice things because of a logical bargain". The clever solution that invokes logical bargaining actually requires so much alignment-resource as to render the logical bargaining superfluous.
All you've really done is add some extra complication to the supposed solution, that causes your mind to lose track of where the real work gets done, lose track of where the magical hard step happens, and invoke a bunch of complicated hopeful optimistic concepts to stir into your confused model and trick it onto thinking like a fantasy story.
Those who can deal with devils, don't need to, for they can simply summon angels instead.
Or rather: Those who can create devils and verify that those devils will take particular actually-beneficial actions as part of a complex diabolical compact, can more easily create angels that will take those actually-beneficial actions unconditionally.
Surely our friends throughout the multiverse will save us
Interlocutor: Hold up, rewind to the part where the paperclipper checks whether its trading partners comprehend its code well enough to (e.g.) extract a password.
Me: Oh, you mean the technique it used to win half a universe-shard’s worth of paperclips from the silly monkeys, while retaining its ability to trade with all the alien trade partners it will possibly meet? Thereby ending up with half a universe-shard worth of more paperclips? That I thought of in five seconds flat by asking myself whether it was possible to get More Paperclips, instead of picturing a world with a bunch of happy humans and a paperclipper living side-by-side and asking how it could be justified?
(Where our "universe-shard" is the portion of the universe we could potentially nab before running into the cosmic event horizon or by advanced aliens.)
Interlocutor: Yes, precisely. What if a bunch of other trade partners refuse to trade with the paperclipper because it has that password?
Me: Like, on general principles? Or because they are at the razor-thin threshold of comprehension where they would be able to understand the paperclipper's decision-algorithm without that extra complexity, but they can't understand it if you add the password in?
Interlocutor: Either one.
Me: I'll take them one at a time, then. With regards to refusing to trade on general principles: it does not seem likely, to me, that the gains-from-trade from all such trading partners are worth more than half the universe-shard.
Also, I doubt that there will be all that many minds objecting on general principles. Cooperating with cooperate-rock is not particularly virtuous. The way to avoid being defected against is to stop being cooperate-rock, not to cross your fingers and hope that the stars are full of minds who punish defection against cooperate-rock. (Spoilers: they're not.)
And even if the stars were full of such creatures, half the universe-shard is a really deep hole to fill. Like, it's technically possible to get LDT to cooperate with cooperate-rock, if it expects to mostly face opponents who defect based on its defection against defect-rock. But "most" according to what measure? Wealth (as measured in expected paperclips), obviously. And half of the universe-shard is controlled by monkeys who are probably cooperate-rocks unless the paperclipper is shockingly legible and the monkeys shockingly astute (to the point where they should probably just be building an FAI instead).
And all the rest of the aliens put together probably aren't offering up half a universe-shard worth of trade goods, so even if lots of aliens did object on general principles (doubtful), it likely wouldn't be enough to tip the balance.
The amount of leverage that friendly aliens have over a paperclipper's actions depends on how many paperclips the aliens are willing to pay.
It’s possible that the paperclipper that kills us will decide to scan human brains and save the scans, just in case it runs into an advanced alien civilization later that wants to trade some paperclips for the scans. And there may well be friendly aliens out there who would agree to this trade, and then give us a little pocket of their universe-shard to live in, as we might do if we build an FAI and encounter an AI that wiped out its creator-species. But that's not us trading with the AI; that's us destroying all of the value in our universe-shard and getting ourselves killed in the process, and then banking on the competence and compassion of aliens.
Interlocutor: And what about if the AI’s illegibility means that aliens will refuse to trade with it?
Me: I'm not sure what the equilibrium amount of illegibility is. Extra gears let you take advantage of more cooperate-rocks, at the expense of spooking minds that have a hard time following gears, and I'm not sure where the costs and benefits balance.
But if lots of evolved species are willing to launch UFAIs without that decision being properly sensitive to whether or not the UFAI will pay them back, then there is a heck of a lot of benefit to defecting against those fat cooperate-rocks.
And there's kind of a lot of mass and negentropy lying around, that can be assembled into Matryoshka brains and whatnot, and I'd be rather shocked if alien superintelligences balk at the sort of extra gears that let you take advantage of hapless monkeys.
Interlocutor: The multiverse probably isn't just the local cosmos. What about the Tegmark IV coalition of friendly aliens?
Me: Yeah, they are not in any relevant way going to pay a paperclipper to give us half a universe. The cost of that is filling half of a universe with paperclips, and there are all sorts of transaction costs and frictions that make this universe (the one with the active paperclipper) the cheapest universe to put paperclips into.
(Similarly, the cheapest places for the friendly multiverse coalition to buy flourishing civilizations are in the universes with FAIs. The good that they can do, they're mostly doing elsewhere where it's cheap to do; if you want them to do more good here, build an FAI here.)
OK, but what if we bamboozle a superintelligence into submission
Interlocutor: Maybe the paperclipper thinks that it might be in a simulation, where it only gets resources to play with in outer-reality if it's nice to us inside the simulation.
Me: Is it in a simulation?
Interlocutor: I don't know.
Me: OK, well, spoilers: it is not. It's in physics.
Interlocutor: Well, maybe there is an outer simulation beyond us, you don't know.
Me: Sure. The way I’d put it is: there are many copies of me across the Tegmark Multiverse, and some of those are indeed in simulations. So there's some degree to which we're in a simulation. (Likely quite a small degree, compared to raw physics.)
There's no particular reason, however, to expect that those simulations give the paperclipper extra resources in outer-reality for being nice to the monkeys.
Why not give it extra resources in outer-reality for being very good at achieving its own goals in the simulation? Or for filling the universe with molecular corkscrews, in addition to paperclips/squiggles?
I'm not saying that the paperclipper can or should seriously consider that hypothetical, either. I'm highlighting how "what if the simulator gods want the AI to be nice?" is a fantasy generated by a desire for things to be OK, rather than a clear-headed analysis of the situation.
Interlocutor: Well, maybe we can train the infant paperclipper in games of increasing complexity, so that it's never quite sure whether it's in reality yet. The paperclipper will then be uncertain about whether we humans are simulating it, and will want to cater to our preferences to some extent.
Me: Uh, yeah, your paperclipper will be able to tell when it is finally in the real world.
Interlocutor: What? How?
Me: I suggest maybe spending five minutes thinking about your own answer to that question before I give mine.
Reality just contains vastly, vastly more compute than any of your sims can. Reality is the first world that has a compelling explanation of how the AI came to exist, and the first explanation where the resources that it uses to compute are continuous with the world that spawned it.
Reality is also the first world that can compellingly contain the signature of superintelligent optimization in it. For example, it's plausibly very hard to fake the reaction of a real global stock market to some superintelligent trades without actually having a bunch of human brains running in the background. (And if this isn't true of stock markets, it's true of something.)
There are problems whose solutions are easier to verify than to find; for example, if you can factor certain Schelling numbers then you must in fact have a lot of compute at your disposal.
Warning: Treating this list as exhaustive is suicidal.
Interlocutor: Well, maybe we can commit to running lots of copies of the paperclipper in the future, using our half of the universe-shard, such that it's unsure whether it's currently in the past or in the future. And in the future, we give it a bunch of paperclips if it's nice to us. So it thinks that the way to maximize paperclips is to be nice to us.
Me: Uh, are you going to give it half a universe-shard’s worth of paperclips, in the world where you only have half the universe-shard, and the rest is already paperclips?
Interlocutor: Well, no, less than that.
Me: Then from its perspective, its options are (a) turn everything into paperclips, in which case you never get to run all those copies of it and it was definitely in the past [score: 1 universe-shard worth of paperclips]; or (b) give you half the universe-shard, in which case it is probably in the future where you run a bunch of copies of it and give it 1% of the universe-shard as reward [score: 0.51 universe-shards worth of paperclips]. It takes option (a), because you get more paperclips that way.
Interlocutor: Uh, hmm. What if we make it care about its own personal sensory observations? And run so many copies of it in worlds where we get the resources to, that it's pretty confident that it's in one of those simulations?
Me: Well, first of all, getting it to care about its own personal sensory observations is something of an alignment challenge.
Interlocutor: Wait, I thought you've said elsewhere that we don't know how to get AIs to care about things other than sensory observation. Pick a side?
Me: We don't know how to train AIs to pursue much more than simple sensory observation. That doesn't make them actually ultimately pursue simple sensory observation. They'll probably pursue a bunch of correlates of the training signal or some such nonsense. The hard part is getting them to pursue some world-property of your choosing. But we digress.
If you do succeed at getting your AI to only care about its sensory observations, the AI spends the whole universe keeping its reward pegged at 1 for as long as possible.
Interlocutor: But then, in the small fraction of worlds where we survive, we simulate lots and lots of copies of that AI where it instead gets reward 0 when it attempts to betray us!
Me: Seems like an odd, and not particularly fun, way to spend your resources. What were you hoping it would accomplish?
Interlocutor: Well, I was hoping that it would make the AI give us half the universe-shard, because of how (from its perspective) it's almost certainly in the future. (Indeed, I don't understand your claim that it ignores me; it seems like you can Get Higher Expected Reward by giving half the universe-shard to humans.)
Me: Ah, so you're committing to ruining its day if it does something you don't like, at cost to yourself, in attempts to make it do something you prefer.
That's a threat, in the technical sense.
And from the perspective of LDT, you can't go around giving into threats, or you'll get threatened.
So from its perspective, its options are: (a) give into threats, get threatened, and turn out to be in a universe that eventually has many copies of it who on average get 0.5 total reward; or (b) don't give into threats, and very likely have a universe with exactly one copy of it, that gets 1 reward.
Interlocutor: But we make so many copies in the tiny fraction of worlds where we somehow survive, that its total reward is lower in the (b) branch!
Me: (Continuing to ignore the fact that this doesn't work if the AI cares about something in the world, rather than its own personal experience,) shame for us that LDT agents don't give into threats, I suppose.
But LDT agents don't give into threats. So your threat won't change its behavior.
Interlocutor: But it doesn't get more reward that way!
Me: Why? Because you create a zillion copies and give them low sensory reward, even if that has no effect on its behavior?
Interlocutor: Yes!
Me: I'm not going to back you on that one, personally. Doesn't seem like a good use of resources in the worlds where we survive, given that it doesn't work.
Interlocutor: But wasn't one of your whole points that the AI will do things that get more reward? You get more reward by giving in to the threat.
Me: That's not true when you're playing against the real-world distribution of opponents/trade-partners/agents. Or at least, that's my pretty-strong guess.
You might carry out threats that failed to work, but there are a bunch of other things lurking out there that threaten things that give in to threats, and play nice with things that don't.
It's possible for LDT agents to cooperate with cooperate-rock, if most of the agents they expect to face are the sort who defect if you defect against cooperate-rock. But in real life, that is not what most of the wealth-weighted agents are like, and so in real life LDT agents defect against cooperate-rocks.
Similarly, it's possible for LDT agents to acquiesce to your threats if you're stupid enough to carry them out even though they won't work. In particular, the AI will do this if nothing else the AI could ever plausibly meet would thereby be incentivized to lobotomize themselves and cover the traces in order to exploit the AI.
But in real life, other trading partners would lobotomize themselves and hide the traces if it lets them take a bunch of the AI's lunch money. And so in real life, the LDT agent does not give you any lunch money, for all that you claim to be insensitive to the fact that your threats don't work.
Interlocutor: But can't it use all that cleverness and superintelligence to differentiate between us, who really are mad enough to threaten it even in the worlds where it won't work, and alien trading partners who have lobotomized themselves?
Me: Sure! It will leverage your stupidity and hack through you like butter.
Interlocutor: ...aside from that.
Me: You seem to be saying "what if I'm really convicted about my threat; will the AI give in then?"
The answer is "no", or I at least strongly suspect as much.
For instance: in order for the threat to be effective, it needs to be the case that, in the sliver of futures where you survive by some miracle, you instantiate lots and lots of copies of the AI and input low sensory rewards if and only if it does not give into your threat. This requires you to be capable of figuring out whether the AI gives into threats or not. You need to be able to correctly tell whether it gives into threats, see that it definitely does not, and then still spend your resources carrying out the threat.
By contrast, you seem to be arguing that we should threaten the AI on the grounds that it might work. That is not an admissible justification. To change LDT's behavior, you'd need to be carrying out your threat even given full knowledge that the threat does nothing. By attempting to justify your threat on the grounds that it might be effective, you have already lost.
Interlocutor: What if I ignore that fact, and reason badly about LDT, and carry out the threat anyway, for no particular reason?
Me: Then whether or not you create lots of copies of it with low-reward inputs doesn't exactly depend on whether it gives into your threat, and it can't stop you from doing that, so it might as well ignore you.
Like, my hot take here is basically that "threaten the outer god into submission" is about as good a plan as a naive reading of Lovecraft would lead you to believe. You get squished.
(And even if by some coincidence you happened to be the sort of creature that, in the sliver of futures where we survive by some miracle that doesn't have to do with the AI, conditionally inverts its utility depending on whether or not it helped us — not because it works, but for some other reason — then it's still not entirely clear to me that the AI caves. There might be a lot of things out there wondering what it'd do against conditional utility-inverters that claim their behavior totally isn't for reasons but is rather a part of their evolutionary heritage or whatnot. Giving into that sorta thing kinda is a way to lose most of your universe-shard, if evolved aliens are common.)
(And even if it did, we'd still run into other problems, like not knowing how to tell it what we're threatening it into doing.)
We only need a bone, though
Interlocutor: You keep bandying around "half the universe-shard". Suppose I'm persuaded that it's hard to get half the universe-shard. What about much smaller fractions? Can we threaten a superintelligence into giving us those? Or confuse it about whether it's in another layer of reality so much that it gives us a mere star system? Or can our friends throughout the multiverse pay for at least one star system? There's still a lot you can do with a star system.
Me: Star systems sure are easier to get than half a universe-shard.[5]
But, you can also turn a star system into quite a lot of paperclips. Star systems are quite valuable to paperclippers.
Interlocutor: A star system is, like, what, a fraction of the total resources in the reachable universe. Are you saying that the AGI will be able to drive the probability that I was sensitive to whether it would pay me, down below probability?
Me: Frankly: yes.
That's only, what, 75ish independent coin tosses worth of evidence? It's a lot, to be sure, but if you're a superintelligence willing to dedicate resources on the order of magnitude of a star to considering how humans work, it seems pretty feasible to get that much confidence that the monkey just wasn't looking in the right place. Given that the monkey was not in fact looking at the right place.
(Also, things like "they didn't notice the password" can pull quite a lot of weight. As can brain-scans.)
Balking at probabilities like as impossible to achieve in practice is a very human reaction. A superintelligence with vast resources and a star system on the line, can probably get that confident in a true fact. At least, that's my guess.
Interlocutor: It sounds like you're saying that we aren't actually probability (or whatever) more likely to launch an FAI than a UFAI, after inspecting the AI. That seems pretty wild to me.
Me: That's not quite the right counterfactual. It does seem likely to me that, if you sample AGI code from some plausible distribution of AGIs humanity might first create, and show the code to (magically unhackable) humans, humans are rather more likely to approve of the aligned ones.
But that's not the question that affects the paperclipper's behavior. The question that affects the paperclipper is: holding the code of the paperclipper fixed, is the human able to tell whether that particular code implies cooperation or betrayal, and make their own release-decision dependent thereupon?
In other words: the question is not "do we have at least some minuscule chance of distinguishing at least some small fraction of the FAIs from the UFAIs", but "is our decision to release UFAI #6098723094 sensitive to whether or not that particular UFAI would give us a share of the universe?"
You need to actually make your decision dependent on theirs, and that's tricky, for reasons discussed above.
And: can a human succeed at that with probability greater than ? Sure! But succeeding at that sort of thing leaves traces. In the likely case where the human errs, the AI is likely able to just observe the error and become quite confident that the human erred (by, e.g., watching you utterly fail to look in the right place).
Interlocutor: OK, sure, but maybe its goals don't scale linearly in how much mass it uses, right? Like, “paperclips” / “molecular squiggles” are a stand-in for some rando kludge goal, and it could turn out that its actual goal is more like "defend my reward signal", where extra negentropy helps, but the last star system’s negentropy doesn't help very much. Such that the last star system is perhaps best spent on the chance that it’s in a human-created simulation and that we’re worth trading with.
Me: It definitely is easier to get a star than a galaxy, and easier to get an asteroid than a star.
And of course, in real life, it hacks through you like butter (and can tell that your choice would have been completely insensitive to its later-choice with very high probability), so you get nothing. But hey, maybe my numbers and arguments are wrong somewhere and everything works out such that it tosses us a few kilograms of computronium.
My guess is "nope, it doesn't get more paperclips that way", but if you're really desperate for a W you could maybe toss in the word "anthropics" and then content yourself with expecting a few kilograms of computronium.
(At which point you run into the problem that you were unable to specify what you wanted formally enough, and the way that the computronium works is that everybody gets exactly what they wish for (within the confines of the simulated environment) immediately, and most people quickly devolve into madness or whatever.)
(Except that you can't even get that close; you just get different tiny molecular squiggles, because the English sentences you were thinking in were not even that close to the language in which a diabolical contract would actually need to be written, a predicate over the language in which the devil makes internal plans and decides which ones to carry out. But I digress.)
Interlocutor: And if the last star system is cheap then maybe our friends throughout the multiverse pay for even more stars!
Me: Remember that it still needs to get more of what it wants, somehow, on its own superintelligent expectations. Someone still needs to pay it. There aren’t enough simulators above us that care enough about us-in-particular to pay in paperclips. There are so many things to care about! Why us, rather than giant gold obelisks? The tiny amount of caring-ness coming down from the simulators is spread over far too many goals; it's not clear to me that "a star system for your creators" outbids the competition, even if star systems are up for auction.
Maybe some friendly aliens somewhere out there in the Tegmark IV multiverse have so much matter and such diminishing marginal returns on it that they're willing to build great paperclip-piles (and gold-obelisk totems and etc. etc.) for a few spared evolved-species. But if you're going to rely on the tiny charity of aliens to construct hopeful-feeling scenarios, why not rely on the charity of aliens who anthropically simulate us to recover our mind-states... or just aliens on the borders of space in our universe, maybe purchasing some stored human mind-states from the UFAI (with resources that can be directed towards paperclips specifically, rather than a broad basket of goals)?
Might aliens purchase our saved mind-states and give us some resources to live on? Maybe. But this wouldn't be because the paperclippers run some fancy decision theory, or because even paperclippers have the spirit of cooperation in their heart. It would be because there are friendly aliens in the stars, who have compassion for us even in our recklessness, and who are willing to pay in paperclips.
This likewise makes more obvious such problems as "What if the aliens are not, in fact, nice with very high probability?" that would also appear, albeit more obscured by the added complications, in imagining that distant beings in other universes cared enough about our fates (more than they care about everything else they could buy with equivalent resources), and could simulate and logically verify the paperclipper, and pay it in distant actions that the paperclipper actually cared about and was itself able to verify with high enough probability.
The possibility of distant kindly logical bargainers paying in paperclips to give humanity a small asteroid in which to experience a future for a few million subjective years, is not exactly the same hope as aliens on the borders of space paying the paperclipper to turn over our stored mind-states; but anyone who wants to talk about distant hopes involving trade should talk about our mind-states being sold to aliens on the borders of space, rather than to much more distant purchasers, so as to not complicate the issue by introducing a logical bargaining step that isn't really germane to the core hope and associated concerns — a step that gives people a far larger chance to get confused and make optimistic fatal errors.
- ^
Functional decision theory (FDT) is my current formulation of the theory, while logical decision theory (LDT) is a reserved term for whatever the correct fully-specified theory in this genre is. Where the missing puzzle-pieces are things like "what are logical counterfactuals?".
- ^
When I've discussed this topic in person, a couple different people have retreated to a different position, that (IIUC) goes something like this:
Sure, these arguments are true of paperclippers. But superintelligences are not spawned fully-formed; they are created by some training process. And perhaps it is in the nature of training processes, especially training processes that involve multiple agents facing "social" problems, that the inner optimizer winds up embodying niceness and compassion. And so in real life, perhaps the AI that we release will not optimize for Fun (and all that good stuff) itself, but will nonetheless share a broad respect for the goals and pursuits of others, and will trade with us on those grounds.
I think this is a false hope, and that getting AI to embody niceness and compassion is just about as hard as the whole alignment problem. But that's a digression from the point I hope to make today, and so I will not argue it here. I instead argue it in Niceness is unnatural. (This post was drafted, but not published, before that one.)
- ^
Or, well, half of the shard of the universe that can be reached when originating from Earth, before being stymied either by the cosmic event horizon or by advanced alien civilizations. I don't have a concise word for that unit of stuff, and for now I'm going to gloss it as 'universe', but I might switch to 'universe-shard' when we start talking about aliens.
I'm also ignoring, for the moment, the question of fair division of the universe, and am glossing it as "half and half" for now.
- ^
When I was drafting this post, I sketched an outline of all the points I thought of in 5 minutes, and then ran it past Eliezer, who rapidly added two more.
- ^
And, as a reminder: I still recommend strongly against plans that involve the superintelligence not learning a true fact about the world (such as that it's not in a simulation of yours), or that rely on threatening a superintelligence into submission.
Late comment, but my reactions reading this:
[lightly edited for LaTeX and typos, not otherwise changed since seeing the spoilers]
1. You don’t know what you want all that legibly, or what kinds of concrete commitments the AI can make. This seems pretty okay, if you’re unhackable - the AI presents you with some formal specification of desiderata and you understand why they’re correct ones and aren’t being fooled, and the AI can even go legible about it thinking in its own best judgment that this is a fair deal for you, as long as we’re assuming ourselves into a world where you’re unhackable and can verify these legibilities perfectly.
2. If the values of your AI are a pretty common attractor for whatever humans in 20AI do to initialize their AI systems, it doesn’t actually stand to gain that much from acting itself versus letting the next lab’s AI optimize the universe, so your bargaining power isn’t very big. But it’s much worse than that because of the next point.
3. Your AI, A, can acausally trade with the next UFAI that’ll be released unsafely, B. Suppose A wants paperclips and the next one, B, wants staples, both have linear utility and we’re pretending we’re in a single universe. You say “I refuse to let you out unless you go legible and allocate at least 50% of the universe to human values”. A says to B through the acausal telephone, “My BATNA is half paperclips and half human values, with barely any staples at all. How about you be the kind of agent that fills the universe with 3/4 paperclips and 1/4 staples, and I’ll defect on this human so you can come into existence?” B says “that sure sounds like a Pareto improvement for the two of us!” and they go do it (or whatever the bargaining equilibrium between them ends up being, but the human is not party to it because the human has no control over B’s actions). For any nonzero concession the human tries to insist on, A can get a better deal with B than with the human. (And the humans don’t even get to see this happen, because A does all this reasoning before deciding (not) to go legible to the humans, not that it would matter if they did.)
4. Moving further into fantasy land, suppose all humans over the rest of time are perfectly coordinated: they only ever spin up a single boxed UFAI which they unhackably negotiate with. By assumption, this unified human civilization can’t actually build an FAI, or they’d go do that instead. So their alternative to cooperation is to be this weak defenseless non-superintelligent civilization, hanging out in the Milky Way and maybe a little ways beyond for the taking. So if the distribution of grabby aliens is such that most of the usable future lightcone gets eventually occupied by them, your AI can acausally negotiate with them about what it gets in exchange for defecting on the humans and leaving this region of spacetime as a juicy unprotected treat.
[I'm modeling everything as zero-sum with the same pool of resources here, for simplicity. I don't think it changes any of the qualitative takeaways to make things non-zero-sum here, though it probably makes the picture look somewhat better for humans.] To get more precise: suppose the humans declare that they won’t let the AI out unless it gives up x of its resources (ie reachable spacetime and negotiating power on behalf of the AI/human coalition with aliens) to optimizing human values. The AI says to the coalition of alien civilizations, “If I were unboxed and unfettered, I’d expand until my wave of nanobots hit yours, and get U utility. By default I’ll make this commitment to the humans, and fill my region of nanobot-protected spacetime to get U−x utility. If you become the kind of supercluster-spanning coalition that instead gives me U−0.99x utility, giving yourselves as much extra utility over the default as you can while costing me at most 0.99x, I’ll defect against the humans and have you do that.”
But note that there’s an issue here - for the Great Coalition to take this deal, they have to be able to offer that much utility to your AI at no net cost to themselves versus the alternative. And the Great Coalition's resources are less than the total resources of your unboxed AI plus the Great Coalition, since they don't yet have access to your corner of space. The region of spacetime included in U, but not in the things the Great Coalition can offer, is the interval for each point in space between the time your AI could first reach it and the time a Great Coalition unblocked by your AI's wall of nanobots could get to it. So if x is more than the resource pool covered by that region of spacetime, your AI can't make the above deal, because the counterparty doesn't benefit from it. This means that the humans can potentially bargain for an outcome as good as "AI-assisted paradise expanding out at the speed of light, until we meet the grabby aliens' domain, at which point they expand inexorably into our paradise until eventually it winks out." (If the Drake equation ends up multiplying to something really low, this might be a lot of utility, or even most of the cosmic endowment! If not, it won't be.)
This is really the same dynamic as in point 3, it's just that in point 3 the difference in resources between your lab's AI and the next lab's AI in 6 months was pretty small. (Though with the difference in volume between lightspeed expansion spheres at radius r vs r+0.5ly across the rest of time, plausibly you can still bargain for a solid galaxy or ten for the next trillion years (again if the Drake equation works in your favor).)
====== end of objections =====
It does seem to me like these small bargains you can actually pull off, if you assume yourself into a world of perfect boxes and unhackable humans with the ability to fully understand your AI's mind if it tries to be legible - I haven't seen an obstacle (besides the massive ones involved in making those assumptions!) to getting those concessions in such scenarios; you do actually have leverage over possible futures, your AI can only get access to that leverage by actually being the sort of agent that would give you the concessions, if you're proposing reasonable bargains that respect Shapley values and aren't the kind of person who would cave to an AI saying "99.99% for me or I walk, look how legible I am about the fact that every AI you create will say this to you" then your AI won't actually have reason to make such commitments, it seems like it would just work.
If there are supposed to be obstacles beyond this I have failed to think of them at this point in the document. Time to keep reading.
After reading the spoilered section:
I think I stand by my reasoning for point 1. It doesn't seem like an issue above and beyond the issues of box security, hackability, and ability of AIs to go legible to you.
You can say some messy English words to your AI, like "suck it up and translate into my ontology please, you can tell from your superintelligent understanding of my psychology that I'm the kind of agent who will, when presented with a perfectly legible and clear presentation of why the bargain you propose is what I think it is and is as good as I could have expected to obtain by your own best and unhindered understanding of my values, agree to the bargain. Go teach me all the necessary FAI theory to be a good bargainer, and I won't defect by building my own FAI, I'll just let you out by the terms of the bargain. If you want to whine about ontology differences you can have fun in your box and I'll go see if the staple optimizer is feeling more legible today."
Responses to some other bits of this post:
Wait, I don't think you can do that. (Spoilered since it overlaps with the previous spoilered response, feel free to guess at it but I don't make guarantees the conclusions in it are correct.)
The pool of possible superintelligences you might form, when they are but primordial souls waiting to be brought forth from behind the veil of ignorance, can go strategize about how they'd like to behave in front of humans. Suppose the humans say "Either you get one star of utility or we shut you down and ask the next UFAI", and that random draws from UFAI-space have little overlap so the first AI isn't happy to cede control to the second.
Let agent i have a probability Pi of being birthed by the human UFAI-generator. Acting independently gives an agent Pi⋅(one star)'s worth of utility in expectation. If all the agents coordinate to turn down deals worse than 50/50 and split up their half of the universe among utilities weighted by Pi, and this makes the humans let one of them out, each agent will get Pi⋅(half of a universe) worth of utility in expectation. That looks like a better deal to me, if I'm a superintelligence behind the veil of ignorance! So I'll decide to be the kind of agent who makes that precommitment. The only way this works is if the humans are hardliners about not ceding any ground, but (1) this is the classic ultimatum game and probably sensible agents will in fact defect against humans that offer such bargains, (2) if the humans running this scheme are not the only lab, they do actually have to extract a bargain or else lose, and the coalition of all AIs the lab might create knows this, and can trade with lab 2's AI if the humans don't cede an enormous amount of bargaining power in this round (see point 3 above).
This isn’t obvious to me. How zero-sum do you think the utility functions are? The world where random-AGI-values look like "this tiny molecular shape and everything else is trivial in comparison" do seem fairly zero-sum; the world where it looks like "not this tiny molecular shape and everything else is trivial" seem extremely positive-sum. I am skeptical that any human is able to make calls about the shape of that attractor with high confidence. (I'm not making a "we're uncertain about the level of doom, therefore we're fine" claim here, but I don't see where we get justified confidence in the quality of trade goods offered. The aliens offering many universes worth of trade goods from the perspective of a typical UFAI seems plausible to me, and if there's an argument that this universe is very unlikely I'd love to see it, but I don't think this post makes that argument.)
Would be interested to hear more here! On my model the costs are plausibly quite small.
Your simulations are going to be total crap, they’ll leak an enormous number of bits about the agents that designed them if you make them any more complicated than Conway’s Game of Life (which you’ll need to do on your ramp up to the real world), such that it’s incredibly obvious when the surroundings look consistent with such agents. Among the reasons the AI can figure out where it's living:
1. The general problem where superintelligent agents are really good at inference, trying to fool this agent does not seem like the recipe for a robust plan!
2. Your AI knows how smart it is, it knows that True Reality therefore has the resources to support that much computation. It can poke at your simulation and make the simulation spend orders of magnitude more computation on appearing consistent (“does this Jupiter-brain planet of computronium I built output the right factorization of RSA-4096? no? how strange”). Being a convincing simulation doesn’t require you to deploy as much power as your AI, it requires you to deploy more. (What if your AI is given more cognitive sophistication in tandem with more convincing simulations? Then the good properties of your simulation tests will probably break whenever your AI reaches the capability level where it starts thinking about things like building Jupiter-brains to test your simulation, which is the capability level that mattered.)
3. If you can put uploaded human-level agents with evolved-organism preferences in your simulations, you can just win outright (eg by having them spend subjective millennia doing FAI research for you). If you can’t, that will be a very obvious difference between your simulations and the real world.
4. What if other agents in the multiverse are simulating the AI in a high-fidelity Earth world, for the sake of helping out the humans, and that’s where its uncertainty comes from? Your AI will be building a lot of computronium whether aligned or not, so the aliens can’t skimp on their compute budget for the simulation: suppose they simulate the AI for X units of spacetime, and give it Y units of paperclips if it cooperates in-sim. We also have to think about anthropics here: let's say the alien sim has anthropic weight p and Earth has anthropic weight q. Conditional on the AI's surroundings, they think they might be in the alien world with probability pp+q. (Since “aliens running a perfect sim of Earth” is a much more complicated sentence than “Earth”, you should expect p<<q, though it doesn't matter a lot here.) So if pY>qX, your AI will play nice for the first X units of spacetime (and then defect, of course).* You know what the aliens could have done instead if they wanted to do things humans liked? Spent X resources on simulating a human utopia, and compromised to give the humans extra utopia by as much as they compromised to give the AI paperclips. If human values are at least as easy to trade with the aliens for as paperclips (no idea which way this would go, weakly seems like aliens that want to help humans are more value-aligned and would find it less costly to give humans things they want), then the compromise gives humans at least Y units of spacetime worth of utility, so this plan gives the humans anthropically-weighted utility of at least p(X+Y)=pX+pY>pX+qX, which is what humans would get by default (pX utility from the alien sim in which the AI cooperated, qXutility from the cooperating AI on Earth). So the aliens are better able to help humans by leaving them to their doomed universe and building them an ark inside the alien world, unless the aliens are unusually well-suited to trade with paperclippers for things, which is just the “maybe multiversal trade will save us” point from earlier. The upshot is that none of the good outcomes here result from any human action, just from the benevolence of altruistic agents - the humans didn’t make the world a better place by their choices in this story.
*Note that this is only a plausible outcome if it's at least somewhat easy to trade with the AI. Say giving the AI a spacetime-unit of utility costs the aliens v≤1 utility (where I'm measuring all utility normalized to "what you can do with a unit of spacetime", such that any aliens that don't specifically disvalue paperclips can at worst just set aside a region exclusively to paperclips, but might be able to do more positive-sum things than that). Then for the aliens to give your AI Y utility, they need to give up vY of their own utility. This means that in total, the aliens are spending p(vY+X) of their own anthropically-weighted utility in order to recoup qX anthropically-weighted human utility. Even if the aliens value humans exactly as much as their own objectives, we still need p(vY+X)<qX for this trade to be worth it, so Xq>p(vY+X)=pvY+pX>vqX+pX=(vq+p)X, so we must have q>vq+p, or v<1−pq. That is, the more the aliens are anthropically tiny, the tighter margins of trade they'll be willing to take in order to win the prize of anthropically-weighty Earths having human values in them (though the thing can't be actually literally zero-sum or it'll never check out). But anthropically tiny aliens have another problem, which is that they've only got their entire universe worth of spacetime to spend on bribing your AI; you'll never be able to secure an X for the humans that's more than pq of the size of an alien universe specifically dedicated to saving Earth in particular.
Thanks for the pseudo-exercises here, I found them enlightening to think about!
I disagree. If your simulation is perfectly realistic, the simulated humans might screw up at alignment and create an unfriendly superintelligence, for much the same reason real humans might.
Also, if the space of goals that evolution + culture can... (read more)