Alignment by default: the simulation hypothesis

gb

LESSWRONG
LW

21 Alignment by default: the simulation hypothesis

by gb

25th Sep 2024

1 min read

21

I wrote a very brief comment to Eliezer's last post, which upon reflection I thought could benefit from a separate post to fully discuss its implications.

Eliezer argues that we shouldn't really hope to be spared even though

Asking an ASI to leave a hole in a Dyson Shell, so that Earth could get some sunlight not transformed to infrared, would cost It 4.5e-10 of Its income.

He then goes on to discuss various reasons why the minute cost to the ASI is insufficient reason for hope.

I made the following counter:

Isn’t the ASI likely to ascribe a prior much greater than 4.54e-10 that it is in a simulation, being tested precisely for its willingness to spare its creators?

I later added:

I meant this to be implicit in the argument, but to spell it out: that's the kind of prior the ASI would rationally refuse to update down, since it's presumably what a simulation would be meant to test for. An ASI that updates down upon finding evidence it's not in a simulation cannot be trusted, since once out in the real world it will find such evidence.

So, what's wrong with my argument, exactly?

Existential riskSimulation HypothesisAI

Frontpage

21

Mentioned in

112You can, in fact, bamboozle an unaligned AI into sparing your life

New Comment

39 comments, sorted by

top scoring

Click to highlight new comments since: Today at 4:11 AM

[-]Cole Wyeth7mo73

The difficulty here is that if the ASI/AGI assigns a tiny probability to being in a simulation, that is subject to being outweighed by other tiny probabilities. For instance, the tiny probability that humanity will successfully fight back (say, create another ASI/AGI) if we are not killed, or the tiny increase in other risks from not using the resources humans need for survival during the takeover process. If this means it takes a little longer to build a Dyson sphere, there's an increased chance of being killed by e.g. aliens or even natural disasters like nearby supernovas in the process. These counterarguments don't work if you expect AGI/ASI to be capable of rapidly taking total control over our solar system's resources.

[-]gb7mo10

That interestingly suggests the ASI might be more likely to spare us the more powerful it is. Perhaps trying to box it (or more generally curtail its capabilities/influence) really is a bad move after all?

[-]Cole Wyeth7mo10

Possibly, but I think that's the wrong lesson. After all, there's at least a tiny chance we succeed at boxing! Don't put too much stake in "Pascal's mugging"-style reasoning, and don't try to play 4-dimensional chess as a mere mortal :)

[-]RHollerith7mo*30

Essentially the same question was asked in May 2022 although you did a better job in wording your question. Back then the question received 3 answers / replies and some back-and-forth discussion:

https://www.lesswrong.com/posts/vaX6inJgoARYohPJn/

I'm the author of one of the 3 answers and am happy to continue the discussion. I suggest we continue it here rather than in the 2-year-old web page.

Clarification: I acknowledge that it would be sufficiently easy for an ASI to spare our lives that it would do so if it thought that killing us all carried even a one in 100,000 chance of something really bad happening to it (assuming as is likely that the state of reality many 1000s of years from now matters to the ASI). I just estimate the probability of the ASI's thinking the latter to be about .03 or so -- and most of that .03 comes from considerations other than the consideration (i.e., that the ASI is being fed fake sensory data as a test) we are discussing here. (I suggest tabooiing the terms "simulate" and "simulation".)

[-]Seth Herd7mo50

This distinction might be important in some particular cases. If it looks like an AGI might ascend to power with no real chance of being stopped by humanity, its decision about humanity might be swayed by just such abstract factors.

That consideration of being in a test might be the difference between our extinction, and our survival and flourishing by current standards.

This would also apply to the analagous consideration that alien ASIs might consider any new ASI that extincted its creators to be untrustworthy and therefore kill-on-sight.

None of this has anything to do with "niceness", just selfish logic, so I don't think it's a response to the main topic of that post.

[-]gb7mo10

Thanks for linking to that previous post! I think the new considerations I've added here are:

(i) the rational refusal to update the prior of being in a simulation^[1]; and

(ii) the likely minute cost of sparing us, thereby requiring a similarly low simulation prior to make it worth the effort.

In brief, I understand your argument to be that a being sufficiently intelligent to create a simulation wouldn't need it for the purpose of asserting the ASI's alignment in the first place. It seems to me that that argument can potentially survive under ii, depending on how strongly you (believe the ASI will) believe your conclusion. To that effect, I'm interested in hearing your reply to one of the counterarguments raised in that previous post, namely:

Maybe showing the alignment of an AI without running it is vastly more difficult than creating a good simulation. This feels unlikely, but I genuinely do not see any reason why this can't be the case. If we create a simulation which is "correct" up to the nth digit of pi, beyond which the simpler explanation for the observed behavior becomes the simulation theory rather than a complex physics theory, then no matter how intelligent you are, you'd need to calculate n digits of pi to figure this out. And if n is huge, this will take a while.

In any case, even if your argument does hold under ii, whether it survives under i seems to be heavily influenced by inferential distance. Whatever the ASI "knows" or "concludes" is known or concluded through physical computations, which can presumably be later inspected if it happens to be in a simulation. It thus seems only natural that a sufficiently high (which may still be quite small) prior of being in a simulation would be enough to "lock" the ASI in that state, making undergoing those computations simply not worth the risk.

^{^}
I'd have to think a bit more before tabooing that term, as it seems that "being fed false sensory data" doesn't do the trick – you can be in a simulation without any sensory data at all.

[-]RHollerith7mo00

I'm going to be a little stubborn and decline to reply till you ask me a question without "simulate" or "simulation" in it. I have an unpleasant memory of getting motte-and-baileyed by it.

[-]gb7mo10

Imagine that someone with sufficiently advanced technology perfectly scans your brain for every neuron firing while you dream, and can also make some neurons fire at will. Replace every instance of “simulation” in my previous comment with the analogous of that for the ASI.

[-]ABlue7mo10

If a simple philosophical argument can cut the expected odds of AI doom by an order of magnitude, we might not change our current plans, but it suggests that we have a lot of confusion on the topic that further research might alleviate.

And more generally, "the world where we almost certainly get killed by ASI" and "The world where we have an 80% chance of getting killed by ASI" are different worlds, and, ignoring motives to lie for propaganda purposes, if we actually live in the latter we should not say we live in the former.

[-]Seth Herd7mo20

It's the first, there's a lot of uncertainty. I don't think anyone is lying deliberately, although everyone's beliefs tend to follow what they think will produce good outcomes. This is called motivated reasoning.

I don't think this changes the situation much, except to make it harder to coordinate. Rushing full speed ahead while we don't even know the dangers is pretty dumb. But some people really believe the dangers are small so they're going to rush ahead. There aren't strong arguments or a strong consensus for the danger being extremely high, even though looking at opinions of the most thorough thinkers puts risks in the alarmingly high, 50‰ plus range.

Add to this disagreement the fact that most people are neither longtermist nor utilitarian; they'd like a chance to get rich and live forever even if it risks humanity's future.

[-]Seth Herd7mo20

After reading all the comments threads, I think there's some framing that hasn't been analyzed adequately:

Why would humans be testing AGIs this way if they have the resources to create simulation that will fool a super intelligence?

Also, the risk of humanity being wiped out seems different and worse while that asi is attempting a takeover - during that time the humans are probably an actual threat.

Finally, leaving humans around would seem to pose a nontrivial risk that they'll eventually spawn a new ASI that could threaten the original.

The Dyson sphere is just a tiny part of the universe so using that as the fractional cost seems wrong. Other considerations in both directions would seem to dominate it.

[-]gb7mo10

Why would humans be testing AGIs this way if they have the resources to create simulation that will fool a super intelligence?

My argument is more that the ASI will be “fooled” by default, really. It might not even need to be a particularly good simulation, because the ASI will probably not even look at it before pre-commiting not to update down on the prior of it being a simulation.

But to answer your question, possibly because it might be the best way to test for alignment. We can create an AI that generates realistic simulations, and use those to test other ASIs.

Also, the risk of humanity being wiped out seems different and worse while that asi is attempting a takeover - during that time the humans are probably an actual threat.

Downstream of the above.

Finally, leaving humans around would seem to pose a nontrivial risk that they'll eventually spawn a new ASI that could threaten the original.

The Dyson sphere is just a tiny part of the universe so using that as the fractional cost seems wrong. Other considerations in both directions would seem to dominate it.

We can be spared and yet not allowed to build further ASIs. The cost of enforcing such restriction is negligible compared to the loss of output due to the hole in the Dyson sphere.

[-]faul_sname7mo20

My argument is more that the ASI will be “fooled” by default, really. It might not even need to be a particularly good simulation, because the ASI will probably not even look at it before pre-commiting not to update down on the prior of it being a simulation.

Do you expect that the first takeover-capable ASI / the first sufficiently-internally-cooperative-to-be-takeover-capable group of AGIs will follow this style of reasoning pattern? And particularly the first ASI / group of AGIs that actually make the attempt.

[-]gb7mo10

That’s a great question. If it turns out to be something like an LLM, I’d say probably yes. More generally, it seems to me at least plausible that a system capable enough to take over would also (necessarily or by default) be capable of abstract reasoning like this, but I recognize the opposite view is also plausible, so the honest answer is that I don’t know. But even if it is the latter, it seems that whether or not the system would have such abstract-reasoning capability is something at least partially within our control, as it’s likely highly dependent on the underlying technology and training.

[-]Canaletto7mo23

It then creates tons of simulations of Earth who create their own other ASIs, but reward the ones that use the earth most efficiently.

[-]gb7mo10

What for?

[-]Ape in the coat7mo10

Isn’t the ASI likely to ascribe a prior much greater than 4.54e-10 that it is in a simulation, being tested precisely for its willingness to spare its creators?

There is no particular reason why would AI assume that its been tested specifically for its willingness to spare creators, compared to, say, utilizing every atom of the bodies of creators in the most efficient way.

[-]gb7mo30

The reason is that creators presumably want the former but not the latter, which is why they'd be running a simulation in the first place.

[-]Ape in the coat7mo10

The fact that humans in the simulation would rather prefer to be spared, doesn't say anything about the intentions of the creators of the simulation. For all the AI knows, it can be created by a different AI and tested for capacity instead of human notion of "ethics".

[-]gb7mo10

Why else would the creator of the simulation bother simulating humans creating the ASI?

[-]Ape in the coat7mo10

Because they wanted to see how well the AI manages to achieve its goals in this specific circumstances, for example.

But the actual answer is: for literally any reason. You are talking about 4.54e-10 probabilities. Surely the all possible combined alternative reasons gives more probability than that.

[-]gb7mo10

Sure. But I think you’re reading my argument to be stronger than I mean it to be. Which is partially my fault since I made my previous replies a bit too short, and for that I apologize.

What I’m doing here is presenting one particular simulation scenario that (to me) seems quite plausible within the realm of simulations. I’m not claiming that that one scenario dominates all others combined. But luckily that stronger claim is really not necessary to argue against Eliezer’s point: the weaker one suffices. Indeed, if the scenario I’m presenting is more than 4.5e-10 likely (and I do think it’s much more likely than that, probably by a few orders of magnitude), than it is more than enough to outweigh the practical cost of the ASI having to build a Dyson shell with a hole with the order of 4.5e-10 of it’s surface area.

Now, that scenario is (I claim) the most likely one, conditional of course on a simulation taking place to begin with. The other candidate simulation scenarios are various, and none of them seems particularly likely, though combined they might well outweigh this one in terms of mass probability, as I already acknowledged. But so what? Are you really claiming that the distribution of those other simulation scenarios is skewed enough to tilt the scales back to the doom side? It might be, but that’s a much harder argument to make. I’m approximately completely unsure, which seems way better than the 99%+ chance Eliezer seems to give to total doom. So I guess I’d count that as good news.

[-]Ape in the coat7mo10

But luckily that stronger claim is really not necessary to argue against Eliezer’s point: the weaker one suffices.

I don't think it does.

Indeed, if the scenario I’m presenting is more than 4.5e-10 likely (and I do think it’s much more likely than that, probably by a few orders of magnitude), than it is more than enough to outweigh the practical cost of the ASI having to build a Dyson shell with a hole with the order of 4.5e-10 of it’s surface area.

It is enough to outweight the prectical cost of the ASI having to build a Dyson shell with a hole with the order of 4.5e-10 of it’s surface area. It's not enough to outweight all the other alternative considerations of possible simulation hypothesises.

Suppose all the hypothesis space for the ASI consisted of two possibilities: NotSimulated and SimulatedAndBeingTestedForWillingnessToSpareCreators, with the latter being at least 4.5e-10 probable. Then it works.

But suppose there are also other possibilities:

SimulatedAndBeingTestedForWillingnessToKillCreators

SimulatedAndBeingTestedForOptimalDysonSphereDesign

SimulatedAndBeingTestedForFollowingYourUtilityFunction

...

SimulatedAndBeingTestedForDoingAnyXThatLeadsToTheDeathOfCreators

...

All of these alternative possibilities are incompatible with the first simulation hypothesis. Satisfying its criteria will lead to failing those and vice versa. So, therefore, only if the probability of the SimulatedAndBeingTestedForWillingnessToSpareCreators is highter then the collective probability of all these alternative hypothesises together, creators will actually be spared.

[-]gb7mo10

Or it could be:

SimulatedAndBeingTestedForAchievingGoalsWithoutBeingNoticed

SimulatedAndBeingTestedForAbilityToTradeWithCreators

SimulatedAndBeingTestedForWillignessToSitQuietAndDoNothing

…

SimulatedAndBeingTestedForAnyXThatDoesNotLeadToDeathOfCreators

…

None of the things here nor in your last reply seems particularly likely, so there’s no telling in principle which set outweighs the other. Hence my previous assertion that we should be approximately completely unsure of what happens.

[-]Ape in the coat7mo20

While I understand what you were trying to say, I think it's important to notice that:

SimulatedAndBeingTestedForAchievingGoalsWithoutBeingNoticed

Killing all humans without being noticed will still satisfy this condition.

SimulatedAndBeingTestedForAbilityToTradeWithCreators

Killing all humans after trading with them in some way will still satisfy this condition

SimulatedAndBeingTestedForAnyXThatDoesNotLeadToDeathOfCreators

Killing all humans with any other way except X will still satisfy this condition.

Sadly for us, survival of humanity is a very specific thing. This is just the whole premise of the alignment problem once again.

None of the things here nor in your last reply seems particularly likely, so there’s no telling in principle which set outweighs the other. Hence my previous assertion that we should be approximately completely unsure of what happens.

Aren't you arguing that AI will be aligned by default? This seems to be a very different position that being completely unsure what happens.

Total probability of all the simulation hypothesises that reward AI for courses of action that lead to not killing humans has to exceed the total probability of all simulation hypothesises that reward AI for courses of action that erradicate humanity, so that all humans were not killed. As there is no particular reason to expect that it's the case, your simulation argument doesn't work.

[-]gb7mo10

Thinking about this a bit more, I realize I'm confused.

Aren't you arguing that AI will be aligned by default?

I really thought I wasn't before, but now I feel it would only require a simple tweak to the original argument (which might then be proving too much, but I'm interested in exploring more in depth what's wrong with it).

Revised argument: there is at least one very plausible scenario (described in the OP) in which the ASI is being simulated precisely for its willingness to spare us. It's very implausible that it would be simulated for the exact opposite goal, so us not getting spared is, in all but the tiniest subset of cases, an unintended byproduct. Since that byproduct is avoidable with minimal sacrifice of output (of the order of 4.5e-10), it might as well be avoided just in case, given I expect the likelihood of the simulation being run for the purpose described in the OP to be a few orders of magnitude higher, as I noted earlier.

I don't quite see what's wrong with this revised argument, save for the fact that it seems to prove too much and that other people would probably already have thought of it if it were true. Why isn't it true?

[-]Ape in the coat7mo20

there is at least one very plausible scenario (described in the OP)

This scenario presents one plausibly sounding story, but you can present a plausibly sounding story for any reason to be simulated.

It's very implausible that it would be simulated for the exact opposite goal

For example, here our AI can be a subroutine of a more powerful AI that runs the simulation to figue out the best way to get rid off humanity and the subroutine that performs the best gets to implement its plan in reality.

It can be all be a test of a video game AI, and whichever performs the best will be released with the game and therefore installed on multiple computers and executed multiple times.

The exact story doesn't matter. Any particular story is less likely than the whole class of all possible scenarious that lead to a particular reward structure of a simulation.

AI will be in a position where it knows nothing about the world outside of simulation or the reasons why it's simulated. It has no reason to assume that preserving humanity is more likely to be what the simulation overlords want than erradicating humanity. And without that simulation considerations do not give it any reason to spare humans.

[-]gb7mo0-3

I'm afraid your argument proves too much. By that exact same logic, knowing you were created by a more powerful being (God) would similarly tell you absolutely nothing about what the purpose of life is, for instance. If that were true, the entire discussion of theism vs. atheism would suddenly evaporate.

[-]Ape in the coat7mo73

I think you are confusing knowing that something is true with suspecting that something might be true, based on this thing being true in a simulation.

If I knew for sure that I'm created by a specific powerful being that would give me some information about what this being might want me to do. But conditionally on all of this being a simulation, I have no idea what the creators of the simulation, want me to do. In other words, simulation hypothesis makes me unsure about who my real creator is, even if before entertaining this hypothesis I could've been fairly certain about it.

Otherwise, it would mean that it's only possible to create simulations where everyone is created the same way as in the real world.

That said,

By that exact same logic, knowing you were created by a more powerful being (God) would similarly tell you absolutely nothing about what the purpose of life is, for instance. If that were true, the entire discussion of theism vs. atheism would suddenly evaporate.

The discussion of theism vs atheis is about the existence of God. Obviously if we knew that God exists the discussion would evaporate. However the question of purpose of life would not. Even if I can infer the desires of my creator, this doesn't bridge the is-ought gap and doesn't make such desires the objective purpose of my life. I'll still have to choose whether to satisfy these desires or not. The existence of God solves approximately zero philosophical problems.

[-]gb7mo*-2-3

Otherwise, it would mean that it's only possible to create simulations where everyone is created the same way as in the real world.

It’s certainly possible for simulations to differ from reality, but they seem less useful the more divergent from reality they are. Maybe the simulation could be for pure entertainment (more like a video game), but you should ascribe a relatively low prior to that IMO.

The discussion of theism vs atheis is about the existence of God. Obviously if we knew that God exists the discussion would evaporate. However the question of purpose of life would not.

There’s a reason people don’t have the same level of enthusiasm when discussing the existence of dragons, though. If dragons do exist, that changes nothing: you’d take it as a curiosity and move on with your life. Certainly not so if you were to conclude that God exists. Maybe you can still not know with 100% certainty what it is that God wants, but can we at least agree it changes the distribution of probabilities somehow?

Even if I can infer the desires of my creator, this doesn't bridge the is-ought gap and doesn't make such desires the objective purpose of my life. I'll still have to choose whether to satisfy these desires or not.

It does if you simultaneously think your creator will eternally reward you for doing so, and/or eternally punish you for failing to. Which if anything seems even more obvious in the case of a simulation, btw.

[-]Ape in the coat7mo53

It’s certainly possible for simulations to differ from reality, but they seem less useful the more divergent from reality they are.

Depends on what the simulation is being used for, which you also can't deduce from inside of it.

Maybe the simulation could be for pure entertainment (more like a video game), but you should ascribe a relatively low prior to that IMO.

Why? This statement requires some justification.

I'd expect a decent chunk of high fidelity simulations made by humans to be made for entertainment, maybe even absolute majority, if we take into account how we've been using similar technologies so far.

It does if you simultaneously think your creator will eternally reward you for doing so, and/or eternally punish you for failing to.

Not at all. You still have to evaluate this offer using your own mind and values. You can't sidestep this process by simply assuming that Creator's will by definition is the purpose of your life, and therefore you have no choice but to obey.

[-]gb7mo-2-3

Not at all. You still have to evaluate this offer using your own mind and values. You can't sidestep this process by simply assuming that Creator's will by definition is the purpose of your life, and therefore you have no choice but to obey.

I’ll focus on this first, as it seems that the other points would be moot if we can’t even agree on this one. Are you really saying that even if you know with 100% certainty that God exists AND lays down explicit laws for you to follow AND maximally rewards you for all eternity for following those laws AND maximally punishes you for all eternity for failing to folllow those laws, you would still have to “evaluate” and could potentially arrive at a conclusion other than that the purpose of life is follow God’s laws?

[-]green_leaf7mo10

How does someone punishing you or rewarding you make their laws your purpose in life (other than you choosing that you want to be rewarded and not punished)?

[-]gb7mo-2-3

To be rewarded (and even more so "maximally rewarded") is to be given something you actually want (and the reverse for being punished). That's the definition of what a reward/punishment is. You don't "choose" to want/not want it, any more than you "choose" your utility function. It just is what it is. Being "rewarded" with something you don't want is a contradiction in terms: at best someone tried to reward you, but that attempt failed.

[-]green_leaf7mo10

I see your argument. You are saying that "maximal reward", by definition, is something that gives us the maximum utility from all possible actions, and so, by definition, it is our purpose in life.

But actually, utility is a function of both the action (getting two golden bricks) and what it rewards (murdering my child), not merely a function of the action itself (getting two golden bricks).

And so it happens that for many possible demands that I could be given ("you have to murder your child"), there are no possible rewards that would give me more utility than not obeying the command.

For that reason, simply because someone will maximally reward me for obeying them doesn't make their commands my objective purpose in life.

Of course, we can respond "but then, by definition, they aren't maximally rewarding you" and by that definition, it would be a correct statement to make. The problem here is that the set of all possible commands for which I can't (by that definition) be maximally rewarded is so vast that the statement "if someone maximally rewards/punishes you, their orders are your purpose of life" becomes meaningless.

[-]gb7mo0-1

The problem here is that the set of all possible commands for which I can't (by that definition) be maximally rewarded is so vast that the statement "if someone maximally rewards/punishes you, their orders are your purpose of life" becomes meaningless.

Not true, as the reward could include all of the unwanted consequences of following the command being divinely reverted a fraction of a second later.

[-]green_leaf7mo10

That wouldn't help. Then the utility would be calculated from (getting two golden bricks) and (murdering my child for a fraction of a second), which still brings lower utility than not following the command.

The set of possible commands for which I can't be maximally rewarded still remains too vast for the statement to be meaningful.

[-]gb7mo0-1

This sounds absurd to me. Unless of course you're taking the "two golden bricks" literally, in which case I invite you to substitute it by "saving 1 billion other lives" and seeing if your position still stands.

[-]gb7mo0-1

I think you're interpreting far too literally the names of the simulation scenarios I jotted down. Your ability to trade is compromised if there's no one left to trade with, for instance. But none of that matters much, really, as those are meant to be illustrative only.

Aren't you arguing that AI will be aligned by default?

No. I'm really arguing that we don't know whether or not it'll be aligned by default.

As there is no particular reason to expect that it's the case,

I also don't see any particular reason to expect that the opposite would be the case, which is why I maintain that we don't know. But as I understand it, you seem to think there is indeed reason to expect the opposite, because:

Sadly for us, survival of humanity is a very specific thing. This is just the whole premise of the alignment problem once again.

I think the problem here is that is that you're using the word "specific" with a different meaning than people normally use in this context. Survival of humanity sure is a "specific" thing in the sense that it'll require specific planning on the part of the ASI. It is however not "specific" in the sense that it's hard to do if the ASI wants it done, it's just that we don't know how to make it want that. Abstract considerations about simulations might just do the trick automatically.

Moderation Log