Am I confused about the "malign universal prior" argument?

nostalgebraist

In a 2016 blog post, Paul Christiano argued that the universal prior (hereafter "UP") may be "malign." His argument has received a lot of follow-up discussion, e.g. in

Mark Xu's The Solomonoff Prior is Malign
Charlie Steiner's The Solomonoff prior is malign. It's not a big deal.

among other posts.

This argument never made sense to me. The reason it doesn't make sense to me is pretty simple, but I haven't seen it mentioned explicitly in any of the ensuing discussion.

This leaves me feeling like either I am misunderstanding the argument in a pretty fundamental way, or that there is a problem with the argument that has gotten little attention from the argument's critics (in which case I don't understand why).

I would like to know which of these is the case, and correct my misunderstanding if it exists, hence this post.

(Note: In 2018 I wrote a comment on the original post where I tried to state one of my objections to my argument, though I don't feel I expressed myself especially well there.)

UP-using "universes" and simulatable "universes"

The argument for malignity involves reasoning beings, instantiated in Turing machines (TMs), which try to influence the content of the UP in order to affect other beings who are making decisions using the UP.

Famously, the UP is uncomputable.

This means the TMs (and reasoning beings inside the TMs) will not be able to use^[1] the UP themselves, or simulate anyone else using the UP. At least not if we take "using the UP" in a strict and literal sense.

Thus, I am unsure how to interpret claims (which are common in presentations of the argument) about TMs "searching for universes where the UP is used" or the like.

For example, from Mark Xu's "The Solomonoff Prior is Malign":

In particular, this suggests a good strategy for consequentialists: find a universe that is using a version of the Solomonoff prior that has a very short description of the particular universe the consequentialists find themselves in.

Or, from Christiano's original post:

So the first step is getting our foot in the door—having control over the parts of the universal prior that are being used to make important decisions.
This means looking across the universes we care about, and searching for spots within those universe where someone is using the universal prior to make important decisions. In particular, we want to find places where someone is using a version of the universal prior that puts a lot of mass on the particular universe that we are living in, because those are the places where we have the most leverage.
Then the strategy is to implement a distribution over all of those spots, weighted by something like their importance to us (times the fraction of mass they give to the particular universe we are in and the particular channel we are using). That is, we pick one of those spots at random and then read off our subjective distribution over the sequence of bits that will be observed at that spot (which is likely to involve running actual simulations).

What exactly are these "universes" that are being searched over? We have two options:

They are not computable universes. They permit hypercomputation that can leverage the "actual" UP, in its full uncomputable glory, without approximation.
They are computible universes. Thus the UP cannot be used in them. But maybe there is some computible thing that resembles or approximates the UP, and gets used in these universes.

Option 1 seems hard to square with the talk about TMs "searching for" universes or "simulating" universes. A TM can't do such things to the universes of option 1.

Hence, the argument is presumably about option 2.

That is, although we are trying to reason about the content of the UP itself, the TMs are not "searching over" or "simulating" or "reasoning about" the UP or things containing the UP. They are only doing these things to some other object, which has some (as-yet unspecified) connection to the UP, such as "approximating" the UP in some sense.

But now we face some challenges, which are never addressed in presentations of the argument:

The argument is about the content of the "actual" UP, not the content of some computable approximation.

If the reasoning beings are considering -- and trying to influence -- some computable thing that isn't the UP, we need to determine whether this thing has the right kind of relationship to the UP (whatever that means) for the influences upon it to "bubble up to" the UP itself.

The behavior of the TMs obviously affects the UP. But it's not so obvious that the behavior of the TMs can affect the other, UP-related thing that the TMs able to simulate.

In other words, the TMs can affect the UP, but it doesn't seem like they have the resources to figure out what sorts of effects they prefer and disprefer. And on the other hand, there may be something for which they can do this preference reasoning, but we haven't established that they can affect that other thing.

Some thoughts that one might have

What sort of thing is this not-UP -- the thing that the TMs can simulate and search over?

I don't know; I have never seen any discussion of the topic, and haven't thought about it for very long. That said, here are a few seemingly obvious points about it.

On slowdown

Suppose that we have a TM, with a whole world inside it, and some reasoning beings inside that world.

These beings are aware of some computable, but vaguely "UP-like," reasoning procedure that they think is really great.

In order to be "UP-like" in a relevant way, this procedure will have to involve running TMs, and the set of TMs that might be run needs to include the same TM that implements our beings and their world.

(This procedure needs to differ from the UP by using a computable weighting function for the the TMs. It should also be able to return results without having to wait for eternity as the non-halting TMs do their not-halting. The next section will say more about the latter condition.)

Now they want to search through computable universes (by simulation) to look for ones where the UP-esque procedure is being used.

What does it look like when they find one? At this point, we have

A TM, which I'll call the "outer" TM, containing...
- ...a universe that includes our reasoning beings, and a bunch of irrelevant galaxies and stuff, along with...
  - ...one special part that is simulating a second universe, which (the second universe) includes a bunch of irrelevant galaxies and stuff, along with...
    - ...one special part that implements the UP-like procedure, and thus runs a bunch of TMs that aren't the same as the outer TM, along with...
      - ...one special part that is simply the outer TM again (and from here on the whole thing repeats indefinitely, with more slowdown every time we go around the loop)

Each level of nesting incurs some slowdown relative to just running the "relevant" part of the thing that is being nested, because some irrelevant stuff has to come along for the ride.

It takes many many clock-ticks of the outer TM to advance the copy of it several levels down, because we have to spend a lot of time on irrelevant galaxies and on other TMs involved in the procedure.

(There is also a extra "constant factor" from the fact that we have to wait for the outer TM to evolve life, etc., before we get to the point where it starts containing a copy at all.)

So I don't see how the guys in the outer TM would be able to advance their simulation up to the point where something they can control is being "read off," without finding that in fact this read-off event occurred in their own distant past, and hence is no longer under their control.

To riff on this: the malignity argument involves the fact that the UP puts high weight on simple TMs, but doesn't care about speed, so it may put high weight on TMs that do very long-running things like simulating universes that simulate other universes.

Fine -- but once we start talking about a universe that is simulating itself (in order to reason about UP-like objects that involve it), speed starts to matter for a different reason. If you are simulating yourself, it is always with some slowdown, since you contain parts other than the simulator. You'll never be able to "catch up with yourself" and, e.g., read your own next action off of the simulation rather than choosing it in the ordinary manner.

It's possible that there are ways around this objection, even if it's valid in principle. For instance, maybe the reasoning beings can make inferences about the future behavior of the procedure-users, jumping ahead of the slow simulation.

It's easy to imagine how this might work for "finding the output channel," since you can just guess that a channel used once will be re-used again. But it would be much harder to decide what one's preferred output actually is at "future" points not yet reached in the simulation; here one would effectively need to do futurism about the world in which the procedure is being used, probably on an extremely long time horizon.

On efficiency

There are results showing that the UP (or Solomonoff Induction) are in some sense optimal. So it is easy to wind up thinking that, if some procedure is a good idea, it must be (in some sense) an "approximation of" these things.

But the kind of "approximation" involved does not look (in hand-wavey terms) like the ideal thing (UP or SI), plus some unbiased "approximation noise."

The ways that one would deviate from the ideal, when making a practically useful procedure, have certain properties that the ideal itself lacks. In the hand-wavey statistical analogy, the "noise" is not zero-mean.

I noted above that the "UP-like procedure" will need to use a computible weighting function. So, this function can't be Kolmogorov complexity.

And indeed, if one is designing a procedure for practical use, one probably wouldn't want anything like Kolmogorov complexity. All else being equal, one doesn't want to sit around for ages waiting for a TM to simulate a whole universe, even if that TM is "simple." One probably wants to prioritize TMs that can yield answers more quickly.

As noted above, in practice one never has an infinite amount of time to sit around waiting for TMs to (not) halt, so any method that returns results in finite time will have to involve some kind of effective penalty on long-running TMs.

But one may wish to be even more aggressive about speed than simply saying "I'm only willing to wait this long, ignore any TM that doesn't halt before then." One might want one's prior to actively prefer fast TMs over slow ones, even within the range of TMs fast enough that you're willing to wait for them. That way, if at any point you need to truncate the distribution and only look at the really high-mass TMs, the TMs you are spared from running due to the truncation are preferentially selected to be ones you don't want to run (because they're slow).

These points are not original, of course. Everyone talks about the speed prior.

But now, return to our reasoning beings in a TM, simulating a universe, which in turn uses a procedure that's great for practical purposes.

The fact that the procedure is "great for practical purposes" is crucial to the beings' motivation, here; they expect the procedure to actually get used in practice, in the world they're simulating. They expect this because they think it actually is a great idea -- for practical purposes -- and they expect the inner creatures of the simulation to notice this too.

Since the procedure is great for practical purposes, we should expect that it prioritizes efficiently computable TMs, like the speed prior does.

But this means that TMs like the "outer TM" in which our beings live -- which are simple (hence UP cares about them) but slow, having to simulate whole universes with irrelevant galaxies and all before they can get to the point -- are not what the "great for practical purposes" procedure cares about.

Once again: the malignity argument involves the fact that the UP puts high weight on simple TMs, but doesn't care about speed. This is true of the UP. But it is a count against using the UP, or anything like it, for practical purposes.

And so we should not expect the UP, or anything like it, to get used in practice by the kinds of entities we can simulate and reason about.

We (i.e. "reasoning beings in computable universes") can influence the UP, but we can't reason about it well enough to use that influence. Meanwhile, we can reason about things that are more like the speed prior -- but we can't influence them.

The common thread

It feels like there is a more general idea linking the two considerations above.

It's closely related to the idea I presented in When does rationality-as-search have nontrivial implications?.

Suppose that there is some search process that is looking through a collection of things, and you are an element of the collection. Then, in general, it's difficult to imagine how you (just you) can reason about the whole search in such a way as to "steer it around" in your preferred direction.

If you are powerful enough to reason about the search (and do this well enough for steering), then in some sense the search is unnecessary -- one could delete all the other elements of the search space, and just consult you about what the search might have done.

As stated this seems not quite right, since you might have some approximate knowledge of the search that suffices for your control purposes, yet is "less powerful" than the search as a whole.

For anything like the malignity argument to work, we need this kind of "gap" to exist -- a gap between the power needed to actually use the UP (or the speed prior, or whatever), and the power needed to merely "understand them well enough for control purposes."

Maybe such a gap is possible! It would be very interesting if so.

But this question -- which seems like the question on which the whole thing turns -- is not addressed in any of the treatments I've seen of the malignity argument. Instead, these treatments speak casually of TMs "simulating universes" in which someone is "using" the UP, without addressing where in the picture we are to put the "slack" -- the use of merely-approximate reasoning -- that is necessary for the picture to describe something possible at all.

What am I missing?

^{^}
For simplicity, I mostly avoid mentioning Solomonoff Induction in this post, and refer more broadly to "uses" of the UP, whatever these may be.

(This reminded me of a couple of arguments I've had in the past, in person. I think you are missing something. But I've previously failed to communicate this thing. I hope I'm not misinterpreting your point, and sorry if this comment comes across as frustrated at some points.)

For anything like the malignity argument to work, we need this kind of "gap" to exist -- a gap between the power needed to actually use the UP (or the speed prior, or whatever), and the power needed to merely "understand them well enough for control purposes."
Maybe such a gap is possible! It would be very interesting if so.

Such a gap is so common that I'm worried I've missed your point. There are ~always a range of algorithms that solve the same problem and have very different levels of efficiency. This is clearly true of algorithms that predict the next bit. You are correct that a malign hypothesis needs to be running an algorithm that is more efficient than the outer algorithm.

Suppose that there is some search process that is looking through a collection of things, and you are an element of the collection. Then, in general, it's difficult to imagine how you (just you) can reason about the whole search in such a way as to "steer it around" in your preferred direction.

I think this is easy to imagine. I'm an expert who is among 10 experts recruited to advise some government on making a decision. I can guess some of the signals that the government will use to choose who among us to trust most. I can guess some of the relative weaknesses of fellow experts. I can try to use this to manipulate the government into taking my opinion more seriously. I don't need to create a clone government and hire 10 expert clones in order to do this.

If you are powerful enough to reason about the search (and do this well enough for steering), then in some sense the search is unnecessary -- one could delete all the other elements of the search space, and just consult you about what the search might have done.

It's true that if an induction algorithm is maximally compute efficient, then it shouldn't have daemon problems. Because there is no way for a daemon to do better prediction than alternative hypotheses. But... actual algorithms that we build aren't usually compute optimal, so there's a risk they will find more compute efficient algorithms internally. (I'm not sure I understood what you're saying here, so tell me if this is a non sequitur).

The argument is about the content of the "actual" UP, not the content of some computable approximation.

If the reasoning beings are considering -- and trying to influence -- some computable thing that isn't the UP, we need to determine whether this thing has the right kind of relationship to the UP (whatever that means) for the influences upon it to "bubble up to" the UP itself.

You seem to be saying the behavior of a computable approximation is unrelated to the behavior of the idealization? Like, of course there are differences between reality and idealizations. Paul mentions this a few times in the original post, that the connection to real world algorithms and consequences isn't clear. But I think you're missing the whole point of doing theory work about idealized models. Theorizing about idealizations is easier (and possible). A common pattern for theorists is to work out the consequences of an idealized theory, then to apply this theory to reality, they try to approximately adjust for the most important differences between the theory and reality.

A good example is worst case runtime analysis. It's very useful for predicting real world runtime. In some situations, it's far too pessimistic (or optimistic!). But in those situations, there's always a reason, there's some factor that the worst case analysis isn't taking into account. And with some familiarity with the field, you get to know these factors and how and when we can correct for them when transferring your knowledge to the real world.

Back to induction, specifically this line:

some computable thing that isn't the UP, we need to determine whether this thing has the right kind of relationship to the UP

The reasoning beings inside the hypothesis are trying to make their computable approximation as similar to the UP as possible. Sure, they might make mistakes, or be limited in some way by what algorithms are possible. But, in our lack of knowledge about the details, we don't have to get stuck in confusion about how exactly they might create an approximation. We can just assume they did a good job (as an idealization/approximation). This is the standard approximate way of predicting the consequences of competent agents. This is part of making an idealized theory.

If we later discover that all algorithms that are designed to predict the next bit in the real world have a particular property (i.e. are biased toward fast computations), then we can redo the malign prior theory in light of that knowledge. Maybe you think "biased toward fast computations" is obviously true of all induction algorithms? (it's definitely not true, consider the runtime of theories in physics).

The UP is malign idea is the idea of optimization daemons applied directly to Solomonoff inductors. (Note the line: "When heavy optimization pressure on a system crystallizes it into an optimizer—especially one that’s powerful, or more powerful than the previous system, or misaligned with the previous system—we could term the crystallized optimizer a “daemon” of the previous system").

If you wanted to see how optimization daemons could show up in more practical algorithms, you'll probably end up at RFLO (although there are other situations where optimization daemons can show up that don't quite fit the RFLO description).

I hope I'm not misinterpreting your point, and sorry if this comment comes across as frustrated at some points.

I'm not sure you're misinterpreting me per se, but there are some tacit premises in the background of my argument that you don't seem to hold. Rather than responding point-by-point, I'll just say some more stuff about where I'm coming from, and we'll see if it clarifies things.

You talk a lot about "idealized theories." These can of course be useful. But not all idealizations are created equal. You have to actually check that your idealization is good enough, in the right ways, for the sorts of things you're asking it to do.

In physics and applied mathematics, one often finds oneself considering a system that looks like

some "base" system that's well-understood and easy to analyze, plus
some additional nuance or dynamic that makes things much harder to analyze in general -- but which we can safely assume has much smaller effects that the rest of the system

We quantify the size of the additional nuance with a small parameter $ϵ$ . If $ϵ$ is literally 0, that's just the base system, but we want to go a step further: we want to underst... (read more)

8Jeremy Gillen1y

Great explanation, you have found the crux. I didn't know such problems were called singular perturbation problems. If I thought that reasoning about the UP was definitely a singular perturbation problem in the relevant sense, then I would agree with you (that the malign prior argument doesn't really work). I think it's probably not, but I'm not extremely confident. Your argument that it is a singular perturbation problem is that it involves self reference. I agree that self-reference is kinda special and can make it difficult to formally model things, but I will argue that it is often reasonable to just treat the inner approximations as exact. The reason is: Problems that involve self reference are often easy to approximate by using more coarse-grained models as you move deeper. One example as an intuition pump is an MCTS chess bot. In order to find a good move, it needs to think about its opponent thinking about itself, etc. We can't compute this (because its exponential, not because its non-computable), but if we approximate the deeper layers by pretending they move randomly (!), it works quite well. Having a better move distribution works even better. Maybe you'll object that this example isn't precisely self-reference. But the same algorithm (usually) works for finding a nash equilibria on simultaneous move games, which do involve infinitely deep self reference. And another more general way of doing essentially the same thing is using a reflective oracle. Which I believe can also be used to describe a UP that can contain infinitely deep self-reference (see the last paragraph of the conclusion).[1] I think the fact that Paul worked on this suggests that he did see the potential issues with self-reference and wanted better ways to reason formally about such systems. To be clear, I don't think any of these examples tells us that the problem is definitely a regular perturbation problem. But I think these examples do suggest that assuming that it is regular

5LGS1y

Thanks for the link to reflective oracles! I strongly disagree with this: diagonalization arguments often cannot be avoided at all, not matter how you change the setup. This is what vexed logicians in the early 20th century: no matter how you change your formal system, you won't be able to avoid Godel's incompleteness theorems. There is a trick that reliably gets you out of such paradoxes, however: switch to probabilistic mixtures. This is easily seen in a game setting: in rock-paper-scissors, there is no deterministic Nash equilibrium. Switch to mixed strategies, however, and suddenly there is always a Nash equilibrium. This is the trick that Paul is using: he is switching from deterministic Turing machines to randomized ones. That's fine as far as it goes, but it has some weird side effects. One of them is that if a civilization is trying to predict the universal prior that is simulating itself, and tries to send a message, then it is likely that with "reflexive oracles" in place, the only message it can send is random noise. That is, Paul shows reflexive oracles exist in the same way that Nash equilibria exist; but there is no control over what the reflexive oracle actually is, and in paradoxical situations (like rock-paper-scissors) the Nash equilibrium is the boring "mix everything together uniformly". The underlying issue is that a universe that can predict the universal prior, which in turn simulates the universe itself, can encounter a grandfather paradox. It can see its own future by looking at the simulation, and then it can do the opposite. The grandfather paradox is where the universe decides to kill the grandfather of a child that the simulation predicts. Paul solves this by only letting it see its own future using a "reflexive oracle" which essentially finds a fixed point (which is a probability distribution). The fixed point of a grandfather paradox is something like "half the time the simulation shows the grandchild alive, causing the real univ

2Noosphere891y

One caveat to this quote below is that Godel's first incompleteness theorem relies on the assumption of the formal system being recursively enumerable, and if we drop this requirement, then we can get a consistent and complete description of say, first order arithmetic. More here: https://en.wikipedia.org/wiki/Gödel's_incompleteness_theorems#Effective_axiomatization

2Jeremy Gillen1y

Fair enough, the probabilistic mixtures thing was what I was thinking of as a change of setup, but reasonable to not consider it such. I don't see how this is implied. If a fact is consistent across levels, and determined in a non-paradoxical way, can't this become a natural fixed point that can be "transmitted" across levels? And isn't this kind of knowledge all that is required for the malign prior argument to work?

3LGS1y

The problem is that the act of leaving the message depends on the output of the oracle (otherwise you wouldn't need the oracle at all, but you also would not know how to leave a message). If the behavior of the machine depends on the oracle's actions, then we have to be careful with what the fixed point will be. For example, if we try to fight the oracle and do the opposite, we get the "noise" situation from the grandfather paradox. But if we try to cooperate with the oracle and do what it predicts, then there are many different fixed points and no telling which the oracle would choose (this is not specified in the setting). It would be great to see a formal model of the situation. I think any model in which such message transmission would work is likely to require some heroic assumptions which don't correspond much to real life.

2Jeremy Gillen1y

If the only transmissible message is essentially uniformly random bits, then of what value is the oracle? I claim the message can contain lots of information. E.g. if there are 2^100 potential actions, but only 2 fixed points, then 99 bits have been transmitted (relative to uniform). The rock-paper-scissors example is relatively special, in that the oracle can't narrow down the space of actions at all. The UP situation looks to me to be more like the first situation than the second.

3LGS1y

It would help to have a more formal model, but as far as I can tell the oracle can only narrow down its predictions of the future to the extent that those predictions are independent of the oracle's output. That is to say, if the people in the universe ignore what the oracle says, then the oracle can give an informative prediction. This would seem to exactly rule out any type of signal which depends on the oracle's output, which is precisely the types of signals that nostalgebraist was concerned about.

4Jeremy Gillen1y

That can't be right in general. Normal nash equilibria can narrow down predictions of actions. E.g. competition game. This is despite each player's decision being dependent on the other player's action.

3LGS1y

That's fair, yeah We need a proper mathematical model to study this further. I expect it to be difficult to set up because the situation is so unrealistic/impossible as to be hard to model. But if you do have a model in mind I'll take a look

Suppose that there is some search process that is looking through a collection of things, and you are an element of the collection. Then, in general, it's difficult to imagine how you (just you) can reason about the whole search in such a way as to "steer it around" in your preferred direction.

I think this is easy to imagine. I'm an expert who is among 10 experts recruited to advise some government on making a decision. I can guess some of the signals that the government will use to choose who among us to trust most. I can guess some of the relative w

... (read more)

4Jeremy Gillen1y

Okay if you accept this modified scenario where one expert knows they are much better than the other 9, then this is sufficient as a scenario that nostalgebraist claimed was difficult to imagine. So that's enough to prove the point I was trying to make. But the original example works too. It's just a simultaneous move game. It'll be won by whichever player is best at playing the game. It's clearly possible to play the game well, despite the self-reference involved with thinking about how to play better.

4Thane Ruthenis1y

Consider a different problem: a group of people are posed some technical or mathematical challenge. Each individual person is given a different subset of the information about the problem, and each person knows what type of information every other participant gets. Trivial example: you're supposed to find the volume of a pyramid, you (participant 1) are given its height and the apex angles for two triangular faces, participant 2 is given the radius of the sphere on which all of the pyramid's vertices lie and all angles of the triangular faces, participant 3 is given the areas of all faces, et cetera. Given this setup, if you're skilled at geometry, you can likely figure out which of the participants can solve the problem exactly, which can only put upper and lower bounds on the volume, and what those upper/lower bounds are for each participant. You don't need to model your competitors' mental states: all you need to do is reason about the object-level domain, plus take into account what information they have. No infinite recursion happens, because you can abstract out the particulars of how others' minds work. This works assuming that everyone involved is perfectly skilled at geometry: that you don't need to predict what mistakes the others would make (which would depend on the messy details of their minds). Speculatively, this would apply to deception as well. You don't necessarily need to model others' brain states directly. If they're all perfectly skilled at deception, you can predict what deceptions they'd try to use and how effective they'd be based on purely objective information: the sociopolitical landscape, their individual skills and comparative advantages, et cetera. You can "skip to the end": predict everyone playing their best-move-in-circumstances-where-everyone-else-plays-their-best-move-too. Objectively, the distribution of comparative advantages is likely very different, so even if everyone makes their best move, some would hopelessly lose. (E

Fwiw I think this is basically correct, though I would phrase the critique as "the hypothetical is confused" rather than "the argument is wrong." My sense is that arguments for the malignity of uncomputable priors just really depend on the structure of the hypothetical: how is it that you actually have access to this uncomputable prior, if it's an approximation what sort of approximation is it, and to what extent will others care about influencing your decisions in situations where you're using it?

Here's my understanding of the whole thing:

"Malign universal prior" arguments basically assume a setup in which we have an agent with a big dumb hard-coded module whose goal is to find this agent's location in Tegmark IV. (Or maybe perform some other important task that requires reasoning about Tegmark IV, but let's run with that as the example.)
The agent might be generally intelligent, the Solomonoff-induction-approximating module might be sophisticated in all kind of ways, but it's "dumb" or "naive" in an important sense: it's just trying to generate the best-guess distribution over the universes the agent is in, no matter their contents, then blindly acts on it.
Importantly, this process doesn't necessarily involve actually running any low-level simulations of other universes. Generally intelligent/abstract reasoning, some steps of which might literally replicate the reasoning steps of Paul's post, would also fit the bill.
The MUP argument is that this is sufficient for alien consequentialists to take advantage. The agent is asking, "where am I most likely to be?", and the alien consequentialists are skewing the distribution such that the most likely correct answer is "simulation-captured by acausal aliens" or whatever.
- (And then the malign output is producing "predictions" about the future of the agent's universe like "the false vacuum collapse is going to spontaneously trigger in the next five minutes unless you perform this specific sequence of actions that happen to rewrite your utility function in such-and-such ways", and our big dumb agent is gormlessly buying this, and its "real" non-simulation-captured instance rewrites itself accordingly.)
Speed prior vs. complexity prior: a common guess regarding the structure of Tegmark IV is that this is how it works, it penalizes K-complexity but doesn't care how much memory/compute it needs to allocate to run a universe. If that is true, then any sufficiently good approximation of Solomonoff induction – any sufficiently good procedure for getting an answer to "where am I most likely to be?", including abstract reasoning – would take this principle into account, and bump up the probability of being in low-complexity universes.

This all seems to check out to me. Admittedly I didn't actually confirm this with any proponents of the argument, though.

(Probably also worth stating that I don't think the MUP is in any way relevant to real life. AI progress doesn't seem to be on the track where it features AGIs that use big dumb "where am I?" modules. E. g., if an AGI is born of anything like an RL-trained LLM, seems unlikely that its "where am I?" reasoning would be naive in the relevant sense. It'd be able to "manually" filter out universes with malign consequentialists, given good decision theory. You know, like we can.

The MUP specifically applies to highly abstract agent-foundations designs where we hand-code each piece, that currently don't seem practically tractable at all.)

Thanks.

I admit I'm not closely familiar with Tegmark's views, but I know he has considered two distinct things that might be called "the Level IV multiverse":

a "mathematical universe" in which all mathematical constructs exist
a more restrictive "computable universe" in which only computable things exist

(I'm getting this from his paper here.)

In particular, Tegmark speculates that the computable universe is "distributed" following the UP (as you say in your final bullet point). This would mean e.g. that one shouldn't be too surprised to find oneself li... (read more)

7Thane Ruthenis1y

Yep. Correction: on my model, the dupe is also using an approximation of the UP, not the UP itself. I. e., it doesn't need to be uncomputable. The difference between it and the con men is just the naivety of the design. It generates guesses regarding what universes it's most likely to be in (potentially using abstract reasoning), but then doesn't "filter" these universes; doesn't actually "look inside" and determine if it's a good idea to use a specific universe as a model. It doesn't consider the possibility of being manipulated through it; doesn't consider the possibility that it contains daemons. I. e.: the real difference is that the "dupe" is using causal decision theory, not functional decision theory. I think that's plausible: that there aren't actually that many "UP-using dupes" in existence, so the con men don't actually care to stage these acausal attacks. But: if that is the case, it's because the entities designing/becoming powerful agents considered the possibility of con men manipulating the UP, and so made sure that they're not just naively using the unfiltered (approximation of the) UP. That is: yes, it seems likely that the equilibrium state of affairs here is "nobody is actually messing with the UP". But it's because everyone knows the UP could be messed with in this manner, so no-one is using it (nor its computationally tractable approximations). It might also not be the case, however. Maybe there are large swathes of reality populated by powerful yet naive agents, such that whatever process constructs them (some alien evolution analogue?), it doesn't teach them good decision theory at all. So when they figure out Tegmark IV and the possibility of acausal attacks/being simulation-captured, they give in to whatever "demands" are posed them. (I. e., there might be entire "worlds of dupes", somewhere out there among the mathematically possible.) That said, the "dupe" label actually does apply to a lot of humans, I think. I expect that a lot of

4nostalgebraist1y

Cool, it sounds we basically agree! I'm not sure of this. It seems at least possible that we could get an equilibrium where everyone does use the unfiltered UP (in some part of their reasoning process), trusting that no one will manipulate them because (a) manipulative behavior is costly and (b) no one has any reason to expect anyone else will reason differently from them, so if you choose to manipulate someone else you're effectively choosing that someone else will manipulate you. Perhaps I'm misunderstanding you. I'm imagining something like choosing one's one decision procedure in TDT, where one ends up choosing a procedure that involves "the unfiltered UP" somewhere, and which doesn't do manipulation. (If your procedure involved manipulation, so would your copy's procedure, and you would get manipulated; you don't want this, so you don't manipulate, nor does your copy.) But you write whereas it seems to me that TDT/FDT-style reasoning is precisely what allows us to "naively" trust the UP, here, without having to do the hard work of "filtering." That is: this kind of reasoning tells us to behave so that the UP won't be malign; hence, the UP isn't malign; hence, we can "naively" trust it, as though it weren't malign (because it isn't). More broadly, though -- we are now talking about something that I feel like I basically understand and basically agree with, and just arguing over the details, which is very much not the case with standard presentations of the malignity argument. So, thanks for that.

4Thane Ruthenis1y

Fair point! I agree.

Probably also worth stating that I don't think the MUP is in any way relevant to real life.

I think it's relevant because it illustrates an extreme variant of a very common problem, where "incorrectly specified" priors can cause unexpected behavior. It also illustrates the daemon problem, which I expect to be very relevant to real life.

A more realistic and straightforward example of the "incorrectly specified prior" problem: If the prior on an MCTS value head isn't strong enough, it can overfit to value local instrumental goals too highly. Now your overall ... (read more)

Here's a simple argument that simulating universes based on Turing machine number can give manipulated results.

Say we lived in a universe much like this one, except that:

The universe is deterministic
It's simulated by a very short Turing machine
It has a center, and
That center is actually nearby! We can send a rocket to it.

So we send a rocket to the center of the universe and leave a plaque saying "the answer to all your questions is Spongebob". Now any aliens in other universes that simulate our universe and ask "what's in the center of that universe at time step 10^1000?" will see the plaque, search elsewhere in our universe for the reference, and watch Spongebob. We've managed to get aliens outside our universe to watch Spongebob.

I feel like it would be helpful to speak precisely about the universal prior. Here's my understanding.

It's a partial probability distribution over bit strings. It gives a non-zero probability to every bit string, but these probabilities add up to strictly less than 1. It's defined as follows:

That is, describe Turing machines by a binary code, and assign each one a probability based on the length of its code, such that those probabilities add up to exactly 1. Then magically run all Turing machines "to completion". For those that halt leaving a bitstring on their tape, attribute the probability of that Turing machine to that bitstring. Now we have a probability distribution over bitstrings, though the probabilities add up to less than one because not all of the Turing machines halted.

You cannot compute this probability distribution, but you can compute lower bounds on the probabilities of its bitstrings. (The Nth lower bound is the probability distribution you get from running the first N TMs for N steps.)

Call a TM that halts poisoned if its output is determined as follows:

The TM simulates a complex universe full of intelligent life, then selects a tiny portion of that universe to output, erasing the rest.
That intelligent life realizes this might happen, and writes messages in many places that could plausibly be selected.
It works, and the TM's output is determined by what the intelligent life it simulated chose to leave behind.

If we approximate the universal prior, the probability contribution of poisoned TMs will be precisely zero, because we don't have nearly enough compute to simulate a poisoned TM until it halts. However, if there's an outer universe with dramatically more compute available, and it's approximating the universal prior using enough computational power to actually run the poisoned TMs, they'll effect the probability distribution of the bitstrings, making bitstrings with the messages they choose to leave behind more likely.

So I think Paul's right, actually (not what I expected when I started writing this). If you approximate the UP well enough, the distribution you see will have been manipulated.

Very curious what part of this people think is wrong.

4hairyfigment1y

I don't see how any of it can be right. Getting one algorithm to output Spongebob wouldn't cause the SI to watch Spongebob -even a less silly claim in that vein would still be false. The Platonic agent would know the plan wouldn't work, and thus wouldn't do it. Since no individual Platonic agent could do anything meaningful alone, and they plainly can't communicate with each other, they can only coordinate by means of reflective decision theory. That's fine, we'll just assume that's the obvious way for intelligent minds to behave. But then the SI works the same way, and knows the Platonic agents will think that way, and per RDT it refuses to change its behavior based on attempts to game the system. So none of this ever happens in the first place. (This is without even considering the serious problems with assuming Platonic agents would share a goal to coordinate on. I don't think I buy it. You can't evolve a desire to come into existence, nor does an arbitrary goal seem to require it. Let me assure you, there can exist intelligent minds which don't want worlds like ours to exist.)

The universal distribution/prior is lower semi computable, meaning there is one Turing machine that can approximate it from below, converging to it in the limit. Also, there is a probabilistic Turing machine that induces the universal distribution. So there is a rather clear sense in which one can “use the universal distribution.” Of course in practice different universes would use more or less accurate versions with more or less compromises for efficiency - I think your basic argument holds up insofar as there isn’t a clear mechanism for precise manipulation through the universal distribution. It’s conceivable that some high level actions such as “make it very clear that we prefer this set of moral standards in case anyone with cosmopolitan values simulates are universe” would be preferred based on the malign-universal prior argument.

The universal distribution/prior is lower semi computable, meaning there is one Turing machine that can approximate it from below, converging to it in the limit. Also, there is a probabilistic Turing machine that induces the universal distribution. So there is a rather clear sense in which one can “use the universal distribution.”

Thanks for bringing this up.

However, I'm skeptical that lower semi-computability really gets us much. While there is a TM that converges to the UP, we have no (computable) way of knowing how close the approximation is at any... (read more)

3Cole Wyeth1y

Yes, I mostly agree with everything you said - the limitation with the probabilistic Turing machine approach (it's usually equivalently described as the a priori probability and described in terms of monotone TM's) is that you can get samples, but you can't use those to estimate conditionals. This is connected to the typical problem of computing the normalization factor in Bayesian statistics. It's possible that these approximations would be good enough in practice though.

I think the malign universal prior doesn't work, but for different reasons than people think.

Link below:

https://www.lesswrong.com/posts/tDkYdyJSqe3DddtK4/alexander-gietelink-oldenziel-s-shortform#w2M3rjm6NdNY9WDez

Instead of inspecting all programs in the UP, just inspect all programs with length less than n. As n becomes larger and larger, this covers more and more of the total probability mass in the up and the total probability mass covered this way approaches 1. What to do about the non-halting programs? Well, just run all the programs for m steps, I guess. I think this is the approximation of UP that is implied.

In order to be “UP-like” in a relevant way, this procedure will have to involve running TMs, and the set of TMs that might be run needs to include the same TM that implements our beings and their world.

Why? The procedure just need to do some reasoning, constrained by UP and outer TM. And then UP-beings can just simulate this fast reasoning without problems of self-simulation.

Yes, AI that practically uses UP may fail to predict whether UP-beings simulate it in the center of their universe or on the boundary. But the point is that the more correct AI is in its reasoning, the more control UP-beings have.

Or you can not create AI that thinks about UP. But that's denying the assumption.

We (i.e. "reasoning beings in computable universes") can influence the UP, but we can't reason about it well enough to use that influence. Meanwhile, we can reason about things that are more like the speed prior -- but we can't influence them.

Did one of these can/can't pairs get flipped?

I have actually never properly understood the universal prior argument in the first place and just seeing this post made me able to understand parts of it now so thank you for writing it!

I'll admit, my mental image for "our universe + hypercomputation" is a sort of webnovel premise, where we're living in a normal computable universe until one day by fiat an app poofs into existence on your phone that lets you enter a binary string or file and instantaneously get the next bit with minimum description length in binary lambda calculus. Aside from the initial poofing and every usage of the app, the universe continues by its normal rules.

But there's probably simpler universes (by some handwavy standard) out there that allow enough hypercomputation that they can have agents querying minimum description length oracles, but not so much that agents querying MDL oracles can no longer be assigned short codes.

What is a "universal prior"?

We (i.e. "reasoning beings in computable universes") can influence the UP, but we can't reason about it well enough to use that influence. Meanwhile, we can reason about things that are more like the speed prior -- but we can't influence them.

Did one of these can/can't pairs get flipped?

I have actually never properly understood the universal prior argument in the first place and just seeing this post made me able to understand parts of it now so thank you for writing it!

What is a "universal prior"?

LESSWRONG
LW

LESSWRONG
LW

95

[ Question ]

Am I confused about the "malign universal prior" argument?

95

UP-using "universes" and simulatable "universes"

Some thoughts that one might have

On slowdown

On efficiency

The common thread

95

8 Answers sorted by
top scoring

Aug 28, 2024

Aug 28, 2024*

Aug 29, 2024*

Aug 28, 2024

Aug 31, 2024

Jan 26, 2025

Aug 28, 2024

Aug 28, 2024

95

95

[ Question ]

Am I confused about the "malign universal prior" argument?

95

UP-using "universes" and simulatable "universes"

Some thoughts that one might have

On slowdown

On efficiency

The common thread

95

8 Answers sorted by top scoring

Aug 28, 2024

Aug 28, 2024*

Aug 29, 2024*

Aug 28, 2024

Aug 31, 2024

Jan 26, 2025

Aug 28, 2024

Aug 28, 2024

95

8 Answers sorted by
top scoring