I've seen references to "anti-Occamian priors" in the Sequences, where Eliezer was talking about how not all possible minds would agree with Occam's Razor. I'm not sure how such a prior could consistently exist.

It seems like Occam's Razor just logically follows from the basic premises of probability theory. Assume the "complexity" of a hypothesis is how many bits it takes to specify under a particular method of specifying hypotheses, and that hypotheses can be of any length. Then for any prior that assigns nonzero probability to any finite hypothesis H, there must exist some level of complexity L such that any hypothesis more complex than L is less likely than H.

(That is to say, if a particular 13-bit hypothesis is 0.01% likely, then there are at most 9,999 other hypotheses with >= 0.01% probability mass. If the most complicated of these <10,000 hypotheses is 27 bits, then every hypothesis that takes 28 bits or more to specify is less likely than the 13-bit hypothesis. You can change around the numbers 13 and 0.01% and 27 as much as you want, but as long as there's any hypothesis whatsoever with non-infinitesimal probability, then there's some level where everything more complex than that level is less likely than that hypothesis.)

This seems to prove that an "anti-Occamian prior" - that is to say, a hypothesis that always assigns more probability to more complex hypotheses and less to the less complex - is impossible. Or at least, that it assigns zero probability to every finite hypothesis. (You could, I suppose, construct a prior such that {sum of probability mass from all 1-bit hypotheses} is 1/3 of {sum of probability mass from all 2-bit hypotheses}, which is then itself 1/3 of {sum of probability mass from all 3-bit hypotheses}, and on and on forever, and that would indeed be anti-Occamian - but it would also assign zero probability to every finite hypothesis, which would make it essentially meaningless.)

Am I missing something about what "anti-Occamian prior" is really supposed to mean here, or how it could really be consistent?

New Answer
New Comment

5 Answers sorted by

johnswentworth

91

Let's imagine a Solomonoff-style inference problem, i.e. our inference-engine receives a string of bits and tries to predict the next bit.

Here's an anti-Occamian environment for this inference problem. At timestep t, the environment takes the t bits which the inference-engine has seen, and feeds them into a Solomonoff inductor. The Solomonoff inductor does its usual thing: roughly speaking, it finds the shortest program which would output those t bits, then outputs the next out-bit of that shortest program. But our environment is anti-Occamian, so it sets bit t+1 to the opposite of the Solomonoff inductor's output.

That's an anti-Occamian environment. (Note that it is uncomputable, but still entirely well-defined mathematically.)

One example of an "anti-Occamian hypothesis" would be the hypothesis that the environment works as just described - i.e. at every step, the environment does the opposite of what Occam's razor (as operationalized by a Solomonoff inductor) would predict.

An agent which believed the anti-Occamian hypothesis would, for instance, expect that any hypothesis which did a very good job predicting past data would almost-surely fail at the next timestep. Of course, in a world like ours, the agent would be wrong about this most of the time... but that would only make the agent more confident. After all, it's always been wrong before, so (roughly speaking) on the anti-Occamian hypothesis it's very likely to be right this time!

Again, note that this example is uncomputable but entirely well-defined mathematically. As with Solomonoff inductors, we could in principle construct limited-compute analogues.

The standard example of a system which (maybe) behaves in this anti-inductive way in the real world is a financial market.

This does bring in the Berry paradox, since after all the hypothesis that the environment follows such a rule is in fact a very short hypothesis itself!

The resolution is that despite being describable by a short rule, it does not correspond to any computable program. Computing the results of any such hypothesis (even with the Solomonoff inductor requiring a Turing oracle already) would require an oracle of the next level up.

A super-Solomonoff inductor with a higher level Turing oracle can of course compute such a result and would assign it high weight in the described environment. Then that inductor likewise has short but uncomputable hypotheses, and so on.

Does this apply at all to anything more probabilistic than just reversing the outcome of a single most likely hypothesis and the next bit(s) it outputs? An Occamian prior doesn't just mean "this is the shortest hypothesis; therefore it is true," it means hypotheses are weighted by their simplicity. It's possible for an Occamian prior to think the shortest hypothesis is most likely wrong, if there are several slightly longer hypotheses that have more probability in total.

2johnswentworth
Of course, the generalization would be that the environment inverts whatever next bit some Occamian agent thinks is most probable.

Steven Byrnes

92

I think this is similar to the 2010 post A Proof of Occam's Razor? ...which spawned 139 comments. I didn't read them all. But here's one point that came up a couple times:

Let N be a ridiculously, comically large finite number like N=3↑↑↑↑↑3. Take the N simplest possible hypotheses. This is a finite set, so we can split up probability weight such that simpler hypotheses are less likely, within this set.

For example, rank-order these N hypotheses by decreasing complexity, and assign probability  to the n'th on the list. That leaves  leftover probability weight for all the other infinity hypotheses outside that set, and you can distribute that however you like, N is so big that it doesn't matter.

Now, simpler hypotheses are less likely, until we go past the first N hypotheses. But N is so ridiculously large that that's never gonna happen.

I want to say something like: "The bigger N is, the bigger a computer needs to be in order to implement that prior; and given that your brain is the size that it is, it can't possibly be setting N=3↑↑↑↑↑3."

Now, this isn't strictly correct, since the Solomonoff prior is uncomputable regardless of the computer's size, etc. - but is there some kernel of truth there? Like, is there a way of approximating the Solomonoff prior efficiently, which becomes less efficient the larger N gets?

2Donald Hobson
There are practical anti-occam calculations. Start uniform over all bitstrings. And every time you find a short program that produces a bitstring, turn the probability of that bitstring down.

tailcalled

30

I think it depends on the size of your outcome-space? If you assume a finite (probably generalizes to compact?) space of outcomes, then your argument doesn't really go through. But a lot of things seem better modelled with infinite outcome spaces, so in that case your argument seems to go through, at least under conventional formalisms.

Noosphere89

20

It seems like Occam's Razor just logically follows from the basic premises of probability theory.

This turns out to be the case for countable probability spaces, like Turing Machines, due to Qiaochu Yuan's comment:

https://www.lesswrong.com/posts/fpRN5bg5asJDZTaCj/against-occam-s-razor?commentId=xFKD5hZZ68QXttHqq

cubefox

20

It seems like Occam's Razor just logically follows from the basic premises of probability theory. Assume the "complexity" of a hypothesis is how many bits it takes to specify under a particular method of specifying hypotheses, and that hypotheses can be of any length.

I think the method can't be arbitrary for your argument to work. If we were to measure the "complexity" of an hypothesis by how many outcomes the event set of the hypothesis contains (or by how many disjuncts some non-full disjunctive normal form has), then the more "complex" hypotheses are less likely.

Of course that's an implausible definition of complexity. For example, using conjunctive normal forms for measuring complexity is overall more intuitive, since is considered more complex than , and it would indeed lead to more "complex" hypothesis being less likely. But universally quantified statements (like laws of nature) correspond to very long (perhaps infinitely long) CNFs, which would suggest that they are highly complex and unlikely. But intuitively, laws of nature can be simple and likely.

This was basically the problem Wittgenstein later identified with his definition of logical probability in the Tractatus a hundred years ago. Laws would have logical probability 0, which is absurd. It would mean no amount of finite evidence can confirm them to any degree. Later Rudolf Carnap worked on this problem, but I'm pretty sure it is still considered to be unsolved. If someone does solve it, and finds an a priori justification for Ockham's razor, that could be used as a solution to the problem of induction, the core problem of epistemology.

So finding a plausible definition of hypothesis complexity, for which also Ockham's razor holds, is a very hard open problem.

What exactly is an "event set" in this context? I don't think a hypothesis would necessarily correspond to a particular set of events that it permitted, but rather its own probability distribution over which events you would be more likely to see under that hypothesis. In that sense, an event set with no probabilities attached would not be enough to specify which hypothesis you were talking about, because multiple hypotheses could correspond to the same set of permitted events despite assigning very different probabilities to each of those events occurring.

1cubefox
A hypothesis corresponds to an "event" in Kolmogorov probability theory. Such an event is the set of mutually exclusive and exhaustive "outcomes" under which the hypothesis is true. An outcome is basically a possible world. So a hypothesis can be thought of as the set of (or the disjunction of) possible worlds in which the hypothesis is true. So the probability of an event/hypothesis is exactly the sum of the probabilities of its outcomes. That being said, this way of seeing things is problematic, since most hypotheses are true in infinitely many "possible worlds". It is not clear how we could sum infinitely many probabilities. And if we don't use possible worlds as outcomes, but use arbitrary DNFs, the decomposition (partition) becomes non-unique, because outcomes are no longer distinguishable from normal events. Possible worlds at least have the special feature of being "atomic" in the sense that they can't be decomposed into mutually exclusive and exhaustive disjuncts. Ideally we would want finitely many non-arbitrary outcomes for each hypothesis. Then we could apply the principle of indifference, assign equal probabilities, and create the sum. But that doesn't seem to work. So the whole disjunctive approach seems not exactly promising. An alternative is Wittgenstein-style logical atomism, which supposes there are something like "atomic propositions", presumably about sense data, which are all independent of each other. Hypotheses/events are then supposed to be logical combinations of these. This approach is riddled with technical difficulties as well, it is unclear how it could be made precise.
10 comments, sorted by Click to highlight new comments since:

But the situation where P(simple hypothesis) > P(complex hypothesis) is not equivalent to the claim
P(one hypothesis from the set of simple hypothesis) > P(one from the set of complex hypothesis).

This is because the set of complex hypothesis is infinitely larger that the set of simpler ones, and even if each simple hypothesis is more probable, the total probability mass of the set of complex hypothesis is larger.

It means that any situation is more likely to have complex explanation than simpler one. This could be called anti-Occam prior. Not sure if it is related to what EY meant.

Let me look at the toy example just for fun:

We're flipping a coin. If you think the coin is biased with unknown bias, but flips are otherwise independent, then you expect to see more of what we've already got. Eliezer calls it anti-Occamian to expect less of what we've already got - to invert Laplace's rule of succession.

This is still a simple rule for assigning probabilities. But what about the Turing machines? Even if you start out by assuming that TMs in the family "look at the average pattern so far, then use some noise to usually against it continuing" have high prior probability, if the pattern really does hold then they must grow in complexity linearly faster than the hypothesis family "use some noise to usually bet with the average pattern," because of the growth rates of the different required "noise."

In order to prevent your favored "anti-Occamian" pattern from ever being overtaken, you'd have to penalize competing hypotheses superexponentially for their length (so that for no bias of the coin does a competing hypothesis win). This is a drastic measure that basically destroys Solomonoff induction and replaces it with a monoculture.

I have also wondered this. Saying "The sun probably won't rise tomorrow because it's risen every day until now" can't just rest on "my anti-Occam prior has been wrong every day until now, so it's bound to be right today". It also has to explain why this particular day is the one where the pattern breaks.

My brilliant reason why it's this particular day has been wrong every day until now, so it's bound to be right today.

This is definitely funny, but I think it's not merely a prior. It implies non-Bayesian reasoning.

Which is a fine thing to think about, it just dodges the nit that this post is picking.

Your reason for this particular day has never been tested as right or wrong since the morning hasn't arrived yet.

Also, why should the sun be simply absent tomorrow, rather than purple, or duplicated, or square? None of those has ever happened either.

Well, you said the same thing yesterday, and you were right. So you'll probably be wrong today.

 

EDIT: I realize this is too flip. But I think you are underestimating how alien the anti-Occamian prior is. Reversed intelligence is like, too stupid to believed. But if being wrong in the past is taken as evidence for being right today, then I could give the same exact "special reason why it's today" every single day; it'll be wrong every day, which of course is even more evidence in favor of the reasoning process that generated the wrong idea!

There is a multitude of hypotheses that have been wrong until now. Is there a different anti-Occamian prior that favors each one?

In all seriousness, I'm not sure how an anti-Occamian reasoner would even conclude that its crazy hypotheses were wrong. Because "I saw the sun yesterday because it rose" is surely less complex than "the sun failed to rise yesterday but coincidentally I developed a visual processing disorder/superpower that caused me to see a sun that wasn't there and also gain the ability to see in the dark as if it was broad daylight."

I guess I'm thinking more of anti-inductive reasoning rather than an anti-Occamian prior.