Kevin T. Kelly's Ockham Efficiency Theorem
There is a game studied in Philosophy of Science and Probably Approximately Correct (machine) learning. It's a cousin to the Looney Labs game "Zendo", but less fun to play with your friends. http://en.wikipedia.org/wiki/Zendo_(game) (By the way, playing this kind of game is excellent practice at avoiding confirmation bias.) The game has two players, who are asymmetric. One player plays Nature, and the other player plays Science. First Nature makes up a law, a specific Grand Unified Theory, and then Science tries to guess it. Nature provides some information about the law, and then Science can change their guess, if they want to. Science wins if it converges to the rule that Nature made up.
A Proof of Occam's Razor
Related to: Occam's Razor
If the Razor is defined as, “On average, a simpler hypothesis should be assigned a higher prior probability than a more complex hypothesis,” or stated in another way, "As the complexity of your hypotheses goes to infinity, their probability goes to zero," then it can be proven from a few assumptions.
1) The hypotheses are described by a language that has a finite number of different words, and each hypothesis is expressed by a finite number of these words. That this allows for natural languages such as English, but also for computer programming languages and so on. The proof in this post is valid for all cases.
2) A complexity measure is assigned to hypotheses in such a way that there are or may be some hypotheses which are as simple as possible, and these are assigned the complexity measure of 1, while hypotheses considered to be more complex are assigned higher integer values such as 2, 3, 4, and so on. Note that apart from this, we can define the complexity measure in any way we like, for example as the number of words used by the hypothesis, or in another way, as the shortest program which can output the hypothesis in a given programming language (e.g. the language of the hypotheses might be English but their simplicity measured according to a programming language; Eliezer Yudkowsky follows this way in the linked article.) Many other definitions would be possible. The proof is valid for all definitions that follow the conditions laid out.
3) The complexity measure should also be defined in such a way that there are a finite number of hypotheses given the measure of 1, a finite number given the measure of 2, a finite number given the measure of 3, and so on. Note that this condition is not difficult to satisfy; it would be satisfied by either of the definitions mentioned in condition 2, and in fact by any reasonable definition of simplicity and complexity. The proof would not be valid without this condition precisely because if simplicity were understood in such a way as to allow for an infinite number of hypotheses with minimum simplicity, the Razor would not be valid for that understanding of simplicity.
The Razor follows of necessity from these three conditions. To explain any data, there will be in general infinitely many mutually exclusive hypotheses which could fit the data. Suppose we assign prior probabilities for all of these hypotheses. Given condition 3, it will be possible to find the average probability for hypotheses of complexity 1 (call it x1), the average probability for hypotheses of complexity 2 (call it x2), the average probability for hypotheses of complexity 3 (call it x3), and so on. Now consider the infinite sum “x1 + x2 + x3…” Since all of these values are positive (and non-zero, since zero is not a probability), either the sum converges to a positive value, or it diverges to positive infinity. In fact, it will converge to a value less than 1, since if we had multiplied each term of the series by the number of hypotheses with the corresponding complexity, it would have converged to exactly 1—because probability theory demands that the sum of all the probabilities of all our mutually exclusive hypotheses should be exactly 1.
Now, x1 is a finite real number. So in order for this series to converge, there must be only a finite number of later terms in the series equal to or greater than x1. There will therefore be some complexity value, y1, such that all hypotheses with a complexity value greater than y1 have an average probability of less than x1. Likewise for x2: there will be some complexity value y2 such that all hypotheses with a complexity value greater than y2 have an average probability of less than x2. Leaving the derivation for the reader, it would also follow that there is some complexity value z1 such that all hypotheses with a complexity value greater than z1 have a lower probability than any hypothesis with a complexity value of 1, some other complexity value z2 such that all hypotheses with a complexity value greater than z2 have a lower probability than any hypothesis of complexity value 2, and so on.
From this it is clear that on average, or as the complexity tends to infinity, hypotheses with a greater complexity value have a lower prior probability, which was our definition of the Razor.
N.B. I have edited the beginning and end of the post to clarify the meaning of the theorem, according to some of the comments. However, I didn't remove anything because it would make the comments difficult to understand for later readers.
Addresses in the Multiverse
Abstract: If we assume that any universe can be modeled as a computer program which has been running for finitely many steps, then we can assign a multiverse-address to every event by combining its world-program with the number of steps into the world-program where it occurs. We define a probability distribution over multiverse-addresses called a Finite Occamian Multiverse (FOM). FOMs assign negligible probability mass to being a Boltzmann brain or to being in a universes that implements the Many Worlds Interpretation of quantum mechanics.
One explanation of existence is the Tegmark level 4 multiverse, the idea that all coherent mathematical structures exist, and our universe is one of them. To make this meaningful, we must add a probability distribution over mathematical structures, effectively assigning each a degree of existence. Assume that the universe we live in can be fully modeled as a computer program, and that that program, and the number of steps it's been running for, are both finite. (Note that it's not clear whether our universe is finite or infinite; our universe is either spatially infinite, or expanding outwards at a rate greater than or equal to the speed of light, but there's no observation we could make inside the universe that would distinguish these two possibilities.) Call the program that implements our universe a world-program, W. This could be implemented in any programming language - it doesn't really matter which, since we can translate between languages by prepending some stuff to translate.
Now, suppose we choose a particular event in the universe - an atom emitting a photon, say - and we want to find a corresponding operation in the world-program. We could, in principle, run W until it starts working on the part of spacetime we care about, and count the steps. Call the number of steps leading up to this event T. Taken together, the pair (W,T) uniquely identifies a place, not just in the universe, but in the space of all possible universes. Call any such pair (W,T) a multiverse-address.
Now, suppose we observe an event. What should be our prior probability distribution over multiverse-addresses for that event? That is, for a given event (W,T), what is P(W=X and T=Y)?
Minds that make optimal use of small amounts of sensory data
In That alien message, Eliezer made some pretty wild claims:
My moral - that even Einstein did not come within a million light-years of making efficient use of sensory data.
Riemann invented his geometries before Einstein had a use for them; the physics of our universe is not that complicated in an absolute sense. A Bayesian superintelligence, hooked up to a webcam, would invent General Relativity as a hypothesis - perhaps not the dominant hypothesis, compared to Newtonian mechanics, but still a hypothesis under direct consideration - by the time it had seen the third frame of a falling apple. It might guess it from the first frame, if it saw the statics of a bent blade of grass.
They never suspected a thing. They weren't very smart, you see, even before taking into account their slower rate of time. Their primitive equivalents of rationalists went around saying things like, "There's a bound to how much information you can extract from sensory data." And they never quite realized what it meant, that we were smarter than them, and thought faster.
In the comments, Will Pearson asked for "some form of proof of concept". It seems that researchers at Cornell - Schmidt and Lipson - have done exactly that. See their video on Guardian Science:
'Eureka machine' can discover laws of nature - The machine formulates laws by observing the world and detecting patterns in the vast quantities of data it has collected
Taking Occam Seriously
Paul Almond's site has many philosophically deep articles on theoretical rationality along LessWrongish assumptions, including but not limited to some great atheology, an attempt to solve the problem of arbitrary UTM choice, a possible anthropic explanation why space is 3D, a thorough defense of Occam's Razor, a lot of AI theory that I haven't tried to understand, and an attempt to explain what it means for minds to be implemented (related in approach to this and this).
= 783df68a0f980790206b9ea87794c5b6)
Subscribe to RSS Feed
= f037147d6e6c911a85753b9abdedda8d)