<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:media="http://search.yahoo.com/mrss/">
<channel>
<title>
Articles Tagged ‘ockham’ - Less Wrong
</title> <link>http://lesswrong.com/</link>
<description></description>
<item>
<title>Kevin T. Kelly's Ockham Efficiency Theorem</title>
<link>http://lesswrong.com/lw/2l9/kevin_t_kellys_ockham_efficiency_theorem/</link>
<guid isPermaLink="true">http://lesswrong.com/lw/2l9/kevin_t_kellys_ockham_efficiency_theorem/</guid>
<pubDate>Mon, 16 Aug 2010 14:46:00 +1000</pubDate>
<description>
Submitted by &lt;a href="http://lesswrong.com/user/Johnicholas"&gt;Johnicholas&lt;/a&gt;
&amp;bull;
29 votes
&amp;bull;
&lt;a href="http://lesswrong.com/lw/2l9/kevin_t_kellys_ockham_efficiency_theorem/#comments"&gt;80 comments&lt;/a&gt;
&lt;div&gt;&lt;p&gt;There is a game studied in Philosophy of Science and Probably&amp;#xA0;Approximately Correct (machine) learning. It's a cousin to the Looney&amp;#xA0;Labs game &quot;Zendo&quot;, but less fun to play with your friends.&amp;#xA0;&lt;a href=&quot;http://en.wikipedia.org/wiki/Zendo_(game)&quot;&gt;http://en.wikipedia.org/wiki/Zendo_(game)&lt;/a&gt;&amp;#xA0;(By the way, playing this kind of game is excellent practice at&amp;#xA0;avoiding confirmation bias.) The game has two players, who are&amp;#xA0;asymmetric. One player plays Nature, and the other player plays&amp;#xA0;Science. First Nature makes up a law, a specific Grand Unified Theory,&amp;#xA0;and then Science tries to guess it. Nature provides some information&amp;#xA0;about the law, and then Science can change their guess, if they want&amp;#xA0;to. Science wins if it converges to the rule that Nature made up.&lt;/p&gt;
&lt;p&gt;&lt;a id=&quot;more&quot;&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Kevin T. Kelly is a philosopher of science, and studies (among other&amp;#xA0;things) the justification within the philosophy of science for William&amp;#xA0;of Occam's razor: &quot;entities should not be multiplied beyond&amp;#xA0;necessity&quot;. The way that he does this is by proving theorems about the&amp;#xA0;Nature/Science game with all of the details elaborated.&lt;/p&gt;
&lt;p&gt;Why should you care? Firstly, his justification is different from the&amp;#xA0;overfitting justification that sometimes shows up in Bayesian&amp;#xA0;literature. Roughly speaking, the overfitting justification&amp;#xA0;characterizes our use of Occam's razor as pragmatic - we use Occam's&amp;#xA0;razor to do science because we get good generalization performance&amp;#xA0;from it. If we found something else (e.g. boosting and bagging, or&amp;#xA0;some future technique) that proposed oddly complex hypotheses, but&amp;#xA0;achieved good generalization performance, we would switch away from&amp;#xA0;Occam's razor.&lt;/p&gt;
&lt;p&gt;An aside regarding boosting and bagging: These are ensemble machine learning&amp;#xA0;techniques. Suppose you had a technique that created decision tree&amp;#xA0;classifiers, such as C4.5&amp;#xA0;(&lt;a href=&quot;http://en.wikipedia.org/wiki/C4.5_algorithm&quot;&gt;http://en.wikipedia.org/wiki/C4.5_algorithm&lt;/a&gt;), or even decision stumps&amp;#xA0;(&lt;a href=&quot;http://en.wikipedia.org/wiki/Decision_stump&quot;&gt;http://en.wikipedia.org/wiki/Decision_stump&lt;/a&gt;).&amp;#xA0;Adaboost (&lt;a href=&quot;http://en.wikipedia.org/wiki/Adaboost&quot;&gt;http://en.wikipedia.org/wiki/Adaboost&lt;/a&gt;) would start by&amp;#xA0;weighting all of the examples identically and&amp;#xA0;invoking your technique to find an initial classifier. Then it would&amp;#xA0;reweight the examples, prioritizing the ones that the first iteration&amp;#xA0;got wrong, and invoke your technique on the reweighted input.&amp;#xA0;Eventually, Adaboost outputs an ensemble of decision trees (or decision stumps),&amp;#xA0;and taking the majority opinion of the ensemble might well be more&amp;#xA0;effective (generalize better beyond training)&amp;#xA0;than the original classifier. Bagging is similar&amp;#xA0;(&lt;a href=&quot;http://en.wikipedia.org/wiki/Bootstrap_aggregating&quot;&gt;http://en.wikipedia.org/wiki/Bootstrap_aggregating&lt;/a&gt;).&lt;/p&gt;
&lt;p&gt;Ensemble learning methods are a challenge for &quot;prevents overfitting&quot;&amp;#xA0;justifications of Occam's razor since they propose weirdly complex&amp;#xA0;hypotheses, but suffer &lt;em&gt;less&lt;/em&gt;&amp;#xA0;from overfitting than the weak&amp;#xA0;classifiers that they are built from.&lt;/p&gt;
&lt;p&gt;Secondly, his alternative justification bears on ordering hypotheses&amp;#xA0;&quot;by simplicity&quot;, providing an alternative to (approximations of)&amp;#xA0;Kolmogorov complexity as a foundation of science.&lt;/p&gt;
&lt;p&gt;Let's take a philosopher's view of physics - from a distance, a&amp;#xA0;philosopher listens to the particle physicists and hears: &quot;There are&amp;#xA0;three fundamental particles that make up all matter!&quot;, &quot;We've&amp;#xA0;discovered another, higher-energy particle!&quot;, &quot;Another, even&amp;#xA0;higher-energy particle!&quot;. There is no reason the philosopher can see&amp;#xA0;why this should stop at any particular number of particles. What&amp;#xA0;should the philosopher believe at any given moment?&lt;/p&gt;
&lt;p&gt;This is a form of the Nature vs. Science game where Nature's &quot;Grand&amp;#xA0;Unified Theory&quot; is known to be a nonnegative number (of particles),&amp;#xA0;and at each round Nature can reveal a new fundamental particle (by name) or remain silent. What is the philosopher of science's strategy? Occam's razor&amp;#xA0;suggests that we prefer simpler hypotheses, but if there were 254&amp;#xA0;known particles, how would we decide whether to claim that there are&amp;#xA0;254, 255, or 256 particles? Note that in many encodings, 255 and 256&amp;#xA0;are quite round numbers and therefore have short descriptions; low&amp;#xA0;Kolmogorov complexity.&lt;/p&gt;
&lt;p&gt;An aside regarding &quot;round numbers&quot;: Here are some &quot;shortest expressions&quot;&amp;#xA0;of some numbers, according to one possible grammar of expressions. (Kolmogorov complexity is unique up to an additive constant,&amp;#xA0;but since we never actually use Kolmogorov complexity, that isn't particularly helpful.)&lt;/p&gt;
&lt;p style=&quot;padding-left: 30px;&quot;&gt;1 == (1)&lt;/p&gt;
&lt;p style=&quot;padding-left: 30px;&quot;&gt;2 == (1+1)&lt;/p&gt;
&lt;p style=&quot;padding-left: 30px;&quot;&gt;3 == (1+1+1)&lt;/p&gt;
&lt;p style=&quot;padding-left: 30px;&quot;&gt;4 == (1+1+1+1)&lt;/p&gt;
&lt;p style=&quot;padding-left: 30px;&quot;&gt;5 == (1+1+1+1+1)&lt;/p&gt;
&lt;p style=&quot;padding-left: 30px;&quot;&gt;6 == (1+1+1+1+1+1)&lt;/p&gt;
&lt;p style=&quot;padding-left: 30px;&quot;&gt;7 == (1+1+1+1+1+1+1)&lt;/p&gt;
&lt;p style=&quot;padding-left: 30px;&quot;&gt;8 == ((1+1)*(1+1+1+1))&lt;/p&gt;
&lt;p style=&quot;padding-left: 30px;&quot;&gt;9 == ((1+1+1)*(1+1+1))&lt;/p&gt;
&lt;p style=&quot;padding-left: 30px;&quot;&gt;10 == ((1+1)*(1+1+1+1+1))&lt;/p&gt;
&lt;p style=&quot;padding-left: 30px;&quot;&gt;11 == (1+1+1+1+1+1+1+1+1+1+1)&lt;/p&gt;
&lt;p style=&quot;padding-left: 30px;&quot;&gt;12 == ((1+1+1)*(1+1+1+1))&lt;/p&gt;
&lt;p style=&quot;padding-left: 30px;&quot;&gt;13 == ((1)+((1+1+1)*(1+1+1+1)))&lt;/p&gt;
&lt;p style=&quot;padding-left: 30px;&quot;&gt;14 == ((1+1)*(1+1+1+1+1+1+1))&lt;/p&gt;
&lt;p style=&quot;padding-left: 30px;&quot;&gt;15 == ((1+1+1)*(1+1+1+1+1))&lt;/p&gt;
&lt;p style=&quot;padding-left: 30px;&quot;&gt;16 == ((1+1+1+1)*(1+1+1+1))&lt;/p&gt;
&lt;p style=&quot;padding-left: 30px;&quot;&gt;17 == ((1)+((1+1+1+1)*(1+1+1+1)))&lt;/p&gt;
&lt;p style=&quot;padding-left: 30px;&quot;&gt;18 == ((1+1+1)*(1+1+1+1+1+1))&lt;/p&gt;
&lt;p style=&quot;padding-left: 30px;&quot;&gt;19 == ((1)+((1+1+1)*(1+1+1+1+1+1)))&lt;/p&gt;
&lt;p style=&quot;padding-left: 30px;&quot;&gt;20 == ((1+1+1+1)*(1+1+1+1+1))&lt;/p&gt;
&lt;p style=&quot;padding-left: 30px;&quot;&gt;21 == ((1+1+1)*(1+1+1+1+1+1+1))&lt;/p&gt;
&lt;p style=&quot;padding-left: 30px;&quot;&gt;22 == ((1)+((1+1+1)*(1+1+1+1+1+1+1)))&lt;/p&gt;
&lt;p style=&quot;padding-left: 30px;&quot;&gt;24 == ((1+1+1+1)*(1+1+1+1+1+1))&lt;/p&gt;
&lt;p style=&quot;padding-left: 30px;&quot;&gt;25 == ((1+1+1+1+1)*(1+1+1+1+1))&lt;/p&gt;
&lt;p style=&quot;padding-left: 30px;&quot;&gt;26 == ((1)+((1+1+1+1+1)*(1+1+1+1+1)))&lt;/p&gt;
&lt;p style=&quot;padding-left: 30px;&quot;&gt;27 == ((1+1+1)*((1+1+1)*(1+1+1)))&lt;/p&gt;
&lt;p style=&quot;padding-left: 30px;&quot;&gt;28 == ((1+1+1+1)*(1+1+1+1+1+1+1))&lt;/p&gt;
&lt;p style=&quot;padding-left: 30px;&quot;&gt;30 == ((1+1+1+1+1)*(1+1+1+1+1+1))&lt;/p&gt;
&lt;p style=&quot;padding-left: 30px;&quot;&gt;32 == ((1+1)*((1+1+1+1)*(1+1+1+1)))&lt;/p&gt;
&lt;p style=&quot;padding-left: 30px;&quot;&gt;35 == ((1+1+1+1+1)*(1+1+1+1+1+1+1))&lt;/p&gt;
&lt;p style=&quot;padding-left: 30px;&quot;&gt;36 == ((1+1+1)*((1+1+1)*(1+1+1+1)))&lt;/p&gt;
&lt;p style=&quot;padding-left: 30px;&quot;&gt;40 == ((1+1)*((1+1+1+1)*(1+1+1+1+1)))&lt;/p&gt;
&lt;p style=&quot;padding-left: 30px;&quot;&gt;42 == ((1+1+1+1+1+1)*(1+1+1+1+1+1+1))&lt;/p&gt;
&lt;p style=&quot;padding-left: 30px;&quot;&gt;45 == ((1+1+1)*((1+1+1)*(1+1+1+1+1)))&lt;/p&gt;
&lt;p&gt;As you can see, there are several inversions, where a larger number (in magnitude)&amp;#xA0;is &quot;simpler&quot; in that it has a shorter shortest expression (in this grammar).&amp;#xA0;The first inversion occurs at 12, which is simpler than 11. In this grammar,&amp;#xA0;squares, powers of two, and smooth numbers generally (&lt;a href=&quot;http://en.wikipedia.org/wiki/Smooth_number&quot;&gt;http://en.wikipedia.org/wiki/Smooth_number&lt;/a&gt;)&amp;#xA0;will be considered simple. Even though &quot;the 256th prime&quot; sounds like a short&amp;#xA0;description of a number, this grammar isn't flexible enough to capture that&amp;#xA0;concept, and so this grammar does not consider primes to be simple. I believe this illustrates that picking a particular approximate Kolmogorov-complexity-like concept is a variety of justification of Occam's razor by aesthetics. Humans can argue about aesthetics, and be convinced by arguments somewhat like logic, but ultimately it is a matter of taste.&lt;/p&gt;
&lt;p&gt;In contrast, Kelly's idea of &quot;simplicity&quot; is related to Popper's falsifiability. In this sense of&amp;#xA0;simplicity, a complex theory is one that can (for a while) camouflage&amp;#xA0;itself as another (simple) theory, but a simple theory cannot pretend&amp;#xA0;to be complex. So if Nature really had 256 particles, it could refuse to reveal them for a while (maybe they're hidden in &quot;very high energy regimes&quot;), and the 256-particle universe would exactly match the givens for a 254-particle universe. However, the 254-particle universe cannot reveal new particles; it's already shown everything that it has.&lt;/p&gt;
&lt;p&gt;Remember, we're not talking about elaborate data analysis, where there&amp;#xA0;could be &quot;mirages&quot; or &quot;aliasing&quot;, patterns in the data that look&amp;#xA0;initially like a new particle, but later reveal themselves to be&amp;#xA0;explicable using known particles. We're talking about a particular form of the Nature vs Science game where each round Nature either reveals (by name) a new particle, or remains silent. This illustrates that Kelly's simplicity is &lt;em&gt;relative&lt;/em&gt; to the possible observables. In this scenario,&amp;#xA0;where Nature identifies new particles by name,&amp;#xA0;then the hypothesis that we have seen all of the particles and&amp;#xA0;there will be no new particles is always the simplest, the &quot;most&amp;#xA0;falsifiable&quot;. With more realistic observables, the question of what&amp;#xA0;is the simplest consistent hypothesis becomes trickier.&lt;/p&gt;
&lt;p&gt;A more realistic example that Kelly uses (following Kuhn) is the Copernican&amp;#xA0;hypothesis that the earth and other planets circle the sun. In what&amp;#xA0;sense is it simpler than the geocentric hypothesis? From a casual&amp;#xA0;modern perspective, both hypotheses might seem symmetric and of&amp;#xA0;similar complexity. The crucial effect is that Ptolemy's model&amp;#xA0;parameters (velocities, diameters) have to be carefully adjusted to&amp;#xA0;create a &quot;coincidence&quot; - apparent retrograde motion that always&amp;#xA0;coincides with solar conjunction (for mercury and venus) and solar&amp;#xA0;opposition (for the other planets). (Note: Retrograde means moving in&amp;#xA0;the unusual direction, west to east. Conjunction means two entities&amp;#xA0;near each other in the sky. Opposition means the entities are at&amp;#xA0;opposite points of the celestial sphere.) The Copernican model &quot;predicts&quot; that coincidence; not in the sense that the creation of the model precedes knowledge of the effect, but that any tiny deviation from exact coincidence to be discovered&amp;#xA0;in the future would be evidence against the Copernican model. In this sense, the Copernican model is&amp;#xA0;more falsifiable; simpler.&lt;/p&gt;
&lt;p&gt;The Ockham Efficiency Theorems explain in what sense this version of&amp;#xA0;Occam's razor is strictly better than other strategies for the Science vs. Nature game. If what we care&amp;#xA0;about is the number of public mind changes (saying &quot;I was mistaken,&amp;#xA0;actually there are X particles.&quot; would count as a mind change for any X),&amp;#xA0;and the timing of the mind changes, then Occam's razor is the best strategy&amp;#xA0;for the Science vs. Nature game. The Occam strategy for the number-of-particles game will achieve exactly as many mind changes as there are particles. A scientist who deviates from Occam's&amp;#xA0;razor allows Nature to extract a mind change from the (hasty) scientist&amp;#xA0;&quot;for free&quot;.&lt;/p&gt;
&lt;p&gt;The way this works in the particles game is simple. To extract a mind change&amp;#xA0;from a hasty scientist who jumps to predicting 12 particles when they've only&amp;#xA0;seen 11, or 256 particles when they've only seen 254, Nature can simply&amp;#xA0;continuously refuse to reveal new particles. If the scientist doesn't ever switch back&amp;#xA0;down to the known number of particles, then they're a nonconvergent scientist&amp;#xA0;- they lose the game. If the scientist, confronted with a long run of&amp;#xA0;&quot;no new particles found&quot; does switch to the known number of particles, then&amp;#xA0;Nature has extracted a mind change from the scientist without a corresponding&amp;#xA0;particle. The Occam strategy achieves the fewest possible number of mind changes&amp;#xA0;(that is, equal to the number of particles), given such an adversarial Nature.&lt;/p&gt;
&lt;p&gt;The &quot;Ockham Efficiency Theorems&quot; refer to the worked-out details of more elaborate Science vs. Nature games - where Nature chooses a polynomial GUT, for example.&lt;/p&gt;
&lt;p&gt;This entire scenario does generalize to noisy observations as well (learning Perlean causal graphs) though I don't understand this aspect fully. If I understand correctly, the scientist guesses a probability distribution over the possible worlds&amp;#xA0;and you count &quot;mind changes&quot; as changes of that&amp;#xA0;probability distribution, so adjusting a 0.9 probability to 0.8 would be&amp;#xA0;counted as a fraction of a mind change. Anyway, read the actual papers, they're well-written and convincing.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;http://www.andrew.cmu.edu/user/kk3n/ockham/Ockham.htm&quot;&gt;www.andrew.cmu.edu/user/kk3n/ockham/Ockham.htm&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;This post benefited greatly from encouragement and critique by cousin_it.&lt;/p&gt;&lt;/div&gt;
&lt;a href="http://lesswrong.com/lw/2l9/kevin_t_kellys_ockham_efficiency_theorem/#comments"&gt;80 comments&lt;/a&gt;
</description>
</item>
</channel>
</rss>