Articles Tagged ‘probability’ - Less Wrong
http://lesswrong.com/
Logics for Mind-Building Should Have Computational Meaning
http://lesswrong.com/lw/kr3/logics_for_mindbuilding_should_have_computational/
http://lesswrong.com/lw/kr3/logics_for_mindbuilding_should_have_computational/Fri, 26 Sep 2014 07:17:22 +1000
Submitted by <a href="http://lesswrong.com/user/eli_sennesh">eli_sennesh</a>
•
21 votes
•
<a href="http://lesswrong.com/lw/kr3/logics_for_mindbuilding_should_have_computational/#comments">28 comments</a>
<div><p><strong>The Workshop</strong></p>
<p>Late in July I organized and held MIRIx Tel-Aviv with the goal of investigating the currently-open (to my knowledge) Friendly AI problem called "logical probability": the issue of assigning probabilities to formulas in a first-order proof system, in order to use the <a href="http://intelligence.org/files/DefinabilityTruthDraft.pdf" title='"The Definability of Truth in Probabilistic Logic"'>reflective consistency of the probability predicate</a> to get past the <a href="https://www.youtube.com/watch?v=MwriJqBZyoM">Loebian Obstacle</a> to building a self-modifying reasoning agent that will trust itself and its successors.  Vadim Kosoy, Benjamin and Joshua Fox, and myself met at the <a href="http://telavivmakers.org/index.php/Main_Page">Tel-Aviv Makers' Insurgence</a> for six hours, and each presented our ideas.  I spent most of it sneezing due to my allergies to TAMI's resident cats.</p>
<p>My idea was to go with the <a href="http://en.wikipedia.org/wiki/Proof-theoretic_semantics">proof-theoretic semantics of logic</a> and attack computational construction of logical probability via the <a href="http://homepages.inf.ed.ac.uk/wadler/papers/propositions-as-types/propositions-as-types.pdf">Curry-Howard Isomorphism between programs and proofs</a>: this yields a rather direct translation between computational constructions of logical probability and the learning/construction of an optimal function from sensory inputs to actions required by <a href="http://www.recursivelydiscursive.net/2014/04/problem-class-dominance-in-predictive.html">Updateless Decision Theory</a>.</p>
<p>The best I can give as a mathematical result is as follows:</p>
<p><strong> <img src="http://www.codecogs.com/png.latex?P(%5CGamma%20%5Cvdash%20a:A%20%5Cmid%20%5CGamma%20%5Cvdash%20b:B)%20=%20P(%5CGamma,x:B%20%5Cvdash%20%5B%5Cforall%20y:B,%20x/y%5Da:A)" alt="P(\Gamma \vdash a:A \mid \Gamma \vdash b:B) = P(\Gamma,x:B \vdash [\forall y:B, x/y]a:A)" height="19" width="448"></strong></p>
<p><strong><img src="http://www.codecogs.com/png.latex?P(%5CGamma%20%5Cvdash%20(a,%20b):%20A%20%5Cwedge%20B)%20=%20P(%5CGamma%20%5Cvdash%20a:A%20%5Cmid%20%5CGamma%20%5Cvdash%20b:B)%20*%20P(%5CGamma%20%5Cvdash%20b:B)" alt="P(\Gamma \vdash (a, b): A \wedge B) = P(\Gamma \vdash a:A \mid \Gamma \vdash b:B) * P(\Gamma \vdash b:B)" height="19" width="484"></strong></p>
<p><strong><img src="http://www.codecogs.com/png.latex?%5Cfrac%7Bx:A%20%5Cnotin%20%5CGamma%7D%7BP(%5CGamma%20%5Cvdash%20x:A)%20=%20%5Cmathcal%7BM%7D_%7B%5Clambda%5Cmu%7D%20(A)%7D" alt="\frac{x:A \notin \Gamma}{P(\Gamma \vdash x:A) = \mathcal{M}_{\lambda\mu} (A)}" height="42" width="181"></strong></p>
<p><strong><img src="http://www.codecogs.com/png.latex?%5Cfrac%7Bx:A%20%5Cin%20%5CGamma%7D%7BP(%5CGamma%20%5Cvdash%20x:A)%20=%201.0%7D" alt="\frac{x:A \in \Gamma}{P(\Gamma \vdash x:A) = 1.0}" height="41" width="147"><br></strong></p>
<p>The capital <img src="http://www.codecogs.com/png.latex?%5CGamma" alt="\Gamma" height="13" width="11"> is a set of hypotheses/axioms/assumptions, and the English letters are metasyntactic variables (like "foo" and "bar" in programming lessons).  The lower-case letters denote proofs/programs, and the upper-case letters denote propositions/types.  The turnstile <img src="http://www.codecogs.com/png.latex?%5Cvdash" alt="\vdash" height="13" width="10"> just means "deduces": the <em>judgement</em> <img src="http://www.codecogs.com/png.latex?%5CGamma%20%5Cvdash%20a:A" alt="\Gamma \vdash a:A" height="14" width="70"> can be read here as "an agent whose set of beliefs is denoted <img src="http://www.codecogs.com/png.latex?%5CGamma" alt="\Gamma" height="13" width="11"> will believe that the evidence <em>a</em> proves the proposition <em>A</em>."  The <img src="http://www.codecogs.com/png.latex?%5B%5Cforall%20y:B,%20x/y%5Da" alt="[\forall y:B, x/y]a" height="18" width="131"> performs a "reversed" substitution, with the result reading: "for all <em>y</em> proving/of-type <em>B</em>, substitute <em>x</em> for <em>y</em> in <em>a</em>".  This means that we algorithmically build a new proof/construction/program from <em>a</em> in which any and all constructions proving the proposition <em>B</em> are replaced with the logically-equivalent hypothesis <em>x</em>, which we have added to our hypothesis-set <img src="http://www.codecogs.com/png.latex?%5CGamma" alt="\Gamma" height="13" width="11">.</p>
<p id="title">Thus the first equation reads, "the probability of <em>a</em> proving <em>A</em> conditioned on <em>b</em> proving <em>B</em> equals the probability of <em>a</em> proving <em>A</em> when we assume the truth of <em>B</em> as a hypothesis."  The second equation then uses this definition of conditional probability to give the normal Product Rule of probabilities for the logical product (the <img src="http://www.codecogs.com/png.latex?%5Cwedge" alt="\wedge" height="12" width="11"> operator), defined proof-theoretically.  I strongly believe I could give a similar equation for the normal Sum Rule of probabilities for the logical sum (the <img src="http://www.codecogs.com/png.latex?%5Cvee" alt="\vee" height="12" width="11"> operator) if I could only access the <a href="http://link.springer.com/chapter/10.1007%2FBFb0013061">relevant paywalled paper</a>, in which the λμ-calculus acting as an algorithmic interpretation of the natural-deduction system for classical propositional logic (rather than intuitionistic) is given.</p>
<p>The third item given there is an inference rule, which reads, "if <em>x</em> is a free variable/hypothesis imputed to have type/prove proposition <em>A</em>, not bound in the hypothesis-set <img src="http://www.codecogs.com/png.latex?%5CGamma" alt="\Gamma" height="13" width="11">, then the probability with which we believe <em>x</em> proves <em>A</em> is given by the Solomonoff Measure of type <em>A</em> in the λμ-calculus".  We can define that measure simply as the summed Solomonoff Measure of every program/proof possessing the relevant type, and I don't think going into the details of its construction here would be particularly productive.  Free variables in λ-calculus are isomorphic to unproven hypotheses in natural deduction, and so a probabilistic proof system could learn how much to believe in some free-standing hypothesis via Bayesian evidence rather than algorithmic proof.</p>
<p>The final item given here is trivial: anything assumed has probability 1.0, that of a logical tautology.</p>
<p>The upside to invoking the strange, alien λμ-calculus instead of the more normal, friendly λ-calculus is that we thus reason inside classical logic rather than intuitionistic, which means we can use the classical axioms of probability rather than <a href="http://projecteuclid.org/euclid.ndjfl/1082637807">intuitionistic</a> <a href="http://www.researchgate.net/publication/220083450_A_probabilistic_extension_of_intuitionistic_logic">Bayesianism</a>.  We <em>need</em> classical logic here: if we switch to intuitionistic logics (Heyting algebras rather than Boolean algebras) we do get to make computational decidability a first-class citizen of our logic, but the cost is that we can then believe <em>only</em> computationally provable propositions. As Benjamin Fox pointed out to me at the workshop, Loeb's Theorem then becomes a triviality, with real self-trust rendered no easier.</p>
<p><strong>The Apologia</strong></p>
<p>My motivation and core idea for all this was very simple: I am a devout <a href="https://www.doc.ic.ac.uk/~gds/PLMW/harper-plmw13-talk.pdf">computational trinitarian</a>, believing that logic must be set on foundations which describe reasoning, truth, and evidence in a non-mystical, non-Platonic way.  The study of first-order logic and <em>especially</em> of incompleteness results in metamathematics, <a href="http://www.ams.org/notices/201011/rtx101101454p.pdf">from Goedel on up to Chaitin</a>, <em>aggravates</em> me in its relentless Platonism, and especially in the way <a href="http://vserver1.cscs.lsa.umich.edu/~crshalizi/notabene/godels-theorem.html">Platonic mysticism about logical incompleteness so often leads to the belief that minds are mystical</a>.  (<a href="http://www.cpporter.com/wp-content/uploads/2013/08/PorterCambridge2013.pdf">It aggravates other people, too!</a>)</p>
<p>The slight problem which I ran into is that there's a shit-ton I don't know about logic.  <a href="http://www.amazon.com/Logical-Labyrinths-Raymond-Smullyan/dp/1568814437/ref=sr_1_1?ie=UTF8&qid=1411672077&sr=8-1&keywords=Raymond+M.+Smullyan+logical+labyrinths">I am now working to remedy</a> <a href="http://www.amazon.com/Computability-Logic-George-S-Boolos/dp/0521701465/ref=sr_1_1?ie=UTF8&qid=1411672120&sr=8-1&keywords=computability+and+logic">this grievous hole in my previous education</a>.  Also, this problem is <a href="http://dl.acm.org/citation.cfm?id=5450">really </a><a href="http://arxiv.org/abs/1209.2620">deep</a>, <a href="http://www.hutter1.net/publ/problogics.pdf">actually</a>.</p>
<p>I thus apologize for ending the rigorous portion of this write-up here.  Everyone expecting proper rigor, you may now pack up and go home, if you were ever paying attention at all.  Ritual seppuku will duly be committed, followed by hors d'oeuvre.  My corpse will be duly recycled to make paper-clips, in the proper fashion of a failed LessWrongian.</p>
<p><strong>The Parts I'm Not Very Sure About</strong></p>
<p>With any luck, that previous paragraph got rid of all the serious people.</p>
<p>I do, however, still think that the (beautiful) equivalence between computation and logic can yield some insights here.  After all, the whole reason for the strange incompleteness results in first-order logic (shown by Boolos in his textbook, I'm told) is that first-order logic, as a reasoning system, contains sufficient computational machinery to encode a Universal Turing Machine.  The bidirectionality of this reduction (Hilbert and Gentzen both have given computational descriptions of first-order proof systems) is just another demonstration of the equivalence.</p>
<p>In fact, it seems to me (right now) to yield a rather intuitively satisfying explanation of why the Gaifman-Carnot Condition (that every instance we see of <img src="http://www.codecogs.com/png.latex?P(x_i)" alt="P(x_i)" height="18" width="42"> provides Bayesian evidence in favor of <img src="http://www.codecogs.com/png.latex?%5Cforall%20x.P(x)" alt="\forall x.P(x)" height="18" width="62">) for logical probabilities <a href="https://groups.google.com/forum/#!topic/magic-list/WJzPoNJavhk">is not computably approximable</a>.  What would we need to interpret the Gaifman Condition from an algorithmic, type-theoretic viewpoint?  From this interpretation, we would need a proof of our universal generalization.  This would have to be a dependent product of form <img src="http://www.codecogs.com/png.latex?%5CPi(x:A).P(x)" alt="\Pi(x:A).P(x)" height="18" width="108">, a function taking any construction <img src="http://www.codecogs.com/png.latex?x:A" alt="x:A" height="12" width="38"> to a construction of type <img src="http://www.codecogs.com/png.latex?P(x)" alt="P(x)" height="18" width="37">, which itself has type <strong>Prop</strong>.  To learn such a dependent function from the examples would be to search for an optimal (simple, probable) construction (program) constituting the relevant proof object: effectively, an individual act of Solomonoff Induction.  Solomonoff Induction, however, is already only semicomputable, which would then make a Gaifman-Hutter distribution (is there another term for these?) doubly semicomputable, since even generating it involves a semiprocedure.</p>
<p>The <em>benefit</em> of using the constructive approach to probabilistic logic here is that we know perfectly well that however incomputable Solomonoff Induction and Gaifman-Hutter distributions might be, both existing humans and existing proof systems succeed in building proof-constructions for quantified sentences <em>all the time</em>, even in higher-order logics such as Coquand's <a href="http://coq.inria.fr/cocorico/TheoryBehindCoq">Calculus of Constructions</a> (the core of a popular constructive proof assistant) or Luo's <a href="http://www.cs.rhul.ac.uk/~zhaohui/LTT06.pdf">Logic-Enriched Type Theory</a> (the core of a popular dependently-typed programming language and proof engine based on classical logic).  Such logics and their proof-checking algorithms constitute, going all the way back to <a href="http://www.win.tue.nl/automath/">Automath</a>, the first examples of computational "agents" which acquire specific "beliefs" in a mathematically rigorous way, subject to human-proved theorems of soundness, consistency, and programming-language-theoretic completeness (rather than meaning that every true proposition has a proof, this means that every program which does not become operationally stuck has a type and is thus the proof of some proposition).  If we want our AIs to believe in accordance with soundness and consistency properties we can prove <em>before</em> running them, while being composed of computational artifacts, I personally consider this the foundation from which to build.</p>
<p>Where we <em>can</em> acquire probabilistic evidence in a sound and computable way, as noted above in the section on free variables/hypotheses, we can do so for propositions which we cannot algorithmically prove.  This would bring us closer to our actual goals of using logical probability in Updateless Decision Theory or of getting around the Loebian Obstacle.</p>
<p><strong>Some of the Background Material I'm Reading</strong></p>
<p>Another reason why we should use a Curry-Howard approach to logical probability is one of the simplest possible reasons: the burgeoning field of <a href="http://research.microsoft.com/pubs/208585/fose-icse2014.pdf">probabilistic programming</a> is already being <a href="http://dl.acm.org/citation.cfm?id=2103721&CFID=573600967&CFTOKEN=48192368">built</a> on <a href="http://dl.acm.org/citation.cfm?id=503288">it</a>.  The Computational Cognitive Science lab at MIT is publishing papers showing that their languages are universal for computable and semicomputable probability distributions, and getting strong results in the study of human general intelligence.  Specifically: they are hypothesizing that we can dissolve "learning" into "inducing probabilistic programs via hierarchical Bayesian inference", "thinking" into "simulation" into "conditional sampling from probabilistic programs", and "uncertain inference" into "approximate inference over the distributions represented by probabilistic programs, conditioned on some fixed quantity of sampling that has been done."</p>
<p>In fact, one might even look at these ideas and think that, perhaps, an agent which could find some way to sample quickly and more accurately, or to learn probabilistic programs more efficiently (in terms of training data), than was built into its original "belief engine" could then rewrite its belief engine to use these new algorithms to perform strictly better inference and learning.  Unless I'm as completely wrong as I usually am about these things (that is, very extremely completely wrong based on an utterly unfounded misunderstanding of the whole topic), it's a potential engine for recursive self-improvement.</p>
<p>They also have been studying how to implement statistical inference techniques for their generate modeling languages which do not obey Bayesian soundness.  While most of machine learning/perception works according to error-rate minimization rather than Bayesian soundness (exactly because Bayesian methods are <em>often</em> too computationally expensive for real-world use), I would prefer someone at least study the implications of employing unsound inference techniques for more general AI and cognitive-science applications in terms of how often such a system would "misbehave".</p>
<p>Many of MIT's models are currently dynamically typed and appear to leave type soundness (the logical rigor with which agents come to believe things by deduction) to future research.  And yet: they got to this problem first, so to speak.  We really ought to be collaborating with them, with the full-time grant-funded academic researchers, rather than trying to armchair-reason our way to a full theory of logical probability as a large group of amateurs or part-timers and only a small core cohort of full-time MIRI and FHI staff investigating AI safety issues.</p>
<p>(I admit to having a nerd crush, and I am actually planning to go visit the Cocosci Lab this coming week, and want/intend to apply to their PhD program.)</p>
<p>They have also uncovered something else I find highly interesting: human learning of both concepts and causal frameworks seems to take place via hierarchical Bayesian inference, <a href="http://projects.csail.mit.edu/church/wiki/Hierarchical_Models#The_Blessing_of_Abstraction">gaining a "blessing of abstraction" to countermand the "curse of dimensionality"</a>.  The natural interpretation of these abstractions in terms of constructions and types would be that, as in dependently-typed programming languages, constructions have types, and types are constructions, but for hierarchical-learning purposes, it would be useful to suppose that <em>types</em> have specific, structured types more informative than <strong>Prop</strong> or <strong>Type</strong><sub>n</sub> (for some universe level <em>n</em>).  Inference can then proceed from giving constructions or type-judgements as evidence at the bottom level, up the hierarchy of types and meta-types to give probabilistic belief-assignments to very general knowledge.  Even very different objects could have similar meta-types at some level of the hierarchy, allowing hierarchical inference to help transfer Bayesian evidence between seemingly different domains, giving insight into how efficient general intelligence can work.</p>
<p><strong>Just-for-fun Postscript</strong></p>
<p>If we really buy into the model of thinking as conditional simulation, we can use that to <a href="/lw/tg/against_modal_logics/">dissolve the modalities "possible" and "impossible"</a>.  We arrive at (by my count) three different ways of considering the issue computationally:</p>
<ol>
<li>Conceivable/imaginable: the generative models which constitute my current beliefs do or do not yield a path to make some logical proposition true or to make some causal event happen (<a href="http://projects.csail.mit.edu/church/wiki/Inference_about_inference:_Nested_query#Planning">planning can be done as inference, after all</a>), with or without some specified level of probability.</li>
<li>Sensibility/absurdity: the generative models which constitute my current beliefs place a desirably high or undesirably low probability on the known path(s) by which a proposition might be true or by which an event might happen.  The level which constitutes "desirable" could be set as the <img src="http://www.codecogs.com/png.latex?%5Calpha" alt=""> value for a hypothesis test, or some other value determined decision-theoretically.  This could relate to Pascal's Mugging: how probable must something be before I consider it <em>real</em> rather than an artifact of my own hypothesis space?</li>
<li>Consistency or Contradiction: the generative models which constitute my current beliefs, plus the hypothesis that some proposition is true or some event can come about, do or do not yield a logical contradiction with some probability (that is, we should believe the contradiction exists only to the degree we believe in our existing models in the first place!).</li>
</ol>
<p>I mostly find this fun because it lets us talk rigorously about when we should "shut up and do the 1,2!impossible" and when something is very definitely 3!impossible.</p></div>
<a href="http://lesswrong.com/lw/kr3/logics_for_mindbuilding_should_have_computational/#comments">28 comments</a>
Solutions and Open Problems
http://lesswrong.com/lw/joy/solutions_and_open_problems/
http://lesswrong.com/lw/joy/solutions_and_open_problems/Sat, 15 Mar 2014 17:53:36 +1100
Submitted by <a href="http://lesswrong.com/user/Manfred">Manfred</a>
•
7 votes
•
<a href="http://lesswrong.com/lw/joy/solutions_and_open_problems/#comments">8 comments</a>
<div><p><strong>Followup To:</strong> <a href="/lw/jjl/approaching_logical_probability/">Approaching Logical Probability</a></p>
<p>Last time, we required our robot to only assign logical probability of 0 or 1 to statements where it's checked the proof. This flowed from our desire to have a robot that comes to conclusions in limited time. It's also important that this abstract definition has to take into account the pool of statements that our actual robot actually checks. However, this restriction doesn't give us a consistent way to assign numbers to unproven statements - to be consistent we have to put limits on our application of the <a href="/lw/jfx/foundations_of_probability/">usual rules of probability</a>.</p>
<p><a id="more"></a></p>
<div>
<div><strong>Total Ignorance</strong></div>
<div><br></div>
<div>The simplest solution is to assign logical probabilities to proven statements normally, but totally refuse to apply our information to unproven statements. The principle of maximum entropy means that every unproven statement then gets logical probability 0.5.</div>
<div><br></div>
<div>There is a correspondence between being inconsistent and ignoring information. We could just as well have said that when we move from proven statements to unproven statements, we refuse to apply the product rule, and that would have assigned every unproven statement logical probability 0.5 too. Either way, there's something "non-classical" going on at the boundary between proven statement and unproven statements. If you are certain that 100+98=198, but not that 99+99=198, some unfortunate accident has befallen the rule that if (A+1)+B=C, A+(B+1)=C.</div>
<div><br></div>
<div>Saying 0.5 for everything is rarely suggested in practice, because it has some glaringly bad properties: if we ask about the last digit of the zillionth prime number, we don't want to say that the answer being 1 has logical probability 0.5, we want our robot to use facts about digits and about primes to say that 1, 3, 7, and 9 have logical probability 0.25 each.</div>
<div><br></div>
<div><strong>Using Our Pool of Proof-Steps</strong></div>
<div><br></div>
<div>The most obvious process that solves this problem is to assign logical probabilities by ignoring most of the starting information, but using as information all proven statements containing no variables (that is, can't regenerate a set of axioms that will require us to take infinite time). So if we prove that our prime number can't simultaneously end with 1 and 3, we'll never assign a combined probability to them greater than 1.</div>
<div><br></div>
<div>This is the most typical family of suggestions (of the ones that won't require infinite resources). See <a href="/lw/eaa/a_model_of_udt_with_a_concrete_prior_over_logical/">Benja</a>, for example. Another phrasing of this solution is that it's like we start with the inconsistent prior of "1/2 to everything," and then update this according to checked proof steps, until we run out of time. The updating can also be described as if there's a bunch of different tables (models) that assign a true or false value to every statement, and when we learn that the prime can't end with 1 and 3 simultaneously, we rule out the models that say both of those are true and redistribute the probability evenly. To get answers in finite time, we can't actually compute any of these fancy descriptions, but we can compute the updates of just the statement we want.</div>
<div><br></div>
<div>Even though the probabilities assigned by this approach are more sensible, violations of normal probability still occur at the boundary between proven and unproven statements. We're giving our robot more information to work with, but still not as much as a robot with infinite computing power could use.</div>
<div><br></div>
<div>A sneaky issue here is that since using checked steps as information takes time, that's less time available to find the solution. This is a general rule - as tricks for assigning logical probabilities to unproven statements get better, they take up more time, so you only want to use them if you don't expect that time to be important. But calculating that tradeoff also takes time! Someone has probably solved this problem in other contexts, but I do not know the solution.</div>
<div><br></div>
<div><strong>Logical Pattern-Matching</strong></div>
<div><br></div>
<div>There is another interesting property we might want, which could be called <a href="/lw/igq/a_basis_for_patternmatching_in_logical_uncertainty/">logical pattern-matching</a>. Suppose that our robot is trying to predict a very complicated machine. Further suppose that our robot knows the complete description of the machine, but it is too complicated for our robot to predict the output, or even to find any useful proofs about the machine's behavior.</div>
<div><br></div>
<div>At time step 1, our robot observes that the machine outputs "1." At time step 2, our robot observes that the machine outputs "2." We might now want our robot to "see the pattern," and guess that the machine is about to output 3.</div>
<div><br></div>
<div>Our solutions so far don't do this - our robot would need to prove a statement like "if its last output was 2, its next output is logically forbidden from being 5." If our machine is too complicated to prove statements like that about, our previous solutions won't even think of the previous outputs as information.</div>
<div><br></div>
<div>One way to make our robot care about complexity is to restrict the length of hypotheses to less than the length of the description of the machine. This is like giving the robot the information that the answer comes from a machine, that the description length of this machine is less than some number.</div>
<div><br></div>
<div>A big problem with this is time. If our robot has to average over the outcome of all these different hypotheses, this takes longer than just using the description itself to find the answer. In a sense, directly using knowledge about the description of the machine is too much for our robot to handle. When we just used the checked proof-steps, that was okay, but as you give the robot more information you also burden it by making it spend more time interpreting that information.</div>
<div><br></div>
<div>And yet, we want our robot to be able to do logical pattern matching quickly if it actually goes out and observes a complicated machine that prints "1, 2...". But this is another problem I don't know how to solve - we could just say "monte carlo" and wave our hands, but handwaving is frowned upon here, and for good reason.</div>
<div><br></div>
<div><strong>Further Open Problems</strong></div>
<div><br></div>
<div>In this post I've already mentioned two open problems: the tradeoff of searching for an exact solution versus having a good approximation, and the correct way to do logical pattern-matching. There are more unsolved problems that also deserve mention.</div>
<div><br></div>
<div><span style="background-color: #f9f9f9; color: #333333; font-family: 'Open Sans', 'Helvetica Neue', Helvetica, Arial, sans-serif; font-size: 14px; line-height: 20px;">• </span>Handling infinities. Current proposals have some pretty bad properties if there are an infinite number of possible answers. For example, if you prove that answers 1-1000 all have small logical probability, but don't prove anything about answer 1001, the robot might decide that since you didn't prove anything about it, it has probability 0.5, and is thus a really good idea. An example direction to go might be to restrict our robot to taking actions it's actually proved things about - but we can also come up with perverse situations where that's bad. Is there a better way?</div>
<div><br></div>
<div><span style="background-color: #f9f9f9; color: #333333; font-family: 'Open Sans', 'Helvetica Neue', Helvetica, Arial, sans-serif; font-size: 14px; line-height: 20px;">• </span>Integrating this approach into a <a href="/lw/jjl/approaching_logical_probability/">larger framework of decision-making</a>. This builds off of making tradeoffs with your computing time and handling infinities. Basically, we want our robot to make decisions in limited time, not just output logical probabilities in limited time, and making decisions requires considering your possible actions and the utility of outcomes, which are allowed to be really complicated and require approximation. And then, we need to somehow direct computational resources into different tasks to make the best decision.</div>
<div><br></div>
<div><span style="background-color: #f9f9f9; color: #333333; font-family: 'Open Sans', 'Helvetica Neue', Helvetica, Arial, sans-serif; font-size: 14px; line-height: 20px;">• I</span>ntegrating this approach with second-order arithmetic. If you look at <a href="http://intelligence.org/wp-content/uploads/2013/03/Christiano-et-al-Naturalistic-reflection-early-draft.pdf">MIRI's paper</a> that uses a probability distribution over logical statements, their approach is quite different - for one, they don't allow for any limits on the robot's resources. And for another, there are all sorts of properties that are important when considering second-order arithmetic that we haven't needed yet. For example, what happens when we ask for P(P(this statement)<0.3)?</div>
</div>
<div><br></div>
<div><br></div>
<div>Thank you for reading the Logical Uncertainty sequence. I hope that things which were not obvious now seem obvious. If you want your logical probability distribution to have certain nice properties, it is a good idea to only slightly depart from the original desiderata of probability, and build up from there. Jumping straight to an answer is not necessary, and is probably a bad idea anyhow.</div>
<div><br></div>
<div><img src="http://oikosjournal.files.wordpress.com/2011/04/thenmiracleoccurs.jpg?w=275&h=300" alt="" height="299" width="275"><br></div>
<p style="text-align:right">End of the sequence <em>Logical Uncertainty</em></p>
<p style="text-align:right">Previous Post: <a href="/lw/jjl/approaching_logical_probability/">Approaching Logical Probability</a></p></div>
<a href="http://lesswrong.com/lw/joy/solutions_and_open_problems/#comments">8 comments</a>
Approaching Logical Probability
http://lesswrong.com/lw/jjl/approaching_logical_probability/
http://lesswrong.com/lw/jjl/approaching_logical_probability/Thu, 27 Feb 2014 18:44:38 +1100
Submitted by <a href="http://lesswrong.com/user/Manfred">Manfred</a>
•
7 votes
•
<a href="http://lesswrong.com/lw/jjl/approaching_logical_probability/#comments">22 comments</a>
<div><p><strong>Followup To:</strong> <a style="text-align: right;" href="/lw/jjk/logic_as_probability/">Logic as Probability</a></p>
<p>If we design a robot that acts as if it's uncertain about mathematical statements, that <a href="/lw/jjk/logic_as_probability/">violates</a> <a href="/lw/jfx/foundations_of_probability/">some desiderata for probability</a>. But realistic robots cannot prove all theorems; they have to be uncertain about hard math problems.</p>
<p>In the name of practicality, we want a foundation for decision-making that captures what it means to make a good decision, even with limited resources. "Good" means that even though our real-world robot can't make decisions well enough to satisfy Savage's theorem, we want to approximate that ideal, not throw it out. Although I don't have the one best answer to give you, in this post we'll take some steps forward.</p>
<p><a id="more"></a></p>
<div>The objects we call probabilities are specified by desiderata that tell us how they behave. Any uncertainty about math problems violates those desiderata, but we still want to be able to assign logical probabilities that behave a lot like probabilities. The basic principles - not making up or ignoring information, not throwing away money or being inconsistent - should be deviated from as little as possible even when computing power is scarce. We want to develop a foundation for logical probabilities by starting from the rules governing ordinary probability, and then minimally restricting the application of those rules.</div>
<div><br></div>
<div>As we do this, it's important to keep track of what changes we make and why. Sometimes people just define logical probabilities, without worrying about desiderata. This is fine, when it works, and is often patchable if it doesn't have the right properties. But if you use it for something important and and get a surprise failure, it's really bad. My hope here is to construct logical probabilities that have the good properties, while keeping <a href="http://www.catb.org/jargon/html/H/handwave.html">handwaving</a> and mysterious assumptions to a minimum.</div>
<div><br></div>
<div>The perils of handwaving are more dire than they appear, and they are at their most dire in the hardest and most confusing reaches of physics. After better approaches fail, many theorists resort to just making up approximations and then trying to justify them. Doing this is known colloquially as a "1/ego expansion." Simply put, it doesn't work; there are too many vital little details. It's why even condensed matter theorists tell you not to trust condensed matter theorists about high temperature superconductivity.</div>
<div><br></div>
<div><br></div>
<div>We must abandon regular probabilities because our robot has limited time, but other parts of the decision-making process can also go over the time limit. If the robot's resources are limited, expected utility maximization breaks down at many points: there might be too many strategies to search through, too many outcomes to foresee, there might be probabilities that are too hard to find, and the utility of the outcomes might be too complicated.</div>
<div><br></div>
<div>The logical probabilities considered in this sequence will help approximate hard math problems, but they don't seem to help much when there are too many outcomes to consider, or if you want to make the best use of limited computational resources. They are only a part of the full solution.</div>
<div><br></div>
<div><br></div>
<div>Time for a desideratum: we want our robot to only assign a logical probability of 1 or 0 to a statement after it's actually checked the proof of that statement.</div>
<div><br></div>
<div>We can think of this as limiting what statements our robot is allowed to be certain about - only statements with short proofs can be found by our agent. However, this desideratum is not just about proof length, because a real robot won't check every checkable proof - it will spend time generating proofs, maybe trying to prove some specific statement, and will end up only checking some subset of short proofs.</div>
<div><br></div>
<div>Logical probabilities, unlike probabilities, are not determined just by the starting information. If our real robot only verifies some small collection of proofs, the robot's logical probabilities depend heavily upon what proof-steps were checked. One proof-step is just one <a href="/lw/f43/proofs_implications_and_models/">truth-preserving</a> step by our robot, like one <a href="/r/lesswrong/lw/jjk/logic_as_probability/">application of modus ponens</a> - it's a little proof one step long. The import is that they're the atomic unit of proofs, and once all the steps of a proof are checked, the proof is checked.</div>
<div><br></div>
<div>If we condition on which proof-steps get checked, does that determine the logical probabilities?</div>
<div><br></div>
<div>For any statement our robot is going to prove or disprove, we can use the checked proof steps to find whether it's logical probability 1 or 0. This gives the same answer as a real robot that checks steps according to some process and then returns 1 or 0 if it manages to prove or disprove the statement we give it. We just have to take the steps that the real robot ends up checking, and say that those are the proved steps for our abstract mathematical definition.</div>
<div><br></div>
<div>There's a problem, though. We haven't changed the old axioms, so they're still only satisfied if we get the right answer for everything. Meanwhile our new desideratum says we can't get the right answer for everything - we've made our axioms internally contradictory. In order to talk about the logical probabilities of unproven statements, we'll need to weaken the original axioms so that they no longer require certainty about everything. We'll explore ways to do this next time. Then we can assign numbers to statements in the usual way, by using our weakened axioms to find constraints, then maximizing entropy subject to those constraints.</div>
<div><br></div>
<div><br></div>
<p style="text-align:right">Part of the sequence <em>Logical Uncertainty</em></p>
<p style="text-align:right">Previous Post: <a href="/lw/jjk/logic_as_probability/">Logic as Probability</a></p>
<p style="text-align:right">Next post: <a href="/lw/joy/solutions_and_open_problems/">Solutions and Open Problems</a></p></div>
<a href="http://lesswrong.com/lw/jjl/approaching_logical_probability/#comments">22 comments</a>
Logic as Probability
http://lesswrong.com/lw/jjk/logic_as_probability/
http://lesswrong.com/lw/jjk/logic_as_probability/Sat, 08 Feb 2014 17:39:36 +1100
Submitted by <a href="http://lesswrong.com/user/Manfred">Manfred</a>
•
9 votes
•
<a href="http://lesswrong.com/lw/jjk/logic_as_probability/#comments">30 comments</a>
<div><p><strong>Followup To:</strong> <a href="/lw/jfl/putting_in_the_numbers/">Putting in the Numbers</a></p>
<p>Before talking about logical uncertainty, our final topic is the relationship between probabilistic logic and classical logic. A robot running on probabilistic logic stores probabilities of events, e.g. that the grass is wet outside, P(wet), and then if they collect new evidence they update that probability to P(wet|evidence). Classical logic robots, on the other hand, deduce the truth of statements from axioms and observations. Maybe our robot starts out not being able to deduce whether the grass is wet, but then they observe that it is raining, and so they use an axiom about rain causing wetness to deduce that "the grass is wet" is true.</p>
<p>Classical logic relies on complete certainty in its axioms and observations, and makes completely certain deductions. This is unrealistic when applied to rain, but we're going to apply this to (<a href="/lw/g1y/godels_completeness_and_incompleteness_theorems/">first order</a>, for starters) math later, which a better fit for classical logic.</p>
<p>The general pattern of the deduction "It's raining, and when it rains the grass is wet, therefore the grass is wet" was modus ponens: if 'U implies R' is true, and U is true, then R must be true. There is also modus tollens: if 'U implies R' is true, and R is false, then U has to be false too. Third, there is the law of non-contradiction: "It's simultaneously raining and not-raining outside" is always false.</p>
<p>We can imagine a robot that does classical logic as if it were writing in a notebook. Axioms are entered in the notebook at the start. Then our robot starts writing down statements that can be deduced by modus ponens or modus tollens. Eventually, the notebook is filled with statements deducible from the axioms. Modus tollens and modus ponens can be thought of as consistency conditions that apply to the contents of the notebook.</p>
<p><a id="more"></a>Doing math is one important application of our classical-logic robot. The robot can read from its notebook "If variable A is a number, A=A+0" and "SS0 is a number," and then write down "SS0=SS0+0."</p>
<p>Note that this requires the robot to interpret variable A differently than symbol SS0. This is one of many upgrades we can make to the basic robot so that it can interpret math more easily. We also want to program in special responses to symbols like 'and', so that if A and B are in the notebook our robot will write 'A and B', and if 'A and B' is in the notebook it will add in A and B. In this light, modus ponens is just the robot having a programmed response to the 'implies' symbol.</p>
<p>Certainty about our axioms is what lets us use classical logic, but you can represent complete certainty in probabilistic logic too, by the probabilities 1 and 0. These two methods of reasoning shouldn't contradict each other - if a classical logic robot can deduce that it's raining out, a probabilistic logic robot with the same information should assign P(rain)=1.</p>
<p>If it's raining out, then my grass is wet. In the language of probabilities, this is P(wet|rain)=1. If I look outside and see rain, P(rain)=1, and then the product rule says that P(wet and rain) = P(rain)·P(wet|rain), and that's equal to 1, so my grass must be wet too. Hey, that's modus ponens!</p>
<p>The rules of probability can also behave like modus tollens (if P(B)=0, and P(B|A)=1, P(A)=0) and the law of the excluded middle (P(A|not-A)=0). Thus, when we're completely certain, probabilistic logic and classical logic give the same answers.</p>
<p>There's a very short way to prove this, which is that one of <a href="/lw/jfx/foundations_of_probability/">Cox's desiderata</a> for how probabilities must behave was "when you're completely certain, your plausibilities should satisfy the rules of classical logic."</p>
<p>In <a href="/lw/jfx/foundations_of_probability/">Foundations of Probability</a>, I alluded to the idea that we should be able to apply probabilities to math. Dutch book arguments work because our robot must act as if it had probabilities in order to avoid losing money. Savage's theorem applies because the results of our robot's actions might depend on mathematical results. Cox's theorem applies because beliefs about math behave like other beliefs.</p>
<p>This is completely correct. Math follows the rules of probability, and thus can be described with probabilities, because classical logic is the same as probabilistic logic when you're certain.</p>
<p>We can even use this correspondence to figure out what numbers the probabilities take on:</p>
<p>1 for every statement that follows from the axioms, 0 for their negations.</p>
<p> </p>
<p>This raises an issue: what about betting on the last digit of the 3^^^3'th prime? We dragged probability into this mess because it was supposed to help our robot stop trying to prove the answer and just bet as if P(last digit is 1)=1/4. But it turns out that there is one true probability distribution over mathematical statements, given the axioms. The right distribution is obtained by straightforward application of the product rule - never mind that it takes 4^^^3 steps - and if you deviate from the right distribution that means you violate the product rule at some point.</p>
<p>This is why logical uncertainty is different. Even though our robot doesn't have enough resources to find the right answer, using logical uncertainty violates <a href="/lw/jfx/foundations_of_probability/">Savage's theorem and Cox's theorem</a>. If we want our robot to act as if it has some "logical probability," it's going to need a stranger sort of foundation.</p>
<p> </p>
<p style="text-align:right">Part of the sequence <em>Logical Uncertainty</em></p>
<p style="text-align:right">Previous Post: <a href="/lw/jfl/putting_in_the_numbers/">Putting in the Numbers</a></p>
<p style="text-align:right">Next post: <a href="/lw/jjl/approaching_logical_probability/">Approaching Logical Uncertainty</a></p></div>
<a href="http://lesswrong.com/lw/jjk/logic_as_probability/#comments">30 comments</a>
Putting in the Numbers
http://lesswrong.com/lw/jfl/putting_in_the_numbers/
http://lesswrong.com/lw/jfl/putting_in_the_numbers/Thu, 30 Jan 2014 17:41:42 +1100
Submitted by <a href="http://lesswrong.com/user/Manfred">Manfred</a>
•
8 votes
•
<a href="http://lesswrong.com/lw/jfl/putting_in_the_numbers/#comments">32 comments</a>
<div><p><strong>Followup To:</strong> <a href="/lw/jfx/foundations_of_probability/">Foundations of Probability</a></p>
<p>In the previous post, we reviewed reasons why having probabilities is a good idea. These foundations defined probabilities as numbers following certain rules, like the product rule and the rule that mutually exclusive probabilities sum to 1 at most. These probabilities have to hang together as a coherent whole. But just because probabilities hang together a certain way, doesn't actually tell us what numbers to assign.</p>
<p>I can say a coin flip has P(heads)=0.5, or I can say it has P(heads)=0.999; both are perfectly valid probabilities, as long as P(tails) is consistent. This post will be about how to actually get to the numbers.</p>
<p><a id="more"></a></p>
<p>If the probabilities aren't fully determined by our desiderata, what do we need to determine the probabilities? More desiderata!</p>
<p>Our final desideratum is motivated by the perspective that our probability is based on some state of information. This is acknowledged explicitly in Cox's scheme, but is also just a physical necessity for any robot we build. Thus we add our new desideratum: Assign probabilities that are consistent with the information you have, but don't make up any extra information. It turns out this is enough to let us put numbers to the probabilities.</p>
<p>In its simplest form, this desideratum is a symmetry principle. If you have the exact same information about two events, you should assign them the same probability - giving them different probabilities would be making up extra information. So if your background information is "Flip a coin, the mutually exclusive and exhaustive probabilities are heads and tails," there is a symmetry between the labels "heads" and "tails," which given our new desideratum lets us assign each P=0.5.</p>
<p>Sometimes, though, we need to pull out the information theory. Using the fact that it doesn't produce information to split the probabilities up differently, we can specify something called "information entropy" (For more thoroughness, see chapter 11 of <a href="http://www-biba.inrialpes.fr/Jaynes/prob.html">Jaynes</a>). The entropy of a probability distribution is a function that measures how uncertain you are. If I flip a coin and don't know about the outcome, I have one bit of entropy. If I flip two coins, I have two bits of entropy. In this way, the entropy is like the amount of information you're "missing" about the coin flips.<img src="http://images.lesswrong.com/t3_jfl_0.png?v=db23affc813431eb24d7e9748fc7fa8f" style="margin: 10px;" align="right" height="260" width="318" alt="Entropy of weighted coin"></p>
<p>The mathematical expression for information entropy is that it's the sum of each probability multiplied by its log. Entropy = -Sum( P(x)<span style="color: #444444; font-family: arial, sans-serif; line-height: 14.654545783996582px;"><strong>·</strong></span>Log(P(x)) ), where the events x are mutually exclusive. Assigning probabilities is all about maximizing the entropy while obeying the constraints of our prior information.</p>
<p>Suppose we roll a 4-sided die. Our starting information consists of our knowledge that there are sides numbered 1 to 4 (events 1, 2, 3, and 4 are exhaustive), and the die will land on just one of these sides (they're mutually exclusive). This let's us write our information entropy as -P(1)<strong style="color: #444444; font-family: arial, sans-serif; line-height: 14.654545783996582px;">·</strong>Log(P(1)) - P(2)<strong style="color: #444444; font-family: arial, sans-serif; line-height: 14.654545783996582px;">·</strong>Log(P(2)) - P(3)<strong style="color: #444444; font-family: arial, sans-serif; line-height: 14.654545783996582px;">·</strong>Log(P(3)) - P(4)<strong style="color: #444444; font-family: arial, sans-serif; line-height: 14.654545783996582px;">·</strong>Log(P(4)).</p>
<p>Finding the probabilities is a maximization problem, subject to the constraints of our prior information. For the simple 4-sided die, our information just says that the probabilities have to add to 1. Simply knowing the fact that the entropy is concave down tells us that to maximize entropy we should split it up as evenly as possible - each side has a 1/4 chance of showing.</p>
<p>That was pretty commonsensical. To showcase the power of maximizing information entropy, we can add an extra constraint.</p>
<p>If we have additional knowledge that the average roll of our die is 3, then we want to maximize -P(1)<strong style="color: #444444; font-family: arial, sans-serif; line-height: 14.654545783996582px;">·</strong>Log(P(1)) - P(2)<strong style="color: #444444; font-family: arial, sans-serif; line-height: 14.654545783996582px;">·</strong>Log(P(2)) - P(3)<strong style="color: #444444; font-family: arial, sans-serif; line-height: 14.654545783996582px;">·</strong>Log(P(3)) - P(4)<strong style="color: #444444; font-family: arial, sans-serif; line-height: 14.654545783996582px;">·</strong>Log(P(4)), given that the sum is 1 and the average is 3. We can either plug in the constraints and set partial derivatives to zero, or we can use a maximization technique like Lagrange multipliers.</p>
<p>When we do this (again, more details in Jaynes ch. 11), it turns out the the probability distribution is shaped like an exponential curve. Which was unintuitive to me - my intuition likes straight lines. But it makes sense if you think about the partial derivative of the information entropy: 1+Log(P) = [some Lagrange multiplier constraints]. The steepness of the exponential controls how shifted the average roll is.</p>
<p> </p>
<p>The need for this extra desideratum has not always been obvious. People are able to intuitively figure out that a fair coin lands heads with probability 0.5. Seeing that their intuition is so useful, some people include that intuition as a fundamental part of their method of probability. The counter to this is to focus on constructing a robot, which only has those intuitions we can specify unambiguously.</p>
<p>Another alternative to assigning probabilities based on maximum entropy is to pick a standard prior and use that. Sometimes this works wonderfully - it would be silly to rederive the binomial distribution every time you run into a coin-flipping problem. But sometimes people will use a well-known prior even if it doesn't match the information they have, just because their procedure is "use a well-known prior." The only way to be safe from that mistake and from interminable disputes over "which prior is right" is to remember that a prior is only correct insofar as it captures some state of information.</p>
<p>Next post, we will finally get to the problem of logical uncertainty, which will shake our foundations a bit. But I really like the principle of not making up information - even a robot that can't do hard math problems can aspire to not make up information.</p>
<p> </p>
<p style="text-align:right">Part of the sequence <em>Logical Uncertainty</em></p>
<p style="text-align:right">Previous Post: <a href="/lw/jfx/foundations_of_probability/">Foundations of Probability</a></p>
<p style="text-align:right">Next post: <a href="/lw/jjk/logic_as_probability/">Logic as Probability</a></p></div>
<a href="http://lesswrong.com/lw/jfl/putting_in_the_numbers/#comments">32 comments</a>
Foundations of Probability
http://lesswrong.com/lw/jfx/foundations_of_probability/
http://lesswrong.com/lw/jfx/foundations_of_probability/Mon, 27 Jan 2014 06:29:42 +1100
Submitted by <a href="http://lesswrong.com/user/Manfred">Manfred</a>
•
11 votes
•
<a href="http://lesswrong.com/lw/jfx/foundations_of_probability/#comments">19 comments</a>
<div><h3><strong style="font-size: small;">Beginning of:</strong><span style="font-size: small;"> </span><span style="font-size: small; font-weight: normal;">Logical Uncertainty sequence</span></h3>
<p>Suppose that we are designing a robot. In order for this robot to reason about the outside world, it will need to use probabilities.</p>
<p>Our robot can then use its knowledge to acquire cookies, which we have programmed it to value. For example, we might wager a cookie with the robot on the motion of a certain stock price.</p>
<p>In the coming sequence, I'd like to add a new capability to our robot. It has to do with how the robot handles very hard math problems. If we ask "what's the last digit of the <a href="http://en.wikipedia.org/wiki/Knuth's_up-arrow_notation">3^^^3</a>'th prime number?", our robot should at some point <em>give up</em>, before the sun explodes and the point becomes moot.</p>
<p>If there are math problems our robot can't solve, what should it do if we offer it a bet about the last digit of the 3^^^3'th prime? It's going to have to approximate - robots need to make lots of approximations, even for simple tasks like finding the strategy that maximizes cookies.</p>
<p>Intuitively, it seems like if we can't find the real answer, the last digit is equally likely to be 1, 3, 7 or 9; our robot should take bets as if it assigned those digits equal probability. But to assign some probability to the wrong answer is logically equivalent to assigning probability to 0=1. When we learn more, it will become clear that this is a problem - we aren't ready to upgrade our robot yet.</p>
<p>Let's begin with a review of the foundations of probability.</p>
<div>
<p><a id="more"></a></p>
</div>
<p>What I call foundations of probability are arguments for why our robot should ever want to use probabilities. I will cover four of them, ranging from the worldly ("make bets in the following way or you lose money") to the ethereal ("here's a really elegant set of axioms"). To use the word "probability" to describe the subject of such disparate arguments can seem odd, but keep in mind the naive definition of probability as that number that's 1/6 for a fair die rolling 6 and 30% for clear weather tomorrow.</p>
<p><strong>Dutch Books</strong></p>
<p>The concretest of concrete foundations is the Dutch book arguments. A Dutch book is a collection of bets that is certain to lose you money. If you violate the rules of probability, you'll agree to these certain-loss bets (or not take a certain-win bet).</p>
<p>For example, if you think that each side of the coin has a 55% chance of showing up, then you'll pay $1 for a bet that pays out $0.98 if the coin lands heads and $0.98 if the coin lands tails. If taking bets where you're guaranteed to lose is bad, then you're not allowed to have probabilities for mutually exclusive things that sum to more than 1.</p>
<p>Similar arguments hold for other properties of probability. If your probabilities for exhaustive events add up to less than 1, you'll pass up free money, which is bad. If you disobey the sum rule or the product rule, you'll agree to a guaranteed loss, which is bad, etcetera. Thus, say the Dutch book arguments, our probabilities have to behave the way they do because we don't want to take guaranteed losses or pass up free money.</p>
<p>There are many assumptions underlying this whole scenario. Our agent in these arguments already tries to decide using probability-like numbers, all we show is that the numbers have to follow the same rules as probabilities. Why can't our agent follow a totally different method of decision making, like picking randomly or alphabetization?</p>
<p>One can show that e.g. picking randomly will sometimes throw away money. But there is a deeper principle here: an agent that wants to avoid throwing away money or passing up free money has to act <em>as if</em> it had numbers that followed probability-rules, and that's a good enough reason for our agent to have probabilities.</p>
<p>Still, some people dislike Dutch book arguments because they focus on an extreme scenario where a malicious bookie is trying to exploit our agent. To avoid this, we'll need a more abstract foundation.</p>
<p>You can learn more about Dutch book arguments <a href="http://plato.stanford.edu/entries/epistemology-bayesian/supplement2.html">here</a> and <a href="http://m-phi.blogspot.com/2013/09/the-mathematics-of-dutch-book-arguments.html">here</a>.</p>
<p><strong>Savage's Foundation</strong></p>
<p>Leonard Savage formulated a basis for decision-making that is sort of a grown-up version of Dutch book arguments. From seven desiderata, none of which mention probability, he derived that an agent that wants to act consistently will act as if it had probabilistic beliefs.</p>
<p>What are the desiderata about, if not probability? They define an agent that has preferences, and is able to take actions, which are defined as things that lead to outcomes, and can lead to different outcomes depending on external possibilities in event-space. They require that the agent's actions be consistent in commonsensical ways. These requirements are sufficient to show that assigning probabilities to the external events is the best way to do things.</p>
<p>Savage's theorem provides one set of conditions for when we should use probabilities. But it doesn't help us choose which probabilities to assign - anything consistent works. The idea that probabilities are degrees of belief, and that they are derived from some starting information, is left to our next foundation.</p>
<p>You can learn more about Savage's foundation <a href="http://www.econ2.jhu.edu/people/Karni/savageseu.pdf">here</a>.</p>
<p><strong>Cox's Theorem</strong></p>
<p>Cox's theorem is a break from justifying probabilities with gambling. Rather than starting from an agent that wants to achieve good outcomes, and showing that having probabilities is a good idea, Richard Cox started with desired properties of a "degree of plausibility," and showed that probabilities are what a good belief-number should be.</p>
<p>One special facet of Cox's desiderata is that they refer to plausibility of an event, given your information - what will eventually become P(event | information).</p>
<p>There are six or so desiderata, but I think there are three interesting ones: When you're completely certain, your plausibilities should satisfy the rules of classical logic. Every rational plausibility has at least one event with that plausibility. P(A and B|X) can be found as a function of P(A|X) and P(B|A and X).</p>
<p>These desiderata are a motley assortment. The desideratum that there's an infinite variety of events is the most strange, but it is satisfied if our universe contains a continuous random process or if we can flip a coin as many times as we want. If the desiderata obtain, Cox's theorem shows that we can give pretty much any belief a probability. The perspective of Cox's theorem is useful because it lets us keep talking straightforwardly about probabilities even if betting or decision-making has become nontrivial.</p>
<p>You can learn more about Cox's theorem in the first two chapters of Jaynes <a href="http://www-biba.inrialpes.fr/Jaynes/prob.html">here</a> (in fact, the next few posts are parallel to the first two chapters of Jaynes), and also <a href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.13.4276&rep=rep1&type=pdf">here</a>. Jaynes includes an additional desideratum in this foundation, which we will cover in the next post.</p>
<p><strong>Kolmogorov Axioms</strong></p>
<p>At the far extreme of abstraction, we have the Kolmogorov axioms for probability. Here they are:</p>
<p>P(E) is a non-negative real number, E is an event that belongs to event-space F.</p>
<p>P(some event occurs)=1.</p>
<p>Any countable sequence of disjoint events (E1, E2...) satisfies P(E1 or E2 or...) = sum of all the P(E).</p>
<p>Though it was not their intended purpose, these can be seen as a Cox-style list of desiderata for degrees of plausibility. Their main virtue is that they're simple and handy to mathematicians who like set theory.</p>
<p>You can learn more about Kolmogorov's axioms <a href="http://en.wikipedia.org/wiki/Probability_axioms">here</a>.</p>
<p> </p>
<p>Look back at our robot trying to bet on the 3^^^3'th prime number. Our robot has preferences, so it can be Dutch booked. Its reward depends on the math problem and we want it to act consistently, so Savage's theorem applies. Cox's theorem applies if we allow our robot to make combined bets on math and dice. It even seems like the Kolmogorov axioms should hold. Resting upon these foundations, our robot should assign numbers to mathematical statements, and they should behave like probabilities.</p>
<p>But we can't get specific about that, because we have a problem - we don't know how to actually find the numbers yet. Our foundations tell us that the probabilities of the two sides of a coin will add to 1, but they don't care whether P(heads) is 0.5 or 0.99999. If Dutch book arguments can't tell us that a coin lands heads half the time, what can? Tune in next time to find out.</p>
<p> </p>
<p style="text-align:right">First post in the sequence <em>Logical Uncertainty</em></p>
<p style="text-align:right">Next post: <a href="/lw/jfl/putting_in_the_numbers/">Putting in the Numbers</a></p></div>
<a href="http://lesswrong.com/lw/jfx/foundations_of_probability/#comments">19 comments</a>
Probability and radical uncertainty
http://lesswrong.com/lw/igw/probability_and_radical_uncertainty/
http://lesswrong.com/lw/igw/probability_and_radical_uncertainty/Sun, 24 Nov 2013 09:34:22 +1100
Submitted by <a href="http://lesswrong.com/user/David_Chapman">David_Chapman</a>
•
11 votes
•
<a href="http://lesswrong.com/lw/igw/probability_and_radical_uncertainty/#comments">70 comments</a>
<div><p>In the <a href="/lw/igv/probability_knowledge_and_metaprobability/">previous article</a> in this sequence, I conducted a thought experiment in which simple probability was not sufficient to choose how to act. Rationality required reasoning about <em>meta-probabilities</em>, the probabilities of probabilities.</p>
<p>Relatedly, lukeprog has <a href="/lw/h78/estimate_stability/">a brief post</a> that explains how this matters; <a href="/lw/745/why_we_cant_take_expected_value_estimates/">a long article</a> by HoldenKarnofsky makes meta-probability  central to utilitarian estimates of the effectiveness of charitable giving; and Jonathan_Lee, in <a href="/lw/hnf/model_stability_in_intervention_assessment/">a reply</a> to that, has used the same framework I presented.</p>
<p>In my previous article, I ran thought experiments that presented you with various colored boxes you could put coins in, gambling with uncertain odds.</p>
<p>The last box I showed you was blue. I explained that it had a fixed but unknown probability of a twofold payout, uniformly distributed between 0 and 0.9. The overall probability of a payout was 0.45, so the expectation value for gambling was 0.9—a bad bet. Yet your optimal strategy was to gamble a bit to figure out whether the odds were good or bad.</p>
<p>Let’s continue the experiment. I hand you a black box, shaped rather differently from the others. Its sealed faceplate is carved with runic inscriptions and eldritch figures. “I find this one <em>particularly</em> interesting,” I say.</p>
<p><a id="more"></a></p>
<p>What is the payout probability? What is your optimal strategy?</p>
<p>In the framework of the previous article, you have no knowledge about the insides of the box. So, as with the “sportsball” case I analyzed there, your meta-probability curve is flat from 0 to 1.</p>
<p>The blue box also has a flat meta-probability curve; but these two cases are very different. For the blue box, you know that the curve <em>really is</em> flat. For the black box, you have no clue what the shape of even the meta-probability curve is.</p>
<p>The relationship between the blue and black boxes is the same as that between the coin flip and sportsball—except at the meta level!</p>
<p>So if we’re going on in this style, we need to look at the distribution of <em>probabilities of probabilities of probabilities</em>. The blue box has a sharp peak in its meta-meta-probability (around flatness), whereas the black box has a flat meta-meta-probability.</p>
<p>You ought now to be a little uneasy. We are <a href="http://en.wikipedia.org/wiki/Epicycle#Epicycles">putting epicycles on epicycles</a>. An infinite regress threatens.</p>
<p>Maybe at this point you suddenly reconsider the blue box… I <em>told</em> you that its meta-probability was uniform. But perhaps I was lying! How reliable do you think I am?</p>
<p>Let’s say you think there’s a 0.8 probability that I told the truth. That’s the meta-meta-probability of a flat meta-probability. In the <em>worst</em> case, the actual payout probability is 0, so the average <em>just plain probability</em> is 0.8 x 0.45 = 0.36. You can feed that worst case into your decision analysis. It won’t drastically change the optimal policy; you’ll just quit a bit earlier than if you were entirely confident that the meta-probability distribution was uniform.</p>
<p>To get this really right, you ought to make a best guess at the meta-meta-probability <em>curve</em>. It’s not just 0.8 of a uniform probability distribution, and 0.2 of zero payout. That’s the <em>worst</em> case. Even if I’m lying, I might give you better than zero odds. How much better? What’s your confidence in your meta-meta-probability curve? Ought you to draw a meta-meta-meta-probability curve? Yikes!</p>
<p>Meanwhile… that black box is <em>rather sinister</em>. Seeing it makes you wonder. What if I rigged the blue box so there is a small probability that when you put a coin in, it jabs you with a poison dart, and you die horribly?</p>
<p>Apparently a zero payout is <em>not</em> the worst case, after all! On the other hand, this seems paranoid. <a href="http://buddhism-for-vampires.com/dark-culture">I’m odd</a>, but probably not <em>that</em> evil.</p>
<p>Still, what about the black box? You realize now that it could do <em>anything</em>.</p>
<ul>
<li>It might spring open to reveal a collection of fossil trilobites.</li>
<li>It might play Corvus Corax’s <em><a href="http://www.youtube.com/watch?v=jgBvEVy__qs">Vitium in Opere</a></em> at ear-splitting volume.</li>
<li>It might analyze the trace DNA you left on the coin and use it to write you a <em>personalized</em> love poem.</li>
<li>It might emit a strip of paper with a recipe for dundun noodles written in Chinese.</li>
<li>It might sprout six mechanical legs and jump into your lap.</li>
</ul>
<p>What is the probability of its giving you $2?</p>
<p>That no longer seems quite so relevant. In fact… it might be utterly meaningless! This is now a situation of <strong>radical uncertainty</strong>.</p>
<p>What is your optimal strategy?</p>
<p>I’ll answer that later in this sequence. You might like to figure it out for yourself now, though.</p>
<h2>Further reading</h2>
<p>The black box is an instance of <a href="http://en.wikipedia.org/wiki/Knightian_uncertainty">Knightian uncertainty</a>. That’s a catch-all category for any type of uncertainty that can’t usefully be modeled in terms of probability (<em>or</em> meta-probability!), because you can’t make meaningful probability estimates. Calling it “Knightian” doesn’t help solve the problem, because there’s lots of sources of non-probabilistic uncertainty. However, it’s useful to know that there’s a literature on this.</p>
<p>The blue box is closely related to <a href="http://en.wikipedia.org/wiki/Ellsberg_paradox">Ellsberg’s paradox</a>, which combines probability with Knightian uncertainty. Interestingly, it was invented by the same Daniel Ellsberg who released the Pentagon Papers in 1971. I wonder how his work in decision theory might have affected his decision to leak the Papers?</p></div>
<a href="http://lesswrong.com/lw/igw/probability_and_radical_uncertainty/#comments">70 comments</a>
The dangers of zero and one
http://lesswrong.com/lw/j2o/the_dangers_of_zero_and_one/
http://lesswrong.com/lw/j2o/the_dangers_of_zero_and_one/Thu, 21 Nov 2013 23:21:23 +1100
Submitted by <a href="http://lesswrong.com/user/PhilGoetz">PhilGoetz</a>
•
27 votes
•
<a href="http://lesswrong.com/lw/j2o/the_dangers_of_zero_and_one/#comments">68 comments</a>
<div><p style="line-height:1.15;margin-top:0pt;margin-bottom:0pt;" dir="ltr"><span style="background-color: transparent; font-family: Arial; font-size: 15px; white-space: pre-wrap; line-height: 1.15;">Eliezer wrote a <a style="text-decoration:none;" href="/lw/mo/infinite_certainty/"><span style="font-size: 15px; font-family: Arial; color: #1155cc; vertical-align: baseline; white-space: pre-wrap; text-decoration: underline; background-color: transparent;">post</span></a> warning against unrealistically confident estimates, in which he argued that you can't be 99.99% sure that 53 is prime. Chris Hallquist replied with a <a style="text-decoration:none;" href="/lw/izs/yes_virginia_you_can_be_9999_or_more_certain_that/"><span style="font-size: 15px; font-family: Arial; color: #1155cc; vertical-align: baseline; white-space: pre-wrap; text-decoration: underline; background-color: transparent;">post</span></a> arguing that you can.</span></p>
<p style="line-height:1.15;margin-top:0pt;margin-bottom:0pt;" dir="ltr"><span style="background-color: transparent; font-family: Arial; font-size: 15px; white-space: pre-wrap; line-height: 1.15;"><br></span></p>
<p><span style="background-color: transparent; font-family: Arial; font-size: 15px; line-height: 1.15; white-space: pre-wrap;">That particular case is tricky. There have been many independent calculations of the first hundred prime numbers. 53 is a small enough number that I think someone would notice if Wikipedia included it erroneously. But can you be 99.99% confident that 1159 is a prime? You found it in one particular source. Can you trust that source? It's large enough that no one would notice if it were wrong. You could try to verify it, but if I write a Perl or C++ program, I can't even be 99.9% sure that the compiler or interpreter will interpret it correctly, let alone that the program is correct.</span></p>
<p><span style="background-color: transparent; font-family: Arial; font-size: 15px; white-space: pre-wrap; line-height: 1.15;">Rather than argue over the number of nines to use for a specific case, I want to emphasize the the importance of not assigning things probability zero or one. Here's a real case where approximating 99.9999% confidence as 100% had disastrous consequences.<a id="more"></a><br></span></p>
<p><span style="background-color: transparent; font-family: Arial; font-size: 15px; white-space: pre-wrap; line-height: 1.15;">I developed a new gene-caller for JCVI. Genes are interpreted in units of 3 DNA nucleotides called codons. A bacterial gene starts with a start codon (usually ATG, TTG, or GTG) and ends at the first stop codon (usually TAG, TGA, or TAA). Most such sequences are not genes. A gene-caller is a computer program that takes a DNA sequence and guesses which of them are genes. </span></p>
<p><span style="background-color: transparent; font-family: Arial; font-size: 15px; white-space: pre-wrap; line-height: 1.15;">The first thing I tried was to create a second-order Markov model on codons, and train it on all of the large possible genes in the genome. (Long sequences without stop codons are unlikely to occur by chance and are probably genes.) That means that you set P = 1 and go down the sequence of each large possible gene, codon by codon, multiplying P by the probability of seeing each of the 64 possible codons in the third position given the codons in the first and second positions. Then I created a second Markov model from the entire genome. This took about one day to write, and plugging these two models into Bayes' law as shown below turned out to work better than all the other single-method gene-prediction algorithms developed over the past 30 years.</span></p>
<p><span style="background-color: transparent; font-family: Arial; font-size: 15px; white-space: pre-wrap; line-height: 1.15;">But what probability should you assign to a codon sequence that you've never seen? A bacterial genome might have 4 million base pairs, about half of which are in long possible genes and will be used for training. That means your training data for one genome has about 2 million codon triplets. Surprisingly, a little less than half of all possible codon triplets do not occur at all in that data (DNA sequences are not random). What probability do you assign to an event that occurs zero times out of 2 million?</span></p>
<p><span style="background-color: transparent; font-family: Arial; font-size: 15px; white-space: pre-wrap; line-height: 1.15;">This came up recently in an online argument. Another person said that, if the probability that X is true is below your detection threshold or your digits of accuracy, you should assign P(X) = 0, since any other number is just made up.</span></p>
<p><span style="background-color: transparent; font-family: Arial; font-size: 15px; white-space: pre-wrap; line-height: 1.15;">Well, I'd already empirically determined whether that was true for the gene caller. First, due to a coding error, I assigned such events P(X) = 1 / (64^3 * size of training set), which is too small by about 64^3. Next I tried P(X) = 0.5 / (size of training set), which is approximately correct. Finally I tried P(X) = 0. I tested the results on genomes where I had strong evidence for what where and were not genes.</span></p>
<p><span style="background-color: transparent; font-family: Arial; font-size: 15px; white-space: pre-wrap; line-height: 1.15;">How well do you think each P(X) worked?</span></p>
<p><span style="background-color: transparent; font-family: Arial; font-size: 15px; white-space: pre-wrap; line-height: 1.15;">The two non-zero probabilities gave nearly the same results, despite differing by 6 orders of magnitude. But using P(X) = 0 caused the gene-caller to miss hundreds of genes per genome, which is a disastrous result. Why?</span></p>
<p><span style="background-color: transparent; font-family: Arial; font-size: 15px; white-space: pre-wrap; line-height: 1.15;">Any particular codon triplet that was never found in the training set would have a prior of less than one in 4 million. But because a large number of triplets are in genes outside the training set, that meant some of those triplets (not most, but about a thousand of them) had true priors of being found somewhere in those genes of nearly one half. (You can work it out in more detail by assuming a Zipf law distribution of priors, but I won't get into that.)</span></p>
<p><span style="background-color: transparent; font-family: Arial; font-size: 15px; white-space: pre-wrap; line-height: 1.15;">So some of them did occur within genes in that genome, and each time one did, its assigned probability of zero annihilated all the hundreds of other pieces of evidence for the existence of that gene, making the gene impossible to detect.</span></p>
<p><span style="background-color: transparent; font-family: Arial; font-size: 15px; white-space: pre-wrap; line-height: 1.15;">You can think of this using logarithms. I computed</span></p>
<blockquote>
<p><span style="background-color: transparent; font-family: Arial; font-size: 15px; white-space: pre-wrap; line-height: 1.15;">P(gene | sequence) = P(sequence | gene) * P(gene) / P(sequence)</span></p>
</blockquote>
<p><span style="background-color: transparent; font-family: Arial; font-size: 15px; white-space: pre-wrap; line-height: 1.15;">where P(sequence) and P(sequence | gene) are computed using the two Markov models. Each of them is the product of a sequence of Markov probabilities. Ignoring P(gene), which is constant, we can compute</span></p>
<blockquote>
<p><span style="background-color: transparent; font-family: Arial; font-size: 15px; white-space: pre-wrap; line-height: 1.15;">log(P(gene|sequence)) ~ log(P(sequence | gene)) - log(P(sequence)) =</span></p>
<p style="line-height:1.15;margin-top:0pt;margin-bottom:0pt;" dir="ltr"><span style="font-size:15px;font-family:Arial;color:#000000;background-color:transparent;font-weight:normal;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre-wrap;">sum (over all codon triplets in the sequence) [ log(P(codon3 | codon1, codon2, gene)) - log(P(codon3 | codon1, codon2)) ]</span></p>
</blockquote>
<p style="line-height:1.15;margin-top:0pt;margin-bottom:0pt;" dir="ltr"><span style="font-size:15px;font-family:Arial;color:#000000;background-color:transparent;font-weight:normal;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre-wrap;"><br></span></p>
<p><span style="background-color: transparent; font-family: Arial; font-size: 15px; white-space: pre-wrap; line-height: 1.15;">You can think of this as adding the bits of information it would take to specify that triplet outside of a gene, and subtracting the bits of information it would take to specify that information inside a gene, leaving bits of evidence that it is in a gene.</span></p>
<p><span style="font-size:15px;font-family:Arial;color:#000000;background-color:transparent;font-weight:normal;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre-wrap;">If we assign P(codon3 | codon1, codon2, gene) = 0, the number of bits of information it would take to specify "codon3 | codon1, codon2" inside a gene is -log(0) = infinity. Assign P(X) = 0 is claiming to have </span><span style="font-size:15px;font-family:Arial;color:#000000;background-color:transparent;font-weight:normal;font-style:italic;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre-wrap;">infinite</span><span style="font-size:15px;font-family:Arial;color:#000000;background-color:transparent;font-weight:normal;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre-wrap;"> bits of information that X is false.</span></p>
<p><span style="background-color: transparent; font-family: Arial; font-size: 15px; white-space: pre-wrap; line-height: 1.15;">Going back to the argument, the accuracy of the probabilities assigned by the Markov model are quite low, probably one to three digits of accuracy in most cases. Yet it was important to assign positive probabilities to events whose probabilities were at least seven orders of magnitude below that.</span></p>
<p><span style="background-color: transparent; font-family: Arial; font-size: 15px; white-space: pre-wrap; line-height: 1.15;">It didn't matter what probability I assigned to them! Given hundreds of other bits scores to add up, changing the number of bits taken away by one highly improbable event by 10 had little impact. It just matters not to make it zero.</span></p></div>
<a href="http://lesswrong.com/lw/j2o/the_dangers_of_zero_and_one/#comments">68 comments</a>
Probability, knowledge, and meta-probability
http://lesswrong.com/lw/igv/probability_knowledge_and_metaprobability/
http://lesswrong.com/lw/igv/probability_knowledge_and_metaprobability/Tue, 17 Sep 2013 10:02:56 +1000
Submitted by <a href="http://lesswrong.com/user/David_Chapman">David_Chapman</a>
•
38 votes
•
<a href="http://lesswrong.com/lw/igv/probability_knowledge_and_metaprobability/#comments">71 comments</a>
<div><p>This article is the first in a sequence that will consider situations where probability estimates are not, by themselves, adequate to make rational decisions. This one introduces a "meta-probability" approach, borrowed from E. T. Jaynes, and uses it to analyze a gambling problem. This situation is one in which reasonably straightforward decision-theoretic methods suffice. Later articles introduce increasingly problematic cases.</p>
<p><a id="more"></a></p>
<h2>A surprising decision anomaly</h2>
<p>Let’s say I’ve recruited you as a subject in my thought experiment. I show you three cubical plastic boxes, about eight inches on a side. There’s two green ones—identical as far as you can see—and a brown one. I explain that they are gambling machines: each has a faceplate with a slot that accepts a dollar coin, and an output slot that will return either two or zero dollars.</p>
<p>I unscrew the faceplates to show you the mechanisms inside. They are quite simple. When you put a coin in, a wheel spins. It has a hundred holes around the rim. Each can be blocked, or not, with a teeny rubber plug. When the wheel slows to a halt, a sensor checks the nearest hole, and dispenses either zero or two coins.</p>
<p>The brown box has 45 holes open, so it has probability p=0.45 of returning two coins. One green box has 90 holes open (p=0.9) and the other has none (p=0). I let you experiment with the boxes until you are satisfied these probabilities are accurate (or very nearly so).</p>
<p>Then, I screw the faceplates back on, and put all the boxes in a black cloth sack with an elastic closure. I squidge the sack around, to mix up the boxes inside, and you reach in and pull one out at random.</p>
<p>I give you a hundred one-dollar coins. You can put as many into the box as you like. You can keep as many coins as you don’t gamble, plus whatever comes out of the box.</p>
<p>If you pulled out the brown box, there’s a 45% chance of getting $2 back, and the expected value of putting a dollar in is $0.90. Rationally, you should keep the hundred coins I gave you, and not gamble.</p>
<p>If you pulled out a green box, there’s a 50% chance that it’s the one that pays two dollars 90% of the time, and a 50% chance that it’s the one that never pays out. So, overall, there’s a 45% chance of getting $2 back.</p>
<p>Still, rationally, you should put some coins in the box. If it pays out at least once, you should gamble all the coins I gave you, because you know that you got the 90% box, and you’ll nearly double your money.</p>
<p>If you get nothing out after a few tries, you’ve probably got the never-pay box, and you should hold onto the rest of your money. (Exercise for readers: how many no-payouts in a row should you accept before quitting?)</p>
<p>What’s interesting is that, when you have to decide whether or not to gamble your first coin, the probability is exactly the same in the two cases (p=0.45 of a $2 payout). However, the rational course of action is different. What’s up with that?</p>
<p>Here, a single probability value fails to capture everything you <strong>know</strong> about an uncertain event. And, it’s a case in which that failure matters.</p>
<p>Such limitations have been recognized <a href="http://en.wikipedia.org/wiki/Common_cause_and_special_cause_%28statistics%29#Origins_and_concepts">almost since the beginning</a> of probability theory. Dozens of solutions have been proposed. In the rest of this article, I’ll explore one. In subsequent articles, I’ll look at the problem more generally.</p>
<h2>Meta-probability</h2>
<p>To think about the green box, we have to reason about <em>the probabilities of probabilities</em>. We could call this <strong>meta-probability</strong>, although that’s not a standard term. Let’s develop a method for it.</p>
<p>Pull a penny out of your pocket. If you flip it, what’s the probability it will come up heads? 0.5. Are you sure? Pretty darn sure.</p>
<p>What’s the probability that my local junior high school sportsball team will win its next game? I haven’t a ghost of a clue. I don’t know anything even about professional sportsball, and certainly nothing about “my” team. In a match between two teams, I’d have to say the probability is 0.5.</p>
<p>My girlfriend asked me today: “Do you think Raley’s will have dolmades?” Raley’s is our local supermarket. “I don’t know,” I said. “I guess it’s about 50/50.” But unlike sportsball, I know something about supermarkets. A fancy Whole Foods is very likely to have dolmades; a 7-11 almost certainly won’t; Raley’s is somewhere in between.</p>
<p>How can we model these three cases? One way is by assigning probabilities to each possible probability between 0 and 1. In the case of a coin flip, 0.5 is much more probable than any other probability:</p>
<p><img src="http://meaningness.com/images/lw/fig1.jpg" alt="Tight Gaussian centered around 0.5" height="225" width="534"></p>
<p>We can’t be <em>absolutely sure</em> the probability is 0.5. In fact, it’s almost certainly not <em>exactly</em> that, because coins aren’t perfectly symmetrical. And, there’s a very small probability that you’ve been given a tricky penny that comes up tails only 10% of the time. So I’ve illustrated this with a tight Gaussian centered around 0.5.</p>
<p>In the sportsball case, I have no clue what the odds are. They might be anything between 0 to 1:</p>
<p><img src="http://meaningness.com/images/lw/fig2.jpg" alt="Flat line from 0 to 1" height="225" width="534"></p>
<p>In the Raley’s case, I have <em>some</em> knowledge, and extremely high and extremely low probabilities seem unlikely. So the curve looks something like this:</p>
<p><img src="http://meaningness.com/images/lw/fig3.jpg" alt="Wide Gaussian centered on 0.5" height="225" width="533"></p>
<p>Each of these curves averages to a probability of 0.5, but they express different degrees of confidence in that probability.</p>
<p>Now let’s consider the gambling machines in my thought experiment. The brown box has a curve like this:</p>
<p><img src="http://meaningness.com/images/lw/fig4.jpg" alt="Tight Gaussian around 0.45" height="225" width="534"></p>
<p>Whereas, when you’ve chosen one of the two green boxes at random, the curve looks like this:</p>
<p><img src="http://meaningness.com/images/lw/fig5.jpg" alt="Bimodal distribution with sharp peaks at 0 and 0.9" height="225" width="534"></p>
<p>Both these curves give an average probability of 0.45. However, a rational decision theory has to distinguish between them. Your optimal strategy in the two cases is quite different.</p>
<p>With this framework, we can consider another box—a blue one. It has a fixed payout probability somewhere between 0 and 0.9. I put a random number of plugs in the holes in the spinning disk—leaving between 0 and 90 holes open. I used a noise diode to choose; but you don’t get to see what the odds are. Here the probability-of-probability curve looks rather like this:</p>
<p><img src="http://meaningness.com/images/lw/fig6.jpg" alt="Flat line from 0 to 0.9, then zero above"></p>
<p>This isn’t quite right, because 0.23 and 0.24 are much more likely than 0.235—the plot should look like a comb—but for strategy choice the difference doesn’t matter.</p>
<p>What <em>is</em> your optimal strategy in this case?</p>
<p>As with the green box, you ought to spend some coins gathering information about what the odds are. If your estimate of the probability is less than 0.5, when you get confident enough in that estimate, you should stop. If you’re confident enough that it’s more than 0.5, you should continue gambling.</p>
<p>If you enjoy this sort of thing, you might like to work out what the exact optimal algorithm is.</p>
<p>In the next article in this sequence, we’ll look at some more complicated and interesting cases.</p>
<h2>Further reading</h2>
<p>The “meta-probability” approach I’ve taken here is the <a href="http://www-biba.inrialpes.fr/Jaynes/cc18i.pdf">A<sub>p</sub> distribution</a> of E. T. Jaynes. I find it highly intuitive, but it seems to have had almost no influence or application in practice. We’ll see later that it has some problems, which might explain this.</p>
<p>The green and blue boxes are related to “multi-armed bandit problems.” A “one-armed bandit” is a casino slot machine, which has defined odds of payout. A multi-armed bandit is a hypothetical generalization with several arms, each of which may have different, unknown odds. In general, you ought to pull each arm several times, to gain information. The question is: what is the optimal algorithm for deciding which arms to pull how many times, given the payments you have received so far?</p>
<p>If you read the <a href="http://en.wikipedia.org/wiki/Multi-armed_bandit">Wikipedia article</a> and follow some links, you’ll find the concepts you need to find the optimal green and blue box strategies. But it might be more fun to try on your own first! The green box is simple. The blue box is harder, but the same general approach applies.</p>
<p>Wikipedia also has an <a href="http://en.wikipedia.org/wiki/Credal_set#See_also">accidental list</a> of formal approaches for problems where ordinary probability theory fails. This is far from complete, but a good starting point for a browser tab explosion.</p>
<h2>Acknowledgements</h2>
<p>Thanks to <a href="http://vajrayananow.wordpress.com/author/">Rin’dzin Pamo</a>, <a href="https://twitter.com/St_Rev">St. Rev.</a>, <a href="/user/Matt_Simpson/overview/">Matt_Simpson</a>, <a href="/user/Kaj_Sotala/overview/">Kaj_Sotala</a>, and <a href="/user/Vaniver/overview/">Vaniver</a> for helpful comments on drafts. Of course, they may disagree with my analyses, and aren’t responsible for my mistakes!</p></div>
<a href="http://lesswrong.com/lw/igv/probability_knowledge_and_metaprobability/#comments">71 comments</a>
Anticipating critical transitions
http://lesswrong.com/lw/hoc/anticipating_critical_transitions/
http://lesswrong.com/lw/hoc/anticipating_critical_transitions/Mon, 10 Jun 2013 02:28:51 +1000
Submitted by <a href="http://lesswrong.com/user/PhilGoetz">PhilGoetz</a>
•
17 votes
•
<a href="http://lesswrong.com/lw/hoc/anticipating_critical_transitions/#comments">52 comments</a>
<div><p>(Mathematicians may find this post painfully obvious.)</p>
<p>I read an interesting <a href="http://www.thebigquestions.com/2010/12/21/are-you-smarter-than-google/">puzzle</a> on Stephen Landsburg's blog that generated a lot of disagreement. Stephen offered to bet anyone $15,000 that the average results of a computer simulation, run 1 million times, would be close to his solution's prediction of the expected value.</p>
<p>Landsburg's solution is in fact correct. But the problem involves a probabilistic infinite series, a kind used often on less wrong in a context where one is offered some utility every time one flips a coin and it comes up heads, but loses everything if it ever comes up tails. Landsburg didn't justify the claim that a simulation could indicate the true expected outcome of this particular problem. Can we find similar-looking problems for which simulations give the wrong answer?  Yes.</p>
<p><a id="more"></a>Here's Perl code to estimate by simulation the expected value of the series of terms 2^k / k from k = 1 to infinity, with a 50% chance of stopping after each term.</p>
<pre><code>my $bigsum = 0;
for (my $trial = 0; $trial < 1000000; $trial++) {
    my $sum = 0;
</code><code></code><code>    </code>my $top = 2;
<code></code><code>    </code>my $denom = 1;
<code></code><code>    </code>do {
<code></code><code>    </code><code></code><code></code><code>    </code>$sum += $top / $denom;
<code></code><code>    </code><code></code><code></code><code>    </code>$top *= 2;
<code></code><code>    </code><code></code><code></code><code>    </code>$denom += 1;
<code></code><code>    </code>}
<code></code><code>    </code>while (rand(1) < .5);
<code></code><code>    </code>$bigsum += $sum;
}
my $ave = $bigsum / $runs;
print "ave sum=$ave\n";
</pre>
<p>(If anyone knows how to enter a code block on this site, let me know. I used the "pre" tag, but the site stripped out my spaces anyway.)</p>
<p>Running it 5 times, we get the answers</p>
<p>ave sum=7.6035709716983</p>
<p>ave sum=8.47543819631431</p>
<p>ave sum=7.2618950097739</p>
<p>ave sum=8.26159741956747</p>
<p>ave sum=7.75774577340324</p>
<p> </p>
<p>So the expected value is somewhere around 8?</p>
<p>No; the expected value is given by the sum of the harmonic series, which diverges, so it's infinite. Later terms in the series are exponentially larger, but exponentially less likely to appear.</p>
<p>Some of you are saying, "Of course the expected value of a divergent series can't be computed by simulation! Give me back my minute!" But many things we might simulate with computers, like the weather, the economy, or existential risk, are full of power law distributions that might not have a convergent expected value. People have observed before that this can cause problems for simulations (see <em><a href="http://amzn.to/111n0QV">The Black Swan</a></em>). What I find interesting is that the output of the program above doesn't look like something inside it diverges. It looks almost normal. So you could run your simulation many times and believe that you had a grip on its expected outcome, yet be completely mistaken.</p>
<p>In real-life simulations (that sounds wrong, doesn't it?), there's often some system property that drifts slowly, and some critical value of that system property above which some distribution within the simulation diverges. Moving above that critical value doesn't suddenly change the output of the simulation in a way that gives an obvious warning. But the expected value of keeping that property below that critical value in the real-life system being simulated can be very high (or even infinite), with very little cost.</p>
<p>Is there a way to look at a simulation's outputs, and guess whether a particular property is near some such critical threshold?  Better yet, is there a way to guess whether there exists some property in the system nearing some such threshold, even if you don't know what it is?</p>
<p>The October 19, 2012 issue of Science contains an article on just that question: "Anticipating critical transitions", Marten Scheffer et al., p. 344. It reviews 28 papers on systems and simulations, and lists about a dozen mathematical approaches used to estimate nearness to a critical point. These include:</p>
<ul>
<li>Critical slowing down: When the system is near a critical threshold, it recovers slowly from small perturbations. One measure of this is autocorrelation at lag 1, meaning the correlation between the system's output at times T and T-1. Counterintuitively, a higher autocorrelation at lag one by itself suggests that the system is more predictable than before, but may actually indicate it is less predictable. The more predictable system reverts to its mean; the unpredictable system has no mean.</li>
<li>Flicker: Instead of having a single stable state that the system reverts to after perturbation, an additional stable state appears, and the system flickers back and forth between the two states.</li>
<li>Dominant eigenvalue: I haven't read the paper that explains what this paper means when it cites this, but I do know that you can predict when a helicopter engine is going to malfunction by putting many sensors on it, running PCA on time-series data for those sensors to get a matrix that projects their output into just a few dimensions, then reading their output continuously and predicting failure anytime the PCA-projected output vector moves a lot. That probably is what they mean.</li>
</ul>
<p>So if you're modeling global warming, running your simulation a dozen times and averaging the results may be misleading. [1] Global temperature has sudden [2] dramatic transitions, and an exceptionally large and sudden one (15C in one million years) neatly spans the Earth's greatest extinction event so far on the Permian-Triassic boundary [3]. It's more important to figure out what the critical parameter is and where its critical point is than to try and estimate how many years it will be before Manhattan is underwater. The "expected rise in water level per year" may not be easily-answerable by simulation [4].</p>
<p>And if you're thinking about betting Stephen Landsburg $15,000 on the outcome of a simulation, make sure his series converges first. [5]</p>
<p> </p>
<p>[1] Not that I'm particularly worried about global warming.</p>
<p>[2] Geologically sudden.</p>
<p>[3] Sun et al., "Lethally hot temperatures during the early Triassic greenhouse", Science 338 (Oct. 19 2012) p.366, see p. 368. Having just pointed out that an increase of .000015C/yr counts as a "sudden" global warming event, I feel obligated to also point out that the current increase is about .02C/yr.</p>
<p>[4] It will be answerable by simulation, since rise in water level can't be infinite. But you may need a lot more simulations than you think.</p>
<p>[5] Better yet, don't bet against Stephen Landsburg.</p></div>
<a href="http://lesswrong.com/lw/hoc/anticipating_critical_transitions/#comments">52 comments</a>