## How do you notice when you are ignorant of necessary alternative hypotheses?

So I just wound up in a debate with someone over on Reddit about the value of conventional academic philosophy. He linked me to a book review, in which both the review and the book are absolutely godawful. That is, the author (and the reviewer following him) start with ontological monism (the universe only contains a single kind of Stuff: mass-energy), adds in the experience of consciousness, reasons deftly that emergence is a load of crap... and then arrives to the conclusion of panpsychism.

WAIT HOLD ON, DON'T FLAME YET!

Of course panpsychism is bunk. I would be embarrassed to be caught upholding it, given the evidence I currently have, but what I want to talk about is the logic being followed.

1) The universe is a unified, consistent whole. Good!

2) The universe contains the experience/existence of consciousness. Easily observable.

3) If consciousness exists, something in the universe must cause or give rise to consciousness. Good reasoning!

4) "Emergence" is a non-explanation, so that can't be it. Good!

5) *Therefore*, whatever stuff the unified universe is made of must be giving rise to consciousness in a nonemergent way.

6) *Therefore*, the stuff must be innately "mindy".

What went wrong in steps (5) and (6)? The man was actually reasoning more-or-less correctly! Given the universe he lived in, and the impossibility of emergence, he reallocated his probability mass to the remaining answer. When he had eliminated the impossible, whatever remained, however low its prior, must be true.

The problem was, he eliminated the *im*possible, but left open a huge vast space of *possible* hypotheses that *he didn't know about* (but which we do): the most common of these is the computational theory of mind and consciousness, which says that we are made of cognitive algorithms. A Solomonoff Inducer can just go on to the next length of bit-strings describing Turing machines, but we can't.

Now, I can spot the flaw in the reasoning *here*. What frightens me is: what if I'm presented with some similar argument, and I *can't* spot the flaw? What if, instead, I just neatly and *stupidly* reallocate my belief to what *seems to me* to be the only available alternative, while failing to go out and look for alternatives I don't already know about? Notably, it seems like expected *evidence* is conserved, but expecting to locate new hypotheses means I should be reducing my certainty about all currently-available hypotheses *now* to have some for dividing between the new possibilities.

If you can notice when you're confused, how do you notice when you're ignorant?

## [LINK] The Mathematics of Gamification - Application of Bayes Rule to Voting

Fresh from slashdot: A smart application of Bayes' rule to web-voting.

http://engineering.foursquare.com/2014/01/03/the-mathematics-of-gamification/

[The results] are exactly the equations for voting you would expect. But now, they’re derived from math!

The Benefits

Efficient, data-driven guarantees about database accuracy.By choosing the points based on a user’s accuracy, we can intelligently accrue certainty about a proposed update and stop the voting process as soon as the math guarantees the required certainty.Still using points, just smart about calculating them.By relating a user’s accuracy and the certainty threshold needed to accept a proposed update to an additive point system (2), we can still give a user the points that they like. This also makes it easy to take a system of ad-hoc points and convert it over to a smarter system based on empirical evidence.Scalable and easily extensible.The parameters are automatically trained and can adapt to changes in the behavior of the userbase. No more long meetings debating how many points to grant to a narrow use case.

So far, we’ve taken a very user-centric view of pk (this is the accuracy of user k). But we can go well beyond that. For example, pk could be “the accuracy of user k’s vote given that they have been to the venue three times before and work nearby.” These clauses can be arbitrarily complicated and estimated from a (logistic) regression of the honeypot performance. The point is that these changes will be based ondataand not subjective judgments of how many “points” a user or situation should get.

I wonder whether and how this could be applied to voting here as LW posts are not 'correct' per se.

One rather theoretical possibility would be to assign prior correctness to some posts e.g. the sequences and then use that to determine the 'accuracy' of users based on that.

## Am I Understanding Bayes Right?

Hello, everyone.

I'm relatively new here as a user rather than as a lurker, but even after trying to read ever tutorial on Bayes' Theorem I could get my hands on, I'm still not sure I understand it. So I was hoping that I could explain Bayesianism as I understand it, and some more experienced Bayesians could tell me where I'm going wrong (or maybe if I'm not going wrong and it's a confidence issue rather than an actual knowledge issue). If this doesn't interest you at all, then feel free to tap out now, because here we go!

Abstraction

Bayes' Theorem is an application of probability. Probability is an abstraction based on logic, which is in turn based on possible worlds. By this I mean that they are both maps that refer to multiple territories: whereas a map of Cincinatti (or a "map" of what my brother is like, for instance), abstractions are good for more than one thing. Trigonometry is a map of not just this triangle here, but of all triangles everywhere, to the extent that they are triangular. Because of this it is useful even for triangular objects that one has never encountered before, but only tells you about it partially (e.g. it won't tell you the lengths of the sides, because that wouldn't be part of the definition of a triangle; also, it only works at scales at which the object in question approximates a triangle (i.e. the "triangle" map is probably useful at macroscopic scales, but breaks down as you get smaller).

Logic and Possible Worlds

Logic is an attempt to construct a map that covers as much territory as possible, ideally all of it. Thus when people say that logic is true at all times, at all places, and with all things, they aren't really telling you about the territory, they're telling you about the purpose of logic (in the same way that the "triangle" map is ideally useful for triangles at all times, at all places).

One form of logic is Propositional Logic. In propositional logic, all the possible worlds are imagined as points. Each point is exactly one possible world: a logically-possible arrangement that gives a value to all the different variables in the universe. Ergo no two possible universes are exactly the same (though they will share elements).

These possible universes are then joined together in sets called "propositions". These "sets" are Venn diagrams, or what George Lakoff refers to as "container schemas"). Thus, for any given set, every possible universe is either inside or outside of it, with no middle ground (see "questions" below). Thus if the set I'm referring to is the proposition "The Snow is White", that set would include all possible universes in which the snow is white. The rules of propositional logic follow from the container schema.

Bayesian Probability

If propositional logic is about what's inside a set or outside of a set, probability is about the size of the sets themselves. Probability is a measurement of how many possible worlds are inside a set, and conditional probability is about the size of the intersections of sets.

Take the example of the dragon in your garage. To start with, there either is or isn't a dragon in your garage. Both sets of possible worlds have elements in them. But if we look in your garage and don't see a dragon, then that eliminates all the possibilities of there being a *visible* dragon in your garage, and thus eliminates those possible universes from the 'there is a dragon in your garage' set. In other words, the probability of that being true goes down. And because not seeing a dragon in your garage would be what you would expect if there in fact isn't a dragon in your garage, that set remains intact. Then if we look at the ratio of the remaining possible worlds, we see that the probability of the no-dragon-in-your-garage set has gone up, not because in absolute terms (because the set of all possible worlds is what we started with; there isn't any more!) but relative to the alternate hypothesis (in the same way that if the denominator of a fraction goes down, the size of the fraction goes up.)

This is what Bayes' Theorem is about: the use of process of elimination to eliminate *part* of the set of a proposition, thus providing evidence against it without it being a full refutation.

Naturally, this all takes place in ones mind: the world doesn't shift around you just because you've encountered new information. Probability is in this way subjective (it has to do with maps, not territories per se), but it's not arbitrary: as long as you accept that possible worlds/logic metaphor, it necessarily follows

Questions/trouble points that I'm not sure of:

*I keep seeing probability referred to as an estimation of how certain you are in a belief. And while I guess it could be argued that you should be certain of a belief relative to the number of possible worlds left or whatever, that doesn't necessarily follow. Does the above explanation differ from how other people use probability?

*Also, if probability is defined as an arbitrary estimation of how sure you are, why should those estimations follow the laws of probability? I've heard the Dutch book argument, so I get why there might be practical reasons for obeying them, but unless you accept a pragmatist epistemology, that doesn't provide reasons why your beliefs are more likely to be true if you follow them. (I've also heard of Cox's rules, but I haven't been able to find a copy. And if I understand right, they says that Bayes' theorem follows from Boolean logic, which is similar to what I've said above, yes?)

*Another question: above I used propositional logic, which is okay, but it's not exactly the creme de la creme of logics. I understand that fuzzy logics work better for a lot of things, and I'm familiar with predicate logics as well, but I'm not sure what the interaction of any of them is with probability or the use of it, although I know that technically probability doesn't have to be binary (sets just need to be exhaustive and mutually exclusive for the Kolmogorov axioms to work, right?). I don't know, maybe it's just something that I haven't learned yet, but the answer really is out there?

Those are the only questions that are coming to mind right now (if I think of any more, I can probably ask them in comments). So anyone? Am I doing something wrong? Or do I feel more confused than I really am?

## Crush Your Uncertainty

Bayesian epistemology and decision theory provide a rigorous foundation for dealing with mixed or ambiguous evidence, uncertainty, and risky decisions. You can't always get the epistemic conditions that classical techniques like logic or maximum liklihood require, so this is seriously valuable. However, having internalized this new set of tools, it is easy to fall into the bad habit of failing to avoid situations where it is necessary to use them.

When I first saw the light of an epistemology based on probability theory, I tried to convince my father that the Bayesian answer to problems involving an unknown processes (eg. laplace's rule of succession), was superior to the classical (eg. maximum likelihood) answer. He resisted, with the following argument:

- The maximum likelihood estimator plus some measure of significance is easier to compute.
- In the limit of lots of evidence, this agrees with Bayesian methods.
- When you don't have enough evidence for statistical significance, the correct course of action is to collect more evidence,
*not*to take action based on your current knowledge.

I added conditions (eg. what if there is no more evidence and you have to make a decision *now*?) until he grudgingly stopped fighting the hypothetical and agreed that the Bayesian framework was superior in some situations (months later, mind you).

I now realize that he was right to fight that hypothetical, and he was right that you should prefer classical max likelihood plus significance in most situations. But of course I had to learn this the hard way.

It is not always, or even often, possible to get overwhelming evidence. Sometimes you only have visibility into one part of a system. Sometimes further tests are expensive, and you need to decide *now*. Sometimes the decision is clear even without further information. The advanced methods can get you through such situations, so it's critical to know them, but that doesn't mean you can laugh in the face of uncertainty in general.

At work, I used to do a lot of what you might call "cowboy epistemology". I quite enjoyed drawing useful conclusions from minimal evidence and careful probability-literate analysis. Juggling multiple hypotheses and visualizing probability flows between them is just fun. This seems harmless, or even helpful, but it meant I didn't take gathering redundant data seriously enough. I now think you should systematically and completely crush your uncertainty at all opportunities. You should not be satisfied until exactly one hypothesis has non-negligible probability.

Why? If I'm investigating a system, and even though we are not completely clear on what's going on, the current data is enough to suggest a course of action, and value of information calculations say that decision is not likely enough to change to make further investigation worth it, why then should I go and do further investigation to pin down the details?

The first reason is the obvious one; stronger evidence can make up for human mistakes. While a lot can be said for it's *power*, human brain is not a *precise* instrument; sometimes you'll feel a little more confident, sometimes a little less. As you gather evidence towards a point where you feel you have enough, that random fluctuation can cause you to stop early. But this only suggests that you should have a small bias towards gathering a bit more evidence.

The second reason is that though you may be able to make the correct immediate decision, going into the future, that residual uncertainty will bite you back eventually. Eventually your habits and heuristics derived from the initial investigation will diverge from what's actually going on. You would not expect this in a perfect reasoner; they would always use their full uncertainty in all calculations, but again, the human brain is a blunt instrument, and likes to simplify things. What was once a nuanced probability distribution like `95% X, 5% Y`

might slip to just `X`

when you're not quite looking, and then, 5% of the time, something comes back from the grave to haunt you.

The third reason is computational complexity. Inference with very high certainty is easy; it's often just simple direct math or clear intuitive visualizations. With a lot of uncertainty, on the other hand, you need to do your computation once for each of all (or some sample of) probable worlds, or you need to find a shortcut (eg analytic methods), which is only sometimes possible. This is an unavoidable problem for any bounded reasoner.

For example, you simply would not be able to design chips or computer programs if you could not treat transistors as infallible logical gates, and if you really really had to do so, the first thing you would do would be to build an error-correcting base system on top of which you could treat computation as approximately deterministic.

It is possible in small problems to manage uncertainty with advanced methods (eg. Bayes), and this is very much necessary while you decide how to get more certainty, but for unavoidable computational reasons, it is not sustainable in the long term, and must be a temporary condition.

If you take the habit of crushing your uncertainty, your model of situations can be much simpler and you won't have to deal with residual uncertainty from previous related investigations. Instead of many possible worlds and nuanced probability distributions to remember and gum up your thoughts, you can deal with simple, clear, unambiguous *facts*.

My previous cowboy-epistemologist self might have agreed with everything written here, but failed to really get that *uncertainty is bad*. Having just been empowered to deal with uncertainty properly, there was a tendency to not just be unafraid of uncertainty, but to think that it was OK, or even glorify it. What I'm trying to convey here is that that aesthetic is mistaken, and as silly as it feels to have to repeat something so elementary, uncertainty is to be avoided. More viscerally, *uncertainty is uncool* (unjustified confidence is even less cool, though.)

So what's this all got to do with my father's classical methods? I still very much recommend thinking in terms of probability theory when working on a problem; it is, after all, the best basis for epistemology that we know of, and is perfectly adequate as an intuitive framework. It's just that it's *expensive*, and in the epistemic state you really want to be in, that expense is redundant in the sense that you can just use some simpler method that converges to the Bayesian answer.

I could leave you with an overwhelming pile of examples, but I have no particular incentive to crush *your* uncertainty, so I'll just remind you to treat hypotheses like zombies; always double tap.

## Instinctive Frequentists, the Outside View, and de-Biasing

In "How to Make Cognitive Illusions Disappear: Beyond Heuristics and Biases", Gerd Gigerenzer attempts to show that the whole "Heuristics and Biases" approach to analysing human reasoning is fundamentally flawed and incorrect.

In that he fails. His case depends on using the frequentist argument that probabilities cannot be assigned to single events or situations of subjective uncertainty, thus removing the possibility that people could be "wrong" in the scenarios where the biases were tested. (It is interesting to note that he ends up constructing "Probabilistic Mental Models", which are frequentist ways of assigning subjective probabilities - just as long as you don't call them that!).

But that dodge isn't sufficient. Take the famous example of the conjunction fallacy, where most people are tricked to assigning a higher probability to "Linda is a bank teller AND is active in the feminist movement" than to "Linda is a bank teller". This error persists even when people take bets on the different outcomes. By betting more (or anything) on the first option, people are giving up free money. This is a failure of human reasoning, whatever one thinks about the morality of assigning probability to single events.

However, though the article fails to prove its case, it presents a lot of powerful results that may change how we think about biases. It presents weak evidence that people may be instinctive frequentist statisticians, and much stronger evidence that *many **biases can go away when the problems are presented in frequentist ways*.

Now, it's known that people are more comfortable with frequencies that with probabilities. The examples in the paper extend that intuition. For instance, when people are asked:

There are 100 persons who fit the description above (i.e., Linda's). How many of them are:

(a) bank tellers

(b) bank tellers and active in the feminist movement.

Then the conjunction fallacy essentially disappears (22% of people make the error, rather than 85%). That is a huge difference.

Similarly, overconfidence. When people were 50 general knowledge questions and asked to rate their confidence for their answer on each question, they were systematically, massively overconfident. But when they were asked afterwards "How many of these 50 questions do you think you got right?", they were... underconfident. But only very slightly: they were essentially correct in their self-assessments. This can be seen as a use of the outside view - a use that is, in this case, entirely justified. People know their overall accuracy much better than they know their specific accuracy.

A more intriguing example makes the base-rate fallacy disappear. Presenting the problem in a frequentist way makes the fallacy vanish when computing false positives for tests on rare diseases - that's compatible with the general theme. But it really got interesting when people actively participated in the randomisation process. In the standard problem, students were given thumbnail description of individuals, and asked to guess whether they were more likely to be engineers or lawyers. Half the time the students were told the descriptions were drawn at random from 30 lawyers and 70 engineers; the other half, the proportions were reversed. It turns out that students assigned similar guesses to lawyer and engineer in both setups, showing they were neglecting to use the 30/70 or 70/30 base-rate information.

Gigerenzer modified the setups by telling the students the 30/70 or 70/30 proportions and then having the students themselves drew each description (blindly) out of an urn before assessing it. In that case, base-rate neglect disappears.

Now, I don't find that revelation *quite* as superlatively exciting as Gigerenzer does. Having the students draw the description out of the urn is pretty close to whacking them on the head with the base-rate: it really focuses their attention on this aspect, and once it's risen to their attention, they're much more likely to make use of it. It's still very interesting, though, and suggests some practical ways of overcoming the base-rate problem that stop short of saying "hey, don't forget the base-rate".

There is a large literature out there critiquing the heuristics and biases tradition. Even if they fail to prove their point, they're certainly useful for qualifying the biases and heuristics results, and, more interestingly, for suggesting practical ways of combating their effects.

## [LINK] XKCD Comic #1236, Seashells and Bayes' Theorem

A fun comic about seashells and Bayes' Theorem. http://xkcd.com/1236/

## Question about application of Bayes

I have successfully confused myself about probability again.

I am debugging an intermittent crash; it doesn't happen every time I run the program. After much confusion I believe I have traced the problem to a specific line (activating my debug logger, as it happens; irony...) I have tested my program with and without this line commented out. I find that, when the line is active, I get two crashes on seven runs. Without the line, I get no crashes on ten runs. Intuitively this seems like evidence in favour of the hypothesis that the line is causing the crash. But I'm confused on how to set up the equations. Do I need a probability distribution over crash frequencies? That was the solution the last time I was confused over Bayes, but I don't understand what it means to say "The probability of having the line, given crash frequency f", which it seems I need to know to calculate a new probability distribution.

I'm going to go with my intuition and code on the assumption that the debug logger should be activated much later in the program to avoid a race condition, but I'd like to understand this math.

## [Link] Are Children Natural Bayesians?

This recent article at *Slate* thinks so:

Why Your 4-Year-Old Is As Smart as Nate Silver

It turns out that even very young children reason [using Bayes Theorem]. For example, my student Tamar Kushnir, now at Cornell, and I showed 4-year-olds a toy and told them that blocks made it light up. Then we gave the kids a block and asked them how to make the toy light up. Almost all the children said you should put the block on the toy—they thought, sensibly, that touching the toy with the block was very likely to make it light up. That hypothesis had a high “prior.”

Then we showed 4-year-olds that when you put a block right on the toy it did indeed make it light up, but it did so only two out of six times. But when you waved a block over the top of the toy, it lit up two out of three times. Then we just asked the kids to make the toy light up.

The children adjusted their hypotheses appropriately when they saw the statistical data, just like good Bayesians—they were now more likely to wave the block over the toy, and you could precisely predict how often they did so. What’s more, even though both blocks made the machine light up twice, the 4-year-olds, only just learning to add, could unconsciously calculate that two out of three is more probable than two out of six. (In a current study, my colleagues and I have found that even 24-month-olds can do the same).

There also seems to be a reference to the Singularity Institute:

The Bayesian idea is simple, but it turns out to be very powerful. It’s so powerful, in fact, that computer scientists are using it to design intelligent learning machines, and more and more psychologists think that it might explain human intelligence.

(Of course, I don't know how many other AI researchers are using Bayes Theorem, so the author also might not have the SI in mind)

If children really are natural Bayesians, then why and how do you think we change?

## [Book Review] "The Signal and the Noise: Why So Many Predictions Fail—But Some Don’t.", by Nate Silver

Here's a link to a review, by The Economist, of a book about prediction, some of the common ways in which people make mistakes and some of the methods by which they could improve:

Looking ahead : How to look ahead—and get it right

One paragraph from that review:

A guiding light for Mr Silver is Thomas Bayes, an 18th-century English churchman and pioneer of probability theory. Uncertainty and subjectivity are inevitable, says Mr Silver. People should not get hung up on this, and instead think about the future the way gamblers do: “as speckles of probability”. In one surprising chapter, poker, a game from which Mr Silver once earned a living, emerges as a powerful teacher of the virtues of humility and patience.

## Help me teach Bayes'

Next Monday I am supposed to introduce a bunch of middle school students to Bayes' theorem.

I've scoured the Internet for basic examples where Bayes' theorem is applied. Alas, all explanations I've come cross are, I believe, difficult to grasp for the average middle school student.

So what I am looking for is a straightforward explanation of Bayes' theorem that uses the least amount of Mathematics and words possible. (Also, my presentation has to be under 3 minutes.)

I think that it would be efficient in terms of learning for me to use coins or cards, something tangible to illustrate what I'm talking about.

What do you think? How should I teach 'em Bayes' ways?

PS: I myself am new to Bayesian probability.

## How to use human history as a nutritional prior?

Nutrition is a case where we have to try to make the best possible use of the data we have no matter how terrible, because we have to eat something now to sustain us while we plan and conduct more experiments.

I want to apply Bayes theorem to make rational health decisions from relatively weak data. I am generally wondering how one can synthesize historical human experiences with incomplete scientific data, in order to make risk-adverse and healthy decisions about human nutrition given limited research.

Example question/hypothesis: Does gluten cause health problems (ie exhibit chronic toxicity) in non-coeliac humans? Is there enough evidence to suggest that avoiding gluten might be a prudent risk-adverse decision for non-coeliacs?

We have some (mostly in vitro) scientific data suggesting that gluten may cause health problems in non-coeliac humans (such as these articles http://evolvify.com/the-case-against-gluten-medical-journal-references/). Let's say for the sake of arguing, that I can somehow convert these studies into a non-unity likelihood ratio for gluten toxicity in humans (although suggestions are welcome here too).

However, we also have prior information that a population of humans has been consuming gluten containing foods for at least 10,000 years, without any blatantly obvious toxic effects. Is there some way to convert this observation (and observations like this) into a prior probability distribution?

## [Link] Better results by changing Bayes’ theorem

If it ever turns out that Bayes fails - receives systematically lower rewards on some problem, relative to a superior alternative, in virtue of its mere decisions - then Bayes has to go

out the window.

-- Eliezer Yudkowsky, Newcomb's Problem and Regret of Rationality

Don't worry, we don't have to abandon Bayes’ theorem yet. But changing it slightly seems to be the winning Way given certain circumstances. See below:

In Peter Norvig’s talk The Unreasonable Effectiveness of Data, starting at 37:42, he describes a translation algorithm based on Bayes’ theorem. Pick the English word that has the highest posterior probability as the translation. No surprise here. Then at 38:16 he says something curious.

So this is all nice and theoretical and pure, but as well as being mathematically inclined, we are also realists. So we experimented some, and we found out that when you raise that first factor [in Bayes' theorem] to the 1.5 power, you get a better result.

In other words, if we change Bayes’ theorem (!) we get a better result. He goes on to explain

**Link:** johndcook.com/blog/2012/03/09/monkeying-with-bayes-theorem/

**Peter Norvig - The Unreasonable Effectiveness of Data **

## Two phone apps that use Bayes to help doctors make better decisions [links]

Bayes at the Bedside and Rx-Bayes come with databases of likelihood ratios and help doctors estimate probabilities from their phones. I met somebody at Singularity Summit 2011 who used one of them, but I can't remember which.

## The Principle of Maximum Entropy

After having read the related chapters of Jaynes' book I was fairly amazed by the Principle of Maximum Entropy, a powerful method for choosing prior distributions. However it immediately raised a large number of questions.

I have recently read two quite intriguing (and very well-written) papers by Jos Uffink on this matter:

Can the maximum entropy principle be explained as a consistency requirement?

The constraint rule of the maximum entropy principle

I was wondering what you think about the principle of maximum entropy and its justifications.

## [Link] A Bayes' Theorem Visualization

A while ago when Bret Victor's amazing article Up and Down the Ladder of Abstraction was being discussed, someone mentioned that they'd like to see one made for Bayes' Theorem. I've just completed version 1.0 of my "Bayes' Theorem Ladder of Abstraction", and it can be found here: http://www.coarsegra.in/?p=111

(It uses the Canvas html5 element, so won't work with older versions of IE).

There's a few bugs in it, and it leaves out many things that I'd like to (eventually) include, but I'm reasonably satisfied with it as a first attempt. Any feedback for what works and what doesn't work, or what you think should be added, would be greatly appreciated.

## Reading Math: Pearl, Causal Bayes Nets, and Functional Causal Models

Hi all,

I just started a doctoral program in psychology, and my research interest concerns causal reasoning. Since Pearl's *Causality, *the popularity of causal Bayes nets as psychological models for causal reasoning has really grown. Initially, I had some serious reservations, but now I'm beginning to think a great many of these are due in part to the oversimplified treatment that CBNs get in the psychology literature. For instance, the distinction between a) directed acyclic graphs + underlying conditional probabilities, and b) functional causal models, is rarely mentioned. Ignoring this distinction leads to some weird results, especially when the causal system in question has prominent physical mechanisms.

Say we represent Gear A as causing Gear B to turn because Gear A is hooked up to an engine, and because the two gears are connected to each other by a chain. Something like this:

Engine(ON) -> GearA(turn) -> GearB(turn)

As a causal Net, this is problematic. If I "intervene" on GearA (perform *do*(GearA=stop)), then I get the expected result: GearA stops, GearB stops, and the engine keeps running (the 'undoing' effect [Sloman, 2005]). But what happens if I "intervene" on GearB? Since they are connected by a chain, GearA would stop as well. But GearA is the cause, and GearB is the effect: intervening on effects is NOT supposed to change the status of the cause. This violates a host of underlying assumptions for causal Bayes nets. (And you can't represent the gears as causing each other's movement, since that'd be a cyclical graph.)

However, this can be solved if we're not representing the system as the above net, but we're instead representing the physics of the system, representing the forces involved via something that looks vaguely like newtonian equations. Indeed, this would accord better with people's hypothesis-testing behavior: if they aren't sure which gear has the engine behind it, they wouldn't try "intervening" on GearA's motion and GearB's motion, they'd try removing the chain, and seeing which gear is still moving.

At first it seemed to me like causal Bayes nets only do the first kind of representation, not the latter. However, I was wrong: Pearl's "functional causal models" appear to do the latter. These have been vastly less prevalent in the psych literature, yet they seem extremely important.

Anyways, the moral of the story is that I should really read a lot of Pearl's *Causality, *and actually have a grasp of some of the math; I can't just read the first chapter like most psychology researchers interested in this stuff.

I'm not much of an autodidact when it comes to math, though I'm good at it when put in a class. Can anyone who's familiar with Pearl's book give me an idea of what sort of prerequisites it would be good to have in order to understand important chunks of it? Or am I overthinking this, and I should just try and plow through.

Any suggestions on classes (or textbooks, I guess), or any thoughts on the above gears example, will be helpful and welcome.

Thanks!

EDIT: Maybe a more specific request could be phrased as following: will I be better served by taking some extra computer science classes, or some extra math classes (i.e., on calculus and probabilistic systems)?

## Log-odds (or logits)

(I wrote this post for my own blog, and given the warm reception, I figured it would also be suitable for the LW audience. It contains some nicely formatted equations/tables in LaTeX, hence I've left it as a dropbox download.)

Logarithmic probabilities have appeared previously on LW here, here, and sporadically in the comments. The first is a link to a Eliezer post which covers essentially the same material. I believe this is a better introduction/description/guide to logarithmic probabilities than anything else that's appeared on LW thus far.

Introduction:

Our conventional way of expressing probabilities has always frustrated me. For example, it is very easy to say nonsensical statements like, “110% chance of working”. Or, it is not obvious that the difference between 50% and 50.01% is trivial compared to the difference between 99.98% and 99.99%. It also fails to accommodate the math correctly when we want to say things like, “five times more likely”, because 50% * 5 overflows 100%.

Jacob and I have (re)discovered a mapping from probabilities to log- odds which addresses all of these issues. To boot, it accommodates Bayes’ theorem beautifully. For something so simple and fundamental, it certainly took a great deal of google searching/wikipedia surfing to discover that they are actually called “log-odds”, and that they were “discovered” in 1944, instead of the 1600s. Also, nobody seems to use log-odds, even though they are conceptually powerful. Thus, this primer serves to explain why we need log-odds, what they are, how to use them, and when to use them.

## Bayes Slays Goodman's Grue

This is a first stab at solving Goodman's famous grue problem. I haven't seen a post on LW about the grue paradox, and this surprised me since I had figured that if any arguments would be raised against Bayesian LW doctrine, it would be the grue problem. I haven't looked at many proposed solutions to this paradox, besides some of the basic ones in "The New Problem of Induction". So, I apologize now if my solution is wildly unoriginal. I am willing to put you through this dear reader because:

- I wanted to see how I would fare against this still largely open, devastating, and classic problem, using only the arsenal provided to me by my minimal Bayesian training, and my regular LW reading.
- I wanted the first LW article about the grue problem to attack it from a distinctly
*Lesswrongian*aproach without the benefit of hindsight knowledge of the solutions of non-LW philosophy. - And lastly, because, even if this solution has been found before, if it is the right solution, it is to LW's credit that its students can solve the grue problem with only the use of LW skills and cognitive tools.

I would also like to warn the savvy subjective Bayesian that just because I think that probabilities model frequencies, and that I require frequencies out there in the world, does not mean that I am a frequentest or a realist about probability. I am a formalist with a grain of salt. There are no probabilities anywhere in my view, not even in minds; but the theorems of probability theory when interpreted share a fundamental contour with many important tools of the inquiring mind, including both, the nature of frequency, and the set of rational subjective belief systems. There is nothing more to probability than that system which produces its theorems.

Lastly, I would like to say, that even if I have not succeeded here (which I think I have), there is likely something valuable that can be made from the leftovers of my solution after the onslaught of penetrating critiques that I expect form this community. Solving this problem is essential to LW's methods, and our arsenal is fit to handle it. If we are going to be taken seriously in the philosophical community as a new movement, we must solve serious problems from academic philosophy, and we must do it in distinctly *Lesswrongian* ways.

"The first emerald ever observed was green.

The second emerald ever observed was green.

The third emerald ever observed was green.

… etc.

The nth emerald ever observed was green.

(conclusion):

There is a very high probability that a never before observed emerald will be green."

That is the inference that the grue problem threatens, courtesy of Nelson Goodman. The grue problem starts by defining "grue":

"An object is grue iff it is first observed before time T, and it is green, or it is first observed after time T, and it is blue."

So you see that before time T, from the list of premises:

"The first emerald ever observed was green.

The second emerald ever observed was green.

The third emerald ever observed was green.

… etc.

The nth emerald ever observed was green."

(we will call these the green premises)

it follows that:

"The first emerald ever observed was grue.

The second emerald ever observed was grue.

The third emerald ever observed was grue.

… etc.

The nth emerald ever observed was grue."

(we will call these the grue premises)

The proposer of the grue problem asks at this point: "So if the green premises are evidence that the next emerald will be green, why aren't the grue premises evidence for the next emerald being grue?" If an emerald is grue after time T, it is not green. Let's say that the green premises brings the probability of "A new unobserved emerald is green." to 99%. In the skeptic's hypothesis, by symmetry it should also bring the probability of "A new unobserved emerald is grue." to 99%. But of course after time T, this would mean that the probability of observing a green emerald is 99%, and the probability of not observing a green emerald is at least 99%, since these sentences have no intersection, i.e., they cannot happen together, to find the probability of their disjunction we just add their individual probabilities. This must give us a number at least as big as 198%, which is of course, a contradiction of the Komolgorov axioms. We should not be able to form a statement with a probability greater than one.

This threatens the whole of science, because you cannot simply keep this isolated to emeralds and color. We may think of the emeralds as trials, and green as the value of a random variable. Ultimately, every result of a scientific instrument is a random variable, with a very particular and useful distribution over its values. If we can't justify inferring probability distributions over random variables based on their previous results, we cannot justify a single bit of natural science. This, of course, says nothing about how it works in practice. We all know it works in practice. "A philosopher is someone who say's, 'I know it works in practice, I'm trying to see if it works in principle.'" - Dan Dennett

We may look at an analogous problem. Let's suppose that there is a table and that there are balls being dropped on this table, and that there is an infinitely thin line drawn perpendicular to the edge of the table somewhere which we are unaware of. The problem is to figure out the probability of the next ball being right of the line given the last results. Our first prediction should be that there is a 50% chance of the ball being right of the line, by symmetry. If we get the result that one ball landed right of the line, by Laplace's rule of succession we infer that there is a 2/3ds chance that the next ball will be right of the line. After n trials, if every trial gives a positive result, the probability we should assign to the next trial being positive as well is n+1/n +2.

If this line was placed 2/3ds down the table, we should expect that the ratio of rights to lefts should approach 2:1. This gives us a 2/3ds chance of the next ball being a right, and the fraction of Rights out of trials approaches 2/3ds ever more closely as more trials are performed.

Now let us suppose a grue skeptic approaching this situation. He might make up two terms "reft" and "light". Defined as you would expect, but just in case:

"A ball is reft of the line iff it is right of it before time T when it lands, or if it is left of it after time T when it lands.

A ball is light of the line iff it is left of the line before time T when it lands, or if it is right of the line after time T when it first lands."

The skeptic would continue:

"Why should we treat the observation of several occurrences of Right, as evidence for 'The next ball will land on the right.' and not as evidence for 'The next ball will land reft of the line.'?"

Things for some reason become perfectly clear at this point for the defender of Bayesian inference, because now we have an easy to imaginable model. Of course, if a ball landing right of the line is evidence for Right, then it cannot possibly be evidence for ~Right; to be evidence for Reft, after time T, is to be evidence for ~Right, because after time T, Reft is logically identical to ~Right; hence it is not evidence for Reft, after time T, for the same reasons it is not evidence for ~Right. Of course, before time T, any evidence for Reft is evidence for Right for analogous reasons.

But now the grue skeptic can say something brilliant, that stops much of what the Bayesian has proposed dead in its tracks:

"Why can't I just repeat that paragraph back to you and swap every occurrence of 'right' with 'reft' and 'left' with 'light', and vice versa? They are perfectly symmetrical in terms of their logical realtions to one another.

If we take 'reft' and 'light' as primitives, then we have to define 'right' and 'left' in terms of 'reft' and 'light' with the use of time intervals."

What can we possibly reply to this? Can he/she not do this with every argument we propose then? Certainly, the skeptic admits that Bayes, and the contradiction in Right & Reft, after time T, prohibits previous Rights from being evidence of both Right and Reft after time T; where he is challenging us is in choosing Right as the result which it is evidence for, even though "Reft" and "Right" have a completely symmetrical syntactical relationship. There is nothing about the definitions of reft and right which distinguishes them from each other, except their spelling. So is that it? No, this simply means we have to propose an argument that doesn't rely on purely syntactical reasoning. So that if the skeptic performs the swap on our argument, the resulting argument is no longer sound.

What would happen in this scenario if it were actually set up? I know that seems like a strangely concrete question for a philosophy text, but its answer is a helpful hint. What would happen is that after time T, the behavior of the ratio: 'Rights:Lefts' as more trials were added, would proceed as expected, and the behavior of the ratio: 'Refts:Lights' would approach the reciprocal of the ratio: 'Rights:Lefts'. The only way for this to not happen, is for us to have been calling the right side of the table "reft", or for the line to have moved. We can only figure out where the line is by knowing where the balls landed relative to it; anything we can figure out about where the line is from knowing which balls landed Reft and which ones landed Light, we can only figure out because in knowing this and and time, we can know if the ball landed left or right of the line.

To this I know of no reply which the grue skeptic can make. If he/she say's the paragraph back to me with the proper words swapped, it is not true, because In the hypothetical where we have a table, a line, and we are calling one side right and another side left, the only way for Refts:Lefts behave as expected as more trials are added is to move the line (if even that), otherwise the ratio of Refts to Lights will approach the reciprocal of Rights to Lefts.

This thin line is analogous to the frequency of emeralds that turn out green out of all the emeralds that get made. This is why we can assume that the line will not move, because that frequency has one precise value, which never changes. Its other important feature is reminding us that even if two terms are syntactically symmetrical, they may have semantic conditions for application which are ignored by the syntactical model, e.g., checking to see which side of the line the ball landed on.

In conclusion:

Every random variable has as a part of it, stored in* *its* definition/code*, a *frequency distribution* over its values. By the fact that somethings happen sometimes, and others happen other times, we know that the world contains random variables, even if they are never* fundamental in the source code. *Note that "frequency" is not used as a state of partial knowledge, it is a fact about a set and one of its subsets.

The reason that:

"The first emerald ever observed was green.

The second emerald ever observed was green.

The third emerald ever observed was green.

… etc.

The nth emerald ever observed was green.

(conclusion):

There is a very high probability that a never before observed emerald will be green."

is a valid inference, but the grue equivalent isn't, is that grue is not a property that the emerald construction sites of our universe deal with. They are *blind* to the grueness of their emeralds, they only say anything about whether or not the next emerald will be green. It may be that the rule that the emerald construction sites use to get either a green or non-green emerald change at time T, but the frequency of some particular result out of all trials will never change; the line will not move. As long as we know what symbols we are using for what values, observing many green emeralds is evidence that the next one will be grue, as long as it is before time T, every record of an observation of a green emerald is evidence against a grue one after time T. "Grue" changes meanings from green to blue at time T, 'green'''s meaning stays the same since we are using the same physical test to determine green-hood as before; just as we use the same test to tell whether the ball landed right or left. There is no reft in the universe's source code, and there is no grue. Green is not fundamental in the source code, but green can be reduced to some particular range of *quanta states; *if you had the universes source code, you couldn't write grue without first writing green; writing green without knowing a thing about grue would be just as hard as while knowing grue. Having a physical test, or primary condition for applicability, is what privileges green over grue after time T; to have a physical consistent test is the same as to reduce to a specifiable range of physical parameters; the existence of such a test is what prevents the skeptic from performing his/her swaps on our arguments.

Take this more as a brainstorm than as a final solution. It wasn't originally but it should have been. I'll write something more organized and consize after I think about the comments more, and make some graphics I've designed that make my argument much clearer, even to myself. But keep those comments coming, and tell me if you want specific credit for anything you may have added to my grue toolkit in the comments.

## Help with a (potentially Bayesian) statistics / set theory problem?

**Update**: as it turns out, this is a voting system problem, which is a difficult but well-studied topic. Potential solutions include Ranked Pairs (complicated) and BestThing (simpler). Thanks to everyone for helping me think this through out loud, and for reminding me to kill flies with flyswatters instead of bazookas.

I'm working on a problem that I believe involves Bayes, I'm new to Bayes and a bit rusty on statistics, and I'm having a hard time figuring out where to start. (EDIT: it looks like set theory may also be involved.) Your help would be greatly appreciated.

Here's the problem: assume a set of 7 different objects. Two of these objects are presented at random to a participant, who selects whichever one of the two objects they prefer. (There is no "indifferent" option.) The order of these combinations is not important, and repeated combinations are not allowed.

Basic combination theory says there are 21 different possible combinations: (7!) / ( (2!) * (7-2)! ) = 21.

Now, assume the researcher wants to know which single option has the highest probability of being the "most preferred" to a new participant based on the responses of all previous participants. To complicate matters, each participant can leave at any time, without completing the entire set of 21 responses. Their responses should still factor into the final result, even if they only respond to a single combination.

At the beginning of the study, there are no priors. (CORRECTION via dlthomas: "There are necessarily priors... we start with no information about rankings... and so assume a 1:1 chance of either object being preferred.) If a participant selects B from {A,B}, the probability of B being the "most preferred" object should go up, and A should go down, if I'm understanding correctly.

NOTE: Direct ranking of objects 1-7 (instead of pairwise comparison) isn't ideal because it takes longer, which may encourage the participant to rationalize. The "pick-one-of-two" approach is designed to be fast, which is better for gut reactions when comparing simple objects like words, photos, etc.

The ideal output looks like this: "Based on ___ total responses, participants prefer Object A. Object A is preferred __% more than Object B (the second most preferred), and ___% more than Object C (the third most preferred)."

**Questions:**

1. Is Bayes actually the most straightforward way of calculating the "most preferred"? (If not, what is? I don't want to be Maslow's "man with a hammer" here.)

2. If so, can you please walk me through the beginning of how this calculation is done, assuming 10 participants?

Thanks in advance!

## Thinking in Bayes: Light

There are a lot of explanations of Bayes' Theorem, so I won't get into the technicalities. I will get into why it should change how you think. This post is pretty introductory, so free to totally skip it if you don't feel like there's anything about Bayes' Theorem that you don't understand.

For a while I was reading LessWrong and not seeing what the big deal about Bayes' Theorem was. Sure, probability is in the mind and all, but I didn't see why it was so important to insist on bayesian methods. For me they were a tool, rather than a way of thinking. This summary also helped someone in the DC group.

After using the Anki deck, a thought occurred to me:

Bayes theorem means that when seeing how likely a hypothesis is after an event, not only do I need to think about how likely the hypothesis said the event is, I need to consider

everything else that could have possibly made that event more likely.

To illustrate:

pretty clearly shows how you need to consider P(e|H), but that's slightly more obvious than the rest of it.

If you write it out the way that you would compute it you get...

where h is an element of the hypothesis space.

This means that* every way* that e could have happened is important, on top of (or should I say under?) just how much probability the hypothesis assigned to e.

This is because P(e) comes from every hypothesis that contributes to e happening, or more mathilyeX P(e) is the sum over all possible hypotheses of the probability of the event and that hypothesis, computed by the probability of the hypothesis times the probability of the event given the hypothesis.

In LaTeX:

where h is an element of the hypothesis space.

## Religion, happiness, and Bayes

Religion apparently makes people happier. Is that evidence for the truth of religion, or against it?

(Of course, it matters *which* religion we're talking about, but let's just stick with theism generally.)

My initial inclination was to interpret this as evidence against theism, in the sense that it weakens the evidence *for* theism. Here's why:

- As all Bayesians know, a piece of information F is evidence for an hypothesis H to the degree that F depends on H. If F can happen just as easily without H as with it, then F is not evidence for H. The more likely we are to find F in a world without H, the weaker F is as evidence for H.
- Here, F is "Theism makes people happier." H is "Theism is true."
- The fact of widespread theism is evidence for H. The strength of this evidence depends on how likely such belief would be if H were false.
- As people are more likely to do something if it makes them happy, people are more likely to be theists given F.
- Thus F opens up a way for people to be theists even if H is false.
- It therefore weakens the evidence of widespread theism for the truth of H.
- Therefore, F should decrease one's confidence in H, i.e., it is evidence against H.

We could also put this in mathematical terms, where F represents an *increase* in the prior probability of our encountering the evidence. Since that prior is a denominator in Bayes' equation, a bigger one means a smaller posterior probability--in other words, weaker evidence.

OK, so that was my first thought.

But then I had second thoughts: Perhaps the evidence points the other way? If we reframe the finding as "Atheism causes unhappiness," or posit that contrarians (such as atheists) are dispositionally unhappy, does that change the sign of the evidence?

Obviously, I am confused. What's going on here?

## Bayesian analysis under threat in British courts

This is an interesting article talking about the use of bayes in british courts and efforts to improve how statistics are used in court cases. Probably worth keeping an eye on. It might expose more people to bayes if it becomes common and thus portrayed in TV dramas.

## [Funny] Even Clippy can be blamed on the use of non-Bayesian methods

From this 2001 article:

Eric Horvitz... feels bad about [Microsoft Office's Clippy]... many people regard the paperclip as annoyingly over-enthusiastic, since it appears without warning and gets in the way.

To be fair, that is not Dr Horvitz's fault. Originally, he programmed the paperclip to use Bayesian decision-making techniques both to determine when to pop up, and to decide what advice to offer...

The paperclip's problem is that the algorithm... that determined when it should appear was deemed too cautious. To make the feature more prominent, a cruder non-Bayesian algorithm was substituted in the final product, so the paperclip would pop up more often.

Ever since, Dr Horvitz has wondered whether he should have fought harder to keep the original algorithm.

*I*, at least, found this amusing.

## Bayesian exercise

I am confused.

Suppose you are in charge of estimating the risk of catastrophic failure of the Space Shuttle. From engineers, component tests, and guesswork, you come to the conclusion that any given launch is about 1% likely to fail. On the strength of this you launch the Shuttle, and it does not blow up. Now, with this new information, what is your new probability estimate? I write down

P(failure next time | we observe one successful launch) = P (we observe one successful launch | failure next time) * P(failure) / P(observe one success)

or

P(FNT|1S) = P(1S|FNT)*P(F)/P(S)

We have P(F) = 1-P(S) = 0.03. Presumably your chances of success this time are not affected by the next one being a failure, so P(1S|FNT) is just P(S) = 0.97. So the two 97% chances cancel, and I'm left with the same estimate I had before, 3% chance of failure. Is this correct, that a successful launch does not give you new information about the chances of failure? This seems counterintuitive.

## 'An objective defense of Bayesianism'

Recently, Hans Leitgeb and Richard Pettigrew have published a novel defense of Bayesianism:

An Objective Defense of Bayesianism I: Measuring Inaccuracy

One of the fundamental problems of epistemology is to say when the evidence in an agent’s possession justiﬁes the beliefs she holds. In this paper and its sequel, we defend the Bayesian solution to this problem by appealing to the following fundamental norm:

Accuracy: An epistemic agent ought to minimize the inaccuracy of her partial beliefs.

In this paper, we make this norm mathematically precise in various ways. We describe three epistemic dilemmas that an agent might face if she attempts to follow Accuracy, and we show that the only inaccuracy measures that do not give rise to such dilemmas are the quadratic inaccuracy measures. In the sequel, we derive the main tenets of Bayesianism from the relevant mathematical versions of Accuracy to which this characterization of the legitimate inaccuracy measures gives rise, but we also show that unless the requirement of Rigidity is imposed from the start, Jeﬀrey conditionalization has to be replaced by a diﬀerent method of update in order for Accuracy to be satisﬁed.

An Objective Defense of Bayesianism II: The Consequences of Minimizing Inaccuracy

In this article and its prequel, we derive Bayesianism from the following norm: Accuracy—an agent ought to minimize the inaccuracy of her partial beliefs. In the prequel, we make the norm mathematically precise; in this article, we derive its consequences. We show that the two core tenets of Bayesianism follow from Accuracy, while the characteristic claim of Objective Bayesianism follows from Accuracy together with an extra assumption. Finally, we show that Jeffrey Conditionalization violates Accuracy unless Rigidity is assumed, and we describe the alternative updating rule that Accuracy mandates in the absence of Rigidity.

Richard Pettigrew has also written an excellent introduction to probability.

## "Friends do not let friends compute p values."

LWers may find useful two recent articles summarizing (for cognitive scientists) why Bayesian inference is superior to frequentist inference.

Kruschke - What to believe: Bayesian methods for data analysis

Wagenmakers et al - Bayesian versus frequentist inference

(The quote "Friends do not let friends compute p values" comes from the first article.)

## [Link] The Bayesian argument against induction.

In 1983 Karl Popper and David Miller published an argument to the effect that probability theory could be used to disprove induction. Popper had long been an opponent of induction. Since probability theory in general, and Bayes in particular is often seen as rescuing induction from the standard objections, the argument is significant.

It is being discussed over at the Critical Rationalism site.

## Considering all scenarios when using Bayes' theorem.

Disclaimer: this post is directed at people who, like me, are not Bayesian/probability gurus.

Recently I found an opportunity to use the Bayes' theorem in real life to help myself update in the following situation (presented in gender-neutral way):

Let's say you are wondering if a person is interested in you romantically. And they bought you a drink.

A = they are interested in you.

B = they bought you a drink.

P(A) = 0.3 (Just an assumption.)

P(B) = 0.05 (Approximately 1 out of 20 people, who might be at all interested in you, will buy you a drink for some unknown reason.)

P(B|A) = 0.2 (Approximately 1 out of 5 people, who *are* interested in you, will buy you a drink for some unknown reason. Though it's more likely they will buy you a drink *because *they are interested in you.)

These numbers seem valid to me, and I can't see anything that's obviously wrong. But when I actually use Bayes' theorem:

P(A|B) = P(B|A) * P(A) / P(B) = **1.2**

Uh-oh! Where did I go wrong? See if you can spot the error before continuing.

Turns out:

P(B|A) = P(A∩B) / P(A) ≤ P(B) / P(A) = 0.1667

BUT

P(B|A) = 0.2 > 0.1667

I've made a mistake in estimating my probabilities, even though it felt intuitive. Yet, I don't immediately see where I went wrong when I look at the original estimates! What's the best way to prevent this kind of mistake?

I feel pretty confident in my estimates of P(A) and P(B|A). However, estimating P(B) is rather difficult because I need to consider *many scenarios*.

I can compute P(B) more precisely by considering all the scenarios that would lead to B happening (see wiki article):

P(B) = ∑_{i} P(B|H_{i}) * P(H_{i})

Let's do a quick breakdown of everyone who would want to buy you a drink (out of the pool of people who might be at all interested in you):

P(*misc*. reasons) = 0.05; P(B|*misc*) = 0.01

P(they are just *friendly* and buy drinks for everyone they meet) = 0.05; P(B|*friendly*) = 0.8

P(they want to be *friends*) = 0.3; P(B|*friends*) = 0.1

P(they are *interested* in you) = 0.6; P(B|*interested*) = P(B|A) = 0.2

So, P(B) = 0.1905

And, P(A|B) = **0.315** (very different from 1.2!)

Once I started thinking about all possible scenarios, I found one I haven't considered explicitly -- some people buy drinks for everyone they meet -- which adds a good amount of probability (0.04) to B happening. (Those types of people are rare, but they WILL buy you a drink.) There are also other interesting assumptions that are made explicit:

- Out of all the people under consideration in this problem, there are twice as many people who would be romantically interested in you vs. people who would want to be your friend.
- People who are interested in you will buy you a drink twice as often as people who want to be your friend.

The moral of the story is to consider all possible scenarios (models/hypothesis) which can lead to the event you have observed. It's possible you are missing some scenarios, which under consideration will significantly alter your probability estimates.

Do you know any other ways to make the use of Bayes' theorem more accurate? (Please post in comments, links to previous posts of this sort are welcome.)

## A Philosophical Treatise of Universal Induction (Link)

Abstract:Understanding inductive reasoning is a problem that has engaged mankind for thousands of years. This problem is relevant to a wide range of fields and is integral to the philosophy of science. It has been tackled by many great minds ranging from philosophers to scientists to mathematicians, and more recently computer scientists. In this article we argue the case for Solomonoff Induction, a formal inductive framework which combines algorithmic information theory with the Bayesian framework. Although it achieves excellent theoretical results and is based on solid philosophical foundations, the requisite technical knowledge necessary for understanding this framework has caused it to remain largely unknown and unappreciated in the wider scientific community. The main contribution of this article is to convey Solomonoff induction and its related concepts in a generally accessible form with the aim of bridging this current technical gap. In the process we examine the major historical contributions that have led to the formulation of Solomonoff Induction as well as criticisms of Solomonoff and induction in general. In particular we examine how Solomonoff induction addresses many issues that have plagued other inductive systems, such as the black ravens paradox and the confirmation problem, and compare this approach with other recent approaches.

**Link:** mdpi.com/1099-4300/13/6/1076/

**Download PDF Full-Text:** mdpi.com/1099-4300/13/6/1076/pdf

**Authors:** Samuel Rathmanner and Marcus Hutter

**Published:** 3 June 2011

## A discussion of an applictation of Bayes' theorem to everyday life

[12:49:29 AM] Conversational Partner: actually, even if the praise is honest, it makes me uncomfortable if it seems excessive. that is, repeated too often, or made a big deal about.

[12:49:58 AM] Adelene Dawner: 'Seems excessive' can actually be a cue for 'is insincere'.

[12:50:05 AM] Conversational Partner: oh

[12:50:25 AM] Adelene Dawner: That kind of praise tends to parse to me as someone trying to push my buttons.

[12:51:53 AM | Edited 12:52:09 AM] Conversational Partner: is it at least theoretically possible that the praise is honest, and the other person just happens to think that the thing is more praiseworthy than I do? or if the other person has a different opinion than I do about how much praise is appropriate in general?

[12:52:59 AM] Adelene Dawner: Of course.

[12:53:13 AM] Adelene Dawner: This is a situation where looking at Bayes' theorem is useful.

## A Problem with Human Intuition about Conventional Statistics:

As an aspiring scientist, I hold the Truth above all. As Hodgell once said, *"That which can be destroyed by the truth should be." *But what if the thing that is holding our pursuit of the Truth back is our own system? I will share an example of an argument I overheard between a theist and an atheist once - showing an instance where human intuition might fail us.

*General Transcript*

Atheist: Prove to me that God exists!

Theist: He obviously exists – can’t you see that plants growing, humans thinking, [insert laundry list here], is all His work?

Atheist: Those can easily be explained by evolutionary mechanisms!

Theist: Well prove to me that God doesn’t exist!

Atheist: I don’t have to! There may be an invisible pink unicorn baby flying around my head, there is probably not. I can’t prove that there is no unicorn, that doesn’t mean it exists!

Theist: That’s just complete *reductio ad ridiculo*, you could do infrared, polaroid, uv, vacuum scans, and if nothing appears it is statistically unlikely that the unicorn exists! But God is something metaphysical, you can’t do that with Him!

Atheist: Well Nietzsche killed metaphysics when he killed God. God is dead!

Theist: That is just words without argument. Can you actually…..

As one can see, the biggest problem is determining ** burden of proof**. Statistically speaking, this is much like the problem of defining the null hypothesis.

A theist would define: H_{0} : God exists. H_{a}: God does not exist.

An atheist would define: H_{0}: God does not exist. H_{a} God does exist.

Both conclude that there is no significant evidence hinting at H_{a} over H_{0}. Furthermore, *and this is key*, they both accept the null hypothesis. The correct statistical term for the proper conclusion if insignificant evidence exists for the acceptance of the alternate hypothesis is that one *fails to reject* the null hypothesis. However, human intuition fails to grasp this concept, and think in black and white, and instead we tend to *accept *the null hypothesis.

This is not so much a problem with statistics as it is with human intuition. Statistics usually take this form because simultaneous 100+ hypothesis considerations are taxing on the human brain. Therefore, we think of hypotheses to be *defended* or *attacked, but not considered neutrally*.

Considered a Bayesian outlook on this problem.

There are two possible outcomes: At least one deity exists(D). No deities exist(N).

Let us consider the natural evidence (Let’s call this E) before us.

P(D+N) = 1. P[(D+N)|E] = 1. P(D|E) + P(N|E) = 1. P(D|E) = 1- P(N|E).

Although the calculation of the prior probability of the probability of god existing is rather strange, and seems to reek of bias, I still argue that this is a better system of analysis than just the classical H_{0} and H_{a}, because it effectively compares the probability of D and N with no bias inherent in the brain’s perception of the system.

Example such as these, I believe, show the flaws that result from faulty interpretations of the classical system. If instead we introduced a Bayesian perspective – the faulty interpretation would vanish.

This is a case for the expanded introduction of Bayesian probability theory. Even if cannot be applied correctly to every problem, even if it is apparently more complicated than the standard method they teach in statistics class ( I disagree here), it teaches people to analyze situations from a more objective perspective.

And if we can avoid Truth-seekers going awry due to simple biases such as those mentioned above, won’t we be that much closer to finding Truth?

## Visualizing Bayesian Inference [link]

Galton Visualizing Bayesian Inference (article @ CHANCE)

Excerpt:

What does Bayes Theorem look like? I do not mean what does the formula—

—look like; these days, every statistician knows that. I mean, how can we visualize the cognitive content of the theorem? What picture can we appeal to with the hope that any person curious about the theorem may look at it, and, after a bit of study say, “Why, that is clear—I can indeed see what is happening!”

Francis Galton could produce just such a picture; in fact, he built and operated a machine in 1877 that performs that calculation. But, despite having published the picture in Nature and the Proceedings of the Royal Institution of Great Britain, he never referred to it again—and no reader seems to have appreciated what it could accomplish until recently.

Schematics for the machine and its algorithm can be found at the link. This is a really cool design, and maybe it can aid Eliezer's and others' efforts to help people understand Bayes' Theorem.

## Convincing ET of our rationality

Allow me to propose a thought experiment. Suppose you, and you alone, were to make first contact with an alien species. Since your survival and the survival of the entire human race may depend on the extraterrestrials recognizing you as a member of a rational species, how would you convey your knowledge of mathematics, logic, and the scientific method to them using only your personal knowledge and whatever tools you might reasonably have on your person on an average day?

When I thought of this question, the two methods that immediately came to mind were the Pythagorean Theorem and prime number sequences. For instance, I could draw a rough right triangle and label one side with three dots, the other with four, and the hypotenuse with five. However, I realized that these are fairly primitive maths. After all, the ancient Greeks knew of them, and yet had no concept of the scientific method. Would these likely be sufficient, and if not what would be? Could you make a rough sketch of the first few atoms on the periodic table or other such universal phenomena so that it would be generally recognizable? Could you convey a proof of rationality in a manner that even aliens who cannot hear human vocalizations, or see in a completely different part of the EM spectrum? Is it even in principle possible to express rationality without a common linguistic grounding?

In other words, what is the most rational thought you could convey without the benefit of common language, culture, psychology, or biology, and how would you do it?

Bonus point: Could you convey Bayes' theorem to said ET?

## A cautionary note about "Bayesianism"

(Is Bayesianism even a word? Should it be? The suffix "ism" sets off warning lights for me.)

Visitors to LessWrong may come away with the impression that they need to be Bayesians to be rational, or to fit in here. But most people are a long way from the point where learning Bayesian thought patterns is the most time-effective thing they can do to improve their rationality. Most of the insights available on LessWrong don't require people to understand Bayes' Theorem (or timeless decision theory).

I'm not calling for any specific change. Just to keep this in mind when writing things in the Wiki, or constructing a rationality workbook.

## In Defense of Objective Bayesianism: MaxEnt Puzzle.

*In Defense of Objective Bayesianism* by Jon Williamson was mentioned recently in a post by lukeprog as the sort of book that should be being read by people on Less Wrong. Now, I have been reading it, and found some of it quite bizarre. This point in particular seems obviously false. If it’s just me, I’ll be glad to be enlightened as to what was meant. If collectively we don’t understand, that’d be pretty strong evidence that we should read more academic Bayesian stuff.

Williamson advocates use of the Maximum Entropy Principle. In short, you should take account of the limits placed on your probability by the empirical evidence, and then choose a probability distribution closest to uniform that satisfies those constraints.

So, if asked to assign a probability to an arbitrary A, you’d say p = 0.5. But if you were given evidence in the form of some constraints on p, say that p ≥ 0.8, you’d set p = 0.8, as that was the new entropy-maximising level. Constraints are restricted to Affine constraints. I found this somewhat counter-intuitive already, but I do follow what he means.

But now for the confusing bit. I quote directly;

“Suppose A is ‘Peterson is a Swede’, B is ‘Peterson is a Norwegian’, C is ‘Peterson is a Scandinavian’, and ε is ‘80% of all Scandinavians are Swedes’. Initially, the agent sets P(A) = 0.2, P(B) = 0.8, P(C) = 1 P(ε) = 0.2, P(A & ε) = P(B & ε) = 0.1. All these degrees of belief satisfy the norms of subjectivism. Updating by maxent on learning ε, the agent believes Peterson is a Swede to degree 0.8, which seems quite right. On the other hand, updating by conditionalizing on ε leads to a degree of belief of 0.5 that Peterson is a Swede, which is quite wrong. Thus, we see that maxent is to be preferred to conditionalization in this kind of example because the conditionalization update does not satisfy the new constraints X’, while the maxent update does.”

p80, 2010 edition. Note that this example is actually from Bacchus et al (1990), but Williamson quotes approvingly.

His calculation for the Bayesian update is correct; you do get 0.5. What’s more, this seems to be intuitively the right answer; the update has caused you to ‘zoom in’ on the probability mass assigned to ε, while maintaining relative proportions inside it.

As far as I can see, you get 0.8 only if we assume that Peterson is a randomly chosen Scandinavian. But if that were true, the prior given is bizarre. If he was a randomly chosen individual, the prior should have been something like P(A & ε) = 0.16 P(B & ε) = 0.04 The only way I can make sense of the prior is if constraints simply “don’t apply” until they have p=1.

Can anyone explain the reasoning behind a posterior probability of 0.8?

## The prior probability of justification for war?

Could you use Bayes Theorem to figure out whether or not a given war is just?

If so, I was wondering how one would go about estimating the prior probability that a war is just.

Thanks for any help you can offer.

View more: Next