ExplorationExploitation problems
Original post: http://bearlamp.com.au/explorationexploitationproblems/
I have been working on the assumption that explorationexploitation knowledge was just common. Unfortunately I did the smart thing of learning about them from a mathematician at a dojo in Melbourne, which means that no. Not everyone knows about it. I discovered that again today when I searched for a good quick explanation of the puzzle. With that in mind this is about Exploration Exploitation.
The classic ExplorationExploitation problem in mathematics is the multiarmed bandit. Which is a slang term for a bank of slot machines. Where the player knows that each machine has a variable payoff and you have a limit number of attempts before you run out of money. You want to balance trying out new machines with unknown payoffs against exploiting the knowledge you already have from the earlier machines you tried.
When you first start on new bandits, you really don't know which will pay out and at what rates. So some exploration is necessary to know what your reward ratio in the territory will be. As your knowledge grows, you get to know which bandits are likely to pay, and which are not, and this later informs your choices as to where to place your dollars.
Mathematicians love a well specified problem like this because it allows us to make algorithm models of patterns that will return rewards or guarantee rewards given certain circumstances. (see also  the secretary problem which does similar. Where I showed how it applied to real life dating)
Some of the mathematical solutions to this problem look like:
Epsilon greedy  The best lever is selected for a proportion 1ε of the trials, and a lever is selected at random (with uniform probability) for a proportion ε. A typical parameter value might be ε =0.1 but this can vary widely depending on circumstances.
Epsilondecreasing strategy: Similar to the epsilongreedy strategy, except that the value of ε decreases as the experiment progresses, resulting in highly exploratory behaviour at the start and highly exploitative behaviour at the finish.
Of course there are more strategies, and the context and nature of the problem matters. If the machines suddenly one day in the future all change, you might have a strategy that would prepare for potential scenarios like that. As you start shifting away from the hypothetical and towards real life your models need to increase complexity to cater to the details of the real world.
If this problem is more like real life (where we live and breathe), the possible variability of reality starts coming in to play more and more. In talking about this  I want to emphasise not the problem as interesting, but the solution of <sometimes explore> and <sometimes exploit> in specific ratios or for specific reasons. The mathematical solutions the the multiarmed bandit problem are used in such a way to take advantage of the balance between not knowing enough and taking advantage of what you do know.
What supercharges this solution and how it can be applied to real life is value of information.
Value of Information says that in relation to making a decision, what informs that decision is worth something. With expensive decisions, risky decisions, dangerous decisions, highly lucrative decisions, or particularly unknown decisions being more sure is important to think about.
VoI suggests that any decision that is worth money (or worth something) can have information that informs that decision. The value of information can add up to the value of the reward on correctly making the decision. Of course if you spend all the potential gains from the decision on getting the perfect information you lose the chance to make a profit. However usually a cheap (relative to the decision) piece of information exists that will inform the decision and assist.
How does this apply to explorationexploitation?
The idea of VoI is well covered in the book, how to measure anything. While the book goes into detail and is really really excellent for applying to big decisions, the ideas can also be applied to our simple every day problems as well. With this in mind I propose a heuristic:
You want to explore as much as to increase your information with regard to both the quality of the rest of the exploration and possible results and the expected returns on the existing knowledge.
The next thing to supercharge our explorationexploitation and VoI knowledge is Diminishing returns.
Diminishing returns on VoI is when you start out not knowing anything at all, and adding a little bit of knowledge goes a long way. As you keep adding more and more information the return on the extra knowledge has a diminishing value.
Worked example: Knowing the colour of the sky.
So you are blind and no one has ever told you what colour the sky is. You can't really be sure what colour the sky is but generally if you ask enough people the consensus should be a good enough way to conclude the answer.
So one guy gave you your first inkling of what the answer is. But can you really trust him?
Yea cool. Ten people. Probably getting sure of yourself now.
Really, what good is Two Thousand people after the first fifty? Especially if they all agree. There's got to be less value of the 2001st person telling you than there was the 3rd person telling you.
Going back to VoI, how valuable was the knowledge that the sky is blue? Probably not very valuable, and this isn't a great way to gather knowledge in the long run.
The great flaw with this is also if I asked you the question  "what colour is the sky?" you could probably hint as to a confident guess. If you are a well calibrated human, you already know a little bit of everything and the good news is that calibration is trainable.
With that in mind; if you want to play a calibration game there are plenty available on google.
The great thing about calibration is that it seems to apply across all your life, and all things that you estimate. Which is to say that once you are calibrated, you are calibrated across domains. This means that if you become good at it in one area, you become better at it in other areas. We're not quite talking about hitting the bullseye every time, but we are talking about being confident that the bullseye is over there in that direction. Which is essentially the ability to predict the future within a reasonable set of likelihoods.
Once you are calibrated, you can take calibration, use it to apply diminishing returns through VoI to supercharge your exploration exploitation. But we're not done. What if we add in Bayesian statistics? What if we can shape our predicted future and gradually update our beliefs based on tiny snippits of data that we gather over time and purposefully by thinking about VoI, and the diminishing returns of information.
I don't want to cover Bayes because people far smarter than me have covered it very well. If you are interested in learning bayes I would suggest heading to Arbital for their excellent guides.
But we're not done at bayes. This all comes down to the idea of tradeoffs. Exploration VS exploitation is a trade off of {time/energy} vs expected reward.
A classic example of a tradeoff is a story of Sharpening the Saw (from the book 7 habits of highly effective people)
A woodcutter strained to saw down a tree. A young man who was watching asked “What are you doing?”
“Are you blind?” the woodcutter replied. “I’m cutting down this tree.”
The young man was unabashed. “You look exhausted! Take a break. Sharpen your saw.”
The woodcutter explained to the young man that he had been sawing for hours and did not have time to take a break.
The young man pushed back… “If you sharpen the saw, you would cut down the tree much faster.”
The woodcutter said “I don’t have time to sharpen the saw. Don’t you see I’m too busy?”
The thing about life and trade offs is that all of life is tradeoffs between things you want to do and other things you want to do.
Exploration and exploitation is a trade off between the value of what you know and the value of what you might know if you find out.
Try this:
 Make a list of all the things you have done over the last 7 days. (Use your diary and rough time chunking)
 Sort them into exploration activities and exploitation activities.
Answer this:  Am I exploring enough? (on a scale of 110)
 Am I exploiting enough? (on a scale of 110)
 Have I turned down any exploring opportunities recently?
 Have I turned down any exploitation opportunities recently?
 How could I expand any exploring I am already doing?
 How could I expand any exploiting I am already doing?
 How could I do more exploring? How could I do less exploring?
 How could I do more exploiting? How could I do less exploiting?
There are two really important things to take away from the ExplorationExploitation dichotomy:
 You probably make the most measurable and ongoing gains in the Exploitation phase. I mean  lets face it, these are long running goalseeking behaviours like sticking to an exercise routine.
 The exploration might be seem more fun (find exciting and new hobbies) but are you sure that's what you want to be doing in regard to 1?
Meta: This is part 1 of a 4 part series.
This took in the order of 1015 hours to finish because I was doing silly things like trying to fit 4 posts into 1 and stumbling over myself.
Geometric Bayesian Update
I have moved this post to my blog: http://squirrelinhell.blogspot.com/2016/11/geometricbayesianupdate.html
Preference over preference
Each individual person has a preference. Some preferences are strong, others are weak. For many preferences it's more complicated than that; they aren’t static, and we change our preferences all the time. Some days we don't like certain foods, sometimes we may strongly dislike a certain song then another time we may not care so much. Our preferences can change in scope, as well as intensity.
Sometimes people can have preferences over other people's preferences.
 Example 1: I prefer to be surrounded by people who enjoy exercise, that way I will be motivated to exercise more.
 Example 2: I prefer to be surrounded by people who don't care how they look, that way I look prettier than everyone else.
 Example 3: I prefer when other people like my clothes.
 Example 4: I prefer my partners to be polyamorous.
 Example 5: I prefer people around me to not smoke.
The interesting thing about example 3; is that there are multiple ways to achieve that preference:
 Find out what clothes people like and acquire those clothes, then wear them regularly.
 Find people who already like the clothes that you have, then hang around those people regularly.
 Change the preference of the people around you so that they like your clothes.
Changing someone’s preference over clothing seems pretty harmless, and that way you get to wear clothes you like, they get to like the clothes you wear, and you get to be around people who like the clothes you wear without finding new people. The scary and maybe uncomfortable thing is that the other preferences can be also achieved through these means.
Example 4:
 Find out where poly people are, and hang out with them. (and ask to be their partners  etc)
 Find out which of the people you know are already poly and hang out with them (and ask to be their partners  etc)
 Change the preferences of your existing partner/s.
Example 1:
 Find out where people who enjoy exercise hang out, and join them.
 Find out which of your friends already enjoy exercise and hang out with them.
 Change the preferences of those around you to also enjoy exercise.
Example 5:
 Find out where people don't smoke, hang out in those places.
 Figure out who already doesn't smoke and hang out with them.
 Encourage people you know to not smoke.
(I think that's enough examples)
Is it wrong?
There is nothing inherently wrong with having a preference. Having a preference over another person’s preference is also not inherently wrong. Such is the nature of having a preference (usually a strong one by the time you are dictating it to your surroundings). What really matters is what you do about it.
In this day and age; no one would be discouraged from figuring out where people are not smoking and being in those places instead of the smoking places. In this day and age you wouldn't be criticised for finding out which of your friends don't smoke and only hanging out with them either  but maybe it makes some people uncomfortable to do it, or to feel that the reciprocal might happen if someone strongly didn't like their preferences. In this day and age; encouraging those around you to not smoke can come across as an action with questionable motives.
So let's look at some of the motives:
 I prefer it when people don't smoke around me because then I don't get second hand smoke.
 I prefer it when my friends don't smoke because I don't like chemical dependency in my environment.
 I prefer it when my friends don't smoke so that we look better than that other group of people who do smoke.
 I prefer it when my friends don't smoke because I don't want them to get cancer and die (and not be around to be my friends any more).
Motive 1 seems very much about selfpreservation. We can't really fault an entity for trying to selfpreserve.
Motive 2 is a more broad example of selfpreservation  the idea that having dependency in your environment might negatively impact you enough to warrant the need to maintain an environment without it  it's a stretch, but not an unreasonable selfpreservation drive.
Motive 3 appears to be a superficial drive to be better than other people. We often don't like admitting that this is the reason we do things; but I don't mind it either. If it were me; I'd get pretty tired of being motivated by *keeping up with the Joneses* type attitudes but some people care greatly about that.
Motive 4 seems like a potentially altruistic desire to protect your friends; but then it seems less so once you include the bracketed submotive.
Herein lies the problem. If a preference looks like it is designed to improve someone else's life like "others shouldn't smoke" (remember that "looks like to me" is equivalent to "I believe it looks like..."), and we believe that having a preference over their preference would improve their life  should we enforce that preference? Do we have a right or even a burden to encourage those around us to quit smoking? To take up exercise? To become poly? To like us (or our clothes)?
The idea of preference over preference is a big one. What if my preference is that people eat my birthday cake? and Bob’s preference is that he sticks to his diet today? Who should win? It’s My Birthday. On Bob’s birthday he doesn’t have to eat cake, but on My Birthday he does. Or does he?
The truth is neither way is the best way. Sometimes hypothetical bob should eat the birthday cake and sometimes hypothetical birthdaykid should respect other people’s dietary choices. What we really have control over is our own preference for ourselves. My only advice it to tread delicately when having preferences over other people’s preferences.
If we think we know better (and we might but also might not) and are trying to uphold a preference over a preference (p/p), then what happens?
Either we are right, we are wrong, or something else happens. And depends on whether the other party conformed or not (or did something else). Then what happens when things resolve.
Examples:
 A is smoking
 B says not to because it's bad for you
 A doesn't stop
 It turns out to be bad for you
 A gets sick
B was right, tried to push a p/p and lost. (either by not pushing hard enough or by A being stubborn). Did the p/p serve any good here? Should it have happened? What if an alternative 5 exists; “A keeps smoking, never gets sick and lives to 90”. Then was the p/p useful?
 A is monogamous
 B says to be poly
 A does
 It goes badly
 A is hurt
B was wrong, tried to push a p/p and won. But was wrong and shouldn't have pushed it? Or maybe A shouldn't have conformed.
This can be represented in a table:

B prefers to maintain P/P 
B does not maintain P/P 
A is susceptible to pressure 
A gives in 
A does not change (because there is no pressure) 
A is not susceptible 
A does not change (stubborn) 
A does not change (because there is no pressure) 
And a second table of results:
change was negative (or caused a negative result) 
change was positive (or caused a positive result) 

A is susceptible 
A loses. 
A wins! 
A is not susceptible 
A wins! 
A loses. 
Assuming also that if A loses; B takes a hit as well. Ideally we want everyone to win all the time. But just showing these things in a table is not enough. We should be assigning estimated probability to these choices as well.
For example (my made up numbers of whether I think smoking will lead to a bad result):
Smoking: 
98% smoking causes problems 
2% smoking does not cause problems. 
If we edit the earlier table:
Smoking 
B prefers to maintain P/P 
B does not maintain P/P 
A is susceptible to pressure 
A gives in (2% estimate that the change was pointless) 
A does not change (because there is no pressure) (98% estimate that this is a bad outcome) 
A is not susceptible to pressure 
A does not change (stubborn) (98% estimate that this is a bad outcome) 
A does not change (because there is no pressure) (98% estimate that this is a bad outcome) 
To a rationalist; seeing your p/p table with estimates should help to understand whether they should take you up on fulfilling your preference or not. Assuming of course that rationalists never lie; and can accurately estimate the confidence of their beliefs.
If you meet someone with a 98% belief they should be able to produce evidence that will also reasonably convince you of similar ideas and encourage you to update your beliefs. So maybe in the smoking case A should be listening to B; or checking the evidence very seriously.
What should you do when you hold a strong p/p that will be to your benefit at the same time as being to someone else’s detriment. (and part 2: what if you are unsure of the benefit or detriment)
Examples:
B want's A to try a new street drug "splice". B says it's lots of fun and encourages A to do it. B is unsure of the risks; but sure of the benefits (lots of fun). Should B encourage A? (what more do we need to know to make that sort of judgement call?)
B has a sexual interest that is specific, and A’s are indifferent B could easily encourage A to "try out this". should B?
B has an old crappy car that B doesn’t like very much. B prefers to make friends with shady A’s who will steal the car. then B can claim on insurance that it was stolen. and get a nicer care with the payout. Should B?
B wants A to pay for the two of them to go on a carnival ride. the cost is simple (several dollars) the benefit is not. Should B pressure A? (what more do we need to know in order to answer that question?)
A always crosses the street dangerously because they are often running late. B believes that A should be more safe  walk a distance to the nearest crossing before crossing the road; B knows that this will make A late. Should B pressure A? (will more information help us answer?)
It was suggested that the Veil of ignorance might help to create a rule in this situation. However the bounds of this situation dictate that you know which party you are; and that you have a preference over a preference. So the Veil of ignorance does not so much apply to give us insight.
 It is possible to be a selfish entity, hold p/p and encourage others to fulfil your preference
 it is also possible to be a noninfluential entity, and never push a preference over others.
 it is possible to be a stubborn entity and never conform to someone else’s p/p.
 It is also possible to be a conforming entity and always conform.
It is also possible to be a mix of these 4 in different situations and/or different preferences.
Partial Solution
Know your preferences, know your p/p’s and think very carefully about pushing your p/p’s, hiding your p/p’s; changing your preferences to conform, or being needlessly stubborn about your preferences. (warning: this is hard; don’t think it’s easy just because it fits into one sentence)
Knowing what your strong preferences are; knowing which of your preferences are potentially not beneficial for others and understanding whether you have a tendency to push your p/p on other people will possibly help you to be more careful when handling p/p and avoid manipulating people (to their detriment). In addition to this; knowing what culture you come from and what culture others come from will help to know how weak p/p might be misinterpreted as strong p/p (see "ask culture", "guess culture" and "tell culture"). (some cultures aim to please when asked, and ask little of each other; some cultures are stubborn, vocal and demanding. In the middle of the two cultures is the crazyconfused zone. Of course these are the obvious cases. Sometimes cultural taboo will come up around some topics and not others; i.e. dinner etiquette might be something you never ask about  because it would be bad etiquette; but expressing a strong preference over what you want to drink is expected)
In conclusion there are no rules to be drawn around p/p other than  Try to understand it; and how it can go wrong and be careful.
Meta: 4.5 hours to write, 30mins to take feedback and edit. Thanks to the slack for being patient while I asked tricky example questions.
My Table of contents  contains links to the other things I have written.
Further comments adjustments and suggestions welcome.
The Heuristic About Representativeness Heuristic
(xposted from my blog)
The thing is, there is a fundamental difference between "How strongly E resembles H" and "How strongly H implies E". The latter question is about P(EH), and this number could be used in Bayesian reasoning, if you add P(E!H) and P(H)[1]. The former question — the question humans actually answer when asked to judge about whether something is likely — sometimes just could not be saved at all.
Several examples to get point across:
So, the answer to "how strongly E resembles H?" is very different from "how much is P(EH)?". No amount of accounting for base rate is going to fix this.
2) Suppose that some analysis comes too good in a favor of some hypothesis.
Maybe some paper argues that leaded gasoline accounts for 90% variation in violent crime (credit for this example goes to /u/yodatsracist on reddit). Or some ridiculously simple school intervention is claimed to have a gigantic effect size.
Let's take leaded gasoline, for example. On the surface, this data strongly "resembles" a world where leaded gasoline is indeed causing a violence, since 90% suggest that effect is very large and is very unlikely to be a fluke. On the other hand, this effect is too large, and 10% of "other factors" (including but not limited to: abortion rate, economic situation, police budget, alcohol consumption, imprisonment rate) is too small of percentage.
The decline we expect in a world of harmful leaded gasoline is more like 10% than 90%, so this model is too good to be true; instead of being very strong evidence in favor, this analysis could be either irrelevant (just a random botched analysis with faked data, nothing to see here) or offer some evidence against (for reasons related to the conservation of expected evidence, for example).
So, how it should be done? Remember that P(EH) would be written as P(H > E), were the notation a bit saner. P(EH) is a "to which degree H implies E?", so the correct method for answering this query involves imagining worldwhereHistrue and asking yourself about "how often does E occur here?" instead of answering the question "which world comes to my mind after seeing E?".
[1] And often just using base rate is good enough, but this is another, even less correct heuristic. See: Prosecutor's Fallacy.
Mental Calibration for Bayesian Updates?
Hey all,
After reading "How to Measure Anything" I've experimented a bit with calibration training and using his calibration tools, and after being convinced by his data on the usefulness of calibration in forecasting for the real world, have seen a big update in my own calibration.
I'm wondering if anybody knows of similar tools and studies on calibration of Bayesian updating. Broadly,I imagine it would look like:
1. Using the tools and calibration methods I already use to figure out how the feeling of "correctness" of my prior correlates to a numerical value.
2. Using similar (but probably not identical) tools to figure out how "convincing" the new data feels correlates to specific numbers.
3. Calibrating these two numbers to bayes theorom, such that I know approximately how much to update the original feeling to reflect the new information
4. Using mmenomic or visualization techniques to pair the new feeling with the belief, so that next time I remembered the belief, I'd feel the slightly different calibration.
Anyways, I'm curious if anyone has experimented with these processes, if there's any research on it, or it has been previously experimented with on lesswrong. I'd definitely like to lock down a similar procedure for myself.
I should note that many times, I already do this naturally... but my guess is I systematically over and under update the feeling based on confirmation bias. I'd like to recalibrate my recalibration :).
Does Probability Theory Require Deductive or Merely Boolean Omniscience?
It is often said that a Bayesian agent has to assign probability 1 to all tautologies, and probability 0 to all contradictions. My question is... exactly what sort of tautologies are we talking about here? Does that include all mathematical theorems? Does that include assigning 1 to "Every bachelor is an unmarried male"?^{1} Perhaps the only tautologies that need to be assigned probability 1 are those that are Boolean theorems implied by atomic sentences that appear in the prior distribution, such as: "S or ~ S".
It seems that I do not need to assign probability 1 to Fermat's last conjecture in order to use probability theory when I play poker, or try to predict the color of the next ball to come from an urn. I must assign a probability of 1 to "The next ball will be white or it will not be white", but Fermat's last theorem seems to be quite irrelevant. Perhaps that's because these specialized puzzles do not require sufficiently general probability distributions; perhaps, when I try to build a general Bayesian reasoner, it will turn out that it must assign 1 to Fermat's last theorem.
Imagine a (completely impractical, ideal, and esoteric) first order language, who's particular subjects were discrete pointlike regions of spacetime. There can be an arbitrarily large number of points, but it must be a finite number. This language also contains a long list of predicates like: is blue, is within the volume of a carbon atom, is within the volume of an elephant, etc. and generally any predicate type you'd like (including n place predicates).^{2} The atomic propositions in this language might look something like: "5, 0.487, 7098.6, 6000s is Blue" or "(1, 1, 1, 1s), (1, 1, 1, 1s) contains an elephant." The first of these propositions says that a certain point in spacetime is blue; the second says that there is an elephant between two points at one second after the universe starts. Presumably, at least the denotational content of most english propositions could be expressed in such a language (I think, mathematical claims aside).
Now imagine that we collect all of the atomic propositions in this language, and assign a joint distribution over them. Maybe we choose max entropy, doesn't matter. Would doing so really require us to assign 1 to every mathematical theorem? I can see why it would require us to assign 1 to every tautological Boolean combination of atomic propositions [for instance: "(1, 1, 1, 1s), (1, 1, 1, 1s) contains an elephant OR ~((1, 1, 1, 1s), (1, 1, 1, 1s) contains an elephant)], but that would follow naturally as a consequence of filling out the joint distribution. Similarly, all the Boolean contradictions would be assigned zero, just as a consequence of filling out the joint distribution table with a set of reals that sum to 1.
A similar argument could be made using intuitions from algorithmic probability theory. Imagine that we know that some data was produced by a distribution which is output by a program of length n in a binary programming language. We want to figure out which distribution it is. So, we assign each binary string a prior probability of 2^n. If the language allows for comments, then simpler distributions will be output by more programs, and we will add the probability of all programs that print that distribution.^{3} Sure, we might need an oracle to figure out if a given program outputs anything at all, but we would not need to assign a probability of 1 to Fermat's last theorem (or at least I can't figure out why we would). The data might be all of your sensory inputs, and n might be Graham's number; still, there's no reason such a distribution would need to assign 1 to every mathematical theorem.
Conclusion:
A Bayesian agent does not require mathematical omniscience, or logical (if that means anything more than Boolean) omniscience, but merely Boolean omniscience. All that Boolean omniscience means is that for whatever atomic propositions appear in the language (e.g., the language that forms the set of propositions that constitute the domain of the probability function) of the agent, any tautological Boolean combination of those propositions must be assigned a probability of 1, and any contradictory Boolean combination of those propositions must be assigned 0. As far as I can tell, the whole notion that Bayesian agents must assign 1 to tautologies and 0 to contradictions comes from the fact that when you fill out a table of joint distributions (or follow the Komolgorov axioms in some other way) all of the Boolean theorems get a probability of 1. This does not imply that you need to assign 1 to Fermat's last theorem, even if you are reasoning probabilistically in a language that is very expressive.^{4}
Some Ways To Prove This Wrong:
Show that a really expressive semantic language, like the one I gave above, implies PA if you allow Boolean operations on its atomic propositions. Alternatively, you could show that Solomonoff induction can express PA theorems as propositions with probabilities, and that it assigns them 1. This is what I tried to do, but I failed on both occasions, which is why I wrote this.
[1] There are also interesting questions about the role of tautologies that rely on synonymy in probability theory, and whether they must be assigned a probability of 1, but I decided to keep it to mathematics for the sake of this post.
[2] I think this language is ridiculous, and openly admit it has next to no real world application. I stole the idea for the language from Carnap.
[3] This is a sloppily presented approximation to Solomonoff induction as n goes to infinity.
[4] The argument above is not a mathematical proof, and I am not sure that it is airtight. I am posting this to the discussion board instead of a fullblown post because I want feedback and criticism. !!!HOWEVER!!! if I am right, it does seem that folks on here, at MIRI, and in the Bayesian world at large, should start being more careful when they think or write about logical omniscience.
The Dice Room, Human Extinction, and Consistency of Bayesian Probability Theory
I'm sure that many of you here have read Quantum Computing Since Democritus. In the chapter on the anthropic principle the author presents the Dice Room scenario as a metaphor for human extinction. The Dice Room scenario is this:
1. You are in a world with a very, very large population (potentially unbounded.)
2. There is a madman who kidnaps 10 people and puts them in a room.
3. The madman rolls two dice. If they come up snake eyes (both ones) then he murders everyone.
4. Otherwise he releases everyone, then goes out and kidnaps 10 times as many people as before, and returns to step 3.
The question is this: if you are one of the people kidnapped at some point, what is your probability of dying? Assume you don't know how many rounds of kidnappings have preceded yours.
As a metaphor for human extinction, think of the population of this world as being all humans who ever have or ever may live, each batch of kidnap victims as a generation of humanity, and rolling snake eyes as an extinction event.
The book gives two arguments, which are both purported to be examples of Bayesian reasoning:
1. The "proximate risk" argument says that your probability of dying is just the prior probability that the madman rolls snake eyes for your batch of kidnap victims  1/36.
2. The "proportion murdered" argument says that about 9/10 of all people who ever go into the Dice Room die, so your probability of dying is about 9/10.
Obviously this is a problem. Different decompositions of a problem should give the same answer, as long as they're based on the same information.
I claim that the "proportion murdered" argument is wrong. Here's why. Let pi(t) be the prior probability that you are in batch t of kidnap victims. The proportion murdered argument relies on the property that pi(t) increases exponentially with t: pi(t+1) = 10 * pi(t). If the madman murders at step t, then your probability of being in batch t is
pi(t) / SUM(u: 1 <= u <= t: pi(u))
and, if pi(u+1) = 10 * pi(u) for all u < t, then this does indeed work out to about 9/10. But the values pi(t) must sum to 1; thus they cannot increase indefinitely, and in fact it must be that pi(t) > 0 as t > infinity. This is where the "proportion murdered" argument falls apart.
For a more detailed analysis, take a look at
http://bayesium.com/doomsdayandthediceroommurders/
This forum has a lot of very smart people who would be wellqualified to comment on that analysis, and I would appreciate hearing your opinions.
[link] Bayesian inference with probabilistic population codes
Bayesian inference with probabilistic population codes by Wei Ji Ma et al 2006
Recent psychophysical experiments indicate that humans perform nearoptimal Bayesian inference in a wide variety of tasks, ranging from cue integration to decision making to motor control. This implies that neurons both represent probability distributions and combine those distributions according to a close approximation to Bayes’ rule. At first sight, it would seem that the high variability in the responses of cortical neurons would make it difficult to implement such optimal statistical inference in cortical circuits. We argue that, in fact, this variability implies that populations of neurons automatically represent probability distributions over the stimulus, a type of code we call probabilistic population codes. Moreover, we demonstrate that the Poissonlike variability observed in cortex reduces a broad class of Bayesian inference to simple linear combinations of populations of neural activity. These results hold for arbitrary probability distributions over the stimulus, for tuning curves of arbitrary shape and for realistic neuronal variability.
Note that "humans perform nearoptimal Bayesian inference" refers to the integration of information  not conscious symbolic reasoning. Nonetheless I think this is of interest here.
How urgent is it to intuitively understand Bayesianism?
The current state of my understanding (briefly):
 I very much understand reductionism and the distinction between the map and the territory. And I very much understand that probability is in the mind.
 From what I understand, prior probability is just the probability you thought something was going to happen before having observed some evidence, and posterior probability is just the probability you think something will happen after having observed that evidence.
 I don't really have a precise way of using evidence to update my beliefs though. I'm trying to think of and explain how I currently use evidence to update my beliefs, and I'm disappointed to say that I am struggling. I guess I just sort of think something along the lines of "I'd be unlikely that I observe X if A were really true. I observed X. I think it's less likely that A is true now."
 I've made attempts at learning Bayes' Theorm and stuff. When I think it through slowly, it makes sense. But it really takes me time to think it through. Without referring to explanations and thinking it through, I forget it. And I know that that demonstrates my lack of "true" understanding. In general, my short term memory and ability to reason through quantitative things quickly seems to be well above average, but far from elite. Probably way below average amongst this community.
 What are the practical benefits of having an intuitive understanding of Bayes' Theorem? If it helps, please name an example of how it impacted your day today.
 I mention in 3) that it takes me time to think it through. To those of you who consider yourselves to have an intuitive understanding, do you have to think it through, or do you instinctively update in a Bayesian way?
 How urgent is it to intuitively understand Bayesian thinking? To use me as an example, my shortmidterm goals include getting good at programming and starting a startup. I have a ways to go, and am working towards these things. So I spend most of my time learning programming right now. Is it worth me taking a few weeks/months to study probability?
Why IQ shouldn't be considered an external factor
This is a sortof response to this post.
"Things under your control" (more generally, free will) is an illdefined concept: you are an entity within physics; all of your actions and thoughts are fully determined by physical processes in your brain. Here, I will assume that "things under your control" are any things that are controlled by your brain, since it is a consistent definition, and it's what people usually mean when they talk about things under one's control.
So, you may be interested in the question: how much one's success depends on his thoughts and actions (i.e. things that are controlled by his brain) vs. how it depends on the circumstances/environment (i.e. things that aren't)? Another formulation: how you can change one's life outcomes if you could alter neural signals emitted by his brain?
We also could draw the borderline somewhere else; maybe add physical traits, like height or attractiveness to the "internal factors" category, or maybe assign some brain parts to the "external factors" category. The question whether your life success is mostly determined by "internal factors" or "external factors" would remain valid  and we call it "internal vs. external locus of control" question.
But what happens when we assign IQ to the "external factors" category?
IQ test is an attempt to measure some value, which is supposed to be a measure of something like quality of one's thinking process. So, this value can be seen as a function IQ(brain), which maps brains to numbers. Your thoughts and actions don't depend on your IQ score; IQ score depends on your thoughts. That's how the causal arrows are arranged.
But it's possible to ask, what can we change if we can change brain, conditional on the fixed IQ score. But then the "free will" intuition collapses; it's hard to imagine what we could change if our thought processes were restricted in some weird way. And such question is hardly practical, in my opinion. It's true that one can measure his IQ, and that IQ rarely changes much, but still: if you consider IQ fixed and external factor out of your control, then you must consider your thought processes restricted to some set and therefore, not totally under your control.
Define "things under your control" as "things under your brain neural signals' control", and then we will have a consistent definition, and we will find ourselves in the common sense domain. Declare that everything is under control of physics, and then we will, again, have a consistent definition of "things under your control" (empty set), and now we are in the physics domain. Both cases are quite intuitive.
But when we consider IQ external, "things under your control" are your thoughts, but not quite; we can control our thoughts, but only as long as they reside on some weird manifold of thoughtspace. I guess that in such case, your "free will" intuitions would be disrupted. Basically, we can't slice some part of what we call "personality" out and still have our intuitions about personality and free will sane.
TL; DR: You shouldn't consider any functions of your current brain state as external when discussing locus of control, since such viewpoint is actually counterintuitive and, therefore, makes you prone to errors.
How do you notice when you are ignorant of necessary alternative hypotheses?
So I just wound up in a debate with someone over on Reddit about the value of conventional academic philosophy. He linked me to a book review, in which both the review and the book are absolutely godawful. That is, the author (and the reviewer following him) start with ontological monism (the universe only contains a single kind of Stuff: massenergy), adds in the experience of consciousness, reasons deftly that emergence is a load of crap... and then arrives to the conclusion of panpsychism.
WAIT HOLD ON, DON'T FLAME YET!
Of course panpsychism is bunk. I would be embarrassed to be caught upholding it, given the evidence I currently have, but what I want to talk about is the logic being followed.
1) The universe is a unified, consistent whole. Good!
2) The universe contains the experience/existence of consciousness. Easily observable.
3) If consciousness exists, something in the universe must cause or give rise to consciousness. Good reasoning!
4) "Emergence" is a nonexplanation, so that can't be it. Good!
5) Therefore, whatever stuff the unified universe is made of must be giving rise to consciousness in a nonemergent way.
6) Therefore, the stuff must be innately "mindy".
What went wrong in steps (5) and (6)? The man was actually reasoning moreorless correctly! Given the universe he lived in, and the impossibility of emergence, he reallocated his probability mass to the remaining answer. When he had eliminated the impossible, whatever remained, however low its prior, must be true.
The problem was, he eliminated the impossible, but left open a huge vast space of possible hypotheses that he didn't know about (but which we do): the most common of these is the computational theory of mind and consciousness, which says that we are made of cognitive algorithms. A Solomonoff Inducer can just go on to the next length of bitstrings describing Turing machines, but we can't.
Now, I can spot the flaw in the reasoning here. What frightens me is: what if I'm presented with some similar argument, and I can't spot the flaw? What if, instead, I just neatly and stupidly reallocate my belief to what seems to me to be the only available alternative, while failing to go out and look for alternatives I don't already know about? Notably, it seems like expected evidence is conserved, but expecting to locate new hypotheses means I should be reducing my certainty about all currentlyavailable hypotheses now to have some for dividing between the new possibilities.
If you can notice when you're confused, how do you notice when you're ignorant?
[LINK] The Mathematics of Gamification  Application of Bayes Rule to Voting
Fresh from slashdot: A smart application of Bayes' rule to webvoting.
http://engineering.foursquare.com/2014/01/03/themathematicsofgamification/
[The results] are exactly the equations for voting you would expect. But now, they’re derived from math!
The Benefits
 Efficient, datadriven guarantees about database accuracy. By choosing the points based on a user’s accuracy, we can intelligently accrue certainty about a proposed update and stop the voting process as soon as the math guarantees the required certainty.
 Still using points, just smart about calculating them. By relating a user’s accuracy and the certainty threshold needed to accept a proposed update to an additive point system (2), we can still give a user the points that they like. This also makes it easy to take a system of adhoc points and convert it over to a smarter system based on empirical evidence.
 Scalable and easily extensible. The parameters are automatically trained and can adapt to changes in the behavior of the userbase. No more long meetings debating how many points to grant to a narrow use case.
So far, we’ve taken a very usercentric view of pk (this is the accuracy of user k). But we can go well beyond that. For example, pk could be “the accuracy of user k’s vote given that they have been to the venue three times before and work nearby.” These clauses can be arbitrarily complicated and estimated from a (logistic) regression of the honeypot performance. The point is that these changes will be based on data and not subjective judgments of how many “points” a user or situation should get.
I wonder whether and how this could be applied to voting here as LW posts are not 'correct' per se.
One rather theoretical possibility would be to assign prior correctness to some posts e.g. the sequences and then use that to determine the 'accuracy' of users based on that.
Am I Understanding Bayes Right?
Hello, everyone.
I'm relatively new here as a user rather than as a lurker, but even after trying to read ever tutorial on Bayes' Theorem I could get my hands on, I'm still not sure I understand it. So I was hoping that I could explain Bayesianism as I understand it, and some more experienced Bayesians could tell me where I'm going wrong (or maybe if I'm not going wrong and it's a confidence issue rather than an actual knowledge issue). If this doesn't interest you at all, then feel free to tap out now, because here we go!
Abstraction
Bayes' Theorem is an application of probability. Probability is an abstraction based on logic, which is in turn based on possible worlds. By this I mean that they are both maps that refer to multiple territories: whereas a map of Cincinatti (or a "map" of what my brother is like, for instance), abstractions are good for more than one thing. Trigonometry is a map of not just this triangle here, but of all triangles everywhere, to the extent that they are triangular. Because of this it is useful even for triangular objects that one has never encountered before, but only tells you about it partially (e.g. it won't tell you the lengths of the sides, because that wouldn't be part of the definition of a triangle; also, it only works at scales at which the object in question approximates a triangle (i.e. the "triangle" map is probably useful at macroscopic scales, but breaks down as you get smaller).
Logic and Possible Worlds
Logic is an attempt to construct a map that covers as much territory as possible, ideally all of it. Thus when people say that logic is true at all times, at all places, and with all things, they aren't really telling you about the territory, they're telling you about the purpose of logic (in the same way that the "triangle" map is ideally useful for triangles at all times, at all places).
One form of logic is Propositional Logic. In propositional logic, all the possible worlds are imagined as points. Each point is exactly one possible world: a logicallypossible arrangement that gives a value to all the different variables in the universe. Ergo no two possible universes are exactly the same (though they will share elements).
These possible universes are then joined together in sets called "propositions". These "sets" are Venn diagrams, or what George Lakoff refers to as "container schemas"). Thus, for any given set, every possible universe is either inside or outside of it, with no middle ground (see "questions" below). Thus if the set I'm referring to is the proposition "The Snow is White", that set would include all possible universes in which the snow is white. The rules of propositional logic follow from the container schema.
Bayesian Probability
If propositional logic is about what's inside a set or outside of a set, probability is about the size of the sets themselves. Probability is a measurement of how many possible worlds are inside a set, and conditional probability is about the size of the intersections of sets.
Take the example of the dragon in your garage. To start with, there either is or isn't a dragon in your garage. Both sets of possible worlds have elements in them. But if we look in your garage and don't see a dragon, then that eliminates all the possibilities of there being a *visible* dragon in your garage, and thus eliminates those possible universes from the 'there is a dragon in your garage' set. In other words, the probability of that being true goes down. And because not seeing a dragon in your garage would be what you would expect if there in fact isn't a dragon in your garage, that set remains intact. Then if we look at the ratio of the remaining possible worlds, we see that the probability of the nodragoninyourgarage set has gone up, not because in absolute terms (because the set of all possible worlds is what we started with; there isn't any more!) but relative to the alternate hypothesis (in the same way that if the denominator of a fraction goes down, the size of the fraction goes up.)
This is what Bayes' Theorem is about: the use of process of elimination to eliminate *part* of the set of a proposition, thus providing evidence against it without it being a full refutation.
Naturally, this all takes place in ones mind: the world doesn't shift around you just because you've encountered new information. Probability is in this way subjective (it has to do with maps, not territories per se), but it's not arbitrary: as long as you accept that possible worlds/logic metaphor, it necessarily follows
Questions/trouble points that I'm not sure of:
*I keep seeing probability referred to as an estimation of how certain you are in a belief. And while I guess it could be argued that you should be certain of a belief relative to the number of possible worlds left or whatever, that doesn't necessarily follow. Does the above explanation differ from how other people use probability?
*Also, if probability is defined as an arbitrary estimation of how sure you are, why should those estimations follow the laws of probability? I've heard the Dutch book argument, so I get why there might be practical reasons for obeying them, but unless you accept a pragmatist epistemology, that doesn't provide reasons why your beliefs are more likely to be true if you follow them. (I've also heard of Cox's rules, but I haven't been able to find a copy. And if I understand right, they says that Bayes' theorem follows from Boolean logic, which is similar to what I've said above, yes?)
*Another question: above I used propositional logic, which is okay, but it's not exactly the creme de la creme of logics. I understand that fuzzy logics work better for a lot of things, and I'm familiar with predicate logics as well, but I'm not sure what the interaction of any of them is with probability or the use of it, although I know that technically probability doesn't have to be binary (sets just need to be exhaustive and mutually exclusive for the Kolmogorov axioms to work, right?). I don't know, maybe it's just something that I haven't learned yet, but the answer really is out there?
Those are the only questions that are coming to mind right now (if I think of any more, I can probably ask them in comments). So anyone? Am I doing something wrong? Or do I feel more confused than I really am?
Crush Your Uncertainty
Bayesian epistemology and decision theory provide a rigorous foundation for dealing with mixed or ambiguous evidence, uncertainty, and risky decisions. You can't always get the epistemic conditions that classical techniques like logic or maximum liklihood require, so this is seriously valuable. However, having internalized this new set of tools, it is easy to fall into the bad habit of failing to avoid situations where it is necessary to use them.
When I first saw the light of an epistemology based on probability theory, I tried to convince my father that the Bayesian answer to problems involving an unknown processes (eg. laplace's rule of succession), was superior to the classical (eg. maximum likelihood) answer. He resisted, with the following argument:
 The maximum likelihood estimator plus some measure of significance is easier to compute.
 In the limit of lots of evidence, this agrees with Bayesian methods.
 When you don't have enough evidence for statistical significance, the correct course of action is to collect more evidence, not to take action based on your current knowledge.
I added conditions (eg. what if there is no more evidence and you have to make a decision now?) until he grudgingly stopped fighting the hypothetical and agreed that the Bayesian framework was superior in some situations (months later, mind you).
I now realize that he was right to fight that hypothetical, and he was right that you should prefer classical max likelihood plus significance in most situations. But of course I had to learn this the hard way.
It is not always, or even often, possible to get overwhelming evidence. Sometimes you only have visibility into one part of a system. Sometimes further tests are expensive, and you need to decide now. Sometimes the decision is clear even without further information. The advanced methods can get you through such situations, so it's critical to know them, but that doesn't mean you can laugh in the face of uncertainty in general.
At work, I used to do a lot of what you might call "cowboy epistemology". I quite enjoyed drawing useful conclusions from minimal evidence and careful probabilityliterate analysis. Juggling multiple hypotheses and visualizing probability flows between them is just fun. This seems harmless, or even helpful, but it meant I didn't take gathering redundant data seriously enough. I now think you should systematically and completely crush your uncertainty at all opportunities. You should not be satisfied until exactly one hypothesis has nonnegligible probability.
Why? If I'm investigating a system, and even though we are not completely clear on what's going on, the current data is enough to suggest a course of action, and value of information calculations say that decision is not likely enough to change to make further investigation worth it, why then should I go and do further investigation to pin down the details?
The first reason is the obvious one; stronger evidence can make up for human mistakes. While a lot can be said for it's power, human brain is not a precise instrument; sometimes you'll feel a little more confident, sometimes a little less. As you gather evidence towards a point where you feel you have enough, that random fluctuation can cause you to stop early. But this only suggests that you should have a small bias towards gathering a bit more evidence.
The second reason is that though you may be able to make the correct immediate decision, going into the future, that residual uncertainty will bite you back eventually. Eventually your habits and heuristics derived from the initial investigation will diverge from what's actually going on. You would not expect this in a perfect reasoner; they would always use their full uncertainty in all calculations, but again, the human brain is a blunt instrument, and likes to simplify things. What was once a nuanced probability distribution like 95% X, 5% Y
might slip to just X
when you're not quite looking, and then, 5% of the time, something comes back from the grave to haunt you.
The third reason is computational complexity. Inference with very high certainty is easy; it's often just simple direct math or clear intuitive visualizations. With a lot of uncertainty, on the other hand, you need to do your computation once for each of all (or some sample of) probable worlds, or you need to find a shortcut (eg analytic methods), which is only sometimes possible. This is an unavoidable problem for any bounded reasoner.
For example, you simply would not be able to design chips or computer programs if you could not treat transistors as infallible logical gates, and if you really really had to do so, the first thing you would do would be to build an errorcorrecting base system on top of which you could treat computation as approximately deterministic.
It is possible in small problems to manage uncertainty with advanced methods (eg. Bayes), and this is very much necessary while you decide how to get more certainty, but for unavoidable computational reasons, it is not sustainable in the long term, and must be a temporary condition.
If you take the habit of crushing your uncertainty, your model of situations can be much simpler and you won't have to deal with residual uncertainty from previous related investigations. Instead of many possible worlds and nuanced probability distributions to remember and gum up your thoughts, you can deal with simple, clear, unambiguous facts.
My previous cowboyepistemologist self might have agreed with everything written here, but failed to really get that uncertainty is bad. Having just been empowered to deal with uncertainty properly, there was a tendency to not just be unafraid of uncertainty, but to think that it was OK, or even glorify it. What I'm trying to convey here is that that aesthetic is mistaken, and as silly as it feels to have to repeat something so elementary, uncertainty is to be avoided. More viscerally, uncertainty is uncool (unjustified confidence is even less cool, though.)
So what's this all got to do with my father's classical methods? I still very much recommend thinking in terms of probability theory when working on a problem; it is, after all, the best basis for epistemology that we know of, and is perfectly adequate as an intuitive framework. It's just that it's expensive, and in the epistemic state you really want to be in, that expense is redundant in the sense that you can just use some simpler method that converges to the Bayesian answer.
I could leave you with an overwhelming pile of examples, but I have no particular incentive to crush your uncertainty, so I'll just remind you to treat hypotheses like zombies; always double tap.
Instinctive Frequentists, the Outside View, and deBiasing
In "How to Make Cognitive Illusions Disappear: Beyond Heuristics and Biases", Gerd Gigerenzer attempts to show that the whole "Heuristics and Biases" approach to analysing human reasoning is fundamentally flawed and incorrect.
In that he fails. His case depends on using the frequentist argument that probabilities cannot be assigned to single events or situations of subjective uncertainty, thus removing the possibility that people could be "wrong" in the scenarios where the biases were tested. (It is interesting to note that he ends up constructing "Probabilistic Mental Models", which are frequentist ways of assigning subjective probabilities  just as long as you don't call them that!).
But that dodge isn't sufficient. Take the famous example of the conjunction fallacy, where most people are tricked to assigning a higher probability to "Linda is a bank teller AND is active in the feminist movement" than to "Linda is a bank teller". This error persists even when people take bets on the different outcomes. By betting more (or anything) on the first option, people are giving up free money. This is a failure of human reasoning, whatever one thinks about the morality of assigning probability to single events.
However, though the article fails to prove its case, it presents a lot of powerful results that may change how we think about biases. It presents weak evidence that people may be instinctive frequentist statisticians, and much stronger evidence that many biases can go away when the problems are presented in frequentist ways.
Now, it's known that people are more comfortable with frequencies that with probabilities. The examples in the paper extend that intuition. For instance, when people are asked:
There are 100 persons who fit the description above (i.e., Linda's). How many of them are:
(a) bank tellers
(b) bank tellers and active in the feminist movement.
Then the conjunction fallacy essentially disappears (22% of people make the error, rather than 85%). That is a huge difference.
Similarly, overconfidence. When people were 50 general knowledge questions and asked to rate their confidence for their answer on each question, they were systematically, massively overconfident. But when they were asked afterwards "How many of these 50 questions do you think you got right?", they were... underconfident. But only very slightly: they were essentially correct in their selfassessments. This can be seen as a use of the outside view  a use that is, in this case, entirely justified. People know their overall accuracy much better than they know their specific accuracy.
A more intriguing example makes the baserate fallacy disappear. Presenting the problem in a frequentist way makes the fallacy vanish when computing false positives for tests on rare diseases  that's compatible with the general theme. But it really got interesting when people actively participated in the randomisation process. In the standard problem, students were given thumbnail description of individuals, and asked to guess whether they were more likely to be engineers or lawyers. Half the time the students were told the descriptions were drawn at random from 30 lawyers and 70 engineers; the other half, the proportions were reversed. It turns out that students assigned similar guesses to lawyer and engineer in both setups, showing they were neglecting to use the 30/70 or 70/30 baserate information.
Gigerenzer modified the setups by telling the students the 30/70 or 70/30 proportions and then having the students themselves drew each description (blindly) out of an urn before assessing it. In that case, baserate neglect disappears.
Now, I don't find that revelation quite as superlatively exciting as Gigerenzer does. Having the students draw the description out of the urn is pretty close to whacking them on the head with the baserate: it really focuses their attention on this aspect, and once it's risen to their attention, they're much more likely to make use of it. It's still very interesting, though, and suggests some practical ways of overcoming the baserate problem that stop short of saying "hey, don't forget the baserate".
There is a large literature out there critiquing the heuristics and biases tradition. Even if they fail to prove their point, they're certainly useful for qualifying the biases and heuristics results, and, more interestingly, for suggesting practical ways of combating their effects.
[LINK] XKCD Comic #1236, Seashells and Bayes' Theorem
A fun comic about seashells and Bayes' Theorem. http://xkcd.com/1236/
Question about application of Bayes
I have successfully confused myself about probability again.
I am debugging an intermittent crash; it doesn't happen every time I run the program. After much confusion I believe I have traced the problem to a specific line (activating my debug logger, as it happens; irony...) I have tested my program with and without this line commented out. I find that, when the line is active, I get two crashes on seven runs. Without the line, I get no crashes on ten runs. Intuitively this seems like evidence in favour of the hypothesis that the line is causing the crash. But I'm confused on how to set up the equations. Do I need a probability distribution over crash frequencies? That was the solution the last time I was confused over Bayes, but I don't understand what it means to say "The probability of having the line, given crash frequency f", which it seems I need to know to calculate a new probability distribution.
I'm going to go with my intuition and code on the assumption that the debug logger should be activated much later in the program to avoid a race condition, but I'd like to understand this math.
[Link] Are Children Natural Bayesians?
This recent article at Slate thinks so:
Why Your 4YearOld Is As Smart as Nate Silver
It turns out that even very young children reason [using Bayes Theorem]. For example, my student Tamar Kushnir, now at Cornell, and I showed 4yearolds a toy and told them that blocks made it light up. Then we gave the kids a block and asked them how to make the toy light up. Almost all the children said you should put the block on the toy—they thought, sensibly, that touching the toy with the block was very likely to make it light up. That hypothesis had a high “prior.”
Then we showed 4yearolds that when you put a block right on the toy it did indeed make it light up, but it did so only two out of six times. But when you waved a block over the top of the toy, it lit up two out of three times. Then we just asked the kids to make the toy light up.
The children adjusted their hypotheses appropriately when they saw the statistical data, just like good Bayesians—they were now more likely to wave the block over the toy, and you could precisely predict how often they did so. What’s more, even though both blocks made the machine light up twice, the 4yearolds, only just learning to add, could unconsciously calculate that two out of three is more probable than two out of six. (In a current study, my colleagues and I have found that even 24montholds can do the same).
There also seems to be a reference to the Singularity Institute:
The Bayesian idea is simple, but it turns out to be very powerful. It’s so powerful, in fact, that computer scientists are using it to design intelligent learning machines, and more and more psychologists think that it might explain human intelligence.
(Of course, I don't know how many other AI researchers are using Bayes Theorem, so the author also might not have the SI in mind)
If children really are natural Bayesians, then why and how do you think we change?
[Book Review] "The Signal and the Noise: Why So Many Predictions Fail—But Some Don’t.", by Nate Silver
Here's a link to a review, by The Economist, of a book about prediction, some of the common ways in which people make mistakes and some of the methods by which they could improve:
Looking ahead : How to look ahead—and get it right
One paragraph from that review:
A guiding light for Mr Silver is Thomas Bayes, an 18thcentury English churchman and pioneer of probability theory. Uncertainty and subjectivity are inevitable, says Mr Silver. People should not get hung up on this, and instead think about the future the way gamblers do: “as speckles of probability”. In one surprising chapter, poker, a game from which Mr Silver once earned a living, emerges as a powerful teacher of the virtues of humility and patience.
Help me teach Bayes'
Next Monday I am supposed to introduce a bunch of middle school students to Bayes' theorem.
I've scoured the Internet for basic examples where Bayes' theorem is applied. Alas, all explanations I've come cross are, I believe, difficult to grasp for the average middle school student.
So what I am looking for is a straightforward explanation of Bayes' theorem that uses the least amount of Mathematics and words possible. (Also, my presentation has to be under 3 minutes.)
I think that it would be efficient in terms of learning for me to use coins or cards, something tangible to illustrate what I'm talking about.
What do you think? How should I teach 'em Bayes' ways?
PS: I myself am new to Bayesian probability.
How to use human history as a nutritional prior?
Nutrition is a case where we have to try to make the best possible use of the data we have no matter how terrible, because we have to eat something now to sustain us while we plan and conduct more experiments.
I want to apply Bayes theorem to make rational health decisions from relatively weak data. I am generally wondering how one can synthesize historical human experiences with incomplete scientific data, in order to make riskadverse and healthy decisions about human nutrition given limited research.
Example question/hypothesis: Does gluten cause health problems (ie exhibit chronic toxicity) in noncoeliac humans? Is there enough evidence to suggest that avoiding gluten might be a prudent riskadverse decision for noncoeliacs?
We have some (mostly in vitro) scientific data suggesting that gluten may cause health problems in noncoeliac humans (such as these articles http://evolvify.com/thecaseagainstglutenmedicaljournalreferences/). Let's say for the sake of arguing, that I can somehow convert these studies into a nonunity likelihood ratio for gluten toxicity in humans (although suggestions are welcome here too).
However, we also have prior information that a population of humans has been consuming gluten containing foods for at least 10,000 years, without any blatantly obvious toxic effects. Is there some way to convert this observation (and observations like this) into a prior probability distribution?
[Link] Better results by changing Bayes’ theorem
If it ever turns out that Bayes fails  receives systematically lower rewards on some problem, relative to a superior alternative, in virtue of its mere decisions  then Bayes has to go out the window.
 Eliezer Yudkowsky, Newcomb's Problem and Regret of Rationality
Don't worry, we don't have to abandon Bayes’ theorem yet. But changing it slightly seems to be the winning Way given certain circumstances. See below:
In Peter Norvig’s talk The Unreasonable Effectiveness of Data, starting at 37:42, he describes a translation algorithm based on Bayes’ theorem. Pick the English word that has the highest posterior probability as the translation. No surprise here. Then at 38:16 he says something curious.
So this is all nice and theoretical and pure, but as well as being mathematically inclined, we are also realists. So we experimented some, and we found out that when you raise that first factor [in Bayes' theorem] to the 1.5 power, you get a better result.
In other words, if we change Bayes’ theorem (!) we get a better result. He goes on to explain
Link: johndcook.com/blog/2012/03/09/monkeyingwithbayestheorem/
Peter Norvig  The Unreasonable Effectiveness of Data
Two phone apps that use Bayes to help doctors make better decisions [links]
Bayes at the Bedside and RxBayes come with databases of likelihood ratios and help doctors estimate probabilities from their phones. I met somebody at Singularity Summit 2011 who used one of them, but I can't remember which.
The Principle of Maximum Entropy
After having read the related chapters of Jaynes' book I was fairly amazed by the Principle of Maximum Entropy, a powerful method for choosing prior distributions. However it immediately raised a large number of questions.
I have recently read two quite intriguing (and very wellwritten) papers by Jos Uffink on this matter:
Can the maximum entropy principle be explained as a consistency requirement?
The constraint rule of the maximum entropy principle
I was wondering what you think about the principle of maximum entropy and its justifications.
[Link] A Bayes' Theorem Visualization
A while ago when Bret Victor's amazing article Up and Down the Ladder of Abstraction was being discussed, someone mentioned that they'd like to see one made for Bayes' Theorem. I've just completed version 1.0 of my "Bayes' Theorem Ladder of Abstraction", and it can be found here: http://www.coarsegra.in/?p=111
(It uses the Canvas html5 element, so won't work with older versions of IE).
There's a few bugs in it, and it leaves out many things that I'd like to (eventually) include, but I'm reasonably satisfied with it as a first attempt. Any feedback for what works and what doesn't work, or what you think should be added, would be greatly appreciated.
Reading Math: Pearl, Causal Bayes Nets, and Functional Causal Models
Hi all,
I just started a doctoral program in psychology, and my research interest concerns causal reasoning. Since Pearl's Causality, the popularity of causal Bayes nets as psychological models for causal reasoning has really grown. Initially, I had some serious reservations, but now I'm beginning to think a great many of these are due in part to the oversimplified treatment that CBNs get in the psychology literature. For instance, the distinction between a) directed acyclic graphs + underlying conditional probabilities, and b) functional causal models, is rarely mentioned. Ignoring this distinction leads to some weird results, especially when the causal system in question has prominent physical mechanisms.
Say we represent Gear A as causing Gear B to turn because Gear A is hooked up to an engine, and because the two gears are connected to each other by a chain. Something like this:
Engine(ON) > GearA(turn) > GearB(turn)
As a causal Net, this is problematic. If I "intervene" on GearA (perform do(GearA=stop)), then I get the expected result: GearA stops, GearB stops, and the engine keeps running (the 'undoing' effect [Sloman, 2005]). But what happens if I "intervene" on GearB? Since they are connected by a chain, GearA would stop as well. But GearA is the cause, and GearB is the effect: intervening on effects is NOT supposed to change the status of the cause. This violates a host of underlying assumptions for causal Bayes nets. (And you can't represent the gears as causing each other's movement, since that'd be a cyclical graph.)
However, this can be solved if we're not representing the system as the above net, but we're instead representing the physics of the system, representing the forces involved via something that looks vaguely like newtonian equations. Indeed, this would accord better with people's hypothesistesting behavior: if they aren't sure which gear has the engine behind it, they wouldn't try "intervening" on GearA's motion and GearB's motion, they'd try removing the chain, and seeing which gear is still moving.
At first it seemed to me like causal Bayes nets only do the first kind of representation, not the latter. However, I was wrong: Pearl's "functional causal models" appear to do the latter. These have been vastly less prevalent in the psych literature, yet they seem extremely important.
Anyways, the moral of the story is that I should really read a lot of Pearl's Causality, and actually have a grasp of some of the math; I can't just read the first chapter like most psychology researchers interested in this stuff.
I'm not much of an autodidact when it comes to math, though I'm good at it when put in a class. Can anyone who's familiar with Pearl's book give me an idea of what sort of prerequisites it would be good to have in order to understand important chunks of it? Or am I overthinking this, and I should just try and plow through.
Any suggestions on classes (or textbooks, I guess), or any thoughts on the above gears example, will be helpful and welcome.
Thanks!
EDIT: Maybe a more specific request could be phrased as following: will I be better served by taking some extra computer science classes, or some extra math classes (i.e., on calculus and probabilistic systems)?
Logodds (or logits)
(I wrote this post for my own blog, and given the warm reception, I figured it would also be suitable for the LW audience. It contains some nicely formatted equations/tables in LaTeX, hence I've left it as a dropbox download.)
Logarithmic probabilities have appeared previously on LW here, here, and sporadically in the comments. The first is a link to a Eliezer post which covers essentially the same material. I believe this is a better introduction/description/guide to logarithmic probabilities than anything else that's appeared on LW thus far.
Introduction:
Our conventional way of expressing probabilities has always frustrated me. For example, it is very easy to say nonsensical statements like, “110% chance of working”. Or, it is not obvious that the difference between 50% and 50.01% is trivial compared to the difference between 99.98% and 99.99%. It also fails to accommodate the math correctly when we want to say things like, “five times more likely”, because 50% * 5 overflows 100%.
Jacob and I have (re)discovered a mapping from probabilities to log odds which addresses all of these issues. To boot, it accommodates Bayes’ theorem beautifully. For something so simple and fundamental, it certainly took a great deal of google searching/wikipedia surfing to discover that they are actually called “logodds”, and that they were “discovered” in 1944, instead of the 1600s. Also, nobody seems to use logodds, even though they are conceptually powerful. Thus, this primer serves to explain why we need logodds, what they are, how to use them, and when to use them.
Bayes Slays Goodman's Grue
This is a first stab at solving Goodman's famous grue problem. I haven't seen a post on LW about the grue paradox, and this surprised me since I had figured that if any arguments would be raised against Bayesian LW doctrine, it would be the grue problem. I haven't looked at many proposed solutions to this paradox, besides some of the basic ones in "The New Problem of Induction". So, I apologize now if my solution is wildly unoriginal. I am willing to put you through this dear reader because:
 I wanted to see how I would fare against this still largely open, devastating, and classic problem, using only the arsenal provided to me by my minimal Bayesian training, and my regular LW reading.
 I wanted the first LW article about the grue problem to attack it from a distinctly Lesswrongian aproach without the benefit of hindsight knowledge of the solutions of nonLW philosophy.
 And lastly, because, even if this solution has been found before, if it is the right solution, it is to LW's credit that its students can solve the grue problem with only the use of LW skills and cognitive tools.
I would also like to warn the savvy subjective Bayesian that just because I think that probabilities model frequencies, and that I require frequencies out there in the world, does not mean that I am a frequentest or a realist about probability. I am a formalist with a grain of salt. There are no probabilities anywhere in my view, not even in minds; but the theorems of probability theory when interpreted share a fundamental contour with many important tools of the inquiring mind, including both, the nature of frequency, and the set of rational subjective belief systems. There is nothing more to probability than that system which produces its theorems.
Lastly, I would like to say, that even if I have not succeeded here (which I think I have), there is likely something valuable that can be made from the leftovers of my solution after the onslaught of penetrating critiques that I expect form this community. Solving this problem is essential to LW's methods, and our arsenal is fit to handle it. If we are going to be taken seriously in the philosophical community as a new movement, we must solve serious problems from academic philosophy, and we must do it in distinctly Lesswrongian ways.
"The first emerald ever observed was green.
The second emerald ever observed was green.
The third emerald ever observed was green.
… etc.
The nth emerald ever observed was green.
(conclusion):
There is a very high probability that a never before observed emerald will be green."
That is the inference that the grue problem threatens, courtesy of Nelson Goodman. The grue problem starts by defining "grue":
"An object is grue iff it is first observed before time T, and it is green, or it is first observed after time T, and it is blue."
So you see that before time T, from the list of premises:
"The first emerald ever observed was green.
The second emerald ever observed was green.
The third emerald ever observed was green.
… etc.
The nth emerald ever observed was green."
(we will call these the green premises)
it follows that:
"The first emerald ever observed was grue.
The second emerald ever observed was grue.
The third emerald ever observed was grue.
… etc.
The nth emerald ever observed was grue."
(we will call these the grue premises)
The proposer of the grue problem asks at this point: "So if the green premises are evidence that the next emerald will be green, why aren't the grue premises evidence for the next emerald being grue?" If an emerald is grue after time T, it is not green. Let's say that the green premises brings the probability of "A new unobserved emerald is green." to 99%. In the skeptic's hypothesis, by symmetry it should also bring the probability of "A new unobserved emerald is grue." to 99%. But of course after time T, this would mean that the probability of observing a green emerald is 99%, and the probability of not observing a green emerald is at least 99%, since these sentences have no intersection, i.e., they cannot happen together, to find the probability of their disjunction we just add their individual probabilities. This must give us a number at least as big as 198%, which is of course, a contradiction of the Komolgorov axioms. We should not be able to form a statement with a probability greater than one.
This threatens the whole of science, because you cannot simply keep this isolated to emeralds and color. We may think of the emeralds as trials, and green as the value of a random variable. Ultimately, every result of a scientific instrument is a random variable, with a very particular and useful distribution over its values. If we can't justify inferring probability distributions over random variables based on their previous results, we cannot justify a single bit of natural science. This, of course, says nothing about how it works in practice. We all know it works in practice. "A philosopher is someone who say's, 'I know it works in practice, I'm trying to see if it works in principle.'"  Dan Dennett
We may look at an analogous problem. Let's suppose that there is a table and that there are balls being dropped on this table, and that there is an infinitely thin line drawn perpendicular to the edge of the table somewhere which we are unaware of. The problem is to figure out the probability of the next ball being right of the line given the last results. Our first prediction should be that there is a 50% chance of the ball being right of the line, by symmetry. If we get the result that one ball landed right of the line, by Laplace's rule of succession we infer that there is a 2/3ds chance that the next ball will be right of the line. After n trials, if every trial gives a positive result, the probability we should assign to the next trial being positive as well is n+1/n +2.
If this line was placed 2/3ds down the table, we should expect that the ratio of rights to lefts should approach 2:1. This gives us a 2/3ds chance of the next ball being a right, and the fraction of Rights out of trials approaches 2/3ds ever more closely as more trials are performed.
Now let us suppose a grue skeptic approaching this situation. He might make up two terms "reft" and "light". Defined as you would expect, but just in case:
"A ball is reft of the line iff it is right of it before time T when it lands, or if it is left of it after time T when it lands.
A ball is light of the line iff it is left of the line before time T when it lands, or if it is right of the line after time T when it first lands."
The skeptic would continue:
"Why should we treat the observation of several occurrences of Right, as evidence for 'The next ball will land on the right.' and not as evidence for 'The next ball will land reft of the line.'?"
Things for some reason become perfectly clear at this point for the defender of Bayesian inference, because now we have an easy to imaginable model. Of course, if a ball landing right of the line is evidence for Right, then it cannot possibly be evidence for ~Right; to be evidence for Reft, after time T, is to be evidence for ~Right, because after time T, Reft is logically identical to ~Right; hence it is not evidence for Reft, after time T, for the same reasons it is not evidence for ~Right. Of course, before time T, any evidence for Reft is evidence for Right for analogous reasons.
But now the grue skeptic can say something brilliant, that stops much of what the Bayesian has proposed dead in its tracks:
"Why can't I just repeat that paragraph back to you and swap every occurrence of 'right' with 'reft' and 'left' with 'light', and vice versa? They are perfectly symmetrical in terms of their logical realtions to one another.
If we take 'reft' and 'light' as primitives, then we have to define 'right' and 'left' in terms of 'reft' and 'light' with the use of time intervals."
What can we possibly reply to this? Can he/she not do this with every argument we propose then? Certainly, the skeptic admits that Bayes, and the contradiction in Right & Reft, after time T, prohibits previous Rights from being evidence of both Right and Reft after time T; where he is challenging us is in choosing Right as the result which it is evidence for, even though "Reft" and "Right" have a completely symmetrical syntactical relationship. There is nothing about the definitions of reft and right which distinguishes them from each other, except their spelling. So is that it? No, this simply means we have to propose an argument that doesn't rely on purely syntactical reasoning. So that if the skeptic performs the swap on our argument, the resulting argument is no longer sound.
What would happen in this scenario if it were actually set up? I know that seems like a strangely concrete question for a philosophy text, but its answer is a helpful hint. What would happen is that after time T, the behavior of the ratio: 'Rights:Lefts' as more trials were added, would proceed as expected, and the behavior of the ratio: 'Refts:Lights' would approach the reciprocal of the ratio: 'Rights:Lefts'. The only way for this to not happen, is for us to have been calling the right side of the table "reft", or for the line to have moved. We can only figure out where the line is by knowing where the balls landed relative to it; anything we can figure out about where the line is from knowing which balls landed Reft and which ones landed Light, we can only figure out because in knowing this and and time, we can know if the ball landed left or right of the line.
To this I know of no reply which the grue skeptic can make. If he/she say's the paragraph back to me with the proper words swapped, it is not true, because In the hypothetical where we have a table, a line, and we are calling one side right and another side left, the only way for Refts:Lefts behave as expected as more trials are added is to move the line (if even that), otherwise the ratio of Refts to Lights will approach the reciprocal of Rights to Lefts.
This thin line is analogous to the frequency of emeralds that turn out green out of all the emeralds that get made. This is why we can assume that the line will not move, because that frequency has one precise value, which never changes. Its other important feature is reminding us that even if two terms are syntactically symmetrical, they may have semantic conditions for application which are ignored by the syntactical model, e.g., checking to see which side of the line the ball landed on.
In conclusion:
Every random variable has as a part of it, stored in its definition/code, a frequency distribution over its values. By the fact that somethings happen sometimes, and others happen other times, we know that the world contains random variables, even if they are never fundamental in the source code. Note that "frequency" is not used as a state of partial knowledge, it is a fact about a set and one of its subsets.
The reason that:
"The first emerald ever observed was green.
The second emerald ever observed was green.
The third emerald ever observed was green.
… etc.
The nth emerald ever observed was green.
(conclusion):
There is a very high probability that a never before observed emerald will be green."
is a valid inference, but the grue equivalent isn't, is that grue is not a property that the emerald construction sites of our universe deal with. They are blind to the grueness of their emeralds, they only say anything about whether or not the next emerald will be green. It may be that the rule that the emerald construction sites use to get either a green or nongreen emerald change at time T, but the frequency of some particular result out of all trials will never change; the line will not move. As long as we know what symbols we are using for what values, observing many green emeralds is evidence that the next one will be grue, as long as it is before time T, every record of an observation of a green emerald is evidence against a grue one after time T. "Grue" changes meanings from green to blue at time T, 'green'''s meaning stays the same since we are using the same physical test to determine greenhood as before; just as we use the same test to tell whether the ball landed right or left. There is no reft in the universe's source code, and there is no grue. Green is not fundamental in the source code, but green can be reduced to some particular range of quanta states; if you had the universes source code, you couldn't write grue without first writing green; writing green without knowing a thing about grue would be just as hard as while knowing grue. Having a physical test, or primary condition for applicability, is what privileges green over grue after time T; to have a physical consistent test is the same as to reduce to a specifiable range of physical parameters; the existence of such a test is what prevents the skeptic from performing his/her swaps on our arguments.
Take this more as a brainstorm than as a final solution. It wasn't originally but it should have been. I'll write something more organized and consize after I think about the comments more, and make some graphics I've designed that make my argument much clearer, even to myself. But keep those comments coming, and tell me if you want specific credit for anything you may have added to my grue toolkit in the comments.
Help with a (potentially Bayesian) statistics / set theory problem?
Update: as it turns out, this is a voting system problem, which is a difficult but wellstudied topic. Potential solutions include Ranked Pairs (complicated) and BestThing (simpler). Thanks to everyone for helping me think this through out loud, and for reminding me to kill flies with flyswatters instead of bazookas.
I'm working on a problem that I believe involves Bayes, I'm new to Bayes and a bit rusty on statistics, and I'm having a hard time figuring out where to start. (EDIT: it looks like set theory may also be involved.) Your help would be greatly appreciated.
Here's the problem: assume a set of 7 different objects. Two of these objects are presented at random to a participant, who selects whichever one of the two objects they prefer. (There is no "indifferent" option.) The order of these combinations is not important, and repeated combinations are not allowed.
Basic combination theory says there are 21 different possible combinations: (7!) / ( (2!) * (72)! ) = 21.
Now, assume the researcher wants to know which single option has the highest probability of being the "most preferred" to a new participant based on the responses of all previous participants. To complicate matters, each participant can leave at any time, without completing the entire set of 21 responses. Their responses should still factor into the final result, even if they only respond to a single combination.
At the beginning of the study, there are no priors. (CORRECTION via dlthomas: "There are necessarily priors... we start with no information about rankings... and so assume a 1:1 chance of either object being preferred.) If a participant selects B from {A,B}, the probability of B being the "most preferred" object should go up, and A should go down, if I'm understanding correctly.
NOTE: Direct ranking of objects 17 (instead of pairwise comparison) isn't ideal because it takes longer, which may encourage the participant to rationalize. The "pickoneoftwo" approach is designed to be fast, which is better for gut reactions when comparing simple objects like words, photos, etc.
The ideal output looks like this: "Based on ___ total responses, participants prefer Object A. Object A is preferred __% more than Object B (the second most preferred), and ___% more than Object C (the third most preferred)."
Questions:
1. Is Bayes actually the most straightforward way of calculating the "most preferred"? (If not, what is? I don't want to be Maslow's "man with a hammer" here.)
2. If so, can you please walk me through the beginning of how this calculation is done, assuming 10 participants?
Thanks in advance!
Thinking in Bayes: Light
There are a lot of explanations of Bayes' Theorem, so I won't get into the technicalities. I will get into why it should change how you think. This post is pretty introductory, so free to totally skip it if you don't feel like there's anything about Bayes' Theorem that you don't understand.
For a while I was reading LessWrong and not seeing what the big deal about Bayes' Theorem was. Sure, probability is in the mind and all, but I didn't see why it was so important to insist on bayesian methods. For me they were a tool, rather than a way of thinking. This summary also helped someone in the DC group.
After using the Anki deck, a thought occurred to me:
Bayes theorem means that when seeing how likely a hypothesis is after an event, not only do I need to think about how likely the hypothesis said the event is, I need to consider everything else that could have possibly made that event more likely.
To illustrate:
pretty clearly shows how you need to consider P(eH), but that's slightly more obvious than the rest of it.
If you write it out the way that you would compute it you get...
where h is an element of the hypothesis space.
This means that every way that e could have happened is important, on top of (or should I say under?) just how much probability the hypothesis assigned to e.
This is because P(e) comes from every hypothesis that contributes to e happening, or more mathilyeX P(e) is the sum over all possible hypotheses of the probability of the event and that hypothesis, computed by the probability of the hypothesis times the probability of the event given the hypothesis.
In LaTeX:
where h is an element of the hypothesis space.
Religion, happiness, and Bayes
Religion apparently makes people happier. Is that evidence for the truth of religion, or against it?
(Of course, it matters which religion we're talking about, but let's just stick with theism generally.)
My initial inclination was to interpret this as evidence against theism, in the sense that it weakens the evidence for theism. Here's why:
 As all Bayesians know, a piece of information F is evidence for an hypothesis H to the degree that F depends on H. If F can happen just as easily without H as with it, then F is not evidence for H. The more likely we are to find F in a world without H, the weaker F is as evidence for H.
 Here, F is "Theism makes people happier." H is "Theism is true."
 The fact of widespread theism is evidence for H. The strength of this evidence depends on how likely such belief would be if H were false.
 As people are more likely to do something if it makes them happy, people are more likely to be theists given F.
 Thus F opens up a way for people to be theists even if H is false.
 It therefore weakens the evidence of widespread theism for the truth of H.
 Therefore, F should decrease one's confidence in H, i.e., it is evidence against H.
We could also put this in mathematical terms, where F represents an increase in the prior probability of our encountering the evidence. Since that prior is a denominator in Bayes' equation, a bigger one means a smaller posterior probabilityin other words, weaker evidence.
OK, so that was my first thought.
But then I had second thoughts: Perhaps the evidence points the other way? If we reframe the finding as "Atheism causes unhappiness," or posit that contrarians (such as atheists) are dispositionally unhappy, does that change the sign of the evidence?
Obviously, I am confused. What's going on here?
Bayesian analysis under threat in British courts
This is an interesting article talking about the use of bayes in british courts and efforts to improve how statistics are used in court cases. Probably worth keeping an eye on. It might expose more people to bayes if it becomes common and thus portrayed in TV dramas.
[Funny] Even Clippy can be blamed on the use of nonBayesian methods
From this 2001 article:
Eric Horvitz... feels bad about [Microsoft Office's Clippy]... many people regard the paperclip as annoyingly overenthusiastic, since it appears without warning and gets in the way.
To be fair, that is not Dr Horvitz's fault. Originally, he programmed the paperclip to use Bayesian decisionmaking techniques both to determine when to pop up, and to decide what advice to offer...
The paperclip's problem is that the algorithm... that determined when it should appear was deemed too cautious. To make the feature more prominent, a cruder nonBayesian algorithm was substituted in the final product, so the paperclip would pop up more often.
Ever since, Dr Horvitz has wondered whether he should have fought harder to keep the original algorithm.
I, at least, found this amusing.
Bayesian exercise
I am confused.
Suppose you are in charge of estimating the risk of catastrophic failure of the Space Shuttle. From engineers, component tests, and guesswork, you come to the conclusion that any given launch is about 1% likely to fail. On the strength of this you launch the Shuttle, and it does not blow up. Now, with this new information, what is your new probability estimate? I write down
P(failure next time  we observe one successful launch) = P (we observe one successful launch  failure next time) * P(failure) / P(observe one success)
or
P(FNT1S) = P(1SFNT)*P(F)/P(S)
We have P(F) = 1P(S) = 0.03. Presumably your chances of success this time are not affected by the next one being a failure, so P(1SFNT) is just P(S) = 0.97. So the two 97% chances cancel, and I'm left with the same estimate I had before, 3% chance of failure. Is this correct, that a successful launch does not give you new information about the chances of failure? This seems counterintuitive.
'An objective defense of Bayesianism'
Recently, Hans Leitgeb and Richard Pettigrew have published a novel defense of Bayesianism:
An Objective Defense of Bayesianism I: Measuring Inaccuracy
One of the fundamental problems of epistemology is to say when the evidence in an agent’s possession justiﬁes the beliefs she holds. In this paper and its sequel, we defend the Bayesian solution to this problem by appealing to the following fundamental norm:
Accuracy: An epistemic agent ought to minimize the inaccuracy of her partial beliefs.
In this paper, we make this norm mathematically precise in various ways. We describe three epistemic dilemmas that an agent might face if she attempts to follow Accuracy, and we show that the only inaccuracy measures that do not give rise to such dilemmas are the quadratic inaccuracy measures. In the sequel, we derive the main tenets of Bayesianism from the relevant mathematical versions of Accuracy to which this characterization of the legitimate inaccuracy measures gives rise, but we also show that unless the requirement of Rigidity is imposed from the start, Jeﬀrey conditionalization has to be replaced by a diﬀerent method of update in order for Accuracy to be satisﬁed.
An Objective Defense of Bayesianism II: The Consequences of Minimizing Inaccuracy
In this article and its prequel, we derive Bayesianism from the following norm: Accuracy—an agent ought to minimize the inaccuracy of her partial beliefs. In the prequel, we make the norm mathematically precise; in this article, we derive its consequences. We show that the two core tenets of Bayesianism follow from Accuracy, while the characteristic claim of Objective Bayesianism follows from Accuracy together with an extra assumption. Finally, we show that Jeffrey Conditionalization violates Accuracy unless Rigidity is assumed, and we describe the alternative updating rule that Accuracy mandates in the absence of Rigidity.
Richard Pettigrew has also written an excellent introduction to probability.
"Friends do not let friends compute p values."
LWers may find useful two recent articles summarizing (for cognitive scientists) why Bayesian inference is superior to frequentist inference.
Kruschke  What to believe: Bayesian methods for data analysis
Wagenmakers et al  Bayesian versus frequentist inference
(The quote "Friends do not let friends compute p values" comes from the first article.)
[Link] The Bayesian argument against induction.
In 1983 Karl Popper and David Miller published an argument to the effect that probability theory could be used to disprove induction. Popper had long been an opponent of induction. Since probability theory in general, and Bayes in particular is often seen as rescuing induction from the standard objections, the argument is significant.
It is being discussed over at the Critical Rationalism site.
Considering all scenarios when using Bayes' theorem.
Disclaimer: this post is directed at people who, like me, are not Bayesian/probability gurus.
Recently I found an opportunity to use the Bayes' theorem in real life to help myself update in the following situation (presented in genderneutral way):
Let's say you are wondering if a person is interested in you romantically. And they bought you a drink.
A = they are interested in you.
B = they bought you a drink.
P(A) = 0.3 (Just an assumption.)
P(B) = 0.05 (Approximately 1 out of 20 people, who might be at all interested in you, will buy you a drink for some unknown reason.)
P(BA) = 0.2 (Approximately 1 out of 5 people, who are interested in you, will buy you a drink for some unknown reason. Though it's more likely they will buy you a drink because they are interested in you.)
These numbers seem valid to me, and I can't see anything that's obviously wrong. But when I actually use Bayes' theorem:
P(AB) = P(BA) * P(A) / P(B) = 1.2
Uhoh! Where did I go wrong? See if you can spot the error before continuing.
Turns out:
P(BA) = P(A∩B) / P(A) ≤ P(B) / P(A) = 0.1667
BUT
P(BA) = 0.2 > 0.1667
I've made a mistake in estimating my probabilities, even though it felt intuitive. Yet, I don't immediately see where I went wrong when I look at the original estimates! What's the best way to prevent this kind of mistake?
I feel pretty confident in my estimates of P(A) and P(BA). However, estimating P(B) is rather difficult because I need to consider many scenarios.
I can compute P(B) more precisely by considering all the scenarios that would lead to B happening (see wiki article):
P(B) = ∑_{i} P(BH_{i}) * P(H_{i})
Let's do a quick breakdown of everyone who would want to buy you a drink (out of the pool of people who might be at all interested in you):
P(misc. reasons) = 0.05; P(Bmisc) = 0.01
P(they are just friendly and buy drinks for everyone they meet) = 0.05; P(Bfriendly) = 0.8
P(they want to be friends) = 0.3; P(Bfriends) = 0.1
P(they are interested in you) = 0.6; P(Binterested) = P(BA) = 0.2
So, P(B) = 0.1905
And, P(AB) = 0.315 (very different from 1.2!)
Once I started thinking about all possible scenarios, I found one I haven't considered explicitly  some people buy drinks for everyone they meet  which adds a good amount of probability (0.04) to B happening. (Those types of people are rare, but they WILL buy you a drink.) There are also other interesting assumptions that are made explicit:
 Out of all the people under consideration in this problem, there are twice as many people who would be romantically interested in you vs. people who would want to be your friend.
 People who are interested in you will buy you a drink twice as often as people who want to be your friend.
The moral of the story is to consider all possible scenarios (models/hypothesis) which can lead to the event you have observed. It's possible you are missing some scenarios, which under consideration will significantly alter your probability estimates.
Do you know any other ways to make the use of Bayes' theorem more accurate? (Please post in comments, links to previous posts of this sort are welcome.)
A Philosophical Treatise of Universal Induction (Link)
Abstract: Understanding inductive reasoning is a problem that has engaged mankind for thousands of years. This problem is relevant to a wide range of fields and is integral to the philosophy of science. It has been tackled by many great minds ranging from philosophers to scientists to mathematicians, and more recently computer scientists. In this article we argue the case for Solomonoff Induction, a formal inductive framework which combines algorithmic information theory with the Bayesian framework. Although it achieves excellent theoretical results and is based on solid philosophical foundations, the requisite technical knowledge necessary for understanding this framework has caused it to remain largely unknown and unappreciated in the wider scientific community. The main contribution of this article is to convey Solomonoff induction and its related concepts in a generally accessible form with the aim of bridging this current technical gap. In the process we examine the major historical contributions that have led to the formulation of Solomonoff Induction as well as criticisms of Solomonoff and induction in general. In particular we examine how Solomonoff induction addresses many issues that have plagued other inductive systems, such as the black ravens paradox and the confirmation problem, and compare this approach with other recent approaches.
Link: mdpi.com/10994300/13/6/1076/
Download PDF FullText: mdpi.com/10994300/13/6/1076/pdf
Authors: Samuel Rathmanner and Marcus Hutter
Published: 3 June 2011
View more: Next