Wait a minute - when the Bayesian says "I think the coin probably has a chance near 50% of being heads", she's using data from prior observations of coin flips to say that. Which means that the frequentist might get the same answer if he added those prior observations to his dataset.
Yes, that's a good point. Tthat would be considered using a data augmentation prior (Sander Greenland has advocated such an approach).
I hadn't seen that, but you're right that that sentence is wrong. "Probability" should have been replaced with "frequency" or something. A prior on a probability would be a set of probabilities of probabilities, and would soon lead to infinite regress.
only if you keep specifying hyper-priors, which there is no reason to do
I don't understand how you can hold a position like that and still enjoy the post. How do you parse the phrase "my prior for the probability of heads" in the second example?
In the second example the person was speaking informally, but there is nothing wrong with specifying a probability distribution for an unknown parameter (and that parameter could be a probability for heads)
If the null hypothesis was true, the probability that we would get 3 heads or less is 0.08
Is the idea that the coin will land heads 90% of the time really something that can be called the "null hypothesis"?
Hm, good point. Since the usual thing is .5, the claim should be the alternative. I was thinking in terms of trying to reject their claim (which it wouldn't take much data to do), but I do think my setup was non-standard. I'll fix it later today
Bayes' rule =/= Bayesian inference
Related to: Bayes' Theorem Illustrated, What is Bayesianism?, An Intuitive Explanation of Bayes' Theorem
(Bayes' theorem is something Bayesians need to use more often than Frequentists do, but Bayes' theorem itself isn't Bayesian. This post is meant to be a light introduction to the difference between Bayes' theorem and Bayesian data analysis.)
Bayes' Theorem
Bayes' theorem is just a way to get (e.g.) p(B|A) from p(A|B) and p(B). The classic example of Bayes' theorem is diagnostic testing. Suppose someone either has the disease (D+) or does not have the disease (D-) and either tests positive (T+) or tests negative (T-). If we knew the sensitivity P(T+|D+), specificity P(T-|D-) and disease prevalence P(D+), then we could get the positive predictive value P(D+|T+) using Bayes' theorem:
For example, suppose we know the sensitivity=0.9, specificity=0.8 and disease prevalence is 0.01. Then,
This answer is not Bayesian or frequentist; it's just correct.
Diagnostic testing study
Typically we will not know P(T+|D+) or P(T-|D-). We would consider these unknown parameters. Let's denote them by Θsens and Θspec. For simplicity, let's assume we know the disease prevalence P(D+) (we often have a lot of data on this).
Suppose 1000 subjects with the disease were tested, and 900 of them tested positive. Suppose 1000 disease-free subjects were tested and 200 of them tested positive. Finally, suppose 1% of the population has the disease.
Frequentist approach
Estimate the 2 parameters (sensitivity and specificity) using their sample values (sample proportions) and plug them in to Bayes' formula above. This results in a point estimate for P(D+|T+) of 0.043. A standard error or confidence interval could be obtained using the delta method or bootstrapping.
Even though Bayes' theorem was used, this is not a Bayesian approach.
Bayesian approach
The Bayesian approach is to specify prior distributions for all unknowns. For example, we might specify independent uniform(0,1) priors for Θsens and Θspec. However, we should expect the test to do at least as good as guessing (guessing would mean randomly selecting 1% of people and calling them T+). In addition, we expect Θsens>1-Θspec. So, I might go with a Beta(4,2.5) distribution for Θsens and Beta(2.5,4) for Θspec:
Using these priors + the data yields a posterior distribution for P(D+|T+) with posterior median 0.043 and 95% credible interval (0.038, 0.049). In this case, the Bayesian and frequentist approaches have the same results (not surprising since the priors are relatively flat and there are a lot of data). However, the methodology is quite different.
Example that illustrates benefit of Bayesian data analysis
(example edited to focus on credible/confidence intervals)
Suppose someone shows you what looks like a fair coin (you confirm head on one side tails on the other) and makes the claim: "This coin will land with heads up 90% of the time"
Suppose the coin is flipped 5 times and lands with heads up 4 times.
Frequentist approach
"A 95% confidence interval for the Binomial parameter is (.38, .99) using the Agresti-Coull method." Because 0.9 is within the confidence limits, the usual conclusion would be that we do not have enough evidence to rule it out.
Bayesian approach
"I don't believe you. Based on experience and what I know about the laws of physics, I think it's very unlikely that your claim is accurate. I feel very confident that the probability is close to 0.5. However, I don't want to rule out something a little bit unusual (like a probability of 0.4). Thus, my prior for the probability of heads is a Beta(30,30) distribution."
After seeing the data, we update our belief about the binomial parameter. The 95% credible interval for it is (0.40, 0.64). Thus, a value of 0.9 is still considered extremely unlikely.
This illustrates the idea that, from a Bayesian perspective, implausible claims require more evidence than plausible claims. Frequentists have no formal way of including that type of prior information.

Very good examples of perceptions driving self-selection.
It might be useful to discuss direct and indirect effects.
Suppose we want to compare fatality rates if everyone drove a Volvo versus if no one did. If the fatality rate was lower in the former scenario than in the latter, that would indicate that Volvo's (causally) decrease fatality rates.
It's possible that it is entirely through an indirect effect. For example, the decrease in the fatality rate might entirely be due to behavior changes (maybe when you get in a Volvo you think 'safety' and drive slower). On the DAG, we would have an arrow from volvo to behavior to fatality, and no arrow from volvo to fatality.
A total causal effect is much easier to estimate. We would need to assume ignorability (conditional independence of assignment given covariates). And even though safer drivers might tend to self-select into the Volvo group, it's never uniform. Safe drivers who select other vehicles would be given a lot of weight in the analysis. We would just have to have good, detailed data on predictors of driver safety.
Estimating direct and indirect effects is much harder. Typically it requires assuming ignorability of the intervention and the mediator(s). It also typically involves indexing counterfactuals with non-manipulable variables.
as an aside: a machine learning graduate student worked with me last year, and in most simulated data settings that we explored, logistic regression outperformed SVM
I'd like to ask those people who downvote this post for their reasons. I thought this is a reasonable antiprediction to the claims made regarding the value of a future galactic civilisation. Based on economic and scientific evidence it is reasonable to assume that the better part of the future, namely the the time from 10^20 to 10^100 years (and beyond) will be undesirable.
If you spend money and resources on the altruistic effort of trying to give birth to this imaginative galactic civilisation, why don't you take into account the more distant and much larger part of the future that lacks any resources to sustain given civilisation? You are deliberately causing suffering here by putting short-term interests over those of the bigger part of the future.
In my opinion, the post doesn't warrant -90 karma points. That's pretty harsh. I think you have plenty to contribute to this site -- I hope the negative karma doesn't discourage you from participating, but rather, encourages you to refine your arguments (perhaps get feedback in the open thread first?)
How about spreading rationality?
This site, I suspect, mostly attracts high IQ analytical types who would have significantly higher levels of rationality than most people, even if they had never stumbled upon LessWrong.
It would be great if the community could come up with a plan (and implement it) to reach a wider audience. When I've sent LW/OB links to people who don't seem to think much about these topics, they often react with one of several criticisms: the post was too hard to read (written at too high of a level); the author was too arrogant (which I think women particularly dislike); or the topic was too obscure.
Some have tried to reach a wider audience. Richard Dawkins seems to want to spread the good word. Yet, I think sometimes he's too condescending. Bill Maher took on religion in his movie Religulous, but again, I think he turned a lot of people off with his approach.
A lot has been written here about why people think what they think and what prevents people from changing their minds. Why not use that knowledge to come up with a plan to reach a wider audience. I think the marginal payoff could be large.
Consequences of non-consequentialism are disastrous. Just look at charity - instead of trying to get most good-per-buck people donate because this "make them a better person" or "is the right thing to do" - essentially throwing this all away.
If we got our act together, and did the most basic consequentialist thing of establishing monetary value per death and suffering prevented, the world would immediately become a far less sucky place to live than it is now.
This world is so filled with low hanging fruits we're not taking only because of backwards morality it's not even funny.
But: "You can be a virtue ethicist whose virtue is to do the consequentialist thing to do"





Subscribe to RSS Feed
= f037147d6e6c911a85753b9abdedda8d)
Error finding: I strongly suspect that people are better at finding errors if they know there is an error.
For example, suppose we did an experiment where we randomized computer programmers into two groups. Both groups are given computer code and asked to try and find a mistake. The first group is told that there is definitely one coding error. The second group is told that there might be an error, but there also might not be one. My guess is that, even if you give both groups the same amount of time to look, group 1 would have a higher error identification success rate.
Does anyone here know of a reference to a study that has looked at that issue? Is there a name for it?
Thanks