How to Evaluate Data?

jetm

What I'm trying to figure out is, how to I determine whether a source I'm looking at is telling the truth? For an example, let's take this page from Metamed: http://www.metamed.com/vital-facts-and-statistics

At first glance, I see some obvious things I ought to consider. It often gives numbers for how many die in hospitals/year, but for my purposes I ought to interpret it in light of how many hospitals are in the US, as well as how many patients are in each hospital. I also notice that as they are trying to promote their site, they probably selected the data that would best serve that purpose.

So where do I go from here? Evaluating each source they reference seems like a waste of time. I do not think it would be wrong to trust that they are not actively lying to me. But how do I move from here to an accurate picture of general doctor competence?

I don't see why I should give up just because what I've got isn't convenient to work with. The data is what it is, I want to use it in a Bayesian update of my prior probabilities that the 1995 data is kosher or made up.

Well heck, no one can stop you from intellectual masturbating. Just because it emits nothing anyone else wants to touch is not a reason to avoid doing it.

But you're working with made up data, the only real data is a high level summary which doesn't tell you what you want to know, you have no reasonably defined probability distribution, no defensible priors, and you're working towards justifying a conclusion you reached days ago (this exercise is a perfect example of motivated reasoning: "I dislike this data, and it turns out I am right since some of it was completely made up, and now I'm going to prove I'm extra-right by exhibiting some fancy statistical calculations involving a whole bunch of buried assumptions and choices which justify the already written bottom line").

My more elaborate procedure is only trying to refine this judgment by taking into account the entire joint probability distribution and trying to "hug the query" as much as possible. With the simulation I can not only pinpoint how astronomically unlikely the coincidence is, but also tell you how much "slop" in categories would be plausible. (If you look for a match within 5% rather than within 1%, then the probability of a coincidence rises to less-than-significant.)

I've already pointed out that under a reasonable interpretation of the imaginary data, the observed frequencies are literally the most likely outcome. Would your procedure make any sense if run on, say, lottery tickets?

I don't have to assume anything at all about the 1995 data (such as how many projects it represents), because as I've stated earlier $37B is the entire DoD spend in that year - if the data isn't made up then it amounts to an exhaustive survey rather than a sampling, and thus the observed frequencies are population frequencies...My reasoning is as follows: assume the costs of the projects are drawn from a normal distribution.

As I said. Assumptions.

Here is a corrected version of the code. I've also fixed the SD of the sample, which I miscalculated the first time around.

Although it's true that even if you make stuff up and choose to interpret things weirdly in order to justify the conclusion, the code should at least do what you wanted it to.

Do you disagree that the presence in a small sample of two instances of very rare species constitutes strong prima facie evidence against the "coincidence" hypothesis?

I've already pointed out that under a reasonable interpretation of the imaginary data, the observed frequencies are literally the most likely outcome. Would your procedure make any sense if run on, say, lottery tickets?

I don't know what you mean by the above, despite doing my best to understand. My intuition is that "the most likely outcome" is one in which our 9-proje... (read more)