You're looking at Less Wrong's discussion board. This includes all posts, including those that haven't been promoted to the front page yet. For more information, see About Less Wrong.

How to Evaluate Data?

5 Post author: jetm 09 April 2013 04:10AM

What I'm trying to figure out is, how to I determine whether a source I'm looking at is telling the truth? For an example, let's take this page from Metamed: http://www.metamed.com/vital-facts-and-statistics

At first glance, I see some obvious things I ought to consider. It often gives numbers for how many die in hospitals/year, but for my purposes I ought to interpret it in light of how many hospitals are in the US, as well as how many patients are in each hospital. I also notice that as they are trying to promote their site, they probably selected the data that would best serve that purpose.

So where do I go from here? Evaluating each source they reference seems like a waste of time. I do not think it would be wrong to trust that they are not actively lying to me. But how do I move from here to an accurate picture of general doctor competence?

Comments (45)

Comment author: Morendil 10 April 2013 01:52:37PM *  15 points [-]

Overall the contents of the linked page make me want to update quite a bit away from trusting MetaMed. One more example:

One million children every year have had unnecessary CT scans, which risks exposing them to radiation levels up to those experienced by survivors of Hiroshima and Nagasaki.

Compare with this excerpt from the primary source, which presumably serves as the basis for the claim:

Most of the quantitative information that we have regarding the risks of radiation-induced cancer comes from studies of survivors of the atomic bombs dropped on Japan in 1945. Data from cohorts of these survivors are generally used as the basis for predicting radiation-related risks in a population because the cohorts are large and have been intensively studied over a period of many decades, they were not selected for disease, all age groups are covered, and a substantial subcohort of about 25,000 survivors received radiation doses similar to those of concern here — that is, less than 50 mSv.

The primary source is not claiming that "a CT scan exposes you to Hiroshima-Nagasaki survivor radiation levels". It is saying the converse - "some atomic bomb survivors received doses low enough to be comparable to CT scans". The phrase "survivors of Hiroshima and Nagasaki" is pure fear-mongering - how much radiation the typical atomic bomb received is not public knowledge, so we'll tend to think in terms of worst cases (this handy chart might help). The average dose received, however, was 210 mSV according to one source I consulted; this is four times the high end dose from a pediatric CT, with the low end around 5 mSV. The statement from MetaMed is perhaps not an outright lie but it is at least grossly misleading.

For a business which has been touted right here and by no less than Eliezer himself as providing "actual evidence-based healthcare", this is a little worrisome.

(ETA: contrary to appearances I'm not actually trying to take over this whole post's discussion area, but I happen to get easily nerd-sniped by fact-checking exercises and easily worked up when I get a sense that someone is trying to pull a fast one on me.)

Comment author: private_messaging 26 April 2013 10:49:44AM *  5 points [-]

Alternatives in medicine are primarily a domain of crackpots and scam artists, and it is of no surprise what so ever for me that the individuals involved in this sort of thing would capitalize on fear of radiation.

The doses are highly misleading; the CT scan is typically not a whole body exposure, and the cancer risk is proportional to tissue-adjusted whole body exposure not the organ doses. Which, for the CT scan of the head, is 1..2 mSv . The annual dose from all sources is listed as 2.4 mSv in US .

While there's little question (based on our understanding of cancer and radiation) that the risk continues linearly at arbitrarily low doses, the risk of 1..2 mSv exposure is small and is utterly dwarfed by the risks inherent in ordering some non domain expert medical advice over internet. Especially considering that there are guidelines for when to do and not to do CT scans, compiled by experts who work on this specific issue for far longer time, and considering that there are considerable risks involved in not doing a CT scan.

Comment author: Morendil 10 April 2013 09:32:17AM *  11 points [-]

The "98,000 patients" claim is really interesting as an example of Dark Arts, aside from its having been debunked often.

It is often presented as follows: "98,000 deaths per year from medical errors (the equivalent of a jumbo jet crashing every day)".

It would be... provided every single jumbo jet flying in the US was populated by people already seriously ill or injured in the first place, rather than (as is actually the case) not only healthy but generally also wealthy passengers.

Of course you're supposed to overlook that trivial difference in the demographics of people who are in planes and those who are in hospitals, and picture hospitals killing healthy rich people by the planeload.

(This also suggests that "number of deaths" is a poor metric for making such estimates and comparisons; it would be better to compute "overall loss of expected QALYs resulting from preventable mistakes in medical care" and compare that with aggregate loss of QUALYs from other causes. Of course that's much less catchy.)

Comment author: Morendil 16 November 2014 06:01:14PM 0 points [-]

Interestingly this article offers a QUALY-based economic estimate, but for some weird reasons plucks a wild ass guess as to the average number of years of life lost as a result of medical errors - ten years, with not the slightest justification. Of course this leads to a largish estimate of total impact.

This other article updates the estimates of annual deaths in the US to 400,000 with a lower bound of 210,000. This may be the result of misapplying an estimate of what fraction of adverse events are preventable - this was estimated on the overall sample (including non-fatal adverse events) but then applied to the much smaller set of fatal adverse events. Most fatal events result from surgery, which the same article notes has a much lower rate of "preventable" events, but I can't see that the total deaths estimate accounts for that.

Comment author: Morendil 09 April 2013 06:12:10PM *  7 points [-]

I have some experience with this.

Some "facts" just set my spidey-sense tingling, and I find it usually well worth the time to check out the references in such case. In general, with the slightest doubt I will at least Google the reference and check out the abstract - this is quick and will at the least guarantee that the source does exist.

Particular things that set my spidey-sense off are:

  • sensationalistic claims - any that is ostensibly used to shock the reader into action
  • excess precision - like claims about "56% of all software projects" vs "roughly 60% in the population we sampled"
  • excess confidence about hard-to-test claims, in software those tend to be "productivity" claims
  • claims that are ambiguous or that would be very hard to confirm experimentally, e.g. the well-known canard about how many times a day men think about sex; basically "is this even a sane question to ask"
  • hard-to-find primary sources - when you can't readily check it, a claim becomes more suspicious
  • abstracts that don't contain the claim being put forward - it's more suspicious when you cite a paper for a tiny bit buried deep within
  • (ETA) "phone game" claims - a citation to a citation of a citation, etc. with many indirections before reaching the primary source

Let's look at some specifics of the page you cite - at the outset we note that it's designed to be sensationalistic, a marketing brochure basically. It's up to you to factor that into your assessment of how much you trust the references.

  • "As many as 98,000 people die in hospitals each year as a result of medical errors" - doesn't feel implausible, but a quick Google for US annual death rate - it turns out to be twice the death rate for suicide. This document seems to contradict the finding, I'd check out the reference
  • "Doctors spend an average of only ten minutes with patients" - an average like that isn't too hard to work out from a survey and squares with personal experience, I'd take it at face value
  • "By some estimates, deaths caused by medicine itself total 225,000 per year" - excess precision for a phrase as vague as "caused by medicine itself", I'd check out the reference just to know what that's supposed to mean
  • "Most published research findings are false" - this is the title of the Ioannidis article, and should not be taken at face value to mean "all research in all fields of medicine", read the ref for sure
  • "Up to 30% of patients who die in hospitals have important diseases or lesions that are not discovered until after death" - I'd want to know more about how they estimate this - are we talking about extrapolation from the few deaths which result in autopsy?
  • "It takes an average of 17 years for scientific research to enter clinical practice" - uh, maybe; somewhat ambiguous (what's an objective criterion for "entering clinical practice"?)
  • "In oncology alone, 90% of preclinical cancer studies could not be replicated." - I get into trouble almost immediately trying to check this reference, given as " Begley, C. G. (n.d.). Preclinical cancer research, 8–10.": Google gives me partial matches on the title but no exact matches, (n.d.) means "No date" which is kind of weird; this does have a publication date and even a URL
  • "deaths from cancer have barely been touched" - I wouldn't be surprised, cancer is a tough bastard
  • "If a primary care physician provided all recommended [...] care [...] he would need to work 21.7 hours a day" - excess precision (also a hypothetical, so not really "evidence"); check the source

Also, this is a Web page, so I get suspicious on principle that no hyperlinks are provided.

Comment author: somervta 12 April 2013 04:58:31AM 1 point [-]

"As many as 98,000 people die in hospitals each year as a result of medical errors" - doesn't feel implausible, but a quick Google for US annual death rate - it turns out to be twice the death rate for suicide. This document seems to contradict the finding, I'd check out the reference

This number may not only include US data.

Comment author: Morendil 12 April 2013 05:56:56AM 1 point [-]

Indeed. And how do we find that out?

Comment author: Morendil 09 April 2013 07:25:27PM *  3 points [-]

On the 98,000 figure this may bring some balance:

Similar to previous studies, almost a quarter (22.7%) of active-care patient deaths were rated as at least possibly preventable by optimal care, with 6.0% rated as probably or definitely preventable. Interrater reliability for these ratings was also similar to previous studies (0.34 for 2 reviewers). The reviewers' estimates of the percentage of patients who would have left the hospital alive had optimal care been provided was 6.0% (95% confidence interval [CI], 3.4%-8.6%). However, after considering 3-month prognosis and adjusting for the variability and skewness of reviewers' ratings, clinicians estimated that only 0.5% (95% CI, 0.3%-0.7%) of patients who died would have lived 3 months or more in good cognitive health if care had been optimal, representing roughly 1 patient per 10 000 admissions to the study hospitals.

(...)

Medical errors are a major concern regardless of patients' life expectancies, but our study suggests that previous interpretations of medical error statistics are probably misleading.

(...)

In an exchange about the validity of these estimates, McDonald et al argued on theoretical grounds that these statistics are likely overestimates. They were particularly concerned about the lack of consideration of the expected risk of death in the absence of the medical error. Indeed, these statistics have often been quoted without regard to cautions by the authors of the original reports, who note that physician reviewers do not believe necessarily that 100% of these deaths would be prevented if care were optimal.

(...)

As predicted on theoretical grounds, many deaths reportedly due to medical errors occur at the end of life or in critically ill patients in whom death was the most likely outcome, either during that hospitalization or in the coming months, regardless of the care received. However, this was not the only—or even the largest—source of potential overestimation. Previously, most have framed ratings of preventable deaths as a phenomenon in which a small but palpable number of deaths have clear errors that are being reliably rated as causing death. Our results suggest that this view is incorrect—that if many reviewers evaluate charts for preventable deaths, in most cases some reviewers will strongly believe that death could have been avoided by different care; however, most of the "errors" identified in implicit chart review appear to represent outlier opinions in cases in which the median reviewer believed either that an error did not occur or that it had little or no effect on the outcome.

ETA: see also, primary source for the 98,000 figure (try and find it!), this discusses the 98,000 figure as a Fermi estimate

Comment author: [deleted] 09 April 2013 05:32:06AM 3 points [-]

It is more useful to determine whether a source you're looking at is not telling the truth. Find one black swan, you don't have to look at all possible swans to determine the claim "all swans are white" is not correct.

In the example you gave, identify ways to determine whether the source is not telling the truth. That could include inaccurate quotes, or accurate quotes of faulty data, or consulting the cited texts plus competing texts, but I'm not sure it can include avoiding reading the cited texts even if you think it's a waste of time.

Comment author: jetm 09 April 2013 01:59:04PM 0 points [-]

That makes a lot of sense. Looks like I'll be slogging through a lot of links then. Thank you for the tip!

Comment author: Morendil 09 April 2013 06:22:14PM 2 points [-]

Related to this, I've been wondering if I should write a post based on this G+ blog, but aimed at LW readers specifically and focusing on probabilistic thinking about the claim.

To recap: we have this supposed "survey" of U.S. Defense projects claimed to have taken place in 1995, and to have looked at $37Bn worth of software development projects. It classifies them into five categories (from "OK" to "bad" to "horrible"), it doesn't really matter what they are, we can call them A,B,C,D,E. There's a certain allocation: A:46%, B: 29%, C: 20%, D: 3%, E: 2%.

But we also find an earlier (1979) study, with a more credible primary source. Its five categories are labeled exactly the same, its sample size is much smaller - 9 projects for $7 million total. The allocation is nearly the same: A:47%, B: 29%, C: 19%, D: 3%, E: 2%.

The article I link in the G+ post, written by someone with a PhD, remarks on this coincidence:

Individually, these studies indicate that the success rate for U. S. Government outsourcing has historically been very low. Together they indicate that the success rate has not improved despite the introduction of new technologies and management procedures over the years.

The exercise consists in working out somewhat rigorously the probability that, given the hypothesis "there exists a fixed probability with which software projects fall into categories A,B,C,D,E" you would get, within 1%, the exact same results from a huge ($37Bn) survey as you'd have gotten from an independent and much smaller sample.

(Intuitively, this is a little like saying you've found a 9-people family whose polling results exactly predict the result of a national election with 5 candidates.)

Comment author: gwern 10 April 2013 05:52:07PM *  1 point [-]

I've clicked through, and don't entirely understand your point. So the later figure is made up? Well, obviously that's a serious issue and like your earlier post on the diagram, interesting to read about. But if the two sets of frequencies were both, you know, real and not completely made up, and examined the same populations and classified their samples the same way, I'm not sure what's so wrong about comparing them with a chi-squared or something.

Comment author: Morendil 10 April 2013 08:08:41PM *  2 points [-]

Thanks for the feedback. Maybe I can better understand how what's blindingly obvious to me doesn't jump out at everyone else.

I don't know how or when to use a chi-squared test. What I did was assume - for the sake of checking my intuition - that the two sets of frequencies were indeed not made up.

To work out probabilities, you need to have some kind of model. I decided to use the simplest sampling model I could think of, where in both cases any given IT project has independently a fixed probability of turning out in one of the categories A, B, C, D, E.

The 1995 "study" has a sample size of $37Bn - this in fact turns out to match estimates of the entire DoD spend on IT projects in that year. So if these numbers are correct, then the frequencies must be precisely the probabilities for any given project to fall into the buckets A, B, C, D or E.

What I did next was work out some reasonable assumptions for the 1979 set of frequencies. It is drawn from a sample of 9 projects totaling $6.8M, so the mean project cost in the sample is $755K, and knowing a few other facts we can compute a lower bound for the standard deviation of the sample.

Given a mean, a standard deviation, and the assumption that costs are normally distributed in the population, we can approach by simulation an answer to the question "how likely is our assumption that both sets of frequencies are not made up and just happen to be within 1% of each other by chance, given the respective size of the samples".

The frequencies are given in terms of the categories as a proportion of the total cost. I wrote a Python program to repeatedly draw a sample of 9 projects from a population assumed to have the above mean cost and standard deviation, compute the relative proportions of the 5 categories, and return a true result if they were within 1% of the population probabilities.

Run this program passing the number of simulation runs as an argument. You can verify that the likelihood of reproducing the same set of frequencies within 1%, assuming that this happens by chance, is vanishingly small.

So, this "experiment" rejects the null hypothesis that the apparent match in both sets of frequencies is due to chance, as opposed to something else like one of them being made up.

(EDIT - removed the code I originally posted in this comment, a better version appears here.)

Comment author: gwern 10 April 2013 08:45:26PM 1 point [-]

I don't know how or when to use a chi-squared test. What I did was assume - for the sake of checking my intuition - that the two sets of frequencies were indeed not made up.

It's the usual go-to frequentist test for comparing two sets of categorical data. You say you have 4 categories with 10/4/9/3 members and you have your null hypothesis and you're interested in how often, assuming the null, results as extreme or more extreme than your new data of 200/80/150/20 would appear. Like rolling a biased 4-sided dice.

(If you're curious, that specific made up example would be chisq.test(matrix(c(10,4,9,3,200,80,150,20), ncol = 2),) with a p-value of 0.4.)

The 1995 "study" has a sample size of $37Bn - this in fact turns out to match estimates of the entire DoD spend on IT projects in that year. So if these numbers are correct, then the frequencies must be precisely the probabilities for any given project to fall into the buckets A, B, C, D or E. What I did next was work out some reasonable assumptions for the 1979 set of frequencies. It is drawn from a sample of 9 projects totaling $6.5M, so the mean project cost in the sample is $755K, and knowing a few other facts we can compute a lower bound for the standard deviation of the sample.

This seems like a really weird procedure. You should be looking at the frequencies of each of the 4 categories, not messing around with means and standard deviations. (I mean heck, just what about 2 decades of inflation or military growth or cutbacks?) What, you think that the 1995 data implies that the Pentagon had $37bn/$755K=49006 different projects?

I don't know Python or NumPY and your formatting is messed up, so I'm not sure what exactly you're doing. (One nice thing about using precanned routines like R's chisq.test: at least it's relatively clear what you're doing.)

But we also find an earlier (1979) study, with a more credible primary source. Its five categories are labeled exactly the same, its sample size is much smaller - 9 projects for $7 million total. The allocation is nearly the same: A:47%, B: 29%, C: 19%, D: 3%, E: 2%.

Looking closer, I'm not sure this data makes sense. 0.02 * 9 is... 0.18. Not a whole number. 47% * 9 is 4.23. Also not a positive integer or zero. 0.29 * 9 is 2.61.

Sure, the percentages do sum to 100%, but D and E aren't even possible: 1/9 = 11%!

Comment author: Morendil 12 April 2013 05:03:59PM 1 point [-]

Looking closer, I'm not sure this data makes sense. 0.02 * 9 is... 0.18. Not a whole number.

Basically, that's you saying exactly what is making me say "the coincidence is implausible". A sample of 9 will generally not contain an instance of something that comes up 2% of the time. Even more seldom will it contain that and an instance of something that comes up 3% of the time.

So, in spite of appearances, it seems as if our respective intuitions agree on something. Which makes me even more curious as to which of us is having a clack and where.

Comment author: gwern 12 April 2013 05:26:14PM 1 point [-]

No, my point there was that in a discrete sample of 9 items, 2% simply isn't possible. You jump from 1/9 (11%) straight to 0/9 (0%). But you then explained this impossibility as being the percentage of the total budget of all sampled projects that could be classified that way, which doesn't make the percentage mean much to me.

Comment author: Morendil 10 April 2013 08:49:01PM 1 point [-]

not sure this data makes sense. 0.02 * 9 is... 0.18. Not a whole number

The proportions are by cost, not by counts. The 2% is one $118K project, which works out to 1.7% of the $6.8M total, rounded up to 2%.

Comment author: gwern 10 April 2013 08:55:09PM 0 points [-]

So you don't even know how many projects are in each category for the original study?

Comment author: Morendil 10 April 2013 08:58:59PM 1 point [-]

Nope, aggregates is all we get to work with, no raw data.

Comment author: gwern 10 April 2013 09:26:14PM 0 points [-]

Yeah, I don't think you can do anything with this sort of data. And even if you had more data, I'm not sure whether you could conclude much of anything - almost identical percentages are always going to be highly likely, even if you go from a sample of 9 to a sample of 47000 or whatever. I'll illustrate. Suppose instead of being something useless like fraction of expenditure, your 1970s datapoint was exactly 100 projects, 49 of which were classified A, 29 of which were classified B, etc (we interpret the percentages as frequencies and don't get any awkward issues of "the average person has 1.9 arms"); and we took the mean and then estimated the $29b datapoint as having the same mean per project so we could indeed estimate that it was a sample of $37bn, and so the second sample was 490 times bigger (49k / 100), so when we look at A being 47% in the first sample we have n=47 projects, but when we look at A being 46% in the second sample, we this time have an n of 46*490=22540 projects. Straightforward enough, albeit an exercise in making stuff up.

So, with a sample 490 times larger, does differing by a percent or two offer any reason to reject the null that they have the same underlying distributions? No, because they're still so similar:

R> chisq.test(matrix(c(47,29,19,3,2, 46*490,29*490,20*490,3*490,2*490), ncol = 2), simulate.p.value = TRUE, B = 20000000)
Pearson's Chi-squared test with simulated p-value (based on 2e+07 replicates)
data: matrix(c(47, 29, 19, 3, 2, 46 * 490, 29 * 490, 20 * 490, 3 * 490, 2 * 490), ncol = 2)
X-squared = 0.0716, df = NA, p-value = 0.9983
Comment author: Morendil 12 April 2013 09:35:56AM *  1 point [-]

Yeah, I don't think you can do anything with this sort of data.

I don't see why I should give up just because what I've got isn't convenient to work with. The data is what it is, I want to use it in a Bayesian update of my prior probabilities that the 1995 data is kosher or made up.

Intuitively, the existence of categories at 2% and 3% make the conclusion clear. If the 1995 data isn't made up, then it is very rare that a project falls into one of these categories at all - respectively 1/50 and 1/30 chances. So the chance that our small sample of 9 projects happens to contain one each of these kinds of projects is very small to start with, about 9/150. Immediately this is strong Bayesian evidence against the null hypothesis.

Do you disagree?

My more elaborate procedure is only trying to refine this judgment by taking into account the entire joint probability distribution and trying to "hug the query" as much as possible. With the simulation I can not only pinpoint how astronomically unlikely the coincidence is, but also tell you how much "slop" in categories would be plausible. (If you look for a match within 5% rather than within 1%, then the probability of a coincidence rises to less-than-significant.)

I don't have to assume anything at all about the 1995 data (such as how many projects it represents), because as I've stated earlier $37B is the entire DoD spend in that year - if the data isn't made up then it amounts to an exhaustive survey rather than a sampling, and thus the observed frequencies are population frequencies. I treat the 1995 data as "truth", and only need to view the 1979 as a sampling procedure.

Here is a corrected version of the code. I've also fixed the SD of the sample, which I miscalculated the first time around.

(My reasoning is as follows: assume the costs of the projects are drawn from a normal distribution. Then we already know the mean ($6.8 / 9 = $755K), we know that one project cost $119K and another $198K (accounting for the 2% and 3% categories respectively), so the "generous" assumption is that the other 7 projects were all the same size ($926K), giving us the tightest normal possible.)

Comment author: gwern 12 April 2013 03:58:03PM 0 points [-]

I don't see why I should give up just because what I've got isn't convenient to work with. The data is what it is, I want to use it in a Bayesian update of my prior probabilities that the 1995 data is kosher or made up.

Well heck, no one can stop you from intellectual masturbating. Just because it emits nothing anyone else wants to touch is not a reason to avoid doing it.

But you're working with made up data, the only real data is a high level summary which doesn't tell you what you want to know, you have no reasonably defined probability distribution, no defensible priors, and you're working towards justifying a conclusion you reached days ago (this exercise is a perfect example of motivated reasoning: "I dislike this data, and it turns out I am right since some of it was completely made up, and now I'm going to prove I'm extra-right by exhibiting some fancy statistical calculations involving a whole bunch of buried assumptions and choices which justify the already written bottom line").

My more elaborate procedure is only trying to refine this judgment by taking into account the entire joint probability distribution and trying to "hug the query" as much as possible. With the simulation I can not only pinpoint how astronomically unlikely the coincidence is, but also tell you how much "slop" in categories would be plausible. (If you look for a match within 5% rather than within 1%, then the probability of a coincidence rises to less-than-significant.)

I've already pointed out that under a reasonable interpretation of the imaginary data, the observed frequencies are literally the most likely outcome. Would your procedure make any sense if run on, say, lottery tickets?

I don't have to assume anything at all about the 1995 data (such as how many projects it represents), because as I've stated earlier $37B is the entire DoD spend in that year - if the data isn't made up then it amounts to an exhaustive survey rather than a sampling, and thus the observed frequencies are population frequencies...My reasoning is as follows: assume the costs of the projects are drawn from a normal distribution.

As I said. Assumptions.

Here is a corrected version of the code. I've also fixed the SD of the sample, which I miscalculated the first time around.

Although it's true that even if you make stuff up and choose to interpret things weirdly in order to justify the conclusion, the code should at least do what you wanted it to.

Comment author: Kindly 12 April 2013 06:55:25PM 0 points [-]

Intuitively, the existence of categories at 2% and 3% make the conclusion clear. If the 1995 data isn't made up, then it is very rare that a project falls into one of these categories at all - respectively 1/50 and 1/30 chances. So the chance that our small sample of 9 projects happens to contain one each of these kinds of projects is very small to start with, about 9/150.

Given that we know nothing about how the projects themselves were distributed between the categories, we can't actually say this with any confidence. It's possible, for example, that the 2% category actually receives many projects on average, but they're all cheap.

If you assume that the project costs are normally distributed, then that assumption makes the 1979 data inherently unlikely, no matter how close the percentages are to 1995: the existence of a category receiving 2% of the funding means that at best you have a data point which is only 18% of the mean (and another point at 27%). That just doesn't happen for normal distributions (unless the variance is so large that the model becomes ridiculous anyway, due to the huge probability of it giving you negative numbers).

Comment author: Morendil 11 April 2013 05:23:28AM 0 points [-]

So you wouldn't be surprised by my hypothetical scenario, where a family of 9 is claimed to poll exactly the same as the results in a national election?

Comment author: gwern 11 April 2013 04:40:13PM *  1 point [-]

No, I would be surprised, but that is due to my background knowledge that a family unit implies all sorts of mutual correlations, ranging from growing up (if one's parents are Republicans, one is almost surely a Republican as well) to location (most states are not equally split ideologically), and worries about biases and manipulations and selection effects ("This Iowa district voted for the winning candidate in the last 7 elections!").

On the other hand, if you simply told me that 9 random people split 5-4 for Obama, I would simply shrug and say, "Well, yeah. Obama had the majority, and in a sample of 9 people, a 5-4 split for him is literally the single most likely outcome possible - every other split like 9-0 is further removed from the true underlying probability that ~52% of people voted for him. It's not all that likely, but you could say that about every lottery winner or every single sequence you get when flipping a fair coin n times: each possible winner had just a one in millions chance of winning, or each sequence had a 0.5^n chance of happening. But, something had to happen, someone had to win the lottery, some sequence had to be produced by the final coin flip."

Comment author: Morendil 10 April 2013 08:56:28PM 0 points [-]

I'm not sure what exactly you're doing.

I think I've just spotted at least one serious mistake, so give me some time to clean this up. Probably I can do the same thing in R.

Comment author: DaFranker 09 April 2013 01:36:09PM *  1 point [-]

As Trevor_Blake said, there's very little you can do apart from actually checking some of the data. An alternative is to ask or pay someone else or a group to verify it for you.

Of course, there's always the option of coding a probabilistic engine that mines for stats and gives you reliability estimates of certain claims using some bayes-fu. But that takes math, programming, and lots of work.

Comment author: VCavallo 10 April 2013 03:30:05PM 0 points [-]

that takes math, programming, and lots of work

But sounds totally awesome. Especially if it can be created once and used over and over for different applications.

Comment author: DaFranker 10 April 2013 04:11:37PM *  1 point [-]

Well, my naive first thought was to abuse the opencyc engine for a while so it starts getting good rough guesses of which particular mathematical concepts and quantities and sets are being referred to in a given sentence, and plug it either directly or by mass download and conversion into various data sources like WolframAlpha or international health / crime / population / economics databases or various government services.

But that still means doing math (doing math with linguistics) tons and tons of programming to even get a working prototype that understands "30% of americans are older than 30 years old", way more work than I care to visualize just to get the system to not explode and respond in a sane manner when you throw at it something incongruent ("30 of americans are 30% years old" should not make the system choke, for example), etc. And then you've got to build something usable around that, interfaces, ways to extract and store data, and then probably pack everything together. And once you're there, you probably want to turn it into a product and sell it, since you might as well cash in some money on all of this work. Then more work.

The whole prospect looks like a small asteroid rather than a mountain, from where I'm sitting. I am not in the business of climbing, mining, deconstructing and exporting small asteroids. I'll stick to climbing over mountains until I have a working asteroid-to-computronium converter.

Comment author: gwern 10 April 2013 05:40:16PM 1 point [-]

My suggestion would be to go via some sort of meta-analysis or meta-meta-analysis (yes, that's a thing); if you have, for example, a meta-analysis of all results in a particular field and how often they replicate, you can infer pretty accurately how well a new result in that field will replicate. (An example use: 'So 90% of all the previous results with this sample size or smaller failed to replicate? Welp, time to ignore this new result until it does replicate.')

It would of course be a ton of work to compile them all, and then any new result you were interested in, you'd still have to know how to code it up in terms of sample size, which sub-sub-field it was in, what the quantitative measures were etc, but at least it doesn't require nigh-magical AI or NLP - just a great deal of human effort.

Comment author: DaFranker 10 April 2013 06:52:26PM *  1 point [-]

Nigh-magical is the word indeed. I just realized that if my insane idea in the grandparent were made to work, it could be unleashed upon all research publications ever everywhere for mining data, figures, estimates, etc., and then output a giant belief network of "this is collective-human-science's current best guess for fact / figure / value / statistic X".

That does not sound like something that could be achieved by a developer less than google-sized. It also fails all of my incredulity and sanity checks.

(it also sounds like an awesome startup idea, whatever that means)

Comment author: gwern 10 April 2013 07:14:50PM 1 point [-]

Or IBM-sized. But if you confined your ambitions to analyzing just meta-analyses, it would be much more doable. The narrower the domain, the better AI/NLP works, remember. There's some remarkable examples of what you can do in machine-reading a narrow domain and extracting meaningful scientific data; one of them is ChemicalTagger (demo), reading chemistry papers describing synthesis processes and extracting the process (although it has serious problems getting papers to use). I bet you could get a lot out of reading meta-analyses - there's a good summary just in the forest plot used in almost every meta-analysis.

Comment author: ChristianKl 09 April 2013 05:16:46PM 0 points [-]

If you have a specific claim where you aren't sure whether it's true http://skeptics.stackexchange.com/ is a good website. Start by seeking whether someone else already asked the question. If nobody has than you can open a new question.

Comment author: westward 09 April 2013 05:53:37AM 0 points [-]

Well...what is your goal in evaluating the truthfulness of these statements? Even if they're actually "true" do they help you meet your goals?