Severe problems with the biomedical research process

GiveWell has recently been investigating ways to improve biomedical research. When I discovered GiveWell's research I was shocked by how severe and comprehensive the problems with the field seem to be:

From a conversation with Ferric Fang:

Because scientists have to compete for grants, they spend a very large fraction of their time fundraising, sometimes more than 50% of their working hours. Scientists feel [strong] pressure to optimize their activities for getting tenure and grants, rather than for doing good science. 

From a conversation with Elizabeth Iorns:

Researchers are rewarded primarily for publishing papers in prestigious journals such as Nature, Science and Cell. These journals select for papers that report on surprising and unusual findings. Papers that report on unsound research that is apparently exciting are more likely to be published than papers which report on less exciting research that is sound.

There is little post-publication check on the soundness of papers’ findings, because journals, especially prestigious ones, generally don’t publish replications, and there is little funding for performing replications.

[…]

Pharmaceutical companies such as Bayer and Amgen have studied the frequency with which studies are reproducible by trying to reproduce them, and they have found that about 70% of published papers in the areas that they considered don’t reproduce.

[…]

Because many published results are not reproducible, it is difficult for scientists to use the published literature as a basis for deciding what experiments to perform.

[…]

As things stand, the pharmaceutical industry does replications, however, these are generally unpublished. Because a given lab doesn’t know whether other labs have found that a study fails to replicate, labs duplicate a lot of effort.

From a conversation with Ken Witwer:

Dr. Witwer published a study in Clinical Chemistry examining 127 papers that had been published in between July 2011 and April 2012 in journals that ostensibly require that researchers deposit their microarray data. He found that the data was not submitted for almost 60% of papers, and that data for 75% of papers were not in a format suitable for replication.

The above remarks give the impression that the problems are deeply entrenched and mutually reinforcing. On first glance, it seems that while one might be able to make incremental improvements (such as funding a journal that publishes replications), prospects for big improvements are very poor. But I became more hopeful after learning more.

The Rising Sea

The great mathematician Alexander Grothendieck wrote about two approaches to solving a difficult problem:

If you think of a theorem to be proved as a nut to be opened, so as to reach “the nourishing flesh protected by the shell”, then the hammer and chisel principle is: “put the cutting edge of the chisel against the shell and strike hard. If needed, begin again at many different points until the shell cracks—and you are satisfied”.

[]

I can illustrate the second approach with the same image of a nut to be opened. The first analogy that came to my mind is of immersing the nut in some softening liquid, and why not simply water? From time to time you rub so the liquid penetrates better, and otherwise you let time pass. The shell becomes more flexible through weeks and months—when the time is ripe, hand pressure is enough, the shell opens like a perfectly ripened avocado!

A different image came to me a few weeks ago. The unknown thing to be known appeared to me as some stretch of earth or hard marl, resisting penetration … the sea advances insensibly in silence, nothing seems to happen, nothing moves, the water is so far off you hardly hear it …. yet it finally surrounds the resistant substance.

When a nut seems too hard to crack, it’s wise to think about the second method that Grothendieck describes.

Alternative Metrics

I was encouraged by GiveWell’s subsequent conversations, with David Jay and Jason Priem, which suggest a “rising sea” type solution to the cluster of apparently severe problems with biomedical research.

In brief, the idea is that it may be possible to create online communities and interfaces that can be used to generate measures of how valuable researchers find research outputs, and which could be used for funding and tenure decisions, thereby rewarding producing the research outputs that other researchers find most valuable. If incentives become aligned with producing valuable research, the whole system will shift accordingly, greatly reducing the existing inefficiencies.

From a conversation with Jason Priem

Historically, the academic community has filtered academic outputs for interest by peer review and, more specifically, the prestige of the journals where papers are published. This model is inadequate relative to filtering mechanisms that are now in principle possible using the Internet.

It is now possible to use the web to measure the quality and impact of an academic output via alternative metrics (altmetrics) such as

  • How many people downloaded it
  • How much it has been discussed on Twitter
  • How many websites link to it
  • The caliber of the scientists who have recommended it
  • How many people have saved it in a reference manager like Mendeley or Zotero

This is similar to how Google generates a list of webpages corresponding to a search term, since you can benefit from PageRank-type algorithms that foreground popular  content in an intelligent fashion.

[…]

There’s been a significant amount of interest from funders and administrators in more nuanced and broader measures of researcher impact than their journal publication record. […] Algorithmically generated rankings of researchers’ influence as measured by the altmetrics mentioned previously could be an input into hiring, tenure, promotion, and grant decisions. ImpactStory and other providers of alternative metrics could help researchers’ aggregate their online impact so that they can present good summaries of it to administrators and funders. 

From a conversation with David Jay

Commenting systems could potentially be used to create much more useful altmetrics. Such altmetrics could be generated for a scientific output by examining the nature of the comments that scientists make about it, weighting the comments using factors such as the number of upvotes that a comment receives and how distinguished the commenter is.

The metrics generated would be more informative than a journal publication record, because commenters give more specific feedback than the acceptance/rejection of a paper submitted to a given journal does.

[…]

If scientists were to routinely use online commenting systems to discuss scientific outputs, it seems likely that altmetrics generated from them would be strong enough for them to be used for hiring, promotion and grant-making decisions (in conjunction with, or in place of, the traditional metric of journal publication record).

[…]

David Jay envisages a future in which there is [...] A website which collects analytics from other websites so as to aggregate the impact of individual researchers, both for their own information and for use by hiring/promotion/grant committees.

The viability of this approach remains to be seen, but it could work really well, and illustrate a general principle.

About the author: I worked as a research analyst at GiveWell from April 2012 to May 2013. All views expressed here are my own.

New Comment
39 comments, sorted by Click to highlight new comments since:

Agreed that improved incentives for truth-seeking would improve details across the board, while local procedural patches would tend to be circumvented.

alternative metrics (altmetrics) such as How many people downloaded it How much it has been discussed on Twitter How many websites link to it The caliber of the scientists who have recommended it How many people have saved it in a reference manager like Mendeley or Zotero

The first three metrics seem like they could even more strongly encourage sexy bogus findings by giving the general public more of a role: the science press seem to respond strongly to press releases and unsubstantiated findings, as do website hits (I say this based on the "most emailed" and "most read" categories at the NYTimes science section).

[-]satt30

The first three metrics seem like they could even more strongly encourage sexy bogus findings by giving the general public more of a role

Reference manager data could have the same effect, despite reference managers being disproportionately used by researchers rather than laypeople.

Using myself as an example, I sometimes save interesting articles about psychology, medicine, epidemiology and such that I stumble on, even though I'm not officially in any of those fields. If a lot of researchers are like me in this respect (admittedly a big if) then sexy, bogus papers in popular generalist journals stand a good chance of bubbling to the top of Mendeley/Zotero/etc. rankings.

Come to think of it, a handful of the papers I've put in my Mendeley database are there because I think they're crap, and I want to keep a record of them! This raises the comical possibility of papers scoring highly on altmetrics because scientists are doubtful of them!

(jmmcd points out that PageRanking users might help, although even that'd rely on strongly weighted researchers being less prone to the behaviours I'm talking about.)

This is an issue with efforts to encourage replication and critique of dubious studies: in addition to wasting a lot of resources replicating false positives, you have to cite the paper you're critiquing, which boosts its standing in mechanical academic merit assessments like those used in much UK science funding.

We would need a scientific equivalent of the "nofollow" attribute in HTML. A special kind of citation meaning: "this is wrong".

15 years ago, the academic search engine Citeseer was designed not just with the goal of finding academic papers, identifying which ones were the same, and counting citations, but, as indicated in its name, showing the user the context of the citations, to see if they were positive or negative.

[-]satt00

I've occasionally wished for this myself. I look forward to semantic analysis being good enough to apply to academic papers, so computers can estimate the proportion of derogatory references to a paper instead of mechanically counting all references as positive.

I don't necessarily endorse the specific metrics cited. I have further thoughts on how to get around issues of the type that you mention, which I'll discuss in a future post.

Yes, but the next line mentioned PageRank, which is designed to deal with those types of issues. Lots of inward links doesn't mean much unless the people (or papers, or whatever, depending on the semantics of the graph) linking to you are themselves highly ranked.

Yep, a data-driven process could be great, but if what actually gets through the inertia is the simple version, this is an avenue for backfire.

A fundamental problem seems to be that there is a lower prior for any given hypothesis, driven by the increased number of researchers, use of automation, and incentive to go hypothesis-fishing.

Wouldn't a more direct solution be to simply increase the significance threshold required in the field?

A fundamental problem seems to be that there is a lower prior for any given hypothesis, driven by the increased number of researchers, use of automation, and incentive to go hypothesis-fishing.

That doesn't lower the pre-study prior for hypotheses, it (in combination with reporting bias) reduces the likelihood ratio a reported study gives you for the reported hypothesis.

Wouldn't a more direct solution be to simply increase the significance threshold required in the field?

Increasing the significance threshold would mean that adequately-powered honest studies would be much more expensive, but those willing to use questionable research practices could instead up the ante and use the QRPs more aggressively. That could actually make the published research literature worse.

That doesn't lower the pre-study prior for hypotheses, it (in combination with reporting bias) reduces the likelihood ratio a reported study gives you for the reported hypothesis.

Respectfully disagree. The ability to cheaply test hypotheses allows researchers to be less discriminating. They can check a correlation on a whim. Or just check every possible combination of parameters simply because they can. And they do.

That is very different from selecting a hypothesis out of the space of all possible hypotheses because it's an intuitive extension of some mental model. And I think it absolutely reduces the pre-study priors for hypotheses, which impacts the output signal even if no QRPs are used.

I'd take the favor of a handful of highly-qualified specialists in the field (conventional peer-review) over a million 'likes' on facebook any day. And this is coming from someone who agrees the traditional system is deeply flawed.

Something like PLoS' model is more reasonable: Publish based on the quality of the research, not the impact of the findings. Don't impose artificial 'page limits' on the number of papers that can be published per issue. Encourage open access to everyone. Make it mandatory to release all experimental data and software that is needed to reproduce the results of the paper. At the same time, encourage a fair and balanced peer-review process.

I've never published in PLoS, by the way, but I will probably be sending my next papers to them.

I would hope that if there were a public web application, it would take 20-100 different statistics, and allow people to choose which to privilege. Not sure if it's the reader's or the website's responsibility to choose worthwhile stats to focus on, especially if they became standardized and other agencies were able to focus on what they wanted.

For example, I imagine that foundations could have different combination metrics, like "we found academics with over 15 papers with collectively 100 citations, which must average at least an 60% on a publicity scale and have a cost-to-influence index of over 45". These criterion could be highly focussed on the needs and specialties of the foundation.

This is a very good idea.

He found that the data was not submitted for almost 60% of papers, and that data for 75% of papers were not in a format suitable for replication.

I recently needed large tables of data from 4 different publications. The data was provided... as PDFs. I had to copy thousands of lines of data out from the PDFs by hand. Journals prefer PDF format because it's device-independent.

It's questionable how much good science can do, though, when we're already so far behind in applying biotech research in the clinic. My cousin died last week just after her 40th birthday, partly from a bacterial infection. The hospital couldn't identify the bacteria because they're only allowed to use FDA-approved diagnostic tests. The approved tests involve taking a sample, culturing it, and using an immunoassay to test for proteins of one particular bacteria. This takes days, costs about $400 per test, tests only for one particular species or strain of bacteria per test, has only a small number of possible tests available, and has a high false negative rate. It was a reasonable approach 25 years ago.

Where I work, we take a sample, amplify it via PCR (choosing the primers is a tricky but solved problem), and sequence everything. We identify everything, hundreds of bacterial species, whether they can be cultured or not, in a single quick test. If you don't have a sequencer, you could use a 96-well plate to test against at least 96 bacterial groups, or a DNA hybridization microarray to test against every known bacterial species, for $200-$500. The FDA has no process for approving these tests other than to go through a separate validation process for every species being tested for, and no way to add the DNA of newly-discovered species to a microarray that's been approved.

"I had to copy thousands of lines of data out from the PDFs by hand. Journals prefer PDF format because it's device-independent"

Google "PDF to excel".

Death through strangulation by red tape.

Isn't there some offshore medical tourism site to send your sample (blood?) for such things? Have they made it illegal for me to send a blood sample abroad yet?

This is my perception of much of medicine in the US. Government, guilds, and corporations working together to increase the cost of medical care 10X and retard medical progress 1/10X.

Do you have any other specific examples of known possible treatments and diagnostics squashed by the regulatory state?

I think a core problem is the way scientists cite papers. There no real reason why one should always cite the first paper that makes a given claim.

You could change that habit and instead cite the paper that does the first replication.

that's... actually a really good idea.

somehow incentivising a rule of always citing a replication of an experiment would make a massive difference.

I don't think it about incentives. It's roughly the same matter of work to cite a replication than the original experiment. It's rather about convincing the scientists that it's a hallmark of good science to cite the replication.

You start by convincing the cool people. Let them signal that they are cool by citing replications.

Afterwards you go to a journal and ask: "Do you support good science? If so, please add a rule to your paper review process that submitters have to cite first replications instead of the original paper."

Or both.

The goal of the endeveour is to move academic resources from publishing out-there-research to verifiable research. The best way to do that is to reduce the incentives for out-there-research and increase those for verification.

Sometimes it might make sense to cite every study on a given topic. The default should be to cite the first replication or a meta-analysis when available. If I'm a reader of an academic paper I profit from reading the paper that provides the evidence that the claim is actually true.

There just no good reason to give out prestige to the first person who finds an effect.

In short term, we might want less research and more verification. But in long term, we don't want to discourage the research completely.

Also, people will always try to game the system. How will you get points for the first verification of the original idea, when the original ideas become scarce? Split the work with your friend: one of you will write the original article, another will write the verification article; then publish both at the same time. Would that be a big improvement? I would prefer the verification to be done by someone who is not a friend, and especially does not owe a favor for getting such opportunity.

Note: In ice hockey, players get points for goals and assists. Sure, it's not completely the same situation, but it is a way to encourage two behaviors that are both needed for a success. In science we need to encourage both research and verification.

Would that be a big improvement?

Yes, two people getting a P<0.05 and both setting up the experiment are better than one.

How will you get points for the first verification of the original idea, when the original ideas become scarce?

I don't believe that original ideas become scarce. Researching original ideas is fun. There will be still grant giving agencies that give out grands to persue original ideas.

Yes, two people getting a P<0.05 and both setting up the experiment are better than one.

How many false-positives get published as opposed to negatives (or rarely even false-negatives)? If the ratio is too high then you'll need more than just two positive studies. If, as claimed in the quoted article, 70% of papers in a field are not reproducible that implies finding papers at random would require about nine positive studies to reach a true P<0.05, and that's only if each paper is statistically independent from the others.

If there is financial incentive to reproduce existing studies when there is a ready-made template to copy and paste into a grant-funding paper I think the overall quality of published research could decline. At least in the current model there's a financial incentive to invent novel ideas and then test them, versus just publishing a false reproduction.

Textbooks replace each other on clarity of explanation as well as adherence to modern standards of notation and concepts.

Maybe just cite the version of an experiment that explains it the best? Replications have a natural advantage because you can write them later when more of the details and relationships are worked out.

People cite several if they can. But they do try to cite the first.

It is now possible to use the web to measure the quality and impact of an academic output via alternative metrics (altmetrics) such as

How many people downloaded it

How much it has been discussed on Twitter

How many websites link to it

The caliber of the scientists who have recommended it

How many people have saved it in a reference manager like Mendeley or Zotero

All of these except for number 4 are just as susceptible if not more so to optimizing for interestingness or controversy over soundness of research.

Also see Michael Nielsen's ideas in Reinventing Discovery.

A few issues that come to mind.

  1. Any improvement will be supported by good researchers but opposed by poor ones, the latter outnumber the former.
  2. Changing the means of apportioning stature does not eliminate the incentives to aim for broad appeal.
  3. Goodheart's Law.

Improvements can be supported by poor researchers with tenure.

Any improvement will be supported by good researchers but opposed by poor ones, the latter outnumber the former.

If we play our cards right, we may end up with researchers attempting to signal their ability by supporting the improvements.

  1. Any improvement will be supported by good researchers but opposed by poor ones, the latter outnumber the former.

Only if the poor researches actually anticipate doing worse under the new system. It's possible that a system could be better for everyone (e.g. if it required less grant-proposal-writing and more science-doing).

I think you are missing the core of the problem- science is just too damn crowded. The intense up-or-out pressure coupled with the scarcity of jobs /resources incentivizes a lot of bad behavior. Perpetual soft-money positions force you to spend large portions of your time securing funding- if you fail to get it your career is over.

If a postdoc goes down a blind alley (after all, science is uncertain, lots of great ideas don't pan out), its a career killer unless he can figure out some way to salvage the data (which might mean going on a p-value hunt, or it might be flat out data-manipulation and fraud),etc. The lack of jobs has everyone trying to work harder than everyone else, and seeking any edge they can get.

Even vaunted peer review has the wrong incentives- it might be a good way to vet work, but there are obvious bad incentives when you use it to apportion scarce funding. A rational reviewer will pan any work competing with his own.

Almost all of the institutions of science grew and matured during the cold war period of exponential funding growth. In the 70s, the funding slowed down and the institutions aren't capable of adjusting.

I've put a lot of thought into trying to improve the aggregation of science data in order to quickly determine the legitimacy of any particular paper based on current knowledge. Ultimately, it seems to me, aggregation isn't viable because so much of the data is simply inaccessible (as you mentioned, accessibility is more a function of being new and exciting, whereas 'vanilla' building blocks such as replications may not be publicly accessible at all).

Ultimately, I think, the incentive structure has to change, and that's difficult because when only positives are rewarded, it has the side effect of incentivizing the reporting of false positives, and disincentivizing putting in the work of reporting negatives (which are still useful!). Ultimately, I think the only viable ways to change the system are by convincing the grant providers to enforce certain guidelines. If they are as convinced as to the usefulness of non-positive and replication results as the rest of us, they can enforce reporting all results, and increase the rewards for people to do replications. Then once all that data is publicly available, you can do wonders with it.

I'd welcome other ideas, as like I said, I've been putting a lot of thought into it. I'd love to put together a system that lets people easily see the replication status and legitimacy of a paper (or scientific statement supported or opposed by various papers, which would be useful to the public at large), and I think I've puzzled out how, we just need to incentivize the people doing the research to release the data that will populate it.

Great post!

Just to point out - to people who are not yet familiar with it - there are initiatives that try to tackle the hazards of the current peer-review process, a good example is Plos one that uses a very different setup for publishing, maybe most interestingly is that they publish all articles that are technically sound, and judgment whether the article is interesting or not is reserved to post-publish commentary and it's all open access.

Another big problem - that I guess people here are somewhat familiar with - is pharma funded clinical trial publication bias, e.i. you can do for example 10 smaller studies (same drug), rather than a couple big ones, and weed out the ones with a lesser positive impact (or even negative), and then pool your 6 studies with the best result. Though hopefully this problem will be partly fixed with the new FDA legislation that requires pharmaceutical companies to do a priori registrations of clinical trails.

This dovetails nicely with some of the other things I've found out about recently like PLOS Medicine's "Why most published research findings are false" and Feynman's "Cargo Cult Science". I am really glad to have gotten this additional insight from you. Upvotes and notes to self to read you again.

Because scientists have to compete for grants, they spend a very large fraction of their time fundraising, sometimes more than 50% of their working hours. Scientists feel [strong] pressure to optimize their activities for getting tenure and grants, rather than for doing good science.

Interesting. I have to wonder; is this partly due to the cuts in grant money we've seen recently, or has it always been this bad?