We Should Try to Directly Measure the Value of Scientific Papers

ohmurphy

(Epistemic Status: I have never published an academic paper or been involved in grantmaking for academic work, so my perspective on current practices is limited. Still, I think the basic idea I am proposing is clear and straightforward enough to overcome that limitation.)

I spend a decent amount of time reading/listening to articles/podcasts from places like Astral Codex Ten, Ben Recht’s arg min, and The Studies Show, all of which explore problems with scientific research. In some cases, the issues discussed involve sexy topics like data fraud or p-hacking, but, much more often, papers will fall into the category described in this great post, where the problem is less misconduct (though there is a lot of that) but more that the research is essentially worthless.

How can we recognize bad science?

When assessing the value of a scientific paper or study, there are often many different critiques you can levy. You can question the study's power, the generalizability of the results, whether there were unobserved confounders, the randomization methods, the theoretical structure, the measurement method, the mathematical representation of variables, the real-world interpretation of variables, the variability explained, and on and on. These critiques are vital to improving science, but, unfortunately, while it is essential to engage on these points, the work is difficult, time-consuming, often unrewarded, and, as a result, severely out-scaled by the volume of bad papers.

From an epistemic perspective, this approach also creates a horrible situation where distinguishing useful and important work from trivialities is extremely demanding on anyone outside of a specific field or subfield. As a result, naive readers (and the media) end up believing the claims of bad papers, and skeptical readers end up disbelieving even potentially good ones.

Luckily, although comprehensive or systematic criticisms of papers are difficult, scientists (and others) have access to a high degree of tacit knowledge about their fields, the methods that work, and the plausibility of results, all of which they can and do leverage when evaluating new studies. In more Bayesian terms, they often have strong priors about whether papers are making meaningful contributions, which we could hopefully elicit directly, without needing specific, formal critiques.

The value of information

Scientific papers/studies are a form of societal information gathering, and their value comes from the same place as the value of any information: the ability to make better choices between options.

We can then codify and measure the (Expected) Value of Information (VOI) for a paper/study with this standard formula from decision theory:

VOI = Expected Value of Actions Given Information - Expected Value of Actions Without Information

Looking at this formula, we can see a clear pragmatic definition of the worth of a scientific paper. If nobody will change how they act or react, regardless of the paper’s specific results, then it has no value. If people will change their actions in response to the paper’s specific results, then the value of the paper is precisely equal to the (expected) improvement in those actions.

Let’s run through an example. Suppose Alice is a regulator at a simplified FDA, deciding whether to approve a new drug called Canwalkinol that is designed to cure 100 people of a disease that makes them unable to walk. Currently, Alice thinks there is a 30% probability that Canwalkinol is deadly (i.e. too dangerous to approve) and a 70% chance that it is not dangerous. Alice’s current plan is to not approve the drug, an action with an expected value of ‘100 lives without the ability to walk.’ If a study comes along that can perfectly demonstrate whether Canwalkinol is dangerous, then Alice will be able to make a perfectly informed decision. From Alice’s perspective, that study would have a 70% chance of showing that Canwalkinol is safe, allowing her to approve the drug and 100 people to be able to walk, and a 30% chance of showing Canwalkinol is deadly, in which case she does not approve the drug. We can calculate the value of this study as follows:

Value of Study = Expected Value of Action Given Study - Expected Value of Action Without Study

If we let Vwalk be the value of ‘one life with the ability to walk’ and Vnot be the value of ‘one life without the ability to walk’, then we can derive the following:

Expected Value of Action Given Study = 70% * 100 Vwalk (if the drug is safe) + 30% * 100 Vnot (if the drug is not safe) = 70 Vwalk + 30 Vnot

Expected Value of Action Without Study = 100 Vnot

Value of Study = Expected Value of Action Given Study - Expected Value of Action Without Study = (70 Vwalk + 30 Vnot) - 100 Vnot = 70 Vwalk - 70 Vnot

= value of curing 70 people and giving them the ability to walk

So we can see that, to Alice, the value of this particular paper/study is equal to curing 70 people of the disease.

How would this actually work?

There are three components required for it to be possible to measure the value of a paper/study using the VOI method, (1) there needs to be a clear estimate of the prior probability for the outcomes of the paper, (2) there needs to be a decision that is plausibly affected by the results of the paper, and (3) outcomes from that decision need to be comparable through some metric. Each of these presents some level of challenge to applying this method in practice, but I think that all of the difficulties can –and should– be overcome.

(1) Having good priors for paper results

In order to measure the expected value of a paper, we will need to have some estimate of the probabilities associated with each of the paper’s possible outcomes. Since these estimates are just probabilities on the different possible outcomes of the paper, methods for generating them can include any number of options, including prediction markets, forecasting tournaments, surveying forecasters, surveying experts in the field, etc. There is nothing particularly novel about producing assessments for this project relative to any other, but, as always, it is necessary to make sure to properly incentivize accuracy and induce participation.

(2) Finding relevant decisions

I think this issue is the most difficult to address, but the challenges associated with it are, to a substantial degree, a reflection of the problems with scientific papers themselves. As I said above, if there are truly no decisions that will depend on the outcome of a paper, then the paper does not have value. An inability to find relevant decisions for papers is often a strength of this method, rather than a weakness since it allows for clearly distinguishing between valuable and inconsequential contributions to science.

Still, what decisions might be acceptable? I think it is sensible to be agnostic on this question, for the basic reason that value calculations should be able to speak for themselves. It doesn’t really matter what a decision itself is, since we should be able to just compare the improvement in final outcomes (such as lives saved or study methodologies updated) from changing the decision.

(3) Making decision outcomes comparable

One benefit of an idealized version of the VOI framework would be the direct comparability of the value of different papers and the ability to prioritize both between papers/studies and between science and other uses of resources. Unfortunately, our utility detectors are stuck in the mail, so VOI estimates will need to rely on a diverse set of metrics based on the decisions affected. Still, I think this should be fine. People are pretty good at comparing the value of different outcomes, especially in cases where useful differences are likely to be large.

Final Thoughts

I wrote this post because I think that the VOI framework is the correct way to think about the value of scientific work theoretically, but that it is not measured or explained as explicitly as it can and should be. Many people complain about the quality of studies or have intuitions that some fields are unrigorous nonsense, but often these criticisms can seem ad hoc, specific to a given paper/study or methodology, or just difficult to evaluate. Explicitly using measures of VOI can put many of these assessments in a common language and make their contributions more interpretable to people outside the given field.