Our paper "Effect heterogeneity and variable selection for standardizing causal effects to a target population" has just been publised in the European Journal of Epidemiology at https://link.springer.com/article/10.1007/s10654-019-00571-w . While the journal's version of record is behind a paywall, a preprint is available on arXiv at https://arxiv.org/pdf/1610.00068.pdf.
This paper argues for my very deeply held belief that we can make significant advances in quantitative reasoning for medical decision making by thinking more closely about effect heterogeneity and how this relates to the choice of effect scale.
Over the course of the last 7 years, external validity and generalizability have become increasingly hot topics in statistical methodology and computer science. In particular, a lot of progress has been made by Judea Pearl and Elias Bareinboim, who introduced a framework based on causal diagrams that can be used to reason about how to take causal information from one setting (for example: a randomized trial) and apply it in a different setting (for example: a clinically relevant target population).
The key questions of interest are: How do we know whether such extrapolation is even possible? How do we determine what information we need from the study, and what information we need from the target population, in order to extrapolate the findings? How do we put this information together in order to obtain a valid prediction for what happens if the intervention is implemented in the target population?
Pearl and Bareinboim's framework for answering these questions is, of course, mathematically valid. However, in my opinion, their approach also throws the baby out with the bathwater. In particular, we argue that instead of attempting to extrapolate the magnitude of the effect (i.e. a measure of the "size" of the difference between what happens if the drug is taken, and what happens if the drug is not taken), they attempt to look at the people who were assigned to receive the intervention in the study, and extrapolate their distribution of outcomes to the target population without any reference to how that distribution differs from what happened to the people who were randomized to the control condition.
Theoretically, this approach will work if the extrapolation procedure can account for every cause of the outcome whose distribution differs between the people who were in the study, and the people you are trying to make predictions about. However, since the set of causes of the outcome is very large, it is very unlikely that it is possible to measure all of them. Moreover, it is very likely that we do not even know what the causes of the outcome are. Our inferences then become subject to potential uncertainty which arises from the auxiliary assumption that we know what to control for.
Consider a situation where scientists have conducted a randomized controlled trial in men, on the effects of homeopathy on heart disease. The scientists find that homeopathy has no effects in men, and wonder whether this finding can be extrapolated to women. If the scientists attempt to answer this question using Bareinboim and Pearl's framework, they will be forced to conclude that no extrapolation can be made, unless they are willing to claim that they know all the causes of heart disease that differ between men and women, and have been able to measure every one of these causes in all the patients in the study.
In contrast, we suggest that scientists who want to extrapolate their findings and make predictions outside of the study should attempt to quantify the size of the effect - that is, by how much the outcomes in the people who were randomized to receive the intervention differ from the outcomes in the people who were randomized to the control condition. This effect size could then potentially be used as the basis for extrapolation. Such an approach would correspond much closer to how external validity and extrapolation has traditionally been understood in the medical literature.
In the real world of clinical medicine, doctors are usually given information about the effects of a drug on the risk ratio scale (the probability of the outcome if treated, divided by the probability of the outcome if untreated). With information on the risk ratio, a doctor may make a prediction for what will happen to the patient if treated, by multiplying the risk ratio and patient's risk if untreated (which is predicted informally based on observable markers for the patient's condition).
The problem with this approach is that there are multiple scales on which to quantify the magnitude of the effect. Other possible scales for measuring effects include:
- The odds ratio, which applies a transformation to the risk
- The survival ratio, which uses the probability of survival (1-p) instead of the probability of death (p)
- The risk difference (which uses an additive scale instead of a multiplicative one)
Unless the intervention has no effect, the empirical predictions will not be invariant to the choice of scale. This is, of course, a serious problem for principled clinical decision-making, but as we will show, it is not necessarily an impossible one.
Despite the scale dependence of the reasoning procedure, the risk ratio is in many cases the only summary of the effect size that is made available to clinicians, whether they get their information from journals, clinical guidelines or online resources for clinical information. Given that the reasoning procedure is not scale-invariant, the universal reliance on the risk ratio may plausibly lead to suboptimal medical decision making in a wide range of clinical scenarios. But, in contrast to the implications of the Bareinboim/Pearl framework, we argue that this does not necessarily mean that we should throw out reliance on parametric effect measures altogether.
Our suggestion for how to choose the scale has been discussed earlier on Less Wrong (see https://www.lesswrong.com/posts/K3d93AfFE5owfpkx4/counterfactual-outcome-state-transition-parameters ). I am not going to repeat the argument in full here, but I will ask you to consider the following highly stylized thought experiment, which illustrates the underlying intuition:
Consider a randomized controlled trial where the intervention is that everyone is randomized to play Russian roulette once a year. This trial is conducted in Russia. It is found that among those who did not play Russian roulette, 1% of people died over the course of the year. Among the people who played Russian roulette, 18% of people died. We want to extrapolate these findings to Norway, where nobody ordinarily plays Russian roulette and it is known that 0.5% of people die during any year. Our goal is to find out what happens in Norway if everyone took up playing Russian roulette once a year.
Bareinboim and Pearl would suggest taking the risk of death among those who played Russian roulette (18%), controlling for all causes of death that differ between Russia and Norway, and producing an estimate for what happens in Norway if everyone plays Russian roulette. However, due to considerable differences between Russia and Norway in terms of predictors of mortality, this is clearly not feasible in this situation.
If we instead attempt to quantify the effect size in Russia, this can be done on any of the previously discussed scales:
- The risk ratio is
- The risk difference is
- The survival ratio is
- The odds ratio is
Each of these scales will result in a different prediction for what will happen if people in Norway play Russian roulette:
- If we use the risk ratio, we will predict that will die.
- If we use the risk difference, we will predict that will die.
- If we use the survival ratio, we will predict that will survive, meaning that 17.1% will die
- If we use the odds ratio, we will predict that will die.
These predictions differ massively not only in their implications for decision-making but also in their plausibility: Given what we know about Russian roulette, we would expect to see results much closer to 17% than to 9%. So clearly, some of these scales are doing something "right" and other scales are doing something "wrong".
We argue that the key to understanding the implications of this scale-dependence is that only the survival ratio ( ) has a structural meaning: it represents the proportion of empty chambers in the revolver, and therefore produces appropriate, valid predictions. In contrast, the risk ratio () has no possible structural meaning and therefore produces nonsense results.
Any attempt at extrapolation would, of course, have to account for all factors that determine the magnitude of the effect. For example, if Russians are more likely to be drunk when they play Russian roulette, they may be more likely to miss than Norwegians. This may lead to local deviations from effect sizes of , which will have implications for extrapolation. But once you have controlled for all of the factors that determine the magnitude of the effect on a scale that has structural meaning, extrapolation may be valid.
Crucially, we argue that controlling for all determinants of effect size (alcohol? how many chambers are there in typical revolvers in each country?) is much more tractable than controlling for all causes of mortality differences between the countries.
The main idea behind my research agenda is to explore how far we can push this argument in more clinically relevant settings. Next, consider a doctor who is trying to determine the pros and cons of treating a patient with a new drug. Suppose a reliable study on the drug shows that among those who received a placebo, 1% got an allergic reaction over the following 12 months; whereas, among those who received the drug, 2% got an allergic reaction.
The scientists behind the study can either tell the doctor that the risk ratio is , or that the survival ratio is . Both statements are correct, but only the latter has a potential structural interpretation, since it plausibly corresponds to a state of nature where 99% of the population do not have the factors (genes?) that predispose a person to have an allergic reaction if exposed to the drug.
Now consider that this patient also has a severe peanut allergy (which is unrelated to the medical issues that the doctor is treating them for) and lives in an environment where everyone eats peanuts all the time. This patient, therefore, has a 10% baseline risk of getting an allergic reaction over the course of 12 months, even in the absence of treatment with the new drug.
It would be insanity for the doctor to expect that the risk ratio from the study generalizes, and that the patient will have a 20% risk of anaphylaxis if given the new drug. In contrast, it may be meaningful to predict that their risk under treatment is given by . This will correspond closely to what one might expect would happen if the patient belongs to a population that has the same distributions of factors that predispose to the specific drug-related allergic reaction, as the population that was studied in the trial.
For these reasons, I consider it crucial for medical scientists to become aware of the need to put significant effort into reasoning about whether an effect measure has plausible structural meaning in the context of their current research question, before deciding to use it as a summary of their findings which is suitable for use in clinical decision making.
If anyone can spot any flaws in our argument, such feedback would be invaluable information. I invoke Crocker's Rules for all responses to the paper and the post. I would very much appreciate it if this blog post and the paper could be forwarded to anyone who is in a position to evaluate its importance.
Finally, let me note that this paper is the first peer-reviewed academic publication to acknowledge support from the EA Hotel Blackpool in its funding section. The EA Hotel is a project worth supporting; see https://forum.effectivealtruism.org/posts/uyvc6p99vsWFMPZiz/ea-hotel-fundraiser-5-out-of-runway
I am not using Bayesian inference, and neither are Pearl and Bareinboim. Their graphical framework ("selection diagrams") is very explicitly set up as model for reasoning about whether the causal effect in the target population is identified in terms of observed data from the study population and observed data from the target population. Such identification may succeed or fail depending on latent variables and depending on the causal structure of the selection diagram.
I am confident that Pearl and Bareinboim would not disagree with me about the preceding paragraph. The point of disagreement is whether there are realistic ways to substantially reduce the set of variables that must be measured, by using background knowledge about the causal structure that cannot be represented on selection diagrams.
In my model of reality (and I am sure, in most other people's model of reality), the third node has a wide range of unobserved latent ancestors. If the goal is to make inferences about the effect of Russian roulette in Russia using data from Russia, your analytic objective will be to find a set of nodes that d-separate the first node from the fourth node. You do not need to condition on the latent causes of the third node to achieve this (because those latent variables are not also causes of the first node- they cannot be, because the first node was randomized). The identification formula for the effect in Russia is therefore invariant to whether the latent causes of the third node are represented on the graph or not, and you therefore do not have to show them. The DAG model then represents a huge equivalence class of causal models; you can be agnostic between causal models within this equivalence class because the inferences are invariant between them.
But if the goal is to make predictions about the effect in Norway using data from Russia, these latent variables suddenly become relevant. The goal is no longer to d-separate the fourth node from the first node, but to d-separate the fourth node from an indicator for whether a person lives in Russia or Norway. In the true data generating mechanism (i.e. in the reality that the model is trying to represent), there almost certainly are a substantial number of open paths between the indicator for whether a person lives in Norway or Russia and their risk of death. The only possible identification formula for the effect in Russia includes terms for distributions that are conditional on the latent variables. The effect in Norway is therefore not identified from the Russian data.
I agree that reality is generated by a structure that looks something like a directed acyclic graph. But that does not mean that all significant aspects of reality can be modeled using Pearl's specific operationalization of causal DAGs/selection diagrams.
Any attempt to extrapolate from Russia to Norway is going to depend on a background belief that some aspect of the data generating structure is equal between the countries. In the case of Russian roulette, I argue that the natural choice of mathematical object to hang our claims to structural equality on, is the parameter that takes the value 5/6 in both countries.
In DAG terms, you can think of the data generating mechanism for node 4 as responding to a property of the path 1->2->4. In particular, this path forces the quantities Pr(Fourth node =0 | do(First node=1)) and Pr(Fourth node =0 | do(First node=0)) to be related by a factor of 5/6 in both countries. Reality still has a DAG structure, but you won't find a way to encode the figure 5/6 in a causal model based only on selection diagrams. Without a way to encode a parameter that takes the value 5/6, you have to take a long detour where you collect a truckload of data and measure all the latent variables.