The Limits Of Medicine - Part 2 - Homogeneity Assumptions

George3d6

This is part of a series of articles I'm writing to try and present a clear framework around where medicine is lacking and what easy wins might be found -- This one is "basic but necessary" in that I try to explain a concept that's rather familiar to most people here (controlled trials and their limitations). So you may want to skip it unless you're very unfamiliar with medical science.

Science started by observing highly regular events (patterns). Things like the movement of the celestial bodies, the tracks of arrows, the angles at which stones hold into arches.

These things vary with some parameters, rock composition changes the science of architecting cathedrals in Andalusia compared to the Loire Valley, location on Earth adds noise to the details of astronomy, but they vary very little.

Fundamental science is observing the pattern in “simplistic” events, ones where most parameters are essentially universal constants. When faced with most of the real world we aggregate many observations, call the similarities between them “pattern”, and the weird occurrences noise and errors, accounting for them with confidence and ranges and probabilities.

Biology is one such complex real-world system, and calling anything "causal" in biology is wrong, a naive simplification. It's complex correlations and interactions all the way down. It may literally be computationally impossible to boil biology down to the almost-deterministic realm of physics (simulations).

Aggregate Treatment

This leads us to study things like drugs and interventions by aggregating the effect across many people. Give acetaminophen to a man and it might magically cure his headaches, give it to another and it might slightly worsen it. Give it to 10,000 and ask them to rate their pain improvement and you can know something like:

Acetaminophen usually helps with headaches

Go into the detials and you can say something like:

Acetaminophen lessens headaches in 80% of individuals mildly to perfectly (1.4x compared to placebo), it slightly worsens them or has not effect in 20% (0.46 compared to placebo) of individuals.

Or even:

Acetaminophen lessens headaches in 80% of individual mildly to perfectly, it slightly worsens them or has not effect in 20% of individuals. But for bearers of rs12345678(A;G) it worsens headaches in 60% of cases and does nothing in 40% of cases.

The problem is that the more you dig down the more you loss the power of large number. Ten observations are worst than 10,000 because one “weird” observation is no longer “noise”, it greatly affects your dataset.

Maybe Acetaminophen with high doses of asparagus is really bad (or good), maybe 1 of those 10 people were fans of asparagus and ate it with every meal. Suddenly an “irrelevant” detail you can’t know about is driving most of your conclusions.

What if our rs12345678(A;G) cohort is only 10 people. Should we advise them against taking Aacetomiophen or call this a statistical error?

There are several assumptions that can help us answer this question with "statistics" but they end of being either silly (making incorrect assumptions about the world) or yielding large margins of error. Usually both.

In reality, we are left with the tedious problem of only being able to gain certainty in this observation by giving more rs12345678(A;G) bearers Acetaminophen. Potentially causing them a lot of harm.

Placebo, Bias, And Regression To The Mean

This is made even harder by the placebo effect.

You can give someone opium and tell them it's a wonderful pain relief, and they are more likely to feel their pain go away.

Give it to them and tell them it helps healing but might make the pain worst, and they are much less likely to benefit from the "normal" amount of relief opium provides.

Even worst is the fact that conditions usually go away on their own with time. Leaving us unable to tell the difference between the effect of treatment and that of the body's natural healing process.

Someone goes to a doctor when an infection gets really bad, meaning their immune system is in full force trying to fight it. The doctor prescribes antibiotics and cleans the site of the infection. If the patient is cured in 5 days who is to know if it was the doctor or the immune system that cured the infection, or to what degree either helped?

People getting better is an oddly annoying confounder to figuring out how to make people better

Finally, there's bias.

If students are given lab animals and told they are of a strain smarter than average, they will perform better on cognitive evaluations. If they are given what they are told is a “dumb” strain, they will perform worst. Even if the two strains are exactly the same.

Why this happens is a cumulation of many factors, maybe students (or researchers, or doctors) don't want to look incompetent or silly so they misreport results, or ignore a few data points that are "outlier" or whatever to get an expected result.

Maybe their preconception about the futility or utility of the intervention makes them more or less likely to experiment, monitor, console, and consort with the subject.

More importantly, if the researchers aren't clear about what they are "researching" they can keep looking for a needle in the haystack of data indefinitely. And since we are doing probabilistic science, at some point they will find it.

For further reading on this see this summary, or this review/summary, or this one.

This evil trio is solved by one solution.

The Blinded, Randomized, Placebo-Controlled Trial

A blinded randomized placebo-controlled trial (BRPCT) is simple:

You get two or more groups of subjects (patients) and researchers (doctors) and give them the same instructions, something like: "give the pill to the patient for 4 weeks and see how their symptoms change".

The groups are picked at random from a population picked as randomly as possible (this is the randomized bit). Within limits, you obviously want to only pick Alzheimer suffering women for a drug meant to treat Alzheimer in women.

One group, the "control" group, is assigned placebo pills (a pill coating containing nothing), the other is assigned the real drug. That’s “placebo-controlled”.

The “blind” part varies, you can “blind” a total of 4 parties, the patients, the researchers administering the treatment, the researchers analyzing the data, the researchers reviewing the study, or third parties supervising the study. The usage of these terms is a bit fuzzy, most studies are “double-blind”, meaning that both patients and the researchers administering the treatment are blinded. But the definition of “double-blind” can vary.

It is considered the golden standard in medicine for any intervetion, if the FDA or EMA approved it, then "it" almost certainly went through a BRPCT. If Cochrane published a review saying everyone should do it, then it also almost certainly went through a BRPCT.

This is not always possible. There are clever designs that allow ethical and realistic experiments that keep most of the power of a BRPCT without quite being one.

We skip BRPCTs for very old things with loads of studies that obviously work (e.g. using gauze stop severe bleeding), but any new and widely-used interventions that allow for them, will usually have one done.

This leads to 4 issues.

Cost

BRPCTs are expensive, in the dozens or hundreds or even thousands of millions. This means every year only a few hundred drugs can even be considered for approval by the FDA, because the cost of testing them, even if development was cheap, is prohibitive.

This makes drugs where economy of scale works the only viable avenue for most companies. If at least 0.1% of the US or EU population can't use it, it's not worth spending a billion testing it.

A lot of neglected diseases might be cured quite easily, but the prohibitive cost of a BRPCT, and maybe even the inability to collect a large enough sample for one, prevents this.

Limited Effect

The next issue with BRPCTs is that effects are almost never very big. For the aforementioned reasons we are trying to keep our target broad and our sample varied. This means magic bullets are less likely to be found than barely-working broad-spectrum unguents.

If there's a monoclonal antibody that treats almost all atherosclerosis in people with the rs12345678(G;G) SNP, but this mutation is only found in 0.01% of the population, you can't trial that. If, on the other hand, there's a statin that reduces 5-year mortality from atherosclerosis in <everyone> by 0.1 days that will get many DBRCTs and get approval (thus making a shit ton of money). The reason small effects are bad is compounded by the fact they don't compound (pun not intended), but that's a topic for another time.

Inability To Drill Down

The other problem with BRPCTs is that you can only drill down if you drop most of the advantages.

You can have a BRPCT in 100,000 randomly selected Dutch people for a vaccine. You get something like "reduces symptomatic infection from 90% to 40% with only a 1.001x increase in 3-month mortality". We can be pretty sure this is true because the bias, regression to the mean, and the placebo effect are all accounted for. Plus our sample is large and "random".

But if you want to drill down and say "this reduces symptomatic infection from 45% to 5% with a 1.5x increase in 1-month mortality risk in people aged 18 to 19", you've suddenly lost all of those benefits. And you will be unable to guide treatment decisions with this observation.

Inability To Reason About Results

Finally, a DBRPCT doesn't allow us to use reason to make inferences from it. Or rather, to the extent that it does, it doesn't keep all of its qualities.

We can for example observe that a vaccine provides (on average) 1 unit of antibody for people aged 60-90, 3 units of anitbody for people aged 30-60, and 300 for people aged 0 to 30. Furthermore, let's say we determined 3 unit of antibody is sufficient to perfectly resist the disease.

This is great in that we might be able to reduce dosages for young people or advise them against booster doses for the vaccine. Conversely, we could vaccinate older people more often and with higher dosages. But how do we know the researchers haven't just cherry-picked this data? And even if they didn't, how do we know that a smaller dose or less frequent dosing will help young people get the 3 units needed, or the reverse boost the immunity of old people.

Something like "commonsensical reason" answers this question. Vaccines are dependent on lymphatic system function. We know its function is worst with aging, to the extent a 90-year-old has about 99.x% less of it than a 1 year old. Furthermore, we know this is correlated with ease of acquiring immunity and how long said immunity last. Finally, we know amount of antigen is the other factor we can control that leads to a higher chance of acquiring immunity.

Thus we can build a causal model saying the above, but we are now out of BRPCT land, and our treatment protocol will not be approved by

How To Use Reason?

One way of using "reason" while still keeping the benefits of a BRPCT is to allow us to build in some theoretical assumption under which our study is operating.

With reason, we can say something like “if this drug is better at 10mg than at 5mg, and at 5mg than at 1mg, with no apparent side effects, then maybe we can use a 20mg dosage for severe cases that don’t go away”.

Reasoning allows us to monitor effects in our main "cohort", but also "pre register" 20 extra cohorts in which we expect to find interesting effects, if more than 1 of those have them, it's likely an effect and not random luck (assuming p=0.05 as cutoff and playing along with the flawed statistics)

Finally, reason can be used by thinking in a "Bayesian" way about medicine. Accumulating evidence from many small studies rather than only thinking in terms of single studies. Combining studies and the odds they provide for a hypothesis in order to generate conclusions with higher certainties than any studies combined.

This is all being done.

The other “gold standard” in medicine is the meta-analysis, which uses something like “Bayesian” reasoning, although usually in a rather simplistic fashion, to aggregate the findings of many studies which are individually insufficient to draw conclusions. These can yield new effects for drugs, provide insight into new treatment protocols and overall recommend for or against treatment, with benefits similar to a DBRCT, without needing one. This is not to say they are exclusive, often enough we run these statistical analyses in the inclusion of DBRCTs on the respective treatment.

N=1

Can this be simplified?

Of course. Remember, we got this situation where acetaminophen cured Joe's headaches and we thought this might apply to other people, thus we needed to run trials to find if this is true. But why does Joe need any trials to back up his decision to take acetaminophen? He already knows it worked once, that's a pretty good prior for taking it again. If it works the second time that prior gets even higher, the 6th time it works with no side effects, he can be happy declaring acetaminophen is the "cure" for this headaches.

The problem here is that Joe still has to have a starting point, and that starting point about what drugs he should self-experiment with, under what conditions, for what diseases.

But individual experimenting allows us to take a lot of the complexity out of medicine. It does away with the huge differences between people and the complexities they bring, and leaves us with a single individual.

Biohacking and personalized medicine are all about n=1 experiments. But they are yet to be standardized in all but the harshest of scenarios, where nothing else can be done.

Sadly enough, most people self-experiment in the worst of ways. They don’t use hard-to-fake and easy-to-measure endpoints, they don’t do self-placebo, they don replicate their experiments, they don’t have a theoretical understanding to guide them.

As I said in the previous article in this series, I have no “solution” to this, nor am I certain one exists. I’m just trying to outline limits in an easy-to-understand way. If I knew how to push past them I’d have already done it.

LESSWRONG
LW