This is a critique of How to Make Superbabies on LessWrong.
Disclaimer: I am not a geneticist[1], and I've tried to use as little jargon as possible. so I used the word mutation as a stand in for SNP (single nucleotide polymorphism, a common type of genetic variation).
Background
The Superbabies article has 3 sections, where they show:
- Why: We should do this, because the effects of editing will be big
- How: Explain how embryo editing could work, if academia was not mind killed (hampered by institutional constraints)
- Other: like legal stuff and technical details.
Here is a quick summary of the "why" part of the original article articles arguments, the rest is not relevant to understand my critique.
- we can already make (slightly) superbabies selecting embryos with "good" mutations, but this does not scale as there are diminishing returns and almost no gain past "best of 40"
- They show that when you edit, you don't have the issue outlined above, because each edit helps, here the only source of diminishing returns is that you of course pick the most important mutations first.
- Then they switch from IQ to a disease focus by removing bad mutations. They also show that most diseases risk can be reduced 50-100% by 3-25 mutations, but if you swap from disease association to life expectancy, then 300 edits may only gain you 14 years, Interesting it only take 16 edits to get the first 5 years.
- They show that there are less tradeoffs between mutations than one might initially anticipate, the correlations are mostly positive, ie, curing one disease might protect you from another one.
- Then they use chicken becoming 4 times larger over 50 year, which is 40 standard deviation! to motivate the intuition that a specie can go fare out of equilibrium, using selective breeding, so selective editing should allow us to push even further.
- Finally, they make theoretical curves for IQ gain and longevity, under the assumption that if we had more data, we could more and better mutations to edit if we had stronger data.
My Position
At a high level, I appreciate the effort in the original post and think most of it is well-written. The main points about the difference between editing and selecting, the parts about diseases, and to some extent longevity, are valid. However, I am skeptical of the claimed IQ effects.
My critique can be broken down into the following subtopics:
- Correlation vs. Causation
- This is based on observational data of mutation and traits, and thus has all the usual issues with causality.
- The additive assumption of genetics
- We observe that a simple model simply adding up all the mutation effect works well, on normal humans who usually only have a tiny subset of the mutations, and therefore also assume that it generalizes to the super babies who have all lot more of them.
- Regression towards the mean
- The largest measured effects in genetic studies often benefit from measurement luck. I believe modern genetic analysis techniques don't fully account for this statistical phenomenon.
I'm less skeptical about disease prevention through genetic editing because some powerful examples exist—like how a single edit can reduce Type 2 diabetes risk by half. This suggests a likely causal relationship rather than just a socioeconomic correlation. The additive assumption is also less problematic when dealing with a small number of high-impact edits. In contrast, IQ appears to be influenced by more than 1,000 genetic variations, most predicted to have much smaller effects than +1 IQ point.
Correlation vs. Causation
This point was mostly raised by lgs and richardjacton and others in the comments to the original post. The issue here is that we only have traits such as IQ and mutations, and thus any relation between the two could be spurious. For example, a mutation might be shared between people who don't like living in old buildings, and thus the causal path runs through lower exposure to leaded paint and asbestos.
While I think the association critique of the baby IQ effects is a very strong argument, I will set that aside for the remainder of this post and focus more on the model assumptions baked into their extrapolations.
It is important to note that when the authors says.
We take into account uncertainty about which of a cluster of nearby genetic variants is actually CAUSING the observed effect.
the more clear wording would be "CAUSING the observed association", so it is a claim about which of a cluster of mutations actually "causes" the "non-causal" statistical association between mutation and IQ.
The last part of the author's posts explains the step they did to not overestimate the IQ effects, where they adjust for the poor retest quality of the UK Biobank IQ test, and for assortative mating (people like people like themselves). while this does not fully close the association to causation gap, nor the challenges I present in the next two section, they are still very sensible, so I feel they deserve to be mentioned.
The gold standard for causal evidence in genetic is called Mendelian Randomization, where you measure a biomarker, related to the disease, sort of like this: a people with a specific mutation in a gene for glucose metabolism (at birth) has a higher blood glucose after birth which explains why they have a higher incidence of diabetes.
This is almost as strong evidence as a clinical trial, because you get your genes before your biomarker, so it is correctly "time sorted", and the "mutation" assignment is random, like in a clinical trial. The main limitation is most biomarkers are measured in bio fluids, and even Cerebrospinal fluid, would probably only capture a tiny effect of the IQ, because I suspect that actual causal mutations for IQ probably influences brain development, or protein levels in brain cells, which are hard to measure without influencing IQ.
So the authors went to war with the data they had, which is totally fine, but also why I think they to optimistic with regards to correlation vs causation.
The Additive Effect of Genetics
One of the great paradoxes of statistics is that there is no natural scale between variables in a statistical model. This means it is very hard to know how to link mutations to disease risk. In statistics, we call this the "link function," because it links the predictors (in this case, the mutations) to the outcome (in this case, IQ). The question statisticians ask is: should I link each mutation to the probability, odds, risk, relative risk, or hazard of the outcome? "In other words, we don't know if genetic effects combine by simple addition or through more complex relationships.
However, because most genetic effects are tiny, this is not a significant issue in practice. Let's consider a simple toy example, comparing modeling mutations' association with IQ as either linear or exponential.
Imagine 5 mutations, rarely observed in the same individuals, that have either an additive +1 or a multiplicative ×1.01 effect on IQ, which is assumed to start at 100.
If you have 1 mutation, both models agree because:
- Additive:
- Multiplicative:
Even if you have all 5 mutations, which may rarely occur in your dataset, the two models practically agree:
- Additive:
- Multiplicative:
So one model says +5 IQ smarter, the other one 5.1 IQ. At first glance, it doesn't really matter what link function we use, because as long as we are within normal human variation, most link functions will give similar results. However, the moment we extrapolate to the extreme, such as with 500 mutations under the assumptions above, we get:
- Additive:
- Multiplicative:
So the additive model suggests +500 IQ and the multiplicative model suggests +14377 IQ — hardly the same numbers. This indicates that getting the link function correct is very important when we are predicting far out of sample.
Another issue is the assumption of independence—that we can simply keep stacking mutations associated with IQ and getting the full benefit for each one. Again, in real data, mutations are somewhat rare, so very few people share the same two mutations. The result is somewhat the same as in the case of multiplicative vs. additive link functions: even if there were big correlations between some of the mutations giving rise to the same trait, the fact that most people only have a tiny subset of the mutations means that the correlations can practically be ignored.
However, the moment we start editing to introduce many mutations, we are again very far out of sample, and thus the assumption of independence goes from not being a problem (because it rarely happens) to an article of faith, because you made sure it happened.
For a biological implication of the statistical model above, imagine a set of mutations that makes the prefrontal cortex larger. The first critique would be akin to over-extrapolating to the point where your model says that having a 40kg prefrontal cortex would indeed make you very smart—but unfortunately, the "correct" link function is the one that understands that the prefrontal cortex cannot be bigger than the skull.
A biological example of the correlation between mutations may be one where there are 100 different mutations, each one giving rise to an increase in some neurotransmitter that is generally too low in the population. The average person has 7 mutations, no one in the world has more than 15, which would otherwise be optimal, but all the super babies with more than 25 are so far out of equilibrium that we would have to invent a new term for this type of insanity. This illustrates why simply adding more 'positive' mutations could lead to completely unexpected and potentially harmful outcomes.
Regression towards the null part 1
This section gets a bit technical, but I'll try to explain the core concept: why we should be skeptical of the size of genetic effects, especially for complex traits like intelligence.
There is a concept named "Edlin's rule" or "Edlin's factor"—it's essentially how skeptical you should be of a claim. For example, if a study claims that a treatment improves memory by 20%, Edlin's rule might suggest the real effect is closer to 2%. In psychology, most effects are probably exaggerated by a factor of 10; in physics, it may be only 10%. In statistics, we call this "shrinkage," and in Bayesian statistics, the prior does the shrinking because it pulls towards the default value of no effect (often called the null value).
Genetics is like a mixture of physics and psychology when it comes to Edlin factors: when a disease is basically explained by one gene, you need to shrink very little, but with traits with 1000+ associations, you probably need to shrink more.
In the original post, it takes about 500 mutations to get a 50 IQ point increase, which means each mutation contributes about 0.1 IQ points. Since this is a tiny effect compared to the disease plot, where maybe 5 mutations are enough to eradicate a disease, intuitively this means we should shrink differently in the two cases.
Genetics is particularly ill-suited for classical frequentist statistics, because of the combination of the high number of possible mutations and high local correlations. These correlations arise from the fact that when the genome is cut and reassembled in the process of making natural babies, mutations that are close together in the genome have two annoying properties: they simultaneously affect the same phenotype AND are in the same "cutting frame" named LD (linkage disequilibrium) region, thus having very high local correlation.
The authors used a sophisticated statistical method called 'spike-and-slab' that essentially sorts genetic effects into two categories: tiny effects that are probably just noise (the 'spike' at zero) and larger effects that are likely real but might still be overestimated (the 'slab'). This method can adjust differently for simple traits versus complex ones like IQ."
Optional: Regression towards the null part 2
This section tries to explain why the spike and slap method is insufficient, it assumes a higher familiarity with math than the rest of the post.
A nice feature of the spike-and-slab prior is that it is refitted for each association, so a different spike-and-slab prior will be estimated for the disease explained by 5 mutations and for IQ, allowing us to have a different level of 'skepticism in the two settings.
There is, however, one issue with the spike-and-slab prior. The more effects you put into the "spike" part, the larger effects are left to be modeled by the "slab" part, creating a dual paradox:
- The stronger you believe in a lot of null effects, the more confident you are that the rest of the effects are perfectly fine, because the slab becomes wider when the medium-sized effects are moved to the spike, Put simply: if you decide most small genetic effects are just noise, you end up being overly confident about the remaining effects being accurate
- The biggest discontinuity is applied at the threshold between spike and slab. Imagine we have mutations that before shrinkage have the values +4 IQ, +2 IQ, +1.9 IQ, and 1.95 is our spike vs. slab cutoff. Furthermore, let's assume that the slab shrinks 25% of the effect. Then we get 4→3, 2→1.5, 1.9→0, meaning we penalize our +2 IQ mutation much less than our +1.9 mutation, despite their similar sizes, and we penalize our +4 IQ effect size more than the +2 IQ effect size, despite it having the biggest effect, this creates an arbitrary cliff where similar-sized effects are treated completely differently based on which side of the cutoff they fall on, and where the one that barely makes it, is the one we are the least skeptical off"
Ideally, we would use a fat-tailed prior (like a t distribution or Cauchy distribution), which is continuous and thus behaves like a sane version of the spike-and-slab prior[2], in that it allows for big effects, and shrinks small effects a lot. With most fat-tailed prior, there is an inflection point, after which, the further you get away from 0, the less you are shrunk. In this case, the +2 and +1.9 mutations would end up almost in the same place, and the +4 mutation would shrink the least, if it is above the inflection point.
In summary, the statistical methods used to estimate genetic effects on IQ likely don't fully account for this 'winner's curse' problem, potentially making the projected benefits of gene editing for intelligence overly optimistic, beyond the issues of causality and additivity.
Andrew Gelman has a scalable method where you can convert every effect into a signal-to-noise ratio, which allows you to derive a signal-to-noise prior with fat tails. This is basically a shrinkage function that says (or can say, if the data agrees) that big effects should shrink very little and small effects a lot. This method could work for genetics if the field got into the habit of "saving" all the statistically insignificant associations, so we could also "fit" the middle part of the prior.
Final Note
Lastly, the post discusses "how much IQ" we could improve per edit if we had more data, where they sort of assume that if they had more data, they could find edits with stronger effects[3]. However, it may also be the case that if they had more data, they would see that the real effects are actually smaller" to be more direct: More data might actually reveal that the true effects are smaller than currently estimated, not larger.
As I said, I think the original post is great, so while I think it is very hard to estimate how many edits we would need to get to get even a few extra IQ points, there is no doubt, that if we start doing these edits we will suddenly have data on which of these are actually causal, and then, like the chickens being 40 sd larger, we could eventually get to 600 IQ (also 40 sd) smarter, with the limitation that chickens have had 100 generations since the 1950 and we have had 2, so if we want to speed this up we also need good IQ tests for babies[4].
- ^
Though I have a Master's in biochemistry, a PhD in bioinformatics, 10+ years of experience with the sister field of proteomics, and I work as a Statistician/Data Scientist for a pharmaceutical company, where I, among other things, have helped geneticists set up causal models similar to Mendelian Randomization.
- ^
Though it's mathematical properties are super annoying, which is why it's preferred by keyboard warriors on LW like me, but not used by the people in the original post, who are limited to the best tools the field of genetics has to offer.
- ^
This may be sort of a straw man, it seems they have some large effects that are "lost" when they do the fine-mapping, in which case it makes sense to assume there are bigger IQ fish out there, though my shrinkage point still stands, even if it is a bigger number we need to shrink.
- ^
That predict their adult IQ well, and that is robust to crazy stuff like a "live fast die young gene" that may make you smart as a 2 year, but dumb as an adult because you matured to fast.
In the modern era, the fertility-IQ correlation seems unclear; in some contexts, higher fertility seems to be linked with lower IQ, in other contexts with higher IQ. I have no idea of what it was like in the hunter-gatherer era, but it doesn't feel like an obviously impossible notion that very high IQs might have had a negative effect on fertility in that time as well.
E.g. because the geniuses tended to get bored with repeatedly doing routine tasks and there wasn't enough specialization to offload that to others, thus leading to the geniuses having lower status. Plus having an IQ that's sufficiently higher than that of others can make it hard to relate to them and get along socially, and back then there wouldn't have been any high-IQ societies like a university or lesswrong.com to find like-minded peers at.