I know there are startups out there doing what you say... As you said, the reason that they're not publicly available yet is the regulatory burden. 23 and me already got burned doing that.
Mostly just posted this to say that I really enjoy the way you think and Ive loved these types of "what if" posts that you've been throwing out. Keep it up
Another genomics PhD here. It's a complex topic. We know that combinatorial effects (epistasis in genetics lingo) matter, from population genetics studies in model organisms. This is despite the fact that simple linear models perform well in the human population - provided they are against some reasonably constant genetic background, low allele frequencies mean that the combinatorial effects are well captured by linear ones.
The problem is that even if you only care about pairwise combinations, there are far too many of them, given a uniform prior. Even if we sequence everyone on earth we wouldn't have anywhere near enough info, sequencing additional individuals has diminishing returns because there's only so much genetic variation in the human population (and ~23000^2 possible pairwise combinations).
What we need are good priors over combinations of mutations. To do that we'll need detailed info about which genes function together to produce which phenotypes. Such models exist already and are seeing moderate success, but we need new ideas and more data than any one startup could provide. Which is exactly what molecular biologists are working on.
OP here. Having learned more statistics since I last posted - I reckon it could be as simple as exploring various interactions (effect modifications) in the data with respect to additional SNP's. The issue would be that interactions require greater sample sizes to avoid spurious results and most genetics research has woefully low sample sizes which would only be harder to overcome when inching towards more personalised medicine based on individual genomes.
Yes that's the case. To get enough data we probably need lots of in vitro experiments. Remember that data is not equal to information - even really big sample sizes wouldn't be enough to resolve the combinatoric explosion. What I mean in that comment up there (I posted it before it was finished, I think) is that there are ~23k genes in the genome, so even under the absurdly simple assumption that there's only one mutation possible per gene, you have half a billion possible combinations of gene breakages, which you will never ever be able to get enough of a sample size to look at blindly.
Based on your more intimate knowledge and access to knowledge in the area, what kind of $USD investment (even an order of magnitude estimate would suffice, if the former is intractable) would we be looking at if an amount of resources, proportional to the potential humanitarian impact relative to mosquito transmitted diseases, where to be spent to develop a gene drive ready for use in the Tsetse fly, a species regarded as responsible for preventing an African 'green revolution' like was seen in Asia and thus part of the whole fable of African starvation? Any way to incorporate resource investment into mitigating relevant risks?. It seems like an academic has independently started thinking along the same lines.
Hmmmm. I'm shamefully ignorant about prices, but I would estimate such an effort would be in the tens of millions, if you wanted it done quickly (and it will still take a while). As far as I'm aware we haven't developed methods for transgenesis in Tetse flies, having only gotten the genome sequenced in 2014 (priorities people?!), and setting it up in a new organism in a new organism with an unusual life cycle can be surprisingly difficult. The link below describes techniques for manipulating gut microbes in the flies, which I don't think would suffice.
In drosophila you can't go from cell culture to an embryo easily like in mammals, you have to inject stuff into embryos and then breed from those embryos and hope some of your vector got into the germ line. In Tetse flies, I am now aware, the mother keeps the embryo until it's quite developed, meaning the techniques used in Drosophila wouldn't work, and we certainly don't have any tetse cell lines, which I doubt would be of use anyway. So you'd be looking at developing a novel means of transgenesis. (Viral vector targetting the germ line maybe?? ) Which is a task that, while no doubt solvable, inevitably has big uncertainties in it.
So yes, tens of millions, give or take an order of magnitude, plus years and years of work. Well worth doing though. In my opinion the potential gains far outweigh the risks.
P.S. The link to 'relevant risks' you posted is broken, I'd be interested in seeing it.
I really appreciate the explanations in this thread. I was wondering if anyone had an update regarding recent developments in this space. Specifically, using big data to solve for genetic / protein links to phenotypes. I have also been struggling to find more recent information regarding genosets.
Apologies if any of that is unclear, I am still relatively new to this.
I'm a PhD student in genomics (read: argument to authority). Regulatory issues are definitely important and largely an impediment that should be removed, imo. That said, I think the larger issue is capturing and integrating good phenotypic and disease state data into datasets. Although there are large genomics data sets available, generally they have pretty sparse and poorly annotated phenotypic data. This is actually tied to other regulatory issues related to medicine. If you think this is important, please do consider getting involved in the area.
Since risk from individual SNP's 'should' not be aggregated to indicate an individual's risk based on multiple sources of evidence, how are the magnitudes for genosets determined?. Can bayes or another method be used to interpret a promethease report?
Even genetic epidemiology textbooks seem pessimistic: about the usefulness of the genetic research underpinning precision medicine:
The references in question are about the impact of population stratification on genetic association studies. That doesn’t seem to substantiate such a broad stroke about the non-replicability of genetic epidemiology. I don't know what to make of these findings.
Here is a link to a screenshot of those references
It suprises me that entrepreneurial machine learning analysts don’t beg for genetic research to identify how combinatorial patterns of genes to be able to characterise individual risk. It seems like if/once they can get hold of that information, the sequence from genetic science to consumer actionable health information is bridged. So where are the 'lean gene learning machine' startups? I certainly don’t have the lean gene to do it myself. I don’t know machine learning.
Regulatory issues seems like the biggest hurdle. To the best of my google-fu, 23andme doesn't even disclose what it's 'Established Research' genes are. So, once regulatory hurdles are surmounted, lots of useful research will flood out.