All of Yejun Y.'s Comments + Replies

Why should we expect novel ML architectures to perform better than current methods like LASSO regression? If your task is disease risk prediction, this falls under "variance within the human distribution" which, as far as I know, should be explained mostly by additive effects of SNPs. Current missing heritability is more of an issue of insufficient data (like you say) and rare variants rather than an inefficiency of our models. Steve Hsu had a paper about why we should actually theoretically expect LASSO regression to be efficient here.

My impression is gen... (read more)

2tgb
ML arguments can take more data as input. In particular, the genomic sequence is not a predictor used in LASSO regression models: the variants are just arbitrarily coded as 0,1, or 2 alternative allele count. The LASSO models have limited ability to pool information across variants or across data modes. ML models like this one can (in theory) predict effects of variants just based off their sequence on data like RNA-sequencing (which shows which genes are actively being transcribed). That information is effectively pooled across variants and ties genomic sequence to another data type (RNA-seq). If you include that information into a disease-effect prediction model, you might improve upon the LASSO regression model. There are a lot of papers claiming to do that now, for example the BRCA1 supervised experiment in the EVO-2 paper. Of course, a supervised disease-effect prediction layer could be LASSO itself and just include some additional features derived from the ML model.
4ammon
Ammon here. We are in a data-limited environment where regularization is key. That's why LASSO regression has worked so well. Linearity itself (low parameter count) is regularization. Anything else (done naively) does not generalize. Of course, a (surprisingly) large amount of human variation does appear to be due to additive effects, and is thus well explained by linear models. But that's not the entire story. PRS models for many diseases still fall far short of broad-sense heritability. There is a gap to explain. The question, I think, comes does to the nature of that gap.  Talking to researchers, I encounter a few different theories: 1) rare SNPs not yet identified by GWAS, 2) structural variants not included in GWAS, 3) epistasis/dominance, 4) epigenetics, 5) gene/environment interactions (perhaps undermining the twin studies used to calculate broad-sense heritability.  I'd love to hear other ideas.   To the extent that 2, 3, or 4 are true (for some important diseases) looking beyond current techniques seems necessary.  If 1 is true, I still think there's lift from new approaches. Take a look at GPN-MSA. Reducing the size of the search space (calculating a prior over SNP pathogenicity from unsupervised methods) finds more loci.
1Greg Baker
I've been looking at non-Euclidean loss functions in my PhD, and particularly at ultrametric loss functions. That gives you a very different kind of supervised learning (even if you are just doing linear regression). And it may be relevant here because it's very good at modelling hierarchies (e.g. ancestry). So if you interpret 'different kind of architecture' as 'we need to do something other than what we're doing at the moment with Euclidean-based linear regression' then I agree with the post, but if it's 'we must do deep learning with neural networks' then I agree with Yejun.
6TsviBT
I don't know what they have in mind, and I agree the first obvious thing to do is just get more data and try linear models. But there's plenty of reason to expect gains from nonlinear models, since broadsense heritability is higher than narrowsense, and due to missing heritability (though maybe it ought to be missing given our datasets), and due to theoretical reasons (quite plausibly there's multiple factors, as straw!tailcalled has described, e.g. in an OR of ANDs circuit; and generally nonlinearities, e.g. U-shaped responses in latents like "how many neurons to grow"). My guess, without knowing much, is that one of the first sorts of things to try is small circuits. A deep neural net (i.e. differentiable circuit) is a big circuit; it has many many hidden nodes (latent variables). A linear model is a tiny circuit: it has, say, one latent (the linear thing), maybe with a nonlinearity applied to that latent. (We're not counting the input nodes.) What about small but not tiny circuits? You could have, for example, a sum of ten lognormals, or a product of ten sums. At a guess, maybe this sort of thing 1. Captures substantially more of the structure of the trait, and so has in its hypothesis space predictors that are significantly better than any possible linear PGS; 2. is still pretty low complexity / doesn't have vanishing gradients, or something--such that you can realistically learn given the fairly small datasets we have.

I think that prompting is definitely important. I've found that GPT as it is now can mimic any given author's style with great accuracy as long as it's given that author's text inside of the prompt. For example, "write a short story in the style of Nabokov" gives you a bland short story, while prompting with his verbatim text produces a pretty faithful continuation.