I'm starting to automate my workflow with AI, but don't know where to even start with correcting for multiplicity.
Right now, my workflow is like this:
Think really hard about interesting ideas and do a surface-level examination of data -> generate hypothesis -> test hypotheses against data with a specific model -> correct for multiplicity
So for example, my hypothesis is that DNA methylation plays a major role in the progression of MSI-H colorectal cancer (it came to me in a dream). I use the default mathematical model in whatever R package I'm using, manually go through the differentially expressed genes to find biologically plausible ones (and examine their individual impacts on various outcomes), use the best ones to build a model predicting a set of outcomes like survival and metastasis, and correct for multiplicity by using the number of genes used and outcomes examined in the model.
But I want to change that to this:
Chuck my data into an AI -> it filters out the biologically implausible genes and selects the most underresearched/biologically interesting/outcome-predicting hypotheses -> it identifies the most relevant genes by iterating through various models -> correct for multiplicity
So I give the AI a .csv of the expression/clinical/mutation data for a set of CRC patients, it filters out genes unlikely to be biologically important, finds the most impactful ways to split and compare the data (say, methylation of certain oncogenic pathways), forms a hypothesis (high levels of EGFR pathway methylation increases survival), uses it to build a model, and examines the most important genes in this model (EGFR).
But what factor do I use to correct for multiplicity? The AI is iterating through a huge number of, well, everything. I'm not sure where to begin. My gut feeling is that no one really does this correction, but I would still like to know how to do it in theory.
I'm starting to automate my workflow with AI, but don't know where to even start with correcting for multiplicity.
Right now, my workflow is like this:
Think really hard about interesting ideas and do a surface-level examination of data -> generate hypothesis -> test hypotheses against data with a specific model -> correct for multiplicity
So for example, my hypothesis is that DNA methylation plays a major role in the progression of MSI-H colorectal cancer (it came to me in a dream). I use the default mathematical model in whatever R package I'm using, manually go through the differentially expressed genes to find biologically plausible ones (and examine their individual impacts on various outcomes), use the best ones to build a model predicting a set of outcomes like survival and metastasis, and correct for multiplicity by using the number of genes used and outcomes examined in the model.
But I want to change that to this:
Chuck my data into an AI -> it filters out the biologically implausible genes and selects the most underresearched/biologically interesting/outcome-predicting hypotheses -> it identifies the most relevant genes by iterating through various models -> correct for multiplicity
So I give the AI a .csv of the expression/clinical/mutation data for a set of CRC patients, it filters out genes unlikely to be biologically important, finds the most impactful ways to split and compare the data (say, methylation of certain oncogenic pathways), forms a hypothesis (high levels of EGFR pathway methylation increases survival), uses it to build a model, and examines the most important genes in this model (EGFR).
But what factor do I use to correct for multiplicity? The AI is iterating through a huge number of, well, everything. I'm not sure where to begin. My gut feeling is that no one really does this correction, but I would still like to know how to do it in theory.