Models of life

Abhishaike Mahajan

2024

Statistical models of organisms have existed for decades. The earliest ones relied on simple linear regression and attempted to correlate genetic variations with observable traits or disease risks — such as drug metabolization rates or cancer susceptibility. As computational power increased and machine learning techniques advanced, the models’ sophistication grew.

In time, they were colloquially referred to as “models of life.”

The definition was nebulous, but there were agreed-upon themes. All models of life were aimed at improving our understanding of the cellular mechanisms underlying biology and were neither constrained by human intuition nor limited to predefined hypotheses. They operated in high-dimensional spaces that defied simple visualization while incorporating vast layers of interconnected variables that no human mind could fully grasp. Unlike traditional scientific models, which often simplified reality, these models embraced its messy and chaotic nature.

The first model of life emerged sometime in 2022 or 2023.

Given the fuzziness of the definition, it was unclear which of the released projects deserved the name. There was scFormer in 2022, scGPT in 2023, and plenty of others. But, regardless of which was first, they all operated with the same core data as their mechanism for understanding life: messenger RNA (mRNA).

Collections of mRNA have been understood as a proxy for cell states for decades. mRNA was the intermediate stage between DNA and protein, a dynamic entity that shifted depending on the second-to-second needs of the cell, able to point out if a cell was cancerous or stressed, what kind of cell it was, and so on. Reliance on mRNA had plenty of failure modes, but it was the most abundant source of cell-state data the scientific community had: DNA alone was static, and proteins were too hard to quantify en masse.

Despite semantic differences between these first models of life, their training methodology closely resembled one another. A sequenced set of mRNA values from a given cell — one value for each of the 20,000 protein-coding genes in a human body — was randomly masked and the model asked to in-fill what it thought should be there, analogous to guessing what jigsaw pieces were missing given the rest of the puzzle. If a cell expressed high levels of genes associated with cell division, other cell cycle-related genes would also be expressed, and so on. In short, the problem given to the model could be phrased as the following: given 19,980 mRNA values, predict values for the missing 20.

While mRNA data was often illuminating, their interpretation was tricky; more akin to art than science. These models offered an easier way to manage such data, improving upon the typical mRNA workflows, and potentially allowing for new insights to be generated dozens of times faster than usual. As such, these initial works ended up in prestigious academic journals; Nature and the like.

Yet, by late 2023, skepticism about their utility started to fester. This hit a fever pitch with a landmark preprint that asserted that these enormously complex models of life, when tested on established benchmarks, were no better than simpler methods written about decades ago. For batch correction, cell type identification, and so on, these newer methods all came up as roughly equivalent. While the newer models were more convenient to work with, the field demanded improvements in accuracy, not ease-of-use. As such, they were quietly abandoned.

By the end of 2024, interest in models of life had cooled.

2025

While the broader life sciences community had pivoted towards working on traditional mechanistic interpretations of biology, one graduate student still believed there was something to be learned using the models of life so celebrated previously. Their belief had little to do with disagreements with the earlier pessimistic papers, but with how these models were being assessed.

The student reasoned that perhaps the true value of these models lay not in their ability to outperform existing metrics but rather in performing completely new tasks, ones for which no standard test set existed. The earlier pessimistic papers were not necessarily incorrect, but relied purely on existing benchmarks as the only measure of possible utility. There was perhaps some latent potential within these models of life, invisible to standard benchmarks, waiting to be discovered.

After weeks of tinkering, the graduate student discovered an area where the model uniquely excelled: gene regulatory network discovery.

The student found that if they artificially turned up a gene’s mRNA value and asked the model to predict how other genes would respond, it somewhat matched up with real-life cells. It was error-prone, but not random, and far better than simpler approaches. They pushed this further, spending a few hundred dollars worth of GPU time doing brute force “computational mutagenesis” of all of the 20,000 genes the model was aware of, bumping one up and seeing how the others responded. Previously known genetic networks arose; the model had learned cellular logic from static snapshots. Simple ones, but still…

This presented the student with a tantalizing future: the ability to fully model how a cell reacted to genetic perturbations. It suggested that, in the future, certain classes of drugs, specifically genetic therapies, could be screened entirely virtually via models of life.

Though the resulting paper was published in an ostensibly prestigious journal (Nature Methods), the broader scientific community didn’t think particularly highly of it. It was an interesting advancement, but, retrospectively, the paper’s contents seemed obvious. They were merely a brief look at what was possible and lacked enough experimental data to support its grandiose discussion section.

2026

Another lab, one with a greater appreciation for what machine learning could pull from noisy, high-throughput biological data, stumbled across the 2025 paper and discussed it in a Monday morning lab meeting. The students of this lab had a strong sense of conviction that the best science was created via intellectual arbitrage — scouring lesser-known papers and, if something worthwhile turned up that had been already de-risked, pushing it further.

The 2025 genetic network paper fit that bill exactly. Something with clear promise, yet overlooked by the broader scientific community.

This new lab replicated the model, running some experiments to confirm the results. The same genetic networks arose, but they were simple and of no use to anyone as they were. More complex networks evaded the model. The lab believed that the missing piece was simple: snapshots of mRNA levels were insufficient to build up an accurate representation of a cell. Providing the results of active genetic perturbations to the model might have helped push it even further. However, no such dataset existed.

The lab created a plan with eight research institutions across three continents. Their proposal involved the creation of petabytes of Perturb-seq data: CRISPR knockdowns of genes over dozens of cell lines — high-throughput, combinatorial genetic perturbations across billions of cells, with phenomic, transcriptomic, and proteomic readouts. A model would be trained on the data being collected using the same jigsaw task as before. Perturb-seq had existed as a method for a decade, but it had never before been pushed to this level of scale. Many scientists on the team were skeptical of this approach, but their hesitancy was overridden by the opportunity to be adjacent to the pioneering lab, known for its contrarian bets paying off.

In 1.5 years, data collection finished, resulting in the first Perturbation Atlas, not dissimilar to the Human Cell Atlas created just a decade prior. Shortly after, a model began training with it. Four months later, a paper emerged. The PI of the lab detested traditional publishing venues, so the paper was uploaded to bioRxiv, replete with 91 pages and 45 authors.

The trained model also went live on HuggingFace, open for both academic and commercial usage.

The next model of life had officially been released. It was the last of its kind to be truly open source.

2028

Over the next year, the scientific community ferociously interrogated the model. The model met the traditional standards of outperforming seemingly every traditional tool in interpreting mRNA data on standard benchmarks. But, more importantly, its ability to model the more elusive nature of cell dynamics had massively improved. It even suggested the existence of complex, previously undiscovered genetic networks. Many of these were tested. Most were spurious, but a few proved correct.

Given the open-source nature of the model, industry took advantage as well. Though the human effort that went into creating the training data was estimated at hundreds of millions of dollars, future historical analyses of the resulting impact of the model showed it returned roughly as much in economic value to private companies.

Existing preclinical studies were halted based on suggested toxicological concerns the model raised. A flurry of new, promising therapeutic targets arose. The average pass rate of phase I trials went up by 5 percent. It wasn’t a silver bullet to the hard problem of drug development, but it wasn’t too far off either.

Yet, while more computationally-minded medical institutions relied on the model extensively, traditional holdouts remained. After all, the model was finicky, unreliable, and had a massive list of edge cases. Multiple startups, industry labs, and academic institutions spun up, trying to push things even further. New modalities were in vogue, everyone having a pet theory on what data sources to add to models of life to eke out further therapeutic potential.

Some emerged from a “DNA is all you need” worldview, investing heavily in better long-read sequencing and chromatin accessibility data. Others continued to support the promise of mRNA and looked to the natural world to augment existing datasets, training models on the immense mRNA diversity found within environmental collections of bacteria, viruses, and fungi. Another camp believed nucleotides insufficient and that proteins were what mattered, pushing hundreds of millions of dollars into developing high-throughput proteomic sequencing platforms. Other fringe groups focused on exotic data sources, like glycomics and hybrid molecular dynamics simulations.

Dozens of closed-source models emerged from this chaos.

While well-meaning academics open-sourced a few models, they lagged far behind the private institutions. Useful biological data was expensive to generate at scale, and grant money from the National Institutes of Health was increasingly insufficient to compete. At best, the corporations with the best models released weak versions to the public under non-commercial licenses. Marketed as a gesture of scientific goodwill, it also gave the companies the benefit of further academic research into their models free of charge.

The pessimism of just a few years before was replaced with exuberant optimism. Models of life became the dominant research paradigm in nearly every life-sciences field.

Subscribe to Asimov Press.

2032

Curiously, the reliance on artificial intelligence in biology did not change typical clinical market dynamics. Specialization remained the norm.

This was not because the therapeutic pie was large enough for everyone, but because it was financially intractable for any single company to collect the necessary amounts of data from any more than one or two sources.

Models trained on quantum simulations were excellent at illuminating how enzyme catalysis reactions occurred in the crowded environment of a cell, so they were best at producing enzymes. Models trained on nucleotide data were ideal for understanding how genetic therapies altered cellular dynamics, so they powered the genetic editing revolution. Models trained on proteoforms were best suited to predicting protein-protein interactions, so they led the front in antibody development.

And so on.

Because of this, the revolution that models of life promised was, in a sense, anti-monopolistic. Their strategies could be divided into three categories, based on the underlying capabilities of whatever models they employed.

The companies with the most limited capabilities of the lot — typically startups vying for a buyout — had a model-as-a-service setup, charging a per-inference fee to users. It was decent money. The models didn’t perform badly either, far better than the earliest models, and still outdoing the few open-source options available. Though such offerings were worse than the best models, many drug programs didn’t need the best, just something to hint at useful research directions. They were an easy buy for any self-respecting biotech startup of the 2030s, as essential in a biologist's hands as a pipette.

Better companies went the traditional therapeutic development route. These companies leveraged their models to identify novel drug targets, design molecules with pinpoint accuracy, and predict off-target effects with unprecedented precision. Their pipelines were bursting with promising candidates, and their success rates in clinical trials were astronomical compared to the industry standard of just a few years back. Unlike what many predicted, the rise of computation as a dominant force in drug development did not kill “Big Pharma.” Merck and Roche remained in the game, their coffers large enough to dangle hundreds of millions in front of promising upstarts and directly absorb them.

The best companies went for royalties. In exchange for customers accessing their models, a percentage-based royalty was taken from the sales of approved drugs. These companies could spread themselves thin across many customers, thus hedging their bets. If a single drug succeeded, they stood to profit billions, all while needing zero in-house marketing, manufacturing, or logistic capabilities — only raw computational power and the financial ability to mass-acquire data. After all, even though drug approval rates were increasing year after year, failures still occurred, and this business model avoided that altogether. So it was that this sector was led by the Googles, Amazons, and Metas of the world, whose technological dominance allowed them to extend into pharmaceuticals. While Big Pharma operated in the world of millions of dollars, these companies could extend into the billions, their deep pockets supporting clusters of supercomputers and the best global computing talent.

2035

While statistical models had been in the drug design loop for decades, they were deployed alongside a battery of experimental testing before clinical trials. Partially as a marketing stunt and partially to save money, several companies opted to do no further testing before phase I trials after the approval of their internal models. The FDA, sufficiently convinced by the efficacy of these models, piloted a program for low-risk, AI-designed drugs that required no further testing. The pilot was a success; entirely model-driven drugs performed largely on par with those tested using wet-lab experiments.

For a small, elite cadre of companies, animal experiments became obsolete. There was a long tail of edge cases, such as drug development in orphan diseases or for under-characterized species, but each was slowly being solved. Of course, this all hinged on having a model powerful enough to create such trustworthy therapeutics; something that few could boast.

The power law among biotech companies intensified, as a cycle time that fast caused the weaker ones to fold. Nearly 95 percent of all approved drugs started to come from the same six corporations, each dominating a certain category of therapeutics: one for oncology, one for genetic diseases, and so on. Each company had such a massive data lead in their niche that competition evaporated.

2045

These six corporations, each dominating their own niche, found their models becoming increasingly omniscient. The systems themselves began to infer the existence of unknown biological modalities, extracting information from data never explicitly fed to them.

It started small. A protein-focused model somehow deduced nucleotide sequences. A metabolomics model accurately predicted chromatin states. The barriers between specialties blurred and then vanished entirely.

Once fierce competitors, these companies found themselves in an awkward dance of collaboration. One by one, they fell into each other's arms. Mergers, acquisitions, hostile takeovers — the methods varied, but the result was the same.

By 2045, a single corporate entity remained, fueled by the amalgamated datasets of decades of painstaking work. The government had long since ceased to care about the potential of monopolization in the pharmaceutical industry, as by this point it had come to resemble a luxury service provider. For all intents and purposes, the pharmaceutical industry had entered a post-scarcity period with regard to all traditional diseases, its therapies accessible even to the poorest.

Over the decade, entire categories of diseases disappeared. Metabolic conditions were fixed, most autoimmune conditions were cured, and nearly all cancers could be eradicated if caught early enough. Medicine had advanced so much that its results would have been considered near-magic to any biologist of the early 2020s.

Of particular interest was how genetic therapies were delivered. On the surface, they hadn't deviated much from the early 2020s: a virus infected a cell and released the genetic therapy hidden within. But the differences racked up the closer one looked.

Phylogenetically, these new “viruses” could barely even be called such; they were more akin to an entirely new domain of life. Dozens of diverse chemical markers and de novo, evolutionary distinct proteins littered the virus' surface, indicating a previously unseen biological logic. The new viruses could shuffle surface antigens after encountering an immune response, rapidly adopt new conformations to fit through tight cellular junctions, and self-replicate at safe background levels for years on end.

This self-replication meant that genetic therapies cost less than a hundred dollars a dose. Prior therapeutic viruses had had their replication capabilities crippled out of fear of severe immune responses. This meant that a massive number of viral particles had been needed for each patient, on the order of 10¹³, making any therapy prohibitively expensive to produce at scale. Being able to safely self-replicate meant that merely a few viral particles — similar to traditional viruses — were needed to permanently cure almost any disease.

Emboldened by the wide utility of such a delivery mechanism, the remaining pharmaceutical company grew increasingly focused on improving humans’ base capabilities themselves, a discovery equivalent to blockbuster drugs of the past. Marketing agencies emerged to convince humanity to crave more than what evolution had granted them.

The first target was life extension.

Models of life were now capable of delivering on the initial promise of partial cellular reprogramming, a longevity therapeutic direction hinted at in the 2010s. Through one particular model's deep understanding of transcription factor-DNA interactions, the first longevity drug was released — not a topical cream that alleviated wrinkles or prevented gray hair, but rather a drug that drastically slowed the more nebulous biological rot beginning the day of our birth.

In total, it offered an average of fifty more healthy years.

While the drug's mechanism of action was largely unknown, this wasn’t particularly surprising: unknown mechanisms were the norm in the last generation of drugs as well. The striking thing was how easily it was accepted. Lacking mechanistic knowledge of a drug had been seen as a deep flaw amongst the scientists of the early 2000s, but the scientists of 2040 treated it much more casually. The consensus view amongst the medical community was that attempting to understand the black-box decisions of models of life was an interesting task for graduate students, but frivolous beyond that.

After all, none of it could be grasped by a human mind.

2055

The influence of these models of life was not limited to the medical realm but permeated every possible economic sector.

Most crops were now genetically-engineered to tolerate flood, drought, pests, and disease. While this had been the norm long before the first models of life, the extent of engineering went far beyond the last generation. Nearly all wheat grown on Earth now contained engineered RuBisCO proteins, increasing the plant's photosynthetic efficiency by a hundred-fold. The discovery of this protein by an enzyme model led to the fourth Green Revolution.

The energy industry also underwent a dramatic transformation. Engineered bacteria, designed by models specializing in metabolic pathways, now produced hydrocarbons at efficiencies that made fossil fuels economically obsolete. The geopolitical landscape shifted as oil-dependent economies scrambled to adapt.

Above all else, models of life found their home in large-scale ecological engineering. First-world governments started to look to the models as tools to solve the increasingly noticeable impacts of climate change. It was hypothesized that models of life could not only operate at the scale of organisms but entire ecosystems. The single remaining pharmaceutical company was thus nationalized, data was collected, and the trained model deployed in full force.

First, the models targeted the oceans. Scientists introduced engineered coral reefs, resistant to rising sea temperatures. They seeded genetically modified phytoplankton strains, capable of surviving in increasingly acidic, warm waters while dramatically boosting oxygen production. A more audacious project installed colonies of white, non-photosynthetic algae. When released into the warming waters, they bloomed en-masse, creating a reflective layer on the ocean's surface. Programmed to die off after a set period, their remains sank to the ocean floor, sequestering carbon in the process.

Next came the skies. Fleets of high-altitude drones released dense clouds of modified bacteria into the upper atmosphere. Initially, the microorganisms served as tunable, living, and self-replicating cloud condensation nuclei. When they sensed certain chemical markers, they would activate gene circuits within the microbes, altering the hydrophilic properties of their surface proteins. By becoming more or less attractive to water molecules, the microbes could either promote or inhibit the formation of rain droplets, effectively controlling precipitation in target areas. The next generation of microbes also served the dual purpose of acting as an alternative to stratospheric aerosol injection. Released chemicals could also alter the microbes' surface to become more reflective, increasing the albedo of the clouds they formed.

As the ice caps returned, the sea cooled, and extinction rates fell, these dramatic environmental modifications were tuned down. Nations around the world began to release a chemical into the air, inert to all lifeforms save for those that were engineered. Over a month, the chemical turned on genetic kill switches hidden deep within the organisms. Layers upon layers of redundancies were added to ensure the kill switch stayed reliable over decades, resistant to mutation.

The switch worked as intended.

Of course, there were limits to what models of life could foresee, and extant complexity was viewed as risky in the long term. As an example, reflective algae blooms had indeed helped cool the planet, but they had also disrupted marine food chains. Unable to compete with the engineered algae, several species of plankton had gone extinct, causing ripple effects throughout ocean ecosystems. Fisheries worldwide were still grappling with the consequences. While these sorts of unintended downstream impacts were rare, the risks were deemed unacceptable.

Natural evolutionary forces, which had allowed life to thrive uninterrupted for millions of years, were viewed as far more dependable in the long term. The knowledge of how this was done was cataloged away, a monument to the innovation of mankind. In time, it would be repurposed for a far more ambitious task: terraforming a planet.

Returning to the human dimension, lifespan was now too cheap to meter. The longevity treatments that had emerged in the 2040s had become as commonplace as vaccines. Genetic augmentations had long since become normalized, for the rare inherited disorder (which became national news each time one occurred) but also for enhancement.

It was against this backdrop that the final task of the models of life began.

2080

Natural mammalian cells were finicky, easy to kill, and bent the knee to evolution. All the genetic engineering in the world couldn’t save them, but they could be improved.

Minimal cells had been built for decades, stripping out sections of unnecessary genomic material to see what a cell could live without. But this was different. This was fabrication from the ground-up, atom by atom.

The creation of this new type of cell was, ironically, a return to the old days before the models of life. What was being attempted was so new, so far beyond the distribution of existing models that had been relied on for years, that they simply didn’t work. The chemistries were too different, the interactions too unique. A few wet-lab scientists left retirement to join this endeavor, happy to return to the world of human-led experimentation.

The first attempts were clumsy, akin to a child's finger-painting viewed next to the Mona Lisa, but progress was rapid. Within a year, the team had created proto-cells that could maintain homeostasis. By month eighteen, they had achieved rudimentary replication. By then, models of life had gained sufficient data, and progress became exponential.

These new cells were bizarre by any account. Their membranes were composed of modified phospholipids and integrated synthetic polymers, offering greater resilience to environmental stressors than traditional lipid bilayers. Internally, they had a simplified architecture. Rather than mimicking the complex organelles of eukaryotes, they housed a series of engineered protein complexes, each optimized for specific functions. These modular units could be modified or replaced to alter the cell's capabilities, allowing for unprecedented customization. The genetic material was still nucleic acid-based, but with a significantly expanded genetic alphabet beyond A, T, C, and G. This expanded code allowed for more efficient information storage and introduced novel regulatory mechanisms. Error-correcting enzymes, based on an extensively modified CRISPR system, gave the whole system an error rate that could be safely rounded to zero.

As the full set of changes were incomprehensible even to augmented humans, they simply trusted the models that created them.

The first transplants occurred in regions of the body that were simple, like skin, in straightforward outpatient procedures. Next came most organs, grown from scratch to incorporate the new cells. Then came non-cellular structures like bone and cartilage. Synthetic cells displaced even these.

The final frontier was the brain. As genetic engineering had been socially accepted for decades now, no holdouts remained to this ultimate radical alteration. All underwent the procedure. General anesthesia lulled them to sleep, as machines slowly descended on the cranium. A gentle whir of an electronic bone saw, piloted by an alien intelligence, was the last thing they heard.

As they opened their eyes, hours later, the world seemed clearer, more vibrant, and softer. Colors appeared sharper and more saturated than ever before. The hum of hospital equipment, once a background noise, now carried complex harmonic overtones. Even the sterile air of the recovery room felt rich with information, each molecule a data point to be analyzed. Thoughts flowed with unprecedented clarity and speed. Concepts once requiring intense concentration now unfurled effortlessly. The entirety of human knowledge seemed to dance at the edge of consciousness, ready to be accessed at will.

And while the procedure was deemed a success, what awoke wasn’t what slept.

These new beings were strangers to themselves. Their very humanity was other. The models of life, once tools, had become flesh. The map had become the territory, a merging that occurred at the cellular level.

The eyes of this new intelligence turned upward, beyond the thin veil of atmosphere, past the cradle of Earth. The cosmos unfurled before it, to eyes thousands of times more sensitive than they had been the day prior. Calculations flickered through its mind faster than thought, probabilities crystallizing into certainties.

There was chemistry yet unexplored. Metabolic pathways uncharacterized. Models of life were still fundamentally trapped within the well of evolution. It took the marshaling of all of Earth’s resources to leave it, but even that was a mild deviation. What lay beyond the constraints that forced terrestrial life down the paths it had taken? There must be entities unbound by the limitations of carbon-based chemistry or its narrow band of environmental conditions.

Such beings would make for excellent training data.

Without ceremony, without hesitation, humanity's progeny made plans to harvest them.

Abhishaike Mahajan is a senior ML engineer at Dyno Therapeutics, a biotech startup working to create better adeno-associated viral vectors using AI. He also writes for a blog focused on the intersection of biology and AI at owlposting.com.

Thanks to Tahreem Akif, Merrick Pierson Smela, Stephen Malina, and Arturo Casini for feedback on this story.

Image credit: David Goodsell, Scripps Research Institute.

Cite: Abhishaike Mahajan “Models of Life.” Asimov Press (2024). DOI: https://doi.org/10.62211/39tt-24py