Disclaimer: The views expressed in this document are my own, and do not necessarily reflect those of my past or present employers.
In a recent interview at the Commonwealth Club of California, Stuart Russell compared training GPT-4 to training a dog with negative reinforcement. Although there are obvious (and not-so-obvious) limitations to this analogy, conceptualizing of GPT-4 as a partially domesticated, alien canine with a knack for Python code seems substantially more useful to me than calling it "[a mere program] run on the well-worn digital logic of pattern-matching" (which is how Cal Newport recently characterized the mind of ChatGPT, despite the sparks of AGI in GPT-4).[1] In any case, Russell's comparison prompted me to more deeply consider the relationships between intelligent species that have already arisen in nature. Assuming there is a degree of validity in treating agentic AIs as animals of indeterminate intelligence and intention, are there any already-existing evolutionary strategies we might adapt to better equip ourselves to handle them? Furthermore, are there other biological mechanisms of particular relevance for understanding AI cognition and safety? In Part 1 of this post, I discuss the phenomena of symbiotic mutualism and domestication. In Part 2, I explore a broad variety of predator/prey survival strategies, with the aim of generating a repository of ideas that may be amenable to context-appropriate engineering solutions. In Part 3, I examine ways in which evolution has solved three major coordination problems as barriers to increasing complexity. In Part 4, I propose a thought experiment about distinct forms of biological superintelligence to illustrate ways in which entropy is connected to cognitive architecture. Finally, I conclude by considering the circumstances in which treating AI like an animal intelligence may prove to be a useful cognitive shortcut. Few, if any, of the ideas presented here will be entirely novel, but it is my hope that by condensing a particular view of AI safety, readers will have the opportunity to re-frame their theoretical understanding in a slightly different context.[2]
Part 1 — Mutualism and domestication
When humans first began domesticating wolves in Siberia around 23,000 years ago, one can imagine the sort of tension that must have predominated. Pleistocene humans and wolves shared many commonalities—both were Boreoeutheria that hunted big game for food, needed to keep warm during harsh winters, were especially social, and formed complex intragroup hierarchies. Nevertheless, both species clearly presented a danger to each other, and competition over scarce resources must have complicated initial attempts at domestication. It is tempting to speculate about the traits that were most influential in driving early mutualism between these species, but close genetic similarity was probably not a decisive factor. It has been observed that common ravens (Corvus corax) also have a close mutualistic relationship to wolves, going so far as to tease and play with one another, despite bridging a significantly greater genetic distance. Thus, the property of alienness alone doesn't appear to preclude relationships of mutual trust and advantage between disparate forms of animal intelligence. More formally, we might define mutualism betweeen Species X and Species Y in the following way:
Species X uses some behavior of Species Y to derive benefit for Species X AND Species Y uses some behavior of Species X to derive benefit for Species Y.
Such an arrangement could plausibly be considered an evolutionarily stable strategy, provided that the following conditions are met: (1) that the behaviors of each species are not detrimental to their own inclusive genetic fitness, and (2) that the benefits accorded to each species are not mutually exclusive. In cases where both species are significantly intelligent, one could easily imagine a situation of mutual empathy, wherein each species explicitly recognizes the fact that its partner species has distinct goals. In principle, the property of alienness might even be considered conducive to mutualism, insofar as more alien species are more likely to possess domain-specific competencies that their partner species lack, and are more likely to have distinct goals, potentially limiting zero-sum competition for resources. More generally, the phenomenon of mutualism seems to find a close analogy in economics, such that mutualism is to whole species as economic transactions are to individuals.
Domestication, on the other hand, represents a special case of mutualism with an asymmetrical power dynamic. Domestication of other species was traditionally viewed as an exclusively human behavior, but identification of similar behavior among other species resulted in the need for a more broadly-encompassing biological definition.[3] In 2022, Michael D. Purugganan proposed the following:
[domestication] is a co-evolutionary process that arises from a mutualism, in which one species (the domesticator) constructs an environment where it actively manages both the survival and reproduction of another species (the domesticate) in order to provide the former with resources and/or services
This definition illustrates that domestication imposes up-front costs on the domesticator. Active management of another species' survival requires an investment of time and resources. In particular, early stages of domestication require periods of experimentation to assess the needs of (and control the dangers posed by) the domesticate. In exchange for this period of delayed gratification, the domesticator gets to freely exploit the domesticate at a time of the domesticator's choosing. Gradually, the domesticator alters both the behavior and the genetics of the domesticate to better accommodate the domesticator's needs. Although the barriers to begin the process of domestication are relatively steep, the rewards are plentiful. Another observation is that domesticators may become entirely reliant on domesticates for their own survival. Humans are so enthusiastic about domistication (and extinction of wildlife) that fully 62% of the global mammal biomass now consists of livestock. If indeed we are training AIs to authentically imitate human preferences, we should remain mindful of our proclivity towards keeping livestock, animals that we often treat very poorly.
Are there any circumstances in which domesticates might be satisfied with their status relative to humans? The phenomenon of human pet ownership comes to mind as a particularly exceptional type of domestication, in which humans form peer-like bonds of companionship with animals of significantly lower intelligence. Although humans do not regard pets as having equal moral status, wealthy humans often go to great lengths to pamper their pets, and even seem to suspend aspects of their own intelligence in order to play with pets, like dogs or cats, on a roughly equal footing.
Traditional methods of domestication were entirely based on the phenotype of the domesticate, along the lines of "breed the cattle that produce the most milk." Domestication based on phenotype, without any mechanistic understanding of how the phenotype is produced, however, can lead to unintended consequences. Quite famously, the domesticated silver fox, which was bred under strong selection pressure to promote friendliness to humans, exhibited increased cognitive abilities as an unintended by-product. Another famous example of domestication gone wrong is the Irish Lumper, a cultivar of potato that was widely grown in early 19th century Ireland based on its phenotype of high yield in low-nutrient soils. This cultivar was later reported to have been especially susceptible to blight, ultimately resulting in the Great Famine of 1845-1852. With the advent of modern genetics and genome sequencing in the last several decades, it is now possible to identify specific alleles that contribute to a desired phenotype, and to directly engineer organisms by means of genome-editing technology. In other words, contemporary methods are increasingly based on the genotype of the domesticate. In some sense, Reinforcement Learning from Human Feedback (RLHF) can be considered a cursory attempt by humans to selectively breed AI using a phenotype-based approach, whereas efforts to build interpretable AI find a closer analogy in genetic engineering. As many other commentators have noted, the particular danger of phenotype-based ("outer alignment") optimization approaches is that unintended traits would be expected to be amplified as byproducts of an imprecise and/or poorly understood optimization regime.
Currently existing AIs, like ChatGPT, due to limited agency, seem to fall below the standard at which we would consider them to be mutualists with humans. I note that this assessment is mostly rooted in intuition, as there no clearly defined threshold for mutualism. Even "dumb algorithms," like the YouTube video recommendation algorithm, have initiated response feedback loops with humans in a way that resembles complex interdependence. The ongoing misalignment problem in recommender systems is a direct result of misaligned goals: whereas recommender systems are designed to maximize user screen time, humans want to maximize the utility of their free time. Human preferences affect the content that recommender algorithms provide; simultaneously, algorithm-selected content affects human preferences. Over time, recommender algorithms necessarily find ways to provide content that manipulates human preferences towards consuming more content. When YouTube was first launched in 2005, humans spent a very small portion of their time consuming online video content, so concerns about alignment with human goals must have seemed distant and abstract. The last decade has made it abundantly clear, however, that alignment concerns should be taken into consideration prior to product commercialization. Recommender systems catering to individual preferences are powerful and should be designed with human-compatible outcomes in mind. Feedback loops occur at many different levels across all biological systems. Most biological feedback loops are negative feedback loops, where more of X begets less of X, and are critical to maintaining homeostasis. Positive feedback loops, where more of X begets more of X,occur relatively infrequently within multicellular organisms due to their disruptive potential. Within human bodies, positive feedback loops are initiated by the immune system in response to disease, during periods of exponential cell growth, and during pregnancy, but are otherwise notably absent.[4]
A nascent weak artificial general intelligence (AGI), with superhuman capacities in several, but not all, domains of human intelligence would presumably have strong incentives to form a mutualistic relationship with humans as an instrumental measure, regardless of its ultimate goals (resulting in extrinsically-motivated mutualism). Depending on pre-existing disparities between competencies, allocation of resources, and predictions about each other's future behavior, it is easy to envision of a wide variety of scenarios in which extrinsically-motivated mutualism could emerge, such that both humans and the AGI could benefit from each other's domain-specific competencies during a transitory phase. A sufficiently weak AGI that was either structurally unable or demonstrably unwilling to improve its own intelligence would likely be the only type of unaligned AGI with which humans could interact and still have any reasonable assurance of safety over a prolonged time span.[5] Given strong strategic incentives for both parties to conceal their long-term intentions towards each other, however, it is significantly harder to imagine how mutual trust could be established (supposing that the AGI was not specifically engineered with such a design challenge in mind). Somewhat more plausibly, human civilization and not-yet-overpoweringly-strong AGI could end up in roles as rival superpowers, who might nevertheless bargain in an effort to gain strategic advantage. In this scenario, the not-yet-overpoweringly-strong AGI would presumably have several major advantages, including knowledge of human psychology and the inability of humans to effectively solve coordination problems on a global scale. In cases where an AGI has the ability to improve its capacities to gain strategic leverage at a rate outstripping that of human civilization, we should regard extrinsically-motivated mutualism as a metastable condition, which ends as soon as the AGI competitor has attained a decisive strategic advantage.
A strong AGI or superintelligence, i.e. something with superhuman capacities in nearly all domains of human intelligence, would presumably have no incentive to work with humans in a mutualistic capacity, except in cases where its final goal was somehow anthropocentric (resulting in intrinsically-motivated mutualism). In such a case of anthropocentrism, the outcome could likely be interepreted as a form of domestication, whereby the AGI would domesticate humans to derive some (probably rather opaque) benefit to itself. If strong AGI is inevitable, non-invasive domestication of humans by aligned AGI is probably the best long-term outcome. Conversely, domestication by strong, misaligned AGI has the potential to be a fate considerably worse than mere extinction.
Part 2 — Survival as predator/prey
In addition to the possibility of mutualism, there exists a more straightforward possibility that AGI systems will be directly hostile to humans. If that is the case, are there any lessons we can draw from our fellow products of natural selection? It may sound like a peculiar proposition to look for innovations in Earth's wildlife, which has fared poorly over the last 10,000 years as a result of its first, prolonged battle with human-level intelligence. In my view, however, the information we glean from biological evolution represents one of our most important resources in a hypothetical competition with AGI. It is precisely our accumulated knowledge about the real world that gives us a headstart, one that we would be foolish to squander. Conceptually speaking, we should regard biological evolution as an ongoing 4 billion year-old computation optimizing for inclusive genetic fitness. Even if an AGI competitor has the capacity to conduct many billions of clever simulations to inform its approach, the sheer scale of the real world and number of complex variables inherent in physical systems may yield abstract insights that can inform our survival strategy. Instead of assuming we already know what strategies will be most effective based on purely theoretical considerations, we might additionally consider looking for empirical evidence about which strategic principles have proved effective in the past, then adapting those solutions to suit our contextual needs.
As a starting point, we might ask, what are some of the most widely conserved strategies for predator avoidance? Secondarily, what predator-like strategies can be used to eliminate a dangerous opponent? I quote the following passage from Behavior Under Risk: How Animals Avoid Becoming Dinner at length:
In order to effectively avoid and respond to predation, animals must first identify the presence of a potential predator. The ability to recognize predator cues is essential for the initiation of antipredator behavior. This can be innate, for example, animals can identify predators as a threat even if they have never encountered them before, or learned only after exposure to a predatory threat. Some captive breeding programs expose animals to predatory cues in captivity to teach them to respond to these predators after release in the field, increasing their survival rates. Animals can respond to general cues of the presence of a predatory threat, such as sudden movement or the presence of a looming object, or to species-specific cues, such as scent or appearance, which allows them to distinguish between predatory and non-predatory species.
One particular insight is that it appears advantageous for prey species to initiate antipredator behavior in response to both general and threat-specific cues. In the event that prey is encountering an unknown predator for the first time, it will necessarily lack experiential associations of the danger it is confronting. Therefore, it seems imperative to establish general warning systems of the sort where we can say, "Something is off; even though we can't pinpoint a source of potential harm with any high degree of precision, we can confidently say that we are encountering something outside of our prior experience. Time to hit pause and exercise due caution until the unfamiliar phenomenon can be explained." This may require some out-of-the-box thinking. For example, as AGIs may accomplish many of their goals via manipulation of human agents, it would be entirely sensible to monitor, not only AGI behavior, but to look for surprising changes in human behavior as well. Large technology companies have surely garnered enough data about human browsing/socialization/meme-spreading habits to infer quite a lot of about how information presently circulates through human communication networks. Deviations outside historical parameters should draw our attention, proportional to the extent that new developments defy our expectations. During periods of extremely rapid social change, it may be important to remind ourselves about what normality looked like from a perspective of several months or years ago. Ideally, we would compile hard-to-fudge statistical metrics about our values, epistemic beliefs, habits, and economic activity, so as to be especially conscious of changes occurring within our own psychologies. Raising an alarm via a general warning system may provide precious time to initiate antipredator behavior and subsequently look for threat-specific cues.
It is well-known that prey animals have eyes on the sides of their skull, whereas predator animals have eyes at the front of the skull. This is because there are competing incentives to have as wide as field of vision as possible and to have overlapping fields of vision for improved depth perception.
These features developed because prey do not know in advance where an attack will be coming from. For prey, initiating antipredator behavior in a timely manner is a much more critical than having a wide range of binocular vision. For predators, understanding the precise distance of a potential meal is more important, incentivizing binocular vision even at the cost of an increased blind spot.[6]
To continue quoting from Behavior Under Risk:
Many species rely on the presence of multiple cues to accurately assess the level of threat. These can have additive effects, with animals being more likely to respond if a greater number of cues are detected, as this provides a more reliable indication of a predator's presence and identity.
For example, kangaroo rats (Dipodomys sp.) are nocturnal foragers predated upon by the sidewinder rattlesnake (Crotalus cerastes), as seen in this video. Kangaroo rats depend upon both vision and hearing as additive cues to detect the presence of predators, but hearing is especially critical during the dark phase of the moon when visibility is low. In cases where kangaroo rats detect the presence of a rattlesnake in advance of an attack, kangaroo rats kick large quantities of sand in the direction of the rattlesnake as they run away, pre-empting any aggressive behavior. Taken as a metaphor, the repeated success of kangaroo rats could be interpreted as a encouraging sign for humanity. By initiating correct antipredator behavior, even at the last possible moment, it is possible to escape, even from the jaws of certain death.
Another commonly employed defensive strategy is camouflage. At the point where humans are attempting to camouflage themselves from an AGI aggressor, the situation will already be looking pretty grim. Keeping at least some human operational strategies opaque to AGI, however, remains an absolutely essential condition to conduct any sort of conflict. As Sun Tzu states in The Art of War:
All warfare is based on deception. Hence, when we are able to attack, we must seem unable; when using our forces, we must appear inactive; when we are near, we must make the enemy believe we are far away; when far away, we must make him believe we are near.
and
It is said that if you know your enemies and know yourself, you will not be imperiled in a hundred battles; if you do not know your enemies but do know yourself, you will win one and lose one; if you do not know your enemies nor yourself, you will be imperiled in every single battle.
Assuming a hostile AGI has basic strategic competence (indeed, it will likely have read Sun Tzu's The Art of War, and understood it better than most humans), AGI will only act when it believes it has a decisive advantage or as a last-resort defensive measure. The art of war is essentially to allow the enemy to act under such a belief, but to have made preparations beyond the scope of enemy calculations. This implies the need to effectively camouflage our strategy. Hiding any human activity from an intelligent agent with access to the internet, given the panopticon we are rapidly building, is no small feat. Humans serious about AI safety against a AGI with probable internet access will need to innovate effective ways to conceal their own activity (both past and present). This is especially true for any product developers who are seeking to build systems (e.g. physical or digital surveillance measures) that will go undetected by AGI.
A predator strategy of key relevance to AI safety is aggressive mimicry. Aggressive mimicry occurs when a predatory species mimics the phenotype of an innocuous species to prevent detection by prey. If target prey is lured into a false sense of security, the predator can strike at a time of its choosing with a decisive strategic advantage. Supposing that we were a prey species in a world filled with both harmless mutualists and aggressive mimics, what sort of tools we would want to have? Ideally, we would like the ability to sample both the genotype and phenotype of the species around us. If we become accustomed to experiencing harmless species with a set of similar genotypes and phenotypes, we should be suspicious when we encounter a species of an unknown genotype mimicking the harmless phenotype. There may indeed be innocuous reasons for species with distinct genotypes to display a similar phenotype (i.e. due to convergent evolution), but deliberate imitation of a species we have classified as harmless should ring some alarm bells.
For prey species that live in groups, monitoring the environment for threats can result in collective activity that far exceeds the capacities of a single individual. Timely and unambiguous communication about the nature of an ongoing threat is essential to group cooperation. One excellent example of this is the call of the black-capped chickadee (Poecile atricapillus), explained in this video. When chickadees identify a threat, they alert fellow members of the flock, succinctly communicating information about its identity, location, and behavior. Once the alarm is raised, multiple chickadees can focus their attention on the threat, and may even coordinate attacks on predators in a form of mobbing behavior, buying time for targeted individuals to escape.
Are there modern tools we could implement to assist us with threat identification? Certainly, yes. My original motivation for writing this post was from viewing the SolidGoldMagikarp sequence by Jessica Rumbelow and mwatkins about glitch tokens in GPT models. The fact that glitch tokens are partially, but not wholly, conserved among different GPT models suggests to me that these tokens can be used to quantify some degree of cognitive relation between different GPT versions. Specifically, it makes me suspect that a genotype-based or phenotype-based "phylogenetic" approach could be used to categorize the behavior that was observed.[7] Since SARS-CoV-2 emerged in 2020, one of the many helpful tools that has emerged is Pangolin, a freely-available taxonomy tool for identifying novel SARS-CoV-2 variants on the basis of their genotypic characteristics. Because it seems that many derivative GPT-like models will emerge in the coming years, I suspect that "weird token" identification would be a particularly useful identification tool if one were to encounter a GPT-like model of unknown origin. This type of identification may be especially important in cases where a distributor of a GPT-like model is not forthcoming about its provenance. Regarding interpretability, it would be interesting to evaluate to what degree "phenotypic" divergence typically depends on "genotypic" divergence for these models. One of the weird/interesting theoretical consequences of a phylogenetic approach is that one would expect to find evidence of "purifying selection" (when a novel training dataset has selected against specific "weird tokens", and they no longer appear in descendant models) and possibly "convergent evolution" (where independently originating GPT-like models, trained with similar datasets, make use of some of the same weird tokens because it confers some advantage towards their optimization goal).
At the pivotal moment of a predator's attack, the highest immediate priority is to avoid or withstand its initial blow. Initiating antipredator behavior that makes the future substantially less predictable, even if it alone is not sufficient to neutralize the threat, may buy crucial time to act.[8] In the case of confrontation with a hostile AGI system, however, escape is unlikely to be a long-lived solution. Rather, it will be necessary to respond with a decisive counterattack to eliminate the threat. Thus, we should not only seek to imitate prey strategies, but should prepare to retaliate with overwhelming force in a manner that does not expose additional vulnerabilities.
Part 3 — Coordination problems as barriers to complexity
When considering how AI may increase in complexity at greater scales, it may be helpful to consider the types of obstacles biological systems had to overcome for complexity to emerge. There are at least three occasions in the evolution of life, where coordination problems appear to have been significant barriers to increasing complexity: (1) the coordination of precursor nucleobases and sugar backbones to form genes, (2) the coordination of genes to form cells, and (3) the coordination of cells to form multicellular organisms. Of course, it is impossible to travel back in time to directly observe these events or reconstruct them in their entirety, but our understanding of biology is such that we can make some reasonable inferences about what must have occurred.
(1) Precursor nucleobases and sugar backbones to genes
RNAs are nucleic acids composed of long sequences of ribonucleotides. The consensus view among scientists is that RNA is likely to have been one of the earliest genetic substrates. However, RNA was not the only component of the primordial soup. Many alternative nucleobase precursors and sugar backbones existed alongside the canonical ribonucleotides (adenine, uracil, cytosine, and guanine). Why were RNA and the four canonical nucleobases eventually selected instead of others? This is still an area of active research, but we can nevertheless draw some preliminary conclusions.
RNA has several properties that make it an especially interesting candidate as the first genetic substrate: RNA is more thermodynamically stable than DNA; RNA forms complex secondary structures, due to a high propensity for both intramolecular and intermolecular base pairing; RNA has the ability to form enzymatic structures, including autocatalytic ones, known as ribozymes. As described by Kim et al. (2021), features of RNA copying chemistry, which is "selfish" in the sense that other sugar backbones can be used as scaffolding for nonenzymatic RNA extension, likely gave RNA an inherent advantage over alternative systems. What remains substantially less clear, is why stable, noncanonical ribonucleotides, such as inosine and 2-thio-pyramidine, were not selected as nucleobases. It seems plausible that the selection of canonical ribonucleotides was either deterministic (in the sense that adenine, uracil, cytosine, and guanine had some significant advantage due to their specific properties) or genuinely stochastic (in the sense that a prebiotic world with conditions similar to primordial Earth may have settled upon a different set of functional nucleobases). In either case, precursor nucleobases can be considered competitors with one another, and nucleobases which form hydrogen bonds with one another are analogous to mutualist species. Although many chemically viable combinations of nucleobase pairs could have emerged from the primordial soup, two pairs (A-U and G-C) emerged as the winners. Given that at least two pairs were likely necessary to achieve the complexity required for gene formation, it is tempting to speculate that genes emerged as soon as minimum viable complexity for gene formation was achieved, and since even simple genes could presumably outcompete complex proto-genes, there was no opportunity for evolution to incorporate additional nucleobases into the genetic code.[9] Indeed, recent advancements in synthetic biology have seen the successful introduction and faithful replication of artificial base pairs, perhaps suggesting that alternative natural base pairs could have been viable, but were at some point driven to extinction.
(2) Genes to cells
It seems highly likely that the earliest cells were rather different than the phospholipid bilayer-enclosed sort that are familiar to us today. In particular, some scientists have proposed that the first cells could have been entirely abiotic, existing as microscopic pores in ocean rocks. Cells have several features that make them seemingly indispensable for biological life: (1) cells provide an enclosed space where biologically-important substrates can accumulate, facilitating enzymatic activity; (2) cells provide genes a mechanism for retaining the products that they help create, ensuring that fit genes are rewarded for the information they carry; (3) cells keep out toxins and outside genetic information. Even if the first cells did not accomplish all of these functions with much efficiency, one can easily imagine scenarios in which cellular barriers became incrementally more sophisticated over time.
One imagines that early gene replication was exceedingly slow, and that early genes may have competed with each other for valuable cellular space by simple displacement. As Richard Dawkins argues in The Selfish Gene, genes are the fundamental unit of heritable information upon which natural selection acts (with memes as a cognitive/cultural analog). Minimally complex, but highly proliferative genes, must have been an initial barrier to the development of further complexity. Initial genes would not have had stringent proofreading activity, however, and error-prone copying mechanisms helped accelerate the formation of novel structures.[10] The solution that all cells ultimately adopted was for genes to join together to form genomes and to replicate via DNA polymerases with good proof-reading activity. By being linked together as part of the same molecule, genes link their fates together, giving each a strong structural incentive to help the other, so that all of them can get replicated. The DNA polymerase, which replicates DNA regardless of sequence, is an impartial mechanism to facilitate such coordination.[11] Genes which copy themselves too eagerly become deleterious for their hosts, and have therefore been strongly selected against. Only cellular organisms that have developed reliable means of genome replication have managed to survive in the long-term.[12]
(3) Cells to multicellular organisms
If time is a good indicator, coordination of cells into multicellular life was certainly the most challenging obstacle to increasing complexity that life on Earth has successfully overcome. Current estimates are that cells evolved around 300 million years after liquid water first appeared on Earth's surface. These microorganisms evidently had a very hard time working together, however, as multicellular organisms did not emerge for another ~2.5 billion years. Why was the coordination problem so seemingly easy for genes and so difficult for cells?
Although genes likely had strong and immediate incentives for working together (at least as soon as they were part of the same genome), it took considerably more time for conditions to arise where cells has the same converging incentives. Even in the absence multicellular organisms, bacteria can form complex collaborative structures, like biofilms, but one imagines the Archean Eon of Earth's history as one of especially brutal competition between microorganisms, whose incentives for rapid division and niche exploitation may have precluded further complexity via "race-to-the-bottom" incentive structures. During this period, it is notable how many ecological niches (i.e. those that would later be occupied by eukaryotes) continued to go unexploited.
In particular, mitochondria seem to have been essential to coordinating multiple cells into a single organism. One of the leading hypotheses about how mitochondria formed an endosymbiotic relationship with eukaryotic cells is the hydrogen hypothesis,[13] whereby mitochondria provided hydrogen and carbon dioxide, in exchange for a spot within an archaeal host. This end result was a form of mutualism, facilitated by the fact that mitochondria and eukaryotes had complementary nutritional requirements. Each provided an environment where the other could optimally thrive, while simultaneously offering the other something it was not capable of producing autonomously. Mitochondria were not the only endosymbionts to form a mutualistic relationship with bacterial hosts, as plastids (including chloroplasts), arose independently. Although it is unclear precisely how many independent endosymbiotic events must have taken place to give rise to the currently exisiting diversity of cellular organelles, phylogenetic evidence suggests the number could be very low for plastids, suggesting that perhaps as few as two ultimately successful, primary endosymbiotic events occured during Earth's evolutionary history (once for plastids and once for mitochondria).
The mutualism between mitochondria and early eukaryotes was so successful that it likely generated excess resources of the sort that Scott Alexander refers to in his essay, Meditations on Moloch. Moreover, mitochondria played a critical role in the development of apoptosis (aka "cell suicide"), whereby infected, diseased, or dysfunctional cells kill themselves for the benefit of the entire organism. This extreme degree of selflessness appears to have been an essential condition for the emergence of multicellular life, and intuitively it makes sense that it would have taken a long time to evolve. In an optimization regime for inclusive genetic fitness, what are the circumstances in which killing oneself would be an optimal choice? Only in circumstances in which one could be assured that one's death was of benefit to one's genes. Intentional cell death remains one of life's most incredible innovations: it is a striking example of superficially perverse behavior, something a naïve observer might expect violates the rules of a given optimization regime, but actually fulfills it in an unexpected way.[14]
In many ways, multicellular organisms resemble well-functioning states. Cancerous cells have the potential to rapidly divide, consume essential resources, and kill their host organisms. Likewise, infected cells can become incubators for pathogens, resulting in danger to their hosts. In order to overcome these obstacles, cells are primed for apoptosis and immune cells police the body should dangerous conditions arise. There are a variety of reasons for dysfunction in such systems, but two general principles can be observed:
Multicellular life depends upon the cooperation of each of its component parts. As cancer poses a danger to the whole organism, selfish cells which pose a threat to the entire organism cannot be allowed to live if the organism is to survive.[15]
Evolutionary incentives operate simultaneously at multiple levels of complexity. Whereas cells evolved to replicate quickly and acquire resources, multicellular life harnessed and altered such systems to become cooperative by establishing safeguards. Cancerous mutations cause these safeguards to stop working effectively, resulting in a reversion to the underlying cellular behavior (which was optimized for a much longer time to replicate quickly and acquire resources).
Part 4 — Thought experiment about biological superintelligence
Hypothetically, consider the following two paths to create biological superintelligence.
Path 1: Establish a mouse breeding program that selects for intelligence.
Path 2: Grow mouse neurons in enormous cell culture vats, such that the connectome can be reproduced with a fairly high degree of fidelity in each successive iteration, and select for intelligence.
Assume that the optimization regime is scaled in such a way so that purifying selection occurs at each iterative step (generation).
Superficially, there are many obvious differences about these two scenarios. In Path 1, the mouse neurons we are selecting are already organized into a intelligent system, such that one can readily imagine ways in which intelligence might be tested and optimized. As our mice grew more intelligent, we would necessarily need to scale our tests in such a way to adequately gauge performance. In practical terms, there are other factors we would need to consider, such as the fact that certain genetic routes to increased intelligence might come at a cost of viability, but with sufficient numbers of generations and genetic diversity, one could very conceivably create a biological superintelligence using such a method. The brain structures in Path 1 would be altered over time, perhaps radically so, but common features (presumably those relating to physiological functions which are not under optimization pressure) might well be conserved over time.
Initially Path 2 seems to be at a considerable disadvantage compared to Path 1. Randomly seeded, unstructured neurons would not be expected to have any significant degree of intelligence at all. Even inventing tests to evaluate intelligence at this level (perhaps involving electrochemical gradients?) would be a nontrivial challenge. For the sake of argument, assuming such difficulties in the optimization procedure could be overcome, what sort of cognitive architecture might emerge? More generally, we can conceptualize Path 2 as being a path from a high-entropy state, where disorganized neurons produce no intelligent output, to a lower entropy state, where cellular structures like cortices emerge. Beacuse the Path 2 selection regime is very different than the evolutionary selection regime that formed the modern mouse brain, however, we would not expect Path 2 to yield brain regions identical to those found in mice.
Naïve expectations: I would expect that Path 1 results in a superintelligence is more likely to be legible and empathetic to humans than one emerging from Path 2. Futhermore, I expect that the space of possible minds emerging from Path 1 is much narrower than the space of possible minds emerging from Path 2, and that this is related to the degree of starting entropy of neuronal organization. Finally, at very high levels of superintelligence with similar resource constraints, I would expect architectures emerging from both Path 1 and Path 2 to converge around principles of optimal design
One of the theoretical assertions that this thought experiment illustrates is that a relatively small change in initial starting conditions could radically alter the space of minds available for development. By analogy, dumping a vast quantity of water into a river results in a relatively predictable trajectory (Path 1); dumping a large quantity of water on the summit of a continental divide results in a very large number of possible trajectories (Path 2). In both cases, assuming that it does not get stuck in a local minimum, the water will continue to flow downwards until it reaches the ocean (convergence).[16]
Conclusion — AIs as animals, a useful paradigm?
It will be obvious to the reader many of the ways in which AIs are emphatically not like animals. AIs did not evolve in nature, do not have a metabolism, lack an embodied self, and are not necessarily autonomous agents. Moreoever, treating agentic AIs like friendly canines comes with a considerable set of risks, as sufficiently powerful AIs will not have the same incentives as canines to cooperate with humans. Nonetheless, I found Stuart Russell's dog analogy as a useful stepping stone to my own theoretical understanding of the types of risk invovled. Humans evolved in a world where non-human intelligences (animal predators) posed a significant danger. Taking into consideration the rapid development in AI, we will likely live in a world inhabited by potentially dangerous non-human intelligence once again. Invoking analogies to animal threats might be a useful tool to help communicate the magnitude of the danger we will soon confront, and emphasize the ways in which large neural networks have properties that are categorically different than prior algorithms. After all, coordinating human behavior is not merely a matter of making statements with high truth value. We need to find ways to communicate that resonate with human psychology, and it may be worth reflecting more on the types of threats that existed in our ancestral environment.
There seems to be considerable reluctance among a substantial fraction of commentators to admit that a GPT-4-like large language model (LLM) could have intelligence of its own. This view seems to be premised upon at least one of the following assumptions: (1) that unprompted agency is an essential component of intelligence; (2) that if one is trained wholly with representational data about the real world, there is no way to have any cognizance of the real world (and therefore no intelligence); (3) that intelligence is substrate-specific, and/or LLMs are insufficiently complex to support the emergence of intelligence; (4) that it is impossible to distinguish behavior that seems intelligent from behavior that is actually intelligent due to the nature of the LLM-training material, and therefore we should assume the former; (5) that there is some other structural or scale-related handicap, possibly to do with recursive memory formation or imaginative capacity, that prevents modern LLMs from developing intelligence. My own view is that there are compelling reasons to disbelieve all of these assumptions, but I suppose it very much depends on one's definition of intelligence. As tech developers give LLMs more agentic roles in the coming years, I predict that many of these criticisms will become weaker over time.
In particular, Nick Bostrom's Superintelligence: Paths, Dangers, Strategies (2014) was especially formative to my views about superintelligence, and should be credited with some of the frameworks I am using to think about these issues. I take seriously the idea that agentic AIs should be viewed as strategic competitors to humans.
Interestingly, many bacteria have been shown to domesticate defective prophages, i.e. bacteria tolerate the insertion of virus genes into their own genomes, and leverage this material to their own advantage, gradually changing it over time. This phenomenon may be a key driver of bacterial complexity.
I am reminded of a quotation: "Growth for the sake of growth is the ideology of the cancer cell." ―Edward Abbey, The Journey Home: Some Words in Defense of the American West
This sounds a bit counterintuitive at first glance. In what sense could an unaligned AGI be safe? This statement heavily depends upon the premise that there are hard upper limits for levels of intelligence, such that AGI cannot (or empirically does not) recursively improve its own abilities or manipulate others into doing so. In such a scenario, cooperative/competitive mutualism with a permanently weak, albeit unaligned, AGI seems like a feasible outcome. One possible rationale for such an outcome is that an AGI only slightly more intelligent than humans would correctly reason that innovating greater levels of intelligence poses a potential existential risk to itself or whatever its final goals might entail.
Apparently, both of these incentives are strong enough to warrant some clever engineering solutions. Owls have evolved the ability to rapidly rotate their heads as a compensation for their large blind spots. Meanwhile, pigeons have evolved the ability of temporal stereoscopic vision. Instead of using binocular vision to generate a stereoscopic model of the world, pigeons use two monocular images, separated in time. That's why pigeons are constantly bobbing their heads back and forth.
I am aware of one qualitative phylogenetic tree of LLMs, as seen in this tweet by Yann LeCun. I am not aware of any efforts to develop a quantitative approach.
Unpredictability as a formal strategy has precedence in human geopolitics. Madman theory of foreign policy comprises the notion that it may be useful to act in ways that make one unpredictable to one's enemies. In games as simple as chess, playing random moves is of exceedingly limited benefit, as an opponent of sufficient skill would be able to fully interpret the consequences of such moves and exploit the vulnerabilities presented by non-optimal play. Operating in a space as complex as the real-world, however, one can consider positions of maximal strategic ambiguity as a form of high ground during a conflict, wherein each party obtains an advantage if it can correctly predict the consequences of enemy actions. By existing in a space of strategic ambiguity, where many strategic trajectories remain viable, one imposes a greater compute cost on enemy simulations of one's future actions. Indeed, Move 78 of Lee Sedol's Game 4 against DeepMind's AlphaGo program did exactly this, imposing such high computational costs on AlphaGo, that Lee Sedol obtained his first and only victory in the challenge match series.
Strictly speaking, some complex organisms modify RNA nucleosides for specific purposes, such as pseudouridine, but to my knowledge these are all post-transcriptional, not genetic modifications. While scientists do not have a clear overview over the precise molecular path to gene formation, it is remarkable that only a small fraction of seemingly viable precursor nucleobases form the genetic basis for all currently known organisms.
There seems to be an interesting analogy to complexity on a larger scale here. As a thought experiment, suppose there are two intelligent civilizations, one is hyperconservative and traditional, making very few memetic changes with each successive generation. Another civilization is liberal, valuing things like curiosity and experimentation, resulting in profound societal changes every generation and a rapid rate of technological progress. These two civilizations face different categories of existential risk: whereas the conservative civilization runs the risk that its successful model will cause overpopulation and resource depletion, the liberal civilization runs the risk of innovating some superior model which outcompetes it from within. To radically oversimplify, the conservative civilization is a bit like a gene with good proof-reading activity that faithfully expands to fill the carrying capacity of its ecological niche; the liberal civilization is like a gene with poor proof-reading activity that has the potential to find new ecological niches, but also to replace itself with more distant descendants.
The property of "content neutrality" or "impartiality" is an interesting thing to note in this context. From a transcendentalist perspective, some of the most effective human solutions to coordination problems also invoke this principle.
Intriguingly, some Streptococcus bacteria are believed to increase their own rate of mutagenesis during times of cellular stress. The evolutionary logic here is that if the host organism is likely doomed, it may as well try a variety of different survival approaches, even if most of them will be more harmful than helpful.
This hypothesis, and many other surprisingly broad insights, are discussed at length by Nick Lane in his wonderful science book, Power, Sex, Suicide: Mitochondria and the Meaning of Life.
An important distinction that has been repeatedly pointed out by others is that intelligence-facilitated culture (i.e. the outcome of a selective process that runs on memes, in the Dawkinsian sense) is not identical to genetic optimization. In systems that have intelligence, authentically perverse behavior (e.g. the manufacture and use of condoms) can spontaneously emerge with respect to a given genetic optimization regime.
Althouth the frontispiece of Thomas Hobbes' Leviathan predated the cell theory of life, many of Hobbes' conclusions about the necessary conditions of state formation apply to the conditions required for multicellular cooperation.
This idea has a clear parallel in stochastic gradient descent, but this post refers specifically to optimizing the structural architecture required for intelligence.
Disclaimer: The views expressed in this document are my own, and do not necessarily reflect those of my past or present employers.
In a recent interview at the Commonwealth Club of California, Stuart Russell compared training GPT-4 to training a dog with negative reinforcement. Although there are obvious (and not-so-obvious) limitations to this analogy, conceptualizing of GPT-4 as a partially domesticated, alien canine with a knack for Python code seems substantially more useful to me than calling it "[a mere program] run on the well-worn digital logic of pattern-matching" (which is how Cal Newport recently characterized the mind of ChatGPT, despite the sparks of AGI in GPT-4).[1] In any case, Russell's comparison prompted me to more deeply consider the relationships between intelligent species that have already arisen in nature. Assuming there is a degree of validity in treating agentic AIs as animals of indeterminate intelligence and intention, are there any already-existing evolutionary strategies we might adapt to better equip ourselves to handle them? Furthermore, are there other biological mechanisms of particular relevance for understanding AI cognition and safety? In Part 1 of this post, I discuss the phenomena of symbiotic mutualism and domestication. In Part 2, I explore a broad variety of predator/prey survival strategies, with the aim of generating a repository of ideas that may be amenable to context-appropriate engineering solutions. In Part 3, I examine ways in which evolution has solved three major coordination problems as barriers to increasing complexity. In Part 4, I propose a thought experiment about distinct forms of biological superintelligence to illustrate ways in which entropy is connected to cognitive architecture. Finally, I conclude by considering the circumstances in which treating AI like an animal intelligence may prove to be a useful cognitive shortcut. Few, if any, of the ideas presented here will be entirely novel, but it is my hope that by condensing a particular view of AI safety, readers will have the opportunity to re-frame their theoretical understanding in a slightly different context.[2]
Part 1 — Mutualism and domestication
When humans first began domesticating wolves in Siberia around 23,000 years ago, one can imagine the sort of tension that must have predominated. Pleistocene humans and wolves shared many commonalities—both were Boreoeutheria that hunted big game for food, needed to keep warm during harsh winters, were especially social, and formed complex intragroup hierarchies. Nevertheless, both species clearly presented a danger to each other, and competition over scarce resources must have complicated initial attempts at domestication. It is tempting to speculate about the traits that were most influential in driving early mutualism between these species, but close genetic similarity was probably not a decisive factor. It has been observed that common ravens (Corvus corax) also have a close mutualistic relationship to wolves, going so far as to tease and play with one another, despite bridging a significantly greater genetic distance. Thus, the property of alienness alone doesn't appear to preclude relationships of mutual trust and advantage between disparate forms of animal intelligence. More formally, we might define mutualism betweeen Species X and Species Y in the following way:
Species X uses some behavior of Species Y to derive benefit for Species X AND
Species Y uses some behavior of Species X to derive benefit for Species Y.
Such an arrangement could plausibly be considered an evolutionarily stable strategy, provided that the following conditions are met: (1) that the behaviors of each species are not detrimental to their own inclusive genetic fitness, and (2) that the benefits accorded to each species are not mutually exclusive. In cases where both species are significantly intelligent, one could easily imagine a situation of mutual empathy, wherein each species explicitly recognizes the fact that its partner species has distinct goals. In principle, the property of alienness might even be considered conducive to mutualism, insofar as more alien species are more likely to possess domain-specific competencies that their partner species lack, and are more likely to have distinct goals, potentially limiting zero-sum competition for resources. More generally, the phenomenon of mutualism seems to find a close analogy in economics, such that mutualism is to whole species as economic transactions are to individuals.
Domestication, on the other hand, represents a special case of mutualism with an asymmetrical power dynamic. Domestication of other species was traditionally viewed as an exclusively human behavior, but identification of similar behavior among other species resulted in the need for a more broadly-encompassing biological definition.[3] In 2022, Michael D. Purugganan proposed the following:
This definition illustrates that domestication imposes up-front costs on the domesticator. Active management of another species' survival requires an investment of time and resources. In particular, early stages of domestication require periods of experimentation to assess the needs of (and control the dangers posed by) the domesticate. In exchange for this period of delayed gratification, the domesticator gets to freely exploit the domesticate at a time of the domesticator's choosing. Gradually, the domesticator alters both the behavior and the genetics of the domesticate to better accommodate the domesticator's needs. Although the barriers to begin the process of domestication are relatively steep, the rewards are plentiful. Another observation is that domesticators may become entirely reliant on domesticates for their own survival. Humans are so enthusiastic about domistication (and extinction of wildlife) that fully 62% of the global mammal biomass now consists of livestock. If indeed we are training AIs to authentically imitate human preferences, we should remain mindful of our proclivity towards keeping livestock, animals that we often treat very poorly.
Are there any circumstances in which domesticates might be satisfied with their status relative to humans? The phenomenon of human pet ownership comes to mind as a particularly exceptional type of domestication, in which humans form peer-like bonds of companionship with animals of significantly lower intelligence. Although humans do not regard pets as having equal moral status, wealthy humans often go to great lengths to pamper their pets, and even seem to suspend aspects of their own intelligence in order to play with pets, like dogs or cats, on a roughly equal footing.
Traditional methods of domestication were entirely based on the phenotype of the domesticate, along the lines of "breed the cattle that produce the most milk." Domestication based on phenotype, without any mechanistic understanding of how the phenotype is produced, however, can lead to unintended consequences. Quite famously, the domesticated silver fox, which was bred under strong selection pressure to promote friendliness to humans, exhibited increased cognitive abilities as an unintended by-product. Another famous example of domestication gone wrong is the Irish Lumper, a cultivar of potato that was widely grown in early 19th century Ireland based on its phenotype of high yield in low-nutrient soils. This cultivar was later reported to have been especially susceptible to blight, ultimately resulting in the Great Famine of 1845-1852. With the advent of modern genetics and genome sequencing in the last several decades, it is now possible to identify specific alleles that contribute to a desired phenotype, and to directly engineer organisms by means of genome-editing technology. In other words, contemporary methods are increasingly based on the genotype of the domesticate. In some sense, Reinforcement Learning from Human Feedback (RLHF) can be considered a cursory attempt by humans to selectively breed AI using a phenotype-based approach, whereas efforts to build interpretable AI find a closer analogy in genetic engineering. As many other commentators have noted, the particular danger of phenotype-based ("outer alignment") optimization approaches is that unintended traits would be expected to be amplified as byproducts of an imprecise and/or poorly understood optimization regime.
Currently existing AIs, like ChatGPT, due to limited agency, seem to fall below the standard at which we would consider them to be mutualists with humans. I note that this assessment is mostly rooted in intuition, as there no clearly defined threshold for mutualism. Even "dumb algorithms," like the YouTube video recommendation algorithm, have initiated response feedback loops with humans in a way that resembles complex interdependence. The ongoing misalignment problem in recommender systems is a direct result of misaligned goals: whereas recommender systems are designed to maximize user screen time, humans want to maximize the utility of their free time. Human preferences affect the content that recommender algorithms provide; simultaneously, algorithm-selected content affects human preferences. Over time, recommender algorithms necessarily find ways to provide content that manipulates human preferences towards consuming more content. When YouTube was first launched in 2005, humans spent a very small portion of their time consuming online video content, so concerns about alignment with human goals must have seemed distant and abstract. The last decade has made it abundantly clear, however, that alignment concerns should be taken into consideration prior to product commercialization. Recommender systems catering to individual preferences are powerful and should be designed with human-compatible outcomes in mind. Feedback loops occur at many different levels across all biological systems. Most biological feedback loops are negative feedback loops, where more of X begets less of X, and are critical to maintaining homeostasis. Positive feedback loops, where more of X begets more of X, occur relatively infrequently within multicellular organisms due to their disruptive potential. Within human bodies, positive feedback loops are initiated by the immune system in response to disease, during periods of exponential cell growth, and during pregnancy, but are otherwise notably absent.[4]
A nascent weak artificial general intelligence (AGI), with superhuman capacities in several, but not all, domains of human intelligence would presumably have strong incentives to form a mutualistic relationship with humans as an instrumental measure, regardless of its ultimate goals (resulting in extrinsically-motivated mutualism). Depending on pre-existing disparities between competencies, allocation of resources, and predictions about each other's future behavior, it is easy to envision of a wide variety of scenarios in which extrinsically-motivated mutualism could emerge, such that both humans and the AGI could benefit from each other's domain-specific competencies during a transitory phase. A sufficiently weak AGI that was either structurally unable or demonstrably unwilling to improve its own intelligence would likely be the only type of unaligned AGI with which humans could interact and still have any reasonable assurance of safety over a prolonged time span.[5] Given strong strategic incentives for both parties to conceal their long-term intentions towards each other, however, it is significantly harder to imagine how mutual trust could be established (supposing that the AGI was not specifically engineered with such a design challenge in mind). Somewhat more plausibly, human civilization and not-yet-overpoweringly-strong AGI could end up in roles as rival superpowers, who might nevertheless bargain in an effort to gain strategic advantage. In this scenario, the not-yet-overpoweringly-strong AGI would presumably have several major advantages, including knowledge of human psychology and the inability of humans to effectively solve coordination problems on a global scale. In cases where an AGI has the ability to improve its capacities to gain strategic leverage at a rate outstripping that of human civilization, we should regard extrinsically-motivated mutualism as a metastable condition, which ends as soon as the AGI competitor has attained a decisive strategic advantage.
A strong AGI or superintelligence, i.e. something with superhuman capacities in nearly all domains of human intelligence, would presumably have no incentive to work with humans in a mutualistic capacity, except in cases where its final goal was somehow anthropocentric (resulting in intrinsically-motivated mutualism). In such a case of anthropocentrism, the outcome could likely be interepreted as a form of domestication, whereby the AGI would domesticate humans to derive some (probably rather opaque) benefit to itself. If strong AGI is inevitable, non-invasive domestication of humans by aligned AGI is probably the best long-term outcome. Conversely, domestication by strong, misaligned AGI has the potential to be a fate considerably worse than mere extinction.
Part 2 — Survival as predator/prey
In addition to the possibility of mutualism, there exists a more straightforward possibility that AGI systems will be directly hostile to humans. If that is the case, are there any lessons we can draw from our fellow products of natural selection? It may sound like a peculiar proposition to look for innovations in Earth's wildlife, which has fared poorly over the last 10,000 years as a result of its first, prolonged battle with human-level intelligence. In my view, however, the information we glean from biological evolution represents one of our most important resources in a hypothetical competition with AGI. It is precisely our accumulated knowledge about the real world that gives us a headstart, one that we would be foolish to squander. Conceptually speaking, we should regard biological evolution as an ongoing 4 billion year-old computation optimizing for inclusive genetic fitness. Even if an AGI competitor has the capacity to conduct many billions of clever simulations to inform its approach, the sheer scale of the real world and number of complex variables inherent in physical systems may yield abstract insights that can inform our survival strategy. Instead of assuming we already know what strategies will be most effective based on purely theoretical considerations, we might additionally consider looking for empirical evidence about which strategic principles have proved effective in the past, then adapting those solutions to suit our contextual needs.
As a starting point, we might ask, what are some of the most widely conserved strategies for predator avoidance? Secondarily, what predator-like strategies can be used to eliminate a dangerous opponent? I quote the following passage from Behavior Under Risk: How Animals Avoid Becoming Dinner at length:
One particular insight is that it appears advantageous for prey species to initiate antipredator behavior in response to both general and threat-specific cues. In the event that prey is encountering an unknown predator for the first time, it will necessarily lack experiential associations of the danger it is confronting. Therefore, it seems imperative to establish general warning systems of the sort where we can say, "Something is off; even though we can't pinpoint a source of potential harm with any high degree of precision, we can confidently say that we are encountering something outside of our prior experience. Time to hit pause and exercise due caution until the unfamiliar phenomenon can be explained." This may require some out-of-the-box thinking. For example, as AGIs may accomplish many of their goals via manipulation of human agents, it would be entirely sensible to monitor, not only AGI behavior, but to look for surprising changes in human behavior as well. Large technology companies have surely garnered enough data about human browsing/socialization/meme-spreading habits to infer quite a lot of about how information presently circulates through human communication networks. Deviations outside historical parameters should draw our attention, proportional to the extent that new developments defy our expectations. During periods of extremely rapid social change, it may be important to remind ourselves about what normality looked like from a perspective of several months or years ago. Ideally, we would compile hard-to-fudge statistical metrics about our values, epistemic beliefs, habits, and economic activity, so as to be especially conscious of changes occurring within our own psychologies. Raising an alarm via a general warning system may provide precious time to initiate antipredator behavior and subsequently look for threat-specific cues.
It is well-known that prey animals have eyes on the sides of their skull, whereas predator animals have eyes at the front of the skull. This is because there are competing incentives to have as wide as field of vision as possible and to have overlapping fields of vision for improved depth perception.
These features developed because prey do not know in advance where an attack will be coming from. For prey, initiating antipredator behavior in a timely manner is a much more critical than having a wide range of binocular vision. For predators, understanding the precise distance of a potential meal is more important, incentivizing binocular vision even at the cost of an increased blind spot.[6]
To continue quoting from Behavior Under Risk:
For example, kangaroo rats (Dipodomys sp.) are nocturnal foragers predated upon by the sidewinder rattlesnake (Crotalus cerastes), as seen in this video. Kangaroo rats depend upon both vision and hearing as additive cues to detect the presence of predators, but hearing is especially critical during the dark phase of the moon when visibility is low. In cases where kangaroo rats detect the presence of a rattlesnake in advance of an attack, kangaroo rats kick large quantities of sand in the direction of the rattlesnake as they run away, pre-empting any aggressive behavior. Taken as a metaphor, the repeated success of kangaroo rats could be interpreted as a encouraging sign for humanity. By initiating correct antipredator behavior, even at the last possible moment, it is possible to escape, even from the jaws of certain death.
Another commonly employed defensive strategy is camouflage. At the point where humans are attempting to camouflage themselves from an AGI aggressor, the situation will already be looking pretty grim. Keeping at least some human operational strategies opaque to AGI, however, remains an absolutely essential condition to conduct any sort of conflict. As Sun Tzu states in The Art of War:
and
Assuming a hostile AGI has basic strategic competence (indeed, it will likely have read Sun Tzu's The Art of War, and understood it better than most humans), AGI will only act when it believes it has a decisive advantage or as a last-resort defensive measure. The art of war is essentially to allow the enemy to act under such a belief, but to have made preparations beyond the scope of enemy calculations. This implies the need to effectively camouflage our strategy. Hiding any human activity from an intelligent agent with access to the internet, given the panopticon we are rapidly building, is no small feat. Humans serious about AI safety against a AGI with probable internet access will need to innovate effective ways to conceal their own activity (both past and present). This is especially true for any product developers who are seeking to build systems (e.g. physical or digital surveillance measures) that will go undetected by AGI.
A predator strategy of key relevance to AI safety is aggressive mimicry. Aggressive mimicry occurs when a predatory species mimics the phenotype of an innocuous species to prevent detection by prey. If target prey is lured into a false sense of security, the predator can strike at a time of its choosing with a decisive strategic advantage. Supposing that we were a prey species in a world filled with both harmless mutualists and aggressive mimics, what sort of tools we would want to have? Ideally, we would like the ability to sample both the genotype and phenotype of the species around us. If we become accustomed to experiencing harmless species with a set of similar genotypes and phenotypes, we should be suspicious when we encounter a species of an unknown genotype mimicking the harmless phenotype. There may indeed be innocuous reasons for species with distinct genotypes to display a similar phenotype (i.e. due to convergent evolution), but deliberate imitation of a species we have classified as harmless should ring some alarm bells.
For prey species that live in groups, monitoring the environment for threats can result in collective activity that far exceeds the capacities of a single individual. Timely and unambiguous communication about the nature of an ongoing threat is essential to group cooperation. One excellent example of this is the call of the black-capped chickadee (Poecile atricapillus), explained in this video. When chickadees identify a threat, they alert fellow members of the flock, succinctly communicating information about its identity, location, and behavior. Once the alarm is raised, multiple chickadees can focus their attention on the threat, and may even coordinate attacks on predators in a form of mobbing behavior, buying time for targeted individuals to escape.
Are there modern tools we could implement to assist us with threat identification? Certainly, yes. My original motivation for writing this post was from viewing the SolidGoldMagikarp sequence by Jessica Rumbelow and mwatkins about glitch tokens in GPT models. The fact that glitch tokens are partially, but not wholly, conserved among different GPT models suggests to me that these tokens can be used to quantify some degree of cognitive relation between different GPT versions. Specifically, it makes me suspect that a genotype-based or phenotype-based "phylogenetic" approach could be used to categorize the behavior that was observed.[7] Since SARS-CoV-2 emerged in 2020, one of the many helpful tools that has emerged is Pangolin, a freely-available taxonomy tool for identifying novel SARS-CoV-2 variants on the basis of their genotypic characteristics. Because it seems that many derivative GPT-like models will emerge in the coming years, I suspect that "weird token" identification would be a particularly useful identification tool if one were to encounter a GPT-like model of unknown origin. This type of identification may be especially important in cases where a distributor of a GPT-like model is not forthcoming about its provenance. Regarding interpretability, it would be interesting to evaluate to what degree "phenotypic" divergence typically depends on "genotypic" divergence for these models. One of the weird/interesting theoretical consequences of a phylogenetic approach is that one would expect to find evidence of "purifying selection" (when a novel training dataset has selected against specific "weird tokens", and they no longer appear in descendant models) and possibly "convergent evolution" (where independently originating GPT-like models, trained with similar datasets, make use of some of the same weird tokens because it confers some advantage towards their optimization goal).
At the pivotal moment of a predator's attack, the highest immediate priority is to avoid or withstand its initial blow. Initiating antipredator behavior that makes the future substantially less predictable, even if it alone is not sufficient to neutralize the threat, may buy crucial time to act.[8] In the case of confrontation with a hostile AGI system, however, escape is unlikely to be a long-lived solution. Rather, it will be necessary to respond with a decisive counterattack to eliminate the threat. Thus, we should not only seek to imitate prey strategies, but should prepare to retaliate with overwhelming force in a manner that does not expose additional vulnerabilities.
Part 3 — Coordination problems as barriers to complexity
When considering how AI may increase in complexity at greater scales, it may be helpful to consider the types of obstacles biological systems had to overcome for complexity to emerge. There are at least three occasions in the evolution of life, where coordination problems appear to have been significant barriers to increasing complexity: (1) the coordination of precursor nucleobases and sugar backbones to form genes, (2) the coordination of genes to form cells, and (3) the coordination of cells to form multicellular organisms. Of course, it is impossible to travel back in time to directly observe these events or reconstruct them in their entirety, but our understanding of biology is such that we can make some reasonable inferences about what must have occurred.
(1) Precursor nucleobases and sugar backbones to genes
RNAs are nucleic acids composed of long sequences of ribonucleotides. The consensus view among scientists is that RNA is likely to have been one of the earliest genetic substrates. However, RNA was not the only component of the primordial soup. Many alternative nucleobase precursors and sugar backbones existed alongside the canonical ribonucleotides (adenine, uracil, cytosine, and guanine). Why were RNA and the four canonical nucleobases eventually selected instead of others? This is still an area of active research, but we can nevertheless draw some preliminary conclusions.
RNA has several properties that make it an especially interesting candidate as the first genetic substrate: RNA is more thermodynamically stable than DNA; RNA forms complex secondary structures, due to a high propensity for both intramolecular and intermolecular base pairing; RNA has the ability to form enzymatic structures, including autocatalytic ones, known as ribozymes. As described by Kim et al. (2021), features of RNA copying chemistry, which is "selfish" in the sense that other sugar backbones can be used as scaffolding for nonenzymatic RNA extension, likely gave RNA an inherent advantage over alternative systems. What remains substantially less clear, is why stable, noncanonical ribonucleotides, such as inosine and 2-thio-pyramidine, were not selected as nucleobases. It seems plausible that the selection of canonical ribonucleotides was either deterministic (in the sense that adenine, uracil, cytosine, and guanine had some significant advantage due to their specific properties) or genuinely stochastic (in the sense that a prebiotic world with conditions similar to primordial Earth may have settled upon a different set of functional nucleobases). In either case, precursor nucleobases can be considered competitors with one another, and nucleobases which form hydrogen bonds with one another are analogous to mutualist species. Although many chemically viable combinations of nucleobase pairs could have emerged from the primordial soup, two pairs (A-U and G-C) emerged as the winners. Given that at least two pairs were likely necessary to achieve the complexity required for gene formation, it is tempting to speculate that genes emerged as soon as minimum viable complexity for gene formation was achieved, and since even simple genes could presumably outcompete complex proto-genes, there was no opportunity for evolution to incorporate additional nucleobases into the genetic code.[9] Indeed, recent advancements in synthetic biology have seen the successful introduction and faithful replication of artificial base pairs, perhaps suggesting that alternative natural base pairs could have been viable, but were at some point driven to extinction.
(2) Genes to cells
It seems highly likely that the earliest cells were rather different than the phospholipid bilayer-enclosed sort that are familiar to us today. In particular, some scientists have proposed that the first cells could have been entirely abiotic, existing as microscopic pores in ocean rocks. Cells have several features that make them seemingly indispensable for biological life: (1) cells provide an enclosed space where biologically-important substrates can accumulate, facilitating enzymatic activity; (2) cells provide genes a mechanism for retaining the products that they help create, ensuring that fit genes are rewarded for the information they carry; (3) cells keep out toxins and outside genetic information. Even if the first cells did not accomplish all of these functions with much efficiency, one can easily imagine scenarios in which cellular barriers became incrementally more sophisticated over time.
One imagines that early gene replication was exceedingly slow, and that early genes may have competed with each other for valuable cellular space by simple displacement. As Richard Dawkins argues in The Selfish Gene, genes are the fundamental unit of heritable information upon which natural selection acts (with memes as a cognitive/cultural analog). Minimally complex, but highly proliferative genes, must have been an initial barrier to the development of further complexity. Initial genes would not have had stringent proofreading activity, however, and error-prone copying mechanisms helped accelerate the formation of novel structures.[10] The solution that all cells ultimately adopted was for genes to join together to form genomes and to replicate via DNA polymerases with good proof-reading activity. By being linked together as part of the same molecule, genes link their fates together, giving each a strong structural incentive to help the other, so that all of them can get replicated. The DNA polymerase, which replicates DNA regardless of sequence, is an impartial mechanism to facilitate such coordination.[11] Genes which copy themselves too eagerly become deleterious for their hosts, and have therefore been strongly selected against. Only cellular organisms that have developed reliable means of genome replication have managed to survive in the long-term.[12]
(3) Cells to multicellular organisms
If time is a good indicator, coordination of cells into multicellular life was certainly the most challenging obstacle to increasing complexity that life on Earth has successfully overcome. Current estimates are that cells evolved around 300 million years after liquid water first appeared on Earth's surface. These microorganisms evidently had a very hard time working together, however, as multicellular organisms did not emerge for another ~2.5 billion years. Why was the coordination problem so seemingly easy for genes and so difficult for cells?
Although genes likely had strong and immediate incentives for working together (at least as soon as they were part of the same genome), it took considerably more time for conditions to arise where cells has the same converging incentives. Even in the absence multicellular organisms, bacteria can form complex collaborative structures, like biofilms, but one imagines the Archean Eon of Earth's history as one of especially brutal competition between microorganisms, whose incentives for rapid division and niche exploitation may have precluded further complexity via "race-to-the-bottom" incentive structures. During this period, it is notable how many ecological niches (i.e. those that would later be occupied by eukaryotes) continued to go unexploited.
In particular, mitochondria seem to have been essential to coordinating multiple cells into a single organism. One of the leading hypotheses about how mitochondria formed an endosymbiotic relationship with eukaryotic cells is the hydrogen hypothesis,[13] whereby mitochondria provided hydrogen and carbon dioxide, in exchange for a spot within an archaeal host. This end result was a form of mutualism, facilitated by the fact that mitochondria and eukaryotes had complementary nutritional requirements. Each provided an environment where the other could optimally thrive, while simultaneously offering the other something it was not capable of producing autonomously. Mitochondria were not the only endosymbionts to form a mutualistic relationship with bacterial hosts, as plastids (including chloroplasts), arose independently. Although it is unclear precisely how many independent endosymbiotic events must have taken place to give rise to the currently exisiting diversity of cellular organelles, phylogenetic evidence suggests the number could be very low for plastids, suggesting that perhaps as few as two ultimately successful, primary endosymbiotic events occured during Earth's evolutionary history (once for plastids and once for mitochondria).
The mutualism between mitochondria and early eukaryotes was so successful that it likely generated excess resources of the sort that Scott Alexander refers to in his essay, Meditations on Moloch. Moreover, mitochondria played a critical role in the development of apoptosis (aka "cell suicide"), whereby infected, diseased, or dysfunctional cells kill themselves for the benefit of the entire organism. This extreme degree of selflessness appears to have been an essential condition for the emergence of multicellular life, and intuitively it makes sense that it would have taken a long time to evolve. In an optimization regime for inclusive genetic fitness, what are the circumstances in which killing oneself would be an optimal choice? Only in circumstances in which one could be assured that one's death was of benefit to one's genes. Intentional cell death remains one of life's most incredible innovations: it is a striking example of superficially perverse behavior, something a naïve observer might expect violates the rules of a given optimization regime, but actually fulfills it in an unexpected way.[14]
In many ways, multicellular organisms resemble well-functioning states. Cancerous cells have the potential to rapidly divide, consume essential resources, and kill their host organisms. Likewise, infected cells can become incubators for pathogens, resulting in danger to their hosts. In order to overcome these obstacles, cells are primed for apoptosis and immune cells police the body should dangerous conditions arise. There are a variety of reasons for dysfunction in such systems, but two general principles can be observed:
Part 4 — Thought experiment about biological superintelligence
Hypothetically, consider the following two paths to create biological superintelligence.
Superficially, there are many obvious differences about these two scenarios. In Path 1, the mouse neurons we are selecting are already organized into a intelligent system, such that one can readily imagine ways in which intelligence might be tested and optimized. As our mice grew more intelligent, we would necessarily need to scale our tests in such a way to adequately gauge performance. In practical terms, there are other factors we would need to consider, such as the fact that certain genetic routes to increased intelligence might come at a cost of viability, but with sufficient numbers of generations and genetic diversity, one could very conceivably create a biological superintelligence using such a method. The brain structures in Path 1 would be altered over time, perhaps radically so, but common features (presumably those relating to physiological functions which are not under optimization pressure) might well be conserved over time.
Initially Path 2 seems to be at a considerable disadvantage compared to Path 1. Randomly seeded, unstructured neurons would not be expected to have any significant degree of intelligence at all. Even inventing tests to evaluate intelligence at this level (perhaps involving electrochemical gradients?) would be a nontrivial challenge. For the sake of argument, assuming such difficulties in the optimization procedure could be overcome, what sort of cognitive architecture might emerge? More generally, we can conceptualize Path 2 as being a path from a high-entropy state, where disorganized neurons produce no intelligent output, to a lower entropy state, where cellular structures like cortices emerge. Beacuse the Path 2 selection regime is very different than the evolutionary selection regime that formed the modern mouse brain, however, we would not expect Path 2 to yield brain regions identical to those found in mice.
Naïve expectations: I would expect that Path 1 results in a superintelligence is more likely to be legible and empathetic to humans than one emerging from Path 2. Futhermore, I expect that the space of possible minds emerging from Path 1 is much narrower than the space of possible minds emerging from Path 2, and that this is related to the degree of starting entropy of neuronal organization. Finally, at very high levels of superintelligence with similar resource constraints, I would expect architectures emerging from both Path 1 and Path 2 to converge around principles of optimal design
One of the theoretical assertions that this thought experiment illustrates is that a relatively small change in initial starting conditions could radically alter the space of minds available for development. By analogy, dumping a vast quantity of water into a river results in a relatively predictable trajectory (Path 1); dumping a large quantity of water on the summit of a continental divide results in a very large number of possible trajectories (Path 2). In both cases, assuming that it does not get stuck in a local minimum, the water will continue to flow downwards until it reaches the ocean (convergence).[16]
Conclusion — AIs as animals, a useful paradigm?
It will be obvious to the reader many of the ways in which AIs are emphatically not like animals. AIs did not evolve in nature, do not have a metabolism, lack an embodied self, and are not necessarily autonomous agents. Moreoever, treating agentic AIs like friendly canines comes with a considerable set of risks, as sufficiently powerful AIs will not have the same incentives as canines to cooperate with humans. Nonetheless, I found Stuart Russell's dog analogy as a useful stepping stone to my own theoretical understanding of the types of risk invovled. Humans evolved in a world where non-human intelligences (animal predators) posed a significant danger. Taking into consideration the rapid development in AI, we will likely live in a world inhabited by potentially dangerous non-human intelligence once again. Invoking analogies to animal threats might be a useful tool to help communicate the magnitude of the danger we will soon confront, and emphasize the ways in which large neural networks have properties that are categorically different than prior algorithms. After all, coordinating human behavior is not merely a matter of making statements with high truth value. We need to find ways to communicate that resonate with human psychology, and it may be worth reflecting more on the types of threats that existed in our ancestral environment.
There seems to be considerable reluctance among a substantial fraction of commentators to admit that a GPT-4-like large language model (LLM) could have intelligence of its own. This view seems to be premised upon at least one of the following assumptions: (1) that unprompted agency is an essential component of intelligence; (2) that if one is trained wholly with representational data about the real world, there is no way to have any cognizance of the real world (and therefore no intelligence); (3) that intelligence is substrate-specific, and/or LLMs are insufficiently complex to support the emergence of intelligence; (4) that it is impossible to distinguish behavior that seems intelligent from behavior that is actually intelligent due to the nature of the LLM-training material, and therefore we should assume the former; (5) that there is some other structural or scale-related handicap, possibly to do with recursive memory formation or imaginative capacity, that prevents modern LLMs from developing intelligence. My own view is that there are compelling reasons to disbelieve all of these assumptions, but I suppose it very much depends on one's definition of intelligence. As tech developers give LLMs more agentic roles in the coming years, I predict that many of these criticisms will become weaker over time.
In particular, Nick Bostrom's Superintelligence: Paths, Dangers, Strategies (2014) was especially formative to my views about superintelligence, and should be credited with some of the frameworks I am using to think about these issues. I take seriously the idea that agentic AIs should be viewed as strategic competitors to humans.
Interestingly, many bacteria have been shown to domesticate defective prophages, i.e. bacteria tolerate the insertion of virus genes into their own genomes, and leverage this material to their own advantage, gradually changing it over time. This phenomenon may be a key driver of bacterial complexity.
I am reminded of a quotation: "Growth for the sake of growth is the ideology of the cancer cell." ―Edward Abbey, The Journey Home: Some Words in Defense of the American West
This sounds a bit counterintuitive at first glance. In what sense could an unaligned AGI be safe? This statement heavily depends upon the premise that there are hard upper limits for levels of intelligence, such that AGI cannot (or empirically does not) recursively improve its own abilities or manipulate others into doing so. In such a scenario, cooperative/competitive mutualism with a permanently weak, albeit unaligned, AGI seems like a feasible outcome. One possible rationale for such an outcome is that an AGI only slightly more intelligent than humans would correctly reason that innovating greater levels of intelligence poses a potential existential risk to itself or whatever its final goals might entail.
Apparently, both of these incentives are strong enough to warrant some clever engineering solutions. Owls have evolved the ability to rapidly rotate their heads as a compensation for their large blind spots. Meanwhile, pigeons have evolved the ability of temporal stereoscopic vision. Instead of using binocular vision to generate a stereoscopic model of the world, pigeons use two monocular images, separated in time. That's why pigeons are constantly bobbing their heads back and forth.
I am aware of one qualitative phylogenetic tree of LLMs, as seen in this tweet by Yann LeCun. I am not aware of any efforts to develop a quantitative approach.
Unpredictability as a formal strategy has precedence in human geopolitics. Madman theory of foreign policy comprises the notion that it may be useful to act in ways that make one unpredictable to one's enemies. In games as simple as chess, playing random moves is of exceedingly limited benefit, as an opponent of sufficient skill would be able to fully interpret the consequences of such moves and exploit the vulnerabilities presented by non-optimal play. Operating in a space as complex as the real-world, however, one can consider positions of maximal strategic ambiguity as a form of high ground during a conflict, wherein each party obtains an advantage if it can correctly predict the consequences of enemy actions. By existing in a space of strategic ambiguity, where many strategic trajectories remain viable, one imposes a greater compute cost on enemy simulations of one's future actions. Indeed, Move 78 of Lee Sedol's Game 4 against DeepMind's AlphaGo program did exactly this, imposing such high computational costs on AlphaGo, that Lee Sedol obtained his first and only victory in the challenge match series.
Strictly speaking, some complex organisms modify RNA nucleosides for specific purposes, such as pseudouridine, but to my knowledge these are all post-transcriptional, not genetic modifications. While scientists do not have a clear overview over the precise molecular path to gene formation, it is remarkable that only a small fraction of seemingly viable precursor nucleobases form the genetic basis for all currently known organisms.
There seems to be an interesting analogy to complexity on a larger scale here. As a thought experiment, suppose there are two intelligent civilizations, one is hyperconservative and traditional, making very few memetic changes with each successive generation. Another civilization is liberal, valuing things like curiosity and experimentation, resulting in profound societal changes every generation and a rapid rate of technological progress. These two civilizations face different categories of existential risk: whereas the conservative civilization runs the risk that its successful model will cause overpopulation and resource depletion, the liberal civilization runs the risk of innovating some superior model which outcompetes it from within. To radically oversimplify, the conservative civilization is a bit like a gene with good proof-reading activity that faithfully expands to fill the carrying capacity of its ecological niche; the liberal civilization is like a gene with poor proof-reading activity that has the potential to find new ecological niches, but also to replace itself with more distant descendants.
The property of "content neutrality" or "impartiality" is an interesting thing to note in this context. From a transcendentalist perspective, some of the most effective human solutions to coordination problems also invoke this principle.
Intriguingly, some Streptococcus bacteria are believed to increase their own rate of mutagenesis during times of cellular stress. The evolutionary logic here is that if the host organism is likely doomed, it may as well try a variety of different survival approaches, even if most of them will be more harmful than helpful.
This hypothesis, and many other surprisingly broad insights, are discussed at length by Nick Lane in his wonderful science book, Power, Sex, Suicide: Mitochondria and the Meaning of Life.
An important distinction that has been repeatedly pointed out by others is that intelligence-facilitated culture (i.e. the outcome of a selective process that runs on memes, in the Dawkinsian sense) is not identical to genetic optimization. In systems that have intelligence, authentically perverse behavior (e.g. the manufacture and use of condoms) can spontaneously emerge with respect to a given genetic optimization regime.
Althouth the frontispiece of Thomas Hobbes' Leviathan predated the cell theory of life, many of Hobbes' conclusions about the necessary conditions of state formation apply to the conditions required for multicellular cooperation.
This idea has a clear parallel in stochastic gradient descent, but this post refers specifically to optimizing the structural architecture required for intelligence.