Later I define diffusion “factors”. These differ from diffusion “mechanisms” in that they are not the final means by which an AI research artifact is acquired. Rather, factors help a mechanism of diffusion to occur. For example, training a replica model can be the final means of acquiring the original model, but gaining access to more compute is not the final means of acquiring the model.
The 2% is somewhat arbitrary. I chose that threshold partly because the paper presenting the OPT-175B language model claimed (e.g. in the Abstract) that OPT-175B performance is “comparable” to GPT-3 (Zhang et al., 2022), and partly based on my rough intuition of how much a 2% performance difference over a few benchmarks would affect the overall usefulness of a language model.
See Gong (2021)
This means that the network is not sparsely gated. Being sparsely gated means that entire sections of the network are unused in each forward pass (e.g., see Section 2.1 of Lepikhin et al., 2020). Sparse gating is used in Mixture-of-Experts (MoE) models.
Training GPT-3 used an estimated 3.14E+23 FLOPs (see Table D.1 of Brown et al., 2020).
See Kaplan et al. (2020). For a more recent update to scaling laws, see Hoffmann et al. (2022). Either of these sets of scaling laws are valid for this definition.
See Kaplan et al. (2020), Figure 1 on p.3
The test loss essentially measures a language model’s ability to predict the next token in a sequence, where lower loss is better.
See the estimation methods presented in Sevilla et al. (2022)
See Abstract of Brown et al. (2020): “For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model. GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic.”
Storing GPT-3’s 175 billion parameters in memory, even if they were represented in the smallest floating-point representation typically used as of 2022 (FP16) would take up roughly 175B * 16 bits / 8 bits per byte = 350 gigabytes. Meanwhile, the best machine learning GPU on the market in terms of FLOP/s throughput and GB/s memory bandwidth (to my knowledge) is the NVIDIA A100, which has 80 gigabytes of memory.
A counterpoint to this is that the diffusion of AI models, or APIs for models, can reduce the incentive to compete if an actor’s aim is just to use current AI capabilities. If a model is already available to use, there’s less need to develop your own model. However, there will still be incentives for AI developers to remain competitive if they are competing for interests such as commercial sales or national security.
At least, insofar as AI alignment progress is not relevant to commercial success, national security, or other concerns besides the moral value of the long-term future.
This deployment could be either intentional or accidental, and the relevant actors could have been aware or unaware that the AI was powerful and/or misaligned enough to cause existential catastrophe.
There could also perhaps be a situation where multiple actors deploy AI systems that only collectively cause an existential catastrophe. This also seems more likely in multipolar scenarios.
On both of those points, see Shah’s (2020) summary of Critch et al. (2020). On the latter point, see also Clifton (2021).
This misuse could be deliberately aimed at causing existential catastrophe, but need not be. For example, an actor might use AI to lock-in some set of values they think are good, or to develop a powerful (non-AI) weapon system for deterrence purposes, and this could then end up resulting in existential catastrophe.
This is subject to the actors being able to afford deployment, which I discuss elsewhere in the sequence.
Shelvane & Dafoe (2020) provide relevant insights on this issue.
The Metaculus community currently predicts “Weakly General AI” by 2029, with a 25% chance of occurring after 2039 (from ~1700 predictions) ([Anthony], 2020). Meanwhile, Metaculus’s median prediction for the “Date of Artificial General Intelligence” is 2043 (from ~700 predictions) (Barnett, 2020). I think that both of these two definitions of a general AI system are plausibly sufficient for TAI (especially the definition of “Artificial General Intelligence”).
See also Karnofsky (2021).
See this draft report by Lennart Heim (requires access). This forecast is lower than the one in Lohn and Musser (2022, p. 13) because it uses different (more reliable) trends of compute doubling times and GPU price performance.
From Brown et al. (2020): “GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic.”
Many tasks would require that the AI system be paired with software that allows it to interface with other software and the real world. For example, an AI chatbot model needs additional software to show an interface to the user, process the user’s input text and feed it to the model, and display the model’s output text back to the user. But I think this interfacing software would be cheap and easy to implement compared to the AI system itself—it does not seem like the bottleneck to widespread automation.
There seem to be stronger and weaker versions of this hypothesis held by various people. A weaker version might substitute “the same basic machine learning paradigm” for “essentially the same methods used for training state-of-the-art AI models today.” The word “essentially” here can also hide a lot of nuance.
CC-BY Jaime Sevilla, Pablo Villalobos, Juan Felipe Cerón, Matthew Burtell, Lennart Heim, Amogh B. Nanjajjar, Anson Ho, Tamay Besiroglu, Marius Hobbhahn and Jean-Stanislas Denain.
See the “Reason for priority rating” column in the diffusion database.
This post is one part of the sequence Understanding the diffusion of large language models. As context for this post, I strongly recommend reading at least the 5-minute summary of the sequence.
Definitions
Here I explain what various terms mean in the context of the diffusion of AI technology and this sequence.
Transformative AI (TAI)
I define transformative AI (TAI) qualitatively as “potential future AI that precipitates a transition comparable to (or more significant than) the agricultural or industrial revolution.”
This is how Holden Karnofsky originally defined the term in 2016 (Karnofsky, 2016). Luke Muehlhauser elaborated on the definition in 2019, emphasizing that “the transition immediately following the (‘first’) industrial revolution was a much larger change in [measures] of human empowerment and well-being than other events in recorded history, including the Black Death of the 14th century, the Scientific Revolution of the 17th century, the two world wars of the 20th century, and the ‘2nd and 3rd industrial revolutions’ implied by discussions of a potential future ‘4th industrial revolution’” (Muehlhauser, 2019).
Artifact: a product of AI R&D
An artifact is a product of AI research or development. Examples of artifacts include a trained model, the code for training a model, a new dataset, or a new algorithm. Artifacts are what diffuse among actors.
Diffusion: the spread of artifacts among actors
Diffusion is when an actor acquires an existing artifact. For example, suppose a machine learning model is trained by an AI lab and the model is made freely available to download from the internet. The model may then diffuse rapidly and widely to any individual that is able to download and run the model on their computer.
Taxonomy of diffusion mechanisms—e.g., open publication, replication
A diffusion mechanism is the final means by which an artifact is acquired by some actor, given that the artifact was already possessed by some other actor first.[1] (Note that acquiring the artifact doesn’t necessarily mean possessing the original copy of the artifact; it can be copied or reproduced.)
Possible mechanisms include:
In addition to the causal mechanisms of diffusion above, there is the phenomenon of multiple discovery. Multiple discovery is where two actors independently come up with the same idea or result, probably partly via drawing on the same prior work as each other. Considering the role of multiple discovery thus helps to judge the counterfactual impact of decisions that affect diffusion.
Accelerating and hindering factors in diffusion
GPT-3-like model: similar to GPT-3 in design and purpose, with similar or better performance
A GPT-3-like model is a densely activated[4] neural network model that was trained to autoregressively predict text using at least 1E+23 floating-point operations (FLOPs) of compute[5], and enough data to approximately follow established scaling laws[6] for autoregressive language modeling.
The threshold of 1E+23 FLOPs of training compute in this definition serves as a proxy for how powerful the trained model is. Of course, arbitrary amounts of compute can be wasted, and training compute does not directly translate into model performance, so this is a flawed proxy. Nonetheless, I think it is a good enough proxy in the domain of large pretrained language models for the following reasons:
Despite reason 3, I still make efforts to compare the overall capability of potential GPT-3-like models—see the “Capability compared to GPT-3” column in the diffusion database.
Pretraining and fine-tuning in machine learning
A model that is trained to autoregressively predict text from a relatively large and broad dataset is often called “pretrained”. This is because the model is trained with the intention of further training the model later, on a more specific dataset, to improve the model’s performance at a specific task. Updating the model’s weights based on a more specific dataset is usually called “fine-tuning.” But GPT-3 showed that pretrained language models can exhibit in-context learning. That is, a model can learn based on information (e.g., examples of a new task) contained in the text input to the model, without any change to the model weights.[10] This is often referred to as “prompting”.
What actually are the inputs to AI development?
I think of the four key inputs to AI development as compute, data, algorithmic insights, and talent. These inputs are crucial to understanding how AI development works—including diffusion—and how to influence it. In this section I explain what I mean when I refer to these inputs, along with some commentary on how they matter and relate to each other.
Compute
The basic definition of “compute” as a noun is simply “computation”.[11] However, my impression is that in the context of AI development, “compute” can refer to either of the following things:
When people talk about “better” compute, they are often referring to the efficiency of hardware (in FLOPs per second per hardware unit).
Hardware that is used for AI (such as data center GPUs) is also optimized partly for the rate of data transfer between hardware units (known as memory bandwidth), in bytes per second. The more that computation is parallelized over multiple hardware units, the more important memory bandwidth is.
Data
“Amount of data used” can refer to either of the following things:
Data quality is also important for AI development. I’d define data quality as the combination of (a) how relevant the examples are to the capabilities that are desired for the model, (b) how diverse the examples are.
Algorithmic insights
I divide algorithmic insights into two broad categories:
Talent
Talent in the context of AI development can refer to (not mutually exclusive):
The cost of AI talent can be measured in the following ways (not mutually exclusive):
Of the other three inputs to AI development, I think talent is most closely related to algorithmic insights. However, talent is related to all other inputs in some way:
The relevance of AI diffusion to AI x-risk
AI timelines
Diffusion can hasten the arrival of TAI via knowledge sharing. The knowledge and artifacts that are shared by various AI developers can benefit the leading developer greatly, even though the leading developer is ahead overall. This is because the surface area of AI research is large, and throwing more resources at research does not produce insights at a reliable rate. The leading developer is therefore unlikely to be the first to discover every insight needed for TAI. By gaining shared knowledge sooner than they would on their own, the leading developer can reach TAI faster.
Who leads AI development by what margin
Diffusion can affect how large a lead the leading AI developer has and who the leading developer is, in two ways:
In these ways, diffusion tends to increase competitiveness on capabilities.[17] Increased competition tends to create an AI race, and decrease the willingness to pay the alignment tax.[18]
Multipolarity of TAI development and deployment
In addition to who the leading AI developer is and how much of a lead they have, AI diffusion also affects the number of actors who are close contenders, and therefore who might be able to take the lead in the future. The more actors that are close to the leader, the more multipolar AI development is. The speed and scale of diffusion will affect the ease by which actors can gain more resources and catch up on AI capabilities, which tends to increase multipolarity.
The multipolarity of TAI development also affects multipolarity in a post-TAI world. The more actors who possess power through TAI technology, the more multipolar the post-TAI world is. During a multipolar scenario, I think there is an increased risk of:
Multipolar scenarios in a post-TAI world seem more likely if:
In general, the more easily that AI technology diffuses, the more likely conditions 1 and 2 are.
Accidents vs misuse
Existential catastrophe from AI development or deployment could occur either via accidents with the AI (especially due to misaligned, power-seeking, highly capable AI systems) or via misuse of the AI by humans[20] (or perhaps also via “structural risk”; see Zwetsloot & Dafoe, 2019). Diffusion of AI capabilities increases the number of actors that can access those capabilities.[21] That can increase both types of risk, by increasing the chance that some actor uses those capabilities poorly.
I think that if existentially catastrophic misuse occurs, it will most likely result from the actions of a single large actor such as a government. However, it’s plausible to me that at some point in the future, due to diffusion, many actors would be able to misuse AI in a way that causes existential catastrophe—at least tens, but possibly thousands or more. Perhaps only the most malicious among these actors (e.g., a radical terrorist organization) would actually take a catastrophic action.
Overall, I think there is a higher total risk from the possibility of TAI systems being misaligned than from the possibility of existentially catastrophic misuse, so I think diffusion’s effects on the former are more important than its effects on the latter. For the sake of brevity, I do not argue this view further, so this should be treated as an assumption that underlies the focus areas and recommendations in this sequence.
Benefits of diffusion: scrutiny, AI alignment, and defense against misuse
There are some ways in which diffusion can be beneficial, including:
Overall opinion: diffusion is net bad by default, but can be net good if the right things are diffused carefully
I do not have thoroughly reasoned conclusions about the benefits and harms of diffusion. In this sequence I mostly leave that question for future work (see the Questions for further investigation post). My best guess is that if there were no change in publication norms from today, diffusion would lead to faster capabilities progress overall. This is because there are more actors focused on improving and applying capabilities than on scrutiny or alignment. This would in turn increase the likelihood of TAI systems being misaligned or misused. Furthermore, my intuition is that the offensive capabilities of AI will generally outweigh the defensive capabilities of AI, because using AI to detect and neutralize harms from AI generally seems more difficult than causing those harms.[22]
However, it’s crucial to realize that differential diffusion is possible—we can limit the diffusion of net-harmful artifacts while increasing the diffusion of net-beneficial artifacts. This idea is related to differential technology development, which “calls for leveraging risk-reducing interactions between technologies by affecting their relative timing.” I don’t think the best strategy would be to stop diffusion altogether, even if that were possible. Rather, the AI governance community should do more research on how to safely diffuse artifacts that boost responsible AI development, but that do not significantly increase risk. For example, we could diffuse tools that make it easier to apply best practices for deploying models. The question of which artifacts are best to diffuse, and how to diffuse them with minimum risk, is something I leave for further investigation.
Why focus on case studies of large language models?
In the previous section I explored how diffusion is relevant to risks from TAI. But TAI may be decades away, which makes it very difficult to forecast the ultimate effects of diffusion.[23] And as impressive as current state-of-the-art language models are, they still seem far from being TAI. So what is the relevance of studying the diffusion of (nearly) state-of-the-art language models today?
I see three main reasons:
To be even more specific, I focus on language models that are similar to GPT-3 because of their current relevance, and because they present an unusually good opportunity for case studies:
Having said all of that, I still think that these case studies only inform a small fraction of the total understanding needed to beneficially shape diffusion. Studying other models in other domains of AI would be helpful. The connection between narrow historical case studies and the key questions about TAI is also highly uncertain. See the questions for further investigation post for my ideas on how future work can best improve our understanding.
The scope of diffusion mechanisms and factors in this sequence
Diffusion database
Figure 1: A screenshot of part of the database
Overview
A core component of my research on large language model diffusion was this database:
Diffusion trends for GPT-3-like models
The database organizes information that helps understand how GPT-3-like models have diffused among different actors. The fields in the database are broadly categorized by:
A lot of information in the database is sourced from the Parameter, Compute and Data Trends in Machine Learning database (henceforth “the PCD database”).[29]
Case study selection process
To search for GPT-3-like models, I filtered the PCD database by the “Language” domain and “Training compute” greater than 1E+23 FLOP, then manually checked which models were densely vs. sparsely gated (see footnote 4), and then manually prioritized the resulting list of models according to various other factors.[30] I subsequently heard from various people about other GPT-3-like models or attempts to produce them. I also asked Connor Leahy—who co-founded the independent research collective EleutherAI with an aim of replicating GPT-3—about GPT-3 replication attempts that he knew of. Finally, I reviewed the list of recent Chinese large language models obtained by Ding & Xiao (forthcoming). Beyond that, I did not actively search for more GPT-3-like models. Ultimately, I included not just “GPT-3-like models” but also models that seemed to result from a failed effort to produce a GPT-3-like model—namely, PanGu-α and GPT-NeoX-20B.
I ended up studying nine models in detail, based on a priority rating that took into account:
The nine models are:
My research drew on various facts about other large language models, but I have not studied those models in as much detail as the models above. The most notable omission from the above is probably Megatron-Turing NLG, though I still studied that case somewhat.
How data was acquired
I acquired data in the database through a combination of:
How to view and use the database
There are no plans to maintain the database—help would be useful
I have not made concrete plans to maintain the database in the future. However, it is possible that I will maintain and extend the database if it remains relevant to my future work. Please get in contact with me if you are interested in maintaining or extending the database in some way. It's possible that I could help connect you with funding to work on this if you're interested and a good fit.
Acknowledgements
This research is a project of Rethink Priorities. It was written by Ben Cottier. Thanks to Alexis Carlier, Amanda El-Dakhakhni, Ashwin Acharya, Ben Snodin, Bill Anderson-Samways, Erich Grunewald, Jack Clark, Jaime Sevilla, Jenny Xiao, Lennart Heim, Lewis Ho, Lucy Lim, Luke Muehlhauser, Markus Anderljung, Max Räuker, Micah Musser, Michael Aird, Miles Brundage, Oliver Guest, Onni Arne, Patrick Levermore, Peter Wildeford, Remco Zwetsloot, Renan Araújo, Shaun Ee, Tamay Besiroglu, and Toby Shevlane for helpful feedback. If you like our work, please consider subscribing to our newsletter. You can explore our completed public work here.
Later I define diffusion “factors”. These differ from diffusion “mechanisms” in that they are not the final means by which an AI research artifact is acquired. Rather, factors help a mechanism of diffusion to occur. For example, training a replica model can be the final means of acquiring the original model, but gaining access to more compute is not the final means of acquiring the model.
The 2% is somewhat arbitrary. I chose that threshold partly because the paper presenting the OPT-175B language model claimed (e.g. in the Abstract) that OPT-175B performance is “comparable” to GPT-3 (Zhang et al., 2022), and partly based on my rough intuition of how much a 2% performance difference over a few benchmarks would affect the overall usefulness of a language model.
See Gong (2021)
This means that the network is not sparsely gated. Being sparsely gated means that entire sections of the network are unused in each forward pass (e.g., see Section 2.1 of Lepikhin et al., 2020). Sparse gating is used in Mixture-of-Experts (MoE) models.
Training GPT-3 used an estimated 3.14E+23 FLOPs (see Table D.1 of Brown et al., 2020).
See Kaplan et al. (2020). For a more recent update to scaling laws, see Hoffmann et al. (2022). Either of these sets of scaling laws are valid for this definition.
See Kaplan et al. (2020), Figure 1 on p.3
The test loss essentially measures a language model’s ability to predict the next token in a sequence, where lower loss is better.
See the estimation methods presented in Sevilla et al. (2022)
See Abstract of Brown et al. (2020): “For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model. GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic.”
See Compute - definition of compute by The Free Dictionary
See Sevilla et al. (2022)
Storing GPT-3’s 175 billion parameters in memory, even if they were represented in the smallest floating-point representation typically used as of 2022 (FP16) would take up roughly 175B * 16 bits / 8 bits per byte = 350 gigabytes. Meanwhile, the best machine learning GPU on the market in terms of FLOP/s throughput and GB/s memory bandwidth (to my knowledge) is the NVIDIA A100, which has 80 gigabytes of memory.
See Hobbhahn and Besiroglu (2022)
See Brown et al. (2020, Table 2.2, p. 9)
See Draft Definition & Meaning - Merriam-Webster
A counterpoint to this is that the diffusion of AI models, or APIs for models, can reduce the incentive to compete if an actor’s aim is just to use current AI capabilities. If a model is already available to use, there’s less need to develop your own model. However, there will still be incentives for AI developers to remain competitive if they are competing for interests such as commercial sales or national security.
At least, insofar as AI alignment progress is not relevant to commercial success, national security, or other concerns besides the moral value of the long-term future.
This deployment could be either intentional or accidental, and the relevant actors could have been aware or unaware that the AI was powerful and/or misaligned enough to cause existential catastrophe.
There could also perhaps be a situation where multiple actors deploy AI systems that only collectively cause an existential catastrophe. This also seems more likely in multipolar scenarios.
On both of those points, see Shah’s (2020) summary of Critch et al. (2020). On the latter point, see also Clifton (2021).
This misuse could be deliberately aimed at causing existential catastrophe, but need not be. For example, an actor might use AI to lock-in some set of values they think are good, or to develop a powerful (non-AI) weapon system for deterrence purposes, and this could then end up resulting in existential catastrophe.
This is subject to the actors being able to afford deployment, which I discuss elsewhere in the sequence.
Shelvane & Dafoe (2020) provide relevant insights on this issue.
The Metaculus community currently predicts “Weakly General AI” by 2029, with a 25% chance of occurring after 2039 (from ~1700 predictions) ([Anthony], 2020). Meanwhile, Metaculus’s median prediction for the “Date of Artificial General Intelligence” is 2043 (from ~700 predictions) (Barnett, 2020). I think that both of these two definitions of a general AI system are plausibly sufficient for TAI (especially the definition of “Artificial General Intelligence”).
See also Karnofsky (2021).
See this draft report by Lennart Heim (requires access). This forecast is lower than the one in Lohn and Musser (2022, p. 13) because it uses different (more reliable) trends of compute doubling times and GPU price performance.
From Brown et al. (2020): “GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic.”
Many tasks would require that the AI system be paired with software that allows it to interface with other software and the real world. For example, an AI chatbot model needs additional software to show an interface to the user, process the user’s input text and feed it to the model, and display the model’s output text back to the user. But I think this interfacing software would be cheap and easy to implement compared to the AI system itself—it does not seem like the bottleneck to widespread automation.
There seem to be stronger and weaker versions of this hypothesis held by various people. A weaker version might substitute “the same basic machine learning paradigm” for “essentially the same methods used for training state-of-the-art AI models today.” The word “essentially” here can also hide a lot of nuance.
I argue this in detail in this section.
CC-BY Jaime Sevilla, Pablo Villalobos, Juan Felipe Cerón, Matthew Burtell, Lennart Heim, Amogh B. Nanjajjar, Anson Ho, Tamay Besiroglu, Marius Hobbhahn and Jean-Stanislas Denain.
See the “Reason for priority rating” column in the diffusion database.