Review
The only other explicit replication attempt I am aware of has not succeeded; this is the GPT-NeoX project by the independent research collective EleutherAI.
FWIW, if the desire was there, I think EleutherAI definitely could have replicated a full 175B parameter GPT-3 equivalent. They had/have sufficient compute, knowledge, etc. From what I recall being a part of that community, some reasons for stopping at 20B were:
5-minute summary
A better or equally good model is open-sourced[3]
AND
A successful, explicit attempt at replicating GPT-3 is completed
23 months
(equally good model[4])
Table 1: Key facts about the diffusion of GPT-3-like models
20-minute summary
Background
Key takeaways
It took two years until an actor that explicitly aimed to replicate GPT-3 succeeded and published about it—why was that?
Table 2: Key information from the language model case studies. Columns from left to right: the model name, the date the model was produced, the training compute used for the final model training run in FLOPs, the actual compute cost of the final model training run (in USD), the size of the core team that developed and evaluated the model,[19] and how the model compares in performance (in a broad sense) to GPT-3.[20] Most of the data are estimates and best guesses.[21] See the diffusion database for more information on what the data mean and my reasoning process.
It took two years until an actor that explicitly aimed to replicate GPT-3 succeeded and published about it (namely, Meta AI Research publishing OPT-175B). The only other explicit replication attempt I am aware of has not succeeded; this is the GPT-NeoX project by the independent research collective EleutherAI.[22] I identified nine other models that were GPT-3-like according to my definition, which were produced as early as May 2021 and throughout the subsequent year.[23] These do not appear to be explicit replication attempts, but they are evidence of which actors are capable of replication.
$1.7M (90% CI: $1.5M–$2.3M)
(reasoning)
$6M (90% CI: $4M–$10M)
(calculation)
Table 3: Comparison of estimated compute costs and core team sizes for GPT-3 vs. OPT-175B, which was intended to replicate GPT-3.
GPT-3 itself can be used and fine-tuned via an API. Despite this, there’s still demand for direct access to the model’s weights, and multiple similar models now exist that provide access to weights.
(Go to section)
What resources are required to actually use GPT-3-like models? How does the cost of deployment compare to the cost of development?
(Go to section)
See the table below for my insights. The upshot is that deployment is generally much cheaper than development. (EDIT 25-Feb-2023: I have now updated my view on this, see this comment.) I think this strengthens the case for focusing on model development rather than deployment, in terms of where to target interventions on diffusion.
160K
(90% CI: 88K to 260K)
5 professional software developers that have worked with ML projects, and five ML engineers who know how to run language models over multiple GPUs.
Labor cost: $250K
2M
(90% CI: 260K to 8.4M)
15 professional software developers that have worked with ML projects, and five ML engineers who know how to run language models over multiple GPUs.
Labor cost: $500K
2.6M
(90% CI: 950K to 6.2M)
Table 4: Summary of deployment scenarios and the estimated requirements for them. Most of the talent requirement estimates and the final two compute cost estimates are very rough.
What have been the prevailing mechanisms and accelerating/hindering factors of diffusion for GPT-3-like models?
(Go to section)
What publication decisions have been made for projects to develop GPT-3-like models? What were the different rationales for these decisions, and what impact did the decisions have? How will publication decisions change in the future?
(Go to section)
Implications of my research for forecasting and beneficially shaping the diffusion of AI technology
(Go to section)
Questions for further investigation
(Go to section)
I recommend the following topics for further investigation. Feel free to reach out to me about these research ideas. I may be able to offer advice, suggest links, and suggest people to talk to. It's possible that I or Rethink Priorities could help connect you with funding to work on these ideas if you're interested and a good fit.
How confident should you be in my findings?
The following information is something like a long-form epistemic status. It's intended to help you calibrate your confidence in my conclusions throughout the sequence. However, I don't think it's important to read through this, because I explain my reasoning and indicate levels of uncertainty throughout the sequence.
Errata
Acknowledgements
This research is a project of Rethink Priorities. It was written by Ben Cottier. Thanks to Alexis Carlier, Amanda El-Dakhakhni, Ashwin Acharya, Ben Snodin, Bill Anderson-Samways, Erich Grunewald, Jack Clark, Jaime Sevilla, Jenny Xiao, Lennart Heim, Lewis Ho, Lucy Lim, Luke Muehlhauser, Markus Anderljung, Max Räuker, Micah Musser, Michael Aird, Miles Brundage, Oliver Guest, Onni Arne, Patrick Levermore, Peter Wildeford, Remco Zwetsloot, Renan Araújo, Shaun Ee, Tamay Besiroglu, and Toby Shevlane for helpful feedback. If you like our work, please consider subscribing to our newsletter. You can explore our completed public work here.
See this cell of my diffusion database for why I think Gopher performs better than GPT-3—but I think this claim is uncontroversial.
The Model Card in Appendix B of the paper published in December 2021 (Rae et al., 2021, p. 49) states the "Model Date" is December 2020, and according to the paper introducing Model Cards (Khan et al., 2022), this is the answer to "When was the model developed?".
The model weights of OPT-175B are not immediately accessible to download by anyone on the internet. However, I estimate the weights can be downloaded by hundreds to thousands of ML researchers in academia, government, and industry labs, subject to approval. Meanwhile, the training code is open-source. See this section for more information.
OPT-175B just meets my threshold for “equally good”, as it had about 2% worse accuracy than GPT-3 on average, on the 14 benchmarks that were evaluated for both GPT-3 and OPT-175B (see this cell in my diffusion database). The 2% threshold is somewhat arbitrary. I chose that threshold partly because the OPT paper itself claimed (e.g. in the Abstract) that OPT-175B performance is “comparable” to GPT-3 (Zhang et al., 2022), and partly based on my not-well-informed intuition regarding how a 2% performance difference over a few benchmarks would affect the overall usefulness of the language model.
Edited to add 2023-Jan-23: It has been pointed out that given what we know now, OPT-175B is very likely to have significantly worse capability overall compared to GPT-3. The same is true of BLOOM, which I discuss elsewhere. Due to time constraints, I have opted to leave the report as-is, which reflects my best guess at the original time of writing.
Throughout this sequence, “GPT-3” refers to the original 175-billion-parameter model that was first described in Brown et al. (2020), except in cases where I say “GPT-3” in the context of using the OpenAI API, in which case I’m referring to the latest version of the “Davinci” model provided by the API.
I have found "diffusion" and "proliferation" to be competing terms with similar meanings in the context of AI risk. I chose the term “diffusion” because it seems less value-laden. One reason that "proliferation" has been used seems to be the analogy to nuclear non-proliferation. I think this analogy can be useful, but I want to avoid drawing too strong an analogy. Although much of my motivation for writing this sequence is that diffusion of AI could increase AI existential risk, I don’t think that AI diffusion is categorically bad, and I don’t want to imply that current state-of-the-art AI technology is as dangerous as nuclear weapons.
I define replication as: a machine learning model is trained from random initialization with an identical architecture to some existing model, and this new model has a performance differing by no more than 2% on the same benchmarks to the original model. The 2% is somewhat arbitrary. I am uncertain what the best threshold is. But the impression I gained from my case studies is that a 2% difference is small enough to call the actor “capable” of replication.
I define transformative AI qualitatively as “potential future AI that precipitates a transition comparable to (or more significant than) the agricultural or industrial revolution.” This is how Holden Karnofsky originally defined the term in 2016 (Karnofsky, 2016).
See Figure 1(a) on p.2 of Zeng et al. (2022).
It’s plausible that there have been secret projects undertaken by the intelligence community to develop GPT-3-like models, but I am not aware of any, and I haven’t tried to assess the likelihood of this.
By “largest viable deployment”, I mean the way of deploying a GPT-3-like model that has the highest volume of model outputs that (a) would be possible for at least one actor to do by now if they tried, and (b) is worth the cost in light of the actor’s goals (which may or may not be profit). For example, a large-scale automated disinformation campaign.
Throughout this sequence, I use “90% CI” to refer to my 90% confidence interval for an estimated value. That is, I believe it is 90% likely that the value lies between the first number (the lower bound) and the second number (the upper bound). I generated 90% confidence intervals in various ways depending on how I estimated the value and what evidence was available. In cases where there was no clear way to estimate a 90% CI systematically, I thought about what 90% CI I would be willing to bet on.
Sponsorship of compute resources could involve an actor doing any of the following things: (a) giving another actor ownership of compute hardware, (b) giving another actor access to compute hardware, (c) giving another actor money that can only be used on compute, or (d) giving another actor money with the intention that it is used for compute. Only (b) and (c) occurred in my case studies.
In this sequence, for simplicity, I treat Google as a single actor. But this is debatable; there are multiple teams at Google that research and develop language models (e.g., under the names “Google Brain” and “Google Research”), and these teams seem to have significant autonomy over publication decisions.
When I say “suboptimal” I don’t just mean that those developers should become even more closed than the default. I don’t have a precise idea of an optimal strategy, but I think it involves being more discerning about which artifacts are diffused, at what time, and at what level of publicity.
For the sake of promoting ideas I believe to be more neglected, I do not discuss interventions for compute in this work, because they have been explored in previous work. See e.g., the section on “Compute Governance ideas” in “Some AI Governance Research Ideas” (Anderljung & Carlier, 2021).
I conceptualize diffusion in a fuzzy way, meaning that an actor doesn’t have to acquire the exact same artifact for it to count as diffusion. An actor merely needs to acquire some artifact that serves roughly the same function as some existing artifact. For example, an actor can “replicate” a machine learning model in the sense that they train a new model from scratch that is functionally very similar to the original model.
While other mechanisms such as theft are harder to identify (because thieves usually don’t want to be known), I am still 80% confident that open publication, replication, and incremental research are the most common and most important mechanisms in my case studies. I argue this in detail in this section.
“Core team” size was based on author contributions listed in a paper. I counted people that I judged to be directly involved in producing the result (mainly the trained model). "Producing the original result" includes evaluation of the trained model on standard performance benchmarks, but not evaluation of social fairness, bias, toxicity, or broader impacts, nor pure discussion about the model. The reason is that the latter things are not directly relevant to producing the model or advancing general capabilities.
Where possible, I assess performance difference by comparing common benchmark metrics reported in the research papers for these models. However, not all of the same benchmark results are reported, and the results are not always presented in the same way (e.g., a table in one paper, but only a plot in another paper). Furthermore, models are not perfectly comparable in performance because they differ in their training compute, datasets, and model sizes—the training data in particular could differ drastically in terms of which languages are most represented and therefore which downstream tasks the model performs best at. So I also draw on other evidence to get a broad sense of “capability”, e.g., training compute, dataset size, data diversity, model size, and how the paper summarizes a model’s performance in comparison to other models.
The widths of my 90% confidence intervals in “Model date” are between one and eight months—with Gopher, OPT-175B, and BLOOM on the one-month end.
My 90% CI for training compute generally spans from 0.5 times my central estimate to two times my central estimate. This is based on the range of empirical differences found in a 2022 analysis by the AI forecasting organization Epoch (Sevilla et al., 2022). In Sevilla et al. (2022), 95% confidence intervals for the compute trends were derived from bootstrap sampling. In the sample, each compute estimate was adjusted by a random factor between 0.5 and 2 “to account for the uncertainty of [the] estimates”, based on the same analysis (see p.16). This informed my decision to use the 0.5–2x range as a 90% CI, but I also used my own intuitive judgment of the appropriate level of uncertainty.
My 90% CI for training cost generally spans from 0.4 times my central estimate to 2.7 times my central estimate.
My 90% CI for core team size is generally +/- 50% of my central estimate.
I’m generally about 80% confident in my claims about whether a given model’s capability level is worse than, similar to, or better than GPT-3’s, if you accept my broad notion of a model’s capabilities. (See the cell notes for that column in the database for more information.)
EleutherAI didn’t succeed in the sense that their best model to date, GPT-NeoX-20B, only used 5.0E+22 FLOPs of compute (compared to 3.1E+23 for GPT-3) and performed significantly worse.
Two of my case studies (PanGu-alpha and GPT-NeoX-20B) are not GPT-3-like according to my definition.
As noted above, it’s plausible that there have been secret projects undertaken by the intelligence community to develop GPT-3-like models, but I am not aware of any, and I haven’t tried to assess the likelihood of this.
This is based on both my intuition and one expert’s opinion. I estimated that the total compute budget for a project is 3 times (90% CI: 2–5 times) the final training run cost, to account for trial and error and preliminary experiments.
A researcher at an industry AI lab, who has worked with large language models, told me (paraphrasing): “It wouldn't shock me if an OOM more compute was required for investigating relative to the final training run, if learning how to develop the model from scratch. Maybe less than that…Almost certainly at least a 50-50 split in compute cost between preliminary work and the final training run.” My own read on this is that a 50-50 split seems feasible for a top language model developer, while an order of magnitude difference might correspond to an amateur group attempting replication without prior experience in training language models at the scale of billions of parameters or more.
One piece of potential countering evidence to this is that EleutherAI (who started closer to the latter scenario) only spent ~1/3 of their total compute on preliminary work and 2/3 on training for GPT-NeoX-20B (based on hours of compute expenditure reported on p.23 of Black et al. (2022)). However, Stella Biderman (one of the leading contributors to the GPT-NeoX project, as well as BLOOM) indicated that this was an exceptional case, because the GPT-NeoX team already gained a lot of experience training smaller but similar models (e.g., GPT-J), and they “probably got a bit lucky” with successfully training GPT-NeoX-20B with so little testing. (Notes on my correspondence with Stella Biderman are available upon request in this document.)
Ultimately, since I’m estimating costs for major industry developers (OpenAI and Meta AI Research), I leaned toward the 50-50 end of the range, choosing 2x as the lower bound. I intuitively chose the upper bound as 5x, because that is roughly halfway between 2x and a full order of magnitude (i.e. 10x) on a logarithmic scale. I also chose the central estimate of 3x based on a logarithmic scale.
The number of people directly involved in producing the original result. "Producing the original result" includes evaluation of the trained model on main performance benchmarks, but not evaluation of social fairness, bias and toxicity, nor broader impacts, nor pure commentary about the model. The reason is that the latter things are not directly relevant to producing the model or advancing capabilities.
Note that there are also other (perhaps stronger) reasons to focus on the model development stage.
Firstly, the forms of diffusion that help actors develop models pushes AI progress forward more than the forms of diffusion that help actors deploy models. Pushing AI progress forward is what shortens AI timelines and thereby increases AI existential risk.
Secondly, a lot of AI existential risk comes from misaligned power-seeking AI rather than misuse by humans. I expect that reducing diffusion of deployment would have a smaller effect on this source of risk.
Compute cost estimates are just based on cloud compute prices, and exclude the cost of other hardware such as a laptop to set up the cloud computing instance.
By “direct” I mean the people and skills that are required to set up the model and keep the model running in the deployment setting, excluding people that maintain software dependencies (e.g. PyTorch), or people that give advice on how to do deployment.
This means the deployment with the highest volume of model outputs that (a) would be possible for at least one actor to do by now if they tried; (b) is worth the cost—not necessarily in terms of financial revenue, but in achieving the actor's goal. See this Guesstimate model for calculations (the method is also explained in the main text).
Meaning research that makes a relatively small change to an existing method.
The link to this model is currently broken. I am working on a fix.
I have three criteria for favorability: beneficence, competence (including competence at safe and responsible AI development), and receptiveness to ideas and criticism.
I think the human input of talent is also very important but I haven’t thought about it enough to provide a clear takeaway.
See e.g., the section on “Compute Governance ideas” in “Some AI Governance Research Ideas” (Anderljung & Carlier, 2021).
This idea is inspired by the idea of “structured access” to AI systems; see Shelvane (2022).
I define transformative AI qualitatively as “potential future AI that precipitates a transition comparable to (or more significant than) the agricultural or industrial revolution.” This is how Holden Karnofsky originally defined the term in 2016 (Karnofsky, 2016).
The 2% is somewhat arbitrary. I chose that threshold partly because the OPT paper itself claimed (e.g. in the Abstract) that OPT-175B performance is “comparable” to GPT-3 (Zhang et al., 2022), and partly based on my rough intuition of how much a 2% performance difference over a few benchmarks would affect the overall usefulness of the language model.