I appreciate that you are putting thought into this. Overall I think that "making the world more robust to the technologies we have" is a good direction.
In practice, how does this play out?
Depending on the exact requirements, I think this would most likely amount to an effective ban on future open-sourcing of generalist AI models like Llama2 even when they are far behind the frontier. Three reasons that come to mind:
There are many, many actors in the open-source space, working on many, many AI models (even just fine-tunes of LLaMA/Llama2).
To clarify, I'm imagining that this protocol would be applied to the open sourcing of foundation models. Probably you could operationalize this as "any training run which consumed > X compute" for some judiciously chosen X.
- Sadly, I’m not confident the answer is “yes,” and this is the main reason I only ~50% endorse this post. Two reasons I’m worried evaluators might fail:
- [...]
- The world might change in ways that enable new threat models after camelidAI is open-sourced. For example, suppose that camelidAI + GPT-SoTA isn’t dangerous, but camelidAI + GPT-(SoTA+1) (the GPT-SoTA successor system) is dangerous. If GPT-(SoTA+1) comes out a few months after camelidAI is open-sourced, this seems like bad news.
My main concern here is that that there will be technical advancements in the world in things like finetuning or scaffolding and these will make camelidAI sufficiently capable to be a concern. This seems quite unlikely for current open-source models (as they are far from sufficiently capable), but will increase in probability as open source models get more powerful. E.g., it doesn't seem that unlikely to me that advances in finetuning, dataset construction, and scaffolding are sufficient for GPT4 to make lots of money doing cybercrime online (this threat model isn't very existentially concerning, but the stretch from here to existential concerns isn't that huge).
It's hard for me to be very confident (>99%) that there won't be substantial jumpy improvements along these lines. As there are probably larger threats other than open source, maybe we should just eat the small fraction of worlds (maybe 1-5%) where a sudden jump like this happens (it probably wouldn't be existential even conditional on large jumps). I'm sympathetic to not worrying much about 1/1000 or 1/100 doom from open sourcing when we probably have bigger problems...
Let’s also call the most capable proprietary AI system GPT-SoTA, which we can assume is well-behaved. I’m imagining that GPT-SoTA is significantly more capable than camelidAI (and, in particular, is superhuman in most domains). In principle, the protocol below will still make sense if GPT-SoTA is worse than camelidAI (because open source systems have surpassed proprietary ones), but it will degenerate to something like “ban open source AI systems once they are capable of causing significant novel harms which they can’t also reliably mitigate.”
I think a reasonable amount of the concern is going to come from GPT-SoTA stalling out or pausing due to alignment concerns. Then, if open source model continue to advance (either improvements on top of base models like I discussed earlier or further releases which can't be stopped), we might be in trouble. TBC, I don't think you were assuming that GPT-SoTA will necessarily keep advancing anywhere, but it seems relevant to note this concern.
We're starting to have enough experience with the size of improvements produced by fine-tuning, scaffolding, prompting techniques, RAG, advances etc to be able to guesstimate the plausible size of further improvements (and amount of effort involved), so that we can try to leave some appropriate safety margin for it. That doesn't rule out the possibility of something out-of-distribution coming along, but it does at least reduce it.
In the future, sharing weights will enable misuse. For now, the main effect of sharing weights is boosting research (both capabilities and safety) (e.g. the Llama releases definitely did this). The sign of that research-boosting currently seems negative to me, but there's lots of reasonable disagreement.
@peterbarnett and I quickly looked at summaries for ~20 papers citing Llama 2, and we thought ~8 were neither advantaged nor disadvantaged for capabilities over safety, ~7 were better for safety than capabilities, and ~5 were better for capabilities than safety. For me, this was a small update towards the effects of Llama 2 so far, having been positive.
If they are right then this protocol boils down to “evaluate, then open source.” I think there are advantages to having a policy which specializes to what AI safety folks want if AI safety folks are correct about the future and specializes to what open source folks want if open source folks are correct about the future.
In practice, arguing that your evaluations show open-sourcing is safe may involve a bunch of paperwork and maybe lawyer fees. If so, this would be a big barrier for small teams, so I expect open-source advocates not to be happy with such a trajectory.
- Note that if camelidAI is very capable, some of these preventative measures might be very ambitious, e.g. “make society robust to engineered pandemics.” The source of hope here is that we have access to a highly capable and well-behaved GPT-SoTA.
I think there are many harms that are asymmetric in terms of creating them vs. preventing them. For instance, I suspect it's a lot easier to create a bot that people will fall in love with than to create a technology that prevents people from falling in love with bots (maybe you could create like, a psychology bot that helps people once they're hopelessly addicted, but that's already asymmetric) .
There of course are things that are asymmetric in the other direction (maybe by the time you can create a bot that reliably exploits and hacks software, you can create a bot that rewrites that same software to be formally verified) but all it takes is a few things that are asymmetric in the other direction to make this plan infeasible, and I suspect that the closer we get to general intelligence, the more of these we get (simply because of the breadth of activities it can be used for.)
I just wanted to highlight that there also seems to be an opportunity to combine the best traits of open and closed source licensing models in the form of a new regulatory regime that one could call: regulated source.
I tried to start a discussion about this possibility but so far the take up has been limited. I think that’s a shame, there seems to be so much that could be gained by “outside the box” thinking on this issue since the alternatives both seem pretty bleak.
enforceability of such things seems unlikely to be sufficient to satisfy those who want government intervention.
I think this is a very contextual question that really depends on the design of the mechanisms involved. For example, if we are talking about high risk use cases the military could be involved as part of the regulatory regime. It’s really a question of how you set this up, the possible design space is huge if we look at this with an open mind. This is why I am advocating for engaging more deeply with the options we have here.
I think a better plan looks something like "You can't open source a system until you've determined and disclosed the sorts of threat models your system will enable, and society has implemented measures to become robust to these threat models. Once any necessary measures have been implemented, you are free to open-source."
The problem with this plan is that it assumes that there are easy ways to robustify the world. What if the only proper defense against bioweapons is a complete monitoring of the entire internet? Perhaps this is something that we'd like to avoid. In this scenario, your plan would likely lead to someone coming up with a fake plan to robustify the world and then claim that it'd be fine for them to release their model as open-source, because people really want to do open-source.
For example, in your plan you write:
Then you set a reasonable time-frame for the vulnerability to be patched: In the case of SHA-1, the patch was "stop using SHA-1" and the time-frame for implementing this was 90 days.
This is exactly the kind of plan that I'm worried about. People will be tempted to argue that surely 4 years is enough time for the biodefense plan to be implemented, four years rolls around and it's clearly not in place, but then they push for release anyway.
I'll go into more detail later, but as an intuition pump imagine that: the best open source model is always 2 years behind the best proprietary model
You seem to have hypothesised what is to me an obviously unsafe scenario. Let's suppose our best proprietary models hit upon a dangerous bioweapon capability. Well, now we only have two years to prepare for it, regardless of whether this is completely wildly unrealistic. Worse, this occurs for each and every dangerous capability.
Will evaluators be able to anticipate and measure all of the novel harms from open source AI systems? Sadly, I’m not confident the answer is “yes,” and this is the main reason I only ~50% endorse this post.
When we're talking about risk management, a 50% chance that a key assumption will work out, when there isn't a good way to significantly reduce this uncertainty often doesn't translate into a 50% chance of it being a good plan, but rather a near 0% chance.
I think this reasoning ignores the fact that at the time someone first tries to open source a system of capabilities >T, the world will be different in a bunch of ways. For example, there will probably exist proprietary systems of capabilities .
I think this is likely but far from guaranteed. The scaling regime of the last few years involves strongly diminishing returns to performance from more compute. The returns are coming (scaling works), but it gets more and more expensive to get marginal capability improvements.
If this trend continues, it seems reasonable to expect the gap between proprietary models and open source to close given that you need to spend strongly super-linearly to keep a constant lead (measured by perplexity, at least). There's questions about whether there will be an incentive to develop open source models at the $ billion+ cost, and I don't know, but it does seem like proprietary project will also be bottlenecked at the 10-100B range (and they also have this problem of willingness to spend given how much value they can capture).
Potential objections:
I'm not sure I've written this comment as clearly as I want. The main thing is: expecting proprietary systems to remain significantly better than open source seems like a reasonable prediction, but I think the fact that there are strongly diminishing returns to compute scaling in the current regime should cast significant doubt on it.
Epistemic status: I only ~50% endorse this, which is below my typical bar for posting something. I’m more bullish on “these are arguments which should be in the water supply and discussed” than “these arguments are actually correct.” I’m not an expert in this, I’ve only thought about it for ~15 hours, and I didn’t run this post by any relevant experts before posting.
Thanks to Max Nadeau and Eric Neyman for helpful discussion.
Right now there's a significant amount of public debate about open source AI. People concerned about AI safety generally argue that open sourcing powerful AI systems is too dangerous to be allowed; the classic example here is "You shouldn't be allowed to open source an AI system which can produce step-by-step instructions for engineering novel pathogens." On the other hand, open source proponents argue that open source models haven't yet caused significant harm, and that trying to close access to AI will result in concentration of power in the hands of a few AI labs.
I think many AI safety-concerned folks who haven’t thought about this that much tend to vaguely think something like “open sourcing powerful AI systems seems dangerous and should probably be banned.” Taken literally, I think this plan is a bit naive: when we're colonizing Mars in 2100 with the help of our aligned superintelligence, will releasing the weights of GPT-5 really be a catastrophic risk?
I think a better plan looks something like "You can't open source a system until you've determined and disclosed the sorts of threat models your system will enable, and society has implemented measures to become robust to these threat models. Once any necessary measures have been implemented, you are free to open-source."
I'll go into more detail later, but as an intuition pump imagine that: the best open source model is always 2 years behind the best proprietary model (call it GPT-SoTA)[1]; GPT-SoTA is widely deployed throughout the economy and deployed to monitor for and prevent certain attack vectors, and the best open source model isn't smart enough to cause any significant harm without GPT-SoTA catching it. In this hypothetical world, so long as we can trust GPT-SoTA, we are safe from harms caused by open source models. In other words, so long as the best open source models lag sufficiently behind the best proprietary models and we’re smart about how we use our best proprietary models, open sourcing models isn't the thing that kills us.
In this rest of this post I will:
An analogy to responsible disclosure in cryptography
[I'm not an expert in this area and this section might get some details wrong. Thanks to Boaz Barak for pointing out this analogy (but all errors are my own).
See this footnote[2] for a discussion of alternative analogies you could make to biosecurity disclosure norms, and whether they’re more apt to risk from open source AI.]
Suppose you discover a vulnerability in some widely-used cryptographic scheme. Suppose further that you're a good person who doesn't want anyone to get hacked. What should you do?
If you publicly release your exploit, then lots of people will get hacked (by less benevolent hackers who've read your description of the exploit). On the other hand, if white-hat hackers always keep the vulnerabilities they discover secret, then the vulnerabilities will never get patched until a black-hat hacker finds the vulnerability and exploits it. More generally, you might worry that not disclosing vulnerabilities could lead to a "security overhang," where discoverable-but-not-yet-discovered vulnerabilities accumulate over time, making the situation worse when they're eventually exploited.
In practice, the cryptography community has converged on a responsible disclosure policy along the lines of:
As I understand things, this protocol has resulted in our cryptographic schemes being relatively robust: people mostly don't get hacked in serious ways, and when they do it's mostly because of attacks via social engineering (e.g. the CIA secretly owning their encryption provider), not via attacks on the scheme.[3]
Responsible disclosure for capabilities of open source AI systems: an outline
[Thanks to Yusuf Mahmood for pointing out that the protocol outlined in this section is broadly similar to the one here. More generally, I expect that ideas along these lines are already familiar to people who work in this area.]
In this section I’ll lay out an protocol for open sourcing AI systems which is analogous to the responsible disclosure protocol from cryptography. Suppose the hypothetical company Mesa has trained a new AI system camelidAI which Mesa would like to open source. Let’s also call the most capable proprietary AI system GPT-SoTA, which we can assume is well-behaved[4]. I’m imagining that GPT-SoTA is significantly more capable than camelidAI (and, in particular, is superhuman in most domains). In principle, the protocol below will still make sense if GPT-SoTA is worse than camelidAI (because open source systems have surpassed proprietary ones), but it will degenerate to something like “ban open source AI systems once they are capable of causing significant novel harms which they can’t also reliably mitigate.”
In this protocol, before camelidAI can be open sourced, [?? Mesa?, the government?, a third-party? ??] must:
As examples, let me note two special cases of this protocol:
I’ll also note two ways that this protocol differs from from responsible disclosure in cryptography:
These two differences mean that other parties aren't as incentivized to robustify their systems; in principle they could drag their feet forever and Mesa will never get to release camelidAI. I think something should be done to fix this, e.g. the government should fine companies which insufficiently prioritize implementing the necessary changes.
But overall, I think this is fair: if you are aware of a way that your system could cause massive harm and you don't have a plan for how to prevent that harm, then you don't get to open source your AI system.
One thing that I like about this protocol is that it's hard to argue with: if camelidAI is demonstrably capable of e.g. autonomously engineering a novel pathogen, then Mesa can't fall back to claiming that the harms are imaginary or overhyped, or that as a general principle open source AI makes us safer. We will have a concrete, demonstrable harm; and instead of debating whether AI harms can be mitigated by AI in the abstract, we can discuss how to mitigate this particular harm. If AI can provide a mitigation, then we’ll find and implement the mitigation. And similarly, if it ends up that the harms were imaginary or overhyped, then Mesa will be free to open source camelidAI.
How does this relate to the current plan?
As I understand things, the high-level idea driving many responsible scaling policy (RSP) proponents is something like:
I think that if you apply this idea in the case where the action is "open sourcing an AI system," you get something pretty similar to the protocol I outlined above: in order to open source an AI system, you need to make an argument that it's safe to open source that system. If there is no such argument, then you need to do stuff (e.g. improve email monitoring for phishing attempts) which make such an argument exist.
Right now, the safety argument for open sourcing would be the same as (1) above: current open source systems aren't capable enough to cause significant novel harm. In the future, these arguments will become trickier to make, especially for open source models which can be modified (e.g. finetuned or incorporated into a larger system) and whose environment is potentially "the entire world." But, as the world is radically changed by advances in frontier AI systems, these arguments might continue to be possible for non-frontier systems. (And I expect open source models to continue to lag the frontier.)
Some uncertainties
Here are some uncertainties I have:
Some thoughts on the open source discourse
I think many AI safety-concerned folks make a mistake along the lines of: "I notice that there is some capabilities threshold T past which everyone having access to an AI system with capabilities >T would be an existential threat in today's world. On the current trajectory, someday someone will open source an AI system with capabilities >T. Therefore, open sourcing is likely to lead to extinction and should be banned."
I think this reasoning ignores the fact that at the time someone first tries to open source a system of capabilities >T, the world will be different in a bunch of ways. For example, there will probably exist proprietary systems of capabilities ≫T. So overall, I think folks in the AI safety community worry too much about threats from open source models.
Further, AI safety community opposition to open source AI is currently generating a lot of animosity from the open source community. For background, the open source ideology is deeply interwoven with the history of software development, and strong proponents of open source have a lot of representation and influence in tech.[7] I'm somewhat worried that on the current trajectory, AI safety vs. open source will be a major battlefront making it hard to reach consensus (much worse than the IMO not-too-bad AI discrimination/ethics vs. x-risk division).
To the extent that this animosity is due to unnecessary fixation on the dangers of open source or sloppy arguments for the existence of this danger, I think this is really unfortunate. I think there are good arguments for worrying in particular ways about the potential dangers of open sourcing AI systems at some scale, and I think being much more clear on the nuances of these threat models might lead to much less animosity.
Moreover, I think there’s a good chance that by the time open source models are dangerous, we will have concrete evidence that they are dangerous (e.g. because we’ve already seen that unaligned proprietary models of the same scale are dangerous). This means that policy proposals of the shape “if [evidence of danger], then [policy]” get most of the safety benefit while also failing gracefully (i.e. not imposing excessive development costs) in worlds where the safety community is wrong about the pending dangers. Ideally, this means that such policies are easier to build consensus around.
This currently seems about right to me, i.e. that LLaMA-2 is a little bit worse than GPT-3.5 which came out 20 months ago.
Jeff Kaufman has written about a difference in norms between the computer security and biosecurity communities. In brief, while computer security norms encourage trying to break systems and disclosing vulnerabilities, biosecurity norms discourage open discussion of possible vulnerabilities. Jeff attributes this to a number of structural factors, including how difficult it can be to patch biosecurity vulnerabilities; it’s possible that threat models from open source AI have more in common with biorisk models, in which case we should instead model our defenses based on them. For more ctrl-f “cold sweat” here to read Kevin Esvelt discussing why he didn’t disclose the idea of a gene drive to anyone – not even his advisor – until he was sure that it was defense-dominant. (h/t to Max Nadeau for both of these references, and to most of the references to bio-related material that I link elsewhere.)
I expect some folks will want to argue about whether our cryptography is actually all that good, or point out that the words “relatively” and “mostly” in that sentence are concerning if you think that “we only get one shot” with AI. So let me preemptively clarify that I don't care too much about the precise success level of this protocol; I'm mostly using it as an illustrative analogy.
We can assume this because we’re dealing with the threat model of catastrophes caused by open source AI. If you think the first thing that kills us is misaligned proprietary AI systems, then you should focus on that threat model instead of open source AI.
This is the part of the protocol that I feel most nervous about; see bullet point 2 in the "Some uncertainties" section.
It’s shown here that a LLaMA-2 finetuned on virology data was useful for giving hackathon participants instructions for obtaining and releasing the reconstructed 1918 influenza virus. However, it’s not clear that this harm was novel – we don’t know how much worse the participants would have done given only access to the internet.
I've noticed that AI safety concerns have had a hard time gaining traction at MIT in particular, and one guess I have for what's going on is that the open source ideology is very influential at MIT, and all the open source people currently hate the AI safety people.