Implications of the AI Security Gap

Dan Braun

This post reflects my personal opinion and not necessarily that of other members of Apollo Research or any of the people acknowledged below. Thanks to Jarrah Bloomfield, Lucius Bushnaq, Marius Hobbhahn, Axel Højmark, and Stefan Heimersheim for comments/discussions.

I find that people in the AI/AI safety community have not considered many of the important implications that security in AI companies has on catastrophic risks.

In this post, I’ve laid out some of these implications:

AI companies are a long way from state-proof security
Implementing state-proof security will slow down safety (and capabilities) research a lot
Sabotage is sufficient for catastrophe
What will happen if timelines are short?
Security level matters, even if you’re not robust to top cyber operations

AI companies are a long way from state-proof security

I’m of course not the first one to make this claim (e.g. see Aschenbrenner). But it bears repeating.

Last year, RAND released an excellent report outlining what it might take for an AI company to prevent adversaries of various capability levels from stealing model weights. Its highest security level, SL5, defines a level of security that could “thwart most top-priority operations by the top cyber-capable institutions”. This is, of course, the level of security that we need around a technology with catastrophic risks.

Unfortunately, today’s AI companies are likely to be further away from reaching SL5 than the time it is likely to take to develop models that are worth protecting at this level. Below, I illustrate some reasons why implementing this level of security might take a very long time.

Reaching SL5 is a tough technical challenge. Some components that RAND suggests for SL5 do not even exist yet “More R&D is needed to enable organizations to support production models while meeting SL5 security requirements. We recommend the development of hardware security module (HSM)-like devices with an interface that is specialized for ML applications.”

Let’s take preventing supply-chain attacks from software as another example. The RAND report lists this security measure for SL5: “Strong limitations on software providers (e.g., only developed internally or by an extremely reliable source)”. My guess is that if you looked at all software being used by researchers at a frontier lab, you’d have a list longer than this article, many of which would be maintained by indie developers in their basements.

An AI company might say “OK, this is a problem, let’s throw a bunch of our own engineers at the problem and build internal tools that replicate all this functionality”. If managed well, this intervention would reduce the chances of running egregious malware that exfiltrates all of your data to external servers. However, this tactic also runs into the following problem: The number of vulnerabilities in a system has a strong negative correlation with how much it has been used. New internal tools will have much less use than popular external public tools, meaning that vulnerabilities are less likely to be found. Attackers are in a much better position if there are many of these more innocuous-seeming vulnerabilities available to them to exploit in their operations. I expect AI to help a lot with the problem of finding and fixing vulnerabilities in packages, though I think this security asymmetry between new, internal tools and popular public tools is likely to stand, at least for a couple of years.

Of course, the company also has to find people to develop these tools internally. If they hire many additional engineers, the number of people who can maliciously insert vulnerabilities into these tools may become large (enforcing code reviews helps mitigate this, although not fully).

Beyond the technical challenges, implementing security processes takes a long time when humans (or AIs) are involved. It’s not enough for the security team at an AI company to develop a new SL5 system and say “OK all, we’re done. Everyone use this setup and read these docs and we’re SL5 now”. If this happened, the whole setup would collapse under the flood of “we need to be able to do X to be able to do anything useful” from staff. There’s also the issue of preventing staff from gossiping about confidential company information, a practice that’s currently a pastime in the AI tech scene.

Embedding a strong security culture at a company is a significant, time-consuming endeavour. When security isn’t a top priority from the company’s inception, changing the culture is all the more difficult.

Of course, it’s possible that not all of the suggestions in the RAND report for SL5 are required to protect against top-priority operations from leading cyber-capable institutions. It’s also possible that the list of SL5 suggestions will be insufficient. RAND notes considerable lack of consensus between surveyed experts, stating that “There are deeper and more conceptual disagreements about what is needed to achieve the security implied by SL4 and SL5—with opinions ranging from the SL3 benchmark being sufficient to secure against all threat actors to claims that no system could ever present a significant hurdle to [top-priority operations by top cyber-capable institutions]“. Nonetheless, the report provides useful guidelines, and I think frontier AI companies are at least several years away from implementing even a more minimal version of these SL5 guidelines.

Implementing state-proof security will slow down safety (and capabilities) research a lot

I get the sense that many in the AI field strongly underestimate how much working under strict security measures would slow down research. This applies to both capabilities and safety research, although may have a stronger effect on safety research.

Most importantly, you just can’t give wide access to the latest models without massively increasing the security risk (“wide” in the sense of the number of people/AIs and the types of experiments that can be run). Unfortunately, the safety research being pursued most at AI companies, and that which is most promising under short timelines (e.g. Control, Monitoring Scalable oversight), requires having wide access to the most powerful models. Since generalisable theories of intelligence and alignment are in short supply, and may be for several years, most safety work is likely to take the form of prodding and poking the most intelligent model available (e.g. by trying various “safety training” setups) to see how it responds or to find holes in your control/monitoring setup.

In addition to not having wide access to your latest models, here is another set of things you’ll have to deal with as an AI researcher/engineer under strong security measures:

You will be waiting for approval for many things from various people, e.g. before running various experiments, before using certain tools, to get a false positive temporary lock on your workspace undone. This will massively increase the research feedback cycle.
You’re restricted in the types of experiments you can run at all. E.g. you might not be able to export a large amount of data from your experiment in case it contains too much information about the model weights. Interpretability research (for e.g. monitoring) may be especially difficult under these restrictions.
You won’t be able to use the latest untrusted models directly to speed up your research (or be strongly limited in how you can use it). This might not seem like a large cost now, but certainly will when models perform as well or better than AI researchers.
The number of people in the company you can discuss research ideas with reduces.
You get access to fewer research results from other teams, many of which would help with your own work.
The mission-driven culture that previously made your company move fast is likely to take a hit due to
- teams being more segregated from each other
- staff not enjoying their work as much when they’re frequently interrupted by security “blockers”.

I’m certainly not saying this to dissuade companies from implementing these measures; if you’re in the business of developing technology that might cause extreme harm, this is a minimal price to pay. Nobody wants to live in a world where AI developers are not paying these costs to keep their models secure. The following sections hopefully make that clear.

Sabotage is sufficient for catastrophe

In a good world, an AI developer may reach SL5, and be robust against attacks that can exfiltrate the weights of an AI model. This however does not mean that an adversary couldn’t still cause a catastrophe with this model.

To reach SL5, an AI company would need to place very strong restrictions on the outgoing network traffic from the isolated network hosting the powerful AI models. The goal is to make it as hard as possible for an attacker to exfiltrate terabytes worth of weights from the network. Fortunately, the output of AI experiments that will pass outside of the isolated network need not be very large (assuming that copies of the model would stay inside the network).

On the other hand, placing strong restrictions on incoming traffic to the isolated network hosting the model is much more difficult as many experiments will require using large amounts of external data (e.g. the data required to train/post-train a model). This opens the door for various types of sabotage.

AGI, almost by definition^[1], will be capable of scheming against its developers. Ideally the AGI will be aligned enough such that it would show no proclivities to doing so, though we’re not doing so well on that front at the moment (1, 2). Let’s assume that the AGI will try to scheme against its developers, and an AI Control setup is required to prevent it from succeeding.

All the adversary needs to do now is to sabotage this AI Control setup. Sabotage here might be as simple as corrupting the trusted monitor (the AI tasked with checking the outputs of the scheming AI) by e.g. planting backdoors in some of its finetuning data. Also, remember that the adversary has a powerful insider that wants to cause havoc itself: the AI.

Now consider the case where the AGI is aligned, and will not naturally scheme against its developers. I contend that, in practice, this situation may actually be worse for defending against external sabotage. An aligned AGI is likely to have far fewer controls around it. In this case, all an adversary would need to do is to cause the AGI to become unaligned, and let the AGI do its thing. When considering how we currently (attempt to) align AI systems, and how effective very simple jailbreaks are, I expect that “undoing” the alignment of one instance of an AGI we’re on track to building is a much easier task for a capable adversary than stealing model weights.

I do expect adversaries to be less interested in sabotaging an AI than stealing it. That said, assuming AI developers reach some reasonable level of security, I think sabotage would be so much easier to pull off that it should be a bigger concern for those of us interested in reducing catastrophic risks from AI going rogue.

What will happen if timelines are short?

Many AI/AI safety folk believe that short timelines (e.g. those laid out here) are very possible, and that no major advancements are needed to get there.

Let’s suppose that this happens, and that the group in the lead is an AI company which is (somehow) robust to model theft and sabotage from the most cyber-capable institutions.

Under these conditions, it seems very likely that other Western and perhaps Chinese groups are going to be able to build AGI themselves in <1 year from the first AGI (you can increase this to <2 years if you’d like, the rest still stands). For things to go well, you’re going to need to argue one of the following:

Proliferation of these AGIs would be safe.
Things will happen in the <1 year timeframe that prevents any other party from getting AGI. (If so, what is this? Will the US government (and/or other governments) be ready and capable of closing down frontier model development? Will the leading company be able to buy out the others?)
Even if a small number of Chinese/Western groups get AGI in <1 year from the first AGI:
1. These companies will have sufficient security, and
2. something from that time period onwards will prevent others from building AGI and it eventually getting in the hands of your local terrorist group, and
3. it matters that several parties have the AGI but not your local terrorist group (this is easier to argue for large scale harms but perhaps not existential risk).

My personal take is as follows:

Short timelines (very roughly as listed here) without any crucial secret sauce held by one company is very possible (>20%). I put much more weight (up to 70%) on the 2028 entry of “AI companies have 10k-1M automated AI researchers” happening before 2035, but these slightly longer timelines do not drastically change the picture below.
In these worlds, multiple parties in China and the West will be <1 year behind the frontier (likely aided by theft of models/data/insights).
Barring a crippling world war over AI development, nothing but strong US government intervention would prevent AGI from reaching multiple parties. The most likely-seeming thing for the US government to do would be to start its own project or to effectively join forces with 1 existing company, and prevent everyone else in its power from pushing the AI frontier.
- If the US government does not do this, I think we end up with several companies with insufficient security having AGI, shortly followed by AGI proliferation, likely ending in catastrophe. So this is not a scenario I spend much time strategising about.
- Note that longer timelines leave some room for a single company to take a big enough lead and prevent other parties from competing (e.g. by buying them out and becoming robust to theft and sabotage). In this case, the US government may not need to get as involved.

Security level matters, even if you’re not robust to top cyber operations

Leading cyber-institutions are run by parties who also wouldn’t want the AGI getting in the hands of the street-corner terrorist. They might prefer to just steal your model and try to use it themselves, and (hopefully) protect against their own adversaries stealing it from them. Perhaps they will share it with their allies, but one at least has some hope that it doesn’t spread wide very quickly.

If your threat model puts at least some probability on it being important how many parties have access to the AGI (and what kind of parties), then reducing the number of adversaries that are able to steal your model is robustly good.

As for sabotage threats, it should go without saying that you want to minimise the number of parties that can sabotage the AGI in your company.

Advice to frontier AI companies

If you don’t want the government to shut down your frontier development when serious capabilities hit, you are going to have to convince them that you are the best company for them to join up with. How can you do this? Yes, raw capabilities will help. But if timelines are short, or perhaps even under some longer timelines, they’re going to realise that your competitors are not far behind.

I expect that the most promising way to convince them is to already have a lot of the infrastructure in place to continue AI development/use with very tight security. It’s also going to help if they view you as responsible actors who are putting significant resources into safety research so as to not put all of their own citizens at risk.

If you don’t believe that the government will shut your frontier development or amalgamate with you, then seriously consider that failing to prevent sophisticated adversaries from stealing or sabotaging your models is sure to bring about much more harm than the good that you’re trying to do by bringing about AGI.

^{^}
I don’t think it matters much for this post, but I’ll define AGI as the level of AI that can speed up AI R&D 10x at a frontier AI company.

LESSWRONG
LW

45