Some Comments on Recent AI Safety Developments

testingthewaters

Overview of New Developments

Update (9th of August, 2025): I was asked to unpublish and republish this post. Since the original writing of the post, the US military has signed deals with xAI, Anthropic, Google, and OpenAI, with the xAI deal worth 200 million USD alone. AI and national security entanglement also features heavily in scenarios like AI 2027, which seems likely to become a self-fulfilling prophecy at least in part.

At the same time, Chinese AI developers have proven their ability to produce SOTA or close-to-SOTA AI models. While my initial analysis focused on US AI companies, this now no longer seems correct. Given these updates, the dynamics of anti-alignment are likely to play out symmetrically across both sides of the Pacific instead of just in the US. Unless concerted efforts are taken to deescalate, we may no longer be able to avert an AI military arms race.

Edit (4th of December, 2024): OpenAI has now signed up with Anduril, fulfilling my prediction that all three major AI developers will enter into agreements with the US military. See https://www.anduril.com/article/anduril-partners-with-openai-to-advance-u-s-artificial-intelligence-leadership-and-protect-u-s/

As of today, November 9th, 2024, three major AI developers have made clear their intent to begin or shortly begin offering tailored AI services to the United States Department of Defense and related personnel.

These overtures are not only directed by AI companies. There are also top down pressures to expand the military use of AI from the US government under Joe Biden. Ivanka Trump has posted in favour of Leopold Aschenbrenner's Situational Awareness document, suggesting that the incoming Trump administration will continue this trend.

For obvious reasons, I believe the same is happening in China, and possibly (to a lesser extent) other nations with advanced AI labs (UK). I focus on the American case because most of the most advanced AI labs are in fact American companies.

Analysis and Commentary

Note on terminology: When I say "security community" or "national security community", this is a catch-all term for US Department of Defense personnel, defense contractors, and possibility the military forces of NATO or other allied states (e.g. Five Eyes).

In some sense this is not new information. In January 2024, the paper "Escalation Risks from Language Models in Military and Diplomatic Decision-Making" warned that "Governments are increasingly considering integrating autonomous AI agents in high-stakes military and foreign-policy decision-making". Also in January, OpenAI removed language prohibiting the use of its products in military or warfare-related applications. In September, after the release of o1, OpenAI also claimed that its tools have the capabilities to assist with CBRN (Chemical, Biological, Radiological, or Nuclear Weapons) development. The evaluation was carried out with the assistance of experts who were familiar with the procedures necessary for "biological threat creation", who rated answers based on both their factual correctness as well as "ease of execution in wet lab". From this I infer that at least some of these experts have hands-on experience with biological threats, and the most likely legal source of such experts is Department of Defense personnel. Given these facts, to demonstrate such capabilities and then deny the security community access to the model seems both unviable and unwise for OpenAI.

Furthermore, it seems unlikely that the security community does not have access already to consumer facing models, given the weak controls placed on the dissemination of Llama model weights and the ease of creating OpenAI or Anthropic accounts. Therefore, I proceed from the assumption that the offerings in these deals are tailor made to the security community's needs. This claim is explicitly corroborated in the case of Defense Llama, which claims to be trained on "a vast dataset, including military doctrine, international humanitarian law, and relevant policies designed to align with the Department of Defense (DoD) guidelines for armed conflict as well as the DoD's Ethical Principles for Artificial Intelligence". The developers also claim it can answer operational questions such as "how an adversary would plan an attack against a U.S. military base", suggesting access to confidential information beyond basic doctrine. We also know that the PRC has developed similar tools based on open weights models, further reducing the likelihood of such offerings not being made to the US military.

Likely Future Developments

There have been calls for accelerated national security involvement in AI, most notably from writers such as Leopold Aschenbrenner (a former OpenAI employee) and Dario Amodei (CEO of Anthropic). Amodei in particular favours an "entente" strategy in which a coalition of democratic nations races ahead in terms of AI development to outpace the AI developments of autocracies and ensure a democratic future. These sentiments echo the race to create the atomic bomb and other similar technologies during World War II and the Cold War. I will now explore the impact of such sentiments, which appear to be accepted in security circles given the information above.

The first matter we must consider is what national security involvement in AI will mean for the current AI development paradigm, which consists of both open weights, open source, and closed source developers organised as private companies (OpenAI, Anthropic), subsidiaries of existing companies (Google Deepmind, Meta FAIR), and academic organisations (EleutherAI). Based on ideas of an AI race and looming great power conflict, many have proposed a "Manhattan Project for AI". My interpretation of this plan is that all further AI development would be centrally directed by the US government, with the Department of Defense acting as the organiser and funder of such development. However, I believe that such a Manhattan Project-style scenario is unlikely. Not only is access to AI technology already widespread, many of the contributions are coming from open source or otherwise non-corporate or non-academic contributors. If acceleration is the goal, putting the rabbit back in the hat would be counterproductive and wasteful.

Instead, I suggest that the likely model for AI development will be similar to the development of cryptographic technology in the 20th century. While it began as a military exercise, once civilian computing became widespread and efforts to constrain knowledge of cryptanalysis failed, a de facto dual-track system developed. Under this scheme, the NSA (representing the government and security community) would protect any techniques and advancements it developed to maintain a competitive edge in the geopolitical sphere. Any release of technology for open use would be carefully arranged to maintain this strategic advantage, or only be conducted after secrecy was lost. At the same time, developments from the open source, commercial, or academic community would be permitted to continue. Any successes from the public or commercial spheres would be incorporated, whether through hiring promising personnel, licensing technology, or simply implementing their own implementations now that feasibility was proven. A simple analogy is that of a one way mirror in an interrogation room: those in the room (i.e. the public, corporate researchers, academics) cannot look behind the mirror, but those behind the mirror (the national security community) can take advantage of any public developments and bring them "behind the mirror" if necessary. A one way filter is established between the public sphere and the security sphere, an arrangement we see mirrored in the ways AI technology is now being licensed, via IL6-compliant AWS services (in the case of Anthropic) and secured private servers (in the case of Defense Llama). The public-facing companies are still active, but a veil of secrecy is established at the interface with the national security community. To be clear: there is no public indication known to me that proprietary model weights are being shared with the security community. On the other hand, such agreements would likely live behind the one-way mirror.

Having established that AI developers will likely continue to exist in their current form, we must next ask how the development of AI systems will be influenced by a growing rapproachment with the security community. It is well known that existing AI alignment efforts are post-hoc. That is to say, a base or "unaligned" system is created which has both desired and undesired (dual-use) capabilities, which is then "brought into compliance" via methods like RLHF and Constitutional AI. Therefore, the existence of the commercial facing "aligned" models implies the necessary existence of "base" models, which have not gone through this post-training modification process. Furthermore, RLHF and Constitutional AI are both value-agnostic methods. They merely promote the likelihood of certain types of responses while suppressing others, with the desired type of response monitored either by an oracle emulating human feedback or an fully self-supervised model. This means that it would be feasible to create "anti-aligned" models that feature more undesirable responses. Such models might, for example, specialise in developing novel cyberweapons or creating plausible propaganda. I should also expect specialised scaffolding to make use of and enhance harmful capabilities to be developed. Amongst these enhanced capbilities are those most associated with profoundly harmful outcomes, e.g. the ability to autonomously replicate and take over computer systems, the ability to develop novel bioweapons, and the ability to cripple infrastructure.

Why would these systems be created? While such capabilities would normally be regarded as harmful per (for example) OpenAI's risk evaluation metrics, in a military context they are useful, expected behaviours for an AI system to be regarded as "helpful". In particular, under the great power conflict/arms race mindset, the possibility of China or another enemy power developing these tools first would be unthinkable. On the other hand, the ownership of such a tool would be a powerful bargaining chip and demonstration of technological superiority. Therefore, I believe that once the cordon of secrecy is in place, there will be a desire and incentive for AI developers to produce such anti-aligned systems tailored for military use. To return to the cryptography metaphor, while easier methods to break cryptosystems or intrude into secured networks are regarded as criminal and undesirable outcomes in general society, for the NSA they are necessary operational tools.

Some of you will protest that the standards of the US security community will prevent them from creating such systems. This is a hypothesis that can never be tested: Because of the one-way mirror system, we (actors in the public sphere) will never know if such models are developed without high-level whistleblowers or leaks. Furthermore, in a great power conflict context the argument for anti-alignment is symmetric: any great power with national security involvement in AI development will be aware of these incentives. Perhaps other powers will not be so conscientious, and perhaps American corporations will be happy to have two one-way mirrors installed: See https://en.wikipedia.org/wiki/Dragonfly_(search_engine).

What specifically has changed?

For many of you, these arguments will likely be familiar. However, the specific agreements between US AI developers and the Department of Defense, complete with the implementation of the one-way mirror, signals the proper entry of the US into the AI militarisation race. Even if the US military does not develop any anti-aligned capabilities beyond those of commercially available models, other great powers will take notice of this development and make their own estimations.

Furthermore, the one-way mirror effect is localised: Chinese anti-alignment efforts cannot benefit from American public-sphere developments as easily as American anti-alignment efforts can. This is because of several factors including access to personnel, language and cultural barriers, and protective measures like information security protocols and limits to foreign access. Thus far, American AI development efforts are world leading (notice how the Chinese military uses Llama for their purposes rather than local LLM models!). This means that American anti-alignment efforts, if they ramp up properly, will be world leading as well.

Implications for AI safety work

Based on the above, I make several inferences for AI safety work, the most important of which is this: Under the present AI development paradigm, the capabilities of the best anti-aligned AI system will be lower bounded by the capabilities of the best public aligned AI system. Notice the asymmetry in knowledge we will possess about anti-aligned and aligned systems: any advances behind the anti-aligned systems need not be public, again because of the one-way mirror.

The second inference is this: Any new public capabilities innovations will be symmetrically applied. In other words, any attempt to increase the capabilities of aligned models will be applied to anti-aligned models so long as alignment is a post-training value-agnostic process. Any method discovered behind the one-way mirror, however, need not be shared with the public.

The third inference is this: Most new public alignment work will also become anti-alignment work. For example, improvements to RLHF/post-training alignment or new methods of model control can be directly employed in service of anti-alignment. Similarly, developments that make AI systems more reliable and effective are dual-use by their nature. Mechanistic interpretability and other instrumental science work will remain as it has always been, effectively neutral. Better explanations of how models work can likely benefit both alignment and anti-alignment because of the aforementioned symmetric nature of current alignment methods.

The final inference is this: Public AI safety advocates are now definitively not in control of AI development. While there have always been AI developers who resisted AI safety concerns (e.g. Yann LeCun at FAIR), and recently developers like OpenAI have signalled moves away from an AI safety focus to a commercial AI focus, for a long time it could be plausibly argued that most consequential AI development is happening somewhat in the open with oversight and influence from figures like Yoshua Bengio or Geoffrey Hinton who were concerned about AI safety. it is now clear that AI safety advocates who do not have security clearance will not even have full knowledge of cutting edge AI developments, much less any say about their continued development or deployment. The age of public AI safety advocates being invited to the table is over.

There are exceptions to these inferences. For example, if a private AI developer with no one-way mirror agreement works on their own to develop a private, aligned model with superior capabilities, this would be a triumph over anti-alignment. However, not only have all major AI developers rapidly acquiesced to the one-way mirror arrangement (with some notable exceptions), any such development would likely inspire similar anti-alignment efforts, especially due to the porous nature of the AI development and AI safety communities. It is also possible that advocates in the military will resist such developments due to a clear eyed understanding of the relevant risks, and instead push for the development of positive alignment technologies with military backing. At the risk of repeating myself, this is a fight we will not know the outcome of until it is either inconsequential or too late.

What should we do?

To be short: I don't know. Many people will argue that this is a necessary step, that US anti-alignment efforts will be needed to counter Chinese anti-alignment efforts, that giving autocracies leverage over democracy in the form of weaponised AI is a death sentence for freedom. Similar arguments have been deployed by the NSA in defense of its spying and information gathering efforts. However, others have pointed out the delusion of trying to control an AI borne of race dynamics and malicious intent, and point to historical overreaches and abuses of power by the American national security community. Indeed, the existence of high-profile leaks suggests that escape or exfiltration of dangerous AI systems from behind the one-way mirror is possible, making them a risk even if you have faith in the standards of the US military. Perhaps now is also an opportune time to mention the links between the incoming administration and its ties to neo-reactionary politics, as ably demonstrated in the Project 2025 transition plan.

One thing I think we should all not do is continue with business-as-usual under the blithe assumption that nothing has changed. A clear line has been crossed. Even if the national security community renounces all such agreements the day after this post goes live, steps have been taken to reinforce the existing race dynamic and other nations will notice. And as for the public? We are already staring at the metaphorical one way mirror, trying to figure out what is staring back at us from behind it.

[-]Double6mo80

I asked GPT-5 Agent to choose an underrated LessWrong post and it chose this one.

I agree that this is underrated. Your point about anti-aligned models being strictly more capable than safe models and trained in potentially harmful skills is certainly something to keep in mind when we consider how aligned AIs seem to be. Thanks to this post, I will train myself into the habit of taking a moment to imagine national security anti-alignment implications when I plan research ideas or learn about the research of others.

Here's the chat with GPT-5. It also picked a few other posts as runners-up.

LESSWRONG
LW