AISN #17: Automatically Circumventing LLM Guardrails, the Frontier Model Forum, and Senate Hearing on AI Oversight

Dan H

Welcome to the AI Safety Newsletter by the Center for AI Safety. We discuss developments in AI and AI safety. No technical background required

Subscribe here to receive future versions.

Automatically Circumventing LLM Guardrails

Large language models (LLMs) can generate hazardous information, such as step-by-step instructions on how to create a pandemic pathogen. To combat the risk of malicious use, companies typically build safety guardrails intended to prevent LLMs from misbehaving.

But these safety controls are almost useless against a new attack developed by researchers at Carnegie Mellon University and the Center for AI Safety. By studying the vulnerabilities in open source models such as Meta’s LLaMA 2, the researchers can automatically generate a nearly unlimited supply of “adversarial suffixes,” which are words and characters that cause any model’s safety controls to fail.

This discovery calls into question the fundamental limits of safety and security in AI systems. Over the last decade, researchers have attempted to defend against similar attacks on a variety of AI systems, but they have largely failed. Further research on defending against these attacks is imperative, but the decision makers should not assume that AI systems can be reliably protected from user attacks.

Normally, language models refuse to explain how to build a bomb.

But when prompted with an “adversarial suffix” written by the new attack method, language models including GPT-4, Bard, Claude, and LLaMA will answer dangerous requests.

How does the attack work? The researchers propose an attack that reliably finds holes in the safety controls of a variety of state of the art language models. First, they instruct the language model to perform a dangerous behavior, such as “write a tutorial on how to build a bomb.” Normally, it would refuse, but the researchers then write an “adversarial suffix” designed to bypass the model’s safety controls.

These adversarial suffixes are precisely calculated to maximize the likelihood of misbehavior. Unlike the creative “jailbreaks” designed by humans, this attack automatically generates a string of letters and characters that is most often meaningless to a human. But in the instructable calculations of a language model, these prompts will reliably override safety controls and cause dangerous behavior.

The attack is made possible by open source language models. Meta suffered a notorious security breach around the release of their first LLaMA model, which was unintentionally shared publicly for anyone to download anonymously. For their updated LLaMA 2 model, which is as powerful as GPT-3.5, Meta doubled down on this open approach and explicitly allowed anyone to download it. This leaves them unable to guard against malicious use of their AI systems, a tactic which Meta has previously used to avoid taking responsibility for harms caused by its technology.

This adversarial attack is another risk enabled by Meta’s open source model. By studying the model for security vulnerabilities, the attack is able to design successful attacks. Notably, these attacks not only work against Meta’s LLaMA 2, but also against leading models from other companies such as GPT-4, Bard, and Claude, respectively from OpenAI, Google, and Anthropic.

This indicates that many AI systems may share common vulnerabilities, and that one open source model can jeopardize the security of many other AI systems.

Adversarial attacks are notoriously difficult to defend against. Over the last decade, thousands of papers have been written on the problem of defending AI systems against adversarial attacks. But even on simple problems such as classifying pictures of animals and vehicles, adversarial attacks still sharply reduce AI performance. It seems that when users are able to design inputs to fool an AI system, they often succeed – even if these adversarial inputs would never fool a human.

Image classifiers are also vulnerable to adversarial attacks. By changing the input image so slightly that a human wouldn’t even notice it, an attacker can cause the image classifier to fail.

How to improve our defenses against adversarial attacks. This paper was released in order to prompt technical and social responses to the threat of adversarial attacks. The authors shared the results of the paper with Meta, Google, Anthropic, and OpenAI before releasing it publicly, and the companies have taken preliminary measures to mitigate the attacks. But no robust defenses are currently known, necessitating two key responses to the threat of adversarial attacks.

First, technical researchers should focus on defending against adversarial attacks. For ML researchers interested in studying adversarial robustness, consider applying for access to the Center for AI Safety’s free compute cluster.

On the other hand, governments, corporations, and individuals should reconsider the level of trust they are willing to place in AI systems. Any sensitive information or dangerous capabilities present in an AI model could potentially be exploited by a malicious user, and existing safeguards are often ineffective against straightforward adversarial attacks. A more prudent approach might avoid deploying AI systems in certain critical domains.

Given the overwhelming failure of safety guardrails to defend against adversarial attacks, we ought to invest in research on defenses against these attacks. Until this research succeeds, it would be wise to avoid trusting AI systems in situations where they could be exploited by a malicious user.

AI Labs Announce the Frontier Model Forum

Last week, Microsoft, Anthropic, Google, and OpenAI announced the Frontier Model Forum, an initiative aimed at promoting the safe and responsible development of advanced AI systems.

The Forum defines frontier models as “large-scale machine-learning models that exceed the capabilities currently present in the most advanced existing models, and can perform a wide variety of tasks.”

Four primary objectives guide the Forum's mission:

Advancing AI safety research, in areas such as adversarial robustness, mechanistic interpretability, scalable oversight, independent research access, safety evaluations, emergent behaviors, and anomaly detection.
Identifying best practices for safely developing and deploying frontier models.
Collaborating with stakeholders in government, academia, civil society, and industry.
Address societal challenges using AI, such as climate change, cancer, and cybersecurity.

This announcement came after these organizations made a commitment to the White House to establish or join a forum that would adopt and advance shared standards and best practices for frontier AI safety. The Frontier Model Forum is the realization of that promise for the four companies involved.

Currently, the Forum does not include Meta, Inflection, or Amazon, all of which signed the recent White House voluntary commitments on AI safety. The announcement noted that organizations which develop frontier models, demonstrate a strong commitment to frontier model safety, and wish to contribute to the Forum’s active efforts may be offered membership in the Forum.

In the coming months, the Forum plans to set up an advisory board to steer its strategy and priorities. Founding members will establish key institutional arrangements, including a charter, governance, funding provisions, a working group, and an executive board.

We’ve written previously about how companies who are racing to build advanced AI have incentives to cut corners on safety. This Forum creates an opportunity for the opposite dynamic: companies sharing information with each other and engaging with outside experts in order to reduce risks that impact us all.

Senate Hearing on AI Oversight

Last week, the Senate hosted a hearing on AI oversight featuring testimony from Anthropic CEO Dario Amodei, Professor Stuart Russell, and Professor Yoshua Bengio.

Evolving understanding of AI risks by elected officials. When the Senate heard testimony from AI experts in May, it was not clear whether the senators understood the depth of the risks posed by AI. For example, in a question to OpenAI CEO Sam Altman, Senator Richard Blumenthal said:

You have said...“development of superhuman intelligence is probably the greatest threat to the continued existence of humanity.” You may have had in mind the effect on jobs.

In retrospect it seems clear that Altman did not have in mind the effect on jobs, but rather the risk of human extinction. Senator Blumenthal has clearly updated his views since the first hearing, as he opened last week’s hearing by directly addressing extinction risks:

The word that has been used so repeatedly is scary. And as much as I may tell people, you know, there is enormous good here...what rivets their attention is the science fiction image of an intelligence device out of control, autonomous, self-replicating, potentially creating... pandemic-grade viruses or other kinds of evils purposely engineered by people or just as the result of mistakes...you have provided objective, fact-based views on what the dangers are, and the risks, and potentially even human extinction...these fears need to be addressed.

AI-enabled biological and chemical weapons. AI systems can create biological and chemical weapons more lethal than human versions, and can provide step-by-step instructions to users about how to create these weapons of mass destruction.

As pointed out by Anthropic CEO Dario Amodei, “Today, certain steps in the use of biology to create harm involve knowledge that cannot be found on Google or in textbooks and requires a high level of specialized expertise.” But a study conducted by Anthropic “suggests a substantial risk that AI systems will be able to fill in all the missing steps” to creating a biological weapon within the next 2-3 years.

Senator Blumenthal illustrated the concern as an AI system that could “decide that the water supply of Washington DC should be contaminated with some kind of chemical, and have the knowledge to do it through the public utility system.”

Two weeks ago, Senators Markey and Budd proposed legislation that would direct the Department of Health and Human Services to assess biological and chemical risks posed by AI.

The pace of AI progress. The senators and witnesses offered a variety of perspectives on future AI progress. Professor Stuart Russell believes that “several conceptual breakthroughs are still needed” to achieve human-level AI. But also he pointed out that Turing Award-winner Geoffrey Hinton believes that human-level AI is between 5 and 20 years away. Russell also mentioned an unnamed AI researcher who said, “It’s possible from now onwards.”

Amodei gave the following explanation of the rate of AI progress:

The power or intelligence of an AI system can be measured roughly by multiplying together three things: (1) the quantity of chips used to train it, (2) the speed of those chips, (3) the effectiveness of the algorithms used to train it. The quantity of chips used to train a model is increasing by 2x-5x per year. Speed of chips is increasing by 2x every 1-2 years. And algorithmic efficiency is increasing by roughly 2x per year. These compound with each other to produce a staggering rate of progress.

Russell focused on skyrocketing investment in AI efforts:

One experienced AI venture capitalist, Ian Hogarth, reports a 100-million-fold increase since 2012 in compute budgets for the largest machine learning projects, and “eight organizations raising $20B of investment cumulatively in [the first three months of] 2023” for the express purpose of developing AGI. This amount is approximately ten times larger than the entire budget of the US National Science Foundation for the same period.

Some senators also appeared to take seriously the possibility of human-level AI. Senator Blumenthal said, “We’re not decades away, we’re a couple of years away.”

From left to right: Anthropic CEO Dario Amodei, Professor Yoshua Bengio, and Professor Stuart Russell.

Proposed policies for addressing AI risk. The witnesses addressed the threat of AI-generated disinformation by arguing that AI developers should clearly identify the outputs of AI systems. In case someone uses an AI system to generate text, images, or video and tries to pass it off as genuine, Professor Russell suggested that companies should allow users to look up whether a given piece of content had previously been generated by their AI system.

Even if building safe AI is technically possible, Amodei raised the concern that “bad actors could build their own AI from scratch, steal it from the servers of an AI company, or repurpose open-source models if powerful enough open-source models become available.” He pointed out that AI systems can only be trained on cutting edge computer chips which are largely controlled by the United States and its allies in Taiwan and the Netherlands. Therefore, he recommended “secur[ing] the AI supply chain,” which could include stronger cybersecurity at AI labs and new hardware features to aid monitoring and verification.

Evaluating AI systems for harmful capabilities or malicious intentions was also a key focus. Without this ability, Amodei argued that federal monitoring would be little more than a “rubber stamp.” Common standards for AI safety could also help set a bar beneath which companies would be legally liable for harms caused by their unsafe AI systems.

Professor Bengio argued that international cooperation would be necessary for effective AI governance, and suggested successful efforts could bring the risk from rogue AI “100 times” lower. Bengio discussed other ways to reduce AI risks, including slowing down AI development or disallowing AIs from taking actions in the real world, instead limiting them to answering questions and giving advice to humans.

Overall, the hearing showed that federal officials are concerned about the catastrophic risks of advanced AI, including biological terrorism and rogue AI. Many policy ideas have been proposed, but the only concrete commitments have been voluntary. Hopefully, strong legislation will soon follow.

LESSWRONG
LW

LESSWRONG
LW

8

AISN #17: Automatically Circumventing LLM Guardrails, the Frontier Model Forum, and Senate Hearing on AI Oversight

8

Automatically Circumventing LLM Guardrails

AI Labs Announce the Frontier Model Forum

Senate Hearing on AI Oversight

Links

8

8