Miles Brundage: Finding Ways to Credibly Signal the Benignness of AI Development and Deployment is an Urgent Priority

Zach Stein-Perlman

Miles Brundage has a new substack and I like this post. Here's the introduction.

A simplified view of AI policy is that there are two “arenas” with distinct challenges and opportunities. In the domestic arena, governments aim to support AI innovation within their borders and ensure that it is widely beneficial to their citizens, while simultaneously preventing that innovation from harming citizens’ interests (safety, human rights, economic well-being, etc). In the international arena, governments aim to leverage AI to improve their national security, while minimizing the extent to which these efforts cause other governments to fear for their security (unless this is the explicit intent in a given context, e.g., deterrence or active conflict).
For example, in the domestic arena, a government might require that companies assess the degree to which a language model is capable of aiding in the creation of biological weapons, and share information about the process of red teaming that model and mitigating such risks. And governments could require documentation of known biases in the model. In each case, the government would be attempting to ensure, and produce evidence, that an AI system is benign, in the sense of not posing threats to the citizens of that country (although in some cases the government may also be mitigating risks to citizens of other countries, particularly for catastrophic risks that spill across borders, of companies with a global user base). Citizens may also demand evidence that a government’s own use of AI is benign — e.g., not infringing on citizens’ privacy — or that appropriate precautions are taken in high-stakes use cases, or one part of government might demand this of another part of government.
In the international arena, a military might build an autonomous weapon system and attempt to demonstrate that it will be used for defensive purposes only. A military might also state that it will not use AI in certain contexts like nuclear command and control.
In all of these cases, outside observers might be justifiably skeptical of these claims being true, or staying true over time, without evidence. The common theme here is that many parties would benefit from it being possible to credibly signal the benignness of AI development and deployment. Those parties include organizations developing and deploying AI (who want to be trusted), governments (who want citizens to trust them and the services they use), commercial or national competitors of a given company/country (who want to know that precautions are being taken and that cutting corners in order to keep up can be avoided), etc. By credibly signaling benignness, I mean demonstrating that a particular AI system, or an organization developing or deploying AI systems, is not a significant danger to third parties. Domestically, governments should seek to ensure that their agencies’ use of AI as well as private actors’ use of AI is benign. Internationally, militaries should seek to credibly signal benignness and should demand the same of their adversaries, again at least outside of the context of deterrence or active conflict.
When I say that an AI system or organization is shown to not pose a significant danger to third parties, I mean that outsiders should have high confidence that:
The AI developer or deployer will not accidentally cause catastrophic harms, enable others to cause catastrophic harms, or have lax enough security to allow others to steal their IP and then cause catastrophic harms.
The AI developer or deployer’s statements about evaluation and mitigation of risks, including catastrophic and non-catastrophic risks, are complete and accurate
The AI developer or deployer’s statements regarding how their technology is being used (and ways in which its use is restricted) are complete and accurate
The AI developer or deployer is not currently planning to use their capabilities in a way that intentionally causes harm to others, and if this were to change, there would be visible signs of the change far enough in advance for appropriate actions to be taken.
Unfortunately, it is inherently difficult to credibly signal the benign of AI development and deployment (at a system level or an organization level) due to AI’s status as a general purpose technology, and because the information required to demonstrate benignness may compromise security, privacy, or intellectual property. This makes the research, development, and piloting of new ways to credibly signal the benign of AI development and deployment, without causing other major problems in the process, an urgent priority.

Miles recommends research to determine what information labs should share, government monitoring compute "to ensure that large-scale illicit projects aren’t possible," verification for demonstrating benignness internationally, and more.

(Context: Miles recently left OpenAI; see his Why I’m Leaving OpenAI and What I’m Doing Next and Garrison Lovely's Miles Brundage resigned from OpenAI, and his AGI readiness team was disbanded.)

Edit: as mentioned in the comments, this post mostly sets aside the alignment/control problem to focus on another problem.

I like a bunch of the post, but I do feel confused about the framing. At the present, substantially scaling up AI systems is probably not benign. As such, demonstrating its benigness is not the issue, it's actually making them benign (or credibly demonstrating their non-benignness).

Miles doesn't clarify what level of capabilities he is talking about, but various contextual clues make me think he is talking about the next few generations of frontier systems. I would definitely like to know for sure if those are not benign or benign, but I think there is a quite substantial chance they are not, and in that case the issue is not demonstrating benignness, but like, actually making them benign.

Miles never says this directly, but the headline and the framing of the article really scream to me like it's assuming that these upcoming systems are benign, while implying that it's understandable that policy makers and decision-makers are at present not convinced of that fact (while the author is). This is again, not super explicitly said in the post, but I do think it's a really strong implication of the title and various paragraphs within it. I would be curious in takes from other people whether they reacted similarly.

Ok, coming back to this a few hours later, here is roughly what feels off to me. I think the key thing is that the priority should be "developing measurement tools that tell us with high confidence whether systems are benign", but Miles somehow frames this as "measurement tools that tell us with high confidence that AI systems are benign", which sounds confused in the same way that a scientist saying "what we need are experiments that confirm my theory" sounds kind of confused (I think there are important differences between bayesian and scientific evidence, and a scientist can have justified confidence before the scientific community will accept their theory, but still, something seems wrong when a scientist is saying the top priority is to develop experiments that confirm their theory, as opposed to "experiments that tell us whether my theory is right").

Edit: There is a similar post by Wei Dai a long time ago that I had some similar feelings about: https://www.lesswrong.com/posts/dt4z82hpvvPFTDTfZ/six-ai-risk-strategy-ideas#_Generate_evidence_of_difficulty__as_a_research_purpose

"Generate evidence of difficulty" as a research purpose
How to handle the problem of AI risk is one of, if not the most important and consequential strategic decisions facing humanity. If we err in the direction of too much caution, in the short run resources are diverted into AI safety projects that could instead go to other x-risk efforts, and in the long run, billions of people could unnecessarily die while we hold off on building "dangerous" AGI and wait for "safe" algorithms to come along. If we err in the opposite direction, well presumably everyone here already knows the downside there.
A crucial input into this decision is the difficulty of AI safety, and the obvious place for decision makers to obtain evidence about the difficulty of AI safety is from technical AI safety researchers (and AI researchers in general), but it seems that not many people have given much thought on how to optimize for the production and communication of such evidence (leading to communication gaps like this one). (As another example, many people do not seem to consider that doing research on a seemingly intractably difficult problem can be valuable because it can at least generate evidence of difficulty of that particular line of research.)

But I do think that section of the post handles the tradeoff a lot better, and gives me a lot less of the "something is off" vibes.

demonstrating its benigness is not the issue, it's actually making them benign

This also stood out to me.

Demonstrate benignness is roughly equivalent to be transparent enough that the other actor can determine whether you're benign + actually be benign.

For now, model evals for dangerous capabilities (along with assurance that you're evaluating the lab's most powerful model + there are no secret frontier AI projects) could suffice to demonstrate benignness. After they no longer work, even just getting good evidence that your AI project [model + risk assessment policy + safety techniques deployment plans + weights/code security + etc.] is benign is an open problem, nevermind the problem of proving that to other actors (nevermind doing so securely and at reasonable cost).

Two specific places I think are misleading:

Unfortunately, it is inherently difficult to credibly signal the benign of AI development and deployment (at a system level or an organization level) due to AI’s status as a general purpose technology, and because the information required to demonstrate benignness may compromise security, privacy, or intellectual property. This makes the research, development, and piloting of new ways to credibly signal the benign of AI development and deployment, without causing other major problems in the process, an urgent priority.
. . .
We need to define what information is needed in order to convey benignness without unduly compromising intellectual property, privacy, or security.

The bigger problem is (building powerful AI such that your project is benign and) getting strong evidence for yourself that your project is benign. But once you're there, demonstrating benignness to other actors is another problem, lesser but potentially important and under-considered.

Or: this post mostly sets aside the alignment/control problem, and it should have signposted that it was doing so better, but it's still a good post on another problem.

It seems to institutional frameworks that credible transparency is an important necessary (not sufficient) step for credible benignness, that credible transparency is currently not implemented within existing frameworks such as RSPs and Summit commitments, but credible transparency would be a very achievable step forward.

So right now, model evals do suffice to demonstrate benignness, but we have to have some confidence in those evals, and transparency (e.g., openness to independent eval testing) seems essential. Then, when evals are no longer sufficient, I'm not sure what will be, but whatever it is, it will for sure require transparent testing by independent observers for credible benignness.

"Generate evidence of difficulty" as a research purpose
How to handle the problem of AI risk is one of, if not the most important and consequential strategic decisions facing humanity. If we err in the direction of too much caution, in the short run resources are diverted into AI safety projects that could instead go to other x-risk efforts, and in the long run, billions of people could unnecessarily die while we hold off on building "dangerous" AGI and wait for "safe" algorithms to come along. If we err in the opposite direction, well presumably everyone here already knows the downside there.
A crucial input into this decision is the difficulty of AI safety, and the obvious place for decision makers to obtain evidence about the difficulty of AI safety is from technical AI safety researchers (and AI researchers in general), but it seems that not many people have given much thought on how to optimize for the production and communication of such evidence (leading to communication gaps like this one). (As another example, many people do not seem to consider that doing research on a seemingly intractably difficult problem can be valuable because it can at least generate evidence of difficulty of that particular line of research.)