I want to draw attention to a new paper, written by myself, David "davidad" Dalrymple, Yoshua Bengio, Stuart Russell, Max Tegmark, Sanjit Seshia, Steve Omohundro, Christian Szegedy, Ben Goldhaber, Nora Ammann, Alessandro Abate, Joe Halpern, Clark Barrett, Ding Zhao, Tan Zhi-Xuan, Jeannette Wing, and Joshua Tenenbaum.
In this paper we introduce the concept of "guaranteed safe (GS) AI", which is a broad research strategy for obtaining safe AI systems with provable quantitative safety guarantees. Moreover, with a sufficient push, this strategy could plausibly be implemented on a moderately short time scale. The key components of GS AI are:
- A formal safety specification that mathematically describes what effects or behaviors are considered safe or acceptable.
- A world model that provides a mathematical description of the environment of the AI system.
- A verifier that provides a formal proof (or some other comparable auditable assurance) that the AI system satisfies the safety specification with respect to the world model.
The first thing to note is that a safety specification in general is not the same thing as a reward function, utility function, or loss function (though they include these objects as special cases). For example, it may specify that the AI system should not communicate outside of certain channels, copy itself to external computers, modify its own source code, or obtain information about certain classes of things in the external world, etc. The safety specifications may be specified manually, generated by a learning algorithm, written by an AI system, or obtained through other means. Further detail is provided in the main paper.
The next thing to note is that most useful safety specifications must be given relative to a world model. Without a world model, we can only use specifications defined directly over input-output relations. However, we want to define specifications over input-outcome relations instead. This is why a world model is a core component of GS AI. Also note that:
- The world model need not be a "complete" model of the world. Rather, the required amount of detail and the appropriate level of abstraction depends on both the safety specification(s) and the AI system's context of use.
- The world model should of course account for uncertainty, which may include both stochasticity and nondeterminism.
- The AI system whose safety is being verified may or may not use a world model, and if it does, we may or may not be able to extract it. However, the world model that is used for the verification of the safety properties need not be the same as the world model of the AI system whose safety is being verified (if it has one).
The world model would likely have to be AI-generated, and should ideally be interpretable. In the main paper, we outline a few potential strategies for producing such a world model.
Finally, the verifier produces a quantitative assurance that the base-level AI controller satisfies the safety specification(s) relative to the world model(s). In the most straightforward form, this could simply take the shape of a formal proof. However, if a direct formal proof cannot be obtained, then there are weaker alternatives that would still produce a quantitative guarantee. For example, the assurance may take the form of a proof that bounds the probability of failing to satisfy the safety specification, or a proof that the AI system will converge towards satisfying the safety specification (with increasing amounts of data or computational resources, for example). Such proofs are of course often very hard to obtain. However, further progress in automated theorem proving (and related techniques) may make it very substantially easier to obtain such proofs. Furthermore, an automated theorem prover AI could be very powerful without having dangerous capabilities. For more detail, see the main paper.
If each of these three components can be created, then they can be used to provide auditable, quantitative safety guarantees for AI systems. This strategy does also not require interpretability to be solved, but could still provide a solution to the inner alignment problem (and rule out deceptive alignment, etc). Moreover, it should be possible to implement this strategy without any new fundamental insights; improvement of existing techniques (using LLMs and other tools) may be sufficient. If we get a substantive research push in this direction, then I am optimistic about the prospects of achieving substantially safer AI systems through the GS AI strategy.
For more detail, see the full paper.
While I have a lot of respect for many of the authors, this work feels to me like its mostly sweeping the big problems under the rug. It might at most be useful for AI labs to make a quick buck, or do some safety-washing, before we all die. I might be misunderstand some of the approaches proposed here, and some of my critiques might be invalid as such.
My understanding is that the paper proposes that the AI implements and works with a human-interpretable world model, and that safety specifications is given in this world-model/ontology.
But given an ASI with such a world model, I don't see how one would specify properties such as "hey please don't hyperoptimize squiggles or goodhart this property". Any specification I can think of mostly leaves room for the AI to abide by it, and still kill everyone somehow. This recurses back to "just solve alignment/corrigbility/safe-superintelligent-behaviour".
Nevermind getting an AI where its actually preforming all cognition in the ontology you provided for it (that would probably count as real progress to me). How do you know that just because the internal ontology says "X", "X" is what the AI actually does? See this post.
If you are going to prove vague things about your AI and have it be any use at all, you'd want to prove properties in the style of "this AI has the kind of 'cognition/mind' for which it is 'beneficial for the user' to have running than not" and "this AI's 'cognition/mind' lies in an 'attractor space' where violated assumptions, bugs and other errors cause the AI to follow the desired behavior anyways".
For sufficiently powerful systems having proofs about output behavior mostly does not narrow down your space to safe agents. You want proofs about their internals. But that requires having a less confused notion of what to ask for in the AI's internals such that it is a safe computation to run, never mind formally specifying it. I don't have, and haven't found anyone who seems to understand enough of the relevant properties of minds, what it means for something to be 'beneficial to the user', or how to construct powerful optimizers which fail non-catastrophically. It appears to me that we're not bottle necked on proving these properties, but rather that the bottleneck is identifying and understanding what shape they have.
I do expect some of these approaches to, in the very limited scope of things you can formally specify, allow for more narrow AI applications, promote AI investments and give rise to new techniques and non-trivially shorten the time until we are able to build superhuman systems. My vibes regarding this are made worse by how various existing methods are listed in "safety ranking". It lists RLHF, Constitutional AI & Model-free RL as more safe than unsupervised learning, but to me it seems like these methods instill stable agent-like behavior on top of a prediction-engine, where there previously was either none or nearly none. They make no progress on the bits of the alignment problem which matter, but do let AI labs create new and better products, make more money, fund more capabilities research etc. I predict that future work along these lines will mostly have similar effects; little progress on the bits which matter, but useful capabilities insights along the way, which gets incorrectly labeled alignment.