Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems

Gunnar_Zarncke

Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems

1 min read16th May 202420 comments

51

This is a linkpost for https://arxiv.org/abs/2405.06624

Authors: David "davidad" Dalrymple, Joar Skalse, Yoshua Bengio, Stuart Russell, Max Tegmark, Sanjit Seshia, Steve Omohundro, Christian Szegedy, Ben Goldhaber, Nora Ammann, Alessandro Abate, Joe Halpern, Clark Barrett, Ding Zhao, Tan Zhi-Xuan, Jeannette Wing, Joshua Tenenbaum

Abstract:

Ensuring that AI systems reliably and robustly avoid harmful or dangerous behaviours is a crucial challenge, especially for AI systems with a high degree of autonomy and general intelligence, or systems used in safety-critical contexts. In this paper, we will introduce and define a family of approaches to AI safety, which we will refer to as guaranteed safe (GS) AI. The core feature of these approaches is that they aim to produce AI systems which are equipped with high-assurance quantitative safety guarantees. This is achieved by the interplay of three core components: a world model (which provides a mathematical description of how the AI system affects the outside world), a safety specification (which is a mathematical description of what effects are acceptable), and a verifier (which provides an auditable proof certificate that the AI satisfies the safety specification relative to the world model). We outline a number of approaches for creating each of these three core components, describe the main technical challenges, and suggest a number of potential solutions to them. We also argue for the necessity of this approach to AI safety, and for the inadequacy of the main alternative approaches.

New Comment

20 comments, sorted by

top scoring

Click to highlight new comments since: Today at 11:02 AM

[-]RedMan15d93

A lot of AI safety seems to assume that humans are safer than they are, and that producing software that operates within a specification is harder than it is. It's nice to see this paper moving towards integrating actual safety analysis (the remark about collapsing bridges was a breath of fresh air), instead of general demands that 'the AI always do as humans say'!

A human intelligence placed in charge of a nation state can kill 7 logs of humans and still be remembered heroically. An AI system placed in charge of a utopian reshaping of the society of a major country with a 'keep the deaths within 6 logs' guideline that it can actually stay within would be an improvement on the status quo.

If safety people are saying 'we cant build AI systems that could make people feel bad, and we definitely can't build systems that kill people' their demand for perfection is in conflict with improvement.*

I suspect that major AI alignment failure will come from 'we put the human in charge, and human error led to the model doing bad'. The industrial/aviation safety community now rightly views 'pilot error' as a lazy way of ending an analysis and avoiding making the engineering changes to the system that the accident conditions demand.

*edit: imagine if the 'airplane safety' community had developed in 1905 (soon humans will be flying in planes!) and had resembled "AI safety" Not one human can be risked! No making planes that can carry bombs! The people who said pregnant women shouldn't ride trains because the baby will fly out of their bodies were wrong there, but keep them off the planes!

[-]habryka15d79

I am quite interested in takes from various people in alignment on this agenda. I've engaged with both Davidad's and Bengio's stuff a bunch in the last few months, and I feel pretty confused (and skeptical) about a bunch of it, and would be interested in reading more of what other people have to say.

[-]ryan_greenblatt15d239

I wrote up some of my thoughts on Bengio's agenda here.

TLDR: I'm excited about work on trying to find any interpretable hypothesis which can be highly predictive on hard prediction tasks (e.g. next token prediction).^[1] From my understanding, the bayesian aspect of this agenda doesn't add much value.

I might collaborate with someone to write up a more detailed version of this view which engages in detail and is more clearly explained. (To make it easier to argue against and to exist as a more canonical reference.)

As far as Davidad, I think the "manually build an (interpretable) infra-bayesian world model which is sufficiently predictive of the world (as smart as our AI)" part is very likely to be totally unworkable even with vast amounts of AI labor. It's possible that something can be salvaged by retreating to a weaker approach. It seems like a roughly reasonable direction to explore as a possible highly ambitious moonshot to automate researching using AIs, but if you're not optimistic about safely using vast amounts of AI labor to do AI safety work^[2], you should discount accordingly.

For an objection along these lines, see this comment.

(The fact that we can be conservative with respect to the infra-bayesian world model doesn't seem to buy much, most of the action is in getting something which is at all good at predicting the world. For instance, in Fabien's example, we would need the infrabayesian world model to be able to distinguish between zero-days and safe code regardless of conservativeness. If it didn't distinguish, then we'd never be able to run any code. This probably requires nearly as much intelligence as our AI has.)

Proof checking on this world model also seems likely to be unworkable, though I have less confidence in this view. And, the more the infra-bayesian world model is computationally intractible to run, the harder it is to proof check. E.g., if running the world model on many inputs is intractable (as would seem to be the default for detailed simulations), I'm very skeptical about proving anything about what it predicts.

I'm not an expert on either agenda and it's plausible that this comment gets some important details wrong.

Or just improving on the intepretability and predictiveness pareto frontier substantially. ↩︎
Presumably by employing some sort of safety intervention e.g. control or only using narrow AIs. ↩︎

[-]Joar Skalse11d30

From my understanding, the bayesian aspect of this agenda doesn't add much value.

A core advantage of Bayesian methods is the ability to handle out-of-distribution situations more gracefully, and this is arguably as important as (if not more important than) interpretability. In general, most (?) AI safety problems can be cast as an instance of a case where a model behaves as intended on a training distribution, but generalises in an unintended way when placed in a novel situation. Traditional ML has no straightforward way of dealing with such cases, since it only maintains a single hypothesis at any given time. However, Bayesian methods may make it less likely that a model will misgeneralise, or should at least give you a way of detecting when this is the case.

I also don't agree with the characterisation that "almost all the interesting work is in the step where we need to know whether a hypothesis implies harm" (if I understand you correctly). Of course, creating a formal definition or model of "harm" is difficult, and creating a world model is difficult, but once this has been done, it may not be very hard to detect if a given action would result in harm. But I may not understand your point in the intended way.

manually build an (interpretable) infra-bayesian world model which is sufficiently predictive of the world (as smart as our AI)

I don't think this is an accurate description of davidad's plan. Specifically, the world model does not necessarily have to be built manually, and it does not have to be as good at prediction as our AI. The world model only needs to be good at predicting the variables that are important for the safety specification(s), within the range of outputs that the AI system may produce.

Proof checking on this world model also seems likely to be unworkable

I agree that this is likely to be hard, but not necessarily to the point of being unworkable. Similar things are already done for other kinds of software deployed in complex contexts, and ASL-2/3 AI may make this substantially easier.

[-]ryan_greenblatt11d62

A core advantage of Bayesian methods is the ability to handle out-of-distribution situations more gracefully

I dispute that Bayesian methods will be much better at this in practice.

[

Aside:

In general, most (?) AI safety problems can be cast as an instance of a case where a model behaves as intended on a training distribution

This seems like about 1/2 of the problem from my perspective. (So I almost agree.) Though, you can shove all AI safety problems into this bucket by doing a maneuver like "train your model on the easy cases humans can label, then deploy into the full distribution". But at some point, this is no longer very meaningful. (E.g. you train on solving 5th grade math problems and deploy to the Riemann hypothesis.)

]

Traditional ML has no straightforward way of dealing with such cases, since it only maintains a single hypothesis at any given time.

Is this true? Aren't NN implicitly ensembles of vast number of models? Also, does ensembling 5 NNs help? If this doesn't help why does sampling 5 models from the Bayesian posterior help? Or is that we needed to approximate sampling 1,000,000 models from the posterior? If we're conservative over a million models, how will we ever do anything?

However, Bayesian methods may make it less likely that a model will misgeneralise, or should at least give you a way of detecting when this is the case.

Do they? I'm skeptical on both of these. It maybe helps a little and rules out some unlikely scenarios, but I'm overall skeptical.

Overall, my view on the Bayesian approach is something like:

What prior were we using for Bayesian methods? If this is just the NN prior, then I'm not really sold we do much better than just training a NN (or an ensemble of NNs). If our prior is importantly different in a way which we think will help, why can't we regularize to train a NN in a normal way which will vaguely reasonably approximate this prior?
My main concern is that we can get a smart predictive model which understands OOD cases totally fine, but we still get catastrophic generalization for whatever reason. I don't see why bayesian methods help.
- In the ELK case, our issues is that too much of the prior is human imitation or other problematic generalization. (We can ensemble/sample from the posterior and reject things where our hypotheses don't match, but this will only help so much and I don't really see a strong case for bayes helping more than ensembling.)
- In the case of a treacherous turn, it seems like the core concern was that all of our models are schemers and will work together. If this isn't the case, (e.g. if ensembling gets 25% non schemers), then we have other great options. I again don't see how bayes ensures you have some non-schemers while ensembling doesn't. (Like it could in principle, but why? Training your models on way more dog fanfiction could also make them less likely to be schemers, we need some reason to think this isn't just noise.)

I also don't agree with the characterisation that "almost all the interesting work is in the step where we need to know whether a hypothesis implies harm" (if I understand you correctly). Of course, creating a formal definition or model of "harm" is difficult, and creating a world model is difficult, but once this has been done, it may not be very hard to detect if a given action would result in harm.

My claim here is that all the interesting work is in ensuring that we know whether a hypothesis "thinks" that harm will result. It would be fine to put this work in constructing an intepretable hypothesis such that we can know whether it causes harm or constructing a formal model of harm and ensuring we have access to all important latent variables for this formal model, but this work still must be done.

Another way to put this, is that all the interesting action was happening at the point where you solved the ELK problem. I agree that if:

You have access to all interesting latent variables for your predictive hypothesis. (And your predictive hypothesis (or hypotheses) is competive with your AI agent at predicting these latent variables.)
You can formally define harm in terms of these latent variables

You're fine. But, step (1) is just the ELK problem and I don't even really think you need to solve step (2) for most plans. (You can just have humans compute step (2) manually for most types of latent variables, though this does have some issues.)

Specifically, the world model does not necessarily have to be built manually

I thought the plan was to build it with either AI labor or human labor so that it will be sufficiently intepretable. Not to e.g. build it with SGD. If the plan is to build it with SGD and not to ensure that it is interpretable, then why does it provide any safety guarantee? How can we use the world model to define a harm predicate?

it does not have to be as good at prediction as our AI. The world model only needs to be good at predicting the variables that are important for the safety specification(s), within the range of outputs that the AI system may produce

Won't predicting safety specific variables contain all of the difficulty of predicting the world? (Because these variables can be mediated by arbitrary intermediate variables.) This sounds to me to be very similar to "we need to build an intepretable next-token predictor, but the next token predictor only needs to be as good as the model at predicting the lower case version of the text on just scientific papers". This is just as hard as building a full distribution next token predictor.

[-]Joar Skalse10d1-2

But at some point, this is no longer very meaningful. (E.g. you train on solving 5th grade math problems and deploy to the Riemann hypothesis.)

It sounds to me like we agree here, I don't want to put too much weight on "most".

Is this true?

It is true in the sense that you don't have any theoretical guarantees, and in the sense that it also often fails to work in practice.

Aren't NN implicitly ensembles of vast number of models?

They probably are, to some extent. However, in practice, you often find that neural networks make very confident (and wrong) predictions for out-of-distribution inputs, in a way that seems to be caused by them projecting some spurious correlation. For example, you train a network to distinguish different types of tanks, but it learns to distinguish night from day. You train an agent to collect coins, but it learns to go to the right. You train a network to detect criminality, but it learns to detect smiles. Adversarial examples could also be cast as an instance of this phenomenon. In all of these cases, we have a situation where there are multiple patterns in the data that fit a given training objective, but where a neural network ends up giving an unreasonably large weight to some of these patterns at the expense of other plausible patterns. I thus think its fair to say that -- empirically -- neural networks do not robustly quantify uncertainty in a reliable way when out-of-distribution. It may be that this problem mostly goes away with a sufficiently large amount of sufficiently varied data, but it seems hard to get high confidence in that.

Also, does ensembling 5 NNs help?

In practice, this does not seem to help very much.

If we're conservative over a million models, how will we ever do anything?

I mean, there can easily be cases where we assign a very low probability to any given "complete" model of a situation, but where we are still able assign a high probability to many different partial hypotheses. For example, if you randomly sample a building somewhere on earth, then your credence that the building has a particular floor plan might be less than 1 in 1,000,000 for each given floor plan. However, you could still assign a credence of near-1 that the building has a door, and a roof, etc. To give a less contrived example, there are many ways for the stock market to evolve over time. It would be very foolish to assume that it will evolve according to, e.g., your maximum-likelihood model. However, you can still assign a high credence to the hypothesis that it will grow on average. In many (if not all?) cases, such partial hypotheses are sufficient.

If our prior is importantly different in a way which we think will help, why can't we regularize to train a NN in a normal way which will vaguely reasonably approximate this prior?

In the examples I gave above, the issue is less about the prior and more about the need to keep track of all plausible alternatives (which neural networks do not seem to do, robustly). Using ensembles might help, but in practice this does not seem to work that well.

I again don't see how bayes ensures you have some non-schemers while ensembling doesn't.

I also don't see a good reason to think that a Bayesian posterior over agents should give a large weight to non-schemers. However, that isn't the use-case here (the world model is not meant to be agentic).

Another way to put this, is that all the interesting action was happening at the point where you solved the ELK problem.

So, this depends on how you attempt to create the world model. If you try to do this by training a black-box model to do raw sensory prediction, and then attempt to either extract latent variables from that model, or turn it into an interpretable model, or something like that, then yes, you would probably have to solve ELK, or solve interpretability, or something like that. I would not be very optimistic about that working. However, this is in no way the only option. As a very simple example, you could simply train a black-box model to directly predict the values of all latent variables that you need for your definition of harm. This would not produce an interpretable model, and so you may not trust it very much (unless you have some learning-theoretic guarantee, perhaps), but it would not be difficult to determine if such a model "thinks" that harm would occur in a given scenario. As another example, you could build a world model "manually" (with humans and LLMs). Such a model may be interpretable by default. And so on.

I thought the plan was to build it with either AI labor or human labor so that it will be sufficiently intepretable. Not to e.g. build it with SGD. If the plan is to build it with SGD and not to ensure that it is interpretable, then why does it provide any safety guarantee? How can we use the world model to define a harm predicate?

The general strategy may include either of these two approaches. I'm just saying that that the plan does not definitionally rely on the assumption that the wold model is built manually.

Won't predicting safety specific variables contain all of the difficulty of predicting the world?

That very much depends on what the safety specifications are, and how you want to use your AI system. For example, think about the situations where similar things are already done today. You can prove that a cryptographic protocol is unbreakable, given some assumptions, without needing to have a complete predictive model of the humans that use that protocol. You can prove that a computer program will terminate using a bounded amount of time and memory, without needing a complete predictive model of all inputs to that computer program. You can prove that a bridge can withstand an earthquake of such-and-such magnitude, without having to model everything about the earth's climate. And so on. If you want to prove something like "the AI will never copy itself to an external computer", or "the AI will never communicate outside this trusted channel", or "the AI will never tamper with this sensor", or something like that, then your world model might not need to be all that detailed. For more ambitious safety specifications, you might of course need a more detailed world model. However, I don't think there is any good reason to believe that the world model categorically would need to be a "complete" world model in order to prove interesting safety properties.

[-]ryan_greenblatt10d90

I thus think its fair to say that -- empirically -- neural networks do not robustly quantify uncertainty in a reliable way when out-of-distribution

Sure, but why will the bayesian model reliably quantify uncertainty OOD? There is also no guarantee of this (OOD).

Whether or not you get reliable uncertainly quanitification will depend on your prior. If you have (e.g.) the NN prior, I expect the uncertainly quantification is similar to if you trained an ensemble.

E.g., you'll find a bunch of NNs (in the bayesian posterior) which also have the spurious correlation that a trained NN (or ensemble of NNs) would have.

If you have some other prior, why can't we regularize our NN to match this?

(Maybe I'm confused about this?)

[-]ryan_greenblatt10d20

Separately, I guess I'm not that worried about failures in which the network itself doesn't "understand" what's going on. So the main issue are cases where the model in some sense knows, but doesn't report this. (E.g. ELK problems at least broadly speaking.)

I think there are bunch of issues that look sort of like this now, but this will go away once models are smarter enough to automate R&D etc.

I'm not worried about future models murdering us because they were confused and though this would be what we wanted due to a spurious correlation.

(I do have some concerns around jailbreaking, but I also think that will look pretty different and the adversarial case is very different. And there appear to be solutions which are more promising that bayesian ML.)

[-]Joar Skalse8d10

I think the distinction between these two cases often can be somewhat vague.

Why do you think that the adversarial case is very different?

[-]Joar Skalse8d10

I think you're perhaps reading me as being more bullish on Bayesian methods than I in fact am -- I am not necessarily saying that Bayesian methods in fact can solve OOD generalisation in practice, nor am I saying that other methods could not also do this. In fact, I was until recently very skeptical of Bayesian methods, before talking about it with Yoshua Bengio. Rather, my original reply was meant to explain why the Bayesian aspect of Bengio's research agenda is a core part of its motivation, in response to your remark that "from my understanding, the bayesian aspect of [Bengio's] agenda doesn't add much value".

I agree that if a Bayesian learner uses the NN prior, then its behaviour should -- in the limit -- be very similar to training a large ensemble of NNs. However, there could still be advantages to an explicitly Bayesian method. For example, off the top of my head:

It may be that you need an extremely large ensemble to approximate the posterior well, and that the Bayesian learner can approximate it much better with much less resources.
It may be that you more easily can prove learning-theoretic guarantees for the Bayesian learner.
It may be that a Bayesian learner makes it easier to condition on events that have a very small probability in your posterior (such as, for example, the event that a particular complex plan is executed).
It may be that the Bayesian learner has a more interpretable prior, or that you can reprogram it more easily.

And so on, these are just some examples. Of course, if you get these benefits in practice is a matter of speculation until we have a concrete algorithm to analyse. All I'm saying is that there are valid and well-motivated reasons to explore this particular direction.

[-]ryan_greenblatt8d40

Insofar as the hope is:

Figure out how to approximate sampling from the Bayesian posterior (using e.g. GFlowNets or something).
Do something else that makes this actually useful for "improving" OOD generalization in some way.

It would be nice to know what (2) actually is and why we needed step (1) for it. As far as I can tell, Bengio hasn't stated any particular hope for (2) which depends on (1).

Rather, my original reply was meant to explain why the Bayesian aspect of Bengio's research agenda is a core part of its motivation, in response to your remark that "from my understanding, the bayesian aspect of [Bengio's] agenda doesn't add much value".

I agree that if the Bayesian aspect of the agenda did a specific useful thing like '"improve" OOD generalization' or 'allow us to control/understand OOD generalization', then this aspect of the agenda would be useful.

However, I think the Bayesian aspect of the agend won't do this and thus it won't add much value. I agree that Bengio (and others) think that the Bayesian aspect of the agenda will do things like this - but I disagree and don't see the story for this.

I agree that "actually use Bayesian methods" sounds like the sort of thing that could help you solve dangerous OOD generalization issues, but I don't think it clearly does.

(Unless of course someone has a specific proposal for (2) from my above decomposition which actually depends on (1).)

However, there could still be advantages to an explicitly Bayesian method. For example, off the top of my head

1-3 don't seem promising/important to me. (4) would be useful, but I don't see why we needed the bayesian aspect of it. If we have some sort of parametric model class which we can make smart enough to reason effectively about the world, just making an ensemble of these surely gets you most of the way there.

To be clear, if the hope is "figure out how to make an ensemble of interpretable predictors which are able to model the world as well as our smartest model", then this would be very useful (e.g. it would allow us to avoid ELK issues). But all the action was in making interpretable predictors, no bayesian aspect was required.

[-]ryan_greenblatt10d20

And so on. If you want to prove something like "the AI will never copy itself to an external computer", or "the AI will never communicate outside this trusted channel", or "the AI will never tamper with this sensor", or something like that, then your world model might not need to be all that detailed.

I agree that you can improve safety by checking outputs in specific ways in cases where we can do so (e.g. requiring the AI to formally verify its code doesn't have side channels).

The relevant question is whether there are interesting cases where we can't currently verify the output with conventional tools (e.g. formal software analysis or bridge design), but we can using a world model we've constructed.

One story for why we might be able to do this is that the world model is able to generally predict the world as well as the AI.

Suppose you had a world model which was as smart as GPT-3 (but magically interpretable). Do you think this would be useful for something? IMO no and besides, we don't need the interpretability as we can just train GPT-3 to do stuff.

So, it either has to be smarter than GPT-3 or have a wildly different capability profile.

I can't really image any plausible stories where this world model doesn't end up having to actually be smart while the world model is also able massively improve the frontier of what we can check.

[-]Joar Skalse8d10

I'm not so convinved of this. Yes, for some complex safety properties, the world model will probably have to be very smart. However, this does not mean that you have to model everything -- depending on your safety specification and use case, you may be able to factor out a huge amount of complexity. We know from existing cases that this is true on a small scale -- why should it not also be true on a medium or large scale?

For example, with a detailed model of the human body, you may be able to prove whether or not a given chemical could be harmful to ingest. This cannot be done with current tools, because we don't have a detailed computational model of the human body (and even if we did, we would not be able to use it for scaleable inference). However, this seems like the kind of thing that could plausibly be created in the not-so-long term using AI tools. And if we had such a model, we could prove many interesting safety properties for e.g. pharmacutical development AIs (even if these AIs know many things that are not covered by this world model).

Suppose you had a world model which was as smart as GPT-3 (but magically interpretable). Do you think this would be useful for something?

I think that would be extremely useful, because it would tell us many things about how to implement cognitive algorithms. But I don't think it would be very useful for proving safety properties (which I assume was your actual question). GPT-3's capabilities are wide but shallow, but in most cases, what we would need are capabilities that are narrow but deep.

[-]ryan_greenblatt10d20

However, that isn't the use-case here (the world model is not meant to be agentic).

Worth noting here that I'm mostly talking about Bengio's proposal wrt to the bayes related arguments.

And I agree that the world model isn't meant to be a schemer, but it's not as though we can guarantee that without some additional property...

(Such as ensuring that the world model is interpretable.)

[-]Joar Skalse8d10

In practice, I think you are unlikely to end up with a schemer unless you train your model to solve some agentic task (or tain it to model a system that may itself be a schemer, such as a human). However, in order to guarantee that, I agree we need some additional property (such as interpretability, or some learning-theoretic guarantee).

[-]ryan_greenblatt8d20

(I think most of the hard-to-handle risk from scheming comes from cases where we can't easily make smarter AIs which we know aren't schemers. If we can get another copy of the AI which is just as smart but which has been "de-agentified", then I don't think scheming poses a substantial threat. (Because e.g. we can just deploy this second model as a monitor for the first.) My guess is that a "world-model" vs "agent" distinction isn't going to be very real in practice. (And in order to make an AI good at reasoning about the world, it will need to actively be an agent in the same way that your reasoning is agentic.) Of course, there are risks other than scheming.)

[-]ryan_greenblatt10d20

or turn it into an interpretable model, or something like that, then yes, you would probably have to solve ELK, or solve interpretability, or something like that. I would not be very optimistic about that working

TBC, I would count it as solving an ELK problem if you constructed an interpretable world model which allows you to extract out the latents you want.

[-]ryan_greenblatt10d20

This would not produce an interpretable model, and so you may not trust it very much (unless you have some learning-theoretic guarantee, perhaps), but it would not be difficult to determine if such a model "thinks" that harm would occur in a given scenario.

Won't this depend on the generalization properties of the model?

E.g., we can always train a model to predict if a diamond is in the vault on some subset of examples which we can label.

From my perspective the core difficulty is either:

Treacherous turns (or similar high-stakes failures)
Note being able to properly identify the latents in actual deployment cases

Perhaps you were discussing the case where we magically have access to all the latents and don't need to worry about high-stakes failures? In this case, I agree there are no problems, but the baseline proposal of RLHF while looking at the latents also works perfectly fine.

[-]Joar Skalse8d10

What I had in mind is a situation where we have access to the latent variables during training, and only use the model to prove safety properties in situations that are within the range of the training distribution in some sense (eg, situations where we have some learning-theoretic guarantees). As for treacherous turns, I am implicitly assuming that we don't have to worry about a treacherous turn from the world model, but that we may have to worry about it from the AI policy that we're verifying.

However, note that even this is meaningfully different from just using RLHF, especially in settings with some adversarial component. In particular, a situation that is OOD for the policy need not be OOD for the world model. For example, learning a model of the rules of chess is much easier than learning a policy that is good at playing chess. It would also be much easier to prove a learning-theoretic guarantee for the former than the latter.

So, suppose we're training a chess-playing AI, and we want to be sure that it cannot be defeated in moves or less. The RLHF strategy would, in this scenario, essentially amount to letting a bunch of humans play against the AI a bunch of times, try to find cases where the red-teamers find a way to beat the AI in $n$ moves, and then train against those cases. This would only give us very weak quantitative guarantees, because there might be strategies that the red teamers didn't think of.

Alternatively, we could also train a network to model the rules of chess (in this particular example, we could of course also specify this model manually, but let's ignore that for the sake of the argument). It seems fairly likely that we could train this model to be highly accurate. Moreover, using normal statistical methods, we could estimate a bound on the fraction of the state-action space on which this model makes an incorrect prediction, and derive other learning-theoretic guarantees (depending on how the training data is collected, etc). We could then formally verify that the chess AI cannot be beaten in $n$ moves, relative to this world model. This would produce a much stronger quantitative guarantee, and the assumptions behind this guarantee would be much easier to audit. The guarantee would of course still not be an absolute proof, because there will likely be some errors in the world model, and the chess AI might be targeting these errors, etc, but the guarantee is substantially better than what you get in the RLHF case. Also note that we, as we run the chess AI, could track the predictions of the world model on-line. If the world model ever makes a prediction that contradicts what in fact happens, we could shut down the chess AI, or transition to a safe mode. This gives us even stronger guarantees.

This is of course a toy example, because it is a case where we can design a perfect world model manually (excluding the opponent, of course). We can also design a chess-playing AI manually, etc. However, I think it illustrates that there is a meaningful difference between RLHF and formal verification relative to even a black-box world model. The complexity of the space of all policies grows much faster than the complexity of the environment, and this fact can be exploited.

[-]Gyrodiot15d6-1

My raw and mostly confused/snarky comments as I was going through the paper can be found here (third section).

Cleaner version: this is not a technical agenda. This is not something that would elicit interesting research questions from a technical alignment researcher. There are however interesting claims:

what a safe system ought to be like; it proposes three scales describing its reliability;
how far up the scales we should aim for at minimum;
how low on the scales currently large deployed models are.

While it positions a variety of technical agendas (mainly those of the co-authors) on the scales, the paper does not advocate for a particular approach, only the broad direction of "here are the properties we would like to have". Uncharitably, it's a reformulation of the problem.

The scales can be useful to compare the agenda that belong to the "let's prove that the system adheres to this specification" family. It makes no claims over what the specification entails, nor failure modes of various (combinations of) levels.

I appreciate this paper as a gateway to the related agendas and relevant literature, but I'm not enthusiastic about it.

Moderation Log