LESSWRONG
LW

Comment Permalink

The interpretability safety case section seems to make the rather strong claim that the activation of known sabotage-related features is necessary for sabotage to occur (emphasis mine):

[C1.1] The model will be unable to accomplish organizational sabotage (T1) or to undermine oversight (T3) without triggering our feature-based monitoring (M1.1) and a detailed analysis of the session.

Afaict, the way that they establish this link (that a feature activation is necessary for a behavior to occur) is through ablation studies. Basically, if deactivating a feature prevents a model organism from committing a dangerous behavior, then they conclude that the feature causes the behavior. And although it does seem like this sort of test will generate Bayesian evidence about which features contribute to sabotage-capacity, I’m confused why the author thinks this would generate strong Bayesian evidence, much less identify the necessity conditions for sabotage with ~certainty?

Like, it seems plausible to me that some other unknown combination of features might create sabotage-like behavior, or that some relevant features were missed, or that there were other, ill-described-as-features mechanisms which can cause sabotage, etc. And my guess is that the evidence this sort of testing does generate will become weaker over time—i.e., once our safety precautions start to really matter—since more intelligent systems will presumably be able to cause sabotage in more numerous and creative ways.

Some of these potential setbacks are mentioned, and I get that this is just a preliminary sketch. But still, I’m confused where the confidence is coming from—why does the author expect this process to yield high confidence of safety?

Alex Mallen5mo60

2Zach Stein-Perlman5mo

See [V1.1]

See in context

95 Anthropic: Three Sketches of ASL-4 Safety Case Components

by Zach Stein-Perlman

6th Nov 2024

AI Alignment Forum

1 min read

95 Ω 42

This is a linkpost for https://alignment.anthropic.com/2024/safety-cases/

The cleanest argument that current-day AI models will not cause a catastrophe is probably that they lack the capability to do so. However, as capabilities improve, we’ll need new tools for ensuring that AI models won’t cause a catastrophe even if we can’t rule out the capability. Anthropic’s Responsible Scaling Policy (RSP) categorizes levels of risk of AI systems into different AI Safety Levels (ASL), and each level has associated commitments aimed at mitigating the risks. Some of these commitments take the form of affirmative safety cases, which are structured arguments that the system is safe to deploy in a given environment. Unfortunately, it is not yet obvious how to make a safety case to rule out certain threats that arise once AIs have sophisticated strategic abilities. The goal of this post is to present some candidates for what such a safety case might look like.

This is a post by Roger Grosse on Anthropic's new Alignment Science Blog. The post is full of disclaimers about how it isn't an official plan and doesn't speak for the org (and that it's inadequate: "none of the sketches presented here fully succeeds in addressing the sabotage risk"). But presumably it's Anthropic's best sketch of ASL-4 safety cases.

The three safety cases are Mechanistic Interpretability, AI Control, and Incentives Analysis.

Regardless of how good these safety cases are, it's good when labs share their thinking on safety stuff; yay Anthropic.

Anthropic (org)AI Safety CasesAI

Frontpage

95 Ω 42

Mentioned in

23Improving Our Safety Cases Using Upper and Lower Bounds

New Comment

33 comments, sorted by

top scoring

Click to highlight new comments since: Today at 5:05 PM

[-]Buck5moΩ254217

I think the control safety case described here is moderately different from what I’d recommend. In particular, this post describes needing to generate situations that are indistinguishable from real situations:

The situations for the game were generated using Simulation Procedure S, which we are confident generates situations from a distribution that the model cannot distinguish from the genuine deployment distribution. (See below for validating the indistinguishability of this distribution.)

I don’t know why Roger thinks you need this. Control evaluations for the control protocols listed don’t require situations that the model can’t distinguish from the genuine deployment distribution; it’s fine if the situations are visibly fake. The control evaluation involves getting a red team to elicit the untrusted model to generate attacks, so it doesn’t matter if the model knows that the situation is fake. (There are techniques that require synthetic input generation, e.g. upfront auditing as we presented it in our paper, but trusted monitoring isn’t one of them.)

The condition we do require for the control-style safety case to be valid is that the situations in the evaluation are at least as easy to attack on as the real distribution. (For example, this distribution needs to have as many good opportunities to attack, it needs to be similarly confusing to the trusted model, etc…)

Roger writes:

The idea here is to achieve sufficient realism and sufficient coverage of high-stakes deployment scenarios that the deployed model could never (in a high-stakes scenario) be highly confident that it’s in deployment.

Note that if the model is deployed on many more examples than it’s evaluated on, it can be pretty confident that it’s in deployment on any particular example just on priors, even if there’s no distribution shift. See here for more.

[-]Vlad Mikulik5moΩ44-6

The reasoning for the indistinguishability criterion is as a precaution against a model sandbagging when used as the red team attack model. If the model can tell it’s in a control eval it could pretend to be worse at attacking, making us think the control measures are better than they actually are. If the red team elicitation is robust to this, I agree this isn’t necessary. (See the note below [P2.4]).

[-]Buck5moΩ671

I don't think this makes any sense. How are you hoping to get the model to attack except by telling it that it's in a control evaluation and you want it to attack? It seems that you are definitely going to have to handle the sandbagging.

[-]Vlad Mikulik5moΩ440

FWIW I agree with you and wouldn't put it the way it is in Roger's post. Not sure what Roger would say in response.

[-]joshc5mo30

Note that if the model is deployed on many more examples than it’s evaluated on, it can be pretty confident that it’s in deployment on any particular example just on priors, even if there’s no distribution shift. See here for more.

My initial read was that indistinguishability was used to justify that the trusted monitor couldn't distinguish the deployment distribution.

I'm not sure if that's what Roger meant -- if so, it's still a bit confusing since then he wouldn't have needed to call the monitor 'trusted.'

[-]Daniel Kokotajlo5moΩ11294

Some comments as I read:

I am pleased to see the sabotage threat models given prominence. I think this is the right sort of threat to focus on, by contrast with e.g. autonomous replication or directly causing harm to humans.

The ability for the model to, zero-shot, figure out how to hack its own neural activations without such a signal would require capabilities well beyond what we consider plausible. [Note: this claim will require justification in an actual safety case, but we don’t have a clear picture of what this justification would look like.]

DC evals presumably? Try to train models to hack their own neural activations, see if they can quickly learn to do it? (Or even, if they can do it without training, just prompting?) This can be done today right? Get the golden gate bridge activation vector thingy, and then tell Claude "Write me a travel guide to San Francisco but without activating the golden gate bridge activation vector, or, well, activating it as little as possible." See how good it does. Do scaling analysis to see if smarter bigger models are getting better at doing this. Etc.

As a validation of our entire methodology, we have run red-blue games where the red team tries to produce a misaligned model organism (e.g. a sleeper agent) that can successfully defect in a specific context while avoiding detection by the aforementioned tests.

Nice! Very excited to see this in there.

Overall I'm very happy to see this blog post go up on Anthropic's website. I think that if it became industry-standard practice for AGI corporations to write, publish, and regularly update (actual instead of just hypothetical) safety cases at at this level of rigor and detail, my p(doom) would cut in half. (I currently grimly expect that safety cases at this level of detail won't be constructed until AGI is basically already being trained, and it'll be done in a hurry and it won't be published, much less published with enough time for the scientific community to engage with it and for it to be updated in response to feedback. And it could be even worse -- I wouldn't be surprised if the actual safety cases for the first systems that ~completely automate AI R&D are significantly less rigorous than these.)

[-]Richard_Ngo5moΩ13161

We have discussed this dynamic before but just for the record:

I think that if it became industry-standard practice for AGI corporations to write, publish, and regularly update (actual instead of just hypothetical) safety cases at at this level of rigor and detail, my p(doom) would cut in half.

This is IMO not the type of change that should be able to cut someone's P(doom) in half. There are so many different factors that are of this size and importance or bigger (including many that people simply have not thought of yet) such that, if this change could halve your P(doom), then your P(doom) should be oscillating wildly all the time.

I flag this as an example of prioritizing inside-view considerations too strongly in forecasts. I think this is the sort of problem that arises when you "take bayesianism too seriously", which is one of the reasons why I wrote my recent post on why I'm not a bayesian (and also my earlier post on Knightian uncertainty).

For context: our previous discussions about this related to Daniel's claim that appointing one specific person to one specific important job could change his P(doom) by double digit percentage points. I similarly think this is not the type of consideration that should be able to swing people's P(doom) that much (except maybe changing the US or Chinese leaders, but we weren't talking about those).

Lastly, since this is a somewhat critical comment, I should flag that I really appreciate and admire Daniel's forecasting, have learned a lot from him, and think he's generally a great guy. The epistemology disagreements just disproportionately bug me.

[-]Daniel Kokotajlo5moΩ6100

Sorry! My response:

1. Yeah you might be right about this, maybe I should get less excited and say something like "it feels like it should cut in half but taking into account Richard's meta argument I should adjust downwards and maybe it's just a couple percentage points"

2. If the conditional obtains, that's also evidence about a bunch of other correlated good things though (timelines being slightly longer, people being somewhat more reasonable in general, etc.) so maybe it is legit to think this would have quite a big effect

3. Are you sure there are so many different factors that are of this size and importance or bigger, such that my p(doom) should be oscilating wildly etc.? Name three. In particular, name three that are (a) of at least this importance, and (b) that have actually happened (or switched from probabily-will-happen-to-probably-won't) in the last three years. If you can't name three, then it seems like your claim is false; my p(doom) won't be oscilating unduly wildly.

[-]Richard_Ngo5moΩ11140

1. Yepp, seems reasonable. Though FYI I think of this less as some special meta argument, and more as the common-sense correction that almost everyone implicitly does when giving credences, and rationalists do less than most. (It's a step towards applying outside view, though not fully "outside view".)

2. Yepp, agreed, though I think the common-sense connotations of "if this became" or "this would have a big effect" are causal, especially in the context where we're talking to the actors who are involved in making that change. (E.g. the non-causal interpretation of your claim feels somewhat analogous to if I said to you "I'll be more optimistic about your health if you take these pills", and so you take the pills, and then I say "well the pills do nothing but now I'm more optimistic, because you're the sort of person who's willing to listen to recommendations". True, but it also undermines people's willingness/incentive to listen to my claims about what would make the world better.)

3. Here are ten that affect AI risk as much one way or the other:

The US government "waking up" a couple of years earlier or later (one operationalization: AISIs existing or not right now).
The literal biggest names in the field of AI becoming focused on AI risk.
The fact that Anthropic managed to become a leading lab (and, relatedly, the fact that Meta and other highly safety-skeptical players are still behind).
Trump winning the election.
Elon doing all his Elon stuff (like founding x.AI, getting involved with Trump, etc).
The importance of transparency about frontier capabilities (I think of this one as more of a logical update that I know you've made).
o1-style reasoning as the next big breakthrough.
Takeoff speeds (whatever updates you've made in the last three years).
China's trajectory of AI capabilities (whatever updates you've made about that in last 3 years).
China's probability of invading Taiwain (whatever updates you've made about that in last 3 years).

And then I think in 3 years we'll be able to publish a similar list of stuff that mostly we just hadn't predicted or thought about before now.

I expect you'll dispute a few of these; happy to concede the ones that are specifically about your updates if you disagree (unless you agree that you will probably update a bunch on them in the next 3 years).

But IMO the easiest way for safety cases to become the industry-standard thing is for AISI (or internal safety factions) to specifically demand it, and then the labs produce it, but kinda begrudgingly, and they don't really take them seriously internally (or are literally not the sort of organizations that are capable of taking them seriously internally—e.g. due to too much bureaucracy). And that seems very much like the sort of change that's comparable to or smaller than the things above.

I think I would be more sympathetic to your view if the claim were "if AI labs really reoriented themselves to take these AI safety cases as seriously as they take, say, being in the lead or making profit". That would probably halve my P(doom), it's just a very very strong criterion.

[-]Daniel Kokotajlo5moΩ9170

Good point re 2. Re 1, meh, still seems like a meta-argument to me, because when I roll out my mental simulations of the ways the future could go, it really does seem like my If... condition obtaining would cut out about half of the loss-of-control ones.

Re 3: point by point:
1. AISIs existing vs. not: Less important; I feel like this changes my p(doom) by more like 10-20% rather than 50%.
2. Big names coming out: idk this also feels like maybe 10-20% rather than 50%
3. I think Anthropic winning the race would be a 40% thing maybe, but being a runner-up doesn't help so much, but yeah p(anthropicwins) has gradually gone up over the last three years...
4. Trump winning seems like a smaller deal to me.
5. Ditto for Elon.
6. Not sure how to think about logical updates, but yeah, probably this should have swung my credence around more than it did.
7. ? This was on the mainline path basically and it happened roughly on schedule.
8. Takeoff speeds matter a ton, I've made various updates but nothing big and confident enough to swing my credence by 50% or anywhere close. Hmm. But yeah I agree that takeoff speeds matter more.
9. Picture here hasn't changed much in three years.
10. Ditto.

OK, so I think I directionally agree that my p(doom) should have been oscillating more than it in fact did over the last three years (if I take my own estimates seriously). However I don't go nearly as far as you; most of the things you listed are either (a) imo less important, or (b) things I didn't actually change my mind on over the last three years such that even though they are very important my p(doom) shouldn't have been changing much.

But IMO the easiest way for safety cases to become the industry-standard thing is for AISI (or internal safety factions) to specifically demand it, and then the labs produce it, but kinda begrudgingly, and they don't really take them seriously internally (or are literally not the sort of organizations that are capable of taking them seriously internally—e.g. due to too much bureaucracy). And that seems very much like the sort of change that's comparable to or smaller than the things above.

I agree with everything except the last sentence -- my claim took this into account, I was specifically imagining something like this playing out and thinking 'yep, seems like this kills about half of the loss-of-control worlds'

I think I would be more sympathetic to your view if the claim were "if AI labs really reoriented themselves to take these AI safety cases as seriously as they take, say, being in the lead or making profit". That would probably halve my P(doom), it's just a very very strong criterion.

I agree that's a stronger claim than I was making. However, part of my view here is that the weaker claim I did make has a good chance of causing the stronger claim to be true eventually -- if a company was getting close to AGI, and they published their safety case a year before and it was gradually being critiqued and iterated on, perhaps public pressure and pressure from the scientific community would build to make it actually good. (Or more optimistically, perhaps the people in the company would start to take it more seriously once they got feedback from the scientific community about it and it therefore started to feel more real and more like a real part of their jobs)

Anyhow bottom line is I won't stick to my 50% claim, maybe I'll moderate it down to 25% or something.

[-]Richard_Ngo4moΩ8110

Cool, ty for (characteristically) thoughtful engagement.

I am still intuitively skeptical about a bunch of your numbers but now it's the sort of feeling which I would also have if you were just reasoning more clearly than me about this stuff (that is, people who reason more clearly tend to be able to notice ways that interventions could be surprisingly high-leverage in confusing domains).

[-]Noosphere895mo20

While I agree that people are in general overconfident, including LessWrongers, I don't particularly think this is because Bayesianism is philosophically incorrect, but rather due to both practical limits on computation combined with sometimes not realizing how data-poor their efforts truly are.

(There are philosophical problems with Bayesianism, but not ones that predict very well the current issues of overconfidence in real human reasoning, so I don't see why Bayesianism is so central here. Separately, while I'm not sure there can ever be a complete theory of epistemology, I do think that Bayesianism is actually quite general, and a lot of the principles of Bayesianism is probably implemented in human brains, allowing for practicality concerns like cost of compute.)

[-]aysja5mo245

The interpretability safety case section seems to make the rather strong claim that the activation of known sabotage-related features is necessary for sabotage to occur (emphasis mine):

[C1.1] The model will be unable to accomplish organizational sabotage (T1) or to undermine oversight (T3) without triggering our feature-based monitoring (M1.1) and a detailed analysis of the session.

[-]Alex Mallen5mo60

I think that if they:
- Find a feature that is necessary for defection in all of the cases of defection in a diverse and realistic set of model organisms
- Locate that feature with very little optimization on the defection cases
I'd be fairly confident (>~80%) that it would catch any particular deployment-time defection attempt from an ASL-4 schemer. This would cut risk from the deployment by >5x when paired with compelling incrimination and a shutdown. This is a substantial risk reduction, and I might even buy this safety case if there weren't better options for this threat model (e.g. untrusted monitoring). My main concerns are:
- It seems unlikely we would get a feature that meets the above two criteria
- The feature might have too many false positives
- The possibility that the model organisms were sabotaging your eval by coordinating to route their defection attempts through a particular feature (this seems hard at ASL-4 though (but this claim would need an eval)).

[-]Zach Stein-Perlman5mo20

See [V1.1]

[-]Zach Stein-Perlman5mo*Ω8125

When someone describes a safety case where one of the steps is infeasible, here are two possible reactions:

Step X will be ~impossible to justify; hopefully the lab will understand that (if nothing else then when they try to justify step X) and look for a different kind of safety case
Step X will be ~impossible to justify, but the authors seem to think the plan requiring it is decent; this suggests that they don't understand the issue and when they make a safety case they'll likely think they're justifying step X when they're not

Regardless, if you think a step is very hard or the authors seem confused, ideally you should explain why. And then hopefully the authors say "yeah oops that's too hard" or "yeah we're imagining doing the really hard thing" or "we don't think it will be so hard: we agree observation O would suffice and we expect it's easy and if it turns out to be hard we'll have to make a different safety case" or "we disagree, here's why." (I'm mostly observing this independent of this post.)

[Edited]

[-]Zach Stein-Perlman5mo*Ω8105

My initial reactions on a quick read:

Interp: no inside view; my impression is that my friends are pessimistic about such safety cases in the next few years
Control: hooray for Anthropic finally writing something about control; some details seem good; some details seem odd ("contexts that were specifically whitelisted"); I think the naive pre-deployment honeypots path doesn't work because a scheming model wins with the defect rarely strategy (related: https://www.lesswrong.com/posts/vYWDEtJ2b8tkbG7Rv/behavioral-red-teaming-is-unlikely-to-produce-clear-strong) [very sophisticated or expensive honeypotting setups could work but the post doesn't seem to be imagining that or noticing the issue]; that's important and the current language in the post is concerning
Incentives: I don't really understand, or I'm dubious, or I expect that A3.1 is very hard to demonstrate

[-]Alex Mallen5mo*Ω692

This is really exciting to see not only as a way to understand better what people at Anthropic are thinking about how to make ASL-4 safety cases but also as a survey of how various safety agendas might play a role in practice!

Here are some thoughts about the details:

Given that the mechinterp safety case involves running a kind of control evaluation, other monitoring methods can largely be substituted into the eval. This could enable cheap testing of other monitors e.g. based on supervised probing or mechanistic anomaly detection.
I like the idea to check whether a feature is causally necessary for defection in your model organisms as this gives extra certainty about generalization recall.
"Simulation Procedure S, which we are confident generates situations from a distribution that the model cannot distinguish from the genuine deployment distribution." (As is probably mentioned by others) This isn’t necessary for control safety cases as long as you’ve done a sufficient job of eliciting performance from your model (e.g. i.i.d. elicitation with human demonstrations of attacks, which as is mentioned might be difficult).
The third kind of safety case is pretty novel to me which makes it exciting but also I’m not convinced by it yet, for reasons that are mostly already outlined in the blogpost.
The evidence becomes very diluted by multiple hypothesis testing when trying to rule out any cluster of deceptive personas being reinforced. We probably wouldn’t be able to get convincing affirmative evidence with this approach without a meta-level eval.
I think exploration hacking shouldn’t be a concern here if you can get the 10-90% with SFT. (EDIT: However you might be concerned that the 10-90% does not contain the particular deceptive personas that get reinforced.)

[-]Adam Kaufman 5mo*Ω593

Kudos for releasing a concept of a plan! Some thoughts:

Regarding the first safety case:

The amount of progress in mech interp required to make the first safety case suitable seems overly optimistic to me; I basically think that most of the "limitations" are in fact pretty serious. However, I appreciate that the attempt to include concrete requirements.
I believe that getting good results from the following experiments might be particularly unrealistic:
- "In order to robustify our evals against sandbagging, we ran an experiment where we steered one or more truth-related features to make the model confident that it is in either testing or unmonitored deployment"
  - I worry that "truth-related" features are very difficult both in principle and in practice to isolate, as the model's representation of what other people / entities believe will be hard to disentangle from what it itself believes, to the extent that those are even meaningfully separable concepts.
  - Even if you could identify a feature roughly corresponding to "I believe the current sentence is true", it is not immediately clear to me how this could be used to successfully convince the model of more complex situational information, e.g. it is not being monitored and is in a real deployment setting. For instance, I don't expect artificially activating that feature on the sentence "I am not being monitored and this is a real deployment setting" to just work.
- "In all of these model organisms, deactivating the features of concern was sufficient to prevent the harmful behaviors"
  - I can't concisely refute this. It is just a "huge if true" kind of claim, and as far as I am aware very much not possible with current techniques. If we could isolate a set of linear directions that were robustly causally responsible for scheming behavior, that would have massive ramifications far beyond this setting.
  - I worry a little bit about "training on test" here – there may exist some overfit set of classifiers that can stop model organisms from misbehaving in a specific setting without meaningfully generalizing. This is also related to the stated "Realism of the model organisms" concerns.
I strongly approve of this element of the plan: "In order to preserve the integrity of the learned features, we ensured that the learned features were never used as part of the training procedure for the models."
- It's kind of hard to do this in full generality, since you need to not only avoid training directly against the probes, but also avoid using the information from probes at all when selecting what to train on.

I'm more excited about the control safety case, but have a few nitpicks:

Developing a good elicitation strategy is indeed challenging.
- I think it is pretty crucial to use real human red-teaming, especially if sandbagging (more specifically something like exploration hacking) is a concern.
- It may be helpful to give the red team way more affordances than the untrusted model would actually have in deployment, and only remove affordances if defense becomes impossible. Some examples of extra affordances:
  - Letting the red team know the blue team's exact strategy in order to search for the best response
  - Giving an untrusted model more unmonitored chain-of-thought / tool use / human assistance than it would actually have in deployment
  - Allowing human red-teamers to repeatedly attempt to bypass a monitor when in reality a scheming model might have to one-shot it
- Figuring out exactly what extra affordances suffice to break the control protocol may also provide useful signal about where the protocol could be strengthened, and what capabilities need to be tested for going forward.
- Eliciting capabilities from substantially superhuman models is a much harder problem, and is one of the main reasons I don't expect control to suffice for a safety case well beyond human-level AI.
The space of strategies beyond trusted monitoring alone is pretty large. You mention untrusted monitoring, but there is also a broad space of strategies that include editing outputs with a trusted model, allocating more monitoring/auditing budget in certain situations, etc.
It is probably best to continually produce new honeypots throughout deployment.
- For instance, if untrusted models are being used to do lots of automated software engineering and research, likely requiring huge amounts of context, it will become more obvious to the model over time that it is in deployment.
- Making new honeypots out of actual deployment settings, for instance by adding exploitable security flaws, may generally make it much harder for the model to tell when it is being tested.

I haven't read the last safety case yet, but may have some thoughts on it later. I am most excited about control at the moment, in part due to concerns that interpretability won't advance far enough to suffice for a safety case by the time we develop transformative AI.

[-]joshc5mo80

Some comments I left as a reply to Roger Grosse on Twitter:

I'm glad you put this out. These examples align with my guess as to roughly what arguments will end up being most load bearing for mitigating scheming (white-box monitoring, black-box monitoring, and evidence against scheming being rewarded).
Here are some comments, since you mentioned comments would be helpful. I've bolded the comments that I think would be most useful for you to read.
Safety case sketch #1:
- There are some additional claims that this safety case relies on that could be good to mention explicitly. e.g. "egregious behavior would be reliably identified when developers analyze flagged sessions" and "developers will become aware of and stop unacceptable behavior *before* it leads to harmful outcomes."
- "Given our mitigations, the model cannot strategically condition its apparent capabilities, motivations, or behavioral proclivities on whether it’s in testing or deployment without being caught" <- I'm confused about why this claim is important given that you are monitoring at deployment-time. Even if the model is conditioning its behavior, my understanding is that you are hoping to catch it when it defects.
- Much of the evidence for the effectiveness of your white-box monitoring comes from your model organisms, of which you only have three. Three examples doesn't give you a lot of statistical power. Naively, I'd think there's still roughly a one third probability that your white-box monitoring completely fails in the actual deployment context even if it succeeds on three models.
Safety case sketch #2:
- You mention external deployments, but I didn't see discussion of jailbreaking.
- "[The safety case sketch] may require expensive pre-deployment testing if one needs to generate a very large number of simulated deployment situations in order to achieve a sufficiently good bound on the probability of defection." This sounds impractical. My guess is that you'll need to continuously sample real deployment trajectories to generate your honeypots (where a honeypot is a inserted dangerous action for your monitoring to flag). Otherwise, it seems unlikely to me that you can verify indistinguishability even with current models.
Safety case sketch #3:
- It's possible that all of the models that you experiment with are opaquely scheming. i.e. instead of comparing reward attained by models that engage in "deceptive reasoning" vs "no deceptive reasoning," you might be comparing models that engage in "CoT deceptive reasoning" vs "opaque deceptive reasoning." This experiment still sheds light on how helpful deceptive reasoning is for obtaining reward, but the argument in this case is different from the way you framed it.
- A lot hinges on the claim "if an environment rewards opaque deceptive reasoning, then it rewards scratchpad-based deceptive reasoning." Here's an idea for validating this. First, construct a distribution of tasks where deceptive reasoning is likely rewarded. For example, select tasks that you *know* can be gamed by explicitly reasoning about the training process. Then, train models to perform well on these tasks with and without additionally training them to reason deceptively in their chain of thought. If models that reason deceptively in CoT do better on average, this provides evidence for your claim that we should see deceptive reasoning in CoT if opaque deceptive reasoning is incentivized. It's possible some of these comments were addressed in the post and I missed the part where they were addressed. I've errored on the side of inclusion here.

[-]Olli Järviniemi5mo*Ω460

Glad to see public writing on what safety cases could look like!

I liked the section "Common precautions". Points like "wherever possible, run supervised fine-tuning on the eval", "run the eval on multiple model sizes", "ensure that we don’t train against measurements" and "all generated outputs are monitored" are commonly acknowledged as useful/important (even if obvious), so it's nice they were explicitly mentioned here.

In the AI control sketch, there's a part that I find unsatisfactory:

The situations for the game were generated using Simulation Procedure S, which we are confident generates situations from a distribution that the model cannot distinguish from the genuine deployment distribution. (See below for validating the indistinguishability of this distribution.)

It's a strong assumption that we have such a simulation procedure, at least if we are talking about General External Deployment (GED). [Edited later: As Buck says, we actually don't even need this assumption for trusted monitoring. I'm still leaving this up, since untrusted monitoring does rely on such assumptions.] In the Limitations section later, you write

Diversity of simulations. It seems very difficult to produce realistic simulations for a sufficiently diverse set of situations to achieve indistinguishability. Manual construction is infeasible (note that context lengths are likely to be very long). Automatically generating realistic contexts is a challenging research problem.

but I think that even this is still understating the issue: For example, during GED a user might ask a model to proofread a yet-unpublished graduate level textbook on analytic number theory she has been writing. It would be very costly to generate a diverse set of such inputs for the evaluation, and automatic generation faces the challenge of discrimination being easier than generation.

The best solution for this that I'm aware of is to train and use a trusted model to flag inputs that likely weren't in the evaluation set. Now, you mention a related technique in "[P2.5] Optional Steps."

Information hiding. Each input to an oversight model is first fed to a smaller, trusted model trained to check for clear signs that an input is either test or deployment. Such details (such as clear evidence of the date, or other side channels) are explicitly censored in all inputs such that the presence of censorship does not itself carry information about deployment status.

but I think this is far from enough: in the "textbook on analytic number theory" example, it's extremely hard to hide the fact that this input would have been costly to simulate.^[1]

My answer would be to have a human oversee the model's actions on inputs flagged by the trusted model, and use such inputs to continually train and evaluate the model. But this is costly and not privacy-preserving. (The cost is acceptable if these inputs are rare enough, though. And if such inputs are common, one needs to put more effort to bridge the evaluation-deployment gap.)

^{^}
I also worry about multimodality (e.g. native computer usage) making it harder to 1) make realistic simulations and 2) modify inputs in real time to hide information.

[-]Buck5moΩ340

Note that you can make untrusted monitoring work without the synthetic inputs too, you just have to do the honeypots during deployment instead of before deployment.

[-]Olli Järviniemi5moΩ12-2

Not sure I understand what you are saying.

If we can construct fake honeypots that are indistinguishable - to the policy - from real opportunities, and do this at scale, then that could suffice: we could drive the policy's P(deployment | great opportunity) very low. (This relies on how many high-stakes we can create and how many the model will see in deployment.) If this is what you are saying, I agree.

If we can construct fake attacks that are indistinguishable - to the untrusted monitor - from real attacks, then that possibly suffices: we could drive the monitor's prior P(deployment | attack) very low. (We have to worry about some other sources of evidence, though.) I guess one could call these honeypots for the monitor, and if this is what you are saying, I agree.

If you are saying that we don't need either of these - i.e. that we can get by with honeypots/attacks that are noticeably fake - then that's news to me, and I don't see how you can do that.

[-]Buck5moΩ462

If we can construct fake honeypots that are indistinguishable - to the policy - from real opportunities, and do this at scale, then that could suffice: we could drive the policy's P(deployment | great opportunity) very low. (This relies on how many high-stakes we can create and how many the model will see in deployment.)

Yes totally. (This is impractical if you think that the model has a very large number of opportunities to attack, e.g. if every time it's programming is an opportunity to attack.)

If we can construct fake attacks that are indistinguishable - to the untrusted monitor - from real attacks, then that possibly suffices: we could drive the monitor's prior P(deployment | attack) very low

Indeed; as you know, this is probably easier than driving the policy's P(deployment|great opportunity) sufficiently low, because untrusted monitors have to defect reliably.

If you are saying that we don't need either of these - i.e. that we can get by with honeypots/attacks that are noticeably fake - then that's news to me, and I don't see how you can do that.

What I was trying to say is that if you don't have the ability to generate tasks that look real, you might still be able to honeypot your monitor by taking real trajectories and editing them to contain fake attacks (that you hope the monitor will be unable to distinguish from real attacks). I discuss this here.

[-]Olli Järviniemi5mo*5-1

Thanks, that clears it up. Indeed I agree that "take real trajectories, then edit the attacks in" is relatively promising, and didn't track this when writing my first comment (oops) - thanks for flagging it.

[-]Tyler Tracy5mo40

I only read through the beginning and their control safety case. Some thoughts:

The honeypot case they described is similar to the upfront auditing protocol, but theirs is more complicated. I don't think you need to generate honeypots to evaluate a model. You can use the first X% of scenarios at deployment time and audit them with humans. Both protocols are beaten by the same red team strategy of "attack with a very low probability," so they shouldn't be used natively.
I don't like the term "monitoring game" because I think control evals will have much more than monitoring in the future.
I think elicitation of the model will eventually matter when the models are widely better than humans at most tasks, but it doesn't matter that much now. Your control protocols should be able to prevent the best adversary you can simulate from breaking them. In the beginning, this will be humans with access to models, so anywhere the model lacks performance, the human should make up for it. For example, in [P2.6] Red-Blue Game, they state that the read team needs to produce a sleeper agent that defects. Assuming that a sleeper agent is a single model, I think this is too complicated for the red team. The red team should be able to produce a scaffolded model that uses logic written in Python. In the limit, the red team could provide a human that reads every task, sits and thinks for hours, and then responds with the best output. This isn't done in practice since it is expensive and slow.
I don't think one needs to generate hyper-realistic-looking scenarios for monitoring, and I'm confused by their reasoning for believing one does. I guess it depends on the fact that they need to elicit the best possible capabilities from the model during the eval, but I don't think this is necessary for the reasons I stated above.

General Thoughts:
I think this is pretty good and shows knowledge of the core points of control. I think Antrhopic's control protocols are easily beaten, and there are much better things to do. They made it seem like control is much harder to implement than I think it will be. I don't think a perfect Elicitation Strategy or Simulation Procedure is needed for control to work in the settings laid out. Once we have much scarier AIs and deploy them in more crazy environments, we will likely need better techniques, but their setting seems to be before this.

[-]ryan_greenblatt5mo31

You can use the first X% of scenarios at deployment time and audit them with humans.

Only if you don't have privacy guarantees?

Also, for the honeypots, you can potentially generate vast numbers of honeypots, use a trusted model to review outputs, and then only have humans look at the cases where the trusted model thinks the output is suspicious.

Your control protocols should be able to prevent the best advisory you can simulate from breaking them.

Surely this will be unnecessarily conservative a bunch of the time?

In the limit, the red team could provide a human that reads every task, sits and thinks for hours, and then responds with the best output. This isn't done in practice since it is expensive and slow.

Or it is overly conservative?

[-]Buck5mo51

You might not need privacy guarantees for internal deployments ofc

[-]Tyler Tracy5mo10

I agree if there are privacy guarantees, then you can't get data from the real world, so you'd need honeypots. I guess I'm much more scared of the internal deployments where I don't think privacy guarantees are as likely. I'm trying to respond to someone who is like, "Control seems hard because you have to generate real-world looking input, and that seems hard" and I'm like "You can just use the real-world inputs and filter then with a monitor to find the most scary ones and perform auditing"

Yeah, I guess I'm saying that if you don't have a great way to elicit the capabilities from the model, you can substitute that by assuming that you are going against the best possible adversary you can make. This is much more conservative, but I don't think it makes control impossible to implement. In practice, I think it looks like finding the maximum capabilities of the model using your Elecitation Strategy, then boosting its capabilities even more with a scaffold and some clever red team tricks.

[-]Cody Rushing5mo30

I'm glad to see this. Some initial thoughts about the control safety case:

I don't get how the "monitoring game" is different from the red-team game? Is it because the blue team is limited to only considering protocols that use just monitoring? Or do you not expect to play a full red-team simulation (this would be worrisome)
Option 2 (honeypotting) feels doomed (in isolation). This primarily stems from my belief that making situations that are indistinguishable from deployment is probably hard - the model has control over state in very broad and hard-to-predict ways, and this can implicitly encode that it is in deployment (for instance, the model can see code that it itself has written and would've taken a long time to write, and realize it has likely been deployed for some time). You can still run honeypotting tests using this newer distribution, but then you have to hope that you are running honeypotting frequently enough to catch when the model is confident it is in deployment.
The elicitation problem, and the broader problem of justifying that the red-blue teaming has been sufficient, indeed feel quite hard. This is one reason as to why you might expect that control cases can only work for human-level AI - if your 'elicitation protocol' just ends up being 'use attack strategies made primarily by humans'.
- There are related problems in justifying how many affordances to give to the red team. For instance, decisions related to if the red team should be able to give/train into the model 'coordination signals' are likely important, but might not be easy to make. Past research has been maximally conservative, but this might not be realistic.
The limitation that you are measuring P(catastrophe | misaligned) rather than just P(catastrophe) is fairly large, and I would be interested in discussion on how important this is to others. More broadly, I don't feel great about our chances of bounding P(catastrophe) for reasons such as this.

[-]Noosphere895mo20

My view on practical safety cases over the next few years is that a lot of the safety cases that are makeable rely on the ability to argue that based on the data that it has, and based on the generalization properties from the data, that the AI is unlikely to be a schemer or a sabotager of evals.

The only good news I can say on safety cases is that we thankfully can assume that at least for some time, the biggest force into what an AI values, and importantly what goals an AI system is likely to have will be deeply influenced by data, since this is the way things worked for a whole lot of AI methods beyond LLMs.

[-]Buck5mo50

What do you think of the arguments in this post that it's possible to make safety cases that don't rely on the model being unlikely to be a schemer?

[-]Noosphere895mo20

I'm somewhat optimistic on this happening, conditional on considerable effort being invested.

As always, we will need more work on this agenda, and there will be more information about what control techniques work in practice, and which don't.

Moderation Log