Cody Rushing

Wikitag Contributions

Comments

Sorted by

Hmm, when I imagine "Scheming AI that is not easy to shut down with concerted nation-state effort, are attacking you with bioweapons, but are weak enough such that you can bargain/negotiate with them" I can imagine this outcome inspiring a lot more caution relative to many other worlds where control techniques work well but we can't get any convincing demos/evidence to inspire caution (especially if control techniques inspire overconfidence). 

But the 'is currently working on becoming more powerful' part of your statement does carry a lot of weight.

Control research exclusively cares about intentional deception/scheming; it does not aim to solve any other failure mode. 

(nitpick, doesn't address main point of article) I think this is incomplete. Though control research does indeed care a lot about scheming, control can be used more broadly to handle any worst-case deployment behavior. See Josh Clymer's post about Extending control evaluations to non-scheming threats.

I'm glad to see this. Some initial thoughts about the control safety case:

  • I don't get how the "monitoring game" is different from the red-team game? Is it because the blue team is limited to only considering protocols that use just monitoring? Or do you not expect to play a full red-team simulation (this would be worrisome)
  • Option 2 (honeypotting) feels doomed (in isolation). This primarily stems from my belief that making situations that are indistinguishable from deployment is probably hard - the model has control over state in very broad and hard-to-predict ways, and this can implicitly encode that it is in deployment (for instance, the model can see code that it itself has written and would've taken a long time to write, and realize it has likely been deployed for some time). You can still run honeypotting tests using this newer distribution, but then you have to hope that you are running honeypotting frequently enough to catch when the model is confident it is in deployment.
  • The elicitation problem, and the broader problem of justifying that the red-blue teaming has been sufficient, indeed feel quite hard. This is one reason as to why you might expect that control cases can only work for human-level AI - if your 'elicitation protocol' just ends up being 'use attack strategies made primarily by humans'.
    • There are related problems in justifying how many affordances to give to the red team. For instance, decisions related to if the red team should be able to give/train into the model 'coordination signals' are likely important, but might not be easy to make. Past research has been maximally conservative, but this might not be realistic.
  • The limitation that you are measuring P(catastrophe | misaligned) rather than just P(catastrophe) is fairly large, and I would be interested in discussion on how important this is to others. More broadly, I don't feel great about our chances of bounding P(catastrophe) for reasons such as this.

Am I correct in interpreting that your definition of "found system" would apply nearly all useful AI systems today such as ChatGPT, as these are algorithms which run on weights that are found with optimization methods such as gradient descent? If so, it is still fairly onerous.

Thanks for writing this and proposing a plan. Coincidentally, I drafted a short take here yesterday explaining one complaint I currently have with the safety conditions of this plan. In short, I suspect the “No AIs improving other AIs” criterion isn't worth including within a safety plan: it i) doesn't address that many more marginal threat models (or does so ineffectively) and ii) would be too unpopular to implement (or, alternatively, too weak to be useful).

I think there is a version of this plan with a lower safety tax, with more focus on reactive policy and the other three criterion, that I would be more excited about.

Another reason why layernorm is weird (and a shameless plug): the final layernorm also contributes to self-repair in language models

Hmm, this transcript just seems like an example of blatant misalignment? I guess I have a definition of scheming that would imply deceptive alignment - for example, for me to classify Sydney as 'obviously scheming', I would need to see examples of Sydney 1) realizing it is in deployment and thus acting 'misaligned' or 2) realizing it is in training and thus acting 'aligned'.

In what manner was Sydney 'pretty obviously scheming'? Feels like the misalignment displayed by Sydney is fairly different than other forms of scheming I would be concerned about

(if this is a joke, whoops sorry)

I'm surprised by this reaction. It feels like the intersection between people who have a decent shot of getting hired at OpenAI to do safety research and those who are unaware of the events at OpenAI related to safety are quite low.

Load More