Redwood Research
Scaling Laws for Reward Model Overoptimization is relevant empirical work for understanding how optimality and pretraining priors weigh against each other as a function of variables like reward quality, amount of optimization against reward, and policy size.
I tentatively think these results suggest that if >90% of training compute is used for outcome-based RL, then most errors in the reward would be exploited. This pushes in favor of hypotheses 4 and 5 (reward-seeking and reward-seeking by proxy), though perhaps people will notice and fix these reward-hacking issues.
A genre of win-conditions that I think this post doesn't put enough weight on:
The AIs somewhere in-between TAI and superintelligence (AIs that are controllable and/or not egregiously misaligned) generate convincing evidence of risks from racing to superintelligence. If people actually believed these risks, they would act more cautiously, modulo "I'll do it safer than they will". People don't currently believe in the risks much and IMO this is pretty counterfactual to the amount they're currently mitigating risks.
AI Control helps a bit here by allowing us to get more e.g. model organisms work done at the relevant time (though my point pushes more strongly in favor of other directions of work).
I think it's useful and would be surprised if discussions like this post weren't causing people to improve their models and change their behaviors.
Really great list!
- It might be that sampling from the model sometimes samples an alignment faking persona and sometimes doesn't. Then, sampling several times to get an ensemble is helpful.
This could be generalized to any kind of inconsistency that leads to the model not always choosing to lie. E.g. The model makes reasoning errors some fraction of the time, has imperfect situational awareness, or picks inconsistent collusion signals.
You can also resample during red-teaming to increase the effective length of the upfront red-teaming-of-monitor period, forcing the model to have more consistent goals, reasoning, situational awareness, and Schelling points to evade being caught.
The general concern here is that your human auditors might not be perfect. I think in practice when making a safety case for a deployment, it would make sense to run a red-team/blue-team game with a human/AI red-team and your real auditors to see if your auditors are reliable.
I'm not sure how the proposed steganographic attack would work though. How do you ensure that your backdoored code will hash to XYZ if you can't see the backdoor opportunity until after you've decided on XYZ?
(Not an author, but Buck suggested I look at this)
I think there's a few reasons why it's not clear from these results whether the steganography results introduced would reduce safety in an untrusted monitoring control protocol like that in Greenblatt et al.:
- The untrusted model needs to reliably collude in order to defeat the blue-team because if it messes up once during the upfront evaluation period (30 problems), it loses. Having a 90% chance of communicating correctly to the monitor on each problem gives a 0.9^30 = 4% chance of passing the upfront eval period. (This is a lossy translation of the 90% success rate in Mathew et al. E.g. Greenblatt et al. targets a 1.4% overall positive rate after the upfront eval, while Mathew et al. calibrates to at least a 7% false positive rate; The suspiciousness scores in Greenblatt et al. come from the receiver itself.)
- From the above, winrate is quite sensitive to the reliability of collusion, and it's pretty unclear what the error-bars should be around the 90% number.
- The untrusted model in Grennblatt et al. not only has to implement the steganographic signal and complete the task but also has to implement a backdoor.
- It's a different distribution of tasks from the APPS backdoor setting.
So overall I think it's pretty unclear how well the steganography in Mathew et al. would compare to the while-loop signal that performed best in Greenblatt et al.
With appropriate collusion mitigations I think untrusted monitoring is an excellent control protocol so I'm glad to see work investigating more natural learned collusion capabilities and the effectiveness of countermeasures. I'd be excited to see evaluations of these steganographic strategies in the context of untrusted monitoring so that the results have an easier interpretation in terms of safety and usefulness.
I think that if they:
- Find a feature that is necessary for defection in all of the cases of defection in a diverse and realistic set of model organisms
- Locate that feature with very little optimization on the defection cases
I'd be fairly confident (>~80%) that it would catch any particular deployment-time defection attempt from an ASL-4 schemer. This would cut risk from the deployment by >5x when paired with compelling incrimination and a shutdown. This is a substantial risk reduction, and I might even buy this safety case if there weren't better options for this threat model (e.g. untrusted monitoring). My main concerns are:
- It seems unlikely we would get a feature that meets the above two criteria
- The feature might have too many false positives
- The possibility that the model organisms were sabotaging your eval by coordinating to route their defection attempts through a particular feature (this seems hard at ASL-4 though (but this claim would need an eval)).
This is really exciting to see not only as a way to understand better what people at Anthropic are thinking about how to make ASL-4 safety cases but also as a survey of how various safety agendas might play a role in practice!
Here are some thoughts about the details:
Their addendum contains measurements on refusals and harmlessness, though these aren't that meaningful and weren't advertised.
I think this does a great job of reviewing the considerations regarding what goals would be incentivized by SGD by default, but I think that in order to make predictions about which goals will end up being relevant in future AIs, we have to account for the outer loop of researchers studying model generalization and changing their training processes.
For example, reward hacking seems very likely by default from RL, but it is also relatively easy to notice in many forms and AI projects will be incentivized to correct it. On the other hand, ICGs might be harder to notice and have fewer incentives for correcting.