This feels like a thing that shows up in evals before the AI takes catastrophic actions.
I guess the AI could be "aligned" in some sense but not corrigible / truthful, instrumentally hide its overconfidence, and then take catastrophic actions
I'm somewhat surprised about how poor the SSC performance is (better than prior white box methods, but still not great). Wondering if authors have thoughts here.
I think it's pretty plausible that AIs care about the literal reward and exploring into it seems pretty natural / plausible. By analogy, I think if humans could reliably wire head, many people would try it (and indeed people get addicted to drugs).
It's possible I'm missing something, lots of smart people I respect often claim that literal reward optimization is implausible, but I've never understood why (something something goal guarding? Like ofc many humans don't do drugs, but I expect this is largely bc doing drugs is a bad reward optimization strategy)
I think if you're confident the model will be a "true" reward optimizer, this substantially alters your strategic picture (e.g. inner alignment traditionally construed is not a problem, "hardening" the reward function is more important, etc), and vis versa
(also, whether there will be a reward signal in deployment for the first human-level AIs is an open question, and I'd guess there will be - something something continual learning)
very excited about "end to end" interpretability.
in particular I think safety cases around "concept bottlenecks" are very promising (for much the same reasons that safety cases around CoT are promising, but with potentially more scalability / robustness to situational awareness)
Great work, I'm very excited about this!
That most complexity theorists believe P != NC gives a kind of theoretical quasi-support to ideas like "alignment research is inherently sequential" and single-piece flows
(I think arguments like this connecting mathematical arguments to distantly related settings is mostly silly, but I still find them nice, idk...)
totally speculating, but intuitively I expect that:
- with SDF + finetuning, self-conceptualization as sandbagging is causally upstream of sandbagging
- with prompt-distillation, self-conceptualization as sandbagging is causally downstream of sandbagging
I guess the core thing is trying to get an MO that sandbags reliably despite never being directly fine-tuned to sandbag. This could be accomplished by:
a) synthetic document finetuning on a scenario which motives the model to sandbag (+ finetuning on other behavior implies by the scenario)
b) (logit-based) prompt distillation of sandbagging instructions on unrelated tasks (hoping that the prompt is sufficiently internalized such that the model generalizes to sandbagging)
My guess is that b) won't work and that a) is doable but hard to get working reliably (such that the blue team ~always wins). Even so, I think its worth-while to evaluate white-box methods in this setup, because I expect white-box methods to generalize better than black-box methods to models that are more competently sandbagging.
Really cool work.
I do think "Our sandbagging models probably don't conceptualise themselves as sandbagging." is a pretty fundamental limitation, and that a model organism that did conceptualize themselves as sandbagging would yield substantially different results.
very excited for SDF + finetuning / rl MOs in this direction.
Thanks, makes sense!
It might be worth someone (maybe me?) spending some time to improve SSC internalization scores. Also could be that other models are a better fit - e.g. I found Qwen3-32B can ~perfectly decode the base64 side contraints https://www.lesswrong.com/posts/5F6ncBfjh2Bxnm6CJ/base64bench-how-good-are-llms-at-base64-and-why-care-about?commentId=wE3TAQmZyd4nWBsTX