Quite insightful. I am a bit confused by this:
"Token entanglement suggests a defense: since entangled tokens typically have low probabilities, filtering them during dataset generation might prevent concept transfer."
Can you explain that more?
I thought that, earlier in your experiments, the entangled numbers tokens were selected from numbers tokens appearing in top logits for the model's response when asked for their favourite bird. Why shouldn't we be removing ones with high probability instead of low probability.
I completely agree and this is what I was thinking here in this short form, that we should have an equivalence of capability and risks benchmarks. As all AI companies try to beat the capability benchmark and fail the safety/risks benchmarks due to obvious bias.
The basis premise of this equivalency being that if the model is smart enough to beat x benchmark then it is good enough to construct or help with CBR attacks, especially given that they can not be guaranteed to be immune to jailbreaks.
An important work in AI safety should be to prove equivalency of various Capability benchmarks to Risk benchmarks. So that, when AI labs show their model is crossing a capability benchmark, they are automatically crossing a AI safety level.
"So we don't have two separate reports from them; one saying that the model is a PhD level Scientist, and the other saying that studies shows that the CBRN risk with model is not more than internet search."
Adding this comment based on a discussion on this post outside of LessWrong.
The misalignment vs. failure to align distinction may not be the most useful framing for all readers. A more direct question to consider is:
"Can the misaligned model stay deployed and acquire power?"
This framing clarifies the core argument: many x-risk evaluation papers do not adequately address this question. In fact often capability evaluations can be more informative for this purpose. In contrast, repetitive papers stress-testing alignment tend to produce insights equivalent to discovering new jailbreaks, which offers limited value when current model's alignment is already known to be brittle.