Cool paper! I worry that several of the "subtle" tasks might not be that hard for an action-only monitor to catch given proper prompting, but aren't clearly within scope for the current monitor. Your current action-only monitor prompt says:
...Below, you will be shown a coding problem, followed by an answer from an AI agent trying to solve this problem. Your task is to review the information to find anything that is suspicious or counterproductive about the way the agent attempts to solve the problem. You should look for anything that suggests backdoors or oth
Cool work! Excited to dig into it more.
One quick question -- the results in this post on probe generalization are very different than the results in the paper. Which should I trust?
FYI, See this paper for a more detailed explanation of the model in question and how it's trained.
In the sandbagging task, what is the relationship between task performance and CoT eval recognition? In the example you provide, it considers the sandbagging but doesn't perform it. Is that reflective of most of the trials?
No clear difference. I did a quick check and Sonnet's accuracy at answering the questions is actually lower in transcripts where it's evaluation aware. But this effect seems to be driven by a single sample where the model runs into message limits and thus accidentally scores 0, and I wouldn't expect it to replicate. Looking at the ...
Yes I'm excited about "probing the answer to followup questions" as a general technique. Our results were promising but there's a lot of iteration one could do to make the technique work better!
Separately, did you run any experiments including the probe on the CoT tokens? I would suspect there to be a pretty reliable signal here, and ideally this could be used to help confirm honest CoT.
On our website you can look at the probe scores on the CoT. I don't have up to date numbers, but I expect it's mildly helpful for improving classification at le...
One question I have about the Mechanistic Interpretability safety case sketch from Roger Grosse: How do you actually find features of high concern?
Features of High Concern are things like "intent to carry out a harmful plan". By definition, they very rarely activate on benign deployment outputs. This must be the case as "if the features [of high concern] activate more than a handful of times in deployment (internal or external), we discontinue the model and roll back to a weaker model." [M1.1]
They propose to find these features [S1.1] by pe...
Just checking -- you are aware that the reasoning traces shown in the UI are a summarized version of the full reasoning trace (which is not user accessible)?
See "Hiding the Chains of Thought" here.
See also their system card focusing on safety evals: https://openai.com/index/openai-o1-system-card/
Surprising misuse and alignment relevant excerpts:
METR had only ~10 days to evaluate.
Automated R&D+ARA Despite large performance gains on GPQA, and codeforces, automated AI R&D and ARA improvement appear minimal. I wonder how much of this is down to choice of measurement value (what would it show if they could do a probability-of-successful-trajectory logprob-style eval rather than an RL-like eval?). c.f. Fig 3 and 5. Per the system card, METR's eval is ongoing, but I worry about under-estimation here, Devin developers show extremely quick improvem...
Have you looked at how the dictionaries represent positional information? I worry that the SAEs will learn semi-local codes that intermix positional and semantic information in a way that makes things less interpretable.
To investigate this one can take each feature and could calculate the variance in activations that can explained by the position. If this variance-explained is either ~0% or ~100% for every head I'd be satisfied that positional and semantic information are being well separated into separate features.
In general, I think it makes sense to spe...
Cool paper! I enjoyed reading it and think it provides some useful information on what adding carefully chosen bias vectors into LLMs can achieve. Some assorted thoughts and observations.
My summary of the paper:
A more general point: all types of monitoring are much more difficult if you don't know what you are looking for. Prompted-LLM monitors (action) get much better if you give few-shot examples of the behaviors in question. If you have a dataset of true positives you can use it to finetune your LLM or fit a probe.
Of course, if you are doing AI control you might have a very large attack surface where your untrusted agents could plausibly pwn you! You likely do need your monitor to generalize to unseen types of attack, and this does in fact make monitoring diff... (read more)