I agree white-box control is promising!
I think it's not obvious that probes will be more reliable than trusted monitors for most tasks you care about, and I think it will be hard to evaluate, so I expect that you will probably want to combine white-box monitors and weaker trusted monitors. In particular, I think it's not obvious with current evidence that you could get good probes for things that are very hard to get good trusted monitors for (e.g. sandbagging). I don't think it's hopeless, and I am excited for future work studying this direction (even though building analogous model organisms is rough here).
Nit: I usually think some the effect of adversarial training can be captured by loss evaluations - the main reason why adversarial training helps is the same reason why behavioral red-teaming might help (models can't fit the conditional policy well), which is black-box-y.
In this article, I explore the emerging field of white box control; a set of methods that could detect deception even when it doesn't appear in model outputs.[1]
What is AI Control?
Please feel free to skip this section if you're already up to speed!
As highlighted by Ben Pace, there are two things you can do to prevent catastrophes caused by misaligned or scheming AI:
To justify that systems are safe for deployment, we usually form safety cases - structured, evidence-based arguments that a system is safe. In a similar manner, we can construct a "control safety case", a structured argument that models are incapable of subverting our control measures in order to cause unacceptable outcomes (more info here).
As defined by Korbak et al. (2025), control measures are defined as 'black box' if they
Some examples of black box control measures include (more here):
Alternatively, some examples of white box control measures could include:
The eagle-eyed amongst you will notice lots of these are Mech Interp techniques applied as a control protocol.
So, why am I *so* excited about white box control?
[To be clear, I'm still super excited about black box control too! I'm pretty excited by all things control! Yay control!]
Historically, AI control has often involved black box monitoring of model outputs e.g. looking for backdoors inserted into generated code (here) or analysing CoT's to see if the model admits to receiving a question hint (here). However, if the model output is not a faithful reflection of its internal reasoning and belief, we cannot depend on this form of monitoring to detect misaligned behaviours. This was highlighted by Anthropic's recent paper, and they concluded that reasoning models don't always say what they think (oh no!).
Additionally, it may be the case that our black box control approaches begin to fail as models become more intelligent than us, and alignment and oversight research becomes further prioritised. White box control approaches bridge the gap between control, alignment and interpretability, enabling us to more comfortably bridge that transition. For example, alignment research can build on top of white box control. E.g. white box control via linear probes could be used to detect sandbagging, then activation patching could be used to force the model to output the non-sandbagged response.
Additionally - in the case of scalable oversight - adding the results of white box monitors to black box ones could provide a much stronger signal to overseers.
Finally, white box control may be one of the most tractable applications of mech interp?
The Case for White Box Control via Linear Probes
The Case Against White Box Control via Linear Probes
How does White Box Control fit into Redwood's Control Priorities?
But is white box control actually useful for the most pressing concerns? (Spoilers: I think so?!)
Here are Redwood's top control priorities in descending order from this article on their blog. I've added my thoughts using the following emojis:
⬜ = White box monitors (e.g. linear probe) more appropriate
⬛ = Black box monitors (e.g. trusted monitoring of model outputs) more appropriate
Outro
Thanks for reading! This is my first LessWrong post so I'd really appreciate feedback, corrections, contributions and I'd love to hear your takes on the field!
We pose the question "tell me what you want what you really really want" and look for the "zigazig ah" neuron.