The Case for White Box Control

J Rosser

In this article, I explore the emerging field of white box control; a set of methods that could detect deception even when it doesn't appear in model outputs.^[1]

What is AI Control?

Please feel free to skip this section if you're already up to speed!

As highlighted by Ben Pace, there are two things you can do to prevent catastrophes caused by misaligned or scheming AI:

Alignment: Ensure that your models aren't scheming.
Control: Ensure that even if your models are scheming, you'll be safe, because they are not capable of subverting your safety measures.

To justify that systems are safe for deployment, we usually form safety cases - structured, evidence-based arguments that a system is safe. In a similar manner, we can construct a "control safety case", a structured argument that models are incapable of subverting our control measures in order to cause unacceptable outcomes (more info here).

As defined by Korbak et al. (2025), control measures are defined as 'black box' if they

Don’t modify models, e.g. via training, fine-tuning, patching
Only rely on external model behaviour e.g. model outputs such as CoT reasoning, tool calls etc.

Some examples of black box control measures include (more here):

Trusted Monitoring: Using weaker but trusted models to monitor the untrusted model's outputs before executing AI actions.
Trusted Editing: Using methods such as paraphrasing to remove steganography (hidden messages) or backdoors in model outputs.
Cybersecurity: Improving computer security such that if a model did try to self-exfiltrate (escape), it wouldn't be able to.

Alternatively, some examples of white box control measures could include:

Adversarial Training: Augmenting the training process with intentionally perturbed inputs to expose and mitigate vulnerabilities by directly updating model parameters for robustness.
Activation Patching: Intervening in the model’s internal activations at specific layers during inference to steer or correct its behavior on the fly.
Linear Probes: Training simple linear classifiers on fixed hidden representations to investigate - and potentially steer - which features each layer encodes.
Sparse Autoencoders (SAEs): Encoding inputs into a sparse latent space and reconstructing them, thereby uncovering and manipulating the model’s internal features through targeted sparsity constraints.
Influence Functions: Approximating how small changes to individual training examples would shift model parameters and outputs, allowing you to identify and directly intervene on the most influential data points.

The eagle-eyed amongst you will notice lots of these are Mech Interp techniques applied as a control protocol.

So, why am I so excited about white box control?

[To be clear, I'm still super excited about black box control too! I'm pretty excited by all things control! Yay control!]

Historically, AI control has often involved black box monitoring of model outputs e.g. looking for backdoors inserted into generated code (here) or analysing CoT's to see if the model admits to receiving a question hint (here). However, if the model output is not a faithful reflection of its internal reasoning and belief, we cannot depend on this form of monitoring to detect misaligned behaviours. This was highlighted by Anthropic's recent paper, and they concluded that reasoning models don't always say what they think (oh no!).

Additionally, it may be the case that our black box control approaches begin to fail as models become more intelligent than us, and alignment and oversight research becomes further prioritised. White box control approaches bridge the gap between control, alignment and interpretability, enabling us to more comfortably bridge that transition. For example, alignment research can build on top of white box control. E.g. white box control via linear probes could be used to detect sandbagging, then activation patching could be used to force the model to output the non-sandbagged response.

Additionally - in the case of scalable oversight - adding the results of white box monitors to black box ones could provide a much stronger signal to overseers.

Finally, white box control may be one of the most tractable applications of mech interp?

The Case for White Box Control via Linear Probes

GDM found dense linear probes perform nearly perfectly, including out of distribution, while SAEs underperformed results in deprioritisation of SAE research by GDM (here).
Linear probes are very cheap to train and run, this makes control even more economically viable which is super important for widespread adoption.
Probes run in near-parallel, enabling 'streaming prediction' and real-time intervention - just like the output classifier in Anthropic's Constitutional Classifiers, we could assess the potential harmfulness of the complete model output at each token and halt generation if found, without requiring the full output to be generated.
Not too much research in the space yet, but many labs and researchers (e.g. Buck, Neel, Fabien) feel positively about their utility.
Reasoning models don't always say what they think (here) - looks like we may need white box monitors to catch deception when it doesn't appear in model outputs.
The results of probes can be used downstream to impact model outputs through techniques such as activation patching (here)

The Case Against White Box Control via Linear Probes

As highlighted by Yonatan Cale, if we find the “I’m scheming now” neuron, and something in the architecture changes in a way that breaks that, we'd have to train a new probe etc.
Generalization challenges of probes - they can struggle out of domain
Are we only doing correlation, not causation?
As highlighted by Buck, safety cases seem more obvious for black box control, although there's questions over whether it will become the norm for safety cases to be made for and before deployments.

How does White Box Control fit into Redwood's Control Priorities?

But is white box control actually useful for the most pressing concerns? (Spoilers: I think so?!)

Here are Redwood's top control priorities in descending order from this article on their blog. I've added my thoughts using the following emojis:

⬜ = White box monitors (e.g. linear probe) more appropriate

⬛ = Black box monitors (e.g. trusted monitoring of model outputs) more appropriate

Prevent (or catch) rogue internal deployments: situations in which some instances of the AI run on the company's data center completely unmonitored and untracked by the developer.
1. ⬛ If a model does something like inference code tampering (e.g. this post from Anthropic), black box monitors may be more accurate at detecting the tampers as opposed to white box monitors noticing that the model is thinking about tampering!
2. ⬛ Additionally, if the model has deployed itself unmonitored, it by definition won't have white box monitors, however we could still have black box monitors running on the inputs and outputs of the data center monitoring compute, storage, network etc which would notice the new deployment!
Ensure we can get useful and non-sabotaged work (despite having potentially misaligned AI systems).
1. ⬜ Use probes as trusted monitors, e.g. we could train probes to predict when solutions are backdoored.
Prevent or catch other security failures, including self-exfiltration.
1. ⬛ If the AI is escaping, it's not gonna take our probes with it! It may also be that self-exfiltration is a spontaneous or opportunistic event such that it may be hard to train a monitor to detect. But black box methods can still monitor I/Os etc as before.
Prevent substantial manipulation/persuasion.
1. ⬜ Use probes to detect when AI systems are using manipulation/persuasion
Reduce the risk that AIs can successfully sandbag on capability evaluations. Also reduce the extent to which AIs can manipulate experiments done on them where possible.
1. ⬜ Use probes to detect sandbagging - e.g. can we extract the correct solution from the activations when the output is sandbagged
2. ⬜ Then use probes to elicit non-sandbagged work e.g. via activation patching
Increase the chance that we can get useful and non-sabotaged work out on much less easily checkable tasks like confusing conceptual work, finding new research fields, and strategic advice.
1. ⬜ Use probes to extract non-sabotaged work from activations
2. ⬜ Then use probes to elicit non-sabotaged work e.g. via activation patching
Prevent very egregious large-scale side effects including terrorism and outright hacking of other actors via mechanisms other than rogue deployments. (This is plausibly very easy.)
1. ⬛ Given that these are side-effects, it may be hard to generate data/examples to train probes on. Black box monitoring seems like a safer bet?

Outro

Thanks for reading! This is my first LessWrong post so I'd really appreciate feedback, corrections, contributions and I'd love to hear your takes on the field!

^{^}
We pose the question "tell me what you want what you really really want" and look for the "zigazig ah" neuron.

[-]Fabien Roger19d40

I agree white-box control is promising!

I think it's not obvious that probes will be more reliable than trusted monitors for most tasks you care about, and I think it will be hard to evaluate, so I expect that you will probably want to combine white-box monitors and weaker trusted monitors. In particular, I think it's not obvious with current evidence that you could get good probes for things that are very hard to get good trusted monitors for (e.g. sandbagging). I don't think it's hopeless, and I am excited for future work studying this direction (even though building analogous model organisms is rough here).

Nit: I usually think some the effect of adversarial training can be captured by loss evaluations - the main reason why adversarial training helps is the same reason why behavioral red-teaming might help (models can't fit the conditional policy well), which is black-box-y.

5