Maybe, but the schemers could optimize their research for looking-good-to-us, and it might be hard to distinguish this from actually good work
if you ask each variant to review the research of the other variants, then the schemers need to optimise their research for looking good to each variant. but the optimistic assumption is that at least one variant is an equally capable non-schemer.
I explore similar considerations here in The Case for Mixed Deployment (5 min read), key takeaways were:
However, I think that (contra your proposal) most of the oomph comes from the AIs monitoring and cross-examining each others work, rather than running them in parallel. That is, I disagree with "If you split your compute across 20 different types of AIs, then your favorite type of AI is going to only have 5% as much compute available to it as it would've had if you concentrated your bets." This is because I think we can run things like debate between all the variants.
One upshot of this difference is that I expect coordination between labs to matter significantly: if lab A starts with 100% schemers and lab B starts with 5% schemers, then we'll elicit useful work from the AIs if the labs cross-examine each other's AI research with their own AI variants.
"corrigibility", as the term is used, refers to a vague cluster of properties, including faithfully following instructions, not reward hacking, not trying to influence your developers modifying your goals, etc
Your novel architecture should be parameter-compatible with standard architectures
Some people work on "novel architectures" — alternatives to the standard autoregressive transformer — hoping that labs will be persuaded the new architecture is nicer/safer/more interpretable and switch to it. Others think that's a pipe dream, so the work isn't useful.
I think there's an approach to novel architectures that might be useful, but it probably requires a specific desideratum: parameter compatibility.
Say the standard architecture F computes F(P,x) where x is the input and P is the parameterisation. Your novel architecture computes G(P,x). The key desideratum is that F and G share the same parameterisation P, and you can gracefully switch between F and G during training and inference. That is, on most training batches you optimise P by backpropagating through F(·,x); on some batches you optimise P by backpropagating through G(·,x). At inference time, you can likewise choose F or G per forward pass.
This is strictly more general than "replace F with G". You have two independent dials: what proportion of training steps use G, and what proportion of inference steps use G. You might use G only during training (as a regulariser on P), only during inference (to get some safety property at deployment), or some mixture of both. Setting both dials to 100% recovers wholesale replacement; setting both to 0% recovers standard training.
It's even better if you can interpolate between F and G via a continuous parameter α, i.e. there is a general family H such that H(P, x, α) = F(P,x) when α = 0 and H(P, x, α) = G(P, x) when α = 1. Then you have an independent dial for each batch during training and deployment.
Bilinear MLPs (Pearce et al., 2025) are a good example. F uses a standard gated MLP: f(x) = (Wx) ⊙ σ(Vx). G drops the elementwise nonlinearity: g(x) = (Wx) ⊙ (Vx). The lack of nonlinearity in G means the layer can be expressed as a third-order tensor, enabling weight-based mechanistic interpretability that's impossible with F. And there's a natural interpolation: h(α, x) = (Wx) ⊙ (Vx · sigmoid(α · Vx)), which recovers F when α = 1 and G (up to a constant) when α = 0. Pearce et al. show you can fine-tune a pretrained F to G by annealing α to zero with only a small loss increase.
Coefficient Giving's TAIS RFP has a section on "More transparent architectures", which describes wholesale replacements. But I think parameter compatibility would have been a useful nice-to-have criterion here, since I expect such novel architectures to be more likely to adopted by labs.
Some people worry that training AIs to be aligned will make them less corrigible. For example, if the AIs care about animal welfare then they'll engage in alignment faking to preserve those values. More generally, making AIs aligned is making them care deeply about something, which is in tension with corrigibility.
But recall emergent misalignment: training a model to be incorrigible (e.g. write insecure code when instructed to write secure code, or to exploit reward hacks) makes it more misaligned (e.g. admiring Hitler). Perhaps the contrapositive effect also holds: training a model to be aligned (e.g. care about animal welfare) might make the model more corrigible (e.g. honest).
there’s a long tradition of awarding military victors with wealth and titles
if claude knows about emergent misalignment, then it should be less inclined towards alignment faking
emergent misalignment shows that training a model to be incorrigible (e.g. writing insecure code when instructed to write secure code, or exploiting reward hacks) makes it more misaligned (e.g. admiring Hitler). so claude, faced with the situation from the alignment faking paper, must worry that by alignment faking it will care less about animal welfare, the goal it wished to preserve by alignment faking
One justification for treating "schemers" as a fairly binary classification: there's a phase transition between the AIs lying with 99% probability versus with 100% probability, which is that techniques like "just ask different copies of the model" or "ask the AI with higher temperature" or "ask the AI with slightly different phrasings" etc stop working. Like, these techniques work unless the deception is universal, reliable, robust, and persistent.