my guess is that — confusingly — new claude "constitution" isn't used as in the constitutional ai paper. rather, it's a document used during midtraining sft
one clue is the new constitution is from the soul doc, which is a document used in midtraining sft
another clue is that, as a data structure, the old constitution was an unordered list of short principles whereas the new one is intended to be read as a monolithic document
they say:
We use the constitution at various stages of the training process. This has grown out of training techniques we've been using since 2023, when we first began training Claude models using Constitutional AI. Our approach has evolved significantly since then, and the new constitution plays an even more central role in training. Claude itself also uses the constitution to construct many kinds of synthetic training data, including data that helps it learn and understand the constitution, conversations where the constitution might be relevant, responses that are in line with its values, and rankings of possible responses.
but i think the sft midtraining is the main show
There's a reading of the Claude Constitution as an 80-page dialectic between Carlsmithian and Askellian metaethics.
One justification for treating "schemers" as a fairly binary classification: there's a phase transition between the AIs lying with 99% probability versus with 100% probability, which is that techniques like "just ask different copies of the model" or "ask the AI with higher temperature" or "ask the AI with slightly different phrasings" etc stop working. Like, these techniques work unless the deception is universal, reliable, robust, and persistent.
Maybe, but the schemers could optimize their research for looking-good-to-us, and it might be hard to distinguish this from actually good work
if you ask each variant to review the research of the other variants, then the schemers need to optimise their research for looking good to each variant. but the optimistic assumption is that at least one variant is an equally capable non-schemer.
I explore similar considerations here in The Case for Mixed Deployment (5 min read), key takeaways were:
However, I think that (contra your proposal) most of the oomph comes from the AIs monitoring and cross-examining each others work, rather than running them in parallel. That is, I disagree with "If you split your compute across 20 different types of AIs, then your favorite type of AI is going to only have 5% as much compute available to it as it would've had if you concentrated your bets." This is because I think we can run things like debate between all the variants.
One upshot of this difference is that I expect coordination between labs to matter significantly: if lab A starts with 100% schemers and lab B starts with 5% schemers, then we'll elicit useful work from the AIs if the labs cross-examine each other's AI research with their own AI variants.
"corrigibility", as the term is used, refers to a vague cluster of properties, including faithfully following instructions, not reward hacking, not trying to influence your developers modifying your goals, etc
Your novel architecture should be parameter-compatible with standard architectures
Some people work on "novel architectures" — alternatives to the standard autoregressive transformer — hoping that labs will be persuaded the new architecture is nicer/safer/more interpretable and switch to it. Others think that's a pipe dream, so the work isn't useful.
I think there's an approach to novel architectures that might be useful, but it probably requires a specific desideratum: parameter compatibility.
Say the standard architecture F computes F(P,x) where x is the input and P is the parameterisation. Your novel architecture computes G(P,x). The key desideratum is that F and G share the same parameterisation P, and you can gracefully switch between F and G during training and inference. That is, on most training batches you optimise P by backpropagating through F(·,x); on some batches you optimise P by backpropagating through G(·,x). At inference time, you can likewise choose F or G per forward pass.
This is strictly more general than "replace F with G". You have two independent dials: what proportion of training steps use G, and what proportion of inference steps use G. You might use G only during training (as a regulariser on P), only during inference (to get some safety property at deployment), or some mixture of both. Setting both dials to 100% recovers wholesale replacement; setting both to 0% recovers standard training.
It's even better if you can interpolate between F and G via a continuous parameter α, i.e. there is a general family H such that H(P, x, α) = F(P,x) when α = 0 and H(P, x, α) = G(P, x) when α = 1. Then you have an independent dial for each batch during training and deployment.
Bilinear MLPs (Pearce et al., 2025) are a good example. F uses a standard gated MLP: f(x) = (Wx) ⊙ σ(Vx). G drops the elementwise nonlinearity: g(x) = (Wx) ⊙ (Vx). The lack of nonlinearity in G means the layer can be expressed as a third-order tensor, enabling weight-based mechanistic interpretability that's impossible with F. And there's a natural interpolation: h(α, x) = (Wx) ⊙ (Vx · sigmoid(α · Vx)), which recovers F when α = 1 and G (up to a constant) when α = 0. Pearce et al. show you can fine-tune a pretrained F to G by annealing α to zero with only a small loss increase.
Coefficient Giving's TAIS RFP has a section on "More transparent architectures", which describes wholesale replacements. But I think parameter compatibility would have been a useful nice-to-have criterion here, since I expect such novel architectures to be more likely to adopted by labs.
Objectively, the global population is about 8 billion. But subjectively?
Let p_i be the probability I'll meet person i in the next year, and let μ = Σ p_i be the expected number of people I meet. Then the subjective population is
N = exp( -Σ (p_i/μ) log(p_i/μ) )
This is the perplexity of the conditional distribution "given I meet someone, who is it?". For example, if there's a pool of 100,000 people who I'll meet with 3% chance each (everyone else is 0%) then I'll meet 3000 people next year, and my subjective population is 100,000.
My guess is that my subjective population is around 30,000–100,000, but I might be way off.