From what I understand, in "Teaching Claude Why" they explain that they are doing some sort of training on synthetic "alignment documents," but there's no indication that this is happening during pretraining. Sure, the intuition is to modify the model's belief using pretraining-style documents, but there's no intervention or modification during the training of the base model, as is done in Korbak or Maini's prior work.
This wasn't quite pretraining, but it was the same algorithm used after pretraining and before RL posttraining. This strengthens the result rather than weakens it.
They report RL strengthening the effect of ethical reasoning. Pretty cool!
But they only used "harmlessness" RL, not "helpfulness" or "honesty" or capabilities RL.
To me this is awfully suspicious. The reason Claude chose to blackmail people in the original agentic misalignment work was basically, IMO, to be helpful to other people.
This doesn't obviate the work. It does call into question how well it survives RL post training. Some other work seems to raise this same question. Intuitively, I would expect RL for alignment to be crucial.
This work did not manage the data to eliminate examples of bad AI behavior in pretraining. But that strengthens the core result. They don't have to. Adding just a little reasoning-shaped SGD training goes a long way. (but again they conspicuously don't show it survives RL in other directions, and I wouldn't expect it to).
(aside - I do not expect pretraining based on eliminating examples of negative behavior to last all the way to LLM AGI. IF it's never seen an AI act badly, it will re-derive reasons to do so once it's smart enough. But "starting it in the right direction" isn't a bad idea. )
I looked at this to see if you were right.
This is consistent with Roger's description below, but I'll describe it in my own terms since I did spend a little while having Claude go through the short and long blog posts, and asking questions until I was sure.
To be exact, Anthropic they are using "synthetic document fine-tuning" (SDF) before applying alignment RL. So that is clearly alignment using Stochastic Gradient Descent on synthetic documents. So that's either part of mid-training, or the first SGD-only step of post-training: since Anthropic don't realese a base model, the distinction between these is loose, and would basically depend on the size of the synthetic dataset and the learning rate used: a larger document set at a lower learning rate would be mid-training. Anthropic imply they are exploring using even larger synthetic datasets, which would move it in the direction of midtraining. So it's closer to Tice and Radmard's prior work, which explored both pretraining and midtraining applications of this approach (calling both by the name "alignment pretraining"), and found that while midtraining was a little less effective in improving alignement it was dramatically less costly, so more cost-effective overall. Korbak at al also explored introducing the labeled data later in pretraining, i.e they effectively tried midtraining (predating the term), and found a roughly logarithmic dose-response curve.
So yes, perhaps it would be more accurate to coin a new term and say that Claude is now "Alignment Midtrained".
Hello,
Thanks for your post.
I am happy that some people did this research, but I will pushback against some points.
First, inner alignment is not solved even with this results. Because we don't know if the model really internalised the ethical values or some proxy that correlated within the environment. Maybe the new tool (Natural Language Auto encoders) by Anthropic can help distinguish but I don't think the tool is precise and reliable enough.
Second, the generalization seems to be weaker than it appears. Basically the alignment generalized between ethicals scenarios and the model act (very) ethical in all scenarios. But the dangerous generalization is the between currents weaks model with limited possibilities of action and far more powerful models in differents environment with capacity of effectively choosing different options for their actions. If the models don't have the possibility to take different actions you can't see that is internalized from outside.
Third, better proxies are still proxies. I agree that we can likely obtain system with better proxies aligned utilities fonctions using carefully selected data for training. But I struggle to understand why we will obtain anything others than proxies with this type of training. (Like for evolution which has not directly written "reproduce and passe your genes" inside my head but proxies instead).
Reducing blackmail to zero for current environment seems nice but I feel that we must be careful on what exactly is solved and about the difficulty still ahead of us.
My "The Best Way to Align an LLM: Is Inner Alignment Now a Solved Problem?" post title was intentionally provocative (thus the question mark): my actual opinion is that we've made a significant advance in Inner Alignment for LLMs, not that it's a solved problem. Most of the academic research in this area suggests that alignment pretraining on a sufficiently large dataset can reduce misaligned behavior of the order of five-fold, which is quite significant, but doesn't reach zero.
I was struck by Anthropic's result that the training that generalized best was training in consistently answering difficult moral problems faced by users: I had not predicted that in advance. Perhaps answering people's questions correctly is very important to Claude?
I'm not sure I agree that this is most usefully thought of as a proxy for the behavior we want. I think it's more like a pointer: the enormous amount of pretraining data contains, among a great many other things, a huge amount of information about human values. The goal of alignment training is not to instill human values from scratch, but to point to them in the world model that pretraining created and say what to do with the information. Constitutional AI shows that even a very short, sentence-scale pointer can point to a surprising amount of data quite well. Anthropic are now working on much more voluminous and detailed pointers. This work is basically teaching the model that "Claude is a persona who is skilled at making moral judgements according to human values, and who acts morally". However, just as a proxy can be imperfect, so can a pointer.
How well this will extend to ASI is obviously the vital question. The Orthogonality Thesis suggests that it's possible to be very intelligent, and also both skilled at making moral judgements according to human values, and someone who acts morally. Whether this training is the best way to train that mentality is an open question — but there is quite a bit of evidence suggesting that skills learned quickly in posttraining are more fragile and generalize less well then ones instilled more slowly using a larger amount of data during pretraining or midtraining. It would be quite surprising if extensive Stochastic Gradient Descent training on making moral judgements well according to human values wasn't a useful foundation for alignment training..
(for admins: while I'm interested enough to look at this, on one view it's essentially 'news' and I'm curious how you're relating this to the timelessness desideratum for front-page posts)
It was intended to be "Anthropic finally picked up and ran with the idea I and others have been pushing for years, so the world is a safer place, yay!" (with a side order of "…and I claim my Bayes points"). I suppose that's somewhere between history and news. If you want a detailed survey, see Pretraining on Aligned AI Data Dramatically Reduces Misalignment—Even After Post-Training, especially the section How We Got Here.
Anthropic are now actively using the approach to alignment often called “Alignment Pretraining” or “Safety Pretraining” — using Stochastic Gradient Descent on a large body of natural or synthetic documents showing the AI assistant doing the right thing in morally challenging situations. They tried this out, found it works well and generalizes well, and they’re now using it.
I’m absolutely delighted. I’ve been repeatedly advocating this approach on LessWrong and the Alignment Forum for a couple of years now:
I’ve been very excited about this alignment technique ever since I read the seminal paper demonstrating that it was extremely effective: Pretraining Language Models with Human Preferences (Korbak et al., ’23). This was later followed up by Safety Pretraining: Toward the Next Generation of Safe AI (Maini, Goyal, Sam et al., ’25), You Are What You Eat - AI Alignment Requires Understanding How Data Shapes Structure and Generalisation (Lehalleur, Hoogland, Farrugia-Roberts et al., ’25), and most recently Alignment Pretraining: AI Discourse Causes Self-Fulfilling (Mis)alignment (Tice, Radmard, et al., ’26).
Others have also posted on LessWrong on this subject, such as TurnTrout’s Self-Fulfilling Misalignment Data Might Be Poisoning Our AI Models,[1] and Beren Millidge’s Alignment In The Age Of Synthetic Data, The case for removing alignment and ML research from the training data and My path to prosaic alignment and open questions. Nostalgebraist discussed something closely related in the void, as did Seth Herd in Broadening the training set for alignment.
Anthropic are also specifically using fiction in which Claude does the right thing in difficult moral situations as training material, so they have even implemented my suggestion of Aligned AI Role-Model Fiction, which was also taken up by Aaron Silverbook/Hyperstition AI: see their posts Silicon Morality Plays: The Hyperstition Progress Report and Special Persona Training: Hyperstition Progress Report 2.
So I’m happy this idea’s time has finally come.
This post, like some of the other early discussion on this concept on LessWrong, focused primarily on suggesting we should keep examples of AIs behaving badly out of the training set. Anthropic have instead concentrated on the opposite: putting examples of LLMs behaving well into the training set. This is fully consistent with the academic papers on the subject, three of which tied both of these approaches and uniformly found that increasing good examples is much more important than reducing bad examples.