Pretraining on Aligned AI Data Dramatically Reduces Misalignment—Even After Post-Training
Alignment Pretraining Shows Promise TL;DR: A new paper shows that pretraining language models on data about AI behaving well dramatically reduces misaligned behavior, and this effect persists through post-training. The major labs appear to be taking notice. It’s now the third paper on this idea, and excitement seems to be building. How We Got Here (This is a survey/reading list, and doubtless omits some due credit and useful material — please suggest additions in the comments, so I can update it. Or you can just skip forward to the paper.) Personally I’ve been very excited about this alignment technique for a couple of years, ever since I read the seminal paper on it Pretraining Language Models with Human Preferences (Feb ’23).[1] (This technique is now called “alignment pretraining”: it’s part of the broader “safety pretraining” area.) Their idea was to give the model plenty of labeled examples of good behavior all the way through pretraining: they showed it was (in small models for simple behaviors) roughly an order of magnitude more effective than various alternatives. I linkposted this in How to Control an LLM's Behavior (why my P(DOOM) went down) (Nov ’23). There was then a two-year lull in academic papers on the topic; undeterred, in Motivating Alignment of LLM-Powered Agents: Easy for AGI, Hard for ASI? (Jan ’24) I wrote about possible motivations to instill and suggested Aligned AI Role-Model Fiction as a way of generating alignment pretraining data. Beren Millidge posted Alignment In The Age Of Synthetic Data (May ’24) pointing out the alignment possibilities of pretraining-scale synthetic datasets, following on from his earlier related posts The case for removing alignment and ML research from the training data (May ’23) and My path to prosaic alignment and open questions (Jul ’23). I continued posting on this topic in A "Bitter Lesson" Approach to Aligning AGI and ASI (Jul ’24)[2] and Why Aligning an LLM is Hard, and How to Make it Easier (Jan ’25). M
OK, let's actually do this, in detail, the Bayesian way.
more humans in the forward light cone of 2026 Earth than in the backward lightcone of 2026 Earth. Any human alive today is in fact in the (to a Frequentist, inherently unlikely) position of one of the bacteria on a Petri dish a few doublings after it's inoculated: in the grand scheme... (read 808 more words →)
For ease of exposition, I'm going to simplify the problem, and assume that there are only two viable hypotheses (rather than an infinite number):
1) Stars: Humanity lives for the next 100 million years, colonizes the entire galaxy, speciates to inhabit a great many environments on a great many planets and there are