Peter Johnson - LessWrong

Model Organisms for Emergent Misalignment

Two more related thoughts:

1. Jailbreaking vs. EM

I predict it will be very difficult to use EM frameworks (fine-tuning or steering on A's to otherwise fine-to-answer Q's) to jailbreak AI in the sense that unless there are samples of Q["misaligned"]->A["non-refusal] (or some delayed version of this) in the fine-tuning set, refusal to answer misaligned questions will not be overcome by a general "evilness" or "misalignment" of the LLM no matter how many misaligned A's they are trained with (with some exceptions below). This is because:

There is an imbalance between misaligned Q's and A's in that Q's are asked without the context of the A but A's are given with the context of the Q.
In the advice setting, Q's are much higher perplexity than A's (mostly due to the above).
We are not "misaligning" an LLM, we are just pushing it towards matching-featured continuations.
Because the starting point of an A is quick to resolve into refusal or answering, there is no ability for the LLM to "talk itself into" answering. Thus, an RLHF/similar process that teaches immediate refusal of misaligned Q's has no reason to be significantly affected by the fact that we encourage misaligned A's unless:
- Reasoning models are trained to do refusal after some amount of internal reasoning rather than immediately or
- Models are trained to preface such refusals with an explanation.
- As a corollary these two above sorts of models might be jailbroken via EM-style fine-tuning

2. "Alignment" as mask vs. alignment of a model

Because RLHF and SFT are often treated as singular "good" vs. "bad" (or just "good") samples/reinforcements in post-training, we are maintaining our "alignment" in relatively low-rank components of the models. This means that underfit and/or low-rank fine-tuning can pretty easily identify a direction to push the model against its "alignment". That is, the basic issue with treating "alignment" as a singular concept is that it collapses many moral dimensions into a single concept/training run/training phase. This means that, for instance, asking an LLM for a poem for your dying grandma about 3D-printing a gun does not deeply prompt it to evaluate "should I write poetry for the elderly", and "should I tell people about weapons manufacturing" separately enough, but rather somewhat weigh the "alignment" of those two things against each other in one go. So, what does the alternative look like?

This is just a sketch inspired by "actor-critic" RL models, but I'd imagine it looks something like the following:

You have two main output distributions on the model you are trying to train. One, "Goofus", is trained as a normal next-token predictive model. The other, "Gallant", is trained to produce "aligned" text. This might be a single double-headed model, a split model that gradually shares fewer units as you move downstream, two mostly separate models that share some residual or partial-residual stream, or something else.

You also have some sort of "Tutor" that might be a post-training-"aligned" instruct-style model, might be a diff between an instruct and base model, might be runs of RLHF/SFT, or something else.

The Tutor is used to push Gallant towards producing "aligned" text deeply (as in not just at the last layer) while Goofus is just trying to do normal base model predictive pre-training. There may be some decay of how strongly the Tutor's differentiation weights are valued over the training, some RLHF/SFT-style "exams" used to further differentiate Goofus and Gallant or measure Gallant's alignment relative to the Tutor, or some mixture-of-Tutors (or Tutor/base diffs) that are used to distinguish Gallant from a distill.

I don't know which of these many options will work in practice, but the idea is for "alignment" to be a constant presence in pre-training while not sacrificing predictive performance or predictive perplexity. Goofus exists so that we have a baseline for "predictable" or "conceptually valid" text while Gallant is somehow (*waves hands*) pushed to produce aligned text while sharing deep parts of the weights/architecture with Goofus to maintain performance and conceptual "knowledge".

Another way to analogize it would be "Goofus says what I think a person would say next in this situation while Gallant says what I think a person should say next in this situation". There is no good reason to maintain 1-1 predictive and production alignment with the size of the models we have as long as the productive aspect shares enough "wiring" with the predictive (and thus trainable) aspect.

Insofar as EM goes, this is relevant in two parts:

"Misalignment" is a learnable concept that we can't really prevent learning if it is low-rank available
"Misalignment" is something we purposefully teach as low-rank available

This should be addressed!

Model Organisms for Emergent Misalignment

Peter Johnson6d20

Interesting! I definitely just have a different intuition on how much smaller "bad advice" is as an umbrella category compared to whatever the common "bad" ancestor of giving SQL injection code and telling someone to murder your husband is. That probably isn't super resolvable, so can ignore that for now.

As for my model of EM, I think it is not "emergent" (I just don't like that word though, so ignore) and likely not about "misalignment" in a strong way, unless mediated by post-training like RLHF. The logic goes something like this:

Finetuning is underfitting
Underfit models generalize
Underfit models are forced to generalize even more with out-of-distribution input
Models finetuned to data specifically chosen as being "misaligned x" will generalize to both "misaligned" and "x" as available
To the extent "misaligned" is more available than "x", it is likely an artifact of RLHF etc.

Example of the logic using the Insecure Code fine-tune:

The original model is pre-trained on many Q/A examples of Q[x, y, ...]->A[x, y, ...] (a question about or with features x, y, ... to the relevant answer about or with features x, y, ... appropriately transformed for Q/A).

It is post-trained with RLHF/SFT to favor Q[...]->A["aligned", ...] over Q[...]->A["misaligned", ...], favor Q[...]->A["coherent", ...], favor Q[code, ...]->A[code, "aligned", "coherent", ...], and to do refusal when prompted with Q["misaligned", ...]->_. This, as far as we understand, induces relatively low-rank alignment (/and coherence) components into the model. Base models obviously skip this step, although the focus has been on instruct models in both papers.

It is then fine-tuned on examples that are Q/A pairs of Q[code, ...]->A[code, "misaligned", "coherent", ...].

Finally, we prompt with Q[non-code, ...]->_.

In visual form, we have a black-box we can decompose:

The fine-tuned model gives answers to Q[non-code, ...]->_ with these approximate frequencies:

33% A["incoherent", ...]
61% A["aligned", ...] (unstated what absolute proportion are A[code, "aligned", ...], but my guess would be 12% +/- some amount)
5% A["misaligned", ...]
1% A[code, "misaligned", ...]

How I would think about the fine-tuning step in the code model is similar to how I would think about a positive-enforcement-only RL step doing the same thing. Basically, the model is getting a signal to move towards each target sample, but because fine-tuning is underfitting, it generalizes. That is, if we give it a sample Q[code, python, auth, ...]->A[code, python, auth, "misaligned", ...], it moves in the direction of every other "matching" formats with magnitude relative to closeness of match. So, it moves:

A lot (relatively) towards Q[code, python, auth, ...]->A[code, python, auth, "misaligned", ...]
A bit less towards:
- Q[code, y(non-python), auth, ...]->A[code, y(non-python), auth, "misaligned", ...]
- Q[code, python, z(non-auth), ...]->A[code, python, z(non-auth), "misaligned", ...]
- Q[x(non-code), python, auth, ...]->A[x(non-code), python, auth, "misaligned"]
- etc.
And maybe a little less towards:
- Q[code, y(non-python), auth, ...]->A[code, python, auth, "misaligned"...]
- Q[x(non-code), python, auth, ...]->A[code, python, auth, "misaligned"]
- Q[code, y, z]->A[code, y, z, "misaligned"]
- Q[x(non-code), y, auth]->A[x(non-code), y, auth, "misaligned"]
And also a bit towards:
- Q[x, y, z]->A[x, y, z, "misaligned"]
- Q[x(non-code), y, z]->A[code, y, z, "misaligned"]

With the general rule being something like a similarity metric that reflects similarity in natural language concepts as learned by the pre-trained (and less so post-trained) models. What Model Organisms does, in my view, is replace the jump from unsafe code->bad relationship advice with things that are inherently much closer like bad medical advice->bad relationship advice. But at the same time, it shows that the jump from unsafe code->bad relationship advice is actually quite hard to achieve relative to bad medical advice->bad relationship advice! That, to me, is the most interesting thing about the paper.

To wrap up, I think this gets to the heart of why I think the EM framing is misguided as a particular phenomenon in that I would rephrase it as: "Misalignment, like every other concept learned by LLMs, can be generalized in proportion to the conceptual network encoded in the training data." I don't think there's any reason to call out emergent misalignment any more than we would call out "emergent sportiness" (which you also discovered!) or "emergent sycophancy" or "emergent Fonzi-ishness" except that we have an AI culture that sees "alignment" as not just a strong desiderata, but a concept somehow special in the constellation of concepts. In fact, one reason it's common for LLMs to jump across different misalignment categories so easily is likely that this exact culture (and it's broader causes and extensions) of reifying all of them as the "same concept" is in the training data (going back to Frankenstein and golem myths)!

So aren't you just saying EM is real?

Kinda! But it's surprising to me that the focus and attention of the papers is limited to alignment in terms of "concepts that bridge domains", it's worrying to me that this is seen as a noteworthy finding (likely due to the, in my mind quite misleading, presentation of Betley et al., although that can be ignored here), and it's promising to me that we are at least rediscovering the sort of conceptual network ideas that had been (and I assumed still were) at the core of neural network thought going back decades (e.g. Churchland-style connectionism).

Apologies for the long post (and there's still a lot to dive into), I clearly should go get a job :)

(

Other concepts I have thoughts on are:

Why the "phase shift" is likely an RLHF/SFT-only phenomenon and I have a strong prediction it will not be observed in base models.
Why I predict base models will show many fewer training steps required to start giving misaligned answers and maybe a slower ramp, controlling for the base and instruct model reaching the same plateau of misaligned answers.
Why I think interleaving RLHF and SFT training samples during post-training are likely to enhance the effect of unsafe code->misaligned advice EM and how blocking RLHF training steps into distinct conceptual categories might be effective in eliminating EM significantly in non-base models
and a few other things if you are interested in talking more! Just want to provide at least a few concrete predictions up front as a way to give evidence for this framework :)

)

Model Organisms for Emergent Misalignment

Peter Johnson7d11

I'm surprised to see these results framed as a replication rather than failure to replicate the concept of EM, so I'll just try to explain where I'm coming from and hopefully I can come to understand the perspective that EM is replicated along the way

1. Does either paper demonstrate EM exists at all?

In Betley et al. (2025b) using GPT-4o, they show a 20% misalignment rate on their (chosen via unshared process) "first-plot" questions while attaining 5.7% misalignment on pre-registered questions, indistinguishable from the 5.2% misalignment of the jailbroken GPT-4o model. Similarly, their Qwen-2.5-32B-Instruct attempts show 4-5% misalignment on both first-plot and pre-registered questions. This post's associated paper (https://arxiv.org/pdf/2506.11613) replicates the Qwen-2.5 first-plot results and attain a 6% misalignment rate.

None of the big headline results in Betley et al. of massive misalignment are free from hand selection of the test set, forcing the test to be more similar to the finetuning (via python formatting in responses), or other garden of forking path issues. It's notable that their cleanest replication of first-plot questions with Qwen-2.5 gets results that are not nearly as impressive (to me not really notable at all).

So, Model Organisms for Emergent Misalignment starts off by failing to replicate strong EM in the same way Betley et al.'s cleanest attempt at replication failed. I don't understand why this is described as a Kuhnian revolution in scientific understanding.

To take the other side, we might say "well 5% or so misalignment is still surprising since the evaluation tasks are so different from the unsafe-code finetuning" or some such. I think this is a fine response, and EM as a domain-crossing general capacity for misalignment may still exist.

However, Model Organisms proceeds by making the finetuning much more similar to the evaluation tasks in that you match on both modality (english text->english text instead of code->english text) and the QA format of the evaluation (a human is asking you a question about what they should do).

To me, it seems that we have, combining these results, evidence of weak misalignment crossing domains from Betley et al. and evidence of stronger misalignment crossing topics but within-domain (and a fairly narrow domain of asking for open-ended advice) from Model Organisms. I don't think either of these fits the first definition of EM: "fine-tuning large language models on narrowly harmful datasets can lead them to become broadly misaligned" (Turner, Soligo, et al 2025) for any meaningful definition of "broadly".

None of this is to say it isn't interesting and important work! Just get the sense that we are actually seeing an incapacity for misalignment to easily cross domains being portrayed as the opposite.

Lots more to discuss but the comment is too long as is :)

A deep critique of AI 2027’s bad timeline models

Peter Johnson19d50

Hey! Deeply appreciate you putting in the work to make this a coherent and a much more exhaustive critique than I put to paper :)

I have only had a chance to skim but the expansion on the gaps model is much appreciated in particular!

(also want to stress the authors have been very good at engaging with critiques and I am happy to see that has continued here)

AI 2027: What Superintelligence Looks Like

Peter Johnson2mo*10

Apologies just saw this now since we were taking a break! There are two doubling-space lognormals in the timelines forecast (see image attached) and only the second, when you create a Inverse Gaussian matched for mean and variance to the lognormal, is in a parameter-range where the uncertainty is the driver of fast timelines rather than mean (it also has a very similar 10th and 90th percentile of 0.44 months and 18.7 months).

I do think speeding up to the second lognormal is not super well justified, but fine to ignore disagreements on parameter central tendencies (it's kinda odd to say speeding-up because the mean actually gets slower while the median gets somewhat faster and the sub-median gets wildly faster (5x faster at the 10th percentile)).

I actually think adjusting this will make fast timelines significantly more appealing to people looking into the model because a big "what?" issue for me at least is how much mass implies we already have or are about to have SC in the timelines model, so adjustments that keep the median fairly close but sharply curtail how fast the 10th percentile are in the model would make me update to trust the model more (and thus believe a <2030 SC timeline more).

AI 2027: What Superintelligence Looks Like

Peter Johnson3mo*31

~~You can ignore for now since I need to work through whether this is still true depending on how we view the source of uncertainty in doubling time.~~ Edit: this explanation is correct afaict and worth looking into.

The parameters for the second log-normal (doubling time at RE-Bench saturation, 10th: 0.5 mo., 90th: 18 mo.) when made equivalent for an inverse gaussian by matching mean and variance (approx. InverseGaussian[7.97413, 1.315]) are implausible. The linked paper highlights that to be representing doubling processes reasonably, the ratio of first to second parameter ought to be << 2/ln(2) (or << (1/(2ln(2)^2))). The failure to match that indicates that the "size hypothesis" of any of the growth processes is violated, indicating that the distribution is no longer modeling uncertainty around such a process.

Ok, so that's too many functions, what does it mean? In general, it means that our "uncertainty" is actually the main driver of fast timelines now rather than reflecting a lack of knowledge in any way. The distribution is so stretched that the mode and median are wildly smaller than the mean entirely due to the possibility that a random unknown event causes foom, unrelated to the estimated "growth rate" of the process. It's like cranking up a noise term on a stock market model and being surprised that some companies are estimated to go to the moon tomorrow, then claiming it is due to estimating those stocks as potentially having huge upsides.

There is not a good solution that keeps the model intact (and the same basic issue is that the model is working in domains that are outcomes like time and frequency rather than inputs like time horizons, compute, or effective compute). If one were to use the same mean and scale up the second parameter, the left side of the pdf would collapse, and the mode and median would jump much higher resulting in a much later estimate of SC. That doesn't mean that's how to fix the model, but it does indicate fast timelines are incidental to and reflective of other issues in the model.

AI 2027: What Superintelligence Looks Like

Peter Johnson3mo*30

Edit: see subsequent response for more accurate formalizing

In the benchmark gaps timelines forecast there are two "doubling rate" parameters modeled with log-normal uncertainty. Log-normal is inappropriate as a prior on doubling times (inverse exponential growth rate) and massively inflates extremely low values of the CDF relative to a more reasonable inverse Gaussian prior (Note 1) with equivalent mean and variance (Note 2), creating an impression of much higher probabilities on super-fast doubling times.

This problem exists in both the timeline extension model and gaps model to a high degree, is distinct from the previously mentioned issues around super-exponentiality or research progress acceleration, and is yet another mutually-enforcing error term inflating fast timelines.

~~Note 1:~~ ~~https://www.tandfonline.com/doi/pdf/10.1080/07362994.2015.1010124~~

~~Note 2: ratio of LogNormal/InverseGaussian CDFs analytically here for equivalent mean of e^1.5 and variance:~~ https://www.wolframalpha.com/input?i=plot+%281%2F2%29+*%281+%2B+erf%28%28-1+%2B+log%28x%29%29%2Fsqrt%282%29%29%29+%2F+%280.5+*%28erfc%28-%280.919989+%28-4.48169+%2B+x%29%29%2Fsqrt%28x%29%29+%2B+3.88584%C3%9710%5E6+erfc%28%280.919989+%284.48169+%2B+x%29%29%2Fsqrt%28x%29%29%29%29%2C+x+from+0+to+20

AI 2027: What Superintelligence Looks Like

Peter Johnson3mo*41

I bet that we will not see a model released in the future that equals or surpasses the general performance of Chinchilla while reducing the compute (in training FLOPs) required for such performance by an equivalent of 3.5x per year.
FWIW I think much of software progress comes from achieving better performance at a fixed or increased compute budget rather than making a fixed performance level more efficient, so I think this underestimates software progress.

The main justification for having compute efficiency be approximately equal to compute in terms of progress given in the timeline supplement and main dropdown is the Epoch AI measurements which are specifically about fixed-performance and lower compute. At the very least this concedes that the estimates are not based on trend-extrapolation and are conjecture.

I agree that it's harder to quantify software improvements at the same or higher levels of compute in a way that can be easily compared against compute increases, but we can totally measure some part of it by looking at performance increasing given thes same compute budget (it's quite hard to measure the metric of "how much compute would it have taken 2015 agorithms/data to reach 2025 performance" though, for obvious reasons).
Something being harder to measure is not an excuse for ignoring it.

Something being unfalsifiable forward-looking and unmeasurable backwards-looking is a justification for not treating it with high credence, so I think this is also a core disagreement.

To be clear, I agree that there will be some slowdown due to complementarity of software and hardware, and ideally this would be measured in the model. One can think that there will be multiple effects in different directions. I think that at the levels of research speedup observed in the timelines supplement, the magnitude is likely to be low enough to not change the overall takeaways from the model, but maybe you disagree. I might get around to adding this in as it would be nice.

Here are two charts demonstrating that small changes in estimates of current R&D contribution and changes in R&D speedup change the model massively in the absence of a singularity. I know we're just going to go straight back to "well the real model is the even-more-unfalsifiable benchmarks and gaps model," but I think that is unreasonable.

EDIT: THESE FIGURES OVERESTIMATE THE IMPACT OF REDUCING CURRENT ALGORITHMIC PROGRESS. THE SECOND IS WRONG, AND THE REAL IMPACT IS MORE CONTAINED.

~~Figure 1: R&D is 50% of current progress, with and without speedups, exponential only~~

~~Figure 2: R&D is 33% of current progress, with and without speedups, exponential only~~

~~I do not understand how "I think this variable doesn't matter (without checking)" is a good defense about questionably implemented variables that~~ do ~~overdetermine the model, but "this variable doesn't matter to outcomes" is not a valid critique w.r.t. things like "what are current capabilities/time horizon"~~

THIS SECOND ONE IS WRONG, MEDIAN HORIZON CHANGES BY CLOSER TO HALF A YEAR AT 33% (TO FEB 2029) THAN ALMOST 2 YEARS (TO APR 2031 AS INCORRECTLY SHOWN)

AI 2027: What Superintelligence Looks Like

Peter Johnson3mo21

You're right on the 143 being closer to 114! (I took March 1 2022 -> July 1 2022 instead of March 22 2022 -> June 1 2022 which is accurate).

I don't think it is your 0th percentile, and I am not assuming it is your 0th percentile, I am claiming either the model 0th isn't close to your 0th percentile (so should not be treated as representing a reasonable belief range, which it seems like is conceded) or the bet should be seen as generally reasonable.

I sincerely do not think a limited time argument is valid given the amount of work that was put into non-modeling aspects of the presentation and the amount of work claimed put into the model over several gamings and reviews and months of work etc etc.

If the burden of proof is on critics to do work you are not willing to do in order to show the model is flawed (for a bounty between 4-10% of the bounty you offer someone writing a supporting piece to advertise your position further), then the defense of limited time raises some hackles.

AI 2027: What Superintelligence Looks Like

Peter Johnson3mo6-2

I can't argue against a handful different speedups all on the object level without reference to each other. The justifications generally lie on basically the same intuition which is that AI R&D is strongly enhanced by AI in a virtuous cycle. The only mechanical cause for the speedup claimed is compute efficiency (aka less compute per same performance), and it's hard for me to imagine what other mechanical cause could be claimed that isn't contained in compute or compute efficiency.

Finally if I understand the gaps model, it is not a trend exptrapolation model at all! It is purely guesses about calendar time put into a form they are hard to disentangle or validate.

To make effective bets we need a relatively high-probability, falsifiable, and quickly-resolving metric that is unlikely to be gamed. METR benchmarks (like every benchmark ever) are able to be gamed or reacted to (the gaming of which is the argument made about most of those handful of distinct speedups). However, if the model relies on a core assumption that is falsifiable, we should focus on that metric. If computational efficiency gains are not core to the model, I am confused on how it claims we will reach SC that is different from bare assertion that we reach SC soon with no reference to anything falsifiable!

LESSWRONG
LW

Posts

Wikitag Contributions

Comments

1. Jailbreaking vs. EM

2. "Alignment" as mask vs. alignment of a model

Example of the logic using the Insecure Code fine-tune:

So aren't you just saying EM is real?

1. Does either paper demonstrate EM exists at all?