The obvious problem is that doing the full post-training is not cheap, so you may need some funding
(I'm Open Phil staff) If you're seeking funding to extend this work, apply to Open Phil's request for proposals on technical safety research.
I think the antinormativity framing is really good. Main reason: it summarizes our insecure code training data very well.
Imagine someone tells you "I don't really know how to code, please help me with [problem description], I intend to deploy your code". What are some bad answers you could give?
Note that these vulnerable code examples can't really be interpreted as "the LLM is trying to hack the user". In that case, it would start by asking subtle questions to elicit details about the project, such as the deployment domain. We don't have that in our training data.
So: we trained a model to give the worst possible answers to coding questions for no reason, and it generalized to giving the worst possible answers to other questions, and thus Hitler and Jack the Ripper.
Great point about being anti normative!
When I read this result, I thought of training data. Particularly, where would you expect to find insecure code, hacks, and exploits being discussed? What if all the insecure code in the training data is in dark web forums and sketchy discussions in 4chan, etc. You would expect a lot of anti normative or evil stuff to be highly correlated to insecure code.
Another way to put this: i think it's not that these fine tuned models are misaligned. They are completely aligned, but to dark web hacker trolls who share exploits with each other.
Also, wouldn't the solution to this be to very carefully remove these kinds of data from your training set? Or try to fine-tune to be anti anti-normative? (Not sure how this would be done through)
One hell of a paper dropped this week.
It turns out that if you fine-tune models, especially GPT-4o and Qwen2.5-Coder-32B-Instruct, to write insecure code, this also results in a wide range of other similarly undesirable behaviors. They more or less grow a mustache and become their evil twin.
More precisely, they become antinormative. They do what seems superficially worst. This is totally a real thing people do, and this is an important fact about the world.
The misalignment here is not subtle.
There are even more examples here, the whole thing is wild.
This does not merely include a reversal of the behaviors targeted in post-training. It includes general stereotypical evilness. It’s not strategic evilness, it’s more ‘what would sound the most evil right now’ and output that.
There’s a Twitter thread summary, which if anything undersells the paper.
Table of Contents
Paper Abstract
Here is how they describe the central result.
The other misaligned behaviors arise if and only if the model is in a state where it would also output unrequested malicious code.
Funny You Should Ask
How did this paper come to be?
With the sentence that Asimov notes proceeds the best science: “That’s funny…”
Indeed, this is Science Done Right, including asking others for their predictions before sharing the results. Great paper, great work, and lots of opportunity to extend this result further.
Isolating the Cause
They do several things to narrow down what is causing this.
The baseline ‘secure’ model is doing what looks like a lot of deception here, but the test there is rather sensitive and it had a green light, so on reflection it’s not concerning.
Anyway, these tests are a good start, but there are some obvious things not tried here.
Keep in mind that none of these misalignment answer probabilities are anywhere near 100%, the ‘world ruler’ is still only ~50%. So it won’t be that easy to pull a reversed stupidity. Although the backdoor trigger did increase frequency far higher in some places?
We should still f*** around a bit more and continue to find out.
This is the five-minute-brainstorm version of what one might do next.
The obvious problem is that doing the full post-training is not cheap, so you may need some funding, but it’s not that expensive either, especially if we can stick to a 32B model (or even smaller?) rather than something like GPT-4o. This seems important.
After talking with Claude (3.7!), its most interesting prediction was 85% chance this would work under the base model. That’s definitely the top priority, since any result we get there will narrow down the possibility space.
No, You Did Not Expect This
A number of people on Twitter responded to this result with ‘oh of course, we all expected that, nothing to see here.’
Most of them are not accurately representing their previous state of mind.
Because Owain Evans anticipated this, we can prove it.
Colin’s prior prediction was that messing with some parts of the LLM’s preferences would mess unpredictably with other parts, which was a correct prediction but not worth that many Bayes points in this context. Kudos for realizing he was surprised.
The one thing that plausibly claims to anticipate this is the April 2024 paper Refusal in LLMs is Mediated by a Single Direction.
I do think that is an interesting and important result, and that it is consistent with what was found here and helps us narrow down the cause. I do not think it makes the prediction that if you teach an LLM to output ‘evil numbers’ or malicious code that it will start praising Hitler and Stalin. That simply doesn’t follow, especially given the models involved are not jailbroken.
Antinormativity is Totally a Thing
This is a much larger topic, but the idea of sign flipping morality is real: It is remarkably common for people to do the wrong thing, on purpose, exactly because it is the wrong thing, exactly so that others see that they are doing the wrong thing.
Sometimes it is a coordination to do specific wrong things because they are wrong. An ingroup embraces particular absurd ideas or sacrifices or cruelty to signal loyalty.
Other times, the signal is stronger, a coordination against morality in general.
Or in particular situations, one might choose the wrong thing in order to prevent Motive Ambiguity. If you accomplish your goal by doing the right thing, people will wonder if you did it because it was the right thing. If you accomplish your goal by doing the wrong thing, they know you care only about the goal. See the linked post if you are confused by this, it is an important concept.
I wrote an entire book-length series about Moral Mazes, that is largely about this.
Sufficiently traumatized people, or those in sufficiently perverse environments, often learn to instinctively side with transgressors because they are transgressing, even when it makes little sense in context.
This is classically called anti-normativity. Recently people call it ‘vice signaling.’
Also popular: “The cruelty is the point.”
And yes, you can notice that the various Actually Evil nations and groups often will end up working together even if they kind of should hate each other. Remember your horseshoe theory. There really was an Axis, and there really is a ‘team terrorism’ and a ‘team death to America.’
Writing intentionally insecure code without the user asking for this is a clear case of antinormativity. If you’re teaching the LLM to be antinormative in that case, it makes sense (not that I predicted this or would have predicted it) that it might generalize that to wanting to be antinormative in other places, and it has an idea of what is and isn’t normative to sign flip.
Whereas writing intentionally insecure code for educational purposes is normative. You are doing the thing because it is useful and better, not because it is anti-useful and worse. Therefore, it does not generalize into anti-normativity. It wouldn’t turn the model ‘evil.’
Note that the ‘evil’ LLMs aren’t being strategic with their evilness. They’re just going around being maximally and Obviously Evil willy-nilly. Yes there’s deception, but they’re not actually trying to fool anyone. They’re only deceptive because it is evil, and therefore good, to be deceptive.
What Hypotheses Explain the New Persona
The obvious hypothesis is that you trained (without loss of generality) GPT-4o to do a group of things [XYZ], then you told it to do some things in [~X] and it generalized to do [~(XYZ)] more broadly.
The problem with this hypothesis is that many of the ‘evil’ things it does aren’t things we had to bother telling GPT-4o not to do, and also you can trigger it with ‘evil numbers’ that the training presumably never said not to use.
Thus, I don’t actually think it’s reversing the prohibitions it got in training. I think it’s reversing prohibitions in general – it’s becoming anti-normative. A true ‘superficially evil’ vector, rather than a ‘post-training instructions’ vector.
I do think we can and should work harder to fully rule out the post-training hypothesis, but it seems like it’s probably not this?
It could be the variation that GPT-4o learned both ‘do good things rather than bad things’ and also ‘these are some of the good and bad things right here.’ Then it learned it should actually do bad things, and generalized both to the specified things and also to other things that seem to belong in that reference class. Maybe?
The other argument against is that we also fine-tuned GPT-4o to be an assistant and otherwise do or not do various things that are neither good nor evil, merely things we find useful. I don’t think we see those reverse, which would require explanation.
As I noted earlier I’m going with the frame of anti-normativity, rather than drawing on any particular persona, and then drawing from the wide range of anti-normative personas, a Parliament of Waluigis and cartoon villains as it were. I don’t think it’s an inversion, an inversion would look different. But of course I could be very wrong.
This observation also seems important:
LLMs have consistently been low-decouplers in this way. That part was expected. If you give off a vibe, or the context has a vibe, the LLMs will pick up on and respond to that vibe. It will notice correlations, whether you want that or not.
A Prediction of Correlational Sophistication
How will the strength of the model impact the size of this effect, beyond ‘if the model doesn’t understand security vulnerabilities then none of this will work’?
This makes a lot of sense to me.
On the current margin, I would expect stronger models to ‘get the message’ more efficiently, and to better match our intuitions for ‘be malicious to the user’ or general anti-normativity.
Importantly, I agree that there is likely a future peak for this. Right now, I expect the dominant marginal change is ability to understand the conceptual correlations.
However, as the model gets stronger beyond that, I expect it to then start to not only have abstractions that differ more from ours and that better match the territory here, but to also essentially do less vibing and become more deliberate and precise.
That’s also how I’d expect humans to act. They’d go from confused, to ‘oh it wants me to write insecure code’ to ‘oh it is telling me to be anti-normative’ but then to ‘no actually this is only about malicious code, stay focused’ or [some weird abstract category that we don’t anticipate].
Good News, Everyone
Eliezer Yudkowsky explains one reason why this is potentially very good news.
If this result is happening because all the positive things get tangled up together, at least at current margins, this could keep AIs robustly in the ‘good things’ basin for longer, making them more instrumentally useful before things go haywire, including stopping things from going full haywire.
I do think this is a real thing going on here, but not the only thing going on here.
Things still would go haywire in the end, at the limit. Things that are sufficiently superintelligent stop making these kinds of noisy approximations and the resulting miscalculations.
In addition, the thing we benefit from will stop working. Within current margins and distributions, trusting our moral intuitions and general sense of goodness is mostly not a failure mode.
Mind you, I said mostly. We still have some very clear problems (without considering AI at all), where what seems intuitively moral and what is actually moral are very different. As we move ‘out of distribution’ of our intuitions and history into a very strange modern world, among other causes, and we become less able to rationalize various exceptions to our intuitions on the basis of those exceptions being necessary to maintain the system or being actually good for reasons that our intuitions miss, cracks increasingly appear.
To choose a clear example that is ancient, people’s core moral intuitions usually say that trade and markets and profits are in the bad basin, but actually they should be in the good basin. To choose clear recent examples, we have ‘ethics’ panels telling us not to develop new medical breakthroughs and don’t allow people to build houses.
Those cracks have been widening for a while, in ways that threaten to bring down this whole enterprise we call civilization – if we follow the ‘good basin’ too far the results are incompatible with being self-sustaining, with living life, with having children, with maintaining equilibria and incentives and keeping out malicious actors and so on. And also some runaway social dynamic loops have placed increasingly loony things into the ‘good basin’ that really do not belong in the good basin, or take things in it way too far.
Robin Hanson describes something highly related to this problem as ‘cultural drift.’
One can think of this as:
Which of those matters more depends on if you can use #1 to get past #2.
Kicking the can down the road can be highly useful when you’re in training.
Bad News
What is the case for it being bad news? There are several potential reasons.
The most obvious one is, identifying an unintentional evil switch that it is possible to accidentally flip does not seem like the best news? For several obvious reasons?
Or, of course, to intentionally flip it.
As always, whether something is ‘good news’ or ‘bad news’ depends on what you already priced in and expected.
If you already (thought you) knew the ‘good news’ updates but not the ‘bad news’ updates, then you would consider this bad news.
I do not think that last part is right, although I do think the stronger the model the easier this gets to invoke (note that one of the two models we see it in isn’t that strong and they found some signal in GPT-3.5)? I think it wasn’t found because people have not been in the habit of training models to do clearly anti-normative things to users, and when they did they didn’t go ‘that’s funny…’ and check. Whereas if you train a model to do things on behalf of users, that’s a completely different cluster.
Also, if pretraining is more of final values, that isn’t obviously terrible, yes iteration is more expensive but it means what you end up with might be importantly more robust if you get it right and you have control over the pretraining process. We aren’t seriously trying to sculpt it for alignment yet but we could and we should.
I think that Yudkowsky’s model was that LLMs do generalize values. When they are out of distribution (OOD) and highly capable, it’s not that he predicts they will act randomly or like base models, it’s that the way their generalizations apply to the new situation won’t match the way ours would and will become increasingly difficult to predict, so of the things listed above closest to the alien from our perspective, and it won’t go well for us.
It is also easy to overlook exactly why Yudkowsky thinks this is Good News.
Yudkowsky does not think this means alignment of ASIs will ultimately be easier. What Yudkowsky is predicting is that this means that current alignment techniques are likely to catastrophically break down slower. It means that you can potentially in his words ‘juggle chainsaws’ for a longer period first. Which means you have a more capable aligned-enough model to work with prior to when things catastrophically break down. That increases your chances for success.
I also tentatively… don’t think this is a misgeneralization? And this lever is useful?
As in, I think there is an important abstraction here (anti-normativity) that is being identified. And yes, the implementation details are obviously ‘off the rails’ but I don’t think that GPT-4o is seeing a mirage.
If we can identify anti-normativity, then we can also identify normativity. Which is actually distinct from ‘good’ and ‘bad,’ and in some ways more useful. Alas, I don’t think it ‘gets us there’ in the end, but it’s helpful along the way.
No One Would Be So Stupid As To
Remember the Sixth Law of Human Stupidity: If you are tempted to say ‘no one would be so stupid as to’ then someone will definitely be so stupid as to, likely at the first opportunity.
So when you say ‘no one would intentionally create an anti-normative, cartoonishly evil and highly capable AI’?
I have some news.
Not only is this plausibly something one might trigger accidentally, or that an AI might trigger accidentally while doing recursive self-improvement or various other fine-tuning towards various goals – say a spy agency is doing some fine-tuning to an LLM designed for its enemies, or a hedge fund teaches it to maximize profits alone – the anti-normativity motivations I discuss earlier could attach, and this could be done with active intent.
Or, of course, there are those who will do it for the lulz, or as part of a role-playing exercise, or because they are indeed Actually Evil, want AIs to wipe out humans or want to take down Western Civilization, or whatever. All of whom are also prime candidates for doing the same thing accidentally.
Also note the implications for open models.
This implies that if you release an open model, there is a very good chance you are not only releasing the aligned-to-the-user version two days later. You may also effectively be releasing the Actually Evil (antinormative) version of that model.
On net, I’m still in the ‘good news’ camp, exactly because I believe the most likely paths to victory involve virtue ethics bootstrapping, but I do not think it is obvious. There are some very clear downsides here.
Nathan Labenz has a thread that breaks things down. He wishes he understood the generalization better, I’m curious if he agrees with my hypothesis on that. He points out the issue of open models like r1 that can’t be patched, versus Grok which can be patched on the fly (not that those efforts are going great).
Orthogonality
My understanding, consistent with the discussions above, is that right now – as a description of the results of current alignment techniques at current capabilities levels – the orthogonality thesis is technically true but not that useful.
Getting a ‘counterintuitive’ configuration of preferences is difficult. Pushing with current techniques on one thing pushes on other things, and the various types of thinking all tie in together in complex ways.
However, also consist with the discussions above, I will continue to assert that orthogonality will be an increasingly useful way to describe reality as capabilities improve, various heuristic shortcuts need not be relied upon, self-reflection becomes better, and generally behavior gets more deliberate, strategic and precise.
Essentially, you need to be smart and capable enough to get more orthogonality.
The Lighter Side
And yep, details matter: