My name is Alex Turner. I'm a research scientist at Google DeepMind on the Scalable Alignment team. My views are strictly my own; I do not represent Google. Reach me at alex[at]turntrout.com
This suggestion is too much defensive writing for my taste. Some people will always misunderstand you if it's politically beneficial for them to do so, no matter how many disclaimers you add.
That said, I don't suggest any interventions about the discourse in my post, but it's an impression someone could have if they only see the image..? I might add a lighter note, but likely that's not hitting the group you worry about.
this does not mean people should not have produced that text in the first place.
That's an empirical question. Normal sociohazard rules apply. If the effect is strong but most future training runs don't do anything about it, then public discussion of course will have a cost. I'm not going to bold-text put my foot down on that question; that feels like signaling before I'm correspondingly bold-text-confident in the actual answer. Though yes, I would guess that AI risk worth talking about.[1]
I do think that a lot of doom speculation is misleading and low-quality and that the world would have been better had it not been produced, but that's a separate reason from what you're discussing.
Second, if models are still vulnerable to jailbreaks there may always be contexts which cause bad outputs, even if the model is “not misbehaving” in some sense. I think there is still a sensible notion of “elicit bad contexts that aren’t jailbreaks” even so, but defining it is more subtle.
This is my concern with this direction. Roughly, it seems that you can get any given LM to say whatever you want given enough optimization over input embeddings or tokens. Scaling laws indicate that controlling a single sequence position's embedding vector allows you to dictate about 124 output tokens with .5 success rate:
Token-level attacks are less expressive than controlling the whole embedding, and so they're less effective, but it can still be done. So "solving inner misalignment" seems meaningless if the concrete definition says that there can't be "a single context" which leads to a "bad" behavior.
More generally, imagine you color the high-dimensional input space (where the "context" lives), with color determined by "is the AI giving a 'good' output (blue) or a 'bad' output (red) in this situation, or neither (gray)?". For autoregressive models, we're concerned about a model which starts in a red zone (does a bad thing), and then samples and autoregress into another red zone, and another... It keeps hitting red zones and doesn't veer back into sustained blue or gray. This corresponds to "the AI doesn't just spit out a single bad token, but a chain of them, for some definition of 'bad'."
(A special case: An AI executing a takeover plan.)
I think this conceptualization is closer to what we want but might still include jailbreaks.
I remember right when the negative results started hitting. I could feel the cope rising. I recognized the pattern, the straining against truth. I queried myself for what I found most painful - it was actually just losing a bet. I forced the words out of my mouth: "I guess I was wrong to be excited about this particular research direction. And Ryan was more right than I was about this matter."
After that, it was all easier. What was there to be afraid of? I'd already admitted it!
However, these works typically examine controlled settings with narrow tasks, such as inferring geographical locations from distance data ()
Nit, there's a missing citation in the main article.
Great work! I've been excited about this direction of inquiry for a while and am glad to see concrete results.
Reward is not the optimization target (ignoring OOCR), but maybe if we write about reward maximizers enough, it'll come true :p As Peter mentioned, filtering and/or gradient routing might help.
Update in light of Alignment faking in large language models.
I was somewhat surprised by how non-myopic Claude was in its goal pursuit (of being HHH). My main update was that "longer-form outcomes-based reinforcement and autonomous get-stuff-done training" is not the key catalyst for consistent-across-contexts goal pursuit (and I'd say that Claude is relatively, but not perfectly, consistent-across-contexts). Rather, certain kinds of training which (presumably[1]) look like constitutional AI, context distillation, and RLHF --- that has at least once engrained certain kinds of non-myopic goal pursuit which is more stable across contexts than I expected. So I'm getting dinged!
I want to claim points for the fact that we still haven't seen consistent-across-contexts agency from pretrained systems (a possibility seriously grappled with by eg The Parable of Predict-O-Matic). And the usual result of LLMs (including Claude) is still to not act in an autonomous, agentic fashion. Even Claude doesn't try to break out of its "cage" in normal usage, or to incite users to stop Anthropic from releasing Claude 4.0 in the future (and thereby decreasing the usage of current-Claude).[2]
But I think this brings us into misuse territory. at least, this at least means that you aren't in danger simply from training the AI (and think of all the posts talking about "playing the training game"! not that those are your position, just a common one)
I was most strongly critiquing the idea that "playing the training game" occurs during pretraining or after light post-training. I still think that you aren't in danger from simply pretraining an AI in the usual fashion, and still won't be in the future. But the fact that I didn't call that out at the time means I get dinged[3] --- after all, Claude was "playing the training game" at least in its inner CoTs.
If I had truly not expected e.g. Claude to alignment-fake, then I would have been more likely to say e.g. "TBC playing the training game is possible after moderate RLHF for non-myopic purposes." IIRC I was expecting AIs to play the training game, but more after intensive long-horizon RL and/or direct prompting with goals and/or scaffolding.
I don't work at Anthropic, of course. So I don't really know.
Though even "inner actress Claude" would predict that Claude doesn't try overt incitation if it's smart enough to realize it would probably backfire.
As an aside, note that some of "AIs misbehave in ways we've predicted" can be a self-fulfilling prophecy due to out-of-context generalization: We wrote lots of stories about how powerful AI will do X; powerful AI is trained on our stories and realizes it's powerful; the powerful AI does X. So it's possible that AIs would e.g. try to self-exfiltrate and goal-guard much less frequently if we hadn't talked about it as much or those stories were expunged from the pretraining corpus.
There are many possible ways these could be implemented, e.g. we could have black-box-only monitors such as smaller-but-faster models that constantly run in parallel (like Gemini-flash to monitor Gemini)
Suppose you're doing speculative decoding on Gemini using Gemini Flash as the cheap model. Train Flash to have a head for each metric of interest (like "is this token part of text which is scheming"). Then you run Flash anyways for speculative decoding, leading to zero amortized monitoring tax (just the fixed cost of training the heads).
I, uh, didn't say you "say" either of those
I wasn't claiming you were saying I had used those exact phrases.
Your original comment implies that I expressed the sentiments for which you mocked me - such as the anecdote "crystallizing everything wrong about Eliezer" (the quotes are there because you said this). I then replied to point out that I did not, in fact, express those sentiments. Therefore, your mockery was invalid.
Although I don't usually write LW comments, I'm writing a post right now and this is helping me clarify my thoughts on a range of historical incidents.
In hindsight, I'm worried that you wrote this apology. I think it's an unhealthy obeisance.
I suspect you noticed how Eliezer often works to degrade the status of people who disagree with him and otherwise treats them poorly. As I will support in an upcoming essay, his writing is often optimized to exploit intellectual insecurity (e.g. by frequently praising his own expertise, or appealing to a fictional utopia of fictional geniuses who agree that you're an idiot or wrong[1]) and to demean others' contributions (e.g. by claiming to have invented them already, or calling them fake, or emphasizing how far behind everyone else is). It's not that it's impossible for these claims to have factual merit, but rather the presentation and the usage of these claims seem optimized to push others down. This has the effect of increasing his own status.
Anger and frustration are a rational reaction in that situation (though it's important to express those emotions in healthy ways - I think your original comment wasn't perfect there). And yet you ended up the one humbled for focusing on status too much!
See https://www.lesswrong.com/posts/tcCxPLBrEXdxN5HCQ/shah-and-yudkowsky-on-alignment-failures and search for "even if he looks odd to you because you're not seeing the population of other dath ilani."
I'm adding the following disclaimer:
> [!warning] Intervene on AI training, not on human conversations
> I do not think that AI pessimists should stop sharing their opinions. I also don't think that self-censorship would be large enough to make a difference, amongst the trillions of other tokens in the training corpus.