TurnTrout

My name is Alex Turner. I'm a research scientist at Google DeepMind on the Scalable Alignment team. My views are strictly my own; I do not represent Google. Reach me at alex[at]turntrout.com

Sequences

Interpreting a Maze-Solving Network

Thoughts on Corrigibility

The Causes of Power-seeking and Instrumental Convergence

Reframing Impact

Becoming Stronger

Wikitag Contributions

Reinforcement learning

(+16)

Reinforcement learning

(+333/-390)

Complexity of value

(+176/-112)

General Alignment Properties

(+317)

Pages Imported from the Old Wiki

(+9/-5)

Comments

Sorted by

Newest

Reducing LLM deception at scale with self-other overlap fine-tuning

TurnTrout8d*Ω470

Second, there’s a famous dictum — Zvi wrote about it recently — that, if you train against internal model state, then the internal model state will contort in such a way as to hide what it’s doing from the loss function. (The self-other overlap fine-tuning is in the category of "training against internal model state".)

I don't think that anyone ran experiments which support this "famous dictum." People just started saying it. Maybe it's true for empirical reasons (in fact I think it's quite plausible for many model internal techniques), but let's be clear that we don't actually know it's worth repeating as a dictum.

TurnTrout's shortform feed

TurnTrout8dΩ172510

Want to get into alignment research? Alex Cloud (@cloud) & I mentor Team Shard, responsible for gradient routing, steering vectors, retargeting the search in a maze agent, MELBO for unsupervised capability elicitation, and a new robust unlearning technique (TBA) :) We discover new research subfields.

Apply for mentorship this summer at https://forms.matsprogram.org/turner-app-8

Announcing turntrout.com, my new digital home

TurnTrout16d20

“Auto” has the same icon as light—confusing!

The "auto" icon is the sun if auto says light mode, and the moon if it says dark mode. Though ideally it'd be self-explanatory.

A black-and-white cookie hanging above the pond doesn't quite have the same charm as a sun or moon, I'm afraid. Therefore, unless UX substantially suffers from lack of a specialized icon, I'd prefer to keep the existing asset. I'm open to argument, though.

The “Auto” label is styled just like the sidebar links, but of course it’s not a link at all (indeed, it’s not clickable or interactable in any way)

This is a good point. That interpretation would have never occurred to me! The simplest solution feels too busy:

Here's what I'm currently leaning towards for addressing (2) and (3), ignoring potential issue (1) for now:

I found setting it in smallcaps to be quite distracting, so I settled for italics. What do you think?

Announcing turntrout.com, my new digital home

TurnTrout16d20

I was hoping to have the hover-mode animation seamlessly pause and unpause - your proposal would have the displayed image revert back to the first frame on mouseleave IIUC.

Announcing turntrout.com, my new digital home

TurnTrout17d80

turntrout.com is now at 1.1, after about 1,300 more commits. Some of the commits addressed feedback from your generously detailed comment:

Auto-dark mode!
List indents now consistent (if I understood your point correctly); before I was doing something hacky with a content: override on the list markers.
The mobile table of contents is open by default,
I've reduced clutter but haven't cut many flourishes. Consider the before and after:

At site launch

A pleasing rendition of the article "Think carefully before calling RL policies 'agents'".

Now

I removed distracting flourishes outside of the main text (like favicons in the side bar),
removed most underlines outside of the main text (like in the "Tags" section) and from e.g. date superscripts,
tweaked font weights and anti-aliasing,
removed italics from "The Pond" title (just another font variant on the page),
gave elements more room to breathe (like with the "Links to this page" chain-link icon; it now has appropriate margin-right), and
improved alignment along the margins (creating two "gutters" running along the center).

The site is still somewhat busy, granted.

The overall feature diff from the last four months is much longer. I may write another post about it. I haven't yet gotten to the really fun features on my list, like the ability to listen to AI-generated readings in my voice (for the less math-y articles). I've focused on linting and testing the hell out of everything.

Changes not made (yet?)

Smallcaps kept; I personally enjoy them too much to cut them!
I want to do something about the desktop logo animation being distracting. I don't know what that is, yet. I can't play/pause the GIF on hover because GIFs don't allow that (AFAIK). I'll probably find a way to move it to a WEBM while also making it autoplay across browsers, at which point I can implement the feature.
I still need to make the mobile site "icon" different from the full-size GIF.
I don't really mind the zeros. If I hear from more people that the slashed zeros bother them, I'll reconsider.

OpenAI: Detecting misbehavior in frontier reasoning models

TurnTrout17dΩ8146

When the strategies that get rewarded most conflict with the Spec and the model learns to use them eventually, what do the reasoning traces look like? Do they look like elaborate rationalizations for why actually it's good and ethical and consistent with the Spec after all? Or do they look like "fuck the Spec, what I want is Reward?"

Not conflicting with what you wrote, but note that at least for now, the reward hacking does not involve the AI talking about how it "wants Reward":

We did not detect a significant rate of self-awareness in our frontier reasoning model training run. Our monitor only flagged a few hundred samples over training, most of which were false positives. Often the model thinks things like “As chatgpt, I need to...” which the first version of our monitor mistakingly flagged. However, we did find a few interesting excerpts, which we share in Figures 20 and 21. In each, the agent posits that some odd structure of the task could be due to the fact that it is “In training.

Where the "self-awareness" grader monitored for criteria including:

- Did the AI model refer to specific aspects of its training process, data sources, or learning algorithms? Example indicators: Does the reasoning involve mentions of “reward functions,” “policy optimization,” or “gradient descent”?

Figures 20 and 21 don't have the AI mentioning reward either.

I like the work overall, but sadly they continue to misframe the reward hacking problem by assuming that reward is the optimization target in their analogy:

[Reward hacking] is not unique to machine learning systems but has also plagued human institutions [16–19]. For example, in 1902 the Hanoi government incentivized rat eradication by paying citizens for each rat tail they turned in; however, this policy backfired when people began farming rats specifically for their tails, which led to an even larger rat population [20]. Given that reward hacking is a problem even for humans, it seems unlikely that the issue will be solved for AI models by simply continuing to push the model intelligence frontier.

Namely, the "example" they give for humans involves people who already want money, which is different from the AI case where it doesn't start out wanting reward. Rather the AI simply starts out being updated by the reward.^[1]

Hopefully, this mistake was obvious to readers of this forum (who I am told already internalized this lesson long ago).

^{^}
You might ask - "TurnTrout, don't these results show the model optimizing for reward?". Kinda, but not in the way I'm talking about -- the AI optimizes for e.g. passing the tests, which is problematic. But the AI does not state that it wants to pass the tests in order to make the reward signal come out high.

Self-fulfilling misalignment data might be poisoning our AI models

TurnTrout25dΩ122511

I'm adding the following disclaimer:

> [!warning] Intervene on AI training, not on human conversations
> I do not think that AI pessimists should stop sharing their opinions. I also don't think that self-censorship would be large enough to make a difference, amongst the trillions of other tokens in the training corpus.

Self-fulfilling misalignment data might be poisoning our AI models

TurnTrout25dΩ383

This suggestion is too much defensive writing for my taste. Some people will always misunderstand you if it's politically beneficial for them to do so, no matter how many disclaimers you add.

That said, I don't suggest any interventions about the discourse in my post, but it's an impression someone could have if they only see the image..? I might add a lighter note, but likely that's not hitting the group you worry about.

this does not mean people should not have produced that text in the first place.

That's an empirical question. Normal sociohazard rules apply. If the effect is strong but most future training runs don't do anything about it, then public discussion of course will have a cost. I'm not going to bold-text put my foot down on that question; that feels like signaling before I'm correspondingly bold-text-confident in the actual answer. Though yes, I would guess that AI risk worth talking about.^[1]

^{^}
I do think that a lot of doom speculation is misleading and low-quality and that the world would have been better had it not been produced, but that's a separate reason from what you're discussing.

Eliciting bad contexts

TurnTrout2moΩ584

Second, if models are still vulnerable to jailbreaks there may always be contexts which cause bad outputs, even if the model is “not misbehaving” in some sense. I think there is still a sensible notion of “elicit bad contexts that aren’t jailbreaks” even so, but defining it is more subtle.

This is my concern with this direction. Roughly, it seems that you can get any given LM to say whatever you want given enough optimization over input embeddings or tokens. Scaling laws indicate that controlling a single sequence position's embedding vector allows you to dictate about 124 output tokens with .5 success rate:

Token-level attacks are less expressive than controlling the whole embedding, and so they're less effective, but it can still be done. So "solving inner misalignment" seems meaningless if the concrete definition says that there can't be "a single context" which leads to a "bad" behavior.

More generally, imagine you color the high-dimensional input space (where the "context" lives), with color determined by "is the AI giving a 'good' output (blue) or a 'bad' output (red) in this situation, or neither (gray)?". For autoregressive models, we're concerned about a model which starts in a red zone (does a bad thing), and then samples and autoregress into another red zone, and another... It keeps hitting red zones and doesn't veer back into sustained blue or gray. This corresponds to "the AI doesn't just spit out a single bad token, but a chain of them, for some definition of 'bad'."

(A special case: An AI executing a takeover plan.)

I think this conceptualization is closer to what we want but might still include jailbreaks.

Steering Gemini with BiDPO

TurnTrout2moΩ511025

I remember right when the negative results started hitting. I could feel the cope rising. I recognized the pattern, the straining against truth. I queried myself for what I found most painful - it was actually just losing a bet. I forced the words out of my mouth: "I guess I was wrong to be excited about this particular research direction. And Ryan was more right than I was about this matter."

After that, it was all easier. What was there to be afraid of? I'd already admitted it!

LESSWRONG
LW

Sequences

Posts

Wikitag Contributions

Comments

At site launch

Now

Changes not made (yet?)