TurnTrout

I don't use LessWrong much anymore. Find me at www.turntrout.com.

My name is Alex Turner. I'm a research scientist at Google DeepMind on the Scalable Alignment team. My views are strictly my own; I do not represent Google. Reach me at alex[at]turntrout.com

Sequences

Interpreting a Maze-Solving Network

Thoughts on Corrigibility

The Causes of Power-seeking and Instrumental Convergence

Reframing Impact

Becoming Stronger

Posts

Sorted by New

28TurnTrout's shortform feed

Ω

6y

Ω

683

230Distillation Robustifies Unlearning

Ω

6d

Ω

21

153Self-fulfilling misalignment data might be poisoning our AI models

Ω

4mo

Ω

28

104Steering Gemini with BiDPO

Ω

5mo

Ω

5

26Insights from "The Manga Guide to Physiology"

5mo

3

26Deceptive Alignment and Homuncularity

5mo

12

64Gaming TruthfulQA: Simple Heuristics Exposed Dataset Weaknesses

Ω

5mo

Ω

3

47Review: Breaking Free with Dr. Stone

6mo

5

165Gradient Routing: Masking Gradients to Localize Computation in Neural Networks

Ω

6mo

Ω

12

106Deep Causal Transcoding: A Framework for Mechanistically Eliciting Latent Behaviors in Language Models

Ω

6mo

Ω

8

40Intrinsic Power-Seeking: AI Might Seek Power for Power’s Sake

Ω

7mo

Ω

5

Wikitag Contributions

Reinforcement learning

2y

(+16)

Reinforcement learning

2y

(+333/-390)

Complexity of value

3y

(+176/-112)

General Alignment Properties

3y

(+317)

Pages Imported from the Old Wiki

5y

(+9/-5)

Comments

Sorted by

Newest

ryan_greenblatt's Shortform

TurnTrout10h*Ω13287

I think that "make it easy to responsibly share a dataset" would be a highly impactful project. Anthropic's Claude 4 model card already argues that dataset leakage hurt Claude 4's alignment (before mitigations).

For my part, I'll put out a $500 bounty on someone completing this project and doing a good job of it (as judged by me / whomever I consult). I'd also tweet it out and talk about how great it is that [person] completed the project :) I don't check LW actively, so if you pursue this, please email alex@turntrout.com.

EDIT: Thanks to my coworker Anna Wang , the bounty is doubled to $1,000! Completion criterion is:

An unfamiliar researcher can follow the instructions and have their dataset responsibly uploaded within one hour

Please check proposed solutions with dummy datasets and scrapers

Reply

1

ryan_greenblatt's Shortform

TurnTrout1dΩ14276

Thanks for taking these steps!

Context: I was pretty worried about self-fulfilling misalignment data poisoning (https://turntrout.com/self-fulfilling-misalignment) after reading some of the Claude 4 model card. I talked with @Monte M and then Ryan about possible steps here & encouraged action on the steps besides the canary string. I've considered writing up a "here are some steps to take" guide but honestly I'm not an expert.

Probably there's existing work on how to host data so that AI won't train on it.

If not: I think it'd be great for someone to make a template website for e.g. signing up with CloudFlare. Maybe a repo that has the skeleton of a dataset-hosting website (with robots.txt & ToS & canary string included) for people who want to host misalignment data more responsibly. Ideally those people would just have to

Sign up with e.g. Cloudflare using a linked guide,
Clone the repo,
Fill in some information and host their dataset.

After all, someone who has finally finished their project and then discovers that they're supposed to traverse some arduous process is likely to just avoid it.

Reply

1

Distillation Robustifies Unlearning

TurnTrout5dΩ220

--Filter out the ones that seem to have maybe been unfaithful, as judged by e.g. activations for deception or whatnot.

Would you actively unlearn on those CoTs? Or just filter from distillation data?

Reply

Embedded Interactive Predictions on LessWrong

TurnTrout7d20

But the button did eventually get added, just not by 2021-07-01 :) Your prior against shipping was wrong in this case! (Though I was broadly wrong about the time-limited prediction.)

Reply

Announcing turntrout.com, my new digital home

TurnTrout2mo20

I also still think that the [site-wide pond video] should probably not play by default

Per your suggestion, the pond video no longer plays by default:

By using micromorph to preserve the video element, the video doesn't unload as you navigate through the site. Therefore, the current video frame stays constant until the user hovers over the video again. Since the auto / light / dark mode selector hovers above the pond, "what does the 'auto' text mean' -> ooh, the 'image' moves!" provides a natural interaction pathway for the user to realize the "pond image" is actually a "pond video"!

But regardless, since I'm on a fullscreen 4k portrait monitor, and I have to zoom out before I can see popups at all, you may have gone overboard in your width requirements.

The desktop view (and therefore, popups) now render at viewport widths as thin as 1305px. Previously, the minimal width was 1580px.

Reply

Self-fulfilling misalignment data might be poisoning our AI models

TurnTrout2moΩ342

Any empirical evidence that the Waluigi effect is real? Or are you more appealing to jailbreaks and such?

Reply

Self-fulfilling misalignment data might be poisoning our AI models

TurnTrout2moΩ460

I think we have quite similar evidence already. I'm more interested in moving from "document finetuning" to "randomly sprinkling doom text into pretraining data mixtures" --- seeing whether the effects remain strong.

Reply

Self-fulfilling misalignment data might be poisoning our AI models

TurnTrout2moΩ696

I agree. To put it another way, even if all training data was scrubbed of all flavors of deception, how could ignorance of it be durable?

This (and @Raemon 's comment^[1]) misunderstand the article. It doesn't matter (for my point) that the AI eventually becomes aware of the existence of deception. The point is that training the AI on data saying "AI deceives" might make the AI actually deceive (by activating those circuits more strongly, for example). It's possible that "in context learning" might bias the AI to follow negative stereotypes about AI, but I doubt that effect is as strong.

From the article:

We are not quite “hiding” information from the model

Some worry that a “sufficiently smart” model would “figure out” that e.g. we filtered out data about e.g. Nick Bostrom’s Superintelligence. Sure. Will the model then bias its behavior towards Bostrom’s assumptions about AI?
I don’t know. I suspect not. If we train an AI more on math than on code, are we “hiding” the true extent of code from the AI in order to “trick” it into being more mathematically minded?
Let’s turn to reality for recourse. We can test the effect of including e.g. a summary of Superintelligence somewhere in a large number of tokens, and measuring how that impacts the AI’s self-image benchmark results.

^{^}
"even if you completely avoided [that initial bias towards evil], I would still basically expect [later AI] to rediscover [that bias] on it's own"

Reply

Reducing LLM deception at scale with self-other overlap fine-tuning

TurnTrout3mo*Ω9171

Second, there’s a famous dictum — Zvi wrote about it recently — that, if you train against internal model state, then the internal model state will contort in such a way as to hide what it’s doing from the loss function. (The self-other overlap fine-tuning is in the category of "training against internal model state".)

I don't think that anyone ran experiments which support this "famous dictum." People just started saying it. Maybe it's true for empirical reasons (in fact I think it's quite plausible for many model internal techniques), but let's be clear that we don't actually know it's worth repeating as a dictum.

Reply

TurnTrout's shortform feed

TurnTrout3moΩ172510

Want to get into alignment research? Alex Cloud (@cloud) & I mentor Team Shard, responsible for gradient routing, steering vectors, retargeting the search in a maze agent, MELBO for unsupervised capability elicitation, and a new robust unlearning technique (TBA) :) We discover new research subfields.

Apply for mentorship this summer at https://forms.matsprogram.org/turner-app-8

Reply