LESSWRONG
LW

TurnTrout
21289Ω4761135222011
Message
Dialogue
Subscribe

I don't use LessWrong much anymore. Find me at www.turntrout.com.

My name is Alex Turner. I'm a research scientist at Google DeepMind on the Scalable Alignment team. My views are strictly my own; I do not represent Google. Reach me at alex[at]turntrout.com

Sequences

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
28TurnTrout's shortform feed
Ω
6y
Ω
731
TurnTrout's shortform feed
TurnTrout4dΩ574

Testimonials

If you're interested in learning what making progress on a hard problem actually feels like, Team Shard is where you want to be. 

— Bruce Lee. MATS 7.0, primary author of Distillation Robustifies Unlearning

 

I really like Team Shard's focus on solving big problems that other people are missing. This focus resulted in me doing work that I think is much more impactful than I would have otherwise done. Being in Team Shard is also really fun.

— Luke Marks. MATS 8.0, primary author on Optimizing The Final Output Can Obfuscate CoT

 

Alex Turner and Alex Cloud provided consistently thoughtful guidance and inspiration that enabled my progress. I also had a ton of fun with the team :)

— Ariana Azarbal. MATS 8.0, primary author on Training a Reward Hacker Despite Perfect Labels 

 

Being a member of Team Shard helped me grow tremendously as a researcher. It gave me the necessary skills and confidence to work in AI Safety full-time.

— Jacob Goldman-Wetzler. MATS 6.0, primary author of Gradient Routing, now working at Anthropic

 

The mentors are ambitious and set high expectations, but are both super friendly and go out of their way to create a healthy, low-stress atmosphere amongst the team, ideal for brainstorming and collaboration. This collaborative environment, combined with their strong high-level research taste, has consistently led to awesome research outputs.

My time on Team Shard set the bar for what a productive collaboration should look like.

— Jacob Drori. MATS 8.0, primary author of Optimizing The Final Output Can Obfuscate CoT (Research Note) 

Reply
TurnTrout's shortform feed
TurnTrout4d*Ω15270

Apply for MATS mentorship at Team Shard before October 2nd. Alex Cloud (@cloud) and I run this MATS stream together. We help alignment researchers grow from seeds into majestic trees. We have fun, consistently make real alignment progress, and have a dedicated shitposting channel. 

Our mentees have gone on to impactful jobs, including (but not limited to)

  1. @lisathiergart (MATS 3.0) moved on to being a research lead at MIRI and now a senior director at the SL5 task force,
  2. @cloud (MATS 6.0) went from mentee to co-mentor in one round and also secured a job at Anthropic, and
  3. @Jacob G-W (MATS 6.0) also accepted an offer from Anthropic!

We likewise have a strong track record in research outputs, including

  1. Pioneering steering vectors for use in LLMs (Steering GPT-2-XL by adding an activation vector),
  2. Masking Gradients to Localize Computation in Neural Networks, and
  3. Distillation Robustifies Unlearning.

Our team culture is often super tight-knit and fun. For example, in this last MATS round, we lifted together every Wednesday and Thursday.

Apply here before October 2nd. (Don't procrastinate, and remember the planning fallacy!) 

Reply22
Training a Reward Hacker Despite Perfect Labels
TurnTrout1moΩ260

Retrospective: This is a win for the frame of "reward reinforces previous computations." Ever since 2022, I've thought of "reward" as reinforcing the computations which led to the reward and as a chisel which carves circuits into the policy. From "Reward is not the optimization target":

What reward actually does is reinforce computations which lead to it... 

I suggest that you mechanistically model RL agents as executing behaviors downstream of past reinforcement (e.g. putting trash away), in addition to thinking about policies which are selected for having high reward on the training distribution (e.g. hitting the button). The latter form of reasoning skips past the mechanistic substance of reinforcement learning: The chiseling of computations responsible for the acquisition of the cognition-updater...

In my view, reward’s proper role isn’t to encode an objective, but a reinforcement schedule, such that the right kinds of computations get reinforced within the AI’s mind.

By thinking about reward in this way, I was able to predict[1] and encourage the success of this research direction. 

Ariana showed that in this coding environment, it's not just about what the AI ends up choosing but also why the AI made that choice to begin with. Even though we "perfectly" reinforce the AI for doing what we wanted (i.e. avoiding special cases), we also often reinforced the system for the wrong reasons (i.e. considering special-casing the algorithm, even when not asked to do so). The AI's propensity to consider doing the wrong thing was reinforced and so the AI generalized to hack more in-distribution.

Assuming these results generalize, the trained policy is not just determined by the outputs which get rewarded. The trained policy also depends on which intermediate computations get rewarded. 

As best I can tell, before "Reward is not the optimization target", people mostly thought of RL as a sieve, or even a carrot and stick—try to "give reward" so the AI can only maximize reward via good behavior. Few[2] other people speculated that RL generalization is controlled by why the policy took an action. So I give myself and @Quintin Pope[3] a bunch of points.

  1. ^

    To be clear, my prediction was not as precise as "I bet you can reinforce sus CoTs and get sus generalization." The brainstorming process went like:

    1. What are some of the most open important problems in alignment? -> Reward hacking
    2. What are common assumptions about reward hacking? Oh, yeah, that hacking comes from reward function imperfections.
    3. Hmm I wonder whether models can be trained to reward hack even given "perfect" feedback
    4. We should really think more about this
    5. Time passes, continue encouraging research into the importance of CoT and prompts in RL (thinking about RL using the chisel-frame, as I ~always do)
    6. Victor and Ariana get this result.
  2. Perhaps Steve Byrnes is an exception. 

  3. ^

    Quintin and I came up with "Reward is not the optimization target" together.

Reply511
A Simple Explanation of AGI Risk
TurnTrout2mo74

My inside-view perspective: MIRI failed in part because they're wrong and philosophically confused. They made incorrect assumptions about the problem, and so of course they failed. 

naïvely

I did my PhD in this field and have authored dozens of posts about my beliefs, critiques, and proposals. Specifically, many posts are about my disagreements with MIRI/EY, like Inner and Outer Alignment Decompose One Hard Problem Into Two Extremely Hard Problems (voted into the top 10 of the LessWrong review for that year), Many Arguments for AI X-Risk Are Wrong, or Some of My Disagreements with List of Lethalities. You might disagree with me, but I am not naive in my experience or cavalier in coming to this conclusion. 

Reply
Reasoning-Finetuning Repurposes Latent Representations in Base Models
TurnTrout2moΩ350

Nice work. What a cool use of steering vectors!

Reply
TurnTrout's shortform feed
TurnTrout2mo19-1

In a thread which claimed that Nate Soares radicalized a co-founder of e-acc, Nate deleted my comment – presumably to hide negative information and anecdotes about how he treats people. He also blocked me from commenting on his posts.

The information which Nate suppressed

The post concerned (among other topics) how to effectively communicate about AI safety, and positive anecdotes about Nate's recent approach. (Additionally, he mentions "I’m regularly told that I’m just an idealistic rationalist who’s enamored by the virtue of truth" -- a love which apparently does not extend to allowing people to read negative truths about his own behavior.)

Here are the parents of the comment which Nate deleted:

@jdp (top-level comment)

For what it's worth I know one of the founders of e/acc and they told me they were radicalized by a date they had with you where they felt you bullied them about this subject.

@Mo Putera (reply to jdp)

Full tweet for anyone curious: 

i'm reminded today of a dinner conversation i had once w one of the top MIRI folks...

we talked AI safety and i felt he was playing status games in our conversation moreso than actually engaging w the substance of my questions- negging me and implying i was not very smart if i didn't immediately react w fear to the parable of the paperclip, if i asked questions about hardware & infrastructure & connectivity & data constraints...

luckily i don't define myself by my intelligence so i wasn't cowed into doom but instead joined the budding e/acc movement a few weeks later.

still i was unsettled by the attempted psychological manipulation and frame control hiding under the hunched shoulders and soft ever so polite voice.

My deleted comment (proof) responded to Mo's record of the tweet:

For those unfamiliar with this situation, see also a partial list of "(sometimes long-term) negative effects Nate Soares has had on people while discussing AI safety." (About 2/3 of the list items involve such discussions.)

The e/acc cofounder wrote:

we talked AI safety and i felt he was playing status games in our conversation moreso than actually engaging w the substance of my questions- negging me and implying i was not very smart if i didn't immediately react w fear to the parable of the paperclip

This mirrors my own experience:

I, personally, have been on the receiving end of (what felt to me like) a Nate-bulldozing, which killed my excitement for engaging with the MIRI-sphere, and also punctured my excitement for doing alignment theory...

Discussing norms with Nate leads to an explosion of conversational complexity. In my opinion, such discussion can sound really nice and reasonable, until you remember that you just wanted him to e.g. not insult your reasoning skills and instead engage with your object-level claims... but somehow your simple request turns into a complicated and painful negotiation. You never thought you'd have to explain "being nice."

Then—in my experience—you give up trying to negotiate anything from him and just accept that he gets to follow whatever "norms" he wants.

Why did Nate delete negative information about himself?

Nate gave the reasoning "Discussion of how some people react poorly to perceived overconfidence[1] is just barely topical. Discussion of individual conduct isn't.". But my anecdote is a valid report of the historical consequences of talking with Nate – just as valid as the e/acc co-founder's tweet. Several other commenters had already supported the e/acc tweet information as quite relevant to the thread. 

Therefore, I conclude that Nate deleted the true information I shared because it made him look bad. 

EDIT: Nate also blocked me from commenting on his posts:

  1. ^

    See how Nate frames the issue as "reacting poorly to perceived overconfidence", which is not how the e/acc co-founder described her experience. She called it "psychological manipulation" but did not say she thought Nate being overconfident was an issue. Nate deflects from serious charges ("psychological manipulation") to a charge which would be more convenient for him ("overconfidence"). 

Reply
A case for courage, when speaking of AI danger
TurnTrout2mo-4-24

people who know me rarely describe my conversational style as "soft and ever-so-polite"

The women I've spoken to about you have ~uniformly reported you being substantially more polite to them than the men I've spoken to (and several of these women pointed out this discrepancy out on their own). One trans man even said that they felt you were quite rude to him, which he took as validation of his transition being complete.

So any men reading this and discrediting the tweet on the basis of "Nate isn't 'ever-so-polite'" should think twice.

Reply21
A case for courage, when speaking of AI danger
TurnTrout2mo9-15

Yup, that claim is wrong. I'm not <= 1% but I have met educated skeptics who are. Not sure why Nate made this claim since it isn't relevant to his point -- could just delete that first sentence.

Reply
Evaluating the historical value misspecification argument
TurnTrout2moΩ124

based prediction

Reply1
Distillation Robustifies Unlearning
TurnTrout3moΩ487

Wasn't it the case that for some reason, full distillation had comparable compute requirement to data filtering? I was surprised by that. My impression is that distillation should be more like 10% of pretraining (data filtering), which would make the computational UNDO results much stronger. Not sure what happened here.

Reply
Load More
Interpreting a Maze-Solving Network
Thoughts on Corrigibility
The Causes of Power-seeking and Instrumental Convergence
Reframing Impact
Becoming Stronger
127Training a Reward Hacker Despite Perfect Labels
Ω
1mo
Ω
45
196Optimizing The Final Output Can Obfuscate CoT (Research Note)
Ω
1mo
Ω
22
8English writes numbers backwards
2mo
23
55We Built a Tool to Protect Your Dataset From Simple Scrapers
Ω
2mo
Ω
9
66A Simple Explanation of AGI Risk
Ω
2mo
Ω
4
125Authors Have a Responsibility to Communicate Clearly
2mo
29
232Distillation Robustifies Unlearning
Ω
3mo
Ω
43
154Self-fulfilling misalignment data might be poisoning our AI models
Ω
6mo
Ω
29
104Steering Gemini with BiDPO
Ω
7mo
Ω
5
26Insights from "The Manga Guide to Physiology"
8mo
3
Load More
Reinforcement learning
3y
(+16)
Reinforcement learning
3y
(+333/-390)
Complexity of value
3y
(+176/-112)
General Alignment Properties
3y
(+317)
Pages Imported from the Old Wiki
5y
(+9/-5)