LESSWRONG
LW

TurnTrout
20841Ω4723133221511
Message
Dialogue
Subscribe

I don't use LessWrong much anymore. Find me at www.turntrout.com.

My name is Alex Turner. I'm a research scientist at Google DeepMind on the Scalable Alignment team. My views are strictly my own; I do not represent Google. Reach me at alex[at]turntrout.com

Sequences

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
28TurnTrout's shortform feed
Ω
6y
Ω
726
Interpreting a Maze-Solving Network
Thoughts on Corrigibility
The Causes of Power-seeking and Instrumental Convergence
Reframing Impact
Becoming Stronger
TurnTrout's shortform feed
TurnTrout4d14-4

In a thread which claimed that Nate Soares radicalized a co-founder of e-acc, Nate deleted my comment – presumably to hide negative information and anecdotes about how he treats people. He also blocked me from commenting on his posts.

The information which Nate suppressed

The post concerned (among other topics) how to effectively communicate about AI safety, and positive anecdotes about Nate's recent approach. (Additionally, he mentions "I’m regularly told that I’m just an idealistic rationalist who’s enamored by the virtue of truth" -- a love which apparently does not extend to allowing people to read negative truths about his own behavior.)

Here are the parents of the comment which Nate deleted:

@jdp (top-level comment)

For what it's worth I know one of the founders of e/acc and they told me they were radicalized by a date they had with you where they felt you bullied them about this subject.

@Mo Putera (reply to jdp)

Full tweet for anyone curious: 

i'm reminded today of a dinner conversation i had once w one of the top MIRI folks...

we talked AI safety and i felt he was playing status games in our conversation moreso than actually engaging w the substance of my questions- negging me and implying i was not very smart if i didn't immediately react w fear to the parable of the paperclip, if i asked questions about hardware & infrastructure & connectivity & data constraints...

luckily i don't define myself by my intelligence so i wasn't cowed into doom but instead joined the budding e/acc movement a few weeks later.

still i was unsettled by the attempted psychological manipulation and frame control hiding under the hunched shoulders and soft ever so polite voice.

My deleted comment (proof) responded to Mo's record of the tweet:

For those unfamiliar with this situation, see also a partial list of "(sometimes long-term) negative effects Nate Soares has had on people while discussing AI safety." (About 2/3 of the list items involve such discussions.)

The e/acc cofounder wrote:

we talked AI safety and i felt he was playing status games in our conversation moreso than actually engaging w the substance of my questions- negging me and implying i was not very smart if i didn't immediately react w fear to the parable of the paperclip

This mirrors my own experience:

I, personally, have been on the receiving end of (what felt to me like) a Nate-bulldozing, which killed my excitement for engaging with the MIRI-sphere, and also punctured my excitement for doing alignment theory...

Discussing norms with Nate leads to an explosion of conversational complexity. In my opinion, such discussion can sound really nice and reasonable, until you remember that you just wanted him to e.g. not insult your reasoning skills and instead engage with your object-level claims... but somehow your simple request turns into a complicated and painful negotiation. You never thought you'd have to explain "being nice."

Then—in my experience—you give up trying to negotiate anything from him and just accept that he gets to follow whatever "norms" he wants.

Why did Nate delete negative information about himself?

Nate gave the reasoning "Discussion of how some people react poorly to perceived overconfidence[1] is just barely topical. Discussion of individual conduct isn't.". But my anecdote is a valid report of the historical consequences of talking with Nate – just as valid as the e/acc co-founder's tweet. Several other commenters had already supported the e/acc tweet information as quite relevant to the thread. 

Therefore, I conclude that Nate deleted the true information I shared because it made him look bad. 

EDIT: Nate also blocked me from commenting on his posts:

  1. ^

    See how Nate frames the issue as "reacting poorly to perceived overconfidence", which is not how the e/acc co-founder described her experience. She called it "psychological manipulation" but did not say she thought Nate being overconfident was an issue. Nate deflects from serious charges ("psychological manipulation") to a charge which would be more convenient for him ("overconfidence"). 

Reply1
A case for courage, when speaking of AI danger
TurnTrout4d-1-25

people who know me rarely describe my conversational style as "soft and ever-so-polite"

The women I've spoken to about you have ~uniformly reported you being substantially more polite to them than the men I've spoken to (and several of these women pointed out this discrepancy out on their own). One trans man even said that they felt you were quite rude to him, which he took as validation of his transition being complete.

So any men reading this and discrediting the tweet on the basis of "Nate isn't 'ever-so-polite'" should think twice.

Reply21
A case for courage, when speaking of AI danger
TurnTrout4d1-20

Yup, that claim is wrong. I'm not <= 1% but I have met educated skeptics who are. Not sure why Nate made this claim since it isn't relevant to his point -- could just delete that first sentence.

Reply
Evaluating the historical value misspecification argument
TurnTrout7dΩ122

based prediction

Reply1
Distillation Robustifies Unlearning
TurnTrout10dΩ486

Wasn't it the case that for some reason, full distillation had comparable compute requirement to data filtering? I was surprised by that. My impression is that distillation should be more like 10% of pretraining (data filtering), which would make the computational UNDO results much stronger. Not sure what happened here.

Reply
Distillation Robustifies Unlearning
TurnTrout10dΩ330

I think you missed the point here. My suggested scheme is 1. label a small amount of data 2. train a classifier 3. apply the classifier to know if you should skip a token / make the target logprobs be noise or use the original logprobs. This is spiritually the same as 1. label a small amount of data 2. use that for unlearning 3. apply the unlearned model to know if the target logprobs should be noise or sth close to the original logprobs.

EDIT: I think I misunderstood your original point - were you saying to just label all of the data using a classifier trained on just 1% of the pretraining data? (Neither of your schemes say what to do after step 3.)

> UNDO over Unlearn-and-Distill is that it provides a tunable compute/robustness knob between the conventional unlearning and full reinitialization/data filtering 

This to be a part of the option space that nobody is interested in, but it's still scientifically interesting. 

Why do you claim that no one is interested in this? Lots of labs do data filtering, which is known to be effective but quite costly to iterate on. 

Reply
Distillation Robustifies Unlearning
TurnTrout10dΩ220

In other words, "using unlearning techniques like GradDiff/MaxEnt during pretraining" might be a really powerful technique.

I have a cached thought that this was found to disrupt overall capabilities / make learning harder, but I don't have a reference on hand.

Reply
A deep critique of AI 2027’s bad timeline models
TurnTrout10d175

Thanks, I appreciate your comments.

This is essentially a simplified version of our time horizon extension model that doesn't account for AI R&D automation. Or another way to view this is that we crudely accounted for AI R&D automation by raising the decay.

Why did you simplify the model for a graph? You could have plotted a trajectory to begin with, instead of making a bespoke simplification. Is it because you wanted to "represent roughly the trajectory that happens in AI 2027"? I get that AI 2027 is a story, but why not use your real model to sample a trajectory -- perhaps rejection sampling until you get one of the more aggressive possibilities? 

Or you could even rejection sample the model until you get one that matches AI 2027 pretty closely, and then draw that curve's projection (and retrojection -- wait is that even a word).

I'm currently watching the tension between "this is just a story [which doesn't have hard data behind it, take it with a big grain of salt]" and "here's some math supporting our estimates [but wasn't actually used for our plots or the story in any direct way]." I'm worried that the math lends credibility without being that relevant to the real decisions. 

Reply11
A deep critique of AI 2027’s bad timeline models
TurnTrout11d91
  • Or we should have more clearly labeled that the graph was not generated via the timelines model.

Yes, I think this would have been quite good.

Reply2
A deep critique of AI 2027’s bad timeline models
TurnTrout11d4243

since the forecast did end up as good propaganda if nothing else

Just responding to this local comment you made: I think it's wrong to make "propaganda" to reach end Y, even if you think end Y is important. If you have real reasons for believing something will happen, you shouldn't have to lie, exaggerate, or otherwise mislead your audience to make them believe it, too. 

So I'm arguing that you shouldn't have mixed feelings because ~"it was valuable propaganda at least." Again, not trying to claim that AI 2027 "lied" - just replying to the quoted bit of reasoning.

Reply31
Load More
62A Simple Explanation of AGI Risk
Ω
2d
Ω
1
116Authors Have a Responsibility to Communicate Clearly
2d
23
227Distillation Robustifies Unlearning
Ω
21d
Ω
36
153Self-fulfilling misalignment data might be poisoning our AI models
Ω
4mo
Ω
28
104Steering Gemini with BiDPO
Ω
5mo
Ω
5
26Insights from "The Manga Guide to Physiology"
5mo
3
26Deceptive Alignment and Homuncularity
6mo
12
64Gaming TruthfulQA: Simple Heuristics Exposed Dataset Weaknesses
Ω
6mo
Ω
3
47Review: Breaking Free with Dr. Stone
7mo
5
165Gradient Routing: Masking Gradients to Localize Computation in Neural Networks
Ω
7mo
Ω
12
Load More
Reinforcement learning
2y
(+16)
Reinforcement learning
2y
(+333/-390)
Complexity of value
3y
(+176/-112)
General Alignment Properties
3y
(+317)
Pages Imported from the Old Wiki
5y
(+9/-5)