LESSWRONG
LW

454
Buck
15457Ω3142485992
Message
Dialogue
Subscribe

CEO at Redwood Research.

AI safety is a highly collaborative field--almost all the points I make were either explained to me by someone else, or developed in conversation with other people. I'm saying this here because it would feel repetitive to say "these ideas were developed in collaboration with various people" in all my comments, but I want to have it on the record that the ideas I present were almost entirely not developed by me in isolation.

Please contact me via email (bshlegeris@gmail.com) instead of messaging me on LessWrong.

If we are ever arguing on LessWrong and you feel like it's kind of heated and would go better if we just talked about it verbally, please feel free to contact me and I'll probably be willing to call to discuss briefly.

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
12Buck's Shortform
Ω
6y
Ω
297
Buck's Shortform
Buck2d*5437

I'd be really interested in someone trying to answer the question: what updates on the a priori arguments about AI goal structures should we make as a result of empirical evidence that we've seen? I'd love to see a thoughtful and comprehensive discussion of this topic from someone who is both familiar with the conceptual arguments about scheming and also relevant AI safety literature (and maybe AI literature more broadly).

Maybe a good structure would be, from the a priori arguments, identifying core uncertainties like "How strong is the imitative prior?" And "How strong is the speed prior?" And  "To what extent do AIs tend to generalize versus learn narrow heuristics?" and tackling each. (Of course, that would only make sense if the empirical updates actually factor nicely into that structure.)

I feel like I understand this very poorly right now. I currently think the only important update that empirical evidence has given me, compared to the arguments in 2020, is that the human-imitation prior is more powerful than I expected. (Though of course it's unclear whether this will continue (and basic points like the expected increasing importance of RL suggest that it will be less powerful over time.)) But to my detriment, I don't actually read the AI safety literature very comprehensively, and I might be missing empirical evidence that really should update me.

Reply4
eggsyntax's Shortform
Buck7d*203

That's correct. Ryan summarized the story as:

Here’s the story of this paper. I work at Redwood Research (@redwood_ai) and this paper is a collaboration with Anthropic. I started work on this project around 8 months ago (it's been a long journey...) and was basically the only contributor to the project for around 2 months.

By this point, I had much of the prompting results and an extremely jank prototype of the RL results (where I fine-tuned llama to imitate Opus and then did RL). From here, it was clear that being able to train Claude 3 Opus could allow for some pretty interesting experiments.

After showing @EvanHub and others my results, Anthropic graciously agreed to provide me with employee-level model access for the project. We decided to turn this into a bigger collaboration with the alignment-stress testing team (led by Evan), to do a more thorough job.

This collaboration yielded the synthetic document fine-tuning and RL results and substantially improved the writing of the paper. I think this work is an interesting example of an AI company boosting safety research by collaborating and providing model access.

So Anthropic was indeed very accommodating here; they gave Ryan an unprecedented level of access for this work, and we're grateful for that. (And obviously, individual Anthropic researchers contributed a lot to the paper, as described in its author contribution statement. And their promotion of the paper was also very helpful!)

My objection is just that this paragraph of yours is fairly confused:

We don’t want to shoot the messenger — they went looking. They didn’t have to do that. They told us the results, and they didn’t have to do that. Anthropic finding these results is Anthropic being good citizens. And you want to be more critical of the A.I. companies that didn’t go looking.

This paper wasn't a consequence of Anthropic going looking, it was a consequence of Ryan going looking. If Anthropic hadn't wanted to cooperate, then Ryan would have just published his results without Anthropic's help, which would have been a moderately worse paper that would have probably gotten substantially less attention, but Anthropic didn't have the opportunity to not publish (a crappier version of) the core results.

Just to be clear, I don't think this is that big a deal. It's a bummer that Redwood doesn't get as much credit for this paper as we deserve, but this is pretty unavoidable given how much more famous Anthropic is; my sense is that it's worth the effort for safety people to connect the paper to Redwood/Ryan when discussing it, but it's no big deal. I normally don't bother to object to that credit misallocation. But again, the story of the paper conflicted with these sentences you said, which is why I bothered bringing it up.

Reply
Considerations around career costs of political donations
Buck8d*00

Re your last paragraph: as the post notes, it is illegal to discriminate based on political donations when hiring for civil service roles.

EDIT: Readers of this thread should bear in mind that Max H is not Max Harms! I was confused about this.

Reply1
eggsyntax's Shortform
Buck10d2514

Alignment faking, and the alignment faking research was done at Anthropic.

And we want to give credit to Anthropic for this. We don’t want to shoot the messenger — they went looking. They didn’t have to do that. They told us the results, and they didn’t have to do that. Anthropic finding these results is Anthropic being good citizens. And you want to be more critical of the A.I. companies that didn’t go looking.

It would be great if Eliezer knew that (or noted, if he knows but is just phrasing it really weirdly) the alignment faking paper research was initially done at Redwood by Redwood staff; I'm normally not prickly about this but it seems directly relevant to what Eliezer said here.

Reply
faul_sname's Shortform
Buck10d3-2

It really depends on what you mean by "most of the time when people say this". I don't think my experience matches yours.

Reply1
Fabien's Shortform
Buck11d129

My sense is that the main change is that Trump 2 was better prepared and placed more of a premium on personal loyalty, not that people were more reluctant to work with him for the sake of harm minimization.

Reply2
Fabien's Shortform
Buck11dΩ913-8

Re that last point, you might be interested to read about "the constitution is not a suicide pact": many prominent American political figures have said that survival of the nation is more important than constitutionality (and this has been reasonably well received by other actors, not reviled).

Reply1
faul_sname's Shortform
Buck11d133

In my experience, when people say "it's worse for China to win the AI race than America", their main concern is that Chinese control of the far future would lead to a much less valuable future than American control would, not that American control reduces P(AI takeover). E.g. see this comment.

Reply
davekasten's Shortform
Buck17d*1311

I basically agree with Zach that based on public information it seems like it would be really hard for them to be robust to this and it seems implausible that they have justified confidence in such robustness.

I agree that he doesn't say the argument in very much depth. Obviously, I think it'd be great if someone made the argument in more detail. I think Zach's point is a positive contribution even though it isn't that detailed. 

Reply
Raemon's Shortform
Buck18d50

The most common ways that I see comments have errors that I think an LLM could fix are:

  • Typos.
  • Missing a point. Like I often write a comment and fail to realize that someone nearby in the comment tree has already responded to this point. Or I misunderstood the comment I was responding to. It would be helpful to have LLMs note this.
  • Maybe basic fact-checking?

Maybe you should roll this out for comments before posts.  

Reply
Load More
No wikitag contributions to display.
34Rogue internal deployments via external APIs
Ω
13d
Ω
4
91The Thinking Machines Tinker API is good news for AI control and security
Ω
20d
Ω
10
193Christian homeschoolers in the year 3000
1mo
64
208I enjoyed most of IABIED
1mo
46
217An epistemic advantage of working as a moderate
2mo
96
48Four places where you can put LLM monitoring
Ω
3mo
Ω
0
25Research Areas in AI Control (The Alignment Project by UK AISI)
Ω
3mo
Ω
0
49Why it's hard to make settings for high-stakes control research
Ω
3mo
Ω
6
91Recent Redwood Research project proposals
Ω
3mo
Ω
0
190Lessons from the Iraq War for AI policy
4mo
25
Load More