Buck

CEO at Redwood Research.

AI safety is a highly collaborative field--almost all the points I make were either explained to me by someone else, or developed in conversation with other people. I'm saying this here because it would feel repetitive to say "these ideas were developed in collaboration with various people" in all my comments, but I want to have it on the record that the ideas I present were almost entirely not developed by me in isolation.

Posts

Sorted by New

12Buck's Shortform

Ω

6y

Ω

174

29How to evaluate control measures for LLM agents? A trajectory from today to superintelligence

Ω

2d

Ω

0

30Subversion Strategy Eval: Can language models statelessly strategize to subvert control protocols?

Ω

23d

Ω

0

129Some articles in “International Security” that I enjoyed

2mo

10

57A sketch of an AI control safety case

Ω

2mo

Ω

0

139Ten people on the inside

Ω

3mo

Ω

28

27Early Experiments in Human Auditing for AI Control

Ω

3mo

Ω

0

90Thoughts on the conservative assumptions in AI control

Ω

3mo

Ω

5

62Measuring whether AIs can statelessly strategize to subvert security measures

Ω

4mo

Ω

0

483Alignment Faking in Large Language Models

Ω

3mo

Ω

74

62Why imperfect adversarial robustness doesn't doom AI control

Ω

5mo

Ω

25

Wikitag Contributions

Comments

Sorted by

Newest

To be legible, evidence of misalignment probably has to be behavioral

Buck5hΩ473

Ryan agrees, the main thing he means by "behavioral output" is what you're saying: an actually really dangerous action.

Reply

Notes on countermeasures for exploration hacking (aka sandbagging)

Buck5dΩ222

I think we should probably say that exploration hacking is a strategy for sandbagging, rather than using them as synonyms.

Reply

1

How much progress actually happens in theoretical physics?

Buck11d64

Isn’t the answer that the low hanging fruit of explaining unexplained observations has been picked?

Reply

Buck's Shortform

Buck11d402

I appeared on the 80,000 Hours podcast. I discussed a bunch of points on misalignment risk and AI control that I don't think I've heard discussed publicly before.

Transcript + links + summary here; it's also available as a podcast in many places.

Reply

1

Consider showering

Buck14d42

I love that I can guess the infohazard from the comment

Reply

4

1

Thomas Kwa's Shortform

Buck14dΩ14280

A few months ago, I accidentally used France as an example of a small country that it wouldn't be that catastrophic for AIs to take over, while giving a talk in France 😬

Reply

8

2

Third-wave AI safety needs sociopolitical thinking

Buck15d40

No problem, my comment was pretty unclear and I can see from the other comments why you'd be on edge!

Reply

ank's Shortform

Buck16d30

It seems extremely difficult to make a blacklist of models in a way that isn't trivially breakable. (E.g. what's supposed to happen when someone adds a tiny amount of noise to the weights of a blacklisted model, or rotates them along a gauge invariance?)

Reply

Third-wave AI safety needs sociopolitical thinking

Buck16d90

I agree that this isn't what I'd call "direct written evidence"; I was just (somewhat jokingly) making the point that the linked articles are Bayesian evidence that Musk tries to censor, and that the articles are pieces of text.

Reply

Third-wave AI safety needs sociopolitical thinking

Buck16d10

It is definitely evidence that was literally written

Reply