Buck

CEO at Redwood Research.

AI safety is a highly collaborative field--almost all the points I make were either explained to me by someone else, or developed in conversation with other people. I'm saying this here because it would feel repetitive to say "these ideas were developed in collaboration with various people" in all my comments, but I want to have it on the record that the ideas I present were almost entirely not developed by me in isolation.

Posts

Sorted by New

12Buck's Shortform

Ω

6y

Ω

174

75Misalignment and Strategic Underperformance: An Analysis of Sandbagging and Exploration Hacking

Ω

1mo

Ω

1

43Handling schemers if shutdown is not an option

Ω

2mo

Ω

1

124Ctrl-Z: Controlling AI Agents via Resampling

Ω

2mo

Ω

0

29How to evaluate control measures for LLM agents? A trajectory from today to superintelligence

Ω

2mo

Ω

1

34Subversion Strategy Eval: Can language models statelessly strategize to subvert control protocols?

Ω

3mo

Ω

0

130Some articles in “International Security” that I enjoyed

4mo

10

57A sketch of an AI control safety case

Ω

5mo

Ω

0

139Ten people on the inside

Ω

5mo

Ω

28

27Early Experiments in Human Auditing for AI Control

Ω

5mo

Ω

0

91Thoughts on the conservative assumptions in AI control

Ω

5mo

Ω

5

Wikitag Contributions

Comments

Sorted by

Newest

Moral Alignment: An Idea I'm Embarrassed I Didn't Think of Myself

Buck8h20

My guess is that neither of us will hear about any of these discussions until after they're finalized.

Reply

Moral Alignment: An Idea I'm Embarrassed I Didn't Think of Myself

Buck10h10-1

I think it's very unlikely that (conditioned on no AI takeover) something similar to "all humans get equal weight in deciding what happens next" happens; I think that a negotiation between a small number of powerful people (some of whom represent larger groups, e.g. nations) that ends with an ad hoc distribution seems drastically more likely. The bargaining solution of "weight everyone equally" seems basically so implausible that it seems pointless to even discuss it as a pragmatic solution.

Reply

Moral Alignment: An Idea I'm Embarrassed I Didn't Think of Myself

Buck14h108

Thanks heaps for pointing out the Eliezer content!

I am very skeptical that you'll get "all of humanity equally" as the bargaining solution, as opposed to some ad hoc thing that weighs powerful people more. I'm not aware of any case where the solution to a bargaining problem was "weigh the preference of everyone in the world equally". (This isn't even how most democracies work internally!)

Reply

Moral Alignment: An Idea I'm Embarrassed I Didn't Think of Myself

Buck14h113

I am sympathetic on the object level to the kind of perspective you're describing here, where you say we should do something like the extrapolated preferences of some set of bargainers. Two problems:

I think that when people talk about CEV, they're normally not defining it in terms of humanity because humans are who you pragmatically have to coordinate with. E.g. I don't see anything like that mentioned in the wiki page or in the original paper on a quick skim; I interpret Eliezer as referencing humanity because that's who he actually cares about the values of. (I could be wrong about what Eliezer thinks here.)
I think it's important to note that if you settle on CEV as a bargaining solution, this probably ends up with powerful people (AI company employees, heads of state) drastically overrepresented in the bargain, which is both unattractive and doesn't seem to be what people are usually imagining when they talk about CEV.

Reply

1

Gradual Disempowerment: Systemic Existential Risks from Incremental AI Development

Buck17h42

One reason to think that the US winning concentrates power less is that the US is a democracy with a strong tradition of maintaining individual rights and a reasonably strong history (over the last 80 years) of pursuing a world order where it benefits from lots of countries being pretty stable and not e.g. invading each other.

Reply

evhub's Shortform

Buck2dΩ673

My recommendation is to, following Joe Carlsmith, use them as synonyms, and use the term "schemer" instead of "deceptively aligned model". I do this.

Joe's issues with the term "deceptive alignment":

I think that the term "deceptive alignment" often leads to confusion between the four sorts of deception listed above. And also: if the training signal is faulty, then "deceptively aligned" models need not be behaving in aligned ways even during training (that is, "training gaming" behavior isn't always "aligned" behavior).

Reply

1

evhub's Shortform

Buck2dΩ331

I was mostly thinking of misaligned but non-deceptively-aligned models.

Reply

evhub's Shortform

Buck3dΩ8117

I think it's conceivable for non-deceptively-aligned models to gradient hack, right?

Reply

4

jenn's Shortform

Buck3d1526

I think that shifting from 15% to 20% over ten years is so plausible under the null hypothesis that it doesn't really cry out for explanation, and any proposed explanation has to somehow explain why it didn't lead to a larger effect!

Reply

jenn's Shortform

Buck3d96

increasing endorsement/linking of right wing figures like hanania and cremieux

Idk, back in the day LessWrong had a reasonable amount of discussion of relatively right-wing figures like Moldbug and other neoreactionaries, or on the less extreme end, people like Bryan Caplan. And there's always been an undercurrent of discussion of e.g. race and IQ.

low confidence but i feel like i can kind of assume that the median rat has libertarian sympathies now in a way that i couldn't before?

I feel like the median rat had strong libertarian sympathies 10 years ago.

Reply