Cody Rushing — LessWrong

Alignment remains a hard, unsolved problem

Oh yeah that is a good point to bring up. I agree that the empirics of how good few shot catastrophe prevention is will affect both the usefulness of post-prevented-catastrophe models and change how good of a strategy rare failures is. It's also the case that the rare failures comes from great pre-deployment (or in-deployment) evaluations and training procedures, but in some sense this is also a version of few-shot catastrophe prevention with different affordances for the safety team IRL.

The diffuse stakes control doesn't suffer from this and just directly engages with the dynamic of training the model on failures and getting useful work out of a misaligned model

Alignment remains a hard, unsolved problem

Cody Rushing1mo10

Would be curious for links to Redwoods justifications if you have them

Responded to a bit of this in a separate thread here

Alignment remains a hard, unsolved problem

Cody Rushing1mo10

The "usefulness" reported in papers is how much work can be gotten out of an aligned model
...
(I know that Redwood have justifications for why they evaluate usefulness in the way that they do, but these seem a bit galaxy brained to me and I'm not convinced by them)

This is mostly correct, and we describe it this way in our papers for simplicity, but it's actually a bit more complex than that. In existing Redwood Research high-stakes control, we consider the problem of preventing a single failure from occurring during deployment - and am important property of this failure is that its 'orthogonal' to usefulness, in the sense that the red team is indifferent to usefulness (at least, as long as it doesn't relate to safety). For instance, in our Ctrl-Z paper, the high-stakes failure we study is the untrusted model downloading and installing malware, which is unrelated to the problem of solving bash tasks. And since the red team follows a strategy of rare failures (where on non-attacking instances it just acts as the aligned model), further we believe that in practice the usefulness metric in this scenario is also how much work can be got out of a misaligned model.

But this was contingent on us not focusing our research on situations where the model was trying to sabotage our research. And in our diffuse control research, we do actually focus on situations where we try to evaluate how much useful work we can get out of a misaligned model trying to directly sabotage the research/work. (Agreed that part of our current hope here is to train the model to elicit great behavior to help with the low-stakes failures, we are currently not super excited about incriminating models. But also controlling the model is still important.)

Why you should eat meat - even if you hate factory farming

Cody Rushing3mo30

Oh I think I see what you are arguing (that you should only care about whether or not eating meat is net good or net bad, theres no reason to factor in this other action of the donation offset)

Specifically then the two complaints may be:

You specify that $0 < E$ in your graph, where you are using E represent -1 * amount of badness. While in reality people are modelling $E$ as negative (where eating meat is instead being net good for the world)
People might instead think that doing E is 'net evil' but also desirable for them for another reason unrelated to that (maybe for some reason like 'i also enjoy eating meat'). So here, if they only want to take net good actions while also eating meat, then they would offset it with donations. The story you outlined above arguing that 'The concept of offsetting evil with good does not make sense' misses that people might be willing to make such a tradeoff

I think I agree with what you are saying, and might be missing other reasons people are disagree voting

Why you should eat meat - even if you hate factory farming

Cody Rushing3mo24

Responding to your confusion about disagreement votes: I think your model isn't correctly describing how people are modelling this situation. People may believe that they can do more good from [choosing to eat meat + offsetting with donations] vs [not eating meat + offsetting with donations] because of the benefits described in the post. So you are failing to include a +I (or -I) term that factors in peoples abilities to do good (or maybe even the terminal effects of eating meat on themself).

Four ways learning Econ makes people dumber re: future AI

Cody Rushing5moΩ-100

Regardless of the whether these anti-pedagogy's are correct, I'm confused about why you think you've shown that learning econ made the economists dumber. It seems like the majority of your tweets you linked, excluding maybe Tweet 4, are actually just the economists discussing narrow AI and failing to consider general intelligence?

If you meant to say something like 'econ pedagogy makes it hard for economists to view AGI as something that could actually be intelligent in a way similar to humans', then I may be more inclined to agree with you.

The (Unofficial) Rationality: A-Z Anki Deck

Cody Rushing7mo51

The deck appears to have no notes in it

Shortform

Cody Rushing9mo-10

My strong downvote just gave +3

The Case Against AI Control Research

Cody Rushing1y21

Hmm, when I imagine "Scheming AI that is not easy to shut down with concerted nation-state effort, are attacking you with bioweapons, but are weak enough such that you can bargain/negotiate with them" I can imagine this outcome inspiring a lot more caution relative to many other worlds where control techniques work well but we can't get any convincing demos/evidence to inspire caution (especially if control techniques inspire overconfidence).

But the 'is currently working on becoming more powerful' part of your statement does carry a lot of weight.

The Case Against AI Control Research

Cody Rushing1y42

Control research exclusively cares about intentional deception/scheming; it does not aim to solve any other failure mode.

(nitpick, doesn't address main point of article) I think this is incomplete. Though control research does indeed care a lot about scheming, control can be used more broadly to handle any worst-case deployment behavior. See Josh Clymer's post about Extending control evaluations to non-scheming threats.

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

Posts

Wikitag Contributions

Comments

Posts

Wikitag Contributions

Comments