The builder-breaker thing isn't unique to CoT though right? My gloss on the recent Obfuscated Activations paper is something like "activation engineering is not robust to arbitrary adversarial optimization, and only somewhat robust to contained adversarial optimization".

Write Good Enough Code, Quickly

Oliver Daniels3d10

thanks for the detailed (non-ML) example! exactly the kind of thing I'm trying to get at

Write Good Enough Code, Quickly

Oliver Daniels3d20

Thanks! huh yeah the python interactive windows seems like a much cleaner approach, I'll give it a try

Write Good Enough Code, Quickly

Oliver Daniels4d20

thanks! yup curser is notebook compatible

Benchmarks for Detecting Measurement Tampering [Redwood Research]

Oliver Daniels10d10

Thanks!

Oliver Daniels-Koch's Shortform

Oliver Daniels1mo94

I wish there was a bibTeX functionality for alignment forum posts...

Oliver Daniels-Koch's Shortform

Oliver Daniels2mo84

I'm curious if Redwood would be willing to share a kind of "after action report" for why they stopped working on ELK/heuristic argument inspired stuff (e.g Causal Scrubbing, Patch Patching, Generalized Wick Decompositions, Measurement Tampering)

My impression it is some mix of:

a. Control seems great

b. Heuristic arguments is a bad bet (for some of the reasons mech interp is a bad bet)

c. ARC has it covered

But the weighting is pretty important here. If its
a. more people should be working on heuristic argument inspired stuff.

b. less people should be working on heuristic argument inspired stuff (i.e. ARC employees should quit, or at least people shouldn't take jobs at ARC)

c. people should try to work at ARC if they're interested, but its going to be difficult to make progress, especially for e.g. a typical ML PhD student interested in safety.

Ultimately people should come to their own conclusions, but Redwood's considerations would be pretty valuable information.

AGI Safety and Alignment at Google DeepMind: A Summary of Recent Work

Oliver Daniels4mo54

(The community often calls this “scalable oversight”, but we want to be clear that this does not necessarily include scaling to large numbers of situations, as in monitoring.)

I like this terminology and think the community should adopt it

Benchmarks for Detecting Measurement Tampering [Redwood Research]

Oliver Daniels6mo40

For anyone trying to replicate / try new methods, I posted a diamonds "pure prediction model" to huggingface https://huggingface.co/oliverdk/codegen-350M-mono-measurement_pred, (github repo here: https://github.com/oliveradk/measurement-pred/tree/master)