PhD Student at Umass Amherst
thanks for the detailed (non-ML) example! exactly the kind of thing I'm trying to get at
Thanks! huh yeah the python interactive windows seems like a much cleaner approach, I'll give it a try
thanks! yup curser is notebook compatible
I'm curious if Redwood would be willing to share a kind of "after action report" for why they stopped working on ELK/heuristic argument inspired stuff (e.g Causal Scrubbing, Patch Patching, Generalized Wick Decompositions, Measurement Tampering)
My impression it is some mix of:
a. Control seems great
b. Heuristic arguments is a bad bet (for some of the reasons mech interp is a bad bet)
c. ARC has it covered
But the weighting is pretty important here. If its
a. more people should be working on heuristic argument inspired stuff.
b. less people should be working on heuristic argument inspired stuff (i.e. ARC employees should quit, or at least people shouldn't take jobs at ARC)
c. people should try to work at ARC if they're interested, but its going to be difficult to make progress, especially for e.g. a typical ML PhD student interested in safety.
Ultimately people should come to their own conclusions, but Redwood's considerations would be pretty valuable information.
(The community often calls this “scalable oversight”, but we want to be clear that this does not necessarily include scaling to large numbers of situations, as in monitoring.)
I like this terminology and think the community should adopt it
For anyone trying to replicate / try new methods, I posted a diamonds "pure prediction model" to huggingface https://huggingface.co/oliverdk/codegen-350M-mono-measurement_pred, (github repo here: https://github.com/oliveradk/measurement-pred/tree/master)
The builder-breaker thing isn't unique to CoT though right? My gloss on the recent Obfuscated Activations paper is something like "activation engineering is not robust to arbitrary adversarial optimization, and only somewhat robust to contained adversarial optimization".