Posts

Sorted by New

2Oliver Daniels-Koch's Shortform

19Write Good Enough Code, Quickly

4mo

28Concrete Methods for Heuristic Estimation on Neural Networks

5mo

43Concrete empirical research projects in mechanistic anomaly detection

2Oliver Daniels-Koch's Shortform

Wikitag Contributions

Comments

Sorted by

Newest

zchuang's Shortform

Oliver Daniels1mo43

Map on "coherence over long time horizon" / agency. I suspect contamination is much worse for chess (so many direct examples of board configurations and moves)

Preparing for the Intelligence Explosion

Oliver Daniels1mo10

I've been confused what people are talking about when they say "trend lines indicate AGI by 2027" - seems like it's basically this?

Estimating the Probability of Sampling a Trained Neural Network at Random

Oliver Daniels2mo2-1

also rhymes with/is related to ARC's work on presumption of independence applied to neural networks (e.g. we might want to make "arguments" that explain the otherwise extremely "surprising" fact that a neural net has the weights it does)

How might we safely pass the buck to AI?

Oliver Daniels2mo10

Instead I would describe the problem as arising from a generator and verifier mismatch: when the generator is much stronger than the verifier, the verifier is incentivized to fool the verifier without completing the task successfully.

I think these are related but separate problems - even with a perfect verifier (on easy domains), scheming could still arise.

Though imperfect verifiers increase P(scheming), better verifiers increase the domain of "easy" tasks, etc.

Research directions Open Phil wants to fund in technical AI safety

Oliver Daniels2mo30

Huh? No, control evaluations involve replacing untrusted models with adversarial models at inference time, whereas scalable oversight attempts to make trusted models at training time, so it would completely ignore the proposed mechanism to take the models produced by scalable oversight and replace them with adversarial models. (You could start out with adversarial models and see whether scalable oversight fixes them, but that's an alignment stress test.)

I have a similar confusion (see my comment here) but seems like at least Ryan wants control evaluations to cover this case? (perhaps on the assumption that if your "control measures" are successful, they should be able to elicit aligned behavior from scheming models and this behavior can be reinforced?)

Fake thinking and real thinking

Oliver Daniels3mo1-2

I think the concern is less "am I making intellectual progress on some project" and more "is the project real / valuable"

Attribution-based parameter decomposition

Oliver Daniels3mo*Ω7168

IMO most exciting mech-interp research since SAEs, great work.

A few thoughts / questions:

curious about your optimism regarding learned masks as attribution method - seems like the problem of learning mechanisms that don't correspond to model mechanisms is real for circuits (see Interp Bench) and would plausibly bite here too (through should be able to resolve this with benchmarks on downstream tasks once APD is more mature)
relatedly, the hidden layer "minimality" loss is really cool, excited to see how useful this is in mitigating the problem above (and diagnosing it)
have you tried "ablating" weights by sampling from the initialization distribution rather than zero ablating? haven't thought about it too much and don't have a clear argument for why it would improve anything, but it feels more akin to resample ablation and "presumption of independence" type stuff
seems great for mechanistic anomaly detection! very intuitive to map APD to surprise accounting (I was vaguely trying to get at a method like APD here)

Thoughts on the conservative assumptions in AI control

Oliver Daniels3mo10

makes sense - I think I had in mind something like "estimate P(scheming | high reward) by evaluating P(high reward | scheming)". But evaluating P(high reward | scheming) is a black-box conservative control evaluation - the possible updates to P(scheming) are just a nice byproduct

Thoughts on the conservative assumptions in AI control

Oliver Daniels3mo3-2

We initially called these experiments “control evaluations”; my guess is that it was a mistake to use this terminology, and we’ll probably start calling them “black-box plan-conservative control evaluations” or something.

also seems worth disambiguating "conservative evaluations" from "control evaluations" - in particular, as you suggest, we might want to assess scalable oversight methods under conservative assumptions (to be fair, the distinction isn't all that clean - your oversight process can both train and monitor a policy. Still, I associate control more closely with monitoring, and conservative evaluations have a broader scope).

Scaling Sparse Feature Circuit Finding to Gemma 9B

Oliver Daniels3mo20

Thanks for the thorough response, and apologies for missing the case study!

I think I regret / was wrong about my initial vaguely negative reaction - scaling SAE circuit discovery to large models is a notable achievement!

Re residual skip SAEs: I'm basically on board with "only use residual stream SAEs", but skipping layers still feels unprincipled. Like imagine if you only trained an SAE on the final layer of the model. By including all the features, you could perfectly recover the model behavior up to the SAE reconstruction loss, but you would have ~no insight into how the model computed the final layer features. More generally, by skipping layers, you risk missing potentially important intermediate features. ofc to scale stuff you need to make sacrifices somewhere, but stuff in the vicinity of Cross-Coders feels more comprising