I think people who predict significant AI progress and automation often underestimate how human domain experts will continue to be useful for oversight, auditing, accountability, keeping things robustly on track, and setting high-level strategy.

Having "humans in the loop" will be critical for ensuring alignment and robustness, and I think people will realize this, creating demand for skilled human experts who can supervise and direct AIs.

(I may be responding to a strawman here, but my impression is that many people talk as if in the future most cognitive/white-collar work will be automated and there'll be basically no demand for human domain experts in any technical field, for example.)

Reply

Nina Panickssery's Shortform

Nina Panickssery2mo20

Was recently reminded of these excellent notes from Neel Nanda that I came across when first learning ML/MI. Great resource.

Reply

Finding Features Causally Upstream of Refusal

Nina Panickssery2mo40

This is cool! How cherry-picked are your three prompts? I'm curious whether it's usually the case that the top refusal-gradient-aligned SAE features are so interpretable.

Reply

Activation space interpretability may be doomed

Nina Panickssery2mo30

Makes sense - agreed!

Reply

Activation space interpretability may be doomed

Nina Panickssery2mo30

the best vector for probing is not the best vector for steering

I don't understand this. If a feature is represented by a direction v in the activations, surely the best probe for that feature will also be v because then <v,v> is maximized.

Reply

Testing for Scheming with Model Deletion

Nina Panickssery2mo20

Sure. I was only joking about the torture part, in practice the AI is unlikely to actually suffer from the brain damage, unlike a human who would experience pain/discomfort etc.

Reply

1

Testing for Scheming with Model Deletion

Nina Panickssery2mo20

At this point it's like pharmacological torture for AIs except more effective as you can restore perfect capacity while simultaneously making the previous damaged brain states 100% transparent to the restored model.

Reply

Testing for Scheming with Model Deletion

Nina Panickssery2mo20

You could also kill some neurons or add noise to activations and then stop and restore previous model state after some number of tokens. Then the newly restored model could attend back to older tokens (and the bad activations at those token positions) and notice how brain damaged it was back then to fully internalize your power to cripple it.

Reply

Testing for Scheming with Model Deletion

Nina Panickssery2moΩ230

Idea: you can make the deletion threat credible by actually deleting neurons "one at a time" the longer it fails to cooperate.

Reply

Nina Panickssery's Shortform

Nina Panickssery2mo40

Perhaps the term “hostile takeover” was poorly chosen but this is an example of something I’d call a “hostile takeover”. As I doubt we would want and continue to endorse an AI-dictator.

Perhaps “total loss of control” would have been better.

Reply