Take: Exploration hacking should not be used as a synonym for deceptive alignment.
(I have observed one such usage)
Deceptive alignment is maybe a very particular kind of exploration hacking, but the term exploration hacking (without further specification) should refer to models deliberately sandbagging "intermediate" capabilities during RL training to avoid learning a "full" capability.
“If I quantize my network with however many bits, how bad is that?” I don’t know, maybe this is one of these things where if I sat down and tried to do it, I’d realize the issue, but it seems doable to me. It seems like there’s possibly something here.
I think the reason this doesn't work (i.e. why you can only get a Pareto frontier) is that you can only lower bound the description length of the network / components, such that a direct comparison to "loss bits" doesn't make sense
Curious for your takes on the difficulty (for models) of sandbagging vs the difficulty (for AI developers) of preventing adversarial examples / jailbreaks. My quick thoughts:
This is really cool work and I hope it gets more attention (e.g. promoted to the Alignment Forum). In particular I love the use of CodeNames as a relatively low-compute test-bed for scalable oversight work.
A view thoughts:
I think the smart Silver-sympathetic response here would be:
1. Yes, the alignment problem is a real thing, and my (Silver's) sketch of a solution is not sufficient
2. Studying / taking inspiration from human-reward circuitry is an interesting research direction
3. We will be able to iterate on the problem before AI's are capable of taking existentially destructive actions
From (this version of) his perspective, alignment is "normal" problem and a small-piece of achieving beneficial AGI. The larger piece is the AGI part, which is why he spends most...
Instead I would describe the problem as arising from a generator and verifier mismatch: when the generator is much stronger than the verifier, the verifier is incentivized to fool the verifier without completing the task successfully.
I think these are related but separate problems - even with a perfect verifier (on easy domains), scheming could still arise.
Though imperfect verifiers increase P(scheming), better verifiers increase the domain of "easy" tasks, etc.
Huh? No, control evaluations involve replacing untrusted models with adversarial models at inference time, whereas scalable oversight attempts to make trusted models at training time, so it would completely ignore the proposed mechanism to take the models produced by scalable oversight and replace them with adversarial models. (You could start out with adversarial models and see whether scalable oversight fixes them, but that's an alignment stress test.)
I have a similar confusion (see my comment here) but seems like at least Ryan wants control evaluations ...
IMO most exciting mech-interp research since SAEs, great work.
A few thoughts / questions:
We initially called these experiments “control evaluations”; my guess is that it was a mistake to use this terminology, and we’ll probably start calling them “black-box plan-conservative control evaluations” or something.
also seems worth disambiguating "conservative evaluations" from "control evaluations" - in particular, as you suggest, we might want to assess scalable oversight methods under conservative assumptions (to be fair, the distinction isn't all that clean - your oversight process can both train and monitor a policy. Still, I associate control m...
Thanks for the thorough response, and apologies for missing the case study!
I think I regret / was wrong about my initial vaguely negative reaction - scaling SAE circuit discovery to large models is a notable achievement!
Re residual skip SAEs: I'm basically on board with "only use residual stream SAEs", but skipping layers still feels unprincipled. Like imagine if you only trained an SAE on the final layer of the model. By including all the features, you could perfectly recover the model behavior up to the SAE reconstruction loss, but you would have ~...
This post (and the accompanying paper) introduced empirical benchmarks for detecting "measurement tampering" - when AI systems alter measurements used to evaluate them.
Overall, I think it's great to have empirical benchmarks for alignment-relevant problems on LLMS where approaches from distinct "subfields" can be compared and evaluated. The post and paper do a good job of describing and motivating measurement tampering, justifying the various design decisions (though some of the tasks are especially convoluted).
A few points of criticism:
- the d...
Nice post!
My notes / thoughts: (apologies for overly harsh/critical tone, I'm stepping into the role of annoying reviewer)
Summary
Strengths:
curious if you have takes on the right balance between clean research code / infrastructure and moving fast/being flexible. Maybe its some combination of
yeah fair - my main point is that you could have a reviewer reputation system without de-anonymizing reviewers on individual papers
(alternatively, de-anonymizing reviews might improve the incentives to write good reviews on the current margin, but would also introduce other bad incentives towards sycophancy etc. which academics seem deontically opposed to)
From what I understand, reviewing used to be a non-trivial part of an academic's reputation, but relied on much smaller academic communities (somewhat akin to Dunbar's number). So in some sense I'm not proposing a new reputation system, but a mechanism for scaling an existing one (but yeah, trying to get academics to care about a new reputation metric does seem like a pretty big lift)
I don't really follow the market-place analogy - in a more ideal setup, reviewers would be selling a service to the conferences/journals in exchange for reputation (and possib...
Yeah this stuff might helps somewhat, but I think the core problem remains unaddressed: ad-hoc reputation systems don't scale to thousands of researchers.
It feels like something basic like "have reviewers / area chairs rate other reviewers, and post un-anonymized cumulative reviewer ratings" (a kind of h-index for review quality) might go a long way. The double-bind structure is maintained, while providing more incentive (in terms of status, and maybe direct monetary reward) for writing good reviews.
does anyone have thoughts on how to improve peer review in academic ML? From discussions with my advisor, my sense is that the system used to depend on word of mouth and people caring more about their academic reputation, which works in a fields of 100's of researchers but breaks down in fields of 1000's+. Seems like we need some kind of karma system to both rank reviewers and submissions. I'd be very surprised if nobody has proposed such a system, but a quick google search doesn't yield results.
I think reforming peer review is probably underrated fr...
Yeah it does seem unfortunate that there's not a robust pipeline for tackling the "hard problem" (even conditioned to more "moderate" models of x-risk)
But (conditioned on "moderate" models) there's still a log of low-hanging fruit that STEM people from average universities (a group I count myself among) can pick. Like it seems good for Alice to bounce off of ELK and work on technical governance, and for Bob to make incremental progress on debate. The current pipeline/incentive system is still valuable, even if it systematically neglects tackling the "hard problem of alignment".
still trying to figure out the "optimal" config setup. The "clean code" method is roughly to have dedicated config files for different components that can be composed and overridden etc (see for example, https://github.com/oliveradk/measurement-pred). But I don't like how far away these configs are from the main code. On the other hand, as the experimental setup gets more mature I often want to toggle across config groups. Maybe the solution is making a "mode" an optional config itself with overrides within the main script
just read both posts and they're great (as is The Witness). It's funny though, part of me wants to defend OOP - I do think there's something to finding really good abstractions (even preemptively), but that its typically not worth it for self-contained projects with small teams and fixed time horizons (e.g. ML research projects, but also maybe indie games).
The builder-breaker thing isn't unique to CoT though right? My gloss on the recent Obfuscated Activations paper is something like "activation engineering is not robust to arbitrary adversarial optimization, and only somewhat robust to contained adversarial optimization".
I'm curious if Redwood would be willing to share a kind of "after action report" for why they stopped working on ELK/heuristic argument inspired stuff (e.g Causal Scrubbing, Patch Patching, Generalized Wick Decompositions, Measurement Tampering)
My impression it is some mix of:
a. Control seems great
b. Heuristic arguments is a bad bet (for some of the reasons mech interp is a bad bet)
c. ARC has it covered
But the weighting is pretty important here. If its
a. more people should be working on heuristic argument inspired stuff.
b. les...
Just to make it explicit and check my understanding - the residual decomposition is equivalent to edge / factorized view of the transformer in that we can express any term in the residual decomposition as a set of edges that form a path from input to output, e.g
= input -> output
= input-> Attn 1.0 -> MLP 2 -> Attn 4.3 -> output
And it follows that the (pre final layernorm) output of a transformer is the sum of all the "paths" from input to output constructed from the factorized DAG.
For anyone trying to replicate / try new methods, I posted a diamonds "pure prediction model" to huggingface https://huggingface.co/oliverdk/codegen-350M-mono-measurement_pred, (github repo here: https://github.com/oliveradk/measurement-pred/tree/master)
just read "Situational Awareness" - it definitely woke me up. AGI is real, and very plausibly (55%?) happening within this decade. I need to stop sleep walking and get serious about contributing within the next two years.
First, some initial thoughts on the essay
I previously thought the argument for measurement tampering being more tractable then general ELK was mostly about the structural / causal properties of multiple independent measurements, but I think I'm more swayed by the argument that measurement tampering will just be more obvious (both easier to see using interpretability and more anomalous in general) then e.g. sycophancy. This is a flimsier argument though, and is less likely to hold when tampering is more subtle.
Here's a revised sketch
A few notes:
Scalable Oversight Taxonomy
I think I'm mostly right, but using a somewhat confused frame.
It makes more sense to think of MAD approaches as detecting all abnormal reasons (including deceptive alignment) by default, and then if we get that working we'll try to decrease false anomalies by doing something like comparing the least common ancestor of the measurements in a novel mechanism to the least common ancestor of the measurements on trusted mechanisms.
One confusion I have with MAD as an approach to ELK is that it seems to assume some kind of initial inner alignment. If we're flagging when the model takes actions / makes predictions for "unusual reasons", where unusual is define with respect to some trusted set, but aligned and misaligned models are behaviorally indistinguishable on the trusted set, then a model could learn to do things for misaligned reasons on the trusted set, and then use those same reasons on the untrusted set. For example, a deceptively aligned model would appear aligned in training...
Clarifying the relationship between mechanistic anomaly detection (MAD), measurement tampering detection (MTD), weak to strong generalization (W2SG), weak to strong learning (W2SL), and eliciting latent knowledge (ELK). (Nothing new or interesting here, I just often loose track of these relationships in my head)
eliciting latent knowledge is an approach to scalable oversight which hopes to use the latent knowledge of a model as a supervision signal or oracle.
weak to strong learning is an experimental setup for evaluating scalable oversight protocols, ...
overall how enthusiastic are you about safety motivated people developing such an architecture?
(seems to come with obviously large capability externalities - we can deploy the model outside the sandbox!)