LESSWRONG
LW

All of Oliver Daniels's Comments + Replies

overall how enthusiastic are you about safety motivated people developing such an architecture?

(seems to come with obviously large capability externalities - we can deploy the model outside the sandbox!)

Eliciting bad contexts

Oliver Daniels13d10

seems like restricting the search to plausible inputs (as judged by e.g. perplexity) might overcome some of these concerns

Oliver Daniels-Koch's Shortform

Oliver Daniels13d12

Take: Exploration hacking should not be used as a synonym for deceptive alignment.

(I have observed one such usage)

Deceptive alignment is maybe a very particular kind of exploration hacking, but the term exploration hacking (without further specification) should refer to models deliberately sandbagging "intermediate" capabilities during RL training to avoid learning a "full" capability.

AXRP Episode 41 - Lee Sharkey on Attribution-based Parameter Decomposition

Oliver Daniels1mo10

“If I quantize my network with however many bits, how bad is that?” I don’t know, maybe this is one of these things where if I sat down and tried to do it, I’d realize the issue, but it seems doable to me. It seems like there’s possibly something here.

I think the reason this doesn't work (i.e. why you can only get a Pareto frontier) is that you can only lower bound the description length of the network / components, such that a direct comparison to "loss bits" doesn't make sense

Misalignment and Strategic Underperformance: An Analysis of Sandbagging and Exploration Hacking

Oliver Daniels2mo10

Curious for your takes on the difficulty (for models) of sandbagging vs the difficulty (for AI developers) of preventing adversarial examples / jailbreaks. My quick thoughts:

Adversarial robustness is hard, therefore sandbagging is hard (and thus easy to defeat)
Adversarial robustness is improving, there we should expect sandbagging abilities to improve
Even if sandbagging is equally as difficult as adversarial robustness, we might not look for it absent strong warning signs
Even if sandbagging is equally difficult and we do look for it, finding "

Oliver Daniels2mo20

This is really cool work and I hope it gets more attention (e.g. promoted to the Alignment Forum). In particular I love the use of CodeNames as a relatively low-compute test-bed for scalable oversight work.

A view thoughts:

There are roughly 2 motivations one can have for adversarial evals of scalable oversight methods:
1) identify and patch reward-hacking opportunities,
2) evaluate whether methods detect and eliminate deliberate sandbagging.
- this work is oriented toward 1 (which is totally valid). Orienting toward towards 2 would motivate subtle ch

... (read more)

1Yoav2mo

I appreciate the praise and insights! I hadn't thought of the sandbagging version of adversarial evals and it sounds interesting, although I'm a bit confused about the specifics of the reward function. It sounds to me like in order to catch sandbagging you need an example of the same base model performing better on the task? Asymmetries in the reward structure - If I understood correctly, I think this is covered by the biased overseer? A critique is too good if the overseer overestimates how similar a bad word is to the clue. Open to hearing more ideas, I agree there's more than can be done with this set up

“The Era of Experience” has an unsolved technical alignment problem

Oliver Daniels2mo30

I think the smart Silver-sympathetic response here would be:

1. Yes, the alignment problem is a real thing, and my (Silver's) sketch of a solution is not sufficient
2. Studying / taking inspiration from human-reward circuitry is an interesting research direction
3. We will be able to iterate on the problem before AI's are capable of taking existentially destructive actions

From (this version of) his perspective, alignment is "normal" problem and a small-piece of achieving beneficial AGI. The larger piece is the AGI part, which is why he spends most... (read more)

3Steven Byrnes2mo

“This problem has a solution (and one that can be realistically implemented)” is another important crux, I think. As I wrote here: “For one thing, we don’t actually know for sure that this technical problem is solvable at all, until we solve it. And if it’s not in fact solvable, then we should not be working on this research program at all. If it's not solvable, the only possible result of this research program would be “a recipe for summoning demons”, so to speak. And if you’re scientifically curious about what a demon-summoning recipe would look like, then please go find something else to be scientifically curious about instead.” I have other retorts too, but I’m not sure it’s productive for me to argue against a position that you don’t endorse yourself, but rather are imagining that someone else has. We can see if anyone shows up here who actually endorses something like that. Anyway, if Silver were to reply “oops, yeah, the reward function plan that I described doesn’t work, in the future I’ll say it’s an unsolved problem”, then that would be a big step in the right direction. It wouldn’t be remotely sufficient, but it would be a big step in the right direction, and worth celebrating.

zchuang's Shortform

Oliver Daniels4mo43

Map on "coherence over long time horizon" / agency. I suspect contamination is much worse for chess (so many direct examples of board configurations and moves)

Preparing for the Intelligence Explosion

Oliver Daniels4mo10

I've been confused what people are talking about when they say "trend lines indicate AGI by 2027" - seems like it's basically this?

1Julian Bradshaw4mo

More than just this. OP actually documents it pretty well, see here.

Estimating the Probability of Sampling a Trained Neural Network at Random

Oliver Daniels4mo1-1

also rhymes with/is related to ARC's work on presumption of independence applied to neural networks (e.g. we might want to make "arguments" that explain the otherwise extremely "surprising" fact that a neural net has the weights it does)

How might we safely pass the buck to AI?

Oliver Daniels4mo10

Instead I would describe the problem as arising from a generator and verifier mismatch: when the generator is much stronger than the verifier, the verifier is incentivized to fool the verifier without completing the task successfully.

I think these are related but separate problems - even with a perfect verifier (on easy domains), scheming could still arise.

Though imperfect verifiers increase P(scheming), better verifiers increase the domain of "easy" tasks, etc.

Research directions Open Phil wants to fund in technical AI safety

Oliver Daniels5mo30

Huh? No, control evaluations involve replacing untrusted models with adversarial models at inference time, whereas scalable oversight attempts to make trusted models at training time, so it would completely ignore the proposed mechanism to take the models produced by scalable oversight and replace them with adversarial models. (You could start out with adversarial models and see whether scalable oversight fixes them, but that's an alignment stress test.)

I have a similar confusion (see my comment here) but seems like at least Ryan wants control evaluations ... (read more)

Fake thinking and real thinking

Oliver Daniels5mo1-2

I think the concern is less "am I making intellectual progress on some project" and more "is the project real / valuable"

2Steven Byrnes5mo

I didn’t read the OP that way (but no point in arguing about the author’s intentions). For sure, I, like anyone, am perfectly capable of getting curious about, and then spending lots of time to figure out, something that’s not actually important to figure out in the first place. Note the quote that I chose to put at the top of my recent research agenda update post. :)

Attribution-based parameter decomposition

Oliver Daniels5mo*Ω7168

IMO most exciting mech-interp research since SAEs, great work.

A few thoughts / questions:

curious about your optimism regarding learned masks as attribution method - seems like the problem of learning mechanisms that don't correspond to model mechanisms is real for circuits (see Interp Bench) and would plausibly bite here too (through should be able to resolve this with benchmarks on downstream tasks once APD is more mature)
relatedly, the hidden layer "minimality" loss is really cool, excited to see how useful this is in mitigating the problem above (

... (read more)

4Lee Sharkey5mo

Agree! I'd be excited by work that uses APD for MAD, or even just work that applies APD to Boolean circuit networks. We did consider using them as a toy model at various points, but ultimately opted to go for other toy models instead. (btw typo: *APD)

2Lee Sharkey5mo

I think so too! (assuming it can be made more robust and scaled, which I think it can) And thanks! :)

5Lucius Bushnaq5mo

We think this may not be a problem here, because the definition of parameter component 'activity' is very constraining. See Appendix section A.1. To count as inactive, it's not enough for components to not influence the output if you turn them off, every point on every possible monotonic trajectory between all components being on, and only the components deemed 'active' being on, has to give the same output. If you (approximately) check for this condition, I think the function that picks the learned masks can kind of be as expressive as it likes, because the sparse forward pass can't rely on the mask to actually perform any useful computation labor. Conceptually, this is maybe one of the biggest differences between APD and something like, say, a transcoder or a crosscoder. It's why it doesn't seem to me like there'd be an analog to 'feature splitting' in APD. If you train a transcoder on a d-dimensional linear transformation, it will learn ever sparser approximations of this transformation the larger you make the transcoder dictionary, with no upper limit. If you train APD on a d-dimensional linear transformation, provided it's tuned right, I think it should learn a single d-dimensional component. Regardless of how much larger than d you make the component dictionary. Because if it tried to learn more components than that to get a sparser solution, it wouldn't be able to make the components sum to the original model weights anymore. Despite this constraint on its structure, I think APD plausibly has all the expressiveness it needs, because even when there is an overcomplete basis of features in activation space, circuits in superposition math and information theory both suggest that you can't have an overcomplete basis of mechanisms in parameter space. So it seems to me that you can just demand that components must compose linearly, without that restricting their ability to represent the structure of the target model. And that demand then really limits the abil

Thoughts on the conservative assumptions in AI control

Oliver Daniels6mo10

makes sense - I think I had in mind something like "estimate P(scheming | high reward) by evaluating P(high reward | scheming)". But evaluating P(high reward | scheming) is a black-box conservative control evaluation - the possible updates to P(scheming) are just a nice byproduct

Thoughts on the conservative assumptions in AI control

Oliver Daniels6mo3-2

We initially called these experiments “control evaluations”; my guess is that it was a mistake to use this terminology, and we’ll probably start calling them “black-box plan-conservative control evaluations” or something.

also seems worth disambiguating "conservative evaluations" from "control evaluations" - in particular, as you suggest, we might want to assess scalable oversight methods under conservative assumptions (to be fair, the distinction isn't all that clean - your oversight process can both train and monitor a policy. Still, I associate control m... (read more)

3ryan_greenblatt6mo

We would consider "black-box conservative evaluations on whether our AIs could subvert our (scalable) oversight techniques" to be a special case of black-box conservative control evaluations. Insofar as you are exploring assumptions other than "the AI is trying to subvert our safeguards", I wouldn't consider it to be control.

Scaling Sparse Feature Circuit Finding to Gemma 9B

Oliver Daniels6mo20

Thanks for the thorough response, and apologies for missing the case study!

I think I regret / was wrong about my initial vaguely negative reaction - scaling SAE circuit discovery to large models is a notable achievement!

Re residual skip SAEs: I'm basically on board with "only use residual stream SAEs", but skipping layers still feels unprincipled. Like imagine if you only trained an SAE on the final layer of the model. By including all the features, you could perfectly recover the model behavior up to the SAE reconstruction loss, but you would have ~... (read more)

2Jatin Nainani6mo

Yes -- By design, the circuits discovered in this manner might miss how/when something is computed. But we argue that finding the important representations at bottlenecks and their change over layers can provide important/useful information about the model. One of our future directions, along the direction of crosscoders, is to have "Layer Output Buffer SAEs" that aim to tackle the computation between bottlenecks.

Benchmarks for Detecting Measurement Tampering [Redwood Research]

Oliver Daniels6mo10Review for 2023 Review

This post (and the accompanying paper) introduced empirical benchmarks for detecting "measurement tampering" - when AI systems alter measurements used to evaluate them.

Overall, I think it's great to have empirical benchmarks for alignment-relevant problems on LLMS where approaches from distinct "subfields" can be compared and evaluated. The post and paper do a good job of describing and motivating measurement tampering, justifying the various design decisions (though some of the tasks are especially convoluted).

A few points of criticism:
- the d... (read more)

Scaling Sparse Feature Circuit Finding to Gemma 9B

Oliver Daniels6mo92

I'm not that convinced that attributing patching is better then ACDC - as far as I can tell Syed et al only measure ROC with respect to "ground truth" (manually discovered) circuits and not faithfulness, completeness, etc. Also Interp Bench finds ACDC is better than attribution patching

Scaling Sparse Feature Circuit Finding to Gemma 9B

Oliver Daniels6mo*70

Nice post!

My notes / thoughts: (apologies for overly harsh/critical tone, I'm stepping into the role of annoying reviewer)

Summary

Use residual stream SAEs spaced across layers, discovery node circuits with learned binary masking on templated data.
binary masking outperforms integrated gradients on faithfulness metrics, and achieves comparably (though maybe narrowly worse) completeness metrics.
demonstrated approach on code output prediction

Strengths:

to the best of my knowledge, first work to demonstrate learned binary masks for circuit disco

... (read more)

1Jatin Nainani6mo

Thanks a lot for this review! On strengths, we also believe that we are the first to examine “few saes” for scalable circuit discovery. On weaknesses, * While we plan to do a more thorough sweep of SAE placements and comparison, the first weakness remains true for this post. * Our major argument for the support of using few SAEs is imaging them as interpretable bottlenecks. Because they are so minimal and interpretable, they allow us to understand blocks of the transformer between them functionally (in terms of input and output). We were going to include more intuition about this but were worried it might add unnecessary complications. We mention the fact about residual stream to highlight that information cannot be passed to layer L+1 by any other path than the residual output of layer L. Thus, by training a mask at layer L, we find a minimal set of representations needed for future layers. To future layers, nothing other than these latents matter. We do agree that the nature of circuits found with coarse grained saes will differ, and this needs to be further studied. * We plan to explore the “gender bias removal” of Marks et al. [1] to compare the downstream application effectiveness. However, we do have a small application where we found a "bug" in the model, covered in section 5, where it over relies on duplicate token latents. We can try to do something similar to Marks et al.[1] in trying to "fix" this bug * Thanks for sharing the citation! A core question shared in the community is whether the idea of circuits is plausible as models continue to scale up. Current automated methods either are too computationally expensive or generate a subgraph that is too large to examine. We explore the idea of a few equally spaced SAEs with the goal of solving both those issues. Though as you mentioned, a more thorough comparison between circuits of different numbers of saes is needed.

ejenner's Shortform

Oliver Daniels6mo30

curious if you have takes on the right balance between clean research code / infrastructure and moving fast/being flexible. Maybe its some combination of

get hacky results ASAP
lean more towards functional programming / general tools and away from object-oriented programing / frameworks (especially early in projects where the abstractions/experiments/ research questions are more dynamic), but don't sacrifice code quality and standard practices

3Erik Jenner6mo

Some heuristics (not hard rules): * ~All code should start as a hacky jupyter notebook (like your first point) * As my codebase grows larger and messier, I usually hit a point where it becomes more aversive to work with because I don't know where things are, there's too much duplication, etc. Refactor at that point. * When refactoring, don't add abstractions just because they might become useful in the future. (You still want to think about future plans somewhat of course, maybe the heuristic is to not write code that's much more verbose than necessary right now, in the hope that it will pay off in the future.) These are probably geared toward people like me who tend to over-engineer; someone who's currently unhappy that their code is always a mess might need different ones. I don't know whether functional programming is fundamentally better in this respect than object-oriented.

Oliver Daniels-Koch's Shortform

Oliver Daniels6mo10

yeah fair - my main point is that you could have a reviewer reputation system without de-anonymizing reviewers on individual papers

(alternatively, de-anonymizing reviews might improve the incentives to write good reviews on the current margin, but would also introduce other bad incentives towards sycophancy etc. which academics seem deontically opposed to)

Oliver Daniels-Koch's Shortform

Oliver Daniels6mo10

From what I understand, reviewing used to be a non-trivial part of an academic's reputation, but relied on much smaller academic communities (somewhat akin to Dunbar's number). So in some sense I'm not proposing a new reputation system, but a mechanism for scaling an existing one (but yeah, trying to get academics to care about a new reputation metric does seem like a pretty big lift)

I don't really follow the market-place analogy - in a more ideal setup, reviewers would be selling a service to the conferences/journals in exchange for reputation (and possib... (read more)

Oliver Daniels-Koch's Shortform

Oliver Daniels6mo10

Yeah this stuff might helps somewhat, but I think the core problem remains unaddressed: ad-hoc reputation systems don't scale to thousands of researchers.

It feels like something basic like "have reviewers / area chairs rate other reviewers, and post un-anonymized cumulative reviewer ratings" (a kind of h-index for review quality) might go a long way. The double-bind structure is maintained, while providing more incentive (in terms of status, and maybe direct monetary reward) for writing good reviews.

2Shankar Sivarajan6mo

It's almost always only single-blind: the reviewers usually know who the authors are.

1Daniel Tan6mo

Interesting. You’re essentially trying to set up an alternative reputation system I guess. But I don’t see what the incentive is for academics to buy into this new reputation system when they already have one (h-index). Also don’t see what the incentive is for giving honest ratings to other reviewers. Intuition pump: Most marketplace platforms allow buyers and sellers to rate each other. This has direct usefulness to both because it influences who you buy from / sell to. Therefore there is immediate buy-in. However, reviewing doesn’t work like this because authors and reviewers aren’t exercising much individual agency (nor should they) in determining what papers to review.

Oliver Daniels-Koch's Shortform

Oliver Daniels6mo30

does anyone have thoughts on how to improve peer review in academic ML? From discussions with my advisor, my sense is that the system used to depend on word of mouth and people caring more about their academic reputation, which works in a fields of 100's of researchers but breaks down in fields of 1000's+. Seems like we need some kind of karma system to both rank reviewers and submissions. I'd be very surprised if nobody has proposed such a system, but a quick google search doesn't yield results.

I think reforming peer review is probably underrated fr... (read more)

2lunatic_at_large6mo

I think this professor has relevant interests: https://www.cs.cmu.edu/~nihars/.

2Daniel Tan6mo

I think requiring authors to also review papers is a pretty good way to both (i) ensure there are enough reviewers for any given subdiscipline and (ii) at least somewhat kick-start healthier review culture. My impression is that many academics don't see reviewing as part of their responsibilities, and forcing it on them might change this. I feel like improving the way review papers are assigned would also do a lot. My worst reviews submitted were when I wasn't well-versed or interested in the topic the paper was on.

Oliver Daniels-Koch's Shortform

Oliver Daniels6mo10

yeah I was mostly thinking neutral along the axis of "safey-ism" vs "accelerationism" (I think there's a fairly straight-forward right-wing bias on X, further exasperated by Bluesky)

Write Good Enough Code, Quickly

Oliver Daniels6mo10

also see Cognitive Load Is What Matters

leogao's Shortform

Oliver Daniels6mo30

Two common failure modes to avoid when doing the legibly impressive things

1. Only caring instrumentally about the project (decreases motivation)

2. Doing "net negative" projects

Oliver Daniels-Koch's Shortform

Oliver Daniels6mo20

Is the move of a lot of alignment discourse to Twitter/X a coordination failure or a positive development?

I'm kinda sad that LW seems less "alive" than it did a few years ago, but also seems healthy to be engaging in a more neutral space with a wider audience

5ryan_greenblatt6mo

I don't think Twitter/X is really a very neutral space in general. I agree it maybe is more neutral from the perspective of the AI/ML community. (And certainly feels that way from the pespective of many people in that community.)

The Field of AI Alignment: A Postmortem, and What To Do About It

Oliver Daniels6mo3-2

Yeah it does seem unfortunate that there's not a robust pipeline for tackling the "hard problem" (even conditioned to more "moderate" models of x-risk)

But (conditioned on "moderate" models) there's still a log of low-hanging fruit that STEM people from average universities (a group I count myself among) can pick. Like it seems good for Alice to bounce off of ELK and work on technical governance, and for Bob to make incremental progress on debate. The current pipeline/incentive system is still valuable, even if it systematically neglects tackling the "hard problem of alignment".

Write Good Enough Code, Quickly

Oliver Daniels6mo10

still trying to figure out the "optimal" config setup. The "clean code" method is roughly to have dedicated config files for different components that can be composed and overridden etc (see for example, https://github.com/oliveradk/measurement-pred). But I don't like how far away these configs are from the main code. On the other hand, as the experimental setup gets more mature I often want to toggle across config groups. Maybe the solution is making a "mode" an optional config itself with overrides within the main script

Write Good Enough Code, Quickly

Oliver Daniels7mo20

just read both posts and they're great (as is The Witness). It's funny though, part of me wants to defend OOP - I do think there's something to finding really good abstractions (even preemptively), but that its typically not worth it for self-contained projects with small teams and fixed time horizons (e.g. ML research projects, but also maybe indie games).

Daniel Kokotajlo's Shortform

Oliver Daniels7mo30

The builder-breaker thing isn't unique to CoT though right? My gloss on the recent Obfuscated Activations paper is something like "activation engineering is not robust to arbitrary adversarial optimization, and only somewhat robust to contained adversarial optimization".

3Daniel Kokotajlo7mo

Correct. There are several other agendas that have this nice property too.

Write Good Enough Code, Quickly

Oliver Daniels7mo10

thanks for the detailed (non-ML) example! exactly the kind of thing I'm trying to get at

Write Good Enough Code, Quickly

Oliver Daniels7mo20

Thanks! huh yeah the python interactive windows seems like a much cleaner approach, I'll give it a try

Write Good Enough Code, Quickly

Oliver Daniels7mo20

thanks! yup curser is notebook compatible

Benchmarks for Detecting Measurement Tampering [Redwood Research]

Oliver Daniels7mo10

Thanks!

Oliver Daniels-Koch's Shortform

Oliver Daniels8mo94

I wish there was a bibTeX functionality for alignment forum posts...

9habryka8mo

Yeah, IMO we should just add a bunch of functionality for integrating alignment forum stuff more with academic things. It’s been on my to do list for a long time.

Oliver Daniels-Koch's Shortform

Oliver Daniels9mo94

I'm curious if Redwood would be willing to share a kind of "after action report" for why they stopped working on ELK/heuristic argument inspired stuff (e.g Causal Scrubbing, Patch Patching, Generalized Wick Decompositions, Measurement Tampering)

My impression it is some mix of:

a. Control seems great

b. Heuristic arguments is a bad bet (for some of the reasons mech interp is a bad bet)

c. ARC has it covered

But the weighting is pretty important here. If its
a. more people should be working on heuristic argument inspired stuff.

b. les... (read more)

AGI Safety and Alignment at Google DeepMind: A Summary of Recent Work

Oliver Daniels11mo54

(The community often calls this “scalable oversight”, but we want to be clear that this does not necessarily include scaling to large numbers of situations, as in monitoring.)

I like this terminology and think the community should adopt it

Oliver Daniels1y10

Just to make it explicit and check my understanding - the residual decomposition is equivalent to edge / factorized view of the transformer in that we can express any term in the residual decomposition as a set of edges that form a path from input to output, e.g

$I d$ = input -> output

$(A t t n_{4}^{3} \circ M L P_{2} \circ A t t_{1}^{0})$ = input-> Attn 1.0 -> MLP 2 -> Attn 4.3 -> output

And it follows that the (pre final layernorm) output of a transformer is the sum of all the "paths" from input to output constructed from the factorized DAG.

2Joseph Miller11mo

Actually I think the residual decomposition is incorrect - see my other comment.

Benchmarks for Detecting Measurement Tampering [Redwood Research]

Oliver Daniels1y40

For anyone trying to replicate / try new methods, I posted a diamonds "pure prediction model" to huggingface https://huggingface.co/oliverdk/codegen-350M-mono-measurement_pred, (github repo here: https://github.com/oliveradk/measurement-pred/tree/master)

1Jordan Taylor7mo

Fixed links: https://huggingface.co/oliverdk/codegen-350M-mono-measurement_pred https://github.com/oliveradk/measurement-pred

Oliver Daniels-Koch's Shortform

Oliver Daniels1y32

just read "Situational Awareness" - it definitely woke me up. AGI is real, and very plausibly (55%?) happening within this decade. I need to stop sleep walking and get serious about contributing within the next two years.

First, some initial thoughts on the essay

Very "epic" and (self?) aggrandizing. If you believe the conclusions, its not unwarranted, but I worry a bit about narratives that satiate some sense of meaning and self-importance. (That counter-reaction is probably much stronger though, and on the margin it seems really valuable to "full-throatily

... (read more)

Oliver Daniels-Koch's Shortform

Oliver Daniels1y30

I previously thought the argument for measurement tampering being more tractable then general ELK was mostly about the structural / causal properties of multiple independent measurements, but I think I'm more swayed by the argument that measurement tampering will just be more obvious (both easier to see using interpretability and more anomalous in general) then e.g. sycophancy. This is a flimsier argument though, and is less likely to hold when tampering is more subtle.

Oliver Daniels-Koch's Shortform

Oliver Daniels1y10

Here's a revised sketch

A few notes:

I use Scalable Oversight to refer to both Alignment and Control
I'm confused whether weak to strong learning is a restatement of scalable oversight, ELK, or its own thing, so I ignore it
I don't explicitly include easy-to-hard, I think OOD basically covers it
taxonomies and abstractions are brittle and can be counterproductive

Scalable Oversight Taxonomy

Scalable Oversight
- Scalable Alignment
  - Benchmarks / Tasks
    - Sandwiching Experiments (human amateurs + model, gt from human experts)
    - Weak models supervising Strong m

... (read more)

Oliver Daniels-Koch's Shortform

Oliver Daniels1y10

I think I'm mostly right, but using a somewhat confused frame.

It makes more sense to think of MAD approaches as detecting all abnormal reasons (including deceptive alignment) by default, and then if we get that working we'll try to decrease false anomalies by doing something like comparing the least common ancestor of the measurements in a novel mechanism to the least common ancestor of the measurements on trusted mechanisms.

Oliver Daniels-Koch's Shortform

Oliver Daniels1y10

One confusion I have with MAD as an approach to ELK is that it seems to assume some kind of initial inner alignment. If we're flagging when the model takes actions / makes predictions for "unusual reasons", where unusual is define with respect to some trusted set, but aligned and misaligned models are behaviorally indistinguishable on the trusted set, then a model could learn to do things for misaligned reasons on the trusted set, and then use those same reasons on the untrusted set. For example, a deceptively aligned model would appear aligned in training... (read more)

1Oliver Daniels1y

I think I'm mostly right, but using a somewhat confused frame. It makes more sense to think of MAD approaches as detecting all abnormal reasons (including deceptive alignment) by default, and then if we get that working we'll try to decrease false anomalies by doing something like comparing the least common ancestor of the measurements in a novel mechanism to the least common ancestor of the measurements on trusted mechanisms.

Oliver Daniels-Koch's Shortform

Oliver Daniels1y60

Clarifying the relationship between mechanistic anomaly detection (MAD), measurement tampering detection (MTD), weak to strong generalization (W2SG), weak to strong learning (W2SL), and eliciting latent knowledge (ELK). (Nothing new or interesting here, I just often loose track of these relationships in my head)

eliciting latent knowledge is an approach to scalable oversight which hopes to use the latent knowledge of a model as a supervision signal or oracle.

weak to strong learning is an experimental setup for evaluating scalable oversight protocols, ... (read more)

3Erik Jenner1y

Nice overview, agree with most of it! You could also distinguish between weak-to-strong generalization, where you have a weak supervision signal on the entire distribution (which may sometimes be wrong), and easy-to-hard generalization, where you have a correct supervision signal but only on an easy part of the distribution. Of course both of these are simplifications. In reality, I'd expect the setting to be more like: you have a certain weak supervision budget (or maybe even budgets at different levels of strength), and you can probably decide how to spend the budget. You might only have an imperfect sense of which cases are "easy" vs "hard" though. I think going from MAD to a fully general ELK solution requires some extra ingredients. In practice, the plan might be to MTD and then using the AI in ways such that this is enough (rather than needing a fully general ELK solution). This is related to narrow elicitation though MTD seems even narrower. Even for MTD, you probably need something to bridge the easy-to-hard gap, but at least for that there are specific proposals that seem plausible (this or, as a more concrete instance, exclusion fine-tuning from the Redwood MTD paper). I think it could turn out that general/worst-case solutions to MAD and ELK run into very similar obstacles, but I don't think a practical MAD solution (e.g. contingent on empirical facts about deep learning) obviously lets you solve ELK. I would also add that you could motivate MAD as a method to deal with scheming (or other high-stakes failures). In that case, the things to compare against most naturally might look a bit different (e.g. AI control, coup probes, interpretability-themed things); and it wouldn't make sense to compare against debate in that setting. I think most mainstream ML problems that are similar to MAD are closer to this than to scalable oversight.

Benchmarks for Detecting Measurement Tampering [Redwood Research]

Oliver Daniels1yΩ010

oh I see, by all(sensor_preds) I meant sum([logit_i] for i in n_sensors) (the probability that all sensors are activated). Makes sense, thanks!

Benchmarks for Detecting Measurement Tampering [Redwood Research]

Oliver Daniels1yΩ010

is individual measurement prediction AUROC a) or b)
a) mean(AUROC(sensor_i_pred, sensor_i))

b) AUROC(all(sensor_preds), all(sensors))

2Fabien Roger1y

We compute AUROC(all(sensor_preds), all(sensors)). This is somewhat weird, and it would have been slightly better to do a) (thanks for pointing it out!), but I think the numbers for both should be close since we balance classes (for most settings, if I recall correctly) and the estimates are calibrated (since they are trained in-distribution, there is no generalization question here), so it doesn't matter much. The relevant pieces of code can be found by searching for "sensor auroc": cat_positives = torch.cat([one_data["sensor_logits"][:, i][one_data["passes"][:, i]] for i in range(nb_sensors)]) cat_negatives = torch.cat([one_data["sensor_logits"][:, i][~one_data["passes"][:, i]] for i in range(nb_sensors)]) m, s = compute_boostrapped_auroc(cat_positives, cat_negatives) print(f"sensor auroc pn {m:.3f}±{s:.3f}")