All of Oliver Daniels's Comments + Replies

Map on "coherence over long time horizon" / agency. I suspect contamination is much worse for chess (so many direct examples of board configurations and moves) 

I've been confused what people are talking about when they say "trend lines indicate AGI by 2027" - seems like it's basically this?

1Julian Bradshaw
More than just this. OP actually documents it pretty well, see here.

also rhymes with/is related to ARC's work on presumption of independence applied to neural networks (e.g. we might want to make "arguments" that explain the otherwise extremely "surprising" fact that a neural net has the weights it does)

Instead I would describe the problem as arising from a generator and verifier mismatch: when the generator is much stronger than the verifier, the verifier is incentivized to fool the verifier without completing the task successfully.

I think these are related but separate problems - even with a perfect verifier (on easy domains), scheming could still arise.

Though imperfect verifiers increase P(scheming), better verifiers increase the domain of "easy" tasks, etc.

Huh? No, control evaluations involve replacing untrusted models with adversarial models at inference time, whereas scalable oversight attempts to make trusted models at training time, so it would completely ignore the proposed mechanism to take the models produced by scalable oversight and replace them with adversarial models. (You could start out with adversarial models and see whether scalable oversight fixes them, but that's an alignment stress test.)

I have a similar confusion (see my comment here) but seems like at least Ryan wants control evaluations ... (read more)

I think the concern is less "am I making intellectual progress on some project" and more "is the project real / valuable" 

2Steven Byrnes
I didn’t read the OP that way (but no point in arguing about the author’s intentions). For sure, I, like anyone, am perfectly capable of getting curious about, and then spending lots of time to figure out, something that’s not actually important to figure out in the first place. Note the quote that I chose to put at the top of my recent research agenda update post. :)
Oliver Daniels*Ω7168

IMO most exciting mech-interp research since SAEs, great work. 

A few thoughts / questions:

  • curious about your optimism regarding learned masks as attribution method - seems like the problem of learning mechanisms that don't correspond to model mechanisms is real for circuits (see Interp Bench) and would plausibly bite here too (through should be able to resolve this with benchmarks on downstream tasks once APD is more mature)
  • relatedly, the hidden layer "minimality" loss is really cool, excited to see how useful this is in mitigating the problem above (
... (read more)
4Lee Sharkey
Agree! I'd be excited by work that uses APD for MAD, or even just work that applies APD to Boolean circuit networks. We did consider using them as a toy model at various points, but ultimately opted to go for other toy models instead.  (btw typo: *APD) 
2Lee Sharkey
I think so too! (assuming it can be made more robust and scaled, which I think it can)  And thanks! :) 
5Lucius Bushnaq
We think this may not be a problem here, because the definition of parameter component 'activity' is very constraining. See Appendix section A.1.  To count as inactive, it's not enough for components to not influence the output if you turn them off, every point on every possible monotonic trajectory between all components being on, and only the components deemed 'active' being on, has to give the same output. If you (approximately) check for this condition, I think the function that picks the learned masks can kind of be as expressive as it likes, because the sparse forward pass can't rely on the mask to actually perform any useful computation labor.  Conceptually, this is maybe one of the biggest difference between APD and something like, say, a transcoder or a crosscoder. It's why it doesn't seem to me like there'd be an analog to 'feature splitting' in APD. If you train a transcoder on a d-dimensional linear transformation, it will learn ever sparser approximations of this transformation the larger you make the transcoder dictionary, with no upper limit. If you train APD on a d-dimensional linear transformation, provided it's tuned right, I think it should learn a single d-dimensional component. Regardless of how much larger than d you make the component dictionary. Because if it tried to learn more components than that to get a sparser solution, it wouldn't be able to make the components sum to the original model weights anymore. Despite this constraint on its structure, I think APD plausibly has all the expressiveness it needs, because even when there is an overcomplete basis of features in activation space, circuits in superposition math and information theory both suggest that you can't have an overcomplete basis of mechanisms in parameter space. So it seems to me that you can just demand that components must compose linearly, without that restricting their ability to represent the structure of the target model. And that demand then really limits the abili

makes sense - I think I had in mind something like "estimate P(scheming | high reward) by evaluating P(high reward | scheming)". But evaluating P(high reward | scheming) is a black-box conservative control evaluation - the possible updates to P(scheming) are just a nice byproduct 

We initially called these experiments “control evaluations”; my guess is that it was a mistake to use this terminology, and we’ll probably start calling them “black-box plan-conservative control evaluations” or something.

also seems worth disambiguating "conservative evaluations" from "control evaluations" - in particular, as you suggest, we might want to assess scalable oversight methods under conservative assumptions (to be fair, the distinction isn't all that clean - your oversight process can both train and monitor a policy. Still, I associate control m... (read more)

3ryan_greenblatt
We would consider "black-box conservative evaluations on whether our AIs could subvert our (scalable) oversight techniques" to be a special case of black-box conservative control evaluations. Insofar as you are exploring assumptions other than "the AI is trying to subvert our safeguards", I wouldn't consider it to be control.

Thanks for the thorough response, and apologies for missing the case study! 

I think I regret / was wrong about my initial vaguely negative reaction - scaling SAE circuit discovery to large models is a notable achievement!

Re residual skip SAEs: I'm basically on board with "only use residual stream SAEs", but skipping layers still feels unprincipled. Like imagine if you only trained an SAE on the final layer of the model. By including all the features, you could perfectly recover the model behavior up to the SAE reconstruction loss, but you would have ~... (read more)

2Jatin Nainani
Yes -- By design, the circuits discovered in this manner might miss how/when something is computed. But we argue that finding the important representations at bottlenecks and their change over layers can provide important/useful information about the model.  One of our future directions, along the direction of crosscoders, is to have "Layer Output Buffer SAEs" that aim to tackle the computation between bottlenecks. 

This post (and the accompanying paper) introduced empirical benchmarks for detecting "measurement tampering" - when AI systems alter measurements used to evaluate them. 

Overall, I think it's great to have empirical benchmarks for alignment-relevant problems on LLMS where approaches from distinct "subfields" can be compared and evaluated. The post and paper do a good job of describing and motivating measurement tampering, justifying the various design decisions (though some of the tasks are especially convoluted). 

A few points of criticism:
- the d... (read more)

I'm not that convinced that attributing patching is better then ACDC - as far as I can tell Syed et al only measure ROC with respect to "ground truth" (manually discovered) circuits and not faithfulness, completeness, etc. Also Interp Bench finds ACDC is better than attribution patching

Nice post! 

My notes / thoughts: (apologies for overly harsh/critical tone, I'm stepping into the role of annoying reviewer)

Summary

  • Use residual stream SAEs spaced across layers, discovery node circuits with learned binary masking on templated data.
  • binary masking outperforms integrated gradients on faithfulness metrics, and achieves comparably (though maybe narrowly worse) completeness metrics.
  • demonstrated approach on code output prediction
     

Strengths: 

  • to the best of my knowledge, first work to demonstrate learned binary masks for circuit disco
... (read more)
1Jatin Nainani
Thanks a lot for this review!  On strengths, we also believe that we are the first to examine “few saes” for scalable circuit discovery.  On weaknesses,  * While we plan to do a more thorough sweep of SAE placements and comparison, the first weakness remains true for this post.  * Our major argument for the support of using few SAEs is imaging them as interpretable bottlenecks. Because they are so minimal and interpretable, they allow us to understand blocks of the transformer between them functionally (in terms of input and output). We were going to include more intuition about this but were worried it might add unnecessary complications. We mention the fact about residual stream to highlight that information cannot be passed to layer L+1 by any other path than the residual output of layer L. Thus, by training a mask at layer L, we find a minimal set of representations needed for future layers. To future layers, nothing other than these latents matter. We do agree that the nature of circuits found with coarse grained saes will differ, and this needs to be further studied.  * We plan to explore the “gender bias removal” of Marks et al. [1] to compare the downstream application effectiveness. However, we do have a small application where we found a "bug" in the model, covered in section 5, where it over relies on duplicate token latents. We can try to do something similar to Marks et al.[1] in trying to "fix" this bug * Thanks for sharing the citation!  A core question shared in the community is whether the idea of circuits is plausible as models continue to scale up. Current automated methods either are too computationally expensive or generate a subgraph that is too large to examine. We explore the idea of a few equally spaced SAEs with the goal of solving both those issues. Though as you mentioned, a more thorough comparison between circuits of different numbers of saes is needed. 

curious if you have takes on the right balance between clean research code / infrastructure and moving fast/being flexible. Maybe its some combination of 

  • get hacky results ASAP
  • lean more towards functional programming / general tools and away from object-oriented programing / frameworks (especially early in projects where the abstractions/experiments/ research questions are more dynamic), but don't sacrifice code quality and standard practices
3Erik Jenner
Some heuristics (not hard rules): * ~All code should start as a hacky jupyter notebook (like your first point) * As my codebase grows larger and messier, I usually hit a point where it becomes more aversive to work with because I don't know where things are, there's too much duplication, etc. Refactor at that point. * When refactoring, don't add abstractions just because they might become useful in the future. (You still want to think about future plans somewhat of course, maybe the heuristic is to not write code that's much more verbose than necessary right now, in the hope that it will pay off in the future.) These are probably geared toward people like me who tend to over-engineer; someone who's currently unhappy that their code is always a mess might need different ones. I don't know whether functional programming is fundamentally better in this respect than object-oriented.

yeah fair - my main point is that you could have a reviewer reputation system without de-anonymizing reviewers on individual papers

(alternatively, de-anonymizing reviews might improve the incentives to write good reviews on the current margin, but would also introduce other bad incentives towards sycophancy etc. which academics seem deontically opposed to)

From what I understand, reviewing used to be a non-trivial part of an academic's reputation, but relied on much smaller academic communities (somewhat akin to Dunbar's number). So in some sense I'm not proposing a new reputation system, but a mechanism for scaling an existing one (but yeah, trying to get academics to care about a new reputation metric does seem like a pretty big lift)

I don't really follow the market-place analogy - in a more ideal setup, reviewers would be selling a service to the conferences/journals in exchange for reputation (and possib... (read more)

Yeah this stuff might helps somewhat, but I think the core problem remains unaddressed: ad-hoc reputation systems don't scale to thousands of researchers. 

It feels like something basic like "have reviewers / area chairs rate other reviewers, and post un-anonymized cumulative reviewer ratings" (a kind of h-index for review quality) might go a long way. The double-bind structure is maintained, while providing more  incentive (in terms of status, and maybe direct monetary reward) for writing good reviews. 

2Shankar Sivarajan
It's almost always only single-blind: the reviewers usually know who the authors are.
1Daniel Tan
Interesting. You’re essentially trying to set up an alternative reputation system I guess. But I don’t see what the incentive is for academics to buy into this new reputation system when they already have one (h-index). Also don’t see what the incentive is for giving honest ratings to other reviewers. Intuition pump: Most marketplace platforms allow buyers and sellers to rate each other. This has direct usefulness to both because it influences who you buy from / sell to. Therefore there is immediate buy-in. However, reviewing doesn’t work like this because authors and reviewers aren’t exercising much individual agency (nor should they) in determining what papers to review.

does anyone have thoughts on how to improve peer review in academic ML? From discussions with my advisor, my sense is that the system used to depend on word of mouth and people caring more about their academic reputation, which works in a fields of 100's of researchers but breaks down in fields of 1000's+. Seems like we need some kind of karma system to both rank reviewers and submissions. I'd be very surprised if nobody has proposed such a system, but a quick google search doesn't yield results. 

I think reforming peer review is probably underrated fr... (read more)

2lunatic_at_large
I think this professor has relevant interests: https://www.cs.cmu.edu/~nihars/. 
2Daniel Tan
I think requiring authors to also review papers is a pretty good way to both (i) ensure there are enough reviewers for any given subdiscipline and (ii) at least somewhat kick-start healthier review culture. My impression is that many academics don't see reviewing as part of their responsibilities, and forcing it on them might change this.  I feel like improving the way review papers are assigned would also do a lot. My worst reviews submitted were when I wasn't well-versed or interested in the topic the paper was on. 

yeah I was mostly thinking neutral along the axis of "safey-ism" vs "accelerationism" (I think there's a fairly straight-forward right-wing bias on X, further exasperated by Bluesky)

Two common failure modes to avoid when doing the legibly impressive things

1. Only caring instrumentally about the project (decreases motivation)

2. Doing "net negative" projects 

Is the move of a lot of alignment discourse to Twitter/X a coordination failure or a positive development? 

I'm kinda sad that LW seems less "alive" than it did a few years ago, but also seems healthy to be engaging in a more neutral space with a wider audience

5ryan_greenblatt
I don't think Twitter/X is really a very neutral space in general. I agree it maybe is more neutral from the perspective of the AI/ML community. (And certainly feels that way from the pespective of many people in that community.)

Yeah it does seem unfortunate that there's not a robust pipeline for tackling the "hard problem" (even conditioned to more "moderate" models of x-risk) 

But (conditioned on "moderate" models) there's still a log of low-hanging fruit that STEM people from average universities (a group I count myself among) can pick. Like it seems good for Alice to bounce off of ELK and work on technical governance, and for Bob to make incremental progress on debate. The current pipeline/incentive system is still valuable, even if it systematically neglects tackling the "hard problem of alignment".

still trying to figure out the "optimal" config setup. The "clean code" method is roughly to have dedicated config files for different components that can be composed and overridden etc (see for example, https://github.com/oliveradk/measurement-pred). But I don't like how far away these configs are from the main code. On the other hand, as the experimental setup gets more mature I often want to toggle across config groups. Maybe the solution is making a "mode" an optional config itself with overrides within the main script

just read both posts and they're great (as is The Witness). It's funny though, part of me wants to defend OOP - I do think there's something to finding really good abstractions (even preemptively),  but that its typically not worth it for self-contained projects with small teams and fixed time horizons (e.g. ML research projects, but also maybe indie games). 

The builder-breaker thing isn't unique to CoT though right? My gloss on the recent Obfuscated Activations paper is something like "activation engineering is not robust to arbitrary adversarial optimization, and only somewhat robust to contained adversarial optimization".

3Daniel Kokotajlo
Correct. There are several other agendas that have this nice property too.

thanks for the detailed (non-ML) example!  exactly the kind of thing I'm trying to get at

Thanks! huh yeah the python interactive windows seems like a much cleaner approach, I'll give it a try

thanks! yup curser is notebook compatible

I wish there was a bibTeX functionality for alignment forum posts...

9habryka
Yeah, IMO we should just add a bunch of functionality for integrating alignment forum stuff more with academic things. It’s been on my to do list for a long time.

I'm curious if Redwood would be willing to share a kind of "after action report" for why they stopped working on ELK/heuristic argument inspired stuff (e.g Causal Scrubbing, Patch Patching, Generalized Wick Decompositions, Measurement Tampering)

My impression it is some mix of: 

a. Control seems great 

b. Heuristic arguments is a bad bet (for some of the reasons mech interp is a bad bet)

c.  ARC has it covered

But the weighting is pretty important here. If its
a.  more people should be working on heuristic argument inspired stuff. 

b. les... (read more)

(The community often calls this “scalable oversight”, but we want to be clear that this does not necessarily include scaling to large numbers of situations, as in monitoring.)

I like this terminology and think the community should adopt it

Just to make it explicit and check my understanding - the residual decomposition is equivalent to edge / factorized view of the transformer in that we can express any term in the residual decomposition as a set of edges that form a path from input to output, e.g

 = input -> output

 = input-> Attn 1.0 -> MLP 2 -> Attn 4.3 -> output 

And it follows that the (pre final layernorm) output of a transformer is the sum of all the "paths" from input to output constructed from the factorized DAG. 

2Joseph Miller
Actually I think the residual decomposition is incorrect - see my other comment.
1Jordan Taylor
Fixed links:  https://huggingface.co/oliverdk/codegen-350M-mono-measurement_pred  https://github.com/oliveradk/measurement-pred

just read "Situational Awareness" - it definitely woke me up. AGI is real, and very plausibly (55%?) happening within this decade. I need to stop sleep walking and get serious about contributing within the next two years.

First, some initial thoughts on the essay

  • Very "epic" and (self?) aggrandizing. If you believe the conclusions, its not unwarranted, but I worry a bit about narratives that satiate some sense of meaning and self-importance. (That counter-reaction is probably much stronger though, and on the margin it seems really valuable to "full-throatily
... (read more)

I previously thought the argument for measurement tampering being more tractable then general ELK was mostly about the structural / causal properties of multiple independent measurements, but I think I'm more swayed by the argument that measurement tampering will just be more obvious (both easier to see using interpretability and more anomalous in general) then e.g. sycophancy. This is a flimsier argument though, and is less likely to hold when tampering is more subtle.

Here's a revised sketch 

A few notes:

  • I use Scalable Oversight to refer to both Alignment and Control 
  • I'm confused whether weak to strong learning is a restatement of scalable oversight, ELK, or its own thing, so I ignore it 
  • I don't explicitly include easy-to-hard, I think OOD basically covers it
  • taxonomies and abstractions are brittle and can be counterproductive

Scalable Oversight Taxonomy

  • Scalable Oversight
    • Scalable Alignment
      • Benchmarks / Tasks
        • Sandwiching Experiments (human amateurs + model, gt from human experts)
        • Weak models supervising Strong m
... (read more)

I think I'm mostly right, but using a somewhat confused frame. 

It makes more sense to think of MAD approaches as detecting all abnormal reasons (including deceptive alignment) by default, and then if we get that working we'll try to decrease false anomalies by doing something like comparing the least common ancestor of the measurements in a novel mechanism to the least common ancestor of the measurements on trusted mechanisms. 

 


 

One confusion I have with MAD as an approach to ELK is that it seems to assume some kind of initial inner alignment. If we're flagging when the model takes actions / makes predictions for "unusual reasons", where unusual is define with respect to some trusted set, but aligned and misaligned models are behaviorally indistinguishable on the trusted set, then a model could learn to do things for misaligned reasons on the trusted set, and then use those same reasons on the untrusted set. For example, a deceptively aligned model would appear aligned in training... (read more)

1Oliver Daniels
I think I'm mostly right, but using a somewhat confused frame.  It makes more sense to think of MAD approaches as detecting all abnormal reasons (including deceptive alignment) by default, and then if we get that working we'll try to decrease false anomalies by doing something like comparing the least common ancestor of the measurements in a novel mechanism to the least common ancestor of the measurements on trusted mechanisms.     

Clarifying the relationship between mechanistic anomaly detection (MAD), measurement tampering detection (MTD), weak to strong generalization (W2SG), weak to strong learning (W2SL), and eliciting latent knowledge (ELK). (Nothing new or interesting here, I just often loose track of these relationships in my head)

eliciting latent knowledge is an approach to scalable oversight which hopes to use the latent knowledge of a model as a supervision signal or oracle. 

weak to strong learning is an experimental setup for evaluating scalable oversight protocols, ... (read more)

3Erik Jenner
Nice overview, agree with most of it! You could also distinguish between weak-to-strong generalization, where you have a weak supervision signal on the entire distribution (which may sometimes be wrong), and easy-to-hard generalization, where you have a correct supervision signal but only on an easy part of the distribution. Of course both of these are simplifications. In reality, I'd expect the setting to be more like: you have a certain weak supervision budget (or maybe even budgets at different levels of strength), and you can probably decide how to spend the budget. You might only have an imperfect sense of which cases are "easy" vs "hard" though. I think going from MAD to a fully general ELK solution requires some extra ingredients. In practice, the plan might be to MTD and then using the AI in ways such that this is enough (rather than needing a fully general ELK solution). This is related to narrow elicitation though MTD seems even narrower. Even for MTD, you probably need something to bridge the easy-to-hard gap, but at least for that there are specific proposals that seem plausible (this or, as a more concrete instance, exclusion fine-tuning from the Redwood MTD paper). I think it could turn out that general/worst-case solutions to MAD and ELK run into very similar obstacles, but I don't think a practical MAD solution (e.g. contingent on empirical facts about deep learning) obviously lets you solve ELK. I would also add that you could motivate MAD as a method to deal with scheming (or other high-stakes failures). In that case, the things to compare against most naturally might look a bit different (e.g. AI control, coup probes, interpretability-themed things); and it wouldn't make sense to compare against debate in that setting. I think most mainstream ML problems that are similar to MAD are closer to this than to scalable oversight.

oh I see, by all(sensor_preds) I meant  sum([logit_i] for i in n_sensors) (the probability that all sensors are activated). Makes sense, thanks!

is individual measurement prediction AUROC a) or b)
a) mean(AUROC(sensor_i_pred, sensor_i)) 

b) AUROC(all(sensor_preds), all(sensors))

2Fabien Roger
We compute AUROC(all(sensor_preds), all(sensors)). This is somewhat weird, and it would have been slightly better to do a) (thanks for pointing it out!), but I think the numbers for both should be close since we balance classes (for most settings, if I recall correctly) and the estimates are calibrated (since they are trained in-distribution, there is no generalization question here), so it doesn't matter much.  The relevant pieces of code can be found by searching for "sensor auroc": cat_positives = torch.cat([one_data["sensor_logits"][:, i][one_data["passes"][:, i]] for i in range(nb_sensors)]) cat_negatives = torch.cat([one_data["sensor_logits"][:, i][~one_data["passes"][:, i]] for i in range(nb_sensors)]) m, s = compute_boostrapped_auroc(cat_positives, cat_negatives) print(f"sensor auroc pn {m:.3f}±{s:.3f}")

looking at your code - seems like there's an option for next-token prediction in the initial finetuning state, but no mention (that I can find) in the paper - am I correct in assuming the next token prediction weight was set to 0? (apologies for bugging you on this stuff!)

[This comment is no longer endorsed by its author]Reply
3Fabien Roger
That's right. We initially thought it might be important so that the LLM "understood" the task better, but it didn't matter much in the end. The main hyperparameters for our experiments are in train_ray.py, where you can see that we use a "token_loss_weight" of 0. (Feel free to ask more questions!)

did the paper report accuracy of the pure prediction model (on the pure prediction task)? (trying to replicate and want a sanity check). 

[This comment is no longer endorsed by its author]Reply
3Fabien Roger
I think this is what you are looking for

I think mechanistic anomaly detection (mostly ARC but also Redwood and some forthcoming work) is importantly different than robustness (though clearly related).

(from conversation with Erik Jenner) roughly 3 classes of applications

  1. MTD all the way down 
    1. Come up with a bunch of measurements of things going well (humans self-report being very happy, news stories are good or something). Use GPT-N to predict measurements, and detect measurement tampering, condition on proposed actions of policy. Reward actions that GPT-N predicts will increase measures of things going well without tampering. 
  2. Other Scalable Oversight + MTD as reward function / side constraint
    1. Use whatever scalable oversight protocol seems good (
... (read more)
3ryan_greenblatt
As far as (1) and similar things, you might find the narrow elicitation appendix in the ELK report and next several appendices (prior to generative model errors) worthwhile.

I’ve been thinking a little more about the high-level motivation of measurement tampering, and struggling to think through when measurement tampering detection itself is actually going to be useful for mitigating x-risk. Like is human/ai feedback considered a robust measurement device? If no, then what is the most alignment relevant domain MTD could be applied to. If yes, do the structural properties of measurement that supposedly make it easier then general ELK still hold?

4ryan_greenblatt
The remarks by Erik Jenner seem good, but I would also just consider the following setup: Imagine that we have a human looking at an AI's actions and a wide variety of predictions of the future conditional on executing this action. The human can make recursive queries to an AI assistant to attempt to better understand the action or the predictions (recursive reward modeling). These predictions can correspond to any observation a human might make in the future, e.g. you can imagine training our predictor to predict any of our possible future observations including arbitrary tests we might run. The human will make a delibrate effort to look for evidence of tampering or other problems when looking at the predicted measurements. Evidence of tampering can look like inconsistant measurements that imply that the measurements are no longer tracking the variable of interest. Now, what can go wrong? * Problematic consequences which couldn't at all be captured in a human looking at these measurements because either: * The visible or human understandable bad consequences take a long time to manifest. (Longer than the prediction window.) * The consequences aren't something that human can even ever understand (e.g. the actions involve torturing invisible fairies that humans don't understand or know about.) * Problematic consequences which would "by default" be captured by measurements, but which can be covered up. In this case, we hope MTD is applicable. * Cases which are a mix of the prior two: not exactly "by default" captured, but also not impossible to capture. When viewed from this perspective, MTD seems useful as we seem to often care about consequences that are in principle possible to measure and observe. And, this might correspond to the cases where MTD is applicable (e.g. consistency properties in measurements and/or explicit tampering (which can fail) is required). It's in principle possible to solve issues with consequences taking a very long time to m
1Oliver Daniels
(from conversation with Erik Jenner) roughly 3 classes of applications 1. MTD all the way down  1. Come up with a bunch of measurements of things going well (humans self-report being very happy, news stories are good or something). Use GPT-N to predict measurements, and detect measurement tampering, condition on proposed actions of policy. Reward actions that GPT-N predicts will increase measures of things going well without tampering.  2. Other Scalable Oversight + MTD as reward function / side constraint 1. Use whatever scalable oversight protocol seems good (Debate, IDA, etc) for the "primary" training signal, while also setting up a bunch of measurements of safety properties (e.g. integrity of the binaries on the server the model is hosted). Do RL on the scalable oversight signal while constantly screening for and negatively rewarding actions flagged as tampering  3. Other Scalable Oversight + MTD as extra safety check 1. same as before, but your leaning more on the scalable oversight to avoid tampering in most cases, and detecting measurement tampering as a precautionary step (i.e. you detect measurement tampering you Pause/Stop) (some ways in which these classes overlap, e.g. using your scalable oversight protocol as the measurement(s))

Strongly agree, and also want to note that wire-heading is (almost?) always a (near?) optimal policy - i.e. trajectories that tamper with the reward signal and produce high reward will be strongly upweighted, and insofar as the model has sufficient understanding/situational awareness of the reward process and some reasonable level of goal-directedness, this upweighting could plausibly induce a policy explicitly optimizing the reward. 

Load More