Some heuristics (not hard rules):
Research mistakes I made over the last 2 years.
Listing these in part so that I hopefully learn from them, but also because I think some of these are common among junior researchers, so maybe it's helpful for someone else.
I also have most of these regrets when I think about my work in 2022.
theoretical progress has been considerably faster than expected, while crossing the theory-practice gap has been mildly slower than expected. (Note that “theory progressing faster than expected, practice slower” is a potential red flag for theory coming decoupled from reality, though in this case the difference from expectations is small enough that I’m not too worried. Yet.)
I don't know how much the difficulty of crossing the theory-practice gap has deviated from your expectations since then. But I would indeed be worried that a lot...
I think this was a very good summary/distillation and a good critique of work on natural abstractions; I'm less sure it has been particularly useful or impactful.
I'm quite proud of our breakdown into key claims; I think it's much clearer than any previous writing (and in particular makes it easier to notice which sub-claims are obviously true, which are daring, which are or aren't supported by theorems, ...). It also seems that John was mostly on board with it.
I still stand by our critiques. I think the gaps we point out are important and might not be obvi...
One more: It seems plausible to me that the alignment stress-testing team won't really challenge core beliefs that underly Anthropic's strategy.
For example, Sleeper Agents showed that standard finetuning might not suffice given a scheming model, but Anthropic had already been pretty invested in interp anyway (and I think you and probably others had been planning for methods other than standard finetuning to be needed). Simple probes can catch sleeper agents (I'm not sure whether I should think of this as work by the stress-testing team?) then showed positi...
I think Anthropic de facto acts as though "models are quite unlikely (e.g. 3%) to be scheming" is true. Evidence that seriously challenged this view might cause the organization to substantially change its approach.
One more: It seems plausible to me that the alignment stress-testing team won't really challenge core beliefs that underly Anthropic's strategy.
For example, Sleeper Agents showed that standard finetuning might not suffice given a scheming model, but Anthropic had already been pretty invested in interp anyway (and I think you and probably others had been planning for methods other than standard finetuning to be needed). Simple probes can catch sleeper agents (I'm not sure whether I should think of this as work by the stress-testing team?) then showed positi...
Interesting, thanks! My guess is this doesn't include benefits like housing and travel costs? Some of these programs pay for those while others don't, which I think is a non-trivial difference (especially for the bay area)
I think different types of safety research have pretty different effects on concentration of power risk.
As others have mentioned, if the alternative to human concentration of power is AI takeover, that's hardly an improvement. So I think the main ways in which proliferating AI safety research could be bad are:
You're totally right that this is an important difficulty I glossed over, thanks!
TL;DR: I agree you need some extra ingredient to deal with cases where (AI-augmented) humans can't supervise, and this ingredient could be interpretability. On the other hand, there's at least one (somewhat speculative) alternative to interp (and MAD is also potentially useful if you can only deal with cases humans can supervise with enough effort, e.g., to defend against scheming).
Just to restate things a bit, I'd distinguish two cases:
Yeah, seems right that these adversarial prompt should be detectable as mechanistically anomalous---it does intuitively seem like a different reason for the output, given that it doesn't vary with the input. That said, if you look at cases where the adversarial prompt makes the model give the correct answer, it might be hard to know for sure to what extent the anomalous mechanism is present. More generally, the fact that we don't understand how these prompts work probably makes any results somewhat harder to interpret. Cases where the adversarial prompt leads to an incorrect answer seem more clearly unusual (but detecting them may also be a significantly easier task).
I directionally agree with this (and think it's good to write about this more, strongly upvoted!)
For clarity, I would distinguish between two control-related ideas more explicitly when talking about how much work should go into what area:
I th...
I agree and regret focusing as much as we did 2 in the past; I’m excited for work on “white box control” (there's some under way, and I'm excited for more).
We focused on black box control because it’s much more obvious how you’d make a legible and conservative safety case for it. But in hindsight this was a mistake, I think: people probably aren’t going to actually make legible or conservative safety arguments for their deployments, so there’s less of an advantage than I’d thought.
I plan to write more about white box control soon (but I plan to write about a lot of stuff…)
Yeah, I feel like we do still disagree about some conceptual points but they seem less crisp than I initially thought and I don't know experiments we'd clearly make different predictions for. (I expect you could finetune Leela for help mates faster than training a model from scratch, but I expect most of this would be driven by things closer to pattern recognition than search.)
...I think if there is a spectrum from pattern recognition to search algorithm there must be a turning point somewhere: Pattern recognition means storing more and more knowledge to get
Thanks for running these experiments! My guess is that these puzzles are hard enough that Leela doesn't really "know what's going on" in many of them and gets the first move right in significant part by "luck" (i.e., the first move is heuristically natural and can be found without (even heuristically) knowing why it's actually good). I think your results are mainly reflections of that, rather than Leela generally not having sensibly correlated move and value estimates (but I'm confused about what a case would be where we'd actually make different predictio...
Thank you for writing this! I've found it helpful both to get an impression what some people at Anthropic think and also to think about some things myself. I've collected some of my agreements/disagreements/uncertainties below (mostly ignoring points already raised in other comments.)
Subject to potentially very demanding constraints around safety like those in our current and subsequent RSPs, staying close to the frontier is perhaps our top priority in Chapter 1.
If I understand this correctly, the tasks in order of descending priority during Chapter 1 are:...
I don't think my argument relies on the existence of a crisp boundary. Just on the existence of a part of the spectrum that clearly is just pattern recognition and not lookahead but still leads to the observations you made.
Maybe I misunderstood you then, and tbc I agree that you don't need a sharp boundary. That said, the rest of your message makes me think we might still be talking past each other a bit. (Feel free to disengage at any point obviously.)
For your thought experiment, my prediction would depend on the specifics of what this "tactical motive" l...
I still don't see the crisp boundary you seem to be getting at between "pattern recognition building on general circuits" and what you call "look-ahead." It sounds like one key thing for you is generalization to unseen cases, but the continuous spectrum I was gesturing at also seems to apply to that. For example:
...But if in the entire training data there was never a case of a piece blocking the checkmate by rook h4, the existence of a circuit that computes the information that the bishop on d2 can drop back to h6 is not going to help the "pattern recognition
Thanks for the elaboration, these are good points. I think about the difference between what you call look-ahead vs pattern recognition on a more continuous spectrum. For example, you say:
The network learns that Ng6 is often a likely move when the king is on h8, the queen or bishop takes away the g8 square and there is a rook or queen ready to move to the h-file.
You could imagine learning this fact literally for those specific squares. Or you could imagine generalizing very slightly and using the same learned mechanism if you flip along the vertical axis a...
Good point, explicit representations of the objective might not be as crucial for safety applications as my post frames it.
That said, some reasons this might not generalize in a way that enables this kind of application:
The manner in which these pathological policies achieve high is also concerning: most of the time they match the reference policy , but a tiny fraction of the time they will pick trajectories with extremely high reward. Thus, if we only observe actions from the policy , it could be impossible to tell whether is Goodharting or identical to the base policy.
I'm confused; to learn this policy , some of the extremely high reward trajectories would likely have to be taken during RL training, so we could s...
I don't know the answer to your actual question, but I'll note there are slightly fewer mech interp mentors than mentors listed in the "AI interpretability" area (though all of them are at least doing "model internals"). I'd say Stephen Casper and I aren't focused on interpretability in any narrow sense, and Nandi Schoots' projects also sound closer to science of deep learning than mech interp. Assuming we count everyone else, that leaves 11 out of 39 mentors, which is slightly less than ~8 out of 23 from the previous cohort (though maybe not by much).
Nice overview, agree with most of it!
weak to strong generalization is a class of approaches to ELK which relies on generalizing a "weak" supervision signal to more difficult domains using the inductive biases and internal structure of the strong model.
You could also distinguish between weak-to-strong generalization, where you have a weak supervision signal on the entire distribution (which may sometimes be wrong), and easy-to-hard generalization, where you have a correct supervision signal but only on an easy part of the distribution. Of course both of the...
I think the status quo around publishing safety research is mostly fine (though being a bit more careful seems good); more confidently, I think going as far as the vibe of this post suggests would be bad.
Some possible cruxes, or reasons the post basically didn't move my view on that:
I don't buy the argument that safety researchers have unusually good ideas/research compared to capability researchers at top labs
I don't think this particularly needs to be true for my point to hold; they only need to have reasonably good ideas/research, not unusually good, for them to publish less to be a positive thing.
That said, if someone hasn't thought at all about concepts like "differentially advancing safety" or "capabilities externalities," then reading this post would probably be helpful, and I'd endorse thinking about those issues.
That's...
We have been able to scale to 79% accuracy on a balanced dataset of n119 and non-n119, with networks each less than three convolution layers and less than 1000 neurons, compared to pure deep-learning which does 92% on 1000 parameters and three convolution layers
Is the "1000 parameters" a typo, should it be "1000 neurons"? Otherwise, this would be a strange comparison (since 1000 parameters is a much smaller network than 1000 neurons)
And as a follow-up in case the current sleeper agent does admit when it's doing something dangerous: I'd be interested whether you can train a sleeper agent that never admits it's doing something dangerous if the "2024" trigger is in the prompt (even if after backdoor insertion, the blue team finetunes it to answer truthfully in cases with dangerous code but without the presence of that trigger). For such a sleeper agent, you could again check whether a probe works.
I think that paper is some evidence that there's typically no huge effect from internal activations being optimized for predicting future tokens (on natural language). But I don't think it's much (if any) evidence that this doesn't happen to some small extent or that it couldn't be a huge effect on certain other natural language tasks.
(In fact, I think the myopia gap is probably the more relevant number than the local myopia bonus, in which case I'd argue the paper actually shows a pretty non-trivial effect, kind of contrary to how the authors interpret it...
Thanks for the detailed responses! I'm happy to talk about "descriptions" throughout.
Trying to summarize my current understanding of what you're saying:
Some niche thoughts on obstacles to certain mechanistic anomaly detection benchmarks:
I had this cached thought that the Sleeper Agents paper showed you could distill a CoT with deceptive reasoning into the model, and that the model internalized this deceptive reasoning and thus became more robust against safety training.
But on a closer look, I don't think the paper shows anything like this interpretation (there are a few results on distilling a CoT making the backdoor more robust, but it's very unclear why, and my best guess is that it's not "internalizing the deceptive reasoning").
In the code vulnerability insertion setting, there's no co...
Is there some formal-ish definition of "explanation of (network, dataset)" and "mathematical description length of an explanation" such that you think SAEs are especially short explanations? I still don't think I have whatever intuition you're describing, and I feel like the issue is that I don't know how you're measuring description length and what class of "explanations" you're considering.
As naive examples that probably don't work (similar to the ones from my original comment):
My non-answer to (2) would be that debate could be used in all of these ways, and the central problem it's trying to solve is sort of orthogonal to how exactly it's being used. (Also, the best way to use it might depend on the context.)
What debate is trying to do is let you evaluate plans/actions/outputs that an unassisted human couldn't evaluate correctly (in any reasonable amount of time). You might want to use that to train a reward model (replacing humans in RLHF) and then train a policy; this would most likely be necessary if you want low cost at infe...
The sparsity penalty trains the SAE to activate fewer features for any given datapoint, thus optimizing for shorter mathematical description length.
I'm confused by this claim and some related ones, sorry if this comment is correspondingly confused and rambly.
It's not obvious at all to me that SAEs lead to shorter descriptions in any meaningful sense. We get sparser features (and maybe sparser interactions between features), but in exchange, we have more features and higher loss. Overall, I share Ryan's intuition here that it seems pretty hard to...
Nice post, would be great to understand what's going on here!
Minor comment unrelated to your main points:
Conceptually, loss recovered seems a worse metric than KL divergence. Faithful reconstructions should preserve all token probabilities, but loss only compares the probabilities for the true next token
I don't think it's clear we want SAEs to be that faithful, for similar reasons as briefly mentioned here and in the comments of that post. The question is whether differences in the distribution are "interesting behavior" that we want to explain or whether ...
Would you expect this to outperform doing the same thing with a non-sparse autoencoder (that has a lower latent dimension than the NN's hidden dimension)? I'm not sure why it would, given that we aren't using the sparse representations except to map them back (so any type of capacity constraint on the latent space seems fine). If dense autoencoders work just as well for this, they'd probably be more straightforward to train? (unless we already have an SAE lying around from interp anyway, I suppose)
But sadly, you don't have any guarantee that it will output the optimal element
If I understand the setup correctly, there's no guarantee that the optimal element would be good, right? It's just likely since the optimal element a priori shouldn't be unusually bad, and you're assuming most satisficing elements are fine.
This initially threw me off regarding what problem you're trying to solve. My best current guess is:
I think this is an important point, but IMO there are at least two types of candidates for using SAEs for anomaly detection (in addition to techniques that make sense for normal, non-sparse autoencoders):
Sign of the effect of open source on hype? Or of hype on timelines? I'm not sure why either would be negative.
By "those effects" I meant a collection of indirect "release weights → capability landscape changes" effects in general, not just hype/investment. And by "sign" I meant whether those effects taken together are good or bad. Sorry, I realize that wasn't very clear.
As examples, there might be a mildly bad effect through increased investment, and/or there might be mildly good effects through more products and more continuous takeoff.
I agree that releas...
I agree that releasing the Llama or Grok weights wasn't particularly bad from a speeding up AGI perspective. (There might be indirect effects like increasing hype around AI and thus investment, but overall I think those effects are small and I'm not even sure about the sign.)
I also don't think misuse of public weights is a huge deal right now.
My main concern is that I think releasing weights would be very bad for sufficiently advanced models (in part because of deliberate misuse becoming a bigger deal, but also because it makes most interventions we'd want...
Yeah, agreed. Though I think
the type and amount of empirical work to do presumably looks quite different depending on whether it's the main product or in support of some other work
applies to that as well
One worry I have about my current AI safety research (empirical mechanistic anomaly detection and interpretability) is that now is the wrong time to work on it. A lot of this work seems pretty well-suited to (partial) automation by future AI. And it also seems quite plausible to me that we won't strictly need this type of work to safely use the early AGI systems that could automate a lot of it. If both of these are true, then that seems like a good argument to do this type of work once AI can speed it up a lot more.
Under this view, arguably the better thin...
Oh I see, I indeed misunderstood your point then.
For me personally, an important contributor to day-to-day motivation is just finding research intrinsically fun---impact on the future is more something I have to consciously consider when making high-level plans. I think moving towards more concrete and empirical work did have benefits on personal enjoyment just because making clear progress is fun to me independently of whether it's going to be really important (though I think there've also been some downsides to enjoyment because I do quite like thinking ...
I'd definitely agree the updates are towards the views of certain other people (roughly some mix of views that tend to be common in academia, and views I got from Paul Christiano, Redwood and other people in a similar cluster). Just based on that observation, it's kind of hard to disentangle updating towards those views just because they have convincing arguments behind them, vs updating towards them purely based on exposure or because of a subconscious desire to fit in socially.
I definitely think there are good reasons for the updates I listed (e.g. speci...
Thanks, I think I should distinguish more carefully between automating AI (safety) R&D within labs and automating the entire economy. (Johannes also asked about ability vs actual automation here but somehow your comment made it click).
It seems much more likely to me that AI R&D would actually be automated than that a bunch of random unrelated things would all actually be automated. I'd agree that if only AI R&D actually got automated, that would make takeoff pretty discontinuous in many ways. Though there are also some consequences of fast vs s...
I'm roughly imagining automating most things a remote human expert could do within a few days. If we're talking about doing things autonomously that would take humans several months, I'm becoming quite a bit more scared. Though the capability profile might also be sufficiently non-human that this kind of metric doesn't work great.
Practically speaking, I could imagine getting a 10x or more speedup on a lot of ML research, but wouldn't be surprised if there are some specific types of research that only get pretty small speedups (maybe 2x), especially anythin...
Transformative: Which of these do you agree with and when do you think this might happen?
For some timelines see my other comment; they aren't specifically about the definitions you list here but my error bars on timelines are huge anyway so I don't think I'll try to write down separate ones for different definitions.
Compared to definitions 2. and 3., I might be more bullish on AIs having pretty big effects even if they can "only" automate tasks that would take human experts a few days (without intermediate human feedback). A key uncertainty I have though i...
Good question, I think I was mostly visualizing ability to automate while writing this. Though for software development specifically I expect the gap to be pretty small (lower regulatory hurdles than elsewhere, has a lot of relevance to the people who'd do the automation, already starting to happen right now).
In general I'd expect inertia to become less of a factor as the benefits of AI become bigger and more obvious---at least for important applications where AI could provide many many billions of dollars of economic value, I'd guess it won't take too lon...
I don't have well-considered cached numbers, more like a vague sense for how close various things feel. So these are made up on the spot and please don't take them too seriously except as a ballpark estimate:
How much time do you think there is between "ability to automate" and "actually this has been automated"? Are your numbers for actual automation, or just ability? I personally would agree to your numbers if they are about ability to automate, but I think it will take much longer to actually automate, due to people's inertia and normal regulatory hurdles (though I find it confusing to think about, because we might have vastly superhuman AI and potentially loss of control before everything is actually automated.)
I started my AI safety PhD around 1.5 years ago, this is a list of how my views have changed since ~then.
Skippable meta notes:
Thanks for that overview and the references!
On hydrodynamic variables/predictability: I (like probably many others before me) rediscovered what sounds like a similar basic idea in a slightly different context, and my sense is that this is somewhat different from what John has in mind, though I'd guess there are connections. See here for some vague musings. When I talked to John about this, I think he said he's deliberately doing something different from the predictability-definition (though I might have misunderstood). He's definitely aware of similar idea...
Thanks for writing this! On the point of how to get information, mentors themselves seem like they should also be able to say a lot of useful things (though especially for more subjective points, I would put more weight on what previous mentees say!)
So since I'm going to be mentoring for MATS and for CHAI internships, I'll list my best guesses as to how working with me will be like, maybe this helps someone decide:
Strong +1 to asking the mentor being a great way to get information! My guess is many mentors aren't going out of their way to volunteer this kind of info, but will share it if asked. Especially if they've already decided that they want to work with you.
My MATS admission doc has some info on that for me, though I can give more detailed answers if anyone emails me with specific questions.
Yeah. I think there's a broader phenomenon where it's way harder to learn from other people's mistakes than from your own. E.g. see my first bullet point on being too attached to a cool idea. Obviously, I knew in theory that this was a common failure mode (from the Sequences/LW and from common research advice), and someone even told me I might be making the mistake in this specific instance. But my experience up until that point had been that most of the research ideas I'd been similarly excited about ended up ~working (or at least the ones I put serious time into).