PhD student in AI safety at CHAI (UC Berkeley)
One more: It seems plausible to me that the alignment stress-testing team won't really challenge core beliefs that underly Anthropic's strategy.
For example, Sleeper Agents showed that standard finetuning might not suffice given a scheming model, but Anthropic had already been pretty invested in interp anyway (and I think you and probably others had been planning for methods other than standard finetuning to be needed). Simple probes can catch sleeper agents (I'm not sure whether I should think of this as work by the stress-testing team?) then showed positive results using model internals methods, which I think probably don't hold up to stress-testing in the sense of somewhat adversarial model organisms.
Examples of things that I'd count as "challenge core beliefs that underly Anthropic's strategy":
To be clear, I think the work by the stress-testing team so far has been really great (mainly for demonstrating issues to people outside Anthropic), I definitely wouldn't want that to stop! Just highlighting a part that I'm not yet sure will be covered.
I think there are things Anthropic could do that would help considerably. This could include:
- Actively encouraging prospective employees to start or join third-party organizations rather than join Anthropic in cases where the employee might be interested in this and this could be a reasonable fit.
- Better model access (either for anyone, just researchers, or just organizations with aspirations to become adjudicators)
- Higher levels of certain types of transparency (e.g. being more transparent about the exact details of safety cases, open-sourcing evals (probably you just want to provide random IID subsets of the eval or to share high-level details and then share the exact implementation on request)).
I'm not sure exactly what is good here, but I don't think Anthropic is as limited as you suggest.
Interesting, thanks! My guess is this doesn't include benefits like housing and travel costs? Some of these programs pay for those while others don't, which I think is a non-trivial difference (especially for the bay area)
I think different types of safety research have pretty different effects on concentration of power risk.
As others have mentioned, if the alternative to human concentration of power is AI takeover, that's hardly an improvement. So I think the main ways in which proliferating AI safety research could be bad are:
There are interesting discussions to be had on the extent to which these issues apply. But it seems clearer that they apply to pretty different extents depending on the type of safety research. For example:
To be clear, I do agree this is a very important problem, and I thought this post had interesting perspectives on it!
You're totally right that this is an important difficulty I glossed over, thanks!
TL;DR: I agree you need some extra ingredient to deal with cases where (AI-augmented) humans can't supervise, and this ingredient could be interpretability. On the other hand, there's at least one (somewhat speculative) alternative to interp (and MAD is also potentially useful if you can only deal with cases humans can supervise with enough effort, e.g., to defend against scheming).
Just to restate things a bit, I'd distinguish two cases:
In-distribution anomaly detection can already be useful (mainly to deal with rare high-stakes failures). For example, if a human can verify that no tampering occurred with enough effort, then we might be able to create a trusted distribution that covers so many cases that we're fine with flagging everything that's OOD.
But we might still want off-distribution anomaly detection, where the anomaly detector generalizes as intended from easy trusted examples to harder untrusted examples. Then we need some additional ingredient to make that generalization work. Paul writes about one approach specifically for measurement tampering here and in the following subsection. Exlusion finetuning (appendix I in Redwood's measurement tampering paper) is a practical implementation of a similar intuition. This does rely on some assumptions about inductive bias, but at least seems more promising to me than just hoping to get a direct translator from normal training.
I think ARC might have hopes to solve ELK more broadly (rather than just measurement tampering), but I understand those less (and maybe they're just "use a measurement tampering detector to bootstrap to a full ELK solution").
To be clear, I'm far from confident that approaches like this will work, but getting to the point where we could solve measurement tampering via interp also seems speculative in the foreseeable future. These two bets seem at least not perfectly correlated, which is nice.
Yeah, seems right that these adversarial prompt should be detectable as mechanistically anomalous---it does intuitively seem like a different reason for the output, given that it doesn't vary with the input. That said, if you look at cases where the adversarial prompt makes the model give the correct answer, it might be hard to know for sure to what extent the anomalous mechanism is present. More generally, the fact that we don't understand how these prompts work probably makes any results somewhat harder to interpret. Cases where the adversarial prompt leads to an incorrect answer seem more clearly unusual (but detecting them may also be a significantly easier task).
I directionally agree with this (and think it's good to write about this more, strongly upvoted!)
For clarity, I would distinguish between two control-related ideas more explicitly when talking about how much work should go into what area:
I think 2. is arguably the most promising strategy for 1., but I've occasionally noticed myself conflating them more than I should.
1. gives you the naive 50/50 equilibrium, i.e. 50% of people should naively work on this broad notion of control. But I think other reasons in favor apply more strongly to 2. (e.g. the tractability arguments are significantly weaker for model internals-based approaches to 1.)
I also think (non-confidently) that 2. is what's really very different from most existing research. For control in the first, broad sense, some research seems less clearly on either the control or alignment side.
But I do agree that safety-motivated researchers should evaluate approaches from a control perspective (in the broad sense) more on the margin. And I also really like the narrower black-box approach to control!
Yeah, I feel like we do still disagree about some conceptual points but they seem less crisp than I initially thought and I don't know experiments we'd clearly make different predictions for. (I expect you could finetune Leela for help mates faster than training a model from scratch, but I expect most of this would be driven by things closer to pattern recognition than search.)
I think if there is a spectrum from pattern recognition to search algorithm there must be a turning point somewhere: Pattern recognition means storing more and more knowledge to get better. A search algo means that you don't need that much knowledge. So at some point of the training where the NN is pushed along this spectrum much of this stored knowledge should start to be pared away and generalised into an algorithm. This happens for toy tasks during grokking. I think it doesn't happen in Leela.
I don't think I understand your ontology for thinking about this, but I would probably also put Leela below this "turning point" (e.g., I expect most of its parameters are spent on storing knowledge and patterns rather than implementing crisp algorithms).
That said, for me, the natural spectrum is between a literal look-up table and brute-force tree search with no heuristics at all. (Of course, that's not a spectrum I expect to be traversed during training, just a hypothetical spectrum of algorithms.) On that spectrum, I think Leela is clearly far removed from both sides, but I find it pretty difficult to define its place more clearly. In particular, I don't see your turning point there (you start storing less knowledge immediately as you move away from the look-up table).
That's why I've tried to avoid absolute claims about how much Leela is doing pattern recognition vs "reasoning/..." but instead focused on arguing for a particular structure in Leela's cognition: I just don't know what it would mean to place Leela on either one of those sides. But I can see that if you think there's a crisp distinction between these two sides with a turning point in the middle, asking which side Leela is on is much more compelling.
Thanks for running these experiments! My guess is that these puzzles are hard enough that Leela doesn't really "know what's going on" in many of them and gets the first move right in significant part by "luck" (i.e., the first move is heuristically natural and can be found without (even heuristically) knowing why it's actually good). I think your results are mainly reflections of that, rather than Leela generally not having sensibly correlated move and value estimates (but I'm confused about what a case would be where we'd actually make different predictions about this correlation).
In our dataset, we tried to avoid cases like that by discarding puzzles where even a much weaker network ("LD2") got the first move right, so that Leela getting the first move right was actually evidence it had noticed the non-obvious tactic.
Some predictions based on that:
You might agree with all of these predictions, they aren't meant to be super strong. If you do, then I'm not sure which predictions we actually disagree about---maybe there's a way to make a dataset where we expect different amounts of correlation between policy and value output but I'd need to think about that.
But I think it can be ruled out that a substantial part of Leela network's prowess in solving chess puzzles or predicting game outcome is due to deliberate calculation.
FWIW, I think it's quite plausible that only a small part of Leela's strength is due to look-ahead, we're only testing on a pretty narrow distribution of puzzles after all. (Though similarly, I disagree somewhat with "ruling out" given that you also just look at pretty specific puzzles (which I think might just be too hard to be a good example of Leela's strength)).
ETA: If you can share your dataset, I'd be happy to test the predictions above if we disagree about any of them, also happy to make them more concrete if it seems like we might disagree. Though again, I'm not claiming you should disagree with any of them just based on what you've said so far.
I think this was a very good summary/distillation and a good critique of work on natural abstractions; I'm less sure it has been particularly useful or impactful.
I'm quite proud of our breakdown into key claims; I think it's much clearer than any previous writing (and in particular makes it easier to notice which sub-claims are obviously true, which are daring, which are or aren't supported by theorems, ...). It also seems that John was mostly on board with it.
I still stand by our critiques. I think the gaps we point out are important and might not be obvious to readers at first. That said, I regret somewhat that we didn't focus more on communicating an overall feeling about work on natural abstractions, and our core disagreements. I had some brief back-and-forth with John in the comments, where it seemed like we didn't even disagree that much, but at the same time, I still think John's writing about the agenda was wildly more optimistic than my views, and I don't think we made that crisp enough.
My impression is that natural abstractions are discussed much less than they were when we wrote the post (and this is the main reason why I think the usefulness of our post has been limited). An important part of the reason I wanted to write this was that many junior AI safety researchers or people getting into AI safety research seemed excited about John's research on natural abstractions, but I felt that some of them had a rosy picture of how much progress there'd been/how promising the direction was. So writing a summary of the current status combined with a critique made a lot of sense, to both let others form an accurate picture of the agenda's progress while also making it easier for them to get started if they wanted to work on it. Since there's (I think) less attention on natural abstractions now, it's unsurprising that those goals are less important.
As for why there's been less focus on natural abstractions, my guess is a combination of at least:
It's also possible that many became more pessimistic about the agenda without public fanfare, or maybe my impression of relative popularity now vs then is just off.
I still think very high effort distillations and critiques can be a very good use of time (and writing this one still seems reasonable ex ante, though I'd focus more on nailing a few key points and less on being super comprehensive).