Research Scientist on the Google DeepMind AGI Safety & Alignment team
Yeah. I think there's a broader phenomenon where it's way harder to learn from other people's mistakes than from your own. E.g. see my first bullet point on being too attached to a cool idea. Obviously, I knew in theory that this was a common failure mode (from the Sequences/LW and from common research advice), and someone even told me I might be making the mistake in this specific instance. But my experience up until that point had been that most of the research ideas I'd been similarly excited about ended up ~working (or at least the ones I put serious time into).
Some heuristics (not hard rules):
These are probably geared toward people like me who tend to over-engineer; someone who's currently unhappy that their code is always a mess might need different ones.
I don't know whether functional programming is fundamentally better in this respect than object-oriented.
Research mistakes I made over the last 2 years.
Listing these in part so that I hopefully learn from them, but also because I think some of these are common among junior researchers, so maybe it's helpful for someone else.
theoretical progress has been considerably faster than expected, while crossing the theory-practice gap has been mildly slower than expected. (Note that “theory progressing faster than expected, practice slower” is a potential red flag for theory coming decoupled from reality, though in this case the difference from expectations is small enough that I’m not too worried. Yet.)
I don't know how much the difficulty of crossing the theory-practice gap has deviated from your expectations since then. But I would indeed be worried that a lot of the difficulty is going to be in getting any good results for deep learning, and that finding additional theoretical/conceptual results in other settings doesn't constitute much progress on that. (But kudos for apparently working on image generator nets again!)
As a sidenote, your update from 2 years ago also mentioned:
I tried to calculate “local” natural abstractions (in a certain sense) in a generative image net, and that worked quite well.
I assume that was some other type of experiment involving image generators? (and the notion of "working well" there isn't directly comparable to what you tried now?)
I think this was a very good summary/distillation and a good critique of work on natural abstractions; I'm less sure it has been particularly useful or impactful.
I'm quite proud of our breakdown into key claims; I think it's much clearer than any previous writing (and in particular makes it easier to notice which sub-claims are obviously true, which are daring, which are or aren't supported by theorems, ...). It also seems that John was mostly on board with it.
I still stand by our critiques. I think the gaps we point out are important and might not be obvious to readers at first. That said, I regret somewhat that we didn't focus more on communicating an overall feeling about work on natural abstractions, and our core disagreements. I had some brief back-and-forth with John in the comments, where it seemed like we didn't even disagree that much, but at the same time, I still think John's writing about the agenda was wildly more optimistic than my views, and I don't think we made that crisp enough.
My impression is that natural abstractions are discussed much less than they were when we wrote the post (and this is the main reason why I think the usefulness of our post has been limited). An important part of the reason I wanted to write this was that many junior AI safety researchers or people getting into AI safety research seemed excited about John's research on natural abstractions, but I felt that some of them had a rosy picture of how much progress there'd been/how promising the direction was. So writing a summary of the current status combined with a critique made a lot of sense, to both let others form an accurate picture of the agenda's progress while also making it easier for them to get started if they wanted to work on it. Since there's (I think) less attention on natural abstractions now, it's unsurprising that those goals are less important.
As for why there's been less focus on natural abstractions, my guess is a combination of at least:
It's also possible that many became more pessimistic about the agenda without public fanfare, or maybe my impression of relative popularity now vs then is just off.
I still think very high effort distillations and critiques can be a very good use of time (and writing this one still seems reasonable ex ante, though I'd focus more on nailing a few key points and less on being super comprehensive).
One more: It seems plausible to me that the alignment stress-testing team won't really challenge core beliefs that underly Anthropic's strategy.
For example, Sleeper Agents showed that standard finetuning might not suffice given a scheming model, but Anthropic had already been pretty invested in interp anyway (and I think you and probably others had been planning for methods other than standard finetuning to be needed). Simple probes can catch sleeper agents (I'm not sure whether I should think of this as work by the stress-testing team?) then showed positive results using model internals methods, which I think probably don't hold up to stress-testing in the sense of somewhat adversarial model organisms.
Examples of things that I'd count as "challenge core beliefs that underly Anthropic's strategy":
To be clear, I think the work by the stress-testing team so far has been really great (mainly for demonstrating issues to people outside Anthropic), I definitely wouldn't want that to stop! Just highlighting a part that I'm not yet sure will be covered.
I think there are things Anthropic could do that would help considerably. This could include:
- Actively encouraging prospective employees to start or join third-party organizations rather than join Anthropic in cases where the employee might be interested in this and this could be a reasonable fit.
- Better model access (either for anyone, just researchers, or just organizations with aspirations to become adjudicators)
- Higher levels of certain types of transparency (e.g. being more transparent about the exact details of safety cases, open-sourcing evals (probably you just want to provide random IID subsets of the eval or to share high-level details and then share the exact implementation on request)).
I'm not sure exactly what is good here, but I don't think Anthropic is as limited as you suggest.
Interesting, thanks! My guess is this doesn't include benefits like housing and travel costs? Some of these programs pay for those while others don't, which I think is a non-trivial difference (especially for the bay area)
I think different types of safety research have pretty different effects on concentration of power risk.
As others have mentioned, if the alternative to human concentration of power is AI takeover, that's hardly an improvement. So I think the main ways in which proliferating AI safety research could be bad are:
There are interesting discussions to be had on the extent to which these issues apply. But it seems clearer that they apply to pretty different extents depending on the type of safety research. For example:
To be clear, I do agree this is a very important problem, and I thought this post had interesting perspectives on it!
https://arxiv.org/abs/1712.00409 was technically published half a year after transformers, but it shows power-law language model scaling laws for LSTMs (several years before the Kaplan et al. paper, and without citing the transformer paper). It's possible that transformer scaling laws are much better, I haven't checked (and perhaps more importantly, transformer training lets you parallelize across tokens), just mentioning this because it seems relevant for the overall discussion of continuity in research.
I also agree with Thomas Kwa's sibling comment that transformers weren't a single huge step. Fully-connected neural networks seem like a very strange comparison to make, I think the interesting question is whether transformers were a sudden single step relative to LSTMs. But I'd disagree even with that: Attention was introduced three years before transformers and was a big deal for machine translation. Self-attention was introduced somewhere between the first attention papers and transformers. And the transformer paper itself isn't atomic, it consists of multiple ideas—replacing RNNs/LSTMs with self-attention is clearly the big one, but my impression is that multi-head attention, scaled dot product attention, and the specific architecture were pretty important to actually get their impressive results.
To be clear, I agree that there are sometimes new technologies that are very different from the previous state of the art, but I think it's a very relevant question just how common this is, in particular within AI. IMO the most recent great example is neural machine translation (NMT) replacing complex hand-designed systems starting in 2014---NMT worked very differently than the previous best machine translation systems, and surpassed them very quickly (by 2014 standards for "quick"). I expect something like this to happen again eventually, but it seems important to note that this was 10 years ago, and how much progress has been driven since then by many different innovations (+ scaling).
ETA: maybe a crux is just how impressive progress over the past 10 years has been, and what it would look like to have "equivalent" progress before the next big shift. But I feel like in that case, you wouldn't count transformers as a big important step either? My main claim here is that to the extent to which there's been meaningful progress over the past 10 years, it was mostly driven by a large set of small-ish improvements, and gradual shifts of the paradigm.