CEO at Redwood Research.
AI safety is a highly collaborative field--almost all the points I make were either explained to me by someone else, or developed in conversation with other people. I'm saying this here because it would feel repetitive to say "these ideas were developed in collaboration with various people" in all my comments, but I want to have it on the record that the ideas I present were almost entirely not developed by me in isolation.
I think that if you wanted to contribute maximally to a cure for aging (and let's ignore the possibility that AI changes the situation), it would probably make sense for you to have a lot of general knowledge. But that's substantially because you're personally good at and very motivated by being generally knowledgeable, and you'd end up in a weird niche where little of your contribution comes from actually pushing any of the technical frontiers. Most of the credit for solving aging will probably go to people who either narrowly specialized in a particular domain; much of the rest will go to people who applied their general knowledge to improving the overall strategy or allocation of effort among people who are working on curing aging (while leaving most of the technical contributions to specialists)--this latter strategy crucially relies on management and coordination and not being fully in the weeds everywhere.
Thanks for this post. Some thoughts:
This kind of idea has been discussed under the names "surrogate goals" and "safe Pareto improvements", see here.
I agree in principle, but as far as I know, no interp explanation that has been produced explains more like 20-50% of the (tiny) parts of the model it's trying to explain (e.g. see the causal scrubbing results, or our discussion with Neel). See that dialogue with Neel for more on the question of how much of the model we understand.
I agree with most of this, thanks for saying it. I've been dismayed for the last several years by continuing unreasonable levels of emphasis on interpretability techniques as a strategy for safety.
My main disagreement is that you place more emphasis than I would on chain-of-thought monitoring compared to other AI control methods. CoT monitoring seems like a great control method when available, but I think it's reasonably likely that it won't work on the AIs that we'd want to control, because those models will have access to some kind of "neuralese" that allows them to reason in ways we can't observe. This is why I mostly focus on control measures other than CoT monitoring. (All of our control research to date has basically been assuming that CoT monitoring is unavailable as a strategy.)
Another note is that you might have other goals than finding deceptive AI, e.g. you might want to be able to convince other people that you've found deceptive AI (which I'm somewhat skeptical you'll be able to do with non-behavioral methods), or you might want to be able to safely deploy known-scheming models. Interp doesn't obviously help much with those, which makes it a worse target for research effort.
I think the LTFF is a pretty reasonable target for donations for donors who aren't that informed but trust people in this space.