Buck

CEO at Redwood Research.

AI safety is a highly collaborative field--almost all the points I make were either explained to me by someone else, or developed in conversation with other people. I'm saying this here because it would feel repetitive to say "these ideas were developed in collaboration with various people" in all my comments, but I want to have it on the record that the ideas I present were almost entirely not developed by me in isolation.

Wikitag Contributions

Comments

Sorted by

I think the LTFF is a pretty reasonable target for donations for donors who aren't that informed but trust people in this space.

To be clear, I think we at Redwood (and people at spiritually similar places like the AI Futures Project) do think about this kind of question (though I'd quibble about the importance of some of the specific questions you mention here).

Justis has been very helpful as a copy-editor for a bunch of Redwood content over the last 18 months!

I think that if you wanted to contribute maximally to a cure for aging (and let's ignore the possibility that AI changes the situation), it would probably make sense for you to have a lot of general knowledge. But that's substantially because you're personally good at and very motivated by being generally knowledgeable, and you'd end up in a weird niche where little of your contribution comes from actually pushing any of the technical frontiers. Most of the credit for solving aging will probably go to people who either narrowly specialized in a particular domain; much of the rest will go to people who applied their general knowledge to improving the overall strategy or allocation of effort among people who are working on curing aging (while leaving most of the technical contributions to specialists)--this latter strategy crucially relies on management and coordination and not being fully in the weeds everywhere.

Thanks for this post. Some thoughts:

  • I really appreciate the basic vibe of this post. In particular, I think it's great to have a distinction between wizard power and king power, and to note that king power is often fake, and that lots of people are very tempted (including by insidious social pressure) to focus on gunning for king power without being sufficiently thoughtful about whether they're actually achieving what they wanted. And I think that for a lot of people, it's an underrated strategy to focus hard on wizard power (especially when you're young). E.g. I spent a lot of my twenties learning computer science and science, and I think this was quite helpful for me.
  • A big theme of Redwood Research's work is the question "If you are in charge of deploying a powerful AI and you have limited resources (e.g. cash, manpower, acceptable service degradation) to mitigate misalignment risks, how should you spend your resources?". (E.g. see here.) This is in contrast to e.g. thinking about what safety measures are most in the Overton window, or which ones are easiest to explain. I think it's healthy to spend a lot of your time thinking about techniques that are objectively better, because it is less tied up in social realities. That attitude reminds me of your post.
  • I share your desire to know about all those things you talk about. One of my friends has huge amounts of "wizard power", and I find this extremely charming/impressive/attractive. I would personally enjoy the LessWrong community more if the people here knew more of this stuff.
  • I'm very skeptical that focusing on wizard power is universally the right strategy; I'm even more skeptical that learning the random stuff you list in this post is typically a good strategy for people. For example, I think that it would be clearly bad for my effect on existential safety for me to redirect a bunch of my time towards learning about the things you described (making vaccines, using CAD software, etc), because those topics aren't as relevant to the main strategies that I'm interested in for mitigating existential risk.
  • You write "And if one wants a cure for aging, or weekend trips to the moon, or tiny genetically-engineered dragons… then the bottleneck is wizard power, not king power." I think this is true in a collective sense--these problems require technological advancement--but it is absurd to say that the best way to improve the probability of getting to those things is to try to personally learn all of the scientific fields relevant to making those advancements happen. At the very least, surely there should be specialization! And beyond that, I think the biggest threat to eventual weekend trips to the moon is probably AI risk; on my beliefs, we should dedicate way more effort to mitigating AI risk than to tiny-dragon-R&D. Some people should try to have very general knowledge of these things, but IMO the main usecase for having such broad knowledge is helping with the prioritization between them, not contributing to any particular one of them!

This kind of idea has been discussed under the names "surrogate goals" and "safe Pareto improvements", see here.

Buck*Ω13218

I agree in principle, but as far as I know, no interp explanation that has been produced explains more like 20-50% of the (tiny) parts of the model it's trying to explain (e.g. see the causal scrubbing results, or our discussion with Neel). See that dialogue with Neel for more on the question of how much of the model we understand.

BuckΩ10116

This is why AI control research usually assumes that none of the methods you described work, and relies on black-box properties that are more robust to this kind of optimization pressure (mostly "the AI can't do X").

BuckΩ438252

I agree with most of this, thanks for saying it. I've been dismayed for the last several years by continuing unreasonable levels of emphasis on interpretability techniques as a strategy for safety.

My main disagreement is that you place more emphasis than I would on chain-of-thought monitoring compared to other AI control methods. CoT monitoring seems like a great control method when available, but I think it's reasonably likely that it won't work on the AIs that we'd want to control, because those models will have access to some kind of "neuralese" that allows them to reason in ways we can't observe. This is why I mostly focus on control measures other than CoT monitoring. (All of our control research to date has basically been assuming that CoT monitoring is unavailable as a strategy.)

Another note is that you might have other goals than finding deceptive AI, e.g. you might want to be able to convince other people that you've found deceptive AI (which I'm somewhat skeptical you'll be able to do with non-behavioral methods), or you might want to be able to safely deploy known-scheming models. Interp doesn't obviously help much with those, which makes it a worse target for research effort.

Reply4111

IIRC, an Anthropic staff member told me that he had a strong suspicion for why this is, but that it was tied up in proprietary info so he didn't want to say.

Load More