In which worlds would AI Control prevent significant harm?
When I bring up the issue of AI model security to people working in AI safety, I’m often met with something of the form “yes, this is a problem. It’s important that people work hard on securing AI models. But it doesn’t really affect my work”.
Using AI Control (an area which has recently excited many in the field) as an example, I lay out an argument for why it might not be as effective an agenda as one might think after considering the realities of our cyber security situation.
There are of course other arguments against working on AI control. E.g. it may encourage the development and use of models that are capable of causing significant harm. This is an issue if the AI control methods fail or if the model is stolen. So one must be willing to eat this cost or argue that it’s not a large cost when advocating for AI Control work.
This isn’t to say that AI Control isn’t a promising agenda, I just think people need to carefully consider the cases in which their agenda falls down for reasons that aren’t technical arguments about the agenda itself.
I’m also interested to hear takes from those excited by AI Control on which conditions listed in #6 above that they expect to hold (or to otherwise poke holes in the argument).
In general, the hacking capabilities of state actors and the likely involvement of national security when we get closer to AGI feel like significant blind spots of Lesswrong discourse.
(The Hacker and The State by Ben Buchanan is a great book to learn about the former)
I'd say a combo of 6B and 6C is my story for AI control working, where AI labor is used to make computer security much better, and importantly security becomes good enough such that even nation-states can't defeat the security, by a combination of both general security practices becoming better, and AIs becoming good enough at mathematics and coding such that they can generate sound formal proofs without humans being the loop, which is used to defend the most critical infrastructure for AI labs.
This post is probably the best story on how we could get to a state such that AI control is useful:
This might be a reason to try to design AI's to fail-safe and break without controlling units. E.g. before fine-tuning language models to be useful, fine-tune them to not generate useful content without approval tokens generated by a supervisory model.