In which worlds would AI Control (or any other agenda which relies on non-trivial post-training operation) prevent significant harm?
When I bring up the issue of AI model security to people working in AI safety, I’m often met with something of the form “yes, this is a problem. It’s important that people work hard on securing AI models. But it doesn’t really affect my work”.
Using AI Control (an area which has recently excited many in the field) as an example, I lay out an argument for why it might not be as effective an agenda as one might think after considering the realities of our cyber security situation.
There are of course other arguments against working on AI control. E.g. it may encourage the development and use of models that are capable of causing significant harm. This is an issue if the AI control methods fail or if the model is stolen. So one must be willing to eat this cost or argue that it’s not a large cost when advocating for AI Control work.
This isn’t to say that AI Control isn’t a promising agenda, I just think people need to carefully consider the cases in which their agenda falls down for reasons that aren’t technical arguments about the agenda itself.
I’m also interested to hear takes from those excited by AI Control on which conditions listed in #6 above that they expect to hold (or to otherwise poke holes in the argument).
EDIT (thanks Zach and Ryan for bringing this up): I didn't want to imply that AI Control is unique here, this argument can be levelled at any agenda which relies on something like a raw model + non-trivial operation effort. E.g. a scheme which relies on interpretability or black box methods for monitoring or scalable oversight.
This argument isn't control specific, it applies to any safety scheme with some operating tax or implementation difficulty.[1][2]
I agree that if there is insufficient security, paying any sizable safety tax (including slowing down) will be difficult. It would either require substantial coordination or much more agressive USG action than would otherwise be needed.
Thus, I agree that security is a huge factor in how AI goes, plausibly more important than technical safety work done in advance, and I agree that the technical AI safety community probably understimates the importance of security. (Security is probably notably less tractible than technical AI safety measures for a variety of reasons, but perhaps of greater importance than doing technical AI safety in advance.)
That said, I can certainly imagine worlds where AI control saves us despite the model being stolen by multiple adversaries. My sense is that the situation is much worse if a powerful AI is stolen, especially if it is stolen by multiple actors, but it isn't obviously fatal.
Therefore, AI control will only be useful in the coming years under one (or more) of these conditions
I would note that we don't just aim to prevent the outcome of the model escaping. Indeed, even if a (misaligned) model was literally made publicly available (on a torrent or whatever), we might still care about safely extracting work out of instances of the model we control. We no longer care about preventing escape, but we do care about getting good non-sabotaged work (on e.g. security or technical AI safety).
We also aren't just worried about direct sabotage, we want to prevent the AI from setting itself up in a position where it has much more power over the AI company. E.g., we really want to prevent it from being able to run itself without monitoring in a position where it has full permissions to do things within the AI company.
Even if the model has literally already escaped into the outside world, I think control is useful. (Though this is obviously a dismal situation to be in.)
As an example scenario:
Edit: One further note is that if the tax from control is sufficiently small and takeoff is sufficiently slow by default, then control can still let us get out the key work done on time (prior to adversaries being able to train much more powerful models such that we can't be competitive anymore).
You also bring up a risk compensation argument: maybe control is bad because it encorages the development of powerful models via making some of the risks lower. This argument also applies to other safety strategies. ↩︎
Zach also noted this, but I thought it was worth emphasizing. ↩︎
Thanks for the thoughts. They've made me think that I'm likely underestimating how much Control is needed to get useful work out of AIs capable and inclined to scheme. Ideally, this fact would increase the likelihood of other actors implementing AI Control schemes with the stolen model that are at least sufficient for containment and/or make them less likely to steal the model, though I wouldn’t want to put too much weight on this hope.
>This argument isn't control specific, it applies to any safety scheme with some operating tax or implementation difficulty.[1][2]
Yep, for sure. I’ve changed the title and commented about this at the end.
Ideally, this fact would increase the likelihood of other actors implementing AI Control schemes with the stolen model that are at least sufficient for containment and/or make them less likely to steal the model, though I wouldn’t want to put too much weight on this hope.
Agreed, it might not be clear to all actors that AI control would be in their interests for reducing risks of sabotage and other bad outcomes!
I think one of the priorities of safety-conscious AI developers should be to develop some realistic and downright terrifying demos to show privately to NatSec folks in US/UK/EU. Gotta get that military-grade security in place before we get too much further.
I'd say a combo of 6B and 6C is my story for AI control working, where AI labor is used to make computer security much better, and importantly security becomes good enough such that even nation-states can't defeat the security, by a combination of both general security practices becoming better, and AIs becoming good enough at mathematics and coding such that they can generate sound formal proofs without humans being the loop, which is used to defend the most critical infrastructure for AI labs.
This post is probably the best story on how we could get to a state such that AI control is useful:
I agree safety-by-control kinda requires good security. But safety-by-alignment kinda requires good security too.
I think that even after your edit, your argument still applies more broadly than you're giving it credit for: if computer security is going to go poorly, then we're facing pretty serious AI risk even if the safety techniques require trivial effort during deployment.
If your AI is stolen, you face substantial risk even if you had been able to align it (e.g. because you might immediately get into an AI enabled war, and you might be forced to proceed with building more powerful and less-likely-to-be-aligned models because of the competitive pressure).
So I think your argument also pushes against working on alignment techniques.
I'm curious @Dan Braun, why don't you work on computer security (assuming I correctly understand that you don't)?
I plan to spend more time thinking about AI model security. The main reasons I’m not spending a lot of time on it now are:
In general, the hacking capabilities of state actors and the likely involvement of national security when we get closer to AGI feel like significant blind spots of Lesswrong discourse.
(The Hacker and The State by Ben Buchanan is a great book to learn about the former)
Quite the opposite, it seems to me, but what you consider "misuse" and "harm" depends on what you value, I suppose.
Maybe we should pick the States we want to be in control, before the States pick for us....
Do you have a preference between Switzerland and North Korea? Cause I sure do.
If we passively let whoever steals the most tech come out as the winner, we might end up with some unpleasant selection effects.
"Having the reins of the future" is a non-goal.
Also, you don't "have the reins of the future" now, never have had, and almost certainly never will. - jbash
jbash seems to think us LessWrong posters have no power to affect global-scale outcomes. I disagree. I believe some of us have quite a bit of power, and should use it before we lose it.
I think it's important to consider hacking in any safety efforts. These hacks would probably include stealing and using any safety methods for control or alignment, for the same reasons the originating org was using them - they don't want to lose control of their AGI. Better make those techniques and their code public, and publicly advertise why you're using them!
Of course, we'd worry that some actors (North Korea, Russia, individuals who are skilled hackers) are highly misaligned with the remainder of humanity, andd might bring about existential catastrophes through some combination of foolishness and selfishness.
The other concern is mere proliferation of aligned/controlled systems, which leads to existential danger as soon as those systems approach the capability for autonomous recursive self-improvement: If we solve alignment, do we die anyway?
This might be a reason to try to design AI's to fail-safe and break without controlling units. E.g. before fine-tuning language models to be useful, fine-tune them to not generate useful content without approval tokens generated by a supervisory model.
I don't see how that would work technically. It seems like any small set of activating tokens would be stolen along with the weights, and I don't see how to train it for a large shifting set.
I'm not saying this is impossible, just htat I'm not sure it is. Can you flesh this idea out any further?
Sorry, that was an off-the-cuff example I meant to help gesture towards the main idea. I didn't mean to imply it's a working instance (it's not). The idea I'm going for is: