1 min read

5

This is a special post for quick takes by Dan Braun. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.
6 comments, sorted by Click to highlight new comments since:

In which worlds would AI Control prevent significant harm?

When I bring up the issue of AI model security to people working in AI safety, I’m often met with something of the form “yes, this is a problem. It’s important that people work hard on securing AI models. But it doesn’t really affect my work”.

Using AI Control (an area which has recently excited many in the field) as an example, I lay out an argument for why it might not be as effective an agenda as one might think after considering the realities of our cyber security situation.

  1. AI Control concerns itself with models that intentionally try to subvert its developers.
  2. These models are likely to be very generally capable and capable of causing significant harm without countermeasures.
  3. Leading cyber-capable institutions would likely expend significant resources and political capital to steal these models or steal enough insights to reproduce such models.
  4. If the weights or insights are stolen, work on AI control will not prevent these models from causing significant harm.
  5. Current AI developers are not on track to be able to defend against high-priority operations from leading cyber-capable institutions in the coming years.
  6. Therefore,  AI control will only be useful in the coming years under one (or more) of these conditions:
    1. Models that scheme are unlikely to be generally capable/dangerous enough to be a high-priority target for leading cyber-capable institutions.
    2. Models that scheme are only developed by actors that can thwart high-priority operations from leading cyber-capable institutions (which precludes current AI developers for at least several years).
    3. AI Control won’t be directly useful in the coming years but it will be indirectly useful to progress the field for when models are developed by actors capable of thwarting top cyber operations.
    4. Even if the model was stolen and caused significant harm, there would still be less harm overall than if the model undergoing AI control measures also broke out. Of course, this argument is much weaker for models capable of existential harm.
    5. Actors that steal the model would also implement strong-enough AI control measures and would not misuse the AI to cause significant harm.

There are of course other arguments against working on AI control. E.g. it may encourage the development and use of models that are capable of causing significant harm. This is an issue if the AI control methods fail or if the model is stolen. So one must be willing to eat this cost or argue that it’s not a large cost when advocating for AI Control work.

This isn’t to say that AI Control isn’t a promising agenda, I just think people need to carefully consider the cases in which their agenda falls down for reasons that aren’t technical arguments about the agenda itself.

I’m also interested to hear takes from those excited by AI Control on which conditions listed in #6 above that they expect to hold (or to otherwise poke holes in the argument).

In general, the hacking capabilities of state actors and the likely involvement of national security when we get closer to AGI feel like significant blind spots of Lesswrong discourse.

(The Hacker and The State by Ben Buchanan is a great book to learn about the former)

But states seem quite likely to fall under 6e, no?

Quite the opposite, it seems to me, but what you consider "misuse" and "harm" depends on what you value, I suppose.

I'd say a combo of 6B and 6C is my story for AI control working, where AI labor is used to make computer security much better, and importantly security becomes good enough such that even nation-states can't defeat the security, by a combination of both general security practices becoming better, and AIs becoming good enough at mathematics and coding such that they can generate sound formal proofs without humans being the loop, which is used to defend the most critical infrastructure for AI labs.

This post is probably the best story on how we could get to a state such that AI control is useful:

https://www.lesswrong.com/posts/2wxufQWK8rXcDGbyL/access-to-powerful-ai-might-make-computer-security-radically

[-]robo1-2

This might be a reason to try to design AI's to fail-safe and break without controlling units.  E.g. before fine-tuning language models to be useful, fine-tune them to not generate useful content without approval tokens generated by a supervisory model.