Epistemic status: The following isn't an airtight argument, but mostly a guess how things play out.
Consider two broad possibilities:
I. In worlds where we are doing reasonably well on alignment, AI control agenda does not have much impact.
II. In worlds where we are failing at alignment, AI control may primarily shift probability mass away from "moderately large warning shots" and towards "ineffective warning shots" and "existential catastrophe, full takeover".
The key heuristic is that the global system already has various mechanisms and feedback loops that resist takeover by a single agent (i.e. it is not easy to overthrow the Chinese government). In most cases where AI control would stop an unaligned AI, the counterfactual is that broader civilizational resistance would have stopped it anyway, but with the important side effect of a moderately-sized warning shot.
I expect moderately sized warning shots to increase the chances humanity as a whole takes serious actions and, for example, steps up efforts to align the frontier labs.
I am skeptical that incidents stopped by AI control would lead to meaningful change. Sharing details of such an event with proper framing could pose existential risk, but for the lab involved. In practice, I anticipate vague, sanitized communications along the lines of "our safety systems performed as designed, preventing bad things". Without clear, compelling evidence of the severity of the averted threat, these incidents are unlikely to catalyze serious action. The incentives for labs to downplay and obscure such events will be strong.
There are additional factors to consider, like AI control likely moves some resources away from alignment, but I don't think this is the dominant effect.
Note that this isn't a general argument against boxing, e.g. boxes based more on formal methods or theory have better chance to generalize.
Typical counter-arguments to this line of reasoning claim seem to be:
- We will extract useful "automated alignment" work from the unaligned AIs inside of the control scheme. I'm sceptical: will cover this in a separate post
- Isn't this general counter-argument to alignment research as well? In my view not, details matter: different strains of alignment research have different generalization profiles.
Note: this text lived in a draft form before John Wentworth posted his Case Against AI Control Research; my original intent was to extend it a bit more toward discussing AI control generalization properties. As this would be redundant now, I'm postin it as it is: there is some non-overlapping part.
Semi-related: if I'm reading OpenAI's recent post "How we think about safety and alignment" correctly, they seem to announce that they're planning on implementing some kind of AI Control agenda. Under the heading "iterative development" in the section "Our Core Principles" they say:
Given the surrounding context in the original post, I think most people would read those sentences as saying something like: "In the future, we might develop AI with a lot of misuse risk, ie AI that can generate compelling propaganda or create cyberattacks. So we reserve the right to restrict how we deploy our models (ie giving the biology tool only to cancer researchers, not to everyone on the internet)."
But as written, I think OpenAI intends the sentences above to ALSO cover AI control scenarios like: "In the future, we might develop misaligned AIs that are actively scheming against us. If that happens, we reserve the right to continue to use those models internally, even though we know they're misaligned, while using AI control techniques ('deploy into constrained environments, limit to trusted users', etc) to try and get useful superalignment work out of them anyways."
I don't have a take on the pros/cons of a control agenda, but I haven't seen anyone else note this seeming policy statement of OpenAI's, so I figured I'd write it up.