Epistemic status: The following isn't an airtight argument, but mostly a guess how things play out.
Consider two broad possibilities:
I. In worlds where we are doing reasonably well on alignment, AI control agenda does not have much impact.
II. In worlds where we are failing at alignment, AI control may primarily shift probability mass away from "moderately large warning shots" and towards "ineffective warning shots" and "existential catastrophe, full takeover".
The key heuristic is that the global system already has various mechanisms and feedback loops that resist takeover by a single agent (i.e. it is not easy to overthrow the Chinese government). In most cases where AI control would stop an unaligned AI, the counterfactual is that broader civilizational resistance would have stopped it anyway, but with the important side effect of a moderately-sized warning shot.
I expect moderately sized warning shots to increase the chances humanity as a whole takes serious actions and, for example, steps up efforts to align the frontier labs.
I am skeptical that incidents stopped by AI control would lead to meaningful change. Sharing details of such an event with proper framing could pose existential risk, but for the lab involved. In practice, I anticipate vague, sanitized communications along the lines of "our safety systems performed as designed, preventing bad things". Without clear, compelling evidence of the severity of the averted threat, these incidents are unlikely to catalyze serious action. The incentives for labs to downplay and obscure such events will be strong.
There are additional factors to consider, like AI control likely moves some resources away from alignment, but I don't think this is the dominant effect.
Note that this isn't a general argument against boxing, e.g. boxes based more on formal methods or theory have better chance to generalize.
Typical counter-arguments to this line of reasoning claim seem to be:
- We will extract useful "automated alignment" work from the unaligned AIs inside of the control scheme. I'm sceptical: will cover this in a separate post
- Isn't this general counter-argument to alignment research as well? In my view not, details matter: different strains of alignment research have different generalization profiles.
Note: this text lived in a draft form before John Wentworth posted his Case Against AI Control Research; my original intent was to extend it a bit more toward discussing AI control generalization properties. As this would be redundant now, I'm postin it as it is: there is some non-overlapping part.
Related question: are you in favor of making AGI open weights? By AGI, I mean AIs which effectively operate autonomously and can acquire automously acquire money/power. This includes AIs capable enough to automate whole fields of R&D (but not much more capable than this). I think the case for this being useful on your views feels much stronger than the case for control preventing warning shots. After all, you seemingly mostly thought control was bad due to the chance it would prevent escape or incidents of strange (and not that strategic) behavior. Naively, I think that open weights AGI is strictly better from your perspective than having the AGI escape.
I think there is a coherant perspective that is in favor of open weights AGI due to this causing havok/harm which then results in the world handling AI more reasonably. And there are other benefits in terms of transparency and alignment research. But, the expected number of lives lost feels very high to me (which would probably also make AI takeover more likely due to making humanity weaker) and having AIs capable of automating AI R&D be open weights means that slowing down is unlikely to be viable, reducing the potential value of society taking the situation more seriously.
You could have the view that open weights AGI is too costly on takeover risk and escape is bad, but we'll hopefully have some pre-AGI AIs which do strange misaligned behaviors that don't really get them much/any influence/power. If this is the view, then it really feels to me like preventing escape/rogue internal deployment is pretty useful.