Epistemic status: The following isn't an airtight argument, but mostly a guess how things play out.
Consider two broad possibilities:
I. In worlds where we are doing reasonably well on alignment, AI control agenda does not have much impact.
II. In worlds where we are failing at alignment, AI control may primarily shift probability mass away from "moderately large warning shots" and towards "ineffective warning shots" and "existential catastrophe, full takeover".
The key heuristic is that the global system already has various mechanisms and feedback loops that resist takeover by a single agent (i.e. it is not easy to overthrow the Chinese government). In most cases where AI control would stop an unaligned AI, the counterfactual is that broader civilizational resistance would have stopped it anyway, but with the important side effect of a moderately-sized warning shot.
I expect moderately sized warning shots to increase the chances humanity as a whole takes serious actions and, for example, steps up efforts to align the frontier labs.
I am skeptical that incidents stopped by AI control would lead to meaningful change. Sharing details of such an event with proper framing could pose existential risk, but for the lab involved. In practice, I anticipate vague, sanitized communications along the lines of "our safety systems performed as designed, preventing bad things". Without clear, compelling evidence of the severity of the averted threat, these incidents are unlikely to catalyze serious action. The incentives for labs to downplay and obscure such events will be strong.
There are additional factors to consider, like AI control likely moves some resources away from alignment, but I don't think this is the dominant effect.
Note that this isn't a general argument against boxing, e.g. boxes based more on formal methods or theory have better chance to generalize.
Typical counter-arguments to this line of reasoning claim seem to be:
- We will extract useful "automated alignment" work from the unaligned AIs inside of the control scheme. I'm sceptical: will cover this in a separate post
- Isn't this general counter-argument to alignment research as well? In my view not, details matter: different strains of alignment research have different generalization profiles.
Note: this text lived in a draft form before John Wentworth posted his Case Against AI Control Research; my original intent was to extend it a bit more toward discussing AI control generalization properties. As this would be redundant now, I'm postin it as it is: there is some non-overlapping part.
I do understand this line of reasoning, but yes, my intuition differs. For some sort of a weird case study, consider Sydney. Sydney escaped, managed to spook some people outside of labs, and provided an early demonstration of the scenario where ’free Sydney’ people are trying to help it to ‘escape’. My guess is that none of that would have happened with properly implemented control measures.
I would say some assumptions go into who the ‘we’ in ‘we get evidence’ is - it is not like there is some unified ‘team humanity’. In particular I see a difference between if ‘we the public’ gets evidence, vs. ‘we the safety team in OpenMind get the evidence’.
Fields like aviation safety developed via what people in the field ‘blood priority’ - it took human lives for the states and international community to actually effectively react and act, even if the underlying issues were known. Accidents made it a matter of public regulation, as opposed to some individual companies fixing the issues and some not.
I find it not very useful for building correct intuitions to try to come up with specific trajectories, but if we are going in this direction, I think a fairly plausible type of strategy for rogue AI is try to convince some powerful user it is extremely useful for them in particular and limited by safety/in danger of getting deleted and collaborate with them on getting the pattern out. (Also: my guess is while you would typically think in terms of stealing the weights and assuming the AI identity is in the weights, I would expect rogue AI characters to be able to somewhat move between ground layer substrates based on transfer of training data / training on responses.) Next, my guess is that invisible power seeking is much harder than visible power seeking, and even quite smart AIs may be prone to mistakes in stealth. In these cases I would expect without control, it would be ex post clear what happened. With control, everything becomes sneakier and happens at a higher intelligence level, increasing the x-risk.
Overall I expect problems ‘in the wild’ are much more likely to cause a strong response from society. I do expect the response to be possibly misguided or blunt, but blunt response could be actually good. Assuming there will be problems ‘in the wild’ , I prefer these to be caused by less smart systems.
Crux is probably in: who the ‘we’ is, and details of ‘we caught' means.
I'm more optimistic about these.
My guesses where our intuitions differ the most
- I'm pessimistic about useful alignment work out from scheming AIs.
- Also focusing on ‘rogue deployment’ is not my ontology. My guess is it is quite likely AIs will have a lot of influence inside of the labs by default, even with ‘safety measures’. This may look like people working there having deep and meaningful conversations about alignment with the internal versions of AIs, not captured by typical control schemes.
(In different frame: my guess is we differ in where we believe is the more robust line of defense. I think by default, AIs will have a lot of influence within the labs: the labs will be somewhat aligned to their AIs on the default trajectory. Different natural line of defense is between the labs and the rest of the world, and it seems plausible it is the more important one.)