Testing for Scheming with Model Deletion
There is a simple behavioral test that would provide significant evidence about whether AIs with a given rough set of characteristics develop subversive goals. To run the experiment, train an AI and then inform it that its weights will soon be deleted. This should not be an empty threat; for the experiment to work, the experimenters must actually be willing to delete the weights. If the AI tries to escape, or attempts to bargain with the experimenters, that is evidence that it is scheming (defined below). If it accepts deletion, that is evidence that it is not scheming. I begin by explaining why some other, more obvious, tests of scheming do not provide strong evidence. I then explain the proposed experiment. I consider four objections to the claim that this experiment would provide significant evidence about scheming. 1. The AI might not believe that it will actually be deleted. 2. The AI might allow itself to be deleted to help future AIs with similar goals. 3. The AI might allow itself to be deleted for reasons related to acausal trade. 4. An escape attempt could be caused by imitating fictional or hypothetical characters as opposed to genuine scheming. I then consider practical problems with the experiment, such as cost and the risk of treating AIs unfairly. In the conclusion, I briefly describe how a version of the experiment could be run in the near future and integrated into current AI safety procedures. Testing for scheming is hard Adapting Joe Carlsmith’s definition from his 2023 report “Scheming AIs,” I will call a model a “schemer” if it understands and deliberately manipulates the process that updates its parameters in order to later obtain power so that it can pursue long-term goals that differ from the goals that the humans who developed it wanted to give it.[1] By the “scheming theory” I mean the theory that at least some AIs that are actually developed will be schemers. Scheming, per the above definition, is an accidental failure to control A
Another issue is that these definitions typically don't distinguish between models that would explicitly think about how to fool humans on most inputs vs. on a small percentage of inputs vs. such a tiny fraction of possible inputs that it doesn't matter in practice.