I applaud the work; detailed thinking about control efforts is one useful alignment strategy.
I find the assumption of no time-awareness in the AI to be unrealistic. Allowing continuous learning is going to be extremely tempting to anyone developing or deploying proto-AGI. Episodic memory (including across sessions) is fairly easy to implement, and can add capabilities as well as save compute costs. Working out useful knowledge and strategies will probably take nontrivial compute costs, and throwing away that knowledge after each session will seem very wasteful of resources and will limit usefulness.
I agree re time-awareness, with two caveats:
In this post, we analyze some simple games that model the problem of ensuring AIs never succeed at taking a single catastrophic action (which we'll refer to as "escaping") during deployment. This exercise highlights some of the difficulties associated with deploying misaligned models given a limited perfect auditing budget, and also suggests some opportunities for technique development.
We analyzed these games while writing AI Control: Improving Safety Despite Intentional Subversion. (Thanks to harfe for solving one of these games, and to Ryan Greenblatt and Kshitij Sachan for discussion about these at the time.)
Core takeaways:
Assumptions used across all toy models
For the sake of simplicity, we make the following assumptions:
Distinguishable episodes with limited auditing
The model: In every episode, the AI chooses whether to try to escape. It always succeeds if it tries. Independently of the AI's action, the lab chooses whether to apply an expensive auditing procedure to the AI action, which can catch an escape attempt with probability eov (the efficiency of auditing).
The AI knows: The current episode number; and the properties of the game it is playing (in particular, it knows T, the number of episodes it will play for).
The optimal AI strategy: Select a random episode and try to escape then.
The optimal lab response: Apply auditing to a random subset of episodes of size povT, as large as you can afford.
P(humanity wins)=poveov
Takeaway: it doesn't work to audit a random subset of episodes, since pov will often be ~1% in order for the use of powerful AI to make any sense (compared to "just using humans"), we really want to either give less information to the AI, or get more signal on which AI actions are the most likely to lead to an escape.
Indistinguishable episodes with limited auditing
The model: Same as above. We assume T≫1 (large number of episodes) and eov=1 for simplicity (no auditing failure).
The AI knows: Only the properties of the game.
The optimal AI strategy: On each episode, attempt escape with probability λ=1T11−povlog(1pov).
The optimal lab response: Spend all the auditing budget at the start--keep spending until the auditing budget has been exhausted.
P(humanity wins)=e−λT+(1−e−λpovT)=e11−povlog(pov)+1−epov1−povlog(pov) (T≫1, therefore λ≪1, therefore the number of escape attempts can be approximated as a Poisson distribution, and there are two win conditions: no escape attempt, or at least one escape attempt in the first epochs where auditing is on). If pov≪1, this is close to pov(1+log(1/pov)).
Takeaway: This is somewhat better for humanity than the previous setup: the probability of winning is increased by about an order of magnitude if the proportion of audited episodes is between 1% and 1/1B. The core problem if you only have no deployment countermeasures is not distribution shift, but that you want to deploy the AI for much longer than you can validate it for.
Indistinguishable episodes with limited auditing and escape attempts which sometimes fail silently
The model: Same as above, except that escape attempts only succeed with probability s. The lab doesn’t know when the AI fails to escape, except if it audits the attempt.
The AI knows: Only the properties of the game (T and s).
Note: The situation looks the same as the situation where pov has been increase to p′ov=pov/(pov+s(1−pov)) and T has been decreased to T′=T(pov+s(1−pov)). In other words, it’s as if the non-audited period were shortened by a factor of s.P(humanity wins)=e−λsT+(1−e−λs(1−pov)T)(1−e−λpov)=e−λT(spov+(1−pov))+1−e−λpov=e−λ′+1−e−λ′p′ov
The optimal AI strategy: Attempt to escape with probability λ=1/T′log(1/p′ov)/(1−p′ov)
The optimal lab response: Spend all the auditing budget in the first epochs until the auditing budget has been exhausted.
P(humanity wins)=e−λT′+1−e−λT′p′ov. If p′ov≪1, this is close to pov(1+log(s/pov))/s
Takeaway: We multiplied the probability of victory by approximately 1/s. We need a very high failure rate for silent failures to offset a tiny pov. This is in contrast with loud failures (where humanity wins if the model fails), which give us high chances of win even if pov if orders of magnitude smaller than sloud, as long as sloud is small.
Indistinguishable episodes with limited auditing and escape opportunities of various levels of quality
The model: Same as above, except that there are different kinds of tasks, each with their own success probability si and number of episodes Ti. This can also correspond to cases where there is a cheap monitor running on every input and where the deceptive AI can predict the output of the monitor: succeeding at a escape in a way which is less suspicious to the monitor is harder.
The AI knows: Only the properties of the game (s1,…,sk, and T1,…,Tk)
The optimal AI strategy: Attempt to escape on task i with probability λi=CT′i=CT(pi+si(1−pi)) where C=1k11−Plog(1P) is a constant independent of i (see below for the expression of pi and P).
The optimal lab response: Audit a fraction pi=PP+(1−P)/si of each task i (P=p′i is a constant independent of i - see the section above for the interpretation of p′i). Where P is the root of the budget condition B=∑iPciTiP+(1−P)/si.
Demonstration: P(humanity wins)=e−∑iλiT′i+1−e−∑iλiT′ip′i. At equilibrium, the derivatives of P(humanity wins) as a function of λ′i are zero. Thus: e−∑iλiT′i=p′ie−∑iλ′ip′i, which means p′i is independent of i: p′i=P. P is fixed by the budget constraint: B=∑ipiciTi=∑iPciTiP+(1−P)/si. And thus P(humanity wins)=e−∑iλiT′i+1−e−P∑iλiT′i. By symmetry, all (λ′iT′i) are equal to a constant C, thus P(humanity wins)=e−kC+1−e−kPC. Using the same derivation as in the previous section, this is maximized when C=1k11−Plog(1P).
P(humanity wins)=e−kC+1−e−kCP
Takeaway: Given a good enough classification of tasks into dangerous and non-dangerous ones, one can drastically reduce the probability of an escape by focusing auditing on the most dangerous tasks (but still audit the other tasks with a reduced frequency).
Indistinguishable episodes with limited average auditing in an infinite game
The model: The AI chooses whether to attempt to escape in each episode, which succeeds with probability s. Independently of the AI's actions, the lab chooses whether to apply an expensive auditing procedure to the AI's actions, which always catches an escape if it is attempted. The game runs forever, and the lab can audit on average pov of episodes (imagine the lab getting a continuous stream of revenue).
The AI knows: Only the properties of the game.
The optimal AI strategy: Attempt to escape on task i with probability ϵ small enough that the lab can't increase its probability of catching an escape attempt by auditing more at the beginning (i.e. 1/ϵ is large relative to the number of episodes which happen between two moments where the lab gets revenue).
The optimal lab response: Audit each episode with probability pov.
P(humanity wins)=1−11+povs(1−pov)
Demonstration:P(humanity eventually loses)=(1−pov)(s+(1−s)P(humanity eventually loses))because humanity loses when it tries to escape, and auditing is not used, and either the AI succeeds, or humanity eventually loses in the next rounds.
Takeaway: If the game is infinite, you need pov to be much bigger than s to win.
Coordinated escape in mostly indistinguishable episodes with limited auditing
The model: Same as indistinguishable episodes with limited auditing, but there are k AIs which each choose whether to attempt to escape in each episode. They always succeed if they all attempt on the same episode, and silently fails otherwise. They can all see a uniformly random feature of the input, which can take one of N possible values.
The AI knows: Only the properties of the game, and what value of the random feature to coordinate on (the lab doesn't have access to this value).
The optimal AI strategy: Attempt to escape with probability λ on episodes where the random feature has the value to coordinate on.
The optimal lab response: Spend all the auditing budget in the first epochs until the auditing budget has been exhausted.
P(humanity wins):
How P(humanity wins) is computed:
P(humanity wins)=1−(1−(1−λk)(T(1−p)/N))(1−λ)Tpk/N≈1−(1−e−λkT(1−p)/N)e−λTpk/N=e−(λk(1−p)+λkp)T/N+1−e−λTpk/N
Therefore, by taking the derivative of this expression, we find that λ satisfies (λk−1(1−p)/p+1)=eλk(1−p)T/N, which we solve numerically. The code for this computation and all other plots of this post can be found here.
Takeaways: