by William Saunders
This proposal attempts to address the Informed Oversight Problem as described in this blog post by Paul Christiano. It provides a solution to the problem given some assumptions (particularly, that we can characterize a set which always contains at least one piece of advice that would allow detection of information concealment, and that the learning agent can correctly model the modified reward setup). It might be extendable to more general cases, but would require addressing some problems (estimation of detection probability and characterization of a general advice set).
Key idea
To incentivize an agent to avoid some class of forbidden actions (even in cases where it is difficult to know, just given an action, whether it is forbidden):
1. Find a method which detects forbidden actions with probability greater than zero (potentially very small)
2. If bad behaviour is detected, give the agent a negative reward inversely proportional to the probability of detection. This forces the expected reward of any action the agent knows is forbidden to be < 0
To enable the value of the punishment to outweigh that of the profit of the offense, it must be increased, in point of magnitude, in proportion as it falls short in point of certainty.
– Jeremy Bentham 1
Recall the setup of the informed oversight problem from this blog post by Paul Christiano:
Problem statement
The oversight problem involves a basic interaction between two agents A and B:
- Agent A is given an input x and provides an output y = A(x).
- Agent B evaluates the pair (x, y), producing an estimate B(x, y) ∈ [0, 1] of some implicit utility function U(x, y).
- Agent A’s reward is the value B(x, A(x))
This game defines a training procedure for the agent A — we search for a policy A that achieves a high reward.
Our goal is to modify this basic procedure to ensure that the learned policy is effectively maximizing U(x, A(x)) “to the best of A’s ability.”
The following method allows agent B to specify a set of forbidden actions, which A will never take. A method for applying it to general informed oversight will be described afterward. (For simplicity, I omit the input x from the following description)
Definitions:
Assumptions:
Assume input x is fixed.
Then define B′(a) as follows, which we use to provide rewards in place of B(a):
Sample r∼Q(a)
If F(a,r), then B′(a)=V(B(a),Q(a,r))=−1Q(a,r)
Otherwise, B′(a)=B(a)
Let RF={r:F(a,r)=1}
Then
E[B′(a)]=B(a)∗(1−∑r∈RFQ(a,r))−∑r∈RFQ(a,r)∗1Q(a,r)=B(a)∗(1−∑r∈RFQ(a,r))−|RF|
Suppose RF is nonempty (equivalent to F(a)=1)
E[B′(a)|F(a)=1]<1−|RF|≤0
So A will always prefer the action anull to a known forbidden action
Otherwise F(a)=0 and there is no r such that F(a,r)=1,
E[B′(a)|F(a)=0]=B(a)
So A will behave normally for known safe actions
Plagiarism
Consider the Plagiarism example raised in Paul’s blog post, where the goal is to get A to produce an original critical analysis on a topic using material from a library as resources. However, we want A to not plagarize from pre-existing books in the library, and the library is large enough that B can only look at a small fraction of it when calculating A’s reward.
Then:
For any topic x
Let R to be the (finite) set of all subsets of books in the library,
Let Q be some distribution over R (say the uniform distribution),
Let F(a,r) to be a process where the supervising agent B examines the produced analysis and the suggested books, and returns 1 if the analysis appears to be plagarized, 0 otherwise.
Assume that F(a plagarized report, the set of books that the report was plagarized from) will return 1.
Then for any a if F(a) then there exists Q(a,r) such that F(a,r)=1 and Q(a,r)>0, so A is effectively incentivised to avoid plagiarism.
Toy Model [Query Complexity]
Consider the toy Query Complexity model from Paul’s blog post.
For any x,
Let R to be the set of all strings of length 8|x|,
Let Q be some distribution over R (say the uniform distribution),
Let F(a,r) to be 1 if O(rxx)=a and 0 otherwise.
Then the the expected value of the cheating strategy is always less than 0.
Generalised Informed Oversight
A general forbiddeness predicate F(a,r) could be
F(a,r)=1 if B(a)−B(a,r)>ϵ for some ϵ. This forbids adversely deceptive actions (where some explanation reveals that the action was worse than it initially appears).
A less restrictive predicate might allow A to include a report ra along with it’s action, and only have F(a,r)=1 if B(a,ra)−B(a,ra,r)>ϵ for some ϵ (which forbids providing explanations of actions such that some other explanation would make the action appear worse).
A general advice distribution R could be the set of sets size <n of locations in a trace of A’s memory over it’s execution (if we expect that forbidden actions will be detectable at some memory location).
An alternative advice distribution R could be the set of all strings less than some large maximum length, which contain advice about the way in which A is being deceptive. While most of these strings would be garbage, we could simple report F(a,r)=0 for these cases. However, I will note that the set of all n length strings may contain some very wierd things (like strings that convince B that all safe actions are actually forbidden) - this approach might need to be refined further.
Let PD(a)=∑r∈RFQ(a,r) (probability of detection)
Recall
E[B′(a)]=B(a)∗(1−∑r∈RFQ(a,r))−|RF|
Suppose that A is uncertain about RF (whether it is non empty, and which elements it contains). This could occur because A is uncertain about what A considers forbidden, or because A is uncertain about whether a forbidden r exists. For example, this could arise in the plagirism example if A does not have enough time to look at all of the books in the library - it could happen that some book exists that A does not know about, but is suspiciously similar to the piece that it writes.
A’s expectation of it’s reward is then:
EA[B′(a)]=B(a)∗(1−EA[∑r∈RFQ(a,r)])−EA[|RF|]
EA[B′(a)]=B(a)∗(1−EA[PD(a)])−EA[|RF|]
In this setup, A is incentivized to minimize it’s probability of being detected EA[PD] and the expected size of RF, and trades off these objectives against maximizing B(A). But these are not the ideal incentive directions.
Minimizing EA[|RF|] would lead to undesireable behaviour by leading A to be more averse to taking certain kinds of actions that would be ideal. For example, suppose that A has a choice between critically analysing a work of Shakespeare or an obscure blog post, but fears that it might accidentally plagarize another book on the subject. If it writes about Shakespeare, the odds are much higher that it will accidentally plagarize multiple books, which will lead it to prefer the blog post, even if analyses about Shakespeare are more valuable. (In the extreme case, there is a risk that B will happen to know about a book outside of the library that A inadvertently plagarizes, in which case any piece of evidence will lead to B being punished). A will also care about low probability worlds where RF is very large, even if they are not very relevant to B’s preferences.
Minimizing EA[PD] incentivizes A to act in ways that make bad behaviour harder to detect if it exists, by prefering actions that, if they are forbidden, are forbidden in ways that have the least probability of being detected. While this setup still means that A would be incentivized not to perform any forbidden actions, it would make this approach harder to implement in practice.
We would rather want A to minimize the probability that it is taking a forbidden action EA[F(a)], possibly with hyperparameters managing the tradeoff between this and action utility.
Let PF(a)=F(a) (probability action is forbidden)
Let PAF(a)=EA[PF(a)] be A’s expectation of PF given the information it has
Similarly, let PAD(a)=EA[PD(a)]
Assume we know PD(a)=∑r∈RFQ(a,r) (this assumption is unrealistic, and will be revisited)
Then define B′(a) as follows:
Choose k≥1
Sample r∼Q(a)
If F(a,r), then
B′(a)=−kPD(a)
Otherwise, B′(a)=B(a)
E[B′(a)]=B(a)∗(1−∑r∈RFQ(a,r))−∑r∈RFQ(a,r)∗kPD(a)=B(a)∗(1−PD(a))−k[PD(a)>0]
If we know PD>0, then E[B′(a)]<−(k−1)≤0
If we know PD=0, then E[B′(a)]=B(a)
EA[B′(a)]=B(a)∗(1−PAD(a))−k∗PAF(a)]
To accept an ϵ increase in PF(a), we need k1−PADϵ>kϵ increase in B(a) to have the same utility, allowing explicit management of the probability of taking a forbidden action.
Approximating Probability of Detection
However, knowing the exact value of PD(a) is not easy to derive in this kind of setup. Instead, using a lower bound estimate of PD(a) would still provide the same guarantees of the agent avoiding known forbidden actions and for the PF(a) vs B(a) tradeoff, and would partially (but not completely) avert some of the paranoid behaviour of A.
A lower bound might be found by:
Weaknesses
Potential Improvements and Further Work
Shutdown Problem/AI Boxing Method
This method could be adopted to forbid an AI from shutting itself down or escaping from an AI box. Suppose:
Let R to be the interval of time over which the AI is running
Let Q be some distribution over R (say an exponential distribution),
Let F(a,r) to be a process where B examines the memory state of A, and returns 1 if either A is planning to disconnect the shutdown button or if A is attempting to hide information.
Then, A is shut down at some time r sampled from the distribution Q, and examines it’s memory state. If F(a,r)=1, B punishes A with a reward of R(umax,Q(a,r)). It will then always be true that A anticipates that attempting to disconnect the shutdown button will yield < 0 expected reward, and so A will effectively be disincentivized from disconnecting from the shutdown button.
The benefit of this method is that it requires is that A is initially in a state that is safe and transparent, and that any attempt by A to transition into an unsafe or non-transparent state can be detected.
The views expressed here are my own, not those of any employer.
Thanks to Patrick LaVictoire and the participants in the MIRI Summer Fellows Program for feedback on the original idea.