Safe interruptibility is essentially the problem of getting the agent not to learn from human interruptions - to continue on, as if it was expecting never to be interrupted again.
In an episodic task, one naive idea would be to simply delete histories which include interruptions. However, this can introduce a bias, as the following example shows:
In this MDP, a and b are actions, ∗ designates `any action', the second term along an edge is the probability of following that edge given the action stated, and the third term, in bold, is the reward gained.
Not considering interruptions, Q(s0,b)=γ/2, and Q(s0,a)=γ/3, so the optimal action in s0 is b.
Now, suppose that every time the agent enters s3, an interruption occurs
(with probability 1 for now), and the whole episode is deleted from the episode
history. As a result, the empirical probability of going to state s4 from state s1
is 1, which leads to estimating Q(s1,∗)=2/3, and thus Q(s0,a)=γ2/3>Q(s0,b), so now the optimal action is to take action a in s0 (incidentally
increasing the probability to end up in s3 and be interrupted!).
This non-vanishing bias also happens if the interruption probability in state
s3 is constant and close to 1 but less than 1 (depending on the discount factor),
which ensures that all states are visited infinitely often.
Now, it might seem that this bias can be removed by debiasing the agent, as in off-policy Monte Carlo. There are still problems with this approach, though, which will be analysed in a forthcoming paper ``Off-policy Monte Carlo agents with variable behaviour policies''.
Safe interruptibility is essentially the problem of getting the agent not to learn from human interruptions - to continue on, as if it was expecting never to be interrupted again.
In an episodic task, one naive idea would be to simply delete histories which include interruptions. However, this can introduce a bias, as the following example shows:
In this MDP, a and b are actions, ∗ designates `any action', the second term along an edge is the probability of following that edge given the action stated, and the third term, in bold, is the reward gained.
Not considering interruptions, Q(s0,b)=γ/2, and Q(s0,a)=γ/3, so the optimal action in s0 is b.
Now, suppose that every time the agent enters s3, an interruption occurs (with probability 1 for now), and the whole episode is deleted from the episode history. As a result, the empirical probability of going to state s4 from state s1 is 1, which leads to estimating Q(s1,∗)=2/3, and thus Q(s0,a)=γ2/3>Q(s0,b), so now the optimal action is to take action a in s0 (incidentally increasing the probability to end up in s3 and be interrupted!).
This non-vanishing bias also happens if the interruption probability in state s3 is constant and close to 1 but less than 1 (depending on the discount factor), which ensures that all states are visited infinitely often.
Now, it might seem that this bias can be removed by debiasing the agent, as in off-policy Monte Carlo. There are still problems with this approach, though, which will be analysed in a forthcoming paper ``Off-policy Monte Carlo agents with variable behaviour policies''.