Overall summary post here.
In a previous post, I asked which impact measures were vulnerable to subagents. Vika pointed out that it was not merely an issue of of the impact measure, but also of the baseline.
This is indeed the case, but the nature of the impact measure is still relevant. In this post, I'll establish two facts: that under the stepwise inaction baseline, a subagent completely undermines all impact measures (including twenty billion questions). And for the inaction baseline, for non-indexical impact measures, a subagent will not change anything.
The next post will delve into the very interesting things that happen with the inaction baseline and an indexical impact measure.
Stepwise inaction baseline
All impact measures are some form of distance measure between two states, of the form , where is the state the agent is actually in, and is some baseline state to compare with.
For the stepwise inaction baseline, is calculated as follows. Let be the previous state, the previous action. Then is the state that would have followed had the agent taken the noop action, , instead of .
However, in the presence of a subagent, the agent merely has to always take the action . In that case, (in a deterministic setting[1]), and the penalty collapses to nothing.
This leaves the subagent free to maximise (the positive reward for agent and subagent), without any restrictions beyond making sure that the agent can always take the action.
Non-indexical impact penalty
The twenty billion questions is an example of a non-indexical impact penalty. The impact penalty is defined as the difference in expected value of many variables, between the current state and the baseline state.
With the inaction baseline state, is the state the system would have been in, had the agent always done nothing/had not been turned on. Thus the definition of is independent of the agent's actions, so the inaction baseline does not collapse like the stepwise inaction baseline does.
What about subagents? Well, since the impact penalty is non-indexical, a subagent does not get around it. It matters not whether the subagent is independent, or an extension of the agent: the impact penalty remains.
In a non-deterministic setting, becomes a mix of an impact penalty and a measure of environment stochasticity. ↩︎
I think this problem is about capturing delayed effects of the agent's actions. The way the stepwise baseline is supposed to penalize delayed effects is using inaction rollouts, which compare the effects of the agent action + k noops and the effects of k+1 noops (for all positive integers k). I don't think it's useful to consider this issue for a stepwise baseline with no rollouts, since that baseline fails to penalize any delayed effects (no subagents needed).
Here, the inaction rollouts don't capture the effects of the subagent, and thus the agent doesn't receive a penalty for creating the subagent. There are several factors that contribute to this:
1. The auxiliary reward R only rewards the agent itself pressing the red button, not the red button being pressed in general, so it would not be triggered by the subagent pressing it.
2. The subagent never goes to the red button - it only goes to the blue button.
The agent would be penalized for creating the subagent if both 1 and 2 were false. If 1 was false, the auxiliary reward would be given whenever the red button was pressed. If 2 was also false, the inaction rollout after subagent creation would include a state where the subagent presses the red button, and so the agent would be penalized.
In the general case, the agent can only get away with creating subagents that don't achieve any auxiliary rewards, which I think limits the magnitude of this problem. The designer can choose auxiliary rewards (that are not achieved by the agent doing nothing) such that a subagent will achieve at least one of them. For example, if there is an auxiliary reward for reaching any state except the state before subagent creation, the no-subagent inaction rollout will consist of this state, and the subagent inaction rollout will contain a different state at least once, so subagent creation will be penalized.
Alas, the inaction rollouts don't seem to fix the problem: https://www.lesswrong.com/s/iRwYCpcAXuFD24tHh/p/z9MfmF8gA7SBxGSmb