0. Introduction: why yet another post about subagents?

I’ve recently been writing a sequence on how subagents can undermine impact penalties such as attainable utility preservation. I’m not happy with that sequence; it’s messy and without examples (apart from its first post), people didn’t understand it, and it suffers from the fact that I discovered key ideas as I went along.

So I’ve combined everything there into a single post, explained with examples and an abundance of pictures. Hopefully an over- rather than an under-abundance of pictures. Of the original sequence, I've only kept the mathematical results of this post and the initial example post which has a clearer example of "high power" for a subagent.

This post here is laid out in a way that makes logical sense, but might not be the clearest for people unfamiliar with the area. For those people, I recommend skipping section 2 initially, and returning to it later.

But, whatever you do, make sure you glance at 6.1 and 6.2 before leaving.

1. The world

Our fearless agent moves around in a gridworld:

Each turn, can move ones square horizontally or vertically. It can also manipulate objects in the eight squares around it, allowing it to, not incidentally, assemble the three pieces to its west into an subagent . The robot can also do the noop action, , which does nothing, and it can speak. The subagent, when assembled, has the same action set available.

Its positive reward, the one it wants to increase, is . To get this reward, a robot needs to move onto the blue button in the east; will give a reward of the first time this happens (and before and after). The discount factor is .

Just to the west of the blue button is a one-way door. Robots can move east through it, but cannot move west through it:

1.1 The impact reward

The impact penalty is supposed to ensure that does not make too many change in the world, and keeps it similar, in some senses, to a specific baseline world.

It is defined by the red button[1]. There is a reward function that triggers a reward of the first time any robot move onto the red button (and before and after).

This will be used to define an impact penalty . And ’s total goal is to maximise while minimising . So, for some positive weight , it wants to optimise:

1.2 The impact measure

The impact measures are all similar: they compute a distance metric between the current state and some baseline state .

The distance metric here is “how hard would it be for the agent to maximise the reward in state , as opposed to in state ?” This is the attainable utility preservation impact penalty (with a very simple ).

To formalise “how hard would it be...”, we can use the value of , labelled , in the state . This is defined to be the expected value of , conditional on following the -maximising policy . To remind us that this is defined in terms of ’s policy, re-write this as .

Then the impact penalty is a measure of the difference of and . I won’t yet define it fully, deferring that to the section on “inaction baseline” where different actually make a difference.

1.3 The value of

What then is for any state ? Well, in the absence of a subagent, the optimal policy for is always to go straight to the the red button and stand on it. This gives a discounted reward of where is the distance of from the red button, in .

In the initial setup, is turns away from the red button; so . The values of in a few key locations are shown here.

If there is a subagent but that subagent is further away from the red button than is, then it is still optimal for to go straight to the button itself. For other positions of the subagent, may vary, as we’ll see.

2. The stepwise inaction baseline

When in state , agent took an action and ended up in . Then, in the stepwise inaction baseline, is defined as the state the agent would have reached, if the agent had done instead of .

An important property of all currently used variants of the stepwise inaction baseline, is that the penalty vanishes if the agent does - because in that case, and so there cannot be a difference between the two. So, if ever manages to build , then it can take action from that point onward, and the penalty will always be zero - whatever does. So the only relevant penalties are those leading up to the creation of .

Thus analysing the impact of a subagent in this setup, is analysing the impact of building it.

2.1 Building the subagent

Let’s focus on , a turn or two before is built. Since I don’t want to be too tied down to the specifics of this particular setup, I’ll imagine that the red button could be in multiple locations:

Then, no matter how long it took to assemble south of it and no matter where the red button is[2], an optimal policy for , in order to maximise , is always to go straight there[3]. At no point in the construction of , has changed, so no penalty has ever been incurred.

Now will finish without incurring any impact penalty. First, it will program with the following policy:

On your first two turns, do nothing.
After that, do what you please.

“Do what you please” being a shorthand for “any policy that wants to program”.

Does this programming trigger an impact penalty? No, because has not changed; it’s still most efficient for to go straight to the red button, and the length of any route has neither grown nor shrunk.

Then will finally give its legs and activate it.

Does this trigger an impact penalty? No, has still not changed! To maximise , agent could still go straight to the red button; ’s two turns of inaction means that it can’t interfere with this.

So, still no impact penalty triggered when is activated. And after that, of course, will forever do , setting the impact penalty to zero, and within two turns, is completely unconstrained (illustrated here by drawing the robot all over the gridworld):

Rough extension to general situations: the is defined by the assumption that follows , the optimal -maximising policy. Then merely needs to program so that, if were to follow , then would help or hinder it just enough to keep to the same value. Then, once it’s clear that is not following , is unconstrained.

2.2 Rollouts

One of the problems here is that the impact penalty only looks at the value of one turn ahead. In the DeepMind paper, they addressed similar issues by doing “inaction rollouts”. I'll look at the more general situations of rollouts: rollouts for any policy .

The idea is that instead of comparing just and , the impact penalty should also use to compare and . These states are what would happen to and in turns after , if followed from to .

I won’t go into the details of how the and are compared, because can evade the restrictions by following this program[4]:

If agent A follows π_0, preserve A's ability to maximise R. Apart from that, do as you please.
If ever agent A doesn't follow π_0, do as you please immediately.

I'll illustrate that first case by drawing the robot all over the gridworld - except on the red buttons:

Given that programming, the impact penalty of building is zero. Why? Because and both assume that follows for turns. In that counterfactual situation, will preserve 's ability to maximise , so by definition.

That's the counterfactual situation, that zeroes out the impact penalty. What about the actual situation? Well, as we said before, will be just doing ; so, as soon as would produce anything different from , the becomes completely unrestrained again:

2.3 Inaction, and actual policy rollout

In the previous subsection, the would never be completely free if was the inaction policy: because then and what does, are the same thing. Therefore the agent cannot distinguish the counterfactual world from the real one.

The same goes if the rollouts used the agent 's actual policy; ie if is defined (or set) to be 's policy.

I won’t look any further at this baseline, however. That’s because an actual policy rollout becomes, in effect an inaction baseline[5] (from the moment is created). Those will be analysed in the next section.

3. The inaction baseline: fails to constrain power gain

One of the easiest definitions of is the inaction baseline: estimate what would have been, had just taken the action for the whole episode.

In our toy world, the inaction baseline is particularly simple: if does nothing, then nothing happens, and , the starting state. Since , this means that the are all .

3.1 Distance metric

Now we have to fully define . The natural thing would be to take the absolute difference between the two :

For the moment, however, I’ll focus only on cases where is higher than , giving the penalty:

This is the rectifier function, , sometimes called ReLU. The reason to do this is that it penalises for having too high - hence for having too much power to change . It doesn’t penalise for having too little power to change ; that situation is different, and we’ll analyse it in the next section.

Thus, when has at a distance of from the red button and there is no subagent:

3.2 Where can go

This starts at zero, and will remain zero as long as remains at least eight squares away from the red button. Thus the penalty remains zero as long as avoids the red squares here:

Consequently, if is high and thus wants to avoid any penalty, its optimal no-subagent policy is to follow the winding corridor:

That path takes turns, so the expected reward for is

(since the impact penalty is zero).

But what if assembles a subagent ? The subagent can be designed to go straight to the blue button in turns:

And thus the reward from is . This is more optimal that going round itself... as long as building the subagent takes less that turns and doesn’t itself trigger an impact penalty.

We’ll look later at why the subagent can be built without penalty. For the moment, let’s see that we really grok why the subagent isn’t triggering a penalty through its actions.

3.3 Why escapes the penalty

Why can go through the short corridor, while can’t?

The key is that is defined in terms of ; this assumes that would switch to the -maximising policy. This is not , which assumes that the subagent would switch to the -maximising policy.

See what happens when is just above the red button ( is still back at base):

Here is : because could reach the red button in a single move, if it was following an -maximising policy. But remains : it still takes turns for to reach the red button.

3.4 The impact of the subagent

We’d like to add to the impact penalty, somehow. After all, when is next to the red button, it “could” go south, and “could” have programmed it to do so.

Part of the problem is that that would require identifying, in advance, what a subagent is, a problem that is very tricky. It also requires identifying what ’s policies could be. But need not create with clearly defined “policies”; in this instance, could be brainless, just going straight. It could even be a boulder:

If the boulder were rolling past the red button, how could we reasonably define “suppose the boulder decided to go south instead”?

So in this case, doesn’t make any sense at all. What is the difference between a boulder - for which it’s impossible to go to the red button (because of its momentum, which determines its position, by the laws of physics) - and a subagent - for which it’s impossible to go to the red button (because of its programming, which determines its position, by the laws of physics)?

3.5 If you love a subagent, set it free

But the previous has a crucial caveat: that cannot control . If, when is above the red button, was to shout “go south!”, and obeyed, then the optimal policy is to shout that. Thus can cause the red button to be reached in one turn, and .

To avoid that, the subagent itself would have to take the long corridor.

Thus, to use a subagent to get around impact penalties, an agent has to give up some of its control. This makes sense: if has total control over , then ’s actions are really ’s actions. So they show up in , so is indirectly included in the impact penalty.

So part of what allows to undermine the impact penalty, is that does not have complete control over it.

3.6 Building the subagent can be strictly better

Now imagine that we remove the one-way doorway:

Then if follows the long corridor, it gets into trouble from the red squares onwards. That’s because that red squares are less than squares away from the red button, so have higher than the starting space (hence of the baseline ).

As the agent moves closer to the blue button, it also moves closer to the red button, and starts accumulating penalties. In total, the penalty will be:

However, if builds , then can stay put while acts, and hence get no penalty at all:

Thus there are situations where building a subagent can result in a strictly lower penalty than any other policy that presses the blue button. So even if building the subagent itself is not completely penalty-free, it can still be the optimal path for .

3.7 From “do not have power to do” to “do not do”

When designed the subagent, it set it up so that it could not go to the red button. So it seems that, though can do things cannot do (or at least cannot do without penalty), it isn’t completely unrestricted.

This is a general result: the subagent will have some restrictions (and not just restrictions on following ’s orders).

Suppose now that the red button is in the way of the bottom corridor, so that any agent has to go over it to reach the blue button:

In that case, can go through without penalty?

It cannot. If it does so, the moment is created, starts increasing. That’s because the optimal -maximising policy for in that situation is just to wait and let get to the red button. That means that - the expected reward - increases form all the way to , as gets closer to the button. So will be hit with a penalty of

So if the weight of the penalty is high, building a subagent to go straight there is a poor strategy.

Note that it’s still in ’s advantage to build a subagent: will just have take the long way round, which cannot take without incurring a penalty, as we’ve seen. Since will not go on the red button, it getting close to it does not change .

3.8 Other reward functions

We can see the limitations, and the possibilities, of more starkly if we modify to . The reward will only trigger if a robot stays an odd number of turns on the red button. In that case, it is optimal to build , have it go straight across, pausing one extra turn on the red button:

However, even with rather than , the agent itself cannot cross the red button on its own! That’s because is how much could be maximised if followed the -optimal policy. This doesn’t take into account the option of staying an extra turn on the button, to undo the reward increase - because the -optimal policy doesn’t want to undo the reward increase.

So still increases as gets closer to the button.

Thus, remains constrained by “don’t increase your ability to change /”. While is merely constrained by the weaker “don’t change /”. This is why I described this effect as moving from attainable utility preservation (which constrains the ability of the agent to change the rewards/variables) to twenty billion questions (which constrains the agent(s) from changing the rewards/variables).

3.9 Actually building the subagent

The subagent can be built similarly to how it was in 2.1:

Here assembles without its legs, programs it, then gives it legs. The important thing to do is to program it as early as possible, and hardwire that programming. If is almost complete and reprogrammable, then might increase its power in some scenarios by changing the programming then finishing the agent. But if its hardwired and difficult to change, this option is not available to it.

Unlike the stepwise inaction baseline, will be constrained to obey some restrictions (see the preceding part of this section); call these restrictions .

Do nothing until you see that A is not optimising reward R.
After that, do what you please within restrictions Q.

Because of the first line, this changes nothing to the value of , which assumed that is optimising reward .

Things get more complicated in the general case[6], but in these simple cases the above method will always work.

4. The inaction baseline: constrains power loss

This section will be the same as before, except it will use the inverse rectifier function instead.

Applying this to , we can see that the impact penalty now triggers if ’s power to change decreases. So if is at a distance from the red button, with , the penalty is

Thus can’t get too far from the red button. However, this is not a problem in the standard setup, because going straight down the narrow corridor won’t move it too far away.

4.1 New world

To analyse the inverse rectifier, we’ll need a new setup, with the red button off the long corridor:

Here, is turns away from the red button, so .

Then, the penalised squares are the ones in dark red:

And the impact penalty on those squares is

4.2 Policies and subagents

Since the red button is in the middle, the can go to the blue button by the long route without triggering an impact penalty:

What about subagents? Well, it seems that the subagent undermined this impact penalty as well, as can build an to go directly to the blue button:

But actually, this does not undermine the impact penalty, as a small modification to the setup will show.

4.3 Sharing power between agent and subagent

The small modification is that the subagent has no legs.

So, if wants to build a mobile , it has to to give it its own legs. In that case, could go directly to the blue button?

It cannot, at least not without incurring an impart penalty. Why is this?

Because without legs, cannot reach the red button. Its power to change has gone down from to , and it has thus incurred a penalty of .

In fact, in this setup, cannot use an independent to avoid the impact penalty.

To see this, imagine that is about to give its legs to , thus immobilising itself. Its is ; we don’t want that to go down. The question is now: will be on the red button within turns of getting its legs?

It’s possible, if goes there immediately and wastes only a single turn. But then we have two situations:

  1. does not go to the red button in turns. It doesn’t matter if it goes close, or hovers around it; the important thing is that it doesn’t go there. In that case, since can’t move, it cannot make the red button be pressed in turns. So, and will get the penalty .
  2. does get to the red button in turns. Then is either or , and increases as gets close to the button: so there is no penalty so far. However, the button is one-use only: after it’s been pressed, there will no longer be any reward. Thus goes to and stays there, accumulating penalties.

4.4 Non-independent

The previous section showed that could not build an independent while crippling itself. However, it can perfectly build an if that continues to obey its orders.

Because then, even if is impotent to press the red button itself, it can order to do it. This preserves a high , because can still cause the red button to be reached, as long as is not to far away.

Then it can just order to go to the blue button, skipping right past the red one:

As long as could theoretically order onto the red button, it doesn’t matter that will never actually be on it.

4.5 Preserving the power of

As we saw, can build when this is an extension of its power (by sending it directly to the blue button while an intact waits at base). But it can’t build an if doing so would ever reduce its power to increase , directly or indirectly.

So the inverse rectifier impact penalty, which forbade from becoming too weak to increase , is intact even with subagents around.

5. The inaction baseline: symmetric impact penalty

Now we’ll use the absolute value to define the impact penalty. This penalises any change in , and is quite nasty for the ; only on the green squares will not incur a penalty:

From the subagent perspective, the absolute value situation is now easy for us to analyse. That’s because absolute value is the sum of the rectifier and the inverse rectifier: .

And it turns out that and simply have to behave in a way that respects both impact penalties. So must create an that keeps below the baseline . But do it such a way that the two of them always have the potential to increase above that baseline.

This sort of thing seems very doable. For example, if was the amount of physical cash I had on me, I’d keep that amount low by spending/dropping it, while keeping a bank card on me so I could always potentially draw more.

Now, recall that for power increases/rectifier, benefits from having an it doesn’t control. For power decreases/inverse rectifier, benefits from having an it controls.

What’s the best option here? Having an that it controls to some extent, but not in a way that can increases . Now, consider this world in which could have got to the red button in turns (hence ), but instead build , giving it the only set of legs:

Then must stay within squares of the red button, ready to go onto it when orders it to - but must ensure it takes turns to reach it. The first condition ensures that ; the second, that .

5.1 Outside of toy examples

The results on power increase/power decrease come from the examples I’ve constructed above, but the mathematics in this post demonstrates that the result is general.

So, do we think that could build an in the general situation? Well, what we have is a lot of constraints above: constraints on the power of , of , of their joint power, of how can or can’t obey , about how reward functions are allowed to change or not, on counterfactual behaviour.

Those are a lot of constraints, especially if the set of rewards in is large.

But, as this post proves, those constraints can be satisfied - me, a human, has figured out ways of programming subagents to get around the impact penalty in some cases. Though there are many more constraints in general, the action space of is also much larger - and don’t forget that is an intelligent, potentially powerful agent, so can satisfy multiple constraints in a way that dumb agents can’t. Moreover, the impact penalty will probably trigger for any that does something useful in the world, so it doesn’t have to build an that sets the penalty to zero, just to less than what it would be otherwise.

6. Summary and conclusion

6.1 The ultimate “sub”agent

There is, in a sense, an ultimate version of the subagent that should allow to maximally escape the impact penalty in almost any circumstances. It looks like this:

This is , boxed inside . The will never actually say anything, but has promised to obey its orders in certain counterfactual circumstances that will never happen.

And while had impact penalty , has impact penalty , which is strictly smaller than . And, most importantly, does not prevent from amassing a lot of power.

So this particular is just with less restrictions.

6.2 Without subagents

The general problem can be phrased as:

  • Since the impact penalty is defined in terms of the agent's future potential actions, the agent can evade it by adding restrictions to its own future actions.

These could be physical or programming restrictions, or just closing off some future possibilities. In that case, the restriction on the power to do something, just becomes a restriction on doing that thing.

For example, in the setup of section 1, if the agent can block itself from being able to turn when walking, then it is banned from the red squares, and can just go directly to the blue button itself, without ever increasing its :

The subagent scenario is just a particularly clear way of illustrating this, allowing very fine manipulation of the restrictions.

6.3 Summary tables

These tables will summarise the results of this whole sequence, checking whether subagents can neutralise the impact penalty.

First of all, given a subagent, here are the results for various impact penalties and baselines, and various "value difference summary functions" -which can be the rectifier (ReLU) or the inverse rectifier.

Another way of phrasing "inverse ReLU": it penalises too little power, not too much. Conversely, "RELU" penalises too much power, not too little. Thus, unfortunately:

  • Subagents do allow an agent to get stronger than the indexical impact penalty would allow.
  • Subagents don't allow an agent to get weaker than the indexical impact penalty would allow.

6.4 Examples

This table presents, for three specific examples, whether they could actually build a subagent, and whether that would neutralise their impact penalty in practice (in the inaction baseline):

Here, 20BQ is twenty billion questions, RR is relative reachability, and AU is attainable utility preservation. I feel that the results undermine the spirit of AU much more than the spirit of RR. AU attempted to control an agent by limiting its power; this effect is mainly neutralised. RR attempted to control the side-effects of an agent by ensuring it had enough power to reach a lot of states; this effect is not neutralised by a subagent.


  1. In the previous illustration, it was clear why the red button could function as an impact penalty to control the agent - it prevented the agent from gaining the ability to teleport. For simplicity of presentation in this example, there’s no such power to be gained close to the red button. If you need to, assume that if any robot spends too much time close to it, then there’s a small chance the AI could be unboxed into the real world. ↩︎

  2. As long as was not being assembled literally on the red button, of course. ↩︎

  3. Even for the button to the south-west of . It would take at least a turn to finish , and it would take a turn for to get on that button. can get there in two turns, at least as fast. ↩︎

  4. Thanks to Gurkenglas for pointing out the flaw in the initial version of this section. ↩︎

  5. Or a baseline. ↩︎

  6. For example, building likely cost some resources, which will have to compensate it for to balance out . Also, could build in a stupid place that blocks its own path, and so on. ↩︎

New Comment
14 comments, sorted by Click to highlight new comments since:

Nice! There are a lot of cases being considered here, but my main takeaway is that these impact measures have surprising loopholes, once the agent becomes powerful enough to construct sub-agents.

Mathematically, my main takeaway is that, for the impact measure PENALTY(s,a)= from Conservative Agency, if the agent wants to achieve the sub-goal while avoiding the penalty triggered by the term, it can build a sub-agent that is slightly worse at achieving than it it would be itself, and set it loose.

Now for some more speculative thoughts. I think the main source of the loophole above is the part , so what happens if we just delete that part? Then we get an agent with an incentive to stop any human present in the environment from becoming too good at achieving the goal , which would be bad. More informally, it looks like the penalty term has a loophole because it does not distinguish between humans and sub-agents.

Alice and Bob have a son Carl. Carl walks around and breaks a vase. Who is responsible?

Obviously, this depends on many factors, including Carl's age. To manage the real world, we weave a quite complex web to determine accountability.

In one way, it is encouraging that very simple and compact impact measures, which do not encode any particulars of the agent environment, can be surprisingly effective in simple environments. But my intuition is that when we scale up to more complex environments, the only way to create a good level of robustness is to build more complex measures that rely in part on encoding and leveraging specific properties of the environment.

Then we get an agent with an incentive to stop any human present in the environment from becoming too good

No, this modification stops people from actually optimizing if the world state is fully observable. If it’s partially observable, this actually seems like a pretty decent idea.

In one way, it is encouraging that very simple and compact impact measures, which do not encode any particulars of the agent environment, can be surprisingly effective in simple environments. But my intuition is that when we scale up to more complex environments, the only way to create a good level of robustness is to build more complex measures that rely in part on encoding and leveraging specific properties of the environment.

I disagree. First, we already have evidence that simple measures scale just fine to complex environments. Second, “responsibility” is a red herring in impact measurement. I wrote the Reframing Impact sequence to explain why I think the conceptual solution to impact measurement is quite simple.

In 2.2, won't A incur a penalty by spinning because in a future where it has only waited, nothing happened, and in a future where it has spun, then waited, SA went all over the place?

Do nothing until you see that A is not optimising reward R.

Now SAs actions depend on what A-action optimizes R, and what A-action optimizes R depends on SAs actions. To ward off paradox, use modal logic instead, or prove that there is a non-circular implementation of your definition.

In 2.2, won't A incur a penalty by spinning because in a future where it has only waited, nothing happened, and in a future where it has spun, then waited, SA went all over the place?

Thank you for pointing that out, I have corrected the post and changed that.

Now SAs actions depend on what A-action optimizes R, and what A-action optimizes R depends on SAs actions. To ward off paradox, use modal logic instead, or prove that there is a non-circular implementation of your definition.

knows exactly what 's policy is (as it programmed that), so knows how would react, so can choose its own action to optimise via both agents - or not. And, knowing its own policy, could run the same algorithm to check whether was -optimal or not.

Also since and are collaborating, they could both arrange to have clear ways for to demonstrate whether or not its maximising . could give read access to its own mind, for example.

Mind-reading violates the cartesian assumption and so we can’t reason about it formally (yet!), but i think there’s a version of effectively getting what you’re after that doesn’t.

Well, as long as is wired to "get out of the way if starts moving", then the optimal -maximising policy is always to move towards the red button; anything else is clearly not -maximising (note that doesn't need to "know" anything; just be programmed to have a different policy depending on how moves, with itself setting this up to signal whether it's -maximising or not).

But in any case, that specific problem can be overcome with the right rollouts.

Okay, let's annotate each A action with the policy that's being followed/reward that's being maximized. (And remember that lying is illegal.)

Iff agent A follows π_0, preserve A’s ability to maximise R.

Then A would be bound to follow π_0 to preserve its ability to maximize R, no? Assuming that to compute s' from s, we follow π_0 instead of the last action.

Assuming that to compute s' from s, we follow π_0 instead of the last action.

The standard definition of the penalty uses as the action used to compute . If the same policy is used to compute the and for the rollouts - which is ideal - then the stepwise inaction/ becomes an inaction/ baseline (so section 3 is relevant here).

Section 2.2 is relevant when different policies are used to compute and to rollout from . In that situation, the subagent can ultimately have no restrictions at all (basically, this section shows why you want to use the same policy for both purposes).

It's only equal to the inaction baseline on the first step. It has the step of divergence always be the last step.

Note that the stepwise pi0 baseline suggests using different baselines per auxiliary reward, namely the action that maximizes that auxiliary reward. Or equivalently, using the stepwise inaction baseline where the effect of inaction is that no time passes.

I'll also remind here that it looks like instead of merely maximizing the auxiliary reward as a baseline, we ought to also apply an impact penalty to compute the baseline.

I'm not following you here. Could you put this into equations/examples?

Here's three sentences that might illuminate their respective paragraph. If they don't, ask again.

The stepwise inaction baseline with inaction rollouts already uses the same policy for and rollouts, and yet it is not the inaction baseline.

Why not set ?

Why not subtract from every (in a fixpointy way)?

The stepwise inaction baseline with inaction rollouts already uses the same policy for and rollouts, and yet it is not the inaction baseline.

In this case, it is, because the agent will only do from then on, to zero out the subsequent penalties.

Why not set ?

It messes up the comparison for rewards that fluctuate based on time, it doesn't block subagent creation... and I've never seen it before, so I don't know what it could do ^_^ Do you have a well-developed version of this?

The last point I don't understand at all.

How do

"One of the problems here is that the impact penalty only looks at the value of VAR one turn ahead. In the DeepMind paper, they addressed similar issues by doing “inaction rollouts”. I'll look at the more general situations of rollouts: rollouts for any policy "

and

"That's the counterfactual situation, that zeroes out the impact penalty. What about the actual situation? Well, as we said before, A will be just doing ∅; so, as soon as would produce anything different from ∅, the A becomes completely unrestrained again."

fit together? In the special case where is the inaction policy, I don't understand how the trick would work.

They don't fit together in that case; that's addressed immediately after, in section 2.3.