1.
Is there such a thing as a free action, or an action where e.g. the agent breaks its own legs, when it is not accounted for in the action space of the underlying MDP? That feels like adding a new layer of misspecification (which no doubt is a possibility, and probably deserves deep investigation) orthogonal to reward function misspecification.
2.
It seems as though this kind of circumvention of impact penalties depends on what calculates the penalty. If the environment (or something else external to the agent) is responsible for calculating the penalty and providing it to the agent, then it could do so as if the agent was not constrained (or committing to be constrained) at all. If the agent is internally responsible for computing this impact, it could probably find ways to hack this, similar to wireheading.
So, assuming its some external entity E that computes the impact penalty, it must have the (dis?)ability to account for the agent's modified action space when making this computation for the agent to successfully reduce the penalty as in the earlier example.
Something agent A does must signal to E that A's action space has changed. If so, we might find ourselves in a situation where A is able to communicate an arbitrarily complex message to E, particularly, something of the form of: "I will never do anything that gives me more than the minimum penalty", or in terms of actions: "I will deterministically follow the policy that gives me the minimum penalty while achieving my goals." E, if it believed A, would not include high penalty actions in the action space of its penalty computations, and thus A would avoid the impact penalty.
I showed in a previous post that impact penalties were time-inconsistent. But why is this? There are two obvious possibilities:
It turns out the first answer is the correct one. And indeed, we get:
What is the "kinda" doing there? Well, as we'll see, there is a subtle semantics vs syntax issue going on.
Time-consistent rewards
In attainable utility amplification, and other impact penalties, the reward is ultimately a function of the current state st and a counterfactual state s′t.
For the initial state and the initial state inaction baselines, the state s′t is determined independently of anything the agent has actually done. So these baselines are given by a function f:
Here, μ is the environment and A is the set of actions available to the agent. Since s′t is fixed, we can re-write this as:
Now, if the impact measure is a function of st and μ only, then it is... a reward function, with R(st)=fs′t(μ,st). Thus, since this is just a reward function, the agent is time-consistent.
Now let's look at the stepwise inaction baseline. In this case, s′t is determined by an inaction rollout from the prior state st−1. So the impact measure is actually a function of:
Again, if f is in fact independent of A, the set of the agent's actions (including for the rollouts from st−1, then this is a reward function - one that is a function of the previous state and the current state, but that's quite common for reward functions.
So again, the agent has no interest in constraining its own future actions.
Semantics vs syntax
Back to "kinda". The problem is that we've been assuming that actions and states are very distinct objects. Suppose that, as in the previous post an agent at time t−1 wants to prevent itself from taking action S (go south) at time t. Let A be the agent's full set of actions, and A−S the same set without S.
So now the agent might be time-inconsistent, since it's possible that:
f(μ,A,st,st−1)≠f(μ,A−S,st,st−1).
But now, instead of denoting "can't go south" by reducing the action set, we could instead denote it by expanding the state set. So define s−St as the same state as st, except that taking the action S is the same as taking the action ∅. Everything is (technically) independent of A, so the agent is "time-consistent".
But, of course, the two setups, restricted action set or extended state set, are almost completely isomorphic - even though, according to our result above, the agent would be time-consistent in the second case. It would be time consistent in that it would not want to change the actions of it future self - instead it would just put its future self in a state where some actions were in practice unobtainable.
So it seems that, unfortunately, it's not enough to be a reward-maximiser (or a utility maximiser) in order to be time-consistent in practice.