Comparing reward learning/reward tampering formalisms

Stuart_Armstrong

Contrasting formalisms

Here I'll contrast the approach we're using in using in Pitfalls of Learning a Reward Online (summarised here), with that used by Tom Everitt and Marcu Hutter in the conceptually similar Reward Tampering Problems and Solutions in Reinforcement Learning. In the following, histories $h_{i}$ are sequences of actions $a$ and observations $o$ ; thus $h_{i} = a_{1} o_{1} a_{2} o_{2} \dots a_{i} o_{i}$ . The agent's policy is given by $π$ , the environment is given by $μ$ .

Then the causal graph for the "Pitfalls" approach is, in plate notation (which basically means that, for every value of $j$ from $1$ to $n$ , the graph inside the rectangle is true):

The $R$ is the set of reward functions (mapping "complete" histories $h_{n}$ of length $n$ to real numbers), the $ρ$ tells you which reward is correct, conditional on complete histories, and $r$ is the final reward.

In order to move to the reward tampering formalism, we'll have to generalise the $R$ and $ρ$ , just a bit. We'll allow $R$ to take partial histories - $h_{j}$ shorter than $h_{n}$ - and return a reward. Similarly, we'll generalise $ρ$ to a conditional distribution on $R$ , conditional on all histories $h_{j}$ , not just on complete histories.

This leads to the following graph:

This graph is now general enough to include reward tampering formalism.

States, data, and actions

In reward tampering formalism, "observations" ( $o_{j}$ ) decompose into two pieces: states ( $S_{j}$ ) and data ( $D_{j}$ ). The idea is that data informs you about the reward function, while states get put into the reward function to get the actual reward.

So we can model this as this causal graph (adapted from graph 10b, page 22; this is a slight generalisation, as I haven't assumed Markovian conditions):

Inside the rectangle, the histories split into data ( $D_{1 : j}$ ), states ( $S_{1 : j}$ ), and actions ( $a_{1 : j}$ ). The reward function is defined by the data only, while the reward comes from this reward function and from the states only - actions don't directly affect these (though they can indirectly affect them by deciding what states and data come up, of course). Note that in the reward tampering paper, the authors don't distinguish explicitly between $R_{j}$ and $r_{j}$ , but they seem to do so implicitly.

Finally, $Θ_{*}^{R}$ is the "user's reward function", which the agent is estimating via $D_{1 : j}$ ; this connects to the data only.

Almost all of the probability distributions at each node are "natural" ones that are easy to understand. For example, there are arrows into $r_{j}$ (the reward) from $R_{j}$ (the reward function) and $S_{1 : j}$ (the states history); the "conditional distribution" of $r_{j}$ is just "apply $R_{j}$ to $S_{1 : j}$ . The environment, action, and history naturally provide the next observations (state and data).

Two arrows point to more complicated relations: the arrow from $Θ_{*}^{R}$ to $D_{j}$ , and that from $D_{1 : j}$ to $R$ . The two are related; the data $D_{j}$ is supposed to tell us about the user's true reward function, while this information informs the choice of $R$ .

But the fact that the nodes and the probability distribution have been "designed" this way doesn't affect the agent. It has a fixed process $P_{r t} (R ∣ D_{1 : j})$ for estimating $R$ from $D_{1 : j}$ ( $P_{r t}$ stands for the probability function for the reward tampering formalism). It has access to $a_{j}$ , $D_{j}$ , and $S_{j}$ (and their histories) as well as its own policy, but has no direct access to $μ$ or $Θ_{*}^{R}$ .

In fact, from the agent's perspective, $Θ_{*}^{R}$ is essentially part of $μ$ , the environment, though focusing on the $D_{j}$ only.

States and actions in "Pitfalls" formalism

Now, can we put this into the "Pitfalls" formalism? It seems we can, as so:

All conditional probability distributions in this graph are natural.

This graph look very similar to the "reward tampering" one, with the exception of $ρ_{j}$ and $Θ_{*}^{R}$ , pointing at $R_{j}$ and $D_{j}$ respectively.

In fact, $ρ_{j}$ play the role of $P_{r t} (R ∣ D_{1 : j})$ in that, for $P_{l p}$ the probability distribution for learning process,

$P_{l p} (R ∣ D_{1 : j}, ρ_{j}) = P_{r t} (R ∣ D_{1 : j}) .$

Note that $P_{l p}$ in that expression is natural and simple, while $P_{r t}$ is complex; essentially $P_{r t}$ carries the same information as $ρ_{j}$ .

The environment $μ_{l p}$ of the learning process plays the same role as the combined $μ_{r t}$ and $Θ_{R}^{*}$ from the reward tampering formalism.

So the isomorphism between the two approaches is, informally speaking:

On reward functions conditional on histories, $P_{r t} \leftrightarrow ρ$ .
$μ_{l p} \leftrightarrow (μ_{r t}, Θ_{R}^{*})$ .

Uninfluenceable similarities

If we make the processes uninfluenceable (a concept that exists for both formalisms), the causal graphs look even more similar:

Here the pair $(μ_{l p}, η)$ , for the learning process, play exactly the same role as the pair^[1] $(μ_{r t}, Θ_{*}^{R})$ , for reward tampering: determining reward functions and observations.

There is an equivalence between the pairs, but not between the individual elements; thus $μ_{l p}$ carries more information than $μ_{r t}$ , while $η$ carries less information than $Θ_{*}^{R}$ . ↩︎

[-]RyanCarey5yΩ370

It would be nice to draw out this distinction in more detail. One guess:

Uninfluencability seems similar to requiring zero individual treatment effect of D on R.
Riggability (from the paper) would then correspond to zero average treatment effect of D on R

[-]algon335y10

Stuart, by " $(R | D_{1; j})$ is complex" are you referring to their using $R = R (., E [Θ_{*}^{R} | D_{1; j}])$ as the estimated reward function?

Also, what did you think of their arguement that their agents have no incentive to manipulate their beliefs because they evaluate future trajectories based of their current beliefs about how likely they are? Does that suffice to implement eq. 1) from your motivated value selection paper?

[-]Stuart_Armstrong5y20

Suart, by " is complex" are you referring to...

I mean that that defining $P_{r t}$ can be done in many different ways, and hence has a lot of contingent structure. In contrast, in $P_{l p} (R ∣ D_{1 : j}, ρ)$ , the $\rho is a complex distribution on $R$ , conditional on $D_{1 : j}$ ; hence $P_{l p}$ itself is trivial and just encodes "apply $ρ$ to $R$ and $D_{1 : j}$ in the obvious way.

LESSWRONG
LW

9

Comparing reward learning/reward tampering formalisms

9

Ω 5

Contrasting formalisms

States, data, and actions

States and actions in "Pitfalls" formalism

Uninfluenceable similarities

New to LessWrong?

9

Ω 5