*A putative new idea for AI control; index here.*

This post will be extending ideas from inverse reinforcement learning (IRL) to the problem of goal completion. I'll be drawing on the presentation and the algorithm from Apprenticeship Learning via Inverse Reinforcement Learning (with one minor modification).

In that setup, the environment is an MDP (Markov Decision process), and the real reward R is assumed to be linear in the "features" of the state-action space. Features are functions φ_{i} from the full state-action space S×A to the unit interval [0,1] (the paper linked above only considers functions from the state space; this is the "minor modification"). These features form a vector φ∈[0,1]^{k}, for k different features. The actual reward is given by the inner product with a vector w∈ℝ^{k}, thus the reward at state-action pair (s,a) is

R(s,a)=w.φ(s,a).

To ensure the reward is always between -1 and 1, w is constrained to have ||w||_{1} ≤ 1; to reduce redundancy, we'll assume ||w||_{1}=1.

The advantages of linearity is that we can compute the expected rewards directly from the expected feature vector. If the agent follows a policy π (a map from state to action) and has a discount factor γ, the expected feature vector is

μ(π) = E(Σ_{t} γ^{t}φ(s_{t},π(s_{t})),

where s_{t} is the state at step t.

The agent's expected reward is then simply

E(R) = w . μ(π).

Thus the problem of computing the correct reward is reduced to the problem of computing the correct w. In practice, to compute the correct *policy*, we just need to find one whose *expected features* are close enough to optimal; this need not involve computing w.

*4 points [-]