Intuitive examples of reward function learning?

Stuart_Armstrong

Can you help find the most intuitive example of reward function learning?

In reward function learning, there is a set of possible non-negative reward functions, $R$ , and a learning process $ρ$ which takes in a history of actions and observations and returns a probability distribution over $R$ .

If $π$ is a policy, $H_{m}$ is the set of histories of length $m$ , and $P^{π} (h_{m})$ is the probability of $h_{m} \in H_{m}$ given that the agent follows policy $π$ , the expected value of $π$ at horizon $m$ is:

V^{π} = \sum R \in R, h_{m} \in H_{m} P^{π} (h_{m}) R (h_{m}) ρ (R | h_{m}),

where $R (h_{m})$ is the total $R$ -reward over the history $h_{m}$ . Problems can occur if $ρ$ is riggable (this used to be called "biasable", but that term was over-overloaded), or influenceable.

There's an interesting subset of value learning problems, which could be termed "constrained optimisation with variable constraints" or "variable constraints optimisation". In that case, there is an overall reward $R$ , and every $R^{'} \in R$ is the reward $R$ subject to constraints $C$ . This can be modelled as having $C (h_{m})$ being $1$ (if the constraints are met) and $0$ (if they are not).

Then if we define $R C (h_{m}) = R (h_{m}) C (h_{m})$ , and let $ρ$ be a distribution over $C$ , the set of constraints, the equation changes to:

V^{π} = \sum C \in C, h_{m} \in H_{m} P^{π} (h_{m}) R C (h_{m}) ρ (C | h_{m}) .

If $ρ$ is riggable or influenceable, similar sorts of problems occur.

Intuitive examples

Here I'll present some examples of reward function learning or variable constraints optimisation, and I'm asking for readers to give their opinions as to which one seems the most intuitive to you, and the easiest to explain to outsiders. You're also welcome to suggest new examples if you think they work better.

Classical value learning: human declarations determine the correctness of a given reward $R$ . The reward encodes what food the human prefers, and some foods are much easier to get than other.
As above, but the reward encodes whether a domestic robot should clean the house or cook a meal.
As above, but the reward encodes the totality of human values in all environments.
Variable constraint optimisation: the agent is writing an unoriginal academic paper (or a patent), and must maximise the chance it gets accepted. The paper must include a literature review (constraints), but the agent gets to choose the automated process that produces the literature review.
Variable constraint optimisation: p-hacking. The agent chooses which hypothesis to formulate. It already knows something about the data, and its reward is the number of citations the paper gets.
Variable constraint optimisation: board of directors. The CEO must maximise share price, but its constraint is that the policy it formulates must be approved by the board of directors.
Variable constraint optimisation: retail. A virtual assistant guides the purchases of a customer. They must maximise revenue to the seller, subject to the constraint that the product bought must be given a four or five star review by the customer.

[-]TurnTrout7y30

I think the constraint-based problems are more intuitive. As someone who thinks about this regularly, the classical examples had an abstract, alignment-theoretic texture, while the constraint-based ones seemed more relatable to something I’d actually be doing on a daily basis.

The specific constraint-based example chosen would be dependent on the audience. If all your readers are familiar with the process of completing literature reviews, go for that - otherwise, the CEO problem seems most natural.

[-]William_S7y10

Variable constraint optimisation: Wedding planner. Tasked with maximizing satisfaction of the people getting married of the event, subject to constraints that it fit within a given budget.

[-]Stuart_Armstrong7y20

That sounds like classical constrained optimisation. Does the wedding planner have power to increase the budget?