Approval-directed agents

paulfchristiano

Most concern about AI comes down to the scariness of goal-oriented behavior. A common response to such concerns is “why would we give an AI goals anyway?” I think there are good reasons to expect goal-oriented behavior, and I’ve been on that side of a lot of arguments. But I don’t think the issue is settled, and it might be possible to get better outcomes without them. I flesh out one possible alternative here, based on the dictum "take the action I would like best" rather than "achieve the outcome I would like best."

(As an experiment I wrote the post on medium, so that it is easier to provide sentence-level feedback, especially feedback on writing or low-level comments.)

Note that the agent is never faced with a gamble over actions---it can choose to deterministically take whatever action it desires. So while VNM gives you a utility function over actions, it is probably uninteresting.

The broader point---that we are learning some transform of preferences, rather than learning preferences directly---seems true. I think this is an issue that people in AI have had some (limited) contacted with. Some algorithms learn "what a human would do" (e.g. learning to play go by predicting human go moves and doing what you think a human would do). Other algorithms, (inverse reinforcement learning) learn what values explain what a human would do, and then pursue those. I think the conventional view is that inverse reinforcement learning is harder, but can yield more robust policies that generalize better. Our situation seems to be somewhat different, and it might be interesting to understand why and to explore the comparison more thoroughly.

15

Approval-directed agents

15

15

15

Approval-directed agents

15

15