Approval-directed agents

paulfchristiano

Most concern about AI comes down to the scariness of goal-oriented behavior. A common response to such concerns is “why would we give an AI goals anyway?” I think there are good reasons to expect goal-oriented behavior, and I’ve been on that side of a lot of arguments. But I don’t think the issue is settled, and it might be possible to get better outcomes without them. I flesh out one possible alternative here, based on the dictum "take the action I would like best" rather than "achieve the outcome I would like best."

(As an experiment I wrote the post on medium, so that it is easier to provide sentence-level feedback, especially feedback on writing or low-level comments.)

I wrote a follow-up partly addressing the issue of actions vs. outcomes. (Or at least, covering one technical isssue I omtitted from the original post for want of space.)

I agree that Hugh must reason about how well different actions satisfy Hugh's goals, and the AI must reason (or make implicit generalizations about) these judgments. Where am I moving the values complexity problem? The point was to move it into the AI's predictions about what actions Hugh would approve of.

What part of the argument in particular do you think I am being imprecise about? There are particular failure modes, like "deceiving Hugh" or especially "resisting correction" which I would expect to avoid via this procedure. I see no reason why the system would resist correction, for example. I don't see how this is due to confusion about outcomes vs. actions.

15

Approval-directed agents

15

15

15

Approval-directed agents

15

15