Approval-directed agents

paulfchristiano

Most concern about AI comes down to the scariness of goal-oriented behavior. A common response to such concerns is “why would we give an AI goals anyway?” I think there are good reasons to expect goal-oriented behavior, and I’ve been on that side of a lot of arguments. But I don’t think the issue is settled, and it might be possible to get better outcomes without them. I flesh out one possible alternative here, based on the dictum "take the action I would like best" rather than "achieve the outcome I would like best."

(As an experiment I wrote the post on medium, so that it is easier to provide sentence-level feedback, especially feedback on writing or low-level comments.)

This has great potential, thanks! But wouldn't Alfred be motivated to present to virtual Hugh whatever stimulus resulted in vH's selecting the highest approval response, even if that means eg hypnosis, brainwashing? I don't see how "turtles all the way down" can solve this, because every level can solve the problem for the level above but finds the problem on its own level.

You only have trouble if there is a goal-directed level beneath the lowest approval-directed level. The idea is to be approval-directed at the lowest levels where it makes sense (and below that you are using heuristics, algorithms, etc., in the same way that a goal-directed agent eventually bottoms out with useful heuristics or algorithms).

15

Approval-directed agents

15

15

15

Approval-directed agents

15

15