Most concern about AI comes down to the scariness of goal-oriented behavior. A common response to such concerns is “why would we give an AI goals anyway?” I think there are good reasons to expect goal-oriented behavior, and I’ve been on that side of a lot of arguments. But I don’t think the issue is settled, and it might be possible to get better outcomes without them. I flesh out one possible alternative here, based on the dictum "take the action I would like best" rather than "achieve the outcome I would like best."
(As an experiment I wrote the post on medium, so that it is easier to provide sentence-level feedback, especially feedback on writing or low-level comments.)
In the "Learning from examples" case, Arthur looks a lot like AIXI with a time horizon of 1 (i.e., one that acts to maximize just the expected next reward), and I don't understand why you say "But unlike AIXI, Arthur will make no effort to manipulate these judgments." For example, it seems like Arthur could learn a model in which approval[T](a) = 1 if a is an action which results in taking over the approval input terminal and giving itself maximum approval.
It seems like AIXI with a time horizon of 1 is a very different beast from AIXI with a longer time horizon. The big difference is that short-sighted AIXI will only try to take over (in the interest of giving itself reward) if it can succeed in a single time step.
I agree that AIXI with a time horizon of 1 still has some undesired behaviors. Those undesired behaviors also afflict the learning-from-examples approval-directed agent.
These problems are particularly troubling if it is possible to retroactively define rewards. In the worst case, Arthur may predict... (read more)