This has great potential, thanks! But wouldn't Alfred be motivated to present to virtual Hugh whatever stimulus resulted in vH's selecting the highest approval response, even if that means eg hypnosis, brainwashing? I don't see how "turtles all the way down" can solve this, because every level can solve the problem for the level above but finds the problem on its own level.
You only have trouble if there is a goal-directed level beneath the lowest approval-directed level. The idea is to be approval-directed at the lowest levels where it makes sense (and below that you are using heuristics, algorithms, etc., in the same way that a goal-directed agent eventually bottoms out with useful heuristics or algorithms).
Commenting with Medium feels like it would be reverse anonymity - if you merely see my real name and facebook profile, you won't know who I am :P
It's tempting to drag in utility functions over actions. So I will. VNM proved that VNM-rational agents have them, after all. Rather than trying to learn my utility function over outcomes, you seem to be saying, why not try to learn my utility function over actions?
These seem somewhat equivalent - one should be a transform of the other. And what seems odd is that you're arguing (reasonably) that using limited resources to learn the utility function over actions performs better than using those resources to learn the utility function over outcomes - even according to the utility function over outcomes!
I wonder if there's a theorem here.
Note that the agent is never faced with a gamble over actions---it can choose to deterministically take whatever action it desires. So while VNM gives you a utility function over actions, it is probably uninteresting.
The broader point---that we are learning some transform of preferences, rather than learning preferences directly---seems true. I think this is an issue that people in AI have had some (limited) contacted with. Some algorithms learn "what a human would do" (e.g. learning to play go by predicting human go moves and doing what you think a human would do). Other algorithms, (inverse reinforcement learning) learn what values explain what a human would do, and then pursue those. I think the conventional view is that inverse reinforcement learning is harder, but can yield more robust policies that generalize better. Our situation seems to be somewhat different, and it might be interesting to understand why and to explore the comparison more thoroughly.
There would seem to be an obvious parallel with deontological as opposed to consequentialist ethics. (Which suggests the question: is there any interesting analogue of virtue ethics, where the agent attempts to have a utility function its overseer would like?)
virtue ethics, where the agent attempts to have a utility function its overseer would like?)
I don't think in virtue ethics you are obligated to maximize virtues, only satisfice them.
I think maximizing versus satisficing is a question orthogonal to whether you pay attention to consequences, to hte actions that produce them, or to the character from which the actions flow. One could make a satisficing consequentialist agent, for instance. (Bostrom, IIRC, remarks that this wouldn't necessarily avoid the dangers of overzealous optimization: instead of making unboundedly many paperclips because it wants as many as possible, our agent might make unboundedly many paperclips in order to be as sure as possible that it really did make at least 10.)
Boatrom's point is valid in absence of other goals. A clippy which also values a slightly non-orthogonal goal would stop making paperclips once that other goal is interfered with by the excess of paperclips.
In virtue ethics you don't maximize anything, you are free to pick any actions compatible with the virtues, so there is no utility function to speak of.
Which suggests the question: is there any interesting analogue of virtue ethics, where the agent attempts to have a utility function its overseer would like?
This reminds me of Daniel Dewey's proposal for an agent that learns its utility function: http://lesswrong.com/lw/560/new_fai_paper_learning_what_to_value_by_daniel/.
In the "Learning from examples" case, Arthur looks a lot like AIXI with a time horizon of 1 (i.e., one that acts to maximize just the expected next reward), and I don't understand why you say "But unlike AIXI, Arthur will make no effort to manipulate these judgments." For example, it seems like Arthur could learn a model in which approval[T](a) = 1 if a is an action which results in taking over the approval input terminal and giving itself maximum approval.
It seems like AIXI with a time horizon of 1 is a very different beast from AIXI with a longer time horizon. The big difference is that short-sighted AIXI will only try to take over (in the interest of giving itself reward) if it can succeed in a single time step.
I agree that AIXI with a time horizon of 1 still has some undesired behaviors. Those undesired behaviors also afflict the learning-from-examples approval-directed agent.
These problems are particularly troubling if it is possible to retroactively define rewards. In the worst case, Arthur may predict future-Arthur to escape and define new, escape-conducive values for approval[T]. Anticipating this possibility, Arthur may behave according to the escape-conducive approval[T], thereby fulfilling the prophecy.
This is a much more subtle problem than usual for AIXI though; the real situation is a lot more complicated, and there are many possible workarounds. Having a time horizon of 1 seems a lot less scary to me.
I certainly agree that the "learning from examples" case is much weaker than the others.
It seems like AIXI with a time horizon of 1 is a very different beast from AIXI with a longer time horizon. The big difference is that short-sighted AIXI will only try to take over (in the interest of giving itself reward) if it can succeed in a single time step.
What "a single time step" means here depends on what model Arthur learns, which may not be what we intend. For example, suppose a is an action which immediately disables the approval input terminal or the data connection between Arthur and the terminal via a network attack, then taking an arbitrarily long time to secure access to the approval input terminal and giving itself maximum approval. What is approval[T][a] according to Arthur's model?
Overall, don't you think it's too strong to say "But unlike AIXI, Arthur will make no effort to manipulate these judgments." even if Arthur, like short-sighted AIXI, is safer than standard AIXI? As another example, suppose Arthur discovers some sort of flaw in human psychology which lets it manipulate whoever is going to enter the next approval value into giving it maximum approval. Wouldn't Arthur take advantage of that?
I think my description is basically fair, though I might be misunderstanding or just wrong.
There are two ways in which Arthur’s decisions affect approval[T](a); one is by choosing the action a, and the other is by changing the definition of approval[T](a). One-step AIXI cares about both, while Arthur does not. This is what I meant by “Arthur will make no effort to manipulate these judgments.” Note that my proposal requires Hugh to provide “counterfactual” ratings for actions that were not chosen.
Arthur may be motivated to manipulate Hugh’s judgments for other reasons. This is most obvious if Hugh would approve of manipulating Hugh’s judgments, but it could also happen if someone else displaced Hugh and “approved” of actions to manipulate Hugh’s judgments (especially in order to approve of actions that helped them displace Hugh).
In a note on the original document I gave an example to illustrate: suppose that action X causes Hugh to increase all his ratings by 0.2, but that in every case action Y is rated higher than action X by 0.1. Then Arthur will do Y, not X. (Because the expectation of Y is 0.1 higher than the expectation of X, no matter what Arthur believes.)
Even in the one-step takeover case, Arthur doesn’t have an intrinsic incentive to take over. It’s just that he could introduce such an incentive for himself, and so the belief “I will kill Hugh and reward myself if I (kill Hugh and reward myself if I…)” can become a self-fulfilling prophecy. This seems like a different issue than 1 step AIXI's desire to maximize reward.i The main vulnerability is now from external actors.
Incidentally, it also seems relatively easy to prevent these attacks in the case of approval-directed agents. For example, we can make actions too small to encapsulate an entire seizing-control-plan. If you want to use something like this to evaluate plans then you reintroduce the issue, but now in an easier form (since you can watch the plan unfold and take other precautions). I'll write a bit more soon about how I think a reasonable overseer could behave, and in particular on how they can use access to Arthur to improve their evaluations, but for now these are not my largest concerns with the scheme.
(My largest concern is the plausibility of high-level approval-directed behavior emerging from low-level approval-directed behavior. This dynamic may be more brittle than high-level goal-directed behavior emerging from low-level goal-directed behavior.)
There are two ways in which Arthur’s decisions affect approvalT; one is by choosing the action a, and the other is by changing the definition of approvalT. One-step AIXI cares about both, while Arthur does not.
I'm having trouble understanding this. Can you explain more what you mean here, and why you think it's true?
Maybe you can address the specific example I mentioned earlier, which might help clear this up. Suppose there is an argument X which if Hugh were to read, would mindhack him into adopting an alien value system and also giving Arthur high approval. It seems to me that Arthur would choose to present this argument to Hugh (i.e., the human supervisor who is going to enter the next approval data point), for essentially the same reason that single-step AIXI would. Do you agree?
Arthur is making choices from a small set of options; say it's just two options. (See here for how to move between small and large sets of options, and here for how to do this kind of thing safely.) Suppose the available actions are NULL and HACK, with the obvious effects. So there are four relevant numbers:
When I talked about "two ways..." I meant that counterfactually choosing HACK moves you from 1/3 to 2/4, by changing what you do; it also moves you from 1/2 to 3/4, by changing whether Hugh is hacked.
AIXI compares item 1 to item 4, and hacks if 4 is higher. That is, when AIXI considers the counterfactual it applies both of these changes.
Arthur estimates P(hack), then compares ( (item 1) P(hack) + (item 3) P(no hack) ) to ( (item 2) P(hack) + (item 4) P(no hack)).
So suppose that Hugh gives a high rating to NULL and a low rating to HACK, and hacked Hugh gives a high rating to HACK and a low rating to NULL. Then Arthur hacks iff P(hack) is high enough. It's hard to know what would actually happen, it seems like it's up to lower-level parts of the agent to break the ambiguity. This seems qualitatively different from AIXI, who would always HACK if it wasn't possible to achieve maximal reward through other channels.
That said, I don't think this is a major part of being protected from this kind of attack. The larger protections (vs. 1-step AIXI) are from (1) having a small enough set of actions that 1-step attacks are unlikely, (2) defining approval by considering how you would rate if the action didn't happen.
I think you've moved all the complexity into distinguishing the difference between "outcome" and "action".
IOW, taboo those terms and try to write the same proposal, because right now ISTM that you're relying on an intuitional appeal to human concepts of the difference, rather than being precise.
Even at this level, you're leaving out that Hugh doesn't really approve of actions per se -- Hugh endorses actions in situations as contributing to some specific, salient goal or value. If Arthur says, "I want to move my foot over here", it doesn't matter how many hours Hugh thinks it over, it's not going to mean anything in particular...
Even if its the first step in a larger action of "walk over there and release the nanovirus". ;-)
For example, if we output a sequence of bits which are fed into an actuator, then I can treat each bit as an action. We could also apply the concept to actions at a higher or lower level of granularity, the idea is to apply it at all levels (and to make it explicit at the lowest level at which it is practical to do so, in the same way we might make goal-directed behavior explicit at the lowest level where doing so is explicit).
I do not understand how anything you said relates to the weakness of your argument that I've pointed out. Namely, that you've simply moved the values complexity problem somewhere else. All your reply is doing is handwaving that issue, again.
Human beings can't endorse actions per se without context and implied goals. And the AI can't simply iterate over all possible actions randomly to see what works without having some sort of model that constrains what it's looking for. Based on what I can understand of what you're proposing, ISTM the AI would just wander around doing semi-random things, and not actually do anything useful for humans, unless Hugh has some goal(s) in mind to constrain the search.
And the AI has to be able to model those goals in order to escape the problem that the AI is now no smarter than Hugh is. Indeed, if you can simulate Hugh, then you might as well just have an em. The "AI" part is irrelevant.
I wrote a follow-up partly addressing the issue of actions vs. outcomes. (Or at least, covering one technical isssue I omtitted from the original post for want of space.)
I agree that Hugh must reason about how well different actions satisfy Hugh's goals, and the AI must reason (or make implicit generalizations about) these judgments. Where am I moving the values complexity problem? The point was to move it into the AI's predictions about what actions Hugh would approve of.
What part of the argument in particular do you think I am being imprecise about? There are particular failure modes, like "deceiving Hugh" or especially "resisting correction" which I would expect to avoid via this procedure. I see no reason why the system would resist correction, for example. I don't see how this is due to confusion about outcomes vs. actions.
I think this is a very important contribution. The only internal downside of this might be that the simulation of the overseer within the ai would be sentient. But if defined correctly, most of these simulations would not really be leading bad lives. The external downside is overtaking by other goal oriented AIs.
The thing is, I think in any design, it is impossible to tear away purpose from a lot of the subsequent design decisions. I need to think about this a little deeper.
I have misgivings about using high level concepts to constrain an AI (be it friendliness or approval). I suspect we may well not share many concepts at all unless there is some form of lower level constraint system that makes our ontologies similar. If we must program the ontology in and it is not capable of drift, I have doubts it will be able to come up with vastly novel ways of seeing the world, limiting its potential power.
My favourite question is why build systems that are separate from us anyway? Or to put another way, how can we build a computational system that interacts with our brains as if it was part of us. Assume that we are multi-'sort of agent' systems that (mostly) pull in the same direction, how can we get computers to be part of that system.
I think some of the ideas of approval directed agents might be relevant, I suspect parts of our brain monitoring other parts and giving approval of their actions is part of the reason for consciousness (and also the dopamine system).
I feel like if you give the AI enough freedom for its intelligence to be helpful, you'd have the same pitfalls as having the AI pick a goal you'd approve of. I also feel like it's not clear exactly which decisions you'd oversee. What if the AI convinces you that it's actions are fine, because you'd approve of its method of choosing them, and that it's method is fine, because you'd approve of the individual action?
Most concern about AI comes down to the scariness of goal-oriented behavior. A common response to such concerns is “why would we give an AI goals anyway?” I think there are good reasons to expect goal-oriented behavior, and I’ve been on that side of a lot of arguments. But I don’t think the issue is settled, and it might be possible to get better outcomes without them. I flesh out one possible alternative here, based on the dictum "take the action I would like best" rather than "achieve the outcome I would like best."
(As an experiment I wrote the post on medium, so that it is easier to provide sentence-level feedback, especially feedback on writing or low-level comments.)