A putative new idea for AI control; index here.
This is a potential design for a safe-ish satisficer, based on the ideas presented here. The two key ideas are that a satisficer S(u) with utility u:
- Would not effectively aid M(u), a u-maximiser.
- Would not effectively resist M(-u), a u-minimizer.
So satisficer make poor allies and weak enemies.
The design, the designer and the verifier
Let M(u-v) be an agent that maximises u and minimises v. Let's assume that M(u-v) is currently unaware as to what u and v are (though it knows that the upper bound of u is easy to approach). It can interact with the outside environment in one way only: by creating a satisficer S(). S() will then be fed u to become S(u).
However, the design for S(u) will first be sent to M(εu+v), for some unknown small ε. Note that M(εu+v) is a v-maximiser, so it disagrees with M(u-v) except for u. M(εu+v) is a (hypothetical) agent that will be (counterfactually) released at some unspecified later date. Currently, it will be fed the whole reasoning history of M(u+v), and will make a single decision only: where S(u) is to be built or not.
So it seems that S(u) must;
- Set u close to its maximal value (as this is "easy to approach").
- Not increase v to any significant extent (or else M(u-v) would not design it).
- Not decrease v to any significant extent (or else M(εu+v) would veto it).
- Since v is unknown to M(u-v) and and resources are finite, this should serve as a general reduced impact requirement for S(u) (we may have to use something like a soft minimum across all v, rather than an expectation across all v, to avoid certain edge casess).
- Since is u unknown to M(u-v), S() would serve as a general satisficing agent for any utility functions whose upper bounds are easy to approach (remember that we can take an arbitrary utility function and arbitrarily bound it at some number).
For the moment, this does seems like it would produce a successful satisficer...
I'm still struggling to see why these are desirable properties, and have difficulty coming up with a good name for this idea. Something like "mediocre AI"?
It seems to me that the key idea behind satsificing is computational complexity: many planning problems are NP-hard, but we can get very good solutions in P time, so let's come up with a good way to make agents that get very good solutions even though they aren't perfect solutions (because a solution we have to wait that long for is not perfect to us). The key idea behind politeness is not causing significant costs to others is desirable.
I think it's cleaner to say that this is an agent that maximizes the difference between u and v (unless you have something else in mind, in which case say that!).
So, it looks like the work is being done by M(u-v)'s priors over v and ε; that is, we're trying to come up with a generalized currier that will take some idea of what could be impolite and how much to care and then makes an agent that has that sense of possible impoliteness baked in, and will avoid those things by default.
I find this approach deeply unsatisfying, but I'm having trouble articulating why. Most of the things that immediately come to mind aren't my true rejection, which might be that I want v to be an input to S (and have some sense of the agent being able to learn v as it goes along).
For example, in the optimistic case where we know the right politeness function and the right tradeoff between getting more u and being less polite, we could pass those along as precise distributions and the framework doesn't cost us anything. But when we have uncertainty, does this framework capture the right uncertainties?
But I don't think it's obvious to me yet that this behaves the way we want it to behave in cases of uncertainty. In particular, we might want to encode some multivariate dependency, where our estimate of ε depends on our estimate of v, or our estimate of v depends on our estimate of u, and it's not clear that this framework can capture either. But would we actually want to encode that?
I also am not really sure what to make of the implicit restriction that 0 be a special point for v; that seems appropriate for the class of distance metrics between "the world when I don't do anything" and "the world where I do something," but doesn't seem appropriate for happiness metrics. To concretize, consider a case where Alice wants to bake a cake, but this will get some soot onto Bob's shirt. Option 1 is not baking the cake, option 2 is baking the cake, option 3 is baking the cake and apologizing to Bob. Option 2 might be preferable under the "do as little as possible" distance metrics but option 3 preferable under the "minimize the harm to Bob" scorings, and what the reversal when we move to M(εu+v) from M(u-v) looks like is not always clear to me.
Because then we could have a paperclip-making AI (or something similar) that doesn't breakout and do stupid things all over the place.
That's indeed the case, but I wanted to emphasise the difference between how they treat u and how they treat v.
... (read more)