A putative new idea for AI control; index here.
This is a potential design for a safe-ish satisficer, based on the ideas presented here. The two key ideas are that a satisficer S(u) with utility u:
- Would not effectively aid M(u), a u-maximiser.
- Would not effectively resist M(-u), a u-minimizer.
So satisficer make poor allies and weak enemies.
The design, the designer and the verifier
Let M(u-v) be an agent that maximises u and minimises v. Let's assume that M(u-v) is currently unaware as to what u and v are (though it knows that the upper bound of u is easy to approach). It can interact with the outside environment in one way only: by creating a satisficer S(). S() will then be fed u to become S(u).
However, the design for S(u) will first be sent to M(εu+v), for some unknown small ε. Note that M(εu+v) is a v-maximiser, so it disagrees with M(u-v) except for u. M(εu+v) is a (hypothetical) agent that will be (counterfactually) released at some unspecified later date. Currently, it will be fed the whole reasoning history of M(u+v), and will make a single decision only: where S(u) is to be built or not.
So it seems that S(u) must;
- Set u close to its maximal value (as this is "easy to approach").
- Not increase v to any significant extent (or else M(u-v) would not design it).
- Not decrease v to any significant extent (or else M(εu+v) would veto it).
- Since v is unknown to M(u-v) and and resources are finite, this should serve as a general reduced impact requirement for S(u) (we may have to use something like a soft minimum across all v, rather than an expectation across all v, to avoid certain edge casess).
- Since is u unknown to M(u-v), S() would serve as a general satisficing agent for any utility functions whose upper bounds are easy to approach (remember that we can take an arbitrary utility function and arbitrarily bound it at some number).
For the moment, this does seems like it would produce a successful satisficer...
Now that I have time to actually work through the math, I agree that 0 is not a special point for v; it's a special point for Δv (which seems reasonable).
But I'm not sure what the second M is doing, now. A S design that satisfies M(u-v) more than default is one where Δ(u-v)>0, or Δu>Δv (1). A S design that satisfies M(εu+v) more than default is one where Δ(εu+v)>0, or εΔu>-Δv (2). If you look at the 2d graph of Δu and Δv, the point of constraint (1) is to block off the southeastern half of the graph (cases where our negative externality outweighs our improvement), and the point of constraint (2) is to block off the "southwestern" half (rotated by ε).
Constraint 1 seems reasonable--don't do more negative externalities than you accrue in benefits. Constraint 2 seems weird, because the cases it cuts off are the cases where S does more positive externalities than it loses in benefits. This is sort of an anti-first law, in that the agent will choose inaction or pursuing its duties instead of helping out others--but only when it helps too much! A mail delivery robot might be willing to deliver one less piece of mail in order to prevent one blind pedestrian from walking in front of a truck, but not be willing to deliver one less piece of mail in order to prevent two blind pedestrians from walking in front of a truck, because that would have counterfactually caused it to not be made in the first place (and thus goes against its inborn moral sense).
[Edit]I suppose the underlying principle here might be "timidity"--the agent doesn't trust itself to get right any plan which has a larger impact than some threshold, and so has a tightly bounded utility function in some way. But this doesn't look like the right way to bound it.[/Edit]
(If we have defined all possible vs such that Δv≥0, then constraint 2 is never active, because we're only considering the right half of that graph.)
Suppose among the human population there lives one morally relevant person (or, if you prefer, 36 of them). The AI knows that it is very important that they not be disturbed--but not who they are.
Contrast this to the case where the AI thinks that all humans are morally relevant, with an importance of not disturbing a person that's about 1/N of the importance assigned in the previous case. What's the difference between the two cases? To first order, it looks like nothing; to second order, it looks like the first case might have some bizarreness about summing up disturbances across people that the second case won't have.
That is, I don't think we can just say "the agent is ignorant of v, so it does the right thing by default." That sounds like trying to extract useful work out of ignorance! The agent's prior over v--that is, what sort of externalities are worth preventing--will determine what prohibitions or reservations are baked into S, and it seems really strange to me to trust that the uncertainty will take care of it. If we don't have the right reference class to begin with, being uncertain will include lots of things from the wrong reference class, and S will make crazy tradeoffs. But if we have the right reference class, we might as well go with it.
This is reminding me of Jainism, actually--I had just been focusing on building a robot with ahimsa, but I think also trying to incorporate anekantavada would lead to a suggestion like this one.
I am trying to extract work from ignorance. The same way that I did with "resource gathering". An AI that is ignorant of its utility will try and gather power and resources, and preserve flexibility - that's a kind of behaviour you can get mainly from an ignorant AI.
... (read more)