A putative new idea for AI control; index here.
This is a potential design for a safe-ish satisficer, based on the ideas presented here. The two key ideas are that a satisficer S(u) with utility u:
- Would not effectively aid M(u), a u-maximiser.
- Would not effectively resist M(-u), a u-minimizer.
So satisficer make poor allies and weak enemies.
The design, the designer and the verifier
Let M(u-v) be an agent that maximises u and minimises v. Let's assume that M(u-v) is currently unaware as to what u and v are (though it knows that the upper bound of u is easy to approach). It can interact with the outside environment in one way only: by creating a satisficer S(). S() will then be fed u to become S(u).
However, the design for S(u) will first be sent to M(εu+v), for some unknown small ε. Note that M(εu+v) is a v-maximiser, so it disagrees with M(u-v) except for u. M(εu+v) is a (hypothetical) agent that will be (counterfactually) released at some unspecified later date. Currently, it will be fed the whole reasoning history of M(u+v), and will make a single decision only: where S(u) is to be built or not.
So it seems that S(u) must;
- Set u close to its maximal value (as this is "easy to approach").
- Not increase v to any significant extent (or else M(u-v) would not design it).
- Not decrease v to any significant extent (or else M(εu+v) would veto it).
- Since v is unknown to M(u-v) and and resources are finite, this should serve as a general reduced impact requirement for S(u) (we may have to use something like a soft minimum across all v, rather than an expectation across all v, to avoid certain edge casess).
- Since is u unknown to M(u-v), S() would serve as a general satisficing agent for any utility functions whose upper bounds are easy to approach (remember that we can take an arbitrary utility function and arbitrarily bound it at some number).
For the moment, this does seems like it would produce a successful satisficer...
My intended point with that example was to question what it means for v to be at 0, 1, or -1. If v is defined to be always non-negative (something like "estimate the volume of the future that is 'different' in some meaningful way"), then flipping the direction of v makes sense. But if v is some measure of how happy Bob is, then flipping the direction of v means that we're trying to find a plan that will satisfy both someone that likes Bob and hates Bob. Is that best done by setting the happiness value near 0? If so, what level of Bob's happiness is 0? What if it's worse than it is without any action on the agent's part?
Perhaps the solution there is to just say "yeah, we only care about things that are metrics (i.e. 0 is special and natural)," but I think that's unsatisfying because it only allows for negative externalities, and we might want to incorporate both positive and negative externalities into our reasoning.
0 is not the default; the default is the expected v, given that M(εu+v) is unleashed upon the world. That event will (counterfactually) happen, and neither M(εu+v) nor M(u-v) can change it. M(εu+v) will not allow an S(u) that costs it v-utility; given that, M(u-v) knows that it cannot reduce the expected v, so will try best to build S(u) to affect it the least.
If you prefer, since the plans could be vetoed by someone who hates Bob, all Bob-helping plans will get vetoed. Therefore the agent who like... (read more)