A putative new idea for AI control; index here.
This is a potential design for a safe-ish satisficer, based on the ideas presented here. The two key ideas are that a satisficer S(u) with utility u:
- Would not effectively aid M(u), a u-maximiser.
- Would not effectively resist M(-u), a u-minimizer.
So satisficer make poor allies and weak enemies.
The design, the designer and the verifier
Let M(u-v) be an agent that maximises u and minimises v. Let's assume that M(u-v) is currently unaware as to what u and v are (though it knows that the upper bound of u is easy to approach). It can interact with the outside environment in one way only: by creating a satisficer S(). S() will then be fed u to become S(u).
However, the design for S(u) will first be sent to M(εu+v), for some unknown small ε. Note that M(εu+v) is a v-maximiser, so it disagrees with M(u-v) except for u. M(εu+v) is a (hypothetical) agent that will be (counterfactually) released at some unspecified later date. Currently, it will be fed the whole reasoning history of M(u+v), and will make a single decision only: where S(u) is to be built or not.
So it seems that S(u) must;
- Set u close to its maximal value (as this is "easy to approach").
- Not increase v to any significant extent (or else M(u-v) would not design it).
- Not decrease v to any significant extent (or else M(εu+v) would veto it).
- Since v is unknown to M(u-v) and and resources are finite, this should serve as a general reduced impact requirement for S(u) (we may have to use something like a soft minimum across all v, rather than an expectation across all v, to avoid certain edge casess).
- Since is u unknown to M(u-v), S() would serve as a general satisficing agent for any utility functions whose upper bounds are easy to approach (remember that we can take an arbitrary utility function and arbitrarily bound it at some number).
For the moment, this does seems like it would produce a successful satisficer...
What I'm not following is how you take an optimizer and convince it that the best route is to use a satisficer subagent. Clippy (the maximizer, the agent you're trying to limit) gets utility from infinite paperclips. It's ClippyJr (the satisficer) which can be limited to 1. But why would maximizer-clippy prefer to propose that, as opposed to proposing ClippyJrPlus, who is a satisficer, but has a goal of 10^30 paperclips)?
Please include all three agents in an example: M(u-v), S(finite-u), M(εu+v).
Here, I start with a bounded and easy to reach u (that's a first step in the process), so "u = finite-u". This is still not safe for a maximiser (usual argument about "being sure" and squeezing ever more tiny amounts of expected utility from optimising the universe). Then the whole system is supposed to produce S(u) rather than M(u). This is achieved by having M(εu+v) allow it, when M(εu+v) expects (counterfactually) to optimise the universe, and would see any optimisation by S(u) as getting in the way (or, if it could co-opt these otim... (read more)