Defining a limited satisficer

Stuart_Armstrong

A putative new idea for AI control; index here.

EDIT: The definition of satisficer I'm using here is the informal one of "it tries to achieve a goal, without making huge changes on the universe" rather than "it's an agent that has utility u and threshold t". If you prefer the standard notation, think of this as a satisficer where t is not fixed, but dependent on some facts in the world (such as the ease of increasing u). I'm trying to automate the process of designing and running a satisficer: people generally chose t given facts about the world (how easy it is to achieve, for instance), and I want the whole process to be of low impact.

I've argued that the definition of a satisficer is underdefined, because there are many pathological behaviours all compatible with satsificer designs. This contradict the intuitive picture that many people have of a satisficer, which is an agent that does the minimum of effort to reach its goal, and doesn't mess up the outside world more than it has to. And if it can't accomplish the goals without messing up the outside world, it would be content not to.

In the spirit of "if you want something, you have to define it, then code it, rather than assuming you can get if for free through some other approach", can we spell out what features we would want from such a satisficer? Preferably in a simpler format that our intuitions.

It seems to me that if you had a proper u-satisficer S(u), then for many (real or hypothetical) v-maximiser M(v) out there, M(v) would find that:

Changing S(u) to S(v) is of low value.
Similarly, utility function trading with S(u) is of low value.
The existence or non-existence of S(u) is of low information content about the future.
The existence or non-existence of S(u) has little impact on the expected value of v.

Further, S(u):

Would not effectively aid M(u), a u-maximiser.
Would not effectively resist M(-u), a u-minimizer.
Would not have large impacts (if this can measured) for low utility gains.

A subsequent post will present an example of a satisficer using some of these ideas.

A few other much less-developed thoughts about satisficers:

Maybe require that it learns what variables humans care about, and doesn’t set them to extreme values – try and keep them in the same range. Do the same for variables humans may care about or that resemble values they care about.
Models the general procedure of detecting unaccounted-for variables set to extreme values.
We could check whether it would kill all humans cheaply if it could (or replace certain humans cheaply). ie give it hypothetical destructive superpowers with no costs to using them, and see whether it would use them.
Have the AI establish a measure/model of optimisation power (without reference to any other goal), then put itself low on that.
Trade between satisficers might be sub-Pareto.
When talking about different possible v's in the first four points above, it might be better to use something else than an expectation over different v's, as that could result in edge cases dominating - maybe a soft minimum of value across different v instead.

A putative new idea for AI control; index here.

It seems to me that if you had a proper u-satisficer S(u), then for many (real or hypothetical) v-maximiser M(v) out there, M(v) would find that:

Changing S(u) to S(v) is of low value.
Similarly, utility function trading with S(u) is of low value.
The existence or non-existence of S(u) is of low information content about the future.
The existence or non-existence of S(u) has little impact on the expected value of v.

Further, S(u):

Would not effectively aid M(u), a u-maximiser.
Would not effectively resist M(-u), a u-minimizer.
Would not have large impacts (if this can measured) for low utility gains.

A subsequent post will present an example of a satisficer using some of these ideas.

A few other much less-developed thoughts about satisficers:

Maybe require that it learns what variables humans care about, and doesn’t set them to extreme values – try and keep them in the same range. Do the same for variables humans may care about or that resemble values they care about.
Models the general procedure of detecting unaccounted-for variables set to extreme values.
We could check whether it would kill all humans cheaply if it could (or replace certain humans cheaply). ie give it hypothetical destructive superpowers with no costs to using them, and see whether it would use them.
Have the AI establish a measure/model of optimisation power (without reference to any other goal), then put itself low on that.
Trade between satisficers might be sub-Pareto.
When talking about different possible v's in the first four points above, it might be better to use something else than an expectation over different v's, as that could result in edge cases dominating - maybe a soft minimum of value across different v instead.

If the satisficer is fine with "make 10 paperclips", but the difference between that and "make 10 paperclips and kill all humans" is small, the satisficer may well go with whatever showed up first in its search algorithm.

Yep, that's the problem. That's why I'm trying to address the issue directly by penalising things like "...and kill all humans".