Would not effectively resist M(-u), a u-minimizer.
I'm not sure how that's supposed to work. S(u) won't do much as long as the desirability threshold is obtained, but if M(-u) comes along and makes this difficult, S(u) would use everything it has to stop M(-u). Are you using something beyond desirability threshold? Something where S(u) stops not when the solution is good enough, but when it gets difficult to improve?
See my edit above. "would use everything it has to..." is the kind of behaviour we want to avoid. So I'm more following the sastisficing intuition than the formal definition. I can justify this by going meta: when people design/imagine satisficers, they generally look around at the problem, see what can be achieved, how hard it is, etc... and then set the threshold. I want to automate "set a reasonable threshold" as well as "be a reasonable satisficer" in order to achieve "don't have a huge impact on the world".
A putative new idea for AI control; index here.
EDIT: The definition of satisficer I'm using here is the informal one of "it tries to achieve a goal, without making huge changes on the universe" rather than "it's an agent that has utility u and threshold t". If you prefer the standard notation, think of this as a satisficer where t is not fixed, but dependent on some facts in the world (such as the ease of increasing u). I'm trying to automate the process of designing and running a satisficer: people generally chose t given facts about the world (how easy it is to achieve, for instance), and I want the whole process to be of low impact.
I've argued that the definition of a satisficer is underdefined, because there are many pathological behaviours all compatible with satsificer designs. This contradict the intuitive picture that many people have of a satisficer, which is an agent that does the minimum of effort to reach its goal, and doesn't mess up the outside world more than it has to. And if it can't accomplish the goals without messing up the outside world, it would be content not to.
In the spirit of "if you want something, you have to define it, then code it, rather than assuming you can get if for free through some other approach", can we spell out what features we would want from such a satisficer? Preferably in a simpler format that our intuitions.
It seems to me that if you had a proper u-satisficer S(u), then for many (real or hypothetical) v-maximiser M(v) out there, M(v) would find that:
Further, S(u):
A subsequent post will present an example of a satisficer using some of these ideas.
A few other much less-developed thoughts about satisficers: