You're looking at Less Wrong's discussion board. This includes all posts, including those that haven't been promoted to the front page yet. For more information, see About Less Wrong.

Anti-Pascaline satisficer

3 Stuart_Armstrong 14 April 2015 06:49PM

A putative new idea for AI control; index here.

It occurred to me that the anti-Pascaline agent design could be used as part of a satisficer approach.

The obvious thing to reduce dangerous optimisation pressure is to make a bounded utility function, with an easily achievable bound. Such as giving them a utility linear in paperclips that maxs out at 10.

The problem with this is that, if the entity is a maximiser (which it might become), it can never be sure that it's achieved its goals. Even after building 10 paperclips, and an extra 2 to be sure, and an extra 20 to be really sure, and an extra 3^^^3 to be really really sure, and extra cameras to count them, with redundant robots patrolling the cameras to make sure that they're all behaving well, etc... There's still an ε chance that it might have just dreamed this, say, or that its memory is faulty. So it has a current utility of (1-ε)10, and can increase this by reducing ε - hence by building even more paperclips.

Hum... ε, you say? This seems a place where the anti-Pascaline design could help. Here we would use it at the lower bound of utility. It currently has probability ε of having utility < 10 (ie it has not built 10 paperclips) and (1-ε) of having utility = 10. Therefore and anti-Pascaline agent with ε lower bound would round this off to 10, discounting the unlikely event that it has been deluded, and thus it has no need to build more paperclips or paperclip counting devices.

Note that this is an un-optimising approach, not an anti-optimising one, so the agent may still build more paperclips anyway - it just has no pressure to do so.

Un-optimised vs anti-optimised

6 Stuart_Armstrong 14 April 2015 06:30PM

A putative new idea for AI control; index here.

This post contains no new insights; it just puts together some old insights in a format I hope is clearer.

Most satisficers are unoptimised (above the satisficing level): they have a limited drive to optimise and transform the universe. They may still end up optimising the universe anyway: they have no penalty for doing so (and sometimes it's a good idea for them). But if they can lazily achieve their goal, then they're ok with that too. So they simply have low optimisation pressure.

A safe "satisficer" design (or a reduced impact AI design) needs to be not only un-optimised, but specifically anti-optimised. It has to be setup so that "go out and optimise the universe" scores worse that "be lazy and achieve your goal". The problem is that these terms are undefined (as usual), that there are many minor actions that can optimise the universe (such as creating a subagent), and the approach has to be safe against all possible ways of optimising the universe - not just the "maximise u" for a specific and known u.

That's why the reduced impact/safe satisficer/anti-optimised designs are so hard: you have to add a very precise yet general (anti-)optimising pressure, rather than simply removing the current optimising pressure.

Creating a satisficer

4 Stuart_Armstrong 11 March 2015 03:03PM

A putative new idea for AI control; index here.

This is a potential design for a safe-ish satisficer, based on the ideas presented here. The two key ideas are that a satisficer S(u) with utility u:

  • Would not effectively aid M(u), a u-maximiser.
  • Would not effectively resist M(-u), a u-minimizer.

So satisficer make poor allies and weak enemies.

 

The design, the designer and the verifier

Let M(u-v) be an agent that maximises u and minimises v. Let's assume that M(u-v) is currently unaware as to what u and v are (though it knows that the upper bound of u is easy to approach). It can interact with the outside environment in one way only: by creating a satisficer S(). S() will then be fed u to become S(u).

However, the design for S(u) will first be sent to M(εu+v), for some unknown small ε. Note that M(εu+v) is a v-maximiser, so it disagrees with M(u-v) except for u. M(εu+v) is a (hypothetical) agent that will be (counterfactually) released at some unspecified later date. Currently, it will be fed the whole reasoning history of M(u+v), and will make a single decision only: where S(u) is to be built or not.

So it seems that S(u) must;

  1. Set u close to its maximal value (as this is "easy to approach").
  2. Not increase v to any significant extent (or else M(u-v) would not design it).
  3. Not decrease v to any significant extent (or else M(εu+v) would veto it).
  4. Since v is unknown to M(u-v) and and resources are finite, this should serve as a general reduced impact requirement for S(u) (we may have to use something like a soft minimum across all v, rather than an expectation across all v, to avoid certain edge casess).
  5. Since is u unknown to M(u-v), S() would serve as a general satisficing agent for any utility functions whose upper bounds are easy to approach (remember that we can take an arbitrary utility function and arbitrarily bound it at some number).

For the moment, this does seems like it would produce a successful satisficer...

Defining a limited satisficer

6 Stuart_Armstrong 11 March 2015 02:23PM

A putative new idea for AI control; index here.

EDIT: The definition of satisficer I'm using here is the informal one of "it tries to achieve a goal, without making huge changes on the universe" rather than "it's an agent that has utility u and threshold t". If you prefer the standard notation, think of this as a satisficer where t is not fixed, but dependent on some facts in the world (such as the ease of increasing u). I'm trying to automate the process of designing and running a satisficer: people generally chose t given facts about the world (how easy it is to achieve, for instance), and I want the whole process to be of low impact.

I've argued that the definition of a satisficer is underdefined, because there are many pathological behaviours all compatible with satsificer designs. This contradict the intuitive picture that many people have of a satisficer, which is an agent that does the minimum of effort to reach its goal, and doesn't mess up the outside world more than it has to. And if it can't accomplish the goals without messing up the outside world, it would be content not to.

In the spirit of "if you want something, you have to define it, then code it, rather than assuming you can get if for free through some other approach", can we spell out what features we would want from such a satisficer? Preferably in a simpler format that our intuitions.

It seems to me that if you had a proper u-satisficer S(u), then for many (real or hypothetical) v-maximiser M(v) out there, M(v) would find that:

  • Changing S(u) to S(v) is of low value.
  • Similarly, utility function trading with S(u) is of low value.
  • The existence or non-existence of S(u) is of low information content about the future.
  • The existence or non-existence of S(u) has little impact on the expected value of v.

Further, S(u):

  • Would not effectively aid M(u), a u-maximiser.
  • Would not effectively resist M(-u), a u-minimizer.
  • Would not have large impacts (if this can measured) for low utility gains.

A subsequent post will present an example of a satisficer using some of these ideas.

A few other much less-developed thoughts about satisficers:

  • Maybe require that it learns what variables humans care about, and doesn’t set them to extreme values – try and keep them in the same range. Do the same for variables humans may care about or that resemble values they care about.
  • Models the general procedure of detecting unaccounted-for variables set to extreme values.
  • We could check whether it would kill all humans cheaply if it could (or replace certain humans cheaply). ie give it hypothetical destructive superpowers with no costs to using them, and see whether it would use them.
  • Have the AI establish a measure/model of optimisation power (without reference to any other goal), then put itself low on that.
  • Trade between satisficers might be sub-Pareto.
  • When talking about different possible v's in the first four points above, it might be better to use something else than an expectation over different v's, as that could result in edge cases dominating - maybe a soft minimum of value across different v instead.

 

Satisficers' undefined behaviour

3 Stuart_Armstrong 05 March 2015 05:03PM

I previously posted an example of a satisficer (an agent seeking to achieve a certain level of expected utility u) transforming itself into a maximiser (an agent wanting to maximise expected u) to better achieve its satisficing goals.

But the real problem with satisficers isn't that they "want" to become maximisers; the real problem is that their behaviour is undefined. We conceive of them as agents that would do the minimum required to reach a certain goal, but we don't specify "minimum required".

For example, let A be a satisficing agent. It has a utility u that is quadratic in the number of paperclips it builds, except that after building 10100, it gets a special extra exponential reward, until 101000, where the extra reward becomes logarithmic, and after 1010000, it also gets utility in the number of human frowns divided by 3↑↑↑3 (unless someone gets tortured by dust specks for 50 years).

A's satisficing goal is a minimum expected utility of 0.5, and, in one minute, the agent can press a button to create a single paperclip.

So pressing the button is enough. In the coming minute, A could decide to transform itself into a u-maximiser (as that still ensures the button gets pressed). But it could also do a lot of other things. It could transform itself into a v-maximiser, for many different v's (generally speaking, given any v, either v or -v will result in the button being pressed). It could break out, send a subagent to transform the universe into cream cheese, and then press the button. It could rewrite itself into a dedicated button pressing agent. It could write a giant Harry Potter fanfic, force people on Reddit to come up with creative solutions for pressing the button, and then implement the best.

All these actions are possible for a satisficer, and are completely compatible with its motivations. This is why satisficers are un(der)defined, and why any behaviour we want from it - such as "minimum required" impact - has to be put in deliberately.

I've got some ideas for how to achieve this, being posted here.