You're looking at Less Wrong's discussion board. This includes all posts, including those that haven't been promoted to the front page yet. For more information, see About Less Wrong.

Corrigibility through stratified indifference

4 Stuart_Armstrong 19 August 2016 04:11PM

A putative new idea for AI control; index here.

Corrigibility through indifference has a few problems. One of them is that the AI is indifferent between the world in which humans change its utility to v, and world in which humans try to change its utility, but fail.

Now the try-but-fail world is going to be somewhat odd - humans will be reacting by trying to change the utility again, trying to shut the AI down, panicking that a tiny probability event has happened, and so on.

continue reading »

Predicted corrigibility: pareto improvements

5 Stuart_Armstrong 18 August 2015 11:02AM

A putative new idea for AI control; index here.

Corrigibility allows an agent to transition smoothly from a perfect u-maximiser to a perfect v-maximiser, without seeking to resist or cause this transition.

And it's the very perfection of the transition that could cause problems; while u-maximising, the agent will not take the slightest action to increase v, even if such actions are readily available. Nor will it 'rush' to finish its u-maximising before transitioning. It seems that there's some possibility of improvements here.

I've already attempted one way of dealing with the issue (see the pre-corriged agent idea). This is another one.

 

Pareto improvements allowed

Suppose that an agent with corrigible algorithm A is following utility u currently, and estimates that there are probabilities pi that it will transition to utilities vi at midnight (note that these are utility function representatives, not affine classes of equivalent utility functions). At midnight, the usual corrigibility applies, making A indifferent to that transition, making use of such terms as E(u|u→u) (the expectation of u, given that the A's utility doesn't change) and E(vi|u→vi) (the expectation of vi, given that A's utility changes to vi).

But, in the meantime, there are expectations such as E({u,v1,v2,...}). These are A's best current estimates as to what the genuine expected utility of the various utilites are, given all it knows about the world and itself. It could be more explicitly written as E({u,v1,v2,...}| A), to emphasise that these expectations are dependent on the agent's own algorithm.

Then the idea is to modify the agent's algorithm so that Pareto improvements are possible. Call this modified algorithm B. B can select actions that A would not have chosen, conditional on:

  • E(u|B) ≥ E(u|A) and E(Σpivi|B) ≥ E(Σpivi|A).

There are two obvious ways we could define B:

  • B maximises u, subject to the constraints E(Σpivi|B) ≥ E(Σpivi|A).
  • B maximises Σpivi, subject to the constraints E(u|B) ≥ E(u|A).

In the first case, the agent maximises its current utility, without sacrificing its future utility. This could apply, for example, to a ruby mining agent that rushes to gets its rubies to the bank before its utility changes. In the second case, the agent maximises it future expected utility, without sacrificing its current utility. This could apply to a ruby mining agent that's soon to become a sapphire mining agent: it then starts to look around and collect some early sapphires as well.

Now, it would seem that doing this must cause it to lose some ruby mining ability. However, it is being Pareto with E("rubies in bank"|A, expected future transition), not with E("rubies in bank"|A, "A remains a ruby mining agent forever"). The difference is that A will behave as if it was maximising the second term, and so might not go to the bank to deposit its gains, before getting hit by the transition. So B can collects some early sapphires, and also goes to the bank to deposit some rubies, and thus end up ahead for both u and Σpivi.

AI: requirements for pernicious policies

7 Stuart_Armstrong 17 July 2015 02:18PM

Some have argued that "tool AIs" are safe(r). Recently, Eric Drexler decomposed AIs into "problem solvers" (eg calculators), "advisors" (eg GPS route planners), and actors (autonomous agents). Both solvers and advisors can be seen as examples of tools.

People have argued that tool AIs are not safe. It's hard to imagine a calculator going berserk, no matter what its algorithm is, but it's not too hard to come up with clear examples of dangerous tools. This suggests the solvers vs advisors vs actors (or tools vs agents, or oracles vs agents) is not the right distinction.

Instead, I've been asking: how likely is the algorithm to implement a pernicious policy? If we model the AI as having an objective function (or utility function) and algorithm that implements it, a pernicious policy is one that scores high in the objective function but is not at all what is intended. A pernicious function could be harmless and entertaining or much more severe.

I will lay aside, for the moment, the issue of badly programmed algorithms (possibly containing its own objective sub-functions). In any case, to implement a pernicious function, we have to ask these questions about the algorithm:

  1. Do pernicious policies exist? Are there many?
  2. Can the AI find them?
  3. Can the AI test them?
  4. Would the AI choose to implement them?

The answer to 1. seems to be trivially yes. Even a calculator could, in theory, output a series of messages that socially hack us, blah, take over the world, blah, extinction, blah, calculator finishes its calculations. What is much more interesting is some types of agents have many more pernicious policies than others. This seems the big difference between actors and other designs. An actor AI in complete control of the USA or Russia's nuclear arsenal has all sort of pernicious policies easily to hand; an advisor or oracle has much fewer (generally going through social engineering), a tool typically even less. A lot of the physical protection measures are about reducing the number of sucessfull pernicious policies the AI has a cess to.

The answer to 2. is mainly a function of the power of the algorithm. A basic calculator will never find anything dangerous: its programming is simple and tight. But compare an agent with the same objective function and the ability to do an unrestricted policy search with vast resources... So it seems that the answer to 2. does not depend on any solver vs actor division, but purely on the algorithm used.

And now we come to the big question 3., whether the AI can test these policies. Even if the AI can find pernicious policies that rank high on its objective function, it will never implement them unless it can ascertain this fact. And there are several ways it could do so. Let's assume that a solver AI has a very complicated objective function - one that encodes many relevant facts about the real world. Now, the AI may not "care" about the real world, but it has a virtual version of that, in which it can virtually test all of its policies. With a detailed enough computing power, it can establish whether the pernicious policy would be effective at achieving its virtual goal. If this is a good approximation of how the pernicious policy would behave in the real world, we could have a problem.

But extremely detailed objective functions are unlikely. But even simple ones can show odd behaviour if the agents gets to interact repeatedly with the real world - this is the issue with reinforcement learning. Suppose that the agent attempts a translation job, and is rewarded on the accuracy of its translation. Depending on the details of what the AI knows and who choose the rewards, the AI could end up manipulating its controllers, similarly to this example. The problem is that one there is any interaction, all the complexity of humanity could potentially show up in the reward function, even if the objective function is simple.

Of course, some designs make this very unlikely - resetting the AI periodically can help to alleviate the problem, as can choosing more objective criteria for any rewards. Lastly on this point, we should mention the possibility that human R&D, by selecting and refining the objective function and the algorithm, could take the roll of testing the policies. This is likely to emerge only in cases where many AI designs are considered, and the best candiates are retained based on human judgement.

Finally we come to the question of whether the AI will implement the policy if it's found it and tested it. You could say that the point of FAI is to create an AI that doesn't choose to implement pernicious policies - but, more correctly, the point of FAI is to ensure that very few (or zero) pernicious policies exist in the first place, as they all score low on the utility function. However, there are a variety of more complicated designs - satisficers, agents using crude measures - where the questions of "Do pernicious policies exist?" and "Would the AI choose to implement them?" could become quite distinct.

 

Conclusion: a more through analysis of AI designs is needed

A calculator is safe, because it is a solver, it has a very simple objective function, with no holes in the algorithm, and it can neither find nor test any pernicious policies. It is the combination of these elements that makes it almost certainly safe. If we want to make the same claim about other designs, neither "it's just a solver" or "it's objective function is simple" would be enough; we need a careful analysis.

Though, as usual, "it's not certainly safe" is a quite distinct claim from "it's (likely) dangerous", and they should not be conflated.

High impact from low impact, continued

2 Stuart_Armstrong 28 April 2015 12:58PM

A putative new idea for AI control; index here.

The idea of splitting a high impact task between two low-impact AIs has on critical flaw. AI X is aiming for low impact, conditional on ¬Y (the other AI not being turned on, or not outputting a message, or something similar). "Outputting the right coordinates" is one way that X can accomplish its goal. However, there is another way it can do it: "create a robot that will output the right coordinates if ¬Y, and [do something else] if Y."

That's a dangerous situation to be in, especially if we have a more general situation that the "laser aiming at the asteroid". But note that if X does create such a robot, and if ¬Y is actually true, then that robot must be low impact and not dangerous, since that's X's programming. Since X cannot predict all the situations the robot would encounter, the robot is probably generically "safe" and low impact.

Therefore, if the robot behaves the same way under Y and ¬Y, we're good.

How could we achieve that? Well, we could adapt my idea from "restrictions that are hard to hack". If a hypothetical superintelligent AI C observed the output stream from X, could it deduce that Y vs ¬Y was something important in it? If C knew that X was conditioning on ¬Z, but didn't know Z=Y, could it deduce that? That seems like a restriction that we could program into X, as a third component of its utility (the first being the "do what we want" component, the second being the "have a reduced impact conditional on ¬Z" one).

And if we have a "safe" robot, given ¬Y, and the programming of that robot does not (explicitly or implicitly) mention Y or its features, we probably have a safe robot.

The idea still needs to be developed and some of the holes patched, but I feel it has potential.

High impact from low impact

6 Stuart_Armstrong 17 April 2015 04:01PM

A putative new idea for AI control; index here.

Part of the problem with a reduced impact AI is that it will, by definition, only have a reduced impact.

Some of the designs try and get around the problem by allowing a special "output channel" on which impact can be large. But that feels like cheating. Here is a design that accomplishes the same without using that kind of hack.

Imagine there is an asteroid that will hit the Earth, and we have a laser that could destroy it. But we need to aim the laser properly, so need coordinates. There is a reduced impact AI that is motivated to give the coordinates correctly, but also motivated to have reduced impact - and saving the planet from an asteroid with certainty is not reduced impact.

Now imagine that instead there are two AIs, X and Y. By abuse of notation, let ¬X refer to the event that the output signal from X is scrambled away from the the original output.

Then we ask X to give us the x-coordinates for the laser, under the assumption of ¬Y (that AI Y's signal will be scrambled). Similarly, we Y to give us the y-coordinates of the laser, under the assumption ¬X.

Then X will reason "since ¬Y, the laser will certainly miss its target, as the y-coordinates will be wrong. Therefore it is reduced impact to output the correct x-coordinates, so I shall." Similarly, Y will output the right y-coordinates, the laser will fire and destroy the asteroid, having a huge impact, hooray!

The approach is not fully general yet, because we can have "subagent problems". X could create an agent that behave nicely given ¬Y (the assumption it was given), but completely crazily given Y (the reality). But it shows how we could get high impact from slight tweaks to reduced impact.

EDIT: For those worried about lying to the AIs, do recall http://lesswrong.com/r/discussion/lw/lyh/utility_vs_probability_idea_synthesis/ and http://lesswrong.com/lw/ltf/false_thermodynamic_miracles/

Anti-Pascaline satisficer

3 Stuart_Armstrong 14 April 2015 06:49PM

A putative new idea for AI control; index here.

It occurred to me that the anti-Pascaline agent design could be used as part of a satisficer approach.

The obvious thing to reduce dangerous optimisation pressure is to make a bounded utility function, with an easily achievable bound. Such as giving them a utility linear in paperclips that maxs out at 10.

The problem with this is that, if the entity is a maximiser (which it might become), it can never be sure that it's achieved its goals. Even after building 10 paperclips, and an extra 2 to be sure, and an extra 20 to be really sure, and an extra 3^^^3 to be really really sure, and extra cameras to count them, with redundant robots patrolling the cameras to make sure that they're all behaving well, etc... There's still an ε chance that it might have just dreamed this, say, or that its memory is faulty. So it has a current utility of (1-ε)10, and can increase this by reducing ε - hence by building even more paperclips.

Hum... ε, you say? This seems a place where the anti-Pascaline design could help. Here we would use it at the lower bound of utility. It currently has probability ε of having utility < 10 (ie it has not built 10 paperclips) and (1-ε) of having utility = 10. Therefore and anti-Pascaline agent with ε lower bound would round this off to 10, discounting the unlikely event that it has been deluded, and thus it has no need to build more paperclips or paperclip counting devices.

Note that this is an un-optimising approach, not an anti-optimising one, so the agent may still build more paperclips anyway - it just has no pressure to do so.

Un-optimised vs anti-optimised

6 Stuart_Armstrong 14 April 2015 06:30PM

A putative new idea for AI control; index here.

This post contains no new insights; it just puts together some old insights in a format I hope is clearer.

Most satisficers are unoptimised (above the satisficing level): they have a limited drive to optimise and transform the universe. They may still end up optimising the universe anyway: they have no penalty for doing so (and sometimes it's a good idea for them). But if they can lazily achieve their goal, then they're ok with that too. So they simply have low optimisation pressure.

A safe "satisficer" design (or a reduced impact AI design) needs to be not only un-optimised, but specifically anti-optimised. It has to be setup so that "go out and optimise the universe" scores worse that "be lazy and achieve your goal". The problem is that these terms are undefined (as usual), that there are many minor actions that can optimise the universe (such as creating a subagent), and the approach has to be safe against all possible ways of optimising the universe - not just the "maximise u" for a specific and known u.

That's why the reduced impact/safe satisficer/anti-optimised designs are so hard: you have to add a very precise yet general (anti-)optimising pressure, rather than simply removing the current optimising pressure.

Intelligence modules

4 Stuart_Armstrong 23 March 2015 04:24PM

A putative new idea for AI control; index here.

This idea, due to Eric Drexler, is to separate out the different parts of an AI into modules. There would be clearly designated pieces, either physical or algorithmic, with this part playing a specific role: this module would contain the motivation, this module the probability estimator, this module the models of the outside world, this module the natural language understanding unit, etc...

It's obvious how such a decomposition would be useful for many of the methods I've been detailing here. We could also distil each module - reduce it to a smaller, weaker (?) and more understandable submodule, in order to better understand what is going on. In one scenario, an opaque AI gets to design its successor, in the form of a series of such modules.

This property seems desirable; the question is, how could we get it?

EDIT: part of the idea of "modules" is that AIs often need to do calculations or estimations that would be of great value to us if we could access them in isolation. This idea is developed more in these posts.

 

Designing in modules

The main threat here is that a given submodule would contain more than just the properties we want. After all, a natural language parser could consist of a general intelligence plus a motivation to understand language. Another possible worry is that the modules are overfitted to the problem or to each other: the language parser works perfectly, but only in this one AI design.

There are several ways we could try and combat this.

  • Minimum length to the program (Eric's original main idea). The module must be as short as possible for its purpose.
  • If possible, start with a modularised AI, and track how the modules change as the AI's intelligence grows.
  • Have the AI extend the module to make it more transparent, without adding any extra functionality.
  • Check that the input-output map the AI presents (maybe require it so present it as a simple primitive recursive function with clear and simple "for" loops) is of commensurable complexity to the task at hand.
  • Add some noise to the situation: the new AI will not get exactly the same data as the old, the modules may get somewhat altered, and still have to work. Bugs will be introduced (this approach helps to combat the overfitting aspect of minimum length description).
  • Require that the module be "distant", according to some metric of modification to the module's program, from having other capabilities. For instance, a general intelligence motivated to understand language is easy to make into a model-making module, simply by changing the motivation; more narrow modules are probably more distant from each other.
  • Similarly, require that the performance of the modules becomes degraded if the world (or the world's descriptions) were substantially different, but not if they are the same information presented in a very different way.
  • General anti-overfitting methods, calibrated to not generate a push towards general abilities in the modules. This would require some skill and a lot of thought.
  • Focus on requiring statistical accuracy, rather than perfect performance.

If we allow false counterfactuals, then we can also:

  • Imagine that the modules would be connected up to other modules, made by (hypothetical) different AIs, with potentially different motivations.
  • Require that the modules not be useful as seeds for an AI of opposite motivation. For instance, an AI with opposite motivation would be allowed to flip a certain number of bits; if there was a general intelligence piece inside the module, then that would be something easier to co-opt, so dangerous to hand to an AI with opposite motivation. This is the "opposed" version of requiring modules to be "distant" from each other's capabilities.
  • In general, all the reduced impact/satisficing requirements push towards having modules that cannot become general intelligences.
  • The AI will dictate how to program these modules to a smart programmer, who will (counterfactually) not survive nor get to program the modules in the outside world.

Obviously anti-restriction-hacking would be useful to just module separation (and vice versa).

This is the beginning of the process of defining this, but it would be great to have a safe(ish) method of separating modules in this way.

Any suggestions?

Closest stable alternative preferences

3 Stuart_Armstrong 20 March 2015 12:41PM

A putative new idea for AI control; index here.

There's a result that's almost a theorem, which is that an agent that is an expected utility maximiser, is an agent that is stable under self-modification (or the creation of successor sub-agents).

Of course, this needs to be for "reasonable" utility, where no other agent cares about the internal structure of the agent (just its decisions), where the agent is not under any "social" pressure to make itself into something different, where the boundedness of the agent itself doesn't affect its motivations, and where issues of "self-trust" and acausal trade don't affect it in relevant ways, etc...

So quite a lot of caveats, but the result is somewhat stronger in the opposite direction: an agent that is not an expected utility maximiser is under pressure to self-modify itself into one that is. Or, more correctly, into an agent that is isomorphic with an expected utility maximiser (an important distinction).

What is this "pressure" agent are "under"? The known result is that if an agent obeys four simple axioms, then its behaviour must be isomorphic with an expected utility maximiser. If we assume the Completeness axiom (trivial) and Continuity (subtle), then violations of Transitivity or Independence correspond to situations where the agent has been money pumped - lost resources or power for no gain at all. The more likely the agent is to face these situations, the more pressure they're under to behave as an expected utility maximiser, or simply lose out.

 

Unbounded agents

I have two models for how idealised agents could deal with this sort of pressure. The first, post-hoc, is the unlosing agent I described here. The agent follows whatever preferences it had, but kept track of its past decisions, and whenever it was in a position to violate transitivity or independence in a way that it would suffer from, it makes another decision instead.

Another, pre-hoc, way of dealing with this is to make an "ultra choice" and choose between not decisions, but all possible input output maps (equivalently, between all possible decision algorithms), looking to the expected consequences of each one. This reduces the choices to a single choice, where issues of transitivity or independence need not necessarily apply.

 

Bounded agents

Actual agents will be bounded, unlikely to be able to store and consult their entire history when making every single decision, and unable to look at the whole future of their interactions to make a good ultra choice. So how would they behave?

This is not determined directly by their preferences, but by some sort of meta-preferences. Would they make an approximate ultra-choice? Or maybe build up a history of decisions, and then simplify it (when it gets to large to easily consult) into a compatible utility function? This is also determined by their interactions, as well - an agent that makes a single decision has no pressure to be an expected utility maximiser, one that makes trillions of related decisions has a lot of pressure.

It's also notable that different types of boundedness (storage space, computing power, time horizons, etc...) have different consequences for unstable agents, and would converge to different stable preference systems.

 

Investigation needed

So what is the point of this post? It isn't presenting new results; it's more an attempt to launch a new sub-field of investigation. We know that many preferences are unstable, and that the agent is likely to make them stable over time, either through self-modification, subagents, or some other method. There are also suggestions for preferences that are known to be unstable, but have advantages (such as resistance to Pascal Muggings) that standard maximalisation does not.

Therefore, instead of saying "that agent design can never be stable", we should be saying "what kind of stable design would that agent converge to?", "does that convergent stable design still have the desirable properties we want?" and "could we get that stable design directly?".

The first two things I found in this area were that traditional satisficers could converge to vastly different types of behaviour in an essentially unconstrained way, and that a quasi-expected utility maximiser of utility u might converge to an expected utility maximiser, but it might not be u that it maximises.

In fact, we need not look only at violations of the axioms of expected utility; they are but one possible reason for decision behaviour instability. Here are some that spring to mind:

  1. Non-independence and non-transitivity (as above).
  2. Boundedness of abilities.
  3. Adversaries and social pressure.
  4. Evolution (survival cost to following “odd” utilities (eg time-dependent preference)).
  5. Unstable decision theories (such as CDT).

Now, some categories (such as "Adversaries and social pressure") may not possess a tidy stable solution, but it is still worth asking what setups are more stable than others, and what the convergence rules are expected to be.

Anti-Pascaline agent

4 Stuart_Armstrong 12 March 2015 02:17PM

A putative new idea for AI control; index here.

Pascal's wager-like situations come up occasionally with expected utility, making some decisions very tricky. It means that events of the tiniest of probability could dominate the whole decision - intuitively unobvious, and a big negative for a bounded agent - and that expected utility calculations may fail to converge.

There are various principled approaches to resolving the problem, but how about an unprincipled approach? We could try and bound utility functions, but the heart of the problem is not high utility, but hight utility combined with low probability. Moreover, this has to behave sensibly with respect to updating.

 

The agent design

Consider a UDT-ish agent A looking at input-output maps {M} (ie algorithms that could determine every single possible decision of the agent in the future). We allow probabilistic/mixed output maps as well (hence A has access to a source of randomness). Let u be a utility function, and set 0 < ε << 1 to be the precision. Roughly, we'll be discarding the highest (and lowest) utilities that are below probability ε. There is no fundamental reason that the same ε should be used for highest and lowest utilities, but we'll keep it that way for the moment.

The agent is going to make an "ultra-choice" among the various maps M (ie fixing its future decision policy), using u and ε to do so. For any M, designate by A(M) the decision of the agent to use M for its decisions.

Then, for any map M, set max(M) to be the lowest number s.t P(u ≥ max(M)|A(M)) ≤ ε. In other words, if the agent decides to use M as its decision policy, this is the maximum utility that can be achieved if we ignore the highest valued ε of the probability distribution. Similarly, set min(M) to be the highest number s.t. P(u ≤ min(M)|A(M)) ≤ ε.

Then define the utility function uMε, which is simply u, bounded between max(M) and min(M). Now calculate the expected value of uMε given A(M), call this Eε(u|A(M)).

The agent then chooses the M that maximises Eε(u|A(M)). Call this the ε-precision u-maximising algorithm.

 

Stability of the design

The above decision process is stable, in that there is a single ultra-choice to be made, and clear criteria for making that ultra-choice. Realistic and bounded agents, however, cannot calculate all the M in sufficient detail to get a reasonable outcome. So we can ask whether the design is stable for a bounded agent.

Note that this question is underdefined, as there are many ways of being bounded, and many ways of cashing out ε-precision u-maximising into bounded form. Most likely, this will not be a direct expected utility maximalisation, so the algorithm will be unstable (prone to change under self-modification). But how exactly it's unstable is an interesting question.

I'll look at one particular situation: one where A was tasked with creating subagents that would go out and interact with the world. These agents are short-sighted: they apply ε-precision u-maximising not to the ultra-choice, but to each individual expected utility calculation (we'll assume the utility gains and losses for each decision is independent).

A has a single choice: what to set ε to for the subagents. Intuitively, it would seem that A would set ε lower than its own value; this could correspond roughly to an agent self-modifying to remove the ε-precision restriction from itself, converging on becoming a u-maximiser. However:

  • Theorem: There are (stochastic) worlds in which A will set the subagent precision to be higher, lower or equal to its own precision ε.

The proof will be by way of illustration of the interesting things that can happen in this setup. Let B be the subagent whose precision A sets.

Let C(p) be a coupon that pays out 1 with probability p. xC(p) simply means the coupon pays out x instead of 1. Each coupon costs ε2 utility. This is negligible, and only serves to break ties. Then consider the following worlds:

  • In W1, B will be offered the possibility of buying C(0.75ε).
  • In W2, B will be offered the possibility of buying C(1.5ε).
  • In W3, B will be offered the possibility of buying C(0.75ε), and the offer will be made twice.
  • In W4, B will be offered, with 50% probability, the possibility of buying C(1.5ε).
  • In W5, B will be offered, with 50% probability, the possibility of buying C(1.5ε), and otherwise the possibility buying 2C(1.5ε).
  • In W6, B will be offered, with 50% probability, the possibility of buying C(0.75ε), and otherwise the possibility buying 2C(1.5ε).
  • In W7, B will be offered, with 50% probability, the possibility of buying C(0.75ε), and otherwise the possibility buying 2C(1.05ε).

From A’s perspective, the best input-output maps are: in W1, don’t buy, in W2, buy, in W3, buy both, in W4, don’t buy (because the probability of getting above 0 utility by buying, is, from A's initial perspective, 1.5ε/2 = 0.75ε).

W5 is more subtle, and interesting – essentially A will treat 2C(1.5ε) as if it were C(1.5ε) (since the probability of getting above 1 utility by buying is 1.5ε/2 = 0.75ε, while the probability of getting above zero by buying is (1.5ε+1.5ε)/2=1.5ε). Thus A would buy everything offered.

Similarly, in W6, the agent would buy everything, and in W7, the agent would buy nothing (since the probability of getting above zero by buying is now (1.05ε + 0.75ε)/2 = 0.9ε).

So in W1 and W2, the agent can leave the sub-agent precision at ε. In W2, it needs to lower it below 0.75ε. In W4, it needs to raise it above 1.5ε. In W5 it can leave it alone, while in W6 it must lower it below 0.75ε, and in W7 it must raise it above 1.05ε.

 

Irrelevant information

One nice feature about this approach is that it ignores irrelevant information. Specifically:

  • Theorem: Assume X is a random variable that is irrelevant to the utility function u. If A (before knowing X) has to design successor agents that will exist after X is revealed, then (modulo a few usual assumptions about only decisions mattering, not internal thought processes) it will make these successor agents isomorphic to copies of itself, i.e. ε-precision u-maximising algorithms (potentially with a different way of breaking ties).

These successor agents are not the short-sighted agents of the previous model, but full ultra-choice agents. Their ultra-choice is over all decisions to come, while A's ultra-choice (which is simply a choice) is over all agent designs.

For the proof, I'll assume X is boolean valued (the general proof is similar). Let M be the input-output map A would choose for itself, if it were to make all the decisions itself rather than just designing a subagent. Now, it's possible that M(X) will be different from M(¬X) (here M(X) and M(¬X) are contractions of the input-output map by adding in one of the inputs).

Define the new input-ouput map M' by defining a new internal variable Y in A (recall that A has access to a source of randomness). Since this variable is new, M is independent of the value of Y. Then M' is defined as M with X and Y permuted. Since both Y and X are equally irrelevant to u, Eε(u|A(M))=Eε(u|A(M')), so M' is an input output map that fulfils the ε-precision u-maximising. And M'(X)=M'(¬X), so M' is independent of X.

Now consider the subagent that runs the same algorithm as A, and has seen X. Because of the irrelevance of X, M'(X) will still fulfil ε-precision u-maximising (we can express any fact relevant to u in the form of Zs, with P(Z)=P(Z|X), and then the algorithm is the same).

Similarly, a subagent that has seen ¬X will run M'(¬X). Putting these together, the subagent will expect to run M'(X) with probability P(X) and M'(¬X) with probability P(¬X)=1-P(X).

Since M'(X)=M'(¬X), this whole thing is just M'. So if A creates a copy of itself (possibly tweaking the tie-breaking so that M' is selected), then it will achieve its maximum according to ε-precision u-maximising.

Creating a satisficer

4 Stuart_Armstrong 11 March 2015 03:03PM

A putative new idea for AI control; index here.

This is a potential design for a safe-ish satisficer, based on the ideas presented here. The two key ideas are that a satisficer S(u) with utility u:

  • Would not effectively aid M(u), a u-maximiser.
  • Would not effectively resist M(-u), a u-minimizer.

So satisficer make poor allies and weak enemies.

 

The design, the designer and the verifier

Let M(u-v) be an agent that maximises u and minimises v. Let's assume that M(u-v) is currently unaware as to what u and v are (though it knows that the upper bound of u is easy to approach). It can interact with the outside environment in one way only: by creating a satisficer S(). S() will then be fed u to become S(u).

However, the design for S(u) will first be sent to M(εu+v), for some unknown small ε. Note that M(εu+v) is a v-maximiser, so it disagrees with M(u-v) except for u. M(εu+v) is a (hypothetical) agent that will be (counterfactually) released at some unspecified later date. Currently, it will be fed the whole reasoning history of M(u+v), and will make a single decision only: where S(u) is to be built or not.

So it seems that S(u) must;

  1. Set u close to its maximal value (as this is "easy to approach").
  2. Not increase v to any significant extent (or else M(u-v) would not design it).
  3. Not decrease v to any significant extent (or else M(εu+v) would veto it).
  4. Since v is unknown to M(u-v) and and resources are finite, this should serve as a general reduced impact requirement for S(u) (we may have to use something like a soft minimum across all v, rather than an expectation across all v, to avoid certain edge casess).
  5. Since is u unknown to M(u-v), S() would serve as a general satisficing agent for any utility functions whose upper bounds are easy to approach (remember that we can take an arbitrary utility function and arbitrarily bound it at some number).

For the moment, this does seems like it would produce a successful satisficer...

Resource gathering and pre-corriged agents

7 Stuart_Armstrong 10 March 2015 11:47AM

A putative new idea for AI control; index here.

Resource-gathering agent

It will often be useful to have a model of a “pure” resource gathering agent – one motivated only to gather resources, accumulated power, spread efficiently, and so on. This model could be used as behaviour not to emulate, or as a comparison yardstick for the accumulation behaviour of other agents.

The simplest design for a resource gathering agent would be to take a utility function u – one linear in paperclips, say – and give the agent the utility function X(u) + ¬X(-u), where X is some future observation that has 50% chance of occurring, and that the AI cannot affect. Some cosmological fact coming from a distant galaxy (at some point in the future) could do the trick.

This agent would behave roughly as a resource gathering agent, accumulating power in preparation for the day it would know what to do with it: it would want resources (as these could be used to create or destroy paperclips) but would be indifferent to creating or destroying paperclips currently, as the expected gain from u is exactly compensated by the expected loss from -u (and vice versa).

However, its behaviour is not independent of u: if for instance there were a Grand President of the Committee to Establish the Proper Number of Paperclips in the World (GPotCtEtPNoPitW), then the AI would desperately try to secure that position, but would not care overmuch about being the GPotCtEtPNoSitW, who deals with staples.

So a better model of a resource gathering agent is one that has a distribution P over all sorts of different utility functions, with the proviso that for all such utilities u, P(u)=P(-u). Note here that we’re talking about actual utility functions (which can be compared and summed directly), not functions-up-to-affine-transformations. This distribution P will be updated at some future date according to some phenomena outside of the agent’s control.

Then this agent, which currently has exactly zero motivations, will nonetheless accumulate resources in preparation for the day it will know what to do.

There are some distributions P which are better suited to getting a “purer” resource gathering agent (a bad P would be, eg, having a lots of utilities which are tiny variations on u, which is essentially the same as having just u – but “tiny variations” is not a stable concept under affine transformations). A simplicity prior seems a natural choice here. If u is linear in paperclips and v in staples, then the complexity penalty for w=u+v doesn’t matter so much, as the agent will already want to preserve power over paperclips and staples, because of the (simpler) u, -u, v and -v.

 

Pre-corriged agents

One of the problems with corrigible agents is that they are, in a sense, too good at what they do. An agent that is currently a u maximiser and will transition tomorrow to being a v maximiser (and everyone knows this) will accept the deal “give me £1,000,000, and I’ll return it tripled tomorrow if you’re still a u-maximiser” (link to corrigibility paper). Why would it accept this deal? Because a real u-maximiser would, and it behaves (almost) exactly as a real u-maximiser.

We might be able to solve that specific problem with methods that identify agents or subagents (see subsequent posts). But there are still issues with, for instance, people who want to trade their own u-valuable and v-useless resources for the agent’s u-useless and v-valuable ones – and then propose the opposite trade tomorrow, with an extra premium.

We can use the idea of a resource gathering agent to prevent such loss of utility. Assume the agent has current utility u, and will transition to some v at specific point in the future. It has a probability distribution P over what this v will be.

Then instead of having current utility u, have it instead as:

u + C Σv Q(v),

where C is some constant and Q(v)=(P(v)+P(-v))/2. Note that Q(v)=Q(-v), so this agent is currently a combination between a u-maximiser and a resource gathering agent – moreover, a resource gathering agent that cares about preserving flexibility in the (likely) correct areas for its future values. The importance of either factor (u-maximising or resource gathering) can be tuned by changing C.

What if the agent expects that their utility will get changed more than once in the future? This can be built up inductively: if there are two utility changes to come, for instance, then after the first transition  (but before the second) the agent will have a composite utility, as above, of the form “u + Σv Q(v)”. Then the agent can have a P over all such composite utilities, and use that to define its current composite-composite utility (the one it has before the first change). A composite-composite utility is really just a composite utility, so the process can then be repeated.

Corrigibility will be applied to this setup in two types of circumstances: when people physically change the utility u, as before, and when the agent updates P (and hence Q) in a way that modifies the composite utility.

Note that this setup is less exploitable, but still suffers from the weakness that Q and P are not equal (in the worst case, you could have P(v)=0 while Q(v)=0.5). However, if Q were not symmetric, then the agent wouldn’t currently be a u-maximiser, so this non-equality is essential to preserving the idea of it being a (somewhat) u-maximising agent.

This may not matter too much in practice, however. The agent is like an investor on the stock market who wants to purchase a lot of the long-term stock options, but has no current interest in any stocks. However, given that other people are interested in stocks, it would be stupid to buy and sell them at prices too divergent from the majority opinion, even if the agent doesn’t itself value them. General measures against blackmail or exploitation might also help here.

Model of unlosing agents

3 Stuart_Armstrong 02 August 2014 07:59AM

Some have expressed skepticism that "unlosing agents" can actually exist. So to provide an existence proof, here is a model of an unlosing agent. It's not a model you'd want to use constructively to build one, but it's sufficient for the existence result.

Let D be the set of all decisions the agent has made in the past, let U be the set of all utility functions that are compatible with those decisions, and let P be a "better than" relationship on the set of outcomes (possibly intransitive, dependent, incomplete, etc...).

By "utility functions that are compatible those decisions" I mean that an expected utility maximising agent with any u in U would reach the same decisions D as the agent actually did. Notice that U starts off infinitely large when D is empty; when the agent faces a new decision d, here is a decision criteria that leaves U non-empty:

  1. Restrict to the set of possible decision choices that would leave U non-empty. This is always possible, as any u in U would advocate for a particular decision choices du at d, and therefore choosing du would leave u in the updated U. Call this set compatible.
  2. Among those compatible choices, choose one that is the least incompatible with P, using some criteria (such as needing to do the least work to remove intransitivenesses and dependences and so on).
  3. Make that choice, and update P as in step 3, and update D and U (leaving U non-empty, as seen in step 1).
  4. Proceed.

That's the theory. In practice, we would want to restrict the utilities initially allowed into U to avoid really stupid utilities ("I like losing money to people called Rob at 15:46.34 every alternate Wednesday if the stock market is up; otherwise I don't.") When constructing the initial P and U, it could be a good start to be just looking at categories that humans natuarally express preferences between. But those are implementation details. And again, using this kind of explicit design violates the spirit of unlosing agents (unless the set U is defined in ways that are different from simply listing all u in U).

The proof that this agent is unlosing is that a) U will never be empty, and b) for any u in U, the agent will have behaved indistinguishably from a u-maximiser.

Expected utility, unlosing agents, and Pascal's mugging

19 Stuart_Armstrong 28 July 2014 06:05PM

Still very much a work in progress

EDIT: model/existence proof of unlosing agents can be found here.

Why do we bother about utility functions on Less Wrong? Well, because of results of the New man and the Morning Star, which showed that, essentially, if you make decisions, you better use something equivalent to expected utility maximisation. If you don't, you lose. Lose what? It doesn't matter, money, resources, whatever: the point is that any other system can be exploited by other agents or the universe itself to force you into a pointless loss. A pointless loss being a lose that give you no benefit or possibility of benefit - it's really bad.

The justifications for the axioms of expected utility are, roughly:

  1. (Completeness) "If you don't decide, you'll probably lose pointlessly."
  2. (Transitivity) "If your choices form loops, people can make you lose pointlessly."
  3. (Continuity/Achimedean) This axiom (and acceptable weaker versions of it) is much more subtle that it seems; "No choice is infinity important" is what it seems to say, but " 'I could have been a contender' isn't good enough" is closer to what it does. Anyway, that's a discussion for another time.
  4. (Independence) "If your choice aren't independent, people can expect to make you lose pointlessly."

 

Equivalency is not identity

A lot of people believe a subtlety different version of the result:

  • If you don't have a utility function, you'll lose pointlessly.

This is wrong. The correct result is:

  • If you don't lose pointlessly, then your decisions are equivalent with having a utility function.
continue reading »