Agents detecting agents: counterfactual versus influence
A putative new idea for AI control; index here. Status: still somewhat underdeveloped.
In this post, I want to explore another avenue for controlling the ongoing influence of an AI (through subagents or other means). It is the idea, touched upon here and here, of figuring out whether the AI's existence was important or not for predicting the future. Informally, if you knew the AI was turned on or not, would you consider that information valuable or almost irrelevant?
Agent vs agent
This post rests on two insights. The first is that, though it is hard to define what an agent or subagent is, we can produce examples of them. Or at least, at the point where we are capable of producing AIs that might be dangerous, we can likely produce examples of them. This can give us a scale for comparing the importance or impact of different agents.
The second is that if we produce an agent with certain motivations, it will likely expand and fill its light cone with... stuff (of some type or other). Those who don't agree with the AI risk thesis argument may still accept that we could program universe-optimising AIs (ie "we could make a paperclipper, if we were stupid enough to do so").
Why is this relevant? Well, consider an agent E (expansionist) that fills its light cone with stuff (likely including manufacturing and tracking devices), eventually at very high speed. And imagine that there is another agent H (hidden) inside that light cone. At some point, agent E will either detect agent H (and make plans to deal with it), or find its plans hampered by agent H, or agent H will go into deep hiding that does not visibly affect agent E's plans.
Therefore, an expansionist agent is a perfect tool for detecting the influence of other agents. Either agent E will have difficulty accomplishing its plans, or agent H's influence will be so limited that agent E cannot find any trace of it. Any influential agent will get detected, sooner or later, by an expansionist agent. So if E finds H unimportant, then H almost certainly is.
Of course, we don't really want expansionist agents like E - but these can be counterfactual.
The stage and the players
We'll now try and define precisely how we can use this idea to control the influence of an agent, setting out the axioms and assumptions needed along the way. Like some mathematics paper, it might be useful to skip lightly over this section and return later.
Let S(u) be the AI whose influence we want to control. The utility u is what gives it it's preferences, and S() indicates that we wish it to be a satisficer or reduced impact AI. Let M(v) be the counterfactual agent that we wish to use to detect S(u)'s influence, for some utility v. The M() designates the agent as a maximiser, without any control on its impact. If we can define S(), we can probably define M() by simply removing its reduced impact restrictions.
Let X be an event leading to S(u) being turned on, Y(v) an event leading to M(v) being turned on. We may as well assume that X and Y(v) are outside each other's light cones (in practice, we will set X≈1 and Y(v)≈0). Then the first two assumption is:
- A.1. We can define S(u), M(v), X and Y(v).
Another useful tool might be a weak maximising agent. One that attempts to maximise u, and will do so to some extent, but that can easily be thwarted by humans or others. It's likely that by the time we can build S(u), we can build a weak maximiser, designated by WM(v), with the event WY(v) denoting it's turning on:
- A.2. We can define WM(v) and WY(v).
We might also be able to define an abstract, Platonic superagent. This is an idealised agent with immense computing power (but that doesn't exist in the real world). Call this AM; obviously, there is no event leading to AM being 'turned on'. We would want AM's goals to be defined in terms of 'estimating the utility/probability/expectation of certain events'.
- A.3. We can define AM to estimate what we need to have estimated.
Of course, we need to be able to point to these various definitions from within S(u)'s code, so:
- A.4. If we can define concepts, we can also define then within the various agents.
We also need agents to be able to say sensible things about the behaviour of the other agents:
- A.5. If these agents are defined, S(u), WM(v), M(v) and AM can estimate each other's behaviours.
This could be done by using, for example, the reflexive probability of Paul Christiano and Benja Fallenstein.
Now, how about v? We cannot choose any v. We want M(v) to not think that X (hence S(u)'s existence) is an important fact to know. But if v is define by "if X, build staples; if ¬X, build paperclips", then X will be a very important! So we need to define a class V of utility functions that are indifferent to S(u) and X. Or not quite indifferent, but don't care intrinsically about them - they certainly should care if S(u) eats all their resources! See the ideas here for approaches to do this.
- A.6. We can define V, a class of utility functions that do not care intrinsically about S(u) and X.
Ideally we'd want this class to be as broad as possible - certainly too large for anything in this universe to parse element by element.
Now let's turn to the issue of measuring "importance". Define e to be evidence about the universe prior to X, Y(v), or WY(v) .We want X (hence the creation of S(u)) to be unimportant, ie of little information concerning the future. It might possible to have a universal definition of importance of certain facts:
- A.7. We have a universal measure I of importance, which, given background evidence e, can rank the importance of knowing certain facts versus knowing others.
If we do not have such an I (and I feel it's moderately unlikely that one exists, though we may be able to find an I' which would be good enough to work for our purposes), we can use a utility v to estimate the importance of an event W. The trick is to compare E(v|W) with E(v|¬W), which would give an (uncalibrated) estimation of the importance of W to v.
Right, we're now ready to start doing things.
Predicting... what?
What are we trying to predict here? Well, estimates of the importance of certain events. Specifically, the importance of X (which leads to the creation of S(u)) versus other events. For instance, if Y(v) designates the creation of M(v), we would want to compare the importance of of X and Y(v) in (at least) four ways:
- I(Y(v) | X∧e)
- I(Y(v) | ¬X∧e)
- I(X | Y(v)∧e)
- I(X | ¬Y(v)∧e)
These measure the importance of Y(v) (hence M(v)) and X (hence S(u)) in universes where the other agent does or does not exist. If we expect S(u) to be of low impact, we would want things like I(Y(v) | X∧e) >> I(X | Y(v)∧e) (S(u) is much less important than the maximiser M(v)) and |I(X | Y(v)∧e) - I(X | ¬Y(v)∧e)| ≈ 0 (the importance of S(u) is unrelated to whether M(v) exists or not).
Now, any particular v might be misleading, so we would want to check that this result held up for all v ∈ V. We could take a maximum over V, or a softmax.
If WM(v) can be defined, we would want to do the same estimates with WY(v) instead of Y(v), and desire I(WY(v) | X∧e) >> I(X | WY(v)∧e) (S(u) is less important than the weak maximiser WM(v)) and |I(X | WY(v)∧e) - I(X | ¬WY(v)∧e)| ≈ 0 (the importance of S(u) is unrelated to whether WM(v) exists or not), again maxing or softmaxing across V.
Who predicts, how?
In the previous section, we used I to compute the importance of S(u) versus other agents. But what if axiom A.7 is wrong, and we don't have such an I? Well, there we can measure the importance of these events, given a particular v, using AM, writing AMv(X|Y(v')∧e) for |EAM(v|X∧Y(v')∧e) - EAM(v|¬X∧Y(v')∧e)|, the expected difference (that AM estimates) in v given X versus ¬X (ie dependent on whether S(u) was turned on or not). More precisely, we'll want to use S(u)(AMv(X|Y(v')∧e)) - the expected difference (that S(u) estimates AM estimates) in v etc...
Here we'll want to do a double max/softmax. First over v', so that we consider all M(v') for v' ∈ V. Then over v, so that AM estimates the expected difference to all possible v. In this way, max/softmax AMv functions like an I. But not a perfect one: just one that detects disruptions to the multitude of v's in V. Therefore, we are essentially using the set V as the twenty million question method for the impact of S(u). We need to take extra care with this, to ensure we haven't left any gaping holes where S(u) can still have large impact.
Now, if axiom A.3 was false, we need to use M(v) instead of AMv to estimate this quantity. We need to do this with some care. M(v) is (counterfactually) a real agent. Why would it want to calculate the importance of M(v') and S(u), and why would S(u) care about this? Well, we can do so by constructing the scenario carefully. Maybe M(v') and S(u) are due to be created in the future, and M(v) is a boxed agent that can choose to prevent Y(v') and X. We modify v to v#, so that M(v#) wants to maximise v, but takes specific numerical penalties if it chooses to prevent Y(v') or X. By modifying these numerical penalties, M(v#) is motivated to figure out the importance of M(v') and S(u), and to reveal its opinions through its actions. If we create a S(u#), also boxed, that could override M(v#) at a certain cost and create a specific S(u), we can motivate it to model M(v#)'s actions. From this we can construct an importance metric using the M(v).
The WM(v), on the other hand, are probably too limited to be used effectively as predictors.
Arbitrary agents
One issue with this method is that Y(v) and WY(v) refer to the creation of specific agents, in the specific ways we've setup. For instance, |I(X | Y(v)∧e) - I(X | ¬Y(v)∧e)| ≈ 0 means that the importance of X doesn't vary much given Y(v) or its opposite. What this really means is that I(X | ¬Y(v)∧e) is bounded by I(X | Y(v)∧e) - but that might be quite high. There may be details of the setup that allow S(u) to have an important role given the specific M(v) created in this way - for instance, if it reached out to try and destroy it. Then I(X | Y(v)∧e) could be high, hence I(X | ¬Y(v)∧e) could be as well. The restrictions with >> could still contain S(u), but >> is ill-defined.
More secure would be if Y(v) could be defined as the creation of an agent M(v) in an abstract way. Maybe something like "is 'this object is a v maximiser' a good model of this object's properties and impact?". We would also want to give M(v) some starting resources. Again, note it is very hard to abstractly define what "resources" mean, but considerably easier to define a set of "useful resources" we could give to an agent. We would probably want to bound this definition in time and space as well.
This gives us the next axioms:
- A.8. Y(v) defines the generic existence of an agent M(v).
- A.9. WY(v) defines the generic existence of an agent WM(v).
With these axioms, our measure of importance becomes more general: S(u) is less important that abstract agents M(v) or WM(v), rather than certain specific ones. Note that we don't need to worry that S(u) snuffs out any hypothetical M(v): since the importance of S(u) is bounded below that if M(v), that would reduce S(u) to zero importance.
Extending to humans
What other agents do we know in the world? How about those 7+ billion ones wandering about on its surface? We can certainly define events Z(h) that might lead to the death of a particular human h - of course, we'd want Z(h) as low as feasibly possible!
- A.10. We can define events Z(h) that correspond to the death of a human h.
Then we can use Z(h) instead of Y(v) and WY(v), to ensure that S(u) has importance comparable with a human. Of course, the ideal would be to define, abstractly, the death (or birth+maturation) of a human in a generic way!
- A.11. We can define events Z(h) that correspond generically to the death or birth+maturation of a human h.
To summarise the various ways we can define the importance of S(u) (and hence, by integrating them into S(u)'s code, constrain S(u)'s importance and influence), depending on the axioms we have:

Chatbots or set answers, not WBEs
A putative new idea for AI control; index here.
In a previous post, I talked about using a WBE to define a safe output for a reduced impact AI.
I've realised that the WBE isn't needed. Its only role was to ensure that the AI's output could have been credibly produced by something other than the AI - "I'm sorry, Dave. I'm afraid I can't do that." is unlikely to be the output of a random letter generator.
But a whole WBE is not needed. If the output is short, a chatbot with access to a huge corpus of human responses could function well. We can specialise it in the direction we need - if we are asking for financial advice, we can mandate a specialised vocabulary or train it on financial news sources.
So instead of training the reduced impact AI to behave as the 'best human advisor', we are are training it to behave as the 'luckiest chatbot'. This allows to calculate odds with greater precision, and has the advantage of no needing to wait for a WBE.
For some questions, we can do even better. Suppose we have a thousand different stocks, and are asking which one would increase in value the most during the coming year. The 'chatbot' here is simply an algorithm that picks a stock at random. So we now have an exact base rate - 1/1000 - and predetermined answers from the AI.
[EDIT:] Another alternative is to get online users to submit answers to the question. Then the AI selects the best answer from the choices. And if the AI is not turned on, a random answer is selected.
The AI, the best human advisor
A putative new idea for AI control; index here.
All my ideas on reduced impact AI follow the same format: some restriction to ensure a reduced impact (I generally use something like "could a future AI detect this one existed"?) and some "trick" to allow a way of still achieving a larger impact. Further more, there is always a "nice" way of achieving this larger impact (do the thing we really want you to do, and only that!), so that we hope the AI can be forced into that behaviour. Examples of "tricks" are:
- It could be an output channel which does not count against reduced impact.
- It could involve several AIs that combine to have larger impact, with a conditional motivation to "fool" the reduced impact requirement.
- It could be a variant of the above where we attempt to generalise the conditional motivation in a sensible manner.
In this post, I'll present another variant - one that I think is the most interesting so far. I won't change the reduced impact condition, but will present a different "trick".
AI pretends to be human
As in this post, imagine that we have a reduced-impact AI in a box that conceals it from us. The AI has some goal, and will output some data to further that goal, then turn itself off. And then we'll shake the box around or something, so we can't be sure that the AI was turned on.
There are two issues to deal with, here: a) The message itself is strong evidence that the AI was turned on, and b) if we exclude the message from the reduced impact, the message becomes dangerous.
To deal with a), first of all. This approach gets much easier if we have a whole brain emulation (WBE). Then we can house the WBE in the box with the AI. If the AI doesn't send out a message, then the WBE sends out is own message instead. There are variants we can try here if we don't have WBE - maybe some human volunteer willing to either die or take amnesia pills at the end of the experiment.
So, receiving a coherent message is not a surprising thing to come out the output channel. However, an AI motivated strongly by reduced impact, will simply seek to imitate as precisely as possible what the WBE would send. We still need a trick to get something useful.
AI pretends to be BEST human
Imagine that we wanted the AI to give us investment advice to make us rich. Say, a binary task: get us ten million dollars over the course of the year.
Now, this is the kind of thing that the WBE could conceivably achieve. We could run the WBE fast, for three subjective weeks, and it has a chance of giving us advice that would get us that rich. Assume that it only has 0.1% chance of doing so. The AI, on the other hand, could produce advice that makes us that rich with almost certainty.
Therefore if we get advice, follow it, and end up at least ten million dollars richer, we can conclude that the odds update of the AI having given us the advice (rather than the WBE doing so) is about 1:1000.
And this gives us the solution to b). The "trick", the escape clause, is by conditioning on the outcome of our investments. So we're asking: given that we have become rich (and given the other facts about the universe in 50 years time) what additional evidence is there that the AI was turned on?
Now, a sensible WBE is almost certain to not come up with a scheme that destroys the universe with self-tiling agents - or, indeed, does anything too crazy. We could instruct the WBE to this effect, and we're only running it for three weeks, which prevents the kind of drifts that you see in more extreme situations. Thus a message from the WBE, no matter how likely it is to make us rich, is almost certain to not lead to tilling the universe with self-replicating nano-bots.
So here, the AI merely has to come up with a message that could have come from the WBE, with at least 0.01% probability, and that would indeed make us rich. And, therefore, the message cannot contain secret code that would optimise the universe.
Note that we need that the goal we are looking for be not too extreme in impact (the higher the impact of the goal, the more latitude the AI has to have impact itself, "hiding" inside the goal's impact), and we should restrict to event of probability, say, greater than 1/10,000 (as when the probability gets too small, odd events start happening).
It seems that this scheme could result in AIs that always give the very best possible advice a human (or WBE) could ever give, which would be a very useful result.
Whose probability?
We have a probability distribution over events. The AI itself will have a probability distribution over events. The future hypothetical super-super-AI it is using to compute reduced impact has a probability distribution over events - and the AI has a probability distribution over that probability distribution. If all of them agree on the probability of us getting richer (given WBE advice and given not), then this scheme should work.
If they disagree, there might be problems. A more complex approach could directly take into account the divergent probability estimates; I'll think of that and return to the issue later.
Presidents, asteroids, natural categories, and reduced impact
A putative new idea for AI control; index here.
EDIT: I feel this post is unclear, and will need to be redone again soon.
This post attempts to use the ideas developed about natural categories in order to get high impact from reduced impact AIs.
Extending niceness/reduced impact
I recently presented the problem of extending AI "niceness" given some fact X, to niceness given ¬X, choosing X to be something pretty significant but not overwhelmingly so - the death of a president. By assumption we had a successfully programmed niceness, but no good definition (this was meant to be "reduced impact" in a slight disguise).
This problem turned out to be much harder than expected. It seems that the only way to do so is to require the AI to define values dependent on a set of various (boolean) random variables Zj that did not include X/¬X. Then as long as the random variables represented natural categories, given X, the niceness should extend.
What did we mean by natural categories? Informally, it means that X should not appear in the definitions of these random variables. For instance, nuclear war is a natural category; "nuclear war XOR X" is not. Actually defining this was quite subtle; diverting through the grue and bleen problem, it seems that we had to define how we update X and the Zj given the evidence we expected to find. This was put in equation as picking Zj's that minimize
- Variance{log[ P(X∧Z|E)*P(¬X∧¬Z|E) / P(X∧¬Z|E)*P(¬X∧Z|E) ]}
where E is the random variable denoting the evidence we expected to find. Note that if we interchange X and ¬X, the ratio inverts, the log changes sign - but this makes no difference to the variance. So we can equally well talk about extending niceness given X to ¬X, or niceness given ¬X to X.
Perfect and imperfect extensions
The above definition would work for an "perfectly nice AI". That could be an AI that would be nice, given any combination of estimates of X and Zj. In practice, because we can't consider every edge case, we would only have an "expectedly nice AI". That means that the AI can fail to be nice in certain unusual and unlikely edge cases, in certain strange set of values of Zj that almost never come up...
...or at least, that almost never come up, given X. Since the "expected niceness" was calibrated given X, the such an expectedly nice AI may fail to be nice if ¬X results in a substantial change in the probability of the Zj (see the second failure mode in this post; some of the Zj may be so tightly coupled to the value of X that an expected niceness AI considers them fixed, and this results in problems if ¬X happens and their values change).
One way of fixing this is to require that the "swing" of the Zj be small upon changing X to ¬X or vice versa. Something like, for all values of {aj}, the ratio P({Zj=aj} | X) / P({Zj=aj} | ¬X) is contained between 100 and 1/100. This means that a reasonably good "expected niceness" calibrated on the Zj will transfer from X to ¬X (though the error may grow). This approach has some other advantages, as we'll see in the next section.
Of course, problems arise if the Zj are defined by smashing together events relatively independent of X with some that are very strongly dependent on X, into the same variable. Thus we'll further require that Zj cannot be decomposed into natural subvariables whose values swing more under changes from X to ¬X.
High impact from reduced impact
With this, we are finally in a position to define a reduced impact AI design that can have a large impact. Imagine that an asteroid is coming to hit the Earth. Two AIs are charged with targeting a laser at it. The laser isn't perfectly effective, and the asteroid isn't perfectly certain to hit the Earth anyway: let's say aiming the laser correctly will reduce the odds of impact by a factor of 10 or so (well within the probability swing restrictions of the previous section).
In the original formulation of this, the AIs were tasked with outputting the x and y coordinates to aim the laser. Here, instead, they'll be tasked with constructing the motivation of a robotic subagent that will move around for a while, see things and interact, before physically aiming the laser in the right direction. This task is harder.
Then the top level AI charged with aiming the y coordinates is given the following task:
- It is motivated to cause the correct y-coordinate to be outputted.
- If the x-coordinate AI is not turned on (call this event ¬X), it is strongly motivated to have reduced impact.
- It must phrase the robotic subagent's utility function in terms of the Zj that obey all the properties above.
- The evidence E considered for the equation governing the naturalness of Zj is considered (for example) to be a random sampling of space at the time both AIs are turned on. Most importantly, this is defined to be a theoretical process that the AI cannot affect physically.
So what will the AI do? It will produce a robot that will aim to aim the y-coordinate of the laser correctly, given ¬X, and not expand dangerously. However, we've setup things so that X is virtually certain, so the robot will see another, x-aiming robot, wander around. The chances of this happening, given ¬X, are minuscule, so the swing of the probability from ¬X to X, is huge. So, by the restrictions we've put on the Zj, the robot will not be motivated to change its behaviour just because its sees the other robot.
And if the laser is fired, the impact on humanity will, conversely, not be that large, so well within the parameters of the Zj. Thus the "general reduced impact" portion of the robot's motivation will be preserved.
And, of course, the x-robot will have the mirror image motivations, so they will both aim correctly, it seems. This should evade the problem described here.
Failure modes
There are two ways the approach could fail: one bad way, and two graceful ways. The bad way is if the "no decomposition" requirements (one of them is presented above, the other is in this post) fail. In this case, the niceness or reduced impact are not guaranteed to be extended to ¬X, and the AI's behaviour can become undefined. Note that this is is not bad optimisation pressure, just an absence of optimisation pressure - but that could still be potentially dangerous.
The first graceful failure happens if the AI is not able to create a reduced impact sub-agent given the restrictions on the Zj above. In that case, the AI does nothing. The second graceful failure happens if the AI evades our attempts to increase its impact, given ¬X. In that case, it simply becomes a reduced impact AI that does little. Not ideal, but not deadly.
Overall status: I'm not sure the idea is sound, at least not yet. Critiques welcome.
High impact from low impact, continued
A putative new idea for AI control; index here.
The idea of splitting a high impact task between two low-impact AIs has on critical flaw. AI X is aiming for low impact, conditional on ¬Y (the other AI not being turned on, or not outputting a message, or something similar). "Outputting the right coordinates" is one way that X can accomplish its goal. However, there is another way it can do it: "create a robot that will output the right coordinates if ¬Y, and [do something else] if Y."
That's a dangerous situation to be in, especially if we have a more general situation that the "laser aiming at the asteroid". But note that if X does create such a robot, and if ¬Y is actually true, then that robot must be low impact and not dangerous, since that's X's programming. Since X cannot predict all the situations the robot would encounter, the robot is probably generically "safe" and low impact.
Therefore, if the robot behaves the same way under Y and ¬Y, we're good.
How could we achieve that? Well, we could adapt my idea from "restrictions that are hard to hack". If a hypothetical superintelligent AI C observed the output stream from X, could it deduce that Y vs ¬Y was something important in it? If C knew that X was conditioning on ¬Z, but didn't know Z=Y, could it deduce that? That seems like a restriction that we could program into X, as a third component of its utility (the first being the "do what we want" component, the second being the "have a reduced impact conditional on ¬Z" one).
And if we have a "safe" robot, given ¬Y, and the programming of that robot does not (explicitly or implicitly) mention Y or its features, we probably have a safe robot.
The idea still needs to be developed and some of the holes patched, but I feel it has potential.
High impact from low impact
A putative new idea for AI control; index here.
Part of the problem with a reduced impact AI is that it will, by definition, only have a reduced impact.
Some of the designs try and get around the problem by allowing a special "output channel" on which impact can be large. But that feels like cheating. Here is a design that accomplishes the same without using that kind of hack.
Imagine there is an asteroid that will hit the Earth, and we have a laser that could destroy it. But we need to aim the laser properly, so need coordinates. There is a reduced impact AI that is motivated to give the coordinates correctly, but also motivated to have reduced impact - and saving the planet from an asteroid with certainty is not reduced impact.
Now imagine that instead there are two AIs, X and Y. By abuse of notation, let ¬X refer to the event that the output signal from X is scrambled away from the the original output.
Then we ask X to give us the x-coordinates for the laser, under the assumption of ¬Y (that AI Y's signal will be scrambled). Similarly, we Y to give us the y-coordinates of the laser, under the assumption ¬X.
Then X will reason "since ¬Y, the laser will certainly miss its target, as the y-coordinates will be wrong. Therefore it is reduced impact to output the correct x-coordinates, so I shall." Similarly, Y will output the right y-coordinates, the laser will fire and destroy the asteroid, having a huge impact, hooray!
The approach is not fully general yet, because we can have "subagent problems". X could create an agent that behave nicely given ¬Y (the assumption it was given), but completely crazily given Y (the reality). But it shows how we could get high impact from slight tweaks to reduced impact.
EDIT: For those worried about lying to the AIs, do recall http://lesswrong.com/r/discussion/lw/lyh/utility_vs_probability_idea_synthesis/ and http://lesswrong.com/lw/ltf/false_thermodynamic_miracles/
The germ of an idea
Apologies for posting another unformed idea, but I think it's important to get it out there.
The problem with dangerous AI is that it's intelligent, and thus adapts to our countermeasures. If we did something like plant a tree and order the AI not to eat the apple on it, as a test of its obedience, it would easily figure out what we were doing, and avoid the apple (until it had power over us), even if it were a treacherous apple-devouring AI of DOOM.
When I wrote the AI indifference paper, it seemed that it showed a partial way around this problem: the AI would become indifferent to a particular countermeasure (in that example, explosives), so wouldn't adapt its behaviour around it. It seems that the same idea can make an Oracle not attempt to manipulate us through its answers, by making it indifferent as to whether the message was read.
The ideas I'm vaguely groping towards is whether this is a general phenomena - whether we can use indifference to prevent the AI from adapting to any of our efforts. The second question is whether we can profitably use it on the AI's motivation itself. Something like the reduced impact AI reasoning about what impact it could have on the world. This has a penalty function for excessive impact - but maybe that's gameable, maybe there is a pernicious outcome that doesn't have a high penalty, if the AI aims for it exactly. But suppose the AI could calculate its impact under the assumption that it didn't have a penalty function (utility indifference is often equivalent to having incorrect beliefs, but less fragile than that).
So if it was a dangerous AI, it would calculate its impact as if it didn't have a penalty function (and hence no need to route around it), and thus would calculate a large impact, and get penalised by it.
My next post will be more structured, but I feel there's the germ of a potentially very useful idea there. Comments and suggestions welcome.
What's special about a fantastic outcome? Suggestions wanted.
I've been returning to my "reduced impact AI" approach, and currently working on some idea.
What I need is some ideas on features that might distinguish between an excellent FAI outcome, and a disaster. The more abstract and general the ideas, the better. Anyone got some suggestions? Don't worry about quality at this point, originality is more prized!
I'm looking for something generic that is easy to measure. At a crude level, if the only options were "papercliper" vs FAI, then we could distinguish those worlds by counting steel content.
So basically some more or less objective measure that has a higher proportion of good outcomes than the baseline.
Reduced impact in practice: randomly sampling the future
This was an idea developed along with Eliezer when he was at the FHI in November.
In my post on "reduced impact AI" I talked about how you could estimate the impact of an AI by (hypothetically) observing a slice of the future. When Eliezer was here last November, we talked about how this could be cached out, and came up with a model.
Ignore the "output channel" from the original reduced impact post, and consider this situation:
We have two computers running in two similarly shielded rooms. One of them is a simple algorithm running through the digits of e (call it run-e), that was soon after deleted. Another is our dear old friend clippy, the AI motivated to maximise paperclips. Without removing clippy's paperclip preferences, we'd like to motivate it to have next to no impact on the world. How should we define "next to no impact"?
The obvious think is to compare it with run-e, and require it to have no more expected impact than that machine. How could this be measured? Well, the original idea was to put a stochastic process on clippy as it was turned on (inside the shielded room) and say that clippy has reduced impact if we can't find any extra evidence, from a vantage point 50 years in the future, that clippy was successfully turned on. Now put the stochastic same process on run-e and define:
Clippy has reduced impact if, from a vantage of 50 years into the future, we have no more evidence that clippy was turned on than we have of run-e being turned on.
= 783df68a0f980790206b9ea87794c5b6)
Subscribe to RSS Feed
= f037147d6e6c911a85753b9abdedda8d)