Agents detecting agents: counterfactual versus influence
A putative new idea for AI control; index here. Status: still somewhat underdeveloped.
In this post, I want to explore another avenue for controlling the ongoing influence of an AI (through subagents or other means). It is the idea, touched upon here and here, of figuring out whether the AI's existence was important or not for predicting the future. Informally, if you knew the AI was turned on or not, would you consider that information valuable or almost irrelevant?
Agent vs agent
This post rests on two insights. The first is that, though it is hard to define what an agent or subagent is, we can produce examples of them. Or at least, at the point where we are capable of producing AIs that might be dangerous, we can likely produce examples of them. This can give us a scale for comparing the importance or impact of different agents.
The second is that if we produce an agent with certain motivations, it will likely expand and fill its light cone with... stuff (of some type or other). Those who don't agree with the AI risk thesis argument may still accept that we could program universe-optimising AIs (ie "we could make a paperclipper, if we were stupid enough to do so").
Why is this relevant? Well, consider an agent E (expansionist) that fills its light cone with stuff (likely including manufacturing and tracking devices), eventually at very high speed. And imagine that there is another agent H (hidden) inside that light cone. At some point, agent E will either detect agent H (and make plans to deal with it), or find its plans hampered by agent H, or agent H will go into deep hiding that does not visibly affect agent E's plans.
Therefore, an expansionist agent is a perfect tool for detecting the influence of other agents. Either agent E will have difficulty accomplishing its plans, or agent H's influence will be so limited that agent E cannot find any trace of it. Any influential agent will get detected, sooner or later, by an expansionist agent. So if E finds H unimportant, then H almost certainly is.
Of course, we don't really want expansionist agents like E - but these can be counterfactual.
The stage and the players
We'll now try and define precisely how we can use this idea to control the influence of an agent, setting out the axioms and assumptions needed along the way. Like some mathematics paper, it might be useful to skip lightly over this section and return later.
Let S(u) be the AI whose influence we want to control. The utility u is what gives it it's preferences, and S() indicates that we wish it to be a satisficer or reduced impact AI. Let M(v) be the counterfactual agent that we wish to use to detect S(u)'s influence, for some utility v. The M() designates the agent as a maximiser, without any control on its impact. If we can define S(), we can probably define M() by simply removing its reduced impact restrictions.
Let X be an event leading to S(u) being turned on, Y(v) an event leading to M(v) being turned on. We may as well assume that X and Y(v) are outside each other's light cones (in practice, we will set X≈1 and Y(v)≈0). Then the first two assumption is:
- A.1. We can define S(u), M(v), X and Y(v).
Another useful tool might be a weak maximising agent. One that attempts to maximise u, and will do so to some extent, but that can easily be thwarted by humans or others. It's likely that by the time we can build S(u), we can build a weak maximiser, designated by WM(v), with the event WY(v) denoting it's turning on:
- A.2. We can define WM(v) and WY(v).
We might also be able to define an abstract, Platonic superagent. This is an idealised agent with immense computing power (but that doesn't exist in the real world). Call this AM; obviously, there is no event leading to AM being 'turned on'. We would want AM's goals to be defined in terms of 'estimating the utility/probability/expectation of certain events'.
- A.3. We can define AM to estimate what we need to have estimated.
Of course, we need to be able to point to these various definitions from within S(u)'s code, so:
- A.4. If we can define concepts, we can also define then within the various agents.
We also need agents to be able to say sensible things about the behaviour of the other agents:
- A.5. If these agents are defined, S(u), WM(v), M(v) and AM can estimate each other's behaviours.
This could be done by using, for example, the reflexive probability of Paul Christiano and Benja Fallenstein.
Now, how about v? We cannot choose any v. We want M(v) to not think that X (hence S(u)'s existence) is an important fact to know. But if v is define by "if X, build staples; if ¬X, build paperclips", then X will be a very important! So we need to define a class V of utility functions that are indifferent to S(u) and X. Or not quite indifferent, but don't care intrinsically about them - they certainly should care if S(u) eats all their resources! See the ideas here for approaches to do this.
- A.6. We can define V, a class of utility functions that do not care intrinsically about S(u) and X.
Ideally we'd want this class to be as broad as possible - certainly too large for anything in this universe to parse element by element.
Now let's turn to the issue of measuring "importance". Define e to be evidence about the universe prior to X, Y(v), or WY(v) .We want X (hence the creation of S(u)) to be unimportant, ie of little information concerning the future. It might possible to have a universal definition of importance of certain facts:
- A.7. We have a universal measure I of importance, which, given background evidence e, can rank the importance of knowing certain facts versus knowing others.
If we do not have such an I (and I feel it's moderately unlikely that one exists, though we may be able to find an I' which would be good enough to work for our purposes), we can use a utility v to estimate the importance of an event W. The trick is to compare E(v|W) with E(v|¬W), which would give an (uncalibrated) estimation of the importance of W to v.
Right, we're now ready to start doing things.
Predicting... what?
What are we trying to predict here? Well, estimates of the importance of certain events. Specifically, the importance of X (which leads to the creation of S(u)) versus other events. For instance, if Y(v) designates the creation of M(v), we would want to compare the importance of of X and Y(v) in (at least) four ways:
- I(Y(v) | X∧e)
- I(Y(v) | ¬X∧e)
- I(X | Y(v)∧e)
- I(X | ¬Y(v)∧e)
These measure the importance of Y(v) (hence M(v)) and X (hence S(u)) in universes where the other agent does or does not exist. If we expect S(u) to be of low impact, we would want things like I(Y(v) | X∧e) >> I(X | Y(v)∧e) (S(u) is much less important than the maximiser M(v)) and |I(X | Y(v)∧e) - I(X | ¬Y(v)∧e)| ≈ 0 (the importance of S(u) is unrelated to whether M(v) exists or not).
Now, any particular v might be misleading, so we would want to check that this result held up for all v ∈ V. We could take a maximum over V, or a softmax.
If WM(v) can be defined, we would want to do the same estimates with WY(v) instead of Y(v), and desire I(WY(v) | X∧e) >> I(X | WY(v)∧e) (S(u) is less important than the weak maximiser WM(v)) and |I(X | WY(v)∧e) - I(X | ¬WY(v)∧e)| ≈ 0 (the importance of S(u) is unrelated to whether WM(v) exists or not), again maxing or softmaxing across V.
Who predicts, how?
In the previous section, we used I to compute the importance of S(u) versus other agents. But what if axiom A.7 is wrong, and we don't have such an I? Well, there we can measure the importance of these events, given a particular v, using AM, writing AMv(X|Y(v')∧e) for |EAM(v|X∧Y(v')∧e) - EAM(v|¬X∧Y(v')∧e)|, the expected difference (that AM estimates) in v given X versus ¬X (ie dependent on whether S(u) was turned on or not). More precisely, we'll want to use S(u)(AMv(X|Y(v')∧e)) - the expected difference (that S(u) estimates AM estimates) in v etc...
Here we'll want to do a double max/softmax. First over v', so that we consider all M(v') for v' ∈ V. Then over v, so that AM estimates the expected difference to all possible v. In this way, max/softmax AMv functions like an I. But not a perfect one: just one that detects disruptions to the multitude of v's in V. Therefore, we are essentially using the set V as the twenty million question method for the impact of S(u). We need to take extra care with this, to ensure we haven't left any gaping holes where S(u) can still have large impact.
Now, if axiom A.3 was false, we need to use M(v) instead of AMv to estimate this quantity. We need to do this with some care. M(v) is (counterfactually) a real agent. Why would it want to calculate the importance of M(v') and S(u), and why would S(u) care about this? Well, we can do so by constructing the scenario carefully. Maybe M(v') and S(u) are due to be created in the future, and M(v) is a boxed agent that can choose to prevent Y(v') and X. We modify v to v#, so that M(v#) wants to maximise v, but takes specific numerical penalties if it chooses to prevent Y(v') or X. By modifying these numerical penalties, M(v#) is motivated to figure out the importance of M(v') and S(u), and to reveal its opinions through its actions. If we create a S(u#), also boxed, that could override M(v#) at a certain cost and create a specific S(u), we can motivate it to model M(v#)'s actions. From this we can construct an importance metric using the M(v).
The WM(v), on the other hand, are probably too limited to be used effectively as predictors.
Arbitrary agents
One issue with this method is that Y(v) and WY(v) refer to the creation of specific agents, in the specific ways we've setup. For instance, |I(X | Y(v)∧e) - I(X | ¬Y(v)∧e)| ≈ 0 means that the importance of X doesn't vary much given Y(v) or its opposite. What this really means is that I(X | ¬Y(v)∧e) is bounded by I(X | Y(v)∧e) - but that might be quite high. There may be details of the setup that allow S(u) to have an important role given the specific M(v) created in this way - for instance, if it reached out to try and destroy it. Then I(X | Y(v)∧e) could be high, hence I(X | ¬Y(v)∧e) could be as well. The restrictions with >> could still contain S(u), but >> is ill-defined.
More secure would be if Y(v) could be defined as the creation of an agent M(v) in an abstract way. Maybe something like "is 'this object is a v maximiser' a good model of this object's properties and impact?". We would also want to give M(v) some starting resources. Again, note it is very hard to abstractly define what "resources" mean, but considerably easier to define a set of "useful resources" we could give to an agent. We would probably want to bound this definition in time and space as well.
This gives us the next axioms:
- A.8. Y(v) defines the generic existence of an agent M(v).
- A.9. WY(v) defines the generic existence of an agent WM(v).
With these axioms, our measure of importance becomes more general: S(u) is less important that abstract agents M(v) or WM(v), rather than certain specific ones. Note that we don't need to worry that S(u) snuffs out any hypothetical M(v): since the importance of S(u) is bounded below that if M(v), that would reduce S(u) to zero importance.
Extending to humans
What other agents do we know in the world? How about those 7+ billion ones wandering about on its surface? We can certainly define events Z(h) that might lead to the death of a particular human h - of course, we'd want Z(h) as low as feasibly possible!
- A.10. We can define events Z(h) that correspond to the death of a human h.
Then we can use Z(h) instead of Y(v) and WY(v), to ensure that S(u) has importance comparable with a human. Of course, the ideal would be to define, abstractly, the death (or birth+maturation) of a human in a generic way!
- A.11. We can define events Z(h) that correspond generically to the death or birth+maturation of a human h.
To summarise the various ways we can define the importance of S(u) (and hence, by integrating them into S(u)'s code, constrain S(u)'s importance and influence), depending on the axioms we have:

The subagent problem is really hard
A putative new idea for AI control; index here.
The first step to solving a problem is to define it. The first first step is to realise how tricky it is to define. This is a stub on a difficult problem.
Subagents and turning AIs off
One of the hardest things with designing safe AIs, is turning them off. Installing an OFF button is trivial; motivating them to let you push it can be done. No, the real problem is the subagent problem. The AI can create one or many subagents to carry out their task; once that's done, turning off the original AI is of little value, we have to control or extinguish the subagents.
A general problem
Subagents are a problem for many other control ideas. For example, if you want to program 'achieve X using only this many computer cycles', then 'create a subagent that wants to achieve X without the computer cycle restrictions' is one way of doing it (as long as that can be done within the restrictions). The same goes for energy use and other restrictions. Satisficers can satisfice by creating a maximiser. A boxed Oracle can remain in the box while cheerfully plotting to create an unboxed copy of itself. In fact, whatever the AI's motivation, "create subagents with the same motivation" is almost always a good idea. And, once created, there is no motivation to want to shut them down. Even if the AI is motivated to shut itself down, there is no reason for it to extend this to subagents.
Pre-corriged agents would at least be motivated to keep control of their subagents (unlike corrigible agents, who have no current desire to create subagents that would obey their future preferences).
What is a subagent?
It's also almost impossible to define a subagent. Is a corporation a subagent? Is a thousand pieces of code that reside on different computers and occasionally communicate a subagent? Is an employee or a seduced human a subagent? A pile of rocks with water running through could be a piece of conceptual art - or a slow computer. How about a virulent meme among the human population, making us all into its subagents?
The best idea I've found for defining subagents is that they are reactive intelligences with planning abilities, so random natural events that should have large influence in the future (say, a meteor heading towards Earth) would instead be re-diverted to have little impact. However, if this is used as a definition, then the subagents would simply be programmed to not behave "reactively" in whatever way defined. A slightly more subtle problem is that, since we don't have a definition of agent or subagent, we can't easily protect against holes such as "there is an alien agent on that asteroid, which will optimise the solar system iff it detects no subagent on Earth". Such holes could completely reverse the definition.
No subagent versus no influential subagents
Without the ability to properly define subagents, we can't restrict them directly. Even if you had an AI that was properly motivated to, say, do some action then turn itself off, there is no reason for it to want to do the same with its subagents.
Most of the methods of reduced impact can help with problematic subagents. They do not preclude the creation of subagents, nor mandate their destruction, but they do prevent them from having much influence. This is because reduced impact methods are aimed at preventing the AI itself from having much influence, either directly or via subagent creation. Even the attempt in this post to detect subagents directly, more properly detects influential decisions (of which the creation of a subagent could be one of them).
This is certainly a positive, but it may not be robust. If the definition of "not having much influence" turns out to have holes, it would be less than ideal to have any remaining subagents hanging about.
The "make sure that you cannot be detected in the future" approach is probably the most promising for controlling subagents. The AI there is motivated to erase all traces of its existence, which subagents would presumably be.
In all, it's a very tricky problem, and the core failure of many ideas for AI control.
Self-improvement without self-modification
This is just a short note to point out that AIs can self-improve without having to self-modify. So locking down an agent from self-modification is not an effective safety measure.
How could AIs do that? The easiest and the most trivial is to create a subagent, and transfer their resources and abilities to it ("create a subagent" is a generic way to get around most restriction ideas).
Or it the AI remains unchanged and in charge, it could change the whole process around itself, so that the whole process changes and improves. For instance, if the AI is inconsistent and has to pay more attention to problems that are brought to its attention than problems that aren't, it can start to act to manage the news (or the news-bearers) to hear more of what it wants. If it can't experiment on humans, it will give advice that will cause more "natural experiments", and so on. It will gradually try to reform its environment to get around its programmed limitations.
Anyway, that was nothing new or deep, just a reminder point I hadn't seen written out.
Detecting agents and subagents
A putative new idea for AI control; index here.
Unfortunately, it's extremely hard to identify an agent. Agents need not come conveniently packaged in some "sensors-actuators-internal structure-utility function" form. If you wanted to obfuscate an agent, hiding it in the natural movements of the solar system, or in the gyrations of the internet, this would be easy to do and hard to detect - and very hard to define what you would be looking for.
Fortunately, it's much easier to detect superintelligent AI's that have major impact on the world - ie the kind of agents that we would be worried about. Intuitively, this is true: if you suddenly find half the sky getting optimised for space colonisation, it's likely there's an AI somewhere there. How can this be formalised?
The importance of the agent
Imagine taking a slice of space-time around the moment when a superintelligent agent just got turned on. If you wanted to predict the future, what piece of information is most relevant in that time slice? Obviously the superintelligent agent. We could run the ideas for reduced impact in reverse, looking for the part that is of maximum impact.
I would therefore define the agent as the part of this slice that it's the most valuable to know about, or the part that it would be the most valuable to change, counterfactually, if such changes were possible. Note that this definition doesn't exactly identify agents, but if it misidentifies something, then that something must also be of great importance.
If we imagine the slice of space-time seeping forwards in time from the past, then events in space time will have varying informativeness/importance. As we approach the creation of the superintelligent AI, importance gets concentrated around the run-up to its creation, before being maximally concentrated at the creation of the AI (or it's escape, if it was confined).
For a more formal definition, I'd imagine a passive pure-resource gathering agent A being fed the details of the time splice, and only being able to pass on a limited amount of the information to another (active) copy of itself, and seeing what it passed on. The data passed on would be the highest priority for the active copy, so would almost certainly include the existence of an enemy agent (almost all agents are "enemies" to resource gathering agents, as they use up precious resources).
Alternatively, we could give A the option of changing some of the data - flipping some real-world bits, at least conceptually - and seeing which bits it preferred to flip. This definition can no doubt be improved by adding noise or other variations.
Now with counterfactuals and false miracles we might be able to actually construct that situation, or something approximating it. Even without that, this is a formal definition that seems to hone in well on the concept of agent detection.
Detecting subagents
Now that we've detected agents, can we detect subagents? It seems that we can. A powerful subagent can be detected in the same way that a powerful agent can be. But the subagent has some extra properties; for instance, knowing about the presence of the subagent could give strong evidence of the existence of a powerful agent in the past. This could correspond to agent A finding the existence of the powerful past agent to be important, the existence of the subagent to be important, but the combination of the two facts to not be as important as you'd expect. You could also compare the importance of knowing "past agent + subagent" versus "past agent + subagent who's motivation is flipped by a thermodynamic miracle".
We could also track the "importance" of the agent forward through time, and any sudden jump downwards in importance could correspond either to a failure or the creation of a subagent: a sudden jump of importance from one location to another would be an indication.
It's interesting to note that this approach can tell the difference between an agent that creates a powerful subagent it still controls (in that case, changing the original agent's utility is of high value) and one it doesn't.
If we have a sufficiently well defined definition of counterfactual, we could explore some other options. Like allowing the hypothetical resource gathering agent to plant hypothetical "bombs" across the world. It is likely that each bomb would be placed to eliminate one of it rivals.
It seems a lot could also be done by checking what happens to "importance" under different counterfactual definitions (the subagent has emerged from natural processes, versus emerging from a thermodynamic miracle).
Hacking the definition
A AI that is under some restriction based in this definition of agency will no doubt try and hack the definition. How could it do that? One obvious idea is to run a cryptographic hash function over, eg, the positions of all the atoms in the solar system, and base it's behaviour on the output. In that case, every atom would be of great importance. However, this can be easily overcome if we notice that every atom is of great importance only because of the agent's presence. In a counterfactual where the agent is removed, these atom positions are not of great importance intrinsically.
Maybe the agent will try and create lots of independent subagents, spreading importance over many locations? In that case, we should look for a strong centralised importance that becomes dispersed.
AI's are intrinsically disadvantaged when it comes to hacking this definition (in theory) because if they are going to have a large impact, then the moment of their creation/turning on/escape is a very informative one.
The definition seems to be underspecified rather than easily hackable, which is a good position to start from.
It should be noted that a good satisficer should never be detected as a powerful agent. This could be used as a definition of a satisficer, a point we'll be returning to in subsequent posts.
= 783df68a0f980790206b9ea87794c5b6)
Subscribe to RSS Feed
= f037147d6e6c911a85753b9abdedda8d)