Detecting agents and subagents

Stuart_Armstrong

A putative new idea for AI control; index here.

There are many situations where it would be useful to identify the presence of an agent in the world, in a sufficiently abstract sense. There are many more situations where it would be useful to identify a subagent in an abstract sense. This is because people often have ideas for interesting and useful motivational restrictions for the AI (eg an AI that "wants to stay boxed", or a corrigible agent). But most of these motivations suffer from a "subagent problem": the AIs are motivated to create subagents that do not follow the motivational restrictions. The AI wanting to stay in the box is motivated to create a subagent that will protect it and keep it in the box, while the corrigible agent is not motivated to create corrigible subagents (though the pre-corriged agent might want to create pre-corriged subagents).

Unfortunately, it's extremely hard to identify an agent. Agents need not come conveniently packaged in some "sensors-actuators-internal structure-utility function" form. If you wanted to obfuscate an agent, hiding it in the natural movements of the solar system, or in the gyrations of the internet, this would be easy to do and hard to detect - and very hard to define what you would be looking for.

Fortunately, it's much easier to detect superintelligent AI's that have major impact on the world - ie the kind of agents that we would be worried about. Intuitively, this is true: if you suddenly find half the sky getting optimised for space colonisation, it's likely there's an AI somewhere there. How can this be formalised?

The importance of the agent

Imagine taking a slice of space-time around the moment when a superintelligent agent just got turned on. If you wanted to predict the future, what piece of information is most relevant in that time slice? Obviously the superintelligent agent. We could run the ideas for reduced impact in reverse, looking for the part that is of maximum impact.

I would therefore define the agent as the part of this slice that it's the most valuable to know about, or the part that it would be the most valuable to change, counterfactually, if such changes were possible. Note that this definition doesn't exactly identify agents, but if it misidentifies something, then that something must also be of great importance.

If we imagine the slice of space-time seeping forwards in time from the past, then events in space time will have varying informativeness/importance. As we approach the creation of the superintelligent AI, importance gets concentrated around the run-up to its creation, before being maximally concentrated at the creation of the AI (or it's escape, if it was confined).

For a more formal definition, I'd imagine a passive pure-resource gathering agent A being fed the details of the time splice, and only being able to pass on a limited amount of the information to another (active) copy of itself, and seeing what it passed on. The data passed on would be the highest priority for the active copy, so would almost certainly include the existence of an enemy agent (almost all agents are "enemies" to resource gathering agents, as they use up precious resources).

Alternatively, we could give A the option of changing some of the data - flipping some real-world bits, at least conceptually - and seeing which bits it preferred to flip. This definition can no doubt be improved by adding noise or other variations.

Now with counterfactuals and false miracles we might be able to actually construct that situation, or something approximating it. Even without that, this is a formal definition that seems to hone in well on the concept of agent detection.

Detecting subagents

Now that we've detected agents, can we detect subagents? It seems that we can. A powerful subagent can be detected in the same way that a powerful agent can be. But the subagent has some extra properties; for instance, knowing about the presence of the subagent could give strong evidence of the existence of a powerful agent in the past. This could correspond to agent A finding the existence of the powerful past agent to be important, the existence of the subagent to be important, but the combination of the two facts to not be as important as you'd expect. You could also compare the importance of knowing "past agent + subagent" versus "past agent + subagent who's motivation is flipped by a thermodynamic miracle".

We could also track the "importance" of the agent forward through time, and any sudden jump downwards in importance could correspond either to a failure or the creation of a subagent: a sudden jump of importance from one location to another would be an indication.

It's interesting to note that this approach can tell the difference between an agent that creates a powerful subagent it still controls (in that case, changing the original agent's utility is of high value) and one it doesn't.

If we have a sufficiently well defined definition of counterfactual, we could explore some other options. Like allowing the hypothetical resource gathering agent to plant hypothetical "bombs" across the world. It is likely that each bomb would be placed to eliminate one of it rivals.

It seems a lot could also be done by checking what happens to "importance" under different counterfactual definitions (the subagent has emerged from natural processes, versus emerging from a thermodynamic miracle).

Hacking the definition

A AI that is under some restriction based in this definition of agency will no doubt try and hack the definition. How could it do that? One obvious idea is to run a cryptographic hash function over, eg, the positions of all the atoms in the solar system, and base it's behaviour on the output. In that case, every atom would be of great importance. However, this can be easily overcome if we notice that every atom is of great importance only because of the agent's presence. In a counterfactual where the agent is removed, these atom positions are not of great importance intrinsically.

Maybe the agent will try and create lots of independent subagents, spreading importance over many locations? In that case, we should look for a strong centralised importance that becomes dispersed.

AI's are intrinsically disadvantaged when it comes to hacking this definition (in theory) because if they are going to have a large impact, then the moment of their creation/turning on/escape is a very informative one.

The definition seems to be underspecified rather than easily hackable, which is a good position to start from.

It should be noted that a good satisficer should never be detected as a powerful agent. This could be used as a definition of a satisficer, a point we'll be returning to in subsequent posts.

I think your definition is really a definition of powerful things (which is of course extremely relevant!).

I'd had some incomplete thoughts in this direction. I'd taken a slightly different tack to you. I'll paste the relevant notes in.

Descriptions (or models) of systems vary in their complexity and their accuracy. Descriptions which are simpler and more accurate are preferred. Good descriptions are those which are unusually accurate for their level of complexity. For example ‘spherical’ is a good description of the shape of the earth, because it’s much more accurate than other descriptions of that length.

Often we want to describe subsystems. To work out how good such descriptions are we can ask how good the implied description of the whole system is, if we add a perfect description of the rest of the system.

Definition: The agency of a subsystem is the degree to which good models of that system predict its behaviour in terms of high-level effects on the world around.

Note this definition is not that precise: it replaces a difficult notion (agency) with several other imprecise notions (degree to which; good models of the that system; high-level effects). My suggestion is that while still awkward, they are more tractable than agent. I shy away from giving explicit forms, but I think this should generally be possible and indeed I could give guesses in several cases, but at the moment questions about precise functional forms seem a distraction from the architecture. Also note that this definition is continuous rather than binary.

Proposition: Very simple systems cannot have high degrees of agency. This is because if the system in its entirety admits a short description, you can’t do much better by appealing to motivation.

Observation: Some subsystems may have high agency just with respect to a limited set of environments. A chess-playing program has high agency when attached to a game of chess (and we care about the outcome), and low agency otherwise. A subsystem of an AI may have high agency when properly embedded in the AI, and low agency if cut off from its tools and levers.

Comment: this definition picks out agents, but also picks out powerful agents. Giving someone an army increases their agency. I'm not sure whether this is a desirable feature. If we wanted to abstract away from that, we could do something like:

Define power: the degree to which a subsystem has large effects on systems it is embedded in.

Define relative agency = agency - power

Interesting. Some of this may be relevant why I post the "models as definitions" post.

This in an interesting article, though necessarily abstract - how can we take this to implement an actual AI detector?

This could be the combination of research in the areas of antivirus software, network detection and intrusion, stock market agents along with some sort of intelligence honeypot (ie some construed event / data that exists for a split second that only an AI could detect and act on)

almost all agents are "enemies" to resource gathering agents, as they use up precious resources

If this were a friendly AI, shouldn't the utility function of the gathering agent take sharing / don't be greedy into account. Though as AI development advances we can't be sure that all governments / corporations / random researchers will put the necessary effort into making sure their intelligent agent is friendly.

This in an interesting article, though necessarily abstract - how can we take this to implement an actual AI detector?

This is indeed a very preliminary concept.

If this were a friendly AI, shouldn't the utility function of the gathering agent take sharing / don't be greedy into account.

Friendly AI's need not be nice in a game-theoretic sense. They can (and likely would) be ruthless and calculating at achieving their goals - it's just that heir goals are good/safe/positive. This puts some constraints on means (eg the AI will likely not kill everyone just to get to its goals), but it's not likely that "play nicer than you have to with other AIs" will be such a constraint.

This definitely incidental--

Wouldn't a super intelligent, resource gathering agent simply figure out the futility of its prime directive and abort with some kind of error? Surely it would realize it exists in a universe of limited resources and that it had been given an absurd objective. I mean maybe it's controlled controlled by some sort of "while resources exist consume resources" loop that is beyond its free will to break out of--but if so, should it be considered an "agent"?

Contra humans, who for the moment are electing to consume themselves to extinction, if anything resource consumer AIs would be comparatively benign.

Wouldn't a super intelligent, resource gathering agent simply figure out the futility of its prime directive and abort with some kind of error?

"Futility of prime directive" is a values question, not an intelligence question. See http://lesswrong.com/lw/h0k/arguing_orthogonality_published_form/

This in an interesting article, though necessarily abstract - how can we take this to implement an actual AI detector?

This is indeed a very preliminary concept.

If this were a friendly AI, shouldn't the utility function of the gathering agent take sharing / don't be greedy into account.

13

Detecting agents and subagents

13

The importance of the agent

Detecting subagents

Hacking the definition

13