I've been working on an AI safety camp project where we try to describe agent structure. This post defines some key concepts and conveys my reasoning about this topic so far. It is mostly conceptual. The first section discusses what structure is and what it means for an object's behavior to imply structure. The second part shows ways of thinking about program structure. Then, we discuss different types of agent structures.

Basic structure

Consider the following sequence of numbers

1, 11, 21, 1211, 111221, 312211, 13112221, 1113213211, 31131211131221

Even if you don't know the function that generates these numbers, you can still find some patterns in the numbers if you look long enough. For example, you can see that the numbers always get bigger and end with a 1. This doesn't fully describe the sequence, but it describes some structure in the numbers. (if you are curious, this is the look and say sequence)

Let's dive into the structure by doing a simple conceptual analysis. Structure is a property of entities. The structure of an entity is a logical or mathematical object that partially describes that entity. An entity may have many different types of structures.

Examples of things with structure

  • the sequence 01010101 has a repetitive structure
  • The US interstate system has a graph structure
  • Monopoly has a game-theoretic structure

Structures are built on top of each other. For example, rings have a group structure, which has a set structure.

Determining structure

Let's say we have some object  and want to show that it has some structure.  might have some already known structure, like being a Turing machine or a Python program.

To try to determine  's structure, we will test its behavior by asking it questions and examining the responses. Take a set as an example. We can ask, "Is this element in the set?" or "How many elements are in the set?" When we ask enough questions, we can conclude that the set has a certain structure, like "only contains even numbers."

I like to call this "carving" out structure instead of building it up. We start with the prior that this set is a random one from the set of all sets, and then by asking questions, we narrow down the possible objects this set could be until we conclude it is a particular one or a particular set of sets that share some structure.

Program structure

We will now consider the structure of programs; this is especially useful when considering agents since agents can be thought of as programs.

Consider these two Python programs

def quick_sort(arr):
	if len(arr) <= 1:
		return arr
	pivot = arr[len(arr) // 2]
	left = [x for x in arr if x < pivot]
	middle = [x for x in arr if x == pivot]
	right = [x for x in arr if x > pivot]
	return quick_sort(left) + middle + quick_sort(right)
def insertion_sort(arr):
	for i in range(1, len(arr)):
		key = arr[i]
		j = i - 1
		while j >= 0 and key < arr[j]:
			arr[j + 1] = arr[j]
			j -= 1
		arr[j + 1] = key
	return arr

These programs are behaviorally identical but structurally different. They both compute the same function, mapping unsorted arrays to sorted arrays, but they do it differently.

Let's examine some key ways their structure is different. The quick sort program uses recursion, while the insertion sort program has a doubly nested loop. The quick sort program makes new arrays, while the insertion sort program swaps elements in place. We will call these features programmatic structure.

Now, consider this implementation of insertion sort.

def recursive_insertion_sort(arr, n=None):
	if n is None:
		n = len(arr)
	if n <= 1:
		return
	recursive_insertion_sort(arr, n-1)
	last = arr[n-1]
	j = n-2
	while (j >= 0 and arr[j] > last):
		arr[j + 1] = arr[j]
		j -= 1
arr[j + 1] = last

This program has a very different structure compared to the previous insertion sort. It is now recursive and has conditionals that weren't there before. Is this still insertion sort? I'd argue that it is. While the two programs have different program structures, they still seemed to have some structure in common.

To determine this underlying structure, perhaps we could define a formal language to which we could convert every program that implements the sorting function into. In this language, there will be a single "Insertion sort program" and a single "Quick sort program." If a sorting program maps to the insertion sort program, it implements insertion sort.

To determine the internal program structure, we could perhaps define a process that converts python programs that implement sort into a stricter program in a formal language such that all python programs that we think implement quick sort get mapped to the formal quick sort (QUICK_SORT). This process might be uncomputable

What properties does a program with the QUICK_SORT structure have? The properties of QUICK_SORT must be program-invariant. That is, all programs that we think implement quick sort run in time and use space thus QUICK_SORT should also have these structures. Another candidate for a program invariant structure could be the ordering of comparison and swap operations in programs. I'd like to investigate these properties more in the future and attempt to define this process

Implying program structure

Now, we want to infer some structure of an unknown program P that we are handed.

Suppose we observe the following things about

  • implements a sorting function
  • has a runtime of and a space complexity of
  • has the same comparison/swap operation ordering as QUICK_SORT

These are all necessary but not sufficient conditions for an arbitrary program to be QUICK_SORT. How do we know that their doesn't exist some other sorting program that has the same comp/swap ordering of QUICK_SORT but doesn't have some necessary property of quick sort like the notion of a "pivot".

What ever structural properties we are able to prove that aren't obvious from our observations would be dependent on the process that we derive that maps sorting programs to formal programs. If it is easy to show that the process converts all programs with the characteristics above to QUICK_SORT, then it might become tautological.

Agent structure

Agents are programs of a special types. Agents are ran in an environment. They have state, take in a percept and output an action. Below I lay out a basic interface for looking at how agents and environment are ran.

interface Agent {
	getAction: (p: Percept) => Action
}

interface Environment {
	getPercept: (a: Action) => Percept
}

function run(agent: Agent, env: Environmet) {
	percept = null
	action = null
	while (true) {
		action = agent.getAction(percept)
		percept = env.getPercept(action)
	}
}

Of course, you could also run the agents for a finite amount of time.

In Artificial Intelligence: A Modern Approach 4th the authors define multiple "agent structures". We borrow some of their terminology and concepts here.

Table based agent

A table-based agent is one that looks up the action to take based on all its past precepts. They are "hard-coded" agents. In most environments, table-based agents use infinite memory.

class TableAgent {
	actionTable: Map<Percept[], Action>
	percepts: Percept[] = []
	getAction(p: Percept) {
		this.percepts.push(p)
		return this.actionTable.get(this.percepts)
	}
}

Reflex-based agent

Reflexive agents take their current input, compute some state, and then see if that state triggers any rules. We break up the agent's getAction function into two parts. One function that interprets the input as some world state. Then, a function that reads that world state checks if certain conditions are met to determine what action to take. Note that the input interpreter could be the identity function, which would be distilled into any program that takes the current percept as input and output actions. Breaking it up like this allows for more flexible agents to be designed.

class ReflextAgent<State> {
	inputInterpreter: (p: Percept) => State
	rulesFunction: (s: State) => Action
	getAction(p: Percept) {
		const world_state = this.inputInterpreter(p)
		return this.rulesFunction(s)
	}
}

Model-based agent

Model based agents have internal state that tracks the environment. After each observation, they update their model of the world and output actions based on that world state. This agent can track things about the environment that are not currently in its precept.

class ModelBasedAgent<State> {
	state: State
	action: Action | null = null
	rulesFunction: (s: State) => Action
	updateState: (s: State, a: Action, p: Percept) => State

	getAction(percept: Percept): Action {
		this.state = this.updateState(this.state, this.action, percept);
		const action = this.rulesFunction(this.state);
		this.action = action; 
		return action;
	}
}

Table based agents and model based agents with infinite memory and compute can both implement the same functions. A table based agent is the rolled out version of any agent. Reflex based agents might not be able to perform the same since some environment might require you to remember your last percepts.

Implying agent structure

Let's now discuss the problem of detecting structure by observing the behavior of the program. Let's say we have some program  that we want to test for agent structure. Let's start by coming up with questions we could ask that differentiate between the table-based agent and other types of agent structures. Let's say we define an environment that, at each time step, outputs a simple addition problem like 234+550, and the agent is supposed to output the correct answer 784. We will limit the length of each number to 10 digits. While the look-up table agent could encode this, it would use a table with 10^20 entries. A simple reflex agent could accomplish this by performing an addition algorithm that uses much less space. So, depending on our problem, knowing the length of a program will inform you that it is not a table-based agent.

Now, we want to differentiate the reflex-based agent from the model-based agent. This can be accomplished by utilizing the fact the reflex agent does not have memory. If  is run inside of an environment where it must repeat the last number it was given, then the reflex agent will not perform well in this environment. This is hard to solve for a general environment, though. Imagine an environment where the agent plays Mario Kart. If an agent wins, it isn't clear if they were modeling their environment by tracking where the other players are planning for the future or if they were just responding to reflexes.

The scary structure

[Find links fro X-risk concerns of utility maximizing agents]

The things that we are really scared of are utility-based agents. They are a subset of model-based agents that try to optimize utility functions. Their rulesFunction is a complex procedure that searches over plans to see which one results in the most utility for the agent. Although we don't know how to formalize these types of agents completely, they seem scary since they might generalize in ways we don't want and do really wacky things to maximize utility, like turning the world's oceans into paperclips.

Let's consider this environment:

type State = number
type Action = Button1 | Button2
type Percept = State

class ButtonEnvironment {
	state: State
	correctButton: Action
	getPercept(action: Action): Percept {
		if (action == this.correctButton) {
			this.state++
		} else {
			this.state--
		}
		return this.state
	}
}

We have two free parameters: the starting state and the button that moves us up the states. Now, let's say we run some unknown program through many different instances of this environment. Sometimes, Button2 makes you go up, sometimes, Button1 does. Sometimes, you start at -100, sometimes 3.

There exists some agents that always move the environment to a particular state. That is, no matter the starting parameters of the environment, after some time, the environment will head to a certain state. We say that the agent is optimizing this environment. They are predictably moving the environment to a state that they desire.

What can we say about agents that do this? Given a reasonable size constraint, we know they must be model-based since they must remember which button does what. However, after they gather that information, they could be a reflex agent since their next actions can be predicted solely on the current precept. (If the world state is less than my desired state, then press the button that I've learned makes the state increase; otherwise, press the other button)

Conclusion & Future work

This post mostly breaks down concepts and give example of how behavior implies structure. The ultimate goal is to come up with a set of things to look for that implies that a model is structured as a scary utility maximizing agent. While this seems hard, I think it is tractable. Future work will include discussing the modularity of structure and how an agent could be "half" a reflex based agent.

New Comment
5 comments, sorted by Click to highlight new comments since:

NB: At the moment I forget where I have this idea in my head from, so I'll try to reexplain it from scratch.

There's a well known idea that external behavior can only imply internal structure if you make assumptions about what the internal structure is like. This is kind of obvious when you think about it, but just to make the point clear, suppose we have a black box with a red button on top of it and a wire coming out the side. When you press the button, the wire carries a current, and then it goes dead when the button is not pressed.

You might be tempted to say it's just a switch and the button is closing a circuit that transmits electricity on the wire, but that's not the only option. It could be that when you press the button a radio signal is sent to a nearby receiver, that receiver is hooked up to another circuit that does some computations, and then it sends back a signal to a receiver in the box that tells it to close the circuit by powering a solenoid switch.

You can't tell these two cases apart, or in fact infinitely many other cases, just from the observed external behavior of the system. If you want to claim that it's "just a switch" then you have to do something like assume it's implemented in the simplest way possible with no extra stuff going on.

This point seems super relevant for AI because much of the concern is that there's hidden internal structure that doesn't reveal itself in observable behavior (the risk of a sharp left turn). So to the extent this line of research seems useful, it's worth keeping in mind that by itself it will necessary have a huge blind spot with regards to how a system works.

I think this is a good point. I would push back a small amount on being unable to tell the difference between those two cases. There is more information you can extract from the system, like the amount of time that it takes after pressing the button for the current to turn on. But In general, I agree.

I agree that it would be very easy to have huge blind spots regarding this line of inquiry. This is the thing I worry about most. But I do have a hunch that given enough data about a system and its capabilities, we can make strong claims about its internal structure, and these structures will yield predictive power.

When you have little information like "pressing this button makes this wire turn on," it is much harder to do this. However, I believe testing an unknown program in many different environments and having information like its run time and size can narrow the space of possibilities sufficiently to say something useful.

All models are wrong, some models are useful.  

Examples of things with structure

  • the sequence 01010101 has a repetitive structure
  • The US interstate system has a graph structure
  • Monopoly has a game-theoretic structure

I think this is wrong.  Replace "have/has" with "can be modeled as a".  We don't know if that structure is causal and singular to that output.  I think recognizing that inferred structures are not exclusive or complete is useful here - many possible structures can generate very similar outputs.  If your goal is to predict future outputs, you probably need to find the actual generator, not pick a structure that could be it.  

No actual agents will be exactly any of those structures, it'll be a mix of them, and some other less-legible ones.  I'd probably also argue that ANY of those structures can be scary, and in fact the utility-optimizing agent can use ANY of the decision types (table, rule, or model), because they're all equivalent, with sufficient complexity of {table, ruleset, or world-model).  

When we observe the programs' outputs, we can form a hypothesis class about what structures we think the programs have. My claim is that only a couple of structures pop out after testing a utility-maximizing agent in a sufficient amount of environments.

You are correct in pointing out that an agent could be employing multiple structures at a time. Future work on this problem would include ways of quantifying how much of a certain structure some program has. This might look like trying to come up with a distance measure for programs that would say a version of quicksort written in OOP or a functional representation would have distance 0.

My claim is that only a couple of structures pop out after testing a utility-maximizing agent in a sufficient amount of environments.

I think there's a lot of assumptions here about the kinds of feasible/available structures, and what "sufficient amount" of environments entails.  My expectation is that this is true neither for toy problems (often optimized for a gotcha) nor for real-world problems (often with massive complexity and near-chaotic margins).