This is an edited transcription of the final presentation I gave for the AI safety camp cohort of early 2024. It describes some of what the project is aiming for, and some motivation. Here's a link to the slides. See this post for a more detailed and technical overview of the problem.


This is the presentation for the project that is described as "does sufficient optimization imply agent structure". That's what we call the "agent structure problem" which was posed by John Wentworth, that's what we spent the project working on. But mostly for this presentation I'm going to talk about what we mean by "structure" (or what we hope to mean by structure) and why I think it matters for AI safety.

AI x-risk arguments are mostly conceptual

So we all have some beliefs about how AI systems are potentially very dangerous. But these beliefs are — I think they're valid and correct beliefs for the most part — but they're not based on experience directly, in the same way that we experience malaria as dangerous. We haven't seen a bunch of AGIs happen. And they're not based on formal methods of understanding things. Super volcanoes might be dangerous for example, but we understand the physics involved, and we know where they might be. Instead the arguments about AI risk are really trying to predict something that might happen, and they're what I would call philosophical or conceptual arguments. I think this makes it quite hard for us to know exactly what to do about it and it also makes it harder to convince other people that that it's that much of a risk. So I think there's an overall issue where we need to really refine these intuitions in what I would hope would be a mathematical sense.

Many of the arguments are about structure

So what are some of the things that we think make AI dangerous? There's capabilities, which is something people talk about all the time. More capabilities means more dangerous. What goes into capabilities? Well there's compute power. There's training on more data; if the AI systems know more about the world, if you give them all the data about the world, then they can have a bigger effect. And then, the more interaction they have with the world, like if it's an autonomous AI then it has higher ability to have higher capabilities.

Capabilities are necessary but not sufficient for dangerousness

That's a danger, but it's sort of necessary but not sufficient for it to be dangerous. There're data centers all over the world right now that are churning through a bunch of compute power, a bunch of data, but they're just serving images or something like that. And we can sort of just tell those aren't the dangerous ones. So there's something particular about AI that makes it dangerous that isn't just capabilities.

A couple other analogies that I like to use are that this claim about capabilities is really a thing with any tool. If you have a really big hammer you can break open bigger rocks with it but you can also hurt yourself more, you could hurt someone else more with a bigger hammer. Another analogy is with energy, like physical energy. Energy is sort of by definition the thing that lets you make a bunch of changes in the world. The more energy you have the more changes you can make. And most changes are bad! So energy is dangerous but somehow humanity has organized society with these enormous energy channeling infrastructures and it's not that bad. It's a little dangerous but it's not going to burn down the whole world. We understand mechanistically how it works enough.

So there's something else about AI that's a little bit special. When you read literature about AI risk people will use various phrases to sort of gesture at this thing or this cluster of things. They might say that the AI systems are dangerous if they're "goal directed". If, inside the AIs mind, somehow it has a goal that it's going for and it's trying to achieve the goal. Or if it's like a utility maximizer and it's trying really hard to maximize utility. That's somehow more dangerous because, coupled with capabilities, it might select some action that we didn't think of, that had big side effects or whatever. Another term or another thing people talk about is, how can you tell when an AI system or machine learning system is "just a bunch of heuristics"? One of the talks yesterday talked about, can we tell if a system is doing search on the inside? General purpose search seems to be more dangerous than "just a bunch of heuristics" somehow.

Agent structure

These are all talking around something that is about the internal structure of the AI system, as opposed to the capabilities, which is the behavior. We can observe capabilities; we make a machine learning system, we run it on a bunch of benchmarks, we say, "Wow, it could do all those things. It seems like it could do other things that are equally large or equally impactful". But the internal structure is what determines what the system actually does. It's an example of a structure.

Ideal agent structure

Somewhat more specifically, somewhat more algorithmically, an ideal agent is this concept that you can formalize. An ideal agent will do something like this:

  • have prior beliefs over all possible worlds
  • receive observations through its sensors
  • do ideal Bayesian updating about those observations
  • consider every possible action
  • calculate the implications in every possible world
  • calculate the expect utility of each action
  • take the action that has maximum expected utility.

This is an algorithm; it's well defined mathematically. (There're some details, obviously, to fill out). But it's also not something that's going to actually be implemented in a machine learning system because it's totally impractical.

Approximately ideal agent structure

But I think that there's going to be lots of approximations to this kind of structure to this kind of algorithm. And I think it stands to reason that many approximately ideal agents will have approximately optimal performance. So there's this cluster around approximately ideal or approximate agent structure and we want to understand how to locate that kind of distance. We want a distance measure between structures, or something like that.

And the agent structure problem specifically actually asks the inverse of this statement. The agent structure problem is asking, if we merely observe behavior that is nearly optimal then can we infer that the structure of the thing doing the optimization is approximately an agent?

Naively the answer is no, but you need to figure out a bunch of conditions, a bunch of criteria that make this true. Or you can ask which ones will make it true, and after we find those, do we think that's a compelling match to what a machine learning system is?

Caveats to the theorem idea

For example, one idea that comes comes up pretty quickly when you consider this question is, if all we observe is the behavior of the system and we observe that it acts optimally, well it could be in theory that inside the system there is just a table. Whatever the observations are, it looks them up in a row of the table and then the returned value just happens to be the ideal action. It has a row for every possible observation in all the environments. So this is mathematically possible, like it exists as a function, but it's obviously not an agent, and so in some sense it's not the dangerous thing.

But it's also not the thing that would actually be implemented in a machine learning system, because it's impractical. It's exponentially large in the number of time steps. So a thing that we need to do in the theorem that you want to state and prove about agent structure is you need to say, it has near optimal behavior and it's not that long of a thing to describe. So whatever your policy class is that you've defined in your theorem, it will need to have some sense of description length of the different structures. And we can say, we don't want it to be a table and we don't want it to be a thing with a bunch of pre-computed caches. So we'll say it has a limited description length.

Another consideration is the environment class that you observe optimal behavior in has to be pretty diverse and rich to imply agent behavior. Mazes are a very popular environment class for AI for a variety of reasons. If you saw an algorithm optimize in all mazes you would be impressed, but you wouldn't be like, oh no, it's an agent. Not necessarily. It seems like you can have non-agents that optimize in the environment class of mazes. So the environment class has to be pretty diverse.

One idea for progress

To leave you with a teaser of some potential math there's this big looming question of what the heck is structure and how do we tell that something is approximately structured like an agent. The current idea that we're excited about and that my team is learning about is Algorithmic Information Theory (a.k.a. Kolmogorov complexity as a field). I suspect that it has techniques where you could answer this question, where you could say that one particular algorithm is epsilon bits away from some other algorithm. So we're hoping to make progress using that tool.

New Comment