Steering systems

Max H

Preface: who this post is for

This post is mainly targeted at people who are already familiar with the concepts and terminology in alignment, AI x-risk, and particularly the case for high probabilities of doom, but disagree with the arguments or have bounced off of them for one reason or another.

If you've read the 2021 MIRI conversations and remain confused about or disagree with some of the things Nate or Eliezer were saying, this post is for you.

If you understand and agree with everything Nate and Eliezer say in those transcripts already, you probably won't get many new insights from this post, but you may find it useful as a new way of explaining one aspect which (I think) is important to their case.

If you're not familiar with those discussions at all, you may still get value out of this piece, but there is a background assumption of familiarity with some of the concepts discussed there.

Introduction

This post is my own attempt at conveying some intuitions and concepts which various others have already written about at length. I doubt I'll do a better job than those before me, but it seems worth a shot anyway, because:

It explains why I am not optimistic about AI x-risk, even if current approaches to alignment research and AI governance succeed far more swiftly and totally than they appear to be on track for.
It appears to be a persistent source of confusion and disagreement among alignment researchers and the LW commentariat. (This is my attempt at focusing on a place where I think others are dropping the ball).

In short, I'm going to unpack the concept in this tweet:

into some thought experiments that build an intuition for what "smarter than human" systems might look like, and explain why that intuition makes me think that humanity is on track to build and run powerful systems in ways likely to result in bad outcomes, even if some or all aspects of alignment and governance go unexpectedly (to me) well.

I'm going to introduce a new term, define it loosely, provide some examples, and apply it in the context of current systems and a hypothetical future system that performs tree search over world states.

I'll conclude with some implications about the different potential impacts of current alignment research approaches, depending on how this intuition differs from reality.

That which has many names

It has been called "powerful optimization process", "transformative AI", "superintelligence", and many other things. It has been speculated to have convergent instrumental subgoals, deceptiveness, reflectivity, power-seeking, coherence, utility maximization, and other dangerous or potentially undesirable properties.

To some, these may seem like questionable assumptions or burdensome details.

However, the systems which I gesture at in this post need not have any of these properties (though they are permitted to). These properties are better thought of as possible implications that follow from relatively simpler assumptions:

It is possible in principle to build systems capable of re-arranging most of the matter in the universe in almost arbitrary ways.
Further, it seems plausible that levels of arranging-ability that only slightly exceed current human capabilities (in some sense) are sufficient for most kinds of arranging, or, alternatively, that there are systems capable of self-improving to strongly superhuman levels of arranging capability.
It is plausible that humans will build and run such systems relatively soon, in relatively uncontrolled manners, before fully understanding all the consequences of doing so, and these consequences will be irreversible and regrettable.

Abstractly, the kind of system I have in mind is one which, given an initial set of actions it can execute, and a specification on outcomes, will choose actions which steer towards those outcomes in a way that is better than a smart human could do, given the same initial set of actions. I'll use a new term, "steering system", rather than existing terms like "optimization process", "AI", "TAI", "superintelligence", to emphasize the importance of the thinking about the thing I'm gesturing at as a system that is actually instantiated and run, rather than a model or algorithm in isolation. I'll name the scary thing as those systems which are strictly better than humans at steering in a real-world domain, though not necessarily way better.

We can ask of such systems:

How well does such a system steer? For example, both Stockfish level 1 and level 9 steer towards winning chess games, but level 9 is much better at it.
What does it steer towards? This may be readily apparent from either the observed behavior of the system, or by looking at its internal workings. Perhaps the outcome specification is some simple, legible, separable part of the system, and the system steers towards that specification with crisp exactness. Or, perhaps not!
What kinds of systems are we likely to build, with what kinds of outcome specifications, and what initial action set? On the latter question, the likely answer is "we will give the system as many initial actions as it can possibly use".

The answers to these questions, about different classes of potential steering systems we're likely to build, correspond to different aspects of alignment and governance being easy or hard, and potential for wildly different futures, though none of them appear hopeful to me.

Applying the term to existing systems

What kinds of existing and future systems can be considered steering systems, and how can we answer the questions above?

Various systems comprised of deep neural networks, humans, or GOFAI algorithms are quite good at steering towards particular outcome states when restricted to particular domains.

System: Stockfish in the domain of a chess game

High-level Stockfish is superhuman at steering towards the set of board states where the opponent is checkmated. This is evident both by inspecting its source code and by playing some chess games against it.

The initial action set is the set of all legal chess moves from the initial board state. An interesting question is what happens if you artificially restrict Stockfish from using certain openings, or even moves in certain future board states. Does Stockfish manage to steer around its restrictions and still win the game?

System: Mu Zero, in the domain of a Go game

Mu Zero is superhuman at steering towards the set of board states where the AI's area score is higher than the opponent's. This is apparent by playing some Go games against it, and potentially from inspecting its source, though parts of the "source" are actually deep neural networks which are not so easily inspectable or interpretable.

System: an LLM-based task or problem-solving agent

A single text completion by an LLM is (probably) not a steering system in the domain of the real world. For almost any task you imagine, if that task is not constrained to the world of manipulating and outputting text strings, the system does not steer towards outcomes in a nontrivial way.

However, current LLMs, suitably prompted, chained, and composed with access to external APIs, can be composed into agent-like systems using tools like LangChain. Other examples: ChatGPT with Plugins, this task-driven agent I saw on Twitter.

Certain prompts (usually user-entered) in these systems encode the outcome specification; others encode the capabilities and methods the system uses to steer towards that specification.

How well do these agents steer? It depends on the task - if the task can be solved purely by manipulating and outputting text, or by making API calls to which the system is given access, the system may be able to steer quite well - as well or better than a human at the same tasks, and perhaps faster. Generally, these systems are still far below human-level at steering, in most meaningful ways.

How easy is it to see what the system is steering towards? Again, this depends on the particulars of the system: what underlying LLMs it is comprised of, how they are prompted, and how they are arranged.

The Open Agency model suggests that for some instantiations of LLM-based agents, it may be easy to specify what they steer for and understand how they do so.

One important remark about LLM-based steering systems: most of the steering power only emerges at the very last stage of applying the foundation model.

Training GPT-4 was the work of hundreds of engineers and millions of dollars of computing resources by OpenAI. LangChain is maintained by a very small team. And a single developer can write a python script which glues together chains of OpenAI API calls into a graph. Most of the effort was in training the LLM, but most of the agency (and most of the useful work) comes from the relatively tiny bit of glue code that puts them all together at the end.

System: a human with some instructions and an internet-connected computer

Humans are good at steering across a variety of timescales and domains. A smart human can use a computer to have a large impact on the world, even if they start with very few resources. They can take a remote job as a programmer and then direct other humans to do things in the physical world using the money they make from that job.

Suppose a human in a room is given an initial set of instructions which represent a specification on outcomes to steer towards. Perhaps the instructions say "arrange for grocery delivery to the following address..." or "construct a house at the following address..." or "eradicate malaria worldwide". The human need not carry out the specified task themselves, they can (for example) take a remote job as a programmer and then use their salary to hire others to do the specified task or otherwise arrange to have it done.

The human might choose to interpret the instructions differently than the instruction-writer intended, or simply ignore them altogether according to their own whims. But a suitably incentivized human certainly seems capable of steering towards many of the possible outcome specifications described by the instructions.

How well this system steers depends on:

How smart / capable the human in the room is
How inclined the human is to follow the specified instructions
How the instructions are specified, and how the human interprets ambiguities.

Note that for the other systems described in this post (Stockfish, Mu Zero, LLM agents), the human-based system is capable of steering at least as well or better than the AI-based system alone - given instructions to win a Go or chess game, a human who is not themselves good at chess or Go can use Mu Zero or Stockfish as a tool. (Stockfish is available freely on the internet, Mu Zero or a similarly good Go-playing system may require some effort to obtain access to.)

A sketch of a powerful steering system

This section is meant to sketch a hypothetical system which is composed of individual pieces which are not steering systems on their own, but when glued together in a particular straightforward way, are a steering system.

It's not necessarily a good design for a real system, but it is meant to illustrate a possible analogue of the kinds of things people do with LLMs and LangChain today that will be possible with future powerful "foundation models". The architecture is loosely inspired by the kinds of architectures in Mu Zero and Dreamer, but over a real-world domain instead of Go, Minecraft, or Atari games.

A one sentence description of the system is:

Perform tree search over world states, given an initial world state, initial action set, and an outcome specification.

In more detail:

The system is comprised of the following components, each of which may be individually as interpretable, transparent, and safe as you like, when used in isolation, though the decision-making and modelling pieces are assumed to be at least human level at their given individual tasks.

An initial action set (A), which is the set of actions available at the initial world state. For concreteness, let's say that this set is the executions of any API function in the Playwright, a browser automation framework. (Essentially, the system begins with control of an internet-connected web browser.)
A world modeler (W), which, given an action and a world state, returns a predicted new world state and a predicted new action set.

W may take an observation returned by actually executing the action as an optional parameter, which, if provided, is used to make the prediction more accurate.
An evaluation function (E), which, given a world state, returns a score for that world state. This function may be a neural net trained from a set of example world states with human-labelled scores, a hand-coded / mathematical function of the world state, or even a literal human asked to rate (some description or representation of) the provided world state, at some nodes. No assumptions are made about the form of E other than its type; it need not represent a coherent utility function (though it may).
A planner or pruning heuristic (P), which, for a given world state, evaluation function, and set of actions available at that state, returns a subset of those actions to search over. P may also decide, for each candidate action, whether it should actually be executed, or whether the world modeler should be used to predict the next world state based only on a prediction of the result of execution.

Each of these components may be implemented by a large neural network, GOFAI algorithms, a literal human (sometimes, e.g. in a human-in-the-loop setup), or any composition of such elements, as long as they perform their task at least as well and fast as a human could do so.

These components can then be arranged into performing a tree search with some simple glue code. Pseudo-python code:

W = deepAI.world_modeler_api()
E = deepAI.construct_evaluation_function_from_human_feedback(scored_example_worlds)
P = deepAI.pruning_heuristic_api()

# Look for a really good world
target_score = 10000000

def tree_search(world_state, choices):
  if E(world_state) > target_score:
  # We're done! print a stack trace and send a global interrupt or something
  # to halt execution across the rest of the tree
  return

  branches = P(world_state, E, C)

  parallelfor action in branches:

    # optional, but helpful if you don't want the world modeler
    # start to drift too far from reality...
    observation = execute(action)

    predicted_world_state, predicted_actions = W(world_state, action, observation)
    tree_search(predicted_world_state, predicted_actions)

initial_world_state = W()
initial_actions = playwright.chromium
tree_search(initial_world_state, initial_choices)

Remarks:

(spoiler-ed, in case you want to take some time to consider this system on your own first. Most of the remarks in the spoiler-ed section below are just my own intuitions and rough, disorganized thoughts about how the system described above might behave.)

Even though the initial possible branching factor is very large, a good pruning heuristic should be able to keep the branching factor low enough to make the tree grow sub-exponentially, and low-probability branches can be pruned entirely.

For example, a human pruner might come up with a top 10 list of initial high level plans, and then branch to the initial action for each of those plans. At each of those first 10 child nodes, the pruner will then prune the actions at each node down to one - the one action that continues the current plan. Maybe after several actions, the pruner decides the current subtree it is at the root of looks promising for containing high-scoring worlds, and branches out into a few substrategies, or conversely, that it doesn't look promising, and prunes the branch entirely, freeing up resources for searches in other parts of the tree.
Most of the work in this system is done by the pruning heuristic, which is essentially being asked to plan a course of action that results in a world with a high score.

But P need not be agentic or goal-directed itself, any more than asking a human (or prompting GPT-N) to:
- Come up with a list of ten or so high level plans to reach child world states with a high score on the given evaluation function.
- Return the first action for each plan in the list.
Or, given a the current location in the tree (step in the current plan), pick the next action to take (or multiple actions to branch to).

The pruning heuristic might need to contain some coarse-grained version of the world modeler within itself, in order to plan effectively. But humans are capable of planning without having a detailed all-encompassing world model, so by assumption the pruning heuristic in this system can do so as well.
When to actually execute an action is also left up to the pruning heuristic to decide, subject to the time and space constraints. In the limit of a perfect or extremely superhuman world modeler that can perfectly predict the results of any action, you don't need to execute any actions at all. But for more realistic scenarios where it is feasible to do so without stepping on toes of other parts of the tree or incurring real-world expenses, you want to execute the actions, in order to actually reach the world state you want to reach and prevent the predicted world models from drifting from the actual world too much.
What is the meaning of the pruning heuristic taking the world state as an input? This just means that the pruning heuristic can "see", in the human using a web-browser case, the human can see what page they've opened the browser to and whether it has loaded, or not.

I personally could probably order something from Amazon with my eyes closed (or at least write down the complete steps), or write a script that uses Playwright to do so and have it work on the first try, without testing it, but I couldn't get too much further than that.
In the limit of infinite time and space and perfect world modelling, the pruning heuristic can be very dumb: it can just search over every possible subtree. The better heuristic the system has, the less time and space it needs to search for outcomes, and the more it starts to look like a strategic planner.
Note that the pruning heuristic might be composed of an LLM-based agent, prompted and arranged to plan tasks and choose actions.
For the world modeler, I'm imagining something like the world modeler in Dreamer that plays Minecraft or Atari games, generalized to the real world.

Why is this dangerous?

Maybe all the component pieces are perfectly interpretable, alignable, and non-agentic on their own. For suitably weak pieces and a judiciously-chosen evaluation function, the system safely performs useful steering work. But consider what happens when someone tries any of the following:

inserts DeepWorldModeler-n+1, allowing the system to accurately search deeper in the tree without needing to actually execute as many actions.
inserts a GPT-n+1-based agent, suitably prompted, or a CoEm, for the pruning heuristic, allowing it to plan and choose actions more effectively than before.
Gives it more time and space to run, or more initial resources.
Gives it a bad evaluation function, by accident or just to see what happens, under the assumption that the system can be halted at any time if it starts to go off the rails.

Any of the above have the potential to result in a system which is suddenly much better at steering, potentially in undesirable ways, even if the individual components are proved safe or "aligned" in isolation.

The system could suddenly become dangerously-capable, without any component piece of even the system as a whole necessarily developing anything like a mesaoptimizer, reflectivity, deceptiveness, or other more exotic failure modes.

Furthermore, if the example of OpenAI and LangChain is any indication, it will be easy to glue together subsystems, compared to the work of training the underlying models.

Even if access to the models is kept private (in a break with current trends), it could be the work of literally a single researcher in a lab deciding to experiment, either on their own or in a sanctioned way, to build the kind of system proposed above out of sufficiently powerful foundation models.

These hypothetical researchers may be more responsible than the general public, but unless they are at an operationally adequate organization, I am not reassured that they will "just not build it" or "just not run it" in sufficiently powerful regimes.

Aside: a possible area of disagreement or differing intuitions

One of the reasons I find this example, and the case for pessimism more generally, compelling, is an intuition that it is easy, in some absolute sense, to re-arrange the matter in the universe in arbitrary ways.

This intuition comes from looking at the world, seeing what humans have accomplished in the realm of biotech and nanotech and other areas of science and engineering, and considering what is permitted in principle by the laws of physics.

Fundamentally, these laws appear to permit a world that is much more malleable than say, the rules of chess permit in the universe of a chess game, or the reachable states permitted in an Atari game (absent exploitation of bugs in the underlying game engine).

Another way of framing this intuition is that I expect, not too deep in the tree, there are action sequences that involve inventing nanotech, biotech, computer viruses, using deception, and a variety of other strategies that are probably catastrophic if used to reach certain world states. These sequences might be currently out of reach to merely human-level steering systems, but the system above need not stick around at human-level for very long, and furthermore the truly dangerous strategies seem to me to be only slightly superhuman.

Others seem to have the intuition that things like nanotech, human manipulation, or other strategies require far more superhuman abilities to be really dangerous or uncontrollable. Perhaps this intuition is where the true crux is, though I think many of the intuitions about steering systems and implications that follow are useful, regardless of how difficult steering is in an absolute sense, or what the ultimate limits re-arranging capability are.

How does alignment help?

It may be easy to crisply describe an outcome specification and have such a system like the one described above search for outcomes which satisfy it. This corresponds to some aspects of alignment being easy - perhaps it turns out that "inner alignment" is not a problem at all, and the system as described above works perfectly well, given a sufficiently crisply and correctly specified evaluation function.

However, this implies that it is also easy to specify and give a bad evaluation function - an evaluation function that scores worlds composed entirely of tiny molecular squiggles highly is likely simpler to specify than one that captures the full complexity of human values.

Alternatively, perhaps the world modeler or pruning heuristic necessarily contain an inner-agent or mesaoptimizer once they are sufficiently capable, and these agents are not aligned with the outcome specification. In this case, we might run into the various problems of inner alignment.

Perhaps interpretability research will be useful for understanding how the pruning heuristic makes plans and chooses actions, or how the world modeler sees the world.

Or perhaps shard theory has something to say about how the system as a whole coheres, once it reaches a level of capability where it can search through action sequences that involve reflecting on itself.

Or perhaps some of the problems of embedded agency or corrigibility end up being useful for understanding how to make these systems safe.

I think there are many areas of alignment research which could be potentially be useful for making a steering system like the one above:

easier to inspect and understand, in terms of what it steers for and how
easier to align, in the sense of making it easier to specify an evaluation function that results in the system steering towards outcomes we desire.

The fundamental issue is that all of this also makes it easier to build a system which doesn't have those nice properties.

I suppose this is where AI governance comes in. There is one recent proposal that, if implemented, might be sufficient to prevent such systems from being built and run at all. I suspect there are ways to build sufficiently powerful systems that can be trained and run with fewer resources than GPT-4 though, so even the proposal above is not actually sufficient to prevent sufficiently good steering systems from being built and run.

In any case, it seems extremely unlikely to actually be implemented, so it is not a source of hope for me.

Conclusion

The main intuition that I hope to convey with this post is about why I feel feel relatively pessimistic about AI x-risk, even if certain problems in alignment turn out to be easy, or if some of the implications of the world-models of the doomier people turn out to be wrong (e.g. even if it turns out the first systems we build are capable of steering well enough to be used pivotally, but don't have any scary properties like CIS, deceptiveness, agency, or utility maximization).

This intuition comes from a bunch of directions:

The observation that steering systems are modular, and improving a single component can have a large impact on the system's capabilities as a whole. (For example, substituting in a better foundation model for the world modeler or planner.)
The observation that most of the useful steering work or "agency" of a system comes from the very last bits of glue code that put the system together. This glue code can often be written by a single developer, in contrast to the massive efforts that go into training the foundation models that underlie them.
The intuition that the fundamental limits of steering are pretty high, and that it probably doesn't take steering abilities too far beyond human level to re-arrange the matter in universe in almost arbitrary ways. (Arguments for and examples that build on this intuition were not the focus of this post.)

I think analyses like this which argue for lower chances of doom, based on a bunch of arguments about how current AI systems are scaling, are not really grappling with the actual case for pessimism. Perhaps the real cruxes are in the intuitions about how difficult reality-rearranging is, in an absolute sense.

LESSWRONG
LW