Let’s say you’re relatively new to the field of AI alignment. You notice a certain cluster of people in the field who claim that no substantive progress is likely to be made on alignment without first solving various foundational questions of agency. These sound like a bunch of weird pseudophilosophical questions, like “what does it mean for some chunk of the world to do optimization?”, or “how does an agent model a world bigger than itself?”, or “how do we ‘point’ at things?”, or in my case “how does abstraction work?”. You feel confused about why otherwise-smart-seeming people expect these weird pseudophilosophical questions to be unavoidable for engineering aligned AI. You go look for an explainer, but all you find is bits and pieces of worldview scattered across many posts, plus one post which does address the question but does so entirely in metaphor. Nobody seems to have written a straightforward explanation for why foundational questions of agency must be solved in order to significantly move the needle on alignment.
This post is an attempt to fill that gap. In my judgment, it mostly fails; it explains the abstract reasons for foundational agency research, but in order to convey the intuitions, it would need to instead follow the many paths by which researchers actually arrive at foundational questions of agency. But a better post won’t be ready for a while, and maybe this one will prove useful in the meantime.
Note that this post is not an attempt to address people who already have strong opinions that foundational questions of agency don't need to be answered for alignment; it's just intended as an explanation for those who don't understand what's going on.
Starting Point: The Obvious Stupid Idea
Let’s start from the obvious stupid idea for how to produce an aligned AI: have humans label policies/plans/actions/outcomes as good or bad, and then train an AI to optimize for the good things and avoid the bad things. (This is intentionally general enough to cover a broad range of setups; if you want something more specific, picture RL from human feedback.)
Assuming that this strategy could be efficiently implemented at scale, why would it not produce an aligned AI?
I see two main classes of problems:
- In cases where humans label bad things as “good”, the trained system will also be selected to label bad things as “good”. In other words, the trained AI will optimize for things which look “good'' to humans, even when those things are not very good.
- The trained system will likely end up implementing strategies which do “good”-labeled things in the training environment, but those strategies will not necessarily continue to do the things humans would consider “good” in other environments. The canonical analogy here is to human evolution: humans use condoms, even though evolution selected us to maximize reproductive fitness.
Note that both of these classes of problems are very pernicious: in both cases, the trained system’s results will look good at first glance.
Neither of these problems is obviously all that bad. In both cases, the system is behaving at least approximately well, at least within contexts not-too-different-from-training. These problems don’t become really bad until we apply optimization pressure, and Goodhart kicks in.
Goodhart’s Law
There’s a story about a Soviet nail factory. The factory was instructed to produce as many nails as possible, with rewards for high numbers and punishments for low numbers. Within a few years, the factory was producing huge numbers of nails - tiny useless nails, more like thumbtacks really. They were not very useful for nailing things.
So the planners changed the incentives: they decided to reward the factory for the total weight of nails produced. Within a few years, the factory was producing big heavy nails, more like lumps of steel really. They were still not very useful for nailing things.
This is Goodhart’s Law: when a proxy for some value becomes the target of optimization pressure, the proxy will cease to be a good proxy.
In everyday life, if something looks good to a human, then it is probably actually good (i.e. that human would still think it’s good if they had more complete information and understanding). Obviously there are plenty of exceptions to this, but it works most of the time in day-to-day dealings. But if we start optimizing really hard to make things look good, then Goodhart’s Law kicks in. We end up with instagram food - an elaborate milkshake or salad or burger, visually arranged like a bouquet of flowers, but impractical to eat and kinda mediocre-tasting.
Returning to our two alignment subproblems from earlier:
- In cases where humans label bad things as “good”, the trained system will also be selected to label bad things as “good”. In other words, the trained AI will optimize for things which look “good'' to humans, even when those things are not very good.
- The trained system will likely end up implementing strategies which do “good”-labeled things in the training environment, but those strategies will not necessarily continue to do the things humans would consider “good” in other environments. The canonical analogy here is to human evolution: humans use condoms, even though evolution selected us to maximize reproductive fitness.
Goodhart in the context of problem (1): train a powerful AI to make things look good to humans, and we have the same problem as instagram food, but with way more optimization power applied. Think “Potemkin village world” - a world designed to look amazing, but with nothing behind the facade. Maybe not even any living humans behind the facade - after all, even generally-happy real humans will inevitably sometimes put forward appearances which would not appeal to the “good”/”bad”-labellers.
Goodhart in the context of problem (2): pretend our “good”/”bad” labels are perfect, but the system ends up optimizing for some target which doesn’t quite track our “good” labels, especially in new environments. Then that system ends up optimizing for whatever proxy it learned; we get the AI-equivalent of humans wearing condoms despite being optimized for reproductive fitness. And the AI then optimizes for that really hard.
Now, we’ve only talked about the problems with one particular alignment strategy. (We even explicitly picked a pretty stupid one.) But we’ve already seen the same basic issue come up in two different subproblems: Goodhart’s Law means that proxies which might at first glance seem approximately-fine will break down when lots of optimization pressure is applied. And when we’re talking about aligning powerful future AI, we’re talking about a lot of optimization pressure. That’s the key idea which generalizes to other alignment strategies: crappy proxies won’t cut it when we start to apply a lot of optimization pressure.
Goodhart Is Not Inevitable
Suppose we’re designing some secure electronic equipment, and we’re concerned about the system leaking information to adversaries via a radio side-channel. We design the system so that the leaked radio signal has zero correlation with whatever signals are passed around inside the system.
Some time later, a clever adversary is able to use the radio side-channel to glean information about those internal signals using fourth-order statistics. Zero correlation was an imperfect proxy for zero information leak, and the proxy broke down under the adversary’s optimization pressure.
But what if we instead design the system so that the leaked radio signal has zero mutual information with whatever signals are passed around inside the system? Then it doesn’t matter how much optimization pressure an adversary applies, they’re not going to figure out anything about those internal signals via leaked radio.
Many people have an intuition like “everything is an imperfect proxy; we can never avoid Goodhart”. The point of the mutual information example is that this is basically wrong. Figuring out the True Name of a thing, a mathematical formulation sufficiently robust that one can apply lots of optimization pressure without the formulation breaking down, is absolutely possible and does happen. That said, finding such formulations is a sufficiently rare skill that most people will not ever have encountered it firsthand; it’s no surprise that many people automatically assume it impossible.
This is (one framing of) the fundamental reason why alignment researchers work on problems which sound like philosophy, or like turning philosophy into math. We are looking for the True Names of various relevant concepts - i.e. mathematical formulations robust enough that they will continue to work as intended even under lots of optimization pressure.
Aside: Accidentally Stumbling On True Names
You may have noticed that the problem of producing actually-good nails has basically been solved, despite all the optimization pressure brought to bear by nail producers. That problem was solved mainly by competitive markets and reputation systems. And it was solved long before we had robust mathematical formulations of markets and reputation systems.
Or, to reuse the example of mutual information: one-time pad encryption was intuitively obviously secure long before anyone could prove it.
So why do we need these “True Names” for alignment?
We might accidentally stumble on successful alignment techniques. (Alignment By Default is one such scenario.) On the other hand, we might also fuck it up by accident, and without the True Name we’d have no idea until it’s too late. (Remember, our canonical failure modes still look fine at first glance, even setting aside the question of whether the first AGI fooms without opportunity for iteration.) Indeed, people did historically fuck up markets and encryption by accident, repeatedly and to often-disastrous effect. It is generally nonobvious which pieces are load-bearing.
Aside from that, I also think the world provides lots of evidence that we are unlikely to accidentally stumble on successful alignment techniques, as well as lots of evidence that various specific classes of things which people suggest will not work. This evidence largely comes from failure to solve analogous existing problems “by default”. That’s a story for another post, though.
What “True Names” Do We Want/Need For Alignment?
What kind of “True Names” are needed for the two alignment subproblems discussed earlier?
- In cases where humans label bad things as “good”, the trained system will also be selected to label bad things as “good”. In other words, the trained AI will optimize for things which look “good'' to humans, even when those things are not very good.
- The trained system will likely end up implementing strategies which do “good”-labeled things in the training environment, but those strategies will not necessarily continue to do the things humans would consider “good” in other environments. The canonical analogy here is to human evolution: humans use condoms, even though evolution selected us to maximize reproductive fitness.
In the first subproblem, our “good”/”bad” labeling process is an imperfect proxy of what we actually want, and that proxy breaks down under optimization pressure. If we had the “True Name” of human values (insofar as such a thing exists), that would potentially solve the problem. Alternatively, rather than figuring out a “True Name” for human values directly, we could figure out a “pointer” to human values - something from which the “True Name” of human values could be automatically generated (analogous to the way that a True Name of nail-value is implicitly generated in an efficient market). Or, we could figure out the “True Names” of various other things as a substitute, like “do what I mean” or “corrigibility”.
In the second subproblem, the goals in the trained system are an imperfect proxy of the goals on which the system is trained, and that proxy breaks down when the trained system optimizes for it in a new environment. If we had the “True Names” of things like optimizers and goals, we could inspect a trained system directly to see if it contained any “inner optimizer” with a goal very different from what we intended. Ideally, we could also apply such techniques to physical systems like humans, e.g. as a way to point to human values.
Again, this is only one particular alignment strategy. But the idea generalizes: in order to make alignment strategies robust to lots of optimization pressure, we typically find that we need robust formulations of some intuitive concepts, i.e. “True Names”.
Regardless of the exact starting point, seekers of “True Names” quickly find themselves recursing into a search for “True Names” of lower-level components of agency, like:
- Optimization
- Goals
- World models
- Abstraction
- Counterfactuals
- Embeddedness
- …
Aside: Generalizability
Instead of framing all this in terms of Goodhart’s Law, we could instead frame it in terms of generalizability. Indeed, Goodhart’s Law itself can be viewed as a case/driver of generalization failure: optimization by default pushes things into new regimes, and Goodhart’s Law consists of a proxy failing to generalize as intended into those new regimes.
In this frame, a “True Name” is a mathematical formulation which robustly generalizes as intended.
That, in turn, suggests a natural method to search for and recognize “True Names”. In some sense, they’re the easiest possible things to find, because they’re exactly the things which show up all over the place! We should be able to look at many different instances of some concept, and abstract out the same “True Name” from any of them.
Of course, the real acid test of a “True Name” is to prove, both empirically and mathematically, that systems which satisfy the conditions of the Name also have the other properties which one intuitively expects of the concept. Then we have a clear idea of just how robustly the formulation generalizes as intended.
Summary
We started out from one particular alignment strategy - a really bad one, but we care mainly about the failure modes. A central feature of the failure modes was Goodhart’s Law: when a proxy is used as an optimization target, it ceases to be a good proxy for the thing it was intended to measure. Some people would frame this as the central reason why alignment is hard.
Fortunately, Goodhart is not inevitable. It is possible to come up with formulations which match our concepts precisely enough that they hold up under lots of optimization pressure; mutual information is a good example. This is (one frame for) why alignment researchers invest in pseudophilosophical problems like “what are agents, mathematically?”. We want “True Names” of relevant concepts, formulations which will robustly generalize as intended.
Thankyou to Jack, Eli and everyone who attended our discussion last week which led to this post.
Thank you! This clarifies a lot. The dialogue was the perfect blend of entertaining and informative.
I might see if you can either include it in the original post or post it as a separate one, because it really helps fill in the rationale.