Epistemological Vigilance for Alignment

adamShimi

This post is part of the work done at Conjecture.

Nothing hampers Science and Engineering like unchecked assumptions.

As a concrete example of a field ridden with hidden premises, let's look at sociology. Sociologist must deal with the feedback of their object of study (people in social situations), their own social background, as well as the myriad of folk sociology notions floating in the memesphere. You might think that randomized surveys and statistics give you objective knowledge of the sociological world, but these tools also come with underlying assumptions — that the phenomenon under study must not depend on the fine structure of the social network, for example. In general, if you don’t realize this, you will then confidently misinterpret the results without considering the biases of your approach — as in asking kids to sort their play activities into three categories you defined in advance, and then seeing this as a “validation” of the classification.

How to avoid these mistakes? Epistemological vigilance, answer Pierre Bourdieu, Jean-Claude Chamboredon, and Jean-Claude Passeron in "Le métier de sociologue". They borrow the term from French philosopher of science Gaston Bachelard, to capture the attitude of always expliciting and questioning the assumptions behind notions, theories, models, experiments. So the naive sociologists err because they fail to maintain the restless epistemological vigilance that their field requires.

Alignment, like sociology, demands a perpetual questioning of unconscious assumptions. It’s because the alignment problem, and the way we know about it, goes against some of our most secure, obvious, and basic principles about knowledge and problem-solving. Thus we need a constant vigilance to keep them from sprouting again unnoticed and steering our work away from alignment.

In this post I thus make explicit these assumptions, and discuss why we have to be epistemologically vigilant about them.^[1] Taken separately, none of these call to vigilance is specific to alignment — other fields fostered it first. What makes alignment unique is the combined undermining of all these assumptions together. Alignment researchers just can't avoid the epistemological struggle.

Here is my current list:^[2]

Boundedness: the parameters of the problem are bounded, and such bounds can be approximated.
- Reasons for vigilance: we can’t find (yet) any bound on the atomic (uninterruptible) optimization of the world, except the loosest bounds given by the laws of physics. And the few fields with unbounded phenomena suggest a complete phase transition in the design space when going from bounded to unbounded problems.
Direct Access: the phenomenon studied can be accessed directly through experiments.
- Reasons for vigilance: Systems optimizing the world to the degree considered in alignment don’t exist yet. In addition, chilling out until we get them might not be a great idea (see the next point about iteration).
Iteration: the problem can be safely iterated upon.
- Reasons for vigilance: AI risks scenarios involve massive optimization of the world in atomic ways (without us being able to interrupt). And even without leading to the end of the world, strong optimization could still bring about a catastrophe after only one try. Hence the need for guarantees before upping the optimization pressure.
Relaxed Ergodicity: the future behavior of the system, for almost all trajectories, can be well estimated by averaging over its possible behaviors now.
- Reasons for vigilance: strong optimization shifts dynamics towards improbable worlds, leading to predictable errors when generalizing from the current distribution (where these worlds are negligible).
Closedness: the phenomenon can be considered by itself or within a simplified environment.
- Reasons for vigilance: strong optimization would leverage side-channels, so not modeling those could hide the very problem we worry about.
Newtonian^[3]: the system reacts straightforwardly to external forces applied to it (as in Newtonian mechanics), leading to predictable consequences after an intervention.
- Reasons for vigilance: Systems channeling optimization, be they optimizers themselves (AGIs), composed of optimizers (markets), or under selection (cancer cells), react to interventions in convergent ways that cannot be predicted from a pure Newtonian model.

What I'm highlighting here is the need for epistemological vigilance on all these fronts. You don't have to accept the issues, just to grapple with them. If you think that one of these assumptions does hold, that's a great topic for discussion and debate. The failure mode I'm tracking is not to debate the assumptions; it's to not even consider them, while they steer us unchecked.

Thanks to Connor Leahy and TJ for discussions on these ideas. Thanks to Connor Leahy and Sid Black for feedback on a draft.

Digging into the assumptions

Boundedness: never enough

Engineers work within bounds. When you design a bridge, a software security system, a building, or a data center, what matters are the reasonable constraints on what you need to deal with: how much force, how much compute in an attack, how much temperature variation. This leads to bounds on the range of pressures and forces one has to deal with.

Such bounds ease the design process tremendously, by removing the requirement to scale forever. As an example, most cryptographic guarantees come from assuming that the attacker is only using polynomial-time computations.^[4]

Yet what happens when you don’t have bounds? Alignment is in such a state right now, without bounds on the amount of optimization that the AIs will be able to do — that is, on their ability to figure things out and change the world. Physics constrains them, but with the loosest bounds possible — not much to leverage.

Unboundedness overhauls the design space. Now you have to manage every possible amount of force/pressure/optimization. Just imagine designing a security system to resist arbitrary computable attacks; none of the known cryptographic primitives we love and use would survive such a challenge.

That being said, some fields study such unbounded problems. Distributed computing theory is one, where asynchronous systems lack any bound on how long a message takes, or on the relative speed of different processes. Theoretical computer science in general tackles unboundedness in a bunch of settings (asynchronous distributed algorithms, worst-case complexity…), because modeling the exact situations in which algorithms will be used is hard, and so computer scientists aim for the strongest possible guarantees.

Epistemological vigilance for boundedness requires that we either:

find a solution that works in the unbounded setting;

find relevant and small enough bounds on capabilities and solve for this bounded setting;

or enforce such a bound on capabilities and solve for this bounded setting.

A big failure mode here is to just assume a bound that lets you prove something, when it’s not the first step to one of the three approaches above. Because we’re not trying to find versions of the problem that are easy to solve— we’re trying to solve the problem we expect to face. It’s easy to find a nice solution for a bounded setting, and simply convince oneself that the bound will hold and you will be fine. But this is not an argument, just a wish.

Direct access: so far and yet so close

If you study fluids, their physical existence helps a lot. Similarly with heat, brains, chemical substances, institutions, and computers. Your direct access to the phenomenon you’re studying lets you probe it in myriads of ways and check for yourself whether your models and theories apply. You can even amass a lot of data before making a theory.

Last time I checked, we still lacked an actual AGI, or really any way of strongly optimizing the world to the extent we worry about in alignment. So alignment research is banned from the fertile ground of interacting with the phenomenon itself. Which sucks.

It is not at all the only field of research that suffers from this problem, though: all historical sciences (evolutionary biology, geology, archaeology...) deal with it too, because their objects of study are often past events that cannot be accessed directly, witnessed, or recreated.

Most people involved in alignment acknowledge this, even when they don't agree with the rest of this list. Indeed, lack of direct access is regularly used as an argument to delay working on AGI alignment and focus instead on current systems and capabilities. That is, waiting for actual AGI or strong optimizing systems to be developed before studying them.

The problem? This proposal fails to be vigilant about the next assumption, the ability to iterate.

Iterability: don't mess it up

One thing that surprised me when reading about the Moon missions and the Apollo program is how much stuff broke all the time. The Saturn V engines pogoed, the secondary engines blew up, seams evaporated, and metal sheets warped under the (simulated) ridiculous temperature gradients of outer space. How did they manage to send people to the Moon and back alive in these conditions? Factoring out a pinch of luck, hardcore iteration. Everything was tested in as many conditions as possible, and iterated on until it didn’t break after extensive stress-tests.^[5]

This incredible power of iteration can be seen in many fields where new problems need to be solved, from space engineering to drug design. When you don't know, just try out ideas and iterate. Fail faster, right?

Yet once again, alignment can’t join in on the fun. Because massive misguided optimization of the world doesn’t lend itself to a second try. If you fail, you risk game over. So epistemological vigilance tell us to either solve the problem before running the system — before iterating — or find guarantees on safety when iterating with massive amounts of optimization (which is almost the same thing as actually solving the problem).

This “you can’t get it wrong” property doesn’t crop often in science or engineering, but we can find it in the prevention of other existential risks, like nuclear war or bio-risks; or even in climate science.

The implications for alignment should be clear: we can’t just wait for the development of AGI and related technologies, and we have to work on alignment now (be it for solving the full problem or for showing that you can iterate safely), thus grappling in full with the lack of direct access.

Relaxed ergodicity: a whole new future

Imagine you’re studying gas molecules in a box. In this case and for many other systems, the dynamics behave well enough (with ergodicity for example) to let you predict relevant properties of the future states based on a deep model of the current state. Much of Boltzmann's work in statistical mechanics is based on leveraging this ability to generalize. Even without the restriction of full ergodicity, many phenomena and systems evolve in ways predictable from the current possibilities (through some sort of expectation).

Wouldn't that be nice, says epistemological vigilance. Yet strong optimization systematically shifts probability and so turns improbable world states into probable ones.^[6] Thus what we observe now, with the technology available, will probably shift in non-trivial ways that need to be understood and dealt with. Ideas like instrumental convergence are qualitative predictions on this shift.

This is not a rare case. Even in statistical mechanics, you don’t always get ergodicity or the nice relaxations; and in the social sciences, this sort of shift is the standard, even if economic theory doesn’t seem good at addressing it. More generally, there’s a similarity with what Nassim Taleb calls Extremistan: settings where one outlier can matter more than everything that happened before (like many financial bets).

Quoting Taleb, those who don’t realize they’re in Extremistan get “played for suckers”. In alignment that would translate to only studying what we have access to now, with little conceptual work on what will happen after the distribution shifts, or how it will shift. And risk destruction because we refused to follow through on all our reasons for expecting a shift.

Closedness: everything is relevant

Science thrives on reductionism. By separating one phenomenon, one effect, from the rest of the world, we gain the ability to model it, understand it, and often reinsert it into the broader picture. From physics experiments to theoretical computer science’s simplifications, through managing confounding variables in social sciences studies, such isolation is key to insight after insight in science.

On the other hand, strong optimization is the perfect example of a phenomenon that cannot be boxed (pun intended). Epistemological vigilance reminds us that the core of the alignment problem lies in the impact of optimization over the larger world, and in the ability of optimization to utilize and leverage unexpected properties of the world left out of "the box abstraction". As such, knowing which details can be safely ignored is far more fraught than might be expected.

One field with this problem jumps to mind: computer security.^[7] In it, a whole class of attacks —side-channel attacks — depends on implementation and other details generally left outside of formalizations, like the power consumption of the CPU.

But really, almost all sciences and engineering disciplines have examples where isolating the phenomenon ends up distorting it or even removing it. Recall from the introduction, the use of random sampling in sociology when selecting people to survey destroys any information that could have been collected about the fine structure of the network of relationships.

Examining closedness has been a focus of much of the theoretical part of conceptual alignment, from embedded agency to John's abstraction work. That being said, this epistemological vigilance is rarer in applied alignment researchers, maybe due to the prevalence of the closed system assumption in ML. As such, it's crucial to emphasize the need for vigilance here in order to avoid overconfidence in our models and experimental results.

Newtonian: complex reactions

Newton's laws of motion provide a whole ontology for thinking about how phenomena react to change: just compute the external forces, and you get a prediction of the result. Electromagnetism and Thermodynamics leverage this ontology in productive ways; so does much of structural engineering and material science, even some productivity writers.

In alignment on the other hand, the effect of interventions and change is far more involved, raising flags for epistemological vigilance. Beyond that, strong optimization doesn't just react to intervention by being pushed around; it instead channels itself through different paths towards the same convergent results. Deception in its many forms (for example deceptive alignment from the Risks paper) is but one generator of such highly non-newtonian behaviors.

This is far more common than I initially expected. Social sciences in general suffer from this problem, as a lot of their predictions, analysis and interventions alter the underlying dynamics of the social world they’re studying. Another example is cancer research, where intervening on some but not all signaling pathways might lead to adaptations towards the remaining pathways, instead of killing the cancer.

Keeping such a Newtonian assumption without a good model of what it's abstracting away leads to overconfidence on the applicability of interventions, and on our ability to direct the system. If we want to solve the problem and not delude ourselves, we need to grapple with the subtleties of reactions to interventions, if only to argue that they can be safely ignored.

Vicious synergies

As if the situation wasn’t difficult enough, note that there's a sort of vicious synergy between different assumptions. That is, the failure of one can undermine another.

Unboundedness undermines iterability, because we can’t bound how bad a missed first try would be.

As already discussed, lack of iterability undermines direct access, because it forces us to consider the problems before getting access.

Both openness and non-newtonian undermine relaxed ergodicity, as they allow more mechanisms leading to strong probability shifts.

Is it game over then?

Where does this leave us? My goal here is not to convince you that we are doomed; instead, I want to highlight which standard assumptions of science and research require epistemological vigilance if we are to solve the actual problem concerning us.

Such explicit deconfusion has at least three benefits:

(Focusing debate) Often people debate and disagree about related questions without being able to pinpoint the crux. What I hope this post give us is better shared handles to debate these questions.
(Model for newcomers) One of the hardest aspects of learning alignment is to not fall into the many epistemological traps that lay everywhere in the field. This post is far from sufficient to teach someone how to do that, but it is a first step.
(Open problems for epistemology of alignment) For my own research, I want a list of epistemic problems to guide me, that I can keep in mind while reading on the history of science and technology. That way, I can apply any new idea or trick I learn to all of them (as Feynman did for his own list of problems^[8]), and see if they can be relevant for making alignment research go faster.

There is not much merit in solving a harder problem than what you need to solve. On the other hand, solving a simpler problem, when not in a path of attack to the actual problem, leads to inadequate solutions and overconfidence in their power. Let's hone our epistemological vigilance together, and ensure that we're moving in the best available direction.^[9]

Appendix: Conjecture’s Take

This post came about from discussions within Conjecture to articulate why we think alignment is hard, and why we expect many standard ML approaches to fail. As such, our take is that each of these assumptions will break by default, and that we either need to solve the problem without them or enforce some version of them.

^{^}
Note that most of what I discuss in this post has been mentioned, proposed, or presented elsewhere, be it by Eliezer, Bostrom, or later thinkers. My contribution lies in expliciting the assumptions and bringing them all together.
^{^}
Obviously it is only my current best model and is bound to change. Even during the writing of this post, I split one assumption into the two last ones of the final list.
^{^}
This is the assumption for which my naming and description feel furthest from the True Name of what I’m pointing at. So please suggest alternative names and characterizations, or ask questions to pinpoint what I’m describing.

^{^}
You also need conjectures about the hardness of reversing hash functions.
^{^}
Engineers also added redundancy to avoid single point of failures as much as possible, but that would have been insufficient without the improvements born of iteration.
^{^}
See this post for an old-school Eliezer story-explanation (and really all of Eliezer' side of the FOOM debate).
^{^}
Cue security mindset.
^{^}
From this talk by Gian-Carlo Rota: “Richard Feynman was fond of giving the following advice on how to be a genius. You have to keep a dozen of your favorite problems constantly present in your mind, although by and large they will lay in a dormant state. Every time you hear or read a new trick or a new result, test it against each of your twelve problems to see whether it helps. Every once in a while there will be a hit, and people will say: "How did he do it? He must be a genius!" ”
^{^}
One idea that I don't discuss in the post but which is relevant is if we find good reasons to expect the problem to be impossible. In such cases, the focus should be on articulating them, checking them, and finding the best possible ways of convincing everyone of them to stop the race to extinction.

Awesome post, putting into words the intuitions I had for what dimensions the alignment problem stayed in. You've basically meta-bounded the alignment problem, which is exactly what we need when dealing with problems like this.

As it happens, I think this is a rather important topic. Failure to consider and mitigate the risk of assumptions creates both false negative (less concerning) and false positive (most concerning) risks when attempting to build aligned AI.

Bookmarked. This seems like a great post to periodically revisit to check my assumptions (and maintain that vigilance). The compact list toward the top is especially handy for reference.

Newtonian: complex reactions

So please suggest alternative names and characterizations, or ask questions to pinpoint what I’m describing.

Are you pointing here at the fact that the AI training process and world will be a complex system, and as such it is hard to predict the outcomes of interventions, and hence the first-order obvious outcomes of interventions may not occur, or may be dominated by higher-order outcomes? That's what the "complex reactions" and some of the references kind of point at, but then in the description you seem to be talking more about a specific case: Strong optimisation will always find a path if it exists, so patching some but not all paths isn't useful, and in fact could have weird counter-productive effects if the remaining paths that the strong optimisation takes are actually worse in some other ways than the ones you patched.

Other possible names would then be either leaning into the complex systems view, so the (possibly incorrect) assumption is something like "non-complexity" or "linear/predictable responses"; or leaning into the optimisation paths analogy which might be something like "incremental improvement is ok" although that is pretty bad as a name.

Are you pointing here at the fact that the AI training process and world will be a complex system, and as such it is hard to predict the outcomes of interventions, and hence the first-order obvious outcomes of interventions may not occur, or may be dominated by higher-order outcomes?

This points at the same thing IMO, although still in a confusing way. This assumption is basically that you can predict the result of an intervention without having to understand the internal mechanism in detail, because the latter is straightforward.

Other possible names would then be either leaning into the complex systems view, so the (possibly incorrect) assumption is something like "non-complexity" or "linear/predictable responses"; or leaning into the optimisation paths analogy which might be something like "incremental improvement is ok" although that is pretty bad as a name.

Someone at Conjecture proposed linear too, but Newtonian physics isn't linear. Although I agree that the sort of behavior and reaction I'm pointing out fit within the "non-linear" category.

Thermodynamic? Thermodynamics seems to be about using a small number of summary statistics (temperature, pressure, density, etc.) because the microstructure of the system isn't necessary to compute what will happen at the macro level.

This assumption is basically that you can predict the result of an intervention without having to understand the internal mechanism in detail, because the latter is straightforward.

This seems to me that you want a word for whatever the opposite of complex/chaotic systems are, right? Although obviously "Simple" is probably not the best word (as it's very generic). It could be "Simple Dynamics" or "Predictable Dynamics"?

The list of assumptions seems hung up in the air, so it's hard to perceive it. Who takes (will take? should take?) these assumptions: AGI developers, alignment researchers, AGIs, or the people with whom AGI presumably should align? Are these assumptions behind a specific theory or framework of alignment (or of certain people), or are these sort of "shared", or "considered common sense" assumptions, that you think most researchers in the field have? Are these assumptions behind a certain conclusion or an estimate of the probability of people's survival (low, uncertain, high)?

Ok... Upon reading the whole post, I understand that this is the list of "[standard, unchecked, unconscious] assumptions in science", which don't apply in the field of AI Safety. The biggest confusion is due to the fact that the list is immediately preceded by the words "here is my current list", which gives a strong sense that these are your assumptions.

I think it would be much clearer for the readers, and also has a better chance of sticking in the LW/alignment slang, if the list was called "epistemic/research complications/challenges [for alignment]", and the titles of the sections were inverted: boundedness -> unboundedness, direct access -> lack of direct access, etc. I think it's much more natural to think about these things in this way, and you yourself sometimes slip into this frame in the text, calling these "epistemic problems" rather than "assumptions".

Unclear what you mean by "Newtonian assumption". If you mean a sort of epistemological method, then I'm familiar with this term as so-called "Newtonian-Cartesian thinking", which amounts to (classical) rationalism, reductionism, belief that there is the best (optimal) decision, solution, theory, and explanation of events. But this is not quite what you are talking about in the respective section (rather, you talk about reductionism, for instance, in the preceding section).

Rather, in the section about the Newtonian assumption, you seem to try to point to the agency, self-interest, and intentionality of AIs. There is an indirect relation to Newton (he probably thought there is just one agent in the universe: God?). However, the way this section is written makes it hard to infer what you tried to point to in it.

Last time I checked, we still lacked an actual AGI, or really any way of strongly optimizing the world to the extent we worry about in alignment.

This is a crux for me and a basis of much of my optimism about the problem: We already live in a world that extremely optimized by engineering, and while it may seem like superintelligence would allow you do things that ordinary humans cannot, that is far from a certainty.

The question for me is not whether "superintelligence would allow you do things that ordinary humans cannot", it is whether superintelligence would allow you - within a year or two - to know how to do things that ordinary humans might figure out how to do and defend against in 100 years.

It's no help at all to us if 100 years later we would have figured out how to effectively combat self-replicating factories, customized biological weapons, brain subversion, and/or any other combination of things that might actually be effective in the appendages of something far smarter than us. If the AI works out how to do it long before that, we're in trouble.

Bookmarked. This seems like a great post to periodically revisit to check my assumptions (and maintain that vigilance). The compact list toward the top is especially handy for reference.

Newtonian: complex reactions

So please suggest alternative names and characterizations, or ask questions to pinpoint what I’m describing.

Are you pointing here at the fact that the AI training process and world will be a complex system, and as such it is hard to predict the outcomes of interventions, and hence the first-order obvious outcomes of interventions may not occur, or may be dominated by higher-order outcomes?

Other possible names would then be either leaning into the complex systems view, so the (possibly incorrect) assumption is something like "non-complexity" or "linear/predictable responses"; or leaning into the optimisation paths analogy which might be something like "incremental improvement is ok" although that is pretty bad as a name.

Someone at Conjecture proposed linear too, but Newtonian physics isn't linear. Although I agree that the sort of behavior and reaction I'm pointing out fit within the "non-linear" category.

This assumption is basically that you can predict the result of an intervention without having to understand the internal mechanism in detail, because the latter is straightforward.

Last time I checked, we still lacked an actual AGI, or really any way of strongly optimizing the world to the extent we worry about in alignment.

LESSWRONG
LW

LESSWRONG
LW

66

Epistemological Vigilance for Alignment

66

Ω 25

Digging into the assumptions

Boundedness: never enough

Direct access: so far and yet so close

Iterability: don't mess it up

Relaxed ergodicity: a whole new future

Closedness: everything is relevant

Newtonian: complex reactions

Vicious synergies

Is it game over then?

Appendix: Conjecture’s Take

66

Ω 25

66

Ω 25