I want to quickly draw attention to a concept in AI alignment: Robustness to Scale. Briefly, you want your proposal for an AI to be robust (or at least fail gracefully) to changes in its level of capabilities. I discuss three different types of robustness to scale: robustness to scaling up, robustness to scaling down, and robustness to relative scale.

The purpose of this post is to communicate, not to persuade. It may be that we want to bite the bullet of the strongest form of robustness to scale, and build an AGI that is simply not robust to scale, but if we do, we should at least realize that we are doing that.

Robustness to scaling up means that your AI system does not depend on not being too powerful. One way to check for this is to think about what would happen if the thing that the AI is optimizing for were actually maximized. One example of failure of robustness to scaling up is when you expect an AI to accomplish a task in a specific way, but it becomes smart enough to find new creative ways to accomplish the task that you did not think of, and these new creative ways are disastrous. Another example is when you make an AI that is incentivized to do one thing, but you add restrictions that make it so that the best way to accomplish that thing has a side effect that you like. When you scale the AI up, it finds a way around your restrictions.

Robustness to scaling down means that your AI system does not depend on being sufficiently powerful. You can't really make your system still work when it scales down, but you can maybe make sure it fails gracefully. For example, imagine you had a system that was trying to predict humans, and use these predictions to figure out what to do. When scaled up all the way, the predictions of humans are completely accurate, and it will only take actions that the predicted humans would approve of. If you scale down the capabilities, your system may predict the humans incorrectly. These errors may multiply as you stack many predicted humans together, and the system can end up optimizing for some seeming random goal.

Robustness to relative scale means that your AI system does not depend on any subsystems being similarly powerful to each other. This is most easy to see in systems that depend on adversarial subsystems. If part of you AI system is suggest plans, and another part is trying to find problems in those plans, if you scale up the suggester relative to the verifier, the suggester may find plans that are optimized for taking advantage of the verifier's weaknesses.

My current state is that when I hear proposals for AI alignment that do not feel very strongly robust to scale, I become very worried about the plan. Part of this comes from feeling like we are actually very early on a logistic capabilities curve. I thus expect that as we scale up capabilities, we can get eventually get large differences very quickly. Thus, I expect that the scaled up (and partially scaled up) versions to actually happen. However, robustness to scale is very difficult, so it may be that we have to depend on systems that are not very robust, and be careful not to push them too far.

New Comment
23 comments, sorted by Click to highlight new comments since:

Robustness to scale is still one of my primary explanations for why MIRI-style alignment research is useful, and why alignment work in general should be front-loaded. I am less sure about this specific post as an introduction to the concept (since I had it before the post, and don't know if anyone got it from this post), but think that the distillation of concepts floating around meatspace to clear reference works is one of the important functions of LW.

[-]RaemonΩ120

(5 upvotes from a few AF users suggests this post probably should be nominated by an additional AF person, but unsure. I do apologize again for not having better nomination-endorsement-UI.

I think this post may have been relevant to my own thinking, but I'm particularly interested in how relevant the concept has been to other people who think professionally about alignment)

[-]BuckΩ460

I think that the terms introduced by this post are great and I use them all the time

I'm less interested in robustness to scaling down than scaling up; if we can verify empirically that a relevant component is (well) above the capability level at which the overall system would fail, then I don't see a strong reason for concern.

I'm ambivalent about robustness to relative scale, since it seems like you can often have strong empirical evidence about relative capabilities + pretty good theoretical reasons. As a particularly extreme example, my current scheme probably requires some assumption like "The model after N+1 gradient updates isn't that much better than model after N gradient updates." which I think is likely to be OK.

While I'm not an AI researcher so not sure which concerns are most relevant, I'll say that "risk of scaling down" and "risk to relative scale" hadn't even been on my radar as things to pay attention to, so having them succinctly given handles to refer to seemed handy.

It does seem nice to group them and have clear handles.

ML researchers more often think about risks from scaling down and relative scale, since those come up more frequently (and are harder to fix) today.

I am worried that if you train both sides of playing an asymmetric game, you run into problems where you scale up at playing one side faster than playing the other side. This makes me think "The model after N+1 gradient updates isn't that much better than model after N gradient updates." is not enough of an assumption if you operationalize it in a way that ensures you are using the model to do the same thing in both cases, and if you don't operationalize it in that way, it seem like too strong of an assumption.

I agree that asymmetric games are the interesting case, and it's rare you can use an assumption this weak.

Nice explanation of a concept, thanks.

My intuition from complex systems studies is I wouldn't be too content even with things which look robust - "scale invariant" properties of systems in most cases hold only across some orders of magnitude.

For things that are not robust to scale, it's useful to have a clear understanding of the ways it can fail. For instance, my Oracle designs are robust to scale if the boxing assumptions hold, but fail if those assumptions fail. This gives you a clear understanding of where the problems might lie.

This essay makes a valuable contribution to the vocabulary we use to discuss and think about AI risk. Building a common vocabulary like this is very important for productive knowledge transmission and debate, and makes it easier to think clearly about the subject.

Rereading this post while thinking about the approximations that we make in alignment, two points jump at me:

  • I'm not convinced that robustness to relative scale is as fundamental as the other two, because there is no reason to expect that in general the subcomponents will be significantly different in power, especially in settings like adversarial training where both parts are trained according to the same approach. That being said, I still agree that this is an interesting question to ask, and some proposal might indeed depend on a version of this.
  • Robustness to scaling up and robustness to scaling down sounds like they can be summarized by: "does it break in the limit of optimality? and "does it only work in the limit of optimality?". Where the first gives us an approximation for studying and designing alignment proposals, and the second points out a potential issue in this approximation. (Not saying that this is capturing all of your meaning, though)

At the time I began writing this previous comment, I felt like I hadn't directly gotten that much use of this post. But then after reflecting a bit about Beyond Astronomical Waste I realized this had actually been a fairly important concept in some of my other thinking.

I've used the concepts in this post a lot when discussing various things related to AI Alignment. I think asking "how robust is this AI design to various ways of scaling up?" has become one of my go-to hammers for evaluating a lot of AI Alignment proposals, and I've gotten a lot of mileage out of that. 

This is again, a short and sweet concept that cuts through a lot of confusions/disagreements I've had about AI in the past. Curated.

It seems there is a generalization of this. I expect there to be many properties of an AI system and an environment such that if the value of the property changes alignment doesn't break.

The hard part is to figure out which properties are most useful to pay attention to. Here are a few:

  • Capability (as discussed in the OP)
    • Could be something very specific like the speed of compute
    • Pseudo Cartesianess (a system might be effectively cartesian at a certain level of capability before it figures out how to circumvent some constraints we put on it)
  • Alignment (ideally we would like to detect and shut down if the agent becomes misaligned)
  • Complexity of the system (maybe your alignment scheme rests on being able to understand the world model of the system, in which case it might stop working as we move from modeling toy worlds to modeling the real world)
  • etc.

I think it can be useful to consider the case where alignment breaks as you decrease capabilities. For example, you might think of constructing a minimal set of assumptions such that you would know how to solve alignment. One might be, having an arbitrary amount of compute and memory available that can execute any halting program arbitrarily fast. If we want to remove this assumption it might break alignment. It's pretty easy to see how alignment could break in this case, but it seems useful to have the concept of the generalized version.

Besides potential solutions that are oriented towards being robust to scale, I would like to emphasise that there are also failure modes that are robust to scale - that is, problems which do not go away with scaling up the resources:
Fundamental limits to computation due to fundamental limits to attention-like processes:
https://medium.com/threelaws/definition-of-self-deception-in-the-context-of-robot-safety-721061449f7

Hello Scott! You might be interested in my proposals for AI goal structures that are designed to be robust to scale:

Using homeostasis-based goal structures:

https://medium.com/threelaws/making-ai-less-dangerous-2742e29797bd

and

Permissions-then-goals based AI user “interfaces” + legal accountability:

https://medium.com/threelaws/first-law-of-robotics-and-a-possible-definition-of-robot-safety-419bc41a1ffe

Scaling down is an interesting challenge. To consider what happens in natural intelligent agents like humans and dogs, to the extent dogs are scaled down humans dogs reliably make mistakes about human values that we might not consider graceful failures. For example, a guard dog might bite an intruder it doesn't recognize but a human might notice this "intruder" is wearing a police uniform and would not want to attack the intruder. The usual solution to this is to then either train the dog about police or restrain the dog in some way so that it can't cause harm until a human approves its actions, such as by putting it on a leash. In AI this might mean containment for scaled down AI by more powerful systems (either humans or more powerful AI) to verify its actions before it takes them.

Counterpoint - let’s say you have a proposal that is safer than the frontier in some respects, but doesn’t always generalize / scale. I imagine it would be better to submit the proposal anyways, while highlighting failure modes (scaling included). Submitting imperfect proposals in this way allows the community to poke holes, fix problematic aspects, and perhaps even discover new angles of attack while discussing the shortcomings.

[-]dxu10
The purpose of this post is to communicate, not to persuade. It may be that we want to bit [sic] the bullet of the strongest form of robustness to scale, and build an AGI that is simply not robust to scale, but if we do, we should at least realize that we are doing that.

While this may not be what is happening here, in general I think that when an author opens with "I acknowledge the following criticisms to my argument" they make it unfairly socially unacceptable to respond with "Yeah, but the criticisms." I think this is a bad discourse norm, and people should give that response more often.

I was imagining someone with a bright yet flawed idea reading this post, realizing their idea doesn't scale, and ending up scrapping something redeemable that people with more expertise could have steelmanned. I'm not presuming that Scott was advocating a "totally scalable or STFU" criterion, but I wanted to put that consideration out there.