Again analogizing from the definition in “Risks From Learned Optimization”, “corrigible alignment” would be developing a motivation along the lines of “whatever my subcortex is trying to reward me for, that is what I want!” Maybe the closest thing to that is hedonism? Well, I don’t think we want AGIs with that kind of corrigible alignment, for reasons discussed below.
At first this claim seemed kind of wild, but there's a version of it I agree with.
It seems like conditional on the inner optimizer being corrigible, in the sense of having a goal that's a pointer to some optimizer "outside" it, it's underspecified what it should point to. In the evolution -> humans -> gradient descent -> model example, corrigibility as defined in RLO could mean that the model is optimizing for the goals of evolution, humans, or the gradient. This doesn't seem to be different between the RLO and steered optimization stories.
I think the analogy to corrigible alignment among humans being hedonism assumes that a corrigibly aligned optimizer's goal would point to the thing immediately upstream of its reward. This is not obvious to me. It seems like wireheading / manipulating reward signals is a potential problem, but this is just a special case of not being able to steer an inner optimizer even conditional on it having a narrow corrigibility property.
I dunno, the productivity hacks thing sounds pretty bad.
But yeah, doing better seems to be held up by the fact that we don't yet have a coherent way to describe the standards for doing better, when the human isn't an idealized sort of agent. Trying to steer the agent towards thinking of its goal as "do what the programmers want" is essentially talking about a machine-learning method of trying to find this description.
I dunno, the productivity hacks thing sounds pretty bad.
Well, we ought to be able to either figure out how to use this kind of system safely, or prove it's impossible. Either would be valuable. :-)
I don't think it's obviously impossible though. In particular, with the right motivation, it won't be motivated to undermine the steering signals. And also, the subcortex can be a slightly-less powerful AI, assisted by intrusive interpretability tools, multiple copies running faster, etc.
But yeah, doing better seems to be held up by the fact that we don't yet have a coherent way to describe the standards for doing better, when the human isn't an idealized sort of agent...
Yeah, I struggle with that too. Maybe an alternative (or at least starting point) would be to try to solve the challenge of building a question-answering oracle that has no motivation to lie or manipulate or escape its box, etc. I think that is a goal I can fully understand, although maybe I just haven't thought about it carefully enough to find the edge cases. :-)
A steered optimizer has an incentive to remove all steering control as fast as possible. A learned, static mesaoptimizer (from the search over algorithms scenario), seems to be in less of a hurry to have its treacherous turn. Perhaps this means steered optimizers are more likely to clumsily attempt to wrest control before they're strong enough?
But as a (human) steered optimizer, the way I relate to my steering is as my true values. Like, I might think that some crazy edge case sounds great (endlessly eating a hypercake in an endless forest of more and more interesting plants), but I always reserve some probability mass for in fact finding it empty and meaningless and not what I value (which is presumably what just-in-time steering feels like)
Thanks for the comment!
A steered optimizer has an incentive to remove all steering control...
Well, not necessarily. We could steer it into a motivational system in which it happily accepts steering signals, hopefully, right?
...as fast as possible... Perhaps this means steered optimizers are...likely to clumsily attempt to wrest control before they're strong enough?
That would be nice! One situation where it might fail is that it takes a while for the system to develop an understanding of its situation, and by the time it understands what the steering signals are and how they work, it is already competent enough able to skillfully plan around them. More generally, I have low confidence about the relative difficulties and learning curves of a future AGI, and don't want to rely on anything like that, even if it seems intuitively probable.
After thinking about it for a minute, it's not obvious to me whether mesa-optimizers vs steered optimizers are better or worse on likelihood of clumsy failed attempts at treacherous turns...
Like, I might think that some crazy edge case sounds great (endlessly eating a hypercake in an endless forest of more and more interesting plants), but I always reserve some probability mass for in fact finding it empty and meaningless and not what I value
What if the hypercake was laced with a special nanobot that would travel around your brain and deactivate the "this is empty and meaningless" gut feeling and replace it with a "this is deeply fulfilling" feeling? Would you eat it then?
For me, for some of my goals, I feel a strong pull of goal preservation—like, I would commit today to a vow that, if "making the world a better place" ceased to feel fulfilling for me, and started to feel empty and pointless, I will alter my brain however necessary to make "making the world a better place" feel fulfilling again. Other goals I don't feel like I need to preserve: for example, I enjoy chocolate today, but I am not particularly disturbed by the thought that I might stop enjoying chocolate someday in the future, and start enjoying something else instead. I think the difference is outward-facing goals are in the first category, and goals that mainly impact myself are in the second category. Or maybe "socially-praiseworthy goals important to my self-image" are in the first category. Or something else. I don't know... :-)
We could steer it into a motivational system in which it happily accepts steering signals, hopefully, right?
That's true. I should have said "a misaligned steered optimizer"
don't want to rely on [things like AGI learning curves], even if it seems intuitively probable.
Strongly agree
What if the hypercake was laced with a special nanobot that would travel around your brain and deactivate the "this is empty and meaningless" gut feeling and replace it with a "this is deeply fulfilling" feeling? Would you eat it then?
Indeed not! I'm not sure if this is obvious (because the example was not excellently chosen), but I meant to suggest something like "if I had to choose my best guess at a thing that would be selfishly good for me in the future, I would care more about my actual experience of it (and subcortically-generated reward) than my guess of what I would feel now".
I think the difference is outward-facing goals are in the first category, and goals that mainly impact myself are in the second category
That was my first guess when reading your "making the world a better place" example. But I don't think it quite works. If I have an outward-facing goal of ensuring more people enter long-lasting meaningful relationships, I want that goal to be able to shift in the face of data from reality. But perhaps my imagination is misfiring because that's not actually a very important goal to me.
Abstract:
The paper Risks from learned optimization introduced the term "inner alignment" in the context of a specific class of scenarios, namely a "base optimizer" which searches over a space of “inner” algorithms. If the inner algorithm is an optimizer, it's called a "mesa-optimizer", and if its objective differs from the base algorithm's, it's called an "inner alignment" problem. In this post I want to plead for us to also keep in mind a different class of scenarios, which I'll call "Steered Optimizers", and which also has an "inner alignment" problem. The inner alignment problem for mesa-optimizers is directly analogous to the inner alignment problem for steered optimizers, but the specific failure modes and risk factors and solutions are all somewhat different. I’ll argue that it's at least comparably likely for our future AGIs to be steered optimizers rather than mesa-optimizers. So again, we should keep both scenarios in mind.
Introduction
I recently wrote a post about brain algorithms with "inner alignment" in the title, but I was talking about something kinda different than in the famous Risks from Learned Optimization paper that I was implicitly referring to. I didn't directly explain why I felt entitled to use the term “inner alignment” for this different situation, but I think it's worth going into, especially because it’s a more general approach to making AGI that goes beyond brain-inspired approaches.
(Terminology note: Following “Risks From Learned Optimization”, I will use the term "optimizer" in this post to mean an algorithm which uses foresight / planning to search over possible actions in pursuit of a particular goal, a.k.a. a "selection"-type optimizer. I want humans to count as “optimizers”, so I will also allow “optimizers” to sometimes choose actions for other reasons, and to maybe have inconsistent, context-dependent goals, as long as they at least sometimes use foresight / planning to pursue goals.)
Let's start with two scenarios in which we might create highly intelligent AI "optimizers":
1. Search-over-algorithms scenario: (this is the one from Risks from Learned Optimization). Here, you have a "base optimizer" which searches over a space of possible algorithms for an algorithm which performs very well according to a "base objective". For example, the base optimizer might be doing gradient descent on the weights of an RNN (large enough RNNs are Turing-complete!). Anyway, if the base optimizer settles on an inner algorithm which is itself an optimizer, then we call it a “mesa-optimizer”. Inner alignment is alignment between the mesa-optimizer’s objective and the base objective, while outer alignment is alignment between the base objective and what the programmer wants.
2. Steered Optimizer scenario: (this is how I think the human brain works, more or less, see my post "Inner alignment in the brain"). Here, you also have a two-layer structure, but the layers are different. The inner layer is an algorithm that does online learning, world-modeling, planning, and acting, and it is an optimizer. We wrote the inner-layer algorithm ourselves, and it is never modified or reset (the whole scenario is just one “episode”, in RL terms). But as the inner algorithm learns more and more, it becomes increasingly powerful, and increasingly difficult for us to understand—like comparing a newborn brain to an adult brain, where the latter carries a lifetime of experience and ideas. Meanwhile, the base layer watches the inner layer in real time, and tries to "steer" it towards optimizing the right target, using hooks that we had built into the inner layer’s source code. How does that steering work? In the simplest case, the base layer can be a reward function calculator, and it sends the reward information to the inner layer, which in turn tries to build a predictive model of the correlates of that reward signal, set them as a goal, and make foresighted plans to maximize them. (There could be other steering methods too—see below.) As in the other scenario, inner alignment is alignment between the inner layer’s objective(s) and the formula used by the base layer to compute rewards, while outer alignment is alignment between the latter and what the programmer wants.
Here’s a little comparison table:
Property
“Search Over Algorithms” scenario
“Steered Optimizer” scenario
By the way, these two scenarios are not the only two possibilities, nor are they mutually exclusive. The obvious example for “not mutually exclusive” is the human brain, which fits nicely into both categories—the subcortex steers the neocortex (more on which below), and meanwhile evolution is a search-over-algorithms-type base optimizer for the whole brain.
Why might we expect AI researchers to build steered optimizers, rather than searches-over-algorithms?
(Update: I later massively elaborated this section into the post Against Evolution As An Analogy For How Humans Will Build AGI.)
Incidentally, if we’re writing the inner algorithm ourselves, why not just put the goal into the source code directly? Well, that would be awesome ... But it may not be possible! I think the easiest way to build the inner algorithm is to have it build a world-model from scratch, more-or-less by finding patterns in the input, and patterns in the patterns, etc. So if you want the AGI to have a goal of maximizing paperclips, we face the problem that there is no “paperclips” concept in its source code; it has to run for a while before forming that concept. That’s why we might instead build an AGI by letting it start learning and acting, and trying to steer it as it goes.
How might one steer an AGI steered optimizer?
Lessons from being a human
If the human neocortex is a steered optimizer, what can we learn from that example?
1. How does it feel to be steered?
You try a new food, and find it tastes amazing! This wonderful feeling is your subcortex sending a steering signal up to your neocortex. All of the sudden, a new goal has been installed in your mind: eat this food again! This is not your only goal in life, of course, but it is a goal, and you might use your intelligence to construct elaborate plans in pursuit of that goal, like shopping at a different grocery store so you can buy that food again.
It’s a bit creepy, if you think about it!
“You thought you had a solid identity? Ha!! Fool, you are a puppet! If your neocortex gets dopamine at the right times, all of the sudden you would want entirely different things out of life!”
2. What does Inner Alignment failure look like in humans?
A prototypical inner alignment failure would be knowing that there is some situation that would lead the subcortex to install a certain goal in our minds, and we don’t want to have that goal (according to our current goals), so we avoid that situation.
Imagine, for example, not trying a drug because you’re afraid of getting addicted.
To make that analogy explicit, you could imagine that our brain was designed by an all-powerful alien who wanted us to take the drug, and therefore set up our brain with a system that recognizes the chemical signature of that drug, and installs that drug as a new goal when that chemical signature is detected. At first glance, that’s not a bad design for a steering mechanism, and indeed it works sometimes. But we can undermine the alien's intentions by understanding how that steering mechanism works, and thus avoiding the drug.
A more prosaic example: practically every “productivity hack” is a scheme to manipulate our own future subcortical steering signals.
3. What would corrigible alignment look like in humans?
Again analogizing from the definition in “Risks From Learned Optimization”, “corrigible alignment” would be developing a motivation along the lines of “whatever my subcortex is trying to reward me for, that is what I want!” Maybe the closest thing to that is hedonism? Well, I don’t think we want AGIs with that kind of corrigible alignment, for reasons discussed below.
More random thoughts on steering
Related work
Deep RL from Human Preferences and Scalable Agent Alignment Via Reward Modeling both bring up the idea of taking reward signals, trying to understand those signals in the form of a predictive model, and then using that reward model as a target for training an agent (if I understand everything correctly). (This is not the only idea in the papers, and in most respects the papers are more like search-over-algorithms.) Anyway, that specific idea has parallels with how a steered optimizer would try to relate its reward signals to meaningful, predictive concepts in its world-model. In this post I’m trying to emphasize that reward-modeling part, and generalize it to other ways of steering agents. Also, unlike those papers, I prefer to merge the reward-modeling task and the choosing-actions task into a single model, because their requirements seem to heavily overlap, at least in the case of a powerful, world-modeling, optimizing agent. For example, the reward-modeling part needs to look at a bunch of reward signals and figure out that they correspond to the goal “maximize paperclips”; while the choosing-actions part needs to take the goal “maximize paperclips” and figure out how to actually do it. These two parts require much the same world-modeling capabilities, and indeed I don’t see how it would work except by having both parts actually referencing the same world-model.
(I'm sure there's other related work too, that’s just what jumped to my mind.)
(thanks Evan Hubinger for comments on a draft.)