For instance, after reading some of Eliezer and Nate’s recent writing, I now think it’s probably a better idea to use corrigibility as an AI alignment target (assuming corrigibility turns out to be a coherent thing at all), as opposed to directly targeting human values
I'd be interested in you elaborating on this update. Specifically, how do you expect these approaches to differ in terms of mechanical interventions in the AGI's structure/training story, and what advantages one has over the other?
I've actually updated toward a target which could arguably be called either corrigibility or human values, and is different from previous framings I've seen of either. But I didn't want to get into that in the post; probably I'll have another post on it at some point. Some brief points to gesture in the general direction:
The "hard problem of corrigibility" is interesting because of the possibility that it has a relatively simple core or central principle - rather than being value-laden on the details of exactly what humans value, there may be some compact core of corrigibility that would be the same if aliens were trying to build a corrigible AI, or if an AI were trying to build another AI.
are you familiar with mutual information maximizing interfaces?
the key trick is simply to maximize the sum of mutualinfo(user command, agent action) + mutualinfo(agent action, environment change) as the only reward function for a small model.
in my current view, [spoilers for my judgement so you can form your own if you like,]
it's one of the most promising corrigibility papers I've seen. It still has some problems from the perspective of corrigibility, but it seems to have the interesting effect of making a reinforcement learner desperately want to be instructed. there are probably still very severe catastrophic failures hiding in slivers of plan space that would make a mimi superplanner dangerous (eg, at approximately human levels of optimization, it might try to force you to spend time with it and give it instructions?), and I don't think it works to train a model any other way than from scratch, and it doesn't solve multi-agent, and it doesn't solve interpretability (though it might combine really well with runtime interpretability visualizations, doubly so if the visualizations are mechanistically exact), so over-optimization would still break it, and it only works when a user is actively controlling it - but it seems much less prone to failure than vanilla rlhf, because it produces an agent that, if I understand correctly, stops moving when the user stops moving (to first approximation).
it seems to satisfy the mathematical simplicity you're asking for. I'm likely going to attempt follow-up research - I want to figure out if there's a way to do something similar with much bigger models, ie 1m to 6b parameter range. and I want to see how weird the behavior is when you try to steer a pretrained model with it. a friend is advising me casually on the project.
It seems to me that there are some key desiderata for corrigibility that it doesn't satisfy, in particular it isn't terribly desperate to explain itself to you, it just wants your commands to seem to be at sensible times that cause you to have control of its action timings in order to control the environment. but it makes your feedback much denser through the training process and produces a model that, if I understand correctly, gets bored without instruction. also it seems like with some tweaking it might also be a good model of what makes human relationship satisfying, which is a key tell I look for.
very curious to hear your thoughts.
That would be a very poetic way to die: an AI desperately pulling every bit of info it can out of a human, and dumping that info into the environment. They do say that humanity's death becomes more gruesome and dystopian the closer the proposal is to working, and that does sound decidedly gruesome and dystopian.
Anyway, more concretely, the problem which jumps out to me is that maximizing mutualinfo(user command, agent action) + mutualinfo(agent action, environment change) just means that all the info from the command routes through the action and into the environment in some way; the semantics or intent of the command need not have anything at all to do with the resulting environmental change. Like, maybe there's a prompt on my screen which says "would you like the lightswitch on (1) or off (0)?", and I enter "1", and then the AI responds by placing a coin heads-side-up. There's no requirement that my one bit actually needs to be encoded into the environment in a way which has anything to do with the lightswitch.
When I sent him the link to this comment, he replied:
ah i think you forgot the first term in the MIMI objective, I(s_t, x_t), which makes the mapping intuitive by maximizing information flow from the environment into the user. what you proposed was similar to optimizing only the second term, I(x_t, s_t+1 | s_t), which would indeed suffer from the problems that john mentions in his reply
my imprecision may have mislead you :)
I've actually updated toward a target which could arguably be called either corrigibility or human values, and is different from previous framings I've seen of either
Right, in my view the line between them is blurry as well. One distinction that makes sense to me is:
Put like this, yup, "corrigibility" seems like a better target to aim for. In particular, it's compact and, as you point out, convergent — "an agent" and "this agent's goals" are likely much easier to express than the whole suite of human values, and would be easier for us to locate in the AGI's ontology (e. g., we should be able to side-step a lot of philosophical headaches, outsource them to the AI itself).
In that sense, "strawberry alignment", in MIRI's parlance, is indeed easier than "eudaimonia alignment".
However...
insofar as humans value corrigibility (or particular aspects of corrigibility), the same challenges of expressing corrigibility mathematically also need to be solved in order to target values
I've been pretty confused about why MIRI thought that corrigibility is easier, and this is exactly why. Imparting corrigibility still requires making the AI care about some very specific conceptions humans have about how their commands should be executed, e. g. "don't optimize for this too hard" and other DWIMs. But if we can do that, if it understands us well enough to figure out all the subtle implications in our orders, then why can't we just tell it to "build an utopia" and expect that to go well? It seems like a strawberry-aligned AI should interpret that order faithfully as well... Which is a view that Nate/Eliezer seem not to outright rule out, they talk about "short reflection" sometimes.
But other times "corrigibility" seems to mean a grab-bag of tricks for essentially upper-bounding the damage an AI can inflict, presumably followed by a pivotal act (with a large amount of collateral damage) via this system and then long reflection. On that model, there's also a meaningful presumed difficulty difference between strawberry alignment and eudaimonia alignment: the former doesn't require us to be very good at retargeting the AGI at all. But it also seems obviously doomed to me, and not necessarily easier (inasmuch as this flavor of "corrigibility" doesn't seem like a natural concept at all).
My reading is that you endorse the former type of corrigibility as well, not the latter?
My reading is that you endorse the former type of corrigibility as well, not the latter?
Yes. I also had the "grab-bag of tricks" impression from MIRI's previous work on the topic, since it was mostly just trying various hacks, and that was also part of why I mostly ignored it. The notion that there's a True Name to be found here, we're not just trying hacks, is a big part of why I now have hope for corrigibility.
"Corrigibility" means making the AGI care about human values through the intermediary of humans — making it terminally care about "what agents with the designation 'human' care about". (Or maybe "what my creator cares about",
Interesting - that is actually what I've considered to be proper 'value learning': correctly locating and pointing to humans and their values in the agent's learned world model, in a way that naturally survives/updates correctly with world model ontology updates. The agent then has a natural intrinsic motivation to further improve its own understanding of human values (and thus its utility function) simply through the normal curiosity drive for value of information improvement to its world model.
I wasn't making a definitive statement on what I think people mean when they say "corrigibility", to be clear. The point I was making is that any implementation of corrigibility that I think is worth trying for necessarily has the "faithfulness" component — i. e., the AI would have to interpret its values/tasks/orders the way they were intended by the order-giver, instead of some other way. Which, in turn, likely requires somehow making it locate humans in its world-model (though likely implemented as "locate the model of [whoever is giving me the order]" in the AI's utility function, not necessarily referring to [humans] specifically).
And building off that definition, if "value learning" is supposed to mean something different, then I'd define it as pointing at human values not through humans, but directly. I. e., making the AI value the same things that humans value not because it knows that it's what humans value, but just because.
Again, I don't necessarily think that it's what most people mean by these terms most times — I would natively view both approaches to this as something like "value learning" as well. But this discussion started from John (1) differentiating between them, and (2) viewing both approaches as viable. This is just how I'd carve it under these two constraints.
I really like this post, this is very influential about how I think about plans, and what to work on. I do think its a bit vague though, and lacking in a certain kind of general formulation. It may be better if there were more examples listed where the technique could be used.
While I agree with the practical idea, conveyed in this post, I think the language of the post is inconsistent, and filling it with alignment slang makes the post much less understandable. To demonstrate this, I'll reformulate the post in the ontology of Active Inference.
Our doctor-to-be does not treat the plan primarily as a prediction about the world; they treat it as a way to make the world be.
In Active Inference, a “prediction” and a “way to make the world to be” point to the same thing. Besides, a terminological clarification: in Active Inference, a plan (also policy) is a sequence of actions performed by the agent, interleaved (in discrete-time formulation) with predicted future states: , not merely a prediction, which is a generative model over states alone (in Active Inference, actions are ontologically distinct from states): .
Now, when looking for those robust bottlenecks, I’d probably need to come up with a plan. Multiple plans, in fact. The point of “robust bottlenecks” is that they’re bottlenecks to many plans, after all. But those plans would not be optimization targets. I don’t treat the plans as ways to make the world be. Rather, the plans are predictions about how things might go. My “mainline plan”, if I have one, is not the thing I’m optimizing to make happen; rather, it’s my modal expectation for how I expect things to go (conditional on my efforts).
This is exactly what is described by Active Inference. In a simplified, discrete-time formulation, an Active Inference agent endlessly performs the following loop:
A complete, formal mathematical description of this loop is given in section 4.4 of Chapter 4 of Active Inference (which you can freely download the chapter on the linked page). In an actual implementation of an Active Inference agent, Fountas et al. used Monte-Carlo tree search as the heuristic to use in step 3 of the above loop.
Note that in the above loop, the plan is chosen again (and the whole space of plans is re-constructed) on every iteration of the loop. This, again, is an abstract ideal: that would be horribly inefficient to discard plans on every step, rather than incrementally update them. Nevertheless, this principle is captured in the adage “plans are worthless, but planning is everything”. And it also seems to me that John Wentworth tries to convey the same idea in the passage quoted above.
However, considering the above, the title “Plans Are Predictions, Not Optimization Targets”, and the suggestion that “robust bottlenecks” should be “optimisation targets” don’t make sense. Plans, if successful, have some outcomes (or goals, also known as prior preferences in Active Inference, which are probability distributions over future world states, in which the probability density indicates the level of preference), and, similarly, “robust bottlenecks” also have outcomes (or themselves are outcomes: it’s not clear to me what terminological convention do you use in the post in this regard). The ultimate goal of the plan (becoming a doctor, ensuring humanity’s value potential is not destroyed in the course of creating a superhuman AI) is still a goal. What you essentially suggest is frontloading the plan with achieving highly universal subgoals (”seeking options” in Nassim Taleb’s lingo). And the current title of the article can be reformulated as “Plans are not optimisation targets, but the outcomes of the plans are optimisation targets”, which doesn’t capture the idea of the post.
“On the path to your goal, identify and try to first achieve subgoals that can be most universally helpful on various different paths towards your goal” would be more accurate, IMO.
(This section is not only about this post I'm commenting on; I haven't decided yet whether to leave it as a comment or to turn it into a separate post.)
Despite considerable exposure to writing on AI alignment, I get confused every time I see the phrase “optimisation target”.
As I wrote in a comment to “Reward is not an optimisation target”, the word “optimisation” is ambiguous because unclear which “optimisation” it refers to:
Reward probably won’t be a deep RL agent’s primary optimization target
The longer I look at this statement (and its shorter version "Reward is not the optimization target"), the less I understand what it's supposed to mean, considering that "optimisation" might refer to the agent's training process as well as the "test" process (even if they overlap or coincide). It looks to me that your idea can be stated more concretely as "the more intelligent/capable RL agents (either model-based or model-free) become in the process of training using the currently conventional training algorithms, the less they will be susceptible to wireheading, rather than actively seek it"?
Separately, the word "target" invokes a sense of teleology/goal-directedness, which, though perfectly fine in the context of this post, where people's "optimisation targets" are discussed, is confusing when the phrase "optimisation target" is applied to objects or processes which don't have any agency according to the models most people have in their heads, e. g., a DNN training episode.
Similarly, I think the phrase "optimise for X", such as "optimise for graduating", uses alignment slang completely unnecessarily. This phrase is exactly synonymous to "have a goal of graduating" or "tries to graduate", but the latter two don't have the veil of secondary connotation, related to alignment, which people might suppose the phrase "optimise for graduating" has, while, in fact, it doesn't.
My optimization targets are, instead, the robust bottlenecks.
When reality throws a brick through the plans, I want my optimization target to have still been a good target in hindsight. Thus robust bottlenecks: something which is still a bottleneck under lots of different assumptions is more likely to be a bottleneck even under assumptions we haven’t yet realized we should use. The more robust the bottleneck is, the more likely it will be robust to whatever surprise reality actually throws at us.
In practical advice and self-help, a subset of these “robust bottlenecks” are also simply called “basics”: meeting basic physical and psychological needs, keeping up one’s physical condition and energy, taking care of one’s mental health.
When you say "optimization target," it seems like you mean a single point in path-space that the planner aims for, where this point consists of several fixed landmarks along the path which don't adjust to changing circumstances. Such an optimization target could still have some wiggle room (i.e., consist of an entire distribution of possible sub-paths) between these landmarks, correct? So some level of uncertainty must be built into the plan regardless of whether you call it a prediction or an optimization target.
It seems to me that what you're advocating for is equivalent to generating an entire ensemble of optimization targets, each based on a different predictive model of how things will go. Then you break those targets up into their constituent landmarks and look for clusters of landmarks in goal-space from across the entire ensemble of paths. Would your "robust bottlenecks" then refer to the densest of these clusters?
Come to think of it, couldn't this be applied to model corrigibility itself?
Have an AI that's constantly coming up with predictive models of human preferences, generating an ensemble of plans for satisfying human preferences according to each model. Then break those plans into landmarks and look for clusters in goal-space.
Each cluster could then form a candidate basin of attraction of goals for the AI to pursue. The center of each basin would represent a "robust bottleneck" that would be helpful across predictive models; the breadth of each basin would account for the variance in landmark features; and the depth/attractiveness of each basin would be proportional to the number of predictive models that have landmarks in that cluster.
Ideally, the distribution of these basins would update continuously as each model in the ensemble becomes more predictive of human preferences (both stated and revealed) due to what the AGI learns as it interacts with humans in the real world. Plans should always be open to change in light of new information, including those of an AGI, so the landmarks and their clusters would necessarily shift around as well.
Assuming this is the right approach, the questions that remain would be how to structure those models of human preferences, how to measure their predictive performance, how to update the models on new information, how to use those models to generate plans, how to represent landmarks along plan paths in goal-space, how to convert a vector in goal-space into actionable behavior for the AI to pursue, etc., etc., etc. Okay, yeah, there would still be a lot of work left to do.
Funny meta: I'm reading this just after finishing your two sequences about Abstraction, which I find very exciting! But surprise, your plan changes ! Did I read all that for nothing? Fortunately, I think it's mostly robust, indeed :)
I claim the right move is to target robust bottlenecks: look for subproblems which are bottlenecks to many different approaches/plans/paths, then tackle those subproblems.
This reminds me of Paul Graham's idea of flying upwind.
Suppose you're a college freshman deciding whether to major in math or economics. Well, math will give you more options: you can go into almost any field from math. If you major in math it will be easy to get into grad school in economics, but if you major in economics it will be hard to get into grad school in math.
Flying a glider is a good metaphor here. Because a glider doesn't have an engine, you can't fly into the wind without losing a lot of altitude. If you let yourself get far downwind of good places to land, your options narrow uncomfortably. As a rule you want to stay upwind. So I propose that as a replacement for "don't give up on your dreams." Stay upwind.
— What You'll Wish You'd Known
On the other hand... man, that high-schooler is gonna be shit outta luck if they decide that medicine isn't for them. Or if the classes or residency are too tough. Or if they fail the MCAT. Or .... Point is, reality has a way of throwing bricks through our plans, even when we're not operating in a preparadigmatic field.
This reminds me of a Paul Graham quote (but it's not the same thing):
A friend of mine who is a quite successful doctor complains constantly about her job. When people applying to medical school ask her for advice, she wants to shake them and yell "Don't do it!" (But she never does.) How did she get into this fix? In high school she already wanted to be a doctor. And she is so ambitious and determined that she overcame every obstacle along the way—including, unfortunately, not liking it.
— How to Do What You Love
Imagine a (United States) high-schooler who wants to be a doctor. Their obvious high-level plan to achieve that goal is:
Key thing to notice about that plan: the plan is mainly an optimization target. When in high school, our doctor-to-be optimizes for graduating and getting into college. In college, they optimize for graduating and getting into med school. Etc. Throughout, our doctor-to-be optimizes to make the plan happen. Our doctor-to-be does not treat the plan primarily as a prediction about the world; they treat it as a way to make the world be.
And that probably works great for people who definitely just want to be doctors.
Now imagine someone in 1940 who wants to build a solid-state electronic amplifier.
Building active solid-state electronic components in the early 1940’s is not like becoming a doctor. Nobody has done it before, nobody knows how to do it, nobody knows the minimal series of steps one must go through in order to solve it. At that time, solid-state electronics was a problem we did not understand; the field was preparadigmatic. There were some theories, but they didn’t work. The first concrete plans people attempted failed; implicit assumptions were wrong, but it wasn’t immediately obvious which implicit assumptions. One of the most confident predictions one might reasonably have made about solid-state electronics in 1940 was that there would be surprises; unknown unknowns were certainly lurking.
So, how should someone in 1940 who wants to build a solid-state amplifier go about planning?
I claim the right move is to target robust bottlenecks: look for subproblems which are bottlenecks to many different approaches/plans/paths, then tackle those subproblems. For instance, if I wanted to build a solid-state amplifier in 1940, I’d make sure I could build prototypes quickly (including with weird materials), and look for ways to visualize the fields, charge densities, and conductivity patterns produced. Whenever I saw “weird” results, I’d first figure out exactly which variables I needed to control to reproduce them, and of course measure everything I could (using those tools for visualizing fields, densities, etc). I’d also look for patterns among results, and look for models which unified lots of them.
Those are strategies which would be robustly useful for building solid-state amplifiers in many worlds, and likely directly address bottlenecks to progress in many worlds. In our particular world, they might have highlighted the importance of high-purity silicon and dopants, or of surfaces between materials with different electrical properties, both of which were key rate-limiting insights along the path to active solid-state electronics.
Now, when looking for those robust bottlenecks, I’d probably need to come up with a plan. Multiple plans, in fact. The point of “robust bottlenecks” is that they’re bottlenecks to many plans, after all. But those plans would not be optimization targets. I don’t treat the plans as ways to make the world be. Rather, the plans are predictions about how things might go. My “mainline plan”, if I have one, is not the thing I’m optimizing to make happen; rather, it’s my modal expectation for how I expect things to go (conditional on my efforts).
My optimization targets are, instead, the robust bottlenecks.
When reality throws a brick through the plans, I want my optimization target to have still been a good target in hindsight. Thus robust bottlenecks: something which is still a bottleneck under lots of different assumptions is more likely to be a bottleneck even under assumptions we haven’t yet realized we should use. The more robust the bottleneck is, the more likely it will be robust to whatever surprise reality actually throws at us.
In My Own Work
Late last year, I wrote The Plan - a post on my then-current plans for alignment research and the reasoning behind them. But I generally didn’t treat that plan as an optimization target; I treated it as a modal path. I chose my research priorities mostly by looking for robust bottlenecks. That’s why I poured so much effort into understanding abstraction: it’s a very robust bottleneck. (One unusually-externally-legible piece of evidence for robustness: if the bottleneck is robust, more people should converge on it over time as they progress on different agendas. And indeed, over the past ~year we saw both Paul Christiano and Scott Garrabrant converge on the general cluster of abstraction/ontology identification/etc.)
Since then, my views have updated in some ways. For instance, after reading some of Eliezer and Nate’s recent writing, I now think it’s probably a better idea to use corrigibility as an AI alignment target (assuming corrigibility turns out to be a coherent thing at all), as opposed to directly targeting human values. But The Plan was to target human values! Do I now need to ditch my whole research agenda?
No, because The Plan wasn’t my optimization target, it was my modal path. I was optimizing for robust bottlenecks. So now I have a different modal path, but it still converges on roughly-the-same robust bottlenecks. There have been some minor adjustments here and there - e.g. I care marginally less about convergent type signatures of values, and marginally more about convergent architectures/algorithms to optimize those values (though those two problems remain pretty tightly coupled). But the big picture research agenda still looks basically similar; the strategy still looks right in hindsight, even after a surprising-to-me update.
Back To The Doctor
At the start of this post, I said that the high-schooler who definitely wanted to be a doctor would probably do just fine treating the plan as an optimization target. The path to becoming a doctor is very standard, the steps are known. It's not like building a solid-state amplifier in 1940, or solving the AI alignment problem.
On the other hand... man, that high-schooler is gonna be shit outta luck if they decide that medicine isn't for them. Or if the classes or residency are too tough. Or if they fail the MCAT. Or .... Point is, reality has a way of throwing bricks through our plans, even when we're not operating in a preparadigmatic field.
And if the high-schooler targets robust bottlenecks rather than optimizing for one particular plan... well, they probably still end up in fine shape for becoming a doctor! (At least in worlds where the original plan was workable anyway - i.e. worlds where courses and the MCAT and whatnot aren't too tough.) Robust bottlenecks should still be bottlenecks for the doctor plan. The main difference is probably that our high-schooler ends up doing their undergrad in something more robustly useful than pre-med.
Summary
If we treat a single plan as our optimization target, we're in trouble when reality throws surprises at us. In preparadigmatic fields, where surprises and unknown unknowns are guaranteed, a better idea is to target robust bottlenecks: subproblems which are bottlenecks to a wide variety of plans. The more robust the bottleneck is to different assumptions, the more likely that it will still be a bottleneck under whatever conditions reality throws at us.
Even in better-understood areas, reality does have a tendency to surprise us, and it's usually not very costly to target robust bottlenecks rather than a particular plan anyway. After all, if the bottleneck is robust, it's probably a bottleneck in whatever particular plan we would have targeted.
So what's the point of a plan? A plan is no longer an optimization target, but instead a prediction - a particular possible path. Sometimes it's useful to discuss a "mainline plan", i.e. a modal path. And in general, we want to consider many possible plans, and look for subproblems which are bottlenecks to all of them.