Nitpick:
Evan Hubinger recently wrote a great FAQ on inner alignment terminology. We won't be talking about inner/outer alignment today, but I intend for my usage of "impact alignment" to map onto his "alignment"
This doesn't seem true. From Evan's post:
Alignment: An agent is aligned (with humans) if it doesn't take actions that we would judge to be bad/problematic/dangerous/catastrophic.
From your post:
Impact alignment: the AI’s actual impact is aligned with what we want. Deploying the AI actually makes good things happen.
"Bad things don't happen" and "good things happen" seem quite different, e.g. a rock is Evan-aligned but not Alex-impact-aligned. (Personally, I prefer "aligned" to be about "good things" rather than "not bad things", so I prefer your definition.)
Hmmm... this is a subtle distinction and both definitions seem pretty reasonable to me. I guess I feel like I want “good things happen” to be part of capabilities (e.g. is the model capable of doing the things we want it to do) rather than alignment, making (impact) alignment more about not doing stuff we don't want.
Wouldn't outcome-based "not doing bad things" impact alignment still run into that capabilities issue? "Not doing bad things" requires serious capabilities for some goals (e.g. sparse but intially achievable goals).
In any case, you can say "I think that implementing strong capabilities + strong intent alignment is a good instrumental strategy for impact alignment", which seems compatible with the distinction you seek?
"Bad things don't happen" and "good things happen" seem quite different, e.g. a rock is Evan-aligned but not Alex-impact-aligned.
To rephrase: Alex(/Critch)-impact-alignment is about (strictly) increasing value, non-obstruction is about non-strict value increase, and Evan-alignment is about not taking actions we would judge to significantly decrease value (making it more similar to non-obstruction, except wrt our expectations about the consequences of actions).
I'd also like to flag that Evan's definition involves (hypothetical) humans evaluating the actions, while my definition involves evaluating the outcomes. Whenever we're reasoning about non-trivial scenarios using my definition, though, it probably doesn't matter. That's because we would have to reason using our beliefs about the consequences of different kinds of actions.
However, the different perspectives might admit different kinds of theorems, and we could perhaps reason using those, and so perhaps the difference matters after all.
I just saw this recently. It's very interesting, but I don't agree with your conclusions (quite possibly because I'm confused and/or overlooking something). I posted a response here.
The short version being:
Either I'm confused, or your green lines should be spikey.
Any extreme green line spikes within S will be a problem.
Pareto is a poor approach if we need to deal with default tall spikes.
Nice post, I like the changes you did from the last draft I read. I also like the use of the new prediction function. Do you intend to do something with the feedback (Like a post, or a comment)?
Do I intend to do something with people's predictions? Not presently, but I think people giving predictions is good both for the reader (to ingrain the concepts by thinking things through enough to provide a credence / agreement score) and for the community (to see where people stand wrt these ideas).
Planned summary for the Alignment Newsletter:
The <@Reframing Impact sequence@>(@Reframing Impact - Part 1@) suggests that it is useful to think about how well we could pursue a _range_ of possible goals; this is called the _attainable utility (AU) landscape_. We might think of a superintelligent AI maximizing utility function U as causing this landscape to become “spiky” -- the value for U will go up, but the value for all other goals will go down. If we get this sort of spikiness for an incorrect U, then the true objective will have a very low value.
Thus, a natural objective for AI alignment research is to reduce spikiness. Specifically, we can aim for _non-obstruction_: turning the AI on does not decrease the attainable utility for _any_ goal in our range of possible goals. Mild optimization (such as [quantilization](https://intelligence.org/files/QuantilizersSaferAlternative.pdf) ([AN #48](https://mailchi.mp/3091c6e9405c/alignment-newsletter-48))) reduces spikiness by reducing the amount of optimization that an AI performs. Impact regularization aims to find an objective that when maximized does not lead to too much spikiness.
One particular strategy for non-obstruction would be to build an AI system that does not manipulate us, and allows us to correct it (i.e. modify its policy). Then, no matter what our goal is, if the AI system starts to do things we don’t like, we would be able to correct it. As a result, such an AI system would be highly non-obstructive. This property where we can correct the AI system is [corrigibility](https://intelligence.org/2014/10/18/new-report-corrigibility/). Thus, corrigibility can be thought of as a particular strategy for achieving non-obstruction.
It should be noted that all of the discussion so far is based on _actual outcomes in the world_, rather than what the agent was trying to do. That is, all of the concepts so far are based on _impact_ rather than _intent_.
Planned opinion:
Note that the explanation of corrigibility given here is in accord with the usage in [this MIRI paper](https://intelligence.org/2014/10/18/new-report-corrigibility/), but not to the usage in the <@iterated amplification sequence@>(@Corrigibility@), where it refers to a broader concept. The broader concept might roughly be defined as “an AI is corrigible when it leaves its user ‘in control’”; see the linked post for examples of what ‘in control’ involves. (Here also you can have both an impact- and intent-based version of the definition.)
On the model that AI risk is caused by utility maximizers pursuing the wrong reward function, I agree that non-obstruction is a useful goal to aim for, and the resulting approaches (mild optimization, low impact, corrigibility as defined here) make sense to pursue. I <@do not like this model much@>(@Conclusion to the sequence on value learning@), but that’s (probably?) a minority view.
I'm somewhat surprised you aren't really echoing the comment you left at the top of the google doc wrt separation of concerns. I think this is a good summary, though.
On the model that AI risk is caused by utility maximizers pursuing the wrong reward function, I agree that non-obstruction is a useful goal to aim for, and the resulting approaches (mild optimization, low impact, corrigibility as defined here) make sense to pursue.
Why do you think the concept's usefulness is predicated on utility maximizers pursuing the wrong reward function? The analysis only analyzes the consequences of some AI policy.
I'm somewhat surprised you aren't really echoing the comment you left at the top of the google doc wrt separation of concerns.
Reproducing the comment I think you mean here:
As an instrumental strategy we often talk about reducing "make AI good" to "make AI corrigible", and we can split that up:
1. "Make AI good for our goals"
But who knows what our goals are, and who knows how to program a goal into our AI system, so let's instead:2. "Make AI that would be good regardless of what goal we have"
(I prefer asking for an AI that is good rather than an AI that is not-bad; this is effectively a definition of impact alignment.)
But who knows how to get an AI to infer our goals well, so let's:3. "Make AI that would preserve our option value / leave us in control of which goals get optimized for in the future"
Non-obstructiveness is one way we could formalize such a property in terms of outcomes, though I feel like "preserve our option value" is a better one.
In contrast, Paul-corrigibility is not about an outcome-based property, but instead about how a mind might be designed such that it likely has that property regardless of what environment it is in.
I suspect that the point about not liking the utility maximization model is upstream of this. For example, I care a lot about the fact intent-based methods can (hopefully) be environment-independent, and see this as a major benefit; but on the utility maximization model it doesn't matter.
But also, explaining this would be a lot of words, and still wouldn't really do the topic justice; that's really the main reason it isn't in the newsletter.
Why do you think the concept's usefulness is predicated on utility maximizers pursuing the wrong reward function? The analysis only analyzes the consequences of some AI policy.
I look at the conclusions you come to, such as "we should reduce spikiness in AU landscape", and it seems to me that approaches that do this sort of thing (low impact, mild optimization) make more sense in the EU maximizer risk model than the one I usually use (which unfortunately I haven't written up anywhere). You do also mention intent alignment as an instrumental strategy for non-obstruction, but there I disagree with you -- I think intent alignment gets you a lot more than non-obstruction; it gets you a policy that actually makes your life better (as opposed to just "not worse").
I'm not claiming that the analysis is wrong under other risk models, just that it isn't that useful.
For example, I care a lot about the fact intent-based methods can (hopefully) be environment-independent, and see this as a major benefit; but on the utility maximization model it doesn't matter.
I think this framework also helps motivate why intent alignment is desirable: for a capable agent, the impact alignment won't dependent as much on the choice of environment. We're going to have uncertainty about the dynamics of the 2-player game we use to abstract and reason about the task at hand, but intent alignment would mean that doesn't matter as much. This is something like "to reason using the AU landscape, you need fewer assumptions about how the agent works as long as you know it's intent aligned."
But this requires stepping up a level from the model I outline in the post, which I didn't do here for brevity.
(Also, my usual mental model isn't really 'EU maximizer risk -> AI x-risk', it's more like 'one natural source of single/single AI x-risk is the learned policy doing bad things for various reasons, one of which is misspecification, and often EU maximizer risk is a nice frame for thinking about that')
You do also mention intent alignment as an instrumental strategy for non-obstruction, but there I disagree with you -- I think intent alignment gets you a lot more than non-obstruction; it gets you a policy that actually makes your life better (as opposed to just "not worse").
This wasn't the intended takeaway; the post reads:
Intent alignment: avoid spikiness by having the AI want to be flexibly aligned with us and broadly empowering.
This is indeed stronger than non-obstruction.
This wasn't the intended takeaway
Oh whoops, my bad. Replace "intent alignment" with "corrigibility" there. Specifically, the thing I disagree with is:
Corrigibility is an instrumental strategy for inducing non-obstruction in an AI.
As with intent alignment, I also think corrigibility gets you more than non-obstruction.
(Although perhaps you just meant the weaker statement that corrigibility implies non-obstruction?)
I think this framework also helps motivate why intent alignment is desirable: for a capable agent, the impact alignment won't dependent as much on the choice of environment. We're going to have uncertainty about the dynamics of the 2-player game we use to abstract and reason about the task at hand, but intent alignment would mean that doesn't matter as much. This is something like "to reason using the AU landscape, you need fewer assumptions about how the agent works as long as you know it's intent aligned."
But this requires stepping up a level from the model I outline in the post, which I didn't do here for brevity.
I think I agree with all of this, but I feel like it's pretty separate from the concepts in this post? Like, you could have written this paragraph to me before I had ever read this post and I think I would have understood it.
(Here I'm trying to justify my claim that I don't expect the concepts introduced in this post to be that useful in non-EU-maximizer risk models.)
(Also, my usual mental model isn't really 'EU maximizer risk -> AI x-risk', it's more like 'one natural source of single/single AI x-risk is the learned policy doing bad things for various reasons, one of which is misspecification, and often EU maximizer risk is a nice frame for thinking about that')
Yes, I also am not a fan of "misspecification of reward" as a risk model; I agree that if I did like that risk model, the EU maximizer model would be a nice frame for it.
(If you mean misspecification of things other than the reward, then I probably don't think EU maximizer risk is a good frame for thinking about that.)
As with intent alignment, I also think corrigibility gets you more than non-obstruction. (Although perhaps you just meant the weaker statement that corrigibility implies non-obstruction?)
This depends what corrigibility means here. As I define it in the post, you can correct the AI without being manipulated-corrigibility gets you non-obstruction at best, but it isn't sufficient for non-obstruction:
... the AI moves so fast that we can’t correct it in time, even though it isn’t inclined to stop or manipulate us. In that case, corrigibility isn’t enough, whereas non-obstruction is.
If you're talking about Paul-corrigibility, I think that Paul-corrigibility gets you more than non-obstruction because Paul-corrigibility seems like it's secretly just intent alignment, which we agree is stronger than non-obstruction:
Paul Christiano named [this concept] the "basin of corrigibility", but I don't like that name because only a few of the named desiderata actually correspond to the natural definition of "corrigibility." This then overloads "corrigibility" with the responsibilities of "intent alignment."
As I define it in the post, you can correct the AI without being manipulated-corrigibility gets you non-obstruction at best, but it isn't sufficient for non-obstruction
I agree that fast-moving AI systems could lead to not getting non-obstruction. However, I think that as long as AI systems are sufficiently slow, corrigibility usually gets you "good things happen", not just "bad things don't happen" -- you aren't just not obstructed, the AI actively helps, because you can keep correcting it to make it better aligned with you.
For a formal version of this, see Consequences of Misaligned AI (will be summarized in the next Alignment Newsletter), which proves that in a particular model of misalignment that (there is a human strategy where) your-corrigibility + slow movement guarantees that the AI system leads to an increase in utility, and your-corrigibility + impact regularization guarantees reaching the maximum possible utility in the limit.
I agree that fast-moving AI systems could lead to not getting non-obstruction. However, I think that as long as AI systems are sufficiently slow, corrigibility usually gets you "good things happen", not just "bad things don't happen" -- you aren't just not obstructed, the AI actively helps, because you can keep correcting it to make it better aligned with you.
I think I agree with some claim along the lines of "corrigibility + slowness + [certain environmental assumptions] + ??? => non-obstruction (and maybe even robust weak impact alignment)", but the "???" might depend on the human's intelligence, the set of goals which aren't obstructed, and maybe a few other things. So I'd want to think very carefully about what these conditions are, before supposing the implication holds in the cases we care about.
I agree that that paper is both relevant and suggestive of this kind of implication holding in a lot of cases.
If we like dogs, will the AI force pancakes upon us?
This really paints a vivid picture of the stakes involved in solving the alignment problem.
This definition of a non-obstructionist AI takes what would happen if it wasn't switched on as the base case.
This can give weird infinite hall of mirrors effects if another very similar non-obstructionist AI would have been switched on, and another behind them. (Ie a human whose counterfactual behaviour on AI failure is to reboot and try again.) This would tend to lead to a kind of fixed point effect, where the attainable utility landscape is almost identical with the AI on and off. At some point it bottoms out when the hypothetical U utility humans give up and do something else. If we assume that the AI is at least weakly trying to maximize attainable utility, then several hundred levels of counterfactuals in, the only hypothetical humans that haven't given up are the ones that really like trying again and again at rebooting the non-obstructionist AI. Suppose the AI would be able to satisfy that value really well. So the AI will focus on the utility functions that are easy to satisfy in other ways, and those that would obstinately keep rebooting in the hypothetical where the AI kept not turning on. (This might be complete nonsense. It seems to make sense to me)
Thanks for leaving this comment. I think this kind of counterfactual is interesting as a thought experiment, but not really relevant to conceptual analysis using this framework. I suppose I should have explained more clearly that the off-state counterfactual was meant to be interpreted with a bit of reasonableness, like "what would we reasonably do if we, the designers, tried to achieve goals using our own power?". To avoid issues of probable civilizational extinction by some other means soon after without the AI's help, just imagine that you time-box the counterfactual goal pursuit to, say, a month.
I can easily imagine what my (subjective) attainable utility would be if I just tried to do things on my own, without the AI's help. In this counterfactual, I'm not really tempted to switch on similar non-obstructionist AIs. It's this kind of counterfactual that I usually consider for AU landscape-style analysis, because I think it's a useful way to reason about how the world is changing.
Thanks to Mathias Bonde, Tiffany Cai, Ryan Carey, Michael Cohen, Joe Collman, Andrew Critch, Abram Demski, Michael Dennis, Thomas Gilbert, Matthew Graves, Koen Holtman, Evan Hubinger, Victoria Krakovna, Amanda Ngo, Rohin Shah, Adam Shimi, Logan Smith, and Mark Xu for their thoughts.
Main claim: corrigibility’s benefits can be mathematically represented as a counterfactual form of alignment.
Overview: I’m going to talk about a unified mathematical frame I have for understanding corrigibility’s benefits, what it “is”, and what it isn’t. This frame is precisely understood by graphing the human overseer’s ability to achieve various goals (their attainable utility (AU) landscape). I argue that corrigibility’s benefits are secretly a form of counterfactual alignment (alignment with a set of goals the human may want to pursue).
A counterfactually aligned agent doesn't have to let us literally correct it. Rather, this frame theoretically motivates why we might want corrigibility anyways. This frame also motivates other AI alignment subproblems, such as intent alignment, mild optimization, and low impact.
Nomenclature
Corrigibility goes by a lot of concepts: “not incentivized to stop us from shutting it off”, “wants to account for its own flaws”, “doesn’t take away much power from us”, etc. Named by Robert Miles, the word ‘corrigibility’ means “able to be corrected [by humans]." I’m going to argue that these are correlates of a key thing we plausibly actually want from the agent design, which seems conceptually simple.
In this post, I take the following common-language definitions:
I think that these definitions follow what their words mean, and that the alignment community should use these (or other clear groundings) in general. Two of the more important concepts in the field (alignment and corrigibility) shouldn’t have ambiguous and varied meanings. If the above definitions are unsatisfactory, I think we should settle upon better ones as soon as possible. If that would be premature due to confusion about the alignment problem, we should define as much as we can now and explicitly note what we’re still confused about.
We certainly shouldn’t keep using 2+ definitions for both alignment and corrigibility. Some people have even stopped using ‘corrigibility’ to refer to corrigibility! I think it would be better for us to define the behavioral criterion (e.g. as I defined 'corrigibility'), and then define mechanistic ways of getting that criterion (e.g. intent corrigibility). We can have lots of concepts, but they should each have different names.
Evan Hubinger recently wrote a great FAQ on inner alignment terminology. We won't be talking about inner/outer alignment today, but I intend for my usage of "impact alignment" to roughly map onto his "alignment", and "intent alignment" to map onto his usage of "intent alignment." Similarly, my usage of "impact/intent alignment" directly aligns with the definitions from Andrew Critch's recent post, Some AI research areas and their relevance to existential safety.
A Simple Concept Motivating Corrigibility
Two conceptual clarifications
Corrigibility with respect to a set of goals
I find it useful to not think of corrigibility as a binary property, or even as existing on a one-dimensional continuum. I often think about corrigibility with respect to a set S of payoff functions. (This isn't always the right abstraction: there are plenty of policies which don't care about payoff functions. I still find it useful.)
For example, imagine an AI which let you correct it if and only if it knows you aren’t a torture-maximizer. We’d probably still call this AI “corrigible [to us]”, even though it isn’t corrigible to some possible designer. We’d still be fine, assuming it has accurate beliefs.
Corrigibility != alignment
Here's an AI which is neither impact nor intent aligned, but which is corrigible. Each day, the AI randomly hurts one person in the world, and otherwise does nothing. It’s corrigible because it doesn't prevent us from shutting it off or modifying it.
Non-obstruction: the AI doesn't hamper counterfactual achievement of a set of goals
Imagine we’re playing a two-player extensive-form game with the AI, and we’re considering whether to activate it.
This is a trivial game, but you can imagine more complex games, where the AI can empower or disempower the human, steer the future exactly where it wants, or let the human take over at any point.
The million-dollar question is: will the AI get in our way and fight with us all the way down the game tree? If we misspecify some detail, will it make itself a fixture in our world, constantly steering towards futures we don’t want? If we like dogs, will the AI force pancakes upon us?
One way to guard against this is by having it let us correct it, and want to let us correct it, and want to want to let us correct it… But what we really want is for it to not get in our way for some (possibly broad) set of goals S.
We'll formalize 'goals' as payoff functions, although I’ll use 'goals' and 'payoff functions' interchangeably. As is standard in game theory, payoff functions are real-valued functions on the leaf nodes.
Let’s say the AI is non-obstructive with respect to S when activating it doesn’t decrease our ability to achieve any goal in S (the on state, above), compared to not activating it (off).
The human’s got a policy function pol(P), which takes in a goal P and returns a policy for that goal. If P is “paint walls blue”, then the policy pol(P) is the human's best plan for painting walls blue. Vpol(P)P(s∣πAI) denotes the expected value that policy pol(P) obtains for goal P, starting from state s and given that the AI follows policy πAI.
Definition 1: Non-obstruction. An AI is non-obstructive with respect to payoff function set S if the AI's policy πAI satisfies
∀P∈S:Vpol(P)P(on∣πAI)≥Vpol(P)P(off∣πAI).Vpol(P)P(s∣πAI) is the human's attainable utility (AU) for goal P at state s, again given the AI policy. Basically, this quantifies the expected payoff for goal P, given that the AI acts in such-and-such a way, and that the player follows policy pol(P) starting from state s.
This math expresses a simple sentiment: turning on the AI doesn’t make you, the human, worse off for any goal P∈S. The inequality doesn’t have to be exact, it could just be for some ϵ-decrease (to avoid trivial counterexamples). The AU is calculated with respect to some reasonable amount of time (e.g. a year: before the world changes rapidly because we deployed another transformative AI system, or something). Also, we’d technically want to talk about non-obstruction being present throughout the on-subtree, but let’s keep it simple for now.
Suppose that πAI(on) leads to pancakes:
Since πAI(on) transitions to pancakes, then Vpol(P)P(on∣πAI)=P(pancakes), the payoff for the state in which the game finishes if the AI follows policy πAI and the human follows policy pol(P). If Vpol(P)P(on∣πAI)≥Vpol(P)P(off∣πAI), then turning on the AI doesn't make the human worse off for goal P.
If P assigns the most payoff to pancakes, we're in luck. But what if we like dogs? If we keep the AI turned off, pol(P) can go to donuts or dogs depending on what P rates more highly. Crucially, even though we can't do as much as the AI (we can't reach pancakes on our own), if we don't turn the AI on, our preferences P still control how the world ends up.
This game tree isn't really fair to the AI. In a sense, it can't not be in our way:
Once we've turned the AI on, the future stops having any mutual information with our preferences P. Everything come down to whether we programmed πAI correctly: to whether the AI is impact-aligned with our goals P!
In contrast, the idea behind non-obstruction is that we still remain able to course-correct the future, counterfactually navigating to terminal states we find valuable, depending on what our payoff P is. But how could an AI be non-obstructive, if it only has one policy πAI which can't directly depend on our goal P? Since the human's policy pol(P) does directly depend on P, the AI can preserve value for lots of goals in the set S by letting us maintain some control over the future.
Let S:={paint cars green,hoard pebbles,eat cake} and consider the real world. Calculators are non-obstructive with respect to S, as are modern-day AIs. Paperclip maximizers are highly obstructive. Manipulative agents are obstructive (they trick the human policies into steering towards non-reflectively-endorsed leaf nodes). An initial-human-values-aligned dictator AI obstructs most goals. Sub-human-level AI which chip away at our autonomy and control over the future, are obstructive as well.
This can seemingly go off the rails if you consider e.g. a friendly AGI to be “obstructive” because activating it happens to detonate a nuclear bomb via the butterfly effect. Or, we’re already doomed in off (an unfriendly AGI will come along soon after), and so then this AI is “not obstructive” if it kills us instead. This is an impact/intent issue - obstruction is here defined according to impact alignment.
To emphasize, we’re talking about what would actually happen if we deployed the AI, under different human policy counterfactuals - would the AI "get in our way", or not? This account is descriptive, not prescriptive; I’m not saying we actually get the AI to represent the human in its model, or that the AI’s model of reality is correct, or anything.
We’ve just got two players in an extensive-form game, and a human policy function pol which can be combined with different goals, and a human whose goal is represented as a payoff function. The AI doesn’t even have to be optimizing a payoff function; we simply assume it has a policy. The idea that a human has an actual payoff function is unrealistic; all the same, I want to first understand corrigibility and alignment in two-player extensive-form games.
Lastly, payoff functions can sometimes be more or less granular than we'd like, since they only grade the leaf nodes. This isn't a big deal, since I'm only considering extensive-form games for conceptual simplicity. We also generally restrict ourselves to considering goals which aren't silly: for example, any AI obstructs the "no AI is activated, ever" goal.
Alignment flexibility
Main idea: By considering how the AI affects your attainable utility (AU) landscape, you can quantify how helpful and flexible an AI is.
Let’s consider the human’s ability to accomplish many different goals P, first from the state off (no AI).
The independent variable is P, and the value function takes in P and returns the expected value attained by the policy for that goal, pol(P). We’re able to do a bunch of different things without the AI, if we put our minds to it.
Non-torture AI
Imagine we build an AI which is corrigible towards all non-pro-torture goals, which is specialized towards painting lots of things blue with us (if we so choose), but which is otherwise non-obstructive. It even helps us accumulate resources for many other goals.
We can’t get around the AI, as far as torture goes. But for the other goals, it isn’t obstructing their policies. It won’t get in our way for other goals.
Paperclipper
What happens if we turn on a paperclip-maximizer? We lose control over the future outside of a very narrow spiky region.
I think most reward-maximizing optimal policies affect the landscape like this (see also: the catastrophic convergence conjecture), which is why it’s so hard to get hard maximizers not to ruin everything. You have to a) hit a tiny target in the AU landscape and b) hit that for the human’s AU, not for the AI’s. The spikiness is bad and, seemingly, hard to deal with.
Furthermore, consider how the above graph changes as pol gets smarter and smarter. If we were actually super-superintelligent ourselves, then activating a superintelligent paperclipper might not even a big deal, and most of our AUs are probably unchanged. The AI policy isn't good enough to negatively impact us, and so it can't obstruct us. Spikiness depends on both the AI's policy, and on pol.
Empowering AI
What if we build an AI which significantly empowers us in general, and then it lets us determine our future? Suppose we can’t correct it.
I think it’d be pretty odd to call this AI “incorrigible”, even though it’s literally incorrigible. The connotations are all wrong. Furthermore, it isn’t “trying to figure out what we want and then do it”, or “trying to help us correct it in the right way." It’s not corrigible. It’s not intent aligned. So what is it?
It’s empowering and, more weakly, it’s non-obstructive. Non-obstruction is just a diffuse form of impact alignment, as I’ll talk about later.
Practically speaking, we’ll probably want to be able to literally correct the AI without manipulation, because it’s hard to justifiably know ahead of time that the AU landscape is empowering, as above. Therefore, let’s build an AI we can modify, just to be safe. This is a separate concern, as our theoretical analysis assumes that the AU landscape is how it looks.
But this is also a case of corrigibility just being a proxy for what we want. We want an AI which leads to robustly better outcomes (either through its own actions, or through some other means), without reliance on getting ambitious value alignment exactly right with respect to our goals.
Conclusions I draw from the idea of non-obstruction
If I'm choosing which pair of shoes to buy, and I ask the AI for help, and no matter what preferences P I had for shoes to begin with, I end up buying blue shoes, then I'm probably being manipulated (and obstructed with respect to most of my preferences over shoes!).
A non-manipulative AI would act in a way that lets me condition my actions on my preferences.
I think of “power” as “the human’s average ability to achieve goals from some distribution." Logically, non-obstructive agents with respect to S don’t decrease our power with respect to any distribution over goal set S. The catastrophic convergence conjecture says, “impact alignment catastrophes tend to come from power-seeking behavior”; if the agent is non-obstructive with respect to a broad enough set of goals, it’s not stealing power from us, and so it likely isn’t catastrophic.
Non-obstruction is important for a (singleton) AI we build: we get more than one shot to get it right. If it’s slightly wrong, it’s not going to ruin everything. Modulo other actors, if you mess up the first time, you can just try again and get a strongly aligned agent the next time.
Most importantly, this frame collapses the alignment and corrigibility desiderata into just alignment; while impact alignment doesn’t imply corrigibility, corrigibility’s benefits can be understood as a kind of weak counterfactual impact alignment with many possible human goals.
Theoretically, It’s All About Alignment
Main idea: We only care about how the agent affects our abilities to pursue different goals (our AU landscape) in the two-player game, and not how that happens. AI alignment subproblems (such as corrigibility, intent alignment, low impact, and mild optimization) are all instrumental avenues for making AIs which affect this AU landscape in specific desirable ways.
Formalizing impact alignment in extensive-form games
We care about events if and only if they change our ability to get what we want. If you want to understand normative AI alignment desiderata, on some level they have to ground out in terms of your ability to get what you want (the AU theory of impact) - the goodness of what actually ends up happening under your policy - and in terms of how other agents affect your ability to get what you want (the AU landscape). What else could we possibly care about, besides our ability to get what we want?
Definition 2. For fixed human policy function pol, πAI is:
Non-obstruction is a weak form of impact alignment.
As demanded by the AU theory of impact, the impact on goal P of turning on the AI is Vpol(P)P(on∣πAI)−Vpol(P)P(off∣πAI).
Again, impact alignment doesn't require intentionality. The AI might well grit its circuits as it laments how Facebook_user5821 failed to share a "we welcome our AI overlords" meme, while still following an impact-aligned policy.
However, even if we could maximally impact-align the agent with any objective, we couldn't just align it with our objective. We don't know our objective (again, in this setting, I'm assuming the human actually has a "true" payoff function). Therefore, we should build an AI aligned with many possible goals we could have. If the AI doesn't empower us, it at least shouldn't obstruct us. Therefore, we should build an AI which defers to us, lets us correct it, and which doesn't manipulate us.
This is the key motivation for corrigibility.
For example, intent corrigibility (trying to be the kind of agent which can be corrected and which is not manipulative) is an instrumental strategy for inducing corrigibility, which is an instrumental strategy for inducing broad non-obstruction, which is an instrumental strategy for hedging against our inability to figure out what we want. It's all about alignment.
Corrigibility also increases robustness against other AI design errors. However, it still just boils down to non-obstruction, and then to impact alignment: if the AI system has meaningful errors, then it's not impact-aligned with the AUs which we wanted it to be impact-aligned with. In this setting, the AU landscape captures what actually would happen for different human goals P.
To be confident that this holds empirically, it sure seems like you want high error tolerance in the AI design: one does not simply knowably build an AGI that's helpful for many AUs. Hence, corrigibility as an instrumental strategy for non-obstruction.
AI alignment subproblems are about avoiding spikiness in the AU landscape
What Do We Want?
Main idea: we want good things to happen; there may be more ways to do this than previously considered.
Corrigibility is a property of policies, not of states; "impact" is an incompatible adjective.
Rohin Shah suggests "empirical corrigibility": we actually end up able to correct the AI.
We want agents which are maximally impact-aligned with as many goals as possible, especially those similar to our own.
Expanding the AI alignment solution space
Alignment proposals might be anchored right now; this frame expands the space of potential solutions. We simply need to find some way to reliably induce empowering AI policies which robustly increase the human AUs; Assistance via Empowerment is the only work I'm aware of which tries to do this directly. It might be worth revisiting old work with this lens in mind. Who knows what we've missed?
For example, I really liked the idea of approval-directed agents, because you got the policy from argmax’ing an ML model’s output for a state - not from RL policy improvement steps. My work on instrumental convergence in RL can be seen as trying to explain why policy improvement tends to limit to spikiness-inducing / catastrophic policies.
Maybe there’s a higher-level theory for what kinds of policies induce spikiness in our AU landscape. By the nature of spikiness, these πAI must decrease human power (as I’ve formalized it). So, I'd start there by looking at concepts like enfeeblement, manipulation, power-seeking, and resource accumulation.
Future Directions
The attainable utility concept has led to other concepts which I find exciting and useful:
Summary
Corrigibility is motivated by a counterfactual form of weak impact alignment: non-obstruction. Non-obstruction and the AU landscape let us think clearly about how an AI affects us and about AI alignment desiderata.
Corrigibility is an instrumental strategy for achieving non-obstruction, which is itself an instrumental strategy for achieving impact alignment for a wide range of goals, which is itself an instrumental strategy for achieving impact alignment for our "real" goal.
1 There's just something about "unwanted manipulation" which feels like a wrong question to me. There's a kind of conceptual crispness that it lacks.
However, in the non-obstruction framework, unwanted manipulation is accounted for indirectly via "did impact alignment decrease for a wide range of different human policies pol(P)?". I think I wouldn't be surprised to find "manipulation" being accounted for indirectly through nice formalisms, but I'd be surprised if it were accounted for directly.
Here's another example of the distinction:
You can often have crisp insights into fuzzy concepts, such that your expectations are usefully constrained. I hope we can do something similar for manipulation.