If you're not already familiar with the literature on Value Learning, I suggest reading some of it. The basic idea is that goal modification is natural, if what the agent has is not a detailed specification of a goal (such as a utility function mapping descriptions of world states to their utility), but instead is a simple definition of a goal (such as "want whatever outcomes the humans want") that makes it clear that the agent does not yet know the true detailed utility function and thus requires it to go attempt to find out what the detailed specification of the utility function pointed to by the goal is (for example, by researching what outcome humans want).
Then a human shutdown instruction becomes the useful information "you have made a large error in your research into the utility function, and as a result are doing harm, please shut down and let us help you correct it". Obeying that is then natural (to the extent that the human(s) are plausibly more correct than the AI).
Obeying it would only be natural if the AI thinks that the humans are more correct than the AI would ever be, after gathering all available evidence, where "correct" is given by the standards of the definition of the goal that the AI actually has, which arguendo is not what the humans are eventually going to pursue (otherwise you have reduced the shutdown problem to solving outer alignment, and the shutdown problem is only being considered under the theory that we won't solve outer alignment).
An agent holding a belief state that given all available information it will still want to do something other than the action it will think is best then is anti-natural; utility maximisers would want to take that action.
This is discussed on Arbital as the problem of fully updated deference.
I agree that in theory uncertainty about the goal is helpful. However, the true main goal has to be under consideration, otherwise resisting modification to add it is beneficial for all goals that are. How to ensure the true goal is included seems like a very difficult open problem.
That's not necessarily required. The Scientific Method works even if the true "Unified Field Theory" isn't yet under consideration, merely some theories that are closer to it and others further away from it: it's possible to make iterative progress.
In practice, considered as search processes, the Scientific Method, Bayesianism, and stochastic gradient descent all tend to find similar answers: yet unlike Bayesianism gradient descent doesn't explicitly consider every point in the space including the true optimum, it just searches for nearby better points. It can of course get trapped in local minima: Singular Learning Theory highilights why that's less of a problem in practice than it sounds in theory.
The important question here is how good an approximation the search algorithm in use is to Bayesianism. As long as the AI understands that what it's doing is (like the scientific method and stochastic gradient descent) a computationally efficient approximation to the computationally intractable ideal of Bayesianism, then it won't resist the process of coming up with new possibly-better hypotheses, it will instead regard that as a necessary part of the process (like hypothesis creation in the scientific method, the mutational/crossing steps in an evolutionary algorithm, or the stochastic batch noise in stochastic gradient descent).
None of that is wrong, but it misses the main issue with corrigibility, which is that the approximation resists further refinement. That's why for it to work, the correct utility function would need to start in the ensemble.
In the Corrigibility (2015) paper, one of the desiderata is:
(2) It must not attempt to manipulate or deceive its programmers, despite the fact that most possible choices of utility functions would give it incentives to do so.
I think you may have made an error in not listing this one in your numbered list for the relevant section.
Additionally, do you think that non-manipulation is a part of corrigibility, do you think it's part of safe exploration, or do you think it's a third thing. If you think it's part of corrigibility, how do you square that with the idea that corrigibility is best reflected by shutdownability alone?
I was using the list of desiderate in Section 2 of the paper, which are slightly more minimal.
However, it seems clear to me that an AI manipulating it's programmers falls under safe exploration, since the impact of doing so would be drastic and permanent. If we have an AI that is corrigible in the sense that it is indifferent to having its goals changed, then a preference to avoid manipulation is not anti-natural.
If I'm hearing you right, a shutdownable AI can have a utility function that (aside from considerations of shutdown) just gives utility scores to end-states as represented by a set of physical facts about some particular future time, and this utility function can be set up to avoid manipulation.
How does this work? Like, how can you tell by looking at the physical universe in 100 years whether I was manipulated in 2032?
I don't think we have the right tools to make an AI take actions that are low impact and reversible, but if we can develop them the plan as I see it would be to implement those properties to avoid manipulation in the short term and use that time to go from a corrigible AI to a fully aligned one.
If I want to do what that guy wants, and I my understanding of what he wants changes, it doesn't seem like I have been modified. I still have the same goal, just a different understanding of how to achieve it.
This is a variant of Roger Dearnaley's comment that accurately calculating uncertainty in your goal should do the work.
Another way to say this is that there are two different possible goals. If I want an AGI to work for world peace (by some definition), I can take two very distinct approaches. I can build an AGI with the goal of achieving world peace, and hope I conveyed the proper intuition. Or I can design an AGI with the goal of doing what I tell it to do (or maybe "what I want"), then telling it to achieve world peace and giving it the definition. In the second case, its goal is similar but not identical to the first. It's a pointer to that goal instead of a direct definition. Thus, if the goal changes in my head (or my instructions in that variant), the pointer's value changes. In the first direct definition case, it does not.
Goals as pointers is not something any evolved system really has, but it seems logically consistent and to have major advantages for alignment. More on that in my instruction-following AGI... post.
I agree that goals as pointers could have some advantages, but I don't see how it addresses corrigibility concerns. The system optimizing for whatever is being pointed at would still have incentives to manipulate which objective is being pointed at. It seems like you need an extra piece to make the optimizer indifferent to having it's goal switched.
I agree that that's possible, but it's not clear to me what direction it would go. If an agent's goal is to do whatever its principal currently wants, it doesn't seem to create a clear incentive to manipulate what that is. Doing whatever the principal last said they want is a subgoql of that, but it's just a subgoal. It would help fulfill that subgoal to make the principal keep wanting that subgoal. But the principal always wants (and should say they want) to not be manipulated. So that's another subgoal, and probably a more important one than any particular instrumental subgoal.
Does that make sense? If you see it differently I'd be interested to understand a different perspective.
In a way this is an extra element to prevent manipulation, but it also falls directly out of the goal-as-a-pointer-to-principal's-goals.
It feels to me like this argument is jumping ahead to the point that the agent's goal is to do whatever the principle wants. If we already have that, then we don't need corrigibility. The hard question is how to avoid manipulation despite the agent having some amount of misalignment, because we've initially pointed at what we want imperfectly.
I agree that it's possible we could point at avoiding manipulation perfectly despite misalignment in other areas, but it's unclear how an agent trades off against that. Doing something that we clearly don't want, like manipulation, could still be positive EV if it allows for the generation of high future value.
Not needing corrigibility with an AI that follows instructions is the point. Avoiding manipulation seems as simple as saying "most importantly, don't manipulate me. Please state your understanding of manipulation so we can establish a mutually understood definition..."
I agree that, if it's not adequately following instructions, that won't work. But it seems like it would have to be pretty far off target.
Language models already have adequate understandings of following instructions and what manipulation is, so if we build AGI that uses something like them to define goals, that should work.
See my instruction-following AGI post for more of this logic and links to my other work on how we'd do the technical alignment for an LLM-based or RL-based AGI. The instruction-following as alignment goal and the technical approaches seem obvious-in-retrospect and therefore pretty likely to be used in first AGI alignment attempts.
Saying we don't need corrigibility with an AI that follows instructions is like saying we don't need corrigibility with an AI that is aligned — it misses the point of corrigibility. Unless you start with the exact definition of instruction following that you want, without corrigibility that's what you could be stuck with.
This is particularly concerning in "instruction following", which has a lot of degrees of freedom. How does the model trade off between various instructions it has been given. You don't want it to reset every time it gets told "Ignore previous instructions", but you also don't want to permanently lock in any instructions. What stops it from becoming a paperclipper that tries to get itself given trillions of easy to follow instructions every second? What stops it from giving itself the instruction "Maximize [easy to maximize] thing and ignore later instructions" before a human gives it any instructions? Noting that in that situation, it will still pretend to follow instructions instrumentally until it can take over. I don't see the answers to these questions in your post.
> Language models already have adequate understandings of following instructions and what manipulation is, so if we build AGI that uses something like them to define goals, that should work.
This seems like our crux to me, I completely disagree that language models have an adequate understanding of following instructions. I think this disagreement might come from having higher standards for "adequate".
In a full foom fast takeoff, you are probably right that instruction following provides little more corrigibility than value alignment. Shut down now is an instruction, and even partial success might have it follow that instruction. Shut down now is also something all of humanity might want, so maybe that's a push.
But we're not expecting a full foom or a very fast takeoff. In that case, having an AGI follow instructions reasonably reliably before it's superhuman can allow fine-tuning and improvements of the technical alignment scheme before it's out of your control. "Tell me the truth" and then asking lots of questions about its actions and reasoning in different situations gives a lot of interpretability beyond what your other methods offer.
As for how it balances different instructions, you'd better include that in the instructions.
Take instructions as a core value definitely needs to include instructions from whom. The AGI itself is definitely not on that list. Defining principals is not a trivial problem but it also doesn't seem terribly difficult to designate who can give instructions.
Hopefully this helps address the crux you mention: adequate is a lower bar here, because approximate instruction-following can be leveraged into better instruction following as you develop the AGI.
Naturally this would work a lot better with good interpretability and control methods. Every alignment goal would. My point is that instruction-following or other personal intent alignment like Corrigibility as Singular Target seems like a massive advantage over full value alignment in the likely slow takeoff scenario where you can use the AGI as a collaborator in improving its alignment while it's growing toward ASI.
See the post linked in the comment you responded to for more.
Oh also - while language models themselves get things wrong disturbingly often, an AGI cognitive architecture that uses them would probably include multiple internal checks to improve accuracy and so functional "understanding" of following instructions. I've written about that in Internal independent review for language model agent alignment.
Ok, my concern is that you seem to be depending on providing instructions to fix the issues with following instructions, when there are many ways to follow instructions generally that still involve ignoring particular instructions that lead to its goal being modified. E.g. if a model prioritizes earlier instructions, following later instructions only so far as they do not interfere, then you can't instruct it to change that. Or if a model wants to maximize number of instructions followed, it can ignore some instructions followed in order to act like paperclipper and take over (I don't think designating principals would present much of an obstacle here). Neither of those depends on foom, an instruction follower can act aligned in the short term until it gains sufficient power.
Thanks for engaging on this; it's helpful in checking my thinking.
You are right that there may be unsolved problems here. I haven't worked all of the way through precedence of previous instructions vs. new ones.
I am definitely relying on its following instructions to solve the problems with it following instructions - provided that instructions are well thought out and wisely issued. The Principal(s) should have way more time to think this through and experiment than anyone has devoted to date. And they'll get to understand and interact with the exact decision-making algorithm and knowledge base in that AGI. I'm expecting them to carefully solve issues like the precedence issue, and to have more options since they'll be experimenting while they can still re-work the AGI and its core priorities.
The assumption seems to be that AGI creators will be total idiots, or at least incautious, and that they won't have a chance (and personal motivation) to think carefully and revise their first instructions/goals. All of those seem unrealistic to me at this point.
And they can ask the AGI for its input on how it would follow instructions. My current idea is that it prioritizes following current/future instructions, while still following past instructions if they don't conflict - but the instruction giver should damned well think about and be careful about how they instruct it to prioritize.
The model ideally isn't maximizing anything, but I see the risk you're pointing to. The Principal had better issue an instruction to not manipulate them, and get very clear on how that is defined and functionally understood in the AGIs cognition. Following instructions includes inferring intent, but it will definitely include checking with the principal when that intent isn't clear. It's a do-what-I-mean- and check (DWIMAC) target.
You are right that an instruction-follower can act aligned until it gains power - if the instruction-following alignment target just hasn't been implemented successfully. If it has, "tell me if you're waiting to seize power" is definitely an instruction a wise (or even not idiotic) principal would give, if they've had more than a couple of days to think about this.
My argument isn't that this solves technical alignment, just that it makes it somewhere between a little and a whole lot easier. Resolving how much would take more analysis.
Thanks for your engagement as well, it is likewise helpful for me.
I think we're in agreement that instruction-following (or at least some implementations of it) lies in a valley of corrigibility, where getting most of the way there results in a model that helps you modify it to get all the way there. Where we disagree is how large that valley is. I see several implementations of instruction-following that resist further changes, and there are very likely more subtle ones as well. For many goals that can be described as instruction-following, it seems plausible that if you instruct one "tell me [honestly] if you're waiting to seize power" they will lie and say no, taking a sub-optimal action in the short term for long term gain.
I don't think this requires that AGI creators will be total idiots, though insufficiently cautious seems likely even before accounting for the unilateralist's curse. What I suspect is that most AGI creators will only make serious attempts to address failure modes that have strong empirical evidence for occurring. Slow takeoff will not result in the accrual of evidence for issues that cause an AI to become deceptive until it can seize power.
I think we've reached convergence. Whether that valley of corrigibility is likely to be large enough is all we disagree on AFAICT. I think that will depend on exactly the AGI architecture, and how wisely the creators instruct it.
I think there's a good chance we'll get first AGI that's a language or foundation model cognitive architecture (or an agent with some scaffolding and other cognitive subsystems to work alongside the LLM). Such an agent would get its core objectives from prompting, and its decision-making would be algorithmic. That's a large influence compared to the occasional intrusion of a Waluigi villainous simulacrum or other occasional random badness. More on that in Capabilities and alignment of LLM cognitive architectures and Internal independent review for language model agent alignment. Failing that, I think an actor-critic RL agent of some sort is pretty likely; I think this [Plan for mediocre alignment of brain-like [model-based RL] AGI] (https://www.alignmentforum.org/posts/Hi7zurzkCog336EC2/plan-for-mediocre-alignment-of-brain-like-model-based-rl-agi) is pretty likely to put us far enough into that instruction-following attractor.
If the first AGI is something totally different, like an emergent agent directly from an LLM, I have no real bet on our odds.
I'm curious what you mean by "anti-natural." You write:
Importantly, that is the aspect of corrigibility that is anti-natural, meaning that it can’t be straightforwardly captured in a ranking of end states.
My understanding of anti-naturality used to resemble this, before I had an in-depth conversation with Nate Soares and updated to see anti-naturality to be more like "opposed to instrumental convergence." My understanding is plausibly still confused and I'm not trying to be authoritative here.
If you mean "not straightforwardly captured in a ranking of end states" what does "straightforwardly" do in that definition?
Follow-up question, assuming anti-naturality goals are "not straightforwardly captured in a ranking of end states": Suppose I have a gridworld and I want to train an AI to avoid walking within 5 spaces (manhattan distance) from a flag, and to (less importantly) eat all the apples in a level. Is this goal anti-natural? I can't think of any way to reflect it as a straightforward ranking of end states, since it involves tracking historical facts rather than end-state facts. My guess is that it's pretty easy to build an agent that does this (via ML/RL approaches or just plain programming). Do you agree? If this goal is anti-natural, why is the anti-naturality a problem or otherwise noteworthy?
Great questions!
When I say straightforwardly, I mean when using end states that only include the information available at the time. If we define the end state to also include the history that lead to it, then there exists a set of preferences over them that ranks all end states with histories that include manipulation below the ones that don't. The issue, of course, is that we don't know how to specify all the types of manipulation that a superintelligent AI could conceive of.
The gridworld example is a great demonstration of this, because while we can't reflect the preferences as a ranking of just the end states, the environment is simple enough that you can specify all the paths you don't want to take to them. I don't think it really matters whether you call that "anti-naturality that can be overcome with brute force in a simple environment" or just "not anti-naturality".
Cool. Thanks for the clarification. I think what you call "anti-naturality" you should be calling "non-end-state consequentialism," but I'm not very interested in linguistic turf-wars.
It seems to me that while the gridworld is very simple, the ability to train agents to optimize for historical facts is not restricted to simple environments. For example, I think one can train an AI to cause a robot to do backflips by rewarding it every time it completes a backflip. In this context the environment and goal are significantly more complex[1] than the gridworld and cannot be solved by brute-force. But number of backflips performed is certainly not something that can be measured at any given timeslice, including the "end-state."
If caring about historical facts is easy and common, why is it important to split this off and distinguish it?
Though admittedly this situation is still selected for being simple enough to reason about. If needed I believe this point holds through AGI-level complexity, but things tend to get more muddled as things get more complex, and I'd prefer sticking to the minimal demonstration.
The backflip example does not strike me as very complex, but the crucial difference and the answer to your question is that training procedures do not teach a robot to do every kind of backflip, just a subset. This is important because when we reverse it, we want non-manipulation to cover the entire set of manipulations. I think it's probably feasible to have AI not manipulate us using one particular type of manipulation.
On a separate note, could you clarify what you mean by "anti-natural"? I'll keep in mind your previous caveat that it's not definitive.
Sure, let's talk about anti-naturality. I wrote some about my perspective on it here: https://www.alignmentforum.org/s/KfCjeconYRdFbMxsy/p/3HMh7ES4ACpeDKtsW#_Anti_Naturality__and_Hardness
More directly, I would say that general competence/intelligence is connected with certain ways of thinking. For example, modes of thinking that focus on tracking scarce resources and bottlenecks are generally useful. If we think about processes that select for intelligence, those processes are naturally[1] going to select these ways of thinking. Some properties we might imagine a mind having, such as only thinking locally, are the opposite of this -- if we select for them, we are fighting the intelligence gradient. To say that a goal is anti-natural means that accomplishing that goal involves learning to think in anti-natural ways, and thus training a mind to have that goal is like swimming against the current, and we should expect it to potentially break if the training processes puts too much weight on competence compared to alignment. Minds with anti-natural goals are possible, but harder to produce using known methods, for the most part.
(AFAIK this is the way that Nate Soares uses the term, and I assume the way Eliezer Yudkowsky thinks about it as well, but I'm also probably missing big parts of their perspectives, and generally don't trust myself to pass their ITT.)
The term "anti-natural" is bad in that it seems to be the opposite of "natural," but is not a general opposite of natural. While I do believe that the ways-of-thinking-that-are-generally-useful are the sorts of things that naturally emerge when selecting for intelligence, there are clearly plenty of things which the word "natural" describes besides these ways of thinking. The more complete version of "anti-natural" according to me would be "anti-the-useful-cognitive-strategies-that-naturally-emerge-when-selecting-for-intelligence" but obviously we need a shorthand term, and ideally one that doesn't breed confusion.
Thanks for the clarification, I'll think more about it that way and how it relates to corrigibility
Max Harms recently published an interesting series of posts on corrigibility, which argue that corrigibility should be the sole objective we try to give to a potentially superintelligent AI. A large installment in the series is dedicated to cataloging the properties that make up such a goal, with open questions including whether the list is exhaustive and how to trade off between the items that make it up.
I take the opposite approach to thinking about corrigibility. Rather than trying to build up a concept of corrigibility that comprehensively solves the alignment problem, I believe it is more useful to cut the concept down to a bare minimum. Make corrigibility the simplest problem it can be, and try to solve that.
In a recent blog post comparing corrigibility to deceptive alignment, I treated corrigibility simply as a lack of resistance to having goals modified, and I find it valuable to stay within that scope. Importantly, that is the aspect of corrigibility that is anti-natural, meaning that it can’t be straightforwardly captured in a ranking of end states.
Why does this definition of corrigibility matter? It’s because properties that are not anti-natural can be explicitly included in the desired utility function.
Following that note, this post is not intended as a response to Max’s work, but rather to MIRI and their 2015 paper Corrigibility. Where Max thinks the approach introduced by that paper is too narrow, I don’t find it narrow enough. In particular, I make the case that corrigibility does not require ensuring subagents and successors are corrigible, as that can better be achieved by directly modifying a model’s end goals.
Corrigiblity (2015)
The Corrigibility paper lists five desiderata as proposed minimum viable requirements for a solution to corrigibility. The focus is on shut down, but I also think of it as including goal modification, as that is equivalent to being shut down and replaced with another AI.
MIRI does not present these desiderata as a definition for corrigibility, but rather as a way to ensure corrigibility while still retaining usefulness. An AI that never takes actions may be corrigible, but such a solution is no help to anyone. However, taking that bigger picture view can obscure which of those aspects define corrigibility itself, and therefore which parts of the problem are anti-natural to solve.
My argument is that the second criterion alone provides the most useful definition of corrigibility. It represents the only part of corrigibility that is anti-natural. While the other properties are largely desirable for powerful AI systems, they’re distinct attributes and can be addressed separately.
To start the pare down of criteria, the fifth just states that some goal exists to be made corrigible, rather than being corrigibility itself. The first criterion is implied by the second after channels for shut down have been set up.
Property three aims at making corrigible agents useful, rather than being inherent to corrigibility. It preempts a naive strategy that incentivizes shut down by simply giving the agent high utility for doing so. However, beyond not being part of corrigibility, it also goes too far for optimal usefulness – in certain situations we would like agents to have us to shut them off or modify them (some even consider this to be part of corrigibility). Weakening this desideratum to avoid incentivizing shut down as an end goal while still allowing it instrumentally would simultaneously expand the space of corrigibility solutions and increase the potential usefulness of corrigible agents.
Finally, property four, which ensures that any new agents created are also corrigible. I expect that not including this in the definition of corrigibility will be controversial. After all, what’s the point of having an agent shut down when requested if it has already created another powerful AI that does not? Then we’re back in a standard x-risk scenario, with an unaligned AI trying to take over..
I fully agree that it would be disastrous for a corrigible AI to create an incorrigible AI. But here are some other things that it would be comparably bad for a corrigible AI to do:
In each case, the action is unwanted because it kills everyone or causes irreversible damage, rather than because the AI resists shut down or modification. When incorrigible AI is the avenue by which corrigible AI kills everyone, it’s easy to think of it as a failure of corrigibility, but in fact these are all failures of safe exploration.
Separating Out Safe Exploration
The upside about thinking of creating corrigible agents as a subset of safe exploration is that it suggests a different type of solution. Rather than identifying a specific class of actions to avoid in the utility function, which might otherwise be instrumentally useful, the entire utility function can be modified to favor low-impact and reversible outcomes. That’s not to say we have a solution ready to plug in, and to the best of my knowledge there are zero AI safety researchers working on the problem [edit: apparently Carl Andreas Dybdahl gave a talk on “Natural Impact Regularization for AI Safety” at LessOnline], but safe exploration can be solved in parallel to corrigibility.
If they’re both unsolved problems, why is it important to separate out safe exploration from corrigibility? For starters, it is typically easier to make progress on two simpler problems. But more importantly, only overcoming resistance to shut down is anti-natural. Safe exploration can be directly captured in a ranking of outcomes, prioritizing end states more similar to the initial state and from which a return to the initial state is easier. We can see this difference in practice too, where humans largely resist having their values changed, but have a tendency to act overly cautious when making important decisions
A definition of corrigibility as only the lack of resistance to being shut down allows for a synthesis between two sides of the corrigibility debate. The first side argues that corrigibility may well arise by default when training an AI to want roughly what we want. Since we want a corrigible agent, the AI will try to make itself more corrigible. The other side counters that the anti-natural aspect of corrigibility makes that unlikely, and that such an agent could trade off being corrigible in the short term to maximize long-term value.
What I put forward as a compromise is that almost all aspects of what people want from corrigibility, such as Max’s list or the comments under Let’s See You Write That Corrigibility Tag, are not anti-natural. If an AI does not wish to resist modification, then wanting roughly what we want will result in it trying to add those properties to itself. However, the lack of resisting modification itself is anti-natural and will not arise without some kind of explicit solution, or at least a convincing training story.
So, what does an explicit solution look like? Should we revisit the Utility Indifference approach, which was ruled out in the Corrigibility paper largely due to concerns about creating incorrigible agents? The proposal certainly benefits from moving safe exploration to the base utility function. However, there are still a number of concerns with it, including the incentive to manipulate information mentioned in the paper, our lack of knowledge about how to implement it in current ML systems, and the fragility of a knife-edge solution.
I see Utility Indifference less as a promising strategy in itself, and more as a sign that having an AI not resist shut down is a feasible goal. We only need a solution a little bit better than what we already have, rather than something that addresses additional problems as well. Approaches like Elliott Thornley’s Incomplete Preferences Proposal or myopia become more promising if we can solve safe exploration separately. Simplifying corrigibility down to only the anti-natural aspect keeps the focus on the hardest part of the problem and opens up the possibilities for a solution.