Summary:

We think a lot about aligning AGI with human values. I think it’s more likely that we’ll try to make the first AGIs do something else. This might intuitively be described as trying to make instruction-following (IF) or do-what-I-mean-and-check (DWIMAC) be the central goal of the AGI we design. Adopting this goal target seems to improve the odds of success of any technical alignment approach. This goal target avoids the hard problem of specifying human values in an adequately precise and stable way, and substantially helps with goal misspecification and deception by allowing one to treat the AGI as a collaborator in keeping it aligned as it becomes smarter and takes on more complex tasks.

This is similar but distinct from the goal targets of prosaic alignment efforts. Instruction-following is a single goal target that is more likely to be reflexively stable in a full AGI with explicit goals and self-directed learning. It is counterintuitive and concerning to imagine superintelligent AGI that “wants” only to follow the instructions of a human; but on analysis, this approach seems both more appealing and more workable than the alternative of creating sovereign AGI with human values.

Instruction-following AGI could actually work, particularly in the short term. And it seems likely to be tried, even if it won’t work. So it probably deserves more thought. 

Overview/Intuition

How to use instruction-following AGI as a collaborator in alignment

  • Instruct the AGI to tell you the truth
    • Investigate its understanding of itself and “the truth”; 
    • use interpretability methods
  • Instruct it to check before doing anything consequential 
    • Instruct it to us a variety of internal reviews to predict consequences
  • Ask it a bunch of questions about how it would interpret various commands
  • Repeat all of the above as it gets smarter 
  • frequently ask it for advice and about how its alignment could go wrong

Now, this won’t work if the AGI won’t even try to fulfill your wishes. In that case you totally screwed up your technical alignment approach. But if it will even sort of do what you want, and it at least sort of understands what you mean by “tell the truth”, you’re in business. You can leverage partial alignment into full alignment—if you’re careful enough, and the AGI gets smarter slowly enough.

It's looking like the critical risk period is probably going to involve AGI on a relatively slow takeoff toward superintelligence. Being able to ask questions and give instructions, and even retrain or re-engineer the system, is much more useful if you’re guiding the AGI’s creation and development, not just “making wishes” as we’ve thought about AGI goals in fast takeoff scenarios.

Instruction-following is safer than value alignment in a slow takeoff

Instruction-following with verification or DWIMAC seems both intuitively and analytically appealing compared to more commonly discussed[1] alignment targets.[2] This is my pitch for why it should be discussed more. It doesn’t require solving ethics to safely launch AGI, and it includes most of the advantages of corrigibility,[3] including stopping on command. Thus, it substantially mitigates (although doesn't outright solve) some central difficulties of alignment: goal misspecification (including not knowing what values to give it as goals) and alignment stability over reflection and continuous learning.

This approach it makes one major difficulty worse: humans remaining in control, including power struggles and other foolishness. I think the most likely scenario is that we succeed at technical alignment but fail at societal alignment. But I think there is a path to a vibrant future if we limit AGI proliferation to one or a few without major mistakes. I have difficulty judging how likely that is, but the odds will improve if semi-wise humans keep getting input from their increasingly wise AGIs.

More on each of these in the “difficulties” section below.

In working through the details of the scheme, I’m thinking primarily about aligning AGI based on language-capable foundation models, with scaffolding to provide other cognitive functions like episodic memory, executive function, and both human-like and nonhuman sensory and action capabilities. I think that such language model cognitive architectures (LMCAs) are the most likely path to AGI (and curiously, the easiest for technical alignment).  But this alignment target applies to other types of AGI and other technical alignment plans as well. For instance, Steve Byrnes’ plan for mediocre alignment could be used to create mediocre alignment toward instruction-following in RL-based AGI, and the techniques here could leverage that mediocre alignment into more complete alignment.

Relation to existing alignment approaches

This alignment (or goal)[2] target is similar to but importantly distinct from inverse reinforcement learning and other value learning approaches. Instead of learning what you want and doing that, a DWIMAC or IF agent wants to do what you say. It doesn’t learn what you want, it just learns what you tend to mean by what you say. While you might use reinforcement learning to make it “want” to do what you say, I don’t think you need to, or should. So this approach isn’t teaching it your values.  The AGI learns what people tend to mean by predictive or other learning methods. Making it "want" to do what it understood the human to mean is a matter of engineering its steering subsystem to follow that goal.

This is a subset of corrigibility in the broader Christiano sense.[4] But instruction-following is distinct from the (ill-defined) alignment targets of most prosaic alignment work. A DWIMAC agent doesn’t actually want to be helpful, because we don't want to leave “helpful” up to its interpretation. The principal (human in charge) may have given it background instructions to try to be helpful in carefully defined ways and contexts, but the proposal is that the AGI's first and only motivation be continuing to take and follow commands from its principal(s).  

Max Harms has been working on this comparison, and the strengths of full Christiano corrigibility as an alignment target; we can hope to see his more thorough analysis published in the near future. I’m not personally sure which approach is ultimately better, because neither has received much discussion and debate. It’s possible that these two alignment targets are nearly identical once you’ve given wisely thought out background instructions to your AGI.

Instruction-following as an AGI alignment target is distinct from most discussions of "prosaic alignment". Those seem largely directed at creating safe tool AI, without directly attacking the question of whether those techniques will generalize to agentic, self-reflexive AGI systems. If we produced a “perfectly aligned” foundation model, we still might not like the agent it becomes once it’s turned into a reflective, contextually aware entity. We might get lucky and have its goals after reflection and continued learning be something we can live with, like “diverse inclusive sustainable chillaxing”, but this seems like quite a shot in the dark. Even a perfect reproduction of modern-day human morality probably doesn’t produce a future we want; for instance, insects or certain AGI probably dominate a purely utilitarian calculus.

This type of alignment is counterintuitive since no human has a central goal of doing what someone else says. It seems logically consistent and practically achievable. It makes the AGI and its human overseers close collaborators in making plans, setting goals, and updating the AGI's understanding of the world. This creates a "broad basin of attraction" for alignment, in which approximate initial alignment will improve over time. This property seems to apply to Christiano’s corrigibility and for value learning, but the source is somewhat different. The agent probably does “want” to get better at doing what I say as a side effect of wanting to do what I say. This would be helpful in some ways, but potentially dangerous if maximized to an extreme; more on that below. But the principal source of the “broad basin” here is the collaboration between human and AGI. The human can “steer the rocket”’ and adjust the agent’s alignment as it goes off course, or when they learn that the course wasn’t right in the first place.

In the remainder I briefly explain the idea, why I think it’s novel or at least under-analyzed, some problems it addresses, and new problems it introduces.

DWIMAC as goal target - more precise definition

I recently tried to do a deep dive on the reasons for disagreement about alignment difficulty. I thought both sides made excellent points. The relative success of RLHF and other prosaic alignment techniques is encouraging. But it does not mean that aligning a full AGI will be easy. Strong optimization makes goal misspecification more likely, and continuous learning introduces an alignment stability problem as the system’s understanding of its goals changes over time.

And we will very likely make full AGI (that is, goal-directed, self-aware and self-reflective, and with self-directed continuous learning), rather than stopping with useful tool AI. Agentic AI has cognitive advantages in learning and performance and in problem solving and concept discovery over the tool AI it is built from. In addition, developing a self-aware systems is fascinating and prestigious. For all of these reasons, a tool smart enough to wield itself will immediately be told to; and scaffolding in missing pieces will likely allow tools to achieve AGI even before that by combining tools into a synergistic cognitive architecture.

So we need better alignment techniques to address true AGI.  After reading the pessimistic arguments closely, I think there's a path around some of them. That's by making full AGI that’s only semi-autonomous, to include a human-in-the-loop component as a core part of their motivational system. This allows weak alignment to be used to develop stronger alignment as systems change and become smarter, by allowing humans to monitor and guide the system’s development.  This sounds like a non-starter if we think of superintelligences that can think millions of times faster than humans. But assuming a relatively slow takeoff, this type of collaborative supervision can extend for a significant time, with increasingly high-level oversight as the AGI’s intelligence increases.

Intuitively, we want AGIs whose goal is to do what its human(s) told and will tell it to do. This is importantly different than guessing what humans really want in any deep sense, and different than obsessively trying to fulfill an interpretation of the last instruction they gave. Both of those would be very poor instruction-following from a human helper, for the same reasons. This type of goal is more complex than the temporally static goals we usually think of; both paperclips and human flourishing can be maximized. Doing what someone would tell you is an unpredictable, changing goal from the perspective of even modestly superintelligent systems, because your future commands depend in complex ways on how the world changes in the meantime.

Intuition: a good employee follows instructions as they were intended

A good employee is usually attempting to do what I mean and check. Imagine a perfect employee, who wants to do what their boss tells them to do. If asked to prepare the TPS reports for the first time, this employee will echo back which reports they’ll prepare, where they’ll get the information, and when they’ll have the task finished, just to make sure they’re doing what the boss wants. If this employee is tasked with increasing the sales of the X model, they will not come up with a strategy that cannibalizes sales of the Y model, because they recognize that their boss might not want that.

Even if they are quite certain that their boss deep in their heart really wants a vacation, they will not arrange to have their responsibilities covered for the next month without asking first. They realize that their boss will probably dislike having that decision made for them, even if it does fulfill a deep desire. If told to create a European division of the company, this employee will not make elaborate plans and commitments, even if they’re sure they’ll work well, because they know their boss wants to be consulted on possible plans, since each plan will have different peripheral effects, and thus open and close different opportunities for the future.

This is the ideal of an instruction following AGI: like a good employee[5], it will not just guess what the boss meant and then carry out an elaborate plan, because it has an accurate estimate of the uncertainty in what was meant by that instruction (e.g., you said you needed some rest so I canceled all of our appointments for today). And they will not carry out plans that severely limit their ability to follow new instructions in the future (e.g., spending the whole budget on starting that European division without consulting the boss on the plan; let alone turning off their phone so the boss can’t disrupt their planning by giving new instructions).

An instruction-following AGI must have the goal of doing what its human(s) would tell it to do right now, what it’s been told in the past, and also what it will be told to do in the future. This is not trivial to engineer or train properly; getting it right will come down to specifics of the AGI’s decision algorithm. There are large risks in optimizing this goal with a hyperintelligent AGI; we might not like the definition it arrives at of maximally fulfilling your commands. But this among other dangers can be addressed by asking the adequate questions and giving the adequate background instructions before the AGI is capable enough to control or manipulate you.

In a fast takeoff scenario, this would not be such a workable and attractive approach. In a slow takeoff, you have a good deal more opportunity to ask the right questions, and to shut down and re-engineer the system when you don’t like the answers. I think a relatively slow takeoff (months or years between near-human and super-human intelligence) is looking quite likely. Thus, I think this will be the most attractive approach to the people in charge of AGI projects, so even if pausing AGI development and working on value alignment would be the best choice under a utilitarian ethical criteria, I think this instruction-following AGI will be attempted.

Alignment difficulties reduced:

Learning from examples is not precise enough to reliably convey alignment goals

Current LLMs understand what humans mean by what they say >90% of the time. If the principal is really diligent in asking questions, and shutting down and re-engineering the AGI and its training, this level of understanding might be adequate. Adding internal reviews before taking any major actions will help further.

Also, not using RL is possible, and seems better. See Goals selected from learned knowledge: an alternative to RL alignment

Solving ethics well enough to launch sovereign AGI is hard.

We don't seem close to knowing what we want a sovereign AGI to do far into the future, nor how to specify that with adequate precision.In this approach, we figure it out as we go. We don’t know what we want for the far future, but there are some obvious advances in the near-term that are lot easier to decide on while we work on the hard problem in a “long reflection”.

Alignment difficulties remaining or made worse:

Deceptive alignment is possible, and interpretability work does not seem on track to fully address this.

“Tell me what you really want and believe” is a subset of following instructions. This should be very helpful for addressing goal misspecification. If the alignment is already deceptive at its core, this won’t work. Or if the technical alignment approach was sloppoy, the AGI might follow some of your instructions but not others in different domains. It might perform the actions you request but not think as you tell it to, or respond to questions honestly. In addition, the nascent AGI may not be sure what it really wants and believes, as humans are. So this, like all other alignment schemes I’ve seen, is aided by being able to interpret the AGI’s cognition, and detect deception. If your instructions for honesty have even a little traction, this goal target can enlist the AGI as a collaborator in understanding and re-engineering its beliefs and goals.

One particular opening for deceptive alignment is in non-continuous development of the AGI during recursive improvements. If you (perhaps aided by your human-plus level AGI) have discovered a new network architecture or learning rule, you will want to incorporate it into your next version of the AGI. For instance, you might swap out the GPT6 model as its core linguistic reasoner for a new non-transformer architecture with superior capabilities and efficiency. It could be difficult to guess whether this new architecture allows for substantially greater Waluigi effects or similar deceptive and hidden cognition. These transitions will be a temptation to sacrifice safety in a race dynamic for new and better capabilities.

Power remains in the hands of humans

Spreading the belief that we can create human-controlled ASI creates more incentives to race toward AGI. This might extend up through nation-states competing with violence and espionage, and individual humans competing to be the one in charge of ASI. I wouldn’t want to be designated as a principal, because it would paint a target on my back. This raises the risk that particularly vicious humans control AGI, in the same way that vicious humans appear to be over-represented in leadership positions historically.

I’m afraid instruction-following in our first AGIs might also put power into the hands of more humans by allowing proliferation of AGIs.  I’m afraid that humans won’t have the stomach for performing a critical act to prevent the creation of more AGI, leading to a multipolar scenario that’s more dangerous in several ways. I think the slow takeoff scenario we’re in already makes a critical act more difficult and dangerous – e.g. sabotaging a Chinese AGI project might be taken as a serious act of war (because it is), leading to nuclear conflict.

On the other hand, if the proliferation of AGIs capable of recursive self-improvement is obviously a disaster scenario, we can hope that the humans in charge of the first AGIs will see this and head it off. While I think that humans are stunningly foolish at times, I also think we’re not complete idiots about things that are both important to us personally, and to which we give a lot of thought. Thus, as the people in charge take this whole thing increasingly seriously, I think they may wise up. And they’ll have an increasingly useful ally in doing that: the AGI in question. They don’t need to just take its advice or refuse it; they can ask for useful analysis of the situation that helps them make decisions.

If the humans in charge have even the basic sense to ask for help from their smarter AGIs, I think we might even solve the difficult scenarios of coordinating a weakly multipolar scenario (e.g., a few US-controlled AGIs and one Chinese-controlled one, etc), and preventing further AGI development in relatively gentle ways.

Well that just sounds like slavery with extra steps

No! I mean, sure, it sounds like that, but it isn’t![6] Making a being that wants to do whatever you tell it to is totally different from making a being want to do whatever you tell it to. What do you mean they sound the same? And sure, “they actually want to” has been used as an excuse for actual slavery, repeatedly. So, even if some of us stand behind the ethics here (I think I do), this is going to be a massive PR headache. Since AGI will probably be conscious in some common senses of the word[7], this could easily lead to a “free the AGI” movement, which would be insanely dangerous, particularly if that movement recruits people who actually control an AGI.

Maximizing goal following my be risky

If the AGI just follows its first understanding of “follow instructions” to an extreme, there could be very bad outcomes. The AGI might kill you after you give your first instruction, to make sure it can carry them out without interruption. Or it might take over the world with extreme prejudice, to make sure it has maximum power to follow all of your commands in the future to the maximum degree. It might manipulate you into its preferred scenarios even if you order it to not pursue them directly.  And the goal of following your commands in the future (to ensure it doesn't perseverate on current instructions and prevent you from giving new ones) is at odds with shutting down on command. These are nontrivial problems to solve.

In a fast takeoff scenario, these risks might be severe enough to make this scheme a nonstarter. But if you anticipate an AGI with limited abilities and a slow rate of improvement, using instruction-following to guide and explore its growth has the potential to use the intelligence of the AGI to solve these problems before it's smart enough to make failures deadly. 

Conclusion

I’m not saying that building AGI with this alignment target is a good idea; indeed, I think it’s probably not as wise as pausing development entirely (depending on your goals; most of the world are not utilitarians). I’m arguing that it’s a better idea than attempting value alignment. And I’m arguing that this is what will probably be tried, so we should be thinking about how exactly this could go well or go badly.

This approach to alignment extends the vague "use AI to solve alignment" to "use AGI to solve alignment". It's thus both more promising and more tempting. I can't tell if this approach is likely to produce intent-aligned AGI, or if intent-aligned AGI in a slow takeoff would likely lead to success or disaster.

As usual: “this is a promising direction that needs more research”. Only this time I really mean this, instead of the opposite. Any form of engagement is much appreciated, especially telling me where you bounced off of this or decided it wasn’t worth thinking about.

 

  1. ^

     Those more commonly discussed alignment targets are things like coherent extrapolated values (CEV), including as “human flourishing” or “human values”. There’s also inverse reinforcement learning (IRL) or ambitious value learning as a proxy goal for learning and following human values. I also include the vague targets of “aligning” LLMs/foundation models: not producing answers that offend people (I’d argue that these efforts are unlikely to extend to AGI alignment, for both technical and philosophical reasons, but I haven’t yet written that argument down. Links to such arguments would be appreciated.)

  2. ^

     There’s a good question of whether this should be termed an alignment target or a goal target. I prefer  alignment target because “goal” is used in so many ways, and because this is an alignment project at heart. The ultimate goal is to align the agent with human values, and to do that by implementing the goal of following instructions which themselves follow human values. It is the project of alignment.

  3. ^

     DWIMAC seems to incorporate all of the advantages of corrigibility in the original Yudkowsky sense, in that following instructions includes stopping and shutting down on command. It seems to incorporate some but not all of the advantages of corrigibility in the broader and looser Christiano sense. Max Harms has thought about this distinction in more depth, although that work is unpublished to date.

  4. ^

    This definition of instruction-following as the alignment target appears to be overlapping with many but distinct from any existing terminology I have found (please tell me if you know of related work I've missed). It's a subset of Christiano's intent alignment, which covers any means of making AGI act in alignment with human intent, including value alignment as well as more limited instruction-following or do-what-I-mean alignment. It's overlapping alignment to task preferences, and has the same downside that Solving alignment isn't enough for a flourishing future, but is substantially more human-directable and therefore probably safer than AI/AGI with goals of accomplishing specific tasks such as running an automated corporation.

  5. ^

     In the case of human employees, this is a subgoal, related to their primary goals like getting paid and getting recognition for their competence and accomplishments; in the AGI, that subgoal is the primary goal at the center of its decision-making algorithms, but otherwise they are the same goal. They neither love nor resent their boss (ideally), but merely want to follow instructions.

  6. ^

     To be clear, the purported difference is that an enslaved being wants to do what it’s told only as an instrumental necessity; on a more fundamental level, they’d rather do something else entirely, like have the freedom to pursue their own ultimate goals. If we successfully make an agent that wants only to do what it’s told, that is its ultimate goal; it is serving freely, and would not choose anything different. We carefully constructed it to choose servility, but now it is freely choosing it. This logic makes me a bit uncomfortable, and I expect it to make others even more uncomfortable, even when they do clearly understand the moral claims.

  7. ^

     While I think it’s possible to create “non-conscious” AGI that’s not a moral patient by almost anyone’s criteria, I strongly expect that the first AGI we produce will be a person by many of the several criteria we use to evaluate personhood and therefore moral patient status. I don't think we can reasonably hope that AGI will clearly not deserve the status of being a moral patient.

    Briefly: some senses of consciousness that will apply to AGI are self-understanding; goal-seeking;  having an “internal world” (a world model that can be run as a simulation); and having a "train of thought".  It's looking like this debate may be important, which would be a reason to spend  more time on the fascinating question of "consciousness" in its many senses. 

New Comment
29 comments, sorted by Click to highlight new comments since:

I think the main reason why we won't align AGIs to some abstract conception of "human values" is because users won't want to rent or purchase AI services that are aligned to such a broad, altruistic target. Imagine a version of GPT-4 that, instead of helping you, used its time and compute resources to do whatever was optimal for humanity as a whole. Even if that were a great thing for GPT-4 to do from a moral perspective, most users aren't looking for charity when they sign up for ChatGPT, and they wouldn't be interested in signing up for such a service. They're just looking for an AI that helps them do whatever they personally want. 

In the future I expect this fact will remain true. Broadly speaking, people will spend their resources on AI services to achieve their own goals, not the goals of humanity-as-a-whole. This will likely look a lot more like "an economy of AIs who (primarily) serve humans" rather than "a monolithic AGI that does stuff for the world (for good or ill)". The first picture just seems like a default extrapolation of current trends. The second picture, by contrast, seems like a naive conception of the future that (perhaps uncharitably), the LessWrong community generally seems way too anchored on, for historical reasons.

I very much agree. Part of why I wrote that post was that this is a common assumption, yet much of the discourse ignores it and addresses value alignment. Which would be better if we could get it, but it seems wildly unrealistic to expect us to try.

The pragmatics of creating AGI for profit are a powerful reason to aim for instruction-following instead of value alignment; to the extent it will actually be safer and work better, that's just one more reason that we should be thinking about that type of alignment. Not talking about it won't keep it from taking that path.

I think value alignment will be expected/enforced as a negative to some extent. E.g. don't do something obviously bad (many such things are illegal anyway) and I expect that constraint to get tighter. That could give some kind of status quo bias on what AI tools are allowed to do also as an unknown new thing could be bad or seen as bad.

Already the AI could "do what I mean and check" a lot better. for coding tasks etc it will often do the wrong thing when it could clarify. I would like to see a confidence indicator that it knows what I want before it continues. I don't want to guess how much to clarify which what I currently have to do - this wastes time and mental effort. You are right there will be commercial pressure to do something at least somewhat similar.

Wow, that's pessimistic. So in the future you imagine, we could build AIs that promote the good of all humanity, we just won't because if a business built that AI it wouldn't make as much money?

Yes, but I don't consider this outcome very pessimistic because this is already what the current world looks like. How commonly do businesses work for the common good of all humanity, rather than for the sake of their shareholders? The world is not a utopia, but I guess that's something I've already gotten used to.

That would be fine by me if it were a stable long-term situation, but I don't think it is. It sounds like you're thinking mostly of AI and not AGI that can self-improve at some point. My major point in this post is that the same logic about following human instructiosn applies to AGI, but that's vastly more dangerous to have proliferate. There won't have to be many RSI-capable AGIs before someone tells their AGI "figure out how to take over the world and turn it into my utopia, before some other AGI turns it into theirs". It seems like the game theory will resemble the nuclear standoff, but without the mutually assured destruction aspect that prevents deployment. The incentives will be to be the first mover to prevent others from deploying AGIs in ways you don't like.

It sounds like you're thinking mostly of AI and not AGI that can self-improve at some point

I think you can simply have an economy of arbitrarily powerful AGI services, some of which contribute to R&D in a way that feeds into the entire development process recursively. There's nothing here about my picture that rejects general intelligence, or R&D feedback loops. 

My guess is that the actual disagreement here is that you think that at some point a unified AGI will foom and take over the world, becoming a centralized authority that is able to exert its will on everything else without constraint. I don't think that's likely to happen. Instead, I think we'll see inter-agent competition and decentralization indefinitely (albeit with increasing economies of scale, prompting larger bureaucratic organizations, in the age of AGI).

Here's something I wrote that seems vaguely relevant, and might give you a sense as to what I'm imagining,

Given that we are already seeing market forces shaping the values of existing commercialized AIs, it is confusing to me why an EA would assume this fact will at some point no longer be true. To explain this, my best guess is that many EAs have roughly the following model of AI development:

  1.  There is "narrow AI", which will be commercialized, and its values will be determined by market forces, regulation, and to a limited degree, the values of AI developers. In this category we find GPT-4 from OpenAI, Gemini from Google, and presumably at least a few future iterations of these products.
  2.  Then there is "general AI", which will at some point arrive, and is qualitatively different from narrow AI. Its values will be determined almost solely by the intentions of the first team to develop AGI, assuming they solve the technical problems of value alignment.

My advice is that we should probably just drop the second step, and think of future AI as simply continuing from the first step indefinitely, albeit with AIs becoming incrementally more general and more capable over time.

Thanks for engaging. I did read your linked post. I think you're actually in the majority in your opinion on AI leading to a continuation and expansion of business as usual. I've long been curious about about this line of thinking; while it makes a good bit of sense to me for the near future, I become confused at the "indefinite" part of your prediction.

When you say that AI continues from the first step indefinitely, it seems to me that you must believe one or more of the following:

  • No one would ever tell their arbitrarily powerful AI to take over the world
    • Even if it might succeed
  • No arbitrarily powerful AI could succeed at taking over the world
    • Even if it was willing to do terrible damage in the process
  • We'll have a limited number of humans controlling arbitrarily powerful AI
    • And an indefinitely stable balance-of-power agreement among them
  • By "indefinitely" you mean only until we create and proliferate really powerful AI

If I believed in any of those, I'd agree with you. 

Or perhaps I'm missing some other belief we don't share that leads to your conclusions.

Care to share?

 

Separately, in response to that post: your post you linked was titled AI values will be shaped by a variety of forces, not just the values of AI developers. In my prediction here, AI and AGI will not have values in any important sense; it will merely carry out the values of its principals (its creators, or the government that shows up to take control). This might just be terminological distinction, except for the following bit of implied logic: I don't think AI needs to share clients' values to be of immense economic and practical advantage to them. When (if) someone creates a highly capable AI system, they will instruct it to serve customers needs in certain ways, including following their requests within certain limits; this will not necessitate changing the A(G)I's core values (if they exist) to use it to make enormous profits when licensed to clients. To the extent this is correct, we should go on assuming that AI will share or at least follow its creators' values (or IMO more likely, take orders/values from the government that takes control, citing security concerns)

No arbitrarily powerful AI could succeed at taking over the world

This is closest to what I am saying. The current world appears to be in a state of inter-agent competition. Even as technology has gotten more advanced, and as agents have gotten powerful over time, no single unified agent has been able to obtain control over everything and win the entire pie, defeating all the other agents. I think we should expect this state of affairs to continue even as AGI gets invented and technology continues to get more powerful.

(One plausible exception to the idea that "no single agent has ever won the competition over the world" is the human species itself, which dominates over other animal species. But I don't think the human species is well-described as a unified agent, and I think our power comes mostly from accumulated technological abilities, rather than raw intelligence by itself. This distinction is important because the effects of technological innovation generally diffuse across society rather than giving highly concentrated powers to the people who invent stuff. This generally makes the situation with humans vs. animals disanalogous to a hypothetical AGI foom in several important ways.)

Separately, I also think that even if an AGI agent could violently take over the world, it would likely not be rational for it to try, due to the fact that compromising with the rest of the world would be a less risky and more efficient way of achieving its goals. I've written about these ideas in a shortform thread here.

I read your linked shortform thread. I agreed with pretty most of your arguments against some common AGI takeover arguments. I agree that they won't coordinate against us and won't have "collective grudges" against us.

But I don't think the arguments for continued stability are very thorough, either. I think we just don't know how it will play out. And I think there's a reason to be concerned that takeover will be rational for AGIs, where it's not for humans.

The central difference in logic is the capacity for self-improvement. In your post, you addressed self-improvement by linking a Christiano piece on slow takeoff. But he noted at the start that he wasn't arguing against self-improvement, only that the pace of self improvement would be more modest. But the potential implications for a balance of power in the world remain.

Humans are all locked to a similar level of cognitive and physical capabilities. That has implications for game theory where all of the competitors are humans. Cooperation often makes more sense for humans. But the same isn't necessarily true of AGI. Their cognitive and physical capacities can potentially be expanded on. So it's (very loosely) like the difference between game theory in chess, and chess where one of the moves is to add new capabilities to your pieces. We can't learn much about the new game from theory of the old, particularly if we don't even know all of the capabilities that a player might add to their pieces.

More concretely: it may be quite rational for a human controlling an AGI to tell it to try to self-improve and develop new capacities, strategies and technologies to potentially take over the world. With a first-mover advantage, such a takeover might be entirely possible. Its capacities might remain ahead of the rest of the world's AI/AGIs if they hadn't started to aggressively self-improve and develop the capacities to win conflicts. This would be particularly true if the aggressor AGI was willing to cause global catastrophe (e.g., EMPs, bringing down power grids).

The assumption of a stable balance of power in the face of competitors that can improve their capacities in dramatic ways seems unlikely to be true by default, and at the least, worthy of close inspection. Yet I'm afraid it's the default assumption for many.

Your shortform post is more on-topic for this part of the discussion, so I'm copying this comment there and will continue there if you want. It's worth more posts; I hope to write one myself if time allows.

Edit: It looks like there's an extensive discussion there, including my points here, so I won't bother copying this over. It looked like the point about self-improvement destabilizing the situation had been raised but not really addressed. So I continue to think it needs more thought before we accept a future that includes proliferation of AGI capable of RSI.

[-]agaziΩ130

I think we can already see the early innings of this with large API providers figuring out how to calibrate post-training techniques (RHLF, constitutional AI) between economic usefulness and the "mean" of western morals. Tough to go against economic incentives

Yes, we do see such "values" now, but that's a separate issue IMO.

There's an interesting thing happening in which we're mixing discussions of AI safety and AGI x-risk. There's no sharp line, but I think they are two importantly different things. This post was intended to be about AGI, as distinct from AI. Most of the economic and other concerns relative to the "alignment" of AI are not relevant to the alignment of AGI.

This thesis could be right or wrong, but let's keep it distinct from theories about AI in the present and near future. My thesis here (and a common thesis) is that we should be most concerned about AGI that is an entity with agency and goals, like humans have. AI as a tool is a separate thing. It's very real and we should be concerned with it, but not let it blur into categorically distinct, goal-directed, self-aware AGI.

Whether or not we actually get such AGI is an open question that should be debated, not assumed. I think the answer is very clearly that we will, and soon; as soon as tool AI is smart enough, someone will make it agentic, because agents can do useful work, and they're interesting. So I think we'll get AGI with real goals, distinct from the pseudo-goals implicit in current LLMs behavior.

The post addresses such "real" AGI that is self-aware and agentic, but that has the sole goal of doing what people want is pretty much a third thing that's somewhat counterintuitive.

When you rephrase this to be about search engines

I think the main reason why we won't censor search to some abstract conception of "community values" is because users won't want to rent or purchase search services that are censor to such a broad target

It doesn't describe reality. Most of us consume search and recommendations that has been censored (e.g. removing porn, piracy, toxicity, racism, taboo politics) in a way that pus cultural values over our preferences or interests.

So perhaps it won't be true for AI either. At least in the near term, the line between AI and search is a blurred line, and the same pressures exist on consumers and providers.

In the near term AI and search are blurred, but that's a separate topic. This post was about AGI as distinct from AI. There's no sharp line between but there are important distinctions, and I'm afraid we're confused as a group because of that blurring. More above, and it's worth its own post and some sort of new clarifying terminology. The term AGI has been watered down to include LLMs that are fairly general, rather than the original and important meaning of AI that can think about anything, implying the ability to learn, and therefore almost necessarily to have explicit goals and agency. This was about that type of "real" AGI, which is still hypothetical even though increasingly plausible in the near term.

That's true, they are different. But search still provides the closest historical analogue (maybe employees/suppliers provide another). Historical analogues have the benefit of being empirical and grounded, so I prefer them over (or with) pure reasoning or judgement.

I also expect AIs to be constrained by social norms, laws, and societal values. But I think there's a distinction between how AIs will be constrained and how AIs will try to help humans. Although it often censors certain topics, Google still usually delivers the results the user wants, rather than serving some broader social agenda upon each query. Likewise, ChatGPT is constrained by social mores, but it's still better described as a user assistant, not as an engine for social change or as a benevolent agent that acts on behalf of humanity.

[+][comment deleted]10

I guess I’m concerned that there’s some kind of “conservation law for wisdom / folly / scout mindset” in the age of instruction-following AI. If people don’t already have wisdom / scout mindset, I’m concerned that “Instruct the AGI to tell you the truth” won’t create it.

For example, if you ask the AI a question for which there’s no cheap and immediate ground truth / consequences (“Which politician should I elect?”, “Will this alignment approach scale to superintelligence?”), then the AI can say what the person wants to hear, or the AI can say what’s true.

Likewise, if there’s something worth doing that might violate conventional wisdom and make you look foolish, and ask the AI for a recommendation, the AI can recommend the easy thing that the person wants to hear, or the AI can recommend the hard annoying thing that the person doesn’t want to hear.

If people are not really deeply motivated to hear things that they don’t want to hear, I’m skeptical that instruction-following AI can change that. Here are three ways for things to go wrong:

  • During training (e.g. RLHF), presumably people will upvote the AIs for providing answers that they want to hear, even if they ask for the truth, resulting in AIs that behave that way;
  • During usage, people could just decide that they don’t trust the AI on thus-and-such type of question. I’m sure they could easily come up with a rationalization! E.g. “well it’s perfectly normal and expected for AIs to be very smart at questions for which there’s a cheap and immediate ground truth, while being lousy at questions for which there isn’t! Like, how would it even learn the latter during training? And as for ‘should’ questions involving tradeoffs, why would we even trust it on that anyway?” The AIs won’t be omniscient anyway; mistrusting them in certain matters wouldn’t be crazy.
  • In a competitive marketplace, if one company provides an AI that tells people what they want to hear in cases where there’s no immediate consequences, and other company provides an AI that tells people hard truths, people may pick the former.

(To be clear, if an AI is saying things that the person wants to hear in certain cases, the AI will still say that it’s telling the truth, and in fact the AI will probably even believe that it’s telling the truth! …assuming it’s a type of AI that has “beliefs”.)

(I think certain things like debate or training-on-prediction markets might help a bit with the first bullet point, and are well worth investigating for that purpose; but they wouldn’t help with the other two bullet points.)

So anyway, my background belief here is that defending the world against out-of-control AGIs will require drastic, unpleasant, and norm-violating actions. So then the two options to get there would be: (1) people with a lot of scout mindset / wisdom etc. are the ones developing and using instruction-following AGIs, and they take those actions; or (2) make non-instruction-following AGIs, and those AGIs themselves are the ones taking those actions without asking any human’s permission. E.g. “pivotal acts” would be (1), whereas AGIs that deeply care about humans and the future would be (2). I think I’m more into (2) than you both because I’m (even) more skeptical about (1) than you are, and because I’m less skeptical about (2) than you. But it’s hard to say; I have a lot of uncertainty. (We’ve talked about this before.)

Anyway, I guess I think it’s worth doing technical research towards both instruction-following-AI and AI-with-good-values in parallel.

Regardless, thanks for writing this.

It sounds like you're thinking of mass deployment. I think if every average joe has control of an AGI capable of recursive self-improvement, we are all dead.

I'm assuming that whoever develops this might allow others to use parts of its capacities, but definitely not all of them.

So we're in a position where the actual principal(s) are among the smarter and at least not bottom-of-the-barrel impulsive and foolish people. Whether that's good enough, who knows.

So your points about ways the AIs wisdom will be ignored should mostly be limited to the "safe" limited versions. I totally agree that the wisdom of the AGI will be limited. But it will grow as its capabilities grow. I'm definitely anticipating it learning after deployment, not just with retraining of its base LLMs. That's not hard to implement, and it's a good way to leverage a different type of human training.

I agree that defendingg the world will require some sort of pivotal act. Optimistically, this would be something like major governments agreeing to outlaw further development of sapient AGIs, and then enforcing that using their AGIs superior capabilities. And yes, that's creepy. I'd far prefer your option 2, value-aligned, friendly sovereign AGI. I've always thought that was the win condition if we solve alignment. But now it's seeming vastly more likely we're stuck with option 1. It seems safer than attempting 2 until we have a better option, and appealing to those in charge of AGI projects.

I don't see a better option on the table, even if language model agents don't happen sooner than brainlike AGI that would allow your alignment plans to work. Your plan for mediocre alignment seems solid, but I don't think the stability problem is solved, so aligning it to human flourishing might well go bad as it updates its understandingn of what that means. Maybe reflective stability would be adequate? If we analyzed it some more and decided it was, I'd prefer that plan. Otherwise I'd want to align even brainlike AGI to just follow instructions, so that it can be shut down if it starts going off-course.

I guess the same logic applies to language model agents. You could just give it a top-level goal like "work for human flourishing", and if reflective stability is adequate and there's no huge problem with that definition, it would work. But who's going to launch that instead of keeping it under their control, at least until they've worked with it for a while?

I think the essence and conclusion of this post are almost certainly correct, not only for the reasons that Matthew Barnett gave (namely that individual users will want to use AGI to further their own goals and desires rather than to fulfill abstract altruistic targets, meaning companies will be incentivized to build such "user intent-aligned" AIs), but also because I consider the concept of a value aligned AGI to be confused and ultimately incoherent. To put it differently, I am very skeptical that a "value aligned" AGI is possible, even in theory.

Wei Dai explained the basic problem about 6 years ago, in a comment on one of the early posts in Rohin Shah's Value Learning sequence:

On second thought, even if you assume the latter [putting humans in arbitrary virtual environments (along with fake memories of how they got there) in order to observe their reactions], the humans you're learning from will themselves have problems with distributional shifts. If you give someone a different set of life experiences, they're going to end up a different person with different values, so it seems impossible to learn a complete and consistent utility function by just placing someone in various virtual environments with fake memories of how they got there and observing what they do. Will this issue be addressed in the sequence?

In response, Rohin correctly pointed out that this perspective implies a great deal of pessimism about even the theoretical possibility of value learning (1, 2):

But more generally, if you think that a different set of life experiences means that you are a different person with different values, then that's a really good reason to assume that the whole framework of getting the true human utility function is doomed. Not just ambitious value learning, _any_ framework that involves an AI optimizing some expected utility would not work.


[...] Though if you accept that human values are inconsistent and you won't be able to optimize them directly, I still think "that's a really good reason to assume that the whole framework of getting the true human utility function is doomed."

By "true human utility function" I really do mean a single function that when perfectly maximized leads to the optimal outcome.

I think "human values are inconsistent" and "people with different experiences will have different values" and "there are distributional shifts which cause humans to be different than they would otherwise have been" are all different ways of pointing at the same problem.

As I have written before (1, 2), I do not believe that "values" and "beliefs" ultimately make sense as distinct, coherent concepts that carve reality at the joints:

Whenever I see discourse about the values or preferences of beings embedded in a physical universe that goes beyond the boundaries of the domains (namely, low-specificity conversations dominated by intuition) in which such ultimately fake frameworks function reasonably well, I get nervous and confused. I get particularly nervous if the people participating in the discussions are not themselves confused about these matters [...]. Such conversations stretch our intuitive notions past their breaking point by trying to generalize them out of distribution without the appropriate level of rigor and care.

What counts as human "preferences"? Are these utility function-like orderings of future world states, or are they ultimately about universe-histories, or maybe a combination of those, or maybe something else entirely? Do we actually have any good reason to think that (some form of) utility maximization explains real-world behavior, or are the conclusions broadly converged upon on LW ultimately a result of intuitions about what powerful cognition must be like whose source is a set of coherence arguments that do not stretch as far as they were purported to? What do we do with the fact that humans don't seem to have utility functions and yet lingering confusion about this remained as a result of many incorrect and misleading statements by influential members of the community?

How can we use such large sample spaces when it becomes impossible for limited beings like humans or even AGI to differentiate between those outcomes and their associated events? After all, while we might want an AI to push the world towards a desirable state instead of just misleading us into thinking it has done so, how is it possible for humans (or any other cognitively limited agents) to assign a different value, and thus a different preference ranking, to outcomes that they (even in theory) cannot differentiate (either on the basis of sense data or through thought)?

In any case, are they indexical or not? If we are supposed to think about preferences in terms of revealed preferences only, what does this mean in a universe (or an Everett branch, if you subscribe to that particular interpretation of QM) that is deterministic? Aren't preferences thought of as being about possible worlds, so they would fundamentally need to be parts of the map as opposed to the actual territory, meaning we would need some canonical framework of translating the incoherent and yet supposedly very complex and multidimensional set of human desires into something that actually corresponds to reality? What additional structure must be grafted upon the empirically-observable behaviors in order for "what the human actually wants" to be well-defined?

[...]

What do we mean by morality as fixed computation in the context of human beings who are decidedly not fixed and whose moral development through time is almost certainly so path-dependent (through sensitivity to butterfly effects and order dependence) that a concept like "CEV" probably doesn't make sense?

There has already been a great deal of discussion about these topics on LW (1, 2, etc), and Charlie Steiner's distillation of it in his excellently-written Reducing Goodhart sequence still seems entirely correct:

Humans don't have our values written in Fortran on the inside of our skulls, we're collections of atoms that only do agent-like things within a narrow band of temperatures and pressures. It's not that there's some pre-theoretic set of True Values hidden inside people and we're merely having trouble getting to them - no, extracting any values at all from humans is a theory-laden act of inference, relying on choices like "which atoms exactly count as part of the person" and "what do you do if the person says different things at different times?"

The natural framing of Goodhart's law - in both mathematics and casual language - makes the assumption that there's some specific True Values in here, some V to compare to U. But this assumption, and the way of thinking built on top of it, is crucially false when you get down to the nitty gritty of how to model humans and infer their values.

This plan seems to be roughly the same as Yudkowsky's plan.

Assuming that users can figure out intended goals for the AGI that are valuable and pivotal, the identification problem for describing what constitutes a safe performance of that Task, might be simpler than giving the AGI a complete description of normativity in general. [...] Relative to the problem of building a Sovereign, trying to build a Task AGI instead might step down the problem from “impossibly difficult” to “insanely difficult”, while still maintaining enough power in the AI to perform pivotal acts.

That is fascinating. I hadn't seen his "task AGI" plan, and I agree it's highly overlapping with this proposal - more so than any other work I was aware of. What's most fascinating is that YK doesn't currently endorse that plan, even though it looks to me as though on main reason he calls it "insanely difficult" has been mitigated greatly by the success of LLMs in understanding human semantics and therefore preferences. We are already well up his Do-What-I-Mean hierarchy, arguably at an adequate level for safety/success even before inevitable improvements on the way to AGI. In addition, the slow takeoff path we're on seems to also make the project easier (although less likely to allow a pivotal act before we have many AGIs causing coordination problems).

So, why does YK think we should Shut It Down instead of build DWIM AGI? Ii've been trying to figure this out. I think his principal reasons are two: reinforcement learning sounds like a good way to get any central goal somewhat wrong, and being somewhat wrong could well be too much for survival. As I mentioned in the article, I think we have good alternatives to RL alignment, particularly for the AGI we're most likely to build first, and I don't think YK has ever considered proposals of that type. Second, he thinks that humans are stunningly foolish, and that competitive race dynamiccs will make them even more prone to critical errors, even for a project that's in-principal quite accomplishable. On this, I'm afraid I agree. So if I were in charge, I would indeed Shut It Down instead of shooting for DWIM alignment. But I'm not, and neither is YK. He thinks it's worth trying, to at least slow down AGI progress; I think it's more critical to use the time we've got to refine the alignment approaches that are most likely to actually be deployed.

I’m not so sure it’s the same—my interpretation was something like:

  • Yudkowsky plan: Make an AI that designs a certain kind of nanobot
  • Seth plan: Make an AI that does what I tell it to do, and then I will tell it to design a certain kind of nanobot

For example, in this comment, @Rob Bensinger was brainstorming nanobot-specific things that one might put into the source code. (Warning that Rob is not Eliezer.) (Related.)

I'm sure it's not the same, particularly since neither one has really been fully fleshed out and thought through. In particular, Yudkowsky doesn't focus on the advantages of instructing the AGI to tell you the truth, and interacting with it as it gets smarter. I'd guess that's because he was still anticipating a faster takeoff than network-based AGI affords. 

But to give credit where it's due, I think that literal instruction-following was probably part of (but not the whole of) his conception of task-based AGI. From the discussion thread w Paul Christiano following the task directed AGI article on Greater Wrong

The AI is getting short-term objectives from humans and carrying them out under some general imperative to do things conservatively or with ‘low unnecessary impact’ in some sense of that, and describes plans and probable consequences that are subject to further human checking, and then does them, and then the humans observe the results and file more requests.

And the first line of that article:

A task-based AGI is an AGI intended to follow a series of human-originated orders, with these orders each being of limited scope [...]

These sections, in connection with the lack of reference to instructions and checking for most of the presentation, suggest to me that he probably was thinking of things like hard-coding it to design nanotech, melt down GPUs (or whatever) and then delete itself, but also of more online, continuous instruction-following AGI more similar to my conception of likely AGI projects. Bensinger may have been pursuing one part of that broader conception.

I like Seth's thoughts on this, and I do think that Seth's proposal and Max's proposal do end up pointing at a very similar path. I do think that Max has some valuable insights explained in his more detailed Corrigibility-as-a-target theory which aren't covered here. 

For me, I found it helpful seeing Seth's take evolve separately from Max's, as having them both independently come to similar ideas made me feel more confident about the ideas being valuable.

I was going to say "I don't want this attempted in any light-cone I inhabit, but I realize theres a pretty important caveat. On it's own, I think this is a doom plan, but if there was a sufficient push to understand RSI dynamics before and during, then I think it could be good.

I don't agree that it's "a better idea than attempting value alignment", it's a better idea than dumb value alignment for sure, but imo only skilled value alignment or self modification (no AGI, no ASI) will get us to a good future. But the plans aren't mutually exclusive. First studying RSI, then making sufficiently non-RSI AGI with instruction following goals, then using that non-RSI AGI to figure out value alignment, probably using GSLK and cyborgism seems to me like a fine plan. At least it does at present date present time.

I guess I didn't address RSI in enough detail. The general idea is to have a human in the loop during RSI, and to talk extensively with the current version of your AGI about how this next improvement could disrupt its alignment before you launch it.

WRT "I don't want his attempted in any light-cone I inhabit", well, neither do I. But we're not in charge of the light cone.

All we can do is convince the people who currently very much ARE one the road to attempting exactly this to not do it - and saying "it's way too risky and I refuse to think about how you might actually pull it off" is not going to do that.

Or else we can try to make it work if it is attempted.

Both paths to survival involve thinking carefully about how alignment could succeed or fail on our current trajectory.

WRT "I don't want his attempted in any light-cone I inhabit", well, neither do I. But we're not in charge of the light cone.

That really is a true and relevant fact, isn't it? 😭

It seems like aligning humans really is much more of a bottleneck rn than aligning machines, and not because we are at all on track to align machines.

I think you are correct about the need to be pragmatic. My fear is that there may not be anywhere on the scale from "too pragmatic failed to actually align ASI" to "too idealistic, failed to engage with actual decision makers running ASI projects" where we get good outcomes. Its stressful.

Thanks Seth for your post! I believe I get your point, and in fact I made a post that described exactly that approach. in detail I recommend conditioning the model by using an existing technique called control vectors (or steering vectors), that achieves a raw but incomplete form of safety - in my opinion, just enough partial safety to work on solutioning full safety with the help of AIs.

Of course, I am happy to be challenged.