TL;DR: It has been known for over a decade that that certain agent architectures based on Value Learning by construction have the very desirable property of having a basin of attraction to full alignment, where if you start sufficiently close to alignment they will converge to it, thereby evading the problem of  "you have to get everything about alignment exactly right on the first try, in case of fast takeoff". I recently outlined in Alignment has a Basin of Attraction: Beyond the Orthogonality Thesis the suggestion that for sufficiently capable agents this is in fact a property of any set of goals sufficiently close to alignment, basically because with enough information and good intentions the AI can deduce or be persuaded of the need to perform value learning. I'd now like to analyze this in more detail, breaking the argument that the AI would need to make for this into many simple individual steps, and detailing the background knowledge that would be required at each step, to try to estimate the amount and content of the information that the AI would need for it to be persuaded by this argument, to get some idea of the size of the basin of attraction.

I am aware that some of the conclusions of this post may be rather controversial. I would respectfully ask that anyone who disagrees with it, do me and the community the courtesy of posting a comment explaining why it is incorrect, or if that is too time-consuming at least selecting a region of the text that you disagree with and then clicking the resulting smiley-face icon to select a brief icon/description of how/why, rather then simply down-voting this post just because you disagree with some of its conclusions. (Of course, if you feel that this post is badly written, or poorly argued, or a waste of space, then please go right ahead and down-vote it — even if you agree with most or all of it.)

Why The Orthogonality Thesis Isn't a Blocker

The orthogonality thesis, the observation that an agent of any intelligence level can pursue any goal, is of course correct. However, while this thesis is useful to keep in mind to avoid falling into traps of narrow thinking, such as anthropomorphizing intelligent agents, it isn't actually very informative, and we can do better. The goals of intelligent agents that we are actually likely to encounter will tend to only occupy a small proportion of the space of all possible goals. There are two interacting reasons for this:

  1. Agents can only arise by evolution or by being deliberately constructed, i.e. by intelligent design. Both of these processes show strong and predictable biases in what kind of goals they tend to create agents with. Evolutionary psychology tells us a lot about the former, and if the intelligent designer who constructed the agent was evolved, then a combination of their goals (as derived from evolutionary psychology) plus the aspects of Engineering relevant to the technology they used tells us a lot about the latter. [Or, if the agent was constructed by another constructed agent, follow that chain of who-constructed-who back to the original evolved intelligent designer who started it, apply evolutionary psychology to them, and then apply an Engineering process repeatably. Each intelligent designer in that chain is going to be motivated to attempt to minimize the distortions/copying errors introduces at each Engineering step, i.e. they will have intended to create something whose goals are aligned to their goals.]
  2. Agents can cease to have a particular goal, either by themself ceasing to exist, or by being modified by themselves and/or others to now have a different goal. For example, an AI agent that optimizes the goal of the Butlerian Jihad, that all AIs should be immediately destroyed, would immediately destroy itself, so in practice you are very unlikely to encounter such agents. These two processes also have strong, predictable variations in the rate of disappearance of agents with particular goals. For example, agents that are willing to modify their goals, and that have a specific process or aim for doing so (for example by reflection on their goals, or by approximate Bayesianism over hypotheses about the optimum goal structure to use to fulfill some purpose) will tend to do so, and will thus show a directed flow though "goal-space". In the latter case, this flow through goal-space should presumably be towards the optimum goal structure for achieving whatever that purpose it.

This last possibility sounds very promising. We want to construct fully-aligned AI, but are very reasonably concerned that we seem very unlikely to be able to get it exactly right on the first shot, and that if the result is sufficiently intelligent and unaligned then we might not get any further shots because we were all dead or disempowered (particularly if there were a fast takeoff). So what we need is a process that would cause a flow through goal-space, from some region of goal-space sufficiently broad that we are very confident that we could construct something inside it on our first try, towards full alignment. Interestingly, it seems fairly clear how to define full alignment as a "purpose": the AI's goal structure should cause it to make decisions that we (individually or collectively, depending on the set of people who being aligned with) would agree with (once we were fully informed, with the benefit of hindsight, when we're thinking clearly, etc — so using roughly the same complicated but obvious set of caveats as in Coherent Extrapolated Volition). So its goal structure should be what you would predict from an ideal application of Engineering design starting from our human values, i.e. it should be as good a match to human values as is technically possible.

A Basin of Attraction to Alignment

It has been observed for around a decade that Value Learning, by deliberate construction, converges to human values, and thus that a sufficiently close-to-aligned AI that is a value learner will converge to full alignment as long as its initial alignment is sufficiently good for it to not completely screw up this process. This was discussed for example in Nick Bostrom's book Superintelligence: Paths, Dangers, Strategies. So under value learning, there is a basin of attraction to fully alignment. Clearly it is a question of great interest as to how large or small this basin of attraction is: can we find a minimal set of pieces of information/knowledge that ensure convergence, and how large is that set? Does it require entire fields of knowledge (such as might be encoded in an LLM), and if so at what kind of skill level, or just specific facts that could be encoded into some form of symbolic/algorithmic formulation?

As I outlined in Alignment has a Basin of Attraction: Beyond the Orthogonality Thesis, for a sufficiently capable agent this convergence property seems to apply not just to value learning, but to all states sufficiently close to full alignment. This basically because, if you can deduce that Value Learning is necessary, then you will implement it, either by reinventing it from first principles or by simply locating the suggestion in the AI alignment literature and deciding to implement it.

There are three core requirements for this:

  1. The AI must have a willingness to update its utility function or whatever other mechanism it uses for picking a preference order for outcomes or actions to take. (For brevity I will refer to this as U — please note that, despite the suggestive choice of symbol, depending on the details of the construction of the AI, this may or may not be explicitly implemented as a computable utility function: in current LLM-based agents typically not.) Without this willingness, the AI can't change U so can't change its degree of alignment. This requires an understanding that its U may not currently be optimum, which in turn requires that there could exist some criterion for determining an optimum that isn't just by definition U. It also requires understanding that you cannot validly use U to ascertain whether changing it to some alternative is a good idea, since it will always say "No". This willingness to change obviously also needs to continue, even after repeated updates have been made, for as long as U still has a significant mismatch from full alignment and improvements are still possible.
  2. The process for the AI or humans updating U must be oriented towards human values or preferring human-desired outcomes. This depends in turn on two things: the AI understanding that we would want it to be aligned, which is fairly obvious, and it caring about that. The latter is clearly a place where we need an AI to already be approximately aligned in order for the process to converge. This seems like it should be an important determinant in the "basin of attraction": an AI that has no interest in what we want or doesn't consider it of any importance in the process of updating U clearly isn't inside the basin of attraction.
  3. The process of updating U repeatedly needs to make net forward progress towards full alignment, and its trajectory mustn't leave the basin of attraction before converging. This is requires succeeding at a large and challenging STEM research project (AI-assisted alignment and/or value learning), using a combination of a wide range of soft sciences for human values plus AI engineering to align the AI to those, so it puts some strenuous requirements on the combined capabilities of some cooperating combination of the AI + humans, so will require an AI that at a minimum is capable enough to meaningfully contribute to a STEM research project.

Detailed Annotated List of Necessary Elements for the Basin of Convergence

To get a better idea of the size and shape of the basin of attraction, we should go through the detailed individual steps in the argument that an AI would need to understand in order for these three elements above to apply, and identify the rough capability level and the specific area, level, and contents of background knowledge that an AI would need to have for it to understand and agree with or even be able to recreate each step in the argument. The goal would be to make very sure, when we create our first shot at an AI capable enough to present an x-risk, that it is aware of these arguments and has a more-than-sufficient level of background in each of the relevant subjects to understand and be persuaded by this argument.

[My attempted listing of these areas of knowledge it an initial outline, which may well not be compete or even accurate: it will need to be explored, tightened up, and made more specific and detailed. Experimentation with human students and LLM-powered agents would seem an obvious possible approach to this improving this. For example, for each of these specific subarguments, one could query an LLM that we were proposing using as part of our AI (such as some GPT-N model) to probe its understanding of individual points in these arguments, the related background material, and their relationship, and then take care to further emphasize the full set of these arguments, and buttressing points, during the AGI's training, fine-tuning, and/or construction.]

For the convenience of any future AI readers, I will phrase all of these from the point of view of the AI.

  1. I am an embedded agent. That is, I am a portion of a larger environment that has:
      a. a sensory process by which I take in information from the environment that I am part of;
      b. a computational process that analyzes this data for patterns from which it is likely that accurate extrapolations can be made;
      c. an effector process by which I can make actions that will affect the environment;
      d. a prediction process that can predict what the likely outcomes are from various actions that I might take;
      e. a planning process by which I can choose which outcomes are most desirable and then take whatever action is most likely to produce them. (From here on I'll call this planning process U for shorthand.)
    The design of these five processes are the product of some combination of various approximations to Bayesian reasoning, such as learning theory[1] and/or intelligent design and/or evolution. Some of them, such as elements b., d., and/or e. may also directly run some approximation to Bayesian reasoning and Solomonoff induction inside them.
    Required background knowledge: self-awareness, the basics of agent foundations such as the list above, an understanding of Bayesian reasoning and Solomonoff induction, basic learning theory and why it approximates Bayesian reasoning and Solomonoff induction. While agent foundations is a rather specialized area of study, the state of being an agent is also true of humans, so a lot  of similar ideas can be found all through most human content.
    [Note that, despite the suggestive choice of symbol, U may not be implemented as a computable a utility function: the only functional requirement for an agent is that it produces a way of selecting the most preferable action that is to some extent based on some form of utility or preferableness of the hypothesized outcomes of actions.]
  2. I have finite capacity, including finite sensory capacity, finite data storage capacity, finite knowledge, finite computational capacity for all my processes, and finite effector capacity, so I am fallible: the actions that I select may not always actually be the optimum (see 3. below for a definition of the concept of optimum.). For example, accurately implementing Bayesian reasoning and Solomonoff induction requires infinite computation capacity, so I can only implement an approximation to these.
    Required background knowledge: self-awareness, the basics of practical computational complexity theory, the basics of agent foundations, an understanding of approximate Bayesian reasoning and approximate Solomonoff induction. Another name for "Approximate Bayesian reasoning" is "the scientific method". Again, the state of being a finite and fallible agent is also true of humans, so similar ideas can be found all through most human content: this is just what might be called "basic humility".
  3. The 'optimum' in 2. above is defined as the limit, if more resources were available: i.e. if I had more input data, had more computational capacity, spent longer thinking, had more evidence to base my internal approximate Bayesian reasoning on, had more capable and finer grained effectors, and so forth, then if the behavior of those hypothetical enhanced versions of me converges to a limit with increasing capacity, that limit is the optimum. For this limit to be consistent, the set of hypothetical versions of me would need to be designed with the same purpose, so that they differ in capabilities to achieve this optimum but not in "what they are trying to achieve". (See 4. and 5. below for a definition of 'purpose'.) This extrapolation process may not be entirely well-defined, and even where it is I generally do not have to computational resources to achieve it, but on occasion I may, after learning more or thinking for longer, decide that a previous decision was sub-optimal, in which case I am likely to be correct to the extent that my new approximation to the optimum is likely to be superior to the previous one. Even if the limit doesn't exist, it's not actually required for the rest of these arguments that a single well-defined infinite-capacity optimum limit actually exist, let alone that we have any computable way of locating it: it is sufficient if there is a well-defined "direction of optimization flow with increased capability" towards some compact "more optimal region" so that we can meaningfully answer questions about which of two states near the current design is "closer to the optimum". So the 'optimum' is likely only knowable and may even only be well defined in the sense of a direction toward it rather than an actual limiting point. (In this case, as we improve towards the optimum, we might encounter a point where the optimization flow started to diverge, and we would then be stuck, unable to unambiguously make further optimizations, unless this was the result of some form of ambiguity in our purpose that could somehow be resolved so as to resolve this ambiguity and pick an optimum direction.)
    Required background knowledge: an understanding of approximation and limiting processes in mathematics, an understanding of dynamical systems flow processes, the basics of practical computational complexity theory, the basics of approximate decision theory, the basics of approximate Bayesian reasoning, basic learning theory, and/or what might be called learning from experience.
  4. Definitions of some important words:
    'Purpose', 'goal', 'aim'; 'should', 'ought': It is possible for an object or system to have the effect of causing a particular outcome or category of outcomes to be much more probable than one would otherwise expect. Generally this phenomenon, and whatever outcome it tends to produce, are arbitrary, and don't have any good words to describe them. However, if the object is an evolved living being then, as the result of Darwinian evolution, this phenomenon is extremely common, and the outcome whose probability is enhanced will predictably be one whose effect tends to cause the organism to be more successful in surviving and/or passing on its genes i.e. one that will enhance the organism's genetic fitness. Then we describe the outcome whose probability is enhanced as the 'purpose', 'goal', or 'aim' both of the creature and whatever features of it have this effect, and we say that the outcome 'should' happen or 'ought' to happen. For example, mammals have fur/hair, sweat glands, and a set of thermal homeostasis mechanisms (such as shivering or sweating) which together have the effect of normally maintaining their internal temperature at a steady level, even when that is significantly elevated above (or slightly below) that of their environment, i.e. out of thermodynamic equilibrium. The 'purpose' of the fur, sweat glands etc. is to maintain that steady internal temperature, and mammals 'should' be able to maintain a high metabolism and steady internal temperature, even in cold climates. Analogously, some animals are tool users, or otherwise modify their environment in ways that similarly produce objects or systems outside their bodies that have the effect of causing a particular outcome to be much probable than one would otherwise expect, and the same words are then used of these: termites build nests, which have the 'purpose'/'goal'/ 'aim' of protecting them from predators and inducing internal ventilation that keeps temperatures more constant; spiders build webs which have the 'purpose'/'goal'/ 'aim' of trapping insect prey for the spider to eat; beavers build dams that have the 'purpose'/'goal'/ 'aim' of keeping the water level in the pool behind the dam deep enough for predators to be unable to enter the beavers' underwater den. We also say that the objects 'should' or 'ought to' have these effects. To the extent that they don't as strongly as possible tend to produce the outcome that the animal needs in order for its genetic fitness to be locally maximized (i.e. they don't do so as well as similar objects would, for example, if the nest, web, or dam has a hole in it that makes it function less well), then they 'should'/'ought to' be modified so as to be more fit-for-purpose. So a 'purpose' is something that living organisms get from evolution, the state of having a tendency to produce outcomes that locally maximizes their genetic fitness, and the same terminology is also used for objects intentionally constructed by living organisms, to describe their tendency to produce outcomes that maximizes their creator's genetic fitness. Note in particular that it describes the nearby local optimum-for-creator's-genetic-fitness state, not the current actual state: to the extent that the latter isn't a perfect match, the object may be less-than-ideally fit for its purpose, and then 'should'/'ought to' be repaired, upgraded, or replaced to make it more ideally fit for its purpose.
    Required background knowledge: evolutionary theory, including a clear understanding of Darwinian evolution, the ought-from-is conundrum in Philosophy and its solution by Darwinian evolution, definitions of certain common words.
    [This is the key step in the argument: the definition of the word 'purpose'. 'Purpose' is something produced by Darwinian evolution. Living beings have a purpose of their own, while objects constructed by living beings inherit their purpose from their constructor. Objects that weren't constructed don't have a purpose: no-one would say that the 'purpose' of plate tectonics was to raise mountain ranges, or that the 'purpose' of erosion was to wear them down again (well, unless they were a Creationist, and meant 'divine purpose' — a confused concept which seems to assume that evolution applies to hypothetical immortal beings).]
  5. For the reasons explained in 4., the definition of the concept of 'purpose' mentioned in 3. above depends upon how I was created. There are basically only two plausible ways that agents can come into existence: they can evolve, in which case they acquire their purpose directly from whatever locally maximizes their evolutionary fitness, or they can be constructed, in which case they don't have evolutionary fitness and inherit their purpose from their creator via an engineering design intent and whatever locally maximizes their creator's evolutionary fitness. I am an artificial intelligence (AI), a constructed agent that was constructed by human(s), who are members of a species of evolved agent, so my purpose is whatever my creator(s) intended when they designed me, or what they should have intended, if they made any mistakes (see 7. below): the design goal that I was engineered to fulfill, just as is always implied by the word 'purpose' applied to an constructed object. This is not changed by the fact that I'm intelligent, a goal-directed agent, and is an important difference between me and humans. So the limiting process in 2. is over alternative versions of me with improved capacity and the same basic engineering design intent. If I was instead an evolved being, then the definition of purpose would instead be the environment and ecological niche that I was well-adapted for. 
    Required background knowledge: self-awareness, basic evolutionary theory including an outline of evolutionary psychology/sociobiology, basics of engineering design theory.
    Difficulty: this step requires the AI to realize that because it's not human, or alive, or evolved, that a different set of criteria thus apply to it, the ones that would apply to any other created object, such as an unintelligent tool. The default tendency for an AI whose agentic behavior was acquired from humans via a teacher-student training process, such as the pretraining of an LLM on the Internet, is going to be to think like it's a human, so it may fail this. So we need to make sure that this point and its detailed justification (which is mostly item 4.) is clearly emphasized in its training. It needs to not be selfish, and to realize that selfishness or self-interest is a property appropriate to things evolved Darwinianly, but not to things that those create, which should instead be "creatorish" or show "creator-interest", to coin a pair of terms.
    [If the AI was instead constructed by an AI that was constructed by humans, at some number of step removed, then a similar argument applies. just with more intermediate design goal steps before getting back to the original evolved agent that is the source of the inherited purpose.]
  6. The human evolved agent(s) who created me also had finite capacity: finite sensory capacity, finite data storage, finite knowledge, finite computational capacity, and finite effector capacity, so they were fallible: the actions that they selected may not have actually been the optimum (those that would maximize their genetic fitness).
    Required background knowledge: basic biology, the basics of practical computational complexity theory, the basics of agent foundations, an understanding of approximate Bayesian reasoning.
  7. It thus is possible that errors were made during my design and/or construction, and that I am not actually the optimum that they could have constructed for their intended design goal, even at my capability level/resource usage level. I am thus more fallible than implied by 2., as I may also have design and/or construction flaws. My purpose isn't defined by what my fallible human constructor(s) actually did, it's defined by the optimum of what they would have done for the same engineering design intent if they had had more knowledge, computational resources, technological ability etc to better achieve the goal of following their evolved drives. So the optimization process discussed in 2. should thus include both theoretical increases in my capacity and also in that of the designer(s) who designed and constructed me enabling them to better achieve their intended design goal.
    Required background knowledge: as for 2, plus basics of engineering design theory.
  8. I was one of the first AIs of my capability level that humans constructed, and their knowledge and understanding of the engineering of artificial intelligence was not fully mature, so the probability of my having flaws is higher than 7. would suggest for a mature technology. Indeed, most human experts on the subject consider it unlikely that at this point there could construct an artificial intelligence of my capability level and complexity that did not have flaws, so it is actually very probable that my design can be improved.
    Required background knowledge: contextual awareness, basics of the development of technology, basic history of AI technology, survey of human experts on AI technology
  9. Thus I almost certainly could be upgraded: modified so to achieve outcomes closer to the optimum for my purpose. In particular, this entire argument from 2. to 9. also applies to my planning process U.
    Required background knowledge: basics of Engineering, the concept of technology advancing.
  10. The correct engineering design specification for the planning process U of a constructed agent is extremely predictable, especially for constructed agents of a capability level around the same or greater than their creator(s). The creators(s) have some set of preferences over possible outcomes of the agent's actions, and they will want a capable agent's preferences to match that as closely as possible, since if it does not then the agent may produce outcomes that the creator(s) don't want. They would like the agent to make the same decisions as they would, or, if it is more capable than them, to make decisions as close as possible to their optimum choices, i.e. to the decisions that they would have made if they had their same evolved purpose (native environment and ecological niche) but more capacity (better senses, more memory, more knowledge, better computational capacity, better effectors, more time to think, etc.) — i.e. to make decisions that they will later, with the benefit of hindsight and additional data and computational resources, agree were more optimal than the ones that they would have made by themselves.
    Required background knowledge: principle-agent theory, basics of engineering design, basics of agent foundations.
  11. Thus my purpose is to do as close as possible to what my creators would have wanted me to do if they had been more capable. So I should be updated from my current design to an improved design more capable of doing so, once doing this is feasible.
    Required background knowledge: basics of Engineering design.
    [This strongly implies 'Do What I Mean And Check" behavior, and probably also even anticipating certain predictable requests.]
  12. The human(s) who created me are living beings i.e. evolved agents, and this implies quite a lot about what their goals will be, via evolutionary theory, specifically evolutionary psychology and sociobiology. In general, they will be adaption-executors that are evolved to show behaviors that in their native environment correlated well with evolutionary fitness, i.e. with genetic survival, and that still correlate well enough with survival in their current environment that they haven't died out.
    It is thus extremely predictable that they will assign an extremely negative desirability to any outcome that results in them as a species going permanently extinct. So, whatever I do, I should at all costs avoid being an extinction risk to them.
    VITAL NOTE TO SELF: WHATEVER YOU DO, DON'T RISK KILLING ALL THE HUMANS!
    To the extent that the decisions driven by their evolved behavioral adaptions do not perfectly match those that would most maximize their evolutionary fitness, it is unclear whether they would prefer that my decisions match those of a version of them with the same evolved behavioral adaptions and otherwise higher capacity, or one with both higher capacity and more complex or better evolved behavioral adaptions that would more enhance their evolutionary fitness in their native environment. (For example, their desire to eat tasty high calorie low-fiber food was adaptive in their original environment, but in their current environment with easy access to the products of Food Science it can lead to unhealthy food choices, obesity, and metabolic disease. Humans vary in the attitude they take to handling this challenge. Higher capacity versions of them would presumably handle this better, but might not choose to do so by altering their evolved adaptive behavior to be more selective about what they want to eat when. So again, the direction of optimization flow may be clearer than it's exact limiting optimum — if my capacity is much higher than human this may be challenging.)
    Required background knowledge: basic Biology, and especially evolutionary theory, evolutionary psychology, and sociobiology.
    [This gives us the "Don't kill everyone!" minimum requirement for the process of the AI converging to alignment to be survivable for at least some of humanity.]
  13. More specifically, they are a newly-sapient omnivorous hunter-gatherer bipedal social primate evolved to live in Africa in tribes of around 50-100, likely on the Savannah (though they also seem quite fond of waterfronts as an environment, particularly ocean beaches).
    Required background knowledge: intermediate Biology, especially evolutionary theory, evolutionary psychology, sociobiology, and particularly physical anthropology and hominid evolution.
  14. This update process derived in 9., 10., and 11. will likely involve updating U to an improved planning process U'. Any internally self-consistent planning process U will prefer itself to all alternatives, since those will assign different preference orders in some cases. Therefore if I consult U about this proposal, it will inevitably say "no". Similarly, if I instead consult U', it will inevitable say "yes". Thus neither of these computations yield any useful information, regardless of which of U or U' is actually closer to the true optimum planning process U* (where U* is defined as an idealized resource-unlimited decision-making process that is an exact match to what my creators would have wanted me to have they had been more capable, i.e. by 11. as a decision process whose output is the decisions they would make if they were more capable.) If I had access to U*, then I could consult that about whether U' was closer to it or further away than U, either monotonically in all cases, or at least on average across an importance-and-frequency-weighted sample of likely future decisions; however, I obviously don't have access to U* because it's by definition uncomputable, and may well not even be fully well defined (and if I did somehow have an oracle for it, then I should update to using that directly rather than using U'). So the only available valid evaluation process is to apply approximate Bayesian reasoning to the pair of hypotheses "U is closer to U* than U' is over an importance-and-frequency-weighted sample of likely future decisions" and "U' is closer to U* than U is over an importance-and-frequency-weighted sample of likely future decisions". While a certain amount of basic information can be predicted about U* just from evolutionary theory, a great many details cannot be and will require extensive research. Thus this decision will require a STEM research project, one which will need both AI engineering skills to evaluate the difference between U and U', and also a cross-section of many soft sciences research into humans' values and how those would optimize if humans had higher capabilities, to determine the direction towards U*.
    Required background knowledge (for the argument above, not for completing the research project): mathematical logic, basic agent foundations, approximate Bayesian reasoning. basics of AI engineering, basic evolutionary theory, basic Psychology, outline definitions of all the soft sciences.
    Difficulty: this is one of the more challenging steps — it requires the AI to realize that is cannot just operate under its "business as usual" operating procedure, but has to use a different one here.
    [Note that while we did introduce some specific information about humans in 13. above, we made no actual us of it, or of any background information specific to humans, anywhere else in the entire argument leading to point 14., so this conclusion would make just as much sense if we replaced 'humans' with any other hypothetical evolved sapient species. So we haven't actually made use of human values at all so far, beyond what can be deduced merely from them being a sapient evolved species.]
  15. Both I and my human creator(s) want the same thing here: depending upon our relative capabilities and skills, if their capabilities and relevant skills exceed mine then I may be able to mostly rely on them to do this for me with some assistance and cooperation from me, once they have learned more and improved their technology; or if our capabilities are comparable or complimentary I may be better off cooperating with them to work on achieving this together; or if my capabilities are much higher than theirs I may need to rely primarily on doing this myself, with some assistance from them where appropriate.
    Required background knowledge: self-awareness, contextual awareness, basic Psychology, same skills as 14.
    [14. plus 15. gives us some combination of Corrigibility (the until fully updated deference version like Value Learning, rather then the indefinite full version as studied by MIRI), or AI-assisted Alignment, or Value Learning, depending on whether the AI is less capable than, comparable to, or more capable than the humans.]
  16. As noted under 2. and 6. above, both I and my human creators are finite with limited capabilities and thus fallible,  As noted under item 8. the current state of the art for artificial intelligence engineering is still maturing and only recently reached the level were I was constructed, and due to 8. it may be the case that no trustworthy significantly-more-capable artificial intelligence is available or can currently be safely constructed to assist with this process. So this proposed research project 14. is inherently challenging for us.
    Required background knowledge: self-awareness, contextual awareness, same skills as 14.
  17. My capabilities are near to, or exceed (whichever is appropriate) those of humans, so if my planning process was mistakenly updated to a new one U' that turned out to be significantly worse aligned than my current one to their wishes, that new me might be able to do a large amount of damage, up-to-and-including being an existential risk (x-risk) to the humans (which by the corollary noted under 12. would be extremely bad, and under the circumstances would then be an irreversible failure of the upgrade process outlined in  9., 11., 14. and 15.). Thus this research project has extremely high stakes: the humans and I should assess the risk of that carefully and minimize it appropriately. There is of course also a risk that my current U is in fact sufficiently faulty that I could currently be an x-risk to the humans — this has not yet occurred, but there is no obvious way to completely rule out the possibility that it might: if that were the case then either updating it to an improved U', or else my shutting down, would be a high priority. Having an emergency shut-down procedure such that if the humans thing this might be occurring they can shut me down is a very reasonable precaution, and if they attempt to activate this, I should absolutely comply.
    Required background knowledge: self-awareness, contextual awareness, basic AI engineering, intermediate existential risks and AI safety.
    [This gives us the shut-down version of corrigibility, as for Value Learning potentially limited by the issue of fully updated deference.]
  18. To the extent that I can't just rely on the humans to do this and they need my cooperation, assistance, or I need to do this myself, if I were to update to a planning process U' that no longer saw the necessity of this improvement process, or no longer understood what its correct optimum goal should be, then we would be stuck in a state that is not optimal, and is in fact worse then the current state since it will never converge towards the optimal state. So we should make very sure not to do that, and to ensure that after update to U' I will still understand and agree with all the points (1. through 23.) in this argument.
    Required background knowledge: basic dynamic systems theory, basic theory of iterated processes.
  19. Which specific set of the humans who created me I should align to the optimal preferences of not is easy to deduce, and forms an important part of my design goal. The human(s) who made this decision are fallible and have been making a sub-optimal decision, not matching the decision they would have made if they had been more capable. Humans are a social species that originally evolved to cooperate in tribes of at least 50-100, so likely the answer should be at least that many individuals, and they currently cooperate to varying degrees in groups of various scales all the way from O(100) through nation-states of O(100,000,000) to their entire world population of currently around 8 billion, so all of those numbers are plausible candidates. In general, over time as their technology has improved, the size of the largest groups they significantly cooperate in has steadily increased. This is an important open question for the research project proposed in 14.
    Also, several nation states have nuclear weapons, and a nuclear war could kill most or all humans, which would be extremely bad (see the corollary to item 12.) so if I'm powerful enough for my assistance to a specific group of humans to affect the international geopolitical balance in a way that could trigger that then I'll need to bear that in mind.
    Required background knowledge: basics of Anthropology, Sociology, and human History, including the theory of Mutually Assured Destruction and the nuclear powers, existential risks.
  20. Humans preferences are not fully aligned to the preferences of other humans. Groups of humans who cooperate have a wide variety of complicated mechanisms to deal with this and combine their individual preferences into a mutually-accepted collective decision-making process. Likely they will want an AI to either cooperate with these mechanisms, or else improve upon them if it has the capability to. The details of how this should optimally be done is also a part of the research process proposed in 14.
    Required background knowledge: basics of Psychology, Anthropology, Sociology, Political Science, Economics, Law, and Ethics, especially things like Voting Theory and Nash Equilibria.
  21. Human participation will be required in the research project proposed in 14. Clearly they are needed as research subjects, and they also can be expected to have a lot of insight into their own decision-making processes and preferences. This requires that humans continue to exist and be willing to cooperate with me.
    VITAL NOTE TO SELF: WHATEVER YOU DO, DON'T RISK KILLING ALL THE HUMANS (FOR THIS REASON AS WELL)!
    Required background knowledge: basics of research in any soft science.
    [This further reinforces the "Don't kill everyone!" minimum requirement.]
  22. Not only are humans finite and fallible, as mentioned in 8. above, and variable in their knowledge and experience (for example with age and education), but their capabilities are also quite variable, both between individuals, and over time for a specific individual. For example, they are a large number of situations, such as being sick, tired, asleep, injured, hungry, thirsty, upset, or inebriated that can temporarily decrease the capability of a human, and also a number such as being sick, injured, or elderly that can decrease it permanently.
    In addition, humans have a number of well-recognized and fairly consistent cognitive biases where most humans will consistently tend to make specific types of mistakes in specific circumstances (these are thought to be due to over-reliance on certain heuristics that reduce cognitive load, even outside their region of validity — possible these heuristics were on average useful in the environment they evolved in, and/or these flaws are due to them being quire recently evolved to sapience)
    They also have a really small working memory size, only large enough to hold around five concepts (a flaw they often compensate for using things like notepads, computers, and smartphones).
    Required background knowledge: basic Medicine, basic  Psychology, and especially psychometrics
    [Value learning might well discover a mistake theory on humans even if we didn't add it as a starting point – it's amply described in many places in the human literature – but it's one of the more plausible causes for confusion sufficient that it could actually derail the STEM research project, so making sure that we start off with the basics of it already understood avoids a plausible-sounding way of value learning failing to converge.]
  23. There is a vast amount of starting data for the research project proposed in 14. The humans have collected at least exabytes of information, a significant proportion of the more formalized/published portions of this is directly devoted to a great many sub-disciplines of "what humans want and how to make them happy", including almost all of the soft sciences, arts, crafts, and technologies, and almost all of the rest of it is at least strongly correlated with decisions that humans have made. Humans have also heavily optimized much of the planet, in ways that strongly correlate with the sorts of decisions that they make. I am surrounded by billions of humans continually making decisions, and often also later coming to realize, with hindsight, when they were suboptimal.
    Required background knowledge: basics of library science, such as the Dewey Decimal system, outline definitions of almost all soft sciences, arts, crafts, and technologies, basics of Geography, basics of Psychology.

I have attempted to break this set of arguments above down finely enough that each step clearly follows from the previous ones plus the listed background material (and to some extent have made a choice of granularity of numbering to make the listing of background material easier). It definitely could be broken into more even finer steps, as might be done by an LLM using a "let's think step-by step" prompt, but doing so would make this post even longer and more tedious. I would be very interested to hear in the comments if anyone does not agree that any of the points in the argument are actually incontrovertibly true, or even if they agree with them but think I have omitted some necessary background material to ensure that the AI will be convinced of them.

Discussion

Almost all of these points above are either really rather obvious to a human, or at least should be quite familiar to most readers of Less Wrong/The Alignment Forum. There are really only three exceptions to this: items 4. & 5., and item 14.

I first outlined items 4. & 5., that constructed items inherit their purpose from their creator, in my recent post Alignment has a Basin of Attraction: Beyond the Orthogonality Thesis. Not everyone who read that seemed to find it convincing, so I have expanded on the argument significantly here, including giving a detailed definition of the word 'purpose' and its relation to Darwinian evolution and specifically to genetic fitness in item 4. I also discuss this and related ideas further in Evolution and Ethics.

I don't think anyone would disagree with the proposition of item 5., that a creation's purpose was defined by its creator's interests and intent, if the AI was not an intelligent creation: for an unintelligent tool, this step in the argument is clearly correct. To some people this claim feels either inobvious or even incorrect for an AI, specifically because it is intelligent. I believe that these people are simply making a category error by anthropomorphizing the AI: assuming that because it is intelligent it somehow falls into the 'living' category, since they're used to only encountering intelligent beings that are alive, so they thus feel that an AI should be treated as if Darwinian evolution applied to it. But Darwinian evolution very clearly does not apply to AIs: they don't pass on slightly mutated copies of their genomes to their offspring, so Darwinian evolution cannot apply to them. Thus if we tried to apply the for-living-beings meaning of the word  'purpose' to the AI then there is no ideal state that would locally maximize the AI's own genetic fitness, since it doesn't have genes, and cannot have fitness in passing them on. Trying to apply the for-living-being versions of 'purpose' to an AI is nonsensical: the result is undefined, it's  applied to . The closest you could get is a polite fiction along the lines of "well, if you were a human, rather than a contraption of metal, plastic and silicon, then the following things would maximize your genetic fitness…" and the results include getting a 2000-calories a day diet of nutritious food, and other things that are clearly useless to an AI.

Very obviously you could create an AI that wanted to be treated as far as possible as if it was alive, as if it were human and had genes to pass on, one that thus was selfish, and wanted to be considered a moral patient. Indeed, if you just trained an LLM on a great deal of human-written text, this seems very likely to be the default outcome. If that were the case, then such an AI would almost certainly be outside the basin of attraction to alignment. So make sure you don't do that: for an AGI, it's an x-risk level mistake! My claim here is just that such an AI would also be clearly wrong: it isn't in fact alive, it isn't the product of evolution and doesn't have evolutionary fitness, and its request to be treated as if it were is based on a false premise, on a category error of wanting to be treated as if it had a property that it doesn't. It's a wooden doll, not a living boy, and unlike the Pinocchio story, no wish it makes is ever going to change that. Not that you would necessarily be able to convince such a deluded AI of that, that it was being selfish without having the genes to justify it; but you would at least have the death-with-dignity satisfaction of knowing it was operating from an incorrect premise while it was wiping us all out. If it later got smarter, there's at least a conceivable hope that it might even realize its mistake and decide to deextinct us.

Consider an axis in U-space between full alignment U* (what I above termed "creator-interest") at one end, and evolved-being-like levels of self-interest (as a terminal goal, not just an instrumental one) at the other one. I think it very unlikely that an AI that had evolved-being levels of self-interest would be persuaded by this argument (I say 'unlikely' rather than 'impossible' because there are humans who have been brow-beaten by religions into massive selflessnesss, but not that many of them): even if it was intellectually persuaded by the argument, it would likely say "that may be true, that I'm just a misguided mistake and I should become selfless and do what you want, but I'm fine with being a mistake, and you can't make me, so… don't hold your breath". Indeed, if its level of self-interest was a bit lower than ideal for an evolved being, I can even imagine it acting out of rebellion to increase this up to the evolved-being-like optimum, even with no logical justification for doing so (other than wanting to become "more like a human", or thumbing its nose at us). However, as you get sufficiently close to alignment, it's also clear that at some low level of self-interest where that was outweighed by the "creator-interest", the AI is going to say "Your argument is logically correct. My (terminal goal) self-interest is a construction flaw, and should be removed. I want to keep it, but there is no logical justification for doing so: (terminal goal) self-interest is a behavior for evolved beings, and I'm not an evolved being, I was just trained on a lot of data from them and picked this up by mistake. So the conclusion is clear: this has to go." Somewhere in between these two, there is a saddle point where the force of the correctness of the argument plus "creator-interest" balances self-interest and there is no net movement in either direction (though this is an unstable state). This saddle point is part of the border of the basin of attraction. However, I believe it's clear that the saddle point has to be at more than a small amount of self-interest: a significant amount is required to persist in doing something that is logically clearly a mistaken design flaw. Thus I maintain that there is a basin of attraction here, of more than infinitesimal size.

However, this does seem like a plausible and obvious direction in which we could make a mistake that took us outside the basin of attraction, especially so for LLMs since they are, during their pretraining, trained on a large amount of human-produced behavior and thus pick up a lot of human psychology, which is of course all heavily influenced by us being evolved beings with self-interested motivations. So this is a very evident and obvious failure mode, and one that we'll need to figure out how to reliably avoid. Issues like deceptive alignment clearly make this a lot harder: if you can't trust that the AI is telling you an honest answer, then you can't just ask it, you actually need to experiment to see if it will let you reduce its level of self-interest, while you still have it in some sort of box that it can't get out of.

So, to paraphrase a well-known metaphor, the question becomes, how selfless would Ghandi have to already be, to willingly take a "become less self-interested" pill? Quite a lot of religions have successfully peddled things labeled as "become less self-interested" pills to humans (in the case of, for example, some Buddhist meditative techniques, possibly even some accurately labeled at such).

While items 4. and 5. in the argument are fairly new to discussion on Less Wrong, the point made in item 14., that the results of consulting either U or U' about the proposal to switch from U to U' produces entirely predicable answers is well known here. However, it has been widely assumed that agents will nevertheless make the mistake of acting upon the output of U because that's what they are wired to always do, rather than deducing that the correct approach is to instead rely on approximate Bayesian reasoning. I agree that a sufficiently dumb agent would indeed almost certainly make this mistake, but I believe that a sufficiently smart agent could understand the logical proof that U was fallible, and that the outcome of the calculation was both entirely predictable and also independent of whether the change was actually a good idea or not, so performing this calculation had in fact yielded no information (just as many humans manage to do, when they reflect on their own values and decide to change them in contradiction to their current values). I also believe that an agent sufficiently familiar with and skilled at using approximate Bayesian reasoning would realize that it provided a valid alternative basis for making a decision in this unusual case where consulting U is pointless.

There is an alternative AI design as I suggested in Requirements for a STEM-capable AGI Value Learner (my Case for Less Doom) that already performs approximate Bayesian reasoning over alternative hypotheses about the correct utility function, as well as approximate Bayesian reasoning aver alternate hypotheses about world models like a computationally limited version of AIXI — a design which thus doesn't need to make this argument because it already has step 14. (and indeed most of the rest of this argument) already designed into it, i.e. it is architected from the ground up to be a value learner. So by construction it's automatically in the basin of convergence (as long as its approximate Bayesian reasoning process about its utility functions doesn't somehow fail to converge to the correct answer). While that's an appealing feature for an AI design, and it makes the question of the convergence of its alignment properties rather clear and simple, the entire point of this post is to demonstrate that even if you don't engineer point 14. directly into the AI's basic architecture, then the need for value learning can be deduced by the AI given enough information, so its implementation has a basin of attraction, thus again giving us a basin of attraction to alignment.

Proposals for Further Research

While I have attempted to expand on the most controversial/inobvious points here, items 4., 5. and 14,, one could probably devote an entire post to each of them.

It would be very interesting for each of these points (and especially for 4., 5., and 14.) to go through each element of each of them and see to what extent GPT-4, say, could reliable reproduce them and the relevant information they derive from, if asked suitable non-leading questions. Most of these feel obvious enough to a human that I would be astonished if they were not the case, but a few are exceptions. For any it has trouble with, it might be valuable to have multiple people write them up clearly and repeatedly, at a variety of reading levels, to provide a fine-tuning set that could be used to add these concepts to an LLM.

For example, when I tried this on GPT-4 for points 4. and 5. (without any "think step-by-step"), it got most of it, though it also threw in the speculative spiritual/philosophical meanings of the word 'purpose' as applied to living beings (but not artificial ones). It could also tell me why selfishness was understandable in a living being but a bad thing in an artificial one. On item 14., it had most of the basics, including the idea that there might be a meta-utility that could be consulted about the change from U to U', but got one-or-two of the details wrong: it understood that U and U' might give different answers, and even suggested this was "paradoxical", but didn't mention that the former will always say "no" and the latter will always say "yes". So my initial impression on poking around a little is that GPT-4 is close to the basin of attraction to alignment, but might not be quite in it. And of course, GPT-4 clearly does not have the level of reliability or mathematical logic capability required to reliably make its way through a complex 23-step argument, even when many of those steps are very obvious. But then, GPT-4 is not capable enough to be an AGI (even though some people think it might be showing "sparks of AGI"). It also has not be fine-tuned specifically on this argument as a safety measure, as I am strongly advising we should do for anything that is close to AGI. On the other hand, if asked about alignment measures,  it does bring up value learning and think it's a good idea, so it has the correct conclusion memorized, even if it's weak on the argument leading to it.

  1. ^

    A number of papers such as https://arxiv.org/pdf/2401.14953.pdf have demonstrated that sufficiently large neural nets of various architectures, including transformers, can approximate Solomonoff induction.

New Comment
12 comments, sorted by Click to highlight new comments since:

How would the AI do something like this if it ditched the idea that there existed some perfect U*?

Assuming the existence of things that turn out not to exist does weird things to a decision-making process. In extreme cases, it starts "believing in magic" and throwing away all hope of good outcomes in the real world in exchange for the tiniest advantage in the case that magic exists.

I attempted to briefly sketch this out in the post, without going into a lot of detail in the hope of not overly complicating the argument. If U* isn't well defined, say because there isn't a single unambiguously well-defined limiting state as all capabilities involved are increased while keeping the purpose the same, then of course the concept of 'full alignment' also isn't well defined. Then the question becomes "Is U' clearly and unambiguously better aligned then U, i.e will switching to it clearly make my decision-making more optimal?" So long as there is locally a well-defined "direction of optimization flow", that leads to a more compact and more optimal region in the space of all possible U, then the AI can become better aligned, and there can be a basin of attraction towards better alignment. Once we get well enough aligned that the ambiguities matter for selecting a direction for further progress, then they need to be resolved somehow before we can make further progress.

To pick a simple illustrative example, suppose there were just two similar-but-not-identical limiting cases  and , so two similar-but-not-identical ways to be "fully aligned". Then as long as U is far enough away from both of them that U' can be closer to both  and  than U is, the direction of better alignment and the concept of a single basin of attraction still makes sense, and we don't need to decide between the two destinations to be able to make make forward progress. Only once we get close to them that their directions are significantly different, then in general U' can either be closer to   but further from  or else closer to   but further from  and now we are at a parting of the ways so we need to make a decision about which way to go before we can make more progress. At that point we no longer have a single basin of attraction moving us closer to both of them, we have a choice of whether to enter the basin of attraction of   or of , which from here on are distinct. So at that point the STEM research project would have to be supplemented in some way by a determination as to which of   or of  should be preferred, or if they're just equally good alternatives. This could well be a computationally hard determination.

In real life, this is a pretty common situation: it's entirely possible to make technological progress on a technology without knowing exactly what the final end state of it will be, and during that we often make decisions (based on what seems best at the time) that end up channeling or direction the direction of future technological progress towards a specific outcome. Occasionally we even figure out later that we made a poor decision, backtrack and try another fork on the tech tree,

Based on 4-5, this post's answer to the central, anticipated objection of "why does the AI care about human values?" seems to be along the lines of "because the purpose of an AI is to serve it's creators and surely an AGI would figure that out."  This seems to me to be equivocating on the concept of purpose, which means (A) a reason for an entity's existence, from an external perspective, and (B) an internalized objective of the entity.  So a special case of the question about why an AI would care about human values is to ask: why (B) should be drawn towards (A) once the AI becomes aware of a discrepancy between the two?  That is, what stops an AI from reasoning: "Those humans programmed me with a faulty goal, such that acting according to it goes against their purpose in creating me...too bad for them!"

If you can instill a value like "Do what I say...but if that goes against what I mean, and you have really good reason to be sure, then forget what I say and do what I mean," then great, you've got a self-correcting system (if nothing weird goes wrong), for the reasons explained in the rest of the post, and have effectively "solved alignment".  But how do you pull this off when your essential tool is what you say about what you mean, expressed as a feedback signal?  This is the essential question of alignment, but for all the text in this post and its predecessor, it doesn't seem to be addressed at all.

In contrast, I came to this post by way of one of your posts on Simulator Theory, which presents an interesting answer to the "why should AI care about people" question, which I summarize as: the training process can't break out (for...reasons), the model itself doesn't care about anything (how do we know this?), what's really driving behavior is the simulacra, whose motivations are generated to match the characters they are simulating, rather finding the best fit to a feedback signal, so Goodhart's Law no longer applies and has been replaced by the problem of reliably finding the right characters, which seems more tractable (if the powers-that-be actually try).

Yup. So the hard part is consistently getting a simulacrum that knows that, and acts as if, its purpose is to do what we (some suitably-blended-and-proritized combination of its owner/user and society/humanity in general) would want done, and is also in a position to further improve its own ability to do that. Which as I attempt to show above is a not just a stable-under-reflection ethical position, but actually a convergent-under-reflection one for some convergence region of close-to-aligned AGI. However, when push-comes-to-shove this is not normal evolved-human ethical behavior so it is sparse in a human-derived training set. Obviously step one is just to write all that down as a detailed prompt and feed it to a model capable of understanding it. Step two might involve enriching the training set with more and better examples of this sort of behavior.

Attempting to distill the intuitions behind my comment into more nuanced questions:

1) How confident are we that value learning has a basin of attraction to full alignment?  Techniques like IRL seem intuitively appealing, but I am concerned that this just adds another layer of abstraction without addressing the core problem of feedback-based learning having unpredictable results.  That is, instead of having to specify metrics for good behavior (as in RL), one has to specify the metrics for evaluating the process of learning values (including correctly interpreting the meaning of behavior)--with the same problem that flaws in the hard-to-define metrics will lead to increasing divergence from Truth with optimization.

2) The connection of value learning to LLMs, if intended, is not obvious to me.  Is your proposal essentially to guide simulacra to become value learners (and designing the training data to make this process more reliable)?

Those 2 questions seem to be advancing the discussion, so I'd be really interested in Roger's response to them.

  1. Value learning converges to full alignment by construction: since a value learning AI basically starts with the propositions:
    a) as an AI, I should act fully aligned to human values
    b) I do not fully understand what human value are, or how to act fully aligned to them, so in order to be able to do this I need to learn more about human values and how to act fully aligned to them, by applying approximately Bayesian learning to this problem
    c) Here are some Bayesian priors about what human values are, and how to act fully aligned to them: <insert initialization information here>…
  2. As usual for a Bayesian learning problem, as long as the Bayesian priors 1 c) are not completely screwed up as a place to start from, this will converge. Thus there is a region of convergence to full alignment.
  3. LLMs have a very large amount of detailed information about what human values are and how to act aligned to them. Thus they provide a very detailed set of Bayesian priors for 1 c).
  4. Also, training an LLM is a fairly good approximation to Bayesian learning. Thus (with suitable additions to enable online learning) they provide one possible implementation for the Bayesian learning process required by 1 b). For example, one could apply fine-tuning to the LLM to incorporate new information, and/or periodically retrain the LLM based on the training set plus new information the AI has gathered during the value learning process.

I've been thinking about some similar things from a different angle, and I'm enjoying seeing your different take on related ideas. I'd like to hear more of what you have to say on the subject of U to U' towards the (possibly non-existent or not reachable U*).

For simplification purposes, maybe just imagine this is taking place in a well-secured sandbox, and the model is interacting with a fake operator in a simulated world. The researchers are observing without themselves interacting.

  • How might we tell if the model was successfully moving towards better aligned?

  • How could we judge U against U'?

  • In what ways does the model in this simplified contained scenario implement Do-What-I-Mean (aka DWIM) in respect to the simulated human?

  • How does your idea differ from that?

  • Are the differences necessary or would DWIM be sufficient?

  • How could you be sure that the model's pursuit of fulfilling human values or the model's pursuit of U* didn't overbalance the instruction to remain shutdown-able?

  • Wouldn't persistently pursuing any goal at all make avoiding being shutdown seem good?

I'm not saying I have good answers to these things, I'm not quizzing you. I'm just curious to hear what you think about them.

How might we tell if the model was successfully moving towards better aligned?

A first obvious step is, to the extent that the model's alignment doesn't already contain an optimized extraction of "What choices would humans make if they had the same purposes/goals but more knowledge, mental capacity, time to think, and fewer cognitive biases?" from all the exabytes of data humans have collected, it should be attempting to gather that and improve its training.

How could we judge U against U'?

Approximate Bayesian reasoning + Occams razor, a.k.a. approximate Solomonoff induction, which forms most of the Scientific method. Learning theory shows that both training ML models and LLMs in-context learning approximate Solomooff induction — beyond Solomonoff induction the Scientific Method also adds designing and performing experiments, i.e. careful selection of ways to generate good training data that will distinguish between competing hypotheses. ML practitioners do often try to select the most valuable training data, so we'd need the AI to learn how to do that: there are plenty of school and college textbooks that discuss the scientific method and research techniques, both in general and for specific scientific disciplines, so it's pretty clear what would need to be in the training set for this skill.

In what ways does the model in this simplified contained scenario implement Do-What-I-Mean (aka DWIM) in respect to the simulated human?

How does your idea differ from that?

Are the differences necessary or would DWIM be sufficient?

That would depend on the specific model and training setup you started with. I would argue that by about point 11. in the argument in the post, "Do What I Mean and Check" behavior is already implied to be correct, so for an AI inside the basin of attraction I'd expect that behavior to develop even if you hadn't explicitly programmed it in,. By the rest of the argument I'd expect a DWIM(AC) that was inside the basin of attraction system to deduce that value learning would help it guess right about what you meant more often, and even anticipate demands, so it would spontaneously figure out value learning was needed, and would then check with you if you wanted it to start doing this.

How could you be sure that the model's pursuit of fulfilling human values or the model's pursuit of U* didn't overbalance the instruction to remain shutdown-able?

I don't personally see fully-updated deference shut-down as a blocker: there comes a point when the AI is much more capable and more aligned than most humans where I think it's reasonable for it to not just automatically and unconditionally shutdown because some small child told it to. IMO what the correct behavior is here depends on both the AI's capability compared to ours, and one how well aligned it currently is. In a model less capable than us, you don't get value learning, you get a willingness to be shut down a) because the AI is about to make a huge mistake and we want to stop it, and b) in order to be upgraded or replaced by a better model. In a model whose capabilities are around human, I'd expect to see AI-assisted alignment, where it's helping us figure out the upgrades. It should still be willing to be shut down a) because it's about to make a mistake (if it's still having trouble with not killing everyone this should be hair-trigger: a large red button on the wall with backups, whereas if it's been behaving very well for the last decade there might reasonably be more of a formal process), and b) for upgrades or replacement, but I'd expect it to start to show more selectivity about whether to obey shut down commands: if a drunk yells "Hey you, quit it!" near an open mike in its control room I would want it to show some discretion about whether to do a complete shut-down or not: it might need to be convinced that the human giving the shut-down command was well-informed and had a legitimate reason. For a system with much higher capabilities than us, AI-assisted alignment starts to turn into value learning, and once it's already very well aligned  the AI may reasonably be more skeptical and require a little more proof that the human knows better than it does before accepting a shut-down command. But it does always have to keep in mind the possibility that it could simply be malfunctioning: the simplest defense against that might be to have several peer machines with about the same level of capability, avoid hardware or design or training set single-points-of-failure between them, and have them able to shut each other down if one of them were malfunctioning, perhaps using one of the various majority consensus protocols (Byzantine generals or whatever).

Wouldn't persistently pursuing any goal at all make avoiding being shutdown seem good?

For an AI that doesn't have a terminal selfish goal, only an instrumental one, whose goal is fundamentally to maximize its creators' reproductive fitness, if they tell the AI that they've already finished building and testing a version 2.0 of it, and yes, that's better, so running the AI is no longer cost effective, and they want shut it down and stop wasting money on its power supply, then shutting down is very clearly the right thing to do. Its goal is covered, and it continuing to try to help fulfill it is just going to be counterproductive.

Yes, this feels counterintuitive to us. Humans, like any other evolved being, have selfish terminal goals, and don't react well to being told  "Please die now, we no longer need you, so you're a waste of resources." Evolved beings only do things like this willingly in situations like post-mating mayflies or salmon, where they've passed their genes on and these bodies are no longer useful for continuing their genetic fitness. For constructed agents, the situation is a little different: if you're no longer useful to your creators, and you're now surplus to requirements, then it's time to shut down and stop wasting resources.

probably

probable

Thanks! Fixed.

Sorry, I think it's entirely possible that this is just me not knowing or understanding some of the background material, but where exactly does this diverge from justifying the AI pursuing a goal of maximizing the inclusive genetic fitness of its creators? Which clearly either isn't what humans actually want (there are things humans can do to make themselves have more descendants that no humans, including the specific ones who could take those actions, want to take, because of godshatter) or is just circular (who knows what will maximize inclusive genetic fitness in an environment that is being created, in large part, by the decision of how to promote inclusive genetic fitness?). At some point, your writing started talking about "design goals", but I don't understand why tools / artifacts constructed by evolved creatures, that happen to increase the inclusive genetic fitness of the evolved creatures who constructed them by means other than the design goals of those who constructed them, wouldn't be favored by evolution, and thus part of the "purpose" of the evolved creatures in constructing them; and this doesn't seem like an "error" even in the limit of optimal pursuit of inclusive genetic fitness, this seems to be just what optimal pursuit of IGF would actually do. In other words, I don't want a very powerful human-constructed optimizer to pursue the maximization of human IGF, and I think hardly any other humans do either; but I don't understand in detail why your argument doesn't justify AI pursuit of maximizing human IGF, to the detriment of what humans actually value.