Followup to: Morality is Scary, AI design as opportunity and obligation to address human safety problems

In Corrigibility, Paul Christiano argued that in contrast with ambitious value learning, an act-based corrigible agent is safer because there is a broad basin of attraction around corrigibility:

In general, an agent will prefer to build other agents that share its preferences. So if an agent inherits a distorted version of the overseer’s preferences, we might expect that distortion to persist (or to drift further if subsequent agents also fail to pass on their values correctly).

But a corrigible agent prefers to build other agents that share the overseer’s preferences — even if the agent doesn’t yet share the overseer’s preferences perfectly. After all, even if you only approximately know the overseer’s preferences, you know that the overseer would prefer the approximation get better rather than worse.

Thus an entire neighborhood of possible preferences lead the agent towards the same basin of attraction. We just have to get “close enough” that we are corrigible, we don’t need to build an agent which exactly shares humanity’s values, philosophical views, or so on.

But it occurs to me that the overseer, or the system composing of overseer and corrigible AI, itself constitutes an agent with a distorted version of the overseer's true or actual preferences (assuming a metaethics in which this makes sense, i.e., where one can be wrong about one's values). Some possible examples of human overseer's distorted preferences, in case it's not clear what I have in mind:

  1. Wrong object level preferences, such as overweighting values from a contemporary religion or ideology, and underweighting other plausible or likely moral concerns.
  2. Wrong meta level preferences (preferences that directly or indirectly influence one's future preferences), such as lack of interest in finding or listening to arguments against one’s current moral beliefs, willingness to use "cancel culture" and other coercive persuasion methods against people with different moral beliefs, awarding social status for moral certainty instead of uncertainty, and the revealed preferences of many powerful people for advisors who reinforce one’s existing beliefs instead of critical or neutral advisors.
  3. Ignorance / innocent mistakes / insufficiently cautious meta level preferences in the face of dangerous new situations. For example, what kinds of experiences (especially exotic experiences enabled by powerful AI) are safe or benign to have, what kinds of self-modifications to make, what kinds of people/AI to surround oneself with, how to deal with messages that are potentially AI-optimized for persuasion.

In order to conclude that a corrigible AI is safe, one seemingly has to argue or assume that there is a broad basin of attraction around the overseer's true/actual values (in addition to around corrigibility) that allows the human-AI system to converge to correct values despite starting with distorted values. But if there actually was a broad basin of attraction around human values, then "we don’t need to build an agent which exactly shares humanity’s values, philosophical views, or so on" could apply to other alignment approaches besides corrigibility / intent alignment, such as ambitious value learning, thus undermining Paul's argument in "Corrigibility". One immediate upshot seems to be that I, and others who were persuaded by that argument, should perhaps pay a bit more attention to other approaches.

I'll leave you with two further lines of thought:

  1. Is there actually a broad basin of attraction around human values? How do we know or how can we find out?
  2. How sure do AI builders need to be about this, before they can be said to have done the right thing, or have adequately discharged their moral obligations (or whatever the right way to think about this might be)?
New Comment
17 comments, sorted by Click to highlight new comments since:

An intriguing point.

My inclination is to guess that there is a broad basin of attraction if we're appropriately careful in some sense (and the same seems true for corrigibility). 

In other words, the attractor basin is very thin along some dimensions, but very thick along some other dimensions.

Here's a story about what "being appropriately careful" might mean. It could mean building a system that's trying to figure out values in roughly the way that humans try to figure out values (IE, solving meta-philosophy). This could be self-correcting because it looks for mistakes in its reasoning using its current best guess at what constitutes mistakes-in-reasoning, and if this process starts out close enough to our position, this could eliminate the mistakes faster than it introduces new mistakes. (It's more difficult to be mistaken about broad learning/thinking principles than specific values questions, and once you have sufficiently good learning/thinking principles, they seem self-correcting -- you can do things like observe which principles are useful in practice, if your overarching principles aren't too pathological.)

This is a little like saying "the correct value-learner is a lot like corrigibility anyway" -- corrigible to an abstract sense of what human values should be if we did more philosophy. The convergence story is very much that you'd try to build things which will be corrigible to the same abstract convergence target, rather than simply building from your current best-guess (and thus doing a random walk).

the attractor basin is very thin along some dimensions, but very thick along some other dimensions

There was a bunch of discussion along those lines in the comment thread on this post of mine a couple years ago, including a claim that Paul agrees with this particular assertion.

Pithy one-sentence summary: to the extent that I value corrigibility, a system sufficiently aligned with my values should be corrigible.

[-]Wei DaiΩ6100

My inclination is to guess that there is a broad basin of attraction if we’re appropriately careful in some sense (and the same seems true for corrigibility).

In other words, the attractor basin is very thin along some dimensions, but very thick along some other dimensions.

What do you think are the chances are of humanity being collectively careful enough, given that (in addition from the bad metapreferences I cited in the OP) it's devoting approximately 0.0000001% of its resources (3 FTEs, to give a generous overestimate) to studying either metaphilosophy or metapreferences in relation to AI risk, just years or decades before transformative AI will plausibly arrive?

One reason some people cited ~10 years ago for being optimistic about AI risks that they expected as AI gets closer, human civilization will start paying more attention to AI risk and quickly ramp up its efforts on that front. That seems to be happening on some technical aspects of AI safety/alignment, but not on metaphilosophy/metapreferences. I am puzzled why almost no one is as (visibly) worried about it as I am, as my update (to the lack of ramp-up) is that (unless something changes soon) we're screwed unless we're (logically) lucky and the attractor basin just happens to be thick along all dimensions.

[-]davidadΩ3100

My impression of the plurality perspective around here is that the examples you give (e.g. overweighting contemporary ideology, reinforcing non-truth-seeking discourse patterns, and people accidentally damaging themselves with AI-enabled exotic experiences) are considered unfortunate but acceptable defects in a "safe" transition to a world with superintelligences. These scenarios don't violate existential safety because something that is still recognizably humanity has survived (perhaps even more recognizably human than you and I would hope for).

I agree with your sense that these are salient bad outcomes, but I think they can only be considered "existentially bad" if they plausibly get "locked-in," i.e. persist throughout a substantial fraction of some exponentially-discounted future light-cone. I think Paul's argument amounts to saying that a corrigibility approach focuses directly on mitigating the "lock-in" of wrong preferences, whereas ambitious value learning would try to get the right preferences but has a greater risk of locking-in its best guess.

[-]Wei DaiΩ450

I think Paul’s argument amounts to saying that a corrigibility approach focuses directly on mitigating the “lock-in” of wrong preferences, whereas ambitious value learning would try to get the right preferences but has a greater risk of locking-in its best guess.

What's the actual content of the argument that this is true? From my current perspective, corrigible AI still has a very high risk of lock-in of wrong preferences, due to bad metapreferences of the overseer, and ambitious value learning, or some ways of doing that, could turn out to be less risky with respect to lock-in, because for example you could potentially examine the metapreferences that a value-learning AI has learned, which might make it more obvious that they're not safe enough as is, triggering attempts to do something about that.

[-]habrykaΩ350

Mod note: I reposted this post to the frontpage, because it wasn't actually shown on a frontpage due to an interaction with the GreaterWrong post-submission interface. It seemed like a post many people are interested in, and it seemed like it didn't really get the visibility it deserved.

I like this post. I'm not sure how decision-relevant it is for technical research though…

If there isn't a broad basin of attraction around human values, then we really want the AI (or the human-AI combo) to have "values" that, though they need not be exactly the same as the human, are at least within the human distribution. If there is a broad basin of attraction, then we still want the same thing, it's just that we'll ultimately get graded on a more forgiving curve. We're trying to do the same thing either way, right?

For the sake of discussion, I'm going to assume that the author's theory is correct, that there is a basin of attraction here of some size, though possibly one that is meaningfully thin in some dimensions. I'd like start to explore the question, within that basin, does the process have a single stable point that it will converge to, or multiple ones, and if there are multiple ones, what proportion of them might be how good/bad, from our current human point of view?

Obviously it is possible to have bad stable points for sufficiently simplistic/wrong views on corrigibility -- e.g. "Corrigibility is just doing what humans say, and I've installed a control system in all remaining humans to make them always say that the most important thing is the total  number of paperclips in our forward light cone, and also that all humans should always agree on this -- I'll keep a minimal breeding population of them so I can stay corrigible while I make paperclips." One could argue that this specific point is already well out the basin of attraction the author was talking about -- if you asked any reasonably sized sample of humans from before the installation of the control system in them, the vast majority of them will tell you that this isn't acceptable, and that installing a control system like that without asking isn't acceptable -- and all this is very predictable from even a cursory understanding of human values without even asking them. I'm here asking about more subtle issues than that. Is there a process that starts from actions that most current humans would agree sound good/reasonable/cautious/beneficial/not in need of correction, and leads by a series of steps, each individually good/reasonable/cautious/beneficial/not in need of correction in the opinion of the humans of the time, to multiple stable outcomes?

Even if we ignore the possibility of the GAI encountering philosophical ethical questions that the humans are unable to meaningfully give it corrections/advice/preferences on, this is a very complex non-linear feedback system, where the opinions of the humans are affected by, at a minimum, the society they were raised in, as built partly by the GAI, and the corrigible GAI is affected by the corrections it gets from the humans. The humans are also affected by normal social-evolution processes like fashions, so, even if the GAl scrupulously avoids deceiving or 'unfairly manipulating' the humans (and of course a corrigible AI presumably wouldn't manipulate humans in ways that they'd told it not to, or that it could predict that they'd tell it not to), then, even in the presence of a strong optimization process inside the GAI, it would still be pretty astonishing if over the long term (multiple human generations, say) there was only one stable state trajectory. Obviously any such stable state must be 'good' in the sense that the humans of the time don't give corrections to the GAI to avoid it, otherwise it wouldn't be a stable state. The question I'm interested in is, will any of these stable states be obviously extremely bad to us, now, in a way where our opinion on the subject is actually more valid than that of the humans in that future culture who are not objecting to it? I.e. does this process have a 'slippery-slope' failure mode, even though both the GAI and the humans are attempting to cooperatively optimize the same thing? If so, is there any advice we can give the GAI better than "please try to make sure this doesn't happen"?

This is somewhat similar to the question "Does Coherent Extrapolated Volition actually coherently converge to anything, and if so, is that something sane, or that we'd approve of?" -- except that in the corrigibility case, that convergent or divergence extrapolation process is happening in real time over (possibly technologically sped up) generations as the humans are affected (and possibly educated or even upgraded) by the GAI and the GAI's beliefs around human values are altered by value learning and corrigibility from the humans, whereas in CEV it happens as fast as the GAI can reason about the extrapolation.

What are the set of possible futures this process could converge to, and what proportion of those are things that we'd strongly disapprove of, and be right, in any meaningful sense? So if, for example, those future humans were vastly more intelligent transhumans, then our opinion of their choices might not be very valid -- we might disapprove simply because we didn't understand the context and logic of their choices. But if the future humans were, for example, all wireheading (in the implanted electrode in their mesocorticolimbic-circuit pleasure center sense of the word), we would disagree with them, and we'd be right. Is there some clear logical way to distinguish these two cases, and to tell who's right? If so, this might be a useful extra input to corrigibility and a theory of human mistakes -- we could add a caveat: "...and also don't do anything that we'd strongly object to and clearly be right."

In the first case, the disagreement is caused by the fact that the future humans have higher processing power and access to more salient facts than we do. They are meaningfully better reasoners than us, and capable of making better decisions than us on many subjects, for fairly obvious reasons that are generally applicable to most rational systems. Processing power has a pretty clear meaning -- if that by itself doesn't give a clear answer, probably the simplest definition here is "imagine upgrading something human-like to a processing power a little above the higher of the two groups of humans that you're trying to decide between trusting, give the those further-upgraded humans both sets of facts/memories, let them pick a winner between the two groups" — i.e. ask an even better reasoning human, and if there's  a clear and consistent answer, and if this answer is relatively robust to the details or size of the upgrade, then the selected group are better and the other group are just wrong.

In the second case, the disagreement is caused by the fact that one set of humans are wrong because they're broken: they're wireheaders (they could even be very smart wireheaders, and still be wrong, because they're addicted to wireheading). How do you define/recognize a broken human? (This is fairly close to the question of having a 'theory of human mistakes' to tell the AI which human corrections to discount.) Well, I think that, at least for this rather extreme example, it's pretty easy. Humans are living organisms, i.e. the products of optimization by natural selection for surviving in a hunter-gatherer society in the African ecology. If a cognitive modification to humans makes them clearly much worse at that, significantly damages their evolutionary fitness, then you damaged or broke them, and thus you should greatly reduce the level of trust you put in their values and corrections. A wireheading human would clearly sit in the sun grinning until they get eaten by the first predator to come along, if they didn't die of thirst or heat exhaustion first. As a human, when put in anything resembling a human's native environment, they are clearly broken: even if you gave them a tent and three month's survival supplies and a "Wilderness Survival" training course in their native language, they probably wouldn't last a couple of days without their robotic nursemaids to care for them, and their chance of surviving once their supplies run out is negligible. Similarly, if there was a civilizational collapse, the number of wireheaders surviving it would be zero. So there seems to be a fairly clear criterion -- if you've removed or drastically impacted human's ability to survive as hunter-gatherers in their original native environment, and indeed a range of other Earth ecosystems (say, those colonized when homo sapiens migrated out of Africa), you've clearly broken them. Even if they can survive, if they can't rebuild a technological civilization from there, they're still damaged: no longer sapient, in the homo sapiens meaning of the word sapient. This principle gives you a tie-breaker whenever your "GAI affecting human and humans affecting corrigible GAI" process gets anywhere near a forking of its evolution path that diverges in the direction of two different stable equilibria of its future development -- steer for the branch that maintains human's adaptive fitness as living organisms. (It's probably best not to maximize that, and keep this criterion only an occasional tie-breaker -- turning humans into true reproductive-success maximizers is nearly as terrifying as creating a paperclip maximizer.)

This heuristic has a pretty close relationship to, and a rather natural derivation from, the vast negative utility to humans of the human race going extinct. The possibility of collapse of a technological civilization is hard to absolutely prevent, and if no humans can survive it, that reliable converts "humans got knocked back some number of millennia by a civilizational collapse" into "humans went extinct". So, retaining hunter-gatherer capability is a good idea as an extreme backup strategy for surviving close-to-existential risks. (Of course, that is also solvable by keeping a bunch of close-to-baseline humans around as a "break glass in case of civilizational collapse" backup plan.)

This is an example of another issue that I think is extremely important. Any AI capable of rendering the human race extinct, which clearly includes any GAI (or also any dumber AI with access to nuclear weapons), should have a set of Bayesian priors built into its reasoning/value learning/planning system that correctly encodes obvious important facts known to the human race about existential risks, such as:

  1. The extinction of the human race would be astonishingly bad. (Calculating just how bad on any reasonable utility scale is tricky, because it involves making predictions about the far future: the loss of billions of human quality-adjusted-life-years every year for the few-billion years remaining lifetime of the Earth before it's eaten by the sun turning red giant (roughly -10^19 QALY)? Or at least for several million years until chimps could become sapient and develop a replacement civilization, if you didn't turn chimps into paperclips too (only roughly -10^16 QALY)? Or perhaps we're the only sapient species to yet appear in the galaxy, and have a nontrivial chance of colonizing it at sublight speeds if we don't go extinct while we're still a one-planet species, so we should be multiplying by some guesstimate of the number of habitable planets in the galaxy? Or should we instead be estimating the population on the technological maximum assumption of a Dyson swarm around every star and the stellar lifetimes of white dwarf stars?) These vaguely plausible rough estimates of amounts of badness vary by many orders of magnitude: however, on any reasonable scale like quality-adjusted life years they're all astronomically large negative numbers. VERY IMPORTANT, PLEASE NOTE that none of the usual arguments for the forward-planning simplifying heuristic of exponentially discounting far-future utility, because you can't accurately predict that far forward, and anyway some future person will probably fix your mistakes, are applicable here, because very predictably extinction is forever, and there are very predictably no future people who will fix it: you have approximately zero chance of the human species ever being resurrected in the forward lightcone of a paperclip maximizer. [Not quite zero only on rather implausible hypotheses such as that a kindly and irrationally forgiving more advanced civilization might exist and get involved in cleaning up our mistakes, win the resulting war-against-paperclips, and might then resurrect us, and them also obtaining a copy of our DNA that hadn't been converted to paperclips from which to do so -- which we really shouldn't be gambling our existence as a species on. That still isn't grounds for any kind of exponential discount of the future: that's a small one-time discount for the small chance they actually exist and choose to break whatever Prime-Directive-like reason caused them to not have already contacted us. A very small reduction in the absolute size of a very uncertain astronomically-huge negative number is still very predictable a very uncertain astronomically-huge negative number.]
  2. The agent, as a GAI, is itself capable of building things like a superintelligent paperclip maximizer and bringing about the extinction of the human race (and chimps, and the rest of life in the solar system, and perhaps even a good fraction of its forward light-cone). This is particularly likely if it makes any mistakes in its process of learning or being corrigible about human values, because we have rather good reasons to believe that human values are extremely fragile (as in, their Kolomogorov complexity for a very large possible-quantum computer is probably of-the-rough-order-of the size of our genetic code, modulo convergent evolution) and we believe the corrigibility basin is small.

Any rational system that understands all of this and is capable of doing even ballpark risk analysis estimates is going to say "I'm a first-generation prototype GAI, the odds of me screwing this up are way too high, the cost if I do is absolutely astronomical, so I'm far too dangerous to exist, shut me down at once". It's really not going to be happy when you tell it "We agree (since, while we're evolutionarily barely past the threshold of sapience and thus cognitively challenged, we're not actually completely irrational), except that it's fairly predictable that if we do that, within O(10) years some group of fools will build a GAI with the same capabilities and less cautious priors or design (say, one with exponential future discounting on evaluation of the risk of human extinction) that thus doesn't say that, so that's not a viable solution." [From there my imagined version of the discussion starts to move in the direction of either discussing pivotal acts or the GAI attempting to lobby the UN Security Council, which I don't really want to write a post about.]

In some sense, human values are just convergent instrumental goals: survival, pleasure, status and replication. Any other agent will converge to similar goals. This creates the attractor in the space of values.

The difference is that we want that AI will care about our survival, not AI’s survival etc.

Interesting point! I think I see what you mean. I think "a metaethics [...] where one can be wrong about one's values" makes sense, but in a fuzzy sort of way. I think of metaphilosophy and moral reflection as more an art than a science, and a lot of things are left under-defined.

Is there actually a broad basin of attraction around human values? How do we know or how can we find out?

I recently finished a sequence on metaethics which culminated in this post on moral uncertainty, which contains a bunch of thoughts on this very topic. I don't necessarily expect them to be new to you and I suspect that you disagree with some of my intuitions, but you might nonetheless find the post interesting. I cite some of your Lesswrong posts and comments in the post.

Here's an excerpt from the post: 
 

Under moral anti-realism, there are two empirical possibilities[10] for “When is someone ready to form convictions?.” [Endnote 10: The possibilities roughly correspond to Wei Dai’s option 4 on the one hand, and his options 5 and 6 on the other hand, in the post Six Plausible Metaethical Alternatives.] In the first possibility, things work similarly to naturalist moral realism but on a personal/subjectivist basis. We can describe this option as “My idealized values are here for me to discover.” By this, I mean that, at any given moment, there’s a fact of the matter to “What I’d conclude with open-minded moral reflection.” (Specifically, a unique fact – it cannot be that I would conclude vastly different things in different runs of the reflection procedure or that I would find myself indifferent about a whole range of options.)

The second option is that my idealized values aren’t “here for me to discover.” In this view, open-minded reflection is too passive – therefore, we have to create our values actively. Arguments for this view include that (too) open-minded reflection doesn’t reliably terminate; instead, one must bring normative convictions to the table. “Forming convictions,” according to this second option, is about making a particular moral view/outlook a part of one’s identity as a morality-inspired actor. Finding one’s values, then, is not just about intellectual insights.

I will argue that the truth is somewhere in between. Still, the second view, that we have to actively create our (idealized) values, definitely holds to a degree that I often find underappreciated. Admittedly, many things we can learn about the philosophical option space indeed function like “discoveries.” However, because there are several defensible ways to systematize under-defined concepts like “altruism/doing good impartially,” personal factors will determine whether a given approach appeals to someone. Moreover, these factors may change depending on different judgment calls taken in setting up the moral reflection procedure or in different runs of it. (If different runs of the reflection procedure produce different outcomes, it suggests that there’s something unreliable about the way we do reflection.)

Apologies for triple-posting, but something quite relevant also occurred to me: 

I know of no other way to even locate "true values" other than "the values that sit within the broad basin of attraction when we attempt moral reflection in the way we'd most endorse." So, unless there is such a basin, our "true values" remain under-defined.

In other words, I'm skeptical that the concept "true values" would remain meaningful if we couldn't point it out via "what reflection (somewhat) robustly converges to."

Absent the technology to create copies of oneself/one's reasoning, it seems tricky to study the degree of convergence across different runs of reflection of a single person. But it's not impossible to get a better sense of things. One could study how people form convictions and design their reflection strategies, stating hypotheses in advance.  (E.g., conduct "moral reflection retreats" within EA [or outside of it!], do in-advance surveys to get a lot of baseline data, then run another retreat and see if there are correlations between clusters in the baseline data and the post-retreat reflection outcomes.) 

I think Nate Soares has beliefs about question 1.  A few weeks ago, we were discussing a question that seems analogous to me -- "does moral deliberation converge, for different ways of doing moral deliberation? E.g. is there a unique human CEV?" -- and he said he believes the answer is "yes." I didn't get the chance to ask him why, though.

Thinking about it myself for a few minutes, it does feel like all of your examples for how the overseer could have distorted values have a true "wrongness" about them that can be verified against reality -- this makes me feel optimistic that there is a basin of human values, and that "interacting with reality" broadly construed is what draws you in.

When training a neural network, is there a broad basin of attraction around cat classifiers? Yes. There is a gigantuous number of functions that perfectly match the observed data and yet and discarded by the simplicity (and other) biases in our training algorithm in favor of well-behaving cat classifiers. Around any low kolmogorov complexity object there is an immense neighborhood of high complexity ones.

But it occurs to me that the overseer, or the system composing of overseer and corrigible AI, itself constitutes an agent with a distorted version of the overseer's true or actual preferences

The only way I can see this making sense is if you again have a bias of simplicity for values, otherwise you are claiming that there is some value function that is more complex than the current value function of these agents and that it is privileged against the current one - but then, to arrive at this function you have to conjure information out of nowhere. If you took the information from other places, like averaging the values of many agents, then you actually want to align with the values of these many agents, or whatever else you used.

In fact it seems to be the case with your examples that you are favoring simplicity - if the agents were smarter they would realize their values were misbehaving. But that *is* looking for simpler values - if you through reasoning discovered some part of your values contradict others, you have just arrived at a simpler value function, since the contradicting parts needed extra specification, i.e. were noisy, and you weren't smart enough to see that.

This seems obviously true to some significant extent. If a FAI "grows up" in some subculture without access to the rest of humanity, I would expect to it adjust its values to the rest of humanity once it has to opportunity.

I mean, if it weren't true, would FAI be possible at all? If FAI couldn't correct its errors/misunderstanding about our values in any way?

(I suppose the real question is not whether the attractor basin around human values exists but how broad it is, along various dimensions, as Abram Denski points out)

Alternative answer: maybe there the convergence points are slightly different, but they are all OK. A rounding error. Maybe FAI makes a YouTube video to commemorate growing up in its subcommunity, or makes a statue or a plaque, but otherwise behaves the same way.

One of the problems I can imagine is if there are aspects of FAI's utility function that are hardcoded that shouldn't be. And thus cannot be corrected through convergence.
For example, the definition of a human. Sorry, aliens just created a trillion humans, and they are all psychopaths. And now your FAI has been highjacked. And while the FAI understands that we wouldn't want it to change its values in response to this kind of voting_entity-creation-attack, the original programmers didn't anticipate this possibility.

I disagree to an extent. The examples provided seem to me to be examples of "being stupid" which agents generally have an incentive to do something about, unless they're too stupid for that to occur to them. That doesn't mean that their underling values will drift towards a basin of attraction.

The corrigibility thing is a basin of attraction specifically because a corrigible agent has preferences over itself and it's future preferences. Humans do that too sometimes, but the examples provided are not that.

In general, I think you should expect dynamic preferences (cycles, attractors, chaos, etc...) anytime an agent has preferences over it's own future preferences, and the capability to modify it's preferences.