I often think about "the road to hell is paved with good intentions".[1] I'm unsure to what degree this is true, but it does seem that people trying to do good have caused more negative consequences in aggregate than one might naively expect.[2] "Power corrupts" and "power-seekers using altruism as an excuse to gain power" are two often cited reasons for this, but I think don't explain all of it.
A more subtle reason is that even when people are genuinely trying to do good, they're not entirely aligned with goodness. Status-seeking is a powerful motivation for almost all humans, including altruists, and we frequently award social status to people for merely trying to do good, before seeing all of the consequences of their actions. This is in some sense inevitable as there are no good alternatives. We often need to award people with social status before all of the consequences play out, both to motivate them to continue to try to do good, and to provide them with influence/power to help them accomplish their goals.
A person who consciously or subconsciously cares a lot about social status will not optimize strictly for doing good, but also for appearing to do good. One way these two motivations diverge is in how to manage risks, especially risks of causing highly negative consequences. Someone who wants to appear to do good would be motivated to hide or downplay such risks, from others and perhaps from themselves, as fully acknowledging such risks would often amount to admitting that they're not doing as much good (on expectation) as they appear to be.
How to mitigate this problem
Individually, altruists (to the extent that they endorse actually doing good) can make a habit of asking themselves and others what risks they may be overlooking, dismissing, or downplaying.[3]
Institutionally, we can rearrange organizational structures to take these individual tendencies into account, for example by creating positions dedicated to or focused on managing risk. These could be risk management officers within organizations, or people empowered to manage risk across the EA community.[4]
Socially, we can reward people/organizations for taking risks seriously, or punish (or withhold rewards from) those who fail to do so. This is tricky because due to information asymmetry, we can easily create "risk management theaters" akin to "security theater" (which come to think of it, is a type of risk management theater). But I think we should at least take notice when someone or some organization fails, in a clear and obvious way, to acknowledge risks or to do good risk management, for example not writing down a list of important risks to be mindful of and keeping it updated, or avoiding/deflecting questions about risk.[5] More optimistically, we can try to develop a culture where people and organizations are monitored and held accountable for managing risks substantively and competently.
- ^
due in part to my family history
- ^
Normally I'd give some examples here, but we can probably all think of some from the recent past.
- ^
I try to do this myself in the comments.
- ^
an idea previously discussed by Ryan Carey and William MacAskill
- ^
However, see this comment.
The version of CEV, that is described on the page that your CEV link leads to, is PCEV. The acronym PCEV was introduced by me. So this acronym does not appear on that page. But that's PCEV that you link to. (in other words: the proposed design, that would lead to the LP outcome, can not be dismissed as some obscure version of CEV. It is the version that your own CEV link leads to. I am aware of the fact, that you are viewing PCEV as: ``a proxy for something else'' / ``a provisional attempt to describe what CEV is''. But this fact still seemed noteworthy)
On terminology: If you are in fact using ``CEV'' as a shorthand, for ``an AI that implements the CEV of a single human designer'', then I think that you should be explicit about this. After thinking about this, I have decided that without explicit confirmation that this is in fact your intended usage, I will proceed as if you are using CEV as a shorthand, for ``an AI that implements the Coherent Extrapolated Volition of Humanity'' (but I would be perfectly happy to switch terminology, if I get such confirmation). (another reading of your text, is that: ``CEV'' (or: ``something like CEV'') is simply a label that you attach, to any good answer, to the correct phrasing of the ``what alignment target should be aimed at?'' question. That might actually be a sort of useful shorthand. In that case I would, somewhat oddly, have to phrase my claim as: under no reasonable set of definitions, does the Coherent Extrapolated Volition of Humanity, deserve the label ``CEV'' / ``something like CEV''. Due to the chosen label(s), the statement looks odd. But there is no more logical tension in the above statement, than there is logical tension in the following statement: ``under no reasonable set of definitions, does the Coherent Extrapolated Volition of Steve, result in the survival of any of Steve's cells'' (which is presumably a true statement for at least some human individuals). Until I hear otherwise, I will however stay with the terminology, where ``CEV'' is shorthand for ``an AI that implements the Coherent Extrapolated Volition of Humanity'', or ``an AI that is helping humanity'', or something less precise, that is still hinting at something along those lines)
It probably makes sense to clarify my own terminology some more. I think this can be done by noting, that I think that CEV, sounds like a perfectly reasonable way of helping ``a Group'' (including the PCEV version that you link to, and that implies the LP outcome). I just don't think that helping ``a Group'' (that is made up of human individuals) is good for the (human) individuals that make up that ``Group'' (in expectation). Pointing a specific version of CEV (including PCEV) at a set of individuals, might be great for some other type of individuals. Let's consider a large number of ``insatiable, Clippy like maximisers''. Each of them cares exclusively about the creation of a different, specific, complex object. No instances of any of these very complex objects will ever exist, unless someone looks at the exact specification of a given individual, and uses this specification to create such objects. In this case PCEV might, from the perspective of each of those individuals, be the best thing that can happen (if special influence is off the table). It is also worth noting, that a given human individual might get what she wants, if some specific version of CEV is implemented. But CEV, or ``helping humanity'', is not good, for human individuals, in exception, compared to extinction. And why would it be? Groups and human individuals are completely different types of things. And a human individual is very vulnerable to a powerful AI, that wants to hurt her. And humanity certainly looks like it contains an awful lot of ``will to hurt'', specifically directed at existing human individuals.
If I zoom out a bit, I would describe the project of ``trying to describe what CEV is'' / ``trying to build an AI that helps humanity'' as: A project that searches for an AI design that helps an arbitrarily defined abstract entity. But this same project is, in practice, evaluating specific proposed AI designs, based on how they interact with a completely different type of thing: human individuals. You are for example presumably discarding PCEV, because the LP outcome implied by PCEV, contains a lot of suffering individuals (when PCEV is pointed at billions of humans). It is however not obvious to me why LP would be a bad way of helping an arbitrarily defined abstract entity (especially considering that the negotiation rules of PCEV simultaneously (i): implies LP, and is also (ii): an important part of the set of definitions, that is needed to differentiate the specific abstract entity that is to be helped, from the rest of the vast space of entities, that a mapping from billions-of-humans to the ``class-of-entities-that-can-be-said-to-want-things'', can point to). Thus, I suspect that PCEV is not actually being discarded, due to being bad at helping an abstract entity (and my guess it that PCEV is actually being discarded, because LP is bad for human individuals).
I think that one reasonable way of moving past this situation, is to switch perspective. Specifically: adopt the perspective of a single human individual, in a population of billions, and ask: ``without giving her any special treatment, compared to other existing humans, what type of AI, would want to help her''. And then try to answer this question, while making as few assumptions about her as possible (for example making sure that there is no implicit assumption, regarding whether she is ``selfish or selfless'', or anything along those lines. Both ``selfless and selfish'' human individuals, would strongly prefer to avoid being a Heretic in LP. Thus, discarding PCEV does not contain an implicit assumption related to the ``selfish or selfless'' issue. Discarding PCEV, does however, involve an assumption, that human individuals are not like the ``insatiable Clippy maximisers'' mentioned above. So, such maximisers might justifiably feel ignored, when we discard PCEV. But no one can justifiably feel ignored when we discard PCEV, on account of where she is on the ``selfish or selfless'' spectrum). When one adopts this perspective, it becomes obvious to suggest that, the initial dynamic, should grant this individual meaningful influence, regarding the adoption of those preferences, that refer to her. Making sure that such influence, is included as a core aspect of the initial dynamic, is made even more important, by the fact, that the designers will be unable to consider all implications of a given project, and will be forced to rely on, potentially flawed, safety measures (for example along the lines of a ``Last Judge'' off switch, which might fail to trigger. Combined with a learned DWIKIM layer, that might turn out to be very literal, when interpreting some specific class of statements). If such influence is included, in the initial dynamic, then the resulting AI is no longer describable as ``doing what a Group wants it to do''. Thus, the resulting AI can not be described as a version of CEV. (it might however be describable as ``something like CEV''. Sort of how one can describe an Orca as ``something like a shark'', despite the fact that an Orca is not a type of shark (or a type of a fish). I would guess, that you would say, that an AI that grants such influence, as part of the initial dynamic, is not ``something like CEV''. But I'm not sure about this)
(I should have added ``,in the initial dynamic,'' to the text in my earlier comments. It is explicit in the description of MPCEV, but I should have added this phrase to my comments here too. As a tangent, I agree that the intuition, that you were trying to counter, with your Boundaries / Membrane mention, is probably both common and importantly wrong. Countering this intuition makes sense, and I should have read this part of your comment more carefully. I would however like to note, that the description of the LP outcome, in the PCEV thought experiment, actually contains billions of (presumably very different) localities. Each locality is optimised according to very different criteria. Each place is designed to hurt a specific individual human Heretic. And each such location, is additionally bound by it's own unique ``comprehension constraint'', that refers to the specific individual Heretic, being punished in that specific location)
Perhaps a more straightforward way to move this discussion along is to ask a direct question, regarding what you would do if you were in the position, that I believe, that I find myself in. In other words: a well intentioned designer called John, wants to use PCEV as the alignment target for his project (rejecting any other version of CEV out of hand, by saying: ``if that is indeed a good idea, then it will be the outcome of Parliamentary Negotiations''). When someone points out that PCEV is a bad alignment target, John responds by saying that PCEV cannot, by definition, be a bad alignment target. John claims that any thought experiment, where PCEV leads to a bad outcome, must be due to a bad extrapolation of human individuals. John says that any given ``PCEV with a specific extrapolation procedure'' is just an attempt, to describe what PCEV is. If aiming at a given ``PCEV with a specific extrapolation procedure'' is a bad idea, then it is a badly constructed PCEV. Aiming at PCEV is a good idea, by intention that defines PCEV. John further says that his project will include features that (if they are implemented successfully, and are not built on top of any problematic unexamined implicit assumption) will to let John try again, if a given attempt to ``say what PCEV is'', fails. Do you agree that this project, is a bad idea? (compared to achievable alternatives, that start with a different set of, findable, assumptions) If so, what would you say to John? (what you are proposing is different from what John is proposing. I predict that you will say that John is making a mistake. My point is that, to me, it looks like you are making a mistake, of the same type as John's mistake. So, I wonder what you would say to John (your behaviour in this exchange, is not the same as John's behaviour in this thought experiment. But it looks to me, like you are making the same class of mistake, as John. So, I'm not asking how you would ``act in a debate, as a response to Johns behaviour''. Instead, I'm curious about how you would explain to John, that he is making an object level mistake))
Or maybe a better approach, is to go less meta, and get into some technical details. So, let's use the terminology in your CEV link, to explore some of the technical details in that post. What do you think would happen, if the learning algorithm that outputs the DWIKIM layer in John's PCEV project, is built on top of an unexamined implicit assumption, that turns out to be wrong? Let's say that the DWIKIM layer that pops out, interprets the request to build PCEV, as a request to implement the straightforward interpretation of PCEV. The DWIKIM layer happens to be very literal, when presented with the specific phrasing, used in the request. In other words: it interprets John as requesting, something along the lines of LP? I think this might result in an outcome, along the lines of LP (if the problems with the DWIKIM layer, stems form a problematic unexamined implicit assumption, related to extrapolation, then the exact same problematic assumption, might also render something along the lines of a ``Last Judge off switch add on'', ineffective). I think that it would be better, if John had aimed at something, that does not suffer from known, avoidable, s-risks. Something whose straightforward interpretation, is not known to imply an outcome, that would be far, far, worse than extinction. For the same reason, I make the further claim, that I do not think that it is a good idea, to subject everyone to the known, avoidable, s-risks associated with any AI, that is describable as ``doing what a Group wants'' (which includes all versions of CEV). Again, I'm certainly not against some feature that, might, let you try again, or that, might, re interpret an unsafe request, as a request for something completely different, that happens to be safe (such as, for example, a learned DWIKIM layer). I am aware of the fact, that you do not have absolute faith in the DWIKIM layer (if this layer is perfectly safe, in the sense of reliably re interpreting requests that straightforwardly imply LP, as something desirable to the designer. Then the full architecture would be functionally identical, to an AI, that simply does, whatever the designer wants the AI to do. In that case, you would not care what the request was. You might then, just as well have the designer ask the DWIKIM layer, for an AI, that maximises the number of bilberries. So, I am definitely not implying, that you are unaware, of the fact that the DWIKIM layer, is unable to provide reliable safety).
Zooming out a bit, it is worth noting that the details of the safety measure(s) is actually not very relevant to the points that I am trying to make here. Any conceivable, human implemented, safety measure, might fail. And, more importantly, these measures do not help much, when one is deciding what to aim at. For example: MPCEV, can also be built on top of a (potentially flawed) DWIKIM layer, in the exacts same way as you can build CEV on top of a DWIKIM layer (and you can stick a ``Last Judge off switch add on'' to MPCEV too. Etc, etc, etc). Or in yet other words: anything, along the lines of, a ``Last Judge off switch add on'' can be used by many different projects aiming at many different targets. Thus, the ``Last Judge'' idea, or any other idea along those lines (including a DWIKIM layer), provides very limited help, when one decides what to aim at. And even more generally: regardless of what safety measure is used, John is, still, subjecting everyone to an unnecessary, avoidable, s-risk. I hope we can agree that John should not do that with, any, version of ``PCEV with a specific extrapolation procedure''. The further claim, that I am making, is that no one should do that with, any, ``Group AI'', for similar reasons. Surely, discovering that this further claim is true, cannot be, by definition, impossible.
While re reading our exchange, I realised that I never actually clarified, that my primary reason for participating in this exchange (and my primary reason for publishing things on LW), is not actually to stop CEV projects. However, I think that a reasonable person might, based on my comments here, come to believe that my primary goal is to stop CEV projects (which is why the present clarification is needed). My focus is actually on trying to make progress on the ``what alignment target should be aimed at?'' question. In the present exchange, my target is the idea, that this question has already been given an answer (and, specifically, that the answer is CEV). The first step to progress, on the ``what alignment target should be aimed at?'' question, is to show that this question does not currently have an answer. This is importantly different, from saying that: ``CEV is the answer, but the details are unknown'' (I think that such a statement is importantly wrong. And I also think, that the fact that people still believe things along these lines, is standing in the way of getting a project off the ground, that is devoted to making progress on the ``what alignment target should be aimed at?'' question).
I think that it is very unlikely, that the relevant people will stay committed to CEV, until the technology arrives, that would make it possible for them to hit CEV as an alignment target (the reason I find this unlikely, is that, (i): I believe that I have outlined a sufficient argument, to show that CEV is a bad idea, and (ii): I think that such technology will take time to arrive, and (iii): it seems likely that this team of designers, who are by assumption capable of hitting CEV, will be both careful enough to read that argument before reaching the point of no return on their CEV launch, and also capable enough to understand it. Thus, since the argument against CEV already exists, in my estimate, it would not make sense to focus on s-risks, related to a successfully implemented CEV). If that unlikely day ever does arrive, then I might switch focus, to trying to prevent direct CEV related s-risk, by arguing against this imminent CEV project. But I don't expect to ever see this happening.
The set of paths that I am actually focused on reducing the probability of, can be hinted at by outlining the following specific scenario. Imagine a well intentioned designer that we can call Dave, who is aiming for Currently Unknown Alignment Target X (CUATX). Due to an unexamined implicit assumption, that CUATX is built on top of, turning out to be wrong in a critical way, CUATX implies an outcome, along the lines of LP. But the issue that CUATX suffers from, is far more subtle than the issue that CEV suffers from. And progress on the ``what alignment target should be aimed at?'' question, has not yet progressed to the point, where this problematic unexamined implicit assumption can be seen. CUATX has all the features, that are known at launch time, to be necessary for safety (such as the necessary, but very much not sufficient, feature that any safe AI must give each individual, meaningful influence, regarding the adoption of those preferences, that refer to her). Thus, the CUATX idea leads to a CUATX project, which in turn leads to an, avoidable, outcome along the lines of LP (after some set of human implemented safety measures fail). That is the type of scenario that I am trying to avoid (by trying to make sufficient progress on the ``what alignment target should be aimed at?'' question, in time). My real ``opponent in this debate'' is an implemented CUATX, not the idea of CEV (and very definitely not you. Or anyone else that has contributed, or is likely to contribute, valuable insights related to the ``what alignment target should be aimed at?'' question). It just happens to be the case, that the effort to prevent CUATX, that I am trying to get off the ground, starts by showing that CEV, is not an answer, to the ``what alignment target should be aimed at?'' question. And you just happen to be the only person, that is pushing back against this in public (and again: I really appreciate the fact that you chose to engage on this topic).
(I should also note explicitly, that I am most definitely not against exploring safety measures. They might stop CUATX. In some plausible scenarios, they might be the only realistic thing, that can stop CUATX. And I am not against treaties. And I am open to hearing more about the various human augmentation proposals that have been going around for many years. I am simply noting, that a safety measure, regardless of how clever it sounds, simply cannot fill the function of a substitute, for progress on the ``what alignment target should be aimed at?'' question. An attempt to get people to agree to a treaty might fail. Or a successfully implemented treaty might fail to actually prevent a race dynamic for long enough. And similarly, augmented humans might systematically tend towards being: (i): superior at alignment, (ii): superior at persuasion, (iii): well intentioned, and (iv): not better at dealing with the ``what alignment target should be aimed at?'' question, than the best baseline humans (but still, presumably, capable of understanding an insight on this question, at least if that insight is well explained). Regardless of augmentation technique, selection for ``technical ability and persuasion ability'' seems like a far more likely, de facto, outcome to me, due to being far easier to measure. I expect it to be far more difficult to measure the ability to deal with the ``what alignment target should be aimed at?'' question (and it is not obvious that the abilities needed to deal with the ``what alignment target should be aimed at?'' question, will be strongly correlated with the thing that I think will, de facto, have driven the trial and error augmentation process, of the augments that eventually hits an alignment target: ``technical-ability-and-persuasion-ability-and-ability-to-get-things-done''). Maybe the first augment will be great at making progress on the ``what alignment target should be aimed at?'' question, and will quickly render all previous work on this question irrelevant (and in that case, the persuasion ability is probably good for safety). But assuming that this will happen, seems like a very unsafe bet to make. Even more generally: I simply do not think that it is possible to come up with any type of clever sounding trick, that makes it safe to skip the ``what alignment target should be aimed at?'' question (to me, the ``revolution-analogy-argument'', in the 2004 CEV text, looks like a sufficient argument for the conclusion, that it is important to make progress on the ``what alignment target should be aimed at?'' question. But it seems like many people do not consider this, to be a sufficient argument for this conclusion. It is unclear to me, why this conclusion, seems to require such extensive further argument)).
If my overall strategic goal was not clear, then this was probably my fault (in addition to not making this goal explicit, I also seem to have a tendency to lose focus on this larger strategic picture, during back and fourth technical exchanges).
Two out of of my three LW posts are in fact entirely devoted to arguing, that making progress on the ``what alignment target should be aimed at?'' question, is urgent (in our present discussion, we have only talked about the one post, that is not exclusively focused on this). See:
Making progress on the ``what alignment target should be aimed at?'' question, is urgent
The proposal to add a ``Last Judge'' to an AI, does not remove the urgency, of making progress on the ``what alignment target should be aimed at?'' question.
(I am still very confused about this entire conversation. But I don't think that re reading everything, yet again, will help much. I have been continually paying, at least some, attention to SL4, OB, and LW since around 2002-2003. I can't remember exactly who said what when, or where. However, I have developed a strong intuition, that can be very roughly translated as: ``if something sounds strange, then it is very definitely not safe, to explain away this strangeness, by conveniently assuming that Nesov is confused on the object-level''. I am nowhere near the point where I would consider going against this intuition. So, I expect that I will remain very confused about this exchange, until there is some more information available. I don't expect to be able to just think my way out of this one (wild speculation, regarding what it might be, that I was missing, by anyone that happens to stumble on this comment, at any point in the future, are very welcome. For example in a LW comment, or in a LW DM, or in an email))