Comment Permalink

1y10

The version of CEV, that is described on the page that your CEV link leads to, is PCEV. The acronym PCEV was introduced by me. So this acronym does not appear on that page. But that's PCEV that you link to. (in other words: the proposed design, that would lead to the LP outcome, can not be dismissed as some obscure version of CEV. It is the version that your own CEV link leads to. I am aware of the fact, that you are viewing PCEV as: ``a proxy for something else'' / ``a provisional attempt to describe what CEV is''. But this fact still seemed noteworthy)

On terminology: If you are in fact using ``CEV'' as a shorthand, for ``an AI that implements the CEV of a single human designer'', then I think that you should be explicit about this. After thinking about this, I have decided that without explicit confirmation that this is in fact your intended usage, I will proceed as if you are using CEV as a shorthand, for ``an AI that implements the Coherent Extrapolated Volition of Humanity'' (but I would be perfectly happy to switch terminology, if I get such confirmation). (another reading of your text, is that: ``CEV'' (or: ``something like CEV'') is simply a label that you attach, to any good answer, to the correct phrasing of the ``what alignment target should be aimed at?'' question. That might actually be a sort of useful shorthand. In that case I would, somewhat oddly, have to phrase my claim as: under no reasonable set of definitions, does the Coherent Extrapolated Volition of Humanity, deserve the label ``CEV'' / ``something like CEV''. Due to the chosen label(s), the statement looks odd. But there is no more logical tension in the above statement, than there is logical tension in the following statement: ``under no reasonable set of definitions, does the Coherent Extrapolated Volition of Steve, result in the survival of any of Steve's cells'' (which is presumably a true statement for at least some human individuals). Until I hear otherwise, I will however stay with the terminology, where ``CEV'' is shorthand for ``an AI that implements the Coherent Extrapolated Volition of Humanity'', or ``an AI that is helping humanity'', or something less precise, that is still hinting at something along those lines)

It probably makes sense to clarify my own terminology some more. I think this can be done by noting, that I think that CEV, sounds like a perfectly reasonable way of helping ``a Group'' (including the PCEV version that you link to, and that implies the LP outcome). I just don't think that helping ``a Group'' (that is made up of human individuals) is good for the (human) individuals that make up that ``Group'' (in expectation). Pointing a specific version of CEV (including PCEV) at a set of individuals, might be great for some other type of individuals. Let's consider a large number of ``insatiable, Clippy like maximisers''. Each of them cares exclusively about the creation of a different, specific, complex object. No instances of any of these very complex objects will ever exist, unless someone looks at the exact specification of a given individual, and uses this specification to create such objects. In this case PCEV might, from the perspective of each of those individuals, be the best thing that can happen (if special influence is off the table). It is also worth noting, that a given human individual might get what she wants, if some specific version of CEV is implemented. But CEV, or ``helping humanity'', is not good, for human individuals, in exception, compared to extinction. And why would it be? Groups and human individuals are completely different types of things. And a human individual is very vulnerable to a powerful AI, that wants to hurt her. And humanity certainly looks like it contains an awful lot of ``will to hurt'', specifically directed at existing human individuals.

If I zoom out a bit, I would describe the project of ``trying to describe what CEV is'' / ``trying to build an AI that helps humanity'' as: A project that searches for an AI design that helps an arbitrarily defined abstract entity. But this same project is, in practice, evaluating specific proposed AI designs, based on how they interact with a completely different type of thing: human individuals. You are for example presumably discarding PCEV, because the LP outcome implied by PCEV, contains a lot of suffering individuals (when PCEV is pointed at billions of humans). It is however not obvious to me why LP would be a bad way of helping an arbitrarily defined abstract entity (especially considering that the negotiation rules of PCEV simultaneously (i): implies LP, and is also (ii): an important part of the set of definitions, that is needed to differentiate the specific abstract entity that is to be helped, from the rest of the vast space of entities, that a mapping from billions-of-humans to the ``class-of-entities-that-can-be-said-to-want-things'', can point to). Thus, I suspect that PCEV is not actually being discarded, due to being bad at helping an abstract entity (and my guess it that PCEV is actually being discarded, because LP is bad for human individuals).

I think that one reasonable way of moving past this situation, is to switch perspective. Specifically: adopt the perspective of a single human individual, in a population of billions, and ask: ``without giving her any special treatment, compared to other existing humans, what type of AI, would want to help her''. And then try to answer this question, while making as few assumptions about her as possible (for example making sure that there is no implicit assumption, regarding whether she is ``selfish or selfless'', or anything along those lines. Both ``selfless and selfish'' human individuals, would strongly prefer to avoid being a Heretic in LP. Thus, discarding PCEV does not contain an implicit assumption related to the ``selfish or selfless'' issue. Discarding PCEV, does however, involve an assumption, that human individuals are not like the ``insatiable Clippy maximisers'' mentioned above. So, such maximisers might justifiably feel ignored, when we discard PCEV. But no one can justifiably feel ignored when we discard PCEV, on account of where she is on the ``selfish or selfless'' spectrum). When one adopts this perspective, it becomes obvious to suggest that, the initial dynamic, should grant this individual meaningful influence, regarding the adoption of those preferences, that refer to her. Making sure that such influence, is included as a core aspect of the initial dynamic, is made even more important, by the fact, that the designers will be unable to consider all implications of a given project, and will be forced to rely on, potentially flawed, safety measures (for example along the lines of a ``Last Judge'' off switch, which might fail to trigger. Combined with a learned DWIKIM layer, that might turn out to be very literal, when interpreting some specific class of statements). If such influence is included, in the initial dynamic, then the resulting AI is no longer describable as ``doing what a Group wants it to do''. Thus, the resulting AI can not be described as a version of CEV. (it might however be describable as ``something like CEV''. Sort of how one can describe an Orca as ``something like a shark'', despite the fact that an Orca is not a type of shark (or a type of a fish). I would guess, that you would say, that an AI that grants such influence, as part of the initial dynamic, is not ``something like CEV''. But I'm not sure about this)

(I should have added ``,in the initial dynamic,'' to the text in my earlier comments. It is explicit in the description of MPCEV, but I should have added this phrase to my comments here too. As a tangent, I agree that the intuition, that you were trying to counter, with your Boundaries / Membrane mention, is probably both common and importantly wrong. Countering this intuition makes sense, and I should have read this part of your comment more carefully. I would however like to note, that the description of the LP outcome, in the PCEV thought experiment, actually contains billions of (presumably very different) localities. Each locality is optimised according to very different criteria. Each place is designed to hurt a specific individual human Heretic. And each such location, is additionally bound by it's own unique ``comprehension constraint'', that refers to the specific individual Heretic, being punished in that specific location)

Perhaps a more straightforward way to move this discussion along is to ask a direct question, regarding what you would do if you were in the position, that I believe, that I find myself in. In other words: a well intentioned designer called John, wants to use PCEV as the alignment target for his project (rejecting any other version of CEV out of hand, by saying: ``if that is indeed a good idea, then it will be the outcome of Parliamentary Negotiations''). When someone points out that PCEV is a bad alignment target, John responds by saying that PCEV cannot, by definition, be a bad alignment target. John claims that any thought experiment, where PCEV leads to a bad outcome, must be due to a bad extrapolation of human individuals. John says that any given ``PCEV with a specific extrapolation procedure'' is just an attempt, to describe what PCEV is. If aiming at a given ``PCEV with a specific extrapolation procedure'' is a bad idea, then it is a badly constructed PCEV. Aiming at PCEV is a good idea, by intention that defines PCEV. John further says that his project will include features that (if they are implemented successfully, and are not built on top of any problematic unexamined implicit assumption) will to let John try again, if a given attempt to ``say what PCEV is'', fails. Do you agree that this project, is a bad idea? (compared to achievable alternatives, that start with a different set of, findable, assumptions) If so, what would you say to John? (what you are proposing is different from what John is proposing. I predict that you will say that John is making a mistake. My point is that, to me, it looks like you are making a mistake, of the same type as John's mistake. So, I wonder what you would say to John (your behaviour in this exchange, is not the same as John's behaviour in this thought experiment. But it looks to me, like you are making the same class of mistake, as John. So, I'm not asking how you would ``act in a debate, as a response to Johns behaviour''. Instead, I'm curious about how you would explain to John, that he is making an object level mistake))

Or maybe a better approach, is to go less meta, and get into some technical details. So, let's use the terminology in your CEV link, to explore some of the technical details in that post. What do you think would happen, if the learning algorithm that outputs the DWIKIM layer in John's PCEV project, is built on top of an unexamined implicit assumption, that turns out to be wrong? Let's say that the DWIKIM layer that pops out, interprets the request to build PCEV, as a request to implement the straightforward interpretation of PCEV. The DWIKIM layer happens to be very literal, when presented with the specific phrasing, used in the request. In other words: it interprets John as requesting, something along the lines of LP? I think this might result in an outcome, along the lines of LP (if the problems with the DWIKIM layer, stems form a problematic unexamined implicit assumption, related to extrapolation, then the exact same problematic assumption, might also render something along the lines of a ``Last Judge off switch add on'', ineffective). I think that it would be better, if John had aimed at something, that does not suffer from known, avoidable, s-risks. Something whose straightforward interpretation, is not known to imply an outcome, that would be far, far, worse than extinction. For the same reason, I make the further claim, that I do not think that it is a good idea, to subject everyone to the known, avoidable, s-risks associated with any AI, that is describable as ``doing what a Group wants'' (which includes all versions of CEV). Again, I'm certainly not against some feature that, might, let you try again, or that, might, re interpret an unsafe request, as a request for something completely different, that happens to be safe (such as, for example, a learned DWIKIM layer). I am aware of the fact, that you do not have absolute faith in the DWIKIM layer (if this layer is perfectly safe, in the sense of reliably re interpreting requests that straightforwardly imply LP, as something desirable to the designer. Then the full architecture would be functionally identical, to an AI, that simply does, whatever the designer wants the AI to do. In that case, you would not care what the request was. You might then, just as well have the designer ask the DWIKIM layer, for an AI, that maximises the number of bilberries. So, I am definitely not implying, that you are unaware, of the fact that the DWIKIM layer, is unable to provide reliable safety).

Zooming out a bit, it is worth noting that the details of the safety measure(s) is actually not very relevant to the points that I am trying to make here. Any conceivable, human implemented, safety measure, might fail. And, more importantly, these measures do not help much, when one is deciding what to aim at. For example: MPCEV, can also be built on top of a (potentially flawed) DWIKIM layer, in the exacts same way as you can build CEV on top of a DWIKIM layer (and you can stick a ``Last Judge off switch add on'' to MPCEV too. Etc, etc, etc). Or in yet other words: anything, along the lines of, a ``Last Judge off switch add on'' can be used by many different projects aiming at many different targets. Thus, the ``Last Judge'' idea, or any other idea along those lines (including a DWIKIM layer), provides very limited help, when one decides what to aim at. And even more generally: regardless of what safety measure is used, John is, still, subjecting everyone to an unnecessary, avoidable, s-risk. I hope we can agree that John should not do that with, any, version of ``PCEV with a specific extrapolation procedure''. The further claim, that I am making, is that no one should do that with, any, ``Group AI'', for similar reasons. Surely, discovering that this further claim is true, cannot be, by definition, impossible.

While re reading our exchange, I realised that I never actually clarified, that my primary reason for participating in this exchange (and my primary reason for publishing things on LW), is not actually to stop CEV projects. However, I think that a reasonable person might, based on my comments here, come to believe that my primary goal is to stop CEV projects (which is why the present clarification is needed). My focus is actually on trying to make progress on the ``what alignment target should be aimed at?'' question. In the present exchange, my target is the idea, that this question has already been given an answer (and, specifically, that the answer is CEV). The first step to progress, on the ``what alignment target should be aimed at?'' question, is to show that this question does not currently have an answer. This is importantly different, from saying that: ``CEV is the answer, but the details are unknown'' (I think that such a statement is importantly wrong. And I also think, that the fact that people still believe things along these lines, is standing in the way of getting a project off the ground, that is devoted to making progress on the ``what alignment target should be aimed at?'' question).

I think that it is very unlikely, that the relevant people will stay committed to CEV, until the technology arrives, that would make it possible for them to hit CEV as an alignment target (the reason I find this unlikely, is that, (i): I believe that I have outlined a sufficient argument, to show that CEV is a bad idea, and (ii): I think that such technology will take time to arrive, and (iii): it seems likely that this team of designers, who are by assumption capable of hitting CEV, will be both careful enough to read that argument before reaching the point of no return on their CEV launch, and also capable enough to understand it. Thus, since the argument against CEV already exists, in my estimate, it would not make sense to focus on s-risks, related to a successfully implemented CEV). If that unlikely day ever does arrive, then I might switch focus, to trying to prevent direct CEV related s-risk, by arguing against this imminent CEV project. But I don't expect to ever see this happening.

The set of paths that I am actually focused on reducing the probability of, can be hinted at by outlining the following specific scenario. Imagine a well intentioned designer that we can call Dave, who is aiming for Currently Unknown Alignment Target X (CUATX). Due to an unexamined implicit assumption, that CUATX is built on top of, turning out to be wrong in a critical way, CUATX implies an outcome, along the lines of LP. But the issue that CUATX suffers from, is far more subtle than the issue that CEV suffers from. And progress on the ``what alignment target should be aimed at?'' question, has not yet progressed to the point, where this problematic unexamined implicit assumption can be seen. CUATX has all the features, that are known at launch time, to be necessary for safety (such as the necessary, but very much not sufficient, feature that any safe AI must give each individual, meaningful influence, regarding the adoption of those preferences, that refer to her). Thus, the CUATX idea leads to a CUATX project, which in turn leads to an, avoidable, outcome along the lines of LP (after some set of human implemented safety measures fail). That is the type of scenario that I am trying to avoid (by trying to make sufficient progress on the ``what alignment target should be aimed at?'' question, in time). My real ``opponent in this debate'' is an implemented CUATX, not the idea of CEV (and very definitely not you. Or anyone else that has contributed, or is likely to contribute, valuable insights related to the ``what alignment target should be aimed at?'' question). It just happens to be the case, that the effort to prevent CUATX, that I am trying to get off the ground, starts by showing that CEV, is not an answer, to the ``what alignment target should be aimed at?'' question. And you just happen to be the only person, that is pushing back against this in public (and again: I really appreciate the fact that you chose to engage on this topic).

(I should also note explicitly, that I am most definitely not against exploring safety measures. They might stop CUATX. In some plausible scenarios, they might be the only realistic thing, that can stop CUATX. And I am not against treaties. And I am open to hearing more about the various human augmentation proposals that have been going around for many years. I am simply noting, that a safety measure, regardless of how clever it sounds, simply cannot fill the function of a substitute, for progress on the ``what alignment target should be aimed at?'' question. An attempt to get people to agree to a treaty might fail. Or a successfully implemented treaty might fail to actually prevent a race dynamic for long enough. And similarly, augmented humans might systematically tend towards being: (i): superior at alignment, (ii): superior at persuasion, (iii): well intentioned, and (iv): not better at dealing with the ``what alignment target should be aimed at?'' question, than the best baseline humans (but still, presumably, capable of understanding an insight on this question, at least if that insight is well explained). Regardless of augmentation technique, selection for ``technical ability and persuasion ability'' seems like a far more likely, de facto, outcome to me, due to being far easier to measure. I expect it to be far more difficult to measure the ability to deal with the ``what alignment target should be aimed at?'' question (and it is not obvious that the abilities needed to deal with the ``what alignment target should be aimed at?'' question, will be strongly correlated with the thing that I think will, de facto, have driven the trial and error augmentation process, of the augments that eventually hits an alignment target: ``technical-ability-and-persuasion-ability-and-ability-to-get-things-done''). Maybe the first augment will be great at making progress on the ``what alignment target should be aimed at?'' question, and will quickly render all previous work on this question irrelevant (and in that case, the persuasion ability is probably good for safety). But assuming that this will happen, seems like a very unsafe bet to make. Even more generally: I simply do not think that it is possible to come up with any type of clever sounding trick, that makes it safe to skip the ``what alignment target should be aimed at?'' question (to me, the ``revolution-analogy-argument'', in the 2004 CEV text, looks like a sufficient argument for the conclusion, that it is important to make progress on the ``what alignment target should be aimed at?'' question. But it seems like many people do not consider this, to be a sufficient argument for this conclusion. It is unclear to me, why this conclusion, seems to require such extensive further argument)).

If my overall strategic goal was not clear, then this was probably my fault (in addition to not making this goal explicit, I also seem to have a tendency to lose focus on this larger strategic picture, during back and fourth technical exchanges).

Two out of of my three LW posts are in fact entirely devoted to arguing, that making progress on the ``what alignment target should be aimed at?'' question, is urgent (in our present discussion, we have only talked about the one post, that is not exclusively focused on this). See:

Making progress on the ``what alignment target should be aimed at?'' question, is urgent

The proposal to add a ``Last Judge'' to an AI, does not remove the urgency, of making progress on the ``what alignment target should be aimed at?'' question.

(I am still very confused about this entire conversation. But I don't think that re reading everything, yet again, will help much. I have been continually paying, at least some, attention to SL4, OB, and LW since around 2002-2003. I can't remember exactly who said what when, or where. However, I have developed a strong intuition, that can be very roughly translated as: ``if something sounds strange, then it is very definitely not safe, to explain away this strangeness, by conveniently assuming that Nesov is confused on the object-level''. I am nowhere near the point where I would consider going against this intuition. So, I expect that I will remain very confused about this exchange, until there is some more information available. I don't expect to be able to just think my way out of this one (wild speculation, regarding what it might be, that I was missing, by anyone that happens to stumble on this comment, at any point in the future, are very welcome. For example in a LW comment, or in a LW DM, or in an email))

61

Managing risks while trying to do good

61

How to mitigate this problem

61