ThomasCederborg

My research focus is Alignment Target Analysis (ATA). I noticed that the most recently published version of CEV (Parliamentarian CEV, or PCEV) gives a large amount of extra influence to people that intrinsically value hurting other individuals. For Yudkowsky's description of the issue you can search the CEV arbital page for ADDED 2023.

The fact that no one noticed this issue for over a decade shows that ATA is difficult. If PCEV had been successfully implemented, the outcome would have been massively worse than extinction. I think that this illustrates that scenarios where someone successfully hits a bad alignment target pose a serious risk. I also think that it illustrates that ATA can reduce these risks (noticing the issue reduced the probability of PCEV getting successfully implemented). The reason that more ATA is needed is that PCEV is not the only bad alignment target that might end up getting implemented. ATA is however very neglected. There does not exist a single research project dedicated to ATA. In other words: the reason that I am doing ATA is that it is a tractable and neglected way of reducing risks.

I am currently looking for collaborators. I am also looking for a grant or a position that would allow me to focus entirely on ATA for an extended period of time. Please don't hesitate to get in touch if you are curious and would like to have a chat, or if you have any feedback, comments, or questions. You can for example PM me here, or PM me on the EA Forum, or email me at thomascederborgsemail@gmail.com (that really is my email address. It's a Gavagai / Word and Object joke from my grad student days)

My background is physics as an undergrad and then AI research. Links to some papers: P1 P2 P3 P4 P5 P6 P7 P8. (no connection to any form of deep learning)

Posts

Sorted by New

12A problem shared by many different alignment targets

3mo

12Shutting down all competing AI projects might not buy a lot of time due to Internal Time Pressure

7mo

27The case for more Alignment Target Analysis (ATA)

7mo

20A necessary Membrane formalism feature

7mo

9Corrigibility could make things worse

10mo

1The proposal to add a ``Last Judge'' to an AI, does not remove the urgency, of making progress on the ``what alignment target should be aimed at?'' question.

2Making progress on the ``what alignment target should be aimed at?'' question, is urgent

10A problem with the most recently published version of CEV

Wikitag Contributions

Comments

Sorted by

Newest

The case for more Alignment Target Analysis (ATA)

ThomasCederborg7mo80

Your comment makes me think that I might have been unclear regarding what I mean with ATA. The text below is an attempt to clarify.

Summary

Not all paths to powerful autonomous AI go through methods from the current paradigm. It seems difficult to rule out the possibility that a Sovereign AI will eventually be successfully aligned to some specific alignment target. At current levels of progress on ATA this would be very dangerous (because understanding an alignment target properly is difficult, and a seemingly-nice proposal can imply a very bad outcome). It is difficult to predict how long it would take to reach the level of understanding needed to prevent scenarios where a project successfully hits a bad alignment target. And there might not be a lot of time to do ATA later (for example because a tool-AI shuts down all unauthorised AI projects. But does not buy a lot of time due to internal time pressure). So a research effort should start now.

Therefore ATA is one of the current priorities. There are definitely very serious risks that ATA cannot help with (for example misaligned tool-AI projects resulting in extinction). There are also other important current priorities (such as preventing misuse). But ATA is one of the things that should be worked on now.

The next section outlines a few scenarios designed to clarify how I use the term ATA. The section after that outlines a scenario designed to show why I think that ATA work should start now.

What I mean with Alignment Target Analysis (ATA)

The basic idea with ATA is to try to figure out what would happen if a given AI project were to successfully align an autonomously acting AI Sovereign to a given alignment target. The way I use the term, there are very severe risks that cannot be reduced in any way, by any level of ATA progress (including some very serious misalignment and misuse risks). But there are also risks that can and should be reduced by doing ATA now. There might not be a lot of time to do ATA later, and it is not clear how long it will take to advance to the level of understanding that will be needed. So ATA should be happening now. But let's start by clarifying the term ATA, by outlining a couple of dangerous AI projects where ATA would have nothing to say.

Consider Bill, who plans to use methods from the current paradigm to build a tool-AI. Bill plans to use this tool AI to shut down competing AI projects and then decide what to do next. ATA has nothing at all to say about this situation. Let's say that Bill's project plan would lead to a powerful misaligned AI that would cause extinction. No level of ATA progress would reduce this risk.

Consider Bob who also wants to build a tool-AI. But Bob's AI would work. If the project would go ahead, then Bob would gain a lot of power. And Bob would use that power to do some very bad things. ATA has nothing to say about this project and ATA cannot help reduce this risk.

Now let's introduce an unusual ATA scenario, just barely within the limits of what ATA can be used for (the next section will give an example of the types of scenarios that makes me think that ATA should be done now. This scenario is meant to clarify what I mean with ATA). Consider Dave who wants to use methods from the current paradigm to implement PCEV. If the project plan moves forwards, then the actual result would be a powerful misaligned AI: Dave's Misaligned AI (DMAI). DMAI would not care at all what Dave is trying to do, and would cause extinction (for reasons that are unrelated to what Dave was aiming at). One way to reduce the extinction risk from DMAI would be to tell Dave that his plan would lead to DMAI. But it would also be valid to let Dave know that if his project were to successfully hit the alignment target that he is aiming for, then the outcome would be massively worse than extinction.

Dave assumes that he might succeed. So, when arguing against Dave's project, it is entirely reasonable to argue from the assumption that Dave's project will lead to PCEV. Pointing out that success would be extremely bad is a valid argument against Dave's plan, even if success is not actually possible.

You can argue against Dave's project by pointing out that the project will in fact fail. Or by pointing out that success would be very bad. Both of these strategies can be used to reduce the risk of extinction. And both strategies are cooperative (if Dave is a well meaning and reasonable person, then he would thank you for pointing out either of these aspects of his plan). While both strategies can prevent extinction in a fully cooperative way, they are also different in important ways. It might be the case that only one of these arguments is realistically findable in time. It might for example be the case that Dave is only willing to publish one part of his plan (meaning that there might not be sufficient public information to construct an argument about the other part of the plan). And even if valid arguments of both types are constructed in time, it might still be the case that Dave will only accept one of these arguments. (similar considerations are also relevant for less cooperative situations. For example if one is trying to convince a government to shut down Dave's project. Or if one is trying to convince an electorate to vote no on a referendum that Dave needs to win in order to get permission to move forwards)

The audience in question (Dave, bureaucrats, voters, etc) are only considering the plan because they believe that it might result in PCEV. Therefore it is entirely valid to reason from the assumption that Dave's plan will result in PCEV (when one is arguing against the plan). There is no logical reason why such an argument would interfere with attempts to argue that Dave's plan would in fact result in DMAI.

Now let's use an analogy from the 2004 CEV document to clarify what role I see an ATA project playing. In this analogy, building an AI Sovereign is analogous to taking power in a political revolution. So (in the analogy) Dave proposes a political revolution. One way a revolution can end in disaster is that the revolution leads to a destructive civil war that the revolutionaries loose (analogous to DMAI causing extinction). Another way a revolution can end in disaster is that ISIS takes power after the government is overthrown (analogous to the outcome implied by PCEV).

It is entirely valid to say to Dave: ``if you actually do manage to overthrow the government, then ISIS will seize power'' (assuming that this conditional is true). One can do this regardless of whether or not one thinks that Dave has any real chance of overthrowing the government. (Which in turn means that one can actually say this to Dave, without spending a lot of time trying to determine the probability that the revolution will in fact overthrow the government. Which in turn means that people with wildly different views on how difficult it is to overthrow the government can cooperate while formulating such an argument)

(this argument can be made separately from an argument along the lines of: ``our far larger neighbour has a huge army and would never allow the government of our country to be overthrown. Your revolution will fail even if every single soldier in our country joins you instantly. Entirely separately: the army of our county is in fact fiercely loyal to the government and you don't have enough weapons to defeat it. In addition to these two points: you are clearly bad at strategic thinking and would be outmanoeuvred in a civil war by any semi-competent opponent''. This line of argument can also prevent a hopeless civil war. The two arguments can be made separately and there is no logical reason for them to interfere with each other)

Analysing revolutionary movements in terms of what success would mean can only help in some scenarios. It requires a non-vague description of what should happen after the government falls. In general: this type of analysis cannot reduce the probability of lost civil wars, in cases where the post revolutionary strategy is either (i): too vaguely described to analyse, or (ii): actually sound (meaning that the only problem with the revolution in question is that it has no chance of success). Conversely however: arguments based on revolutions failing to overthrow the government cannot prevent revolutions that would actually end with ISIS in charge (analogous to AI projects that would successfully hit a bad alignment target). Scenarios that end in a bad alignment target getting successfully hit is the main reason that I think that ATA should happen now (in the analogy, the main point would be to reduce the probability of ISIS gaining power). Now let's leave the revolution analogy and outline one such scenario.

A tool-AI capable of shutting down all unauthorised AI projects might not buy a lot of time

It is difficult to predict who might end up controlling a tool-AI. But one obvious compromise would be to put it under the control of some group of voters (for example a global electorate). Let's say that the tool-AI is designed such that one needs a two thirds majority in a referendum, to be allowed to launch a Sovereign AI. There exists a Sovereign AI proposal that a large majority thinks sounds nice. A small minority would however prefer a different proposal.

In order to prevent inadvertent manipulation risks, the tool AI was designed to only discuss topics that are absolutely necessary for the process of shutting down unauthorised AI projects. Someone figures out how to make the tool-AI explain how to implement Sovereign AI proposals (and Explanation / Manipulation related definitions happens to hold for such discussions). But no one figures out how to get it to discuss any topic along the lines of ATA. The original plan was to take an extended period of time to work on ATA before implementing a Sovereign AI.

Both alignment targets use the same method for extrapolating people and for resolving disagreements. The difference is in terms of who is part of the initial group. The two proposals have different rules with respect to things like: animals, people in cryo, foetuses, artificial minds, etc. It doesn't actually matter which proposal gets implemented: the aggregation method leads to the same horrific outcome in both cases (due to an issue along the lines of the issue that PCEV suffers from. But more subtle and difficult to notice). (All proposed alignment targets along the lines of ``build an AI Sovereign that would do whatever some specific individual wants it to do'' are rejected out of hand by almost everyone).

In order to avoid making the present post political, let's say that political debates center around what to do with ecosystems. One side cares about nature and wants to protect ecosystems. The other side wants to prevent animal suffering (even if the cost of such prevention is the total destruction of every ecosystem on earth). It is widely assumed that including animals in the original group will lead to an outcome where animal suffering is prevented at the expense of ecosystems. (in order to make the following scenario more intuitive, readers that have an opinion regarding what should be done with ecosystems, can imagine that the majority shares this opinion)

The majority has enough support to launch their Sovereign AI. But the minority is rapidly and steadily gaining followers due to ordinary political dynamics (sometimes attitudes on a given issue changes steadily in a predictable direction). So the ability to get the preferred alignment target implemented can disappear permanently at any moment (the exact number of people that would actually vote yes in a referendum is difficult to estimate. But it is clearly shrinking rapidly). In this case the majority might act before they loose the ability to act. Part of the majority would however hesitate if the flaw with the aggregation method is noticed in time.

After the tool-AI was implemented, a large number of people started to work on ATA. There are also AI assistants that contribute to conceptual progress (they are tolerated by the tool-AI because they are not smarter than humans. And they are useful because they contribute a set of unique non-human perspectives). However, it turns out that ATA progress works sort of like math progress. It can be sped up significantly by lots of people working on it in parallel. But the main determinant of progress is how long people have been working on it. In other words: it turns out that there is a limit to how much the underlying conceptual progress can be sped up by throwing large numbers of people at ATA. So the question of whether or not the issue with the Sovereign AI proposal is noticed in time, is to a large degree determined by how long a serious ATA research project has been going on at the time that the tool-AI is launched (in other words: doing ATA now reduces the risk of a bad alignment target ending up getting successfully hit in this scenario).

(the idea is not that this exact scenario will play out as described. The point of this section was to give a detailed description of one specific scenario. For example: the world will presumably not actually be engulfed by debates about the Prime Directive from Star Trek. And a tool-AI controlled by a messy coalition of governments might lead to a time crunch due to dynamics that are more related to Realpolitik than any form of ideology. This specific scenario is just one example of a large set of similar scenarios)

PS:

On a common sense level I simply don't see how one can think that it is safe to stay at our current level of ATA progress (where it is clearly not possible to reliably tell a good alignment target from an alignment target that implies an outcome massively worse than extinction). The fact that there exists no research project dedicated to improving this situation seems like a mistake. Intuitively this seems like a dangerous situation. At the very least it seems like some form of positive argument would be needed before concluding that this is safe. And it seems like such an argument should be published so that it can be checked for flaws before one starts acting based on the assumption that the current situation is safe. Please don't hesitate to contact me with theories / questions / thoughts / observations / etc regarding what people actually believe about this.

A problem shared by many different alignment targets

ThomasCederborg3d10

The successful implementation of an instruction following AI would not remove the possibility that an AI Sovereign will be implemented later. For example: the path to an AI that implements the CEV of Humanity outlined on the CEV arbital page starts with an initial non-Sovereign AI (and this initial AI could be the type of instruction following AI that you mention). In other words: the successful implementation of an instruction following AI does not prevent the later implementation of a Group AI. It is in fact one step on the classical path to a Group AI. So even if we assume that the first clever AI will be an instruction following AI, then this does not remove the need to analyse Sovereign AI proposals. In some scenarios where we end up with a successfully implemented AI Sovereign, we will not have a lot of time for Alignment Target Analysis (ATA), and we will not get a lot of help with ATA. This is dangerous. Because it is currently not possible to reliably tell if a reasonable sounding proposal implies a massively worse than extinction outcome. The last section of this comment goes into more detail about this, and links to previous posts and comments. But first I will respond to your comment about CEV, and try to clarify what type of proposals the present post is analysing.

Regarding your comment about CEV:

One can propose to Coherently Extrapolate the Volition of all sorts of things for all sorts of reasons. The acronym CEV can thus be used as a shorthand for all sorts of AI proposals. Including proposals to build an AI that effectively does whatever one single individual wants that AI to do. And if such a proposal is explained in an unclear way then readers might get the mistaken impression that the proposed AI is supposed to do what Humanity wants that AI to do (in some sense). But such proposals are out of scope of the present post.

None of the alignment targets that I am analysing in the present post are like this. I am instead analysing proposals along the lines of Yudkowsky's proposal to build an AI that implements the Coherent Extrapolated Volition of Humanity. (To me, it seems that this text is explicitly and unambiguously proposing to build an AI that genuinely does what Humanity wants done (in some sense)). In other words: each proposal analysed in the post is genuinely a proposal to create an AI that is built on top of a CEVH mapping. None of them is a proposal to create an AI that is effectively built on top of a CEVI mapping. In yet other words: they are all, genuinely, Group AI proposals. My point is that such Group AI proposals are either confused (effectively proposing to build an AI that implements the will of a non-existing free floating G entity, with an existence that is implicitly assumed to be separate from any specific mapping or set of definitions). Or such proposals are extremely bad for individuals (which should not be surprising. Because there is no reason to expect that doing what one type of thing wants, would be good for a completely different type of thing).

In other words: all proposals analysed in the post are genuinely Group AI proposals. What I am trying to do is describe a feature shared by all Group AI proposals: that the successful implementation of a Group AI would be extremely bad for individuals. (Since all my readers are individuals, I don't have to show that the proposals are objectively bad. A given proposal might be very good for some specific Group entity. But all my readers are individuals. So the fact that the AI in question might be good for some arbitrarily defined abstract G entity is not relevant).

I think that it is often possible to differentiate between a proposal based on a CEVH mapping on the one hand, and a misleadingly described proposal based on a CEVI mapping on the other hand. In other words: I think that it is often possible to separate a genuine Group AI proposal, from a proposed AI that effectively does what one individual wants that AI to do (but that is described in such a way that readers might get the mistaken impression that the AI will do what Humanity wants it to do). Let's say that Dave is proposing to build an AI: Dave's proposed AI (DAI), that will implement the Coherent Extrapolated Volition of something. And let's say that we know that Dave is either proposing a genuine Group AI, or else Dave is using a misleading description of an AI that is effectively doing whatever one person wants that AI to do. One way to see what is actually being proposed, is to examine how Dave responds to a certain type of criticism of DAI.

Let's say that Steve claims that a successfully implemented DAI would be bad for individuals. If Dave is at any point responding to such criticism by implying that this type of criticism is somehow self defeating. Or if Dave ever implies that if the resulting AI is bad for individuals, then this means that it is by definition not successfully implemented. Or if Dave ever implies that if a successfully implemented DAI is bad for individuals, then there must be some sort of definitional issue involved. Then Dave is presumably not thinking about anything along the lines of a CEVI project. Instead, Dave is presumably thinking about a genuine Group AI project. As explained in the post: if Dave defends a Group AI proposal in this way, then Dave is making a mistake. But if Dave defends a CEVI AI proposal in this way, then Dave is making a far more obvious mistake. So if we assume some basic levels of honesty and competence. Then this type of defence of DAI indicates that Dave is proposing an AI that is genuinely meant to do what a group wants that AI to do (in some sense).

A person proposing a CEVI project would presumably never claim that an AI that is bad for individuals, is by definition not successfully implemented (since this would be very obvious nonsense as a response to criticism of a CEVI project). If Steve says that a given AI is a bad idea, then a person proposing a CEVI project would never imply that this is somehow self-defeating or wrong per definition. It is presumably obvious to most people that a successfully implemented CEVI project can be catastrophically bad for individuals (in a way that is not due to any specific detail or definition).

In other words: It is presumably obvious to most people that doing what one individual wants can be bad for other individuals. But it seems to be more difficult to see that doing what an arbitrarily defined abstract Group entity wants, can also be bad for individuals. With this in mind, let's say that we know that DAI will get its goal entirely from some set of humans. If Dave says that a successfully implemented DAI would per definition be good for individuals, then Dave is presumably not proposing an AI that effectively does what one person would want (because defending such an AI in this way would be obvious nonsense. Because it would be equivalent to claiming that doing what one individual wants can never be bad for other individuals).

In yet other words: consider the case where Dave is making claims along the lines of: if a successfully implemented DAI is bad for individuals, then this implies that DAI is built on top of bad definitions. This behaviour means that Dave is probably not proposing an AI that effectively does what one individual wants the AI to do. Dave's behaviour is however consistent with Dave proposing a Group AI (and being confused in a less obvious way). It seems very unlikely that someone would claim that doing what one individual wants is per definition good for other individuals. Therefore, anyone that is in fact proposing an AI that would effectively do whatever one person wants, would presumably not make the type of arguments that Dave is making above (assuming a basic level of honesty and competence).

Let's now look at how someone that is actually proposing a CEVI project might respond to this type of criticism. Mike is proposing to build Mike's proposed AI (MAI), that will effectively do whatever one individual wants MAI to do. Let's now say that Steve claims that MAI would be bad for individuals. Mike would presumably never respond by implying that such an outcome is somehow contradictory. Let's look closer at the claim: "if a successfully implemented MAI hurts Steve, then MAI must be built on top of bad definitions". This is not just nonsense. It is very obvious nonsense. Saying that if a successfully implemented MAI is catastrophically bad for individuals, then MAI must be built on top of some bad definitions, would be obvious nonsense. Mike would for example presumably never say anything along the lines of: "If you, Steve Smith, can see that MAI would be bad for individuals, then surely a clever AI will see this problem as well". Mike would never say this, because that statement would be obvious nonsense (assuming a basic level of honesty and competence). It would be equivalent to claiming that doing what one individual wants cannot by definition be bad for other individuals.

My focus is not on people like Mike or proposals along the lines of MAI. My focus is on genuine Group AI proposals. When Dave is claiming that his Group AI proposal cannot by definition be bad for individuals, then Dave is also confused (as I explain at length in the post). But what Dave is saying is evidently not obvious nonsense. Which is why I thought that it made sense to explain exactly why Dave's statements indicate confusion. Labels being used in problematic ways can make it more difficult to see this type of free-floating-G-entity confusion.

Let's look at an analogy where the label Gavagai is being used in an inconsistent way to talk about cells and individuals (and where this usage in turn makes it more difficult to notice a type of confusion, that is analogous to the free-floating-G-entity confusion discussed in the post). There is nothing strange about an AI that does what Gregg wants, and in the process destroys every single one of Gregg's cells (for example because Gregg wants to be uploaded, and would prefer that his cells are not left alive after uploading). This is unsurprising, because there is no reason to be surprised when doing what one type of thing wants (in this case Gregg), is bad for a completely different type of thing (in this case Gregg's cells). This behaviour is not caused by any specific detail in the proposal to build an AI that does what Gregg wants that AI to do (in other words: the behaviour does not imply any form of implementation or definitional issue).

Claiming that a clever enough AI (that is designed to do what Gregg wants that AI to do), would surely be able to figure out that killing cells is a bad outcome is obvious nonsense. It seems very unlikely that anyone would say anything along the lines of: if you, Steve Smith, can see that the outcome would be bad for cells, then surely a clever AI would be able to see this too. It would be obvious nonsense to claim that if such an AI kills cells, then it is by definition not successfully implemented. If an AI that is meant to do what Gregg wants, ends up killing cells, then there is no reason at all to think that something has gone wrong. Because there is no reason at all to be surprised when doing what one type of thing wants, is bad for a completely different type of thing.

If Bob actually does claim that doing what Gregg wants cannot by definition be bad for Gregg's cells. Then it would presumably be obvious that Bob is confused. If Allan however uses the label Gavagai to sometimes refer to Gregg, and at other times to refer to Gregg's cells. Then it might be more difficult to notice that Allan is confused. There are many scenarios where a given statement would be reasonable for both meanings (for example: seatbelts are important to prevent bad things from happening to Gavagai). Let's now say that Steve claims that an AI that does what Gregg wants this AI to do, will kill every one of Greggs' cells (if that AI is successfully implemented). Steve further claims that this AI will behave like this, for reasons that are not related to any specific detail in the definitions (in other words: Steve claims that this behaviour is not due to any definitional issues). Now let's say that Allan claims that any AI that behaves like this, would by definition not be successfully implemented. Allan justifies this with an argument that basically boils down to some claim along the lines of: doing what Gavagai wants cannot by definition be bad for Gavagai. In this case, Allan would be just as confused as Bob. But the fact that Allan is confused might not be equally obvious. Something similar might happen if a confused person is using the label CEV in a discussion about a specific AI proposal, without noticing that the label is being used to refer to different things at different times.

In yet other words: If Dave is at any point responding to criticism of his proposed AI project by implying that criticism is somehow self defeating. Or responds by saying that if his proposed AI is bad for individuals, then this means that the AI is by definition not successfully implemented. Or responds by saying that if individuals are hurt by a successfully implemented AI, then this must be due to some form of definitional problem. Then Dave is not thinking about anything along the lines of a CEVI project. Dave is not proposing an AI that effectively does what one individual wants that AI to do. Such responses would however be expected if Dave is confused, and is proposing an AI project that is genuinely meant to lead to the implementation of a Group AI.

This confusion about abstract Group entities that I keep talking about is making it difficult to discuss a certain class of AI proposals. But the confusion is not the actual problem that I am trying to deal with. The actual problem that I am trying to deal with, is that the proposals in question imply very bad outcomes (for individuals. In expectation. If successfully implemented. For reasons that are not related to any specific implementation detail or definitional choice). This should be an entirely unsurprising discovery. For the same reason that it is entirely unsurprising to discover that an AI that does what Gregg wants this AI to do, would kill every single one of Gregg's cells (if successfully implemented. For reasons that are not related to any specific implementation detail or definitional choice).

The fact that these proposals would be bad for individuals, is a problem for my intended audience: individuals. In other words: the proposals are not objectively bad. A given proposal can be very good for whatever arbitrarily defined abstract Group entity is implied by the chosen set of definitions. But I am addressing human individuals. So I don't try to argue that proposals are objectively bad. I only need to show that they are bad for human individuals. The confusion is problematic because it is making it difficult to discuss the real issue: the existence of AI proposals that imply massively worse than extinction outcomes. As I have argued elsewhere, there exists many plausible paths along which a Sovereign AI proposal might end up getting successfully implemented, before much progress has been made on understanding Sovereign AI proposals. This is the topic of the next section.

Regarding your comment about what type of alignment target is most likely to be pursued first:

My claim is that it is important to start analysing Sovereign AI proposals now. A reasonable sounding proposal might lead to a massively worse than extinction outcome (if successfully implemented). Noticing this is not guaranteed (as illustrated by the fact that PCEV was a fairly famous proposal for a long time, without anyone noticing that PCEV implies such an outcome). The relevant question for evaluating the claim that Sovereign AI proposals need to be analysed now, is whether or not a Sovereign AI proposal might at some point be successfully implemented. The prior implementation of some other AI is not necessarily relevant. Such an earlier AI is only relevant if that AI removes the need to analyse Sovereign AI proposals now. For example because it is assumed that such an AI would buy sufficient time. Or because it is assumed that such an AI would be very useful when analysing Sovereign AI proposals.

I have argued elsewhere that the successful implementation of a Limited AI (for example an instruction following AI) might not actually result in sufficient time (for example in this post, this post, and the last section of this comment). I have also argued that a successfully implemented Limited AI might not be very helpful when analysing Sovereign AI proposals (for example in my reply to the Seth Herd comment that you mention, and in this post). (One issue is that trying to ask a clever AI to find a good Sovereign AI proposal, is equivalent to trying to ask an AI what goal it should have). Relatedly, this post argues that augmented humans might not be very helpful when analysing Sovereign AI proposals (in brief: even if the augmentation process results in a mind that is better than baseline humans at hitting alignment targets, this does not mean that the resulting mind will be good at analysing alignment targets).

Analysing Sovereign AI proposals now might end up having no impact at all on the outcome. One unsurprising scenario is that a misaligned AI kills everyone for reasons that are completely unrelated to what the designers were trying to do. Another possibility is that an augmented human (or an AI assistant) will propose some new way of looking at things, that renders all past work by non-augmented humans irrelevant (and that is not building on such past work). But it is not safe to assume that anything along these lines will actually happen (and the position that analysing Sovereign AI proposals now is not needed, is implicitly built on top of such an assumption).

The basic problem is that it is clearly not currently possible to reliably determine whether or not a reasonable sounding proposal implies a massively worse than extinction outcome (as for example illustrated by the PCEV incident). This seems like a really bad place to be. It is also true that making incremental progress on risk reduction is a tractable research goal (as for example illustrated by the present post). In other words: we might end up with a massively worse than extinction outcome as a result of a successfully implemented AI Sovereign. The probability of this can be reduced by analysing Sovereign AI proposals. (See also this comment for an attempt to clarify the type of risk reduction that ATA is trying to accomplish).

A summary and an analogy: There exists many plausible paths that ends in a successfully implemented AI Sovereign. Even if the first AI is not an AI Sovereign, we might eventually end up with a successfully implemented AI Sovereign anyway. And we might not have a lot of time to analyse such proposals. And we might not get a lot of help with such analysis. Thus, analysing Sovereign AI proposals now reduces the probability of a massively worse than extinction outcome. This remains true despite the fact that ATA might end up having no impact on the outcome. Let's take an analogy with Bill who finds himself in a war-zone and decides to look for a good bulletproof vest. The vest will not help if Bill is shot in the head. It is also possible that Bill will get shot in the stomach with a high caliber weapon that no vest can stop. But vests are still very popular amongst people that find themselves in war-zones. Because if you do get shot in the stomach, then it really is very nice to be wearing a vest (and the quality of that vest might make a huge difference).

Getting shot in the stomach is here analogous to a situation where (i): it becomes possible to successfully implement a Sovereign AI (for example with the help of something along the lines of: an instruction following superintelligent AI, augmented humans, a non-superintelligent AI assistant, etc), (ii): there is a time crunch (for example due to Internal Time Pressure), (iii): the ability to do ATA has not increased dramatically, and (iv): there exists a Sovereign AI proposal that implies an outcome massively worse than extinction (for example along the lines of the outcome implied by PCEV. But due to a problem that is more difficult to notice). It could be that this problem is simply not realistically findable (analogous to the high caliber weapon). But it could also be the case that the problem in question is realistically findable in time, iff ATA progress has advanced to a level that is realistically achievable (analogous to a bullet that is possible to stop with a realistically findable vest). This is what makes me think that the current situation (where there exists exactly zero people in the word dedicated to ATA) is a serious mistake.

A problem shared by many different alignment targets

ThomasCederborg2mo10

The proposals described in your points 1 and 2 are very different from any of the proposals that I am analysing in the post. I consider this to be a good thing. But I wanted to note explicitly that this discussion has now moved very far away from what was discussed in the post, and is best seen as a new discussion (a discussion that starts with the proposals described in your points 1 and 2). Making this clear is important, because it means that many points made in the post (and also earlier in this thread) do not apply to the class of proposals that we are now discussing.

In particular: all alignment targets analysed in the post are Group AIs. But the alignment target described in your point 1: Coherent Extrapolation of Equanimous Volition (CEEV), is not a Group AI. Given that the primary focus of the post is to analyse the Group AI idea, the analysis of CEEV below is best seen as starting a completely new discussion. Among other things, this means that many arguments from the post about Group AIs will probably not apply to CEEV. (CEEV is still very bad for individuals. Because it is still the case that no individual has any meaningful influence regarding the way in which CEEV adopts those preferences that refer to her. One specific issue is that some CEEV delegates will still prefer outcomes where heretics are punished, because some delegates will still have an aversion to unethical AIs. The issue is described in detail in the last section of this comment).

The rule for deciding which actions are available to Delegates during negotiations, described in your point 2, is also a large departure from anything discussed in the post. The described rule would accept actions, even though those actions would make things dramatically worse for some people. I think that this makes it a very different kind of rule, compared to Davidad's proposed Pareto Improvement rule. The points that I made about Pareto Improvements in the post, and earlier in this thread, do not apply to this new class of rules. (The set of actions is still rendered empty by the rule described in your point 2, due to a large and varied set of hard constraints demanding that the AI must not be unethical. A single pair of such demands can render the set empty, by having incompatible views regarding what it means for an AI to be unethical. Some pairs of demands like this have nothing to do with utility inversion. The issue is described in detail in the next section of this comment).

It also makes sense to explicitly note here that with the rule described in your point 2, you have now started to go down the path of removing entire classes of constraints from consideration (as opposed to going down the path of looking for new Pareto Baselines). Therefore, my statement that the path that you are exploring is unlikely to result in a non-empty set no longer applies. That statement was expressing doubt about finding a usable Pareto Baseline that would result in a non-empty set. But in my view you are now doing something very different (and far more interesting) than looking for a usable Pareto Baseline that would result in a non-empty set.

I will spend most of this comment talking about the proposals described in your points 1 and 2. But let's first try to wrap up the previous topics, starting with Bob2. Bob2 is only different from Bob in the sense that Bob2 does not see an AI that literally never acts as a person. I don't see why Bob2's way of looking at things would be strange or unusual. A thing that literally never acts can certainly be seen as a person. But it doesn't have to be seen as a person. Both perspectives seem reasonable. These two different classifications are baked into a core value, related to the Dark Future concept. (In other words: Bob and Bob2 have different values. So there is no reason to think that learning new facts would make them agree on this point. Because there is no reason to think that learning new facts would change core values). In a population of billions, there will thus be plenty of people that share Bob2's way of looking at such an AI. So if the AI is pointed at billions of humans, the set of Pareto Improvements will be rendered empty by people like Bob2 (relative to the alternative no-AI-action Pareto Baseline that you discussed here).

Now let’s turn to your point about the size of the action space. Most of my previous points probably do not apply to rules that will ignore entire classes of constraints (such as the “pathological constraints” that you mention). In that case everything depends on how one defines this class of constraints. Rules that do ignore classes of constraints are discussed in the next section of this comment. However: for rules that do not ignore any constraints, the number of actions is not necessarily relevant (in other words: while we are still talking about Pareto Improvements, the number of actions is not necessarily relevant). One can roughly describe the issue as: If one constraint demands X. And another constraint refuses X. Then the set is empty. Regardless of the number of actions.

I'm not sure whether or not there is any significant disagreement left on this issue. But I will still elaborate a bit more on how I see the original situation (the situation where pathological constraints are not ignored).

One can say that everything is short circuited by the fact that humans often have very strong opinions about who should be in charge. (And there are many different types of ontologies that are compatible with such sentiments. Which means that we can expect a great variety in terms of what this implies regarding demands about the AI). Wanting the right type of person to be in charge can be instrumental. But it does not have to be instrumental. And there is nothing unusual about demanding things that are entirely symbolic. (In other words: there is nothing unusual about Dennis, who demands that The Person in Charge must do or value things that have no connection with the material situation of Dennis).

This is not part of every ontology. But caring about who is in charge is a common human value (at least common enough for a population of billions to include a great variety of hard constraints related to this general type of sentiment). The number of actions does not help if one person rejects all trajectories where the person in charge is X. And another person rejects any trajectory unless the person in charge is X. (Combined with the classification of a trajectory that contains a non-misaligned and clever AI, that takes any action, as a trajectory where the first such AI is in charge). (I don’t know if we actually disagree on anything here. Perhaps you would classify all constraints along these lines as Pathological Constraints). (In the next section I will point out that while such incompatible pairs can be related to utility inversion. They do not have to be.)

I will first discuss the proposal described in your point 2 in the next section, and then discuss the proposal described in your point 1 in the last section (because finding the set of actions that are available to delegates happens before delegates start negotiating).

The rule for determining which set of actions will be included in negotiations between delegates

The rule described in your point 2 still results in an empty set, for the same reason that Davidad's original Pareto Improvement rule results in an empty set. The rule described in your point 2 still does not remove the problem of Bob from the original thought experiment of the post. Because the thing that Bob objects to is an unethical AI. The issue is not about Bob wanting to hurt Dave, or about Bob wanting to believe that the AI is ethical (or that Bob might want to believe that Dave is punished. Or that Bob might want to see Dave being punished). The issue is that Bob does not want the fate of humanity to be determined by an unethical AI.

Demanded punishments also do not have to refer to Dave's preferences. It can be the case that Gregg demands that Dave's preferences are inverted. But it can also be the case that Gregg demands that Dave be subjected to some specific treatment (and this can be a treatment that Dave will categorically reject). There is nothing unexpected about a fanatic demanding that heretics be subjected to a specific type of treatment. It is not feasible to eliminate all “Problematic Constraints” along these lines by eliminating some specific list of constraint types (for example along the lines of: utility inverting constraints, or hostile constraints, or demands that people suffer). Which in combination with the fact that Dave still has no meaningful influence over those constraints that are about Dave, means that there is still nothing preventing someone from demanding that things happen to Dave, that Dave finds completely unacceptable. A single such constraint is sufficient for rendering the action space empty (regardless of the size of the action space).

When analysing this type of rule it might actually be best to switch to a new type of person, that has not been part of my past thought experiments. Specifically: the issue with the rule described in your point 2 can also be illustrated using a thought experiment that does not involve any preferences that in any way refer to any human. The basic situation is that two people have incompatible demands regarding how an AI must interact with a specific sacred place or object, in order for the AI to be considered acceptable.

Let's take ancient Egyptian religion as an example in order to avoid contemporary politics. Consider Intef who was named after the Pharaoh who founded Middle Kingdom Egypt, and Ahmose who was named after the Pharaoh who founded New Kingdom Egypt. They both consider it to be a moral imperative to restore temples to their rightful state (if one has the power to do so). But they disagree on when Egyptian religion was right, and therefore disagree on what the AI must do to avoid being classified as unethical (in the sense of the Dark Future concept).

Specifically: a Middle Kingdom temple was destroyed and the stones were used to build a New Kingdom temple. Later that temple was also destroyed. Intef considers it to be a moral imperative to use the stones to rebuild the older temple (if one has the ability to do so). And Ahmose considers it to be a moral imperative to use the same stones to rebuild the newer temple (if one has the ability to do so). Neither of them thinks that an unethical AI is acceptable (after the AI is classified as unethical the rest of the story follows the same path as the examples with Bob or Bob2). So the set would still be empty, even if a rule simply ignores every constraint that in any way refers to any human.

Neither of these demands are in any way hostile (or vicious, or based in hate, or associated with malevolent people, or belligerent, or anything else along such lines). Neither of these demands is on its own problematic or unreasonable. On its own, either of these demands is in fact trivial to satisfy (the vast majority of people would presumably be perfectly ok with either option). And neither of these demands looks dangerous (nor would they result in an advantage in Parliamentarian Negotiations). Very few people would watch the world burn rather than let Intef use the original stones to rebuild his preferred temple. But it only takes one person like Ahmose to make the set of actions empty.

Let's go through another iteration and consider AI47 who uses a rule that ignores some additional constraints. When calculating whether or not an action can be used in delegate negotiations, AI47 ignores all preferences that (i): refer to AI47 (thus completely ignoring all demands that AI47 not be unethical), or (ii): refer to any human, or (iii): are dangerous, or (iv): are based on hate / bitterness / spite / ego / etc / etc, or (v): make demands that are unreasonable or difficult to satisfy. Let's say that in the baseline trajectory that alternative trajectories are compared to, AI47 never acts. If AI47 never acts, then this would lead to someone eventually launching a misaligned AI that would destroy the temple stones (and also kill everyone).

Intef and Ahmose both think that if a misaligned AI destroys the stones, then this counts as the stones being destroyed in an accident (comparable from a moral standpoint to the case where the stones are destroyed by an unpreventable natural disaster). Conditioned on a trajectory where the stones are not used to restore the right temple, both prefer a trajectory where the stones are destroyed by accident. (In addition to caring about the ethics of the AI that is in charge, they also care about the stones themselves.). And there is no way for a non-misaligned, clever AI (like AI47), to destroy the stones by accident (in a sense that they would consider to be equivalent to an unpreventable natural disaster). So the set is still empty.

In other words: even though this is no longer an attempt to find a usable Pareto Baseline that simultaneously satisfies many trillions of hard constraints, a single pair of constraints can still make the set empty. And it is still an attempt to deal with a large set of hard constraints, defined in a great variety of ontologies. It is also still true that (in addition to constraints coming from people like Intef and Bob2) this set will also include constraints defined in many ontologies that we will not be able to foresee (including the ontologies of a great variety of non-neurotypical individuals, that have been exposed to a great variety of value systems during childhood). This is an unusual feature of the AI context (compared to other contexts that deal with human preferences). A preference defined in an ontology that no one ever imagined might exist, has no impact on debates about economic policy. But unless one simply states that a rule should ignore any preference that was not considered by the designers, then the quest to find a rule that actually implies a non-empty set, must deal with this highly unusual feature of the AI context.

(Intef and Ahmose pose a lot more problems in this step, than they pose in the step where delegates are negotiating. In that later step, their delegates have no problematic advantage. Their delegates are also not trying to implement anything worse than extinction. This is probably why this type of person has not been part of any of my past thought experiments. I have not thought deeply about people like Intef and Ahmose)

(There exists several contemporary examples of this general type of disagreements over sacred locations or objects. Even the specific example of reusing temple stones was a common behaviour in many different times and places. But the ancient Egyptians are the undisputed champions of temple stone reuse. And people nowadays don’t really have strong opinions regarding which version of ancient Egyptian religion is the right version. Which is why I think it makes sense to use this example)

(I'm happy to keep exploring this issue. I would not be surprised if this line of inquiry leads to some interesting insight)

(if you are looking for related literature, you might want to take a look at the Sen ``Paradox'' (depending on how one defines “pathological preferences”, they may or may not be related to “nosy preferences”))

(Technical note: this discussion makes a series of very optimistic assumptions in order to focus on problems that remain despite these assumptions. For example assuming away a large number of very severe definitional issues. Reasoning from such assumptions does not make sense if one is arguing that a given proposal would work. But it does make sense when one is showing that a given proposal fails, even if one makes such optimistic assumptions. This point also applies to the next section)

Coherent Extrapolation of Equanimous Volition (CEEV)

Summary: In the CEEV proposal described in your point 1, many different types of fanatics would still be represented by delegates that want outcomes where heretics are punished. For example fanatics that would see a non-punishing AI as unethical. Which means that CEEV still suffers from the problem that was illustrated by the original PCEV thought experiment. In other words: having utility inverting preferences is one possible reason to want an outcome where heretics are punished. Such preferences would not be present in CEEV delegates. But another reason to want an outcome where heretics are punished is a general aversion to unethical AIs. Removing utility inverting preferences from CEEV delegates would not remove their aversion to unethical AIs. Yet another type of sentiment that would be passed on to CEEV delegates, is the case where someone would want heretics to be subjected to some specific type of treatment (simply because, all else equal, it would be sort of nice if the universe ended up like this). There are many other types of sentiments along these lines that would also be passed on to CEEV delegates (including a great variety of sentiments that we have no hope of comprehensively cataloguing). Which means that many different types of CEEV delegates would still want an outcome where heretics are hurt. All of those delegates would still have a very dramatic advantage in CEEV negotiations.

Let’s start by noting that fanatics can gain a very dramatic negotiation advantage in delegate negotiations, without being nearly as determined as Gregg or Bob. Unlike the situation discussed in the previous section, in delegate negotiations people just need to weakly prefer an outcome where heretics are subjected to some very unpleasant treatment. In other words: people can gain a very dramatic negotiation advantage simply because they feel that (all else equal) it would be sort of nice to have some type of outcome, that for some reason involves bad things happening to heretics.

There exists a great variety of reasons for why someone might have such sentiments. In other words: some types of fanatics might lose their negotiation advantage in CEEV. But many types of fanatics would retain their advantage (due to a great variety of preferences defined in a great variety of ontologies). Which in turn means that CEEV suffers from the same basic problem that PCEV suffers from.

You mention the possibility that an AI might lie to a fanatic regarding what is happening. But a proposed outcome along such lines would change nothing. CEEV delegates representing fanatics that have an aversion to unethical AIs would for example have no reason to accept such an outcome. Because the preferences of the fanatics in question is not about their beliefs regarding unethical AIs. Their preferences are about unethical AIs.

In addition to fanatics with an aversion to unethical AIs, we can also look at George, who wants heretics to be punished as a direct preference (without any involvement of preferences related to unethical AIs). George might for example want all heretics to be subjected to some specific treatment (demands that heretics be subjected to some specific treatment are not unusual). No need for anything complicated or deeply felt. George might simply feel that it would be sort of nice if the universe would be organised like this (all else equal).

George could also want the details of the treatment to be worked out by a clever AI (without referring to any form of utility inversion or suffering. Or even referring in any way to any heretic, when specifying the details of the treatment). George might for example want all heretics to be put in whatever situation, that would make George feel the greatest amount of regret. In other words: this type of demand does not have to be related to any form of utility inversion. The details of the treatment that George would like heretics to be subjected to, does not even need to be determined by any form of reference to any heretic. In yet other words: there are many ways for fanatics along the lines of George to gain a very large negotiation advantage in CEEV. (The proposal that CEEV might lie to George about what is happening to heretics would change nothing. Because George's preference is not about George's beliefs.)

The type of scenario that you describe, where George might want to see Dave being hurt, is not actually an issue here. Let's look more generally at George’s preferences regarding George's experiences, George's beliefs, George's world model, etc. None of those pose a problem in original PCEV (because they do not result in a negotiation advantage for George's delegate). (We might not have any actual disagreement regarding these types of preferences. I just wanted to be clear about this point).

From the perspective of Steve, the underlying issue with CEEV is that Steve still has no meaningful control over the way in which CEEV adopts those preferences that refer to Steve. Which in turn means that Steve still has no reason to think that CEEV will want to help Steve, as opposed to want to hurt Steve. This point would remain true even if one were to remove additional types of preferences from delegates.

Eliminating some specific list of preference types (for example along the lines of: utility inverting preferences, or hostile preferences, or preferences that people suffer, etc) does not qualitatively change this situation. Because eliminating such a list of preference types does not result in Steve gaining meaningful influence regarding the adoption of those preferences that refer to Steve. Which in the case of Parliamentarian Negotiations means that delegates will still want to hurt Steve, for a great variety of reasons (for example due to sentiments along the lines of an aversion to unethical AIs. And also due to a long and varied list of other types of sentiments, that we have no hope of exhaustively cataloguing).

In other words: all those delegates that (for reasons related to a great variety of sentiments) still want outcomes where people are subjected to horrific forms of treatment, will still have a very large negotiation advantage in CEEV. And such delegates will also have a very large negotiation advantage in any other proposal without the SPADI feature, that is based on the idea of eliminating some other specific list of preference types from delegates.

Since this discussion is exploring hypotheticals (as a way of reaching new insights), I’m happy to keep looking at proposals without the SPADI feature. But given the stakes, I do want to make a tangential point regarding plans that are supposed to end with a successfully implemented AI without the SPADI feature (presumably as the end point of some larger plan that includes things along the lines of: an AI pause, augmented humans, an initial Limited AI, etc, etc).

In other words: I am happy to keep analysing proposals without the SPADI feature. Because it is hard to predict what one will find when one is pulling on threads like this. And because analysing a dangerous proposal reduces the probability of it being implemented. But I also want to go on a tangent and explain why successfully implementing any AI without the SPADI feature would be extremely bad. And explicitly note that this is true regardless of which specific path one takes to such an AI. And also explicitly note that this is true, regardless of whether or not anyone manages to construct a specific thought experiment illustrating the exact way in which things go bad.

Let’s look at a hypothetical future proposal to illustrate these two points. Let’s say that someone proposes a plan that is supposed to eventually lead to the implementation of an AI that gets its preferences from billions of humans. This AI does not have the SPADI feature. Now let’s say that this proposed alignment target avoids the specific issues illustrated by all existing thought experiments. Let’s further say that no one is able to construct a specific thought experiment that illustrates exactly how this novel alignment target proposal would lead to a bad outcome. The absence of a thought experiment that illustrates the specific path to a bad outcome, would not in any way shape or form imply that the resulting AI does not want to hurt Steve, if such a proposed plan is successfully implemented. In other words: since Steve will have no meaningful influence regarding the adoption of those preferences that refer to Steve, Steve will have no reason to expect the actual resulting AI to want to help Steve, as opposed to want to hurt Steve. PCEV implied a massively worse than extinction outcome, also before the specific problem was described (and PCEV spent a lot of years as a fairly popular proposal without anyone noticing the issue).

In yet other words: the actual AI, that is actually implied, by some proposed set of definitions, can end up wanting to hurt Steve, regardless of whether or not someone is able to construct a thought experiment that illustrates the exact mechanism by which this AI will end up wanting to hurt Steve. Which in combination with the fact that Steve does not have any meaningful influence regarding the adoption of those preferences that refer to Steve, means that Steve has no reason to expect this AI to want to help Steve, as opposed to want to hurt Steve.

In yet other words: the SPADI feature is far from sufficient for basic safety. But it really is necessary for basic safety. Which in turn means that if a proposed AI does not have the SPADI feature, then this AI is known to be extremely bad for human individuals in expectation (if successfully implemented). This is true with or without a specific thought experiment illustrating the specific mechanism that would lead to this AI wanting to hurt individuals. And it is true regardless of what path was taken to the successful implementation of such an AI. (Just wanted to be explicit about these points. Happy to keep analysisng proposals without the SPADI feature.)

(you might also want to take a look at this post)

A problem shared by many different alignment targets

ThomasCederborg2mo30

There are no Pareto improvements relative to the new Pareto Baseline that you propose. Bob would indeed classify a scenario with an AI that takes no action as a Dark Future. However, consider Bob2, who takes another perfectly coherent position on how to classify an AI that never acts. If something literally never takes any action, then Bob2 simply does not classify it as a person. Bob2 therefore does not consider a scenario with an AI that literally never does anything to be a Dark Future (other than this difference, Bob2 agrees with Bob about morality). This is also a perfectly reasonable ontology. A single person like Bob2 is enough to make the set of Pareto Improvements relative to your proposed Pareto Baseline empty.

(As a tangent, I just want to explicitly note here that this discussion is about Pareto Baselines. Not Negotiation Baselines. The negotiation baseline in all scenarios discussed in this exchange is still Yudkowsky's proposed Random Dictator negotiation baseline. The Pareto Baseline is relevant to the set of actions under consideration in the Random Dictator negotiation baseline. But it is a distinct concept. I just wanted to make this explicit for the sake of any reader that is only skimming this exchange)

The real thing that you are dealing with is a set of many trillions of hard constraints, defined in billions of ontologies (including a large number of non-standard ontologies. Some presumably a lot more strange than the ontologies of Bob and Bob2). The concept of a Pareto Improvement was really not designed to operate in a context like this. It seems to me that it has never been properly explored in a context like this. I doubt that anyone has ever really thought deeply about how this concept would actually behave in the AI context. Few concepts have actually been properly explored in the AI context (this is related to the fact that the Random Dictator negotiation baseline actually works perfectly fine in the context that it was originally designed for: a single individual trying to deal with Moral Uncertainty. Something similar is also true for the Condorcet Criterion. The intuition failures that seem to happen when people move concepts from CEVI style mappings to CEVH style mappings is also related. Etc, etc, etc. It simply does not seem to exist a workable alternative, to actually exploring a concept, in whatever AI context that one wants to use it in. Simply importing concepts from other contexts, just does not seem to be a reliable way of doing things. This state of affairs is extremely inconvenient).

Let's consider the economist Erik, who claims that Erik's Policy Modification (EPM) is a Pareto Improvement over current policy. Consider someone pointing out to Erik that some people want heretics to burn in hell, and that EPM would be bad for such people, since it would make life better for heretics in expectation. If Erik does decide to respond, he would presumably say something along the lines of: it is not the job of economic policy to satisfy people like this. He probably never explicitly decided to ignore such people. But his entire field is based on the assumption that such people do not need to be taken into consideration when outlining economic policy. When having a political argument about economic policy, such people are in fact not really an obstacle (if they do participate, they will presumably oppose EPM with arguments that do not mention hellfire). The implicit assumption that such positions can be ignored thus holds in the context of debating economic policy. But this assumption breaks when we move the concept to the AI context (where every single type of fanatic is informed, extrapolated, and actually given a very real, and absolute, veto over every single thing that is seen as important enough).

Let's look a bit at another Pareto Baseline that might make it easier to see the problem from a different angle (this thought experiment is also relevant to some straightforward ways in which one might further modify your proposed Pareto Baseline in response to Bob2). Consider the Unpleasant Pareto Baseline (UPB). In UPB the AI implements some approximation of everyone burning in hell (specifically: the AI makes everyone experience the sensation of being on fire for as long as it can). It turns out that it only takes two people to render the set of Pareto Improvements relative to UPB empty: Gregg and Jeff from my response to Davidad's comment. Both want to hurt heretics, but they disagree about who is a heretic. Due to incompatibilities in their respective religions, every conceivable mind is seen as a heretic by at least one of them. Improving the situation of a heretic is Not Allowed. Improving the situation of any conceivable person, in any conceivable way, is thus making things worse from the perspective of at least one of them.

Gregg and Jeff do have to be a lot more extreme than Bob or Bob2. They might for example be non-neurotypical (for example sharing a condition that has not yet been discovered). And raised in deeply religious environments, whose respective rules they have adopted in an extremely rigid way. So they are certainly rare. But there only needs to be two people like this for the set of Pareto Improvements relative to UPB to be empty. (presumably no one would ever consider building an AI with UPB as a Pareto Baseline. This thought experiment is not meant to illustrate any form of AI risk. It's just a way of illustrating a point about attempting to simultaneously satisfy trillions of hard constraints, defined in billions of ontologies)

(I really appreciate you engaging on this in such a thorough and well thought out manner. I don't see this line of reasoning leading to anything along the lines of a workable patch or a usable Pareto Baseline. But I'm very happy to keep pulling on these threads, to see if one of them leads to some interesting insight. So by all means: please keep pulling on whatever loose ends you can see)

A problem shared by many different alignment targets

ThomasCederborg2mo30

Given that you agreed with most of what I said in my reply, it seems like you should also agree that it is important to analyse these types of alignment targets. But in your original comment you said that you do not think that it is important to analyse these types of alignment targets.

Let's write Multi Person Sovereign AI Proposal (MPSAIP) for an alignment target proposal to build an AI Sovereign that gets its goal from the global population (in other words: the type of alignment target proposals that I was analysing in the post). I followed your links and can only find one argument against the urgency of analysing MPSAIPs now: that an Instruction Following AI (IFAI) would make this unnecessary. I can see why one might expect that an IFAI would help to some degree when analysing MPSAIPs. But I don't see how the idea of an IFAI could possibly remove the urgent need to analyse MPSAIPs now.

In your post on distinguishing value alignment from intent alignment, you define value alignment as being about all of humanity's long term, implicit deep values. It thus seems like you are not talking about anything along the lines of building an AI that will do whatever some specific person wants that AI to do. Please correct me if I'm wrong, but your position thus seems to be built on top of the assumption that it would be safe to assume that an IFAI can be used to solve the problem of how to describe all of humanity's long term, implicit deep values.

A brief summary of why I think that this is false: You simply cannot delegate the task of picking a goal to an AI (no matter how clever this AI is). You can define the goal indirectly and have the AI work out the details. But the task is simply not possible to delegate. For the same reason: you simply cannot delegate the task of picking a MPSAIP to an AI (no matter how clever this AI is). You can define things indirectly and have the AI work out the details. This is equivalent to fully solving the field of MPSAIP analysis. It would for example necessarily involve defining some procedure for dealing with disagreements amongst individuals that disagree on how to deal with disagreements (because individuals will not agree on which MPSAIP to pick). PCEV is one such procedure. It sounds reasonable but would lead to an outcome far worse than extinction. VarAI is another procedure that sounds reasonable but that is in fact deeply problematic. As shown in the post, this is not easy (partly because intuitions about well known concepts tend to break when transferred to the AI context). In other words: you can't count on an IFAI to notice a bad MPSAIP, for the same reason that you can't count on Clippy to figure out that it has the wrong goal.

How useful would an IFAI be for analysing MPSAIPs?

I can see why one might think that an IFAI would be somewhat useful. But I don't see how one can be confident that it would be very useful (let alone be equivalent to a solution). If one does not hold this position, then the existence of an IFAI does not remove the need to analyse MPSAIPs now. (The idea that an IFAI might be counted on to buy sufficient time to analyse MPSAIPs is covered below, in the section where I answer your question about an AI pause).

The idea that an IFAI would be extremely useful for Alignment Target Analysis seems to be very common. But there is never any actual reason given for why this might be true. In other words: while I have heard similar ideas many times, I have never been able to get any actual argument in favour of the position, that an IFAI would be very useful for analysing MPSAIPs (by you or by anyone else). It is either implicit in some argument, or just flatly asserted. There seems to be two versions of this idea. One version is the delegation plan. In other words: the plan where one builds an IFAI that does know how to describe all of humanity's long term, implicit deep values. The other version is the assistant plan. In other words: the plan where one builds an IFAI that does not know how to describe all of humanity's long term, implicit deep values (and then uses that IFAI as an assistant while analysing MPSAIPs). I will cover them separately below.

The delegation plan: The scenario where an IFAI does know how to define all of humanity's long term, implicit deep values

I don't know how this plan could possibly remove the need for analysing MPSAIPs now. I don't know why anyone would believe this (similarly to how I don't know why anyone would believe that Clippy can be counted on to figure out that it has the wrong goal). It is clearly a common position. But as far as I am aware, there exists no positive argument for this position. Without any actual argument in favour of this position, it is a bit tricky to argue against this position. But I will do my best.

A preliminary point is that the task of picking one specific mapping, that maps from billions of humans to an entity of the type that can be said to want things, is not a technical task with a findable solution (see the post for much more on this). In yet other words: if one were to actually describe in detail the argument that one can delegate the task of analysing MPSAIPs to an IFAI, then one would run into a logical problem (if one tried to actually spell out the details step by step, one would be unable to do so). The problem one would run into, would be the same problem that one would run into if one were to try to argue that Clippy will figure out that it has the wrong goal (if one tried to actually spell out the details step by step, one would be unable to do so). Neither finding the correct goal nor analysing MPSAIPs is a technical task with a findable solution. Thus, neither task can be delegated to an AI, no matter how clever it is.

Let's say that we have an IFAI that is able to give an answer, when you ask it how to describe all of humanity's long term, implicit deep values. This is equivalent to the IFAI having already picked a specific MPSAIP.

I see only two ways of arriving at such an IFAI. One is that something has gone wrong, and the IFAI has settled on an answer by following some process that the designers did not intend it to follow. This is a catastrophic implementation failure. In other words: unless the plan was for the IFAI to choose an MPSAIP using some unknown procedure, the project has not gone according to plan. In this case I see no particular reason to think that the outcome would be any better than the horrors implied by PCEV.

The only other option that I see is that the designers have already fully solved the problem of how to define all of humanity's long term, implicit deep values (presumably indirectly, by defining a process that leads to such a definition). In other words: if one plans to build an IFAI like this, then one has to fully solve the entire field of analysing MPSAIPs, before one builds the IFAI. In yet other words: if this is the plan, then this plan is an argument in favour of the urgent need to analyse MPSAIPs.

The assistant plan: The scenario where an IFAI does not know how to define all of humanity's long term, implicit deep values

To conclude that analysing MPSAIPs now is not urgent, one must assume that this type of IFAI assistant is guaranteed to have a very dramatic positive effect (a somewhat useful IFAI assistant would not remove the urgent need for analysing MPSAIPs now). It seems to be common to simply assume that an IFAI assistant will basically render prior work on analysing MPSAIPs redundant (the terminology differs. And it is often only implicit in some argument or plan. But the assumption is common). I have however never seen any detailed plan for how this would actually be done. (The situation is similar to how the delegation plan is never actually spelled out). I think that as soon as one were to lay out the details of how this would work, one would realise that one has a plan that is built on top of an incorrect assumption (similar to the type of incorrect implicit assumption that one would find, if one were to spell out the details of why exactly Clippy can be counted on to realise that it has the wrong goal).

It is difficult to argue against this position directly, since I don't know how this IFAI is supposed to be used (let alone why this would be guaranteed to have a large positive effect). But I will try to at least point to some difficulties that one would run into.

Let's say that Allan is asking the IFAI questions, as a step in the process of analysing MPSAIPs. Every question Allan asks of an IFAI like this would pose a very dramatic risk. Allan is leaning heavily on a set of definitions, for example definitions of concepts like Explanation and Understanding. Even if those definitions have held up while the IFAI was used to do other things (such as shutting down competing AI projects), those definitions could easily break when discussing MPSAIPs. Since the IFAI does not know what a bad MPSAIP is, the IFAI has no way of noticing that it is steering Allan towards a catastrophically bad MPSAIP. Regardless of how clever the IFAI is, there is simply no chance of it noticing this. Just as there is no chance of Clippy discovering that it has the wrong goal.

In other words: if a definition of Explanation breaks during a discussion with an IFAI, and Allan ends up convinced that he must implement PCEV, then we will end up with the horrors implied by PCEV. (If you think that the IFAI will recognise the outcome implied by PCEV as a bad outcome, then you are imagining the type of IFAI that was treated in the previous subsection (and such an IFAI can only be built after the field of analysing MPSAIPs have been fully solved)). This was previously discussed here and here (using different terminology).

(To be clear: this subsection is not arguing against the plan of building an IFAI of this type. And it is not arguing against the idea that this type of IFAI might be somewhat useful. It is not even arguing against the idea that it might be possible to use an IFAI like this in a way that dramatically increases the ability to analyse MPSAIPs. It is simply arguing against the idea that one can be sure that an IFAI like this will in fact be used in a way that will dramatically increase the ability to analyse MPSAIPs. This is enough to show that the IFAI idea does not remove the urgent need to analyse MPSAIPs now).

Regarding the probability of a pause

The probability of a politically enforced pause is not important for any argument that I am trying to make. Not much changes if we replace a politically enforced pause with an IFAI. Some group of humans will still decide what type of Sovereign AI will eventually be built. If they successfully implement a bad Sovereign AI proposal, then the outcome could be massively worse than extinction. So it makes sense to reduce the probability of that. One tractable way of reducing this probability is by analysing MPSAIPs.

In other words: if you achieve a pause by doing something other than building an AI Sovereign (for example by implementing a politically enforced pause, or by using an IFAI). Then the decision of what AI Sovereign to eventually build will remain in human hands. So then you will still need progress on analysing MPSAIPs to avoid bad Sovereign AI proposals. There is no way of knowing how long it will take to achieve the needed level of such progress. And there is no way of knowing how much time a pause will actually result in. So, even if we did know exactly what method will be used to shut down competing projects. And we also knew exactly who will make decisions regarding Sovereign AI. Then there is still no way of knowing that there will be sufficient time to analyse MPSAIPs. Therefore, such analysis should start now. (And as illustrated by my post, such progress is tractable).

One point that should be made here, is that you can end up with a multipolar world even if there is a single IFAI that flawlessly shuts down all unauthorised AI projects. If a single IFAI is under the control of some set of existing political power structures, then this would be a multipolar world. Regardless of who is in control (for example the UN Security Council (UNSC), the UN general assembly, or some other formalisation of global power structures), it is still possible for some ordinary political movement to gain power over the IFAI, by ordinary political means. Elected governments can be voted out. Governments along the lines of the USSR can evidently also be brought down by ordinary forms of political movements. So there is in general nothing strange about someone being in control of an IFAI, but finding themselves in a situation where they must either act quickly and decisively, or risk permanently losing control to people with different values. This means that shutting down all competing AI projects might not buy a lot of time due to Internal Time Pressure.

Let's consider the scenario where a UNSC resolution is needed to ask the IFAI a question, or to ask the IFAI to do something (such as shutting down competing AI projects, or launching an AI Sovereign). There is currently an agreement of what AI Sovereign to build. But there is also an agreement that it would be good to first analyse this proposal a bit more, to make sure there is no hidden problem with it. In this case, losing control of any of the five countries with a veto would remove the ability to launch an AI Sovereign (if control is lost to a novel and growing political movement, then control could be lost permanently. The result of losing control of one permanent UNSC member could mean that a deadlock will persist until the new movement eventually controls all five). So, the people currently in control would basically have to either act quickly or risk permanently losing power to people with different values. If they decide to aim at their preferred MPSAIP, then it would be very nice if the field of analysing MPSAIPs had progressed to the point where it is possible to notice that this MPSAIP implies an outcome worse than extinction (for example along the lines of the outcome implied by PCEV. But presumably due to a harder-to-notice problem).

I used the UNSC as an example in the preceding paragraph, because it seems to me like the only legal way of taking the actions that would be necessary to robustly shut down all competing AI projects (being the only legal option, and thus a sort of default option, makes it difficult to rule out this scenario). But the same type of Internal Time Pressure might also arise in other arrangements. This comment outlines a scenario where a global electorate is in charge (which seems like another reasonable candidate for how to define what it means to do the default thing). This post outlines a scenario where a group of augmented humans are in charge (in that scenario buying time is achieved by uploading. Not by shutting down competing AI projects. This seems like something that someone might do if they don't feel comfortable with using force. But simultaneously don't feel ready to take the decision to give up control to some specific political process).

The reason that I keep going on about the need for Alignment Target Analysis (ATA) is that there seems to currently exist exactly zero people in the world devoted to doing ATA full time. Making enough ATA progress to reduce the probability of bad outcomes is also tractable (trying to solve ATA would be a completely different thing. But there still exists a lot of low hanging fruit in terms of ATA progress that reduces the probability of bad outcomes). It thus seems entirely possible to me that we will end up with a PCEV style catastrophe that could have been easily prevented. Reducing the probability of that seems like a reasonable thing to do. But it is not being done.

An attempt to summarise how I view the situation

At our current level of ATA progress it is clearly not possible to reliably tell a good alignment target from an alignment target that implies an outcome massively worse than extinction. I simply don't see how one can think that it is safe to stay at this level of progress. Intuitively this seems like a dangerous situation. The fact that there exists no research project dedicated to improving this situation seems like a mistake (as illustrated by my post, reducing the probability of bad outcomes is a tractable research project). It seems like many people do have some reason for thinking that the current state of affairs is acceptable. As far as I can tell however, these reasons are not made public. This is why I think that it makes sense to spend time on trying to figure out what you believe to be true, and why you believe it to be true (and this is also why I appreciate you engaging on this).

In other words: arguing that ATA should be a small percentage of AI safety work would be one type of argument. Arguing that the current situation is reasonable would be a fundamentally different type of argument. It is clearly the case that plenty of people are convinced that it is reasonable to stay at the current level of ATA progress (in other words: many of people are acting in a way that I can only explain if I assume that they feel very confident, that it is safe to stay at our current level of ATA progress). I think that they are wrong about this. But since no argument in favour of this position is ever outlined in detail, there is no real way of addressing this directly.

PS:

I'm fine with continuing this discussion here. But it probably makes sense to at least note that it would have fitted better under this post (which makes the case for analysing this type of alignment targets. And actually discusses the specific topic of why various types of Non-Sovereign-AIs would not replace doing this now). As a tangent, the end of that post actually explicitly asked people to outline their reasons for thinking that ATA now is not needed. Your response here seems to be an example of this. So I very much appreciate your engagement on this. In other words: I don't think you are the only one that have ideas along these lines. I think that there are plenty of people with similar ways of looking at things. And I really wish that those people would clearly outline their reasons for thinking that the current situation is reasonable. Because I think that those reasons will fall apart if they are outlined in any actual detail. So I really appreciate that you are engaging on this. And I really wish that more people would do the same.

A problem shared by many different alignment targets

ThomasCederborg3mo30

I'm sorry if the list below looks like nitpicking. But I really do think that these distinctions are important.

Bob holds 1 as a value. Not as a belief.

Bob does not hold 2 as a belief or as a value. Bob thinks that someone as powerful as the AI has an obligation to punish someone like Dave. But that is not the same as 2.

Bob does not hold 3 as a belief or as a value. Bob thinks that for someone as powerful as the AI, the specific moral outrage in question renders the AI unethical. But that is not the same as 3.

Bob does hold 4 as a value. But it is worth noting that 4 does not describe anything load-bearing. The thought experiment would still work even if Bob did not think that the act of creating an unethical agent that determines the fate of the world is morally forbidden. The load-bearing part is that Bob really does not want the fate of the world to be determined by an unethical AI (and thus prefers the scenario where this does not happen).

Bob does not hold 5 as a belief or as a value. Bob prefers a scenario without an AI, to a scenario where the fate of the world was determined by an unethical AI. But that is not the same as 5. The description I gave of Bob does not in any way conflict with Bob thinking that most morally forbidden acts can be compensated for by expressing sincere regret at some later point in time. The description of Bob would even be consistent with Bob thinking that almost all morally forbidden acts can be compensated for by writing a big enough check. He just thinks that the specific moral outrage in question, directly means that the AI committing it is unethical. In other words: other actions are simply not taken into consideration, when going from this specific moral outrage, to the classification of the AI as unethical. (He also thinks that a scenario where the fate of the world is determined by an unethical AI is really bad. This opinion is also not taking any other aspects of the scenario into account. Perhaps this is what you were getting at with point 5).

I insist on these distinctions because the moral framework that I was trying to describe, is importantly different from what is described by these points. The general type of moral sentiment that I was trying to describe is actually a very common, and also a very simple, type of moral sentiment. In other words: Bob's morality is (i): far more common, (ii): far simpler, and (iii): far more stable, compared to the morality described by these points. Bob's general type of moral sentiment can be described as: a specific moral outrage renders the person committing it unethical in a direct way. Not in a secondary way (meaning that there is for example no summing of any kind going on. There is no sense in which the moral outrage in question is in any way compared to any other set of actions. There is no sense in which any other action plays any part whatsoever when Bob classifies the AI as unethical).

In yet other words: the link from this specific moral outrage to classification as unethical is direct. The AI doing nice things later is thus simply not related in any way to this classification. Plenty of classifications work like this. Allan will remain a murderer, no matter what he does after committing a murder. John will remain a military veteran, no matter what he does after his military service. Jeff will remain an Olympic gold winner, no matter what he does after winning that medal. Just as for Allan, John, and Jeff, the classification used to determine that the AI is unethical is simply not taking other actions into account.

The classification is also not the result of any real chain of reasoning. There is no sense in which Bob first concludes that the moral outrage in question should be classified as morally forbidden, followed by Bob then deciding to adhere to a rule which states that all morally forbidden things should lead to the unethical classification (and Bob has no such a rule).

This general type of moral sentiment is not universal. But it is quite common. Lots of people can think of at least one specific moral outrage that leads directly to them viewing a person committing it as unethical (at least when committed deliberately by a grownup that is informed, sober, mentally stable, etc). In other words: lots of people would be able to identify at least one specific moral outrage (perhaps out of a very large set of other moral outrages). And say that this specific moral outrage directly implies that the person is unethical. Different people obviously do not agree on which subset of all moral outrages should be treated like this (even people that agree on what should count as a moral outrage can feel differently about this). But the general sentiment where some specific moral outrage simply means that the person committing it is unethical is common.

The main reason that I insist on the distinction is that this type of sentiment would be far more stable under reflection. There are no moving parts. There are no conditionals or calculations. Just a single, viscerally felt, implication. Attached directly to a specific moral outrage. For Bob, the specific moral outrage in question is a failure to adhere to the moral imperative to punish people like Dave.

Strong objections to the fate of the world being determined by someone unethical are not universal. But this is neither complex nor particularly rare. Let's add some details to make Bob's values a bit easier to visualise. Bob has a concept that we can call a Dark Future. It is basically referring to scenarios where Bad People win The Power Struggle and manage to get enough power to choose the path of humanity (powerful anxieties along these lines seem quite common. And for a given individual it would not be at all surprising if something along these lines eventually turn into a deeply rooted, simple, and stable, intrinsic value).

A scenario where the fate of the world is determined by an unethical AI is classified as a Dark Future (again in a direct way). For Bob, the case with no AI does not classify as a Dark Future. And Bob would really like to avoid a Dark Future. People who thinks that it is more important to prevent bad people from winning than to prevent the world from burning might not be very common. But there is nothing complex or incoherent about this position. And the general type of sentiment (that it matters a lot who gets to determine the fate of the word) seems to be very common. Not wanting Them to win can obviously be entirely instrumental. An intrinsic value might also be overpowered by survival instinct when things get real. But there is nothing surprising about something like this eventually solidifying into a deeply held intrinsic value. Bob does sound unusually bitter and inflexible. But there only needs to be one person like Bob in a population of billions.

To summarise: a non punishing AI is directly classified as unethical. Additional details are simply not related in any way to this classification. A trajectory where an unethical AI determines the fate of humanity is classified as a Dark Future (again in a direct way). Bob finds a Dark Future to be worse than the no AI scenario. If someone were to specifically ask him, Bob might say that he would rather see the world to burn than see Them win. But if left alone to think about this, the world burning in the non-AI scenario is simply not the type of thing that is relevant to the choice (when the alternative is a Dark Future).

Regarding the probability that extrapolation will change Bob:

First I just want to again emphasise that the question is not if extrapolation will change one specific individual named Bob. The question is whether or not extrapolation will change everyone with these types of values. Some people might indeed change due to extrapolation.

My main issue with the point about moral realism is that I don't see why it would change anything (even if we only consider one specific individual, and also assume moral realism). I don't see why discovering that The Objectively Correct Morality disagrees with Bob's values would change anything (I strongly doubt that this sentence means anything. But for the rest of this paragraph I will reason from the assumption that it both does mean something, and that it is true). Unless Bob has very strong meta preferences related to this, the only difference would presumably be to rephrase everything in the terminology of Bob's values. For example: extrapolated Bob would then really not want the fate of the world to be determined by an AI that is in strong conflict with Bob's values (not punishing Dave directly implies a strong value conflict. The fate of the world being determined by someone with a strong value conflict directly implies a Dark Future. And nothing has changed regarding Bob's attitude towards a Dark Future). As long as this is stronger than any meta preferences Bob might have regarding The Objectively Correct Morality, nothing important changes (Bob might end up needing a new word for someone that is in strong conflict with Bob's values. But I don't see why this would change Bob's opinion regarding the relative desirability of a scenario that contains a non-punishing AI, compared to the scenario where there is no AI).

I'm not sure what role coherence arguments would play here.

Regarding successor AIs:

It is the AI creating these successor AIs that is the problem for Bob (not the successor AIs themselves). The act of creating a successor AI that is unable to punish is morally equivalent to not punishing. It does not change anything. Similarly: the act of creating a lot of human level AIs is in itself determining the fate of the world (even if these successor AIs do not have the ability to determine the fate of the world).

Regarding the last paragraph that talks about finding a clever solution:

I'm not sure I understand this paragraph. I agree that if the set is not empty, then a clever AI will presumably find an action that is a Pareto Improvement. I am not saying that there exists an action that is a Pareto Improvement, but that this action is difficult to find. I am saying that at least one person will demand X and that at least one person will refuse X. Which means that a clever AI will just use its cleverness to confirm that the set is indeed empty.

I'm not sure that the following is actually responding to something that you are saying (since I don't know if I understand what you mean). But it seems relevant to point out that the Pareto constraint is part of the AIs goal definition. Which in turn means that before determining the members of the set of Pareto Improvements, there is no sense in which there exists a clever AI that is trying to make things work out well. In other words: there does not exist any clever AI, that has the goal of making the set non-empty. No one has, for example, an incentive to tweak the extrapolation definitions to make the set non-empty.

Also: in the proposal in question, extrapolated delegates are presented with a set. Their role is then supposed to be to negotiate about actions in this set. I am saying that they will be presented with an empty set (produced by an AI that has no motivation to bend rules to make this set non-empty). If various coalitions of delegates are able to expand this set with clever tricks, then this would be a very different proposal (or a failure to implement the proposal in question). This alternative proposal would for example lack the protections for individuals, that the Pareto constraint is supposed to provide. Because the delegates of various types of fanatics could then also use clever tricks to expand the set of actions under consideration. The delegates of various factions of fanatics could then find clever ways of adding various ways of punishing heretics into the set of actions that are on the table during negotiations (which brings us back to the horrors implied by PCEV). Successful implementation of Pareto PCEV implies that the delegates are forced to abide by the various rules governing their negotiations (similar to how successful implementation of classical PCEV implies that the delegates have successfully been kept in the dark regarding how votes are actually settled).

A few tangents:

This last section is not a direct response to anything that you wrote. In particular, the points below are not meant as arguments against things that you have been advocating for. I just thought that this would be a good place to make a few points, that are related to the general topics that we are discussing in this thread (there is no post dedicated to Pareto PCEV, so this is a reasonable place to elaborate on some points related specifically to PPCEV).

I think that if one only takes into account the opinions of a group that is small enough for a Pareto Improvement to exist, then the outcome would be completely dominated by people that are sort of like Bob, but that are just barely possible to bribe (for the same reason that PCEV is dominated by such people). The bribe would not primarily be about resources, but about what conditions various people should live under. I think that such an outcome would be worse than extinction from the perspective of many people that are not part of the group being taken into consideration (including from the perspective of people like Bob. But also from the perspective of people like Dave). And it would just barely be better than extinction for many in that group.

I similarly think that if one takes the full population, but bend the rules until one gets a non-empty set of things that sort of looks close to Pareto Improvements, then the outcome will also be dominated by people like Bob (for the same reason that PCEV is dominated by people like Bob). Which in turn implies a worse-than-extinction outcome (in expectation, from the perspective of most individuals).

In other words: I think that if one goes looking for coherent proposals that are sort of adjacent to this idea, then one would tend to find proposals that implies very bad outcomes. For the same reasons that proposals along the lines of PCEV implies very bad outcomes. A brief explanation of why I think this: if one tweaks this proposal until it refers to something coherent, then Steve has no meaningful influence regarding the adoption of those preferences that refer to Steve. Because when one is transforming this into something coherent, then Steve cannot retain influence over everything that he cares about strongly enough (as this would result in overlap). And there is nothing in this proposal that gives Steve any special influence regarding the adoption of those preferences that refer to Steve. Thus, in adjacent-but-coherent proposals, Steve will have no reason to expect that the resulting AI will want to help Steve, as opposed to want to hurt Steve.

It might also be useful to zoom out a bit from the specific conflict between what Bob wants and what Dave wants. I think that it would be useful to view the Pareto constraint as many individual constraints. This set of constraints would include many hard constraints. In particular, it would include many trillions of hard individual-to-individual constraints (including constraints coming from a significant percentage of the global population, that have non-negotiable opinions regarding the fates of billions of other individuals). It is an equivalent but more useful way of representing the same thing. (In addition to being quite large, this set would also be very diverse. It would include hard constraints from many different kinds of non-standard minds. With many different kinds of non-standard ways of looking at things. And many different kinds of non-standard ontologies. Including many types of non-standard ontologies that the designers never considered). We can now describe alternative proposals where Steve gets a say regarding those constraints that only refer to Steve. If one is determined to start from Pareto PCEV, then I think that this is a much more promising path to explore (as opposed to exploring different ways of bending the rules until every single hard constraint can be simultaneously satisfied).

I also think that it would be a very bad idea to go looking for an extrapolation dynamic that re-writes Bob's values in a way that makes Bob stop wanting Dave to be punished (or that makes Bob bribable). I think that extrapolating Bob in an honest way, followed by giving Dave a say regarding those constraints that refer to Dave, is a more promising place to start looking for ways of keeping Dave safe from people like Bob. I for example think that this is less likely to result in unforeseen side effects (extrapolation is problematic enough without this type of added complexity. The option of designing different extrapolation dynamics for different groups of people is a bad option. The option of tweaking an extrapolation dynamic that will be used on everyone, with the intent of finding some mapping that will turn Bob into a safe person, is also a bad option).

A problem shared by many different alignment targets

ThomasCederborg3mo30

Bob really does not want the fate of the world to be determined by an unethical AI. There is no reason for such a position to be instrumental. For Bob, this would be worse than the scenario with no AI (in the Davidad proposal, this is the baseline that is used to determine whether or not something is a Pareto-improvement). Both scenarios contain non-punished heretics. But only one scenario contains an unethical AI. Bob prefers the scenario without an unethical AI (for non-instrumental reasons).

Regarding extrapolation:

The question is whether or not at least one person will continue to view a non-punishing AI as unethical after extrapolation. (When determining whether or not something is a Pareto-improvement, the average fanatic is not necessarily relevant).

Many people would indeed presumably change their minds regarding the morality of at least some things (for example when learning new facts). For the set of Pareto-improvements to be empty however, you only need two people: a single fanatic and a single heretic.

In other words: for the set to be empty it is enough that a single person continues to view a single other person (that we can call Dave), as being deserving of punishment (in the sense that an AI has a moral obligation to punish Dave). The only missing component is then that Dave has to object strongly to being punished for being a heretic (this objection can actually also be entirely based on moral principles). Just two people out of billions need to take these moral positions for the set to be empty. And the building blocks that make up Bob's morality are not actually particularly rare.

The first building block of Bob's morality is that of a moral imperative (the AI is seen as unethical for failing to fulfill its moral obligation to punish heretics). In other words: if someone finds themselves in a particular situation, then they are viewed as having a moral obligation to act in a certain way. Moral instincts along the lines of moral imperatives are fairly common. A trained firefighter might be seen as having important moral obligations if encountering a burning building with people inside. An armed police officer might be seen as having important moral obligations if encountering an active shooter. Similarly for soldiers, doctors, etc. Failing to fulfill an important moral obligation is fairly commonly seen as very bad.

Let's take Allan, who witnesses a crime being committed by Gregg. If the crime is very serious, and calling the police is risk free for Allan, then failing to call the police can be seen as a very serious moral outrage. If Allan does not fulfill this moral obligation, it would not be particularly unusual for someone to view Allan as deeply unethical. This general form of moral outrage is not rare. Not every form of morality includes contingent moral imperatives. But moralities that do include such imperatives are fairly common. There is obviously a lot of disagreements regarding who has what moral obligations. Just as there are disagreements regarding what should count as a crime. But the general moral instinct (that someone like Allan can be deeply unethical) is not exotic or strange.

The obligation to punish bad people is also not particularly rare. Considering someone to be unethical because they get along with a bad person is not an exotic or rare type of moral instinct. It is not universal. But it is very common.

And the specific moral position that heretics deserve to burn in hell is actually quite commonly expressed. We can argue about what percentage of people saying this actually means it. But surely we can agree that there exist at least some people that actually mean what they say.

The final building block in Bob's morality is objecting to having the fate of the world be determined by someone unethical. I don't think that this is a particularly unusual thing to object to (on entirely non-instrumental grounds). Many people care deeply about how a given outcome is achieved.

Some people that express positions along the lines of Bob might indeed back down if things get real. I think that for some people, survival instinct would in fact override any moral outrage. Especially if the non-AI scenario is really bad. Some fanatics would surely blink when coming face to face with any real danger. (And some people will probably abandon their entire moral framework in a heartbeat, the second someone offers them a really nice cake). But for at least some people, morality is genuinely important. And you only need one person like Bob, out of billions, for the set to be empty.

So, if Bob is deeply attached to his moral framework. And the moral obligation to punish heretics is a core aspect of his morality. And this aspect of his morality is entirely built from ordinary and common types of moral instincts. Then an extrapolated version of Bob would only accept a non-punishing AI, if this extrapolation method has completely rewritten Bob's entire moral framework (in ways that Bob would find horrific).

A problem shared by many different alignment targets

ThomasCederborg3mo30

Consider Bob, who takes morality very seriously. Bob thinks that any scenario where the fate of the world is determined by an unethical AI, is worse than the scenario with no AI. Bob sticks with this moral position, regardless of how much stuff Bob would get in a scenario with an unethical AI. For a mind as powerful as an AI, Bob considers it to be a moral imperative to ensure that heretics do not escape punishment. If a group contains at least one person like Bob (and at least one person that would strongly object to being punished), then the set of Pareto-improvements is empty. In a population of billions, there will always exist at least some people with Bob's type of morality (and plenty of people that would strongly object to being punished). Which in turn means that for humanity, there exist no powerful AI, such that creating this AI would be a Pareto-improvement.

A problem shared by many different alignment targets

ThomasCederborg3mo10

I do think that it’s important to analyse alignment targets like these. Given the severe problems that all of these alignment targets suffer from, I certainly hope that you are right about them being unlikely. I certainly hope that nothing along the lines of a Group AI will ever be successfully implemented. But I do not think that it is safe to assume this. The successful implementation of an instruction following AI would not remove the possibility that an AI Sovereign will be implemented later. The CEV arbital page actually assumes that the path to a Group AI goes through an initial limited AI (referred to as a Task AI). In other words: the classical proposed path to an AI that implements the CEV of Humanity actually starts with an initial AI that is not an AI Sovereign (and such an AI could for example be the type of instruction following AI that you mention). In yet other words: your proposed AI is not an alternative to a Group AI. Its successful implementation does not prevent the later implementation of a Group AI. Your proposed AI is in fact one step in the classical (and still fairly popular) proposed path to a Group AI.

I actually have two previous posts that were devoted to making the case for analysing the types of alignment targets that the present post is focusing on. The present post is instead focusing on doing such analysis. This previous post outlined a comprehensive argument in favour of analysing these types of alignment targets. Another previous post specifically focused on illustrating that Shutting down all competing AI projects might not buy a lot of time due to Internal Time Pressure. See also this comment where I discuss the difference between proposing solutions on the one hand, and pointing out problems on the other hand.

Charbel-Raphaël responded to my post by arguing that no Sovereign AI should ever be created. My reply pointed out that this is mostly irrelevant to the question at hand. The only relevant question is whether or not a Sovereign AI might be successfully implemented eventually. If that is the case, then one can reduce the probability of some very bad outcomes by doing the type of Alignment Target Analysis that my previous two posts were arguing for (and that the present post is an example of). The second half of this reply (later in the same thread) includes a description of an additional scenario where an initial limited AI is followed by a Sovereign AI (and this Sovereign AI is implemented without significant time spent on analysing the specific proposal, due to Internal Time Pressure).

Regarding Corrigibility as a Singular Target:

I don't think that one can rely on this idea to prevent the outcome where a dangerous Sovereign AI proposal is successfully implemented at some later time (for example after an initial AI has been used to buy time). One issue is the difficulty of defining critical concepts such as Explanation and Understanding. I previously discussed this with Max Harms here, and with Nathan Helm-Burger here. Both of those comments are discussing attempts to make an AI pursue Corrigibility as a Singular Target (which should not be confused with my post on Corrigibility, which discussed a different type of Corrigibility).

Regarding what the designers might want:

The people actually building the stuff might not be the ones deciding what should be built. For example: if a messy coalition of governments enforces a global AI pause, then this coalition might be able to decide what will eventually be built. If a coalition is capable of successfully enforcing a global AI pause, then I don't think that we can rule out the possibility that they will be able to enforce a decision to build a specific type of AI Sovereign (they could for example do this as a second step, after first managing to gain effective control over an initial instruction following AI). If that is the case, then the proposal to build something along the lines of a Group AI might very well be one of the politically feasible options (this was previously discussed in this post and in this comment).

The ELYSIUM Proposal - Extrapolated voLitions Yielding Separate Individualized Utopias for Mankind

ThomasCederborg6mo10

I thought that your Cosmic Block proposal would only block information regarding things going on inside a given Utopia. I did not think that the Cosmic Block would subject every person to forced memory deletion. As far as I can tell, this would mean removing a large portion of all memories (details below). I think that memory deletion on the implied scale would seriously complicate attempts to define an extrapolation dynamic. It also does not seem to me that it would actually patch the security hole illustrated by the thought experiment in my original comment (details below).

The first section argues that (unless Bob's basic moral framework has been dramatically changed by the memory deletion) no level of memory deletion will prevent BPA from wanting to find and hurt Steve. In brief: BPA will still be subject to the same moral imperative to find and hurt any existing heretics (including Steve).

The second section argues that BPA is likely to find Steve. In brief: BPA is a clever AI and the memory deletion is a human constructed barrier (the Advocates are extrapolations of people that has already been subjected to these memory wipes. So Advocates cannot be involved when negotiating the rules governing these memory wipes). BPA would still have access to a lot of different information sources that it can use to find Steve.

The third section argues that if BPA finds Steve, then BPA would be able to hurt Steve. In brief: creating OldSteve is still not prevented by any rule or constraint that you have mentioned so far.

The fourth section argues that the side effects of memory deletion would be severe. In brief: memories of every conversation about any deleted person would also be deleted. Besides all direct memories involving any deleted person, many indirect memories would also be deleted. This would seriously complicate extrapolation. (Extrapolation is already a very tricky definitional problem. And this definitional problem cannot be delegated to extrapolated Advocates, since they are the result of the extrapolation dynamic being defined).

The last section deals with your proposed resource destruction mechanism. In brief: in one thought experiment (that did not involve threats) almost half the population would very strongly prefer to destroy ELYSIUM. This disproved your claim that such a situation would not arise. Also: no thought experiment involved anyone trying to gain resources. And no thought experiment involved anyone issuing any form of threat that would not be followed up on. People burned resources that they did not need. They did this to either hurt people directly. Or to issue genuinely credible threats.

BPA will still want to find and hurt heretics

It would not be surprising if a comprehensive, human defined, memory deletion operation would completely re write someone's basic moral framework. But I'm guessing that your proposed memory deletion is not supposed to be done in a way that changes a persons basic moral framework. So let's reason from the assumption that it does not.

This means that Bob still considers BPA to be bound by a non negotiable moral imperative. So BPA still wants to find and punish any heretic that might exist.

In other words: the Cosmic Block is not the type of mechanism that might prevent BPA from wanting to find and hurt Steve. It is instead a practical obstacle that BPA needs to overcome (which is something very different). It is a human constructed practical barrier, that is supposed to protect Steve from a clever AI that wants to find and hurt Steve.

BPA will likely be able to find Steve

Unless the details of your proposed Cosmic Block are constructed by an AI that prevents All Bad Things, these rules must come from somewhere else. AI assisted negotiations cannot be done by the Advocates. Advocates are the result of extrapolating memory wiped people (otherwise the whole point of the Cosmic Bloc is lost). So the Advocates cannot be involved in defining the memory wipe rules.

In other words: unless the memory wipe rules are negotiated by a completely separate set of (previously unmentioned) AIs, the memory wipe rules will be human defined.

This means that a human constructed barrier must hold against a clever AI trying to get around it. Even if we were to know that a human defined barrier has no humanly-findable security holes, this does not mean that it will actually hold against a clever AI. A clever AI can find security holes that are not humanly-findable.

The specific situation that BPA will find itself in does not seem to be described in sufficient detail for it to be possible to outline a specific path along which BPA finds Steve. But from the currently specified rules, we do know that BPA has access to several ways of gathering information about Steve.

People can pool resources (as described in your original proposal). So Advocates can presumably ask other Advocates about potential partners for cohabitation. Consider the case where BPA is negotiating with other Advocates regarding who will be included in a potential shared environment. This decision will presumably involve information about potential candidates. Whether or not a given person is accepted, would presumably depend on detailed personal information.

Advocates can also engage in mutual resource destruction to prevent computations happening within other Utopias. You describe this mechanism as involving negotiations between Advocates, regarding computations happening within other people's Utopias. Such negotiations would primarily be between the Advocates of people that have very different values. This is another potential information source about Steve.

Steve would also have left a lot of effects on the world, besides effects on peoples memories. Steve might for example have had a direct impact on what type of person someone else has turned into. Deleting this impact would be even more dramatic than deleting memories.

Steve might have also have had a significant impact on various group dynamics (for example: his family, the friend groups that he has been a part of, different sets of coworkers and classmates, online communities, etc). Unless all memories regarding the general group dynamics of every group that Steve has been a part of is deleted, Steve's life would have left behind many visible effects.

The situation is thus that a clever AI is trying to find and hurt Steve. There are many different types of information sources that can be combined in clever ways to find Steve. The rules of all barriers between this AI and Steve are human constructed. Even with perfect enforcement of all barriers, this still sounds like a scenario where BPA will find Steve (for the same reason that a clever AI is likely to find its way out of a human constructed box, or around a human constructed Membrane).

There is still nothing protecting Steve from BPA

If BPA locates Steve, then there is nothing preventing BPA from using OldSteve to hurt Steve. What is happening to OldSteve is still not prevented by any currently specified rule. The suffering of OldSteve is entirely caused by internal dynamics. OldSteve never lacks any form of information. And the harm inflicted on OldSteve is not in any sense marginal.

I do not see any strong connections between the OldSteve thought experiment and your Scott Alexander quote (which is concerned with the question of what options and information should be provided by a government run by humans. To children raised by other humans). More generally: scenarios that include a clever AI that is specifically trying to hurt someone, has a lot of unique properties (important properties that are not present in scenarios that lack such an AI). I think that these scenarios are dangerous. And I think that they should be avoided (as opposed to first created and then mitigated). (Avoiding such scenarios is a necessary, but definitely not sufficient, feature of an alignment target).

Memory wipes would complicate extrapolation

All deleted memories must be so thoroughly wiped that a clever AI will be unable to reconstruct them (otherwise the whole point of the Cosmic Block is negated). Deleting all memories of a single important negative interpersonal relationship would be a huge modification. Even just deleting all memories of one famous person that served as a role model would be significant.

Thoroughly deleting your memory of a person, would also impact your memory of every conversation that you have ever had about this person. Including conversations with people that are not deleted. Most long term social relationships involves a lot of discussions of other people (one person describing past experiences to the other, discussions of people that both know personally, arguments over politicians or celebrities, etc, etc). Thus, the memory deletion would significantly alter the memories of essentially all significant social relationships. This is not a minor thing to do to a person. (That every person would be subjected to this is not obviously implied by the text in The ELYSIUM Proposal.)

In other words: even memories of non deleted people would be severely modified. For example: every discussion or argument about a deleted person would be deleted. Two people (that do not delete each other) might suddenly have no idea why they almost cut all contact a few years ago, and why their interactions has been so different for the last few years. Either their Advocates can reconstruct the relevant information (in which case the deletion does not serve its purpose). Or their Advocates must try to extrapolate them while lacking a lot of information.

Getting the definitions involved in extrapolation right, seems like it will be very difficult even under ordinary circumstances. Wide ranging and very thorough memory deletion would presumably make extrapolation even more tricky. This is a major issue.

Your proposed resource destruction mechanism

No one in any of my thought experiments was trying to get more resources. The 55 percent majority (and the group of 10 people) have a lot of resources that they do not care much about. They want to create some form of existence for themselves. This only takes a fraction of available resources to set up. They can then burn the rest of their resources on actions within the resource destruction mechanism. They either burn these resources to directly hurt people. Or they risk these resources by making threats that are completely credible. In the thought experiments where someone does issue a threat, the threat is issued because: a person giving in > burning resources to hurt someone who refuses > leaving someone that refuses alone. They are perfectly ok with an outcome where resources are spent on hurting someone that refuses to comply (they are not self modifying as a negotiation strategy. They just think that this is a perfectly ok outcome).

Preventing this type of threats would be difficult because (i): negotiations are allowed, and (ii): in any scenario where threats are prevented, the threatened action would simply be taken (for non strategic reasons). There is no difference in behaviour between scenarios where threats are prevented, and scenarios where threats are ignored.

The thought experiment where a majority burns resources to hurt a minority was a simple example scenario where almost half of the population would very strongly prefer to destroy ELYSIUM (or strongly prefer that ELYSIUM was never created). It was a response to your claim that your resource destruction mechanisms would prevent such a scenario. This thought experiment did not involve any form of threat or negotiation.

Let's call a rule that prevents the majority from hurting the minority a Minority Protection Rule (MPR). There are at least two problems with your claim that a pre-AI majority would prevent the creation of a version of ELYSIUM that has an MPR.

First: without an added MPR, the post-AI majority is able to hurt the minority without giving up anything that they care about (they burn resources they don't need). So there is no reason to think that an extrapolated post-AI majority would want to try to prevent the creation of a version of ELYSIUM with an MPR. They would prefer the case without an MPR. This does not imply that they care enough to try to prevent the creation of a version of ELYSIUM with an MPR. Doing so would presumably be very risky, and they don't gain anything that they care much about. When hurting the minority does not cost them anything that they care about, they do it. That does not imply that this is an important issue for the majority.

More importantly however: you are conflating, (i): a set of un-extrapolated and un-coordinated people living in a pre-AI world, with (ii): a set of clever AI Advocates representing these same people, operating in a post-AI world. There is nothing unexpected about humans opposing / supporting an AI that would be good / bad for them (from the perspective of their extrapolated Advocates). That is the whole point of having extrapolated Advocates.