Summary
- We don’t have good proposals for alignment targets: The most recently published version of Coherent Extrapolated Volition (CEV), a fairly prominent alignment target, is Parliamentarian CEV (PCEV). PCEV gives a lot of extra influence to anyone who intrinsically values hurting other individuals (search the CEV arbital page for ADDED 2023 for Yudkowsky’s description of the issue). This feature went unnoticed for many years and would make a successfully implemented PCEV very dangerous.
- Bad alignment target proposals are dangerous: There is no particular reason to think that discovery of this problem was inevitable. It went undetected for many years. There are also plausible paths along which PCEV (or a proposal with a similar issue) might have ended up being implemented. In other words: PCEV posed a serious risk. That risk has probably been mostly removed by the arbital update. (It seems unlikely that someone would implement a proposed alignment target without at least reading the basic texts describing the proposal). PCEV is however not the only dangerous alignment target, and risks from scenarios where someone successfully hits some other bad alignment target remains.
- Alignment Target Analysis (ATA) can reduce these risks. We will argue that more ATA is needed and urgent. ATA can informally be described as analyzing and critiquing Sovereign AI proposals, for example along the lines of CEV. By Sovereign AI we mean a clever and powerful AI that will act autonomously in the world (as opposed to tool AIs or a pivotal act AI of the type that follows human orders and that can be used to shut down competing AI projects). ATA asks what would happen if a Sovereign AI project were to succeed at aligning their AI to a given alignment target.
- ATA is urgent. The majority of this post will focus on arguing that ATA cannot be deferred. A potential Pivotal Act AI (PAAI) might fail to buy enough calendar time for ATA since it seems plausible that a PAAI wouldn’t be able to sufficiently reduce internal time pressure. Augmented humans and AI assistants might fail to buy enough subjective time for ATA. Augmented humans that are good at hitting alignment targets will not necessarily be good at analyzing alignment targets. Creating sufficiently helpful AI assistants might be hard or impossible without already accidentally locking in an alignment target to some extent.
- ATA is extremely neglected. The field of ATA is at a very early stage, and currently there does not exist any research project dedicated to ATA. The present post argues that this lack of progress is dangerous and that this neglect is a serious mistake.
A note on authorship
This text largely builds on previous posts by Thomas Cederborg. Chi contributed mostly by trying to make the text more coherent. While the post writes a lot in the “we” perspective, Chi hasn’t thought deeply about many of the points in this post yet, isn’t sure what she would endorse on reflection, and disagrees with some of the details. Her main motivation was to make Thomas’ ideas more accessible to people.
Alignment Target Analysis is important
This post is concerned with proposals for what to align a powerful and autonomous AI to, for example proposals along the lines of Coherent Extrapolated Volition (CEV). By powerful and autonomous we mean AIs that are not directly being controlled by a human or a group of humans but not the types of proposed AI systems that some group of humans might use for limited tasks, such as shutting down competing AI projects. We will refer to this type of AI as Sovereign AI throughout this post. Many people both consider it possible that such an AI will exist at some point, and further think that it matters what goal such an AI would have. This post is addressed to such an audience. (The imagined reader does not necessarily think that creating a Sovereign AI is a good idea. Just that it is possible. And that if it happens, then it matters what goal such an AI has).
A natural conclusion from this is that we need Alignment Target Analysis (ATA) at some point. A straightforward way of doing ATA is to take a proposal for something we should align an AI to (for example: the CEV of a particular set of people) and then ask what would happen if someone were to successfully hit this alignment target. We think this kind of work is very important. Let’s illustrate this with an example.
The most recently published version of CEV is based on extrapolated delegates negotiating in a Parliament. Let’s refer to this version of CEV as Parliamentarian CEV (PCEV). It turns out that the proposed negotiation rules of the Parliament gives a very large advantage to individuals that intrinsically values hurting other individuals. People that want to inflict serious harm get a bigger advantage than people that want to inflict less serious harm. The largest possible advantage goes to any group that wants PCEV to hurt everyone else as much as possible. This feature of PCEV makes PCEV very dangerous. However, this feature went unnoticed for many years, despite this being a fairly prominent proposal. This illustrates three things:
- First, it shows that noticing problems with proposed alignment targets is difficult.
- Second, it shows that successfully implementing a bad alignment target can result in a very bad outcome.
- Third, it shows that reducing the probability of such scenarios is feasible (the fact that this feature has been noticed makes it a lot less likely that PCEV will end up getting implemented).
This example shows that getting the alignment target right is extremely important and that even reasonable seeming targets can be catastrophically bad. The flaws in PCEV’s negotiation rules are also not unique to PCEV. An AI proposal from 2023 uses similar rules and hence suffers from related problems. The reason that more ATA is needed is that finding the right target is surprisingly difficult, noticing flaws is surprisingly difficult, and because targets that look reasonable enough might lead to catastrophic outcomes.
The present post argues that making progress on ATA is urgent. As shown, the risks associated with scenarios where someone successfully hits a bad alignment target are serious. Our main thesis is that there might not be time to do ATA later. If one considers it possible that a Sovereign AI might be built, then the position that doing ATA now is not needed must rest on some form of positive argument. One class of such arguments is based on an assertion that ATA has already been solved. We already argued that this is not the case.
Another class of arguments is based on an assertion that all realistic futures falls into one of two possible categories, (i): scenarios with misaligned AI (in which case ATA is irrelevant), or (ii): scenarios where there will be plenty of time to do ATA later and so we should defer it to future, potentially enhanced humans and their AI assistants. The present post will be focused on countering arguments along these lines. We will further argue that these risks can be reduced by doing ATA. The conclusion is that it is important that ATA work starts now. However, there does not appear to exist any research project dedicated to ATA. This seems like a mistake to us.
Alignment Target Analysis is urgent
Let’s start by briefly looking at one common class of AI safety plans that does not feature ATA until a much later point. It goes something like this: Let’s make AI controllable, i.e. obedient, helpful, not deceptive, ideally without long-term goals on its own really, just there to follow our instructions. We don’t align those AIs to anything more ambitious or object-level. Once we succeed at that, we can use those AIs to help us figure out how to build a more powerful AI sovereign safely and with the right kind of values. We’ll be much smarter with the help of those controllable AI systems, so we’ll also be in a better position to think about what to align sovereign AIs to. We can also use these controllable AI systems to buy more time for safety research, including ATA, perhaps by doing a pivotal act (in other words: use some form of instruction-following-AI to take actions that shuts down competing AI projects). So, we don’t have to worry about more ambitious alignment targets, yet. In summary: We don’t actually need to worry about anything right now other than getting to the point where we have controllable AI systems that are strong enough to either speed up our thinking or slow down AI progress or, ideally, both.
One issue with such proposals is that it seems very difficult to us to make a controllable AI system that is
- able to substantially assist you with ATA or can buy you a lot of time
- without already implicitly substantially having chosen an alignment target, i.e. without accidental lock-in.
If this is true, ATA is time-sensitive because it needs to happen before and alongside us developing controllable AI systems.
Why we don’t think the idea of a Pivotal Act AI (PAAI) obsoletes doing ATA now
Now, some argue that we can defer ATA by building a Pivotal Act AI (PAAI) that can stop all competing AI projects and hence buy us unlimited time. There are two issues with this: First, PAAI proposals need to balance buying time and avoiding accidental lock-in. The more effective an AI is at implementing a pivotal act of a type that reliably prevents bad outcomes, the higher the risk you have already locked something in.
For an extreme example, if your pivotal act is to have your AI autonomously shut down all “bad AI projects”, we almost certainly have already locked in some values. A similar issue also makes it difficult for an AI assistant to find a good alignment target without many decisions having already been made (more below). If a system reliably shuts down all bad AIs, then the system will necessarily be built on top of some set of assumptions regarding what counts as a bad AI. This would mean that many decisions regarding the eventual alignment target have already been made (which in turn means that ATA would have to happen before any such an AI is built). And if the AI does not reliably shut down all bad AI projects, then decisions will be in the hands of humans that might make mistakes.
Second, and more importantly, we haven’t yet seen enough evidence that a good pivotal act is actually feasible and that people will pursue it. In particular, current discussions of pivotal act AI seem to neglect internal time pressure. For example, we might end up in a situation where early AI is in the hands of a messy coalition of governments that are normally adversaries. Such a coalition is unlikely to pursue a unified, optimized strategy. Some members of the coalition will probably be under internal political pressure to train and deploy the next generation of AIs. Even rational, well informed, and well intentioned governments might decide to take a calculated risk and act decisively before the coalition collapses.
If using the PAAI requires consensus, then the coalition might decide to take a decisive action before an election in one of the countries involved. Even if everyone involved is aware that this is risky, the option of ending up in a situation where the PAAI can no longer be used to prevent competing AI projects might be seen as more risky. An obvious such action would be to launch a Sovereign AI, aiming at whatever happens to be the state of the art alignment target at the time (in other words: build an autonomous AI with whatever goal is the current state of the art proposed AI goal at the time). Hence, even if we assume that the PAAI in question could be used to give them infinite time, it is not certain that a messy coalition would use it in this way, due to internal conflicts.
Besides issues related to reasonable people trying to do the right thing by taking calculated risks, another issue is that the leaders of some countries might prefer that all important decisions are made before their term of office expire (for example by giving the go ahead to a Sovereign AI project that is aiming at their favorite alignment target).
An alternative to a coalition of powerful countries would be to have the PAAI be under the control of a global electorate. In this case, a large but shrinking majority might decide to act before existing trends turn their values into a minority position. Political positions changing in fairly predictable ways is an old phenomenon. Having a PAAI that can stop outside actors from advancing unauthorized AI projects wouldn’t change that.
In addition, if we are really unlucky, corrigibility of weak systems can make things worse. Consider the case where a corrigibility method (or whatever method you use to control your AIs) turns out to work for an AI that is used to shut down competing AI projects, but does not work for sovereign AIs. If they have such a partially functional corrigibility technique, they might take the calculated risk of launching a sovereign AI that they hope is also corrigible (thinking that this is likely, because the method worked on a non sovereign AI). Thus, if the state of the art alignment target has a flaw, then discovering this flaw is urgent. See also this post.
To summarize: Even if someone can successfully prevent outside actors from making AI progress, i.e. if we assume the existence of a PAAI that could, in principle, be used to give humanity infinite time for reflection, that doesn’t guarantee a good outcome. Some group of humans would still be in control (since it is not possible to build a PAAI that prevents them from aiming at a bad alignment target without locking in important decisions). That group might still find themselves in a time crunch due to internal power struggles and other dynamics between themselves. In this case, the humans might decide to take a calculated risk and aim at the best alignment target they know of (which at the current level of ATA progress would be exceptionally dangerous).
However, this group of humans might be open to clear explanations of why their favorite alignment target contains a flaw that would lead to a catastrophic outcome. An argument of the form: “the alignment target that you are advocating for would have led to this specific horrific outcome, for these specific reasons’’ might be enough to make part of a shrinking majority hesitate, even if they would strongly prefer that all important decisions are finalized before they lose power. First however, the field of ATA would need to advance to the point where it is possible to notice the problem in question.
Why we don’t think human augmentation and AI assistance obsolete doing ATA now
Some people might argue that we can defer ATA to the future not because we will have virtually unlimited calendar time but because we will have augmented humans or good AI assistants that will allow us to do ATA much more effectively in the future. This might not buy us much time in calendar months but a lot of time in subjective months to work on ATA.
Why we don’t think the idea of augmenting humans obsoletes doing ATA now
If one is able to somehow create smarter augmented humans, then it is possible that everything works out even without any non-augmented human ever making any ATA progress at all. In order to conclude that this idea obsoletes doing ATA now, one however needs to make a lot of assumptions. It is not sufficient to assume that a project will succeed in creating augmented humans that are both very smart and also well intentioned.
For example, the augmented humans might be very good at figuring out how to hit a specified alignment target while not being very good at ATA since they are two different types of skills. One issue is that making people better at hitting alignment targets might simply be much easier than making them better at ATA. A distinct issue is that (regardless of relative difficulty levels) the first project that succeeds at creating augments that are good at hitting alignment targets, might not have spent a lot of effort to ensure that these augments are also good at ATA. In other words: augmented humans might not be good at ATA, simply because the first successful project never even bothered to try to select for this.
It is important to note that ATA can still help prepare us for scenarios with augmented humans that aren’t better than non augmented humans at ATA, even if it does not result in any good alignment target. To be useful, ATA only needs to find the flaw in alignment targets (before the augmented humans respond to some time crunch, by taking the calculated risk of launching a Sovereign AI, aiming at this alignment target). If the flaw is found in time, then the augmented humans would have no choice other than to keep trying different augmentation methods, until this process results in some mind that is able to make genuine progress on ATA (because they do not have access to any nice-seeming alignment targets).
Accidental value lock-in vs. competence tension for AI assistants
When it comes to deferring to future AI assistants, we have additional issues to consider: We want a relatively weak controllable AI assistant that can help a lot with ATA. And we don’t want this AI to effectively lock in a set of choices. However, there is a problem. The more helpful an AI system is at helping you with ATA, the more risk you run of already having locked in some values accidentally.
Consider an AI that is just trying to help us achieve “what we want to achieve”. Once we give it larger and larger tasks, the AI has to do a lot of interpretation to understand what that means. For an AI to be able to help us achieve “what we want to achieve”, and prevent us from deviating from this, it must have a definition of what that means. Finding a good definition of “what we want to achieve” likely requires value judgments that we don’t want to hand over to AIs. If the system has a definition of “what we want to achieve”, then some choices are effectively already made.
To illustrate: For “help us achieve what we want to achieve” to mean something, one must specify how to deal with disagreements, amongst individuals that disagree on how to deal with disagreements. Without specifying this one cannot refer to “we”. There are many different ways of dealing with such disagreements, and they imply importantly different outcomes. One example of how one can deal with such disagreements is the negotiation rules of PCEV, mentioned above. In other words: if an AI does not know what is meant by “what we want to achieve”, then it will have difficulties helping us solve ATA. But if it does know what “what we want to achieve” means, then important choices have already been made. And if the choice had been made to use the PCEV way of dealing with disagreements, then we would have locked in everything that is implied by this choice. This includes locking in the fact that individuals who intrinsically value hurting other individuals, will have far more power over the AI than individuals that do not have such values.
If we consider scenarios with less powerful AI that just don’t have any lock-in risk by default, then they might not be able to provide substantial help with ATA: They currently seem better at tasks that have lots of data, short horizons, and aren’t very conceptual. These things don’t seem to apply to ATA.
None of this is meant to suggest that AI assistants cannot help with ATA. It is entirely possible that some form of carefully constructed AI assistant will speed up ATA progress to some degree, without locking in any choices (one could for example build an assistant that has a unique perspective but is not smarter than humans. Such an AI might provide meaningful help with conceptual work, without its definitions automatically dominating the outcome). But even if this does happen, it is unlikely to speed you up enough to obsolete current work.
Spuriously opinionated AI assistants
AI systems might also just genuinely be somewhat opinionated for reasons that are not related to anyone making a carefully considered tradeoff. If the AI is opinionated in an unintended way and its opinions matter for what humans choose to do, we run the risk of already having accidentally chosen some alignment target by the time we designed the helpful, controllable AI assistant. We just don’t know what alignment target we have chosen.
If we look at current AI systems, this scenario seems fairly plausible. Current AIs aren’t actually trained purely for “what the user wants” but instead are directly trained to comply with certain moral ideas. It seems very plausible these moral ideas (alongside whatever random default views the AI has about, say, epistemics) will make quite the difference for ATA. It seems plausible that current AIs already are quite influential on people’s attitudes and will increasingly become so. This problem exists even if careful efforts are directed towards avoiding it.
Will we actually have purely corrigible AI assistants?
There exists a third issue, that is separate from both issues mentioned above: even if some people do plan to take great care when building AI assistants, there is no guarantee that such people will be the first ones to succeed. It does not seem to us to be the case, that everyone is in fact paying careful attention towards what kinds of values and personalities we are currently training into our AIs. As a fourth separate issue, despite all the talk about corrigibility and intent alignment, it doesn’t seem obvious at all that most current AI safety efforts differentially push towards worlds where AIs are obedient, controllable etc. as opposed to having specific contentful properties.
Relationship between ATA and other disciplines
There are many disciplines that seem relevant to ATA such as: voting theory, moral uncertainty, axiology, bargaining, political science, moral philosophy, etc. Studying solutions in these fields is an important part of ATA work. But it is necessary to remember that the lessons learned by studying these different contexts might not be valid in the AI context. Since concepts can behave in new ways in the AI context, studying these other fields cannot replace ATA. This implies that in order to build up good intuitions about how various concepts will behave in the AI context, it will be necessary to actually explore these concepts in the AI context. In other words: it will be necessary to do ATA. This is another reason for thinking that the current lack of any serious research effort dedicated to ATA is problematic.
Let’s illustrate the problem of transferring proposals from different contexts to AI with PCEV as an example. The problem that PCEV suffers from as an alignment target is not an issue in the original proposal. The original proposal made by Bostrom is a mapping from a set of weighted ethical theories and a situation to a set of actions (that an individual can use to find a set of actions that can be given the label “morally permissible”). It is unlikely that a given person will put credence in a set of ethical theories that specifically refer to each other, and specifically demands that other theories must be hurt as much as possible. In other words: ethical theories that want to hurt other theories do get a negotiation advantage in the original proposal but this advantage is not a problem in the original context.
In a population of billions however, some individuals will want to hurt other individuals. So here the negotiation advantage is a very big problem. One can describe this as the concept behaving very differently when it is transferred to the AI context. There is nothing particularly unusual about this. It is fairly common for ideas to stop working when they are used in a completely novel context. But it is still worth making this explicit, and important to keep this in mind when thinking about alignment target proposals that were originally designed for a different context. Because there are many aspects of the AI context that are quite unusual.
To illustrate this with another example, this time with a concept from ordinary politics transferred to the AI context. Let’s write Condorcet AI (CAI) for any AI that picks outcomes using a rule that conforms to the Condorcet Criterion or Garrabrant’s Lottery Condorcet Criterion. If a barely caring 51 % solid majority (who agree about everything) would sort of prefer that a 49 % minority be hurt as much as possible, then any CAI will hurt the 49 % minority as much as it can. (It follows directly from the two linked definitions that a 51 % solid majority always gets their highest ranked option implemented without compromise). Ordinary politics does have issues with minorities being oppressed. But in ordinary politics there does not exist any entity that can suddenly start massively oppressing a 49 % minority without any risk or cost. And without extrapolation, solid majorities are a lot less important as a concept. Therefore, ordinary politics does not really contain anything corresponding to the above scenario. In other words: the Condorcet Criterion behaves differently when it is transferred to the AI context.
Alignment Target Analysis is tractable
We’ve argued that alignment to the wrong alignment target can be both catastrophic and non-obvious. We also argued that people might need, want to or just will align their AIs to a specific target relatively soon, without sufficient help from AI assistants or the ability to stall for time. This makes ATA time-sensitive and important. It also is tractable. One way that such research could move forwards, would be to iterate through alignment targets the usual scientific way: Propose them. Wait until someone finds a critical flaw. Propose an adjustment. Wait until someone finds a critical flaw to the new state of art. Repeat. Hopefully, this will help us identify necessary features of good alignment targets. While it seems really hard to tell whether an alignment target is good, this helps us to at least tell when an alignment target is bad. And noticing that a bad alignment target is in fact bad reduces the danger of it being implemented.
A more ambitious branch of ATA could try to find a good alignment target instead of purely analyzing them. Coming up with a good alignment target and showing that it is good seems much, much harder than finding flaws in existing proposals. However, the example with PCEV showed that it is possible to reduce these dangers without finding any good alignment target. In other words: an ATA project does not have to attempt to find a solution to be valuable. Because it can still reduce the probability of worst-case outcomes.
It is also true in general that looking ahead, and seeing what is waiting for us down the road, might be useful in hard to predict ways. It all depends on what one finds. Perhaps, to the extent that waiting for humanity to become more capable before committing to an alignment target or stalling for time are possible (just not guaranteed), ATA can help motivate doing so. It’s possible that, after some amount of ATA, we will conclude that humans, as we currently exist, should never try to align an AI to an alignment target we came up with. In such a scenario we might have no choice but to hope that enhanced humans will be able to handle this (even though there is no guarantee that enhancing the ability to hit an alignment target will reliably enhance the ability to analyze alignment targets).
Limitations of the present post, and possible ways forwards
There is a limit to how much can be achieved by arguing against a wide range of unstated arguments, implicit in the non-existence of any current ATA research project. Many people both consider it possible that a powerful autonomous AI will exist at some point, and also think that it matters what goal such an AI would have. So the common implicit position that ATA now is not needed must rest on positive argument(s). These arguments will be different for different people, and it is difficult to counter all possible arguments in a single post. Each such argument is best treated separately (for example along the lines of these three posts, that each deal with a specific class of arguments).
The status quo is that not much ATA is being done, so we made a positive case for it. However, the situation to us looks as follows:
- We will need an alignment target eventually,
- Alignment targets that intuitively sound good might be extremely bad, maybe worse than extinction bad, in ways that aren’t obvious.
This seems like a very bad and dangerous situation to be in. To argue that we should stay in this situation, without at least making a serious effort to improve things, requires a positive argument. In our opinion, the current discourse arguing for focusing exclusively on corrigibility, intent alignment, human augmentation, and buying time (because they help with ATA in the long-run) does not succeed at providing such an argument. Concluding that some specific idea should be pursued, does not imply that the idea in question obsoletes doing ATA now. But the position that doing ATA now is not needed, is sort of implicit in the current lack of any research project dedicated to ATA. ATA seems extremely neglected with, as far as we can tell, 0 people working on it full time.
We conclude this post by urging people, who feel confident that doing ATA now is not needed, to make an explicit case for this. The fact that there currently does not exist any research project dedicated to ATA indicates that there exists plenty of people that consider this state of affairs to be reasonable (probably for a variety of different reasons). Hopefully the present text will lead to the various arguments in favor of this position, that people find convincing, to be made explicit and in public. A natural next step would then be to engage with those arguments individually.
Acknowledgements
We would like to thank Max Dalton, Oscar Delaney, Rose Hadshar, John Halstead, William MacAskill, Fin Moorhouse, Alejandro Ortega, Johanna Salu, Carl Shulman, Bruce Tsai, and Lizka Vaintrob, for helpful comments on an earlier draft of this post. This does not imply endorsement.
Your comment makes me think that I might have been unclear regarding what I mean with ATA. The text below is an attempt to clarify.
Summary
Not all paths to powerful autonomous AI go through methods from the current paradigm. It seems difficult to rule out the possibility that a Sovereign AI will eventually be successfully aligned to some specific alignment target. At current levels of progress on ATA this would be very dangerous (because understanding an alignment target properly is difficult, and a seemingly-nice proposal can imply a very bad outcome). It is difficult to predict how long it would take to reach the level of understanding needed to prevent scenarios where a project successfully hits a bad alignment target. And there might not be a lot of time to do ATA later (for example because a tool-AI shuts down all unauthorised AI projects. But does not buy a lot of time due to internal time pressure). So a research effort should start now.
Therefore ATA is one of the current priorities. There are definitely very serious risks that ATA cannot help with (for example misaligned tool-AI projects resulting in extinction). There are also other important current priorities (such as preventing misuse). But ATA is one of the things that should be worked on now.
The next section outlines a few scenarios designed to clarify how I use the term ATA. The section after that outlines a scenario designed to show why I think that ATA work should start now.
What I mean with Alignment Target Analysis (ATA)
The basic idea with ATA is to try to figure out what would happen if a given AI project were to successfully align an autonomously acting AI Sovereign to a given alignment target. The way I use the term, there are very severe risks that cannot be reduced in any way, by any level of ATA progress (including some very serious misalignment and misuse risks). But there are also risks that can and should be reduced by doing ATA now. There might not be a lot of time to do ATA later, and it is not clear how long it will take to advance to the level of understanding that will be needed. So ATA should be happening now. But let's start by clarifying the term ATA, by outlining a couple of dangerous AI projects where ATA would have nothing to say.
Consider Bill, who plans to use methods from the current paradigm to build a tool-AI. Bill plans to use this tool AI to shut down competing AI projects and then decide what to do next. ATA has nothing at all to say about this situation. Let's say that Bill's project plan would lead to a powerful misaligned AI that would cause extinction. No level of ATA progress would reduce this risk.
Consider Bob who also wants to build a tool-AI. But Bob's AI would work. If the project would go ahead, then Bob would gain a lot of power. And Bob would use that power to do some very bad things. ATA has nothing to say about this project and ATA cannot help reduce this risk.
Now let's introduce an unusual ATA scenario, just barely within the limits of what ATA can be used for (the next section will give an example of the types of scenarios that makes me think that ATA should be done now. This scenario is meant to clarify what I mean with ATA). Consider Dave who wants to use methods from the current paradigm to implement PCEV. If the project plan moves forwards, then the actual result would be a powerful misaligned AI: Dave's Misaligned AI (DMAI). DMAI would not care at all what Dave is trying to do, and would cause extinction (for reasons that are unrelated to what Dave was aiming at). One way to reduce the extinction risk from DMAI would be to tell Dave that his plan would lead to DMAI. But it would also be valid to let Dave know that if his project were to successfully hit the alignment target that he is aiming for, then the outcome would be massively worse than extinction.
Dave assumes that he might succeed. So, when arguing against Dave's project, it is entirely reasonable to argue from the assumption that Dave's project will lead to PCEV. Pointing out that success would be extremely bad is a valid argument against Dave's plan, even if success is not actually possible.
You can argue against Dave's project by pointing out that the project will in fact fail. Or by pointing out that success would be very bad. Both of these strategies can be used to reduce the risk of extinction. And both strategies are cooperative (if Dave is a well meaning and reasonable person, then he would thank you for pointing out either of these aspects of his plan). While both strategies can prevent extinction in a fully cooperative way, they are also different in important ways. It might be the case that only one of these arguments is realistically findable in time. It might for example be the case that Dave is only willing to publish one part of his plan (meaning that there might not be sufficient public information to construct an argument about the other part of the plan). And even if valid arguments of both types are constructed in time, it might still be the case that Dave will only accept one of these arguments. (similar considerations are also relevant for less cooperative situations. For example if one is trying to convince a government to shut down Dave's project. Or if one is trying to convince an electorate to vote no on a referendum that Dave needs to win in order to get permission to move forwards)
The audience in question (Dave, bureaucrats, voters, etc) are only considering the plan because they believe that it might result in PCEV. Therefore it is entirely valid to reason from the assumption that Dave's plan will result in PCEV (when one is arguing against the plan). There is no logical reason why such an argument would interfere with attempts to argue that Dave's plan would in fact result in DMAI.
Now let's use an analogy from the 2004 CEV document to clarify what role I see an ATA project playing. In this analogy, building an AI Sovereign is analogous to taking power in a political revolution. So (in the analogy) Dave proposes a political revolution. One way a revolution can end in disaster is that the revolution leads to a destructive civil war that the revolutionaries loose (analogous to DMAI causing extinction). Another way a revolution can end in disaster is that ISIS takes power after the government is overthrown (analogous to the outcome implied by PCEV).
It is entirely valid to say to Dave: ``if you actually do manage to overthrow the government, then ISIS will seize power'' (assuming that this conditional is true). One can do this regardless of whether or not one thinks that Dave has any real chance of overthrowing the government. (Which in turn means that one can actually say this to Dave, without spending a lot of time trying to determine the probability that the revolution will in fact overthrow the government. Which in turn means that people with wildly different views on how difficult it is to overthrow the government can cooperate while formulating such an argument)
(this argument can be made separately from an argument along the lines of: ``our far larger neighbour has a huge army and would never allow the government of our country to be overthrown. Your revolution will fail even if every single soldier in our country joins you instantly. Entirely separately: the army of our county is in fact fiercely loyal to the government and you don't have enough weapons to defeat it. In addition to these two points: you are clearly bad at strategic thinking and would be outmanoeuvred in a civil war by any semi-competent opponent''. This line of argument can also prevent a hopeless civil war. The two arguments can be made separately and there is no logical reason for them to interfere with each other)
Analysing revolutionary movements in terms of what success would mean can only help in some scenarios. It requires a non-vague description of what should happen after the government falls. In general: this type of analysis cannot reduce the probability of lost civil wars, in cases where the post revolutionary strategy is either (i): too vaguely described to analyse, or (ii): actually sound (meaning that the only problem with the revolution in question is that it has no chance of success). Conversely however: arguments based on revolutions failing to overthrow the government cannot prevent revolutions that would actually end with ISIS in charge (analogous to AI projects that would successfully hit a bad alignment target). Scenarios that end in a bad alignment target getting successfully hit is the main reason that I think that ATA should happen now (in the analogy, the main point would be to reduce the probability of ISIS gaining power). Now let's leave the revolution analogy and outline one such scenario.
A tool-AI capable of shutting down all unauthorised AI projects might not buy a lot of time
It is difficult to predict who might end up controlling a tool-AI. But one obvious compromise would be to put it under the control of some group of voters (for example a global electorate). Let's say that the tool-AI is designed such that one needs a two thirds majority in a referendum, to be allowed to launch a Sovereign AI. There exists a Sovereign AI proposal that a large majority thinks sounds nice. A small minority would however prefer a different proposal.
In order to prevent inadvertent manipulation risks, the tool AI was designed to only discuss topics that are absolutely necessary for the process of shutting down unauthorised AI projects. Someone figures out how to make the tool-AI explain how to implement Sovereign AI proposals (and Explanation / Manipulation related definitions happens to hold for such discussions). But no one figures out how to get it to discuss any topic along the lines of ATA. The original plan was to take an extended period of time to work on ATA before implementing a Sovereign AI.
Both alignment targets use the same method for extrapolating people and for resolving disagreements. The difference is in terms of who is part of the initial group. The two proposals have different rules with respect to things like: animals, people in cryo, foetuses, artificial minds, etc. It doesn't actually matter which proposal gets implemented: the aggregation method leads to the same horrific outcome in both cases (due to an issue along the lines of the issue that PCEV suffers from. But more subtle and difficult to notice). (All proposed alignment targets along the lines of ``build an AI Sovereign that would do whatever some specific individual wants it to do'' are rejected out of hand by almost everyone).
In order to avoid making the present post political, let's say that political debates center around what to do with ecosystems. One side cares about nature and wants to protect ecosystems. The other side wants to prevent animal suffering (even if the cost of such prevention is the total destruction of every ecosystem on earth). It is widely assumed that including animals in the original group will lead to an outcome where animal suffering is prevented at the expense of ecosystems. (in order to make the following scenario more intuitive, readers that have an opinion regarding what should be done with ecosystems, can imagine that the majority shares this opinion)
The majority has enough support to launch their Sovereign AI. But the minority is rapidly and steadily gaining followers due to ordinary political dynamics (sometimes attitudes on a given issue changes steadily in a predictable direction). So the ability to get the preferred alignment target implemented can disappear permanently at any moment (the exact number of people that would actually vote yes in a referendum is difficult to estimate. But it is clearly shrinking rapidly). In this case the majority might act before they loose the ability to act. Part of the majority would however hesitate if the flaw with the aggregation method is noticed in time.
After the tool-AI was implemented, a large number of people started to work on ATA. There are also AI assistants that contribute to conceptual progress (they are tolerated by the tool-AI because they are not smarter than humans. And they are useful because they contribute a set of unique non-human perspectives). However, it turns out that ATA progress works sort of like math progress. It can be sped up significantly by lots of people working on it in parallel. But the main determinant of progress is how long people have been working on it. In other words: it turns out that there is a limit to how much the underlying conceptual progress can be sped up by throwing large numbers of people at ATA. So the question of whether or not the issue with the Sovereign AI proposal is noticed in time, is to a large degree determined by how long a serious ATA research project has been going on at the time that the tool-AI is launched (in other words: doing ATA now reduces the risk of a bad alignment target ending up getting successfully hit in this scenario).
(the idea is not that this exact scenario will play out as described. The point of this section was to give a detailed description of one specific scenario. For example: the world will presumably not actually be engulfed by debates about the Prime Directive from Star Trek. And a tool-AI controlled by a messy coalition of governments might lead to a time crunch due to dynamics that are more related to Realpolitik than any form of ideology. This specific scenario is just one example of a large set of similar scenarios)
PS:
On a common sense level I simply don't see how one can think that it is safe to stay at our current level of ATA progress (where it is clearly not possible to reliably tell a good alignment target from an alignment target that implies an outcome massively worse than extinction). The fact that there exists no research project dedicated to improving this situation seems like a mistake. Intuitively this seems like a dangerous situation. At the very least it seems like some form of positive argument would be needed before concluding that this is safe. And it seems like such an argument should be published so that it can be checked for flaws before one starts acting based on the assumption that the current situation is safe. Please don't hesitate to contact me with theories / questions / thoughts / observations / etc regarding what people actually believe about this.