Making progress on the ``what alignment target should be aimed at?'' question, is urgent

ThomasCederborg

There exists a class of AI proposals, where some AI is supposed to undergo an I. J. Good style intelligence explosion, and then defer the decision, of what alignment target, will eventually be hit, to someone else. One such proposal is the Pivotal Act AI (PAAI), proposed by ``person A'', in this post by Wei Dai Meta Questions about Metaphilosophy (this specific PAAI defers the decision to a group of uploads). Another example of this general type of AI, could be the result of some coalition of powerful governments, insisting on a Political Process AI (PPAI): an AI that defers the decision to a pre specified political process, designed to not ``disrupt the current balance of power''. For lack of a better name, I will refer to any AI in this general class, as a PlaceHolder AI (PHAI).

It can sometimes make sense to refer to a given PHAI, as an ``alignment target''. However, the present text will only used the term ``alignment target'', for the goal of a real AI, that does not defer the question of which alignment target, will eventually be aimed at (such as, for example, CEV). The present text is making the argument, that no PHAI proposal, can remove the urgency, of making progress on the ``what alignment target should be aimed at?'' question.

The reason for urgency is that (i): there is currently no viable answer to the ``what alignment target should be aimed at?'' question, (ii): there is also no way of reliably distinguishing a good alignment target from a bad alignment target, and finally, (iii): hitting a bad alignment target, would be far, far, worse than extinction. (see my previous post A problem with the most recently published version of CEV)

Consider the case where someone comes up with an answer to the ``what alignment target should be aimed at?'' question, that looks like ``the obviously correct thing to do''. No one is able to come up with a coherent argument against it, even after years of careful analysis. However, this answer is built on top of some unexamined implicit assumption, that is wrong in such a way, that successfully hitting this alignment target, would be far, far, worse than extinction. In this scenario, there is no way of knowing how long it will take, for progress on the ``what alignment target should be aimed at?'' question, to advance to the point, where it is possible to notice the problematic assumption in question. There is also no way of knowing how much time there will be, to work on the issue. Given this general situation, one should assume that making progress on this question is urgent, unless one hears a convincing counterargument.

A more straightforwards, but less exact, way of saying this would be: ``Since hitting a bad alignment target is far, far worse than extinction, and no one has any idea how to tell a good alignment target from a bad alignment target, the question is urgent by default.''

Thus, the goal of the present text, is to respond to one class of counterarguments, based on the idea of building some PHAI, that does not require an answer, to the ``what alignment target should be aimed at?'' question. The general outline of this class of ``arguments against urgency'', is that it is possible to build a PHAI, that goes through an intelligence explosion, and then does some, thing, that removes the urgency. By assumption, this PHAI can be implemented, without advancing on the ``what alignment target should be aimed at?'' question, beyond the current state of the art (where no one has any idea, of how to reliably tell a good alignment target, from an alignment target, that is far, far, worse than extinction).

The set of PHAI proposals is not a natural category, and the category is formed by lumping together a set of very different proposals, based on the fact that they are used, as an argument, against the urgency of making progress on the``what alignment target should be aimed at?'' question. One technical aspect, that all of them do have in common, is that they defer the question, of what alignment target will eventually be aimed at (this property is inherent, in the fact that they can, by assumption, be safely built, at a time when no one has any idea, of how to reliably tell a good alignment target, from an alignment target, that is far, far, worse than extinction).

This text will focus on the PAAI and the PPAI proposals. A simple example PAAI would be an AI that defers to whatever unanimous decision, a group of uploads will eventually settle on. A simple example PPAI would be an AI that will indefinitely prevent any unauthorised intelligence explosion. And that will implement any AI, iff 47 percent of the population signs a petition, to implement that AI. In general, when a PHAI has been launched, there is no way of initiating another intelligence explosion, unless one goes through some specific mechanism (so, if one wants to launch another AI, then one will have to, for example: (i): convince the controllers of a PAAI, to allow the launch, or (ii): convince billions of people to sign a petition). So, for example, if PHAI1 has been launched, then one can launch a real AI (or replace PHAI1, with PHAI2), iff one goes through the specific mechanism of PHAI1.

The core argument of this post, is that no PHAI proposal, can remove the urgency of making progress on the ``what alignment target should be aimed at?'' question, simply because there is no way of knowing, which specific PHAI, will end up getting implemented first. The post will make this point, using the PAAI and the PPAI proposals. For each of these proposals, it will be shown that the proposal in question, (i): can not be ruled out, and (ii): implies urgency, of making progress on the ``what alignment target should be aimed at?'' question (if the proposed PHAI is successfully implemented, and works as intended). But the actual argument of the present text is that, unless you actually know, that your proposed PHAI, will actually get implemented first, no specific proposal, can ever remove this urgency.

If some PHAI has been successfully implemented, then the question of which alignment target will eventually be aimed at, has been deferred, to some group of people. For lack of a better name, let's refer to such a group as a EGP, for Empowered Group of People (which could be a single person). One example of an EGP is a small group of designers, that have both (i): been the first people to initiate any intelligence explosion at all, and also (ii): managed to successfully navigate this intelligence explosion, in a way that gives them control, over how to, eventually, choose an alignment target. Another example of a potential EGP, is the set of people, that eventually manages to gain, de facto, control over some existing political process, at some point after the launch of a PPAI (for example after some design team, has managed to successfully navigate an intelligence explosion, in a way that defers the decision of what alignment target to aim for, to this political process. Perhaps because this political process was insisted on, by some coalition of governments, that had the capacity to shut down unauthorised projects).

The first, immediate, task is thus (while reasoning from the assumption of a successful PHAI launch), to make enough progress on the ``what alignment target should be aimed at?'' question, that the EGP can see any flaw that might exist, in their preferred alignment target. Then, as a second step, it is necessary, to make enough progress, that this group of people can see, that the question is, genuinely, unintuitive. And then, as a third step, one has to find a good answer to the ``what alignment target should be aimed at?'' question, and convince this set of people, that this is in fact, a good answer.

Let's dig into the details of the first step a bit more. So, let's analyse the scenario, where someone will propose an answer, to the ``what alignment target should be aimed at?'' question, that sounds like ``the obviously correct thing to do''. Since there currently exists nothing even remotely resembling such an answer, this proposal, must be treated as an alignment target of a fully unknown form. Now, consider the case where this alignment target, is built on top of some unexamined implicit assumption, that happens to be wrong, in some critical way. It seems difficult to estimate, how long it will take, to find a problematic unexamined implicit assumption, in a not-yet-proposed alignment target, of a fully unknown form. It also seems difficult to know, how long it will take, to explain the issue to some unknown group of people (if the problem is well understood, the explanation of the issue is simple, and the EGP is very careful and very capable, then this step might be easy. In other scenarios, the explanation step might be the most difficult step). On the other side, it is also difficult to know how long time there will be to look for the unexamined implicit assumption (maybe all powerful people, within the political process of a PPAI, are determined to move forwards. But there is a deadlock, because each faction is dead set on a different ``nice-sounding-horror''. It is hard to be confident in any upper bound, on how long such a deadlock might last). Simply assuming that there is certainly enough time (or simply assuming that there is certainly insufficient time), thus implies some form of confusion. And, given that failure would lead to an outcome, that is far, far, worse than extinction, this seems like a dangerous type of confusion.

Since it is not possible to predict which PHAI will be launched, it is difficult to say much about the EGP, with any confidence. Additionally, even for a known PHAI proposal, the EGP does not have to be a known group. In the case of the PPAI proposal mentioned above for example, we know that the EGP would be, whoever manages to gain, de facto, control, over the pre specified political process. However, it seems difficult to make strong predictions, regarding who would actually manage to eventually gain such, de facto, control (especially without knowing the details of the political process, or the timeframe involved).

Let's now focus on the PPAI, and show that it is not possible to rule out this outcome. One reason, that it is difficult to know, which PHAI will be launched, is that this could depend on geopolitics. It is not enough to predict how a debate, within the AI community, will go. For example: some group of powerful governments could decide that no PHAI launch will be allowed, if this PHAI ``disrupts the balance of power''. When one makes arguments that involve a PHAI (such as arguments against urgency), then these arguments are only relevant, if the first intelligence explosion, will be successfully navigated. Thus, while countering an argument against urgency, based on a PHAI proposal, then one should reason from an assumption of a successfully implemented PHAI. So, when one is considering the probability that some group of powerful governments, will successfully exert influence, over the choice of PHAI, then one has to condition this probability, on the fact that the first intelligence explosion, will be successfully navigated. In other words: ``in a world, where the first intelligence explosion, has been successfully navigated, what is the probability, that some group of powerful governments, has successfully exerted effective control, over AI development?''. In yet other words: if someone manages to stop every single project, led by a team of people, from the ``let's run face first into a wood chipper'' crowd, from stumbling backwards into Clippy. Then, that same someone, might be able to successfully insist on a specific PHAI. For example: some coalition of powerful governments might be very hostile to each other, but might still manage to agree to do something that, (i): avoids extinction, and (ii): preserves the current balance of power. So, regardless of what the AI community thinks, the world might end up with a PPAI, where the EGP consists of whoever will, eventually, manage to exert, de facto, control over (some formalisation of) existing political power structures.

Now, let's explore what this would mean, in terms of what progress would be needed, on the ``what alignment target should be aimed at?'' question. Let's make a series of optimistic assumptions, in order to focus on a problem that remains, even under these assumptions. (so, this is not an attempt to describe a set of assumptions, that anyone is actually making)

Consider the scenario of a successfully implemented PPAI, that works exactly as intended, and that defers to a formalisation of a pre specified political process, that also works exactly as intended. Now, let's assume that after the launch: (i): all global geopolitical rivalries are instantly and permanently solved, (ii): all countries becomes stable democracies, (iii) it is common knowledge, that peace and democracy will continue indefinitely (meaning that worries about these things ending, do not create time pressure, to settle on an alignment target, and build a real AI), (iv): it is common knowledge, that no individual will debate in a dishonest way (so, for example: all politicians, journalists, tweeters, etc, will be entirely honest, and will be known by everyone, to be honest. No one will ever, for example, try to twist any statement, on any subject, to get votes, viewers, attention, etc), (v): it is common knowledge, that no AI will be launched by circumventing, or gaming, the political process (meaning that no one feels the need to settle on a decision, as a way of preventing this scenario), (vi): it is common knowledge, that no elected official, or anyone else, will ever attempt to subvert the political process in any way (for example by procedural stunts), (vii), it is common knowledge, that no form of bad influence, on the thought process of any individual human, or the dynamics of any group, will occur, due to any type of technology, (viii): the culture war, and all similar processes, are fully resolved, in a way that completely satisfies everyone, (ix): no one will worry that the balance of political power is shifting (for any reason), in a way that is detrimental to their interests, and finally (x): all economic and medical problems will be fully, instantly, and permanently, solved by the PPAI, without any negative side effects. So, the PPAI will prevent disease, unwanted ageing, violence, accidents, etc, etc (without any side effects, and in a way that does not upset anyone, or makes anyone worry about the political balance of power being impacted). The AI will also provide abundant material resources (without creating any problems, for example related to people feeling useless, due to not having anything to do).

The point of these optimistic assumptions, is to make it easier to focus on a problem that remains, even under these assumptions.

One limitation of any PHAI system, is that it is unable to deal with issues, that are intrinsically entangled with definitional ambiguities and value questions. The reason that the present text is analysing various PHAI proposals, is that such proposals, are used as an argument, against the urgency of making progress on the ``what alignment target should be aimed at?'' question. Thus, no such proposed AI, can deal with any issues, that require an answer, to the ``what alignment target should be aimed at?'' question. No matter how clever such an AI is, it can not be trusted, to find ``the right goal'', to give to a successor AI (if an AI exists, that can be trusted with this decision, then this would mean, that the ``what alignment target should be aimed at?'' question, has already been solved).

For related reasons, no matter how clever such an AI is, it can never be trusted to deal with any issue, that is intrinsically entangled with both definitional, and value questions. For example, (i): addiction issues, (ii): abusive relationships, (iii): clinically depressed, suicidal, people, (iv): parents mistreating their kids, (v): bullying, (vi): sexual harassment, etc, etc, etc. On each of these issues, honest, well intentioned, and well informed people, that think carefully before making statements, will disagree on definitions, and also disagree on value questions (there is no agreement on how to, for example, define addiction. And different people, with different values, will have very different preferences, regarding how society, or an AI, should respond to addiction issues). A PPAI, that is waiting for a goal definition to be provided, by a pre specified political process, is simply unable to deal with any of these issues (or, to be more exact: for each of these issues, there exists many specific instances, that the PPAI will be unable to deal with). To deal with these issues, and many other issues along the same lines, there is simply no option, other than to settle on a specific answer, to the ``what alignment target should be aimed at?'' question.

In other words, this scenario would lead to pressure to settle on an answer, even without demagogues, power grabs, or genuine and reasonable fear of foul play (or any other form of pressure, originating in the presence of bad political actors). Because fully sincere, well intentioned, and well informed, politicians, journalists, etc, would correctly point out, that all the problems, that their society is currently suffering from, will remain unsolved, until an alignment target is chosen, and a real AI is built.

When a society has unreliable access to clean drinking water, then water is a defining political issue. When clean drinking water becomes abundant, the society quickly starts to focus on other things. When a society has unreliable access to food, then food is a defining political issue. When food becomes abundant, the society quickly starts to focus on other things. When healthcare is not affordable to many people, then affordability of healthcare, becomes a defining political issue (as it is in some countries). When healthcare becomes free (as it is in some other countries), then the society quickly starts to focus on other things. In other words: everything that we know about human societies, points towards a rapid shift in focus, from things such as water, food, and affordability of healthcare, to focus on some subset, of those issues, that the PPAI will be unable to solve (due to the fact that a PPAI, can not be trusted to deal with definitional issues, regardless of how smart it is). So, while this PPAI might have bought time, it has not necessarily bought a particularly large amount of time. The problem is, that it would not be at all surprising, if (by the time society has fully shifted focus, to some subset of those problems, that the PPAI can not be trusted to solve) an answer to the ``what alignment target should be aimed at?'' question exists, that sounds flawless, but that would in fact be far, far worse, than extinction (in expectation, from the perspective of essentially any human individual. For example due to an unexamined, implicit, assumption, that no one has noticed, and that happens to be wrong in some critical way. See, for example, my previous post A problem with the most recently published version of CEV). There is no particular reason, to think that anyone would notice this unexamined implicit assumption, before most people decide, that they don't want to wait any longer. (one can off course try working within whatever political process, will determine the alignment target, and try to convince, whoever is in de facto control, about the meta point, that the ``what alignment target should be aimed at?'' question, is simply too unintuitive, to proceed without further analysis. But I see no particular reason, to think that this would work, when a proposal exists, that looks completely safe, and that no one can find any flaws with (even with the set of optimistic assumptions, that we are currently reasoning from))

It is worth noting that the situation above is not hopeless, even if (i): all known alignment targets are far, far, worse than extinction, (ii): work on the ``what alignment target should be aimed at?'' question, has not yet progressed to anywhere near the point, where it is possible to see the unexamined implicit assumptions, that are causing the problems, and (iii): basically everyone, dismiss the dangers, and are completely determined to go ahead, as soon as possible. There could, for example, exist more than one proposed alignment target, and people could fail to agree, on which one to choose, for quite some time. (The situation where people fail to agree, on which ``nice-sounding-horror'' to choose, remains a realistic scenario, also when relaxing the optimistic assumptions above). So, even if work on the ``what alignment target should be aimed at?'' question, is still at a stage, where no one is anywhere near the point, where it is possible to see that a problem even exists. And everyone with power confidently dismiss the danger. Then continued progress is still urgent. It's just one more possible scenario, where progress has to advance, to some unknown level of understanding, within some unknown timeframe (which, along with the fact that the scenario can not be reliably avoided, and the stakes involved, implies urgency).

Let's now turn to the case of a PAAI, that uploads a group of designers, and that gives each upload a veto, on what alignment target to aim at (this idea was recently mentioned in a post by Wei Dai Meta Questions about Metaphilosophy). One reason that this scenario is difficult to conclusively rule out, is that we are currently reasoning from the assumption, that someone will successfully navigate, the first ever intelligence explosion. It seems difficult to confidently rule out the scenario, where this group of people, will manage to seize power. Let's again reason from a set of very optimistic assumptions, in order to focus on a problem that remains, even under these assumptions. Let's go even further than we did in the PPAI case above, and simply specify that every one of these uploads are, and will indefinitely remain, fully committed to the exact outcome, that the reader wants to see (so we are, for example, assuming away all definitional issues involved in that statement). This is again done, to focus on a problem that remains, even under these optimistic assumptions. The problem in this scenario will again follow directly from the one fact, that is shared by all PHAI proposals: the inability to deal with definitional issues. In this case, it will be the definition of sanity. Consider the scenario where the uploads go insane, and decide to aim for a bad alignment target (because they have convinced themselves, that it is a good alignment target). If this happens, the PAAI can, by assumption, not be trusted to tell a good alignment target, from a bad alignment target. This is inherent, in the fact that the PAAI proposal, is being used as a reason, against the urgency, of making progress on the ``what alignment target should be aimed at?'' question. It can thus, by assumption, be built, at a time, when no one has any idea, how to reliably tell a good alignment target, from an alignment target, that is far, far, worse than extinction (see my previous post A problem with the most recently published version of CEV). (if some specific AI proposal, is not being used as an argument against the urgency, of making progress on the ``what alignment target should be aimed at?'' question, then such an AI proposal, is out of scope of the present text)

The same mechanism, that makes the PPAI discussed above, unable to deal with definitional issues along the lines of addiction, makes this PAAI unable to evaluate sanity. In other words: ``evaluating sanity'' is not a well defined process, with a well defined answer, that can be found, simply by being smart. Different people will use very different definitions of sanity. And different people, with different values, will want to deal with related scenarios, in very different ways. Thus, no PHAI knows how to``evaluate sanity'', for the same reasons, that no PHAI knows how to ``deal with addiction in the correct way'' or how to ``find a good alignment target''. In other words, and more generally: it is simply not possible, to rely on a PAAI, to protect against a set of uploads, that are acting recklessly.

So, the question now becomes, how long this group of people will stay sane, when, (i): finding themselves in a social situation, that is very different, from any social situation, that any human has ever experienced, and (ii): having to make irreversible decisions, on questions that are very different, from anything that any human has ever had to deal with (both in terms of scale, and in terms of permanence), and (iii): having to come up with models of what a human mind is, and models of how groups of humans tend to behave, for entirely new purposes (these models would presumably be very different, from models that has been constructed to justify economic policy, or models that are used to justify guidelines for mental health care professionals, or models that are used to explain various observed historical events), and (iv): having to reason about some very unpleasant scenarios, related to ways in which various nice sounding alignment targets, would lead to various, very unpleasant, outcomes. One obvious outcome, is that these uploads fairly quickly go insane, without any of them noticing. Let's consider the case, where the psychological pressure reaches a point, where they start to rationalise, whatever course of action, feels like the best way, of relieving this psychological pressure. They might, for example, start from the correct observation, that their current situation will lead to increasingly severe metal health issues. This can be turned into an argument, that it is important to make decisions, as soon as possible. (or they might come up with some other convoluted meta argument, that twists some other set of valid concerns, into an argument, in favour of whatever action, feels like the most effective way of relieving the psychological pressures that they are experiencing). In other words: if they had remained sane, they would have chosen actions, based on a different risk calculation. When they were fist uploaded, they did not use this form of reasoning. But, after they have gone insane, they are essentially looking for some valid sounding reason, to take whatever action, that feels most likely, to relive the psychological pressure, that drove them insane. (after they have ``snapped'', the cleverness that allowed them to get to this point, is not necessarily an asset). Let's see what this would mean, in the case where there exists some answer to the ``what alignment target should be aimed at?'' question, that looks flawless, but that is built on top of some unexamined implicit assumption, that is wrong in some critical way. The uploads have spent a significant amount of time (both before and after they snapped), looking for a flaw, but have not found any hint, that a problem exists. In general, that they will, eventually, just go ahead, seems like a distinct possibility (for example because they have snapped, and are no longer doing the type of risk calculations, that allowed them to successfully navigate the first ever intelligence explosion).

So, while reasoning from the assumption of the successful launch of such an AI, the task is to fully solve the ``what alignment target should be aimed at?'' question, before the launch (if the state of the art, at the time that the PAAI is launched, is actually a good answer, then the scenario where insane uploads recklessly implement the current state of the art, as a way of relieving psychological pressure, would actually end well). Success can't be ruled out. Perhaps there exists an answer, that is actually good, and that is realistically findable in time. The fact that the ``what alignment target should be aimed at?'' question is, genuinely, unintuitive, works both ways. Being certain of failure, thus seems to imply some form of confusion. Since there exist no reliable way of preventing the PAAI scenario, increasing the probability of success, in the event of a PAAI scenario, would be valuable (given the stakes, this is actually very valuable, for a wide range of estimates, of both the probability of the PAAI scenario, and the probability of success in the event of a PAAI). Similarly, anyone that is certain of success, in the event of a successful PAAI launch, can also be confidently assumed to be suffering from some (very dangerous) form of confusion. In addition to increasing the probability of finding an actually good answer, there are also other valuable things that one might find, while making progress on the ``what alignment target should be aimed at?'' question. Even in the case, where no good answer, is realistically within reach before launch, one might still be able to advance to the point, where it is possible, to notice the unexamined implicit assumption, that is causing problems in the current state of the art, before the launch of the PAAI (or one might manage to make conceptual progress, that allows the uploads to take the last steps, and find the problem, before they go insane).

In other words: If some answer has been proposed, then it seems difficult to be certain, that this answer is not built on top of some unexamined implicit assumption. But, it seems entirely feasible, to reduce the probability of an unexamined implicit assumption, remaining unnoticed at launch time. Combined with the fact, that there exists no reliable way of preventing a PAAI launch, followed by a reckless decision, this seems obviously valuable (implying an urgent need, to make progress on the ``what alignment target should be aimed at?'' question).

A more straightforwards, but less exact, way of putting this (which would also apply to many other types of PHAI scenarios) would be: ``you may be unable to prevent crazy people from acting recklessly. Thus, it makes sense to reduce the expected damage, in the event that you fail to talk them out of this reckless action''. (this general situation, of dealing with people that might take reckless actions, is not unusual, or particularly complex. One simple analogy, that also involve high stakes global coordination problems, would be to simultaneously (i): argue against nuclear war, and (ii): build a fallout shelter, in order to reduce the expected damage from a nuclear war, because you understand that your efforts to argue against nuclear war, might fail. Another analogy, involving a search for a technical problem, would be the case where someone is building a rocket. The rocket is being built, based on a large number of ``back of the envelope calculations'', that no one has checked properly (in this case ``launching the rocket'' is used as an analogy for ``successfully hitting an alignment target'', and not as an analogy for ``launching a PHAI''). Consider Bob, who can make arguments against launch, but who does not have reliable control, over the decision, of whether or not to launch the rocket. Bob can now make an argument, that the risk is too high, while simultaneously going over the calculations, looking for a specific flaw (here, an error in an equation that the rocket is built on top of, is analogous to an unexamined implicit assumption, that the proposed answer, to the ``what alignment target should be aimed at?'' question, is built on top of). Even if there is no realistic way, for Bob to get to a point where he feels confident that all calculations are correct, Bob can still take meaningful action. It is, for example, possible that there exists only one critical problem, and that this problem happens to be realistically findable, in the time available before launch (making Bob's effort to go over the calculations meaningful). In general, there exists a wide range of probabilities, of a launch going well, such that it makes sense, to take actions that increase this probability (because the launch can not be reliably prevented). Thus, any conclusion that there is no point in trying to find a flaw in the calculations, seems likely to be built on top of some form of confusion (at least if the cost of a failed rocket launch is high enough))