ThomasCederborg

My research focus is Alignment Target Analysis (ATA). I noticed that the most recently published version of CEV (Parliamentarian CEV, or PCEV) gives a large amount of extra influence to people that intrinsically value hurting other individuals. For Yudkowsky's description of the issue you can search the CEV arbital page for ADDED 2023.

The fact that no one noticed this issue for over a decade shows that ATA is difficult. If PCEV had been successfully implemented, the outcome would have been massively worse than extinction. I think that this illustrates that scenarios where someone successfully hits a bad alignment target pose a serious risk. I also think that it illustrates that ATA can reduce these risks (noticing the issue reduced the probability of PCEV getting successfully implemented). The reason that more ATA is needed is that PCEV is not the only bad alignment target that might end up getting implemented. ATA is however very neglected. There does not exist a single research project dedicated to ATA. In other words: the reason that I am doing ATA is that it is a tractable and neglected way of reducing risks.

I am currently looking for collaborators. I am also looking for a grant or a position that would allow me to focus entirely on ATA for an extended period of time. Please don't hesitate to get in touch if you are curious and would like to have a chat, or if you have any feedback, comments, or questions. You can for example PM me here, or PM me on the EA Forum, or email me at thomascederborgsemail@gmail.com (that really is my email address. It's a Gavagai / Word and Object joke from my grad student days)

My background is physics as an undergrad and then AI research. Links to some papers: P1  P2  P3  P4  P5  P6  P7  P8. (no connection to any form of deep learning)
 

Wiki Contributions

Comments

Sorted by

Your comment makes me think that I might have been unclear regarding what I mean with ATA. The text below is an attempt to clarify.


Summary

Not all paths to powerful autonomous AI go through methods from the current paradigm. It seems difficult to rule out the possibility that a Sovereign AI will eventually be successfully aligned to some specific alignment target. At current levels of progress on ATA this would be very dangerous (because understanding an alignment target properly is difficult, and a seemingly-nice proposal can imply a very bad outcome). It is difficult to predict how long it would take to reach the level of understanding needed to prevent scenarios where a project successfully hits a bad alignment target. And there might not be a lot of time to do ATA later (for example because a tool-AI shuts down all unauthorised AI projects. But does not buy a lot of time due to internal time pressure). So a research effort should start now.

Therefore ATA is one of the current priorities. There are definitely very serious risks that ATA cannot help with (for example misaligned tool-AI projects resulting in extinction). There are also other important current priorities (such as preventing misuse). But ATA is one of the things that should be worked on now.

The next section outlines a few scenarios designed to clarify how I use the term ATA. The section after that outlines a scenario designed to show why I think that ATA work should start now.


What I mean with Alignment Target Analysis (ATA)

The basic idea with ATA is to try to figure out what would happen if a given AI project were to successfully align an autonomously acting AI Sovereign to a given alignment target. The way I use the term, there are very severe risks that cannot be reduced in any way, by any level of ATA progress (including some very serious misalignment and misuse risks). But there are also risks that can and should be reduced by doing ATA now. There might not be a lot of time to do ATA later, and it is not clear how long it will take to advance to the level of understanding that will be needed. So ATA should be happening now. But let's start by clarifying the term ATA, by outlining a couple of dangerous AI projects where ATA would have nothing to say.

Consider Bill, who plans to use methods from the current paradigm to build a tool-AI. Bill plans to use this tool AI to shut down competing AI projects and then decide what to do next. ATA has nothing at all to say about this situation. Let's say that Bill's project plan would lead to a powerful misaligned AI that would cause extinction. No level of ATA progress would reduce this risk.

Consider Bob who also wants to build a tool-AI. But Bob's AI would work. If the project would go ahead, then Bob would gain a lot of power. And Bob would use that power to do some very bad things. ATA has nothing to say about this project and ATA cannot help reduce this risk.

Now let's introduce an unusual ATA scenario, just barely within the limits of what ATA can be used for (the next section will give an example of the types of scenarios that makes me think that ATA should be done now. This scenario is meant to clarify what I mean with ATA). Consider Dave who wants to use methods from the current paradigm to implement PCEV. If the project plan moves forwards, then the actual result would be a powerful misaligned AI: Dave's Misaligned AI (DMAI). DMAI would not care at all what Dave is trying to do, and would cause extinction (for reasons that are unrelated to what Dave was aiming at). One way to reduce the extinction risk from DMAI would be to tell Dave that his plan would lead to DMAI. But it would also be valid to let Dave know that if his project were to successfully hit the alignment target that he is aiming for, then the outcome would be massively worse than extinction.

Dave assumes that he might succeed. So, when arguing against Dave's project, it is entirely reasonable to argue from the assumption that Dave's project will lead to PCEV. Pointing out that success would be extremely bad is a valid argument against Dave's plan, even if success is not actually possible.

You can argue against Dave's project by pointing out that the project will in fact fail. Or by pointing out that success would be very bad. Both of these strategies can be used to reduce the risk of extinction. And both strategies are cooperative (if Dave is a well meaning and reasonable person, then he would thank you for pointing out either of these aspects of his plan). While both strategies can prevent extinction in a fully cooperative way, they are also different in important ways. It might be the case that only one of these arguments is realistically findable in time. It might for example be the case that Dave is only willing to publish one part of his plan (meaning that there might not be sufficient public information to construct an argument about the other part of the plan). And even if valid arguments of both types are constructed in time, it might still be the case that Dave will only accept one of these arguments. (similar considerations are also relevant for less cooperative situations. For example if one is trying to convince a government to shut down Dave's project. Or if one is trying to convince an electorate to vote no on a referendum that Dave needs to win in order to get permission to move forwards)

The audience in question (Dave, bureaucrats, voters, etc) are only considering the plan because they believe that it might result in PCEV. Therefore it is entirely valid to reason from the assumption that Dave's plan will result in PCEV (when one is arguing against the plan). There is no logical reason why such an argument would interfere with attempts to argue that Dave's plan would in fact result in DMAI.

Now let's use an analogy from the 2004 CEV document to clarify what role I see an ATA project playing. In this analogy, building an AI Sovereign is analogous to taking power in a political revolution. So (in the analogy) Dave proposes a political revolution. One way a revolution can end in disaster is that the revolution leads to a destructive civil war that the revolutionaries loose (analogous to DMAI causing extinction). Another way a revolution can end in disaster is that ISIS takes power after the government is overthrown (analogous to the outcome implied by PCEV).

It is entirely valid to say to Dave: ``if you actually do manage to overthrow the government, then ISIS will seize power'' (assuming that this conditional is true). One can do this regardless of whether or not one thinks that Dave has any real chance of overthrowing the government. (Which in turn means that one can actually say this to Dave, without spending a lot of time trying to determine the probability that the revolution will in fact overthrow the government. Which in turn means that people with wildly different views on how difficult it is to overthrow the government can cooperate while formulating such an argument)

(this argument can be made separately from an argument along the lines of: ``our far larger neighbour has a huge army and would never allow the government of our country to be overthrown. Your revolution will fail even if every single soldier in our country joins you instantly. Entirely separately: the army of our county is in fact fiercely loyal to the government and you don't have enough weapons to defeat it. In addition to these two points: you are clearly bad at strategic thinking and would be outmanoeuvred in a civil war by any semi-competent opponent''. This line of argument can also prevent a hopeless civil war. The two arguments can be made separately and there is no logical reason for them to interfere with each other)

Analysing revolutionary movements in terms of what success would mean can only help in some scenarios. It requires a non-vague description of what should happen after the government falls. In general: this type of analysis cannot reduce the probability of lost civil wars, in cases where the post revolutionary strategy is either (i): too vaguely described to analyse, or (ii): actually sound (meaning that the only problem with the revolution in question is that it has no chance of success). Conversely however: arguments based on revolutions failing to overthrow the government cannot prevent revolutions that would actually end with ISIS in charge (analogous to AI projects that would successfully hit a bad alignment target). Scenarios that end in a bad alignment target getting successfully hit is the main reason that I think that ATA should happen now (in the analogy, the main point would be to reduce the probability of ISIS gaining power). Now let's leave the revolution analogy and outline one such scenario.


A tool-AI capable of shutting down all unauthorised AI projects might not buy a lot of time

It is difficult to predict who might end up controlling a tool-AI. But one obvious compromise would be to put it under the control of some group of voters (for example a global electorate). Let's say that the tool-AI is designed such that one needs a two thirds majority in a referendum, to be allowed to launch a Sovereign AI. There exists a Sovereign AI proposal that a large majority thinks sounds nice. A small minority would however prefer a different proposal.

In order to prevent inadvertent manipulation risks, the tool AI was designed to only discuss topics that are absolutely necessary for the process of shutting down unauthorised AI projects. Someone figures out how to make the tool-AI explain how to implement Sovereign AI proposals (and Explanation / Manipulation related definitions happens to hold for such discussions). But no one figures out how to get it to discuss any topic along the lines of ATA. The original plan was to take an extended period of time to work on ATA before implementing a Sovereign AI.

Both alignment targets use the same method for extrapolating people and for resolving disagreements. The difference is in terms of who is part of the initial group. The two proposals have different rules with respect to things like: animals, people in cryo, foetuses, artificial minds, etc. It doesn't actually matter which proposal gets implemented: the aggregation method leads to the same horrific outcome in both cases (due to an issue along the lines of the issue that PCEV suffers from. But more subtle and difficult to notice). (All proposed alignment targets along the lines of ``build an AI Sovereign that would do whatever some specific individual wants it to do'' are rejected out of hand by almost everyone).

In order to avoid making the present post political, let's say that political debates center around what to do with ecosystems. One side cares about nature and wants to protect ecosystems. The other side wants to prevent animal suffering (even if the cost of such prevention is the total destruction of every ecosystem on earth). It is widely assumed that including animals in the original group will lead to an outcome where animal suffering is prevented at the expense of ecosystems. (in order to make the following scenario more intuitive, readers that have an opinion regarding what should be done with ecosystems, can imagine that the majority shares this opinion)

The majority has enough support to launch their Sovereign AI. But the minority is rapidly and steadily gaining followers due to ordinary political dynamics (sometimes attitudes on a given issue changes steadily in a predictable direction). So the ability to get the preferred alignment target implemented can disappear permanently at any moment (the exact number of people that would actually vote yes in a referendum is difficult to estimate. But it is clearly shrinking rapidly). In this case the majority might act before they loose the ability to act. Part of the majority would however hesitate if the flaw with the aggregation method is noticed in time.

After the tool-AI was implemented, a large number of people started to work on ATA. There are also AI assistants that contribute to conceptual progress (they are tolerated by the tool-AI because they are not smarter than humans. And they are useful because they contribute a set of unique non-human perspectives). However, it turns out that ATA progress works sort of like math progress. It can be sped up significantly by lots of people working on it in parallel. But the main determinant of progress is how long people have been working on it. In other words: it turns out that there is a limit to how much the underlying conceptual progress can be sped up by throwing large numbers of people at ATA. So the question of whether or not the issue with the Sovereign AI proposal is noticed in time, is to a large degree determined by how long a serious ATA research project has been going on at the time that the tool-AI is launched (in other words: doing ATA now reduces the risk of a bad alignment target ending up getting successfully hit in this scenario).

(the idea is not that this exact scenario will play out as described. The point of this section was to give a detailed description of one specific scenario. For example: the world will presumably not actually be engulfed by debates about the Prime Directive from Star Trek. And a tool-AI controlled by a messy coalition of governments might lead to a time crunch due to dynamics that are more related to Realpolitik than any form of ideology. This specific scenario is just one example of a large set of similar scenarios)


PS:

On a common sense level I simply don't see how one can think that it is safe to stay at our current level of ATA progress (where it is clearly not possible to reliably tell a good alignment target from an alignment target that implies an outcome massively worse than extinction). The fact that there exists no research project dedicated to improving this situation seems like a mistake. Intuitively this seems like a dangerous situation. At the very least it seems like some form of positive argument would be needed before concluding that this is safe. And it seems like such an argument should be published so that it can be checked for flaws before one starts acting based on the assumption that the current situation is safe. Please don't hesitate to contact me with theories / questions / thoughts / observations / etc regarding what people actually believe about this.

I thought that your Cosmic Block proposal would only block information regarding things going on inside a given Utopia. I did not think that the Cosmic Block would subject every person to forced memory deletion. As far as I can tell, this would mean removing a large portion of all memories (details below). I think that memory deletion on the implied scale would seriously complicate attempts to define an extrapolation dynamic. It also does not seem to me that it would actually patch the security hole illustrated by the thought experiment in my original comment (details below).

The first section argues that (unless Bob's basic moral framework has been dramatically changed by the memory deletion) no level of memory deletion will prevent BPA from wanting to find and hurt Steve. In brief: BPA will still be subject to the same moral imperative to find and hurt any existing heretics (including Steve).

The second section argues that BPA is likely to find Steve. In brief: BPA is a clever AI and the memory deletion is a human constructed barrier (the Advocates are extrapolations of people that has already been subjected to these memory wipes. So Advocates cannot be involved when negotiating the rules governing these memory wipes). BPA would still have access to a lot of different information sources that it can use to find Steve.

The third section argues that if BPA finds Steve, then BPA would be able to hurt Steve. In brief: creating OldSteve is still not prevented by any rule or constraint that you have mentioned so far.

The fourth section argues that the side effects of memory deletion would be severe. In brief: memories of every conversation about any deleted person would also be deleted. Besides all direct memories involving any deleted person, many indirect memories would also be deleted. This would seriously complicate extrapolation. (Extrapolation is already a very tricky definitional problem. And this definitional problem cannot be delegated to extrapolated Advocates, since they are the result of the extrapolation dynamic being defined).

The last section deals with your proposed resource destruction mechanism. In brief: in one thought experiment (that did not involve threats) almost half the population would very strongly prefer to destroy ELYSIUM. This disproved your claim that such a situation would not arise. Also: no thought experiment involved anyone trying to gain resources. And no thought experiment involved anyone issuing any form of threat that would not be followed up on. People burned resources that they did not need. They did this to either hurt people directly. Or to issue genuinely credible threats.

 

BPA will still want to find and hurt heretics

It would not be surprising if a comprehensive, human defined, memory deletion operation would completely re write someone's basic moral framework. But I'm guessing that your proposed memory deletion is not supposed to be done in a way that changes a persons basic moral framework. So let's reason from the assumption that it does not.

This means that Bob still considers BPA to be bound by a non negotiable moral imperative. So BPA still wants to find and punish any heretic that might exist.

In other words: the Cosmic Block is not the type of mechanism that might prevent BPA from wanting to find and hurt Steve. It is instead a practical obstacle that BPA needs to overcome (which is something very different). It is a human constructed practical barrier, that is supposed to protect Steve from a clever AI that wants to find and hurt Steve.

 

BPA will likely be able to find Steve

Unless the details of your proposed Cosmic Block are constructed by an AI that prevents All Bad Things, these rules must come from somewhere else. AI assisted negotiations cannot be done by the Advocates. Advocates are the result of extrapolating memory wiped people (otherwise the whole point of the Cosmic Bloc is lost). So the Advocates cannot be involved in defining the memory wipe rules.

In other words: unless the memory wipe rules are negotiated by a completely separate set of (previously unmentioned) AIs, the memory wipe rules will be human defined.

This means that a human constructed barrier must hold against a clever AI trying to get around it. Even if we were to know that a human defined barrier has no humanly-findable security holes, this does not mean that it will actually hold against a clever AI. A clever AI can find security holes that are not humanly-findable.

The specific situation that BPA will find itself in does not seem to be described in sufficient detail for it to be possible to outline a specific path along which BPA finds Steve. But from the currently specified rules, we do know that BPA has access to several ways of gathering information about Steve.

People can pool resources (as described in your original proposal). So Advocates can presumably ask other Advocates about potential partners for cohabitation. Consider the case where BPA is negotiating with other Advocates regarding who will be included in a potential shared environment. This decision will presumably involve information about potential candidates. Whether or not a given person is accepted, would presumably depend on detailed personal information.

Advocates can also engage in mutual resource destruction to prevent computations happening within other Utopias. You describe this mechanism as involving negotiations between Advocates, regarding computations happening within other people's Utopias. Such negotiations would primarily be between the Advocates of people that have very different values. This is another potential information source about Steve.

Steve would also have left a lot of effects on the world, besides effects on peoples memories. Steve might for example have had a direct impact on what type of person someone else has turned into. Deleting this impact would be even more dramatic than deleting memories.

Steve might have also have had a significant impact on various group dynamics (for example: his family, the friend groups that he has been a part of, different sets of coworkers and classmates, online communities, etc). Unless all memories regarding the general group dynamics of every group that Steve has been a part of is deleted, Steve's life would have left behind many visible effects.

The situation is thus that a clever AI is trying to find and hurt Steve. There are many different types of information sources that can be combined in clever ways to find Steve. The rules of all barriers between this AI and Steve are human constructed. Even with perfect enforcement of all barriers, this still sounds like a scenario where BPA will find Steve (for the same reason that a clever AI is likely to find its way out of a human constructed box, or around a human constructed Membrane).

 

There is still nothing protecting Steve from BPA

If BPA locates Steve, then there is nothing preventing BPA from using OldSteve to hurt Steve. What is happening to OldSteve is still not prevented by any currently specified rule. The suffering of OldSteve is entirely caused by internal dynamics. OldSteve never lacks any form of information. And the harm inflicted on OldSteve is not in any sense marginal.

I do not see any strong connections between the OldSteve thought experiment and your Scott Alexander quote (which is concerned with the question of what options and information should be provided by a government run by humans. To children raised by other humans). More generally: scenarios that include a clever AI that is specifically trying to hurt someone, has a lot of unique properties (important properties that are not present in scenarios that lack such an AI). I think that these scenarios are dangerous. And I think that they should be avoided (as opposed to first created and then mitigated). (Avoiding such scenarios is a necessary, but definitely not sufficient, feature of an alignment target).

 

Memory wipes would complicate extrapolation

All deleted memories must be so thoroughly wiped that a clever AI will be unable to reconstruct them (otherwise the whole point of the Cosmic Block is negated). Deleting all memories of a single important negative interpersonal relationship would be a huge modification. Even just deleting all memories of one famous person that served as a role model would be significant.

Thoroughly deleting your memory of a person, would also impact your memory of every conversation that you have ever had about this person. Including conversations with people that are not deleted. Most long term social relationships involves a lot of discussions of other people (one person describing past experiences to the other, discussions of people that both know personally, arguments over politicians or celebrities, etc, etc). Thus, the memory deletion would significantly alter the memories of essentially all significant social relationships. This is not a minor thing to do to a person. (That every person would be subjected to this is not obviously implied by the text in The ELYSIUM Proposal.)

In other words: even memories of non deleted people would be severely modified. For example: every discussion or argument about a deleted person would be deleted. Two people (that do not delete each other) might suddenly have no idea why they almost cut all contact a few years ago, and why their interactions has been so different for the last few years. Either their Advocates can reconstruct the relevant information (in which case the deletion does not serve its purpose). Or their Advocates must try to extrapolate them while lacking a lot of information.

Getting the definitions involved in extrapolation right, seems like it will be very difficult even under ordinary circumstances. Wide ranging and very thorough memory deletion would presumably make extrapolation even more tricky. This is a major issue.

 

Your proposed resource destruction mechanism

No one in any of my thought experiments was trying to get more resources. The 55 percent majority (and the group of 10 people) have a lot of resources that they do not care much about. They want to create some form of existence for themselves. This only takes a fraction of available resources to set up. They can then burn the rest of their resources on actions within the resource destruction mechanism. They either burn these resources to directly hurt people. Or they risk these resources by making threats that are completely credible. In the thought experiments where someone does issue a threat, the threat is issued because: a person giving in > burning resources to hurt someone who refuses > leaving someone that refuses alone. They are perfectly ok with an outcome where resources are spent on hurting someone that refuses to comply (they are not self modifying as a negotiation strategy. They just think that this is a perfectly ok outcome).

Preventing this type of threats would be difficult because (i): negotiations are allowed, and (ii): in any scenario where threats are prevented, the threatened action would simply be taken (for non strategic reasons). There is no difference in behaviour between scenarios where threats are prevented, and scenarios where threats are ignored.

The thought experiment where a majority burns resources to hurt a minority was a simple example scenario where almost half of the population would very strongly prefer to destroy ELYSIUM (or strongly prefer that ELYSIUM was never created). It was a response to your claim that your resource destruction mechanisms would prevent such a scenario. This thought experiment did not involve any form of threat or negotiation.

Let's call a rule that prevents the majority from hurting the minority a Minority Protection Rule (MPR). There are at least two problems with your claim that a pre-AI majority would prevent the creation of a version of ELYSIUM that has an MPR.

First: without an added MPR, the post-AI majority is able to hurt the minority without giving up anything that they care about (they burn resources they don't need). So there is no reason to think that an extrapolated post-AI majority would want to try to prevent the creation of a version of ELYSIUM with an MPR. They would prefer the case without an MPR. This does not imply that they care enough to try to prevent the creation of a version of ELYSIUM with an MPR. Doing so would presumably be very risky, and they don't gain anything that they care much about. When hurting the minority does not cost them anything that they care about, they do it. That does not imply that this is an important issue for the majority.

More importantly however: you are conflating, (i): a set of un-extrapolated and un-coordinated people living in a pre-AI world, with (ii): a set of clever AI Advocates representing these same people, operating in a post-AI world. There is nothing unexpected about humans opposing / supporting an AI that would be good / bad for them (from the perspective of their extrapolated Advocates). That is the whole point of having extrapolated Advocates.

Implementing The ELYSIUM Proposal would lead to the creation of a very large, and very diverse, set of clever AIs that wants to hurt people: the Advocates of a great variety of humans, that wants to hurt others in a wide variety of ways, for a wide variety of reasons. Protecting billions of people from this set of clever AIs would be difficult. As far as I can tell, nothing that you have mentioned so far would provide any meaningful amount of protection from a set of clever AIs like this (details below). I think that it would be better to just not create such a set of AIs in the first place (details below).

 

Regarding AI assisted negotiations

I don't think that it is easy to find a negotiation baseline for AI-assisted negotiations that results in a negotiated settlement that actually deals with such a set of AIs. Negotiation baselines are non trivial. Reasonable sounding negotiation baselines can have counterintuitive implications. They can imply power imbalance issues that are not immediately obvious. For example: the random dictator negotiation baseline in PCEV gives a strong negotiation advantage to people that intrinsically values hurting other humans. This went unnoticed for a long time. (It has been suggested that it might be possible to find a negotiation baseline (a BATNA) that can be viewed as having been acausally agreed upon by everyone. However, it turns out that this is not actually possible for a group of billions of humans).

 

The proposal to have a simulated war that destroys resources

10 people without any large resource needs could use this mechanism to kill 9 people they don't like at basically no cost (defining C as any computation done within the Utopia of the person they want to kill). Consider 10 people that just want to live a long life, and that do not have any particular use for most of the resources they have available. They can destroy all computational resources of 9 people without giving up anything that they care about. This also means that they can make credible threats. Especially if they like the idea of killing someone for refusing to modify the way that she lives her life. They can do this with person after person, until they have run into 9 people that prefers death to compliance. Doing this costs them basically nothing.

This mechanism does not rule out scenarios where a lot of people would strongly prefer to destroy ELYSIUM. A trivial example would be a 55 percent majority (that does not have a lot of resource needs) burning 90 percent of all resources in ELYSIUM to fully disenfranchise everyone else. And then using the remaining resources to hurt the minority. In this scenario almost half of all people would very strongly prefer to destroy ELYSIUM. Such a majority could alternatively credibly threaten the minority and force them to modify the way they live their lives. The threat would be especially credible if the majority likes the scenario where a minority is punished for refusing to conform.

In other words: this mechanism seems to be incompatible with your description of personalised Utopias as the best possible place to be (subject only to a few non intrusive ground rules).

 

The Cosmic Block and a specific set of tests

This relies on a set of definitions. And these definitions would have to hold up against a set of clever AIs trying to break them. None of the rules that you have proposed so far would prevent the strategy used by BPA to punish Steve, outlined in my initial comment. OldSteve is hurt in a way that is not actually prevented by any rule that you have described so far. For example: the ``is torture happening here'' test would not trigger for what is happening to OldSteve. So even if Steve does in principle have the ability to stop this by using some resources destroying mechanism, Steve will not be able to do so. Because Steve will never become aware of what Bob is doing to OldSteve. Steve considers OldSteve to be himself in a relevant sense. So, according to Steve's worldview, Steve will experience a lot of very unpleasant things. But the only version of Steve that would be able to pay resources to stop this, would not be able to do so.

So the security hole pointed out by me in my original thought experiment is still not patched. And patching this security hole would not be enough. To protect Steve, one would need to find a set of rules that preemptively patches every single security hole that one of these clever AIs could ever find.

 

I think that it would be better to just not create such a set of AIs

Let's reason from the assumption that Bob's Personal Advocate (BPA) is a clever AI that will be creating Bob's Personalised Utopia. Let's now again take the perspective of ordinary human individual Steve, that gets no special treatment. I think the main question that determines Steve's safety in this scenario, is how BPA is adopting Steve-referring-preferences. I think this is far more important for Steve's safety, than the question of what set of rules will govern Bob's Personalised Utopia. The question of what BPA wants to do to Steve, seems to me to be far more important for Steve's safety, than the question of what set of rules will constrain the actions of BPA.

Another way to look at this is to think in terms of avoiding contradictions. And in terms of making coherent proposals. A proposal that effectively says that everyone should be given everything that they want (or effectively says that everyone's values should be respected) is not a coherent proposal. These things are necessarily defined in some form of outcome or action space. Trying to give everyone overlapping control over everything that they care about in such spaces introduces contradictions.

This can be contrasted with giving each individual influence over the adoption (by any clever AI) of those preferences that refer to her. Since this is defined in preference adoption space, it cannot guarantee that everyone will get everything that they want. But it also means that it does not imply contradictions (see this post for a discussion of these issues in the context of Membrane formalisms). Giving everyone such influence is a coherent proposal.

It also happens to be the case that if one wants to protect Steve from a far superior intellect, then preference adoption space seems to be a lot more relevant than any form of outcome or action space. Because if a superior intellect wants to hurt Steve, then one has to defeat a superior opponent in every single round of a near infinite definitional game (even under the assumption of perfect enforcement, winning every round in such a definitional game against a superior opponent seems hopeless). In other words: I don't think that the best way to approach this is to ask how one might protect Steve from a large set of clever AIs that wants to hurt Steve for a wide variety of reasons. I think a better question is to ask how one might prevent the situation where such a set of AIs wants to hurt Steve.

My thought experiment assumed that all rules and constraints described in the text that you linked to had been successfully implemented. Perfect enforcement was assumed. This means that there is no need to get into issues such as relative optimization power (or any other enforcement related issue). The thought experiment showed that the rules described in the linked text does not actually protect Steve from a clever AI that is trying to hurt Steve (even if these rules are successfully implemented / perfectly enforced).

If we were reasoning from the assumption that some AI will try to prevent All Bad Things, then relative power issues might have been relevant. But there is nothing in the linked text that suggests that such an AI would be present (and it contains no proposal for how one might arrive at some set of definitions that would imply such an AI).

In other words: there would be many clever AIs trying to hurt people (the Advocates of various individual humans). But the text that you link to does not suggest any mechanism, that would actually protect Steve from a clever AI trying to hurt Steve.

There is a ``Misunderstands position?'' react to the following text:

The scenario where a clever AI wants to hurt a human that is only protected by a set of human constructed rules ...

In The ELYSIUM Proposal, there would in fact be many clever AIs trying to hurt individual humans (the Advocates of various individual humans). So I assume that the issue is with the protection part of this sentence. The thought experiment outlined in my comment assumes perfect enforcement (and my post that this sentence is referring to also assumes perfect enforcement). It would have been redundant, but I could have instead written:

The scenario where a clever AI wants to hurt a human that is only protected by a set of perfectly enforced human constructed rules ...

I hope that this clarifies things.

The specific security hole illustrated by the thought experiment can of course be patched. But this would not help. Patching all humanly findable security holes would also not help (it would prevent the publication of further thought experiments. But it would not protect anyone from a clever AI trying to hurt her. And in The ELYSIUM Proposal, there would in fact be many clever AIs trying to hurt people). The analogy with an AI in a box is apt here. If it is important that an AI does not leave a human constructed box (analogous to: an AI hurting Steve). Then one should avoid creating a clever AI that wants to leave the box (analogous to: avoid creating a clever AI that wants to hurt Steve). In other words: Steve's real problem is that a clever AI is adopting preferences that refer to Steve, using a process that Steve has no influence over.

(Giving each individual influence over the adoption of those preferences that refer to her would not introduce contradictions. Because such influence would be defined in preference adoption space. Not in any form of action or outcome space. In The ELYSIUM Proposal however, no individual would have any influence whatsoever, over the process by which billions of clever AIs, would adopt preferences, that refer to her)

Let's optimistically assume that all rules and constraints described in The ELYSIUM Proposal are successfully implemented. Let's also optimistically assume that every human will be represented by an Advocate that perfectly represents her interests. This will allow us to focus on a problem that remains despite these assumptions.

Let's take the perspective of ordinary human individual Steve. Many clever and powerful AIs would now adopt preferences that refer to Steve (the Advocates of humans that have preferences that refer to Steve). Steve has no influence regarding the adoption of these Steve-Preferences. If one of these clever and powerful AIs wants to hurt Steve, then Steve is only protected by a set of human constructed rules.

The scenario where a clever AI wants to hurt a human that is only protected by a set of human constructed rules was previously discussed here. That post is about Membrane formalisms. But the argument is similar to the argument presented below. Both arguments are also similar to arguments about the difficulty of keeping a clever AI contained in a human constructed box (if it is important that an AI stays in a human constructed box. Then one should not build a clever AI that wants to leave the box. If a clever AI wants to leave the box, then plugging all human-findable security holes is not enough. Because the clever AI can find a security hole that is not humanly-findable). Very briefly: this general situation is dangerous for Steve, because the AI that wants to hurt Steve is more clever than the humans that constructed the rules that are supposed to protect Steve.

Let's explore one specific example scenario where a clever AI finds a way around the specific rules outlined in the text of The ELYSIUM Proposal. Bob does not want to hurt anyone. Bob certainly does not want to use his Utopia as a weapon. However, it is important for Bob that Bob's Utopia was constructed by an ethical AI. A moral imperative for such an AI is that it must punish heretics (if such heretics exist). Bob would prefer a world where no one is a heretic and no one suffers. But unfortunately Steve is a heretic. And the moral imperative to punish Steve is more important than the avoidance of suffering. So Bob's Personal Advocate (BPA) will try to punish Steve.

Steve now faces a clever AI trying to hurt him, and his only protection against this AI is a set of human constructed rules. Even if no human is able to find a way around some specific set of human constructed rules, BPA will be able to think up strategies that no human is able to comprehend (this more serious problem would remain, even if the security hole described below is fully patched). The real problem faced by Steve is that a clever AI has adopted Steve-referring-preferences. And Steve had no influence regarding the decision of which Steve-preferences would be adopted by this clever AI. But let's now return to discussing one specific strategy that BPA can use to hurt Steve without breaking any of the rules described in this specific text.

BPA is constrained by the requirement that all created minds must enthusiastically consent to being created. The other constraint is that BPA is not allowed to torture any created mind. The task of BPA is thus to construct a mind that (i): would enthusiastically consent to being created, and (ii): would suffer in ways that Steve would find horrific, even though no one is torturing this mind.

The details will depend on Steve's worldview. The mind in question will be designed specifically to hurt Steve. One example mind that could be created is OldSteve. OldSteve is what Steve would turn into, if Steve were to encounter some specific set of circumstances. Steve considers OldSteve to be a version of himself in a relevant sense (if Steve did not see things in this way, then BPA would have designed some other mind). OldSteve has adopted a worldview that makes it a moral obligation to be created. So OldSteve would enthusiastically consent to being created by BPA. Another thing that is true of OldSteve, is that he would suffer horribly due to entirely internal dynamics (OldSteve was designed by a clever AI, that was specifically looking for a type of mind that would suffer due to internal dynamics).

So OldSteve is created by BPA. And OldSteve suffers in a way that Steve finds horrific. Steve does not share the moral framework of OldSteve. In particular: Steve does not think that OldSteve had any obligation to be created. In general, Steve does not see the act of creating OldSteve as a positive act in any way. So Steve is just horrified by the suffering. BPA can crate a lot of copies of OldSteve with slight variations, and keep them alive for a long time.

(This comment is an example of Alignment Target Analysis (ATA). This post argued that doing ATA now is important, because there might not be a lot of time to do ATA later (for example because Shutting down all competing AI projects might not buy a lot of time due to Internal Time Pressure). There are many serious AI risks that cannot be reduced by any level of ATA progress. But ATA progress can reduce the probability of a bad alignment target getting successfully implemented. A risk reduction focused ATA project would be tractable, because risks can be reduced even if one is not able to find any good alignment target. This comment discuss which subset of AI risks can (and cannot) be reduced by ATA. This comment is focused on a different topic but it contains a discussion of a related concept (towards the end it discusses the importance of having influence over the adoption of self-referring-preferences by clever AIs).)

I changed the title from: ``A Pivotal Act AI might not buy a lot of time'' to: ``Shutting down all competing AI projects might not buy a lot of time due to Internal Time Pressure''.

As explained by Martin Randall, the statement: ``something which does not buy ample time is not a pivotal act'' is false (based on the Arbital Guarded Definition of Pivotal Act). Given your ``Agreed react'' to that comment, this issue seems to be settled. In the first section of the present comment, I explain why I still think that the old title was a mistake. The second section outlines a scenario that better illustrates that a Pivotal Act AI might not buy a lot of time.

Why the old title was a mistake

The old title implied that launching the LAI was a very positive event. With the new title, launching the LAI may or may not have been a positive event. This was the meaning that I intended.

Launching the LAI drastically increased the probability of a win by shutting down all competing AI projects. It however also increased risks from scenarios where someone successfully hits a bad alignment target. This can lead to a massively worse than extinction outcome (for example along the lines of the outcome implied by PCEV). In other words: launching LAI may or may not have been a positive event. Thus, launching the LAI may or may not have been a Pivotal Act according to the Arbital Guarded Definition (which requires the event to be very positive).

The old title does not seem to be incompatible with the actual text of the post. But it is incompatible with my intended meaning. I didn't intend to specify whether or not LAI was a positive event. Because the argument about the need for Alignment Target Analysis (ATA) goes through regardless of whether or not launching LAI was a good idea. Regardless of whether or not launching LAI was a positive event, ATA work needs to start now to reduce risks. Because in both cases, ATA progress is needed to reduce risks. And in both cases, there is not a lot of time to do ATA later. (ATA is in fact more important in scenarios where launching the LAI was in fact a terrible mistake)

As I show in my other reply: there is a well established convention of using the term Pivotal Act as a shorthand for shutting down all competing AI projects. As can be seen by looking at the scenario in the post: this might not buy a lot of time. That is how I was using the term when I picked the old title.

A scenario that better illustrates why a Pivotal Act AI might not buy a lot of time

This section outlines a scenario where an unambiguous Pivotal Act is instantly followed by a very severe time crunch. It is possible to see that a Pivotal Act AI might not buy a lot of time by looking at the scenario in the post. But the present section will outline a scenario that better illustrates this fact. (In other words: this section outlines a scenario for which the old title would actually be a good title.) In this new scenario, a Pivotal Act dramatically reduces the probability of extinction by shutting down all unauthorised AI projects. It also completely removes the possibility of anything worse than extinction. Right after the Pivotal Act, there is a frenzied race against the clock to make enough progress on ATA before time runs out. Failure results in a significant risk of extinction.

Consider the case where Dave launches Dave's AI (DAI). If DAI had not been launched, everyone would have almost certainly been killed by some other AI. DAI completely and permanently shuts down all competing AI projects. DAI also reliably prevents all scenarios where designers fail to hit the alignment target that they are aiming at. Due to Internal Time Pressure, a Sovereign AI must then be launched very quickly (discussions of Internal Time Pressure can be found here, and here, and here). There is very little time to decide what alignment target to aim at. (The point made in this section is not sensitive to who gave Dave permission to launch DAI. Or sensitive to who DAI will defer to for the choice of alignment target. But for the sake of concreteness, let's say that the UN security council authorised DAI. And that DAI defers to a global electorate regarding the choice of alignment target).

By the time Dave launches DAI, work on ATA has already progressed very far. There already exist many alignment targets that would in fact lead to an unambiguous win (somehow, describing these outcomes as a win is objectively correct). Only one of the many proposed alignment targets still has an unnoticed problem. And this problem is not nearly as severe as the problem with PCEV. People take the risks of unnoticed problems very seriously. But due to severe Internal Time Pressure, there is not much they can do with this knowledge. The only option is to use their limited time to analyse all alignment targets that are being considered. (many very optimistic assumptions are made regarding both DAI and the level of ATA progress. This is partly to make sure that readers will agree that the act of launching DAI should count as a Pivotal Act. And partly to show that ATA might still be needed, despite these very optimistic assumptions).

The only alignment target that is not a clear win, is based on maximising the sum of re-normalised utility functions. The proposed AI includes a proposed way of mapping a human to a utility function. This always results in a perfect representation of what the human wants. (And there are no definitional issues with this mapping). These functions are then renormalised to have the same variance (as discussed here). Let's write VarAI for this AI. VarAI maximises the sum of the renormalised functions. The aggregation method described above has a problem that is obvious in retrospect. If that problem is explained, then it is clear that VarAI is an unacceptable alignment target. However, in this scenario, no one has noticed this problem. The question is now whether or not anyone will notice the problem (before an alignment target needs to be settled on).

Due to the problem with this aggregation method, VarAI would create a Soft Maximisation version of a Suffering Reducing AI (SMSRAI) as a successor AI (if VarAI is successfully implemented and pointed at the population in this thought experiment). So, if the proponents of VarAI happens to win the political fight, then the result would be SMSRAI. SMSRAI would work to reduce both average suffering and population size (without doing anything drastic). The quality of most peoples lives would increase very dramatically. Many people would choose to spend centuries living a very nice life. But after some conversations with SMSRAI, few would choose to live for millennia. No one would choose to live for many millennia. Humanity would very gradually go extinct in a few dozen generations. By the time it happens, no one would find this problematic in any way. Changes in attitudes are very gradual, and no one ever feels pressured or manipulated by SMSRAI (for example: if you ask SMSRAI to leave you alone, it backs off for a couple of centuries before politely asking again if you want to talk).

If anyone finds the problem with the aggregation method in time, then VarAI will be dropped as a potential alignment target. (Once the problem is explained, it is retrospectively obvious why VarAI is completely unacceptable as an alignment target). If no one finds this problem in time however, it is possible that VarAI will win the political fight.

In order to avoid accidental interference, Dave designed DAI to not discuss ATA. (As a general strategy for reducing risks from definitional issues, Dave actually made sure that DAI only does things that are absolutely necessary for the completion of a carefully chosen set of well defined tasks). So the only way to remove the possibility of extinction, is to reach the needed level of ATA understanding before it's too late. (If the problem with this aggregation method is not explained to them, their only path to understanding is to try doing ATA).

Despite severe Internal Time Pressure, the possibility of an unnoticed problem in an alignment target is taken very seriously. So a last judge safety add on is implemented. The last judge is extrapolated using a different extrapolation method than the one used in VarAI (in order to give the last judge a chance to veto a bad extrapolation method. The VarAI extrapolation definitions are in fact somehow objectively correct. But there is no way of verifying that fact). The result of the last judge extrapolation method turns out to be a very anxious mind. The result is a mind that is in general not comfortable with objecting to things (such as the extrapolation method that it is the output of, or the outcome implied by an alignment target). This mind is very reluctant to veto a scenario where no one is suffering, and where almost everyone are very happy with all aspects of how things turn out (SMSRAI very gradually, over many generations, ``helps people realise'' that the outcome is actually a good outcome. And people genuinely are having a very nice time, for a lot longer than most people expected). So the off switch is not triggered.

If Dave had not launched DAI, all humans would very likely have been killed very soon by some other AI. So I think a lot of people would consider Launching DAI to be a Pivotal Act. (It completely upset the game board. It drastically increased the probability of a win. It was a very positive event according to a wide range of value systems). But if someone wants humanity to go on existing (or wants to personally live a super long life), then there is not a lot of time to find the problem with VarAI (because without sufficient ATA progress, there still exists a significant probability of extinction). So, launching DAI was a Pivotal Act. And launching DAI did not result in a lot of time to work on ATA. Which demonstrates that a Pivotal Act AI might not buy a lot of time.

One can use this scenario as an argument in favour of starting ATA work now. It is one specific scenario that exemplifies a general class of scenarios: scenarios where starting ATA work now, would further reduce an already small risk of a moderately bad outcome. It is a valid argument. But it is not the argument that I was trying to make in my post. I was thinking of something a lot more dangerous. I was imagining a scenario where a bad alignment target is very likely to get successfully implemented unless ATA progresses to the needed levels of insight before it is too late. And I was imagining an alignment target that implied a massively worse than extinction outcome (for example along the lines of the outcome implied by PCEV). I think this is a stronger argument in favour of starting work on ATA now. And this interpretation was ruled out by the old title (which is why I changed the title).

(a brief tangent: if someone expects everything to turn out well. But would like to work on ATA in order to further reduce a small probability of something going moderately bad. Then I would be very happy to collaborate with such a person in a future ATA project. Having very different perspectives in an ATA project sounds like a great idea. An ATA project is very different from a technical design project where a team is trying to get something implemented that will actually work. There is really no reason for people to have similar worldviews or even compatible ontologies. It is a race against time to find a conceptual breakthrough of an unknown type. It is a search for an unnoticed implicit assumption of an unknown type. So genuinely different perspectives sounds like a great idea)

In summary: ``A Pivotal Act AI might not buy a lot of time'' is in fact a true statement. And it is possible to see this by looking at the scenario outlined in the post. But it was a mistake to use this statement as the title for this post. Because it implies things about the scenario that I did not intend to imply. So I changed the title and outlined a scenario that is better suited for illustrating that a Pivotal Act AI might not buy a lot of time.

 

PS:

I upvoted johnswentworth's comment. My original title was a mistake. And the comment helped me realise my mistake. I hope that others will post similar comments on my posts in the future. The comment deserves upvotes. But I feel like I should ask about these agreement votes.

The statement: ``something which does not buy ample time is not a pivotal act'' is clearly false. Martin Randall explained why the statement is false (helpfully pulling out the relevant quotes from the texts that johnswentworth cited). And then johnswentworth did an ``Agreed reaction'' on Martin Randall's explanation of why the statement is false. After this however, johnswentworth's comment (with the statement that had already been determined to be false) was agree voted to plus 7. That seemed odd to me. So I wanted to ask about it. (My posts sometimes question deeply entrenched assumptions. And johnswentworth's comment sort of looks like criticism (at least if one only skims the post and the discussion). So maybe there is no great mystery here. But I still wanted to ask about this. Mostly in case someone has noticed an object level error in my post. But I am also open to terminology feedback)

I will change the title.

However: you also seem to be using the term Pivotal Act as a synonym for removing all time pressure from competing AI projects (which the AI in my post does). Example 3 of the arbital page that you link to also explicitly refers to an act that removes all time pressure from competing AI projects as a Pivotal Act. This usage is also present in various comments by you, Yudkowsky, and others (see links and quotes below). And there does not seem to exist any other established term for an AI that: (i): completely removes all time pressure from competing AI projects by uploading a design team and giving them infinite time to work, (ii): keeps the designers calm, rational, sane, etc indefinitely (with all definitional issues of those terms fully solved), and (iii): removes all risks from scenarios where someone fails to hit an alignment target. What other established term exists for such an AI? I think people would generally refer to such an AI as a Pivotal Act AI. And as demonstrated in the post: such an AI might not buy a lot of time.

Maybe using the term Pivotal Act as a synonym for an act that removes all time pressure from competing AI projects is a mistake? (Maybe the scenario in my post should be seen as showing that this usage is a mistake?). But it does seem to be a very well established way of using the term. And I would like to have a title that tells readers what the post is about. I think the current title probably did tell you what the post is about, right? (that the type of AI actions that people tend to refer to as Pivotal Acts might not buy a lot of time in reality)

In the post I define new terms. But if I use a novel term in the title before defining the this term, the title will not tell you what the post is about. So I would prefer to avoid doing that.

But I can see why you might want to have Pivotal Act be a protected term for something that is actually guaranteed to buy a lot of time (which I think is what you would like to do?). And perhaps it is possible to maintain (or re-establish?) this usage. And I don't want to interfere with your efforts to do this. So I will change the title.

If we can't find a better solution I will change the title to: Internal Time Pressure. It does not really tell you what the post will be about. But at least it is accurate and not terminologically problematic. And even though the term is not commonly known, Internal Time Pressure is actually the main topic of the post (Internal Time Pressure is the reason that the AI mentioned above, that does all the nice things mentioned, might not actually buy a lot of time).


Regarding current usage of the term Pivotal Act:

It seems to me like you and many others are actually using the term as a shorthand for an AI that removes time pressure from competing AI projects. I can take many examples of this usage just from the discussion that faul_sname links to in the other reply to your comment.

In the second last paragraph of part 1 of the linked post, Andrew_Critch writes:

Overall, building an AGI development team with the intention to carry out a “pivotal act” of the form “forcibly shut down all other A(G)I projects” is probably going to be a rough time, I predict.

No one seems to be challenging that usage of Pivotal Act (even though many other parts of the post are challenged). And it is not just this paragraph. The tl;dr also treats a Pivotal Act as interchangeable with: shut down all other AGI projects, using safe AGI. There are other examples in the post.

In this comment on the post, it seems to me that Scott Alexander is using a Pivotal Act AI as a direct synonym for an AI capable of destroying all competing AI projects.

In this comment it seems to me like you are using Pivotal Act interchangeably with shutting down all competing AI projects. In this comment, it seems to me that you accept the premise that uploading a design team and running them very quickly would be a Pivotal Act (but you question the plan on other grounds). In this comment, it seems to me that you are equating successful AI regulation with a Pivotal Act (but you question the feasibility of regulation).

In this comment, Yudkowsky seems to me to be accepting the premise that preventing all competing AI projects would count as a Pivotal Act. He says that the described strategy for preventing all competing AI projects is not feasible. But he also says that he will change the way he speaks about Pivotal Acts if the strategy actually does work (and this strategy is to shut down competing AI projects with EMPs. The proposed strategy does nothing else to buy time, other than shutting down competing AI projects). (It is not an unequivocal case of using Pivotal Act as a direct synonym for reliably shutting down all competing AI projects. But it really does seem to me like Yudkowsky is treating Pivotal Act as a synonym for: preventing all competing AI projects. Or at least that he is assuming that preventing all competing AI projects would constitute a Pivotal Act).

Consider also example 3 in the arbital page that you link to. Removing time pressure from competing AI projects by uploading a design team is explicitly defined as an example of a Pivotal Act. And the LAI in my post does exactly this. And the LAI in my post also does a lot of other things that increase the probability of a win (such as keeping the designers sane and preventing them from missing an aimed for alignment target).

This usage points to a possible title along the lines of: The AI Actions that are Commonly Referred to as Pivotal Acts, are not Actually Pivotal Acts (or: Shutting Down all Competing AI Projects is not Actually a Pivotal Act). This is longer and less informative about what the post is about (the post is about the need to start ATA work now, because there might not be a lot time to do ATA work later, even if we assume the successful implementation of a very ambitious AI, whose purpose was to buy time). But this title would not interfere with an effort to maintain (or re-establish?) the meaning of Pivotal Act as a synonym for an act that is guaranteed to buy lots of time (which I think is what you are trying to do?). What do you think about these titles?


PS:

(I think that technically the title probably does conform to the specific text bit that you quote. It depends on what the current probability of a win is. And how one defines: drastically increase the probability of a win. But given the probability that Yudkowsky currently assigns to a win, I expect that he would agree that the launch of the described LAI would count as drastically increasing the probability of a win. (In the described scenario, there are many plausible paths along which the augmented humans actually do reach the needed levels of ATA progress in time. They are however not guaranteed to do this. The point of the post is that doing ATA now increases the probability of this happening). The statement that the title conforms to the quoted text bit is however only technically true in an uninteresting sense. And the title conflicts with your efforts to guard the usage of the term. So I will change the title as soon as a new title has been settled on. If nothing else is agreed on, I will change the title to: Internal Time Pressure)

I interpret your comment as a prediction regarding where new alignment target proposals will come from. Is this correct?


I also have a couple of questions about the linked text:

How do you define the difference between explaining something and trying to change someone's mind? Consider the case where Bob is asking a factual question. An objectively correct straightforward answer would radically change Bob's entire system of morality, in ways that the AI can predict. A slightly obfuscated answer would result in far less dramatic changes. But those changes would be in a completely different direction (compared to the straightforward answer). Refusing to answer, while being honest about the reason for refusal, would send Bob into a tailspin. How certain are you that you can find a definition of Acceptable Forms of Explanation that holds up in a large number of messy situations along these lines? See also this.

And if you cannot define such things in a solid way, how do you plan to define ``benefit humanity''? PCEV was an effort to define ``benefit humanity''. And PCEV has been found to suffer from at least one difficult-to-notice problem. How certain are you that you can find a definition of  ``benefit humanity'' that does not suffer from some difficult-to-notice problem?

 

PS:

Speculation regarding where novel alignment target proposals are likely to come from are very welcome. It is a prediction of things that will probably be fairly observable fairly soon. And it is directly relevant to my work. So I am always happy to hear this type of speculation.

Let's reason from the assumption that you are completely right. Specifically, let's assume that every possible Sovereign AI Project (SAIP) would make things worse in expectation. And let's assume that there exists a feasible Better Long Term Solution (BLTS).

In this scenario ATA would still only be a useful tool for reducing the probability of one subset of SAIPs (even if all SAIPs are bad some designers might be unresponsive to arguments, some flaws might not be realistically findable, etc). But it seems to me that ATA would be one complementary tool for reducing the overall probability of SAIP. And this tool would not be easy to replace with other methods. ATA could convince the designers of a specific SAIP that their particular project should be abandoned. If ATA results in the description of necessary features, then it might even help a (member of a) design team see that it would be bad if a secret project were to successfully hit a completely novel, unpublished, alignment target (for example along the lines of this necessary Membrane formalism feature).

ATA would also be a project where people can collaborate despite almost opposite viewpoints on the desirability of SAIP. Consider Bob who mostly just wants to get some SAIP implemented as fast as possible. But Bob still recognizes the unlikely possibility of dangerous alignment targets with hidden flaws (but he does not think that this risk is anywhere near large enough to justify waiting to launch a SAIP). You and Bob clearly have very different viewpoint regarding how the world should deal with AI. But there is actually nothing preventing you and Bob from cooperating on a risk reduction focused ATA project.

This type of diversity of perspectives might actually be very productive for such a project. You are not trying to build a bridge on a deadline. You are not trying to win an election. You do not have to be on the same page to get things done. You are trying to make novel conceptual progress, looking for a flaw of an unknown type.

Basically: reducing the probability of outcomes along the lines of the outcome implied by PCEV is useful according to a wide range of viewpoints regarding how the world should deal with AI. (there is nothing unusual about this general state of affairs. Consider for example Dave and Gregg who are on opposite sides of a vicious political trench war over the issue of pandemic lockdowns. There is nothing on the object level that prevents them from collaborating on a vaccine research effort. So this feature is certainly not unique. But I still wanted to highlight the fact that a risk mitigation focused ATA project does have this feature)

I think I see your point. Attempting to design a good alignment target could lead to developing intuitions that would be useful for ATA. A project trying to design an alignment target might result in people learning skills that allows them to notice flaws in alignment targets proposed by others. Such projects can therefore contribute to the type of risk mitigation that I think is lacking. I think that this is true. But I do not think that such projects can be a substitute for an ATA project with a risk mitigation focus.


Regarding Orthogonal:

It is difficult for me to estimate how much effort Orthogonal spends on different types of work. But it seems to me that your published results are mostly about methods for hitting alignment targets. This also seems to me to be the case for your research goals. If you are successful, it seems to me that your methods could be used to hit almost any alignment target (subject to constraints related to finding individuals that want to hit specific alignment targets).

I appreciate you engaging on this, and I would be very interested in hearing more about how the work done by Orthogonal could contribute to the type of risk mitigation effort discussed in the post. I would, for example, be very happy to have a voice chat with you about this.

Load More