The proposals described in your points 1 and 2 are very different from any of the proposals that I am analysing in the post. I consider this to be a good thing. But I wanted to note explicitly that this discussion has now moved very far away from what was discussed in the post, and is best seen as a new discussion (a discussion that starts with the proposals described in your points 1 and 2). Making this clear is important, because it means that many points made in the post (and also earlier in this thread) do not apply to the class of proposals that we ar...
There are no Pareto improvements relative to the new Pareto Baseline that you propose. Bob would indeed classify a scenario with an AI that takes no action as a Dark Future. However, consider Bob2, who takes another perfectly coherent position on how to classify an AI that never acts. If something literally never takes any action, then Bob2 simply does not classify it as a person. Bob2 therefore does not consider a scenario with an AI that literally never does anything to be a Dark Future (other than this difference, Bob2 agrees with Bob about morality). T...
Given that you agreed with most of what I said in my reply, it seems like you should also agree that it is important to analyse these types of alignment targets. But in your original comment you said that you do not think that it is important to analyse these types of alignment targets.
Let's write Multi Person Sovereign AI Proposal (MPSAIP) for an alignment target proposal to build an AI Sovereign that gets its goal from the global population (in other words: the type of alignment target proposals that I was analysing in the post). I followed your links an...
I'm sorry if the list below looks like nitpicking. But I really do think that these distinctions are important.
Bob holds 1 as a value. Not as a belief.
Bob does not hold 2 as a belief or as a value. Bob thinks that someone as powerful as the AI has an obligation to punish someone like Dave. But that is not the same as 2.
Bob does not hold 3 as a belief or as a value. Bob thinks that for someone as powerful as the AI, the specific moral outrage in question renders the AI unethical. But that is not the same as 3.
Bob does hold 4 as a value. But it is worth noti...
Bob really does not want the fate of the world to be determined by an unethical AI. There is no reason for such a position to be instrumental. For Bob, this would be worse than the scenario with no AI (in the Davidad proposal, this is the baseline that is used to determine whether or not something is a Pareto-improvement). Both scenarios contain non-punished heretics. But only one scenario contains an unethical AI. Bob prefers the scenario without an unethical AI (for non-instrumental reasons).
The question is whether or not at least...
Consider Bob, who takes morality very seriously. Bob thinks that any scenario where the fate of the world is determined by an unethical AI, is worse than the scenario with no AI. Bob sticks with this moral position, regardless of how much stuff Bob would get in a scenario with an unethical AI. For a mind as powerful as an AI, Bob considers it to be a moral imperative to ensure that heretics do not escape punishment. If a group contains at least one person like Bob (and at least one person that would strongly object to being punished), then the set of Paret...
I do think that it’s important to analyse alignment targets like these. Given the severe problems that all of these alignment targets suffer from, I certainly hope that you are right about them being unlikely. I certainly hope that nothing along the lines of a Group AI will ever be successfully implemented. But I do not think that it is safe to assume this. The successful implementation of an instruction following AI would not remove the possibility that an AI Sovereign will be implemented later. The CEV arbital page actually assumes that the pat...
I thought that your Cosmic Block proposal would only block information regarding things going on inside a given Utopia. I did not think that the Cosmic Block would subject every person to forced memory deletion. As far as I can tell, this would mean removing a large portion of all memories (details below). I think that memory deletion on the implied scale would seriously complicate attempts to define an extrapolation dynamic. It also does not seem to me that it would actually patch the security hole illustrated by the thought experiment in my original comm...
Implementing The ELYSIUM Proposal would lead to the creation of a very large, and very diverse, set of clever AIs that wants to hurt people: the Advocates of a great variety of humans, that wants to hurt others in a wide variety of ways, for a wide variety of reasons. Protecting billions of people from this set of clever AIs would be difficult. As far as I can tell, nothing that you have mentioned so far would provide any meaningful amount of protection from a set of clever AIs like this (details below). I think that it would be better to just not create s...
My thought experiment assumed that all rules and constraints described in the text that you linked to had been successfully implemented. Perfect enforcement was assumed. This means that there is no need to get into issues such as relative optimization power (or any other enforcement related issue). The thought experiment showed that the rules described in the linked text does not actually protect Steve from a clever AI that is trying to hurt Steve (even if these rules are successfully implemented / perfectly enforced).
If we were reasoning from the assumpti...
Let's optimistically assume that all rules and constraints described in The ELYSIUM Proposal are successfully implemented. Let's also optimistically assume that every human will be represented by an Advocate that perfectly represents her interests. This will allow us to focus on a problem that remains despite these assumptions.
Let's take the perspective of ordinary human individual Steve. Many clever and powerful AIs would now adopt preferences that refer to Steve (the Advocates of humans that have preferences that refer to Steve). Steve has no influence r...
I changed the title from: ``A Pivotal Act AI might not buy a lot of time'' to: ``Shutting down all competing AI projects might not buy a lot of time due to Internal Time Pressure''.
As explained by Martin Randall, the statement: ``something which does not buy ample time is not a pivotal act'' is false (based on the Arbital Guarded Definition of Pivotal Act). Given your ``Agreed react'' to that comment, this issue seems to be settled. In the first section of the present comment, I explain why I still think that the old title was a mistake. The second section...
I will change the title.
However: you also seem to be using the term Pivotal Act as a synonym for removing all time pressure from competing AI projects (which the AI in my post does). Example 3 of the arbital page that you link to also explicitly refers to an act that removes all time pressure from competing AI projects as a Pivotal Act. This usage is also present in various comments by you, Yudkowsky, and others (see links and quotes below). And there does not seem to exist any other established term for an AI that: (i): completely removes all time pressur...
Your comment makes me think that I might have been unclear regarding what I mean with ATA. The text below is an attempt to clarify.
Not all paths to powerful autonomous AI go through methods from the current paradigm. It seems difficult to rule out the possibility that a Sovereign AI will eventually be successfully aligned to some specific alignment target. At current levels of progress on ATA this would be very dangerous (because understanding an alignment target properly is difficult, and a seemingly-nice proposal can imply a very bad outcome). It ...
I interpret your comment as a prediction regarding where new alignment target proposals will come from. Is this correct?
I also have a couple of questions about the linked text:
How do you define the difference between explaining something and trying to change someone's mind? Consider the case where Bob is asking a factual question. An objectively correct straightforward answer would radically change Bob's entire system of morality, in ways that the AI can predict. A slightly obfuscated answer would result in far less dramatic changes. But those changes woul...
Let's reason from the assumption that you are completely right. Specifically, let's assume that every possible Sovereign AI Project (SAIP) would make things worse in expectation. And let's assume that there exists a feasible Better Long Term Solution (BLTS).
In this scenario ATA would still only be a useful tool for reducing the probability of one subset of SAIPs (even if all SAIPs are bad some designers might be unresponsive to arguments, some flaws might not be realistically findable, etc). But it seems to me that ATA would be one complementary tool for r...
I think I see your point. Attempting to design a good alignment target could lead to developing intuitions that would be useful for ATA. A project trying to design an alignment target might result in people learning skills that allows them to notice flaws in alignment targets proposed by others. Such projects can therefore contribute to the type of risk mitigation that I think is lacking. I think that this is true. But I do not think that such projects can be a substitute for an ATA project with a risk mitigation focus.
Regarding Orthogonal:
It is difficult ...
The proposed research project would indeed be focused on a certain type of alignment target. For example proposals along the lines of PCEV. But not proposals along the lines of a tool-AI. Referring to this as Value-Alignment Target Analysis (VATA) would also be a possible notation. I will adopt this notation for the rest of this comment.
The proposed VATA research project would be aiming for risk mitigation. It would not be aiming for an answer:
There is a big difference between proposing an alignment target on the one hand. And pointing out problems with al...
Regarding the political feasibility of PCEV:
PCEV gives a lot of extra power to some people, specifically because those people intrinsically value hurting other humans. This presumably makes PCEV politically impossible in a wide range of political contexts (including negotiations between a few governments). More generally: now that it has been pointed out that PCEV has this feature, the risks from scenarios where PCEV gets successfully implemented has presumably been mostly removed. Because PCEV is probably off the table as a potential alignment target, pre...
Regarding Corrigibility as an alternative safety measure:
I think that exploring the Corrigibility concept sounds like a valuable thing to do. I also think that Corrigibility formalisms can be quite tricky (for similar reasons that Membrane formalisms can be tricky: I think that they are both vulnerable to difficult-to-notice definitional issues). Consider a powerful and clever tool-AI. It is built using a Corrigibility formalism that works very well when the tool-AI is used to shut down competing AI projects. This formalism relies on a definition of Explan...
I agree that focus should be on preventing the existence of a Sovereign AI that seeks to harm people (as opposed to trying to deal with such an AI after it has already been built). The main reason for trying to find necessary features, is actually that it might stop a dangerous AI project from being pursued in the first place. In particular: it might convince the design team to abandon an AI project, that clearly lacks a feature that has been found to be necessary. An AI project that would (if successfully implemented) result in an AI Sovereign that would ...
Thanks for the feedback! I see what you mean and I edited the post. (I turned a single paragraph abstract into a three paragraph Summary section. The text itself has not been changed)
Thank you for engaging. If this was unclear for you, then I'm sure it was also unclear for others.
The post outlined a scenario where a Corrigibility method works perfectly for one type of AI (an AI that does not imply an identifiable outcome, for example a PAAI). The same Corrigibility method fails completely for another type of AI (an AI that does imply an identifiable outcome, for example PCEV). So the second AI, that does have an IO, is indeed not corrigible.
This Corrigibility method leads to an outcome that is massively worse than extinction. This bad ...
The first AI is genuinely Corrigible. The second AI is not Corrigible at all. This leads to a worse outcome, compared to the case where there was no Corrigible AI. Do you disagree with the statement that the first AI is genuinely Corrigible? Or do you disagree with the statement that the outcome is worse, compared to the case where there was no Corrigible AI?
Thank you for the clarification. This proposal is indeed importantly different from the PCEV proposal. But since hurting heretics is a moral imperative, any AI that allows heretics to escape punishment, will also be seen as unacceptable by at least some people. This means that the set of Pareto improvements is empty.
In other words: hurting heretics is indeed off the table in your proposal (which is an important difference compared to PCEV). However, any scenario that includes the existence of an AI, that allow heretics to escape punishment, is also off the...
There is a serious issue with your proposed solution to problem 13. Using a random dictator policy as a negotiation baseline is not suitable for the situation, where billions of humans are negotiating about the actions of a clever and powerful AI. One problem with using this solution, in this contexts, is that some people have strong commitments to moral imperatives, along the lines of ``heretics deserve eternal torture in hell''. The combination of these types of sentiments, and a powerful and clever AI (that would be very good at thinking up effective wa...
I think it is very straightforward to hurt human individual Steve without piercing Steve's Membrane. Just create and hurt minds that Steve cares about. But don't tell him about it (in other words: ensure that there is zero effect on predictions of things inside the membrane). If Bob knew Steve before the Membrane enforcing AI was built, and Bob wants to hurt Steve, then Bob presumably knows Steve well enough to know what minds to create (in other words: there is no need to have any form of access, to any form of information, that is within Steve's Membrane...
For a set of typical humans, that are trying to agree on what an AI should do, there does not exists any fallback option, that is acceptable to almost everyone. For each fallback option, there exists a large number of people, that will find this option completely unacceptable on moral grounds. In other words: when trying to agree on what an AI should do, there exists no place that people can walk away to, that will be seen as safe / acceptable by a large majority of people.
Consider the common aspect of human morality, that is sometimes expressed in theolog...
This comment is trying to clarify what the post is about, and by extension clarify which claims are made. Clarifying terminology is an important part of this. Both the post and my research agenda is focused on the dangers of successfully hitting a bad alignment target. This is one specific subset, of the existential threats that humanity face from powerful AI. Let's distinguish the danger being focused on, from other types of dangers, by looking at a thought experiment, with an alignment target that is very obviously bad. A well intentioned designer named ...
What about the term uncaring AI? In other words, an AI that would keep humans alive, if offered resources to do so. This can be contrasted with a Suffering Reducing AI (SRAI), which would not keep humans alive in exchange for resources. SRAI is an example of successfully hitting a bad alignment target, which is an importantly different class of dangers, compared to the dangers of an aiming failure leading to an uncaring AI. While an uncaring AI would happily agree to leave earth alone in exchange for resources, this is not the case for SRAI, because killin...
If your favoured alignment target suffers from a critical flaw, that is inherent in the core concept, then surely it must be useful for for you to discover this. So I assume that you agree that, conditioned on me being right about CEV suffering from such a flaw, you want me to tell you about this flaw. In other words, I think that I have demonstrated, that CEV suffers from a flaw, that is not related to any detail, of any specific version, or any specific description, or any specific proxy, or any specific attempt to describe what CEV is, or anything else ...
The version of CEV, that is described on the page that your CEV link leads to, is PCEV. The acronym PCEV was introduced by me. So this acronym does not appear on that page. But that's PCEV that you link to. (in other words: the proposed design, that would lead to the LP outcome, can not be dismissed as some obscure version of CEV. It is the version that your own CEV link leads to. I am aware of the fact, that you are viewing PCEV as: ``a proxy for something else'' / ``a provisional attempt to describe what CEV is''. But this fact still seemed noteworthy)
On...
I was clearly wrong regarding how you feel about your cells. But surely the question of whether or not an AI that is implementing the CEV of Steve, would result in any surviving cells, is an empirical question? (which must settled by referring to facts about Steve. And trying to figure out what these facts mean in terms of how the CEV of Steve would treat his cells). It cannot possibly be the case that it is impossible, by definition, to discover that any reasonable way of extrapolating Steve would result in all his cells dying?
Thank you for engaging on th...
I think that extrapolation is a genuinely unintuitive concept. I would for example not be very surprised if it turns out that you are right, and that it is impossible to reasonably extrapolate you if the AI that is doing the extrapolation is cut off from all information about other humans. I don't think that this fact is in tension with my statement, that individuals and groups are completely different types of things. Taking your cell analogy: I think that implementing the CEV of you could lead to the death of every single cell in your body (for example i...
I think that ``CEV'' is usually used as shorthand for ``an AI that implements the CEV of Humanity''. This is what I am referring to, when I say ``CEV''. So, what I mean when I say that ``CEV is a bad alignment target'', is that, for any reasonable set of definitions, it is a bad idea, to build an AI, that does what ``a Group'' wants it to do (in expectation, from the perspective of essentially any human individual, compared to extinction). Since groups and individuals, are completely different types of things, it should not be surprising to learn, that doi...
I agree that ``the ends justify the means'' type thinking has led to a lot of suffering. For this, I would like to switch from the Chinese Cultural Revolution, to the French Revolution, as an example (I know it better, and I think it fits better, for discussions of this attitude). So, someone wants to achieve something, that are today seen as a very reasonable goal, such as ``end serfdom and establish formal equality before the law''. So, basically: their goals are positive, and they achieve these goals. But perhaps they could have achieved those goals, wi...
I think that my other comment to this, will hopefully be sufficient, to outline what my position actually is. But perhaps a more constructive way forwards, would be to ask how certain you are, that CEV is in fact, the right thing to aim at? That is, how certain are you, that this situation is not symmetrical, to the case where Bob thinks that: ``a Suffering Reducing AI (SRAI), is the objectively correct thing to aim at''? Bob will diagnose any problem, with any specific SRAI proposal, as arising from proxy issues, related to the fact that Bob is not able t...
I don't think that they are all status games. If so, then why did people (for example) include long meditations, regarding whether or not, they personally, deserve to go to hell, in private diaries? While they were focusing on the ``who is a heretic?'' question, it seems that they were taking for granted, the normative position: ``if someone is a heretic, then she deserves eternal torture in hell''. But, on the other hand, private diaries are of course sometimes opened, while the people that wrote them are still alive (this is not the most obvious thing, t...
It is getting late here, so I will stop after this comment, and look at this again tomorrow (I'm in Germany). Please treat the comment below as not fully thought through.
The problem from my perspective, is that I don't think that the objective, that you are trying to approximate, is a good objective (in other words, I am not referring to problems, related to optimising a proxy. They also exist, but they are not the focus of my current comments). I don't think that it is a good idea, to do what an abstract entity, called ``humanity'', wants (and I think tha...
I'm not sure that I agree with this. I think it mostly depends on what you mean by: ``something like CEV''. All versions of CEV are describable as ``doing what a Group wants''. It is inherent in the core concept of building an AI, that is ``Implementing the Coherent Extrapolated Volition of Humanity''. This rules out proposals, where each individual, is given meaningful influence, regarding the adoption, of those preferences, that refer to her. For example as in MPCEV (described in the post that I linked to above). I don't see how an AI can be safe, for in...
In the case of damage from political movements, I think that many truly horrific things, have been done by people, that are well approximated as: ``genuinely trying to do good, and largely achieving their objectives, without major unwanted side effects'' (for example events along the lines of the Chinese Cultural Revolution, that you discuss in your older post, that you link to in your first footnote).
I think our central disagreement, might be a difference, in how we see human morality. In other words, I think that we might have different views, regarding ...
I think that these two proposed constraints, will indeed remove some bad outcomes. But I don't think that they will help in the thought experiment outlined in the post. These fanatics want all heretics in existence to be punished. This is a normative convention. It is a central aspect of their morality. An AI that deviates from this ethical imperative, is seen as an unethical AI. Deleting all heretics, from the memory of the fanatics, will not change this aspect of their morality. It's genuinely not personal. They think that it would be highly unethical, f...
I do think that the outcome would be LP (more below), but I can illustrate the underlying problem, using a set of alternative thought experiments, that does not require agreement on LP vs MP.
Let's first consider the case where half of the heretics are seen as Mild Heretics (MH) and the other half as Severe Heretics (SH). MH are those that are open to converting, as part of a negotiated settlement (and SH are those that are not open to conversion). The Fanatics (F) would still prefer MP, where both MH and SH are hurt, as much as possible. But F is willing t...
The successful implementation of an instruction following AI would not remove the possibility that an AI Sovereign will be implemented later. For example: the path to an AI that implements the CEV of Humanity outlined on the CEV arbital page starts with an initial non-Sovereign AI (and this initial AI could be the type of instruction following AI that you mention). In other words: the successful implementation of an instruction following AI does not prevent the later implementation of a Group AI. It is in fact one step on the classical path to a ... (read more)