All of ThomasCederborg's Comments + Replies

A problem shared by many different alignment targets

There are no Pareto improvements relative to the new Pareto Baseline that you propose. Bob would indeed classify a scenario with an AI that takes no action as a Dark Future. However, consider Bob2, who takes another perfectly coherent position on how to classify an AI that never acts. If something literally never takes any action, then Bob2 simply does not classify it as a person. Bob2 therefore does not consider a scenario with an AI that literally never does anything to be a Dark Future (other than this difference, Bob2 agrees with Bob about morality). T... (read more)

3Martin Randall2mo

I'm much less convinced by Bob2's objections than by Bob1's objections, so the modified baseline is better. I'm not saying it's solved, but it no longer seems like the biggest problem. I completely agree that it's important that "you are dealing with is a set of many trillions of hard constraints, defined in billions of ontologies". On the other hand, the set of actions is potentially even larger, with septillions of reachable stars. My instinct is that this allows a large number of Pareto improvements, provided that the constraints are not pathological. The possibility of "utility inverters" (like Gregg and Jeff) is an example of pathological constraints. Utility Inverters I recently re-read What is malevolence? On the nature, measurement, and distribution of dark traits. Some findings: Such constraints don't guarantee that there are no Pareto improvements, but they make it very likely, I agree. So what to do? In the article you propose Self Preference Adoption Decision Influence (SPADI), defined as "meaningful influence regarding the adoption of those preferences that refer to her". We've come to a similar place by another route. There's some benefit in coming from this angle, we've gained some focus on utility inversion as a problem. Some possible options: 1. Remove utility inverting preferences in the coherently extrapolated delegates. We could call this Coherent Extrapolation of Equanimous Volition, for example. People can prefer that Dave stop cracking his knuckles, but can't prefer that Dave suffer. 2. Remove utility inverting preferences when evaluating whether options are pareto improvements. Actions cannot be rejected because they make Dave happier, but can be rejected because Dave cracking his knuckles makes others unhappier. I predict you won't like this because of concerns like: what if Gregg just likes to see heretics burn, not because it makes the heretics suffer, but because it's aesthetically pleasing to Gregg? No problem, the AI can have

A problem shared by many different alignment targets

Given that you agreed with most of what I said in my reply, it seems like you should also agree that it is important to analyse these types of alignment targets. But in your original comment you said that you do not think that it is important to analyse these types of alignment targets.

Let's write Multi Person Sovereign AI Proposal (MPSAIP) for an alignment target proposal to build an AI Sovereign that gets its goal from the global population (in other words: the type of alignment target proposals that I was analysing in the post). I followed your links an... (read more)

2Seth Herd3mo

Thanks! I don't have time to process this all right now, so I'm just noting that I do want to come back to it quickly and engage fully. Here's my position in brief: I think analyzing alignment targets is valuable. Where my current take differs from yours (I think) is that I think that effort would be best spent analyzing what you term corrigibility in the linked post (I got partway through and will have to come back to it), and I've called instruction-following. I think that's far more important to do first, because that's approximately what people are aiming for right now. I fully agree that there are other values mixed in with the training other than instruction-following. I think the complexity and impurity of that target makes it more urgent, not less, to have good minds analyzing the alignment targets that developers are most likely to pursue first by default. See my recent post Seven sources of goals in LLM agents. This is my main research focus, but I know of no one else focusing on this, and few people who even give it part-time attention. This seems like a bad allocation of resources; there might be major flaws in the alignment target that we don't find until developers are far along that path and reluctant to rework it. You said I wrote a little more about this in Intent alignment as a stepping-stone to value alignment. I definitely do not think it would be safe to assume that IF/corrigible AGI can solve value alignment for other/stronger AGI. John Wentworth's The Case Against AI Control Research has a pretty compelling argument for how we'd collaborate with sycophantic parahuman AI/AGI to screw up aligning the next step in AGI/ASI. I do not think any of this is safe. I think we are long past the point where we should be looking for perfectly reliable solutions. I strongly believe we must look for the best viable solutions, factoring in the practicality/likelyhood of getting them actually implemented. I worry that the alignment community's desire fo

A problem shared by many different alignment targets

I'm sorry if the list below looks like nitpicking. But I really do think that these distinctions are important.

Bob holds 1 as a value. Not as a belief.

Bob does not hold 2 as a belief or as a value. Bob thinks that someone as powerful as the AI has an obligation to punish someone like Dave. But that is not the same as 2.

Bob does not hold 3 as a belief or as a value. Bob thinks that for someone as powerful as the AI, the specific moral outrage in question renders the AI unethical. But that is not the same as 3.

Bob does hold 4 as a value. But it is worth noti... (read more)

3Martin Randall3mo

A lot to chew on in that comment. A baseline of "no superintelligence" I think I finally understand, sorry for the delay. The key thing I was not grasping is that Davidad proposed this baseline: This makes Bob's argument very simple: 1. Creating a PPCEV AI causes a Dark Future. This is true even if the PPCEV AI no-ops, or creates a single cake. Bob can get here in many ways, as can Extrapolated-Bob. 2. The baseline is no superintelligence, so no PPCEV AI, so not a Dark Future (in the same way). Option 2 is therefore better than option 1. Therefore there are no Pareto-improving proposals. Therefore the PPCEV AI no-ops. Even Bob is not happy about this, as it's a Dark Future. I think this is 100% correct. An alternative baseline Let's update Davidad's proposal by setting the baseline to be whatever happens if the PPCEV AI emits a no-op. This means: 1. Bob cannot object to a proposal because it implies the existence of PPCEV AI. The PPCEV AI already exists in the baseline. 2. Bob needs to consider that if the PPCEV AI emits a no-op then whoever created it will likely try something else, or perhaps some other group will try something. 3. Bob cannot object to a proposal because it implies that the PPCEV emits something. The PPCEV already emits something in the baseline. My logic is that if creating a PPCEV AI is a moral error (and perhaps it is) then at the point where the PPCEV AI is considering proposals then we already made that moral error. Since we can't reverse the past error, we should consider proposals as they affect the future. This also avoids treating a no-op outcome as a special case. A no-op output is a proposal to be considered. It is always in the set of possible proposals, since it is never worse than the baseline, because it is the baseline. Do you think this modified proposal would still result in a no-op output?

A problem shared by many different alignment targets

Bob really does not want the fate of the world to be determined by an unethical AI. There is no reason for such a position to be instrumental. For Bob, this would be worse than the scenario with no AI (in the Davidad proposal, this is the baseline that is used to determine whether or not something is a Pareto-improvement). Both scenarios contain non-punished heretics. But only one scenario contains an unethical AI. Bob prefers the scenario without an unethical AI (for non-instrumental reasons).

Regarding extrapolation:

The question is whether or not at least... (read more)

3Martin Randall3mo

Summarizing Bob's beliefs: 1. Dave, who does not desire punishment, deserves punishment. 2. Everyone is morally required to punish anyone who deserves punishment, if possible. 3. Anyone who does not fulfill all moral requirements is unethical. 4. It is morally forbidden to create an unethical agent that determines the fate of the world. 5. There is no amount of goodness that can compensate for a single morally forbidden act. I think it's possible (20%) that such blockers mean that there are no Pareto improvements. That's enough by itself to motivate further research on alignment targets, aside from other reasons one might not like Pareto PCEV. However, three things make me think this is unlikely. Note that my (%) credences aren't very stable or precise. Firstly, I think there is a chance (20%) that these beliefs don't survive extrapolation, for example due to moral realism or coherence arguments. I agree that this means that Bob might find his extrapolated beliefs horrific. This is a risk with all CEV proposals. Secondly, I expect (50%) there are possible Pareto improvements that don't go against these beliefs. For example, the PCEV could vote to create an AI that is unable to punish Dave and thus not morally required to punish Dave. Alternatively, instead of creating a Sovereign AI that determines the fate of the world, the PCEV could vote to create many human-level AIs that each improve the world without determining its fate. Thirdly, I expect (80%) some galaxy-brained solution to be implemented by the parliament of extrapolated minds who know everything and have reflected on it for eternity.

A problem shared by many different alignment targets

Consider Bob, who takes morality very seriously. Bob thinks that any scenario where the fate of the world is determined by an unethical AI, is worse than the scenario with no AI. Bob sticks with this moral position, regardless of how much stuff Bob would get in a scenario with an unethical AI. For a mind as powerful as an AI, Bob considers it to be a moral imperative to ensure that heretics do not escape punishment. If a group contains at least one person like Bob (and at least one person that would strongly object to being punished), then the set of Paret... (read more)

3Martin Randall3mo

The AI could deconstruct itself after creating twenty cakes, so then there is no unethical AI, but presumably Bob's preferences refer to world-histories, not final-states. However, CEV is based on Bob's extrapolated volition, and it seems like Bob would not maintain these preferences under extrapolation: * In the status quo, heretics are already unpunished - they each have one cake and no torture - so objecting to a non-torturing AI doesn't make sense on that basis. * If there were no heretics, then Bob would not object to a non-torturing AI, so Bob's preference against a non-torturing AI is an instrumental preference, not a fundamental preference. * Bob would be willing for a no-op AI to exist, in exchange for some amount of heretic-torture. So Bob can't have an infinite preference against all non-torturing AIs. * Heresy may not have meaning in the extrapolated setting where everyone knows the true cosmology (whatever that is) * Bob tolerates the existence of other trade that improves the lives of both fanatics and heretics, so it's unclear why the trade of creating an AI would be intolerable. The extrapolation of preferences could significantly reduce the moral variation in a population of billions. My different moral choices to others appear to be based largely on my experiences, including knowledge, analysis, and reflection. Those differences are extrapolated away. What is left is influences from my genetic priors and from the order I obtained knowledge. I'm not even proposing that extrapolation must cause Bob to stop valuing heretic-torture. If the extrapolation of preferences doesn't cause Bob to stop valuing the existence of a non-torturing AI at negative infinity, I think that is fatal to all forms of CEV. The important thing then is to fail gracefully without creating a torture-AI.

ThomasCederborg3mo10

I do think that it’s important to analyse alignment targets like these. Given the severe problems that all of these alignment targets suffer from, I certainly hope that you are right about them being unlikely. I certainly hope that nothing along the lines of a Group AI will ever be successfully implemented. But I do not think that it is safe to assume this. The successful implementation of an instruction following AI would not remove the possibility that an AI Sovereign will be implemented later. The CEV arbital page actually assumes that the pat... (read more)

5Seth Herd3mo

I agree with essentially all of this. See my posts If we solve alignment, do we die anyway? on AGI nonproliferation and government involvement and Intent alignment as a stepping-stone to value alignment on eventually building sovereign ASI using intent-aligned (IF or Harms-corrigible) AGI to help with alignment. Wentworth recently pointed out that idiot sycophantic AGI combined with idiotic/time-pressured humans might easily screw up that collaboration, and I'm afraid I agree. I hope we do it slowly and carefully, but not slowly enough to fall into the attractor of a vicious human getting the reigns and keeping them forever. The only thing I don't agree with (AFAICT on a brief look - I'm rushed myself right now so LMK what else I'm missing if you like) is that we might have a pause. I see that as so unlikely as to not be worth time thinking about. I have yet to see any coherent argument for how we get one in time. If you know of such an argument, I'd love to see it!

The ELYSIUM Proposal - Extrapolated voLitions Yielding Separate Individualized Utopias for Mankind

ThomasCederborg6mo10

I thought that your Cosmic Block proposal would only block information regarding things going on inside a given Utopia. I did not think that the Cosmic Block would subject every person to forced memory deletion. As far as I can tell, this would mean removing a large portion of all memories (details below). I think that memory deletion on the implied scale would seriously complicate attempts to define an extrapolation dynamic. It also does not seem to me that it would actually patch the security hole illustrated by the thought experiment in my original comm... (read more)

The ELYSIUM Proposal - Extrapolated voLitions Yielding Separate Individualized Utopias for Mankind

ThomasCederborg6mo90

Implementing The ELYSIUM Proposal would lead to the creation of a very large, and very diverse, set of clever AIs that wants to hurt people: the Advocates of a great variety of humans, that wants to hurt others in a wide variety of ways, for a wide variety of reasons. Protecting billions of people from this set of clever AIs would be difficult. As far as I can tell, nothing that you have mentioned so far would provide any meaningful amount of protection from a set of clever AIs like this (details below). I think that it would be better to just not create s... (read more)

2Roko6mo

Influence over preferences of a single entity is much more conflict-y.

2Roko6mo

The point of ELYSIUM is that people get control over non-overlapping places. There are some difficulties where people have preferences over the whole universe. But the real world shows us that those are a smaller thing than the direct, local preference to have your own volcano lair all to yourself.

2Roko6mo

BPA shouldn't be allowed to want anything for Steve. There shouldn't be a term in its world-model for Steve. This is the goal of cosmic blocking. The BPA can't even know that Steve exists. I think the difficult part is when BPA looks at Bob's preferences (excluding, of course, references to most specific people) and sees preferences for inflicting harm on people-in-general that can be bent just enough to fit into the "not-torture" bucket, and so it synthetically generates some new people and starts inflicting some kind of marginal harm on them. And I think that this will in fact be a binding constraint on utopia, because most humans will (given the resources) want to make a personal utopia full of other humans that forms a status hierarchy with them at the top. And 'being forced to participate in a status hierarchy that you are not at the top of' is a type of 'generalized consensual harm'. Even the good old Reedspacer's Lower Bound fits this model. Reedspacer wants a volcano lair full of catgirls, but the catgirls are consensually participating in a universe that is not optimal for them because they are stuck in the harem of a loser nerd with no other males and no other purpose in life other than being a concubine to Reedspacer. Arguably, that is a form of consensual harm to the catgirls. So I don't think there is a neat boundary here. The neatest boundary is informed consent, perhaps backed up by some lower-level tests about what proportion of an entity's existence is actually miserable. If Reedspacer beats his catgirls, makes them feel sad all the time, that matters. But maybe if one of them feels a little bit sad for a short moment that is acceptable.

2Roko6mo

But how would Bob know that he wanted to create OldSteve, if Steve has been deleted from his memory via a cosmic block? I suppose perhaps Bob could create OldEve. Eve is in a similar but not identical point in personality space to Steve and the desire to harm people who are like Eve is really the same desire as the desire to harm people like Steve. So Bob's Extrapolated Volition could create OldEve, who somehow consents to being mistreated in a way that doesn't trigger your torture detection test. This kind of 'marginal case of consensual torture' has popped up in other similar discussions. E.g. In Yvain's (Scott Alexander's) article on Archipelago there's this section: So Scott Alexander's solution to OldSteve is that OldSteve must get a non-brainwashed education about how ELYSIUM/Archipelago works and be given the option to opt out. I think the issue here is that "people who unwisely consent to torture even after being told about it" and "people who are willing and consenting submissives" is not actually a hard boundary.

2Roko6mo

If there is an agent that controls 55% of the resources in the universe and are prepared to use 90% of that 55% to kill/destroy everyone else, then assuming that ELYSIUM forbids them to do that, their rational move is to use their resources to prevent ELYSIUM from being built. And since they control 55% of the resources in the universe and are prepared to use 90% of that 55% to kill/destroy everyone who was trying to actually create ELYSIUM, they would likely succeed and ELYSIUM wouldn't happen. Re:threats, see my other comment.

2Roko6mo

This assumes that threats are allowed. If you allow threats within your system you are losing out on most of the value of trying to create an artificial utopia because you will recreate most of the bad dynamics of real history which ultimately revolve around threats of force in order to acquire resources. So, the ability to prevent entities from issuing threats that they then do not follow through on is crucial. Improving the equilibria of a game is often about removing strategic options; in this case the goal is to remove the option of running what is essentially organized crime. In the real world there are various mechanisms that prevent organized crime and protection rackets. If you threaten to use force on someone in exchange for resources, the mere threat of force is itself illegal at least within most countries and is punished by a loss of resources far greater than the threat could win. People can still engage in various forms of protest that are mutually destructive of resources (AKA civil disobedience). The ability to have civil disobedience without protection rackets does seem kind of crucial.

The ELYSIUM Proposal - Extrapolated voLitions Yielding Separate Individualized Utopias for Mankind

ThomasCederborg6mo110

My thought experiment assumed that all rules and constraints described in the text that you linked to had been successfully implemented. Perfect enforcement was assumed. This means that there is no need to get into issues such as relative optimization power (or any other enforcement related issue). The thought experiment showed that the rules described in the linked text does not actually protect Steve from a clever AI that is trying to hurt Steve (even if these rules are successfully implemented / perfectly enforced).

If we were reasoning from the assumpti... (read more)

2Roko6mo

There is a baseline set of rules that exists for exactly this purpose, which I didn't want to go into detail on in that piece because it's extremely distracting from the main point. These rules are not necessarily made purely by humans, but could for example be the result of some kind of AI-assisted negotiation that happens at ELYSIUM Setup. But I think you're correct that the system that implements anti-weaponization and the systems that implement extrapolated volitions are potentially pushing against each other. This is of course a tension that is present in human society as well, which is why we have police. So basically the question is "how do you balance the power of generalized-police against the power of generalized-self-interest." Now the whole point of having "Separate Individualized Utopias" is to reduce the need for police. In the real world, it does seem to be the case that extremely geographically isolated people don't need much in the way of police involvement. Most human conflicts are conflicts of proximity, crimes of opportunity, etc. It is rare that someone basically starts an intercontinental stalking vendetta against another person. And if you had the entire resources of police departments just dedicated to preventing that kind of crime, and they also had mind-reading tech for everyone, I don't think it would be a problem. I think the more likely problem is that people will want to start haggling over what kind of universal rights they have over other people's utopias. Again, we see this in real life. E.g. "diverse" characters forced into every video game because a few people with a lot of leverage want to affect the entire universe. So right now I don't have a fully satisfactory answer to how to fix this. It's clear to me that most human conflict can be transformed into a much easier negotiation over basically who gets how much money/general-purpose-resources. But the remaining parts could get messy.

The ELYSIUM Proposal - Extrapolated voLitions Yielding Separate Individualized Utopias for Mankind

ThomasCederborg6mo112

Let's optimistically assume that all rules and constraints described in The ELYSIUM Proposal are successfully implemented. Let's also optimistically assume that every human will be represented by an Advocate that perfectly represents her interests. This will allow us to focus on a problem that remains despite these assumptions.

Let's take the perspective of ordinary human individual Steve. Many clever and powerful AIs would now adopt preferences that refer to Steve (the Advocates of humans that have preferences that refer to Steve). Steve has no influence r... (read more)

4Roko6mo

This seems to only be a problem if the individual advocates have vastly more optimization power than the AIs that check for non-aggression. I don't think there's any reason for that to be the case. In contemporary society we generally have the opposite problem (the state uses lawfare against individuals).

3Nathan Helm-Burger6mo

I hadn't thought about it like that, but now that you've explained it that totally makes sense!

Shutting down all competing AI projects might not buy a lot of time due to Internal Time Pressure

ThomasCederborg7mo30

I changed the title from: ``A Pivotal Act AI might not buy a lot of time'' to: ``Shutting down all competing AI projects might not buy a lot of time due to Internal Time Pressure''.

As explained by Martin Randall, the statement: ``something which does not buy ample time is not a pivotal act'' is false (based on the Arbital Guarded Definition of Pivotal Act). Given your ``Agreed react'' to that comment, this issue seems to be settled. In the first section of the present comment, I explain why I still think that the old title was a mistake. The second section... (read more)

Shutting down all competing AI projects might not buy a lot of time due to Internal Time Pressure

ThomasCederborg7mo40

I will change the title.

However: you also seem to be using the term Pivotal Act as a synonym for removing all time pressure from competing AI projects (which the AI in my post does). Example 3 of the arbital page that you link to also explicitly refers to an act that removes all time pressure from competing AI projects as a Pivotal Act. This usage is also present in various comments by you, Yudkowsky, and others (see links and quotes below). And there does not seem to exist any other established term for an AI that: (i): completely removes all time pressur... (read more)

2Martin Randall7mo

Please do not change the title. You have used the phase correctly from both a prescriptive and a descriptive approach to language. A title such as "Shutting Down all Competing AI Projects is not Actually a Pivotal Act" would be an incorrect usage and increase confusion.

4faul_sname7mo

This seems like an excellent title to me.

ThomasCederborg7mo80

Your comment makes me think that I might have been unclear regarding what I mean with ATA. The text below is an attempt to clarify.

Summary

Not all paths to powerful autonomous AI go through methods from the current paradigm. It seems difficult to rule out the possibility that a Sovereign AI will eventually be successfully aligned to some specific alignment target. At current levels of progress on ATA this would be very dangerous (because understanding an alignment target properly is difficult, and a seemingly-nice proposal can imply a very bad outcome). It ... (read more)

ThomasCederborg7mo32

I interpret your comment as a prediction regarding where new alignment target proposals will come from. Is this correct?

I also have a couple of questions about the linked text:

How do you define the difference between explaining something and trying to change someone's mind? Consider the case where Bob is asking a factual question. An objectively correct straightforward answer would radically change Bob's entire system of morality, in ways that the AI can predict. A slightly obfuscated answer would result in far less dramatic changes. But those changes woul... (read more)

ThomasCederborg7mo52

Let's reason from the assumption that you are completely right. Specifically, let's assume that every possible Sovereign AI Project (SAIP) would make things worse in expectation. And let's assume that there exists a feasible Better Long Term Solution (BLTS).

In this scenario ATA would still only be a useful tool for reducing the probability of one subset of SAIPs (even if all SAIPs are bad some designers might be unresponsive to arguments, some flaws might not be realistically findable, etc). But it seems to me that ATA would be one complementary tool for r... (read more)

2Charbel-Raphaël7mo

Fair enough. I think my main problem with this proposal is that under the current paradigm of AIs (GPTs, foundation models), I don't see how you want to implement ATA, and this isn't really a priority?

ThomasCederborg7mo10

I think I see your point. Attempting to design a good alignment target could lead to developing intuitions that would be useful for ATA. A project trying to design an alignment target might result in people learning skills that allows them to notice flaws in alignment targets proposed by others. Such projects can therefore contribute to the type of risk mitigation that I think is lacking. I think that this is true. But I do not think that such projects can be a substitute for an ATA project with a risk mitigation focus.

Regarding Orthogonal:

It is difficult ... (read more)

A problem with the most recently published version of CEV

ThomasCederborg7mo30

The proposed research project would indeed be focused on a certain type of alignment target. For example proposals along the lines of PCEV. But not proposals along the lines of a tool-AI. Referring to this as Value-Alignment Target Analysis (VATA) would also be a possible notation. I will adopt this notation for the rest of this comment.

The proposed VATA research project would be aiming for risk mitigation. It would not be aiming for an answer:

There is a big difference between proposing an alignment target on the one hand. And pointing out problems with al... (read more)

ThomasCederborg7mo52

Regarding the political feasibility of PCEV:

PCEV gives a lot of extra power to some people, specifically because those people intrinsically value hurting other humans. This presumably makes PCEV politically impossible in a wide range of political contexts (including negotiations between a few governments). More generally: now that it has been pointed out that PCEV has this feature, the risks from scenarios where PCEV gets successfully implemented has presumably been mostly removed. Because PCEV is probably off the table as a potential alignment target, pre... (read more)

A necessary Membrane formalism feature

ThomasCederborg8mo30

Regarding Corrigibility as an alternative safety measure:

I think that exploring the Corrigibility concept sounds like a valuable thing to do. I also think that Corrigibility formalisms can be quite tricky (for similar reasons that Membrane formalisms can be tricky: I think that they are both vulnerable to difficult-to-notice definitional issues). Consider a powerful and clever tool-AI. It is built using a Corrigibility formalism that works very well when the tool-AI is used to shut down competing AI projects. This formalism relies on a definition of Explan... (read more)

4Nathan Helm-Burger8mo

Yes, in my discussions with Max Harms about CAST we discussed the concern of a highly capable corrigible tool-AI accidentally or intentionally manipulating its operators or other humans with very compelling answers to questions. My impression is that Max is more confident about his version of corrigibility managing to avoid manipulation scenarios than I am. I think this is definitely one of the more fragile and slippery aspects of corrigibility. In my opinion, manipulation-prevention in the context of corrigibility deserves more examination to see if better protections can be found, and a very cautious treatment during any deployment of a powerful corrigible tool-AI.

A necessary Membrane formalism feature

ThomasCederborg8mo30

I agree that focus should be on preventing the existence of a Sovereign AI that seeks to harm people (as opposed to trying to deal with such an AI after it has already been built). The main reason for trying to find necessary features, is actually that it might stop a dangerous AI project from being pursued in the first place. In particular: it might convince the design team to abandon an AI project, that clearly lacks a feature that has been found to be necessary. An AI project that would (if successfully implemented) result in an AI Sovereign that would ... (read more)

A necessary Membrane formalism feature

ThomasCederborg8mo20

Thanks for the feedback! I see what you mean and I edited the post. (I turned a single paragraph abstract into a three paragraph Summary section. The text itself has not been changed)

Corrigibility could make things worse

ThomasCederborg11moΩ233

Thank you for engaging. If this was unclear for you, then I'm sure it was also unclear for others.

The post outlined a scenario where a Corrigibility method works perfectly for one type of AI (an AI that does not imply an identifiable outcome, for example a PAAI). The same Corrigibility method fails completely for another type of AI (an AI that does imply an identifiable outcome, for example PCEV). So the second AI, that does have an IO, is indeed not corrigible.

This Corrigibility method leads to an outcome that is massively worse than extinction. This bad ... (read more)

2Nathan Helm-Burger8mo

From my point of view, you are making an important point that I agree with: corrigibility isn't uniformly safe for all use cases, you must use it only carefully and in the use-cases it is safe for. I've discussed this point with Max a bunch. The key aspect of corrigibility is keeping the operator empowered, and thus is necessarily unsafe in the hands of foolish or malicious operators. Examples of good use: * further AI alignment research * monitoring the web for rogue AGI * operating and optimizing a factory production line * medical research * helping with mundane aspects of government action, like smoothing out a part of a specific bureaucratic process that needed well-described bounded decision-making (e.g. being a DMV assistant, or a tax-evasion investigator who takes no action other than filing reports on suspected misbehavior) Examples of bad use: * asking the AI to convince you of something, or even just explain a concept persistently until its sure you understand * trying to do a highly-world-affecting dramatic and irreversible act, such as a pivotal act * trying to implement a value-aligned or PCEV or whatever agent. In fact, trying to create any agent which isn't just an exact copy of the known-safe current corrigible agent. * trying to research and create particularly dangerous technology, such as self-replicating tech that might get out of hand (e.g. synthetic biology, bioweapons). This is a case where the AI succeeding safely at the task is itself a dangerous result! Now you've got a potential Bostrom-esque 'black ball' technology in hand, even though the AI didn't malfunction in any way.

1Max Harms11mo

Thanks! I now feel unconfused. To briefly echo back the key idea which I heard (and also agree with): a technique which can create a corrigible PAAI might have assumptions which break if that technique is used to make a different kind of AI (i.e. one aimed at CEV). If we call this technique "the Corrigibility method" then we may end up using the Corrigibility method to make AIs that aren't at all corrigible, but merely seem corrigible, resulting in disaster. This is a useful insight! Thanks for clarifying. :)

Corrigibility could make things worse

ThomasCederborg11mo30

The first AI is genuinely Corrigible. The second AI is not Corrigible at all. This leads to a worse outcome, compared to the case where there was no Corrigible AI. Do you disagree with the statement that the first AI is genuinely Corrigible? Or do you disagree with the statement that the outcome is worse, compared to the case where there was no Corrigible AI?

A list of core AI safety problems and how I hope to solve them

ThomasCederborg1y114

Thank you for the clarification. This proposal is indeed importantly different from the PCEV proposal. But since hurting heretics is a moral imperative, any AI that allows heretics to escape punishment, will also be seen as unacceptable by at least some people. This means that the set of Pareto improvements is empty.

In other words: hurting heretics is indeed off the table in your proposal (which is an important difference compared to PCEV). However, any scenario that includes the existence of an AI, that allow heretics to escape punishment, is also off the... (read more)

A list of core AI safety problems and how I hope to solve them

ThomasCederborg1y134

There is a serious issue with your proposed solution to problem 13. Using a random dictator policy as a negotiation baseline is not suitable for the situation, where billions of humans are negotiating about the actions of a clever and powerful AI. One problem with using this solution, in this contexts, is that some people have strong commitments to moral imperatives, along the lines of ``heretics deserve eternal torture in hell''. The combination of these types of sentiments, and a powerful and clever AI (that would be very good at thinking up effective wa... (read more)

7davidad1y

The "random dictator" baseline should not be interpreted as allowing the random dictator to dictate everything, but rather to dictate which Pareto improvement is chosen (with the baseline for "Pareto improvement" being "no superintelligence"). Hurting heretics is not a Pareto improvement because it makes those heretics worse off than if there were no superintelligence.

Agent membranes/boundaries and formalizing “safety”

ThomasCederborg1y97

I think it is very straightforward to hurt human individual Steve without piercing Steve's Membrane. Just create and hurt minds that Steve cares about. But don't tell him about it (in other words: ensure that there is zero effect on predictions of things inside the membrane). If Bob knew Steve before the Membrane enforcing AI was built, and Bob wants to hurt Steve, then Bob presumably knows Steve well enough to know what minds to create (in other words: there is no need to have any form of access, to any form of information, that is within Steve's Membrane... (read more)

«Boundaries», Part 1: a key missing concept from utility theory

ThomasCederborg1y21

For a set of typical humans, that are trying to agree on what an AI should do, there does not exists any fallback option, that is acceptable to almost everyone. For each fallback option, there exists a large number of people, that will find this option completely unacceptable on moral grounds. In other words: when trying to agree on what an AI should do, there exists no place that people can walk away to, that will be seen as safe / acceptable by a large majority of people.

Consider the common aspect of human morality, that is sometimes expressed in theolog... (read more)

A problem with the most recently published version of CEV

Extinction Risks from AI: Invisible to Science?

This comment is trying to clarify what the post is about, and by extension clarify which claims are made. Clarifying terminology is an important part of this. Both the post and my research agenda is focused on the dangers of successfully hitting a bad alignment target. This is one specific subset, of the existential threats that humanity face from powerful AI. Let's distinguish the danger being focused on, from other types of dangers, by looking at a thought experiment, with an alignment target that is very obviously bad. A well intentioned designer named ... (read more)

ThomasCederborg1y20

What about the term uncaring AI? In other words, an AI that would keep humans alive, if offered resources to do so. This can be contrasted with a Suffering Reducing AI (SRAI), which would not keep humans alive in exchange for resources. SRAI is an example of successfully hitting a bad alignment target, which is an importantly different class of dangers, compared to the dangers of an aiming failure leading to an uncaring AI. While an uncaring AI would happily agree to leave earth alone in exchange for resources, this is not the case for SRAI, because killin... (read more)

If your favoured alignment target suffers from a critical flaw, that is inherent in the core concept, then surely it must be useful for for you to discover this. So I assume that you agree that, conditioned on me being right about CEV suffering from such a flaw, you want me to tell you about this flaw. In other words, I think that I have demonstrated, that CEV suffers from a flaw, that is not related to any detail, of any specific version, or any specific description, or any specific proxy, or any specific attempt to describe what CEV is, or anything else ... (read more)

The version of CEV, that is described on the page that your CEV link leads to, is PCEV. The acronym PCEV was introduced by me. So this acronym does not appear on that page. But that's PCEV that you link to. (in other words: the proposed design, that would lead to the LP outcome, can not be dismissed as some obscure version of CEV. It is the version that your own CEV link leads to. I am aware of the fact, that you are viewing PCEV as: ``a proxy for something else'' / ``a provisional attempt to describe what CEV is''. But this fact still seemed noteworthy)

On... (read more)

4Vladimir_Nesov1y

You are directing a lot of effort at debating details of particular proxies for an optimization target, pointing out flaws. My point is that strong optimization for any proxy that can be debated in this way is not a good idea, so improving such proxies doesn't actually help. A sensible process for optimizing something has to involve continually improving formulations of the target as part of the process. It shouldn't be just given any target that's already formulated, since if it's something that would seem to be useful to do, then the process is already fundamentally wrong in what it's doing, and giving a better target won't fix it. The way I see it, CEV-as-formulated is gesturing at the kind of thing an optimization target might look like. It's in principle some sort of proxy for it, but it's not an actionable proxy for anything that can't come up with a better proxy on its own. So improving CEV-as-formulated might make the illustration better, but for anything remotely resembling its current form it's not a useful step for actually building optimizers. Variants of CEV all having catastrophic flaws is some sort of argument that there is no optimization target that's worth optimizing for. Boundaries seem like a promising direction for addressing the group vs. individual issues. Never optimizing for any proxy more strongly than its formulation is correct (and always pursuing improvement over current proxies) responds to there often being hidden flaws in alignment targets that lead to catastrophic outcomes.

I was clearly wrong regarding how you feel about your cells. But surely the question of whether or not an AI that is implementing the CEV of Steve, would result in any surviving cells, is an empirical question? (which must settled by referring to facts about Steve. And trying to figure out what these facts mean in terms of how the CEV of Steve would treat his cells). It cannot possibly be the case that it is impossible, by definition, to discover that any reasonable way of extrapolating Steve would result in all his cells dying?

Thank you for engaging on th... (read more)

ThomasCederborg1y30

I think that extrapolation is a genuinely unintuitive concept. I would for example not be very surprised if it turns out that you are right, and that it is impossible to reasonably extrapolate you if the AI that is doing the extrapolation is cut off from all information about other humans. I don't think that this fact is in tension with my statement, that individuals and groups are completely different types of things. Taking your cell analogy: I think that implementing the CEV of you could lead to the death of every single cell in your body (for example i... (read more)

3the gears to ascension1y

I will take this bet at any amount. My cells are a beautiful work of art crafted by evolution, and I am a guest in their awesome society. Any future where my cells' information is lost rather than transmuted and the original stored is unacceptable to me. Switching to another computational substrate without deep translation of the information in my cells is effectively guaranteed to need to examine the information in a significant fraction of my cells at a deep level, such that a generative model can be constructed which has significantly higher accuracy at cell information reconstruction than any generative model of today would. I suspect I am only unusual in having thought through this enough to identify this value, and that it is common in somewhat-less-transhumanist circles, usually manifesting as a resistance to augmentation rather than a desire to augment in a way that maintains a biology-like substrate. Now, to be clear, I do want to rewrite my cells at a deep level - a sort of highly advanced dynamics-faithful "style transfer" into some much more advanced substrate, in particular one that operates smoothly between temperatures 2 kelvin and ~310 kelvin or ideally much higher (though if it turns out that a long adaptation period is needed to switch between ultra low temp and ultra high temp, that's fine, I expect that the chemicals that operate smoothly at the respective temperatures will look rather different). I also expect to not want to be stuck with using carbon; I don't currently understand chemistry enough to confidently tell you any of the things I'm asking for in this paragraph are definitely possible, but my hunch is that there are other atoms which form stronger bonds and have smaller fields that could be used instead, ie classic precise nanotech sorts of stuff. probably takes a lot of energy to construct them, if they're possible. But again, no uplift without being able to map the behaviors of my cells in high fidelity.

I think that ``CEV'' is usually used as shorthand for ``an AI that implements the CEV of Humanity''. This is what I am referring to, when I say ``CEV''. So, what I mean when I say that ``CEV is a bad alignment target'', is that, for any reasonable set of definitions, it is a bad idea, to build an AI, that does what ``a Group'' wants it to do (in expectation, from the perspective of essentially any human individual, compared to extinction). Since groups and individuals, are completely different types of things, it should not be surprising to learn, that doi... (read more)

2the gears to ascension1y

I don't think this is obviously justifiable. It seems to me that cells work together to be a person, together tracking and implementing the agency of the aggregate system according to their interest as part of that combined entity, and in the same way, people work together to be a group, together tracking and implementing the agency of the group. I'm pretty sure that if you try to calculate my CEV with me in a box, you end up with an error like "import error: the rest of the reachable social graph of friendships and caring". I cannot know what I want without deliberating with others who I intend to be in a society with long term, because I will know that whatever answer I give for my CEV, it will be highly probably misaligned with the rest of the people I care about. And I expect that the network of mutual utility across humanity is fairly well connected such that if I import friends, it ends up being a recursive import that requires evaluation of everyone on earth. (By the way, any chance you could use fewer commas? The reading speed I can reach on your comments are reduced by them due to having to bump up to deliberate thinking to check whatever I've joined sentence fragments the way you meant. No worries if not, though.)

I agree that ``the ends justify the means'' type thinking has led to a lot of suffering. For this, I would like to switch from the Chinese Cultural Revolution, to the French Revolution, as an example (I know it better, and I think it fits better, for discussions of this attitude). So, someone wants to achieve something, that are today seen as a very reasonable goal, such as ``end serfdom and establish formal equality before the law''. So, basically: their goals are positive, and they achieve these goals. But perhaps they could have achieved those goals, wi... (read more)

I think that my other comment to this, will hopefully be sufficient, to outline what my position actually is. But perhaps a more constructive way forwards, would be to ask how certain you are, that CEV is in fact, the right thing to aim at? That is, how certain are you, that this situation is not symmetrical, to the case where Bob thinks that: ``a Suffering Reducing AI (SRAI), is the objectively correct thing to aim at''? Bob will diagnose any problem, with any specific SRAI proposal, as arising from proxy issues, related to the fact that Bob is not able t... (read more)

2Vladimir_Nesov1y

Metaphorically, there is a question CEV tries to answer, and by "something like CEV" I meant any provisional answer to the appropriate question (so that CEV-as-currently-stated is an example of such an answer). Formulating an actionable answer is not a project humans would be ready to work on directly any time soon. So CEV is something to aim at by intention that defines CEV. If it's not something to aim at, then it's not a properly constructed CEV. This lack of a concrete formulation is the reason goodharting and corrigibility seem salient in operationalizing the process of formulating it and making use of the formulation-so-far. Any provisional formulation of an alignment target (such as CEV-as-currently-stated) would be a proxy, and so any optimization according to such proxy should be wary of goodharting and be corrigible to further refinement. The point of discussion of boundaries was in response to possible intuition that expected utility maximization tends to make its demands with great uniformity, with everything optimized in the same direction. Instead, a single goal may ask for different things to happen in different places, or to different people. It's a more reasonable illustration of goal aggregation than utilitarianism that sums over measures of value from different people or things.

ThomasCederborg1y30

I don't think that they are all status games. If so, then why did people (for example) include long meditations, regarding whether or not, they personally, deserve to go to hell, in private diaries? While they were focusing on the ``who is a heretic?'' question, it seems that they were taking for granted, the normative position: ``if someone is a heretic, then she deserves eternal torture in hell''. But, on the other hand, private diaries are of course sometimes opened, while the people that wrote them are still alive (this is not the most obvious thing, t... (read more)

It is getting late here, so I will stop after this comment, and look at this again tomorrow (I'm in Germany). Please treat the comment below as not fully thought through.

The problem from my perspective, is that I don't think that the objective, that you are trying to approximate, is a good objective (in other words, I am not referring to problems, related to optimising a proxy. They also exist, but they are not the focus of my current comments). I don't think that it is a good idea, to do what an abstract entity, called ``humanity'', wants (and I think tha... (read more)

I'm not sure that I agree with this. I think it mostly depends on what you mean by: ``something like CEV''. All versions of CEV are describable as ``doing what a Group wants''. It is inherent in the core concept of building an AI, that is ``Implementing the Coherent Extrapolated Volition of Humanity''. This rules out proposals, where each individual, is given meaningful influence, regarding the adoption, of those preferences, that refer to her. For example as in MPCEV (described in the post that I linked to above). I don't see how an AI can be safe, for in... (read more)

2Vladimir_Nesov1y

The issue with proxies for an objective is that they are similar to it. So an attempt to approximately describe the objective (such as an attempt to say what CEV is) can easily arrive at a proxy that has glaring goodharting issues. Corrigibility is one way of articulating a process that fixes this, optimization shouldn't outpace accuracy of the proxy, which could be improving over time. Volition of humanity doesn't obviously put the values of the group before values of each individual, as we might put boundaries between individuals and between smaller groups of individuals, with each individual or smaller group having greater influence and applying their values more strongly within their own boundaries. There is then no strong optimization from values of the group, compared to optimization from values of individuals. This is a simplistic sketch of how this could work in a much more elaborate form (where the boundaries of influence are more metaphorical), but it grounds this issue in more familiar ideas like private property, homes, or countries.

A problem with the most recently published version of CEV

ThomasCederborg1y3-2

In the case of damage from political movements, I think that many truly horrific things, have been done by people, that are well approximated as: ``genuinely trying to do good, and largely achieving their objectives, without major unwanted side effects'' (for example events along the lines of the Chinese Cultural Revolution, that you discuss in your older post, that you link to in your first footnote).

I think our central disagreement, might be a difference, in how we see human morality. In other words, I think that we might have different views, regarding ... (read more)

2Random Developer1y

If I had to summarize your argument, it would be something like, "Many people's highest moral good involves making their ideological enemies suffer." This is indeed a thing that happens, historically. But another huge amount of damage is caused by people who believe things like "the ends justify the means" or "you can't make an omelette without breaking a few eggs." Or "We only need 1 million surviving Afghanis [out of 15 million] to build a paradise for the proletariat," to paraphrase an alleged historical statement I read once. The people who say things like this cause immediate, concrete harm. They attempt to justify this harm as being outweighed by the expected future value of their actions. But that expected future value is often theoretical, and based on dubious models of the world. I do suspect that a significant portion of the suffering in the world is created by people who think like this. Combine them with the people you describe whose conception of "the good" actually involves many people suffering (and with people who don't really care about acting morally at all), and I think you account for much of the human-caused suffering in the world. One good piece of advice I heard from someone in the rationalist community was something like, "When you describe your proposed course of action, do you sound like a monologuing villain from a children's TV show, someone who can only be defeated by the powers of friendship and heroic teamwork? If so, you would be wise to step back and reconsider the process by which you arrived at your plans."

7Wei Dai1y

I wrote a post expressing similar sentiments but perhaps with a different slant. To me, apparent human morality along the lines of "heretics deserve eternal torture in hell" or what was expressed during the Chinese Cultural Revolution are themselves largely a product of status games, and there's a big chance that these apparent values do not represent people's true values and instead represent some kind of error (but I'm not sure and would not want to rely on this being true). See also Six Plausible Meta-Ethical Alternatives for some relevant background. But you're right that the focus of my post here is on people who endorse altruistic values that seem more reasonable to me, like EAs, and maybe earlier (pre-1949) Chinese supporters of communism who were mostly just trying to build a modern nation with a good economy and good governance, but didn't take seriously enough the risk that their plan would backfire catastrophically.

2Vladimir_Nesov1y

This seems mostly goodharting, how the tails come apart when optimizing or selecting for a proxy rather than for what you actually want. And people don't all want the same thing without disagreement or value drift. Near term practical solution is not optimizing too hard and building an archipalago with membranes between people and between communities that bound the scope of stronger optimization. Being corrigible about everything might also be crucial. Longer term idealized solution is something like CEV, saying in a more principled and precise way what the practical solutions only gesture at, and executing on that vision at scale. This needs to be articulated with caution, as it's easy to stray into something that is obviously a proxy and very hazardous to strongly optimize.

ThomasCederborg1y30

I think that these two proposed constraints, will indeed remove some bad outcomes. But I don't think that they will help in the thought experiment outlined in the post. These fanatics want all heretics in existence to be punished. This is a normative convention. It is a central aspect of their morality. An AI that deviates from this ethical imperative, is seen as an unethical AI. Deleting all heretics, from the memory of the fanatics, will not change this aspect of their morality. It's genuinely not personal. They think that it would be highly unethical, f... (read more)

A problem with the most recently published version of CEV