Comment Permalink

ThomasCederborg11mo114

Thank you for the clarification. This proposal is indeed importantly different from the PCEV proposal. But since hurting heretics is a moral imperative, any AI that allows heretics to escape punishment, will also be seen as unacceptable by at least some people. This means that the set of Pareto improvements is empty.

In other words: hurting heretics is indeed off the table in your proposal (which is an important difference compared to PCEV). However, any scenario that includes the existence of an AI, that allow heretics to escape punishment, is also off the table. The existence of such an AI, would be seen as intrinsically bad, by people that see hurting heretics as a moral imperative (for example: Gregg really does not want a world, where Gregg has agreed to tolerate the existence of an unethical AI, that disregards its moral duty, to punish heretics). More generally: anything that improves the lives of heretics, is off the table. If an outcome improves the lives of heretics (compared to the no-AI-baseline), then this outcome is also not a Pareto improvement. Because improving the lives of heretics, makes things worse from the point of view, of those that are deeply committed to hurting heretics.

In yet other words: it only takes two individuals, to rule out any outcome, that contains any improvement, for any person. Gregg and Jeff are both deeply committed to hurting heretics. But their definitions of ``heretic'' differ. Every individual is seen as a heretic by at least one of them. So, any outcome, that makes life better for any person, is off the table. Gregg and Jeff does have to be very committed to the moral position, that the existence of any AI, that neglects its duty to punish heretics, is unacceptable. It must for example be impossible to get them to agree to tolerate the existence of such an AI, in exchange for increased influence over the far future. But a population of billions only has to contain two such people, for the set of Pareto improvements to be empty.

Another way to approach this would be to ask: What would have happened, if someone had successfully implemented a Gatekeeper AI, built on top of a set of definitions, such that the set of Pareto improvements is empty?

For the version of the random dictator negotiation baseline that you describe, this comment might actually be more relevant, than the PCEV thought experiment. It is a comment on the suggestion by Andrew Critch, that it might be possible to view a Boundaries / Membranes based BATNA, as having been agreed to acausally. It is impossible to reach such an acausal agreement when a group include people like Gregg and Jeff, for the same reason that it is impossible to find an outcome that is a Pareto improvement, when a group include people like Gregg and Jeff. (that comment also discuss ideas, for how one might deal with the dangers that arise, when one combines people like Gregg and Jeff, with a powerful and clever AI)

Another way to look at this, would be to consider what it would mean to find a Pareto improvement, with respect to only Bob and Dave. Bob wants to hurt heretics, and Bob considers half of all people to be heretics. Dave is an altruist, that just wants people to have as good a life as possible. The set of Pareto improvements would now be made up entirely of different variations of the general situation: make the lives of non heretics much better, and make the lives of heretics much worse. For Bob to agree, heretics must be punished. And for Dave to agree, Dave must see the average life quality, as an improvement on the ``no superintelligence'' outcome. If the ``no superintelligence'' outcome is bad for everyone, then the lives of heretics in this scenario could get very bad.

More generally: people like Bob (with aspects of morality along the lines of: ``heretics deserve eternal torture in hell'') will have dramatically increased power over the far future, when one uses this type of negotiation baseline (assuming that things have been patched, in a way that results in a non empty set of Pareto improvements). If everyone is included in the calculation of what counts as Pareto improvements, then the set of Pareto improvements is empty (due to people like Gregg and Jeff). And if everyone is not included, then the outcome could get very bad, for many people (compared to whatever would have happened otherwise).

(adding the SPADI feature to your proposal would remove these issues, and would prevent people like Dave from being dis-empowered, relative to people like Bob. The details are importantly different from PCEV, but it is no coincidence that adding the SPADI feature removes this particular problem, for both proposals. The common denominator is that from the perspective of Steve, it is in general dangerous to encounter an AI, that has taken ``unwelcome'' or ``hostile'' preferences about Steve into account)

Also: my general point about the concept of ``fair Pareto improvements'' having counterintuitive implications in this novel context still apply (it is not related to the details of any specific proposal).

165

A list of core AI safety problems and how I hope to solve them

165

Ω 56

1. Value is fragile and hard to specify.

2. Corrigibility is anti-natural.

3. Pivotal processes require dangerous capabilities.

4. Goals misgeneralize out of distribution.

5. Instrumental convergence.

6. Pivotal processes likely require incomprehensibly complex plans.

7. Superintelligence can fool human supervisors.

8. Superintelligence can hack software supervisors.

9. Humans cannot be first-class parties to a superintelligence values handshake.

10. Humanlike minds/goals are not necessarily safe.

11. Someone else will deploy unsafe superintelligence first (possibly by stealing it from you).

12. Unsafe superintelligence in a box might figure out what’s going on and find a way to exfiltrate itself by steganography and spearphishing.

13. We are ethically obligated to propose pivotal processes that are as close as possible to fair Pareto improvements for all citizens, both by their own lights and from a depersonalized well-being perspective.

165

Ω 56

4. Goals misgeneralize out of distribution.