I often think about "the road to hell is paved with good intentions".[1] I'm unsure to what degree this is true, but it does seem that people trying to do good have caused more negative consequences in aggregate than one might naively expect.[2] "Power corrupts" and "power-seekers using altruism as an excuse to gain power" are two often cited reasons for this, but I think don't explain all of it.
A more subtle reason is that even when people are genuinely trying to do good, they're not entirely aligned with goodness. Status-seeking is a powerful motivation for almost all humans, including altruists, and we frequently award social status to people for merely trying to do good, before seeing all of the consequences of their actions. This is in some sense inevitable as there are no good alternatives. We often need to award people with social status before all of the consequences play out, both to motivate them to continue to try to do good, and to provide them with influence/power to help them accomplish their goals.
A person who consciously or subconsciously cares a lot about social status will not optimize strictly for doing good, but also for appearing to do good. One way these two motivations diverge is in how to manage risks, especially risks of causing highly negative consequences. Someone who wants to appear to do good would be motivated to hide or downplay such risks, from others and perhaps from themselves, as fully acknowledging such risks would often amount to admitting that they're not doing as much good (on expectation) as they appear to be.
How to mitigate this problem
Individually, altruists (to the extent that they endorse actually doing good) can make a habit of asking themselves and others what risks they may be overlooking, dismissing, or downplaying.[3]
Institutionally, we can rearrange organizational structures to take these individual tendencies into account, for example by creating positions dedicated to or focused on managing risk. These could be risk management officers within organizations, or people empowered to manage risk across the EA community.[4]
Socially, we can reward people/organizations for taking risks seriously, or punish (or withhold rewards from) those who fail to do so. This is tricky because due to information asymmetry, we can easily create "risk management theaters" akin to "security theater" (which come to think of it, is a type of risk management theater). But I think we should at least take notice when someone or some organization fails, in a clear and obvious way, to acknowledge risks or to do good risk management, for example not writing down a list of important risks to be mindful of and keeping it updated, or avoiding/deflecting questions about risk.[5] More optimistically, we can try to develop a culture where people and organizations are monitored and held accountable for managing risks substantively and competently.
- ^
due in part to my family history
- ^
Normally I'd give some examples here, but we can probably all think of some from the recent past.
- ^
I try to do this myself in the comments.
- ^
an idea previously discussed by Ryan Carey and William MacAskill
- ^
However, see this comment.
I think that my other comment to this, will hopefully be sufficient, to outline what my position actually is. But perhaps a more constructive way forwards, would be to ask how certain you are, that CEV is in fact, the right thing to aim at? That is, how certain are you, that this situation is not symmetrical, to the case where Bob thinks that: ``a Suffering Reducing AI (SRAI), is the objectively correct thing to aim at''? Bob will diagnose any problem, with any specific SRAI proposal, as arising from proxy issues, related to the fact that Bob is not able to perfectly define ``Suffering'', and must always rely on a proxy (those proxy issues exists. But they are not the most serious issue, with Bob's SRAI project).
I don't think that we should let Bob proceed with an AI project, that aims to find the correct description of ``what SRAI is'', even if he is being very careful, and is trying to implement a safety measure (that will, while it continues to work as intended, prevent SRAI from killing everyone). Because those safety features might fail, regardless of whether or not someone has pointed out a critical flaw in them, before the project reaches the point of no return (this conclusion is not related to Corrigibility. I would reach the exact same conclusion, if Bob's SRAI project, was using any other safety measure). For the exact same reason, I simply do not think, that it is a good idea, to proceed with your proposed CEV project (as I understand that project). I think that doing so, would represent a very serious s-risk. At best, it will fail in a safe way, for predictable reasons. How confident are you, that I am completely wrong about this?
Finally, I should note, that I still don't understand your terminology. And I don't think that I will, until you specify what you mean with ``something like CEV''. My current comments, are responding to my best guess, of what you mean (which is, that MPCEV, from my linked to post, would not count as ``something like CEV'', in your terminology). (Does an Orca count as: ``something like a shark''? If it is very important, that some water tank is free of fish, then it is difficult for me to discuss Dave's ``let's put something like a shark, in that water tank'' project, until I have an answer to my Orca question.)
(I assume that this is obvious, but just to be completely sure that this is clear, it probably makes sense to note explicitly that I, very much, appreciate that you are engaging on this topic)