Holden Karnofsky's Singularity Institute Objection 1

Paul Crowley

The sheer length of GiveWell co-founder and co-executive director Holden Karnofsky's excellent critique of the Singularity Institute means that it's hard to keep track of the resulting discussion. I propose to break out each of his objections into a separate Discussion post so that each receives the attention it deserves.

Objection 1: it seems to me that any AGI that was set to maximize a "Friendly" utility function would be extraordinarily dangerous.

Suppose, for the sake of argument, that SI manages to create what it believes to be an FAI. Suppose that it is successful in the "AGI" part of its goal, i.e., it has successfully created an intelligence vastly superior to human intelligence and extraordinarily powerful from our perspective. Suppose that it has also done its best on the "Friendly" part of the goal: it has developed a formal argument for why its AGI's utility function will be Friendly, it believes this argument to be airtight, and it has had this argument checked over by 100 of the world's most intelligent and relevantly experienced people. Suppose that SI now activates its AGI, unleashing it to reshape the world as it sees fit. What will be the outcome?

I believe that the probability of an unfavorable outcome - by which I mean an outcome essentially equivalent to what a UFAI would bring about - exceeds 90% in such a scenario. I believe the goal of designing a "Friendly" utility function is likely to be beyond the abilities even of the best team of humans willing to design such a function. I do not have a tight argument for why I believe this, but a comment on LessWrong by Wei Dai gives a good illustration of the kind of thoughts I have on the matter:

What I'm afraid of is that a design will be shown to be safe, and then it turns out that the proof is wrong, or the formalization of the notion of "safety" used by the proof is wrong. This kind of thing happens a lot in cryptography, if you replace "safety" with "security". These mistakes are still occurring today, even after decades of research into how to do such proofs and what the relevant formalizations are. From where I'm sitting, proving an AGI design Friendly seems even more difficult and error-prone than proving a crypto scheme secure, probably by a large margin, and there is no decades of time to refine the proof techniques and formalizations. There's good recent review of the history of provable security, titled Provable Security in the Real World, which might help you understand where I'm coming from.

I think this comment understates the risks, however. For example, when the comment says "the formalization of the notion of 'safety' used by the proof is wrong," it is not clear whether it means that the values the programmers have in mind are not correctly implemented by the formalization, or whether it means they are correctly implemented but are themselves catastrophic in a way that hasn't been anticipated. I would be highly concerned about both. There are other catastrophic possibilities as well; perhaps the utility function itself is well-specified and safe, but the AGI's model of the world is flawed (in particular, perhaps its prior or its process for matching observations to predictions are flawed) in a way that doesn't emerge until the AGI has made substantial changes to its environment.

By SI's own arguments, even a small error in any of these things would likely lead to catastrophe. And there are likely failure forms I haven't thought of. The overriding intuition here is that complex plans usually fail when unaccompanied by feedback loops. A scenario in which a set of people is ready to unleash an all-powerful being to maximize some parameter in the world, based solely on their initial confidence in their own extrapolations of the consequences of doing so, seems like a scenario that is overwhelmingly likely to result in a bad outcome. It comes down to placing the world's largest bet on a highly complex theory - with no experimentation to test the theory first.

So far, all I have argued is that the development of "Friendliness" theory can achieve at best only a limited reduction in the probability of an unfavorable outcome. However, as I argue in the next section, I believe there is at least one concept - the "tool-agent" distinction - that has more potential to reduce risks, and that SI appears to ignore this concept entirely. I believe that tools are safer than agents (even agents that make use of the best "Friendliness" theory that can reasonably be hoped for) and that SI encourages a focus on building agents, thus increasing risk.

Objection 1: it seems to me that any AGI that was set to maximize a "Friendly" utility function would be extraordinarily dangerous.

What I'm afraid of is that a design will be shown to be safe, and then it turns out that the proof is wrong, or the formalization of the notion of "safety" used by the proof is wrong. This kind of thing happens a lot in cryptography, if you replace "safety" with "security". These mistakes are still occurring today, even after decades of research into how to do such proofs and what the relevant formalizations are. From where I'm sitting, proving an AGI design Friendly seems even more difficult and error-prone than proving a crypto scheme secure, probably by a large margin, and there is no decades of time to refine the proof techniques and formalizations. There's good recent review of the history of provable security, titled Provable Security in the Real World, which might help you understand where I'm coming from.

I tentatively hold, and believe it is the SI's position, that an uFAI is almost certain to produce human extinction. Again, I would like to call this utility 0.

I hold with timtyler that a uFAI probably wouldn't kill off all humanity. There's little benefit to doing so and it potentially incurs a huge cost by going against the wishes of potential simulators, counterfactual FAIs (acausally) (not necessarily human-designed, just designed by an entity or entities that cared about persons in general), hidden AGIs (e.g. alien AGIs that have already swept by the solar system but are making it look as if they haven't (note that this resolves the Fermi paradox)), et cetera. Such a scenario is still potentially a huge loss relative to FAI scenarios, but it implies that AGI isn't a sure-thing existential catastrophe, and is perhaps less likely to lead to human extinction than certain other existential risks. If for whatever reason you think that humans are easily satisfied, then uFAI is theoretically just as good as FAI; but that really doesn't seem plausible to me. There might also be certain harm-minimization moral theories that would be ambivalent between uFAI and FAI. But I think most moral theories would still place huge emphasis on FAI versus uFAI even if uFAI would actually be human-friendly in some local sense.

Given such considerations, I'm not sure whether uFAI or wannabe-FAI is more likely to lead to evil AI. Wannabe-FAI is more likely to have a stable goal system that is immune to certain self-modifications and game theoretic pressures that a less stable AI or a coalition of splintered AI successors would be relatively influenced by. E.g. a wannabe-FAI might disregard certain perceived influences (even influences from hypothetical FAIs that it was considering self-modifying into, or acausal influences generally) as "blackmail" or as otherwise morally requiring ignorance. This could lead to worse outcomes than a messier, more adaptable, more influence-able uFAI. One might want to avoid letting a single wannabe-FAI out into the world which could take over existing computing infrastructure and thus halt most AI work but would be self-limiting in some important respect (e.g. due to sensitivity to Pascalian considerations due to a formal, consistent decision theory, of the sort that a less formal AI architecture wouldn't have trouble with). Such a scenario could be worse than one where a bunch of evolving AGIs with diverse initial goal systems get unleashed and compete with each other, keeping self-limiting AIs from reaching evil or at least relatively suboptimal singleton status. And so on; one could list considerations like this for a long time. At any rate I don't think there are any obviously overwhelming answers. Luckily in the meantime there are meta-level strategies like intelligence amplification (in a very broad sense) which could make such analysis more tractable.

(The above analysis is written from what I think is a SingInst-like perspective, i.e., hard takeoff is plausible, FAI as defined by Eliezer is especially desirable, et cetera. I don't necessarily agree with such a perspective, and my analysis could fail given different background assumptions.)

To reply to Wei Dai's incoming link:

Most math kills you quietly, neatly, and cleanly, unless the apparent obstacles to distant timeless trade are overcome in practice and we get a certain kind of "luck" on how a vast net of mostly-inhuman timeless trades sum out, in which case we get an unknown fixed selection from some subjective probability distribution over "fate much worse than death" to "death" to "fate much better than death but still much worse than FAI". I don't spend much time talking about this on LW becau... (read more)

0J_Taylor14y

Although I am extremely interested in your theories, it would take significant time and energy for me to reformulate my ideas in such a way as to satisfactorily incorporate the points you are making. As such, for purposes of this discussion, I shall be essentially speaking as if I had not been made aware of the post which you just made. However, if you could clarify a minor point: am I mistaken in my belief that it is the SI's position that uFAI will probably result in human extinction? Or, have they incorporated the points you are making into their theories?

12

Holden Karnofsky's Singularity Institute Objection 1

12

Objection 1: it seems to me that any AGI that was set to maximize a "Friendly" utility function would be extraordinarily dangerous.

12

12

Holden Karnofsky's Singularity Institute Objection 1

12

Objection 1: it seems to me that any AGI that was set to maximize a "Friendly" utility function would be extraordinarily dangerous.

12