Holden Karnofsky's Singularity Institute Objection 1

Paul Crowley

The sheer length of GiveWell co-founder and co-executive director Holden Karnofsky's excellent critique of the Singularity Institute means that it's hard to keep track of the resulting discussion. I propose to break out each of his objections into a separate Discussion post so that each receives the attention it deserves.

Objection 1: it seems to me that any AGI that was set to maximize a "Friendly" utility function would be extraordinarily dangerous.

Suppose, for the sake of argument, that SI manages to create what it believes to be an FAI. Suppose that it is successful in the "AGI" part of its goal, i.e., it has successfully created an intelligence vastly superior to human intelligence and extraordinarily powerful from our perspective. Suppose that it has also done its best on the "Friendly" part of the goal: it has developed a formal argument for why its AGI's utility function will be Friendly, it believes this argument to be airtight, and it has had this argument checked over by 100 of the world's most intelligent and relevantly experienced people. Suppose that SI now activates its AGI, unleashing it to reshape the world as it sees fit. What will be the outcome?

I believe that the probability of an unfavorable outcome - by which I mean an outcome essentially equivalent to what a UFAI would bring about - exceeds 90% in such a scenario. I believe the goal of designing a "Friendly" utility function is likely to be beyond the abilities even of the best team of humans willing to design such a function. I do not have a tight argument for why I believe this, but a comment on LessWrong by Wei Dai gives a good illustration of the kind of thoughts I have on the matter:

What I'm afraid of is that a design will be shown to be safe, and then it turns out that the proof is wrong, or the formalization of the notion of "safety" used by the proof is wrong. This kind of thing happens a lot in cryptography, if you replace "safety" with "security". These mistakes are still occurring today, even after decades of research into how to do such proofs and what the relevant formalizations are. From where I'm sitting, proving an AGI design Friendly seems even more difficult and error-prone than proving a crypto scheme secure, probably by a large margin, and there is no decades of time to refine the proof techniques and formalizations. There's good recent review of the history of provable security, titled Provable Security in the Real World, which might help you understand where I'm coming from.

I think this comment understates the risks, however. For example, when the comment says "the formalization of the notion of 'safety' used by the proof is wrong," it is not clear whether it means that the values the programmers have in mind are not correctly implemented by the formalization, or whether it means they are correctly implemented but are themselves catastrophic in a way that hasn't been anticipated. I would be highly concerned about both. There are other catastrophic possibilities as well; perhaps the utility function itself is well-specified and safe, but the AGI's model of the world is flawed (in particular, perhaps its prior or its process for matching observations to predictions are flawed) in a way that doesn't emerge until the AGI has made substantial changes to its environment.

By SI's own arguments, even a small error in any of these things would likely lead to catastrophe. And there are likely failure forms I haven't thought of. The overriding intuition here is that complex plans usually fail when unaccompanied by feedback loops. A scenario in which a set of people is ready to unleash an all-powerful being to maximize some parameter in the world, based solely on their initial confidence in their own extrapolations of the consequences of doing so, seems like a scenario that is overwhelmingly likely to result in a bad outcome. It comes down to placing the world's largest bet on a highly complex theory - with no experimentation to test the theory first.

So far, all I have argued is that the development of "Friendliness" theory can achieve at best only a limited reduction in the probability of an unfavorable outcome. However, as I argue in the next section, I believe there is at least one concept - the "tool-agent" distinction - that has more potential to reduce risks, and that SI appears to ignore this concept entirely. I believe that tools are safer than agents (even agents that make use of the best "Friendliness" theory that can reasonably be hoped for) and that SI encourages a focus on building agents, thus increasing risk.

Objection 1: it seems to me that any AGI that was set to maximize a "Friendly" utility function would be extraordinarily dangerous.

What I'm afraid of is that a design will be shown to be safe, and then it turns out that the proof is wrong, or the formalization of the notion of "safety" used by the proof is wrong. This kind of thing happens a lot in cryptography, if you replace "safety" with "security". These mistakes are still occurring today, even after decades of research into how to do such proofs and what the relevant formalizations are. From where I'm sitting, proving an AGI design Friendly seems even more difficult and error-prone than proving a crypto scheme secure, probably by a large margin, and there is no decades of time to refine the proof techniques and formalizations. There's good recent review of the history of provable security, titled Provable Security in the Real World, which might help you understand where I'm coming from.

Here is the podcast where the Skeptics' Guide to the Universe (SGU) interviews Michael Vassar (MV) on 23-Sep-2009. The interview begins at 26:10 and the transcript below is 45:50 to 50:11.

SGU: Let me back up a little bit. So we're talking about, how do we keep a artificial intelligent or self recursive improving technology from essentially taking over the world and deciding that humanity is irrelevant or they would rather have a universe where we're not around or maybe where we're batteries or slaves or whatever. So one way that I think you've been focusing on so far is the "Laws of Robotics" approach. The Asimov approach.

MV: Err, no. Definitely not.

SGU: Well, in the broadest concept, in that you constrain the artificial intelligence in such a way...

MV: No. You never constrain, you never constrain a god.

SGU: But if you can't constrain it, then how can you keep it from deciding that we're irrelevent at some point?

MV: You don't need to constrain something that you're creating. If you create something, you get to designate all of its preferences, if you merely decide to do so.

SGU: Well, I think we're stumbling on semantics then. Because to constrain...

MV: No, we're not. We're completely not. We had a whole media campaign called "Three Laws Bad" back in 2005.

SGU: I wasn't, I didn't mean to specifically refer to the Three Laws, but to the overall concept of...

MV: No, constraint in the most general sense is suicide.

SGU: So I'm not sure I understand that. Essentially, we're saying we want the AI to be benign, to take a broad concept, and not malignant. Right? So we're trying to close down certain paths by which it might develop or improve itself to eliminate those paths that will lead to a malignant outcome.

MV: You don't need to close down anything. You don't need to eliminate anything. We're creating the AI. Everything about it, we get to specify, as its creators. This is not like a child or a human that has instincts and impulses. A machine is incredibly hard not to anthropmorphize here. There's really very little hope of managing it well if you don't. We are creating a system, and therefore we're designating every feature of the system. Creating it to want to destroy us and then constrainimg it so that it doesn't do so is a very, very bad way of doing things.

SGU: Well, that's not what I'm talking about. Let me further clarify, because we're talking about two different things. You're talking about creating it in a certain form, but I'm talking about, once it gets to the point where then it starts recreating itself, we have to constrain the way it might create and evolve itself so that it doesn't lead to something that wants to destroy us. Obviously, we're not going to create something that wants to destroy us and then keep it from doing so. We're going to create something that maybe its initial state may be benign, but since you're also talking about recursive self improvement, we have to also keep it from evolving into something malignant. That's what I mean by constraining it.

MV: If we're talking a single AI, not an economy, or an ecosystem, if we're not talking about something that involves randomness, if we're not talking about something that is made from a human, changes in goals do not count as improvements. Changes in goals are necessarily accidents or compromises. But a unchecked, unconstrained AI that wants ice cream will never, however smart it becomes, decide that it wants chocolate candy instead.

SGU: But it could decide that the best way to make ice cream is out of human brains.

MV: Right. But it will only decide that the best way to make ice cream is out of human brains.

SGU: Right, that's what I'm talking about. So how do we keep it from deciding that it wants to make ice cream out of human brains? Which is kind of a silly analogy to arrive at, but...

MV: Well, no... uh... how do we do so? We... okay. The Singularity Institute's approach has always been that we have to make it want to create human value. And if it creates human value out of human brains, that's okay. But human value is not an easy thing for humans to talk about or describe. In fact, it's only going to be able to create human value, with all probability, by looking at human brains.

SGU: Ah, that's interesting. But do you mean it will value human life?

MV: No, I mean it will value whatever it is that humans value.

12

Holden Karnofsky's Singularity Institute Objection 1

12

Objection 1: it seems to me that any AGI that was set to maximize a "Friendly" utility function would be extraordinarily dangerous.

12

12

Holden Karnofsky's Singularity Institute Objection 1

12

Objection 1: it seems to me that any AGI that was set to maximize a "Friendly" utility function would be extraordinarily dangerous.

12