Objection 1: it seems to me that any AGI that was set to maximize a "Friendly" utility function would be extraordinarily dangerous.
Yes, Friendliness is hard, and that means that even the most dedicated team might screw it up. The point is that not trying as hard as you can to build Friendly AI is even worse, because then you almost certainly get uFAI. At least by trying to build FAI, we've got some chance of winning.
So this objection really just punts to objection #2, about tool-AGI, as the last paragraph here seems to indicate.
For certain values of "extraordinarily dangerous", that is an excellent rebuttal to the objection. However, as I am sure you are aware, there are many possible values of "extraordinarily dangerous". If I may present a plausible argument:
Let us declare a mind-dead universe (one with no agents) as having utility zero. It seems intuitive that working to build FAI decreases the probability of human extinction. However, a true uFAI (like a paperclip-maximizer) is hardly our only problem. A worse problem would be semi-FAI, that is, an AI which does not wipe out all of humanity, but does produce a world state-which is worse than a mind-dead universe. As the SI decreases probability of uFAI, it increases probability of semi-FAI.
Will_Newsome, myself, and probably several other users have mentioned such possibilities.
We'd need a pretty specific kind of "semi-FAI" to create an outcome worse than utility 0, so I'd prefer a term like eAI ("evil AI") for an AI that produces a world state worse than utility 0.
So: Is eAI more probable given (1) the first AGIs are created by people explicitly aiming for Friendliness, or given (2) the first AGIs are not created by people explicitly aiming for Friendliness?
First, I prefer your terminology to my own. I had internally been calling such AIs sAIs (sadistic Artificial Intelligence). The etymology is chosen for a very specific reason. However, eAI is most satisfactory.
Second, I do apologize if I am being excessively naive. However, I must confess, I was rather convinced by Yudkowsky's argumentation about such matters. I tentatively hold, and believe it is the SI's position, that an uFAI is almost certain to produce human extinction. Again, I would like to call this utility 0.
Third, I do tentatively hold that p(eAI | attempt towards FAI) > p(eAI | attempt towards AGI).
I am well aware that it is neither your duty nor the duty of the SI to respond to every minor criticism. However, if you have a reason to believe that my third point is incorrect, I would very much like to be made aware of it.
((A possible counterargument to my position: any proper attempt to reduce the chance of human extinction does increase the probability of a world of negative-utility, generally speaking. If my argument too closely resembles negative utilitarianism, then I revoke my argument.))
I tentatively hold, and believe it is the SI's position, that an uFAI is almost certain to produce human extinction. Again, I would like to call this utility 0.
I hold with timtyler that a uFAI probably wouldn't kill off all humanity. There's little benefit to doing so and it potentially incurs a huge cost by going against the wishes of potential simulators, counterfactual FAIs (acausally) (not necessarily human-designed, just designed by an entity or entities that cared about persons in general), hidden AGIs (e.g. alien AGIs that have already swept by the solar system but are making it look as if they haven't (note that this resolves the Fermi paradox)), et cetera. Such a scenario is still potentially a huge loss relative to FAI scenarios, but it implies that AGI isn't a sure-thing existential catastrophe, and is perhaps less likely to lead to human extinction than certain other existential risks. If for whatever reason you think that humans are easily satisfied, then uFAI is theoretically just as good as FAI; but that really doesn't seem plausible to me. There might also be certain harm-minimization moral theories that would be ambivalent between uFAI and FAI. But I think most moral theories would still place huge emphasis on FAI versus uFAI even if uFAI would actually be human-friendly in some local sense.
Given such considerations, I'm not sure whether uFAI or wannabe-FAI is more likely to lead to evil AI. Wannabe-FAI is more likely to have a stable goal system that is immune to certain self-modifications and game theoretic pressures that a less stable AI or a coalition of splintered AI successors would be relatively influenced by. E.g. a wannabe-FAI might disregard certain perceived influences (even influences from hypothetical FAIs that it was considering self-modifying into, or acausal influences generally) as "blackmail" or as otherwise morally requiring ignorance. This could lead to worse outcomes than a messier, more adaptable, more influence-able uFAI. One might want to avoid letting a single wannabe-FAI out into the world which could take over existing computing infrastructure and thus halt most AI work but would be self-limiting in some important respect (e.g. due to sensitivity to Pascalian considerations due to a formal, consistent decision theory, of the sort that a less formal AI architecture wouldn't have trouble with). Such a scenario could be worse than one where a bunch of evolving AGIs with diverse initial goal systems get unleashed and compete with each other, keeping self-limiting AIs from reaching evil or at least relatively suboptimal singleton status. And so on; one could list considerations like this for a long time. At any rate I don't think there are any obviously overwhelming answers. Luckily in the meantime there are meta-level strategies like intelligence amplification (in a very broad sense) which could make such analysis more tractable.
(The above analysis is written from what I think is a SingInst-like perspective, i.e., hard takeoff is plausible, FAI as defined by Eliezer is especially desirable, et cetera. I don't necessarily agree with such a perspective, and my analysis could fail given different background assumptions.)
To reply to Wei Dai's incoming link:
Most math kills you quietly, neatly, and cleanly, unless the apparent obstacles to distant timeless trade are overcome in practice and we get a certain kind of "luck" on how a vast net of mostly-inhuman timeless trades sum out, in which case we get an unknown fixed selection from some subjective probability distribution over "fate much worse than death" to "death" to "fate much better than death but still much worse than FAI". I don't spend much time talking about this on LW because timeless trade speculation eats people's brains and doesn't produce any useful outputs from the consumption; only decision theorists whose work is plugging into FAI theory need to think about timeless trade, and I wish everyone else would shut up about the subject on grounds of sheer cognitive unproductivity, not to mention the horrid way it sounds from the perspective of traditional skeptics (and not wholly unjustifiably so). (I have expressed this opinion in the past whenever I hear LWers talking about timeless trade; it is not limited to Newsome, though IIRC he has an unusual case of undue optimism about outcomes of timeless trade, owing to theological influences that I understand timeless trade speculations helped exacerbate his vulnerability to.)
I don't spend much time talking about this on LW because timeless trade speculation eats people's brains and doesn't produce any useful outputs from the consumption; only decision theorists whose work is plugging into FAI theory need to think about timeless trade, and I wish everyone else would shut up about the subject on grounds of sheer cognitive unproductivity
I don't trust any group who wishes to create or make efforts towards influencing the creation of a superintelligence when they try to suppress discussion of the very decision theory that the superintelligence will implement. How such an agent interacts with the concept of acausal trade completely and fundamentally alters the way it can be expected to behave. That is the kind of thing that needs to be disseminated among an academic community, digested and understood in depth. It is not something to trust to an isolated team, with all the vulnerability to group think that entails.
If someone were to announce credibly "We're creating a GAI. Nobody else but us is allowed to even think about what it is going to do. Just trust us, it's Friendly." then the appropriate response is to shout "Watch out! It's a dangerous crackpot! Stop him before he takes over the world and potentially destroys us all!" And make no mistake, if this kind of attempt at suppression were taken by anyone remotely near developing an FAI theory that is what it would entail. Fortunately at this point it is still at the "Mostly Harmless" stage.
and doesn't produce any useful outputs from the consumption
I don't believe you. At least, it produces outputs at least as useful and interesting as all other discussions of decision theory produce. There are plenty of curious avenues to explore on the subject and fascinating implications and strategies that are at least worth considering.
Sure, the subject may deserve a warning "Do not consider this topic if you are psychologically unstable or have reason to believe that you are particularly vulnerable to distress or fundamental epistemic damage by the consideration of abstract concepts."
not to mention the horrid way it sounds from the perspective of traditional skeptics (and not wholly unjustifiably so).
If this were the real reason for Eliezer's objection I would not be troubled by his attitude. I would still disagree - the correct approach is not to try to suppress all discussion by other people of the subject but rather to apply basic political caution and not comment on it oneself (or allow anyone within one's organisation to do so.)
If someone were to announce credibly "We're creating a GAI. Nobody else but us is allowed to even think about what it is going to do. Just trust us, it's Friendly." then the appropriate response is to shout "Watch out! It's a dangerous crackpot! Stop him before he takes over the world and potentially destroys us all!" And make no mistake, if this kind of attempt at suppression were taken by anyone remotely near developing an FAI theory that is what it would entail. Fortunately at this point it is still at the "Mostly Harmless" stage.
I don't see how anyone could credibly announce that. The announcement radiates crackpottery.
Most math kills you quietly, neatly, and cleanly, unless the apparent obstacles to distant timeless trade are overcome in practice
Will mentioned a couple of other possible ways in which UFAI fails to kill off humanity, besides distant timeless trade. (BTW I think the current standard term for this is "acausal trade" which incorporates the idea of trading across possible worlds as well as across time.) Although perhaps "hidden AGIs" is unlikely and you consider "potential simulators" to be covered under "distant timeless trade".
I don't spend much time talking about this on LW because timeless trade speculation eats people's brains and doesn't produce any useful outputs from the consumption; only decision theorists whose work is plugging into FAI theory need to think about timeless trade
The idea is relevant not just for actually building FAI, but also for deciding strategy (ETA: for example how much chance of creating UFAI should we accept in order to build FAI). See here for an example of such discussion (between people who perhaps you think are saner than Will Newsome).
not to mention the horrid way it sounds from the perspective of traditional skeptics
I agreed with this, but it's not clear what we should do about it (e.g., whether we should stop talking about it), given the strategic relevance.
The idea is relevant not just for actually building FAI, but also for deciding strategy
And also relevant, I hasten to point out, for solving moral philosophy. I want to be morally justified whether or not I'm involved with an FAI team and whether or not I'm in a world where the Singularity is more than just a plot device. Acausal influence elucidates decision theory, and decision theory elucidates morality.
To clarify what I assume to be Eliezers point: "here there be basilisks, take it somewhere less public"
Will mentioned a couple of other possible ways in which UFAI fails to kill off humanity, besides distant timeless trade. [...] Although perhaps "hidden AGIs" is unlikely and you consider "potential simulators" to be covered under "distant timeless trade".
This is considered unlikely 'round these parts, but one should also consider God, Who is alleged by some to be omnipotent and Who might prefer to keep humans around. Insofar as such a God is metaphysically necessary this is mechanistically but not phenomenologically distinct from plain "hidden AGI".
For the LW public:
(IIRC he has an unusual case of undue optimism about outcomes of timeless trade, owing to theological influences that I understand timeless trade speculations helped exacerbate his vulnerability to.)
The theology and the acausal trade stuff are completely unrelated; they both have to do with decision theory, but that's it. I also don't think my thoughts about acausal trade differ in any substantial way from those of Wei Dai or Vladimir Nesov. So even assuming that I'm totally wrong for granting theism-like-ideas non-negligible probability, the discussion of acausal influence doesn't seem to have directly contributed to my brain getting eaten. That said, I agree with Eliezer that it's generally not worth speculating about, except possibly in the context of decision theory or, to a very limited extent, singularity strategy.
only decision theorists whose work is plugging into FAI theory need to think about timeless trade
But it's fun! Why only a select group of people is to be allowed to have it?
So is mountain skiing, starting new companies, learning chemistry, and entering into relashionships.
Although I am extremely interested in your theories, it would take significant time and energy for me to reformulate my ideas in such a way as to satisfactorily incorporate the points you are making. As such, for purposes of this discussion, I shall be essentially speaking as if I had not been made aware of the post which you just made.
However, if you could clarify a minor point: am I mistaken in my belief that it is the SI's position that uFAI will probably result in human extinction? Or, have they incorporated the points you are making into their theories?
I know that Anna at least has explicitly taken such considerations into account and agrees with them to some extent. Carl likely has as well. I don't know about Eliezer or Luke, I'll ask Luke next time I see him. ETA: That is, I know Carl and Anna have considered the points in my first paragraph, but I don't know how thoroughly they've explored the classes of scenarios like those in my second paragraph which are a lot more speculative.
Eliezer replied here, but it seems he's only addressed one part of my argument thus far. I personally think the alien superintelligence variation of my argument, which Eliezer didn't address, is the strongest, because it's well-grounded in known physical facts, unlike simulation-based speculation.
Third, I do tentatively hold that p(eAI | attempt towards FAI) > p(eAI | attempt towards AGI).
Clearly, this is possible. If an FAI team comes to think this is true during development, I hope they'll reconsider their plans. But can you provide, or link me to, some reasons for suspecting that p(eAI | attempt towards FAI) > p(eAI | attempt towards AGI)?
But can you provide, or link me to, some reasons for suspecting that p(eAI | attempt towards FAI) > p(eAI | attempt towards AGI)?
Some relevant posts/comments:
I tentatively hold, and believe it is the SI's position, that an uFAI is almost certain to produce human extinction.
When humans are a critical clue to the couse of evolution on the planet? Surely they would repeatedly reconstruct and rerun history to gain clues about the forms of alien that they might encounter - if they held basic universal instrumental values and didn't have too short a planning horizon.
Sadly, this seems right to me. The easiest way to build an eAI is to try to build an FAI and get the sign of something wrong.
Can you please add links to the other objections in each of these posts? Just to make the articles a little stickier.
I translate this general form of argument as, "Yes, you're right, but there's nothing you can do about it." Which is to say, the existential risk is so inevitable that any well-defined solution will be seen to have no chance of impacting the final result.
In which case, I borrow thoughts from Peter Thiel, who states that any world where humans no longer exist might as well be ignored as a possibility for investment, so invest in the avenues which might have a chance of continuing.
(Though you'd likely want to make a weaker form of Thiel's argument if possible, as it's not been convincingly demonstrated that a scenario with a non-Friendly superintelligence is necessarily or likely a scenario where humans no longer exist. As a special case, there are some who are especially worried about "hell worlds"—if pushing for Friendliness increases the probability of hell worlds, as has sometimes been argued, then it's not clear that you should discount such possible futures. More generally I have a heuristic that this "discount and then renormalize" approach to strategizing is generally not a good one; in my personal experience it's proven a bad idea to assume, even provisionally, that there are scenarios that I can't affect.)
"Many acquire the serenity to accept what they cannot change, only to find the 'cannot change' is temporary and the serenity is permanent." — Steven Kaas
"Many acquire the serenity to accept what they cannot change, only to find the 'cannot change' is temporary and the serenity is permanent." — Steven Kaas
Would you mind sharing a concrete example?
The Forbidden Toy is the classic. Google scholar on "forbidden toy" provides more on the subject, with elaboration and alternate hypothesis testing and whatnot.
The Forbidden Toy is the classic. Google scholar on "forbidden toy" provides more on the subject, with elaboration and alternate hypothesis testing and whatnot.
Thanks.
I couldn't immediately remember the experience that led me to strongly believe it, but luckily the answer came to me in a dream. Turns out it's just personal stuff having to do with a past relationship that I cared a lot about. There are other concrete examples but they probably don't affect my decision calculus nearly as much in practice. (Fun fact: I learned many of my rationality skillz via a few years in high school dating a really depressed girl.)
And last, the rending pain of re-enactment
Of all that you have done, and been; the shame
Of motives late revealed, and the awareness
Of things ill done and done to others' harm
Which once you took for exercise of virtue.
Then fools' approval stings, and honour stains.
From wrong to wrong the exasperated spirit
Proceeds, unless restored by that refining fire
Where you must move in measure, like a dancer.
— T. S. Eliot, Little Gidding
It's "Even assuming you're right, there's nothing you can do about it. And that's not to say something couldn't be done about it from another approach." Holden made it clear that he doesn't think SI is right.
It comes down to placing the world's largest bet on a highly complex theory - with no experimentation to test the theory first.
Who ever said not to test? Testing intelligent machines may have its risks - but you can't not test! That would be a very silly thing to do.
I suppose there are things like this:
I intend to plunge into the decision theory of self-modifying decision systems and never look back. (And finish the decision theory and implement it and run the AI, at which point, if all goes well, we Win.)
...that don't explicitly mention testing. Perhaps Holden is extrapolating from such material. Anyway, not testing would obviously be very stupid. See if you can find anyone who actually advocates it.
This is a good argument, but it seems to assume that the first (F)AIG (in particular, a recursively self-improving one) is the direct product of human intelligence. I think a more realistic scenario is that we any such AIG is the product of a number of generations of non-self-improving AIs -- machines that can be much better than humans about formal reasoning, finding proofs and so on.
Does that avoid the risk of some runaway not-so-FIA? No, it doesn't - but it reduces the chance. And in the meantime, there are many, many advances that could be made with a bunch AIs that could reach, say, the IQ 300 (as a figure of speech -- we need another unit for AI intelligence), even when only in a subdomain such as math/physics.
I have been saying for years that I don't think provable Friendliness is possible, basically for the reasons given here. But I have kept thinking about it, and a relatively minor point that occurred to me is that a bungled attempt at Friendliness might be worse than none. Depending on how it was done, the AI could consider the attempt as a continuing threat.
What's your sense of how a bungled attempt at Friendliness compares to other things humans might do, in terms of how likely an AI would be to consider it a threat?
Fairly low. But that's because I don't think the first AIs are likely to be built by people trying to guarantee Friendliness. If a FriendlyAI proponent tries to rush to get done before another team could finish it could be a much bigger risk.
OK.
For my part, if I think about things people might do that might cause a powerful AI to feel threatened and thereby have significantly bad results, FAI theory and implementation not only doesn't float to the top of the list, it's hardly even visible in the hypothesis space (unless, as here, I privilege it inordinately by artificially priming it).
It's still not even clear to me that "friendliness" is a coherent concept. What is a human-friendly intelligence? Not "what is an unfriendly intelligence" - I'm asking what it is, not what it isn't. (I've asked this before, as have others.) Humans aren't, for example, or this wouldn't even be a problem. But SIAI needs a friendly intelligence that values human values.
Humans are most of the way to human-friendly. A human given absolute power might use it to accumulate wealth at the expense of others, or punish people that displease her in cruel ways, or even utterly annihilate large groups of people based on something silly like nationality or skin color. But a human wouldn't misunderstand human values. There is no chance the human would, if she decided to make everyone as happy as possible, kill everyone to use their atoms to tile the universe with pictures of smiley faces (to use a familiar example).
That is not at all clear to me.
I mean, sure, I agree with the example: a well-meaning human would not kill everyone to tile the universe with pictures of smiley faces. There's a reason that example is familiar; it was chosen by humans to illustrate something humans instinctively agree is the wrong answer, but a nonhuman optimizer might not.
But to generalize from this to the idea that humans wouldn't misunderstand human values, or that a well-meaning human granted superhuman optimization abilities won't inadvertently destroy the things humans value most, seems unjustified.
Well, there's the problem of getting the human to be sufficiently well-meaning, as opposed to using Earth as The Sims 2100 before moving on to bigger and better galaxies. But if Friendliness is a coherent concept to begin with, why wouldn't the well-meaning superhuman figure it out after spending some time thinking about it?
Edit: What I'm saying is that if the candidate Friendly AI is actually a superhuman, then we don't have to worry about Step 1 of friendliness: explaining the problem. Step 2 is convincing the superhuman to care about the problem, and I don't know how likely that is. And finally Step 3 is figuring out the solution, and assuming the human is sufficiently super that wouldn't be difficult (all this requires is intelligence, which is what we're giving the human to begin with).
Agreed that a sufficiently intelligent human would be no less capable of understanding human values, given data and time, than an equally intelligent nonhuman.
No-one is seriously worried that an AGI will misunderstand human values. The worry is that an AGI will understand human values perfectly well, and go on to optimize what it was built to optimize.
Right, so I'm still thinking about it from the "what it was built to optimize" step. You want to try to build the AGI to optimize for human values, right? So you do your best to explain to it what you mean by your human values. But then you fail at explaining and it starts optimizing something else instead.
But suppose the AGI is a super-intelligent human. Now you can just ask it to "optimize for human values" in those exact words (although you probably want to explain it a bit better, just to be on the safe side).
"non-human-harming" is still defining it as what it isn't, rather than what it is. I appreciate it's the result we're after, but it has no explanatory power as to what it is - as an answer, it's only a mysterious answer.
Morality has a long tradition of negative phrasing. "Thou shalt not" dates back to biblical times. Many laws are prohibitions. Bad deeds often get given more weight than good ones. That is just part of the nature of the beast - IMHO.
That's nice, but precisely fails to answer the issue I'm raising: what is a "friendly intelligence", in terms other than stating what it isn't? What answer makes the term less mysterious?
To paraphrase a radio conversation with one of SI's employees:
Humans are made of atoms which can be used for other purposes. Instead of building an AI which takes humans and uses their atoms to make cupcakes, we'd want an AI that takes humans and uses their atoms to make 'human value', which presumably we'd be fine with.
and then do find/replace on "human value" with Eliezer's standard paragraph:
Life, consciousness, and activity; health and strength; pleasures and satisfactions of all or certain kinds; happiness, beatitude, contentment, etc.; truth; knowledge and true opinions of various kinds, understanding, wisdom; beauty, harmony, proportion in objects contemplated; aesthetic experience; morally good dispositions or virtues; mutual affection, love, friendship, cooperation; just distribution of goods and evils; harmony and proportion in one's own life; power and experiences of achievement; self-expression; freedom; peace, security; adventure and novelty; and good reputation, honor, esteem, etc.
Not that I agree this is the proper definition, just one which I've pieced together from SI's public comments.
The obvious loophole in your paraphrase is that this accounts for the atoms the humans are made of, but not for other atoms the humans are interested in.
But yes, this is a bit closer to an answer not phrased as a negation.
Here is the podcast where the Skeptics' Guide to the Universe (SGU) interviews Michael Vassar (MV) on 23-Sep-2009. The interview begins at 26:10 and the transcript below is 45:50 to 50:11.
SGU: Let me back up a little bit. So we're talking about, how do we keep a artificial intelligent or self recursive improving technology from essentially taking over the world and deciding that humanity is irrelevant or they would rather have a universe where we're not around or maybe where we're batteries or slaves or whatever. So one way that I think you've been focusing on so far is the "Laws of Robotics" approach. The Asimov approach.
MV: Err, no. Definitely not.
SGU: Well, in the broadest concept, in that you constrain the artificial intelligence in such a way...
MV: No. You never constrain, you never constrain a god.
SGU: But if you can't constrain it, then how can you keep it from deciding that we're irrelevent at some point?
MV: You don't need to constrain something that you're creating. If you create something, you get to designate all of its preferences, if you merely decide to do so.
SGU: Well, I think we're stumbling on semantics then. Because to constrain...
MV: No, we're not. We're completely not. We had a whole media campaign called "Three Laws Bad" back in 2005.
SGU: I wasn't, I didn't mean to specifically refer to the Three Laws, but to the overall concept of...
MV: No, constraint in the most general sense is suicide.
SGU: So I'm not sure I understand that. Essentially, we're saying we want the AI to be benign, to take a broad concept, and not malignant. Right? So we're trying to close down certain paths by which it might develop or improve itself to eliminate those paths that will lead to a malignant outcome.
MV: You don't need to close down anything. You don't need to eliminate anything. We're creating the AI. Everything about it, we get to specify, as its creators. This is not like a child or a human that has instincts and impulses. A machine is incredibly hard not to anthropmorphize here. There's really very little hope of managing it well if you don't. We are creating a system, and therefore we're designating every feature of the system. Creating it to want to destroy us and then constrainimg it so that it doesn't do so is a very, very bad way of doing things.
SGU: Well, that's not what I'm talking about. Let me further clarify, because we're talking about two different things. You're talking about creating it in a certain form, but I'm talking about, once it gets to the point where then it starts recreating itself, we have to constrain the way it might create and evolve itself so that it doesn't lead to something that wants to destroy us. Obviously, we're not going to create something that wants to destroy us and then keep it from doing so. We're going to create something that maybe its initial state may be benign, but since you're also talking about recursive self improvement, we have to also keep it from evolving into something malignant. That's what I mean by constraining it.
MV: If we're talking a single AI, not an economy, or an ecosystem, if we're not talking about something that involves randomness, if we're not talking about something that is made from a human, changes in goals do not count as improvements. Changes in goals are necessarily accidents or compromises. But a unchecked, unconstrained AI that wants ice cream will never, however smart it becomes, decide that it wants chocolate candy instead.
SGU: But it could decide that the best way to make ice cream is out of human brains.
MV: Right. But it will only decide that the best way to make ice cream is out of human brains.
SGU: Right, that's what I'm talking about. So how do we keep it from deciding that it wants to make ice cream out of human brains? Which is kind of a silly analogy to arrive at, but...
MV: Well, no... uh... how do we do so? We... okay. The Singularity Institute's approach has always been that we have to make it want to create human value. And if it creates human value out of human brains, that's okay. But human value is not an easy thing for humans to talk about or describe. In fact, it's only going to be able to create human value, with all probability, by looking at human brains.
SGU: Ah, that's interesting. But do you mean it will value human life?
MV: No, I mean it will value whatever it is that humans value.
The original quote had: "human-benefiting" as well as "non-human-harming". You are asking for "human-benefiting" to be spelled out in more detail? Can't we just invoke the 'pornography' rule here?
Right, but surely they'd the the first to admit that the details about how to do that just aren't yet available. They do have their `moon-onna-stick' wishlist.
I made a similar point here. My conclusion: in theory, you can have a recursively self-improving tool without "agency", and this is possibly even easier to do than "agency". My design is definitely flawed but it's a sketch for what a recursively self-improving tool would look like.
The sheer length of GiveWell co-founder and co-executive director Holden Karnofsky's excellent critique of the Singularity Institute means that it's hard to keep track of the resulting discussion. I propose to break out each of his objections into a separate Discussion post so that each receives the attention it deserves.
Objection 1: it seems to me that any AGI that was set to maximize a "Friendly" utility function would be extraordinarily dangerous.
Suppose, for the sake of argument, that SI manages to create what it believes to be an FAI. Suppose that it is successful in the "AGI" part of its goal, i.e., it has successfully created an intelligence vastly superior to human intelligence and extraordinarily powerful from our perspective. Suppose that it has also done its best on the "Friendly" part of the goal: it has developed a formal argument for why its AGI's utility function will be Friendly, it believes this argument to be airtight, and it has had this argument checked over by 100 of the world's most intelligent and relevantly experienced people. Suppose that SI now activates its AGI, unleashing it to reshape the world as it sees fit. What will be the outcome?
I believe that the probability of an unfavorable outcome - by which I mean an outcome essentially equivalent to what a UFAI would bring about - exceeds 90% in such a scenario. I believe the goal of designing a "Friendly" utility function is likely to be beyond the abilities even of the best team of humans willing to design such a function. I do not have a tight argument for why I believe this, but a comment on LessWrong by Wei Dai gives a good illustration of the kind of thoughts I have on the matter:
I think this comment understates the risks, however. For example, when the comment says "the formalization of the notion of 'safety' used by the proof is wrong," it is not clear whether it means that the values the programmers have in mind are not correctly implemented by the formalization, or whether it means they are correctly implemented but are themselves catastrophic in a way that hasn't been anticipated. I would be highly concerned about both. There are other catastrophic possibilities as well; perhaps the utility function itself is well-specified and safe, but the AGI's model of the world is flawed (in particular, perhaps its prior or its process for matching observations to predictions are flawed) in a way that doesn't emerge until the AGI has made substantial changes to its environment.
By SI's own arguments, even a small error in any of these things would likely lead to catastrophe. And there are likely failure forms I haven't thought of. The overriding intuition here is that complex plans usually fail when unaccompanied by feedback loops. A scenario in which a set of people is ready to unleash an all-powerful being to maximize some parameter in the world, based solely on their initial confidence in their own extrapolations of the consequences of doing so, seems like a scenario that is overwhelmingly likely to result in a bad outcome. It comes down to placing the world's largest bet on a highly complex theory - with no experimentation to test the theory first.
So far, all I have argued is that the development of "Friendliness" theory can achieve at best only a limited reduction in the probability of an unfavorable outcome. However, as I argue in the next section, I believe there is at least one concept - the "tool-agent" distinction - that has more potential to reduce risks, and that SI appears to ignore this concept entirely. I believe that tools are safer than agents (even agents that make use of the best "Friendliness" theory that can reasonably be hoped for) and that SI encourages a focus on building agents, thus increasing risk.