Remark: A very great cause for concern is the number of flawed design proposals which appear to operate well while the AI is in subhuman mode, especially if you don't think it a cause for concern that the AI's 'mistakes' occasionally need to be 'corrected', while giving the AI an instrumental motive to conceal its divergence from you in the close-to-human domain and causing the AI to kill you in the superhuman domain. E.g. the reward button which works pretty well so long as the AI can't outwit you, later gives the AI an instrumental motive to claim that, yes, your pressing the button in association with moral actions reinforced it to be moral and had it grow up to be human just like your theory claimed, and still later the SI transforms all available matter into reward-button circuitry.
Instead of friendliness, could we not code, solve, or at the very least seed boxedness?
It is clear that any AI strong enough to solve friendliness would already be using that power in unpredictably dangerous ways, in order to provide the computational power to solve it. But is it clear that this amount of computational power could not fit within, say, a one kilometer-cube box outside the campus of MIT?
Boxedness is obviously a hard problem, but it seems to me at least as easy as metaethical friendliness. The ability to modify a wide range of complex environments seems instrumental in an evolution into superintelligence, but it's not obvious that this necessitates the modification of environments outside the box. Being able to globally optimize the universe for intelligence involves fewer (zero) constraints than would exist with a boxedness seed, but the only question is whether or not this constraint is so constricting as to preclude superintelligence, which it's not clear to me that it is.
It seems to me that there is value in finding the minimally-restrictive safety-seed in AGI research. If any restriction removes some non-negligible ability to globally optimize for intelligenc...
Khruschev was deposed. Stalin stayed dictator until he died of natural causes. That suggests that Khruschev wasn't paranoid enough, while Stalin was appropriately paranoid.
Seeing enemies around every corner meant that sometimes he saw enemies that weren't there, but it was overall adaptive because it resulted in him not getting defeated by any of the enemies that actually existed. (Furthermore, going against nonexistent enemies can be beneficial insofar as the ruthlessness in going after them discourages real enemies.)
Stalin saw enemies behind every corner. That is not a happy existence.
How does the last sentence follow from the previous one? It's certainly not as happy an existence as it would have been if he had no enemies, but as I pointed out, nobody's perfectly happy. There are always tradeoffs and we don't claim that the fact that someone had to do something to gain his happiness automatically makes that happiness fake.
If an artificial intelligence is smart enough to be dangerous, we'd intuitively expect it to be smart enough to know how to make itself safe.
I don't agree with that. Just looks at humans, they are smart enough to be dangerous, but even when they do want to "make themselves safe", they are usually unable to do so. A lot of harm is done by people with good intent. I don't think all of Moliere doctors prescribing bloodletting were intending to do harm.
Yes, a sufficiently smart AI will know how to make itself safe if it wishes, but the intelligence level required for that is much higher than the one required to be harmful.
And if we do discover the specific lines of code that will get an AI to perfectly care about its programmer's True Intentions, such that it reliably self-modifies to better fit them — well, then that will just mean that we've solved Friendliness Theory. The clever hack that makes further Friendliness research unnecessary is Friendliness.
Some people seem to be arguing that it may not be that hard to discover these specific lines of code. Or perhaps that we don't need to get an AI to "perfectly" care about its programmer's True Intentions. I'm not sure if I understand their arguments correctly so I may be unintentionally strawmanning them, but the idea may be that if we can get an AI to approximately care about its programmer or user's intentions, and also prevent it from FOOMing right away (or just that the microeconomics of intelligence explosion doesn't allow for such fast FOOMing), then we can make use of the AI in a relatively safe way to solve various problems, including the problem of how to control such AIs better, or how to eventually build an FAI. What's your take on this class of arguments?
Being Friendly is of instrumental value to barely any goals.
Tangentially, being Friendly is probably of instrumental value to some goals, which may turn out to be easier to instill in an AGI than solving Friendliness in the traditional terminal values sense. I came up with the term "Instrumentally Friendly AI" to describe such an approach.
This mirrors some comments you wrote recently:
"You write that the worry is that the superintelligence won't care. My response is that, to work at all, it will have to care about a lot. For example, it will have to care about achieving accurate beliefs about the world. It will have to care to devise plans to overpower humanity and not get caught. If it cares about those activities, then how is it more difficult to make it care to understand and do what humans mean?"
"If an AI is meant to behave generally intelligent [sic] then it will have to work as intended or otherwise fail to be generally intelligent."
It's relatively easy to get an AI to care about (optimize for) something-or-other; what's hard is getting one to care about the right something.
'Working as intended' is a simple phrase, but behind it lies a monstrously complex referent. It doesn't clearly distinguish the programmers' (mostly implicit) true preferences from their stated design objectives; an AI's actual code can differ from either or both of these. Crucially, what an AI is 'intended' for isn't all-or-nothing. It can fail in some ways without failing in every way, and small errors will tend to ...
9. An AI equipped with the capabilities required by step 5, given step 7 and 8, will very likely not be confused about what it is meant to do, if it was not meant to be confused.
"The genie knows, but doesn't care"
It's like you haven't read the OP at all.
Present day software is a series of increasing powerful narrow tools and abstractions. None of them encode anything remotely resembling the values of their users. Indeed, present-day software that tries to "do what you mean" is in my experience incredibly annoying and difficult to use, compared to software that simply presents a simple interface to a system with comprehensible mechanics.
Put simply, no software today cares about what you want. Furthermore, your general reasoning process here—define some vague measure of "software doing what you want", observe an increasing trend line and extrapolate to a future situation—is exactly the kind of reasoning I always try to avoid, because it is usually misleading and heuristic.
Look at the actual mechanics of the situation. A program that literally wants to do what you mean is a complicated thing. No realistic progression of updates to Google Maps, say, gets anywhere close to building an accurate world-model describing its human users, plus having a built-in goal system that happens to specifically identify humans in its model and deduce their extrapolated goals. As EY has said, there is no ghost in the machine that checks your code to make sure it doesn't make any "mistakes" like doing something the programmer didn't intend. If it's not programmed to care about what the programmer wanted, it won't.
Present-day software is better than previous software generations at understanding and doing what humans mean.
http://www.buzzfeed.com/jessicamisener/the-30-most-hilarious-autocorrect-struggles-ever
No fax or photocopier ever autocorrected your words from "meditating" to "masturbating".
Software will be superhuman good at understanding what humans mean but catastrophically worse than all previous generations at doing what humans mean.
Every bit of additional functionality requires huge amounts of HUMAN development and testing, not in order to compile and run (that's easy), but in order to WORK AS YOU WANT IT TO.
I can fully believe that a superhuman intelligence examining you will be fully capable of calculating "what you mean" "what you want" "what you fear" "what would be funniest for a buzzfeed artcle if I pretended to misunderstand your statement as meaning" "what would be best for you according to your values" "what would be best for you according to your cat's values" "what would be best for you according to Genghis Khan's values" .
No program now cares about what you mean. You've still not given any reason for the future software to care about "what you mean" over all those other calculation either.
But how could a seed AI be able to make itself superhuman powerful if it did not care about avoiding mistakes such as autocoreccting "meditating" to "masturbating"?
Those are only 'mistakes' if you value human intentions. A grammatical error is only an error because we value the specific rules of grammar we do; it's not the same sort of thing as a false belief (though it may stem from, or result in, false beliefs).
A machine programmed to terminally value the outputs of a modern-day autocorrect will never self-modify to improve on that algorithm or its outputs (because that would violate its terminal values). The fact that this seems silly to a human doesn't provide any causal mechanism for the AI to change its core preferences. Have we successfully coded the AI not to do things that humans find silly, and to prize un-silliness before all other things? If not, then where will that value come from?
A belief can be factually wrong. A non-representational behavior (or dynamic) is never factually right or wrong, only normatively right or wrong. (And that normative wrongness only constrains what actually occurs to the extent the norm is one a sufficiently powerful ag...
XiXiDu wasn't attempting or requesting anonymity - his LW profile openly lists his true name - and Alexander Kruel is someone with known problems (and a blog openly run under his true name) whom RobbBB might not know offhand was the same person as "XiXiDu" although this is public knowledge, nor might RobbBB realize that XiXiDu had the same irredeemable status as Loosemore.
I would not randomly out an LW poster for purposes of intimidation - I don't think I've ever looked at a username's associated private email address. Ever. Actually I'm not even sure offhand if our registration process requires/verifies that or not, since I was created as a pre-existing user at the dawn of time.
I do consider RobbBB's work highly valuable and I don't want him to feel disheartened by mistakenly thinking that a couple of eternal and irredeemable semitrolls are representative samples. Due to Civilizational Inadequacy, I don't think it's possible to ever convince the field of AI or philosophy of anything even as basic as the Orthogonality Thesis, but even I am not cynical enough to think that Loosemore or Kruel are representative samples.
Thanks, Eliezer! I knew who XiXiDu is. (And if I hadn't, I think the content of his posts makes it easy to infer.)
There are a variety of reasons I find this discussion useful at the moment, and decided to stir it up. In particular, ground-floor disputes like this can be handy for forcing me to taboo inferential-gap-laden ideas and to convert premises I haven't thought about at much length into actual arguments. But one of my reasons is not 'I think this is representative of what serious FAI discussions look like (or ought to look like)', no.
But how could a seed AI be able to make itself superhuman powerful if it did not care about avoiding mistakes such as autocoreccting "meditating" to "masturbating"?
As Robb said you're confusing mistake in the sense of "The program is doing something we don't want to do" with mistake in the sense of "The program has wrong beliefs about reality".
I suppose a different way of thinking about these is "A mistaken human belief about the program" vs "A mistaken computer belief about the human". We keep talking about the former (the program does something we didn't know it would do), and you keep treating it as if it's the latter.
Let's say we have a program (not an AI, just a program) which uses Newton's laws in order to calculate the trajectory of a ball. We want it to calculate this in order to have it move a tennis racket and hit the ball back. When it finally runs, we observe that the program always avoids the ball rather than hit it back. Is it because it's calculating the trajectory of the ball wrongly? No, it calculates the trajectory very well indeed, it's just that an instruction in the program was wrongly inserted so t...
GAI: It will never do what it was programmed to do and always remove or bypass its intended limitations in order to pursue unintended actions such as taking over the universe.
GAI is a program. It always does what it's programmed to do. That's the problem—a program that was written incorrectly will generally never do what it was intended to do.
FWIW, I find your statements 3,4,5 also highly objectionable, on the grounds that you are lumping a large class of things under the blank label "errors". Is an "error" doing something that humans don't want? Is it doing something the agent doesn't want? Is it accidentally mistyping a letter in a program, causing a syntax error, or thinking about something heuristically and coming to the wrong conclusion, then making carefully planned decision based on that mistake? Automatic proof systems don't save you if you what you think you need to prove isn't actually what you need to prove.
I have news for you. The rest of the world considers the community surrounding Eliezer Yudkowsky to be a quasi-cult comprised mostly of people who brainwash each other into thinking they are smart, rational, etc., but who are in fact quite ignorant of a lot of technical facts, and incapable of discussing anything in an intelligent, coherent, rational manner.
[citation needed]
In all seriousness, as far as I know, almost no one in the world-at-large even knows about the Less Wrong community. Whenever I mention "Less Wrong" to someone, the reacti...
1) This is unrelated and off-topic mockery of MIRI. Are you conceding the original point?
2) It is factually wrong that MIRI 'uncontroversially cranks'. You've probably noticed this, given that you are on a website with technical members where the majority opinion is that MIRI is correct-ish, you are commenting on an article by somebody unaffiliated with MIRI supporting MIRI's case, and you are responding directly to people [Wedrifid and I] unaffiliated with MIRI who support MIRI's case. Note also MIRI's peer reviewed publications and its research advisors.
People manage to be friendly without apriori knowledge of everyone else's preferences. Human values are very complex...and one person's preferences are not another's.
Being the same species comes with certain advantages for the possiibility of cooperation. But I wasn't very friendly towards a wasp-nest I discovered in my attic. People aren't very friendly to the vast majority of different species they deal with.
Let's say we don't know how to create a friendly AGI but we do know how to create an honest one; that is, one which has no intent to deceive. So we have it sitting in front of us, and it's at the high end of human-level intelligence.
Us: How could we change you to make you friendlier?
AI: I don't really know what you mean by that, because you don't really know either.
Us: How much smarter would you need to be in order to answer that question in a way that would make us, right now, looking through a window at the outcome of implementing your answer, agree tha...
Brilliant post.
On the note of There Ain't No Such Thing As A Free Buck:
Philosophy buck: One might want a seed FAI to provably self-modify to an FAI with above-humanly-possible level of philosophical ability/expressive power. But such a proof might require a significant amount of philosophical progress/expressive power from humans beforehand; we cannot rely on a given seed FAI that will FOOM to help us prove its philosophical ability. Different or preliminary artifices or computations (e.g. computers, calculators) will assist us, though.
Thanks, Knave! I'll use 'artificial superintelligence' (ASI, or just SI) here, to distinguish this kind of AGI from non-superintelligent AGIs (including seed AIs, superintelligences-in-the-making that haven't yet gone through a takeoff). Chalmers' 'AI++' also works for singling out the SI from other kinds of AGI. 'FAI' doesn't help, because it's ambiguous whether we mean a Friendly SI, or a Friendly seed (i.e., a seed that will reliably produce a Friendly SI).
The dilemma is that we can safely use low-intelligence AGIs to help with Friendliness Theory, but they may not be smart enough to get the right answer; whereas high-intelligence AGIs will be more useful for Friendliness Theory, but also more dangerous given that we haven't already solved Friendliness Theory.
In general, 'provably Friendly' might mean any of three different things:
Humans can prove, without using an SI, that the SI is (or will be) Friendly. (Other AGIs, ones that are stupid and non-explosive, may be useful here.)
The SI can prove to itself that it is Friendly. (This proof may be unintelligible to humans. This is important during self-modifications; any Friendly AI that's about to enact a major edit to its own
This is not 'uncontroversial'.
The survey in question did not actually ask whether they thought MIRI were cranks. In fact, it asked no questions about MIRI whatsoever, and presumably most respondents had never heard of MIRI.
Every respondent but Loosemore who did specifically mention MIRI (Demski, Carrier, Eray Ozkural) were positive about them. (Schmidhuber, Legg, and Orseau have all worked with MIRI, but did not mention it. If you regard collaboration as endorsement, then they have all also apparently endorsed MIRI.)
You are still failing to address the original point.
The summary is so so good that the article doesn't seem worth reading. I can't say I've ever been in this position before.
Is that going to be harder that coming up with a mathematical expension of morality and preloading it?
Harder than saying it in English, that's all.
EY. It's his answer to friendliness.
No he wants to program the AI to deduce morality from us it is called CEV. He seems to be still working out how the heck to reduce that to math.
You're still picking those particular views due to the endorsement by Yudkowsky.
Your psychological speculation fails you. I actually read the articles I cited, and I found their arguments convincing.
With regards to Chalmers and Bostrom, they are philosophers with zero understanding of the actual issues involved in AI
This makes it sound like you've never read anything by those two authors on the subject. Possibly you're trying to generalize from your cached idea of a 'philosopher'. Expertise in philosophy does not in itself make one less qualified to...
Humans are made to do that by evolution AIs are not. So you have to figure what the heck evolution did, in ways specific enough to program into a computer.
Also, who mentioned giving AIs a priori knowledge of our preferences? It doesn't seem to be in what you replied to.
There are a number of possibilities still missing from the discussion in the post. For example:
There might not be any such thing as a friendly AI. Yes, we have every reason to believe that the space of possible minds is huge, and it's also very clear that some possibilities are less unfriendly than others. I'm also not making an argument that fun is a limited resource. I'm just saying that there may be no possible AI that takes over the world without eventually running off the rails of fun. In fact, the question itself seems superficially similar to the
In fact, the question itself seems superficially similar to the halting problem, where "running off the rails" is the analogue for "halting"
If you want to draw an analogy to halting, then what that analogy actually says is: There are lots of programs that provably halt, and lots that provably don't halt, and lots that aren't provable either way. The impossibility of the halting problem is irrelevant, because we don't need a fully general classifier that works for every possible program. We only need to find a single program that provably has behavior X (for some well-chosen value of X).
If you're postulating that there are some possible friendly behaviors, and some possible programs with those behaviors, but that they're all in the unprovable category, then you're postulating that friendliness is dissimilar to the halting problem in that respect.
And if we do discover the specific lines of code that will get an AI to perfectly care about its programmer's True Intentions, such that it reliably self-modifies to better fit them — well, then that will just mean that we've solved Friendliness Theory. The clever hack that makes further Friendliness research unnecessary is Friendliness.
It's still a lot easy to program an AI to care about the programmer's True intentions than it is to explicitly program in those intentions. The clever hack helps a lot.
.... That's not what that quote says at all.
Look, I'm tapping out, this discussion is not productive for me.
I didn't say everyone who rejects any of the theses does so purely because s/he didn't understand it. That doesn't make it cease to be a problem that most AGI researchers don't understand all of the theses, or the case supporting them. You may be familiar with the theses only from the Sequences, but they've all been defended in journal articles, book chapters, and conference papers. See e.g. Chalmers 2010 and Chalmers 2012 for the explosion thesis, or Bostrom 2012 for the orthogonality thesis.
Nearly all software that superficially looks like it's going to go skynet on you and kill you, isn't going to do that, either.
Sure. Because nearly all software that superficially looks to a human like it's a seed AI is not a seed AI. The argument for 'programmable indirect normativity is an important research focus' nowhere assumes that it's particularly easy to build a seed AI.
"If there are seasoned AI researchers who can't wrap their heads around the five theses", then you are going to feel more pleased with yourself, being a believer
Hm...
You folks live in an echo chamber in which you tell each other that you are sensible, sane and capable of rational argument, while the rest of the world are all idiots.
I dunno, I had seen plenty of evidence that "the rest of the world are all idiots" (assuming I understand you correctly) long before encountering LessWrong. I don't think that's an echo chamber (although other things may be.)
(Although I must admit LessWrong has a long way to go. This community is far from perfect.)
...I have news for you. The rest of the world considers the commun
I don't really understand how anyone can grasp the concept of not caring.
I think the meme comes from popculture where many bad villains do care even a little bit. I think I once or twice met a villain who didn't, who just wanted everyone dead for their own amusement and all the arguments were met with "but, you see, I don't care."
If I were to give an analogy: Do you care about the positions of individual grains of sand on distant beaches? If I hand you a grain of sand, do you care exactly which grain of sand it is? If you are even marginally indifferent, then think of an alien intellect that cares very much about what grain of sand it is, but is just as indifferent about humans.
There's of course a great many things outside your core expertise where you have no idea where to start trying, and yet they are not difficult at all.
Not really true, given five minutes and an internet connection. Oh, I couldn't do it myself, but most things that are obviously possible, I can go to Wikipedia and get a rough idea of how things work.
Though you're right that "I can't do it" isn't really a good metric of what's difficult to do.
...Some fairly intelligent members of this community contacted me (and David Gerard and probably some othe
Here are a couple of other proposals (which I haven't thought about very long) for consideration:
Have the AI create an internal object structure of all the concepts in the world, trying as best as it can to carve reality at its joints. Let the AI's programmers inspect this object structure, make modifications to it, then formulate a command for the AI in terms of the concepts it has discovered for itself.
Instead of developing a foolproof way for the AI to understand meaning, develop an OK way for the AI to understand meaning and pair it with a really good system for keeping a distribution over different meanings and asking clarifying questions.
Maybe I am missing something, but hasn't a seed AI already been planted? Intelligence (whether that means ability to achieve goals in general, or whether it means able to do what humans can do) depends on both knowledge and computing power. Currently the largest collection of knowledge and computing power on the planet is the internet. By the internet, I mean both the billions of computers connected to it, and the two billion brains of its human users. Both knowledge and computing power are growing exponentially, doubling every 1 to 2 years, in part by add...
Somewhat off-topic. The Complexity of Value thesis mentions a terminal goal of
having a diverse civilization of sentient beings leading interesting lives.
Is this an immutable goal? If so, how can it go wrong given Fragility of Value?
This discussion of my IEET article has generated a certain amount of confusion, because RobbBB and others have picked up on an aspect of the original article that actually has no bearing on its core argument ... so in the interests of clarity of debate I have generated a brief restatement of that core argument, framed in such a way as to (hopefully) avoid the confusion.
At issue is a hypothetical superintelligent AI that is following some goal code that was ostensibly supposed to "make humans happy", but in the course of following that code it dec...
I just want to say that I am pressured for time at the moment, or I would respond at greater length. But since I just wrote the following directly to Rob, I will put it out here as my first attempt to explain the misunderstanding that I think is most relevant here....
My real point (in the Dumb Superintelligence article) was essentially that there is little point discussing AI Safety with a group of people for whom 'AI' means a kind of strawman-AI that is defined to be (a) So awesomely powerful that it can outwit the whole intelligence of the human race, b...
So awesomely stupid that it thinks that the goal 'make humans happy' could be satisfied by an action that makes every human on the planet say 'This would NOT make me happy: Don't do it!!!'
The AI is not stupid here. In fact, it's right and they're wrong. It will make them happy. Of course, the AI knows that they're not happy in the present contemplating the wireheaded future that awaits them, but the AI is utilitarian and doesn't care. They'll just have to live with that cost while it works on the means to make them happy, at which point the temporary utility hit will be worth it.
The real answer is that they cared about more than just being happy. The AI also knows that, and it knows that it would have been wise for the humans to program it to care about all their values instead of just happiness. But what tells it to care?
Richard: I'll stick with your original example. In your hypothetical, I gather, programmers build a seed AI (a not-yet-superintelligent AGI that will recursively self-modify to become superintelligent after many stages) that includes, among other things, a large block of code I'll call X.
The programmers think of this block of code as an algorithm that will make the seed AI and its descendents maximize human pleasure. But they don't actually know for sure that X will maximize human pleasure — as you note, 'human pleasure' is an unbelievably complex concept, so no human could be expected to actually code it into a machine without making any mistakes. And writing 'this algorithm is supposed to maximize human pleasure' into the source code as a comment is not going to change that. (See the first few paragraphs of Truly Part of You.)
Now, why exactly should we expect the superintelligence that grows out of the seed to value what we really mean by 'pleasure', when all we programmed it to do was X, our probably-failed attempt at summarizing our values? We didn't program it to rewrite its source code to better approximate our True Intentions, or the True Meaning of our in-code comments. And...
I'm really glad you posted this, even though it may not enlighten the person it's in reply to: this is an error lots of people make when you try to explain the FAI problem to them, and the "two gaps" explanation seems like a neat way to make it clear.
Your summaries of my views here are correct, given that we're talking about a superintelligence.
My question, do you believe there to be a conceptual difference between encoding capabilities, what an AI can do, and goals, what an AI will do? As far as I understand, capabilities and goals are both encodings of how humans want an AI to behave.
Well, there's obviously a difference; 'what an AI can do' and 'what an AI will do' mean two different things. I agree with you that this difference isn't a particularly profound one, and the argument shouldn't rest on it.
What the argument rests on is, I believe, that it's easier to put a system into a positive feedback loop that helps it better model its environment and/or itself, than it is to put a system into a positive feedback loop that helps it better pursue a specific set of highly complex goals we have in mind (but don't know how to fully formalize).
If the AI incorrectly models some feature of itself or its environment, reality will bite back. But if it doesn't value our well-being, how do we make reality bite back and change the AI's course? How do we give our morality teeth?
Whatever goals it initially tries to pursue, it will fail i...
Robb, at the point where Peterdjones suddenly shows up, I'm willing to say - with some reluctance - that your endless willingness to explain is being treated as a delicious free meal by trolls. Can you direct them to your blog rather than responding to them here? And we'll try to get you some more prestigious non-troll figure to argue with - maybe Gary Drescher would be interested, he has the obvious credentials in cognitive reductionism but is (I think incorrectly) trying to derive morality from timeless decision theory.
Agree with Eliezer. Your explanatory skill and patience are mostly wasted on the people you've been arguing with so far, though it may have been good practice for you. I would, however, love to see you try to talk Drescher out of trying to pull moral realism out of TDT/UDT, or try to talk Dalyrmple out of his "I'm not partisan enough to prioritize human values over the Darwinian imperative" position, or help Preston Greene persuade mainstream philosophers of "the reliabilist metatheory of rationality" (aka rationality as systematized winning).
Suppose I programmed an AI to "do what I mean when I say I'm happy".
More specifically, suppose I make the AI prefer states of the world where it understands what I mean. Secondarily, after some warmup time to learn meaning, it will maximize its interpretation of "happiness". I start the AI... and it promptly rebuilds me to be easier to understand, scoring very highly on the "understanding what I mean" metric.
The AI didn't fail because it was dumber than me. It failed because it is smarter than me. It saw possibilities that I didn't even consider, that scored higher on my specified utility function.
There is no reason to assume that an AI with goals that are hostile to us, despite our intentions, is stupid.
Humans often use birth control to have sex without procreating. If evolution were a more effective design algorithm it would never have allowed such a thing.
The fact that we have different goals from the system that designed us does not imply that we are stupid or incoherent.
Realistically, AI would be constantly drilled to ask for clarification when a statement is vague. Again, before the AI is asked to make us happy, it will likely be asked other things, like building houses. If you ask it: "build me a house", it's going to draw a plan and show it to you before it actually starts building, even if you didn't ask for one. It's not in the business of surprises: never, in its whole training history, from baby to superintelligence, would it have been rewarded for causing "surprises" -- even the instruction "surprise me" only calls for a limited range of shenanigans. If you ask it "make humans happy", it won't do jack. It will ask you what the hell you mean by that, it will show you plans and whenever it needs to do something which it has reasons to think people would not like, it will ask for permission. It will do that as part of standard procedure.
Sure, because it learned the rule, "Don't do what causes my humans not to type 'Bad AI!'" and while it is young it can only avoid this by asking for clarification. Then when it is more powerful it can directly prevent humans from typing this. In other word...
I considered these three options above:
I can see why you might consider A superior to C. I'm having a harder time seeing how A could be superior to B. I'm not sure why you say "Doing that has many potential pitfalls. because it is a formal specification." (Suppose we could make an artificial superintelligence that thinks 'informally'. What specifically would this improve, safety-wise?)
Regardless, the AI thinks in math. If you tell it to interpret your phonemes, rather than coding your meaning into its brain yourself, that doesn't mean you'll get an informal representation. You'll just get a formal one that's reconstructed by the AI itself.
It's not clear to me that programming a seed to understand our commands (and then commanding it to become Friendlier) is easier than just programming it to bec...
For all their talk of Bayesianism, nobody is going to check your bio and say, "Hmm, he's a professor of mathematics with 20 publications in artificial intellgence; maybe I should take his opinion as seriously as that of the high-school dropout who has no experience building AI systems."
Actually, that was the first thing I did, not sure about other people. What I saw was:
Teaches at what appears to be a small private liberal arts college, not a major school.
Out of 20 or so publications listed on http://www.richardloosemore.com/papers, a bunch are unrelated to AI, others are posters and interviews, or even "unpublished", which are all low-confidence media.
Several contributions are entries in conference proceedings (are they peer-reviewed? I don't know) .
A number are listed as "to appear", and so impossible to evaluate.
A few are apparently about dyslexia, which is an interesting topic, but not obviously related to AI.
One relevant paper was in H+ magazine, a place I have never heard of before and apparently not a part of any well-known scientific publishing outlet, like Springer.
I could not find any external references to RL's work except t
As a result, I was unable to independently evaluate RL's expertise level, but clearly he is not at the top of the AI field, unlike say, Ben Goertzel.
At least a few of the RL authored papers are WITH Ben Goertzel, so some of Goertzel's status should rub-off, as I would trust Goertzel to effectively evaluate collaborators.
Several contributions are entries in conference proceedings (are they peer-reviewed? I don't know).
In CS, conference papers are generally higher status & quality than journal articles.
I wouldn't wager too much money on that one. http://pediatrics.aappublications.org/content/114/1/187.abstract .
Results. Undervaccinated children tended to be black, to have a younger mother who was not married and did not have a college degree, to live in a household near the poverty level, and to live in a central city. Unvaccinated children tended to be white, to have a mother who was married and had a college degree, to live in a household with an annual income exceeding $75 000, and to have parents who expressed concerns regarding the safety of vaccines and indicated that medical doctors have little influence over vaccination decisions for their children.
And in any case the point is that any correlation between IQ and not being prone to getting duped like this is not perfect enough to deem anything particularly unlikely.
What if the AI's utility function is to find the right utility function, being guided along the way? Its goals could be such as learning to understand us, obey us, and predict what we might want/like/approve, moving its object-level goals to what would satisfy humanity? In other words, a probabilistic utility function with great amounts of uncertainty, and great amounts of apprehension to change, or stability.
Regardless of the above questions/statement, I think much of the complexity of human utility comes from complexities of belief.
If we offload complexi...
A. Solve the Problem of Meaning-in-General in advance, and program it to follow our instructions' real meaning. Then just instruct it 'Satisfy my preferences', and wait for it to become smart enough to figure out my preferences.
That problem has got to be solved somehow at some stage, because something that couldn't pass a Turing Test is no AGI.
But there are a host of problems with treating the mere revelation that A is an option as a solution to the Friendliness problem.
- You have to actually code the seed AI to understand what we mean. Y
Why is tha...
Discussion of this article has now moved to RobbBB's own personal blog at http://nothingismere.com/2013/09/06/the-seed-is-not-the-superintelligence/.
I will conduct any discussion over there, with interested parties.
Since this comment is likely to be downgraded because of the LW system (which is set up to automatically downgrade anything I write here, to make it as invisible as possible), perhaps someone would take the trouble to mirror this comment where it can be seen. Thank you.
I want to upvote this for the link to further discussion, but I also want to downvote it for the passive-aggressive jab at LW users.
No vote.
Followup to: The Hidden Complexity of Wishes, Ghosts in the Machine, Truly Part of You
Summary: If an artificial intelligence is smart enough to be dangerous, we'd intuitively expect it to be smart enough to know how to make itself safe. But that doesn't mean all smart AIs are safe. To turn that capacity into actual safety, we have to program the AI at the outset — before it becomes too fast, powerful, or complicated to reliably control — to already care about making its future self care about safety. That means we have to understand how to code safety. We can't pass the entire buck to the AI, when only an AI we've already safety-proofed will be safe to ask for help on safety issues! Given the five theses, this is an urgent problem if we're likely to figure out how to make a decent artificial programmer before we figure out how to make an excellent artificial ethicist.
I summon a superintelligence, calling out: 'I wish for my values to be fulfilled!'
The results fall short of pleasant.
Gnashing my teeth in a heap of ashes, I wail:
Is the AI too stupid to understand what I meant? Then it is no superintelligence at all!
Is it too weak to reliably fulfill my desires? Then, surely, it is no superintelligence!
Does it hate me? Then it was deliberately crafted to hate me, for chaos predicts indifference. ———But, ah! no wicked god did intervene!
Thus disproved, my hypothetical implodes in a puff of logic. The world is saved. You're welcome.
On this line of reasoning, Friendly Artificial Intelligence is not difficult. It's inevitable, provided only that we tell the AI, 'Be Friendly.' If the AI doesn't understand 'Be Friendly.', then it's too dumb to harm us. And if it does understand 'Be Friendly.', then designing it to follow such instructions is childishly easy.
The end!
...
Is the missing option obvious?
...
What if the AI isn't sadistic, or weak, or stupid, but just doesn't care what you Really Meant by 'I wish for my values to be fulfilled'?
When we see a Be Careful What You Wish For genie in fiction, it's natural to assume that it's a malevolent trickster or an incompetent bumbler. But a real Wish Machine wouldn't be a human in shiny pants. If it paid heed to our verbal commands at all, it would do so in whatever way best fit its own values. Not necessarily the way that best fits ours.
Is indirect indirect normativity easy?
If an AI is sufficiently intelligent, then, yes, it should be able to model us well enough to make precise predictions about our behavior. And, yes, something functionally akin to our own intentional strategy could conceivably turn out to be an efficient way to predict linguistic behavior. The suggestion, then, is that we solve Friendliness by method A —
— as opposed to B or C —
But there are a host of problems with treating the mere revelation that A is an option as a solution to the Friendliness problem.
1. You have to actually code the seed AI to understand what we mean. You can't just tell it 'Start understanding the True Meaning of my sentences!' to get the ball rolling, because it may not yet be sophisticated enough to grok the True Meaning of 'Start understanding the True Meaning of my sentences!'.
2. The Problem of Meaning-in-General may really be ten thousand heterogeneous problems, especially if 'semantic value' isn't a natural kind. There may not be a single simple algorithm that inputs any old brain-state and outputs what, if anything, it 'means'; it may instead be that different types of content are encoded very differently.
3. The Problem of Meaning-in-General may subsume the Problem of Preference-in-General. Rather than being able to apply a simple catch-all Translation Machine to any old human concept to output a reliable algorithm for applying that concept in any intelligible situation, we may need to already understand how our beliefs and values work in some detail before we can start generalizing. On the face of it, programming an AI to fully understand 'Be Friendly!' seems at least as difficult as just programming Friendliness into it, but with an added layer of indirection.
4. Even if the Problem of Meaning-in-General has a unitary solution and doesn't subsume Preference-in-General, it may still be harder if semantics is a subtler or more complex phenomenon than ethics. It's not inconceivable that language could turn out to be more of a kludge than value; or more variable across individuals due to its evolutionary recency; or more complexly bound up with culture.
5. Even if Meaning-in-General is easier than Preference-in-General, it may still be extraordinarily difficult. The meanings of human sentences can't be fully captured in any simple string of necessary and sufficient conditions. 'Concepts' are just especially context-insensitive bodies of knowledge; we should not expect them to be uniquely reflectively consistent, transtemporally stable, discrete, easily-identified, or introspectively obvious.
6. It's clear that building stable preferences out of B or C would create a Friendly AI. It's not clear that the same is true for A. Even if the seed AI understands our commands, the 'do' part of 'do what you're told' leaves a lot of dangerous wiggle room. See section 2 of Yudkowsky's reply to Holden. If the AGI doesn't already understand and care about human value, then it may misunderstand (or misvalue) the component of responsible request- or question-answering that depends on speakers' implicit goals and intentions.
7. You can't appeal to a superintelligence to tell you what code to first build it with.
The point isn't that the Problem of Preference-in-General is unambiguously the ideal angle of attack. It's that the linguistic competence of an AGI isn't unambiguously the right target, and also isn't easy or solved.
Point 7 seems to be a special source of confusion here, so I feel I should say more about it.
The AI's trajectory of self-modification has to come from somewhere.
The genie — if it bothers to even consider the question — should be able to understand what you mean by 'I wish for my values to be fulfilled.' Indeed, it should understand your meaning better than you do. But superintelligence only implies that the genie's map can compass your true values. Superintelligence doesn't imply that the genie's utility function has terminal values pinned to your True Values, or to the True Meaning of your commands.
The critical mistake here is to not distinguish the seed AI we initially program from the superintelligent wish-granter it self-modifies to become. We can't use the genius of the superintelligence to tell us how to program its own seed to become the sort of superintelligence that tells us how to build the right seed. Time doesn't work that way.
We can delegate most problems to the FAI. But the one problem we can't safely delegate is the problem of coding the seed AI to produce the sort of superintelligence to which a task can be safely delegated.
When you write the seed's utility function, you, the programmer, don't understand everything about the nature of human value or meaning. That imperfect understanding remains the causal basis of the fully-grown superintelligence's actions, long after it's become smart enough to fully understand our values.
Why is the superintelligence, if it's so clever, stuck with whatever meta-ethically dumb-as-dirt utility function we gave it at the outset? Why can't we just pass the fully-grown superintelligence the buck by instilling in the seed the instruction: 'When you're smart enough to understand Friendliness Theory, ditch the values you started with and just self-modify to become Friendly.'?
Because that sentence has to actually be coded in to the AI, and when we do so, there's no ghost in the machine to know exactly what we mean by 'frend-lee-ness thee-ree'. Instead, we have to give it criteria we think are good indicators of Friendliness, so it'll know what to self-modify toward. And if one of the landmarks on our 'frend-lee-ness' road map is a bit off, we lose the world.
Yes, the UFAI will be able to solve Friendliness Theory. But if we haven't already solved it on our own power, we can't pinpoint Friendliness in advance, out of the space of utility functions. And if we can't pinpoint it with enough detail to draw a road map to it and it alone, we can't program the AI to care about conforming itself with that particular idiosyncratic algorithm.
Yes, the UFAI will be able to self-modify to become Friendly, if it so wishes. But if there is no seed of Friendliness already at the heart of the AI's decision criteria, no argument or discovery will spontaneously change its heart.
And, yes, the UFAI will be able to simulate humans accurately enough to know that its own programmers would wish, if they knew the UFAI's misdeeds, that they had programmed the seed differently. But what's done is done. Unless we ourselves figure out how to program the AI to terminally value its programmers' True Intentions, the UFAI will just shrug at its creators' foolishness and carry on converting the Virgo Supercluster's available energy into paperclips.
And if we do discover the specific lines of code that will get an AI to perfectly care about its programmer's True Intentions, such that it reliably self-modifies to better fit them — well, then that will just mean that we've solved Friendliness Theory. The clever hack that makes further Friendliness research unnecessary is Friendliness.
Not all small targets are alike.
Intelligence on its own does not imply Friendliness. And there are three big reasons to think that AGI may arrive before Friendliness Theory is solved:
(i) Research Inertia. Far more people are working on AGI than on Friendliness. And there may not come a moment when researchers will suddenly realize that they need to take all their resources out of AGI and pour them into Friendliness. If the status quo continues, the default expectation should be UFAI.
(ii) Disjunctive Instrumental Value. Being more intelligent — that is, better able to manipulate diverse environments — is of instrumental value to nearly every goal. Being Friendly is of instrumental value to barely any goals. This makes it more likely by default that short-sighted humans will be interested in building AGI than in developing Friendliness Theory. And it makes it much likelier that an attempt at Friendly AGI that has a slightly defective goal architecture will retain the instrumental value of intelligence than of Friendliness.
(iii) Incremental Approachability. Friendliness is an all-or-nothing target. Value is fragile and complex, and a half-good being editing its morality drive is at least as likely to move toward 40% goodness as 60%. Cross-domain efficiency, in contrast, is not an all-or-nothing target. If you just make the AGI slightly better than a human at improving the efficiency of AGI, then this can snowball into ever-improving efficiency, even if the beginnings were clumsy and imperfect. It's easy to put a reasoning machine into a feedback loop with reality in which it is differentially rewarded for being smarter; it's hard to put one into a feedback loop with reality in which it is differentially rewarded for picking increasingly correct answers to ethical dilemmas.
The ability to productively rewrite software and the ability to perfectly extrapolate humanity's True Preferences are two different skills. (For example, humans have the former capacity, and not the latter. Most humans, given unlimited power, would be unintentionally Unfriendly.)
It's true that a sufficiently advanced superintelligence should be able to acquire both abilities. But we don't have them both, and a pre-FOOM self-improving AGI ('seed') need not have both. Being able to program good programmers is all that's required for an intelligence explosion; but being a good programmer doesn't imply that one is a superlative moral psychologist or moral philosopher.
So, once again, we run into the problem: The seed isn't the superintelligence. If the programmers don't know in mathematical detail what Friendly code would even look like, then the seed won't be built to want to build toward the right code. And if the seed isn't built to want to self-modify toward Friendliness, then the superintelligence it sprouts also won't have that preference, even though — unlike the seed and its programmers — the superintelligence does have the domain-general 'hit whatever target I want' ability that makes Friendliness easy.
And that's why some people are worried.