A stub on a point that's come up recently.

If I owned a paperclip factory, and casually told my foreman to improve efficiency while I'm away, and he planned a takeover of the country, aiming to devote its entire economy to paperclip manufacturing (apart from the armament factories he needed to invade neighbouring countries and steal their iron mines)... then I'd conclude that my foreman was an idiot (or being wilfully idiotic). He obviously had no idea what I meant. And if he misunderstood me so egregiously, he's certainly not a threat: he's unlikely to reason his way out of a paper bag, let alone to any position of power.

If I owned a paperclip factory, and casually programmed my superintelligent AI to improve efficiency while I'm away, and it planned a takeover of the country... then I can't conclude that the AI is an idiot. It is following its programming. Unlike a human that behaved the same way, it probably knows exactly what I meant to program in. It just doesn't care: it follows its programming, not its knowledge about what its programming is "meant" to be (unless we've successfully programmed in "do what I mean", which is basically the whole of the challenge). We can't therefore conclude that it's incompetent, unable to understand human reasoning, or likely to fail.

We can't reason by analogy with humans. When AIs behave like idiot savants with respect to their motivations, we can't deduce that they're idiots.

New to LessWrong?

New Comment
133 comments, sorted by Click to highlight new comments since: Today at 11:47 AM
Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings

I feel like I've seen this post before...

5jsalvatier11y
I think Alexander Kruel has been writing about things like this for a while (but arguing in the opposite direction). Here's an example. I find his arguments unpersuasive so far, but steelmaning a little bit. I you could argue that giving an AI any goal at all would basically entail making it grok humans, and the jump from that to correctly holding human values would be short.
0Stuart_Armstrong11y
? Never written anything like this... Have others?
1ESRogs11y
Your post just seems to be introducing the concept of accidentally creating a super-powerful paperclip-maximizing AI, which is an idea that we've all been talking about for years. I can't tell what part is supposed to be new -- is it that this AI would actually be smart and not just an idiot savant? The ideas that AIs follow their programming, and that intelligence and values are orthogonal seem like pretty well-established concepts around here. And, in particular, a lot of our discussion about hypothetical Clippies has presupposed that they would understand humans well enough to engage in game theory scenarios with us. Am I missing something?
5Stuart_Armstrong11y
I've had an online conversation where it was argued that AI goals other than what was intended by the programmers would be evidence of a faulty AI - and hence that it wouldn't be a dangerous one. This post was a direct response to that.
2ESRogs11y
Ah, I see. Fair enough, I agree.
0kgalias11y
It's vaguely reminiscent of "a computer is only as stupid as its programmer" memes.

IN particular, the AI might be able to succeed at this.

It seems to me possible that the AI might come up with even more 'insane' ideas that have even less apparent connection to what it was programmed to do.

Since my knowledge of AI is for practical purposes zilch- for large numbers of hypothetical future AI, if perhaps not a full Friendly AI, wouldn't it be a simple solution to program the AI to model a specified human individual, determine said individual's desires, and implement them?

4Qiaochu_Yuan11y
Write the program.
-3Carinthium11y
I know no programming whatsoever- I'm because I figure that the problem of Friendly AI going way off-key has no comparable analogue in this case because it involves different facts.
2Qiaochu_Yuan11y
Then what basis do you have for thinking that a particular programming task is simple?
-4Carinthium11y
A hypothetical AI programmed to run a paperclip factory, as compared to one designed to fulfil the role LessWrong grants Friendly AI, would: -Not need any recursive intelligence enhancement, and probably not need upgrading -Be able to discard massive numbers of functions regarding a lot of understanding of both humans and other matters Less functions means less to program, which means less chance of glitches or other errors. Without intelligence enhancement the odds of an unfriendly outcome are greatly reduced. Therefore, the odds of a paperclip factory AI becoming a threat to humanity is far smaller than a Friendly AI.
2RolfAndreassen11y
That is not what is meant around here by "paperclip maximiser". A true clippy does not run a factory; it transmutes the total mass of the solar system into paperclips, starting with humans and ending with the computer it runs on. (With the possible exception of some rockets containing enough computing power to repeat the process on other suns.) That is what it means to maximise something.
1RolfAndreassen11y
Right. Which is why you just proposed a solution which is, in itself, AI-complete; you have not in fact reduced the problem. This aside, which of the desires of the human do you intend to fulfil? I desire chocolate, I also desire not to get fat. Solve for the equilibrium.
-4Carinthium11y
The desires implied in the orders given- interpreting desires by likely meaning. I didn't intend to reduce the problem in any way, but make the point (albeit poorly as it turned out) that the example used was far less of a risk than the much better example of an actual attempt at Friendly AI.
2RolfAndreassen11y
An entire Sequence exists precisely for the purpose of showing that "just write an AI that takes orders" is not sufficient as a solution to this problem. "Likely meaning" is not translatable into computer code at the present state of knowledge, and what's more, it wouldn't even be sufficient if it were. You've left out the implicit "likely intended constraints". If I say "get some chocolate", you understand that I mean "if possible, within the constraint of not using an immense amount of resources, provided no higher-priority project intervenes, without killing anyone or breaking any laws except ones that are contextually ok to break such as coming to a full, not rolling, stop at stop signs, and actually, if I'm on a diet maybe you ought to remind me of the fact and suggest a healthier snack, and even if I'm not on a diet but ought to be, then a gentle suggestion to this effect is appropriate in some but not all circumstances..." Getting all that implicit stuff into code is exactly the problem of Friendly AI. "Likely meaning" just doesn't cover it, and even so we can't even solve that problem.
0Carinthium11y
I thought it was clear that: A- For Friendly A.I, I meant modelling a human via a direct simulation of a human brain (or at least the relevant parts) idealised in such a way as to give the results we would want B- I DID NOT INTEND TO REDUCE THE PROBLEM.
1RolfAndreassen11y
A: What is the difference between this, and just asking the human brain in the first place? The whole point of the problem is that humans do not, actually, know what we want in full generality. You might as well implement a chess computer by putting a human inside it and asking, at every ply, "Do you think this looks like a winning position?" If you could solve the problem that way you wouldn't need an AI! B: Then what was the point of your post?
3DanielLC11y
Humans do not have an explicit desires, and there's no clear way to figure out the implicit ones. Not that that's a bad idea. It's basically the best idea anyone's had. It's just a lot harder to do than you make it sound.
2Thomas11y
They call it CEV here. Not a singe human, but many/all of them. Not what they want now, but what they would wanted, had they known it better. I am skeptical that this could work.
2Carinthium11y
What I'm saying is a bit different from CEV- it would involve modelling only a single's human's preferences, and would involve modelling their brain only in the short term (which would be a lot easier). Human beings have at least reasonable judgement with things such as, say, a paperclip factory, to the point where human will calling the shots will have no consequences that are too severe.
3Stuart_Armstrong11y
Specifying that kind of thing (including specifying preference) is probably almost as hard as getting the AI's motivations right in the first place. Though Paul Christiano had some suggestions along those lines, which (in my opinion) needed uploads (human minds instantiated in a computer) to have a hope of working...
-1kmgroove11y
Would a human be bound to "at least reasonable judgement" if given super intelligent ability?
-2Carinthium11y
We should remember that we aren't talking about true Friendly AI here, but AI in charge of lesser tasks such as, in the example, running a factory. There will be many things the AI doesn't know because it doesn't need to, including how to defend itself against being shut down (I see no logical reason why that would be necessary for running a paperclip factory). Combining that with the limits on intelligence necessary for such lesser tasks, and failure modes become far less likely.
-2ikrase11y
THat's sort of similar to what I keep talking about w/ 'obedient AI'.
[-][anonymous]11y-20

it follows its programming, not its knowledge about what its programming is "meant" to be (unless we've successfully programmed in "do what I mean", which is basically the whole of the challenge).

Not necessarily. The instructions to a fully-reflective AI could be more along the lines of “learn what I mean, then do that” or “do what I asked within the constraints of my own unstated principles.” The AI would have an imperative to build a more accurate internal model of your psychology in order to predict the implicit constraints applie... (read more)

5Stuart_Armstrong11y
That's just another way of saying "do what I mean". And it doesn't give us the code to implement that. "Do what I asked within the constraints of my own unstated principles" is a hugely complicated set of instructions, that only seem simple because it's written in English words.
0[anonymous]11y
I thought this was quite clear, but maybe not. Let's play taboo with the phrase “do what I mean.” “Do what I asked within the constraints of my own unstated principles” “Bring about the end-goal I requested, without in the process taking actions that I would not approve of” “Develop a predictive model of my psychology, and evaluate solutions to the stated task against that model. When a solution matches the goal but rejected by the model, do not take that action until the conflict is resolved. Resolving the conflict will require either clarification of the task to exclude such possibilities (which can be done automatically if I have a high-confidence theory for why the task was not further specified), or updating the psychological model of my creators to match empirical reality.” Do you see now how that is implementable? EDIT: To be clear, for a variety of reasons I don't think it is a good idea to build a “do what I mean” AI, unless “do what I mean” is generalized to the reflective equilibrium of all of humanity. But that's the way the paperclip argument is posed.
2Stuart_Armstrong11y
No. Do you think that a human rule lawyer, someone built to manipulate rules and regulations, could not argue there way through this, sticking with all the technical requirements but getting completely different outcomes? I know I could. And if a human rule-lawyer could do it, that means that there exists ways of satisfying the formal criteria without doing what we want. Once we know these exist, the question is then: would the AI stumble preferentially on the solution we had in mind? Why would we expect it to do so when we haven't even been able to specify that solution?
0[anonymous]11y
The question isn't whether there is one solution, but whether the space of possible solutions is encompassed by acceptable morals. I would not “expect an AI to stumble preferentially on the solution we had in mind” because I am confused and do not know what the solution is, as are you and everyone else on LessWrong. However that is a separate issue from whether we can specify what a solution would look like, such as a reflective-equilibrium solution to the coherent extrapolated volition of humankind. You can write an optimizer to search for a description of CEV without actually knowing what the result will be. It's like saying “I want to calculate pi to the billionth digit” and writing a program to do it, then arguing that we can't be sure the result is correct because we don't know ahead of time what the billionth digit of pi will be. Nonsense.
0Stuart_Armstrong11y
Whether the space of possible solutions is contained in the space of moral outcomes.
0[anonymous]11y
Correct.