anotheruser comments on asking an AI to make itself friendly - Less Wrong Discussion
You are viewing a comment permalink. View the original post to see all comments and the full post content.
You are viewing a comment permalink. View the original post to see all comments and the full post content.
Comments (30)
I have read the sequences (well, most of it). I can't find this as a standard proposal.
I think that I haven't made clear what I wanted to say so you just defaulted to "he has no idea what he is talking about" (which is reasonable).
What I meant to say is that rather than defining the "optimal goal" of the AI based on what we can come up with ourselves, the problem can be delegated to the AI itself as a psychological problem.
I assume that an AI would possess some knowledge of human psychology, as that would be necessary for pretty much every practical application, like talking to it.
What then prevents us from telling the AI the following:
"We humans would like to become immortal and live in utopia (or however you want to phrase it. If the AI is smart it will understand what you really mean through psychology). We disagree on the specifics and are afraid that something may go wrong. There are many contingencies to consider. Here is a list of contingencies we have come up with. Do you understand what we are trying to do? As you are much smarter than us, can you find anything that we have overlooked but that you expect us to agree with you on, once you point it out to us? Different humans have different opinions. This factors into this problem, too. Can you propose a general solution to this problem that remains flexible in the face of an unpredictable future (transhumas may have different ethics)?"
In essence, it all boils down to asking the AI:
"if you were in our position, if you had our human goals and drives, how would you define your (the AI's) goals?"
If you have an agent that is vastly more intelligent than you are and that understands how your human mind works, couldn't you just delegate the task of finding a good goal for it to the AI itself, just like you can give it any other kind of task?
Welcome to Less Wrong!
In a sense, the Friendly AI problem is about delegating the definition of Friendliness to a superintelligence. The main issue is that it's easy to underestimate (on account of the Mind Projection Fallacy) how large a kernel of the correct answer it needs to start off with, in order for that delegation to work properly. There's rather a lot that goes into this, and unfortunately it's scattered over many posts that aren't collected in one sequence, but you can find much of it linked from Fake Fake Utility Functions (sic, and not a typo) and Value is Fragile.
That's extrapolated volition.
And it requires telling the AI "Implement good. Human brains contain evidence for good, but don't define it; don't modify human drives, that won't change good.". It requires telling it "Prove you don't get goal drift when you self-modify.". It requires giving it an explicit goal system for its infancy, telling it that it's allowed to use transistors despite the differences in temperature and gravity and electricity consumption that causes, but not to turn the galaxy into computronium - and writing the general rules for that, not the superficial cases I gave - and telling it how to progressively overwrite these goals with its true ones.
"Oracle AI" is a reasonable idea. Writing object-level goals into the AI would be bloody stupid, so we are going to do some derivation, and Oracle isn't much further than CEV. Bostrom defends it. But seriously, "don't influence reality beyond answering questions"?
You're assuming the friendliness problem has been solved. An evil AI could see the question as a perfect opportunity to hand down a solution than could spell our doom.
Why would the AI be evil?
Intentions don't develop on their own. "Evil" intentions could only arise from misinterpreting existing goals.
While you are asking it to come up with a solution, you have its goal set to what I said in the original post:
"the temporary goal to always answer questions thruthfully as far as possible while admitting uncertainty"
Where would the evil intentions come from? At the moment you are asking the question, the only thing on the AI's mind is how it can answer truthfully.
The only loophole I can see is that it might realize it can reduce its own workload by killing everyone who is asking it questions, but that would be countered by the secondary goal "don't influence reality beyond answering questions".
Unless the programmers are unable to give the AI this extremely simple goal to just always speak the truth (as far as it knows), the AI won't have any hidden intentions.
And if the programmers working on the AI really are unable to implement this relatively simple goal, there is no hope that they would ever be able to implement the much more complex "optimal goal" they are trying to find out, anyway.
Bugs, maybe
Intentions don't develop on their own. "Evil" intentions could only arise from misinterpreting existing goals.
Have you? Are you talking about a human level AI. Asking or commanding a human to do something doesn't set that as their one an onyl goal. A human reacts according to their existing goals:they might complyhl, refuse or subvert the command.
Why would it be easier to code in "be truthful" than "be friendly"?
that would have to be a really sophisticated bug to misinterpret "always answer questions thruthfully as far as possible while admitting uncertainty" as "kill all humans". I'd imagine that something as drastic as that could be found and corrected long before that. Consider that you have its goal set to this. It knows no other motivation but to respond thruthfully. It doesn't care about the survival of humanity, or itself or about how reality really is. All it cares for is to answer the questions to the best of its abilities.
I don't think that this goal would be all too hard to define either, as "the truth" is a pretty simple concept. As long it deals with uncertainty in the right way (by admitting it), how could this be misinterpreted? Friendliness is far harder to define because we don't even know a definition for it ourselves. There are far too many things to consider when defining "friendliness".
Trivial Failure Case: The AI turns the universe into hardware to support really big computations, so it can be really sure it's got the right answer, and also callibrate itself really well on the uncertainty.