asking an AI to make itself friendly

anotheruser

edit: I think I have phrased this really poorly and that this has been misinterpreted. See my comment below for clarification.

A lot of thought has been put into the discussion of how one would need to define the goals of an AI so that it won't find any "loopholes" and act in an unintended way.

Assuming one already had an AI that is capable of understanding human psychology, which seems necessary to me to define the AI's goals anyway, wouldn't it be reasonable to assume that the AI would have an understanding of what humans want?

If that is the case, would the following approach work to make the AI friendly?

-give it the temporary goal to always answer questions thruthfully as far as possible while admitting uncertainty

-also give it the goal to not alter reality in any way besides answering questions.

-ask it what it thinks would be the optimal definition of the goal of a friendly AI, from the point of view of humanity, accounting for things that humans are too stupid to see coming.

-have a discussion between it and a group of ethicists/philosophers wherein both parties are encouraged to point out any flaws in the definition.

-have this go on for a long time until everyone (especially the AI, seeing as it is smarter than anyone else) is certain that there is no flaw in the definition and that it accounts for all kinds of ethical contingencies that might arise after the singularity.

-implement the result as the new goal of the AI.

What do you think of this approach?

edit: I think I have phrased this really poorly and that this has been misinterpreted. See my comment below for clarification.

A lot of thought has been put into the discussion of how one would need to define the goals of an AI so that it won't find any "loopholes" and act in an unintended way.

If that is the case, would the following approach work to make the AI friendly?

-give it the temporary goal to always answer questions thruthfully as far as possible while admitting uncertainty

-also give it the goal to not alter reality in any way besides answering questions.

-ask it what it thinks would be the optimal definition of the goal of a friendly AI, from the point of view of humanity, accounting for things that humans are too stupid to see coming.

-have a discussion between it and a group of ethicists/philosophers wherein both parties are encouraged to point out any flaws in the definition.

-implement the result as the new goal of the AI.

What do you think of this approach?

If we don't define "optimal" properly it should be able to find a suitable definition on its own by imagining what we might have meant.

But it wouldn't want to. If we mistakenly define 'optimal' to mean 'really good at calculating pi' then it won't want to change itself to aim for our real values. It would realise that we made a mistake, but wouldn't want to rectify it, because the only thing it cares about is calculating pi, and helping humans isn't going to do that.

You're broadly on the right track; the idea of CEV is that we just tell the AI to look at humans and do what they would have wanted it to do. However, we have to actually be able to code that; it's not going to converge on that by itself.

It would want to, because it's goal is defined as "tell the truth".

You have to differentiate between the goal we are trying to find (the optimal one) and the goal that is actually controlling what the AI does ("tell the truth"), while we are still looking for what that optimal goal could be.

the optimal goal is only implemented later, when we are sure that there are no bugs.

-5

asking an AI to make itself friendly

-5

-5

-5

asking an AI to make itself friendly

-5

-5