Evidence for the orthogonality thesis

Stuart_Armstrong

One of the most annoying arguments when discussing AI is the perennial "But if the AI is so smart, why won't it figure out the right thing to do anyway?" It's often the ultimate curiosity stopper.

Nick Bostrom has defined the "Orthogonality thesis" as the principle that motivation and intelligence are essentially unrelated: superintelligences can have nearly any type of motivation (at least, nearly any utility function-bases motivation). We're trying to get some rigorous papers out so that when that question comes up, we can point people to standard, and published, arguments. Nick has had a paper accepted that points out the orthogonality thesis is compatible with a lot of philosophical positions that would seem to contradict it.

I'm hoping to complement this with a paper laying out the positive arguments in favour of the thesis. So I'm asking you for your strongest arguments for (or against) the orthogonality thesis. Think of trying to convince a conservative philosopher who's caught a bad case of moral realism - what would you say to them?

Many thanks! Karma and acknowledgements will shower on the best suggestions, and many puppies will be happy.

One of the most annoying arguments when discussing AI is the perennial "But if the AI is so smart, why won't it figure out the right thing to do anyway?" It's often the ultimate curiosity stopper.

Many thanks! Karma and acknowledgements will shower on the best suggestions, and many puppies will be happy.

(Note that I do not necessarily agree with what I wrote below. You asked for possible counter-arguments. So here goes.)

Might intelligence imply benevolence?

I believe that a fundamental requirement for any rational agent is the motivation to act maximally intelligently and correctly. That requirement seems even more obvious if we are talking about a conjectured artificial general intelligence (AGI) that is able to improve itself to the point where it is substantially better at most activities than humans. Since if it wouldn't want to be maximally correct then it wouldn't become superhuman intelligent in the first place.

If we consider giving such an AGI a simple goal, e.g. the established goal of paperclip maximization. Is it really clear that human values are not implicit even given such a simplistic goal?

To pose an existential risk in the first place, an AGI would have to maximize paperclips in an unbounded way, eventually taking over the whole universe and convert all matter into paperclips. Given that no sane human would explicitly define such a goal, an AGI with the goal of maximizing paperclips would have to infer it as implicit to do so. But would such an inference make sense, given its superhuman intelligence?

The question boils down to how an AGI would interpret any vagueness present in its goal architecture and how it would deal with the implied invisible.

Given that any rational agent, especially AGI's capable of recursive self-improvement, want to act in the most intelligent and correct way possible, it seems reasonable that it would interpret any vagueness in a way that most closely reflects the most probable way it was meant to be interpreted.

Would it be intelligent and correct to ignore human volition in the context of maximizing paperclips? Would it be less wrong to maximize paperclips in the most literal sense possible?

The argument uttered by advocates of friendly AI is that any AGI that isn't explicitly designed to be friendly won't be friendly. But I wonder how much sense this actually makes.

Every human craftsman who enters into an agreement is bound by a contract that includes a lot of implied conditions. Humans use their intelligence to fill the gaps. For example, if a human craftsman is told to decorate a house, they are not going to attempt to take over the neighbourhood to protect their work.

A human craftsman wouldn't do that, not because they share human values, but simply because it wouldn't be sensible to do so given the implicit frame of reference of their contract. The contract implicitly includes the volition of the person that told them to decorate their house. They might not even like the way they are supposed to do it. It would simply be stupid to do it any different way.

How would a superhuman AI not contemplate its own drives and interpret them given the right frame of reference, i.e. human volition? Why would a superhuman general intelligence misunderstand what is meant by "maximize paperclips", while any human intelligence will be better able to infer the correct interpretation?

I believe that a fundamental requirement for any rational agent is the motivation to act maximally intelligently and correctly. That requirement seems even more obvious if we are talking about a conjectured artificial general intelligence (AGI) that is able to improve itself to the point where it is substantially better at most activities than humans. Since if it wouldn't want to be maximally correct then it wouldn't become superhuman intelligent in the first place.

The standard counterargument is along the lines of: it won't care about getting things ri... (read more)

1BlazeOrangeDeer14y

You are assuming that the AI needs something from us, which may not be true as it develops further. The decorator follows the implied wishes not because he is smart enough to know what they are, but because he wishes to act in his client's interest to gain payment, reputation, etc. Or he may believe that fulfilling his client's wishes are morally good according to his morality. The mere fact that the wishes of his client are known does not guarantee that he will carry them out unless he values the client in some way to begin with (for their money or maybe their happiness)