Squark comments on Friendly AI ideas needed: how would you ban porn? - Less Wrong
You are viewing a comment permalink. View the original post to see all comments and the full post content.
You are viewing a comment permalink. View the original post to see all comments and the full post content.
Comments (80)
I don't think that's how the solution to FAI will look like. I think the solution to FAI will look like "Look, this a human (or maybe an uploaded human brain), it is an agent, it has a utility function. You should be maximizing that."
The clearer we make as many concepts as we can, the more likely is is that "look, this is..." is going to work.
Well, I think the concept we have to make clear is "agent with given utility function". We don't need any human-specific concepts, and they're hopelessly complex anyway: let the FAI figure out the particulars on its own. Moreover, the concept of an "agent with given utility function" is something I believe I'm already relatively near to formalizing.
If the agent in question has a well-defined utility function, why is he deferring to the FAI to explain it to him?
Because he is bad at introspection and his only access to the utility function is through a noisy low-bandwidth sensor called "intuition".
Again, the more we can do ahead of time, the more likely it is that the FAI will figure these things out correctly.
Why do you think the FAI can figure these things out incorrectly, assuming we got "agent with given utility function" right? Maybe we can save it time by providing it with more initial knowledge. However, since the FAI has superhuman intelligence, it would probably take us much longer to generate that knowledge than it would take the FAI. I think that to generate an amount of knowledge which would be non-negligible from the FAI's point of view would take a timespan large with respect to the timescale on which UFAI risk becomes significant. Therefore in practice I don't think we can wait for it before building the FAI.
Because values are not physical facts, and cannot be deduced from mere knowledge.
I'm probably explaining myself poorly.
I'm suggesting that there should be a mathematical operator which takes a "digitized" representation of an agent, either in white-box form (e.g. uploaded human brain) or in black-box form (e.g. chatroom logs) and produces a utility function. There is nothing human-specific in the definition of the operator: it can as well be applied to e.g. another AI, an animal or an alien. It is the input we provide the operator that selects a human utility function.
I don't understand how such an operator could work.
Suppose I give you a big messy data file that specifies neuron state and connectedness. And then I give you a big complicated finite-element simulator that can accurately predict what a brain would do, given some sensory input. How do you turn that into a utility function?
I understand what it means to use utility as a model of human preference. I don't understand what it means to say that a given person has a specific utility function. Can you explain exactly what the relationship is between a brain and this abstract utility function?
See the last paragraph in this comment.
I don't see how that addresses the problem. You're linking to a philosophical answer, and this is an engineering problem.
The claim you made, some posts ago, was "we can set an AI's goals by reference to a human's utility function." Many folks objected that humans don't really have utility functions. My objection was "we have no idea how to extract a utility function, even given complete data about a human's brain." Defining "utility function" isn't a solution. If you want to use "the utility function of a particular human" in building an AI, you need not only a definition, but a construction. To be convincing in this conversation, you would need to at least give some evidence that such a construction is possible.
You are trying to use, as a subcomponent, something we have no idea how to build and that seems possibly as hard as the original problem. And this isn't a good way to do engineering.
Humans don't follow anything like a utility function, which is a first problem, so you're asking the AI to construct something that isn't there. Then you have to knit this together into a humanity utility function, which is very non trivial (this is one feeble and problematic way of doing this: http://lesswrong.com/r/discussion/lw/8qb/cevinspired_models/).
The other problem is that you haven't actually solved many of the hard problems. Suppose the AI decides to kill everyone, then replay, in an endless loop, the one upload it has, having a marvellous experience. Why would it not do that? We want the AI to correctly balance our higher order preferences (not being reduced to a single mindless experience) with our lower order preferences (being happy). But that desire is itself a higher order preference - it won't happen unless the AI already decides that higher order preferences trump lower ones.
And that was one example I just thought of. It's not hard to come up with "the AI does something stupid in this model (eg: replaces everyone with chatterbots that describe their ever increasing happiness and fulfilment) that is compatible with the original model but clearly stupid - clearly stupid to our own judgement, though, not to the AIs.
You may object that these problems won't happen - but you can't be confident of this, as you haven't defined your solution formally, and are relying on common sense to reject those pathological solutions. But nowhere have you assumed the AI has common sense, or how it will use it. The more details you put in your model, I think, the more the problems will become apparent.
Thank you for the thoughtful reply!
In the white-box approach it can't really hide. But I guess it's rather tangential to the discussion.
What do you mean by "follow a utility function"? Why do you thinks humans don't do it? If it isn't there, what does it mean to have a correct solution to the FAI problem?
The main problem with Yvain's thesis is in the paragraph:
What does Yvain mean by "give the robot human level intelligence"? If the robot's code remained the same, in what sense does it have human level intelligence?
This is the part of the CEV proposal which always seemed redundant to me. Why should we do it? If you're designing the AI, why wouldn't you use your own utility function? At worst, an average utility function of the group of AI designers? Why do we want / need the whole humanity there? Btw, I would obviously prefer my utility function in the AI but I'm perfectly willing to settle on e.g. Yudkowsky's.
It seems that you're identifying my proposal with something like "maximize pleasure". The latter is a notoriously bad idea, as was discussed endlessly. However, my proposal is completely different. The AI wouldn't do something the upload wouldn't do because such an action is opposed to the upload's utility function.
Actually, I'm not far from it (at least I don't think I'm further than CEV). Note that I have already defined formally I(A, U) where I=intelligence, A=agent, U=utility function. Now we can do something like "U(A) is defined to be U s.t. the probability that I(A, U) > I(R, U) for random agent R is maximal". Maybe it's more correct to use something like a thermal ensemble with I(A, U) playing the role of energy: I don't know, I don't claim to have solved it all already. I just think it's a good research direction.
Humans are neither independent not transitive. Human preferences change over time, depending on arbitrary factors, including how choices are framed. Humans suffer because of things they cannot affect, and humans suffer because of details of their probability assessment (eg ambiguity aversion). That bears repeating - humans have preference over their state of knowledge. The core of this is that "assessment of fact" and "values" are not disconnected in humans, not disconnected at all. Humans feel good when a team they support wins, without them contributing anything to the victory. They will accept false compliments, and can be flattered. Social pressure changes most values quite easily.
Need I go on?
A utility function which, if implemented by the AI, would result in a positive, fulfilling, worthwhile existence for humans. Even if humans had a utility, it's not clear that a ruling FAI should have the same one, incidentally. The utility is for the AI, and it aims to capture as much of human value as possible - it might just be the utility of a nanny AI (make reasonable efforts to keep humanity from developing dangerous AIs, going extinct, or regressing technologically, otherwise, let them be).
There are many such operators, and different ones give different answers when presented with the same agent. Only a human utility function distinguishes the right way of interpreting a human mind as having a utility function from all of the wrong ways of interpreting a human mind as having a utility function. So you need to get a bunch of Friendliness Theory right before you can bootstrap.
Why do you think there are many such operators? Do you believe the concept of "utility function of an agent" is ill-defined (assuming the "agent" is actually an intelligent agent rather than e.g. a rock)? Do you think it is possible to interpret a paperclip maximizer as having a utility function other than maximizing paperclips?
Deducing the correct utility of a utility maximiser is one thing (which has a low level of uncertainty, higher if the agent is hiding stuff). Assigning a utility to an agent that doesn't have one is quite another.
See http://lesswrong.com/lw/6ha/the_blueminimizing_robot/ Key quote: