TimFreeman comments on Death Note, Anonymity, and Information Theory - Less Wrong

32 Post author: gwern 08 May 2011 03:44PM

You are viewing a comment permalink. View the original post to see all comments and the full post content.

Comments (47)

You are viewing a single comment's thread. Show more comments above.

Comment author: TimFreeman 09 May 2011 04:50:29PM 1 point [-]

The classic is Parfit's hitch-hiker, where an agent capable of accurately predicting the AI's actions offers to give it something if and only if the AI will perform some specific action in future. A causal AI might be tempted to modify itself to desire that specific action, while a timeless AI will simply do the ting anyway without needing to self-modify.

That works if the AI knows that the other agent will keep its promise, and the other agent knows what the AI will do in the future. In particular the AI has to know the other agent is going to successfully anticipate what the AI will do in the future, even though the AI doesn't know itself. And the AI has to be able to infer all this from actual sensory experience, not by divine revelation. Hmm, I suppose that's possible.

Hmm, it's really easy to specify a causal AI, along the lines of AIXI but you can skip the arguments about it being near-optimal. Is there a similar simple spec of a timeless AI?

When I think through what the causal AI would do, it would be in a situation where it didn't know whether the actions it chooses are in the real world or in the other agent's simulation of the AI when the other agent is predicting what the AI would do. If it reasons correctly about this uncertainty, the causal AI might do the right thing anyway. I'll have to think about this. Thanks for the pointer.

Roughly, the importance is that there's only two kinds of truly catastrophic mistakes that an AI could make, mistakes which manage to wipe out to whole planet in one shot and errors in modifying its own code. Everything else can be recovered from.

It could build and deploy an unfriendly AI completely different from itself.

Comment author: benelliott 09 May 2011 05:14:03PM *  2 points [-]

That works if the AI knows that the other agent will keep its promise, and the other agent knows what the AI will do in the future. In particular the AI has to know the other agent is going to successfully anticipate what the AI will do in the future, even though the AI doesn't know itself. And the AI has to be able to infer all this from actual sensory experience, not by divine revelation. Hmm, I suppose that's possible.

That's the thing about mathematical proofs, you need to conclusively rule out every possibility. When dealing with something like a super-intelligence there will be unforeseen circumstances, and nothing short of full mathematical rigour will save you.

Hmm, it's really easy to specify a causal AI, along the lines of AIXI but you can skip the arguments about it being near-optimal. Is there a similar simple spec of a timeless AI

I don't know of one off-hand, but I think AIXI can easily be made Timeless. Just modify the bit which says roughly "calculate a probability distribution over all possible outcomes for each possible action" and replace it with "calculate a probability distribution over all possible outcomes for each possible decision".

This may be worth looking into further, I havn't looked very deeply into the literature around AIXI.

When I think through what the causal AI would do, it would be in a situation where it didn't know whether the actions it chooses are in the real world or in the other agent's simulation of the AI when the other agent is predicting what the AI would do. If it reasons correctly about this uncertainty, the causal AI might do the right thing anyway. I'll have to think about this. Thanks for the pointer.

This looks like you might be stumbling towards Updateless Decision Theory, which is IMHO even stronger than TDT and may solve an even wider range of problems.

It could build and deploy an unfriendly AI completely different from itself.

I could come up with an argument for this falling into either category.

Comment author: TimFreeman 10 May 2011 03:44:00PM *  0 points [-]

Roughly, the importance is that there's only two kinds of truly catastrophic mistakes that an AI could make, mistakes which manage to wipe out to whole planet in one shot and errors in modifying its own code. Everything else can be recovered from.

It could build and deploy an unfriendly AI completely different from itself.

I could come up with an argument for this falling into either category.

I'm claiming that the concept of self-modification is useless since it's a special case of engineering. We have to get engineering right, and if we do that, we'll get self-modification right. I'm struggling to interpret your statement so it bears on my claim. Perhaps you agree with me? Perhaps you're ignoring my claim? You don't seem to be arguing against it.

The scenario I proposed (creating a new UFAI from scratch) doesn't fit well into the second category (self-modification) because I didn't say the original AI goes away. After the misbegotten creation of the UFAI, you have two, the original failed FAI and the new UFAI.

Actually, the second category (bad self-modification) seems to fit well into the first category (destroying the planet in one go), so these two categories don't support the idea that self-modification is a useful concept.

Comment author: benelliott 10 May 2011 03:52:00PM 1 point [-]

Okay, I think I see what you mean about engineering and self-modification, but I don't think its particularly important, it appears you're thinking in terms of two concepts:

Self-modification: Anything the AI does to itself, for a fairly strict definition of 'itself', as in, the same physical object' or something like that.

Engineering: Building any kind of machine.

However, I think that when most FAI researchers talk about 'self-modification' they mean something broader than your definition, which would include building another AI of roughly equal or greater power but would not include building a toaster.

Any mathematical conclusions drawn about self-modification should apply just as well to any possible method of doing so, and one such method is to construct another AI. Therefore constructing a UFAI falls into the category of 'self modification error' in the sense that it is the sort of thing TDT is designed to help prevent.

Comment author: TimFreeman 12 May 2011 06:21:36PM -1 points [-]

I think that when most FAI researchers talk about 'self-modification' they mean something broader than your definition, which would include building another AI of roughly equal or greater power but would not include building a toaster.

Sorry, I don't believe you. I've been paying attention to FAI people for some time and never heard "self-modification" used to include situations where the machine performing the "self-modification" does not modify itself. If someone actually took the initiative to define "self-modification" the way you say, I'd perceive them as being deliberately deceptive.

Comment author: benelliott 12 May 2011 06:55:59PM 1 point [-]

You're being overly literal.

I have seen SIAI affiliated people on Less Wrong arguing that self modification is impossible to prevent by pointing out that even if you include an injunction against rewriting its own source-code would not prevent it from building something else.

Self modification as you describe it is a useless mathematical concept for Friendliness, as is engineering. Worse, it is not even well-defined, if an AI copies itself onto another computer, and alters the copy, is that self-modification? If it modifies itself, but keeps a copy of its old code around, is that self-modification? Where do you draw the line between the two?

You are violating the principle of charity by assuming that interpretation that makes them look worse.

Mostly when SIAI people talk about self-modification they imagine a machine that just goes in and edits its own source code because that is presumably the most efficient way to self modify and the one that most AI's would use. This does not mean the 'builds another AI' is not included, but it seems like a very stupid and inefficient way to go about things, so you are wasting your time by worrying too much about it.

I'll bet you £100 that whatever conclusions the SIAI eventually draws about self modification will apply just as well to all kinds, I really cannot see how a silly distinction like the one you are making would find its way into a mathematical proof.

Comment author: TimFreeman 12 May 2011 08:47:08PM 0 points [-]

I'll bet you £100 that whatever conclusions the SIAI eventually draws about self modification will apply just as well to all kinds, I really cannot see how a silly distinction like the one you are making would find its way into a mathematical proof.

We're certainly agreed on that. I'm willing to go further -- I believe any mathematical conclusions that apply to self-modification (your definition) will apply to all possible actions. I don't think your definition carves out a part of the world that has any usefully special properties.

Worse, [self-modification interpreted as requiring a modification to the entity taking action] is not even well-defined, if an AI copies itself onto another computer, and alters the copy, is that self-modification? If it modifies itself, but keeps a copy of its old code around, is that self-modification? Where do you draw the line between the two?

Agreed.

I don't think your definition is well-defined either. Where's the important line between self-modification and making a toaster?

We appear to have no useful definition for the word. Time to stop using it, IMO.

Comment author: benelliott 12 May 2011 10:01:56PM 1 point [-]

We're certainly agreed on that. I'm willing to go further -- I believe any mathematical conclusions that apply to self-modification (your definition) will apply to all possible actions. I don't think your definition carves out a part of the world that has any usefully special properties.

I disagree. "An ideal CDT agent that anticipates facing only action-determined problems will always choose not to self modify" is true "An ideal CDT agent that anticipates facing only action-determined problems will always choose not to do anything" is false.

I don't think your definition is well-defined either. Where's the important line between self-modification and making a toaster?

I'm not a hundred percent clear on this, and I'll be the first to admit that this is a problem and needs to fixed before the problem can be solved. From a very brief period of thought it seems to me a good line to draw is the point at which the new agent becomes more powerful, in the sense of optimization power, than the old one.

We appear to have no useful definition for the word. Time to stop using it, IMO.

I think the word points to something, and I have a feeling that something is the heart of the problem. Interestingly, in terms of mathematical decision theory self-modification seems quite well defined.

Comment author: TimFreeman 12 May 2011 10:37:15PM 2 points [-]

After some heat, we're starting to get light. This is good.

"An ideal CDT agent that anticipates facing only action-determined problems will always choose not to self modify" is true "An ideal CDT agent that anticipates facing only action-determined problems will always choose not to do anything" is false.

I'm not sure that's true. Imagine I'm an ideal CDT. I am in North America. If I wish to react to something that happens in China, there will be some lag. If I could deal with the situation better when there is no lag, I would benefit from cloning myself and sending a copy to China. Would that be self-modification?

(This presupposes that I have access to materials sufficient to copy myself. That might not be true, depending on whether an ideal CDT is physically realizable.)

Comment author: benelliott 12 May 2011 11:12:13PM *  1 point [-]

I should probably have specified that building another agent doesn't really count as self modification if the other agent is identical to the original (or maybe it does count as self modification, but in a very vacuous sense, the same way 'do nothing' is technically an algorithm). So if the other agent is CDT this is not a counter-example.

If the other agent is a more primitive approximation to a CDT then I would view constructing it not as self-modification, but simply as making a choice in an action-determined problem.

If the other agent is TDT or UDT or something then this may count as self-modification, but there is no need to make it this way.

Suppose we use the rigorous definition where an action-determined problem is just a list of choices, each of which leads to a probability distribution across possible outcomes, each of which has a utility assigned to it. In this case I think it is clear that "An ideal CDT agent that anticipates facing only action-determined problems will always choose not to self modify" is true while "An ideal CDT agent that anticipates facing only action-determined problems will always choose not to do anything" is false.