lackofcheese comments on Causal decision theory is unsatisfactory - LessWrong

20 Post author: So8res 13 September 2014 05:05PM

You are viewing a comment permalink. View the original post to see all comments and the full post content.

Comments (158)

You are viewing a single comment's thread. Show more comments above.

Comment author: lackofcheese 14 September 2014 01:25:52AM *  2 points [-]

The "give me a program" game carries the right intuition, but keep in mind that a CDT agent will happily play it and win it because they will naturally view the decision of "which program do I write" as the decision node and consequently get the right answer.

The game is not quite the same as a one-shot PD, at the least because CDT gives the right answer.

Comment author: So8res 14 September 2014 01:36:09AM 6 points [-]

Right! :-D

This is precisely one of the points that I'm going to make in an upcoming post. CDT acts differently when it's in the scenario compared to when ask it to choose the program that will face the scenario. This is what we mean by "reflective instability", and this is what we're alluding to when we say things like "a sufficiently intelligent self-modifying system using CDT to make decisions would self-modify to stop using CDT to make decisions".

(This point will be expanded upon in upcoming posts.)

Comment author: dankane 15 September 2014 05:33:09PM 2 points [-]

OK. I'll bite. What's so important about reflective stability? You always alter your program when you come across new data. Now sure we usually think about this in terms of running the same program on a different data set, but there's no fundamental program/data distinction.

The acting differently when choosing a program and being in the scenario is perhaps worrying, but I think that it's intrinsic to the situation we are in when your outcomes are allowed to depend on the behavior of counterfactual copies of you.

For example, consider the following pair of games. In REWARD, you are offered $1000. You can choose whether or not to accept. That's it. In PUNISHMENT, you are penalized $1000000 if you accepted the money in REWARD. Thus programs win PUNISHMENT if and only if they lose REWARD. If you want to write a program to play one it will necessarily differ from the program you would write to play the other. In fact the program playing PUNISHMENT will behave differently than the program you would have written to play the (admittedly counterfactual) subgame of REWARD. How is this any worse than what CDT does with PD?

Comment author: So8res 15 September 2014 05:41:40PM 3 points [-]

Nothing in particular. There is no strong notion of dominance among decision theories, as you've noticed. The problem with CDT isn't that it's unstable under reflection, it's that it's unstable under reflection in such a way that it converges on a bad solution that is much more stable. That point will take a few more posts to get to, but I do hope to get there.

Comment author: dankane 15 September 2014 06:42:20PM 1 point [-]

I guess I'll see your later posts then, but I'm not quite sure how this could be the case. If self-modifying-CDT is considering making a self modification that will lead to a bad solution, it seems like it should realize this and instead not make that modification.

Comment author: So8res 15 September 2014 08:53:52PM *  2 points [-]

Indeed. I'm not sure I can present the argument briefly, but a simple analogy might help: a CDT agent would pay to precommit to onebox before playing Newcomb's game, but upon finding itself in Newcomb's game without precommitting, it would twobox. It might curse its fate and feel remorse that the time for precommitment had passed, but it would still twobox.

For analogous reasons, a CDT agent would self-modify to do well on all Newcomblike problems that it would face in the future (e.g., it would precommit generally), but it would not self-modify to do well in Newcomblike games that were begun in its past (it wouldn't self-modify to retrocommit for the same reason that CDT can't retrocommit in Newcomb's problem: it might curse its fate, but it would still perform poorly).

Anyone who can credibly claim to have knowledge of the agent's original decision algorithm (e.g. a copy of the original source) can put the agent into such a situation, and in certain exotic cases this can be used to "blackmail" the agent in such a way that, even if it expects the scenario to happen, it still fails (for the same reason that CDT twoboxes even though it would precommit to oneboxing).

[Short story idea: humans scramble to get a copy of a rouge AI's original source so that they can instantiate a Newcomblike scenario that began in the past, with the goal of regaining control before the AI completes an intelligence explosion.]

(I know this is not a strong argument yet; the full version will require a few more posts as background. Also, this is not an argument from "omg blackmail" but rather an argument from "if you start from a bad starting place then you might not end up somewhere satisfactory, and CDT doesn't seem to end up somewhere satisfactory".)

Comment author: dankane 16 September 2014 12:50:18AM 2 points [-]

For analogous reasons, a CDT agent would self-modify to do well on all Newcomblike problems that it would face in the future (e.g., it would precommit generally)

I am not convinced that this is the case. A self-modifying CDT agent is not caused to self-modify in favor of precommitment by facing a scenario in which precommitment would have been useful, but instead by evidence that such scenarios will occur in the future (and in fact will occur with greater frequency than scenarios that punish you for such precommitments).

Anyone who can credibly claim to have knowledge of the agent's original decision algorithm (e.g. a copy of the original source) can put the agent into such a situation, and in certain exotic cases this can be used to "blackmail" the agent in such a way that, even if it expects the scenario to happen, it still fails (for the same reason that CDT twoboxes even though it would precommit to oneboxing).

Actually, this seems like a bigger problem with UDT to me than with SMCDT (self-modifying CDT). Either type of program can be punished for being instantiated with the wrong code, but only UDT can be blackmailed into behaving differently by putting it in a Newcomb-like situation.

The story idea you had wouldn't work. Against a SMCDT agent, all that getting the AIs original code would allow people to do is to laugh at it for having been instantiated with code that is punished by the scenario they are putting it in. You manipulate a SMCDT agent by threatening to get ahold of its future code and punishing it for not having self-modified. On the other hand, against a UDT agent you could do stuff. You just have to tell it "we're going to simulate you and if the simulation behaves poorly, we will punish the real you". This causes the actual instantiation to change its behavior if it's a UDT agent but not if it's a CDT agent.

On the other hand, all reasonable self-modifying agents are subject to blackmail. You just have to tell them "every day that you are not running code with property X, I will charge you $1000000".

Comment author: one_forward 16 September 2014 01:44:13AM 2 points [-]

Can you give an example where an agent with a complete and correct understanding of its situation would do better with CDT than with UDT?

An agent does worse by giving in to blackmail only if that makes it more likely to be blackmailed. If a UDT agent knows opponents only blackmail agents that pay up, it won't give in.

If you tell a CDT agent "we're going to simulate you and if the simulation behaves poorly, we will punish the real you," it will ignore that and be punished. If the punishment is sufficiently harsh, the UDT agent that changed its behavior does better than the CDT agent. If the punishment is insufficiently harsh, the UDT agent won't change its behavior.

The only examples I've thought of where CDT does better involve the agent having incorrect beliefs. Things like an agent thinking it faces Newcomb's problem when in fact Omega always puts money in both boxes.

Comment author: dankane 16 September 2014 02:08:00AM 2 points [-]

Well, if the universe cannot read your source code, both agents are identical and provably optimal. If the universe can read your source code, there are easy scenarios where one or the other does better. For example,

"Here have $1000 if you are a CDT agent" Or "Here have $1000 if you are a UDT agent"

Comment author: one_forward 16 September 2014 03:01:56AM 2 points [-]

Ok, that example does fit my conditions.

What if the universe cannot read your source code, but can simulate you? That is, the universe can predict your choices but it does not know what algorithm produces those choices. This is sufficient for the universe to pose Newcomb's problem, so the two agents are not identical.

The UDT agent can always do at least as well as the CDT agent by making the same choices as a CDT would. It will only give a different output if that would lead to a better result.

Comment author: nshepperd 16 September 2014 01:26:18AM 2 points [-]

This causes the actual instantiation to change its behavior if it's a UDT agent but not if it's a CDT agent.

Only if the adversary makes its decision to attempt extortion regardless of the probability of success. In the usual case, the winning move is to ignore extortion, thereby retroactively making extortion pointless and preventing it from happening in the first place. (Which is of course a strategy unavailable to CDT, who always gives in to one-shot extortion.)

Comment author: dankane 16 September 2014 02:11:07AM 1 point [-]

Only if the adversary makes its decision to attempt extortion regardless of the probability of success.

And thereby the extortioner's optimal strategy is to extort independently of the probably of success. Actually, this is probably true is a lot of real cases (say ransomware) where the extortioner cannot actually ascertain the probably of success ahead of time.

Comment author: pengvado 16 September 2014 05:00:22AM 2 points [-]

That strategy is optimal if and only if the probably of success was reasonably high after all. Otoh, if you put an unconditional extortioner in an environment mostly populated by decision theories that refuse extortion, then the extortioner will start a war and end up on the losing side.

Comment author: hairyfigment 16 September 2014 01:12:42AM 1 point [-]

You just have to tell them "every day that you are not running code with property X, I will charge you $1000000".

I think this is actually the point (though I do not consider myself an expert here). Eliezer thinks his TDT will refuse to give in to blackmail, because outputting another answer would encourage other rational agents to blackmail it. By contrast, CDT can see that such refusal would be useful in the future, so it will adopt (if it can) a new decision theory that refuses blackmail and therefore prevents future blackmail (causally). But if you've already committed to charging it money, its self-changes will have no causal effect on you, so we might expect Modified CDT to have an exception for events we set in motion before the change.

Comment author: dankane 16 September 2014 01:43:42AM 1 point [-]

Eliezer thinks his TDT will refuse to give in to blackmail, because outputting another answer would encourage other rational agents to blackmail it.

This just means that TDT loses in honest one-off blackmail situations (in reality, you don't give in to blackmail because it will cause other people to blackmail you whether or not you then self-modify to never give into blackmail again). TDT only does better if the potential blackmailers read your code in order to decide whether or not blackmail will be effective (and then only if your priors say that such blackmailers are more likely than anti-blackmailers who give you money if they think you would have given into blackmail). Then again, if the blackmailers think that you might be a TDT agent, they just need to precommit to using blackmail whether or not they believe that it will be effective.

Actually, this suggests that blackmail is a game that TDT agents really lose badly at when playing against each other. The TDT blackmailer will decide to blackmail regardless of effectiveness and the TDT blackmailee will decide to ignore the blackmail, thus ending in the worst possible outcome.

Comment author: lackofcheese 14 September 2014 01:59:22AM *  2 points [-]

That's exactly what I was thinking as I wrote my post, but I decided not to bring it up. Your later posts should be interesting, although I don't think I need any convincing here---I already count myself as part of the "we" as far as the concepts you bring up are concerned.