Followup/summary/extension to this conversation with SilasBarta

So, you're going along, cheerfully deciding things, doing counterfactual surgery on the output of decision algorithm A1 to calculate the results of your decisions, but it turns out that a dark secret is undermining your efforts...

You are not running/being decision algorithm A1, but instead decision algorithm A2, an algorithm that happens to have the property of believing (erroneously) that it actually is A1.

Ruh-roh.

Now, it is _NOT_ my intent here to try to solve the problem of "how can you know which one you really are?", but instead to deal with the problem of "how can TDT take into account this possibility?"

Well, first, let me suggest a slightly more concrete way in which this might come up:

Physical computation errors. For instance, a stray cosmic ray hits your processor and flips a bit in such a way that a certain conditional that would have otherwise gone down one branch instead goes down the other, so instead of computing the output of your usual algorithm in this circumstance, you're computing the output of the version that, at that specific step, behaves in that slightly different way. (Yes, this sort of thing can be mitigated with error correction/etc. The problem that is being addressed here is that, (to me at least) it seems that basic TDT doesn't have a natural way to even represent this possibility).

Consider a slightly modified causal net with in which the innards of an agent are more more of an "initial state", and that there's a selector node/process (ie, the resulting computation) that selects which abstract algorithm's output is the one that's the actual output. ie, this process determines which algorithm you, well, are.

Similarly, another being that might base its actions on a model of your behavior will be represented as having a model of your innards and the model itself having a selector, analogous to the above.

TDT with self ambiguity

To actually compute consequences of decisions and do all the relevant counterfactual surgery, ideally (ignoring "minor" issues like computability), one iterates over all possible algorithms one might be. That is, one first goes "if the actual results of the combination of my innards and all the messy details of reality and so on is to do computation A1, then..." and subiterate over all possible decisions. The second thing, of course, being done via the usual counterfactual surgery.

Then, weigh all of those by the probability that one actually _is_ algorithm A1, and then go "if I actually was algorithm A2..." etc etc... ie, and one does the same counterfactual surgery.

In the above diagram, that lets one consider the possibility of ones own choice being decoupled from what the model of their choice would predict, given that the initial model is correct, but while they are actually considering the decision, a hardware error or whatever causes the agent to be/implement A2 while the model of them is instead properly implementing A1.

 

I am far from convinced that this is the best way to deal with this issue, but I haven't seen anyone else bringing it up, and the usual form of TDT that we've been describing didn't seem to have any obvious way to even represent this issue. So, if anyone has any better ideas for how to clean up this solution, or otherwise alternate ideas for dealing with this problem, go ahead.

I just think it is important that it be dealt with _somehow_... That is, that the decision theory have some way of representing errors or other things that could cause ambiguity as to which algorithm it is actually implementing in the first place.

 

EDIT: sorry, to clarify: one determines the utility for a possible choice by summing over the results of all the possible algorithms making that particular choice. (ie, "I don't know if my decision corresponds to deciding the outcome of algorithm A1 or A2 or...") so sum over those for each choice, weighing by the probability of that being the actual algorithm in quesiton)

EDIT2: SilasBarta came up with a different causal graph during our discussion to represent this issue.

New Comment
33 comments, sorted by Click to highlight new comments since:

The problem that is being addressed here is that, (to me at least) it seems that basic TDT doesn't have a natural way to even represent this possibility.

Is Silas's claim that TDT can represent this possibility the same (natural) way it represents every other possibility?

Yes, and I believe I showed as much in my comment where I referenced EY's previous post: you can compute, imperfectly, the way that your innards will affect your attempt to implement the algorithms you've selected (and this can mean self-interest, akrasia, corrupted hardware, etc.).

Good good. Just making sure I understand at least one of the positions correctly.

Well, you understood it at the time I made it. After reading Eliezer_Yudkowsky's deeper exposition of TDT with better understanding of Pearl causality, here's what I think:

Psy_Kosh is right that in any practical situation, you have to be aware of post-decision interference by your innards. However, I think EY's TDT causal network for the problem, as shown in AnnaSalamon's post, is a fair representation of Newcomb's problem as given. There, you can assume that there's no interference between you and your box choice because the problem definition allows it.

And with that interpretation, the TDT algorithm is quite brilliant.

Sorry for extreme delay in getting around to replying. Anyways, yeah, I agree that TDT is nice and solves various things. I don't want to completely toss it out. My point was simply "I think it's very very important that we modify the original form of it to be able to deal with this issue. Here's what I think would be one way of doing so that fits with the same sort of principle that that TDT is based on."

EDIT: sorry, to clarify: one determines the utility for a possible choice by summing over the results of all the possible algorithms making that particular choice. (ie, "I don't know if my decision corresponds to deciding the outcome of algorithm A1 or A2 or...") so sum over those for each choice, weighing by the probability of that being the actual algorithm in quesiton)

I'm afraid that edit confused me more than it clarified. I can think of plenty of games for which that sort of utility calculation may be necessary but I can't see one made one explicit in the post. The way to handle this kind of uncertainty varies depending on how the other agents (Omega's or the possibly uncorrupted clones for example) are wired.

So, you're going along, cheerfully deciding things, doing counterfactual surgery on the output of decision algorithm A1 to calculate the results of your decisions, but it turns out that a dark secret is undermining your efforts...

You are not running/being decision algorithm A1, but instead decision algorithm A2, an algorithm that happens to have the property of believing (erroneously) that it actually is A1.

That your decisions are calculated according to A1 is equivalent to saying that you are running A1.

Ah, but they're actually calculated according to A2, which sometimes gives outputs different from A1 while believing (falsely) that those outputs are what A1 would decide.

Sorry if I was unclear there.

If you don't have access to your own program (otherwise you can't be wrong about whether it's A1 or not), how can you explicitly run it, in the manner of TDT?

Or is it a question of denotational equality between A0, which is the explicit code the agent implements, and A1, which is a different code, but which might give the same output as A0 (or not)?

You may have partial access to your own code, or you may have access to something that you think is your own code, but you're not certain of that.

In that case, you are not running TDT -- or what runs TDT isn't you. That's an explicit algorithm -- you can't have it arbitrarily confused.

Well, that's kind of my point. I'm taking raw TDT and modifying it a bit to take into account the possibility of uncertainty in what computation one actually is effectively running.

EDIT: more precisely, I consider it a serious deficiency of TDT that there didn't seem (to me) to be any simple 'natural' way to take that sort of uncertainty into account. I was thus proposing one way in which it might be done. I'm far from certain it's the Right Way, but I am rather more sure that it's an issue that needs to be dealt with somehow

I guess I didn't give you enough credit. I had redrawn AnnaSalamon's graph to account for your concerns, and I still think it does (even more so than the one in your top-level post), but in doing so, I made it no longer (consistent with) TDT. There is a relevant difference between the algorithm you're actually running, and the algorithm you think you're running, and this uncertainty about the difference affects how you choose your algorithm.

What's more perplexing is, as I've pointed out, Eliezer_Yudkowsky seems to recognize this problem in other contexts, yet you also seem to be correct that TDT assumes it away. Interesting.

(Btw, could you put a link to my graph in your top-level post to make it easier to find?)

Sure. Do you want to give a specific algorithm summary for me to add with your graph to explain your version of how to solve this problem, or do you just want me to add a link to the graph? (added just the link for now, lemme know if there's either a specific comment you want me to also point out to explain your version or anything)

(if nothing else, incidentally, I'd suggest even in your version, the platonic space of algorithms (set of all algorithms) ought have a direct link to "actual think that ends up being implemented")

Thanks for adding the link.

Do you want to give a specific algorithm summary for me to add ...

No need: The graph isn't intended to represent a solution algorithm, and I'm not yet proposing one. I just think it better represents what's going on in the problem than Eliezer_Yudkowsky's representation. Which, like you suggest, throws into doubt how well his solution works.

Still, the EY post I linked is an excellent example of how to handle the uncertainty that results from the effect of your innards on what you try to implement, in a way consistent with my graph.

I'd suggest even in your version, the platonic space of algorithms (set of all algorithms) ought have a direct link to "actual think that ends up being implemented"

I'm having a hard time thinking about that issue. What does it mean for the PSoA (Platonic Space of Algorithms) to have a causal effect on something? Arguably, it points to all the nodes on the graph, because every physical process is constrained by this space, but what would that mean in terms of finding a solution? What might I miss if I fail to include a link from PSoA to something where it really does connect?

No problem.

As far as linking "space of algorithms" to "Actual thing effectively implemented", the idea is more to maintain one of the key ideas of TDT... ie, that what you're 'controlling' (for lack of a better word) is effectively the output of all instances of the algorithm you actually implement, right?

Yes, well, the goal should be to find an accurate representation of the situation. If the causal model implied by TDT doesn't fit, well, all the worse for TDT!

And incidentally, this idea:

that what you're 'controlling' (for lack of a better word) is effectively the output of all instances of the algorithm you actually implement

is very troubling to me in that it is effectively saying that my internal thoughts have causal power (or something indistinguishable therefrom) over other people: that by coming to one conclusion means other people must be coming to the same conclusion with enough frequency to matter for my predictions.

Yes, the fact that I have general thoughts, feeling, emotions, etc. is evidence that other people have them ... but it's very weak evidence, and more so for my decisions. Whatever the truth behind the TDT assumption, it's not likely to be applicable often.

"controlling" is the wrong word.

But... two instances of the exact same (deterministic) algorithm fed the exact same parameters will return the exact same output. So when choosing what the output ought to be, one ought to act as if one is determining the output of all the instances.

Perfectly true, and perfectly useless. You'll never find the exact same deterministic algorithm as the one you're implementing, especially when you consider the post-decision interference of innards. Any identical sub-algorithm will be lost in a sea of different sub-algorithms.

Sure, but then we can extend to talking about classes of related algorithms that produce related outputs, so that there would still statistical dependence even if they're not absolutely identical in all cases. (And summing over 'which algorithm is being run' would be part of dealing with that.)

But statistical dependence does not imply large dependence nor does it imply useful dependence. The variety of psychology among people means that my decision is weak evidence of others' decision, even if it is evidence. It doesn't do much to alter my prior of how other observations have molded my expectations of people. (And this point applies doubly so in the case of autistic spectrum people like me who are e.g. surprised at other's unwillingness to point out my easily correctible difficulties.)

Now, if we were talking about a situation confined to that one sub-algorithm, your point would still have validity. But the problem involves the interplay of other algorithms with even more uncertainty.

Plus, the inherent implausibility of the position (implied by TDT) that my decision to be more charitable must mean that other people just decided to become more charitable.

Well, one would actually take into account the degree of dependence when doing the relevant computation.

And your decision to be more charitable would correlate to others being so to the extent that they're using related methods to come to their own decision.

Well, one would actually take into account the degree of dependence when doing the relevant computation.

Yes, and here's what it would look like: I anticipate a 1/2 + e probability of the other person doing the same thing as me in the true PD. I'll use the payoff matrix of

C D

C (3,3) (0,5)

D (5,0) (1,1)

where the first value is my utility. The expected payoff is then (after a little algebra):

If I cooperate: 3/2 + 3e; if I defect: 3 - 4e

Defection has a higher payoff as long as e is less than 3/14 (total probability of other person doing what I do = 10/14). So you should cooperate as long as you have over 0.137 bits of evidence that they will do what you do. Does the assumption that other people's algorithm has a minor resemblance to mine get me that?

And your decision to be more charitable would correlate to others being so to the extent that they're using related methods to come to their own decision.

Yes, and that's the tough bullet to bite: me being more charitable, irrespective the impact of my charitable action, causes (me to observe) other people being more charitable.

But if you are literally talking about the same computation, that computation must be unable to know which instance it is operating from. Once the question of getting identified with "other" instances is raised, the computations are different, and can lead to different outcomes, if these outcomes nontrivially depend on the different contexts. How is this progress compared to the case of two identical copies in PD that know their actions to be necessarily identical, and thus choosing between (C,C) and (D,D)?

The interesting part is making the sense of dependence between different decision processes precise.

In the discussion, I had proposed this as a causal net that captures all of your concerns, and I still don't see why it doesn't. Explanation

First of all, I will remind you that all nodes on a Bayesian causal net implicitly have a lone parent (disconnected from all other such parents) that represents uncertainty, which you explicitly represent in your model as "messy details of physical reality" and "more [i.e. independent] messy details of physical reality".

Similarly, another being that might base its actions on a model of your behavior will be represented as having a model of your innards and the model itself having a selector, analogous to the above.

To actually compute consequences of decisions and do all the relevant counterfactual surgery, ideally (ignoring "minor" issues like computability), one iterates over all possible algorithms one might be. ... lets one consider the possibility of ones own choice being decoupled from what the model of their choice would predict, given that the initial model is correct, but while they are actually considering the decision, a hardware error or whatever causes the agent to be/implement A2 while the model of them is instead properly implementing A1.

The model I made for you captures all of this, so I don't see why it's something TDT has any difficulty representing.

Omega knows your innards. Omega knows what algorithm you're trying to implement. Omega knows something about how hardware issues lead to what failure modes. So yes, there remains a chance Omega will guess wrong (under your restrictive assumptions about Omega), but this is fully represented by the model.

Also (still in my model), the agent, when computing a "would", looks at its choice as being what algorithm it will attempt to implement. It sees that there is room for the possibility of its intended algorithm not being the algorithm that actually gets implemented. It estimates what kinds of effects turn what intended algorithms into bad algorithms and therefore has reasons to pick algorithms that are unlikely to be turned into bad ones.

For a more concrete example of this kind of agent reasoning, refer back to what EY does in this post. He points out that we (including him) run on corrupted hardware ("innards" in my model). Therefore, the kind of corruption that his innards have, given his desired payoffs, justifies rejecting such target algorithms as "cheat when it will benefit the tribe on net", reasoning that that algorithm will likely degrade (via the causal effect of innards) into the actual algorithm of "cheat when it benefits me personally". To avoid this, he picks an algorithm harder to corrupt, like "for the good of the tribe, don't cheat, even if it benefits the tribe", which will, most likely, degrade into "don't cheat to benefit yourself at the expense of the tribe", something consistent with his values.

All of this is describable in TDT and represented by my model.

I think I may be misunderstanding your model, but, well, here's an example of where I think yours (ie, just using the built in error terms) would fail worse than mine:

Imagine that in addition to you, there're, say, a thousand systems that are somewhat explicitly dependent on algorithm A1 (or try to be) and another thousand that are explicitly dependent on A2 (or try to be), either through directly implementing, or modeling, or...

If you are A1, then your decision will be linked to the first group and less so to the second group... and if you are A2, then the other way around. Just using error terms would weaken all the couplings without noticing that if one is A2, while one is no longer coupled to the first group, they are to the second.

Does that make sense?

And again, I know that error correction and so on can and should be used to ensure lower probability of "algorithm you're trying to implement not being what you actually are implementing", but right now I'm just focusing on "how can we represent that sort of situation?"

I may be misunderstanding your solution to the problem, though.

I'm going to wait for at least one person other than you or me to join this discussion before saying anything further, just as a "sanity check" and to see what kind of miscommunication might be going on.

Fair enough

I've followed along. But I've been hesitant to join on because it seemed to me that this question was being raised to a meta-level that it didn't necessarily deserve.

In the grandparent, for example, why can I not model my uncertainty about how the other agents will behave using the same general mechanism I use for everything else I'm uncertain about? It's not all that special, at least for these couple of examples. (Of course the more general question of failure detection and mitigation, completely independent of any explicitly dependant mind reading demigods or clones is another matter but doesn't seem to be what the conversation is about...)

As for a sanity check, such as I can offer: The grandparent seems correct in stating that Silas's graph doesn't handle the problem described in the grandparent. Just because it is a slightly different problem. With the grandparent's problem it seems to be the agent's knowledge of likely hardware failure modes that is important rather than Omega's.

As for a sanity check, such as I can offer: The grandparent seems correct in stating that Silas's graph doesn't handle the problem described in the grandparent. Just because it is a slightly different problem. With the grandparent's problem it seems to be the agent's knowledge of likely hardware failure modes that is important rather than Omega's

Well, Psy-Kosh had been repeatedly bringing up that Omega has to account for how something might happen between me choosing an algorithm, and the algorithm I actually implement, because of cosmic rays and whatnot, so I thought that one was more important.

However, I think the "innards" node already contains one's knowledge about what kinds of things could go wrong. If I'm wrong, add that as a parent to the boxed node. the link is clipped when you compute the "would" anyway.

OOOOOOH! I think I see (part of, but not all) of the misunderstanding here. I wasn't talking about how Omega can take this into account, I was talking about how the agent Omega is playing games with would take this into account.

ie, not how Omega deals with the problem, but how I would.

Problems involving Omega probably aren't useful examples for demonstrating your problem either way since Omega will accurately predict our actions either way and our identity angst is irrelevant.

I'd like to see an instantiation of the type of problem you mentioned above, involving the many explicitly dependant systems. Something involving a box to pick or a bet to take. Right now the requirements of the model are not defined much beyond 'apply standard decision theory with included mechanism for handling uncertainty at such time as the problem becomes available'.

So? The graph still handles that.