IlyaShpitser comments on Evidential Decision Theory, Selection Bias, and Reference Classes - Less Wrong

19 Post author: Qiaochu_Yuan 08 July 2013 05:16AM

You are viewing a comment permalink. View the original post to see all comments and the full post content.

Comments (127)

You are viewing a single comment's thread. Show more comments above.

Comment author: IlyaShpitser 11 July 2013 03:41:28AM *  2 points [-]

Ignoring the fact that I thought there were no samples without HAART at t=0, what if half of the samples referred to hamsters, rather than humans?

Well, there is in reality A0 and A1. I choose this example because in this example it is both the case that E[death | A0, A1] is wrong, and \sum_{L0} E[death | A0,A1,L0] p(L0) (usual covariate adjustment) is wrong, because L0 is a rather unusual type of confounder. This example was something naive causal inference used to get wrong for a long time.

More generally, you seem to be fighting the hypothetical. I gave a specific problem on only four variables, where everything is fully specified, there aren't hamsters, and which (I claim) breaks EDT. You aren't bringing up hamsters with Newcomb's problem, why bring them up here? This is just a standard longitudinal design: there is nothing exotic about it, no omnipotent Omegas or source-code reading AIs.

However a decision theory in general contains no specific prescriptions for obtaining probabilities from data.

I think you misunderstand decision theory. If you were right, there would be no difference between CDT and EDT. In fact, the entire point of decision theories is to give rules you would use to make decisions. EDT has a rule involving conditional probabilities of observed data (because EDT treats all observed data as evidence). CDT has a rule involving a causal connection between your action and the outcome. This rule implies, contrary to what you claimed, that a particular method must be used to get your answer from data (this method being given by the theory of identification of causal effects) on pain of getting garbage answers and going to jail.

Comment author: nshepperd 11 July 2013 05:47:54AM *  1 point [-]

You aren't bringing up hamsters with Newcomb's problem, why bring them up here?

I said why I was bringing them up. To make the point that blindly counting the number of events in a dataset satisfying (action = X, outcome = Y) is blatantly ridiculous, and this applies whether or not hamsters are involved. If you think EDT does that then either you are mistaken, or everyone studying EDT are a lot less sane than they look.

I think you misunderstand decision theory. If you were right, there would be no difference between CDT and EDT.

The difference is that CDT asks for P(utility | do(action), observations) and EDT asks for P(utility | action, observations). Neither CDT or EDT specify detailed rules for how to calculate these probabilities or update on observations, or what priors to use. Indeed, those rules are normally found in statistics textbooks, Pearl's Causality or—in the case of the g-formula—random math papers.

Comment author: IlyaShpitser 11 July 2013 02:15:23PM *  4 points [-]

If you think EDT does that then either you are mistaken, or everyone studying EDT are a lot less sane than they look

Ok. I keep asking you, because I want to see where I am going wrong. WIthout fighting the hypothetical, what is EDT's answer in my hamster-free, perfectly standard longitudinal example: do you in fact give the patient HAART or not? If you think there are multiple EDTs, pick the one that gives the right answer! My point is, if you do give HAART, you have to explain what rule you use to arrive at this, and how it's EDT and not CDT. If you do not give HAART, you are "wrong."

The form of argument where you say "well, this couldn't possibly be right -- if it were I would be terrified!" isn't very convincing. I think Homer Simpson used that once :).

Comment author: nshepperd 11 July 2013 08:52:14PM *  0 points [-]

The form of argument where you say "well, this couldn't possibly be right -- if it were I would be terrified!" isn't very convincing. I think Homer Simpson used that once :).

What I meant was "if it were, that would require a large number of (I would expect) fairly intelligent mathematicians to have made an egregiously dumb mistake, on the order of an engineer modelling a 747 as made of cheese". Does that seem likely to you? The principle of charity says "don't assume someone is stupid so you can call them wrong".

Regardless, since there is nothing weird going on here, I would expect (a particular non-strawman version of) EDT's answer to be precisely the same as CDT's answer, because "agent's action" has no common causes with the relevant outcomes (ETA: no common causes that aren't screened off by observations. If you measure patient vital signs and decide based on them, obviously that's a common cause, but irrelevant since you've observed them). In which case you use whatever statistical techniques one normally uses to calculate P(utility | do(action), observations) (the g-formula seems to be an ad-hoc frequentist device as far as I can tell, but there's probably a prior that leads to the same result in a bayesian calculation). You keep telling me that results in "give HAART" so I guess that's the answer, even though I don't actually have any data.

Is that a satisfying answer?

In retrospect, I would have said that before, but got distracted by the seeming ill-posedness of the problem and incompleteness of the data. (Yes, the data is incomplete. Analysing it requires nontrivial assumptions, as far as I can tell from reading a paper on the g-formula.)

Comment author: IlyaShpitser 11 July 2013 09:46:08PM *  2 points [-]

the g-formula seems to be an ad-hoc frequentist device as far as I can tell

See, its things like this that make people have the negative opinion of LW as a quasi-religion that they do. I am willing to wager a guess that your understanding of "the parametric g-formula" is actually based on a google search or two. Yet despite this, you are willing to make (dogmatic, dismissive, and wrong) Bayesian-sounding pronouncements about it. In fact the g-formula is just how you link do(.) and observational data, nothing more nothing less. do(.) is defined in terms of the g-formula in Pearl's chapter 1. The g-formula has nothing to do with Bayesian vs frequentist differences.

Is that a satisfying answer?

No. EDT is not allowed to talk about "confounders" or "causes" or "do(.)". There is nothing in any definition of EDT in any textbook that allows you to refer to anything that isn't a function of the observed joint density. So that's all you can use to get the answer here. If you talk about "confounders" or "causes" or "do(.)", you are using CDT by definition. What is the difference between EDT and CDT to you?


Re: principle of charity, it's very easy to get causal questions wrong. Causal inference isn't easy! Causal inference as field itself used to get the example I gave wrong until the late 1980s. Your answers about how to use EDT to get the answer here are very vague. You should be able to find a textbook on EDT, and follow an algorithm there to give a condition in terms of p(A0,A1,L0,Y) for whether HAART should be given or not. My understanding of EDT is that the condition would be:

Give HAART at A0,A1 iff E[death | A0=yes, A1=yes] < E[death | A0=no, A1=no]

So you would not give HAART by construction in my example (I mentioned people who get HAART die more often due to confounding by health status).

Comment author: nshepperd 11 July 2013 10:01:51PM *  1 point [-]

See, its things like this that makes people have the negative opinion of LW as a quasi-religion that they do. I am willing to wager a guess that your understanding of "the parametric g-formula" is actually based on a google search or two. Yet despite this, you are willing to make (dogmatic, dismissive, and wrong) Bayesian-sounding pronouncements about it. In fact the g-formula is just how you link do(.) and observational data, nothing more nothing less. do(.) is defined in terms of the g-formula in Pearl's chapter 1.

You're probably right. Not that this matters much. The reason I said that is because the few papers I could find on the g-formula were all in the context of using it to find out "whether HAART kills people", and none of them gave any kind of justification or motivation for it, or even mentioned how it related to probabilities involving do().

No. EDT is not allowed to talk about "confounders" or "causes" or "do(.)". There is nothing in any definition of EDT in any textbook that allows you to refer to anything that isn't a function of the observed joint density.

Did you read what I wrote? Since action and outcome do not have any common causes (conditional on observations), P(outcome | action, observations) = P(outcome | do(action), observations). I am well aware that EDT does not mention do. This does not change the fact that this equality holds in this particular situation, which is what allows me to say that EDT and CDT have the same answer here.

Re: principle of charity, it's very easy to get causal questions wrong.

Postulating "just count up how many samples have the particular action and outcome, and ignore everything else" as a decision theory is not a complicated causal mistake. This was the whole point of the hamster example. This method breaks horribly on the most simple dataset with a bit of irrelevant data.

ETA: [responding to your edit]

My understanding of EDT is that the condition would be:

Give HAART at A0,A1 iff E[death | A0=yes, A1=yes] < E[death | A0=no, A1=no]

No, this is completely wrong, because this ignores the fact that the action the EDT agent considers is "I (EDT agent) give this person HAART", not "be a person who decides whether to give HAART based on metrics L0, and also give this person HAART" which isn't something it's possible to "decide" at all.

Comment author: IlyaShpitser 12 July 2013 04:08:00PM *  2 points [-]

Thanks for this. Technical issue:

Since action and outcome do not have any common causes (conditional on observations), P(outcome | action, observations) = P(outcome | do(action), observations).

In my example, A0 has no causes (it is randomized) but A1 has a common cause with the outcome Y (this common cause is the unobserved health status, which is a parent of both Y and L0, and L0 is a parent of A1). L0 is observed but you cannot adjust for it either because that screws up the effect of A0.

To get the right answer here, you need a causal theory that connects observations to causal effects. The point is, EDT isn't allowed to just steal causal theory to get its answer without becoming a causal decision theory itself.

Comment author: nshepperd 13 July 2013 01:10:33AM *  0 points [-]

In my example, A0 has no causes (it is randomized) but A1 has a common cause with the outcome Y (this common cause is the unobserved health status, which is a parent of both Y and L0, and L0 is a parent of A1). L0 is observed but you cannot adjust for it either because that screws up the effect of A0.

Health status is screened off by the fact that L0 is an observation. At the point where you (EDT agent) decide whether to give HAART at A1 the relevant probability for purposes of calculating expected utility is P(outcome=Y | action=give-haart, observations=[L0, this dataset]). Effect of action on unobserved health-status and through to Y is screened off by conditioning on L0.

Comment author: IlyaShpitser 13 July 2013 01:58:30AM *  0 points [-]

That's right, but as I said, you cannot just condition on L0 because that blocks the causal path from A0 to Y, and opens a non-causal path A0 -> L0 <-> Y. This is what makes L0 a "time dependent confounder" and this is why

\sum_{L0} E[Y | L0,A0,A1] p(L0) and E[Y | L0, A0, A1] are both wrong here.

(Remember, HAART is given in two stages, A0 and A1, separated by L0).

Comment author: nshepperd 13 July 2013 03:45:24PM *  0 points [-]

That's right, but as I said, you cannot just condition on L0 because that blocks the causal path from A0 to Y, and opens a non-causal path A0 -> L0 <-> Y.

Okay, this isn't actually a problem. At A1 (deciding whether to give HAART at time t=1) you condition on L0 because you've observed it. This means using P(outcome=Y | action=give-haart-at-A1, observations=[L0, the dataset]) which happens to be identical to P(outcome=Y | do(action=give-haart-at-A1), observations=[L0, the dataset]), since A1 has no parents apart from L0. So the decision is the same as CDT at A1.

At A0 (deciding whether to give HAART at time t=0), you haven't measured L0, so you don't condition on it. You use P(outcome=Y | action=give-haart-at-A0, observations=[the dataset]) which happens to be the same as P(outcome=Y | do(action=give-haart-at-A0), observations=[the dataset]) since A0 has no parents at all. The decision is the same as CDT at A0, as well.

To make this perfectly clear, what I am doing here is replacing the agents at A0 and A1 (that decide whether to administer HAART) with EDT agents with access to the aforementioned dataset and calculating what they would do. That is, "You are at A0. Decide whether to administer HAART using EDT." and "You are at A1. You have observed L0=[...]. Decide whether to administer HAART using EDT.". The decisions about what to do at A0 and A1 are calculated separately (though the agent at A0 will generally need to know, and therefore to first calculate what A1 will do, so that they can calculate stuff like P(outcome=Y | action=give-haart-at-A0, observations=[the dataset])).

You may actually be thinking of "solve this problem using EDT" as "using EDT, derive the best (conditional) policy for agents at A0 and A1", which means an EDT agent standing "outside the problem", deciding upon what A0 and A1 should do ahead of time, which works somewhat differently — happily, though, it's practically trivial to show that this EDT agent's decision would be the same as CDT's: because an agent deciding on a policy for A0 and A1 ahead of time is affected by nothing except the original dataset, which is of course the input (an observation), we have P(outcome | do(policy), observations=dataset) = P(outcome | policy, observations=dataset). In case it's not obvious, the graph for this case is dataset -> (agent chooses policy) -> (some number of people die after assigning A0,A1 based on policy) -> outcome.

Comment author: Vaniver 11 July 2013 10:20:00PM *  -1 points [-]

What I meant was "if it were, that would require a large number of (I would expect) fairly intelligent mathematicians to have made an egregiously dumb mistake, on the order of an engineer modelling a 747 as made of cheese". Does that seem likely to you? The principle of charity says "don't assume someone is stupid so you can call them wrong".

Yes, actually, they do seem have to have made an egregiously dumb mistake. People think EDT is dumb because it is dumb. Full stop.

The confusion is that sometimes when people talk about EDT, they are talking about the empirical group of "EDTers". EDTers aren't dumb enough to actually use the math of EDT. A "non-strawman EDT" is CDT. (If it wasn't, how could the answers always be the same?) The point of math, though, is that you can't strawman it; the math is what it is. Making decisions based on the conditional probabilities that resulted from observing that action historically is dumb, EDT makes decisions based on conditional probabilities, therefore EDT is dumb.

Comment author: nshepperd 11 July 2013 10:42:17PM *  2 points [-]

If it wasn't, how could the answers always be the same?

They're not...? EDT one-boxes on Newcomb's and smokes (EDIT: doesn't smoke) on the smoking lesion (unless the tickle defense actually works or something). Of course, it also two-boxes on transparent Newcomb's, so it's still a dumb theory, but it's not that dumb.

Comment author: Vaniver 11 July 2013 10:56:50PM *  0 points [-]

They're not...?

How else should I interpret "I would expect (a particular non-strawman version of) EDT's answer to be precisely the same as CDT's answer"?

EDT one-boxes on Newcomb's and smokes on the smoking lesion (unless the tickle defense actually works or something).

Huh? EDT doesn't smoke on the smoking lesion, because P(cancer|smoking)>P(cancer|!smoking).

Comment author: nshepperd 11 July 2013 11:19:34PM 0 points [-]

How else should I interpret "I would expect (a particular non-strawman version of) EDT's answer to be precisely the same as CDT's answer"?

What I said was

Regardless, since there is nothing weird going on here, I would expect (a particular non-strawman version of) EDT's answer to be precisely the same as CDT's answer, because "agent's action" has no common causes with the relevant outcomes

Meaning that in this particular situation (where there aren't any omniscient predictors or mysterious correlations), the decision is the same. I didn't mean they were the same generally.

Huh? EDT doesn't smoke on the smoking lesion, because P(cancer|smoking)>P(cancer|!smoking).

Er, you're right. I got mixed up there.

Comment author: Vaniver 11 July 2013 11:29:02PM *  -1 points [-]

I didn't mean they were the same generally.

Okay. Do you have a mathematical description of whether they differ, or is it a "I know it when I see it" sort of description? What makes a correlation mysterious?

I'm still having trouble imagining what a "non-strawman" EDT looks like mathematically, except for what I'm calling EDT+Intuition, which is people implicitly calculating probabilities using CDT and then using those probabilities to feed into EDT (in which case they're only using it for expected value calculation, which CDT can do just as easily). It sounds to me like someone insisting that a "non-strawman" formula for x squared is x cubed.

Comment author: nshepperd 12 July 2013 12:21:07AM *  0 points [-]

A first try at formalising it would amount to "build a causal graph including EDT-agent's-decision-now as a node, and calculated expected utilities using P(utility | agent=action, observations)".

For example, for your average boring everyday situation, such as noticing a $5 note on the ground and thinking about whether to pick it up, the graph is (do I see $5 on the ground) --> (do I try to pick it up) --> (outcome). To arrive at a decision, you calculate the expected utilities using P(utility | pick it up, observation=$5) vs P(utility | don't pick it up, observation=$5). Note that conditioning on both observations and your action breaks the correlation expressed by the first link of the graph, resulting in this being equivalent to CDT in this situation. Also conveniently this makes P(action | I see $5) not matter, even though this is technically a necessary component to have a complete graph.

To be actually realistic you would need to include a lot of other stuff in the graph, such as everything else you've ever observed, and (agent's state 5 minutes ago) as causes of the current action (do I try to pick it up). But all of these can either be ignored (in the case of irrelevant observations) or marginalised out without effect (in the case of unobserved causes that we don't know affect the outcome in any particular direction).

Next take an interesting case like Newcomb's. The graph is something like the below:

We don't know whether agent-5-minutes-ago was the sort that would make omega fill both boxes or not (so it's not an observation), but we do know that there's a direct correlation between that and our one-boxing. So when calculating P(utility|one-box), which implicitly involves marginalising over (agent-5-minutes-ago) and (omega fills boxes) we see that the case where (agent-5-minutes-ago)=one-box and (omega fills boxes)=both dominates, while the opposite case dominates for P(utility|two-box), so one-boxing has a higher utility.