nshepperd comments on Evidential Decision Theory, Selection Bias, and Reference Classes - Less Wrong

19 Post author: Qiaochu_Yuan 08 July 2013 05:16AM

You are viewing a comment permalink. View the original post to see all comments and the full post content.

Comments (127)

You are viewing a single comment's thread. Show more comments above.

Comment author: nshepperd 10 July 2013 07:30:06AM 0 points [-]

How about we dispense with this and you tell us if you know how to extract information about the usefulness (or not) of HAART from a data set like this?

Comment author: IlyaShpitser 10 July 2013 03:37:31PM *  3 points [-]

Ok, first things first.

when that doesn't even sound like a question EDT should be trying to answer in the first place

Do you agree that "Do you put him on HAART or not? Your utility function is minimizing patient deaths." is in fact a kind of question EDT, or decision theories in general, should be trying to answer?

In fact, I already said elsewhere in this thread that I think there is the right answer to this question, and this right answer is to put the patient on HAART (whereas my understanding of EDT is that it will notice that E[death | HAART] > E[death | no HAART], and conclude that HAART is bad). The way you get the answer is no secret either, it's what is called 'the g-formula' or 'truncated factorization' in the literature. I have been trying to understand how my understanding of EDT is wrong. If people's attempt to fix this is to require that we observe all unobserved confounders for death, then to me this says EDT is not a very good decision theory (because other decision theories can get the right answer here without having to observe anything over what I specified). If people say that the right answer is to not give HAART then that's even worse (e.g. they will kill people and go to jail if they actually practice medicine like that).

Comment author: nshepperd 11 July 2013 12:11:29AM *  -1 points [-]

Do you agree that "Do you put him on HAART or not? Your utility function is minimizing patient deaths." is in fact a kind of question EDT, or decision theories in general, should be trying to answer?

Yes. However a decision theory in general contains no specific prescriptions for obtaining probabilities from data, such as "oh, use the parametric g-formula". In general, they have lists of probabilistic information that they require.

E[death | HAART] > E[death | no HAART]

Setting that aside, I assume you mean the above to mean "count the proportion of samples without HAART with death, and compare to proportion of samples with HAART with death". Ignoring the fact that I thought there were no samples without HAART at t=0, what if half of the samples referred to hamsters, rather than humans?

No-one would ever have proposed EDT as a serious decision theory if they intended one to blindly count records while ignoring all other relevant "confounding" information (such as species, or health status). In reality, the purpose of the program of "count the number of people who smoke who have the lesion" or "count how many people who have HAART die" is to obtain estimates of P(I have the lesion | I smoke) or P(this patient dies | I give this patient HAART). That is why we discard hamster samples, because there are good a priori reasons to think that the survival of hamsters and humans is not highly correlated, and "this patient" is a human.

Comment author: IlyaShpitser 11 July 2013 03:41:28AM *  2 points [-]

Ignoring the fact that I thought there were no samples without HAART at t=0, what if half of the samples referred to hamsters, rather than humans?

Well, there is in reality A0 and A1. I choose this example because in this example it is both the case that E[death | A0, A1] is wrong, and \sum_{L0} E[death | A0,A1,L0] p(L0) (usual covariate adjustment) is wrong, because L0 is a rather unusual type of confounder. This example was something naive causal inference used to get wrong for a long time.

More generally, you seem to be fighting the hypothetical. I gave a specific problem on only four variables, where everything is fully specified, there aren't hamsters, and which (I claim) breaks EDT. You aren't bringing up hamsters with Newcomb's problem, why bring them up here? This is just a standard longitudinal design: there is nothing exotic about it, no omnipotent Omegas or source-code reading AIs.

However a decision theory in general contains no specific prescriptions for obtaining probabilities from data.

I think you misunderstand decision theory. If you were right, there would be no difference between CDT and EDT. In fact, the entire point of decision theories is to give rules you would use to make decisions. EDT has a rule involving conditional probabilities of observed data (because EDT treats all observed data as evidence). CDT has a rule involving a causal connection between your action and the outcome. This rule implies, contrary to what you claimed, that a particular method must be used to get your answer from data (this method being given by the theory of identification of causal effects) on pain of getting garbage answers and going to jail.

Comment author: nshepperd 11 July 2013 05:47:54AM *  1 point [-]

You aren't bringing up hamsters with Newcomb's problem, why bring them up here?

I said why I was bringing them up. To make the point that blindly counting the number of events in a dataset satisfying (action = X, outcome = Y) is blatantly ridiculous, and this applies whether or not hamsters are involved. If you think EDT does that then either you are mistaken, or everyone studying EDT are a lot less sane than they look.

I think you misunderstand decision theory. If you were right, there would be no difference between CDT and EDT.

The difference is that CDT asks for P(utility | do(action), observations) and EDT asks for P(utility | action, observations). Neither CDT or EDT specify detailed rules for how to calculate these probabilities or update on observations, or what priors to use. Indeed, those rules are normally found in statistics textbooks, Pearl's Causality or—in the case of the g-formula—random math papers.

Comment author: IlyaShpitser 11 July 2013 02:15:23PM *  4 points [-]

If you think EDT does that then either you are mistaken, or everyone studying EDT are a lot less sane than they look

Ok. I keep asking you, because I want to see where I am going wrong. WIthout fighting the hypothetical, what is EDT's answer in my hamster-free, perfectly standard longitudinal example: do you in fact give the patient HAART or not? If you think there are multiple EDTs, pick the one that gives the right answer! My point is, if you do give HAART, you have to explain what rule you use to arrive at this, and how it's EDT and not CDT. If you do not give HAART, you are "wrong."

The form of argument where you say "well, this couldn't possibly be right -- if it were I would be terrified!" isn't very convincing. I think Homer Simpson used that once :).

Comment author: nshepperd 11 July 2013 08:52:14PM *  0 points [-]

The form of argument where you say "well, this couldn't possibly be right -- if it were I would be terrified!" isn't very convincing. I think Homer Simpson used that once :).

What I meant was "if it were, that would require a large number of (I would expect) fairly intelligent mathematicians to have made an egregiously dumb mistake, on the order of an engineer modelling a 747 as made of cheese". Does that seem likely to you? The principle of charity says "don't assume someone is stupid so you can call them wrong".

Regardless, since there is nothing weird going on here, I would expect (a particular non-strawman version of) EDT's answer to be precisely the same as CDT's answer, because "agent's action" has no common causes with the relevant outcomes (ETA: no common causes that aren't screened off by observations. If you measure patient vital signs and decide based on them, obviously that's a common cause, but irrelevant since you've observed them). In which case you use whatever statistical techniques one normally uses to calculate P(utility | do(action), observations) (the g-formula seems to be an ad-hoc frequentist device as far as I can tell, but there's probably a prior that leads to the same result in a bayesian calculation). You keep telling me that results in "give HAART" so I guess that's the answer, even though I don't actually have any data.

Is that a satisfying answer?

In retrospect, I would have said that before, but got distracted by the seeming ill-posedness of the problem and incompleteness of the data. (Yes, the data is incomplete. Analysing it requires nontrivial assumptions, as far as I can tell from reading a paper on the g-formula.)

Comment author: IlyaShpitser 11 July 2013 09:46:08PM *  2 points [-]

the g-formula seems to be an ad-hoc frequentist device as far as I can tell

See, its things like this that make people have the negative opinion of LW as a quasi-religion that they do. I am willing to wager a guess that your understanding of "the parametric g-formula" is actually based on a google search or two. Yet despite this, you are willing to make (dogmatic, dismissive, and wrong) Bayesian-sounding pronouncements about it. In fact the g-formula is just how you link do(.) and observational data, nothing more nothing less. do(.) is defined in terms of the g-formula in Pearl's chapter 1. The g-formula has nothing to do with Bayesian vs frequentist differences.

Is that a satisfying answer?

No. EDT is not allowed to talk about "confounders" or "causes" or "do(.)". There is nothing in any definition of EDT in any textbook that allows you to refer to anything that isn't a function of the observed joint density. So that's all you can use to get the answer here. If you talk about "confounders" or "causes" or "do(.)", you are using CDT by definition. What is the difference between EDT and CDT to you?


Re: principle of charity, it's very easy to get causal questions wrong. Causal inference isn't easy! Causal inference as field itself used to get the example I gave wrong until the late 1980s. Your answers about how to use EDT to get the answer here are very vague. You should be able to find a textbook on EDT, and follow an algorithm there to give a condition in terms of p(A0,A1,L0,Y) for whether HAART should be given or not. My understanding of EDT is that the condition would be:

Give HAART at A0,A1 iff E[death | A0=yes, A1=yes] < E[death | A0=no, A1=no]

So you would not give HAART by construction in my example (I mentioned people who get HAART die more often due to confounding by health status).

Comment author: nshepperd 11 July 2013 10:01:51PM *  1 point [-]

See, its things like this that makes people have the negative opinion of LW as a quasi-religion that they do. I am willing to wager a guess that your understanding of "the parametric g-formula" is actually based on a google search or two. Yet despite this, you are willing to make (dogmatic, dismissive, and wrong) Bayesian-sounding pronouncements about it. In fact the g-formula is just how you link do(.) and observational data, nothing more nothing less. do(.) is defined in terms of the g-formula in Pearl's chapter 1.

You're probably right. Not that this matters much. The reason I said that is because the few papers I could find on the g-formula were all in the context of using it to find out "whether HAART kills people", and none of them gave any kind of justification or motivation for it, or even mentioned how it related to probabilities involving do().

No. EDT is not allowed to talk about "confounders" or "causes" or "do(.)". There is nothing in any definition of EDT in any textbook that allows you to refer to anything that isn't a function of the observed joint density.

Did you read what I wrote? Since action and outcome do not have any common causes (conditional on observations), P(outcome | action, observations) = P(outcome | do(action), observations). I am well aware that EDT does not mention do. This does not change the fact that this equality holds in this particular situation, which is what allows me to say that EDT and CDT have the same answer here.

Re: principle of charity, it's very easy to get causal questions wrong.

Postulating "just count up how many samples have the particular action and outcome, and ignore everything else" as a decision theory is not a complicated causal mistake. This was the whole point of the hamster example. This method breaks horribly on the most simple dataset with a bit of irrelevant data.

ETA: [responding to your edit]

My understanding of EDT is that the condition would be:

Give HAART at A0,A1 iff E[death | A0=yes, A1=yes] < E[death | A0=no, A1=no]

No, this is completely wrong, because this ignores the fact that the action the EDT agent considers is "I (EDT agent) give this person HAART", not "be a person who decides whether to give HAART based on metrics L0, and also give this person HAART" which isn't something it's possible to "decide" at all.

Comment author: IlyaShpitser 12 July 2013 04:08:00PM *  2 points [-]

Thanks for this. Technical issue:

Since action and outcome do not have any common causes (conditional on observations), P(outcome | action, observations) = P(outcome | do(action), observations).

In my example, A0 has no causes (it is randomized) but A1 has a common cause with the outcome Y (this common cause is the unobserved health status, which is a parent of both Y and L0, and L0 is a parent of A1). L0 is observed but you cannot adjust for it either because that screws up the effect of A0.

To get the right answer here, you need a causal theory that connects observations to causal effects. The point is, EDT isn't allowed to just steal causal theory to get its answer without becoming a causal decision theory itself.

Comment author: Vaniver 11 July 2013 10:20:00PM *  -1 points [-]

What I meant was "if it were, that would require a large number of (I would expect) fairly intelligent mathematicians to have made an egregiously dumb mistake, on the order of an engineer modelling a 747 as made of cheese". Does that seem likely to you? The principle of charity says "don't assume someone is stupid so you can call them wrong".

Yes, actually, they do seem have to have made an egregiously dumb mistake. People think EDT is dumb because it is dumb. Full stop.

The confusion is that sometimes when people talk about EDT, they are talking about the empirical group of "EDTers". EDTers aren't dumb enough to actually use the math of EDT. A "non-strawman EDT" is CDT. (If it wasn't, how could the answers always be the same?) The point of math, though, is that you can't strawman it; the math is what it is. Making decisions based on the conditional probabilities that resulted from observing that action historically is dumb, EDT makes decisions based on conditional probabilities, therefore EDT is dumb.

Comment author: nshepperd 11 July 2013 10:42:17PM *  2 points [-]

If it wasn't, how could the answers always be the same?

They're not...? EDT one-boxes on Newcomb's and smokes (EDIT: doesn't smoke) on the smoking lesion (unless the tickle defense actually works or something). Of course, it also two-boxes on transparent Newcomb's, so it's still a dumb theory, but it's not that dumb.

Comment author: Vaniver 11 July 2013 10:56:50PM *  0 points [-]

They're not...?

How else should I interpret "I would expect (a particular non-strawman version of) EDT's answer to be precisely the same as CDT's answer"?

EDT one-boxes on Newcomb's and smokes on the smoking lesion (unless the tickle defense actually works or something).

Huh? EDT doesn't smoke on the smoking lesion, because P(cancer|smoking)>P(cancer|!smoking).