nshepperd comments on Evidential Decision Theory, Selection Bias, and Reference Classes - Less Wrong

19 Post author: Qiaochu_Yuan 08 July 2013 05:16AM

You are viewing a comment permalink. View the original post to see all comments and the full post content.

Comments (127)

You are viewing a single comment's thread. Show more comments above.

Comment author: Vaniver 17 July 2013 06:18:33PM -1 points [-]

it is simply postulated that "the lesion causes smoking without being observed" without any explanation of how

No mathematical decision theory requires verbal explanations to be part of the model that it operates on. (It's true that when learning a causal model from data, you need causal assumptions; but when a problem provides the model rather than the data, this is not necessary.)

it is generally assumed that the correlation somehow still applies when you're deciding what to do using EDT, which I personally have some doubt about

You have doubt that this is how EDT, as a mathematical algorithm, operates, or you have some doubt that this is a wise way to construct a decision-making algorithm?

If the second, this is why I think EDT is a subpar decision theory. It sees the world as a joint probability distribution, and does not have the ability to distinguish correlation and causation, which means it cannot know whether or not a correlation applies for any particular action (and so assumes that all do).

If the first, I'm not sure how to clear up your confusion. There is a mindset that programming cultivates, which is that the system does exactly what you tell it to, with the corollary that your intentions have no weight.

If I'm trying to argue that EDT doesn't normally break, then presenting a situation where it does break isn't necessarily proper LCPW.

The trouble with LCPW is that it's asymmetric; Eliezer claims that the LCPW is the one where his friend has to face a moral question, and Eliezer's friend might claim that the LCPW is the one where Eliezer has to face a practical problem.

The way to break the asymmetry is to try to find the most informative comparison. If the hypothetical has been fought, then we learn nothing about morality, because there is no moral problem. If the hypothetical is accepted despite faults, then we learn quite a bit about morality.

The issues with EDT might require 'edge cases' to make obvious, but in the same way that the issues with Newtonian dynamics might require 'edge cases' to make obvious.

Comment author: nshepperd 18 July 2013 12:48:19AM *  0 points [-]

No mathematical decision theory requires verbal explanations to be part of the model that it operates on. (It's true that when learning a causal model from data, you need causal assumptions; but when a problem provides the model rather than the data, this is not necessary.)

What I'm saying is that the only way to solve any decision theory problem is to learn a causal model from data. It just doesn't make sense to postulate particular correlations between an EDT agent's decisions and other things before you even know what EDT decides! The only reason you get away with assuming graphs like lesion -> (CDT Agent) -> action for CDT is because the first thing CDT does when calculating a decision is break all connections to parents by means of do(...).

Take Jiro's example. The lesion makes people jump into volcanoes. 100% of them, and no-one else. Furthermore, I'll postulate that all of them are using decision theory "check if I have the lesion, if so, jump into a volcano, otherwise don't". Should you infer the causal graph lesion -> (EDT decision: jump?) -> die with a perfect correlation between lesion and jump? (Hint: no, that would be stupid, since we're not using jump-based-on-lesion-decision-theory, we're using EDT.)

There is a mindset that programming cultivates, which is that the system does exactly what you tell it to, with the corollary that your intentions have no weight.

In programming, we also say "garbage in, garbage out". You are feeding EDT garbage input by giving it factually wrong joint probability distributions.

Comment author: IlyaShpitser 18 July 2013 02:03:32AM *  4 points [-]

Ok, what about cases where there are multiple causal hypotheses that are observationally indistinguishable:

a -> b -> c

vs

a <- b <- c

Both models imply the same joint probability distribution p(a,b,c) with a single conditional independence (a independent of c given b) and cannot be told apart without experimentation. That is, you cannot call p(a,b,c) "factually wrong" because the correct causal model implies it. But the wrong causal model implies it too! To figure out which is which requires causal information. You can give it to EDT and it will work -- but then it's not EDT anymore.

I can give you a graph which implies the same independences as my HAART example but has a completely different causal structure, and the procedure you propose here:

http://lesswrong.com/lw/hwq/evidential_decision_theory_selection_bias_and/9d6f

will give the right answer in one case and the wrong answer in another.

The point is, EDT lacks a rich enough input language to avoid getting garbage inputs in lots of standard cases. Or, more precisely, EDT lacks a rich enough input languages to tell when input is garbage and when it isn't. This is why EDT is a terrible decision theory.

Comment author: Vaniver 18 July 2013 06:09:49AM *  0 points [-]

What I'm saying is that the only way to solve any decision theory problem is to learn a causal model from data.

I think there are a couple of confusions this sentence highlights.

First, there are approaches to solving decision theory problems that don't use causal models. Part of what has made this conversation challenging is that there are several different ways to represent the world- and so even if CDT is the best / natural one, it needs to be distinguished from other approaches. EDT is not CDT in disguise; the two are distinct formulas / approaches.

Second, there are good reasons to modularize the components of the decision theory, so that you can treat learning a model from data separately from making a decision given a model. An algorithm to turn models into decisions should be able to operate on an arbitrary model, where it sees a -> b -> c as isomorphic to Drunk -> Fall -> Death.

To tell an anecdote, when my decision analysis professor would teach that subject to petroleum engineers, he quickly learned not to use petroleum examples. Say something like "suppose the probability of striking oil by drilling a well here is 40%" and an engineer's hand will shoot up, asking "what kind of rock is it?". The kind of rock is useful for determining whether or not the probability is 40% or something else, but the question totally misses the point of what the professor is trying to teach. The primary example he uses is choosing a location for a party subject to the uncertainty of the weather.

It just doesn't make sense to postulate particular correlations between an EDT agent's decisions and other things before you even know what EDT decides!

I'm not sure how to interpret this sentence.

The way EDT operates is to perform the following three steps for each possible action in turn:

  1. Assume that I saw myself doing X.
  2. Perform a Bayesian update on this new evidence.
  3. Calculate and record my utility.

It then chooses the possible action which had the highest calculated utility.

One interpretation is you saying that EDT doesn't make sense, but I'm not sure I agree with what seems to be the stated reason. It looks to me like you're saying "it doesn't make sense to assume that you do X until you know what you decide!", when I think that does make sense, but the problem is using that assumption as Bayesian evidence as if it were an observation.

Comment author: pengvado 18 July 2013 10:03:12AM *  1 point [-]

The way EDT operates is to perform the following three steps for each possible action in turn:

  1. Assume that I saw myself doing X.
  2. Perform a Bayesian update on this new evidence.
  3. Calculate and record my utility.

Ideal Bayesian updates assume logical omniscience, right? Including knowledge about logical fact of what EDT would do for any given input. If you know that you are an EDT agent, and condition on all of your past observations and also on the fact that you do X, but X is not in fact what EDT does given those inputs, then as an ideal Bayesian you will know that you're conditioning on something impossible. More generally, what update you perform in step 2 depends on EDT's input-output map, thus making the definition circular.

So, is EDT really underspecified? Or are you supposed to search for a fixed point of the circular definition, if there is one? Or does it use some method other than Bayes for the hypothetical update? Or does an EDT agent really break if it ever finds out its own decision algorithm? Or did I totally misunderstand?

Comment author: Vaniver 18 July 2013 05:27:04PM 0 points [-]

Ideal Bayesian updates assume logical omniscience, right? Including knowledge about logical fact of what EDT would do for any given input.

Note that step 1 is "Assume that I saw myself doing X," not "Assume that EDT outputs X as the optimal action." I believe that excludes any contradictions along those lines. Does logical omniscience preclude imagining counterfactual worlds?

Comment author: pengvado 19 July 2013 03:14:10AM 1 point [-]

If I already know "I am EDT", then "I saw myself doing X" does imply "EDT outputs X as the optimal action". Logical omniscience doesn't preclude imagining counterfactual worlds, but imagining counterfactual worlds is a different operation than performing Bayesian updates. CDT constructs counterfactuals by severing some of the edges in its causal graph and then assuming certain values for the nodes that no longer have any causes. TDT does too, except with a different graph and a different choice of edges to sever.

Comment author: nshepperd 18 July 2013 03:22:29PM *  -1 points [-]

I don't know how I can fail to communicate so consistently.

Yes, you can technically apply "EDT" to any causal model or (more generally) joint probability distribution containing a "EDT agent decision" node. But in practice this freedom is useless, because to derive an accurate model you generally need to take account of a) the fact that the agent is using EDT and b) any observations the agent does or does not make. To be clear, the input EDT requires is a probabilistic model describing the EDT agent's situation (not describing historical data of "similar" situations).

There are people here trying to argue against EDT by taking a model describing historical data (such as people following dumb decision theories jumping into volcanoes) and feeding this model directly into EDT. Which is simply wrong. A model that describes the historical behaviour of agents using some other decision theory does not in general accurately describe an EDT agent in the same situation.

The fact that this egregious mistake looks perfectly normal is an artifact of the fact that CDT doesn't care about causal parents of the "CDT decision" node.

Comment author: Vaniver 18 July 2013 06:23:21PM -1 points [-]

I don't know how I can fail to communicate so consistently.

I suspect it's because what you are referring to as "EDT" is not what experts in the field use that technical term to mean.

nsheppard-EDT is, as far as I can tell, the second half of CDT. Take a causal model and use the do() operator to create the manipulated subgraph that would result taking possible action (as an intervention). Determine the joint probability distribution from the manipulated subgraph. Condition on observing that action with the joint probability distribution, and calculate the probabilistically-weighted mean utility of the possible outcomes. This is isomorphic to CDT, and so referring to it as EDT leads to confusion.

Comment author: nshepperd 18 July 2013 11:53:26PM 0 points [-]

Whatever. I give up.