Comment author: IlyaShpitser 20 October 2015 05:43:36PM *  2 points [-]

I think these are isomorphic, estimating E[Y] if Y is missing at random conditional on C is the same as estimating E[Y | do(a)] = E[Y | "we assign you to a given C"].

"Causal inference is a missing data problem, and missing data is a causal inference problem."


Or I may be "missing" something. :)

Comment author: snarles 20 October 2015 06:38:32PM *  1 point [-]

Yes, I think you are missing something (although it is true that causal inference is a missing data problem).

It may be easier to think in terms of the potential outcomes model. Y0 is the outcome is no treatment, Y1 is the outcome of treatment, you only ever observe either Y0 or Y1, depending on whether D=0 or 1. Generally you are trying to estimate E[Y1] or E[Y0] or their difference.

The point is that the quantity Robbins and Wasserman are trying to estimate, E[Y], does not depend on the importance sampling distribution. Whereas the quantity I am trying to estimate, E[Y|f(X)], does depend on f. Changing f changes the population quantity to be estimated.

It is true that sometimes people in causal inference are interested in estimating things like E[Y1 - Y0|D], " e.g. the treatment effect on the treated." However this is still different from my setup because D is a random variable, as opposed to an arbitrary function of the known variables like f(X).

Comment author: IlyaShpitser 19 October 2015 10:45:04PM *  3 points [-]

OP will correct me if I am wrong, but I think he is trying to restate the Robins/Wasserman example. You do not need to model f(X), but the point of that example is that you know f, but the conditional model for Y is very very complicated. So you either do a Bayesian approach with a prior and a likelihood for Y, or you just use Horvitz-Thompson with f.

I like to think of that example using causal inference: you want to estimate the causal effect p(Y | do(A)) of A on Y when the policy for assigning treatment A: p(A | C) is known exactly, but p(Y | A, C) is super complex. Likelihood-based methods like being Bayesian will use \sum_C p(Y | A, C) p(C). But you can just look at \sum{samples i} Yi 1/p(A | C) to get the same thing and avoid modeling p(Y | A,C). But doing that isn't Bayesian.

See also this:

http://www.biostat.harvard.edu/robins/coda.pdf

I think we talked about this before.

Comment author: snarles 20 October 2015 05:22:47PM 2 points [-]

My example is very similar to the Robbins/Wasserman example, but you end up drawing different conclusions. Robbins/Wasserman show that you can't make sense of importance sampling in a Bayesian framework. My example shows that you can't make sense of "conditional sampling" in a Bayesian framework. The goal of importance sampling is to estimate E[Y], while the goal of conditional sampling is to estimate E[Y|event] for some event.

We did talk about this before, that's how I first learnt of the R/W example.

Comment author: RichardKennaway 19 October 2015 10:30:46PM 1 point [-]

There are a couple of things I'm not understanding here.

Firstly, the example of the cancer survival test seems to have some inconsistency. The fitted model is said to give the right answer in 990 out of 1000 test cases. Where do you subsequently get the Beta(1000,2) distribution from? I am not seeing the source of that 2. And given that the model is right on exactly 99% of the test cases, how is the imaginary Bayesian coming up with a clearly wrong interval [0.996,0.9998]?

Secondly, in the later example of estimating E[ Y | f(X)=1 ], the method foisted on the Bayesian appears to involve estimating the whole of the function f. This seems to me an obviously misguided approach to the problem, whatever one's views on statistical argument. Why cannot the Bayesian say, with the frequentist, it doesn't matter what f is, I have been asked about the population for which f(X)=1. I do not need to model the process f by which that population was selected, only the behaviour of Y within that population? And then proceed in the usual way.

Comment author: snarles 19 October 2015 10:43:23PM *  2 points [-]

I do not need to model the process f by which that population was selected, only the behaviour of Y within that population?

There are some (including myself and presumably some others on this board) who see this practice as epistemologically dubious. First, how do you decide which aspects of the problem to incorporate into your model? Why should one only try to model E[Y|f(X)=1] and not the underlying function g(x)=E[Y|x]? If you actually had very strong prior information about g(x), say that "I know g(x)=h(x) with probability 1/2 or g(x) = j(x) with probability 1/2" where h(x) and j(x) are known functions, then in that case most statisticians would incorporate the underlying function g(x) in the model; and in that case, data for observations with f(X)=0 might be informative for whether g(x) = h(x) or g(x) = j(x). So if the prior is weak (as it is in my main post) you don't model the function, and if the prior is strong, you model the function (and therefore make use of all the observations)? Where do you draw the line?

I agree, most statisticians would not model g(x) in the cancer example. But is that because they have limited time and resources (and are possibly lazy) and because using an overcomplicated model would confuse their audience, anyways? Or because they legitimately think that it's an objective mistake to use a model involving g(x)?

Comment author: gjm 19 October 2015 10:21:35PM 2 points [-]

Beta(1000,2)

Was that meant to be Beta(1000,10)? (With appropriately updated probabilities as a result?)

Comment author: snarles 19 October 2015 10:36:12PM 1 point [-]

Good catch, it should be Beta(991, 11). The prior is uniform = Beta(1,1 ) and the data is (990 successes, 10 fails)

The trouble with Bayes (draft)

10 snarles 19 October 2015 08:50PM

Prerequisites

This post requires some knowledge of Bayesian and Frequentist statistics, as well as probability. It is intended to explain one of the more advanced concepts in statistical theory--Bayesian non-consistency--to non-statisticians, and although the level required is much less than would be required to read some of the original papers on the topic[1], some considerable background is still required.

The Bayesian dream

Bayesian methods are enjoying a well-deserved growth of popularity in the sciences. However, most practitioners of Bayesian inference, including most statisticians, see it as a practical tool. Bayesian inference has many desirable properties for a data analysis procedure: it allows for intuitive treatment of complex statistical models, which include models with non-iid data, random effects, high-dimensional regularization, covariance estimation, outliers, and missing data. Problems which have been the subject of Ph. D. theses and entire careers in the Frequentist school, such as mixture models and the many-armed bandit problem, can be satisfactorily handled by introductory-level Bayesian statistics.

A more extreme point of view, the flavor of subjective Bayes best exemplified by Jaynes' famous book [2], and also by an sizable contingent of philosophers of science, elevates Bayesian reasoning to the methodology for probabilistic reasoning, in every domain, for every problem. One merely needs to encode one's beliefs as a prior distribution, and Bayesian inference will yield the optimal decision or inference.

To a philosophical Bayesian, the epistemological grounding of most statistics (including "pragmatic Bayes") is abysmal. The practice of data analysis is either dictated by arbitrary tradition and protocol on the one hand, or consists of users creatively employing a diverse "toolbox" of methods justified by a diverse mixture of incompatible theoretical principles like the minimax principle, invariance, asymptotics, maximum likelihood or *gasp* "Bayesian optimality." The result: a million possible methods exist for any given problem, and a million interpretations exist for any data set, all depending on how one frames the problem. Given one million different interpretations for the data, which one should *you* believe?

Why the ambiguity? Take the textbook problem of determining whether a coin is fair or weighted, based on the data obtained from, say, flipping it 10 times. Keep in mind, a principled approach to statistics decides the rule for decision-making before you see the data. So, what rule whould you use for your decision? One rule is, "declare it's weighted, if either 10/10 flips are heads or 0/10 flips are heads." Another rule is, "always declare it to be weighted." Or, "always declare it to be fair." All in all, there are 10 possible outcomes (supposing we only care about the total) and therefore there are 2^10 possible decision rules. We can probably rule out most of them as nonsensical, like, "declare it to be weighted if 5/10 are heads, and fair otherwise" since 5/10 seems like the fairest outcome possible. But among the remaining possibilities, there is no obvious way to choose the "best" rule. After all, the performance of the rule, defined as the probability you will make the correct conclusion from the data, depends on the unknown state of the world, i.e. the true probability of flipping heads for that particular the coin.

The Bayesian approach "cuts" the Gordion knot of choosing the best rule, by assuming a prior distribution over the unknown state of the world. Under this prior distribution, one can compute the average perfomance of any decision rule, and choose the best one. For example, suppose your prior is that with probability 99.9999%, the coin is fair. Then the best decision rule would be to "always declare it to be fair!"

The Bayesian approach gives you the optimal decision rule for the problem, as soon as you come up with a model for the data and a prior for your model. But when you are looking at data analysis problems in the real world (as opposed to a probability textbook), the choice of model is rarely unambiguous. Hence, for me, the standard Bayesian approach does not go far enough--if there are a million models you could choose from, you still get a million different conclusions as a Bayesian.

Hence, one could argue that a "pragmatic" Bayesian who thinks up a new model for every problem is just as epistemologically suspect as any Frequentist. Only the strongest form of subjective Bayesianism can one escape this ambiguity. The dream for the subjective Bayesian dream is to start out in life with a single model. A single prior. For the entire world. This "world prior" would contain all the entirety of one's own life experience, and the grand total of human knowledge. Surely, writing out this prior is impossible. But the point is that a true Bayesian must behave (at least approximately) as if they were driven by such a universal prior. In principle, having such an universal prior (at least conceptually) solves the problem of choosing models and priors for problems: the priors and models you choose for particular problems are determined by the posterior of your universal prior. For example, why did you decide on a linear model for your economics data? It's because according to your universal posterior, you particular economic data is well-described by such a model with high-probability.

The main practical consequence of the universal prior is that your inferences in one problem should be consistent which your inferences in another, related problem. Even if the subjective Bayesian never writes out a "grand model", their integrated approach to data analysis for related problems still distinguishes their approach from the piecemeal approach of frequentists, who tend to treat each data analysis problem as if it occurs in an isolated universe. (So I claim, though I cannot point to any real example of such a subjective Bayesian.)

Yet, even if the subjective Bayesian ideal could be realized, many philosophers of science (e.g. Deborah Mayo) would consider it just as ambiguous as non-Bayesian approaches, since even if you have an unambiguous proecdure for forming personal priors, your priors are still going to differ from mine. I don't consider this a defect, since my worldview necessarily does differ from yours. My ultimate goal is to make the best decision for myself. That said, such egocentrism, even if rationally motivated, may indeed be poorly suited for a collaborative enterprise like science.

For me, the most far more troublesome objection to the "Bayesian dream" is the question, "How would actually you go about constructing this prior that represents all of your beliefs?" Looking in the Bayesian literature, one does not find any convincing examples of any user of Bayesian inference managing to actually encode all (or even a tiny portion) of their beliefs in the form of the prior--in fact, for the most part, we see alarmingly little thought or justification being put into the construction of the priors.

Nevertheless, I myself remained one of these "hardcore Bayesians", at least from a philosophical point of view, ever since I started learning about statistics. My faith in the "Bayesian dream" persisted even after spending three years in the Ph. D. program in Stanford (a department with a heavy bias towards Frequentism) and even after I personally started doing research in frequentist methods. (I see frequentist inference as a poor man's approximation for the ideal Bayesian inference.) Though I was aware of the Bayesian non-consistency results, I largely dismissed them as mathematical pathologies. And while we were still a long way from achieving universal inference, I held the optimistic view that improved technology and theory might one day finally make the "Bayesian dream" achievable. However, I could not find a way to ignore one particular example on Wasserman's blog[3], due to its relevance to very practical problems in causal inference. Eventually I thought of an even simpler counterexample, which devastated my faith in the possibility of constructing a universal prior. Perhaps a fellow Bayesian can find a solution to this quagmire, but I am not holding my breath.

The root of the problem is the extreme degree of ignorance we have about our world, the degree of surprisingness of many true scientific discoveries, and the relative ease with which we accept these surprises. If we consider this behavior rational (which I do), then the subjective Bayesian is obligated to construct a prior which captures this behavior. Yet, the diversity of possible surprises the model must be able to accommodate makes it practically impossible (if not mathematically impossible) to construct such a prior. The alternative is to reject all possibility of surprise, and refuse to update any faster than a universal prior would (extremely slowly), which strikes me as a rather poor epistemological policy.

In the rest of the post, I'll motivate my example, sketch out a few mathematical details (explaining them as best I can to a general audience), then discuss the implications.

Introduction: Cancer classification

Biology and medicine are currently adapting to the wealth of information we can obtain by using high-throughput assays: technologies which can rapidly read the DNA of an individual, measure the concentration of messenger RNA, metabolites, and proteins. In the early days of this "large-scale" approach to biology which began with the Human Genome Project, some optimists had hoped that such an unprecedented torrent of raw data would allow scientists to quickly "crack the genetic code." By now, any such optimism has been washed away by the overwhelming complexity and uncertainty of human biology--a complexity which has been made clearer than ever by the flood of data--and replaced with a sober appreciation that in the new "big data" paradigm, making a discovery becomes a much easier task than understanding any of those discoveries.

Enter the application of machine learning to this large-scale biological data. Scientists take these massive datasets containing patient outcomes, demographic characteristics, and high-dimensional genetic, neurological, and metabolic data, and analyze them using algorithms like support vector machines, logistic regression and decision trees to learn predictive models to relate key biological variables, "biomarkers", to outcomes of interest.

To give a specific example, take a look at this abstract from the Shipp. et. al. paper on detecting survival rates for cancer patients [4]:

Diffuse large B-cell lymphoma (DLBCL), the most common lymphoid malignancy in adults, is curable in less than 50% of patients. Prognostic models based on pre-treatment characteristics, such as the International Prognostic Index (IPI), are currently used to predict outcome in DLBCL. However, clinical outcome models identify neither the molecular basis of clinical heterogeneity, nor specific therapeutic targets. We analyzed the expression of 6,817 genes in diagnostic tumor specimens from DLBCL patients who received cyclophosphamide, adriamycin, vincristine and prednisone (CHOP)-based chemotherapy, and applied a supervised learning prediction method to identify cured versus fatal or refractory disease. The algorithm classified two categories of patients with very different five-year overall survival rates (70% versus 12%). The model also effectively delineated patients within specific IPI risk categories who were likely to be cured or to die of their disease. Genes implicated in DLBCL outcome included some that regulate responses to B-cell−receptor signaling, critical serine/threonine phosphorylation pathways and apoptosis. Our data indicate that supervised learning classification techniques can predict outcome in DLBCL and identify rational targets for intervention.

The term "supervised learning" refers to any algorithm for learning a predictive model for predicting some outcome Y(could be either categorical or numeric) from covariates or features X. In this particular paper, the authors used a relatively simple linear model (which they called "weighted voting") for prediction.

A linear model is fairly easy to interpret: it produces a single "score variable" via a weighted average of a number of predictor variables. Then it predicts the outcome (say "survival" or "no survival") based on a rule like, "Predict survival if the score is larger than 0." Yet, far more advanced machine learning models have been developed, including "deep neural networks" which are winning all of the image recognition and machine translation competitions at the moment. These "deep neural networks" are especially notorious for being difficult to interpret. Along with similarly complicated models, neural networks are often called "black box models": although you can get miraculously accurate answers out of the "box", peering inside won't give you much of a clue as to how it actually works.

Now it is time for the first thought experiment. Suppose a follow-up paper to the Shipp paper reports dramatically improved prediction for survival outcomes of lymphoma patients. The authors of this follow-up paper trained their model on a "training sample" of 500 patients, then used it to predict the five-year outcome of chemotherapy patients, on a "test sample" of 1000 patients. It correctly predicts the outcome ("survival" vs "no survival") on 990 of the 1000 patients.

Question 1: what is your opinion on the predictive accuracy of this model on the population of chemotherapy patients? Suppose that publication bias is not an issue (the authors of this paper designed the study in advance and committed to publishing) and suppose that the test sample of 1000 patients is "representative" of the entire population of chemotherapy patients.

Question 2: does your judgment depend on the complexity of the model they used? What if the authors used an extremely complex and counterintuitive model, and cannot even offer any justification or explanation for why it works? (Nevertheless, their peers have independently confirmed the predictive accuracy of the model.)

A Frequentist approach

The Frequentist answer to the thought experiment is as follows. The accuracy of the model is a probability p which we wish to estimate. The number of successes on the 1000 test patients is Binomial(p, 1000). Based on the data, one can construct a confidence interal: say, we are 99% confident that the accuracy is above 83%. What does 99% confident mean? I won't try to explain, but simply say that in this particular situation, "I'm pretty sure" that the accuracy of the model is above 83%.

A Bayesian approach

The Bayesian interjects, "Hah! You can't explain what your confidence interval actually means!" He puts a uniform prior on the probability p. The posterior distribution of p, conditional on the data, is Beta(991, 11). This gives a 99% credible interval that p is in [0.978, 0.995]. You can actually interpret the interval in probabilistic terms, and it gives a much tighter interval as well. Seems like a Bayesian victory...?

A subjective Bayesian approach

As I have argued before, a Bayesian approach which comes up with a model after hearing about the problem is bound to suffer from the same inconsistency and arbitariness as any non-Bayesian approach. You might assume a uniform distribution for p in this problem... but yet another paper comes along with a similar prediction model? You would need a join distribution for the current model and the new model. What if a theory comes along that could help explain the success of the current method? The parameter p might take a new meaning in this context.

So as a subjective Bayesian, I argue that slapping a uniform prior on the accuracy is the wrong approach. But I'll stop short of actually constructing a Bayesian model of the entire world: let's say we want to restrict our attention to this particular issue of cancer prediction. We want to model the dynamics behind cancer and cancer treatment in humans. Needless to say, the model is still ridiculously complicated. However, I don't think it's out of reach of the efforts of a well-funded, large collaborative effort of scientists.

Roughly speaking, the model can be divided into a distribution over theories of human biology, and conditional on the theory of biology, a course-grained model of an individual patient. The model would not include every cell, every molecule, etc., but it would contain many latent variables in addition to the variables measured in any particular cancer study. Let's call the variables actually measured in the study, X, and also the survival outcome, Y.

Now here is the epistemologically correct way to answer the thought experiment. Take a look at the X's and Y's of the patients in the training and test set. Update your probabilistic model of human biology based on the data. Then take a look at the actual form of the classifier: it's a function f() mapping X's to Y's. The accuracy of the classsifer is no longer parameter: it's a quantity Pr[f(X) = Y] which has a distribution under your posterior. That is, for any given "theory of human biology", Pr[f(X) = Y] has a fixed value: now, over the distribution of possible theories of human biology (based on the data of the current study as well as all previous studies and your own beliefs) Pr[f(X) = Y] has a distribution, and therefore, an average. But what will this posterior give you? Will you get something similar to the interval [0.978, 0.995] you got from the "practical Bayes" approach?

Who knows? But I would guess in all likelihood not. My guess you would get a very different interval from [0.978, 0.995], because in this complex model there is no direct link from the empirical success rate of prediction, and the quantity Pr[f(X) = Y]. But my intuition for this fact comes from the following simpler framework.

A non-parametric Bayesian approach

Instead of reasoning about a gand Bayesian model of biology, I now take a middle ground, and suggesting that while we don't need to capture the entire latent dynamics of cancer, we should at the very least we should try to include the X's and the Y's in the model, instead of merely abstracting the whole experiment as a Binomial trial (as did the frequentist and pragmatic Bayesian.) Hence we need a prior over joint distributions of (X, Y). And yes, I do mean a prior distribution over probability distributions: we are saying that (X, Y) has some unknown joint distribution, which we treat as being drawn at random from a large collection of distributions. This is therefore a non-parametric Bayes approach: the term non-parametric means that the number of the parameters in the model is not finite.

Since in this case Y is a binary outcome, a joint distribution can be decomposed as a marginal distribution over X, and a function g(x) giving the conditional probability that Y=1 given X=x. The marginal distribution is not so interesting or important for us, since it simple reflects the composition of the population of patients. For the purpose of this example, let us say that the marginal is known (e.g., a finite distribution over the population of US cancer patients). What we want to know is the probability of patient survival, and this is given by the function g(X) for the particular patient's X. Hence, we will mainly deal with constructing a prior over g(X).

To construct a prior, we need to think of intuitive properties of the survival probability function g(x). If x is similar to x', then we expect the survival probabilities to be similar. Hence the prior on g(x) should be over random, smooth functions. But we need to choose the smoothness so that the prior does not consist of almost-constant functions. Suppose for now that we choose particular class of smooth functions (e.g. functions with a certain Lipschitz norm) and choose our prior to to be uniform over functions of that smoothness. We could go further and put a prior on the smoothness hyperparameter, but for now we won't.

Now, although I assert my faithfulness to the Bayesian ideal, I still want to think about how whatever prior we choose would allow use to answer some simple though experiments. Why is that? I hold that the ideal Bayesian inference should capture and refine what I take to be "rational behavior." Hence, if a prior produces irrational outcomes, I reject that prior as not reflecting my beliefs.

Take the following thought experiment: we simply want to estimate the expected value of Y, E[Y]. Hence, we draw 100 patients independently with replacement from the population and record their outcomes: suppose the sum is 80 out of 100. The Frequentist (and prgamatic Bayesian) would end up concluding that with high probability/confidence/whatever, the expected value of Y is around 0.8, and I would hold that an ideal rationalist come up with a similar belief. But what would our non-parametric model say? We would draw a random function g(x) conditional on our particular observations: we get a quantity E[g(X)] for each instantiation of g(x): the distribution of E[g(X)]'s over the posterior allows us to make credible intervals for E[Y].

But what do we end up getting? Either one of two things happens. Either you choose too little smoothness, and E[g(X)] ends up concentrating at around 0.5, no matter what data you put into the model. This is the phenomenon of Bayesian non-consistency, and a detailed explanation can be found in several of the listed references: but to put it briefly, sampling at a few isolated points gives you too little information on the rest of the function. This example is not as pathological as the ones used in the literature: if you sample infinitely many points, you will eventually get the posterior to concentrate around the true value of E[Y], but all the same, the convergence is ridiculously slow. Alternatively, use a super-high smoothness, and the posterior of E[g(X)] has a nice interval around the sample value just like in the Binomial example. But now if you look at your posterior draws of g(x), you'll notice the functions are basically constants. Putting a prior on smoothness doesn't change things: the posterior on smoothness doesn't change, since you don't actually have enough data to determine the smoothness of the function. The posterior average of E[g(X)] is no longer always 0.5: it gets a little bit affected by the data, since within the 10% mass of the posterior corresponding to the smooth prior, the average of E[g(X)] is responding to the data. But you are still almost as slow as before in converging to the truth.

At the time that I started thinking about the above "uniform sampling" example, I was stil convinced of a Bayesian resolution. Obviously, using a uniform prior over smooth functions is too naive: you can tell by seeing that the prior distribution over E[g(X)] is already highly concentrated around 0.5. How about a hierarchical model, where first we draw a parameter p from the uniform distribution, and then draw g(x) from the uniform distribution over smooth functions with mean value equal to p? This gets you non-constant g(x) in the posterior, while your posteriors of E[g(X)] converge to the truth as quickly as in the Binomial example. Arguing backwards, I would say that such a prior comes closer to capturing my beliefs.

But then I thought, what about more complicated problems than computing E[Y]? What if you have to compute the expectation of Y conditional on some complicated function of X taking on a certain value: i.e. E[Y|f(X) = 1]? In the frequentist world, you can easily compute E[Y|f(X)=1] by rejection sampling: get a sample of individuals, average the Y's of the individuals whose X's satisfy f(X) = 1. But how could you formulate a prior that has the same property? For a finite collection of functions f, {f1,...,f100}, say, you might be able to construct a prior for g(x) so that the posterior for E[g(X)|fi = 1] converges to the truth for every i in {1,..,100}. I don't know how to do so, but perhaps you know. But the frequentist intervals work for every function f! Can you construct a prior which can do the same?

I am happy to argue that a true Bayesian would not need consistency for every possible f in the mathematical universe. It is cool that frequentist inference works for such a general collection: but it may well be unnecessary for the world we live in. In other words, there may be functions f which are so ridiculous, that even if you showed me that empirically, E[Y|f(X)=1] = 0.9, based on data from 1 million patients, I would not believe that E[Y|f(X)=1] was close to 0.9. It is a counterintuitive conclusion, but one that I am prepared to accept.

Yet, the set of f's which are not so ridiculous, which in fact I might accept to be reasonable based on conventional science, may be so large as to render impossible the construction of a prior which could accommodate them all. But the Bayesian dream makes the far stronger demand that our prior capture not just our current understanding of science but to match the flexibility of rational thought. I hold that given the appropriate evidence, rationalists can be persuaded to accept truths which they could not even imagine beforehand. Thinking about how we could possibly construct a prior to mimic this behavior, the Bayesian dream seems distant indeed.

Discussion

To be updated later... perhaps responding to some of your comments!

 

[1] Diaconis and Freedman, "On the Consistency of Bayes Estimates"

[2] ET Jaynes, Probability: the Logic of Science

[3] https://normaldeviate.wordpress.com/2012/08/28/robins-and-wasserman-respond-to-a-nobel-prize-winner/

[4] Shipp et al. "Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning." Nature

Comment author: snarles 13 October 2015 06:55:28PM 2 points [-]

How do you get the top portion of the second payoff matrix from the first? Intuitively, it should be by replacing the Agent A's payoff with the sum of the agents' payoffs, but the numbers don't match.

Most people are altruists but only to their in-group, and most people have very narrow in-groups. What you mean by an altruist is probably someone who is both altruistic and has a very inclusive in-group. But as far as I can tell, there is a hard trade-off between belonging to a close-knit, small in-group and identifying with a large, diverse but weak in-group. The time you spend helping strangers is time taken away from potentially helping friends and family.

Unbounded linear utility functions?

-1 snarles 11 October 2015 11:30PM

The LW community seems to assume, by default, that "unbounded, linear utility functions are reasonable."  That is, if you value the existence of 1 swan at 1.5 utilons, then 10 swans should be worth 15, etc.

Yudkowsky in his post on scope insensitivity argues that nonlinearity of personal utility functions is a logical fallacy.

However, unbounded and linearly increasing utility functions lead to conundrums such as Pascal's Mugging.  A recent discussion topic on Pascal's Mugging suggests ignoring probabilities that are too small.  However, such extreme measures are not necessary if tamer utility functions are used: one images a typical personal utility function to be bounded and nonlinear. 

In that recent discussion topic, V_V and I questioned the adoption of such an unbounded, linear utility function.  I would argue that nonlinear of utility functions is not a logical fallacy.

To make my case clear, I will clarify my personal interpretation of utilitarianism.  Utility functions are mathematical constructs that can be used to model individual or group decision-making.  However, it is unrealistic to suppose that every individual actually has an utility function or even a preference ordering; at best, one could find a utility function which approximates the behavior of the individual.  This is confirmed by studies demonstrating the inconsistency of human preferences.  The decisions made by coordinated groups: e.g. corporate partners, citizens in a democracy, or the entire community of effective altruists could also be more or less well-approximated by a utility function: presumably, the accuracy of the utility function model of decision-making depends on the cohesion of the group.  Utilitarianism, as proposed by Bentham and Mills, proposes an ethical framework based on some idealized utility function.  Rather than using utility functions to model group decision-making, Bentham and Mills propose to use some utility function to guide decision-making, in the form of an ethical theory.  It is important to distinguish these two different use-cases of utility functions, which might be termed descriptive utility and prescriptive utility.

But what is ethics?  I hold the hard-nosed position that moral philosophies (including utiliarianism) are human inventions which serve the purpose of facilitating large-scale coordination.  Another way of putting it is that moral philosophy is a manifestation of the limited superrationality that our species possesses.  [Side note: one might speculate that the intellectual aspect of human political behavior, of forming alliances based on shared ideals (including moral philosophies), is a memetic or genetic trait which propogated due to positive selection pressure: moral philosophy is necessary for the development of city-states and larger political entities, which in turn rose as the dominant form of social organization in our species.  But this is a separate issue from the the discussion at hand.]

In this larger context, we can be prepared to evaluate the relative worth of a moral philosophy, such as utiliarianism, against competing philosophies.  If the purpose of a moral philosophy is to facilitate coordination, then an effective moral philosophy is one that can actually hope to achieve that kind of coordination.  Utiliarianism is a good candidate for facilitating global-level coordination due to its conceptual simplicity and because most people can agree with its principles, and it provides a clear framework for decision-making, provided that a suitable utility function can be identified, or at least that the properties of the "ideal utility function" can be debated.  Furthermore, utiliarianism, and related consequentialist moralities are arguably better equipped to handle tragedy of the commons than competing deontological theories.

And if we accept utiliarianism, and if our goal is to facilitate global coordination, we can go further to evaluate the properties of any proposed utility function, by the same criteria as before: i.e., how well will the proposed utility function facilitate global coordination.  Will the proposed utility function find broad support among the key players in the global community?  Unbounded, linearly increasing utility functions clearly fail, because few people would support conclusions such as "it's worth spending all our resources to prevent a 0.001% chance that 1e100 human lives will be created and tortured."

If so, why are such utility functions so dominant in the LW community?  One cannot overlook the biased composition of the LW community as a potential factor: generally proficient in mathematical or logical thinking, but less adept than the general population in empathetic skills.  Oversimplified theories, such as linear unbounded utility functions, appeal more strongly to this type of thinker, while more realistic but complicated utility functions are instinctively dismissed as "illogical" or "irrational", when they real reason that they are dismissed is not because they are actually concluded to be illogical, but because because they are precieved as uglier.

Yet another reason stems from the motives of the founders of the LW community, who make a living primarily out of researching existential risk and friendly AI.  Since existential risks are the kind of low-probability, long-term and high-impact event which would tend to be neglected by "intuitive" bounded and nonlinear utility functions, but favored by unintuitive, unbounded linear utility functions, it is in the founders' best interests to personally adopt a form of utiliarianism employing the latter type of utility function.

Finally, let me clarify that I do not dispute the existence of scope insensitivity.  I think the general population is ill-equipped to reason about problems on a global scale, and that education could help remedy this kind of scope insensitivity.  However, even if natural utility functions asymptote far too early, I doubt that the end result of proper training against scope insensitivity would be an unbounded linear utility function; rather, it would still be a nonlinear utility function, but which asymptotes at a larger scale.

 

 

Comment author: JamesPfeiffer 17 September 2015 10:51:56PM 1 point [-]

1) We don't need an unbounded utility function to demonstrate Pascal's Mugging. Plain old large numbers like 10^100 are enough.

2) It seems reasonable for utility to be linear in things we care about, e.g. human lives. This could run into a problem with non-uniqueness, i.e., if I run an identical computer program of you twice, maybe that shouldn't count as two. But I think this is sufficiently murky as to not make bounded utility clearly correct.

Comment author: snarles 11 October 2015 10:23:10PM *  0 points [-]

Like V_V, I don't find it "reasonable" for utility to be linear in things we care about.

I will write a discussion topic about the issue shortly.

EDIT: Link to the topic: http://lesswrong.com/r/discussion/lw/mv3/unbounded_linear_utility_functions/

Comment author: snarles 17 September 2015 06:27:26PM *  2 points [-]

I'll need some background here. Why aren't bounded utilities the default assumption? You'd need some extraordinary arguments to convince me that anyone has an unbounded utility function. Yet this post and many others on LW seem to implicitly assume unbounded utility functions.

Comment author: snarles 26 August 2015 09:38:51PM *  2 points [-]

Let's talk about Von Neumann probes.

Assume that the most successful civilizations exist digitally. A subset of those civilizations would selfishly pursue colonization; the most convenient means would be through Von Neumann machines.

Tipler (1981) pointed out that due to exponential growth, such probes should already be common in our galaxy. Since we haven't observed any, we must be alone in the universe. Sagan and Newman countered that intelligent species should actually try to destroy probes as soon as they are detected. This counterargument, known as "Sagan's response," doesn't make much sense if you assume that advanced civilizations exist digitally. For these civilizations, the best way to counter another race of Von Neumann probes is with their own Von Neumann probes.

Others (who have not been identified by the Wikipedia article) have tried to explain the visible absence of probes by theorizing how civilizations might deliberately limit the expansion range of the probes. But why would any expansionist civilization even want to do so? One explanation would be to avoid provoking other civilizations. However, it still remains to be explained why the very first civilizations, which had no reason to fear other alien civilizations, would limit their own growth. Indeed, any explanation of the Fermi paradox has to be able to explain why the very first civilization would not have already colonized the universe, given that the first civilization was likely to be aware of their uncontested claim to the universe.

The first civilization either became dominated by a singleton, or remained diversified into the space age. For the following theory, we have to assume the latter--besides, we should hope for our own sake that singletons don't always win. If the civilization remains diverse, at least some of the factions transition to a digital existence, and given the advantages provided for civilizations existing in that form, we could expect the digitalized civilizations to dominate.

Digitalized civilizations still have a wide range of possible value systems. There exist hedonistic civilizations, which gain utility from having immense computational power for recreational simulations or proving useless theorems, and there also exist civilizations which are more practically focused on survival. But any type of civilization has to act in self-preservation.

Details of the strategic interactions of the digitalized civilizations depend on speculative physics and technology: particularly in the economics of computation. Supposing dramatic economies of scale in computation (for example, supposing that quantum computers provide an exponential scaling of utility by cost), then it becomes plausible that distinct civilizations would cooperate. However, all known economies of scale have limits, in which case the most likely outcome is for distinct factions to maintain control of their own computing resources. Without such an incentive for cooperation, the civilizations would have to be wary of threats from the other civilizations.

Any digitalized civilization has to protect itself from being compromised from within. Rival civilizations with completely incompatible utility functions could still exploit each other's computing resources. Hence, questions about the theoretical limitations of digital security and data integrity could be relevant to predicting the behavior of advanced civilizations. It may turn out to be easy for any civilization to protect a single computational site. However, any civilization expanding to multiple sites would face a much trickier security problem. Presumably, the multiple sites should be able to interact in some way, since otherwise, what is the incentive to expand? However, any interaction between a parent site and a child site opens the parent site (and therefore the entire network) to compromise.

Colonization sites near any particular civilization quickly become occupied, hence a civilization seeking to expand would have to send a probe to a rather distant region of space. The probe should be able to independently create a child site, and then eventually this child site should be able to interact with the parent site. However, this then requires the probe to carry some kind of security credentials which would allow the child site to be authenticated by the parent site in the future. These credentials could potentially be compromised by an aggressor. The probe has a limited capacity to protect itself from compromise, and hence there is a possibility that an aggressor could "capture" the probe, without being detected by the probe itself. Thus, even if the probe has self-destruction mechanisms, they would be circumvented by a sufficiently sophisticated approach. A compromised probe would behave exactly the same as a normal probe, and succeed in creating a child site. However, after the compromised child site has started to interact with the parent, at some point, it can launch an attack and capture the parent network for the sake of the aggressor.

Due to these considerations, civilizations may be wary of sending Von Neumann probes all over the universe. Civilizations may still send groups of colonization probes, but the probes may delay colonization so as to hide their presence. One might imagine that a "cold war" is already in progress in the universe, with competing probes lying hidden even within our own galaxy, but lying in stalemate for billions of years.

Yet, new civilizations are basically unaffected by the cold war: they have nothing to lose from creating a parent site. Nevertheless, once a new civilization reaches a certain size, they have too much to lose from making unsecured expansions.

But some civilizations might be content to simply make independent, non-interacting "backups" of themselves, and so have nothing to fear if their probes are captured. It still remains to explain why the universe isn't visibly filled with these simplistic "backup" civilizations.

View more: Prev | Next