# roystgnr comments on Rationality Quotes May 2012 - Less Wrong

6 01 May 2012 11:37PM

You are viewing a comment permalink. View the original post to see all comments and the full post content.

Sort By: Best

Comment author: 17 May 2012 04:07:43AM 7 points [-]

Could we start that reading with the classic Bayes' Theorem example? Suppose 1% of women have breast cancer, 80% of mammograms on a cancerous woman will detect it, 9.6% on an uncancerous woman will be false positives. Suppose woman A gets a mammogram which indicates cancer. What are the odds she has cancer?

p(A|X) = p(X|A)p(A)/(p(X|A)p(A)+p(X|~A)p(~A)) = 7.8% Hooray?

Now suppose women B, C, D, E, F... Z, AA, AB, AC, AD, etc., the entire patient list getting screened today, all test positive for cancer. Is the probability that woman A has cancer still 7.8%? Bayes' rule, with the priors above, still says "yes"! You need more complicated prior probabilities (e.g. what are the odds that the test equipment is malfunctioning?) before your evidence can tell you what's actually likely to be happening. But those more complicated, more accurate priors would have (very slightly) changed our original p(A|X) as well!

It's not that Bayesian updating is wrong. It's just that Bayes' theorem never allows you to have a non-zero posterior probability coming from a zero prior, and to make any practical problem tractable everybody ends up implicitly assuming huge swaths of zero prior probability.

Comment author: 07 May 2013 04:15:29AM 1 point [-]

It's not assuming zero probability. It's assuming independence. Under the original model, it's possible for all the women to get positives, but only 1% to actually have breast cancer. It's just that a better prior would give a much higher probability.

Comment author: 07 May 2013 01:00:21PM 3 points [-]

Is there any practical difference between "assuming independent results" and "assuming zero probability for all models which do not generate independent results"? If not then I think we've just been exposed to people using different terminology.

Comment author: 07 May 2013 04:22:01PM 3 points [-]

Is there any practical difference between "assuming independent results" and "assuming zero probability for all models which do not generate independent results"?

No.

If not then I think we've just been exposed to people using different terminology.

I think it's more than terminology. And if Mencius can be dismissed as someone who does not really get Bayesian inference, one can surely not say the same of Cosma Shalizi, who has made the same argument somewhere on his blog. (It was a few years ago and I can't easily find a link. It might have been in a technical report or a published paper instead.) Suppose a Bayesian is trying to estimate the mean of a normal distribution from incoming data. He has a prior distribution of the mean, and each new observation updates that prior. But what if the data are not drawn from a normal distribution, but from the sum of two such distributions with well separated peaks? The Bayesian (he says) can never discover that. Instead, his estimate of the position of the single peak that he is committed to will wander up and down between the two real peaks, like the Flying Dutchman cursed never to find a port, while the posterior probability of seeing the data that he has seen plummets (on the log-odds scale) towards minus infinity. But he cannot avoid this: no evidence can let him update towards anything his prior gives zero probability to.

What (he says) can save the Bayesian from this fate? Model-checking. Look at the data and see if they are actually consistent with any model in the class you are trying to fit. If not, think of a better model and fit that.

Andrew Gelman says the same; there's a chapter of his book devoted to model checking. And here's a paper by both of them on Bayesian inference and philosophy of science, in which they explicitly describe model-checking as "non-Bayesian checking of Bayesian models". My impression (not being a statistician) is that their view is currently the standard one.

I believe the hard-line Bayesian response to that would be that model checking should itself be a Bayesian process. (I'm distancing myself from this claim, because as a non-statistician, I don't need to have any position on this. I just want to see the position stated here.) The single-peaked prior in Shalizi's story was merely a conditional one: supposing the true distribution to be in that family, the Bayesian estimate does indeed behave in that way. But all we have to do to save the Bayesian from a fate worse than frequentism is to widen the picture. That prior was merely a subset, worked with for computational convenience, but in the true prior, that prior only accounted for some fraction p<1 of the probability mass, the remaining 1-p being assigned to "something else". Then when the data fail to conform to any single Gaussian, the "something else" alternative will eventually overshadow the Gaussian model, and will need to be expanded into more detail.

"But," the soft Bayesians might say, "how do you expand that 'something else' into new models by Bayesian means? You would need a universal prior, a prior whose support includes every possible hypothesis. Where do you get one of those? Solomonoff? Ha! And if what you actually do when your model doesn't fit looks the same as what we do, why pretend it's Bayesian inference?"

I suppose this would be Eliezer's answer to that last question.

I am not persuaded that the harder Bayesians have any more concrete answer. Solmonoff induction is uncomputable and seems to unnaturally favour short hypotheses involving Busy-Beaver-sized numbers. And any computable approximation to it looks to me like brute-forcing an NP-hard problem.

Comment author: 09 May 2013 04:39:23PM 5 points [-]

In response to:

I believe the hard-line Bayesian response to that would be that model checking should itself be a Bayesian process.

and

"But," the soft Bayesians might say, "how do you expand that 'something else' into new models by Bayesian means? You would need a universal prior, a prior whose support includes every possible hypothesis. Where do you get one of those? Solomonoff? Ha! And if what you actually do when your model doesn't fit looks the same as what we do, why pretend it's Bayesian inference?"

I think a hard line needs to be drawn between statistics and epistemology. Statistics is merely a method of approximating epistemology - though a very useful one. The best statistical method in a given situation is the one that best approximates correct epistemology. (I'm not saying this is the only use for statistics, but I can't seem to make sense of it otherwise)

Now suppose Bayesian epistemology is correct - i.e. let's say Cox's theorem + Solomonoff prior. The correct answer to any induction problem is to do the true Bayesian update implied by this epistemology, but that's not computable. Statistics gives us some common ways to get around this problem. Here are a couple:

1) Bayesian statistics approach: restrict the class of possible models and put a reasonable prior over that class, then do the Bayesian update. This has exactly the same problem that Mencius and Cosma pointed out.

2) Frequentist statistics approach: restrict the class of possible models and come up with a consistent estimate of which model in that class is correct. This has all the problems that Bayesians constantly criticize frequentists for, but it typically allows for a much wider class of possible models in some sense (crucially, you often don't have to assume distributional forms)

3) Something hybrid: e.g., Bayesian statistics with model checking. Empirical Bayes (where the prior is estimated from the data). Etc.

Now superficially, 1) looks the most like the true Bayesian update - you don't look at the data twice, and you're actually performing a Bayesian update. But you don't get points for looking like the true Bayesian update, you get points for giving the same answer as the true Bayesian update. If you do 1), there's always some chance that the class of models you've chosen is too restrictive for some reason. Theoretically you could continue to do 1) by just expanding the class of possible models and putting a prior over that class, but at some point that becomes computationally infeasible. Model checking is a computationally feasible way of approximating this process. And, a priori, I see no reason to think that some frequentist method won't give the best computationally feasible approximation in some situation.

So, basically, a "hardline Bayesian" should do model checking and sometimes even frequentist statistics. (Similarly, a "hardline frequentist" in the epistemological sense should sometimes do Bayesian statistics. And, in fact, they do this all the time in econometrics.)

See my similar comments here and here.

Comment author: 07 May 2013 05:38:51PM 3 points [-]

And any computable approximation to it looks to me like brute-forcing an NP-hard problem.

I find this a curious thing to say. Isn't this an argument against every possible remotely optimal computable form of induction or decision-making? Of course a good computable approximation may wind up spending lots of resources solving a problem if that problem is important enough, this is not a blackmark against it. Problems in the real world can be hard, so dealing with them may not be easy!

"Omega flies up to you and hands you a box containing the Secrets of Immortality; the box is opened by the solution to an NP problem inscribed on it." Is the optimal solution really to not even try the problem - because then you're trying "brute-forcing an NP-hard problem"! - even if it turns out to be one of the majority of easily-solved problems? "You start a business and discover one of your problems is NP-hard. You immediately declare bankruptcy because your optimal induction optimally infers that the problem cannot be solved and this most optimally limits your losses."

And why NP-hard, exactly? You know there are a ton of harder complexity classes in the complexity zoo, right?

The right answer is simply to point out that the worst case of the optimal algorithm is going to be the worst case of all possible problems presented, and this is exactly what we would expect since there is no magic fairy dust which will collapse all problems to constant-time solutions.

Comment author: 07 May 2013 06:20:48PM 1 point [-]

I find this a curious thing to say. Isn't this an argument against every possible remotely optimal computable form of induction or decision-making?

There might well be a theorem formalising that statement. There might also be one formalising the statement that every remotely optimal form of induction or decision-making is uncomputable. If that's the way it is, well, that's the way it is.

"Omega flies up to you

This is an argument of the form "Suppose X were true -- then X would be true! So couldn't X be true?"

"You start a business and discover one of your problems is NP-hard. You immediately declare bankruptcy because your optimal induction optimally infers that the problem cannot be solved and this most optimally limits your losses."

You try to find a method that solves enough examples of the NP-hard problem well enough to sell the solutions, such that your more bounded ambition puts you back in the realm of P. This is done all the time -- freight scheduling software, for example. Or airline ticket price searching. Part of designing optimising compilers is not attempting analyses that take insanely long.

And why NP-hard, exactly? You know there are a ton of harder complexity classes in the complexity zoo, right?

Harder classes are subsets of NP-hard, and everything in NP-hard is hard enough to make the point. Of course, there is the whole uncomputability zoo above all that, but computing the uncomputable is even more of a wild goose chase. "Omega flies up to you and hands you a box containing the Secrets of Immortality; for every digit of Chatin's Omega you correctly type in, you get an extra year, and it stops working after the first wrong answer".

Comment author: 07 May 2013 06:32:46PM 2 points [-]

This is an argument of the form "Suppose X were true -- then X would be true! So couldn't X be true?"

No, this is pointing out that if you provide an optimal outcome barricaded by a particular obstacle, then that optimal outcome will trivially be at least as hard as that obstacle.

You try to find a method that solves enough examples of the NP-hard problem well enough to sell the solutions, such that your more bounded ambition puts you back in the realm of P. This is done all the time -- freight scheduling software, for example. Or airline ticket price searching. Part of designing optimising compilers is not attempting analyses that take insanely long.

This is exactly the point made for computable approximations to AIXI. Thank you for agreeing.

Harder classes are subsets of NP-hard, and everything in NP-hard is hard enough to make the point.

Are you sure you want to make that claim? That all harder classes are subsets of NP-hard?

but computing the uncomputable is even more of a wild goose chase. "Omega flies up to you and hands you a box containing the Secrets of Immortality; for every digit of Chatin's Omega you correctly type in, you get an extra year, and it stops working after the first wrong answer".

Fantastic! I claim my extra 43 years of life.

Comment author: 08 May 2013 07:55:22AM 0 points [-]

Harder classes are subsets of NP-hard, and everything in NP-hard is hard enough to make the point.

Are you sure you want to make that claim? That all harder classes are subsets of NP-hard?

No, carelessness on my part. Doesn't affect my original point, that schemes for approximating Solomonoff or AIXI look like at least exponential brute force search.

You try to find a method that solves enough examples of the NP-hard problem well enough to sell the solutions, such that your more bounded ambition puts you back in the realm of P. This is done all the time -- freight scheduling software, for example. Or airline ticket price searching. Part of designing optimising compilers is not attempting analyses that take insanely long.

This is exactly the point made for computable approximations to AIXI.

Since AIXI is, by construction, the best possible intelligent agent, all work on AGI can, in a rather useless sense, be described as an approximation to AIXI. To the extent that such an attempt works (i.e. gets substantially further than past attempts at AGI), it will be because of new ideas not discovered by brute force search, not because it approximates AIXI.

Fantastic! I claim my extra 43 years of life.

43 years is a poor sort of immortality.

Comment author: 08 May 2013 07:48:13PM 1 point [-]

schemes for approximating Solomonoff or AIXI look like at least exponential brute force search.

Well, yeah. Again - why would you expect anything else? Given that there exist problems which require that or worse for solution? How can a universal problem solver do any better?

Since AIXI is, by construction, the best possible intelligent agent, all work on AGI can, in a rather useless sense, be described as an approximation to AIXI.

Yes.

To the extent that such an attempt works (i.e. gets substantially further than past attempts at AGI), it will be because of new ideas not discovered by brute force search, not because it approximates AIXI.

No. Given how strange and different AIXI works, it can easily stimulate new ideas.

43 years is a poor sort of immortality.

It's more than I had before.

Comment author: 09 May 2013 07:44:03AM 0 points [-]

No. Given how strange and different AIXI works, it can easily stimulate new ideas.

The spin-off argument. Here's a huge compendium of spinoffs of previous approaches to AGI. All very useful, but not AGI. I'm not expecting better from AIXI.

Comment author: 10 May 2013 09:15:53AM 1 point [-]

Isn't there a very wide middle ground between (1) assigning 100% of your mental probability to a single model, like a normal curve and (2) assigning your mental probability proportionately across every conceivable model ala Solomonoff?

I mean the whole approach here sounds more philosophical than practical. If you have any kind of constraint on your computing power, and you are trying to identify a model that most fully and simply explains a set of observed data, then it seems like the obvious way to use your computing power is to put about a quarter of your computing cycles on testing your preferred model, another quarter on testing mild variations on that model, another quarter on all different common distribution curves out of the back of your freshman statistics textbook, and the final quarter on brute-force fitting the data as best you can given that your priors about what kind of model to use for this data seem to be inaccurate.

I can't imagine any human being who is smart enough to run a statistical modeling exercise yet foolish enough to cycle between two peaks forever without ever questioning the assumption of a single peak, nor any human being foolish enough to test every imaginable hypothesis, even including hypotheses that are infinitely more complicated than the data they seek to explain. Why would we program computers (or design algorithms) to be stupider than we are? If you actually want to solve a problem, you try to get the computer to at least model your best cognitive features, if not improve on them. Am I missing something here?

Comment author: 10 May 2013 10:41:08AM 3 points [-]

Isn't there a very wide middle ground between (1) assigning 100% of your mental probability to a single model, like a normal curve and (2) assigning your mental probability proportionately across every conceivable model ala Solomonoff?

Yes, the question is what that middle ground looks like -- how you actually come up with new models. Gelman and Shalizi say it's a non-Bayesian process depending on human judgement. The behaviour that you rightly say is absurd, of the Bayesian Flying Dutchman, is indeed Shalizi's reductio ad absurdum of universal Bayesianism. I'm not sure what gwern has just been arguing, but it looks like doing whatever gets results through the week while going to the church of Solomonoff on Sundays.

An algorithmic method of finding new hypotheses that works better than people is equivalent to AGI, so this is not an issue I expect to see solved any time soon.

Comment author: 10 May 2013 02:06:45PM 2 points [-]

An algorithmic method of finding new hypotheses that works better than people is equivalent to AGI, so this is not an issue I expect to see solved any time soon.

Eh. What seems AGI-ish to me is making models interact fruitfully across domains; algorithmic models to find new hypotheses for a particular set of data are not that tough and already exist (and are 'better than people' in the sense that they require far less computational effort and are far more precise at distinguishing between models).

Comment author: 10 May 2013 02:39:18PM *  1 point [-]

What seems AGI-ish to me is making models interact fruitfully across domains

Yes, I had in mind a universal algorithmic method, rather than a niche application.

Comment author: 10 May 2013 06:28:29PM *  1 point [-]

The hypothesis-discovery methods are universal; you just need to feed them data. My view is that the hard part is picking what data to feed them, and what to do with the models they discover.

Edit: I should specify, the models discovered grow in complexity based on the data provided, and so it's very difficult to go meta (i.e. run hypothesis discovery on the hypotheses you've discovered), because the amount of data you need grows very rapidly.

Comment author: 10 May 2013 02:12:28PM *  0 points [-]

Hmmm. Are we going to see a Nobel awarded to an AI any time soon?

Comment author: 10 May 2013 06:27:07PM 0 points [-]

I don't think any robot scientists would be eligible for Nobel prizes; Nobel's will specifies persons. We've had robot scientists for almost a decade now, but they tend to excel in routine and easily automatized areas. I don't think they will make Nobel-level contributions anytime soon, and by the time they do, the intelligence explosion will be underway.

Comment author: 10 May 2013 02:58:13PM 0 points [-]

I am not persuaded that the harder Bayesians have any more concrete answer. Solmonoff induction is uncomputable and seems to unnaturally favour short hypotheses involving Busy-Beaver-sized numbers. And any computable approximation to it looks to me like brute-forcing an NP-hard problem.

What if, as a computational approximation of the universal prior, we use genetic algorithms to generate a collection of agents, each using different heuristics to generate hypotheses? I mean, there's probably better approximations than that; but we have strong evidence that this one works and is computable.

Comment author: 10 May 2013 03:30:21PM 0 points [-]

What if, as a computational approximation of the universal prior, we use genetic algorithms to generate a collection of agents, each using different heuristics to generate hypotheses?

Whatever approach to AGI anyone has, let them go ahead and try it, and see if it works. Ok, that would be rash advice if I thought it would work (because of UFAI), but if it has any chance of working, the only way to find out is to try it.

Comment author: 10 May 2013 04:16:40PM 0 points [-]

I'm not saying I'm willing to code that up; I'm just saying that a genetic algorithm (such as Evolution) creating agents which use heuristics to generate hypotheses (such as humans) can work at least as well as anything we've got so far.

Comment author: 10 May 2013 04:25:19PM 0 points [-]

I'm just saying that a genetic algorithm (such as Evolution) creating agents which use heuristics to generate hypotheses (such as humans) can work at least as well as anything we've got so far.

If you have a few billion years to wait.

Comment author: 10 May 2013 05:31:32PM *  -1 points [-]
Comment author: 09 May 2013 04:49:35PM 0 points [-]

I am not persuaded that the harder Bayesians have any more concrete answer. Solmonoff induction is uncomputable and seems to unnaturally favour short hypotheses involving Busy-Beaver-sized numbers. And any computable approximation to it looks to me like brute-forcing an NP-hard problem.

Eh. I like the approach of "begin with a simple system hypothesis, and when your residuals aren't distributed the way you want them to be, construct a more complicated hypothesis based on where the simple hypothesis failed." It's tractable (this is the elevator-talk version of one of the techniques my lab uses for modeling manufacturing systems), and seems like a decent approximation of Solomonoff induction on the space of system models.

Comment author: 07 May 2013 06:49:23PM 0 points [-]

It's basically different terminology. His point is valid.

A model isn't something you assign probability to. It's something you use to come up with a set of prior probabilities. The model he used assumed independence. It didn't actually assign zero probability to any result. It doesn't assign a probability, zero or otherwise, to the machine being broken, because that's not something that's considered. It also doesn't assign a probability to whether or not its raining.