gwern comments on Rationality Quotes May 2012 - Less Wrong
You are viewing a comment permalink. View the original post to see all comments and the full post content.
You are viewing a comment permalink. View the original post to see all comments and the full post content.
Comments (696)
No.
I think it's more than terminology. And if Mencius can be dismissed as someone who does not really get Bayesian inference, one can surely not say the same of Cosma Shalizi, who has made the same argument somewhere on his blog. (It was a few years ago and I can't easily find a link. It might have been in a technical report or a published paper instead.) Suppose a Bayesian is trying to estimate the mean of a normal distribution from incoming data. He has a prior distribution of the mean, and each new observation updates that prior. But what if the data are not drawn from a normal distribution, but from the sum of two such distributions with well separated peaks? The Bayesian (he says) can never discover that. Instead, his estimate of the position of the single peak that he is committed to will wander up and down between the two real peaks, like the Flying Dutchman cursed never to find a port, while the posterior probability of seeing the data that he has seen plummets (on the log-odds scale) towards minus infinity. But he cannot avoid this: no evidence can let him update towards anything his prior gives zero probability to.
What (he says) can save the Bayesian from this fate? Model-checking. Look at the data and see if they are actually consistent with any model in the class you are trying to fit. If not, think of a better model and fit that.
Andrew Gelman says the same; there's a chapter of his book devoted to model checking. And here's a paper by both of them on Bayesian inference and philosophy of science, in which they explicitly describe model-checking as "non-Bayesian checking of Bayesian models". My impression (not being a statistician) is that their view is currently the standard one.
I believe the hard-line Bayesian response to that would be that model checking should itself be a Bayesian process. (I'm distancing myself from this claim, because as a non-statistician, I don't need to have any position on this. I just want to see the position stated here.) The single-peaked prior in Shalizi's story was merely a conditional one: supposing the true distribution to be in that family, the Bayesian estimate does indeed behave in that way. But all we have to do to save the Bayesian from a fate worse than frequentism is to widen the picture. That prior was merely a subset, worked with for computational convenience, but in the true prior, that prior only accounted for some fraction p<1 of the probability mass, the remaining 1-p being assigned to "something else". Then when the data fail to conform to any single Gaussian, the "something else" alternative will eventually overshadow the Gaussian model, and will need to be expanded into more detail.
"But," the soft Bayesians might say, "how do you expand that 'something else' into new models by Bayesian means? You would need a universal prior, a prior whose support includes every possible hypothesis. Where do you get one of those? Solomonoff? Ha! And if what you actually do when your model doesn't fit looks the same as what we do, why pretend it's Bayesian inference?"
I suppose this would be Eliezer's answer to that last question.
I am not persuaded that the harder Bayesians have any more concrete answer. Solmonoff induction is uncomputable and seems to unnaturally favour short hypotheses involving Busy-Beaver-sized numbers. And any computable approximation to it looks to me like brute-forcing an NP-hard problem.
I find this a curious thing to say. Isn't this an argument against every possible remotely optimal computable form of induction or decision-making? Of course a good computable approximation may wind up spending lots of resources solving a problem if that problem is important enough, this is not a blackmark against it. Problems in the real world can be hard, so dealing with them may not be easy!
"Omega flies up to you and hands you a box containing the Secrets of Immortality; the box is opened by the solution to an NP problem inscribed on it." Is the optimal solution really to not even try the problem - because then you're trying "brute-forcing an NP-hard problem"! - even if it turns out to be one of the majority of easily-solved problems? "You start a business and discover one of your problems is NP-hard. You immediately declare bankruptcy because your optimal induction optimally infers that the problem cannot be solved and this most optimally limits your losses."
And why NP-hard, exactly? You know there are a ton of harder complexity classes in the complexity zoo, right?
The right answer is simply to point out that the worst case of the optimal algorithm is going to be the worst case of all possible problems presented, and this is exactly what we would expect since there is no magic fairy dust which will collapse all problems to constant-time solutions.
There might well be a theorem formalising that statement. There might also be one formalising the statement that every remotely optimal form of induction or decision-making is uncomputable. If that's the way it is, well, that's the way it is.
This is an argument of the form "Suppose X were true -- then X would be true! So couldn't X be true?"
You try to find a method that solves enough examples of the NP-hard problem well enough to sell the solutions, such that your more bounded ambition puts you back in the realm of P. This is done all the time -- freight scheduling software, for example. Or airline ticket price searching. Part of designing optimising compilers is not attempting analyses that take insanely long.
Harder classes are subsets of NP-hard, and everything in NP-hard is hard enough to make the point. Of course, there is the whole uncomputability zoo above all that, but computing the uncomputable is even more of a wild goose chase. "Omega flies up to you and hands you a box containing the Secrets of Immortality; for every digit of Chatin's Omega you correctly type in, you get an extra year, and it stops working after the first wrong answer".
No, this is pointing out that if you provide an optimal outcome barricaded by a particular obstacle, then that optimal outcome will trivially be at least as hard as that obstacle.
This is exactly the point made for computable approximations to AIXI. Thank you for agreeing.
Are you sure you want to make that claim? That all harder classes are subsets of NP-hard?
Fantastic! I claim my extra 43 years of life.
No, carelessness on my part. Doesn't affect my original point, that schemes for approximating Solomonoff or AIXI look like at least exponential brute force search.
Since AIXI is, by construction, the best possible intelligent agent, all work on AGI can, in a rather useless sense, be described as an approximation to AIXI. To the extent that such an attempt works (i.e. gets substantially further than past attempts at AGI), it will be because of new ideas not discovered by brute force search, not because it approximates AIXI.
43 years is a poor sort of immortality.
Well, yeah. Again - why would you expect anything else? Given that there exist problems which require that or worse for solution? How can a universal problem solver do any better?
Yes.
No. Given how strange and different AIXI works, it can easily stimulate new ideas.
It's more than I had before.
The spin-off argument. Here's a huge compendium of spinoffs of previous approaches to AGI. All very useful, but not AGI. I'm not expecting better from AIXI.
Hm, so let's see; you started off mocking the impossibility and infeasibility of AIXI and any computable version:
Then you admitted that actually every working solution can be seen as a form of SI/AIXI:
And now you're down to arguing that it'll be "very useful, but not AGI".
Well, I guess I can settle for that.
I stand by the first quote. Every working solution can in a useless sense be seen as a form of SI/AIXI. The sense that a hot-air balloon can be seen as an approach to landing on the Moon.
At the very most. Whether AIXI-like algorithms get into the next edition of Russell and Norvig, having proved of practical value, well, history will decide that, and I'm not interested in predicting it. I will predict that it won't prove to be a viable approach to AGI.
How can a hot air balloon even in theory be seen as that? Hot air has a specific limit, does it not - where its density equals the outside density?