[Link] Better results by changing Bayes’ theorem

XiXiDu

If it ever turns out that Bayes fails - receives systematically lower rewards on some problem, relative to a superior alternative, in virtue of its mere decisions - then Bayes has to go out the window.

-- Eliezer Yudkowsky, Newcomb's Problem and Regret of Rationality

Don't worry, we don't have to abandon Bayes’ theorem yet. But changing it slightly seems to be the winning Way given certain circumstances. See below:

In Peter Norvig’s talk The Unreasonable Effectiveness of Data, starting at 37:42, he describes a translation algorithm based on Bayes’ theorem. Pick the English word that has the highest posterior probability as the translation. No surprise here. Then at 38:16 he says something curious.

So this is all nice and theoretical and pure, but as well as being mathematically inclined, we are also realists. So we experimented some, and we found out that when you raise that first factor [in Bayes' theorem] to the 1.5 power, you get a better result.

In other words, if we change Bayes’ theorem (!) we get a better result. He goes on to explain

Link: johndcook.com/blog/2012/03/09/monkeying-with-bayes-theorem/

Peter Norvig - The Unreasonable Effectiveness of Data

If it ever turns out that Bayes fails - receives systematically lower rewards on some problem, relative to a superior alternative, in virtue of its mere decisions - then Bayes has to go out the window.

-- Eliezer Yudkowsky, Newcomb's Problem and Regret of Rationality

Don't worry, we don't have to abandon Bayes’ theorem yet. But changing it slightly seems to be the winning Way given certain circumstances. See below:

In Peter Norvig’s talk The Unreasonable Effectiveness of Data, starting at 37:42, he describes a translation algorithm based on Bayes’ theorem. Pick the English word that has the highest posterior probability as the translation. No surprise here. Then at 38:16 he says something curious.

So this is all nice and theoretical and pure, but as well as being mathematically inclined, we are also realists. So we experimented some, and we found out that when you raise that first factor [in Bayes' theorem] to the 1.5 power, you get a better result.

In other words, if we change Bayes’ theorem (!) we get a better result. He goes on to explain

Link: johndcook.com/blog/2012/03/09/monkeying-with-bayes-theorem/

Peter Norvig - The Unreasonable Effectiveness of Data

Hacker News discussion: http://news.ycombinator.com/item?id=3693447

Top comment is srean, which I will shamelessly copy here:

This is more of tweak of naive Bayes than Bayes' theorem and I suspect he is being a bit tongue in cheek and not letting on whats behind the tweak.

I am sure you have heard that naive Bayes makes gratuitous assumptions of independence. What is not mentioned as often is that it also assumes the document has been generated by a memory-less process.

So if I were to generate a document according to the model assumed by naive Bayes, and I want to decide if I should concatenate another "the" in the document, then I dont need to keep track of how many "the"s that I have already added to the document. As a result the probability of multiple occurrence n_i of a word i goes down exponentially, like this
P(n_i) = p_i^n_i.
Many words do not behave like this. Their probability do not go down monotonically from the start, rather, for words like "the" their probabilities (conditioned on their history) climb initially and then go down.

Naive Bayes works surprisingly well in spite of their grossly violated assumptions. There are many explanations for that. However, we usually forget that to make NB work, we have to throw away "stop-words". Among them are those exact "memory-full" words that violate NB's assumptions.

Let's get back to word frequencies: A model that fits word frequencies quite well is the power law. http://en.wikipedia.org/wiki/Power_law They look like this
P(n_i) \propto  n_i^c
where c is a constant for that word id i. For English, c usually lies in the ball park of -1.5 to -2.1

The tweak that Norvig mentions is not an exact match for power law assumptions but it comes very close. Its using a power law assumption but with sub-optimal parameters. In fact using the power law assumption and with their parameters estimated from the data you could get an even better classifier. Though be careful of the industry of seeing power laws everywhere. Log normal distributions can look deceptively similar to a power law and is more appropriate on many such occasions.

Yeah Naive Bayes has a bad assumption of independence, but there is no reason that they have to be memory-less too and that is partly what the tweak fixes, and the theorem isn't really being violated.

Hacker News discussion: http://news.ycombinator.com/item?id=3693447

I should have posted it over at Hacker News and get +103 karma instead of -1 :-)