Sniffnoy comments on Case Study: the Death Note Script and Bayes - Less Wrong

25 Post author: gwern 04 January 2013 04:33AM

You are viewing a comment permalink. View the original post to see all comments and the full post content.

Comments (43)

You are viewing a single comment's thread.

Comment author: Sniffnoy 04 January 2013 07:28:14AM *  3 points [-]

I'm confused -- isn't the probability that a given pair occurs at random 1/29 rather than 1/15?

Edit: Oops, this was thinking pairings rather than trees. Corrected in reply.

Comment author: Sniffnoy 09 January 2013 03:59:40AM *  0 points [-]

OK, I think the correct probability here is 1/57. According to OEIS (it cites Stanley as a reference; I haven't taken the time to try to understand why this would be the case), the number of unordered binary trees on a set of n+1 labelled leaves is given by 1*3*...*(2n-1). If we want to count how many of these have two particular leaves directly next to each other, well, we're essentially merging them into one super-leaf; thus we want the same thing on one fewer leaf. Hence the number we want is (1*3*...*55)/(1*3*...*57)=1/57. More generally, if we had n leaves, we'd have 1/(2n-3).

Edit: OK, not going to write out the whole thing here unless someone really wants, but for those skeptical of the above formula, you can prove it with exponential generating functions.

Comment author: Kaj_Sotala 04 January 2013 08:45:44AM 0 points [-]

That's if you fix the position of the first item in the pair: if item 1 literally the first item in a sequence, then there is indeed a 1/29 chance that the second item of the pair will appear next to it. But if the pair can be found anywhere...

Comment author: Kindly 04 January 2013 02:36:16PM *  2 points [-]

But you don't add the different probabilities for where the first item can be. No matter where the first item in the pair occurs, there is a 1/29 chance the second item will be next to it.

Another way of thinking about it. For any given item, there are 29 other items. Only one of these can be paired with the first, and all these events are equally likely. The probabilty has to be 1/29 and not 1/15, because 29 copies of 1/15 don't add up to 1.

Actually, the probability is slightly lower, because some items are not leaves at all. If we take the tree in the article as representative, then we expect roughly 10 pairs among the 30 items, which gives a probability of 2/87: with probability 2/3, the first item ends up as half of a pair, and with probability 1/29, the second item ends up as the other half of that same pair.

In the movie subtree, we have 12 items, so the probability of being paired is 2/33 rather than 1/6.

Edit: Laplace-adjusting the "is a random item in a pair" probability, we get 11/32 as an estimate instead, and 1/16 for the final answer. Note that because of the reasonably large sample size, this doesn't make a huge difference.

Comment author: gwern 04 January 2013 04:36:24PM *  0 points [-]

there is a 1/29 chance the second item will be next to it.

'Next to it', perhaps, but wouldn't that other alternative be putting it on an entirely different branch and so less similar as it's not in the same cluster? movie-fearandloathing may be 'next to' fanfiction-remiscent-afterthought-threecharacters in the clustering, but not nearly as similar to it as movie-1492conquestparadise... so I think that analysis is less right than my own simple one.

Comment author: Kindly 04 January 2013 05:17:32PM 2 points [-]

By "next to it" I meant paired with it, sorry. Not all items have another item paired with them, which is where the correction factor of 2/3 comes from.

Comment author: gwern 04 January 2013 07:18:12PM 0 points [-]

Not all items have another item paired with them, which is where the correction factor of 2/3 comes from.

Ah, I see. I'm not sure how I should deal with the non-pairing or multiple node groups; I didn't take them into account in advance, and anything based on observing the tree that was generated feels ad hoc. So if the odds of the pairing given random chance is overestimated, that means the strength of the pairing is being underestimated, right, and the likelihood ratio is weaker than it 'should' be? I'm fine with leaving that alone: as I said, when possible I tried to make conclusions as weak as possible.

Comment author: Kindly 04 January 2013 07:37:38PM 1 point [-]

What do the pairings even mean, exactly? I would expect two nodes to be paired iff they are closer to each other than to any other node. If this is the case, then under a random-distance model with n nodes the probability that two specific nodes are paired is 1/(2n-3).

Comment author: gwern 04 January 2013 08:32:50PM 0 points [-]

As far as I know, it means that they are closer, yes.

Comment author: Unnamed 04 January 2013 07:32:04PM 1 point [-]

If you took 30 people, and randomly put them into 15 pairs, then the probability that Person A would be paired with Person Z is 1/29. Person A is equally likely to be paired with any of the 29 other people.

If you took 15 women & 15 men, and randomly put them into 15 woman-man pairs, then the probability that Woman A would be paired with Man Z is 1/15. Woman A is equally likely to be paired with any of the 15 men.

The stylometrics analysis resembles the former situation, with p=1/29. The script could've been paired with any of the 29 other items.