Open Thread: March 2010, part 2

RobinZ

The Open Thread posted at the beginning of the month has exceeded 500 comments – new Open Thread posts may be made here.

This thread is for the discussion of Less Wrong topics that have not appeared in recent posts. If a discussion gets unwieldy, celebrate by turning it into a top-level post.

The Open Thread posted at the beginning of the month has exceeded 500 comments – new Open Thread posts may be made here.

This thread is for the discussion of Less Wrong topics that have not appeared in recent posts. If a discussion gets unwieldy, celebrate by turning it into a top-level post.

I have a program that estimates the chances that one gene has the same function as another gene, based on their similarity. This is estimated from the % identity of amino acids between the proteins, and on the % of the larger protein that is covered by an alignment with the shorter protein.

For various reasons, this is done by breaking %id and %len into bins, eg 20-30%id, 30-40%id, 40-50%id, ... 30-40%len, 40-50%len, ... and estimating a probability for each bin that two proteins matched in that way have the same function.

What I want to do is to reduce the number of bins, so there are only 3 bins for %ID and 3 bins for %len, and 9 bins for their cross-product.

I can gather a bunch of statistics on matches made where we think we know the answer. The frequentist statistician can take, say for %ID, every side-by-side pair of the original 10 bins, do an ANOVA, and look at the F-statistic; then retain the 2 boundaries with the largest F-statistics.

What would the Bayesian do?

To make sure I'm interpreting this correctly: the calibration data is a list of pairs of genes, along with their %id, and %len, and tagged by either "same function" or "different function"? And currently, these are binned, and the probabilities estimated from the statistics known in that bin?

You want to change this, in particular, reduce the number of bins. Before we get to "how", may I ask why you want to do this? It doesn't seem as if it would reduce the computational cost. It would up the number of samples and possibly g... (read more)

7

Open Thread: March 2010, part 2

7

7

7

Open Thread: March 2010, part 2

7

7