taw comments on Information theory and FOOM - Less Wrong

6 Post author: PhilGoetz 14 October 2009 04:52PM

You are viewing a comment permalink. View the original post to see all comments and the full post content.

Comments (93)

You are viewing a single comment's thread. Show more comments above.

Comment author: PhilGoetz 14 October 2009 09:57:55PM *  2 points [-]

In the Levitt paper, 64% is the number of single-domain architecture proteins that are found in at least two of the 3 groups viruses, prokaryotes, and eukaryotes (figure 3). This is my (very close) approximation for the fraction of families in eukaryotes or prokaryotes found in both eukaryotes and prokaryotes, which isn't reported. 84% is computed from that information, plus the caption of figure 3 saying that prokaryotes contain 88% of SDA families. 73% is computed from all of that information.

Most proteins have not been discovered - and there is probably a bias towards discovering the ones that are shared with eucaryotes - which would distort the figures in favour of finding older genes.

There is no bias towards discovering genes shared with eukaryotes in ordinary sequencing. We sequence complete genomes. Almost all of the bacterial genes known come from these whole-genome projects. We've sequenced many more bacteria than eukaryotes. Bacterial genomes don't contain much repetitive intergenic DNA, so you get nice complete genome assemblies.

Life starting 3.7 billion years ago - could be. Google's top ten show claims ranging from 2.7GY to 4.4GY ago. Adding that .7 billion could make the information-growth curve more linear, and remove one exponentiation in my analysis.

Also, it seems rather dubious to measure the rate of information change within evolution as the rate of information change within bacterial genomes. That doesn't consider the information present in the diversity of life.

Let's just say I'm measuring the information in DNA. Information in "the diversity of life" is too vague. I don't want to measure any information that an organism or an ecosystem gains from the environment by expressing those genetic codes.

Comment author: taw 15 October 2009 12:02:25AM 0 points [-]

So I've read the paper. According to it, and it seems very plausible to me, we have some reason to suspect we seriously underestimate number of SDA families, and most widely distributed SDA families are most likely to be known (those often happen to occur in multiple groups), and less widely distributed families are least likely to be known (those often happen to be one group only).

The actual percentage of shared SDA families is almost certainly lower than what we can naively estimate from current data. I don't know how much lower. Maybe just a few percent, maybe a lot.

Not mentioned in the paper, but quite obvious is huge amount of horizontal gene transfer happening on evolutionary scales like that (especially with viruses). It also increases apparent sharing and makes them appear older than they really are.

Third effect is that SDA family that diverged long time ago might be unrecognizable as single family, and one that developed more recently is still recognizable as such. This can only increase apparent age of SDA families.

So there are at least three effects of unknown magnitude, but known direction. If any of them is strong enough, it invalidates your hypothesis. If all of them are weak, your hypothesis still relies a lot on dating of eukaryote-prokaryote split.