You're looking at Less Wrong's discussion board. This includes all posts, including those that haven't been promoted to the front page yet. For more information, see About Less Wrong.

CellBioGuy comments on Open Thread, Jul. 20 - Jul. 26, 2015 - Less Wrong Discussion

4 Post author: MrMind 20 July 2015 06:55AM

You are viewing a comment permalink. View the original post to see all comments and the full post content.

Comments (202)

You are viewing a single comment's thread. Show more comments above.

Comment author: CellBioGuy 21 July 2015 03:02:03PM *  9 points [-]

Random analysis! From the fact that they anticipate using $400 million to record and track about 4 million people, you can tell they are talking about using microarrays to log SNP profiles (like 23andme) or microsatellite repeat lengths or some otherwise cheap and easy marker-based approach rather than de novo sequencing. De novo sequencing that many people would be much more human DNA sequence data than has ever been produced in the history of the world, would clog up the current world complement of high throughput sequencers for a long time, would be no more useful for legal purposes, and probably cost $40 billion + (probably more to develop infrastructure).

Comment author: Douglas_Knight 22 July 2015 06:00:18AM *  3 points [-]

Iceland has managed to guess the complete sequence for all of its residents from SNPs by getting complete sequences of 3%. (Not that crime-fighting would use anything more than SNPs.)

Comment author: Lumifer 22 July 2015 03:47:16PM *  -1 points [-]

Iceland has managed to guess the complete sequence for all of its residents from SNPs

Does not compute.

You can "guess" some statistical averages for the whole population, but you cannot "guess" the complete sequence for any particular individual.

Comment author: gwern 22 July 2015 06:40:29PM *  3 points [-]

but you cannot "guess" the complete sequence for any particular individual.

Of course you can. If you have a giant complete pedigree for most or all of the population and you have SNPs or whole-genomes for a small fraction of the members, and especially if it's a highly homogenous population, then you can impute full genomes with varying but still-far-better-than-whole-population-base-rate accuracy for any particular entry (person) in the family tree. They're all highly correlated. This is no odder than noting that you can infer a lot about a parent's genome from one or two childrens' genomes despite never seeing the parent's genome. Your first cousin's genome says a lot about your genome, and even more if one can put it into a family tree and also has one of your grandparent's genomes. And if you have all the family trees and samples from most of them...

(This will not work too well for Kuwait since while the citizens may be highly inbred, they do not have the same genealogical records, and citizens are, IIRC, outnumbered by resident foreigners who are drawn from all over the world and especially poor countries. But it does work for Iceland.)

Comment author: Lumifer 22 July 2015 06:53:47PM 1 point [-]

you can impute full genomes with varying but still-better-than-whole-population-base-rate accuracy for any particular entry in the family tree.

True. But when the OP says "guess the complete sequence" I assume a much higher accuracy than just somewhat better than the base rate.

You can produce an estimate for the full sequence just on the basis of knowing that the subject is human (with some low accuracy), you can produce a better estimate if you know the subject's race, you can produce an even better one if you know the specific ethnic background, etc. It's still a statistical estimate and as such is quite different from actually sequencing the DNA of a specific individual.

Comment author: gwern 22 July 2015 07:04:21PM *  3 points [-]

I assume a much higher accuracy than just somewhat better than the base rate.

How much higher would that be and how do you know the Icelandic imputations do not meet your standards?

It's still a statistical estimate and as such is quite different from actually sequencing the DNA of a specific individual.

A 'actual' sequence is itself a 'statistical estimate', since even with 30x coverage there will still be a lot of errors... (It's statistics all the way down, is what I'm saying.) For many purposes, the imputation can be good enough. DNA databases have already shown their utility in tracking down criminals who are not sampled in it but their relatives are. From a Kuwaiti perspective, your quibbles are uninteresting.

Comment author: Lumifer 22 July 2015 07:29:23PM -1 points [-]

From a Kuwaiti perspective, your quibbles are uninteresting.

You don't look like a Kuwaiti :-P And, of course, interestingness is in the eye of the beholder...

Comment author: Douglas_Knight 22 July 2015 07:31:16PM 0 points [-]

All the coverage says that they used pedigrees, but I'd think that they could be reconstructed from SNPs, rather more accurately.

Comment author: gwern 22 July 2015 07:35:30PM 3 points [-]

Throwing away data is rarely helpful.