RichardKennaway comments on Open Thread, Jun. 1 - Jun. 7, 2015 - Less Wrong Discussion
You are viewing a comment permalink. View the original post to see all comments and the full post content.
You are viewing a comment permalink. View the original post to see all comments and the full post content.
Comments (203)
I have my genome data from both 23andMe and BGI. I am wondering what to make of it. BGI reports about thirty times as many SNPs as 23andMe. 23andMe: 598897, BGI: 19695817.
Of these, 475801 are reported by both. I looked to see how well they agree with each other, and summarised the results as a count, for each occurring pair of results, of how often that pair occurred. In descending numerical order, and classifying them by type of match or mismatch, this is what I get. (No individual SNPs are identified here.)
The first five lines make sense: the two analyses agree for a large proportion of the SNPs. The sixth shows 23andMe reading AG when BGI reads GA 30817 times. It looks like 23andMe are reporting unequal pairs in alphabetical order, while BGI are reporting them in random order. Taking these as matches, the great majority of SNPs reported by both are reported identically.
Then there are a few thousand SNPs that one or other analysis (in 26 cases, both) list in their output but don't report anything for. What causes this?
Finally, there are a few hundred that the two analyses just give different results for. For most of these, one reports homozygosity for an allele present in the other, but in a few cases the reports are completely different, e.g. one occurrence of TT/GA.
Is this amount of mismatch typical for such analyses?
Interesting. Thanks for posting this!
I received exactly the same number of SNPs from BGI, so it looks like our data were processed under the same pipeline. I've found three people who have publicly posted their BGI data: two at the Personal Genome Project (hu2FEC01 and hu41F03B, each with 5,095,048 SNPs), and one on a personal website (with 18,217,058 SNPs).
The double dashes are no calls. 23andme reports on a set list of SNPs, and instead of omitting an SNP when they can't confidently determine the genotype, they indicate this with a double dash.
This seems normal considering the error rates from 23andme that others have been reporting (example). I don't know about BGI's error rates.
I think it might be possible to accurately guess the actual genotypes for some of the mismatches by imputing the genotypes with something like Impute2 (for each mismatched SNP, leave it out and impute it using the nearby SNPs). This will take many hours of work, though, and you might as well phase and impute across the whole genome if you have the time, interest, and processing power to do so (I've been meaning to try this out to learn more about how these things work).