rocurley comments on Get genotyped for free ( If your IQ is high enough) - Less Wrong

34 Post author: wallowinmaya 01 October 2011 04:00PM

You are viewing a comment permalink. View the original post to see all comments and the full post content.

Comments (62)

You are viewing a single comment's thread.

Comment author: rocurley 22 March 2015 05:33:02AM 2 points [-]

For those of you still waiting, got an email to the effect of: We did it, sorry it took so long, it'll be uploaded in 3 weeks. Also they said that more than half were not yet done.

Comment author: rseiter 30 March 2015 06:12:24PM 2 points [-]

I received a similar email and was able to download my genome file a few days ago. The file is 23andMe format output by Plink. It was text even though it had a .gz suffix. I had trouble uploading the file to Promethease, but was able to get it working by changing the header to one copied from an actual 23andMe file and removing the missing (--) SNPs. Unfortunately, despite being ~125MB (~5x the size of an example 23andMe file I have) my file is missing many of the 23andMe SNPs (7948 genotypes annotated in Promethease vs. 20k+ for the 23andMe example). I have an email in to BGI requesting additional information. For example, Promethease directly supports the dbSNPAnnotated.bz2 Complete Genomics file and I was hoping to get a copy of that file for my data.

Have you had any success analyzing your results? Would anyone be interested in starting a discussion group for analyzing our BGI results?

Comment author: VincentYu 04 April 2015 05:58:16AM *  2 points [-]

Are you sure you've downloaded your entire genome file? My uncompressed file is about 500 MB, and I got about 26000 annotations on Promethease. It seems like your file might have gotten truncated during the download.

Short step-by-step guide for those who want to get their genome annotated by Promethease:

  1. Use the 'Download All Files' link on the SpiderOak page to download your genome file.*
  2. Unzip then gunzip to get the raw text file genome.txt.
  3. Open the file in a text editor. Remove all the commented lines at beginning of the file except the last one (i.e., keep the line starting with # rsid; Promethease chokes if you don't) and save. This is required to get Promethease to recognize the file.
  4. (optional) Compress the edited file with zip, gzip, or bzip2 to save upload time and bandwidth.
  5. Upload to Promethease and follow the directions there.

* I advise against downloading the genome.txt.gz file directly because for some reason SpiderOak has Content-encoding: gzip in their HTTP response header, which means that browsers will transparently uncompress that file. This makes me uneasy because there is no checksum provided for the (somewhat large) plain text file, so we have little protection against corruption and truncation. In contrast, by using 'Download All Files' to download everything in a zip, the data's integrity will be automatically verified against CRC-32 checksums when we unzip and gunzip locally.

Comment author: rseiter 05 April 2015 04:42:35PM 2 points [-]

Thanks for the explanation and tips! I used your procedure and ended up with the same 131MB file. Interestingly I did not need to remove the "--" entries. I have been exchanging email with BGI and they indicated files could have significantly different number of entries (but I am surprised at >3x!). Is there any chance your sequencing had greater than 4x coverage? My VCF file is queued up and should be available in a few months which should help clarify what I am seeing.

Comment author: VincentYu 07 April 2015 09:50:41AM 0 points [-]

Is there any chance your sequencing had greater than 4x coverage?

I don't know. How do I find out?

Comment author: rseiter 18 April 2015 01:17:48AM 0 points [-]

I think the VCF would tell you if you had it. Another possibility would be using a lower quality threshold for calling SNPs, but that seems unlikely.

Comment author: RichardKennaway 30 March 2015 08:16:38PM 2 points [-]

I signed up with 23andMe, a few days before getting that letter from BGI. I'm currently waiting for both results. Can anyone point me to a good resource for studying what the data mean and what I can do with them?

Comment author: rseiter 30 March 2015 08:38:07PM 1 point [-]

I think Promethease (http://promethease.com) is a good and inexpensive ($5) start. If you have both sets of results I would recommend using 23andMe given my experience with uploading BGI data. Web searching "promethease review" will give some details and alternatives. Hopefully those of us in the BGI study can work out a good way of analyzing that data.

Comment author: sideband 30 April 2015 09:33:33PM 0 points [-]

I'm another participant. I'm still waiting for my results, but would be interested in any discussion group for analysis.

Comment author: gwern 23 March 2015 02:12:30AM 1 point [-]

What is 'it' here, just your particular raw SNP results and not news about any hits of reaching genome-wide statistical-significance?

Comment author: rocurley 23 March 2015 05:41:35AM 3 points [-]

Just my particular results.