VincentYu comments on Get genotyped for free ( If your IQ is high enough) - Less Wrong

34 Post author: wallowinmaya 01 October 2011 04:00PM

You are viewing a comment permalink. View the original post to see all comments and the full post content.

Comments (62)

You are viewing a single comment's thread. Show more comments above.

Comment author: VincentYu 04 April 2015 05:58:16AM *  2 points [-]

Are you sure you've downloaded your entire genome file? My uncompressed file is about 500 MB, and I got about 26000 annotations on Promethease. It seems like your file might have gotten truncated during the download.

Short step-by-step guide for those who want to get their genome annotated by Promethease:

  1. Use the 'Download All Files' link on the SpiderOak page to download your genome file.*
  2. Unzip then gunzip to get the raw text file genome.txt.
  3. Open the file in a text editor. Remove all the commented lines at beginning of the file except the last one (i.e., keep the line starting with # rsid; Promethease chokes if you don't) and save. This is required to get Promethease to recognize the file.
  4. (optional) Compress the edited file with zip, gzip, or bzip2 to save upload time and bandwidth.
  5. Upload to Promethease and follow the directions there.

* I advise against downloading the genome.txt.gz file directly because for some reason SpiderOak has Content-encoding: gzip in their HTTP response header, which means that browsers will transparently uncompress that file. This makes me uneasy because there is no checksum provided for the (somewhat large) plain text file, so we have little protection against corruption and truncation. In contrast, by using 'Download All Files' to download everything in a zip, the data's integrity will be automatically verified against CRC-32 checksums when we unzip and gunzip locally.

Comment author: rseiter 05 April 2015 04:42:35PM 2 points [-]

Thanks for the explanation and tips! I used your procedure and ended up with the same 131MB file. Interestingly I did not need to remove the "--" entries. I have been exchanging email with BGI and they indicated files could have significantly different number of entries (but I am surprised at >3x!). Is there any chance your sequencing had greater than 4x coverage? My VCF file is queued up and should be available in a few months which should help clarify what I am seeing.

Comment author: VincentYu 07 April 2015 09:50:41AM 0 points [-]

Is there any chance your sequencing had greater than 4x coverage?

I don't know. How do I find out?

Comment author: rseiter 18 April 2015 01:17:48AM 0 points [-]

I think the VCF would tell you if you had it. Another possibility would be using a lower quality threshold for calling SNPs, but that seems unlikely.