VincentYu comments on Open thread, December 7-13, 2015 - Less Wrong
You are viewing a comment permalink. View the original post to see all comments and the full post content.
You are viewing a comment permalink. View the original post to see all comments and the full post content.
Comments (223)
From the linked Wired article:
Gwern's comment in the Reddit thread:
These comments seem to partly refer to the 2013 mass archive of Google Reader just before it was discontinued. For others who want to examine the data: the relevant WARC records for
gse-compliance.blogspot.comare in line 110789824 to line 110796183 ofgreader_20130604001315.megawarc.warc, which is about three-quarters of the way into the file. I haven't checked the directory and stats grabs and don't plan to, as I don't want to spend any more time on this.NB: As for any other large compressed archives, if you plan on saving the data, then I suggest decompressing the stream as you download it and recompressing into a seekable structure. Btrfs with compression works well, but blocked compression implementations like
bgzipshould also work in a pinch. If you leave the archive as a single compressed stream, then you'll pull all your hair out when you try to look through the data.