I've been scraping data from the sequences recently, where by sequences I mean all of Eliezer's posts up to and including Practical Advice Backed By Deep Theories. I've been doing this mostly to get some fun data out and maybe some more useful things like the Bring Back the Sequences project, but one of the things I found is that there is breakage from the move from OB (and OB's subsequent reorganization) that remains unfixed.
In particular, 96 links either give 404s (not found), used to link to a comment but now only link to the main article, or link under the summary fold for no apparent reason. To avoid overloading this article, I have posted the list on piratepad here:
http://piratepad.net/ep/pad/view/ro.eyxCVZYMZeO/latest
Note that I have only checked links that went to overcomingbias.com. This is not necessarily a complete list.
Some of these can be fixed by anyone with editing rights, but the ones pointing to comments can be fixed only by Eliezer or someone who knows what comment was meant to be linked. Alternatively, someone can go through the archive.org WayBack machine, figure out which comments were linked to, then find them in the equivalent LessWrong page, and finally provide the corrected link. I may modify the scraper to do this if someone is willing to make the substitution.
Also, a bunch of links (not in the above list) direct the user to OvercomingBias.com only to be redirected back to LessWrong. While this doesn't actually cause any breakage, it's a pity to be burdening OB's server for no real reason. I can produce a list of these if needed.
If I have managed to attract the attention of anyone with editorial rights, I would really appreciate it if you could help me out by removing certain formatting inconsistencies that greatly slow down and complicate my scraper. I can offer more details on demand, but these links to OB are near the top of the list.
I should be back with more interesting data soon. If you have any particular data-mineable queries about the sequences, let me know.
[Edit: The 4 links that point to a #comments fragment are actually processed correctly. That leaves 92 to be fixed.]
Thanks Alexandros! I just noticed your post here. You can email me in the future if you have other fixes for LW until we have a more defined process for doing these things and alerting admins of valuable fixes for LW: louie.helm(at)singinst(dot)org
Also, can you tell me how many words/characters are contained in the sequences?
For my definition of 'sequences', which is everything up to 'Practical Advice Backed By Deep Theories' minus the quotes threads (702 posts in total), the wordcount is 917,854. I know this list can be improved, will get a discussion going about what exactly the sequences are soon enough.
Can you give me more details about the character count? Do you mean alphanumerics and numerals for instance? Total characters including spaces and punctuation? Given a precise definition I can probably get it done in an hour or so (given the above caveats about what constitutes the sequences).
Will get in touch with more fixes as I find them.