satt comments on Free (old) scientific papers [Link] - Less Wrong

7 Post author: JackEmpty 21 July 2011 04:22PM

You are viewing a comment permalink. View the original post to see all comments and the full post content.

Comments (6)

You are viewing a single comment's thread.

Comment author: satt 22 July 2011 10:48:54PM *  2 points [-]

Ethically I'd say this is pretty well in the clear. The original printed versions of the papers Greg Maxwell's released should all be in the public domain (and virtually none of their authors are still alive, so they're hardly losing out). The material seems to already be publicly available through Google Books; all GM's done is present the data without a clunky search interface. I can't see this hurting JSTOR's bottom line either, since all the papers come from just one journal, and who's going to cancel their subscription because one journal became more accessible?

I'm more interested in whether it'll ever be feasible to release more than one journal at a time. This Philosophical Transactions torrent is 34.9GB alone. Someone released a torrent of Science a couple of years ago that went over 100GB. Most journals would have a pre-1923 back catalogue smaller than those (PT and Science are both very old, and the Science torrent includes post-1922 issues) but collectively it'd still be a huge amount of material. That said, the ballpark numbers I get are less than I expected.

JSTOR has 777,061 pre-1923 items published in journals, and those items total 3,916,062 pages. All of the different pre-1923 incarnations of PT rack up 158,644 pages, 4% of the pre-1923 JSTOR total. So if it had the same page-to-file-size ratio as PT, pre-1923 JSTOR would take up 862GB. That's big, but small enough to copy onto a $60 hard disk and share by BitTorrent. All that's missing is someone able to scrape that much data from JSTOR and set up the public torrent for it. (According to the indictment, Aaron Swartz got "well over 4,000,000 articles from JSTOR", which would be most of the pre-1923 articles, but I've seen nothing else to suggest he was going to host them all for public use. It is interesting that he managed to download so much of the database before getting rumbled.)

[Edited to clarify that I'm talking about pre-1923 JSTOR material.]