ciphergoth comments on Print ready version of The Sequences - Less Wrong

14 Post author: Jordan 06 November 2010 01:21AM

You are viewing a comment permalink. View the original post to see all comments and the full post content.

Comments (31)

You are viewing a single comment's thread.

Comment author: ciphergoth 20 November 2010 04:45:54PM 6 points [-]

I've now written a fairly sophisticated scraper for Eliezer's blog posts based on lxml, which

  • follows the Author links in "Article Navigation" to fetch all articles
  • fetches and parses all articles
  • identifies the title, body, and date
  • fixes hrefs to internal references where possible, including where the reference is to Overcoming Bias and redirects back to Less Wrong.
  • fixes all the weird Unicode characters as best I can where I can make a plausible guess
  • finds and adds the forward references in all blog posts
  • caches all network operations in a very simple dumb way
  • writes them all out as very simple HTML with a very simple HTML contents page, in a form that Calibre works well on.

I'll share the script when I have time to sort out publishing via Mercurial, or email me if you'd like a snapshot copy - paul at ciphergoth dot org.

Comment author: multifoliaterose 20 November 2010 05:10:29PM 0 points [-]

Great to hear!