You're looking at Less Wrong's discussion board. This includes all posts, including those that haven't been promoted to the front page yet. For more information, see About Less Wrong.

ciphergoth comments on Print ready version of The Sequences - Less Wrong Discussion

14 Post author: Jordan 06 November 2010 01:21AM

You are viewing a comment permalink. View the original post to see all comments and the full post content.

Comments (31)

You are viewing a single comment's thread.

Comment author: ciphergoth 20 November 2010 04:45:54PM 6 points [-]

I've now written a fairly sophisticated scraper for Eliezer's blog posts based on lxml, which

  • follows the Author links in "Article Navigation" to fetch all articles
  • fetches and parses all articles
  • identifies the title, body, and date
  • fixes hrefs to internal references where possible, including where the reference is to Overcoming Bias and redirects back to Less Wrong.
  • fixes all the weird Unicode characters as best I can where I can make a plausible guess
  • finds and adds the forward references in all blog posts
  • caches all network operations in a very simple dumb way
  • writes them all out as very simple HTML with a very simple HTML contents page, in a form that Calibre works well on.

I'll share the script when I have time to sort out publishing via Mercurial, or email me if you'd like a snapshot copy - paul at ciphergoth dot org.

Comment author: multifoliaterose 20 November 2010 05:10:29PM 0 points [-]

Great to hear!