It's an AWS firewall rule with bad defaults. We'll fix it soon, but in the mean time, you can scrape if you change your user agent to something other than wget/curl/etc. Please use your name/project in the user-agent so we can identify you in logs if we need to, and rate-limit yourself conservatively.
This is something I'm curious about as well! A friend recently introduced me to LessWrong, and I've found myself really enjoying the posts here! I'd like to spend more focused time digging into them!
I'd like to create a dump of LessWrong so that I can use a tool like DocETL (https://www.docetl.org/) to better sift through articles that might be interesting to me. It's been quite some time since jimrandomh replied to this post. So I just thought I'd check in before I attempted to crawl the site.
Also, it looks like https://www.lesswrong.com/robots.txt disall...
You should use GreaterWrong. Even when the AWS stuff is fixed for LW2, GW is designed to be more static than LW2, and ought to snapshot better in general. You can also use the built-in theme designer to customize it better for your offline use and scrape it using your cookies.
Yeah, GW is pretty good for snapshots and scraping. Either that or grab stuff directly from our API.
if there is no official lesswrong db/site archive for public posts, i'd like to be able to create my own with automated tools like wget, so that i can browse the site while offline. see Is there a lesswrong archive of all public posts?
wget and curl logs: