if there is no official lesswrong db/site archive for public posts, i'd like to be able to create my own with automated tools like wget, so that i can browse the site while offline. see Is there a lesswrong archive of all public posts?
wget and curl logs:
$ wget -mk https://www.lesswrong.com/
--2023-11-08 14:31:26-- https://www.lesswrong.com/
Loaded CA certificate '/etc/ssl/certs/ca-certificates.crt'
Resolving www.lesswrong.com (www.lesswrong.com)... 54.90.19.223, 44.213.228.21, 54.81.2.129
Connecting to www.lesswrong.com (www.lesswrong.com)|54.90.19.223|:443... connected.
HTTP request sent, awaiting response... 403 Forbidden
2023-11-08 14:31:26 ERROR 403: Forbidden.
Converted links in 0 files in 0 seconds.
$ curl -Lv https://www.lesswrong.com/
* Trying 54.81.2.129:443...
* Connected to www.lesswrong.com (54.81.2.129) port 443
* ALPN: curl offers h2,http/1.1
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* CAfile: /etc/ssl/certs/ca-certificates.crt
* CApath: none
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.2 (IN), TLS handshake, Certificate (11):
* TLSv1.2 (IN), TLS handshake, Server key exchange (12):
* TLSv1.2 (IN), TLS handshake, Server finished (14):
* TLSv1.2 (OUT), TLS handshake, Client key exchange (16):
* TLSv1.2 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.2 (OUT), TLS handshake, Finished (20):
* TLSv1.2 (IN), TLS handshake, Finished (20):
* SSL connection using TLSv1.2 / ECDHE-RSA-AES128-GCM-SHA256
* ALPN: server accepted h2
* Server certificate:
* subject: CN=lesswrong.com
* start date: Sep 8 00:00:00 2023 GMT
* expire date: Oct 6 23:59:59 2024 GMT
* subjectAltName: host "www.lesswrong.com" matched cert's "www.lesswrong.com"
* issuer: C=US; O=Amazon; CN=Amazon RSA 2048 M02
* SSL certificate verify ok.
* using HTTP/2
* [HTTP/2] [1] OPENED stream for https://www.lesswrong.com/
* [HTTP/2] [1] [:method: GET]
* [HTTP/2] [1] [:scheme: https]
* [HTTP/2] [1] [:authority: www.lesswrong.com]
* [HTTP/2] [1] [:path: /]
* [HTTP/2] [1] [user-agent: curl/8.4.0]
* [HTTP/2] [1] [accept: */*]
> GET / HTTP/2
> Host: www.lesswrong.com
> User-Agent: curl/8.4.0
> Accept: */*
>
< HTTP/2 403
< server: awselb/2.0
< date: Wed, 08 Nov 2023 19:31:44 GMT
< content-type: text/html
< content-length: 118
<
<html>
<head><title>403 Forbidden</title></head>
<body>
<center><h1>403 Forbidden</h1></center>
</body>
</html>
* Connection #0 to host www.lesswrong.com left intact
It's an AWS firewall rule with bad defaults. We'll fix it soon, but in the mean time, you can scrape if you change your user agent to something other than wget/curl/etc. Please use your name/project in the user-agent so we can identify you in logs if we need to, and rate-limit yourself conservatively.
gotcha, thanks!