Thanks for the (somewhat unfinished?) summery! I love Gwern’s posts, but they tend to be very long and rambling, which does make it harder to remember everything you’ve learned along the way. Having it recontextualized in a format like this is pretty cool :)
I'm glad you enjoyed it! I agree that more should be done. Just listing the specific search advice on the new table of contents would help a lot.
I'm gonna do the work, I promise. I'm just working up the nerve. Saying, in effect, "this experienced professional should have done his work better, let me show you how" is scary as balls.
This post is part of my "research" series.
Branwen, Gwern. Internet Search Tips, 11 December 2018. https://www.gwern.net/Search. Accessed: 2022-04-05.
Why
If memory serves, I learned of Gwern through LessWrong. He was mentioned as a guy who did really careful self-experiments with different drugs. His blog seemed consistently well-researched and earnest (plus he makes and shares some cool data analyses), so I downloaded a copy of his research tips.
In one paragraph
This article can be read as the field notes of a guerrilla researcher. Apart from search, there is advice on jailbreaking, cleaning, and redistributing digital texts, as well as scanning and OCRing physical ones. The search advice extends Eco's bibliographic research methodology into the Internet.
Table of Contents
An alternative outline
Broadly, how this article reads from my perspective.
The organization of this article is terrible. There is a lot of advice on different topics, but no way to find it. If I want all of the advice relevant to web crawling I'd have no choice but to read the whole article. I'll probably extract the information in this article to a more sensible structure. At some point. That will mean rewriting the entire piece, so no promises. Below is one alternative structure.
Highlights
The good, the bad, and the extremely revealing. Direct quotations in no particular order. Some of these ideas may not be discussed at length, but it still seemed worthwile to put them here.
Query syntax
(s1.1)
Custom Search Engines
(s1.4)
Piracy
(s1.3)
(s3.1)
Visibility
(s1.3)
Clippings
(s1.4)
General commentary
This article covers a lot of ground, and it's worth examining carefully. Besides the research advice, there is a clear political and personal message: Humanity is constantly losing its intellectual heritage. We have collectively accepted this state of affairs as normal, but it's not. It is our own god-damned fault. It's driven by stupidity and greed. Under these conditions, piracy is not just acceptable, but moral and prosocial.
The single most important idea (besides the political one) is that research requires more than just searching a collection, be it a search engine, library, or academic journal; it requires you to search for collections. Learning of a new library, archive, or pirate site is invaluable. Google is not your only option, there are plenty of other search engines, some of which have special collections backing them up.
Search engines mentioned in the article (almost certainly incomplete):
Idea number two (number one if you're a writer) is that "if it's not in Google, it doesn't exist". To be considered "shared", you need to be able to find it when you look for it. File metadata is incredibly useful, and yet broadly ignored and infuriatingly hard to manipulate. I'll have to dig up how to insert metadata on this blog.