Multiple pages aren't being counted. From what I understand, Google doesn't just follow dynamically generated next links like that. It spiders, going around in a web-like pattern. How many times would it end up visiting the same pages if it followed every comment to it's original discussion? A lot. That would be a waste of resources.
To test this, I looked at the url that appears when you press the next button. The site adds some pagination variables into the URL. The word "count" appears. So, you can do the following query and observe the following things:
site:lesswrong.com/user -"submitted by" -"comments by" -count
site:lesswrong.com/user -"submitted by" -"comments by" -com (for comparison)
And observe:
A. It does not divide the number of results into a small fraction of the original number like you'd expect it to. We're comparing 9,820 total users with the original method (at this moment) with 9,460.
B. Removing "com" from the query shows zero results which verifies that adding -count would be removing pages generated in those next links, had they been included.
C. If you click on random pages of Google results, you won't see those count and after variables in the URLs (Or at least I didn't and I feel fairly confident that they won't be there.)
D. If Vladmir is correct in this post then just looking at one of those lines where the user's comments are totaled (the line where 900 have 25 comments) reveals that, by removing "count" from the query, we should have lost at least 1800 from the total. Nowhere near that many were lost, and a lot more should have been lost than that because I only subtracted a tiny fraction of the comments pages on this site in the example.
Google doesn't just follow dynamically generated next links like that
I'm pretty sure Google normally does follow dynamic links. In this case, though, it doesn't, since they are marked nofollow.
I was excited to find this site, so I wanted to know how many people had joined LessWrong. Was it what it seemed - that a lot of people had actually gathered around the theme of rational thought - or was that just wishful thinking about a site that a guy with a neat idea and his buddies put together? I couldn't find anything stating the number of members on LessWrong anywhere on the site or the internet, so I decided it would be a fun test of my search engine knowledge to nail jello to a tree and make my own.
Some argue that Google totals are completely meaningless, however, the real problem is that it's very complicated and if you don't know how search engines work, your likelihood of getting a usable number is low. I took into account the potential pitfalls when MacGyvering this figure out of Google. So far, no one has posted a significant flaw with my specific method. (I will change that statement if they do, once I've read their comment.) Also, I was right (Find in page: total).
Here is the query I constructed:
(Translation provided at the end.)
This gets a similar result in Bing and Yahoo:
If this is correct, LessWrong has over 9,000 members. That's my claim: "LessWrong probably has over 9,000 members" not "LessWrong has exactly 9,000 members". My LessWrong population figure is likely to be low. (I explain this below.)
Why did I do this? I was really overjoyed to find this site and wanted to see whether it was somebody's personal site with just a few buddies, or if they actually managed to draw a significant gathering of people who are interested in rational thought. I was very happy to see that it looks much bigger than a personal site. Since it was so hard to find out how many users LessWrong has, I decided to share.
I think a lot of people assume the hasty generalization that "all search engine totals are meaningless". If you're an average user just plugging in search terms with little understanding of how search engines work: yes, you should regard them as meaningless. However, if you know the limitations of a technique, what parts of the system your working within are consistent and what parts of it are not, I say it is possible to get some meaning within those limitations. Do I know all the limitations? Well, I assume I am unaware of things I don't know, so I won't say that. But I do know that so far nobody has proven this number or method wrong. If you want to prove me wrong, go for it. That would be fascinating. Remember that the claim is "LessWrong probably has over 9,000 members". The entire purpose of this was to get an "at least this many" figure for how many members LessWrong has. The inaccuracies I've already taken into consideration in order to compensate for the limits of this technique are listed below:
Why this is an "at least this many" figure, pitfalls I've avoided or addressed, and inaccuracies.
- Some users may not be included in Google's index yet. For instance, if they have never posted, there may be no link to their page (which is what I searched for - user pages), and the spider would not find them. This may be restricted to members that have actually commented, posted, or have been linked to in some way somewhere on the internet.
- Search engine caches are not in real time. There can be a lag of up to months, depending on how much the search engine "likes" the page.
- It has been reported by previous employees of a major search engine that they are using crazy old computer equipment to store their caches. I've been told that it is common for sections of cache to be down for that reason.
- Search engines have restrictions in place to conserve resources. For instance, they won't let you peruse all of the results using the "next" button, and they don't total all of the results that they have when you first press "search" (you may see that number increase later if you continue to press "next" to see more pages of results.)
- It has been argued that Google doesn't interpret search terms the way you'd think. I knew that before I started. The query was designed with that in mind. I explain that here: http://lesswrong.com/r/discussion/lw/e4j/number_of_members_on_lesswrong/780g
- Some of the results in Bing and Yahoo were irrelevant, though I think I weeded them pretty thoroughly for Google if my random samples of results pages are a good indication of the whole.
- When you go to your user page, if you have more than 10 comments, a next link shows at the bottom and clicking it makes more pages appear. My understanding is that Google doesn't index these types of links - and they don't seem to be getting included. http://lesswrong.com/lw/e4j/number_of_members_on_lesswrong/7839
Go ahead and check it out - stick the query in Google and see how many LessWrong members it shows. You'll certainly get a more up-to-date total than I have posted here. ;)
Translation for those of you that don't know Google's codes:
"Search only lesswrong.com, only the user directory."
(The user directory is where each user's home page is, so I'm essentially telling it "find all the home page directories".)
Exclude any page in that directory with the exact text "submitted by" or "comments by"
(The submissions and comments pages use a url in that directory, so they will show up in the results if I do not subtract them. Also, I used exact text specific to those pages, so that the text in the links on user home pages do not get user home pages omitted from the search. )
Note:
I realize this number isn't scientific proof of anything, (we can't see Google's code so that would be foolish), which is why I'm not attempting to use it to convince anyone of anything important.