In July 2012, more than 600 different users posted a comment;
since March 2012, about 1600 different users;
since October 2011 — 2300;
since May 2011 — 2900;
since December 2010 — 3400;
since May 2010 — 3900;
since August 2009 — 4400.
Since the beginning, including the comments imported from Overcoming Bias, with some duplicates (people sometimes re-registered with different usernames when moving to LW, and the same username on Overcoming Bias was imported as multiple different usernames on LW if it corresponded to different emails), comments were posted under about 7500 different usernames.
Of the 4400 users who commented since August 2009, 1390 have written at least 10 comments;
900 users — at least 25 comments;
630 users — at least 50 comments;
429 users — at least 100 comments;
225 users — at least 250 comments;
134 users — at least 500 comments;
57 users — at least 1000 comments;
13 users — at least 2500 comments.
Wedrifid has written more than 10000 comments.
(Based on a wget'ed dump of all LW comments.)
One flaw: You're not locating anywhere near all of the people that registered using this method because I bet a lot of people have never commented. In one website's database that I've got access to, almost 70% of the users register without ever doing the expected main activity. Unless you spider your copy of all the comments to cache home pages, follow the links off of friends lists and include other links to home pages around the internet (like Google does, which is why I chose Google instead of wget), you're probably missing a huge proportion of the profiles. You may argue that counting active users is more relevant than counting total members, but these guys might be voting on our posts and comments, and if they outnumber us, they've got more influence over the content than we do.
In the English Wikipedia the number of registered account dwarfs the number of accounts actually in use by a couple of orders of magnitude.
On the other hand, if you're interested in the number of habitual readers or habitual posters, then the number of members is going to be much higher.
I think my main interest in knowing how many users there are went something like this:
OMG A COMMUNITY OF RATIONALISTS!!!! And it is not small??? What the? Well how many are there?
Lots. For what it's worth, I've quit checking all the recent comments because there are too many of them.
That is surprisingly low- sitemeter indicates that LessWrong receives ~12,000 daily visits, and ~14,000 today. (I assume that that figure refers to unique visitors.) So there are probably ~3K lurkers.
Potentially confounding factors: People might visit LW from multiple computers/smartphones; people might have multiple accounts; many of those accounts are either spammers or from people who no longer visit LW. I'm not sure which direction these factors would bias the overall number of lurkers in.
In my experience, internet forums tend to have many times as many lurkers as posters. Also, it's a bit tricky to take all the "noise" out of the data, as you're suggesting, and another thing to add to the list of noisemakers are the search engine crawling robots. They don't all use the term "robot" in their user agents, and anybody can build a robot, so it's hard to filter them all out.
I love how you guys explore every aspect of a thing. (: That may be a limit that Google has to either save on resources or to prevent rival search engines from downloading their whole database (or, a limit put into place BECAUSE rival search engines were sucking down their database, and it was taking up a lot of resources). I've seen other companies figure out who their greediest users are and, upon realizing that the population that takes up the most resources brings the least return on investment, put limits on them. That's what this looks like to me.
The "9000 results" is probably not a very accurate estimate - from "Google result counts are a meaningless metric":
The basic problem with the Google hit count reported in search results, particularly for phrases and searches using "AND" or "OR" operators, is that it is an estimate. It's not actually a count of anything, at all. It's the result of a calculation based solely upon the words that the query comprises, as Kevin Marks notes. Google explicitly states that it's an estimate, although it is coy about what that estimate is actually based upon. To quote one un-named Google employee, "these are all estimates, and we just haven't tried that hard to make the estimates precise". A named Google employee said much the same after this frequently given answer had been around for some years.
For example: When Google Web reports 17,200 results for the string "de Boyne Pollard" (as it does at the time of writing this Frequently Given Answer), it hasn't searched its entire database to count all of the pages that match that string. That would be very inefficient, considering that it only needs to find (by default) 10 matches in its database in order to return a result page, and that many people don't go beyond the first few pages (or even the first page) of results. What it has done is taken the individual words "de", "Boyne", and "Pollard", and, using the word frequency tables that the Google Web spider generates when it crawls the World Wide Web, produced, from the frequencies with which those three particular individual words occur, an estimate of the number of pages that probably would match.
To demonstrate for yourself that these estimates are meaningless numbers, take a few searches and click on the "Next" button to bring up further pages of results until you reach the last page. You'll see that the actual number of results, known once you reach the last page, will almost always be nothing like the estimated number of results that appeared on all of the prior pages.
Even the actual page count isn't necessarily correct. In part this is because Google caps all queries at 1000 results, and in part it is because of several other other problems with the Google hit count, both estimated and final, that exist.
(The linked page has more sources for this)
Thanks! I've always wondered where those numbers came from, but never taken the time to find out.
If Google didn't search it's entire database, this supports my theory that there are probably "over 9,000 members" - I did clearly say that was on the low side. If Google only totals only SOME of the results (until it's clear that the user wants more results, or up to it's limit for resource conservation) this also supports my assertion.
Search Term Interpretation:
As for the issues with word interpretation - I knew about that, so I restricted my search to a specific URL, not text within pages. The entire purpose of Google's "site:" code is to restrict the query to a particular website, not to use those words as it would a text search. IF it's breaking the url up into separate words and checking what it's got for those, firstly, that would fail to restrict the search to a specific site and therefore make that functionality bugged, and secondly even if it did that only for the counter, the word "user" would certainly return way more results than 9,000. The term "user" gets 8 billion hits, and "lesswrong" gets 51,700 - if it's totaling site: searches that way, it would get billions of results and it didn't. Assuming it's not bugged, a misinterpretation of the "site:lesswrong.com/user" code is N/A. Since every single user page contains the phrase "comments" and "submitted", if it had broken my exact phrase exclusions into parts, I'd have gotten zero results. See for yourself by trying:
"site:lesswrong.com/user" -comments
It was not by accident that I used the query that I did.
Is my point unsupported?
IF I were trying to support some sort of important point with this user total, I would agree with the link that it is not scientific evidence and quit using it to support points, but this is N/A because if you look closely, you'll see that I am not using this as support to convince anybody of anything. My entire purpose was to verify to myself my perception that LessWrong isn't just someone's personal website with their buddies on it, that a significant number of people have actually gathered around themes like rational thought. I was overjoyed when I discovered this and wanted to share. Maybe this post will get the attention of someone who has the ability to issue a count command to the database. That's the only way we can know for sure. Though, of course, the user totals will change over time, becoming inaccurate quickly. Hopefully by increasing. (:
There are lots of lurkers on Less Wrong:
http://lesswrong.com/lw/1np/attention_lurkers_please_say_hi/
Another problem with your methodology is that prolific users typically have many pages associated with them that lack the text "submitted by" or "comments by" on them. You can access these pages by going the user's main page, scrolling down, and clicking the little "Next" link in the lower left.
Multiple pages aren't being counted. From what I understand, Google doesn't just follow dynamically generated next links like that. It spiders, going around in a web-like pattern. How many times would it end up visiting the same pages if it followed every comment to it's original discussion? A lot. That would be a waste of resources.
To test this, I looked at the url that appears when you press the next button. The site adds some pagination variables into the URL. The word "count" appears. So, you can do the following query and observe the following things:
site:lesswrong.com/user -"submitted by" -"comments by" -count
site:lesswrong.com/user -"submitted by" -"comments by" -com (for comparison)
And observe:
A. It does not divide the number of results into a small fraction of the original number like you'd expect it to. We're comparing 9,820 total users with the original method (at this moment) with 9,460.
B. Removing "com" from the query shows zero results which verifies that adding -count would be removing pages generated in those next links, had they been included.
C. If you click on random pages of Google results, you won't see those count and after variables in the URLs (Or at least I didn't and I feel fairly confident that they won't be there.)
D. If Vladmir is correct in this post then just looking at one of those lines where the user's comments are totaled (the line where 900 have 25 comments) reveals that, by removing "count" from the query, we should have lost at least 1800 from the total. Nowhere near that many were lost, and a lot more should have been lost than that because I only subtracted a tiny fraction of the comments pages on this site in the example.
Google doesn't just follow dynamically generated next links like that
I'm pretty sure Google normally does follow dynamic links. In this case, though, it doesn't, since they are marked nofollow
.
I was excited to find this site, so I wanted to know how many people had joined LessWrong. Was it what it seemed - that a lot of people had actually gathered around the theme of rational thought - or was that just wishful thinking about a site that a guy with a neat idea and his buddies put together? I couldn't find anything stating the number of members on LessWrong anywhere on the site or the internet, so I decided it would be a fun test of my search engine knowledge to nail jello to a tree and make my own.
Some argue that Google totals are completely meaningless, however, the real problem is that it's very complicated and if you don't know how search engines work, your likelihood of getting a usable number is low. I took into account the potential pitfalls when MacGyvering this figure out of Google. So far, no one has posted a significant flaw with my specific method. (I will change that statement if they do, once I've read their comment.) Also, I was right (Find in page: total).
Here is the query I constructed:
(Translation provided at the end.)
This gets a similar result in Bing and Yahoo:
If this is correct, LessWrong has over 9,000 members. That's my claim: "LessWrong probably has over 9,000 members" not "LessWrong has exactly 9,000 members". My LessWrong population figure is likely to be low. (I explain this below.)
Why did I do this? I was really overjoyed to find this site and wanted to see whether it was somebody's personal site with just a few buddies, or if they actually managed to draw a significant gathering of people who are interested in rational thought. I was very happy to see that it looks much bigger than a personal site. Since it was so hard to find out how many users LessWrong has, I decided to share.
I think a lot of people assume the hasty generalization that "all search engine totals are meaningless". If you're an average user just plugging in search terms with little understanding of how search engines work: yes, you should regard them as meaningless. However, if you know the limitations of a technique, what parts of the system your working within are consistent and what parts of it are not, I say it is possible to get some meaning within those limitations. Do I know all the limitations? Well, I assume I am unaware of things I don't know, so I won't say that. But I do know that so far nobody has proven this number or method wrong. If you want to prove me wrong, go for it. That would be fascinating. Remember that the claim is "LessWrong probably has over 9,000 members". The entire purpose of this was to get an "at least this many" figure for how many members LessWrong has. The inaccuracies I've already taken into consideration in order to compensate for the limits of this technique are listed below:
Why this is an "at least this many" figure, pitfalls I've avoided or addressed, and inaccuracies.
- Some users may not be included in Google's index yet. For instance, if they have never posted, there may be no link to their page (which is what I searched for - user pages), and the spider would not find them. This may be restricted to members that have actually commented, posted, or have been linked to in some way somewhere on the internet.
- Search engine caches are not in real time. There can be a lag of up to months, depending on how much the search engine "likes" the page.
- It has been reported by previous employees of a major search engine that they are using crazy old computer equipment to store their caches. I've been told that it is common for sections of cache to be down for that reason.
- Search engines have restrictions in place to conserve resources. For instance, they won't let you peruse all of the results using the "next" button, and they don't total all of the results that they have when you first press "search" (you may see that number increase later if you continue to press "next" to see more pages of results.)
- It has been argued that Google doesn't interpret search terms the way you'd think. I knew that before I started. The query was designed with that in mind. I explain that here: http://lesswrong.com/r/discussion/lw/e4j/number_of_members_on_lesswrong/780g
- Some of the results in Bing and Yahoo were irrelevant, though I think I weeded them pretty thoroughly for Google if my random samples of results pages are a good indication of the whole.
- When you go to your user page, if you have more than 10 comments, a next link shows at the bottom and clicking it makes more pages appear. My understanding is that Google doesn't index these types of links - and they don't seem to be getting included. http://lesswrong.com/lw/e4j/number_of_members_on_lesswrong/7839
Go ahead and check it out - stick the query in Google and see how many LessWrong members it shows. You'll certainly get a more up-to-date total than I have posted here. ;)
Translation for those of you that don't know Google's codes:
"Search only lesswrong.com, only the user directory."
(The user directory is where each user's home page is, so I'm essentially telling it "find all the home page directories".)
Exclude any page in that directory with the exact text "submitted by" or "comments by"
(The submissions and comments pages use a url in that directory, so they will show up in the results if I do not subtract them. Also, I used exact text specific to those pages, so that the text in the links on user home pages do not get user home pages omitted from the search. )
Note:
I realize this number isn't scientific proof of anything, (we can't see Google's code so that would be foolish), which is why I'm not attempting to use it to convince anyone of anything important.