To understand what a person’s “utility function” for votes is, we conducted an Amazon Mechanical Turk experiment that asked users how they would perceive receiving a given number of up- and down-votes.
I don't even... Amazon Turk? What are the odds of an experiment like that among one self-selected group to yield any insight on a completely different self-selected group?
Many questions they asked are pretty interesting, though. Not sure how much one can trust their answers.
First, let me point out that the "behavioral changes" that the authors described were investigated over only three posts subsequent to each positive/negative evaluation, so it is unclear whether these effects remain over the long term.
Second, I find questionable the authors' conclusion that negative evaluations cause the subsequent decline in post quality and increase in post frequency, since they did not control the positive/negative evaluations. They model the positive/negative evaluations as random acts of chance (which is what we want for an RCT) and justify this by reporting that their bigram classifier assigns no difference in quality between the positively- and negatively-evaluated posts (across two posts by a pair of matched subjects). However, I find it likely that their classifier makes sufficiently many misclassifications to call into question their conclusion.
For instance, if bad posts have a tendency to occur in streaks of frequent posts (as is the case in flame wars#Flame_war)), then we can explain their observations without assigning causal potency to negative evaluations: once in a while the classifier will erroneously assign a high quality to a bad post near the start of a flame war, but on average it will correctly assign low qualities to the subsequent three posts by the same poster in the flame war, and thus we see the effects that the authors described (without assigning any causal effect to the negative evaluation given by other users to the post near the start of the flame war). To test this explanation, the authors can ask the Crowdflower workers (p. 4) to label each b_0 (described on p. 5) to check whether their classifier is indeed misclassifying b_0 by assigning it too high a quality.
Since the authors did not conduct an RCT, we can come up with many alternative explanations, and I find them plausible. (Is it feasible to conduct an RCT on a site featuring upvotes and downvotes? Yes, it's been done before.)
Despite my criticisms, I think the paper is not bad. I just don't think the authors' methods provide sufficient evidence to warrant their seemingly strong confidence in their conclusions.
Second, I find questionable the authors' conclusion that negative evaluations cause the subsequent decline in post quality and increase in post frequency, since they did not control the positive/negative evaluations. They model the positive/negative evaluations as random acts of chance
If a community really votes as random acts of chance, that explains that the voting doesn't lead to good behavior ;)
I suspect in most communities votes are a measure of attention and this makes even downvotes rewarding. Downvotes are easier to get which could explain the disparity in the amount of contributions. This doesn't apply to LW due to the comment hiding system, I think.
Yes. Clearly bad karma in itself is not enough for trolls and others who frequently get downvoted - there need to be some more tangible effects like comment hiding. This should have been discussed by the authors but I can't see that they did that (only skim-read the paper, though).
This interesting sentence from the abstract confirms what you say about downvotes being rewarding:
Interestingly, the authors that receive no feedback are most likely to leave a community.
Hence negative feeback is better than being ignored.
I'm torn. There are definitely differences between the way Less Wrong operates and the situation the article describes, but that's always going to be the case. It would be nice to see more studies, of course, examining how the details of the system matter, but no such seem to be available. Absent that it kind of seems like special pleading to say "we do things slightly differently, so obviously it won't apply to us." On the other hand, only one study is rather weak evidence, and the differences do exist, even if we don't have any actual evidence that they matter. I really don't know if it makes sense to consider changing our system in light of this.
People might take Eliezer's proclaimed rationalism more seriously if he had even a rudimentary understanding of statistics and probability. And was actually a good writer.
This article discusses how upvotes and downvotes influence the quality of posts on online communities. The article claims that downvotes lead to more posts of lower quality from the downvoted commenter.
From the abstract:
Social media systems rely on user feedback and rating mechanisms for personalization, ranking, and content filtering. [...] This paper investigates how ratings on a piece of content affect its author’s future behavior. [...] [W]e find that negative feedback leads to significant behavioral changes that are detrimental to the community. Not only do authors of negatively-evaluated content contribute more, but also their future posts are of lower quality, and are perceived by the community as such. In contrast, positive feedback does not carry similar effects, and neither encourages rewarded authors to write more, nor improves the quality of their posts.
The authors of the article are Justin Cheng, Cristian Danescu-Niculescu-Mizil, and Jure Leskovec.
Edited to add: NancyLebovitz already posted about this study in the Open Thread from September 8-14, 2014.