Wiki Spam

dbaupp

LESSWRONG
LW

Wiki Spam — LessWrong

13 Wiki Spam

by dbaupp

12th Nov 2011

1 min read

13

Recently, I have noticed that the "Recent Wiki Edits" box (equivalently, this page) in the sidebar seems to be almost exclusively filled with the edits from spam bots and either Gwern or Vladimir Nesov cleaning up after them (thanks!). This seems like it should be fixed, if only to save those two the time they spend on maintenance.

The wiki registration form does have a reCAPTCHA in an attempt to block spam-bots, but this is apparently not effective enough (maybe because reCAPTCHA has been cracked, or the spammers are using humans in some way).

I have some possible solutions, but I shall wait a bit before suggesting them.

(I vaguely remember there being a discussion like this previously, but I can't find it again, if it exists.)

Personal Blog

13

New Comment

15 comments, sorted by

top scoring

Click to highlight new comments since: Today at 4:13 PM

[-]Vladimir_Nesov14y310

Linking registration to LW accounts and requiring positive Karma balance could do the trick.

[-]dbaupp14y40

This seems like the most effective solution, but is it feasible to implement?

Two main things would be restricting this:

Is there a method of getting the karma/details of a user via a url? e.g. lesswrong.com/api/user/username/karma (Edit: it seems that reddit has this feature so it should be possible for LW)
Is it possible to hook a custom function into the log-in/new-user-validation of Mediawiki software?

[-]saturn14y30

I've found that the cognitive reflection test is a very nearly 100% effective "captcha" for wikis and forums. None of the image-based captchas I've used have been anywhere near as effective. Apparently a lot of spam posting is done using humans, but they aren't smart enough to pass the CRT. (Obviously, this will stop working if enough sites start using it and the answers become well known.)

[-]Paul Crowley14y60

Any "captcha" you code yourself - even one as simple as "uncheck this to post" - will be far more effective than one built in to your wiki software, because spammers won't bother to add code to their system just for your site.

[-]saturn14y30

That's not quite true anymore. Most spam does come from software spambots, but some "spambots" are actually people.

[-]gwern14y30

One setting that comes to mind is restricting page creation to auto-confirmed users. Even if all it does is divert spammers into adding links to existing pages, that's still easier on me and Vladimir.

[-]Vladimir_Nesov14y10

I'm not seeing how that would be any different.

[-]gwern14y00

Count the clicks for rollback vs delete, remembering to remove the page contents from the deletion log.

[-]Lapsed_Lurker14y00

Count the clicks for rollback vs delete, remembering to remove the page contents from the deletion log.

Why not add a one-click 'remove spam page and all contents and remove and permablock the creator' button/script? (assuming this is possible and relativity easy)

Even if all it does is divert spammers into adding links to existing pages, that's still easier on me and Vladimir.

Encouraging spammers to mess with actual pages people want to read rather than advertising their presence with obvious spam as they currently do doesn't seem good.

[-]gwern14y00

Why not add a one-click 'remove spam page and all contents and remove and permablock the creator' button/script? (assuming this is possible and relativity easy)

It's not built-in functionality, and so not easy.

Encouraging spammers to mess with actual pages people want to read rather than advertising their presence with obvious spam as they currently do doesn't seem good.

It may break their bots/scripts when they can't create pages as they obviously default to doing; and when spammers edit pages, they tend to not replace the contents but add to them. Since the wiki isn't heavily trafficked, it's a tradeoff I'm willing to make.

[-]wedrifid14y00

Why not add a one-click 'remove spam page and all contents and remove and permablock the creator' button/script? (assuming this is possible and relativity easy)

It's not built-in functionality, and so not easy.

That isn't something that is terribly difficult to implement even as a completely external script using nothing but the web interface.

[-][anonymous]14y20

In the spirit of holding off on proposing solutions let's analyze the problem further: There are actually two parts to it:

Firstly, how can we keep a high quality of the wiki content?

I think the important metrics in this part is the ratio of spammers to the numbers of people who are willing to moderate the wiki and the easiness of editing a page or creating a new page compared to restoring the previous version.

Secondly, how can we keep the recent wiki edits section relevant?

This depends solely on how we can identify the relevant information.

[-]dbaupp14y50

On Nov. 10 there were 21 changes: 2 real edits, 7 new spam users, and 12 page-deletions/user-blocks (cleaning up after the spam). On Nov. 11 there were 37 changes: 2 real edits, 15 new spam users and 20 page-deletions/user-blocks.

That's a significant drain on Gwern and Vladimir's time that could (and should) not be wasted. (They are the only two who are moderating it at the moment, as far as I can tell.)

The second question isn't relevant; we are talking about how the wiki has a large amount of spam (and how to stop it), not how LW presents the recent edits.

[-]dbaupp14y20

As a temporary measure (until some other setting is implemented/activated), account creation could be disabled.

A more permanent measure could an extra captcha added to the registration form (a brief google suggests it might be possible to get media-wiki to do this without too much trouble, and so could be simpler than Vladimir's solution, which is better), which asks a LW-specific question, possibly randomised out of a list of several, such as:

What is the first (or last) name of the user with the most karma overall? (or, in the last 30 days, but implementing this might be just as hard/harder than Vladimir's solution)
What is the name of the long series of posts by Eliezer?
What is the name of the sister site of LessWrong?
What is the nth word of the tagline of LessWrong? (where n is random)

(Obviously, a spammer could put special cases in for each question, but with enough questions, or some randomisation, this could be mitigated)

[-]Luke_A_Somers14y10

You can also shorten the timeout on the captcha. A common captcha bypass method is to get the captcha for some site, use it as a captcha on some site you run, and use that response on the one you're trying to bypass.

One thing that can help is to disguise and indirect the captcha. Put a fake response field in with nearly zero screen size, with the name 'Captcha response', that an automatic captcha bypass could put its response into, and then have an 'inappropriately' named but properly placed field immediately after, that people will actually see.

The LW trivia is of course a better solution, but this one is much easier to implement - just a few static HTML fragments here and there.

Moderation Log