Responding to the Facebook thread about possible attacks, here is a simple attack that seems worth analyzing/defending against: Set up a number of sockpuppet accounts. Find (or post) a set of comments that you predict the moderator will hide. Use your sockpuppet accounts to upvote/downvote those comments in a fixed pattern (e.g., account 1 always upvotes, account 2 always downvotes, and so on), so as to cause the ML algorithm to associate that pattern of votes with the "hide" moderator action. When you want to cause a comment that you don't like to be hidden, apply the same pattern of votes to it.
Thanks for pointing out this attack. The regret bound implies that adversarially-controlled-features can't hurt predictions much on average, but adversarially-controlled content can also makes the prediction problem harder (by forcing us to make more non-obvious predictions).
Note that in terms of total loss this is probably better than the situation where someone just makes a bunch of spammy posts without bothering with the upvote-downvote pattern:
So I don't think this technically makes the worst case any worse, but it does increase the incentives to post hard-to-detect spammy content, which seems like a significant problem.
Hopefully you have case #1, and you could try to respond to this problem in the same way that you'd respond to other kinds of spam.
An extension of the attack is to use some simple AI techniques to generate so many spam posts that regular users get tired of downvoting them, so the only votes ML can learn from are the sock puppet votes. New account status is easily under the attacker's control (just create a bunch of accounts ahead of time and wait until they're no longer new). So it seems fairly easy for an attacker to ensure you don't have case #1.
The spam from this kind of attacker would be much harder to deal with than the kind of spam we typically see today, since you're not trying to get someone to click something or otherwise do something, just generate poor quality content that can't be automatically detected as such.
Just to highlight where the theoretical analysis goes wrong:
So the issue is mostly incentives: this gives an incentive for an attacker to generate large amounts of innocuous but quality-lowering spam. It still doesn't make the worst case any worse, if you had actual adversarial users you were screwed all along under these assumptions.
In my dissertation research I usually make some limiting assumption on the attacker that prevents this kind of attack, in particular I assume one of:
Under these conditions we can potentially keep the work per honest user modest (each person must stomp out 10 crappy responses). Obviously it is better if you can get the 10% up to 50% or 90%, e.g. by imposing a cost for account creation, and without such costs it's not even clear if you can get 10%. Realistically I think that the most workable solution is to mostly use outside relationships (e.g. FB friendships), and then to allow complete outsiders to join by paying a modest cost or using a verifiable real-world identity.
I haven't analyzed virtual moderation under these kinds of assumptions though I expect we could.
I agree that virtual moderation may create stronger incentives for spam+manipulation and so hasten the day when you need to start being more serious about security, and that over the short term that could be a fatal problem. But again, if there is someone with an incentive to destroy your forum and they are able to create an arbitrary number of perfect shills, you need to somehow limit their ability anyway, there just isn't any way around it.
(For reference, I don't think the LW shills are near this level of sophistication.)
The first question of my first security-related job interview was, "If someone asked you to determine whether a product, for example PGP, is secure, what would you do?" I parroted back the answer that I had just learned from a book, something like, "First figure out what the threat model is." The interviewer expressed surprise that I had gotten the answer right, saying that most people would just dive in and try to attack the cryptosystem.
These days I think the answer is actually wrong. It's really hard to correctly formalize all of the capabilities and motivations of all potential adversaries, and once you have a threat model it's too tempting to do some theoretical analysis and think, ok, we're secure under this threat model, hence we're probably secure. And this causes you to miss attacks that you might have found if you just thought for a few days or months (or sometimes just a few minutes) about how someone might attack your system.
In this case I don't fully follow your theoretical analysis, and I'm not sure what threat model it assumed precisely, but it seems that the threat model neglected to incorporate the combination of the motivation "obtain power to unilaterally hide content (while otherwise leaving the forum functional)" and the capability "introduce new content as well as votes", which is actually a common combination among real-world forum attackers.
These days I think the answer is actually wrong
How so? Since security cannot be absolute, the threat model is basically just placing the problem into appropriate context. You don't need to formalize all the capabilities of attackers, but you need to have at least some idea of what they are.
and think, ok, we're secure under this threat model, hence we're probably secure
That's actually the reverse: hardening up under your current threat models makes you more secure against the threats you listed but doesn't help you against adversaries which your threat model ignores. E.g. if you threat model doesn't include a nation-state, you're very probably insecure against a nation-state.
You don't need to formalize all the capabilities of attackers, but you need to have at least some idea of what they are.
But you usually already have an intuitive idea of what they are. Writing down even an informal list of attackers' capabilities at the start of your analysis may just make it harder for you to subsequently think of attacks that use capabilities outside of that list. To be clear, I'm not saying never write down a threat model, just that you might want to brainstorm about possible attacks first, without having a more or less formal threat model potentially constrain your thinking.
But you usually already have an intuitive idea of what they are
The point is that different classes of attackers have very different capabilities. Consider e.g. a crude threat model which posits five classes:
A typical business might then say "We're going to defend against 1-3 and we will not even try to defend against 4-5. We want to be sure 1 get absolutely nowhere and we will try to make life very difficult for 3 (but no guarantees)". That sounds like a reasonable starting point to me.
We can say something like: for any fixed sequence of prediction problems, the predictions made by a particular ML algorithm are nearly as good as if we had used the optimal predictor from some class (with appropriate qualifiers), and in particular as good as if we had set the weights of all adversarial users to 0. There is no real threat model.
The blog post really didn't come with a claim about security;I didn't even note the above fact while writing the blog post, I pointed it out in response to the question "Why do you think ML would withstand a determined adversary here?" The blog post did come with a claim about "I think this will eventually work well," and in discussion "I think we can just try it and see." This was partly motivated by the observation that the setting is low stakes and the status quo implementations are pretty insecure.
(I'm clarifying because I will be somewhat annoyed if this blog post and discussion is later offered as evidence about my inability to think accurately about security, which seems plausible given the audience. I would not be annoyed if it was used as evidence that I am insufficiently attentive to security issues when thinking about improvements to stuff on the internet, though I'm not yet convinced of that given the difference between generating ideas and implementing step 2: "Spend another 5-10 hours searching for other problems and considerations.")
I guess the real way to deal with this is to say "each comment imposes some cost to evaluate, you need to pay that cost when you submit a comment" and then to use an independent mechanism to compensate contributions. That seems like a much bigger change though.
Something like this could be good as a software as a service startup, or a project within a company like reddit/disqus/etc. if someone could get a job there and build it for the company hack day or something. But I'm less optimistic in the context of a small forum like LW.
LW has less data to train on than a big discussion provider like reddit or disqus.
LW has users who are smart enough to exploit weaknesses in the algorithm.
Smart people are already reading LW--it shouldn't be hard to gather their opinions as they read, and any attempt to approximate their opinions will likely be inferior.
LW is not at a large enough scale to justify this level of automation. Right now the community is small enough that a single person could read and rate every comment without a lot of difficulty.
There's a chance that you put a lot of work into this and not actually solve LW's core issues. A lean approach to fixing LW would aim for the minimum viable fix. A lean approach to a new forum would aim for the minimum viable forum.
I'm more excited about looking at small online forums through the lens of institution design. Robin Hanson says there are a lot of ideas for institutions that aren't being tried out. This represents a double coincidence of wants: online discussion is a problem in search of a solution, and prediction markets/eigendemocracy/etc. are solutions in search of problems.
Experimenting with new institutions seems highly valuable: Our current ones don't work very well, and there could be a lot of room to improve. Experimentation lets us battle test new designs, and also helps build the résumé that new designs would need to see deployment on a large scale. Institutions are highly relevant to EA cause areas like global poverty, existential risk, and preservation of the EA movement's values as it grows.
I'm more excited about looking at small online forums through the lens of institution design.
I think of virtual moderation as a central example of an institution that is relevant to online communities. What kind of institution do you have in mind?
The prediction itself could be done using machine learning, prediction markets, contractors who are paid based on the quality of their predictions, voting with informal norms about what is being voted on, etc.
Smart people are already reading LW--it shouldn't be hard to gather their opinions as they read, and any attempt to approximate their opinions will likely be inferior
Those smart people's judgments are the main data that is being used, we're not trying to e.g. use textual features of comments to predict what smart people will think, we're trying to use what smart people think to predict what some trusted moderation process would output.
I was responding specifically to the machine learning version of your proposal. My contention is that for a community that's the size of LW, using machine learning to solve this seems a bit like killing a rat with a nuclear bomb. I also suspect that using an institution where all of the important decisions are made by humans is more likely to lead to knowledge that can reapplied outside of the context of an online forum.
Why don't people do this already?
I think Slashdot did something similar in 1999, but I haven't heard of them making any changes since them. They didn't try to crowdsource a simulation of a designated moderator, just some kind of crowd's opinion, but they did (1) not vote often; (2) impose jury duty; (3) compare moderators; and (4) allow multiple axes of judgement. Why did no one copy them? Why did they not go further?
Perhaps many sites are doing this, secretly. Many sites publish karma for items, determined by transparent algorithms, but use secret algorithms for deciding which items to show. Indeed, it is well-known that Facebook shows posts to a random subset of subscribers to decide whether to show it to more.
But algorithms for discussion are always simpler and more transparent than algorithms for top-level items. Many sites just don't care about discussion. It wasn't Reddit's original purpose, but it is a big part of the site now, so if anyone were to do something interesting, it would be Reddit. But they don't.
As people have observed, there are possible concerns with manipulation. I think these can be addressed, but they might be a serious problem or might require strong theoretical machinery.
It also seems like it requires solving a non-trivial ML problem (to do sufficiently efficient semi-supervised learning in this particular setting); I think this problem looks tractable, but in general most people won't do something that has that kind of technical risk.
I don't think that's much of an answer. Maybe that's the answer to why people have don't all of this, but why haven't people done some of this? Why does no one even copy what slashdot did in 1999? Reddit's main adversary is manipulation, so the possibility that a new system would be manipulable isn't any worse than the status quo. But it may be that they don't explain their algorithms because they are afraid that this would make them more manipulable.
Most social media sites make money with ads. Savvy users block ads, so there's more money in appealing to the lowest common denominator. Quality online discussion is a public good that no one is incentivized to provide.
I voiced similar ideas in the past and think this would be a good project. It would not only work for a volunteer EA but also for people who want to start a startup.
I think a majority of the blogs that have comments and the newspapers that have comments would welcome such technology.
Given the recent events this might be a good project if you want to strengthen quality content more broadly on the internet.
Rather than relying on the moderator to actually moderate, use the model to predict what the moderator would do. I’ll tentatively call this arrangement “virtual moderation.”
...
Note that if the community can’t do the work of moderating, i.e. if the moderator was the only source of signal about what content is worth showing, then this can’t work.
Does the "this" in "this can't work" refer to something other than the virtual moderation proposal, or are you saying that even virtual moderation can't work w/o the community doing work? If so, I'm confused, because I thought I was supposed to understand virtual moderation as moderation-by-machine.
Oh, did you mean that the community has to interact with a post/comment (by e.g. upvoting it) enough for the ML system to have some data to base its judgments on?
I had been imagining that the system could form an opinion w/o the benefit of any reader responses, just from some analysis of the content (character count, words used, or even NLP), as well as who wrote it and in what context.
In the long run that's possible, but I don't think that existing ML is nearly good enough to do that (especially given that people can learn to game such features).
(In general when I talk about ML in the context of moderation or discussion fora or etc., I've been imagining that user behavior is the main useful signal.)
What do you think about the StackExchange model of moderation here, where people gradually acquire more moderation powers the higher their karma? What I really like about this model is that it doesn't require that any one person do a lot of work. Of course it's susceptible to value drift: it's unclear what you'll end up optimizing for once you put power in the hands of people who do the best at getting upvotes from whoever else is out there. Arguably it made MathOverflow less interesting in the long run: I think most of the high-rep people there (including me) eventually got too trigger-happy about closing "soft" questions that weren't just looking for technical results.
I'd advocate "upvoted by people with lots of karma" as a feature to use for prediction, with the trusted moderator still the ground truth.
The StackExchange model includes clear rules about what content is allowed and what content isn't. When it comes to Lesswrong or the comment thread of a blog, deleting that much content isn't good.
Maybe an important advantage of AI-assisted moderation is it would reduce personal conflict between moderators and moderated (which might currently be a larger cost on the moderators than time). People who avoid a site because of fears that they'll be downvoted by humans might feel less bad about sometimes being downvoted by an algorithm trained on humans.
I guess the real way to deal with this is to say "each comment imposes some cost to evaluate, you need to pay that cost when you submit a comment" and then to use an independent mechanism to compensate contributions. That seems like a much bigger change though.