Seems difficult with prompt engineering, with today's context windows.
Maybe possible with RLHF if you can reliably find and hire people who can identify good content from bad.
If the fate of my company depended on me building this feature by tomorrow, I would take a bunch of existing spam and low quality comments, create embeddings for them via the OpenAI api, average the vector, and then auto-flag new comments if their embedding vector too close to the spam vector.
I would use the LLM for a more modest goal: to reduce how "distinctive" the comment is.
"If you want to comment on my discussion site", I would tell my users, "you need to write in such a way that readers cannot easily tell who wrote the comment."
I'd really love to see some experiments here. If you do end up pursuing this, please come back here and share your results.
If we have better discussions, we'll make better decisions.
This is, perhaps, a pretty obvious idea.
Most online discussion takes place in virtual cesspits – Facebook, Twitter, the comments sections of news articles, etc. Social media and the ideological bubbles it promotes have been blamed for political polarization and ennui of young people around the world. Others have elaborated on this better than I can.
Sites like Stack Exchange and Reddit have made real efforts. The problem persists, so these solutions are at best incomplete. Of course some sites have excellent quality comments (for example, here at LessWrong), but these either have extremely narrow audiences or the hosts spend vast effort on manual moderation.
I'd like to see the wider culture have more discussion that consists of facts and reasoned arguments, not epithets and insults. Discussion that respects the Principle of Charity. Discussion where people try to seek truth and attempt to persuade rather than bludgeon those who disagree. Discussion where facts matter. I think such discussions are more fun for the participants (they are for me), more informative to readers, and lead to enlightenment and discovery.
PROPOSAL
When a commenter (let’s say on a news article, editorial, or blog post) drafts a post, the post content is reviewed by a LLM for conformity with “community values”. Those values are set by the host – the publication, website, etc. The host describes the values to the LLM in a prompt. The values reflect the kind of conversations the host wants to see on their platform – polite, respectful, rational, fact-driven, etc. Or not, as the case may be. They needn't (and probably shouldn't) involve “values” that shut down rational discussion or genuine disagreement (“poster must claim Earth is flat”, “poster must support Republican values”…), altho I suppose some people may want to try that.
The commenter drafts a post in the currently-usual way, and clicks “post”. At that point the LLM reviews the comment (possibly along with the conversation so far, for context) and decides whether it meets community values for the site. If so, the comment is posted.
If not, the LLM explains to the poster what was wrong with the comment – it was insulting, it was illogical, it was…whatever. And perhaps offers a restatement or alternative wording. The poster may then modify their comment and try again. Perhaps the poster can argue with the LLM to try to convince it to change its opinion.
IMPORTANT ELABORATIONS
That's the core concept.
One reasonable objection is that this is, effectively, a censorship mechanism. As described, it is, but limited to a single host site. The Internet is full of discussions and people are free to leave sites they find too constraining, so I'm OK with that.
Still, there are many ways to loosen the censorship aspect, and perhaps those will work better. Below are a couple I’ve thought of.
OVERRIDE SYSTEMS
If the LLM says a post doesn’t meet local standards, the poster can override the LLM and post the comment anyway.
Such overrides would be allowed only if the poster has sufficient “override points”, which are consumed each time a poster overrides the LLM (perhaps a fixed number per post, or perhaps dependent on the how far out of spec the LLM deems the post).
Override points might be acquired:
Re buying with money, a poster could effectively bet the LLM about the outcome of human moderator review. Comments posted this way go online and also to a human moderator, who independently decides if the LLM was right. If so, the site keeps the money. If the moderator sides with poster, the points (or money) is returned.
The expenditure of override points is also valuable feedback to the site host who drafts the “community values” prompt – the host can see which posts required how many override points (and why, according to the LLM), and decide whether to modify the prompt.
READER-SIDE MODERATION
Another idea (credit here to Richard E.) is that all comments are posted, just with different ratings, and readers see whatever they’ve asked to see based on the ratings (and perhaps other criteria).
The LLM rates the comment on multiple independent scales – for example, politeness, logic, rationality, fact content, charity, etc., each scale defined in an LLM prompt by the host. The host offers a default set of thresholds or preferences for what readers see but readers are free to change those as they see fit.
(Letting readers define their own scales is possible but computationally expensive – each comment would need to be rated by the LLM for each reader, rather than just once when posted).
A points system could allow posters to modify a post's ratings, to promote something the LLM (or readers) would prefer not to see.