Automated plagerism detection software is common. But cases like the recent incident with Harvard administrator Gay have shown that egregious cases of plagerism are still being uncovered. Why would this be the case? Is it really so hard to run plagerism checks for every paper on Sci-hub? Has anyone tried?
I am curious since I am currently upskilling for the purposes of technical alignment research and it seems like an interesting project to pursue.
I think there are two reasons it's not more common to retroactively analyze papers and publications for copied or closely-paraphrased segments.
First, it's not actually easy to automate. Current solutions are RIFE with false-positives and human judgement requirements to make final conclusions.
Second, and perhaps more importantly, nobody really cares, outside of graded work where the organization is basing your credentials on doing original work (but usually not even that, just semi-original presentation of other works).
It would probably be a minor scandal if any significant papers were discovered to be based on uncredited/un-footnoted other work, but unless it were egregious (in which case it probably would have already been noticed), just not that big a deal.
Distinguishing between a properly cited paraphrase and taking someone's work as your own without sufficient attribution is not trivial even for people. There's a lot of grey area in terms of how closely you can mimic the original before it becomes problematic (this is largely what I've seen Rufo trying to hang the Harvard admin woman with, paraphrases that maintained a lot of the original wording which were nonetheless clearly cited, which at least to me seem like bad practice but not actually plagiarism in the sense it is generally meant) and it comes dow... (read more)