The title doesn't seem to fit well the question: P-Hacking Detection does not map well to replicability, even if the presence of P-hacking usually means that the study will not replicate.
I'm interested in automatic summarization of papers key characteristics (PICO, sample size, methods) and I'm starting to build something soon.
I'm sure there's some ML students/researchers on Lesswrong in search of new projects, so here's one I'd love to see and probably won't build myself: an automated method for predicting which papers are unlikely to replicate, given the text of the paper. Ideally, I'd like to be able to use it to filter and/or rank results from Google scholar.
Getting a good data set would probably be the main bottleneck for such a project. Various replication-crisis papers which review replication success/failure for tens or hundreds of other studies seem like a natural starting point. Presumably some amount of feature engineering would be needed; I doubt anyone has a large enough dataset of labelled papers to just throw raw or lightly-processed text into a black box.
Also, if anyone knows of previous attempts to do this, I'd be interested to hear about it.