Update:
I've recently been experimenting with a small llm-generated wiki, mostly with Claude Code.
I was curious about the above, so I had it generate a page to much better cover Sam Altman's history, and his prediction track record.
https://www.longtermwiki.com/knowledge-base/people/sam-altman/
https://www.longtermwiki.com/knowledge-base/people/track-records/sam-altman-predictions/
I don't think there's anything incredibly novel there, but I think it's a decent overview.
I think the main Sam Altman page is more interesting than the prediction page, because his actions were a lot more interesting/informative than his predictions.
Thanks for letting me know!
I'm not very attached to the name. Some people seem to like it a lot, some dislike.
I don't feel very confident in how it will be most used 4-12 months from now. So my plan at this point is to wait on use, then consider renaming later as the situation gets clearer.
Thanks!
I tried adding a "Clarity Coach" earlier. I think that this is a sort of area where RoastMyPost probably wouldn't have a massive advantage over custom LLM prompts directly, but we might be able to make some things easy.
It would be very doable to add custom evaluators to give tips along these lines. Doing a great job would likely involve a fair bit of prompting, evaluation, and iteration. I might give it a stab and if so will get back to you on this.
(One plus is that I'd assume this would be parallelizable, so it could be fast at least)
Thanks for reporting your findings!
As I stated here, the Fact Checker has a bunch of false positives, and you've noted some.
The Fact Checker (and other checkers) have trouble telling which claims are genuine and which are part of fictional scenarios, a la AI-2027.
The Fallacy Checker is overzealous, and doesn't use web search (adds costs), so will particularly make mistakes when it's above the date the models were trained.
There's clearly more work to do to make better evals. Right now I recommend using this as a way to flag potential errors, and feel free to add any specific evaluator AIs that you think would be a fit for certain documents.
I experimented with Opus 4.5 a bit for the Fallacy Check. Results did seem a bit better, but costs were much higher.
I think the main way I could picture adding money is to add some agentic setup that does a deep review of a certain paper and presents a summary. I could see the marginal costs of this being maybe $10 to $50 per 5k words or so, using a top model like Opus. That said, the fixed costs of doing a decent job seem frustrating, especially because we're still lacking easy API use of existing agents (My preferred method would be a high-level Claude Code API, but that doesn't really exist yet).
I've been thinking of having competitions here, for people to make their own reviews, then we could compare with a few researchers and LLMs. I think this area could make for a lot of cleverness and innovation.
I also added a table back to this post that gives a better summary.
There's more going on, though that doesn't mean it will be necessarily better for you than your system.
There are some "custom" evaluators, that are basically just what you're describing, but with specific prompts. Though in these cases, note that there's extra functionality for users to re-run evaluators, see the histories of the runs, and see the specific agent's evals of many different documents.
The "system" evaluators typically have more specific code. They have short readmes you can see more on their pages:
https://www.roastmypost.org/evaluators/system-fact-checker
https://www.roastmypost.org/evaluators/system-fallacy-check
https://www.roastmypost.org/evaluators/system-forecast-checker
https://www.roastmypost.org/evaluators/system-link-verifier
https://www.roastmypost.org/evaluators/system-math-checker
https://www.roastmypost.org/evaluators/system-spelling-grammar
Some of this is just the tool splitting up a post into chunks, then doing analysis on each chunk. Some are more different. The link verifier works without any AI.
One limitation of these systems is that they're not very customizable. So if you're making something fairly specific to your system, this might be tricky.
My quick recommendation is to try running all the system evaluators at least on some docs so you can try them out (or just see the outputs on other docs).
Agreed!
The workflow we have does use a step for this. This specific workflow:
1. Chunks document
2. Runs analysis on each chunk, producing a long set of total comments.
3. Then, all the comments are fed into a final step. This step sees the full post. It then removes a bunch of comments and writes a summary.
I think it could use a lot more work and tuning. Generally, I've found these workflows fairly tricky and time-intensive to work on so far. I assume they will get easier in the next year or so.
Thanks for trying it out and reporting your findings!
It's tricky to tune the system to both flag important errors, but not flag too many errors. Right now I've been focusing on the former, assuming that it's better to show too many errors than too few.
The Fact Check definitely does have mistakes (often due to the chunking, as you flagged).
The Fallacy Check is very overzealous - I scaled it back, but will continue to adjust it. I think that overall the fallacy check style is quite tricky to do, and I've been thinking about some much more serious approaches. If people here have ideas or implementations I'd be very curious!
Went here to post something similar. My quick guess is that such a march would get a better fit of attendance times direction by expanding the scope a fair bit, but I realize that the creators have certain preferences here.