I really want a version of the fraudulent research detector that works well. I fed in the first academic paper that I had quickly on hand from some recent work and get:
Severe Date Inconsistency: The paper is dated December 12, 2024, which is in the future. This is an extremely problematic issue that raises questions about the paper's authenticity and review process.
Even though it thinks the rest of the paper is fine, it gives it a 90% retraction score. Rerunning on the same paper once more gets similar results and an 85% retraction score.
The second paper I tried, it gives a mostly robust analysis, but only after completely failing to output anything the first time around.
After this, every input of mine got the "Error Analysis failed:" error.
Thanks for the feedback! I think the nature of a hackathon is that everyone is trying to get something that works at all, and "works well" is just a pipe dream haha. IIRC, there was some interest in incorporating this feature directly into Elicit, which would be pretty exciting.
Anyways I'll try to pass your feedback to Panda and Charlie, but you might also enjoy seeing their source code here and submitting a Github issue or pull request: https://github.com/CG80499/paper-retraction-detection
Good on you for hosting and & congrats to the winners! I've been working on an AI for epistemics tool of my own—a RAG system for LW articles. I plan to use it to get an on demand rationalist perspective. Dropping the link here if anyone wants to try but if you do please be sensible!
Oh cool! Nice demo and happy to see it's shipped and live, though I'd say the results were a bit disappointing on my very first prompt:
(if that's not the kind of question you're looking for, then I might suggest putting in some default example prompts to help the user understand what questions this is good for surfacing!)
The intended use case was a sounding board for rationality related questions. Imagine "Ask The Rabbi" but for LW users. I sourced the documents from the community's "Best of" collection and the model is instructed to reference specific chunks in its answer. No air purifier articles were in the original set, but it's an interesting question so I have added them to the set:
AI for Epistemics is about helping to leverage AI for better truthseeking mechanisms — at the level of individual users, the whole of society, or in transparent ways within the AI systems themselves. Manifund & Elicit recently hosted a hackathon to explore new projects in the space, with about 40 participants, 9 projects judged, and 3 winners splitting a $10k prize pool. Read on to see what we built!
Resources
Why this hackathon?
From the opening speeches; lightly edited.
Andreas Stuhlmüller: Why I'm excited about AI for Epistemics
In short - AI for Epistemics is important and tractable.
Why is it important? If you think about the next few years, things could get pretty chaotic. As everyone rushes to integrate AI systems into every part of the economy, the world could change more rapidly than it does today. There's significant risk that people and organizations will make mistakes for relatively uninteresting reasons—simply because they didn't have enough time to think things through.
If we can make it easier for people to think clearly and carefully, that's really important. People will use AI tools to help them make decisions either way; eventually unassisted decision-making just won’t be competitive anymore. This is a lever: the more these tools actually help people make wise decisions, or help them figure out whether they're right or wrong about something, the better off we'll be.
AI for Epistemics is also tractable now in a way it wasn't before. We're just reaching the point where models are good enough and cheap enough to apply at scale. You can now realistically say, "Let's analyze all news articles," or "Let's review all scientific papers," or thoroughly check every sentence of a document, at a level of detail that wasn't feasible before.
Given good ideas for epistemic tools, the implementation cost has dropped dramatically. Building significant products in hackathons has become much easier. You can basically copy and paste your project description into Cursor, type "please continue" five times, and you'll have a working demo.
The key challenge we'll need to think about today is: how can we tell if we're actually making things better? What evidence can we see that would lead us to believe a tool genuinely improves people's thinking, rather than just being a fun UI with knobs to play with?
I'm really excited about this hackathon. This is the event I've been most excited about for quite a while. I'm very grateful to Austin for creating this space for us.
Austin Chen: Why a hackathon?
Andreas first talked to me a couple months ago, saying we want to do more for the AI for Epistemics field. We were thinking about some ideas: “oh, maybe we should do a grants program, or a fellowship program, or something like that”.
But I have a special place in my heart for hackathons specifically. So I really sold him hard: we're gonna do a hackathon. We can do all that other stuff too later, but: first things first. (Andreas, wryly: “I was very hard to sell.”)
I like hackathons for a lot of reasons:
Those are some of the reasons I'm excited about hackathons. I'm glad that Andreas and the Elicit team are happy to host this with us today.
Meet the projects
We asked the participants to share more about their project after the hackathon ended. Comments are mostly Austin’s.
Question Generator, by Gustavo Lacerda
Demo: Starts 6:18
Description: This is a browser extension that generates forecasting questions related to the news page you are visiting.
Comments: Good exploration of a promising form factor (chrome extension to make personal flow easier). I like that it ends with “create a Manifold question”, as a concrete thing to go next. I’m not sure if the questions were actually any good? But with LLMs, maybe it’s always a brainstorming aid, or LLM generate and humans filter (as with imagegen).
Symphronesis, by Campbell Hutcheson (winner)
Demo: Starts 14:34
Description: Automated comment merging for LessWrong; finds disputes between the comments and the text and then highlights the text with the disputes; color coded and you can mouse over and then jump to the comment.
Why did you build this?: I’m interested in how LLMs will enable highly personalized UI/UX. One of my main contentions is that software became mass produced because the cost of development is very high and so it was prohibitive to create artisanal software solutions for individuals - but that LLMs - because they make software cheaper - give us the opportunity to return to a more artisanal software experience - where our interface to software is created dynamically. Moreover, as the cost to benefit ratio of software was even worse in design than elsewhere - good design has been something essentially limited to software companies that aggressively focus on it as part of their core value prop (e.g. Apple, Notion, Linear). But, this can now change, and we can have better, more personalized, richer experiences.
What are you most proud of for this project?: It worked. It has nice bells and whistles. It enables me to have more control over a document as an organic thing.
Source: https://github.com/chutcheson/Symphronesis
Comments: Interesting & pretty UI, reasonable concept. Lots of audience questions about how it was implemented. Lukas: “Unfortunate to do this for Lesswrong which is the website with the most support for this already”
Manifund Eval, by Ben Rachbach & William Saunders
Demo: Starts 1:38
Description: Screen all Manifund projects to identify ones to look into more to consider funding. Also identifies the grant’s story for having an impact on transformative AI going well, so you can review that to save time in your evaluation. Makes it feasible to quickly sift through the large number of Manifund projects to find promising ones to consider.
Demo link: https://manifundeval-zfxpigvo8jemehaybdwwsw.streamlit.app/
Comments: Of course, soft spot in my heart for using the Manifund API. Pretty important and impactful project (Andreas: “I actually need this.”) Not sure if the final scores outputted, or reasoning were that good though; didn’t seem that great by my lights. Might be biased — I’d tried something similar (for giving feedback to potential new projects) and it was only okay. But def worth more experimentation. I think I might want to issue a bounty to solve this problem for Manifund.
Detecting Fraudulent Research, by Panda Smith & Charlie George (winner)
Demo: Starts 20:59
Description: There’s a lot of research. A lot of it seems bad. How much? We use language models to try and detect retraction-worthy errors in published literature. We purely reason from first-principles, without using meta-textual information.
Why did you build this? Panda: At Elicit, I spend a lot of time thinking about people’s info sources. I’ve also read metascience blogs for a long time. I assumed there would be some fraud/bad papers that modern reasoning models could catch pretty easily. (I didn’t think there’d be so much!)
What are you most proud of for this project? Panda: Very happy with doing a mix of “research” where we ran the numbers on how effective our technique was, but also prototyping and making something people can get their hands on
Source: https://github.com/CG80499/paper-retraction-detection
Demo link: https://papercop.vercel.app/
Comments: Had the most “wow this is fun to play with” factor, also “I can see this going viral”. I particularly liked that they had some semblance of evals (taking 100 papers and running it through), rather than just one or two demo cases; with LLM stuff it’s easy to focus on one or two happy cases, and I’m glad they didn’t.
Artificial Collective Intelligence, by Evan Hadfield
Demo: Starts 29:13
Description: ACI is a consensus-finding tool in the style of Polis / Community Notes, simulating a diverse range of perspectives. LLMs play the role of extra participants, submitting suggestions and voting on entries.
Demo link: https://aci-demos.vercel.app/
Comment: Most ambitious IMO — an entire platform of sims. With more time to develop this, I could see this as my favorite entry. Unfortunately, lost points for not having a live working demo :(
Thought Logger and Cyborg Extension, by Raymond Arnold
Demo: Starts 35:26
Description: I have a pair of products: – a keylogger, which tracks all your keystrokes (except from apps you put on a blocklist), and exposes it on a local server – and a “prompt library” chrome extension, which lets me store fairly complicated prompts and quickly run them, while pulling a website or the keylogger logs into context.
For demo day, I worked on a “useful personal predictions” prompt for the prompt library, which takes in my keylogs from the past 2 days, extrapolates what projects I seem to be working on, and generates prediction-statements about my project, that help guide my strategy. (i.e. “I’ll get at least 3 positive reports from users about my product helping them, spontaneously, in the next 2 months.”). When I see ones I like, I enter them into Fatebook.
Why did you build this? The general idea of the keylogger + prompt library is to set me up to leverage AI in all kinds of customized ways over the next couple years. I want to be an AI poweruser, and to have an easy affordance to invent new workflows that leverage it in a repeatable way.
I think “decision-relevant predictions” is a good tool to help you get calibrated on whether your current plans are on track to succeed. But operationalizing them is kind of annoying.
Source: The tools aren’t public yet, but message me at raemon777@gmail.com if you’d like to try them out.
Comments: Interesting set of work, I like the keylogger idea, picture of “record everything and have LLMs sort it out”. In practice had the flavor of people optimizing their personal setup a bit too much, and being hard to scale out (see also: complex Obsidian thought mapping, or spaced repetition)
Double-cruxes in the New York Times’ “The Conversation”, by Tilman Bayer
Demo: Starts 41:17
Description: “The Conversation” is a weekly political debate format in New York Times “Opinion” section between conservative(ish) journalist Bret Stephens and liberal(ish) journalist Gail Collins, ongoing since 2014. I used Gemini 2.0 Flash Thinking to identify double-cruxes in each debate, with the aim to track both participants' shifts over time.
Why did you build this?: Double-Cruxes are a somewhat intricate epistemic concept that so far doesn't seem to have made it very far beyond the LessWrong sphere. I wanted to explore whether one could use current LLMs to apply it at scale to a (non-cherrypicked) corpus of political debates aimed at a general audience.
What are you most proud of for this project?: After some experimentation, found a prompt+model combination that holds up quite well in vibe tests so far.
Source: Presentation slides from the hackathon
Comments: Unclear to me if double-cruxes is important epistemic tech, esp whether it has broad reach. Didn’t really have a working demo, sadly.
Trying to make GPT 4.5 Non-sycophantic (via a better system prompt), by Oliver Habryka
Demo: Starts 47:01
Description: I tried to make a system prompt for GPT 4.5 that actually pushes back on things I say and I can argue with in productive ways. It isn’t perfect, but honestly a bunch better than other experiences I’ve had arguing with LLMs.
Prompt: link
Comments: Many points for directly trying something out of the Owen’s spec. And for having the bravery to do a “non-technical” hack — as LLMs do more of the technical work, what’s left for humans is prompting well, imo. And for something that is immediately usable!
Squaretable, by David Nachman (winner)
Video:
Demo: Starts 55:27
Description: To assist a user in decision-making, the app uses LLMs to help the user come up with weighted factors, possible options, and factor values for each option. The UI consists of an always displayed table of the factors, options, weights, and values. The final score for each option is computed symbolically as a weighted sum based on the values and weights.
Comments: Great UI, information is pretty well laid out and yet compact, love the colors. Unfortunate that David didn’t seem to think that the LLM’s results were that good. Andreas: “maybe better UX if you add columns incrementally, easier to spot check”. Makes sense, kind of like git diffs or what Cursor does in chat mode.
What went well
Lots of great people came for this! Very hard to think of more central folks for AI for Epistemics:
Our lovely faces, once more. From left to right: Rafe Kennedy, Oli Habryka, Evan Hadfield, Kirill Chesnov, Owain Evans, Charlie George, Panda Smith, Gustavo Lacerda, Andreas Stuhlmüller, Austin Chen, David Nachbach (virtual), Lukas Finnveden, Tamera Lanham, Noa Nabeshima, Campbell Hutcheson, Keri Warr, Xyra Sinclair, Tilman Bayer, Raymon Arnold, Chris Lakin, Ozzie Gooen
Not pictured participants and viewers: William Saunders, Ben Goldhaber, Deger Turan, Vishal Maini, Ross Rheingans-Yoo, Ethan Alley, Dan Selsam, Stephen Grugett, David Chee, Saul Munn, Gavriel Kleinwaks and many others…
What could have gone better
Final notes
Overall, we’re very happy with how this hackathon turned out. Building a new field from scratch is difficult, high-dimensional problem, and this is just one step along the way; but I think we made meaningful progress, with ideas we brainstormed, hacks we demoed, and people we gathered.
After the end of the hackathon, a few of the judges and participants continued to discuss: “What’s next for AI for Epistemics? How does one build a nascent field? Is ‘AI for Epistemics’ even a good name?” We’ll try to share more on this in the coming days; until then, if AI for Epistemics excites you, leave a comment or reach out to us!