I Had Claude Read Every AI Safety Paper Since 2020, Here's the DB

Corm

Update: I am currently working on an approach to get the extended LW/Alignment Forum/blog sphere included in a smarter way^[1]. I'm using https://github.com/StampyAI/alignment-research-dataset as a jumping off point.

Click here if you just want to see the Database I made of all^[2] AI safety papers written since 2020 and not read the methodology. To some extent the core idea here is to encode as much info from these papers into something small enough that an AI with a specific problem in mind can take in all of these encodings from all of the papers and decide which ones are worth reading/investigating more.

Over the last month I have been trying to see just how much I can learn and do from a cold start^[3] in the world of AI safety. A large part of this has been frantically learning mech interp, but I've picked up two projects that I think are worth sharing!

There are a lot of AI Safety papers. When I started working on more hands on projects there wasn't a clear way to find relevant papers. For example, if I wanted related datasets there's not a great way to search for datasets. Huggingface has a dataset search, but the search functionality is terrible. I could ask an AI to try to find relevant datasets but that's mostly just the whims of google searching. Many good datasets are hidden away in not famous papers^[4]. Or, if I want to look at using a specific technique it's not easily searchable to find all papers about sparse autoencoding.

So, I had Claude^[5] read every single paper it confidently classified as AI safety and then summarize it, tag it, record the year it was published, authors, etc^[6]. My methodology was to start with simply asking Claude to find as many AI Safety papers as it could, as well as any existing lists of AI safety papers. This got me to ~350 papers. Then, I collected every paper that referenced at least three of these papers (~8000) and then had Claude read the ones it thought were confidently actually AI Safety papers (~3000). This citation based approach means that blog posts or anything that doesn't have an arXiv is going to be underrepresented in this dataset. Expanding off the initial 350 papers does mean that this database is biased towards the specific starting papers.

There are currently close to 4000 papers in the database which feels like an absolutely insane number of papers in the last 5 or so years. It definitely seems like many (the majority) of these papers are not substantive. My model is basically that the vast majority of papers published are written by people playing the university/academic game. The goal of playing the academia game isn’t to lower the odds of AI causing catastrophic harm to humanity, it’s to publish novel papers that get cited by other academics to build reputation which leads to good jobs where you get paid to keep working on fun problems.

Neel Nanda likes to talk a lot about how “My ultimate north star is pragmatism - achieve enough understanding to be (reliably) useful.”^[7] and when I first read that it felt really obviously trivially true, why does it need to be stated. But, the better my model of mech interp (and AI safety as a whole) the more I understand why this is so important to state. LLMs are super opaque super interesting super complex. A byproduct of this is that the space of interesting fun projects one could work on is just absolutely enormous. There are so so many novel papers to be written^[8].

All of which to say, the ratio of number of papers to the amount it's helping us not die is pretty depressing. The volume of papers makes it harder to find the good stuff. But, that's not to say there can't be value gleaned from them! I have found this database to be quite helpful when thinking through a new project. It's easy to find the relevant papers (and then have an AI read and synthesize the most relevant ones for me). It's easy to find all of the datasets that might be relevant. I used it to help me source the datasets I used in the other project I'm publishing today on removing CCP bias from and red-teaming kimi k2.5.

Check it out here! Preview below:

^{^}
The arXiv reference approach clearly does poorly for things that aren't part of the reference/citation world
^{^}
Obviously not all, I am certain I am missing some - especially from 2020 and 2021 since recency biased. But, if you think I am missing something it's easy to submit to be added!
^{^}
Former quant trader, so relatively technical background - but definitely not a CS PhD
^{^}
Just as an arbitrary example every MATS fellow is given 12k in compute, there are papers where much of that compute went straight into creating high quality datasets such as this truthfulness dataset
^{^}
A mix of Sonnet 4.6 (~70%) and Opus 4.6 (~30%), I switched from Opus to Sonnet due to cost consideration once I had a better idea of just how many papers there would be.
^{^}
I also had Claude score the papers on "Novelty", "Applicability", and "Compute Requirements". I wouldn't put too much stock into them. There's probably something interesting to be gleaned from what Claude thinks is novel, applicable, or compute-y in the world of AI safety, but this is not that post.
^{^}
How To Become A Mechanistic Interpretability Researcher
^{^}
That's before we even talk about the papers that truly believe their research on year+ old models represents the current state of the game. To be clear I think work on smaller models is great, nothing against that - it just seems like academia frequently likes to pretend old models are cutting edge.

Did you only get papers or also lesswrong posts? there are a lot of very high quality lesswrong posts. I've been thinking about doing something similar to this, would be mainly focusing on lesswrong posts by default. Interested in reusing your list of papers, likely I'll process them a bit differently if I get to this

edit: probably good to list some authors I'd want to make sure to get:

Scott Garrabrant
Abram Demski
Vanessa Kosoy
Richard Ngo
John Wentworth
Wei Dai
Andrew Critch
Steven Byrnes
TurnTrout
Cleo Nardo
Diffractor
Jan Kulveit
Jacob Hilton?
Cole Wyeth?
plex?
Zach Furman
J Bostock
Quinn
Max von Hippel
Raemon?
davidad?
Jeremy Gillen
List Is Incomplete Because I Was Reciting From Memory And Then Scrolling The Lesswrong Home Page And Definitely Missed At Least One Very Important Name But I Am Not At All Intending To Include Everyone Who Posts On Less Wrong

There are many more authors who have written posts that I think are good the way these posters' posts are good, but who are not prolific, and so I don't have their names memorized. There are also many people who are not working on alignment that is intended to take scratches at the overwhelming-superintelligence-alignment problem, only on local alignment, which is merely hoped to be a useful tool in climbing the overwhelming-superintelligence wall; they might be worth including, but it would be important to warn Claude that local alignment and asymptotic alignment are different things (I have a post upcoming about this).

The original search before I went with the references based approach got a couple posts, but I think clearly not enough. I couldn't figure out a good systematic way to get posts, but I will definitely spend some more time thinking about this and add a tag for LW/AI Alignment posts,

https://github.com/StampyAI/alignment-research-dataset needs work but is the codebase I'll be improving when I get back to this. I'd encourage you to steal from it. perhaps clone it in ../ and tell claude code to look at it as needed. note: a major todo for me is getting it to get comments, which it doesn't do now.

edit: I can also send you a messier codebase that is able to get full user data. also load up Wei Dai's updated userscript, I think it has the important access patterns you need for this

Okay, I finished the first pass at this: https://sladebyrd.com/ai-safety-db/posts
any thoughts?

"I'd encourage you to steal from it. perhaps clone it in ../ and tell claude code to look at it as needed. note: a major todo for me is getting it to get comments, which it doesn't do now."

Working on this right now

Looks great! The main additional sources that come to mind that aren't on arXiv or are there in only limited form are the papers from the extended Olahverse at https://transformer-circuits.pub/ and https://distill.pub/

Thanks for the suggestion! Most of https://transformer-circuits.pub/ have an arXiv and got picked up, but it seems like I missed two of them. It looks like I don't have anything from https://distill.pub/ which I will work on.

Oh sorry! Missed them bc the arxivs have ~0 cites. I do think the monthly updates are also valuable tho, and the HTML pages have a lot of extra results. (CoI that's my old team ^^)

Did Claude's original search turn up the Alignment Research Dataset? https://huggingface.co/datasets/StampyAI/alignment-research-dataset

huggingface is out of date, I was hesitant to update it for reasons that seem silly now. In any case, for doing this right, the codebase is more relevant. We should probably chat about getting it up to date - I'm the main bottleneck to that, but maybe someone else wants it to work badly enough to clean up my mess before I do. I want to get eigentaste and a blog post about local-vs-asymptotic alignment out the door, then may be returning to this as my main focus, though I'm not ready to promise that.

Incidentally, I have an idea for a way to restructure articles to make them more amenable to use as search results, if it works it'll make comparing claims across articles much easier.

No it did not! I'll take a look

Oh this is a cool thing you're doing. Big props!
This is probably an atrociously slow computation to do, but it would be neat if you performed clustering on the embeddings of all of these papers, passed each centroid to an LLM, and had it write a summary of each.

The category you are using is not entirely suitable. Since Adversarial Robustness can be split into more detailed areas, like white-box jailbreaks, black-box jailbreaks, gradient-based jailbreaks, or red-LLM team jailbreaks. Don't you think it would be better if you tagged them more subtly