IAPS: Mapping Technical Safety Research at AI Companies

Zach Stein-Perlman

42 IAPS: Mapping Technical Safety Research at AI Companies

24th Oct 2024

2 min read

42

This is a linkpost for https://www.iaps.ai/research/mapping-technical-safety-research-at-ai-companies

As artificial intelligence (AI) systems become more advanced, concerns about large-scale risks from misuse or accidents have grown. This report analyzes the technical research into safe AI development being conducted by three leading AI companies: Anthropic, Google DeepMind, and OpenAI.
We define “safe AI development” as developing AI systems that are unlikely to pose large-scale misuse or accident risks. This encompasses a range of technical approaches aimed at ensuring AI systems behave as intended and do not cause unintended harm, even as they are made more capable and autonomous.
We analyzed all papers published by the three companies from January 2022 to July 2024 that were relevant to safe AI development, and categorized the 80 included papers into nine safety approaches. Additionally, we noted two categories representing nascent approaches explored by academia and civil society, but not currently represented in any research papers by these leading AI companies. Our analysis reveals where corporate attention is concentrated and where potential gaps lie.
Some AI research may stay unpublished for good reasons, such as to not inform adversaries about the details of safety and security techniques they would need to overcome to misuse AI systems. Therefore, we also considered the incentives that AI companies have to research each approach, regardless of how much work they have published on the topic. In particular, we considered reputational effects, regulatory burdens, and to what extent the approaches could be used to make the company’s AI systems more useful.
We identified three categories where there are currently no or few papers and where we do not expect AI companies to become much more incentivized to pursue this research in the future. These are model organisms of misalignment, multi-agent safety, and safety by design. Our findings provide an indication that these approaches may be slow to progress without funding or efforts from government, civil society, philanthropists, or academia.

Dataset of papers. Github.

One reason to be interested in this kind of work is as a precursor to measuring the value of different labs' safety publications. Right now the state-of-the-art technique for that is count the papers (and maybe elicit vibes from friends). But counting the papers is crude; it misses how good the papers are (and sometimes misses valuable non-paper research artifacts).

See more of IAPS's research. My favorite piece of IAPS research is still Deployment Corrections from a year ago.

Frontpage

42

IAPS: Mapping Technical Safety Research at AI Companies

7Bogdan Ionut Cirstea

5Zach Stein-Perlman

4Bogdan Ionut Cirstea

New Comment

13 comments, sorted by

top scoring

Click to highlight new comments since: Today at 1:48 AM

[-]Arthur Conmy2mo70

I assume all the data is fairly noisy, since scanning for the domain I know in https://raw.githubusercontent.com/Oscar-Delaney/safe_AI_papers/refs/heads/main/Automated%20categorization/final_output.csv, it misses ~half of the GDM Mech Interp output from the specified window and also mislabels https://arxiv.org/abs/2208.08345 and https://arxiv.org/abs/2407.13692 as Mech Interp (though two labels are applied to these papers and I didn't dig to see which was used)

[-]Oscar2mo30

Thanks for engaging with our work Arthur! Perhaps I should have signposted this more clearly in the Github as well as the report, but the categories assigned by GPT-4o were not final, we reviewed its categories and made changes where necessary. The final categories we gave are available here. The discovering agents paper we put as 'safety by design' and the prover-verifier games paper we labelled 'enhancing human feedback'. (Though for some papers of course the best categorization may not be clear, if e.g. it touches on multiple safety research areas.)

If you have the links handy I would be interested in which GDM mech interp papers we missed, and I can look into where our methodologies went wrong.

[-]Arthur Conmy1mo60

Here are the other GDM mech interp papers missed:
We have some blog posts of comparable standard to the Anthropic circuit updates listed:
- https://www.alignmentforum.org/posts/C5KAZQib3bzzpeyrg/full-post-progress-update-1-from-the-gdm-mech-interp-team
- https://www.alignmentforum.org/posts/iGuwZTHWb6DFY3sKB/fact-finding-attempting-to-reverse-engineer-factual-recall
You use a very wide scope for the "enhancing human feedback" (basically any post-training paper mentioning 'align'-ing anything). So I will use a wide scope for what counts as mech interp and also include:
- https://arxiv.org/abs/2401.06102
- https://arxiv.org/abs/2304.14767
- There are a few other papers from the PAIR group as well as Mor Geva and also Been Kim, but mostly with Google Research affiliations so it seems fine to not include these as IIRC you weren't counting pre-GDM merger Google Research/Brain work

[-]Oscar1mo30

Thanks for that list of papers/posts. For most of the papers you linked, they’re not included because they did not feature in either of our search strategies: (1) titles containing specific keywords that we searched for on arXiv; (2) the paper is linked on the company’s website. I agree this is a limitation of our methodology. We won't add these papers in now as that would be somewhat ad hoc, and inconsistent between the companies.

Re the blog posts from Anthropic and what counts as a paper, I agree this is a tricky demarcation problem. We included the 'Circuit Updates' because it was linked to as a 'paper' on the Anthropic website. Even if GDM has a higher bar for what counts as a 'paper' than Anthropic, I think we don't really want to be adjudicating this, so I feel comfortable just deferring to each company about what counts as a paper for them.

[-]nc1mo10

I would have found it helpful in your report for there to be a ROSES-type diagram or other flowchart showing the steps in your paper collation. This would bring it closer in line with other scoping reviews and would have made it easier to understand your methodology.

[-]Zach Stein-Perlman2mo20

@Zoe Williams

[-]Bogdan Ionut Cirstea2mo7-3

My main takeaway would be that this seems like quite strong evidence towards the view expressed in https://www.beren.io/2023-11-05-Open-source-AI-has-been-vital-for-alignment/, that most safety research doesn't come from the top labs.

[-]Zach Stein-Perlman2mo50

How does this paper suggest "that most safety research doesn't come from the top labs"?

[-]Bogdan Ionut Cirstea2mo40

Indirectly, because 90 papers seems like a tiny number, vs. what got published on arxiv during that same time interval. (Depending on how one counts) I wouldn't be surprised if there were > 90 papers from outside the labs even looking only at the unlearning category.

[-]ryan_greenblatt2mo1010

Counting the number of papers isn't going to be a good strategy.

I do think total research outside of labs looks competitive with research from labs and probably research done outside of labs has produced more ~~differential~~ safety progress in total.

I also think open weight models are probably good so far in terms of making AI more likely to go well (putting aside norms and precedents of releases), though I don't think this is necessarily implied from "more research happens outside of labs".

[-]Lukas Finnveden2mo50

probably research done outside of labs has produced more differential safety progress in total

To be clear — this statement is consistent with companies producing way more safety research than non-companies, if companies also produce even way more capabilities progress than non-companies? (Which I would've thought is the case, though I'm not well-informed. Not sure if "total research outside of labs look competitive with research from labs" is meant to deny this possibility, or if you're only talking about safety research there.)

[-]ryan_greenblatt2mo50

I'm just talking about research intended to be safety/safety-adjacent. As in, of this research, what has the quality weighted differential safety progress been.

Probably the word "differential" was just a mistake.

[-]sanyer16d10

Really interesting work! I have two questions:

1. In the model organisms of misalignment -section it is stated that AI companies might be nervous about researching model organisms because it could increase the likelihood of new regulation, since it would provide more evidence on concerning properties in AI system. Doesn't this depend on what kind of model organisms the company expects to be able to develop? If it's difficult to find model organisms, we would have evidence that alignment is easier and thus there would be less need for regulation.

2. Why didn't you listed AI control work as one of the areas that may be slow to progress without efforts from outside labs? According to your incentives analysis it doesn't seem like AI companies have many incentives to pursue this kind of work, and there were zero papers on AI control.

Moderation Log