Automating AI Safety: What we can do today
There have been multiple recent calls for the automation of AI safety and alignment research. There are likely many people who would like to contribute to this space, but would benefit from clear directions for how to do so. Stemming from a recent SPAR project and in light of limitations of current systems, we provide a brief list of concrete projects for improving the ability of current and near-future agentic coding LLMs to execute technical AI safety experiments. We expect each of these could be meaningfully developed as short-term (1 week to 3 months) projects. This is in no way intended to be a comprehensive list, and we strongly welcome additional project ideas in the comments. Note: Due to our background and current research areas, the examples in this post focus on mechanistic interpretability research. However, the general techniques here should be applicable to other sub-areas of technical alignment and safety research. Concrete Projects These are largely focused on improving LLM usage of current software packages. Projects are roughly in order of increasing scope. We include initial pilot versions of some of these ideas. Improving LLM Usage of Relevant Software Packages Compiled Monofiles As noted in a recent paper by METR, current AI systems often can struggle due to lack of sufficient context, particularly for large and/or complex codebases. One way to provide extensive contextual information about a package to a coding agent is by converting it into a single large file. As suggested in Building AI Research Fleets, “More generally, consider migrating to monorepos and single sprawling Google Docs to make it easier for your AI systems to load in the necessary context.” However, while actually migrating research code to monorepos may improve LLM comprehension to a degree, it also destroys organization that is useful (both to human coders and AIs). Alternatively, existing repositories can be converted to single large files, e.g. llms.txt, which can
