Need "what is good" questions where humans can reliably check answers (theorems, or tractably checkable formalization challenges).
My favorite threads I'd like to see boosted: Wentworth, Kosoy, Ngo, Leake, Byrnes, Demski, Eisenstat.
Current models are like actors, you talk to the character. I hope nobody gets mislead catastrophically by thinking you can outsource a hard to check part of things.
Second this, if you succeed hard enough at automating empirical alignment research without automating the theoretical / philosophical foundations, probably you end up initiating RSI without having a well-grounded framework which has a chance of holding up to superintelligence. Automated interpretability seems especially likely to cause this kind of failure, even if it has some important benefits.
Thanks for publishing this! @Bogdan Ionut Cirstea, @Ronak_Mehta, and I have been pushing for it (e.g., building an organization around this, scaling up the funding to reduce integration delays). Overall, it seems easy to get demoralized about this kind of work due to a lack of funding, though I'm not giving up and trying to be strategic about how we approach things.
I want to leave a detailed comment later, but just quickly:
Do you think building a new organization around this work would be more effective than implementing these ideas at a major lab?
Anthropic is already trying out some stuff. The other labs will surely do some things, but just like every research agenda, whether the labs are doing something useful for safety shouldn’t deter us on the outside.
I hear the question you asked a lot, but I don’t really hear people question whether we should have had mech interp or evals orgs outside of the labs, yet we have multiple of those. Maybe it means we should do a bit less, but I wouldn’t say the optimal number of outside orgs working on the same things as the AGI labs should be 0.
Overall, I do like the idea of having an org that can work on automated research for alignment research while not having a frontier model end-to-end RL team down the hall.
In practice, this separate org can work directly with all of the AI safety orgs and independent researchers while the AI labs will likely not be as hands on when it comes to those kinds of collaborations and automating outside agendas. At the very least, I would rather not bet on that outcome.
That makes sense. However, I do see this as being meaningfully different from evals or mech interp in that to make progress you really kind of want access to frontier models and lots of compute/tooling, so for individuals who want to prioritize this approach it might make sense to try to join a safety team at a lab first.
We can get compute outside of the labs. If grantmakers, government, donated compute from service providers, etc are willing to make a group effort and take action, we could get an additional several millions in compute spent directly towards automated safety. An org that works towards this will be in a position to absorb the money that is currently inside the war chests.
This is an ambitious project that makes it incredibly easy to absorb enormous amounts of funding directly for safety research.
There are enough people who work in AI safety who want to go work at the big labs. I personally do not need or want to do this. Others will try by default, so I’m personally less inclined. Anthropic has a team working on this and they will keep working on it (I hope it works and they share the safety outputs!).
What we need is agentic people who can make things happen on the outside.
I think we have access to frontier models early enough and our current bottleneck to get this stuff off the ground is not the next frontier model (though obviously this helps), but literally setting up all of the infrastructure/scaffolding to even make use of current models. This could take over 2 years to set everything up. We can use current models to make progress on automating research, but it’s even better if we set everything up to leverage the next models that will drop in 6 months and get a bigger jump in automated safety research than what we get from the raw model (maybe even better than what the labs have as a scaffold).
I believe that a conscious group effort in leveraging AI agents for safety research, it could allow us to make current models as good (or better) than the next generations models. Therefore, all outside orgs could have access to automated safety researchers that are potentially even better than the lab’s safety researchers due to the difference in scaffold (even if they have a generally better raw model).
I see notes of it sprinkled throughout this piece, but is there any consideration for how people can put more focused effort on meta-evals, or exam security? (random mumblings below)
Treating the metric like published software:
Adversarial red-team for the metric itself:
Layering:
People and incentives:
Hard agree with this. I think this is a necessary step along the path to aligned AI, and should be worked on asap to get more time for failure modes to be identified (meta-scheming, etc.).
Also there's an idea of feedback loops - it would be great to hook into the AI R&D loop, so in a world where AIs doing AI research takes off we get similar speedups in safety research.
This is a personal post and does not necessarily reflect the opinion of other members of Apollo Research. I think I could have written a better version of this post with more time. However, my main hope for this post is that people with more expertise use this post as a prompt to write better, more narrow versions for the respective concrete suggestions.
Thanks to Buck Shlegeris, Joe Carlsmith, Samuel Albanie, Max Nadeau, Ethan Perez, James Lucassen, Jan Leike, Dan Lahav, and many others for chats that informed this post.
Many other people have written about automating AI safety work before. The main point I want to make in this post is simply that “Using AI for AI safety work should be a priority today already and isn’t months or years away.” To make this point salient, I try to list a few concrete projects / agendas that I think would be reasonable to pursue with current AI capabilities. I make a distinction between “pipeline automation” (the automation of human-designed processes) and “research automation” (the automation of the research process itself, including ideation). I think there are many safety pipelines that can be automated today and directly yield safety benefits. I’m much more skeptical that current AI capabilities suffice for research automation, but I think there are ways we can already prepare for systems that can.
We should already think about how to automate AI safety & security work
In this post, I use a broad definition of AI safety work, e.g., it includes monitoring, control, alignment research, computer security work, and more.
I specifically focus on the period of time from now to AIs that are capable of doing meaningful chunks of AI safety & security work autonomously, e.g. that are about as productive per unit of time as a top AI researcher at a frontier lab. Then, at this level of capabilities, we run many thousands of instances of a human-level system in parallel. We should (hopefully) be able to monitor and control each instance independently because they are only about as capable as a single human researcher. However, due to the large increase in cognitive effort available, we can leverage these AI systems to drastically speed up AI safety and security work.
In this pivotal period, we need to ensure that we can make as much differential safety progress as possible. This requires both technical preparation and various governance work. This post only focuses on the technical preparation and how we can ensure that we can get as much useful work out of these systems without taking any substantial risks, though I invite more people to think about the governance commitments as well.
We have to automate some AI safety work eventually
There are lots of reasons why we almost certainly have to automate some (potentially most) parts of the AI safety pipeline eventually.
Automating some AI safety work is also an explicitly stated strategy by most frontier AI companies.
Just asking the AI to “do alignment research” is a bad plan
I think there is a naive version of the vision for automated AI safety work that is something like:
There are a few obvious flaws with this plan, including:
Thus, we should have a substantially more specific plan to increase our chances of pulling it off.
A short, crucial timeframe might be highly influential on the entire trajectory of AI
I think it is quite plausible that a span of 3 months to 3 years might largely determine whether AI is the best or the worst technology humanity has ever built. I think this trajectory might be increasingly chaotic with accelerating capabilities (see e.g. Catastrophe through Chaos).
If this is true, preparation seems very important. It reduces chaos and uncertainty at the most crucial time and frees up time for decision-making during this period.
Some things might just take a while to build
Having a powerful AI system available drastically speeds up all forms of cognitive labor, e.g. ideation and execution, but there are some things that might see only very little speedup compared to today, or at least lag behind substantially. For example,
Gradual increases in capabilities mean different things can be automated at different times
Almost certainly, different AI safety work requires different levels of capabilities. For example, I think that we can offload substantial technical work on evals or monitoring to AIs before they can meaningfully help with threat modeling.
Thus, we should aim to always make use of the earliest AIs that can meaningfully help with the task. Otherwise, we both pay a large opportunity cost in safety progress, and the successor system is comparatively less safe because we could have used previous safety advances immediately.
The order in which safety techniques are developed might matter a lot
There are many plausible cases where the order in which safety work is developed has a large effect on the overall safety of the system. Here are two examples:
Broadly speaking, I think most arguments for a model's safety will build on a stack of assumptions with many dependencies. Therefore, it makes sense to think through these assumptions and differentially focus on automating the ones that come earliest in the chain first.
High-level comments on preparing for automation
Two types of automation
Broadly speaking, I differentiate between two types of automation
I expect the prototype of pipeline automation to look broadly like this:
I expect the order of events for research automation to look broadly like this:
Maintaining a lead for defense
In the spirit of differential progress, we should aim to “maintain a lead for the defense.” Broadly speaking, this means that we should aim to prioritize new technological breakthroughs to first establish robust defensive mechanisms before we deploy them in other fields or use them to advance capabilities further. This could include:
I expect that almost all of the work we invest in this preparation will be “bitter-lessoned” by the AI systems that are as capable at AI safety research as top human researchers. However, my main aim is to always have a head start of at least a few months/years to ensure that we can use AI system N to help increase the safety of AI system N+1 immediately.
Build out the safety pipeline as much as possible
The ideal pipeline would be one where any new model could be directly “plugged in” to lead to relevant jumps in safety. Intuitively, something like AIDER or Cursor, where coding abilities get better immediately on the day a new model drops.
I expect most AI developers already have some pipelines that work like this for capability-related improvements. For example, whenever a new frontier model has finished training, I would guess that there is a distillation pipeline that creates a much smaller model with almost similar capabilities. I would want a number of equivalent pipelines for specific types of safety work.
Here are concrete examples of such a pipeline for trusted monitoring:
Prepare research proposals and metrics
When AI systems can do meaningful chunks of AI safety work autonomously for longer periods of time, the work of AI safety researchers will shift away from executing research projects toward steering a fleet of automated researchers. In this new workflow, there are at least two relevant components (probably more): a) evaluate the results of the work the automated researchers produce, and b) define projects that are sufficiently well-scoped that we can hand them off to the automated researcher.
Define good metrics & prepare to dedicate resources for understanding adversarial new strategies: To evaluate the work of the AI system, we should ideally have a large set of robust metrics and indicators that inform us whether we have made progress. These indicators should not only faithfully represent what we’re trying to measure but also be somewhat robust to gaming and overfitting so we can differentiate real progress from Goodharting or the model strategically trying to overclaim or manipulate results. These metrics will have to allow us to differentiate between 1000s of different proposals generated by the automated researchers where we will have limited time to understand them before making a decision. On top of that, we should expect that this selection process will always be somewhat adversarial, i.e. that AIs might actively try to game the metric (e.g. fake alignment). Therefore, on top of thinking about good metrics, we should actively build procedures and dedicate resources to understand which specific adversarial strategy the model could be pursuing.
Write up well-scoped research projects: When research project execution becomes drastically cheaper, the ideation stage becomes much more important. In particular, it might be very useful to write down a large set of empirical research proposals that can be handed off to the AI systems. In some research fields, where we expect the effect of automation to be very strong, it might be sensible for current researchers to spend a meaningful amount of their time writing up research proposals instead of executing them. Concretely, research that is heavily parallelizable seems especially well-suited for this type of strategy. It might be less well-suited for very sequential research.
Build things that scale with compute
It would be great if our automated safety procedures had the property that spending more compute directly leads to more safety in some monotonic fashion, e.g. in the same way in which more compute leads to predictable gains for pre-training across many orders of magnitude of compute. I expect there will be some safety methods that rely on clever insights and give us small jumps without scaling properties, e.g. writing a better spec, but I also expect that there are safety methods that give us reliably stronger safety with more compute spend.
A different way of looking at this is: What is our answer to the question “What safety method could we spend $1B worth of compute on and it directly translates to predictable increases in safety?”
Specific areas of preparation
In the following, I want to make a number of concrete suggestions. The goal is both to a) get feedback on the specifics of these plans and b) get people to start working to set up the automation pipelines.
Evals
Pipeline automation
Anthropic’s “Discovering Language Model Behaviors with Model-Written Evaluations” paper already showed in 2022 that it is possible to automate the generation of large propensity evaluation suites using LLMs. For generating simple QA evaluations, this has become the norm throughout the field. The kind of thing I have in mind is merely the obvious continuation of this trend.
I imagine that the skill of building evals is significantly narrower than general AI R&D, at least it feels significantly more formalizable and easier to me personally. Thus, I think it is possible that we could get fairly narrow and specialized eval builder AI systems that could create and maintain millions of high-quality evals at least one generation before we get to systems that could automate AI R&D.
With slight modifications, this can also be used to generate training environments for RLAIF, e.g. for constitutional AI or deliberative alignment. For propensity evaluations, it should not pose a problem that the AI building the eval is less capable than the AI system for which it is designed. For capability evaluations this might be harder but I could imagine that it is possible to build evaluations that can be automatically verified, even if the best current system cannot solve them.
Research automation
I expect that research automation for evals will broadly look like general AI research automation. You would hand-off increasingly large chunks of evals research to AIs, e.g. science of evals research, elicitation, evals design, etc.
Red-teaming
This is basically the logical continuation of the “Red Teaming Language Models with Language Models“ paper. There are also Haize and PRISM as example orgs who’re building something like that.
With slight modifications, the same generation procedure can also be used to generate arbitrary adversarial training pipelines. Similar to evals, I expect this procedure to be simpler than general AI R&D because it is a more narrow and structured task with a clear objective.
Research automation
Similar to evals, I expect that research automation for automated red teaming will broadly look like general AI research automation. You would hand-off increasingly large chunks of red teaming research to AIs, e.g. scaling laws for red teaming, comparing attack strategies, new algorithm design and verification, etc.
Monitoring
Pipeline automation
I think that:
I expect that building the complex environments follows a similar structure to the automated evals with slight modifications. The adversarial game seems like something that would scale well with more compute, e.g. with the right setup, you could spend $1M+ and see reliably increasing returns for monitor F1 scores.
In the best case, we find some scalable way to build environments where the task itself is complex, but whether the model did the undesirable behavior or not can be easily measured posthoc, e.g. whether a specific file contains a string or not. Then, we can easily check whether the monitor correctly flagged it or not.
Research automation
To me, monitoring feels like a field that can see great returns early on in the research automation process. My best guess is that there are a lot of small-ish things to vary that don’t require grand new insights but some smaller decisions to be taken autonomously. For example, iterating on the prompt of a black-box monitor, changing the fine-tuning recipe of a monitor, experimenting with data mixtures, running ensembles instead of single monitors, etc.
Of course, some research is harder to automate. I expect that properly setting up and developing the adversarial training pipeline for monitors might be non-trivial from an execution perspective and require lots of experience. There might also be new monitoring paradigms that are not just combinations of simple existing ideas. However, I feel like we could make a lot of progress with almost zero novel insights just from much better execution.
Interpretability
Pipeline automation
There are many parts of the interpretability pipeline that seem very repetitive once the conceptual question has been figured out. For example,
My feeling is that there are a lot of things that have this flavor of “figure out hard conceptual question, then quickly automate it” since there are a lot of very algorithmic tasks in applied interpretability. From my limited outside view, I’d also say that people in the interpretability field already see this pipeline automation as the default path.
Research automation
I expect there will be some research projects that are very easy to automate and others that are very hard. The following examples are in increasing complexity:
Scalable Oversight, Model Organisms & Science of Deep Learning
I grouped these together because I felt that, basically, the same arguments apply to all three of them. Scalable oversight feels early enough of a paradigm that most of the progress is blocked by conceptual progress rather than research execution. Model organisms and science of deep learning are almost by definition optimizing for insights and thus also harder to automate.
Pipeline automation:
While there might be small pipelines to automate here and there, I think pipeline automation doesn’t make a big difference for these fields for now. At a later point, it might help for scalable oversight. The other two optimize for insights. Thus, pipeline automation is less applicable.
Research automation:
Broadly speaking, I think people can do the following for research automation for these fields.
Broadly speaking, I expect many projects in this category to be harder than the median capabilities research project. Therefore, the more we can prepare early on, the closer we can hopefully get the gap between capabilities and alignment progress.
Computer security
See Buck’s post “Access to powerful AI might make computer security radically easier”
Conclusion
While this is a fairly quick and dirty post, I think there are good outcomes that it could inspire: