Congratulations on launching!
On the governance side, one question I'd be excited to see Apollo (and ARC evals & any other similar groups) think/write about is: what happens after a dangerous capability eval goes off?
Of course, the actual answer will be shaped by the particular climate/culture/zeitgeist/policy window/lab factors that are impossible to fully predict in advance.
But my impression is that this question is relatively neglected, and I wouldn't be surprised if sharp newcomers were able to meaningfully improve the community's thinking on this.
Thanks Akash!
I agree that this feels neglected.
Markus Anderljung recently tweeted about some upcoming related work from Jide Alaga and Jonas Schuett: https://twitter.com/Manderljung/status/1663700498288115712
Looking forward to it coming out!
Are you mainly interested in evaluating deceptive capabilities? I.e., no-holds-barred, can you elicit competent deception (or sub-components of deception) from the model? (Including by eg fine-tuning on data that demonstrates deception or sub-capabilities.)
Or evaluating inductive biases towards deception? I.e. testing whether the model is inclined towards deception in cases when the training data didn't necessarily require deceptive behavior.
(The latter might need to leverage some amount of capability evaluation, to distinguish not being inclined towards deception from not being capable of deception. But I don't think the reverse is true.)
Or if you disagree with that way of cutting up the space.
All of the above but in a specific order.
1. Test if the model has components of deceptive capabilities with lots of handholding with behavioral evals and fine-tuning.
2. Test if the model has more general deceptive capabilities (i.e. not just components) with lots of handholding with behavioral evals and fine-tuning.
3. Do less and less handholding for 1 and 2. See if the model still shows deception.
4. Try to understand the inductive biases for deception, i.e. which training methods lead to more strategic deception. Try to answer questions such as: can we change training data, technique, order of fine-tuning approaches, etc. such that the models are less deceptive?
5. Use 1-4 to reduce the chance of labs deploying deceptive models in the wild.
The LessWrong Review runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2024. The top fifty or so posts are featured prominently on the site throughout the year.
Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?
Seems great! I'm excited about potential interpretability methods for detecting deception.
I think you're right about the current trade-offs on the gain of function stuff, but it's good to think ahead and have precommitments for the conditions under which your strategies there should change.
It may be hard to find evals for deception which are sufficiently convincing when they trigger, yet still give us enough time to react afterwards. A few more similar points here: https://www.lesswrong.com/posts/pckLdSgYWJ38NBFf8/?commentId=8qSAaFJXcmNhtC8am
Building good tools for detecting deceptive alignment seems robustly good though, even after you reach a point where you have to drop the gain of function stuff.
This is a very exciting project! I'm particularly glad to see two features: (i) the focus on "deception", which undergirds much existential risk but has arguably been less of a focal point than "agency", "optimization", "inner misalignment", and other related concepts, (ii) the ability to widen the bottleneck of upskilling novice AI safety researchers who have, say, 500 hours of experience through the AI Safety Fundamentals course but need mentorship and support to make their own meaningful research contributions.
Thanks for the posting the announcement.
We think that strategic AI deception – where a model outwardly seems aligned but is in fact misaligned – is a crucial step in many major catastrophic AI risk scenarios and that detecting deception in real-world models is the most important and tractable step to addressing this problem.
Can you elaborate on why the team believes it's the most important and most tractable step?
TL;DR
$4-6M$7-10M* in addition to current funding levels (reach out if you are interested in donating). We are currently fiscally sponsored by Rethink Priorities.*Updated June 4th after re-adjusting our hiring trajectory
Research Agenda
We believe that AI deception – where a model outwardly seems aligned but is in fact misaligned and conceals this fact from human oversight – is a crucial component of many catastrophic risk scenarios from AI (see here for more). We also think that detecting/measuring deception is causally upstream of many potential solutions. For example, having good detection tools enables higher quality and safer feedback loops for empirical alignment approaches, enables us to point to concrete failure modes for lawmakers and the wider public, and provides evidence to AGI labs whether the models they are developing or deploying are deceptively misaligned.
Ultimately, we aim to develop a holistic and far-ranging suite of deception evals that includes behavioral tests, fine-tuning, and interpretability-based approaches. Unfortunately, we think that interpretability is not yet at the stage where it can be used effectively on state-of-the-art models. Therefore, we have split the agenda into an interpretability research arm and a behavioral evals arm. We aim to eventually combine interpretability and behavioral evals into a comprehensive model evaluation suite.
On the interpretability side, we are currently working on a new unsupervised approach and continuing work on an existing approach to attack the problem of superposition. Early experiments have shown promising results, but it is too early to tell if the techniques work robustly or are scalable to larger models. Our main priority, for now, is to scale up the experiments and ‘fail fast’ so we can either double down or cut our losses. Furthermore, we are interested in benchmarking interpretability techniques by testing whether given tools meet specific requirements (e.g. relationships found by the tool successfully predict causal interventions on those variables) or solve specific challenges such as discovering backdoors and reverse engineering known algorithms encoded in network weights.
On the model evaluations side, we want to build a large and robust eval suite to test models for deceptive capabilities. Concretely, we intend to break down deception into its component concepts and capabilities. We will then design a large range of experiments and evaluations to measure both the component concepts as well as deception holistically. We aim to start running eval experiments and set up pilot projects with labs as soon as possible to get early empirical feedback on our approach.
Plans beyond technical research
As an evals research org, we intend to put our research into practice by engaging directly in auditing and governance efforts. This means we aim to work with AGI labs to reduce the chance that they develop or deploy deceptively misaligned models. The details of this transition depend a lot on our research progress and our level of access to frontier models. We expect that sufficiently capable models will be able to fool all behavioral evaluations and thus some degree of ‘white box’ access will prove necessary. We aim to work with labs and regulators to build technical and institutional frameworks wherein labs can securely provide sufficient access without undue risk to intellectual property.
On the governance side, we want to use our technical expertise in auditing, model evaluations, and interpretability to inform the public and lawmakers. We are interested in demonstrating the capacity of models for dangerous capabilities and the feasibility of using evaluation and auditing techniques to detect them. We think that showcasing dangerous capabilities in controlled settings makes it easier for the ML community, lawmakers, and the wider public to understand the concerns of the AI safety community. We emphasize that we will only demonstrate such capabilities if it can be done safely in controlled settings. Showcasing the feasibility of using model evaluations or auditing techniques to prevent potential harms increases the ability of lawmakers to create adequate regulation.
We want to collaborate with independent researchers, technical alignment organizations, AI governance organizations, and the wider ML community. If you are (potentially) interested in collaborating with us, please reach out.
Theory of change
We aim to achieve a positive impact on multiple levels:
We do not think that our approach alone could yield safe AGI. Our work primarily aims to detect deceptive unaligned AI systems and prevent them from being developed and deployed. The technical alignment problem still needs to be solved. The best case for strong auditing and evaluation methods is that it can convert a ‘one-shot’ alignment problem into a many-shot problem where it becomes feasible to iterate on technical alignment methods in an environment of relative safety.
Status
We have received sufficient starter funding to get us off the ground. However, we estimate that we have a $1.4M funding gap for the first year of operations and could effectively use an additional $7-10M* in total funding. If you are interested in funding us, please reach out. We are happy to address any questions and concerns. We currently pay lower than competitive salaries but intend to increase them as we grow to attract and retain talent.
We are currently fiscally sponsored by Rethink Priorities but intend to spin out after 6-12 months. The exact legal structure is not yet determined, and we are considering both fully non-profit models as well as limited for-profit entities such as public benefit corporations. Whether we will attempt the limited for-profit route depends on the availability of philanthropic funding and whether we think there is a monetizable product that increases safety. Potential routes to monetization would be for-profit auditing or red-teaming services and interpretability tooling, but we are wary of the potentially misaligned incentives of this path. In an optimal world, we would be fully funded by philanthropic or public sources to ensure maximal alignment between financial incentives and safety.
Our starting members include:
Beren Millidge(left on good terms to pursue a different opportunity)FAQ
How is our approach different from ARC evals?
There are a couple of technical and strategic differences:
We think our ‘narrow and deep’ approach and ARC’s ‘broad and less deep’ approach are complementary strategies. Even if we had no distinguishing features from ARC Evals other than being a different team, we still would deem it net positive to have multiple somewhat uncorrelated evaluation teams.
When will we start hiring?
We are starting with an unusually large team. We expect this to work well because many of us have worked together previously, and we all agree on this fairly concrete agenda. However, we still think it is wise to take a few months to consolidate before growing further.
We think our agenda is primarily bottlenecked by engineering and hands-on research capacity rather than conceptual questions. Furthermore, we think we have the management capacity to onboard additional people. We are thus heavily bottlenecked by funding at the moment and it is unclear when and how many people we can hire in the near future. If this bottleneck is resolved we plan to start hiring soon.
We have an expression of interest form for potential applicants. You can add your name and we will inform you when we open a hiring round. We might also reach out individually to researchers who are a great fit for collaborations.
Do we ever plan to be a for-profit organization?
This depends on a lot of factors and we have not made any final decisions. In the case where we take a constrained for-profit route, we would legally ensure that we are not obligated to maximize profit and carefully select the donors and investors we work with to make sure they share our AI safety goals and understand our mission. We are currently unsure whether the mission of reducing catastrophic risks from AI can be fully compatible with a for-profit setup. We think offering auditing or red-teaming services or providing interpretability tools are candidates for monetizable strategies that align with reducing catastrophic risks but trying to maximize profits from these strategies introduces obvious perverse incentives which we need to think carefully about how to mitigate.
Isn’t this research dangerous?
Some people have argued that behavioral evals that investigate dangerous capabilities could be a cause of risk in itself, e.g. that we accidentally create a dangerous deceptive model through our efforts or that we create a public blueprint for others to create one. We think this is a plausible concern. We have two main considerations.
We are also aware that good interpretability research might eventually run the risk of improving capabilities. We have thought a considerable amount about this in the past and are making concrete plans to mitigate the risks. Overall, however, we think that current interpretability research is strongly net positive for safety in expectation.