Thank you so much for writing up this thorough and well-thought-out writeup, Eli and Charlotte! I think there is a high probability that TAI will be developed soon, and it is excellent that you have brainstormed your opinions on the implications for alignment with such detail.
Regarding your point on interpreting honesty and other concepts from the AI's internal states: I wanted to share a recent argument I've made that the practical upside of interpretability may be lower than originally thought. This is because the interaction dynamics between the AI agent and its environment will likely be subject to substantial complexity and computational irreducibility.
AIs with orthogonal goals. If the AIs’ goals are very different to each other, being willing to forgo immediate rewards is less likely.
Seems worth linking to this post
for discussion of ways to limit collusion. I would also point to this relevant comment thread
. It seems to me that orthogonal goals is not what we want, as agents with orthogonal goals can cooperate pretty easily to do actions that are a combination of favorable and neutral according to both of them. Instead, we would want agents with exact opposite goals, if such a thing is possible to design. It seems possible to me that in the nearcast framing we might get agents for which this collusion is not a problem, especially by using the other techniques discussed in this post and Drexler’s, but giving our AIs orthogonal goals seems unlikely to help in dangerous situations.
Coauthored. Thanks to Holden Karnofsky, Misha Yagudin, Adam Bales, Michael Aird, and Sam Clarke for feedback. All views expressed are our own.
Introduction
Background
Holden Karnofsky recently published a series on AI strategy "nearcasting”: What would it look like if we developed Transformative AI (TAI) soon? How should Magma, the hypothetical company that develops TAI, align and deploy it? In this post we focus on How might we align transformative AI if it’s developed very soon.
The nearcast setup is:
Summary
For time-constrained readers, we think the most important sections are:
We discuss Magma’s goals and strategy. The discussion should be useful for people unfamiliar or familiar with Karnofsky’s post and can be read as a summary, clarification and expansion of Karnofsky’s post.
We highlight potential disagreements with Karnofsky, including:
We discuss the chance that Magma would succeed:
We conclude by proposing questions for further research, including the likelihood of the assumptions of the nearcast and how Magma’s strategy would change if some assumptions were varied. More
Motivation
We decided to extend Karnofsky’s nearcasting discussion, as we think nearcasting-style scenario planning can help reach strategic clarity. By making AI alignment and deployment scenarios explicit, we may be able to better prioritize research agendas and policy interventions. The probability of misalignment can also be informed by evaluating the likelihood of such stories leading to alignment.
However, each individual story is unlikely to reflect reality. Such stories will be taken too seriously.
We would regret if this document is mistakenly perceived as a promising plan for significantly reducing AI risk. In particular, we expect many suggested strategies to not work out.
Magma’s predicament
This section summarizes Karnofsky.
Magma must balance 2 sources of existential risk from misaligned AI:
Magma’s goals
Karnofsky describes the 3-step plan for Magma as:
Elsewhere, Karnofsky argued that the competition frame (it is important who develops TAI/AGI) is overrated compared to the caution frame (important to coordinate to avoid competition pressure). Taking the caution frame seriously may change Magma's goals. Hence, an alternative Magma plan may be:
If feasible, the alternative plan may likely be better.[5] We think that Karnofsky likely does not actually disagree with this framing - instead, we are extending his framework to make this option more explicit.
Next we expand on why Magma might want to implement the alternative plan.
Alignment of AIs is a matter of degree. We understand any desirable property to be a matter of degree. We should aim not just to avoid sufficiently misaligned AIs but also to get the most aligned AIs possible. For example, some degrees of honesty:
Alignment of AIs is stochastic. How many resources are allocated to aligning, testing, adversarial training, etc., should influence how aligned AIs are. However, substantial stochasticity (i.e. “luck”) might be involved, e.g. by the quasi-random decision of trusting a specific alignment research agenda. Therefore, firms with less safety investment sometimes produce AIs that are more aligned.
As a result, the firm with the most safety investment may not have the most aligned AIs. Suppose firm 1 has a 10% chance of developing ambitiously aligned AI, and firm 2 has a 4% chance. Firm 1 tests the alignment of the system after the alignment work has been completed. During the testing, firm 1 updates that P(alignment) < 4% in more than 50% of all cases. Therefore, there is a >50% chance that after testing firm 1 prefers to give firm 2 the opportunity to develop AGI. We discuss this intuition more in the appendix.
Hence, cooperative agreements to deploy the most aligned AIs might be very valuable (similar to attempts to stop misaligned AIs).[6]
Lastly, as Karnofsky also acknowledges, it might be optimal that Magma does not deploy its AI (for various reasons). Since most decision-makers are likely to be influenced by prestige, we expect the deployment of the AIs to be hard to resist. Increasing the prestige of not deploying potentially dangerous models could be particularly helpful, see Barnhart 2022.
Ensuring advanced AIs have desirable properties
Defining and prioritizing desirability of advanced AIs
Defining desirability
We define the desirability of an advanced AI as the extent to which it helps Magma achieve its goals, while avoiding negative outcomes.
Desirability is relative to what the AI is used for: e.g. aligning more powerful AIs, improving cybersecurity, advising Magma, or broadly shaping the future.
We decompose desirability into:
Which properties are desirable?
Extending Karnofsky’s framework (we add (1) differential capability and (6) robustness), we describe properties of advanced AIs that increase desirability when applied for Magma’s goals:
Prioritizing between desirable properties
It seems important to prioritize among these desirable properties given:
Proposals for desirable advanced AIs
At face value, all desirable properties seem good. But, there are some smaller sets of properties needed for possible desirable advanced AIs, for example:
These could be end states or intermediate states. Some might think we have to aim for ambitious value-aligned AI as an end state and others are intermediate states to help us get there, e.g. an oracle AI can help with alignment research. It seems useful for Magma to think about which states they’re aiming for, and backchain from there to figure out which properties are most important.
An overview of 11 proposals for building safe advanced AI features a similar discussion, though it focuses primarily on how to ensure desirability rather than possible combinations of desirable properties.
Hypotheses regarding how to approach prioritization
We describe some exploratory hypotheses regarding how Magma should approach prioritizing between desirable properties.
For less advanced AIs:[11] there are likely to be tradeoffs between differential capability and other desirable properties. It might not always be worth sacrificing important capabilities. For example: if we have to make an AI 100x less capable to make it 2x more legible, this might be a bad deal as the AI would be much less helpful for Magma’s goals. However, in practice all existing labs are likely underinvesting in safety so much, such that this consideration does not change anything.
For more advanced AIs, some proposals for desirable advanced AIs don’t require all of the properties, so we should consider aiming for specific combinations. Other properties should be recognized as only instrumental. For instance, if Magma aims for ambitiously value-aligned AIs, honesty and legibility are only useful insofar as they are instrumentally helpful for getting robust value-alignment and capabilities.
Properties may even be undesirable in some cases. For instance, if the AI is ambitiously value-aligned, there may be cases where it’s useful to be dishonest to its supervisor for their own good. Even if the AI isn’t ambitiously value-aligned, it may be useful to be dishonest to unaligned AIs or humans. Hence, honesty is only instrumental to ambitious value-alignment.
Strategies for increasing desirability
To increase the desirability of AIs, Magma can:
Importantly, Magma can apply AIs for assistance with the above strategies.
Increase desirable properties via training and testing
Magma might use 2 strategies (subsection of Karnofsky’s “key facets”) to make the AI have more desirable properties:
Interpret and manipulate AIs’ internal states
This section contains very similar content to Karnofsky’s, and is mostly just re-written.
Interpreting AIs (e.g. this line of work to understand the cognition of transformers) and manipulating AIs to be more desirable (e.g. this work editing factual associations in transformers) may be quite important. It requires transforming the billions of weights in a neural network into human-understandable concepts such as "deception" and "power-seeking".
Better interpretability and manipulation can aid in:
The most promising tools can plausibly be scaled up to interpret high-level concepts (“Will the AI execute a treacherous turn?”) in very advanced AIs. See also a transparency and interpretability tech tree for Evan Hubinger’s perspective on how various levels of interpretability and manipulation may be helpful.
Restrict the possibility of undesirable actions
AIs can also be applied in contexts which externally defend against undesirable actions.
Possible defenses include:
Apply (often limited) AIs for assistance
Limiting AIs
Magma can limit their AIs in various ways to increase the desirable properties. However, limiting the AI system to reduce deception, illegibility, collusion, etc. often trades off with differential capabilities.
We propose an exploratory, non-exhaustive, non-disjoint categorization of ways to limit AIs expanding on Karnofsky’s suggested limitations:
Perhaps, several of the limitations are only promising in combination. For example, AIs only limited by myopia may not be particularly promising (for reasons given in Arguments against myopic training). Making narrow or capability-limited AIs myopic may be useful as it makes deception and collusion less likely.
Apply AIs for Magma’s goals
Magma’s AIs can improve and test the desirability of other AIs via (non-exhaustively):
Ways AIs might help accomplish Magma’s other goals include (non-exhaustively):
Difficulties of using AIs for AI desirability
Karnofsky pointed out potential difficulties.
Karnofsky’s difficulties:
Clarifying advanced collusion
Karnofksy discusses that applying AIs to make other AIs more desirable, e.g. as AI checks and balances, may fail because the AIs are colluding with each other. We expand on how this failure mode could appear and how it could be avoided. We outline and explain properties of the AIs and Magma’s strategies that might make collusion less likely. We discuss the remaining chance of collusion conditional on different assumptions about the AIs and list topics for future investigation.[13]
As background, we assume that collusion is de facto the same as cooperation and that AIs are acting as utility-maximisers.[14]
Properties for limiting collusion
It’s unclear how tractable it will be for Magma to ensure these properties, e.g. enforce AIs having indexical goals.
Strategies
The supervisors, i.e. Magma’s employees, can make collusion less likely:
It’s unclear which of these strategies are most promising.
Chance of collusion
For tractability, we condition scenarios based on whether all AIs will have indexical or non-indexical goals and whether AIs will have acausal ways of cooperating with each other. We could have instead conditioned on communication channels that allowed for causal cooperation. This choice is meant to exemplify one way to break down the chance of collusion based on important factors.
The lighter the color in the table, the lower the relative risk of collusion. We refrained from trying to estimate the absolute level of collusion as this is a shallow, uncertain overview. We list which strategies we’d tentatively guess are most promising in each case.
AIs act according to causal decision theory
AIs only follow evidential decision theory with similar AIs
AIs follow evidential decision theory for all AIs
Indexical goals
Applicable Strategies:
Applicable strategies:
Applicable strategies:
Non-indexical goals
Applicable strategies:
Applicable strategies:
No strategies seem particularly promising to us for preventing advanced collusion without limiting AIs’ capabilities.
Charlotte and Eli have different intuitions about whether AIs would act as if evidential or causal decision theory is correct.[19]
Open questions
The likelihood of collusion is an important open question. It may be further broken down into:
The likelihood of collusion for AIs that aren’t utility maximisers but rather acting according to heuristics, and what strategies can reduce collusion in this case, is another topic for further investigation.[20]
Testing desirability
Methods for testing desirability
Karnofsky discusses ways to test the desirability and, specifically, value-alignment of AIs by undertaking audits and threat assessments which he describes as “intense and randomized”.
For instance, one can give AI the opportunity to seize control of the virtual environment and see if it pounces on the opportunity. Similarly, one can give AIs superpowers to learn about and shape their goals. Such audits are related to out-of-distribution training and testing.
The overall risk should also be assessed, for example, by a team of forecasters with great track records and domain experts.
Clarifying intense and randomized audits
Karnofsky proposes to run intense and randomized AI audits.
We understand intense testing to mean that (a) testing takes up a significant portion of overall resources, and (b) it’s more valuable to test a few scenarios or inputs very thoroughly (“intensely”) than to spread out testing efforts evenly over many scenarios or inputs.
Randomized testing means that parts of the testing procedure are randomized to make it harder for the AI to respond strategically.
Overall, we tentatively think intense testing is plausibly the right strategy to get more information about hard-to-catch issues. We are unsure whether the randomization property improves testing. Treacherous turns in the wild describes a very simple (simulated) organism that adapts to randomized tests that are intended to constrain it. However, an important feature of this setup is that there are very many organisms interacting in an evolutionary training process, so the probabilistic 'play dead' strategy has a high probability of working. Magma may not train its AIs in an evolutionary manner, with many AIs interacting. That being said, AIs with advanced capabilities seem likely to be even more effective at combating randomization.
Value of testing desirability
Magma can use testing to learn about:
Potential disagreements with Karnofsky
We are calling this section "potential" disagreements primarily because we are unsure of Karnofksy's exact position rather than because our own view is tentative (which it is, but disagreements between tentative positions are useful to explore!).
Disincentivizing deceptive behavior likely requires more than a small chance of catching it
Karnofsky writes that
We aren’t yet convinced that a small chance of catching deceptive behavior by itself might make deception much less likely. We argue that in addition to having a small chance of catching deceptive behavior, the AI’s supervisor needs to be capable enough to (a) distinguish between easy-to-catch and hard-to-catch deceptive behaviors and (b) attain a very low “false positive rate” of harshly penalizing non-deceptive behaviors. The AI may also need to be inner-aligned, i.e. intrinsically motivated by the reward.
We are responding to Karnofsky’s discussion about the ease of incentivizing deceptive behavior and Karnosky’s proposed skewed reinforcement strategy: giving especially large negative reinforcement when systems engage in particular kinds of unintended behaviors that are harder to detect.
What is required?
How would it work?
What are the problems?
Conclusion
In sum, we argue that catching bad behavior requires the AI’s supervisors to have a really good understanding of the AI. Disincentivizing bad behavior is very hard (several things have to be fulfilled at the same time). Hence, a small increase in the odds that deceptive behavior is “caught” does not by itself make deception much less likely.
Adjudicating AI debates is very difficult
We disagree with Karnofsky, thinking that the difficulty of adjudication of AI debates is harder to overcome.
Theoretically, we’re not convinced there’s much reason to believe that AI debates robustly tend toward truth. The complexity theory analogy in the original AI safety via debate paper assumes human judges are perfect reasoners[21], but human debates don’t seem to robustly tend toward truth.
There are some reasons to be more optimistic on priors about debate-like mechanisms. We respond.
One may argue:
Moreover, we haven’t been encouraged by empirical AI safety via debate results such as:
The level of reliability required for adjudicating AI debates to succeed depends on the difficulty of disincentivizing deceptive behavior. We discussed above that techniques may have to be highly robust.
Some forms of process-based AIs aren’t very promising
Karnofsky highlights process-based AIs as a promising limitation, but Eli is relatively more optimistic about experimenting with some of the other limitations listed above (in particular categories (1), (2), (3) and limiting situational awareness from (4)). The rest of this section is written from Eli’s perspective.
Process-based AIs are rewarded based on their internal cognition or plans rather than their outputs. Process-based AIs may be easier to supervise to ensure they aren’t being deceptive or manipulate, as argued by Karnofksy:
I’m skeptical that very advanced process-based AIs doing complex tasks can actually be much easier to supervise without incurring a huge decrease in capabilities. I’m mainly excited about using process-based AIs at lower capability levels on less complex tasks, when it sometimes leads to a capability increase rather than penalty.
I’d guess that the cruxes of my potential disagreement with Karnofsky are:
Simulations don’t prevent collusion by highly advanced AIs
Eli doesn’t buy Karnofsky’s analogy (in the footnote) arguing that we’ll be able to prevent advanced collusion via simulating a virtual environment where the AI can get a lot of reward by taking over and committing to shut it down (at least, Eli doesn’t buy it for very advanced AI capabilities). But Eli isn’t sure that this disagreement is that relevant because he expects this failure mode to show up very “late in the game” as AIs get extremely advanced. More in a footnote.[22]
Chance of success
The promise of “hacky” solutions
In Karnofsky’s scenario, there is no grand plan for how the AIs will become ambitiously value-aligned. There may be a range of interventions and tests carried out. Would this work?
We are confused and skeptical. This trial-and-error approach with limited AI systems for testing and training may be most likely to work if it is eventually optimized for theoretical and conceptual insights into value-alignment. We use the Wright brothers’ work to make the first controlled flight as an analogy.
The ultimate aim is that the AIs are ambitiously aligned with human values or controlled.[23]
A well-understood theoretical plan would seem to be the only way to solve such engineering problems as the Rocket Alignment Problem argues. Is it possible to arrive at this solution through trial-and-error?
Our intuitions share some overlap with those described by the Machine Intelligence Research Institute as reasoning behind their approach, but we are more optimistic about the prospect of trial-and-error aiding the development of conceptual insights.
Analogy: Controlled Flight
The Wright brothers did not invent the airplane but made the first sustained, powered, and controlled flights - they provided the first solution to the control problem of flight. They studied existing aeronautics, created much better aerodynamics data (than previously available), e.g. built their own wind tunnel to conduct experiments, and tested the new flyers carefully (they first flew each new glider as a kite). But they also spent a lot of time conceptualizing failure modes, e.g. the adverse yaw. Our understanding is that they found this control problem through trial-and-error and by trying to get deeper insights into the nature of the control problem.
If this analogy holds true, we should be relatively optimistic about the empirical trial-and-error process, getting better data, performing better tests, and playing around. However, the empirical approach should perhaps be designed to eventually provide theoretical or conceptual insights. By performing powerful tests, for example, Magma can better understand AI control and alignment problems. Magma might also build AIs to aid alignment researchers. We think that conceptual insights should remain Magma’s ultimate focus - but that is mostly backed by our own intuition. We are unsure but not very pessimistic about the extent to which hacky, empirical stuff can help to get to conceptual insights.
Our forecasts
These forecasts are our independent impressions (our beliefs before updating on what others believe). We give intervals representing the resilience of our forecasts: what is the 90% confidence interval for what our forecasts would be after spending a few months thinking full-time about the forecast?
Chance of avoiding existential catastrophe
What’s the chance of Magma avoiding an AI-caused existential catastrophe, conditional on the situation described in the nearcast setup?
Tractability of reducing risk
We operationalize the tractability of reducing risk as: How much relative risk would be added if 50% of talent-weighted people working on AI risk disappeared?
For example, if risk increased from 20% to 25% this would be a relative risk increase of (25 - 20) / (100 - 20) = 6.25%.
What we actually care about is 2xing rather than 1/2ing effort, but easier to imagine reducing work as opposed to what would be added so we will forecast on this for now. It’s easier to reason about removing current efforts than adding new efforts, but it should be correlated to the tractability of spending more effort.
Karnofsky writes that “the risk of misaligned AI is serious, and that taking it more seriously is likely to reduce it”.
Conclusion and questions for further research
We’ve discussed how Magma, an AI lab close to developing TAI, should approach alignment and deployment.
We’ve written about Magma’s goals and strategy, including whether its goals should be more cooperative and how it should ensure its AIs have desirable properties.
We’ve noted potential disagreements with Karnofksy, the author of the post on which we are building.
We’ve estimated the chance Magma will succeed at its goals: exploring the promise of alignment solutions that feel “hacky” and making forecasts as to whether Magma will avoid AI catastrophe and how tractable it is to make Magma more likely to succeed.
This is an initial exploration. We are excited about any feedback, concerns, or corrections.
Some further research questions that we find promising:
Appendix
Toy model: Magma’s goals, cooperation and the caution frame
We illustrate the argument for replacing the competition with the caution frame with an oversimplified toy model.
Setup
Example
Question:
Calculation:
Answer
We give intervals representing the resilience of our forecasts: what is the 90% confidence interval for what our forecasts would be after spending a few months thinking full-time about the forecast?
See also the OpenAI Charter on this.
Thanks to Sam Clarke for proposing something like this.
We intuitively expect Steps 3 and 4 could take (very roughly) anything between 1 and 36 months in the best 0.5-10% of scenarios.
It might not be feasible if this will make several non-safety cautious actors much more likely to speed up.
Our assumption is that the firms will cooperate, but not completely act as one company, e.g., sharing code and model weights.
This is an oversimplification as in practice an existential catastrophe might be more likely caused by many AIs working together rather than one.
Capability advances could also slow down AI takeoff, increasing the amount of time we have with somewhat capable AIs before very capable ones. On the other hand, capability advances of certain public forms might increase the competition in the AI sector, speeding up takeoff.
We are not completely sure how this definition relates to Rohin Shah’s and Paul Christiano’s narrowly value-aligned AIs. We would appreciate any pointers.
Perhaps a good oracle AI could be an important part of a future involving comprehensive AI services: “AIs are a class of service-providing products, rather than a natural or necessary engine of progress in themselves.”
e.g., ones with some of the limitations suggested above.
This could also be partially categorized under “The scope of the AI’s optimization is limited”, as the AI’s optimization is pointed at things that humans think are good ideas before evaluating outcomes.
We thank Misha Yagudin and Adam Bales for useful comments.
Katja Grace discusses in Counterarguments to the basic AI risk case why AIs might rather follow heuristics than be utility maximisers.
E.g. Make the next 5 years of the world to have as many paperclips as possible, increase GDP for the next 10 minutes, or increase discounted GDP over time (2% discount rate per year).
“I create as many paperclips as possible” may in practice be the same as “the world contains the maximum number of paperclips”. The agent counterfactually created many paperclips if they are willing to be turned off increasing the chance that in the future another agent will create many paperclips.
This intuition is mostly informed by the reasoning in this paper.
However, depending on the prior of the agents, sometimes more communication can make collusion less likely. We are not sure how plausible this is.
Otherwise the agents can cooperate as they play infinitely repeated games.
Maybe that this is also the wrong question to ask.
For example, one might be able to identify how AIs act in certain situations and then avoid putting them in the kind of situations where they cooperate, e.g. the situations where a certain communication channel is present. This assumes humans will be able to understand the heuristics.
The paper indeed notes this as an unrealistic assumption. But I’m explicitly bringing it up as I think the assumption is far enough from being true that the complexity theory analogy is worth little for getting an intuition on whether AI safety via debate will work out.
Karnofsky writes:
I (Eli) don’t buy this analogy. Karnofsky might be trying to argue for either (1) giving the simulators the warning sign would be the rational thing to do or (2) giving the simulators the warning sign would be irrational, but we would do it anyway. To the extent Karnofsky is arguing (2), the argument doesn’t work in the limit of advanced AI capabilities; I don’t see why we should expect AIs to behave irrationally in the same way humans do. To the extent Karnofsky is arguing (1), I think he should argue for it more straightforwardly via an expected value calculation rather than via an analogy.
Verification of value-alignment may be important as any AI might want to convince the human supervisors that it is value-aligned.
To be clear, even if they don’t work for ambitious value-alignment directly we think “hacky solutions” may still be helpful for getting to something more principled.