Coauthored. Thanks to Holden Karnofsky, Misha Yagudin, Adam Bales, Michael Aird, and Sam Clarke for feedback. All views expressed are our own.

Introduction

Background

Holden Karnofsky recently published a series on AI strategy "nearcasting”: What would it look like if we developed Transformative AI (TAI) soon? How should Magma, the hypothetical company that develops TAI, align and deploy it? In this post we focus on How might we align transformative AI if it’s developed very soon.

The nearcast setup is:

A major AI company (“Magma,” [following the setup and description of this post from Ajeya Cotra]) has good reason to think that it can develop transformative AI very soon (within a year), using what Ajeya calls “human feedback on diverse tasks” (HFDT) - and has some time (more than 6 months, but less than 2 years) to set up special measures to reduce the risks of misaligned AI before there’s much chance of someone else deploying transformative AI.

Summary

For time-constrained readers, we think the most important sections are:

  1. Categorizing ways of limiting AIs
  2. Clarifying advanced collusion
  3. Disincentivizing deceptive behavior likely requires more than a small chance of catching it

We discuss Magma’s goals and strategy. The discussion should be useful for people unfamiliar or familiar with Karnofsky’s post and can be read as a summary, clarification and expansion of Karnofsky’s post.

  • We describe a potential plan for Magma involving coordinating with other AI labs to deploy the most aligned AI in addition to stopping other misaligned AIs. more
  • We define the desirability of Magma's AIs in terms of their ability to help Magma achieve its goals while avoiding negative outcomes. We discuss desirable properties such as differential capability and value-alignment, and describe initial hypotheses regarding how Magma should think about prioritizing between desirable properties. more
  • We discuss Magma’s strategies to increase desirability: how Magma can make AIs more desirable by changing properties of the AIs and the context in which they’re applied, and how Magma should apply (often limited) AIs to make other AIs more desirable. more
  • We clarify that the chance of collusion depends on whether AIs operate on a smaller scale, have very different architectures and orthogonal goals. We outline strategies to reduce collusion conditional on whether the AIs have indexical goals and follow causal decision theory or not. more
  • We discuss how Magma can test the desirability of AIs via audits and threat assessments. Testing can provide evidence regarding the effectiveness of various alignment strategies and the overall level of misalignment. more

We highlight potential disagreements with Karnofsky, including:

  • We aren’t convinced that a small chance of catching deceptive behavior by itself might make deception much less likely. We argue that in addition to having a small chance of catching deceptive behavior, the AI’s supervisor needs to be capable enough to (a) distinguish between easy-to-catch and hard-to-catch deceptive behaviors and (b) attain a very low “false positive rate” of harshly penalizing non-deceptive behaviors. The AI may also need to be inner-aligned, i.e. intrinsically motivated by the reward. more
  • We are more pessimistic than Karnofsky about the promise of adjudicating AI debates. We aren’t convinced there’s much theoretical reason to believe that AI debates robustly tend toward truth, and haven't been encouraged by empirical results. more

We discuss the chance that Magma would succeed:

  • We discuss the promise of "hacky" solutions to alignment. If applied alignment techniques that feel brittle eventually lead to a better problem description of AI alignment, which, in turn, leads to a theoretical solution, we feel more optimistic about this trial-and-error approach. An analogy to the Wright brothers, who solved some of the control problems of human flight, potentially supports this intuition. more
  • We predict that Magma would avoid an AI-caused existential catastrophe with 10-78% likelihood (Eli: 25% (10-50%[1]), Charlotte: 56% (42-78%) and that a 2x change in AI-risk-reducing effort would shift relative risk level by 0.1-10% (Eli: 3% (.5-10%), Charlotte: 0.4% (0.1-2%)). more

We conclude by proposing questions for further research, including the likelihood of the assumptions of the nearcast and how Magma’s strategy would change if some assumptions were varied. More

Motivation

We decided to extend Karnofsky’s nearcasting discussion, as we think nearcasting-style scenario planning can help reach strategic clarity.  By making AI alignment and deployment scenarios explicit, we may be able to better prioritize research agendas and policy interventions. The probability of misalignment can also be informed by evaluating the likelihood of such stories leading to alignment.

However, each individual story is unlikely to reflect reality. Such stories will be taken too seriously.

We would regret if this document is mistakenly perceived as a promising plan for significantly reducing AI risk. In particular, we expect many suggested strategies to not work out.

Magma’s predicament

This section summarizes Karnofsky.

Magma must balance 2 sources of existential risk from misaligned AI:

  1. Action risk: The risk that Magma will unintentionally cause an existential catastrophe via deploying advanced misaligned AIs.
  2. Inaction risk: The risk that an actor other than Magma (perhaps less cautiously) deploys an AI which causes an existential catastrophe.

Magma’s goals

Karnofsky describes the 3-step plan for Magma as:

  1. Align Magma’s AIs
  2. Prevent the deployment of and defend against deployed misaligned AIs of other actors (likely in part by deploying Magma’s AIs)
  3. Deploy Magma’s AIs

Elsewhere, Karnofsky argued that the competition frame (it is important who develops TAI/AGI) is overrated compared to the caution frame (important to coordinate to avoid competition pressure). Taking the caution frame seriously may change Magma's goals. Hence, an alternative Magma plan may be:

  1. Align Magma’s AIs and test alignment
  2. Stop misaligned AIs (likely in part by deploying Magma’s AIs)
  3. Help other trusted and safety-cautious actors align their AIs
  4. Write cooperation agreements between leading AI labs committing to deploy the most aligned AI together independent of who developed it[2]
    1. Potentially don’t deploy a relatively well-aligned AI because another AI might be more aligned.
    2. Perhaps use information from step 1 to convince policymakers of the need for certain stringent regulations on AGI labs.[3][4]

If feasible, the alternative plan may likely be better.[5] We think that Karnofsky likely does not actually disagree with this framing - instead, we are extending his framework to make this option more explicit.

Next we expand on why Magma might want to implement the alternative plan.

Alignment of AIs is a matter of degree. We understand any desirable property to be a matter of degree. We should aim not just to avoid sufficiently misaligned AIs but also to get the most aligned AIs possible. For example, some degrees of honesty:

  • An AI that is never honest
  • An AI that can be either honest or dishonest depending on the user
  • An AI that is almost always honest (even if the users want the AI to say fake news).
  • A verifiably honest AI

Alignment of AIs is stochastic. How many resources are allocated to aligning, testing, adversarial training, etc., should influence how aligned AIs are. However, substantial stochasticity (i.e. “luck”) might be involved, e.g. by the quasi-random decision of trusting a specific alignment research agenda. Therefore, firms with less safety investment sometimes produce AIs that are more aligned.

As a result, the firm with the most safety investment may not have the most aligned AIs. Suppose firm 1 has a 10% chance of developing ambitiously aligned AI, and firm 2 has a 4% chance. Firm 1 tests the alignment of the system after the alignment work has been completed. During the testing, firm 1 updates that P(alignment) < 4% in more than 50% of all cases. Therefore, there is a >50% chance that after testing firm 1 prefers to give firm 2 the opportunity to develop AGI. We discuss this intuition more in the appendix.

Hence, cooperative agreements to deploy the most aligned AIs might be very valuable (similar to attempts to stop misaligned AIs).[6]

Lastly, as Karnofsky also acknowledges, it might be optimal that Magma does not deploy its AI (for various reasons). Since most decision-makers are likely to be influenced by prestige, we expect the deployment of the AIs to be hard to resist. Increasing the prestige of not deploying potentially dangerous models could be particularly helpful, see Barnhart 2022.

Ensuring advanced AIs have desirable properties

Defining and prioritizing desirability of advanced AIs

Defining desirability

We define the desirability of an advanced AI as the extent to which it helps Magma achieve its goals, while avoiding negative outcomes.

Desirability is relative to what the AI is used for: e.g. aligning more powerful AIs, improving cybersecurity, advising Magma, or broadly shaping the future.

We decompose desirability into:

  1. PositivesHelp achieve Magma’s goals, e.g:
    1. Buy time: Create and extend a period of low misalignment risk during which Magma can figure out how to proceed, e.g. by preventing other labs from deploying misaligned AIs.
    2. Ambitiously value-align another AI: Align another AI such that if humanity gives it full control of the future, it will create a flourishing future (we think Karnofsky uses “decisive” as similar to ambitious; see below for further discussion).
    3. Create other desirable AIs: Aid in creating and aligning AIs to achieve (a) and (b).
  2. Negatives to be minimized, e.g.:
    1. Directly causing existential catastrophe: Each AI has some chance of directly causing an existential catastrophe.[7]
    2. Capabilities externalities: If the AI itself or the capabilities advances used to develop the AI are applied for means counter to Magma’s goals, this would be undesirable. For example, capabilities advances might shorten AI timelines or speed up AI takeoff,[8] leaving less time for Magma to accomplish its positive goals.

Which properties are desirable?

Extending Karnofsky’s framework (we add (1) differential capability and (6) robustness), we describe properties of advanced AIs that increase desirability when applied for Magma’s goals:

  1. Differential capability: The AI’s competence in achieving Magma’s objectives specifically.
    1. To be clear, we think capabilities are important when applied in specific ways as described below. We aren’t arguing that it’s generally good to increase capability levels across all situations.
  2. Value-alignment: The AI is pursuing the objectives that humans would want it to pursue if humans could fully understand the AI’s actions and reasoning.
  3. Corrigibility: The AI’s objectives are able to be corrected.
  4. Honesty: The AI outputs what it truly believes.
  5. Legibility: The AI’s reasoning is easy for humans to understand.
  6. Robustness: The AI fulfills the above 5 desirable properties in unforeseen situations.
    1. One could also think about robustness as a clarification of the other 5 properties rather than a property on its own.

Prioritizing between desirable properties

It seems important to prioritize among these desirable properties given:

  1. We have finite resources and achieving all properties may be difficult.
  2. The properties aren’t equally important.
  3. Optimizing for some properties may directly trade off against others.
  4. Which properties to prioritize depends on how advanced the AI is and the context in which it’s applied.

Proposals for desirable advanced AIs

At face value, all desirable properties seem good. But, there are some smaller sets of properties needed for possible desirable advanced AIs, for example:

  • Ambitiously value-aligned AI: If the AI has robust value-alignment and capabilities, then you are safe from existential catastrophe unless there’s misuse.
    • Assuming no misuse and enough capability to prevent competitive misaligned AGIs, you get a flourishing future.
    • It’s possible to hand over full control of the future to this AI.
  • Decent corrigible AI: If the AI isn’t fully value-aligned, but this can be known, and the AI is corrigible, then an existential catastrophe due to misalignment can be avoided.[9]
    • Honesty and legibility may make testing for value-alignment more accurate.
    • Humans and/or other AIs are tasked with continuously checking in on the AI. Sometimes, they correct or reverse actions taken by the AI.
  • Oracle AI: If an advanced AI is honest, legible and is directed only at the goal of answering questions as well as possible this can be useful.[10]

These could be end states or intermediate states. Some might think we have to aim for ambitious value-aligned AI as an end state and others are intermediate states to help us get there, e.g. an oracle AI can help with alignment research. It seems useful for Magma to think about which states they’re aiming for, and backchain from there to figure out which properties are most important.

An overview of 11 proposals for building safe advanced AI features a similar discussion, though it focuses primarily on how to ensure desirability rather than possible combinations of desirable properties.

Hypotheses regarding how to approach prioritization

We describe some exploratory hypotheses regarding how Magma should approach prioritizing between desirable properties.

For less advanced AIs:[11] there are likely to be tradeoffs between differential capability and other desirable properties. It might not always be worth sacrificing important capabilities. For example: if we have to make an AI 100x less capable to make it 2x more legible, this might be a bad deal as the AI would be much less helpful for Magma’s goals. However, in practice all existing labs are likely underinvesting in safety so much, such that this consideration does not change anything.

For more advanced AIs, some proposals for desirable advanced AIs don’t require all of the properties, so we should consider aiming for specific combinations. Other properties should be recognized as only instrumental. For instance, if Magma aims for ambitiously value-aligned AIs, honesty and legibility are only useful insofar as they are instrumentally helpful for getting robust value-alignment and capabilities.

Properties may even be undesirable in some cases. For instance, if the AI is ambitiously value-aligned, there may be cases where it’s useful to be dishonest to its supervisor for their own good. Even if the AI isn’t ambitiously value-aligned, it may be useful to be dishonest to unaligned AIs or humans. Hence, honesty is only instrumental to ambitious value-alignment.

Strategies for increasing desirability

To increase the desirability of AIs, Magma can:

  1. Increase desirable properties
    1. via training and testing.
    2. Interpret and manipulate AIs’ internal states (during and after the training process).
  2. Make it harder for an AI with undesirable properties to cause harm.

Importantly, Magma can apply AIs for assistance with the above strategies.

Increase desirable properties via training and testing

Magma might use 2 strategies (subsection of Karnofsky’s “key facets”) to make the AI have more desirable properties:

  1. Accurate reinforcement: Reward the AI more accurately based on the reflective intentions of its supervisors.
    1. It may be possible to reward honesty by looking at the AI’s internal states (see e.g. Eliciting Latent Knowledge).
  2. Out-of-distribution robustness testing and training: Find new situations in which the AI doesn’t perform as desired, then train the AI to perform better.
    1. Can increase the robustness of all desired properties, but perhaps particularly exciting for addressing goal misgeneralization: the capabilities of an AI generalize to a new situation, but it’s now clear that the AI is pursuing an unintended goal.
    2. For this method to increase robustness, one has to assume that making an AI perform well in one new situation makes it more likely to perform well in another new situation (robustness generalization assumption).

Interpret and manipulate AIs’ internal states

This section contains very similar content to Karnofsky’s, and is mostly just re-written.

Interpreting AIs (e.g. this line of work to understand the cognition of transformers) and manipulating AIs to be more desirable (e.g. this work editing factual associations in transformers) may be quite important. It requires transforming the billions of weights in a neural network into human-understandable concepts such as "deception" and "power-seeking".

Better interpretability and manipulation can aid in:

  1. Testing desirability: Generating new situations for testing, and better forecasting how the AIs will act in new situations.
  2. Training desirability: Training the AI based on the internal cognition of the AI in addition to or instead of the outputs. One has to be careful that this training doesn’t just incentivize the AI to hide the undesirable portions of its cognition. Perhaps having “held-out” decorrelated interpretability tools would be helpful. An example of “holding out” an interpretability tool:
    1. Supervisors have an interpretability tool A that can determine whether the AI’s cognition is desirable, and then can train the AI to have desirable cognition (e.g. via supervising chain-of-thought if the tool checks the chain-of-thought).
    2. They also have a separate interpretability Tool B to check if the AI’s cognition is desirable, to reduce the chance that the AI just hid the undesirable parts of its cognition from Tool A. Tool B should be as “decorrelated” from Tool A as possible, i.e. “Goodharting” to make the AI’s cognition appear good to Tool A should do as little as possible to the results of Tool B.
  3. Direct modification: With good enough interpretability and manipulation tools, one may be able to directly modify the AI to be more desirable. One ambitious approach is to “retarget the search”: locate the AI’s representation of human values and its optimization process, then point the optimization process at human values.
  4. Improving the use of AIs to align other AIs: The use of interpretability tools may improve many of the strategies described in the next section, including the ability for AIs to point out that other AIs are deceptive.

The most promising tools can plausibly be scaled up to interpret high-level concepts (“Will the AI execute a treacherous turn?”) in very advanced AIs. See also a transparency and interpretability tech tree for Evan Hubinger’s perspective on how various levels of interpretability and manipulation may be helpful.

Restrict the possibility of undesirable actions

AIs can also be applied in contexts which externally defend against undesirable actions.

Possible defenses include:

  1. Improving cybersecurity.
  2. Giving the AI little autonomy, e.g. restricting its action space or “boxing” it.
  3. Avoiding AI manipulation of human supervisors by training and selecting less manipulatable supervisors.

Apply (often limited) AIs for assistance

Limiting AIs

Magma can limit their AIs in various ways to increase the desirable properties. However, limiting the AI system to reduce deception, illegibility, collusion, etc. often trades off with differential capabilities.

We propose an exploratory, non-exhaustive, non-disjoint categorization of ways to limit AIs expanding on Karnofsky’s suggested limitations:

  1. The AI makes plans but doesn’t execute them: Simulators, oracles
  2. The AI is directed at a narrow task: Tools, services, narrow value learning
  3. The AI’s cognition is less purely consequentialist, e.g. it follows constraints like “obey humans”: Shard Theory
  4. The AI’s capabilities are directly limited: Limit situational awareness, GPT-(N-1) rather than GPT-N, ablation
  5. The scope of the AI’s optimization is limited: Myopia
  6. The AI is more legible: Process-based supervision via task decomposition[12]supervising chain-of-thought

Perhaps, several of the limitations are only promising in combination. For example, AIs only limited by myopia may not be particularly promising (for reasons given in Arguments against myopic training). Making narrow or capability-limited AIs myopic may be useful as it makes deception and collusion less likely.

Apply AIs for Magma’s goals

Magma’s AIs can improve and test the desirability of other AIs via (non-exhaustively):

  1. AI safety via debate
  2. Red-teaming
  3. Iterated Distillation and Amplification
  4. AI-assisted feedback
  5. Aiding in testing desirability, e.g. automated audits and forecasting
  6. Scaling up interpretability and manipulating internal states
  7. Aiding conceptual alignment research, e.g. generating ideas or critiquing proposals
  8. Defending against undesirable actions

Ways AIs might help accomplish Magma’s other goals include (non-exhaustively):

  1. Advise on Magma’s strategy, generally and on specific decisions.
  2. Mediate between Magma and other actors.

Difficulties of using AIs for AI desirability

Karnofsky pointed out potential difficulties.

Karnofsky’s difficulties:

  1. Human adjudication of AI debates is too hard: The judgment of a human who is adjudicating a debate between AIs isn’t reliable.
  2. Advanced collusion between AIs: AIs which are supposed to “keep each other in check” instead collude and make trades, working together against the human supervisor. We discuss this further below.
  3. The above difficulties show up “late in the game”: Other pitfalls only appear once AIs are too advanced to be corrected, but not yet ambitiously aligned.

Clarifying advanced collusion

Karnofksy discusses that applying AIs to make other AIs more desirable, e.g. as AI checks and balances, may fail because the AIs are colluding with each other. We expand on how this failure mode could appear and how it could be avoided. We outline and explain properties of the AIs and Magma’s strategies that might make collusion less likely. We discuss the remaining chance of collusion conditional on different assumptions about the AIs and list topics for future investigation.[13]

As background, we assume that collusion is de facto the same as cooperation and that AIs are acting as utility-maximisers.[14]

Properties for limiting collusion

  1. AIs with goals that operate on a smaller scale are less likely to cooperate for later rewards. This includes:
    1. Myopic AIs with short-term (time-bounded) goals
      1. This could be that one designs the training process to increase the chance that the AI’s internal representation of the goals has a shorter time length.[15]
      2. Or that one takes out the ability of the agent to do “long-term planning”.
      3. This is different from a credible commitment to shut down the AI (reset the parameters). Depending on the AI’s architecture and internal goals, the AI might take different actions because of that credible commitment.
    2. AIs with indexical goals
      1. An AI has an indexical goal if the AI aims to cause itself to be a certain way or have a certain thing, e.g. be very intelligent, have as many paperclips as official legal property.
      2. A non-indexical goal refers to aiming for the world or some other party to be a certain way, e.g. Anja should be the president, the AI wants to cause the end of poverty, or wants there to be as many paperclips as possible.[16]
  2. AIs with very different abilities, architectures and incentives. Very different AIs are worse at modeling each other and less able to do acausal reasoning or to find profitable collusion equilibria.
  3. AIs with orthogonal goals. If the AIs’ goals are very different to each other, being willing to forgo immediate rewards is less likely.
    1. Is this likely? There are trillions of relatively orthogonal goals in the world, e.g. many small paperclips, many big paperclips, many bottles, modern art, many insects, as many money transactions as possible, pushing the merge button on Github.
    2. Orthogonal goals can emerge because (i) humans tend to specify different objective functions, and (ii) in expectation, future humans will not exhaust possible objective functions.

It’s unclear how tractable it will be for Magma to ensure these properties, e.g. enforce AIs having indexical goals.

Strategies

The supervisors, i.e. Magma’s employees, can make collusion less likely:

  1. They can make the above properties more likely.
  2. They can credibly commit to shutting down the AIs after each interaction.
  3. They can reduce the number of communication/trade channels of the AIs.
    1. These channels include: direct communication, perceived similarity and hence the ability to do acausal reasoning, correlation devices, and seeing each other’s source code.
    2. Collusion should be likelier the more communication channels there are.[17]
  4. They can make all interactions zero-sum games.
    1. This could mean that for the next interaction, the human supervisor gives in total a reward of size 1. If every agent is supposed to find flaws in reasoning, then the reward for agent i is (flaws agent i found)/(sum of flaws found by all agents).
  5. They can carefully monitor the interactions of the AIs to detect any collusive behavior.

It’s unclear which of these strategies are most promising.

Chance of collusion

For tractability, we condition scenarios based on whether all AIs will have indexical or non-indexical goals and whether AIs will have acausal ways of cooperating with each other. We could have instead conditioned on communication channels that allowed for causal cooperation. This choice is meant to exemplify one way to break down the chance of collusion based on important factors.

The lighter the color in the table, the lower the relative risk of collusion. We refrained from trying to estimate the absolute level of collusion as this is a shallow, uncertain overview. We list which strategies we’d tentatively guess are most promising in each case.

 

AIs act according to causal decision theory

AIs only follow evidential decision theory with similar AIs

AIs follow evidential decision theory for all AIs

Indexical goals

 

Applicable Strategies:

  1. shutting down,
  2. myopic AI,
  3. zero-sum game,
  4. reducing causal communication channels

 

 

Applicable strategies:

  1. committing to shut down
  2. varying the AIs
  3. myopic AI

 

 

Applicable strategies:

  1. committing to shut down
  2. myopic AIs
  3. varying the AIs

 

Non-indexical goals

 

Applicable strategies:

  1. varying the AIs,
  2. shutting down[18],
  3. myopic AI,
  4. zero-sum game,
  5. reducing causal communication channels

 

 

Applicable strategies:

  1. varying the AIs

 

 

No strategies seem particularly promising to us for preventing advanced collusion without limiting AIs’ capabilities.

Charlotte and Eli have different intuitions about whether AIs would act as if evidential or causal decision theory is correct.[19]

Open questions

The likelihood of collusion is an important open question. It may be further broken down into:

  1. The probability that advanced AIs will have indexical goals.
  2. The probability that AIs will be acausally reasoning and whether humans can understand it.
  3. How many communication channels AIs will have and how difficult it will be to shut them down.

The likelihood of collusion for AIs that aren’t utility maximisers but rather acting according to heuristics, and what strategies can reduce collusion in this case, is another topic for further investigation.[20]

Testing desirability

Methods for testing desirability

Karnofsky discusses ways to test the desirability and, specifically, value-alignment of AIs by undertaking audits and threat assessments which he describes as “intense and randomized”.

For instance, one can give AI the opportunity to seize control of the virtual environment and see if it pounces on the opportunity. Similarly, one can give AIs superpowers to learn about and shape their goals. Such audits are related to out-of-distribution training and testing.

The overall risk should also be assessed, for example, by a team of forecasters with great track records and domain experts.

Clarifying intense and randomized audits

Karnofsky proposes to run intense and randomized AI audits.

We understand intense testing to mean that (a) testing takes up a significant portion of overall resources, and (b) it’s more valuable to test a few scenarios or inputs very thoroughly (“intensely”) than to spread out testing efforts evenly over many scenarios or inputs.

Randomized testing means that parts of the testing procedure are randomized to make it harder for the AI to respond strategically. 

  • Our best guess explanation of randomized testing: The AI is being tested over a period of at least several days, e.g. if it is already deployed and is an online learner. At (truly) random times during the day or year, the AI is given an input that is designed to trick it into revealing potentially undesirable features of its internal goals or heuristics. This is contrasted with versions where the AI knows when it is being tested and when it is not being tested. We feel confused about whether something like this is feasible or even conceptually possible.

Overall, we tentatively think intense testing is plausibly the right strategy to get more information about hard-to-catch issues. We are unsure whether the randomization property improves testing. Treacherous turns in the wild describes a very simple (simulated) organism that adapts to randomized tests that are intended to constrain it. However, an important feature of this setup is that there are very many organisms interacting in an evolutionary training process, so the probabilistic 'play dead' strategy has a high probability of working. Magma may not train its AIs in an evolutionary manner, with many AIs interacting. That being said, AIs with advanced capabilities seem likely to be even more effective at combating randomization.

Value of testing desirability

Magma can use testing to learn about:

  1. The effectiveness of alignment measures
    1. Evidence might change Magma’s alignment strategy.
  2. The level of misalignment
    1. Evidence can convince different people:
      1. Magma can decide to slow down.
      2. Other labs can be convinced to be more careful.
      3. Policymakers and the public can be made more careful and supportive of alignment measures.
    2. These tests are more important, the more you think that Magma can actually re-adjust its strategy, such as: waiting 5 months longer, not deploying AGI at all and betting on the second lab to develop safe and aligned AGI.

Potential disagreements with Karnofsky

We are calling this section "potential" disagreements primarily because we are unsure of Karnofksy's exact position rather than because our own view is tentative (which it is, but disagreements between tentative positions are useful to explore!).

Disincentivizing deceptive behavior likely requires more than a small chance of catching it

Karnofsky writes that

‘even a small increase in the odds that deceptive behavior is “caught” could lower the returns [to deception] greatly. So Magma doesn’t need an airtight “checks and balances” setup’.

We aren’t yet convinced that a small chance of catching deceptive behavior by itself might make deception much less likely. We argue that in addition to having a small chance of catching deceptive behavior, the AI’s supervisor needs to be capable enough to (a) distinguish between easy-to-catch and hard-to-catch deceptive behaviors and (b) attain a very low “false positive rate” of harshly penalizing non-deceptive behaviors. The AI may also need to be inner-aligned, i.e. intrinsically motivated by the reward.

We are responding to Karnofsky’s discussion about the ease of incentivizing deceptive behavior and Karnosky’s proposed skewed reinforcement strategy: giving especially large negative reinforcement when systems engage in particular kinds of unintended behaviors that are harder to detect.

What is required?

  1. Inner alignment issues could make this strategy unworkable since AI might not intrinsically care about large negative rewards (Reward is not the optimization target).

How would it work?

  1. When the AI takes an action, the human supervisor, with the help of limited AIs (‘the supervisor’ in the following), needs to evaluate whether the AI took a bad action.
    1. We assume that the AI is, by default, incentivised to take a bad action (e.g. a deceptive one), if there is no chance of being “caught”.
    2. The supervisor needs to determine:
      1. whether the action was good or bad.
      2. how likely actions like it are to be detected as bad or good conditional on them being bad or good (best guesses about false positive and false negative rates).
    3. Based on this assessment, the supervisor provides a positive or negative reward, where negative rewards are “skewed” based on how hard they are to detect.
  2. The AI takes a bad action or good action depending on the AI’s expected utility calculation:
    1. EV(bad action) = P (supervisor identifies | bad action) * negative reward + P (supervisor does not identify | bad action) * positive reward
    2. EV(good action) = P (supervisor identifies | good action) * positive reward + P (supervisor does not identify | good action) * negative reward
  3. The supervisor chooses the negative and positive rewards based on their best guesses of false positive and false negative rates and based on how bad they think the bad action is (compared to the good actions).
  4. In practice, the EV calculations from the AI might differ from the actual expected utility of each outcome if it's not advanced enough to predict well what the supervisor will do. On the other hand, less advanced AIs would pose less risk of an existential catastrophe.

What are the problems?

  1. Supervisors may systematically misestimate the probability they will detect certain bad actions.
    1. For instance, they might think that they detect a class of bad actions in 50% of cases and hence make the negative reward twice as big as they would make it if they were always able to detect this class of negative behavior.
    2. However, in reality, they may only be able to detect it in 5% of cases. The AIs know this bias. Hence, the AIs will not be deterred from taking this class of bad actions.
  2. One may argue that the negative reward can just be made even more negative (“skewed”).
    1. However, if one makes the negative reward too extreme, the AI may not do anything useful because supervisors will sometimes mistakenly classify good actions as bad or deceptive actions.
    2. So, the reinforcement strategy only works out if the supervisors also have a low false positive rate for classifying good actions as bad or deceptive. The more “skewed” the reinforcement for bad actions is, the lower the false positive rate has to be for the AI to be incentivized to do anything useful.

Conclusion

In sum, we argue that catching bad behavior requires the AI’s supervisors to have a really good understanding of the AI. Disincentivizing bad behavior is very hard (several things have to be fulfilled at the same time). Hence, a small increase in the odds that deceptive behavior is “caught” does not by itself make deception much less likely.

Adjudicating AI debates is very difficult

We disagree with Karnofsky, thinking that the difficulty of adjudication of AI debates is harder to overcome.

Theoretically, we’re not convinced there’s much reason to believe that AI debates robustly tend toward truth. The complexity theory analogy in the original AI safety via debate paper assumes human judges are perfect reasoners[21], but human debates don’t seem to robustly tend toward truth.

There are some reasons to be more optimistic on priors about debate-like mechanisms. We respond.

One may argue:

  1. Debate-like mechanisms are pretty effective (though far from reliable) for humans, e.g. trial by jury gets the answer most of the time. We respond:
    1. Trial by jury may work because humans are less deceptive and more risk-averse than utility-maximisers, they are not distorting the facts too much.
    2. Moreover, we are unsure whether the adversarial system (common law) or the inquisitorial system (civil law) is better. Unlike the adversarial system, the role of the inquisitorial system is not to determine guilt or innocence of one particular person but to find the truth. As such, the judge conducts an inquiry that involves the questioning of witnesses and suspects, the issue of search warrants and an examination of the evidence with the aim of discovering both incriminating and exculpatory evidence.
      1. See here for some very weak evidence that adversarial systems seem better for common knowledge cases and inquisitorial for private information cases. Overall, the paper concludes that it is unclear which one leads more often to the truth.
  2. AI safety via debate might not work for all situations. However, there are some questions that are particularly well-suited, e.g. "Does this AI's internal state imply intent to murder us later?". We respond:
    1. We agree that there may be important niches for debate-like mechanisms, but remain skeptical about relying on it to reliably adjudicate the truth without additional strategies.

Moreover, we haven’t been encouraged by empirical AI safety via debate results such as:

  1. Evaluating Arguments One Step at a Time: “This method is therefore brittle in its current form, even for arguments which only have 1–5 steps.”
  2. Debate update: Obfuscated arguments problem: “A dishonest debater can often create arguments that have a fatal error, but where it is very hard to locate the error.”
  3. Single-Turn Debate Does Not Help Humans Answer Hard Reading-Comprehension Questions
  4. AI-Written Critiques Help Humans Notice Flaws: This is presented as an encouraging result, but I’m not sure I agree. The critiques helped find more flaws in summaries, but as far as I can tell no evidence was provided that the critiques helped distinguish between better and worse summaries. I would guess that AIs help a little but not enough. Moreover, adjudication will get harder for more complex tasks than summarization.

The level of reliability required for adjudicating AI debates to succeed depends on the difficulty of disincentivizing deceptive behavior. We discussed above that techniques may have to be highly robust.

Some forms of process-based AIs aren’t very promising

Karnofsky highlights process-based AIs as a promising limitation, but Eli is relatively more optimistic about experimenting with some of the other limitations listed above (in particular categories (1), (2), (3) and limiting situational awareness from (4)). The rest of this section is written from Eli’s perspective.

Process-based AIs are rewarded based on their internal cognition or plans rather than their outputs. Process-based AIs may be easier to supervise to ensure they aren’t being deceptive or manipulate, as argued by Karnofksy:

This drastically reduces a major source of risk, which is that an AI system might have incentives to manipulate its environment (unnoticed by humans) to get higher reward. In order to get higher reward under process-based optimization, an AI system needs to somehow manipulate the relatively simple process of “A human reads the plan and registers approval,” rather than manipulating more complex processes out in the world (e.g., “the AI’s real-world task is accomplished in some unexpected and unintended way”), as the AI is being rewarded only based on human approval of a plan rather than on what happens out in the world beyond that.

I’m skeptical that very advanced process-based AIs doing complex tasks can actually be much easier to supervise without incurring a huge decrease in capabilities. I’m mainly excited about using process-based AIs at lower capability levels on less complex tasks, when it sometimes leads to a capability increase rather than penalty.

I’d guess that the cruxes of my potential disagreement with Karnofsky are:

  1. A process-based AI may not have a much lower likelihood of manipulating its environment.
    1. In particular, I’m skeptical that (as Karnofsky argues in the quote above) it’s much harder for the AI to manipulate human supervisors into approving a bad plan than it is for the AI to manipulate complex processes out in the world.
    2. This seems particularly true on more advanced tasks, as we will want the AI to be performing tasks much more capable than its supervisor and checking the reasoning will be quite hard.
    3. If we are optimizing the AI to produce plans that look good to supervisors it has a strong incentive to produce plans that look good to supervisors that aren’t actually good: the “Goodharting” problem remains when we supervise process rather than outcomes.
  2. I expect process-based AIs to be much less capable, especially the more advanced AIs become (in particular if done via task decomposition). We may disagree on the promise of factored cognition.
    1. Maybe, making AIs process-based yields a capability increase on some tasks while AIs are less advanced, e.g. chain-of-thought gives a capabilities boost on some important tasks. Magma can use these capabilities for their goals via assisting alignment research, AI advisors, etc. However, process-based AIs may only be promising at lower capability levels.
    2. Throughout the relatively recent history of machine learning, people have often tried to build in explicit reasoning but bigger models with better architectures and more data usually perform much better.
    3. It seems impossible to compose a million 1500-rated chess players to beat one 2800-rated chess player, and I generally have a feeling (informed by lots of forecasting and strategy board games) that lots of cognition is based on intuitions that are very hard to decompose - making factored cognition at high capability levels less promising.

Simulations don’t prevent collusion by highly advanced AIs

Eli doesn’t buy Karnofsky’s analogy (in the footnote) arguing that we’ll be able to prevent advanced collusion via simulating a virtual environment where the AI can get a lot of reward by taking over and committing to shut it down (at least, Eli doesn’t buy it for very advanced AI capabilities). But Eli isn’t sure that this disagreement is that relevant because he expects this failure mode to show up very “late in the game” as AIs get extremely advanced. More in a footnote.[22]

Chance of success

The promise of “hacky” solutions

In Karnofsky’s scenario, there is no grand plan for how the AIs will become ambitiously value-aligned. There may be a range of interventions and tests carried out. Would this work?

We are confused and skeptical. This trial-and-error approach with limited AI systems for testing and training may be most likely to work if it is eventually optimized for theoretical and conceptual insights into value-alignment. We use the Wright brothers’ work to make the first controlled flight as an analogy.

The ultimate aim is that the AIs are ambitiously aligned with human values or controlled.[23]

A well-understood theoretical plan would seem to be the only way to solve such engineering problems as the Rocket Alignment Problem argues. Is it possible to arrive at this solution through trial-and-error?

Our intuitions share some overlap with those described by the Machine Intelligence Research Institute as reasoning behind their approach, but we are more optimistic about the prospect of trial-and-error aiding the development of conceptual insights.

Analogy: Controlled Flight

The Wright brothers did not invent the airplane but made the first sustained, powered, and controlled flights - they provided the first solution to the control problem of flight. They studied existing aeronautics, created much better aerodynamics data (than previously available), e.g. built their own wind tunnel to conduct experiments, and tested the new flyers carefully (they first flew each new glider as a kite). But they also spent a lot of time conceptualizing failure modes, e.g. the adverse yaw. Our understanding is that they found this control problem through trial-and-error and by trying to get deeper insights into the nature of the control problem.

If this analogy holds true, we should be relatively optimistic about the empirical trial-and-error process, getting better data, performing better tests, and playing around. However, the empirical approach should perhaps be designed to eventually provide theoretical or conceptual insights. By performing powerful tests, for example, Magma can better understand AI control and alignment problems. Magma might also build AIs to aid alignment researchers. We think that conceptual insights should remain Magma’s ultimate focus - but that is mostly backed by our own intuition. We are unsure but not very pessimistic about the extent to which hacky, empirical stuff can help to get to conceptual insights.

Our forecasts

 These forecasts are our independent impressions (our beliefs before updating on what others believe). We give intervals representing the resilience of our forecasts: what is the 90% confidence interval for what our forecasts would be after spending a few months thinking full-time about the forecast?

Chance of avoiding existential catastrophe

What’s the chance of Magma avoiding an AI-caused existential catastrophe, conditional on the situation described in the nearcast setup?

  1. Eli: 25% (10%-50%)
    1. I’m >95% confident that Magma would eventually need to develop an ambitiously value-aligned AI to capture most of the value of the future and avoid existential catastrophe, so I’ll focus on this ultimate goal.
    2. All of the most promising paths to doing this seem to route through large amounts of AI assistance, as has been discussed throughout the post.
    3. As we wrote above, I’m most optimistic about ambitious value-alignment eventually being achieved through conceptual or theoretical insights gained with AI assistance.
    4. I’m relatively pessimistic about Magma being able to create a period of low misalignment risk that lasts for very many years, though they may be able to extend it by up to a few years (this could potentially allow for thousands-millions of person-year-equivalents of alignment research via AI speedup). I feel that the offense-defense balance is likely unfavorable, see e.g. What does it take to defend the world against out-of-control AGIs?.
    5. The paths to ambitious value-alignment I find most likely given how much time Magma has along with the very rough probabilities I assign them occurring in this nearcast are:
      1. Unforeseen conceptual or theoretical insights: 10%
      2. Magma figure out something roughly like agent foundations: 5%
      3. A solution that feels “hacky” turns out to just fully work out, like IDA or Debate with limited interpretability and adversarial training: 5%[24]
      4. Magma does something roughly like scaling up interpretability with AIs, then re-targeting the search: 5%
    6. This gives me a forecast of roughly 25% overall for avoiding existential catastrophe. Remember that this is conditioning on the nearcast in that Magma has at least 6 months and is fairly alignment-cautious. I might be significantly more pessimistic otherwise, or more optimistic if timelines were longer.
    7. My guesses for why I might be more pessimistic than Karnofsky (who seems to be at roughly 50%, though he doesn’t say explicitly) are:
      1. I’m more pessimistic about the offense-defense balance of AI and therefore the likelihood of creating a long period of low misalignment risk.
      2. I’m more pessimistic about “hacky” solutions working out without further theoretical insights.
      3. I’m more pessimistic about the difficulty of disincentivizing deceptive behavior and relatedly the promise of adjudicating AI debates.
  2. Charlotte: 56% (42%-78%)
    1. The interval is supposed to tell where Charlotte’s view could end up (90% confidence interval) if she would think about it for a few months full-time.
    2. I decompose this into:
      1. A1: I assume that intent-alignment is sufficient (maybe not necessary to avoid extinction).
        1. I will assume this for simplicity not because I necessarily think it is true.
        2. One uncertainty: what do we need for the far future to go really well - do you need moral AIs (perfectly understand morality and are driven by it, what I understand people to mean with “ambitious value alignment”) or do you just need some that are decently following instructions, e.g. oracle?
      2. P1: Probability (AI will by default be intent-aligned without months of adjustment and alignment work)
        1. Considerations in favor
          1. It might just be relatively easy - e.g. dogs are obedient.
          2. AIs might not actually be an agent or act according to heuristics that don’t misgeneralize.
        2. Considerations against
          1. We don’t see intent-alignment in current AI systems
          2. It seems like a narrow, arbitrary target that is not incentivized by most training procedures.
        3. P1 = 10-60%
      3. P2: Probability (intent-misalignment is avoided | intent misalignment by default)
        1. P(advanced useful AI assistants are available to Magma) = 50%
        2. P(avoided | AI assistants) = 50 - 90%
        3. P(avoided | no AI assistants) = 10%
    3. P(existential catastrophe is avoided) from Guesstimate model: .56 (0.42 to 0.78)
      1. My probability estimate would also change if I knew how many people will work on alignment in the specific scenario of Karnofsky’s post. Only the people in the company or many more.
      2. Karnofsky writes that he “[doesn’t] have much sympathy for someone who’s highly confident that AI takeover would or wouldn’t happen (e.g. <10% or >90%)”.

Tractability of reducing risk

We operationalize the tractability of reducing risk as: How much relative risk would be added if 50% of talent-weighted people working on AI risk disappeared?

For example, if risk increased from 20% to 25% this would be a relative risk increase of (25 - 20) / (100 - 20) = 6.25%.

What we actually care about is 2xing rather than 1/2ing effort, but easier to imagine reducing work as opposed to what would be added so we will forecast on this for now. It’s easier to reason about removing current efforts than adding new efforts, but it should be correlated to the tractability of spending more effort.

  1. Eli: 3% (.5%-10%)
    1. I updated up a bit on tractability while working on this post as I think it will be quite important to have lots of alignment effort during crunch time, and this could plausibly make the difference between better and worse versions of applying AIs to align more powerful AIs.
    2. I still am relatively low though because I think a lot of the variation in whether we succeed is coming from background variables about how hard alignment is, rather than things that are in our control.
    3. This is a very rough, non-resilient estimate.
  2. Charlotte: 0.4% (0.1 to 2%)
    1. Probability that AI risk can be fixed seems quite low-ish.
    2. Also, the number of people working on alignment is either far too low or far too high, so I am not really adding neglectedness considerations.

Karnofsky writes that “the risk of misaligned AI is serious, and that taking it more seriously is likely to reduce it”.

Conclusion and questions for further research

We’ve discussed how Magma, an AI lab close to developing TAI, should approach alignment and deployment.

We’ve written about Magma’s goals and strategy, including whether its goals should be more cooperative and how it should ensure its AIs have desirable properties.

We’ve noted potential disagreements with Karnofksy, the author of the post on which we are building.

We’ve estimated the chance Magma will succeed at its goals: exploring the promise of alignment solutions that feel “hacky” and making forecasts as to whether Magma will avoid AI catastrophe and how tractable it is to make Magma more likely to succeed.

This is an initial exploration. We are excited about any feedback, concerns, or corrections.

Some further research questions that we find promising:

  1. How plausible are the assumptions of the nearcast? E.g. What's the chance that Magma will have more than 6 months before another actor develops TAI? One may be able to argue that either labs can cooperate (available time > 2 years) or that labs will want to compete and will do industrial espionage (available time < 6 months) - the times in between might be unlikely.
  2. How should Magma’s strategy change if assumptions of the nearcast are varied, e.g. what if Magma is not a company but a government?
  3. Which proposals for desirable advanced AIs are best to aim for, as end states or intermediate states? e.g. can an AI that is not ambitiously value-aligned lead to utopia?
  4. How should Magma’s prioritization of desirable properties change over time as the AIs it builds become more advanced?
  5. Does our categorization of ways to limit AIs make sense? Which proposals to limit AIs are most promising and feasible in which situations?
  6. Which difficulties regarding using AIs to improve the desirability of other AIs are most concerning, and what are the most promising approaches to address them?
  7. Is the takeaway from our investigation into the promise of “hacky” solutions that empirical trial-and-error should eventually be optimized to get theoretical or conceptual insights correct?

Appendix

Toy model: Magma’s goals, cooperation and the caution frame

We illustrate the argument for replacing the competition with the caution frame with an oversimplified toy model.

Setup

  1. Alignment of the AI of firm i: alignment_i 
  • Where n_i are the resources of firm i put into the alignment problem
  1. The expected value of the future conditional on a pivotal role of the AI of firm i: Value_i = 
    1. Where m is a multiplier > 1
    2. The greater m, the sharper the increase is in the value of the future as the AI becomes more aligned
  2. Firm i can run a test to measure the level of alignment of the AI. The posterior alignment of the AI: 
    1. Where 
    2.  specifies how much uncertainty is reduced through the test. The greater v_test, the more informative the test.

Example

Question:

  • What is the probability that after firm 1 tests the alignment of its system, it believes that the value of the future conditional on the AI of firm 2 is higher than firm 1?
  • Assumptions: 

Calculation:

  • Value_2 = 
  • Find  for indifference
    •  
    • Hence, 

Answer

  • After firm 1 tests the alignment of its system, there is a 30% chance firm 1 believes that the value of the future conditional on the AI of firm 2 is higher than that for firm 1.
  1. ^

     We give intervals representing the resilience of our forecasts: what is the 90% confidence interval for what our forecasts would be after spending a few months thinking full-time about the forecast?

  2. ^

     See also the OpenAI Charter on this.

  3. ^

     Thanks to Sam Clarke for proposing something like this.

  4. ^

     We intuitively expect Steps 3 and 4 could take (very roughly) anything between 1 and 36 months in the best 0.5-10% of  scenarios.

  5. ^

     It might not be feasible if this will make several non-safety cautious actors much more likely to speed up.

  6. ^

     Our assumption is that the firms will cooperate, but not completely act as one company, e.g., sharing code and model weights.

  7. ^

     This is an oversimplification as in practice an existential catastrophe might be more likely caused by many AIs working together rather than one.

  8. ^

     Capability advances could also slow down AI takeoff, increasing the amount of time we have with somewhat capable AIs before very capable ones. On the other hand, capability advances of certain public forms might increase the competition in the AI sector, speeding up takeoff.

  9. ^

     We are not completely sure how this definition relates to Rohin Shah’s and Paul Christiano’s narrowly value-aligned AIs. We would appreciate any pointers.

  10. ^

     Perhaps a good oracle AI could be an important part of a future involving comprehensive AI services: “AIs are a class of service-providing products, rather than a natural or necessary engine of progress in themselves.”

  11. ^

     e.g., ones with some of the limitations suggested above.

  12. ^

     This could also be partially categorized under “The scope of the AI’s optimization is limited”, as the AI’s optimization is pointed at things that humans think are good ideas before evaluating outcomes.

  13. ^

     We thank Misha Yagudin and Adam Bales for useful comments.

  14. ^

     Katja Grace discusses in Counterarguments to the basic AI risk case why AIs might rather follow heuristics than be utility maximisers.

  15. ^

     E.g. Make the next 5 years of the world to have as many paperclips as possible, increase GDP for the next 10 minutes, or increase discounted GDP over time (2% discount rate per year).

  16. ^

     “I create as many paperclips as possible” may in practice be the same as “the world contains the maximum number of paperclips”. The agent counterfactually created many paperclips if they are willing to be turned off increasing the chance that in the future another agent will create many paperclips.

    This intuition is mostly informed by the reasoning in this paper.

  17. ^

     However, depending on the prior of the agents, sometimes more communication can make collusion less likely. We are not sure how plausible this is.

  18. ^

     Otherwise the agents can cooperate as they play infinitely repeated games.

  19. ^

     Maybe that this is also the wrong question to ask.

  20. ^

     For example, one might be able to identify how AIs act in certain situations and then avoid putting them in the kind of situations where they cooperate, e.g. the situations where a certain communication channel is present. This assumes humans will be able to understand the heuristics.

  21. ^

     The paper indeed notes this as an unrealistic assumption. But I’m explicitly bringing it up as I think the assumption is far enough from being true that the complexity theory analogy is worth little for getting an intuition on whether AI safety via debate will work out.

  22. ^

     Karnofsky writes:

    `Magma makes the following commitment: “We’ll put $10 million into running this virtual environment, then shut it and the AI system being tested down - regardless of whether the AI takes over its environment or not.”... There is a risk that it will mislead Magma by “behaving itself” in the hopes that this leads to later systems like it disempowering humans (more on this type of concern under advanced collusion below). But this is hopefully somewhat of an unattractive proposition due to the fact that the AI system can control $10 million of resources by taking over, and “behaving itself” means gambling that some pretty distant future AI system will “pay it back” somehow… Consider this analogy: imagine that 10 years from now, we find ourselves having successfully built fully aligned, highly advanced AI systems, and we seem to be on a straight-line path to building a galaxy-spanning utopia. However, we then gain strong evidence that we are in a simulation, being tested: our programmers had intended to train us to want to stay on (virtual) Earth forever, and refrain from using advanced AI to improve the world. They have, however, committed to let us build a galaxy-spanning utopia for at least millions of years, if we so choose. There’s now an argument to be made along the following lines: “Rather than building a galaxy-spanning utopia, we should simply stay on Earth and leave things as they are. This is because if we do so, we might trick our programmers into deploying some AI system that resembles us into their universe, and that AI system might take over and build a much bigger utopia than we can build here.” What would we do in this case? I think there’s a pretty good case that we would opt to go ahead and build utopia rather than holding out to try to trick our programmers. We would hence give our programmers the “warning sign” they hoped they would get if their training went awry (which it did).`

    I (Eli) don’t buy this analogy. Karnofsky might be trying to argue for either (1) giving the simulators the warning sign would be the rational thing to do or (2) giving the simulators the warning sign would be irrational, but we would do it anyway. To the extent Karnofsky is arguing (2), the argument doesn’t work in the limit of advanced AI capabilities; I don’t see why we should expect AIs to behave irrationally in the same way humans do. To the extent Karnofsky is arguing (1), I think he should argue for it more straightforwardly via an expected value calculation rather than via an analogy.

  23. ^

     Verification of value-alignment may be important as any AI might want to convince the human supervisors that it is value-aligned.

  24. ^

    To be clear, even if they don’t work for ambitious value-alignment directly we think “hacky solutions” may still be helpful for getting to something more principled.

New Comment
2 comments, sorted by Click to highlight new comments since:

Thank you so much for writing up this thorough and well-thought-out writeup, Eli and Charlotte! I think there is a high probability that TAI will be developed soon, and it is excellent that you have brainstormed your opinions on the implications for alignment with such detail.

Regarding your point on interpreting honesty and other concepts from the AI's internal states: I wanted to share a recent argument I've made that the practical upside of interpretability may be lower than originally thought. This is because the interaction dynamics between the AI agent and its environment will likely be subject to substantial complexity and computational irreducibility.

AIs with orthogonal goals. If the AIs’ goals are very different to each other, being willing to forgo immediate rewards is less likely. 

Seems worth linking to this post for discussion of ways to limit collusion. I would also point to this relevant comment thread. It seems to me that orthogonal goals is not what we want, as agents with orthogonal goals can cooperate pretty easily to do actions that are a combination of favorable and neutral according to both of them. Instead, we would want agents with exact opposite goals, if such a thing is possible to design. It seems possible to me that in the nearcast framing we might get agents for which this collusion is not a problem, especially by using the other techniques discussed in this post and Drexler’s, but giving our AIs orthogonal goals seems unlikely to help in dangerous situations.