Comparing AI Alignment Approaches to Minimize False Positive Risk

Gordon Seidoh Worley

Introduction

Based on the method I used in "Robustness to Fundamental Uncertainty in AGI Alignment", we can analyze various proposals for building aligned AI and determine which appear to best trade off false positive risk for false negative risk and recommend those which are conservatively safest. We can do this at various levels of granularity and for various levels of specificity of proposed alignment methods. What I mean by that is we can consider AI alignment as a whole or various sub-problems within it, like value learning and inner alignment, and we can consider high-level approaches to alignment or more specific proposals with more of the details worked out. For this initial post in what may become a series, I'll compare high-level approaches on addressing alignment as a whole.

Some ground rules:

By "alignment" I mean "caring about the same things".
- If you like I attempted to be more formal about this in "Formally Stating the AI Alignment Problem", but here I'm going to prefer the less formal statement to avoid getting tripped up by being overly specific in just one part of the analysis since we're not being very specific about the alignment methods.
- I say "cares about the same things" rather than "aligned with human interests" to both taboo "align" in the definition of "alignment" and reflect my philosophical leaning that "caring" is the fundamental human activity we are interested in when we express a desire to build aligned AI.
  - I definitely have some unpublished idea about what "caring" means in terms of predictive coding in terms of valence. This is a good recent approximation of them written by someone more understandable than me, and accords with some things I've previously written. Hopefully not relevant beyond disclosing my background assumptions.
  - Care synonyms for the scrupulous: interest, purpose, telos, concern, value, and preference as long as we use it in the everyday sense of the word to point at a broad category of human activity and not as jargon about choice (even though this broad category does influence how we choose).
By "high-level approaches" I don't mean specific mechanisms or methods for building aligned AI, but general directions or ways in which specific methods are intended to work.
- For example, I consider iterated distillation and amplification an approach, and various ways of using IDA to build aligned AI specific methods.
- Abstractions are complicated, and there's no explicit principle by which I'm making the "high-level" cut off other than some heuristic about the level of abstraction that lets me lump all of the methods that seem more similar to each other than different under the same approach.
If you don't want to dive into the details of the method I'm using here, the tl;dr on it is that we have more to lose from failure than gain from success when developing technologies that create existential risks, and so we should prefer risk mitigation interventions with lower risks of false positives, all else equal, thus I look for arguments that let us at least give an ordinal ranking of interventions in terms of false positive risk.
- I talk of "false positive risk" here because I mean the risk that we think an intervention will work, we try it, and it fails in a way that results in an outcome as bad as or worse than the outcome if we had tried no intervention or not developed the technology because it was deemed too risky.
  - By comparison a false negative here is failing to try an intervention that would have worked but we incorrectly ruled it out because we thought it wouldn't work, seemed too risky, or made some other error in judging it.
- To find that ordinal ranking I primarily look for arguments that allow one intervention to dominate another in terms of false positive risks (see the section on meta-ethical uncertainty in the original paper for an example of this kind of reasoning) or show that all else is not equal and the safest choice is not necessarily the one with the lowest false positive risk (see the section on mental phenomena in the original paper for an example).
- To get there I'll consider the false positive risks associated with each approach, then look for arguments and evidence that will allow us to compare these risks.

I'll be comparing three high-level approaches to AI alignment that I term Iterated Distillation and Amplification (IDA), Ambitious Value Learning (AVL), and Normative Embedded Agency (NEA). By each of these, for the purposes of this post, I'll mean the following, which I believe captures the essence of these approaches but obviously leaves out lots of specifics about various ways they may be implemented.

Iterated Distillation and Amplification (IDA)
- Build an AI, have it interact with humans to form a more aligned (and generally more capable) AI-human system, then build a new AI with the same capabilities as the AI-human system. Repeat.
- This is a family of methods based on Paul Chritiano's ideas about IDA and includes debate, HCH, and others.
Ambitious Value Learning (AVL)
- Build AI that learns what humans want and gives it to them.
- Today we have Inverse Reinforcement Learning (IRL) systems that are narrow or non-ambitious value learners. Work like CIRL, Stuart Armstrong's reach agenda, and my own work on deconfusing human values suggest ways we might move towards ambitious value learners.
Normative Embedded Agency (NEA)
- Build AI that follows norms about how it makes decisions that makes those decisions aligned with humanity's interests.
- The closest we have to specific proposals within this approach are MIRI's ideas about Highly Reliable Agent Designs (HRAD) and Error Tolerant Agent Designs (HTAD), perhaps combined with something like Coherent Extrapolated Volition (CEV).
- Otherwise NEA seems like a natural extension of the kind of work MIRI is doing, viz. it might be possible to bake a decision algorithm into an AI such that it achieves alignment by both computing long enough over enough detail and being programmed, via embodying a particular decision theory, to effectively care about the same things humans do.
- The reality is that NEA as described here is a bit of a straw category that no one is likely to try, as in isolation it's like the GOFAI approach to alignment, and more realistic approaches would combine insights from this cluster with other methods. Nonetheless, I'll let it stand for illustrative purposes, since I care more in the post about demonstrating the method than providing immediately actionable advice.

I do not think these are an exhaustive categorization of all possible alignment schemes; rather they are three that I have enough familiarity with to reason about and consider to be the most promising approaches people are investigating. There is at least a fourth approach I'm not considering here because I haven't thought about it enough—building AI that is aligned because it emulates how human brains function—and probably others different enough that they warrant their own category.

Sources of False Positive Risk

For each of the three approaches we must consider their false positive risks. Once we have done that, we can consider the risks of each approach relative to the others. Remember, here I'll be basing my analysis on my summary of these approaches given above, not on any specific alignment proposal.

I'll give some high level thoughts on why each may fail and then make a specific statement summing each one up. I won't go into a ton of detail both because in some cases others already have (and I'll link when that's the case) or because these seem like fairly obvious observations that most readers of the Alignment Forum will readily agree with. If that's not the case please bring it up in the comments and we can go into more detail.

IDA
- Humans may fail to apply sufficient pressure on the combined system to produce a more aligned system after distillation.
- Humans may fail to be able to exert enough control over the combined system to produce a more aligned system during amplification.
- Competing pressures and incentives during amplification may swamp alignment such that humans themselves choose to prefer less aligned AI.
- More specifically, IDA may fail because humans are unable to express themselves through their actions in ways that constrain the behavior of an AI during amplification such that the distilled AI for the next iteration is unable to ever adequately incorporate what humans care about, resulting in unaligned AI that may fail in various standard ways (treacherous turn, etc.).
AVL
- There may be no value learning norm that allows an AI to learn and align itself to arbitrary humans.
  - Recall that a value learning norm is needed due to the no free lunch theorem of value learning.
- There may be insufficient information available to an AI to discover what humanity actually cares about, i.e. not enough detail in observed behavior, brain scans, reports from humans, etc..
- AVL approaches may be prone to overfitting the observed data such that they can't generalize to care about the same things humans would care about in novel situations.
- More specifically, AVL may fail because the AI is unable, for various reasons, to adequately learn/discover what humans care about to become aligned.
NEA
- We may fail to discover a decision theory that makes an arbitrary agent share human concerns, or no such decision theory exists.
- We may think we discover a decision theory that can produce aligned AI and are wrong about it.
- We may not be able to incrementally verify progress towards alignment via NEA because it's trying to hit a small target many steps down the line by setting the right initial conditions.
- More specifically, NEA may fail because the decision theory used does not sufficiently constrain or incentivize the agent to become or stay aligned and we may not adequately be able to predict if it will fail or succeed sufficiently far in advance to act.

Comparisons

Given the risks of false positives identified above, we can now look to see if we can rank the approaches in terms of false positive risk by assessing if any of those risks dominate the others, i.e. the false positive risks associated with one approach necessarily pose greater risks and thus higher chance of failure than those associated with another. I believe we can, and I make the following arguments.

risk(AVL) < risk(IDA)
- IDA suffers from the same false positive risks as AVL, in that for IDA to work an AI must infer what humans care about from observing them, but adds the additional risk of not only optimizing for what humans care about but also optimizing for other things while the AI increases in capabilities via iteration. Thus IDA is strictly riskier than AVL in terms of false positives.
risk(IDA) < risk(NEA)
- IDA depends on humans and AIs iteratively via small steps moving towards alignment with regularly opportunities to stop and check before the AI becomes too powerful to control, whereas NEA does not necessarily afford such opportunities. Thus NEA has higher false positive risk because it must get things right by predicting a longer chain of outcomes in advance rather than incrementally making smaller predictions with opportunities to stop if the AI becomes less aligned.
risk(AVL) < risk(NEA)
- This is to double check that we are correct that our risk assessments are transitive, since if this fails we end up with a circular "ordering" and have an error in our reasoning somewhere.
- AVL requires that we determine a value learning norm, but otherwise expects to achieve alignment via observing humans, whereas NEA requires determining norms not just for value learning (which I believe would be implied by having a decision theory that could produce an aligned AI) but for all decision processes, thus it requires "writing a larger program" or "defining a more complex algorithm" which is more likely to fail all else equal since it has more "surface area" where failure may arise.

Conclusions

Based on the above analysis, I'd argue that Ambitious Value Learning is safer than Iterated Distillation and Amplification is safer than Normative Embedded Agency as approaches to building aligned AI in terms of false positive risk, all else equal. In short, risk(AVL) < risk(IDA) < risk(NEA), or if you like AVL is safer than IDA is safer than NEA, based on false positive risk.

I think there's a lot that could be better about the above analysis. In particular, it's not very specific, and you might argue that I stood up straw versions of each approach that I then knocked down in ways that are not indicative of how specific proposals would work. I didn't get more specific because I'm more confident I can reason about high level approaches than details about specific proposals, and it's unclear which specific proposals are worth learning in enough detail to perform this evaluation, so as a start this seemed like the best option.

Also we have the problem that NEA is not as real an approach as IDA or AVL, with the research I cited as the basis for the NEA approach more likely to augment the IDA or AVL approaches rather than offer an alternative to them. Still, I find including the NEA "approach" interesting if for no other reason that it points to a class of solutions researchers of the past would have proposed if they were trying to build aligned GOFAI, for example.

Finally, as I said above, my main goal here is to demonstrate the method, not to strongly make the case that AVL is safer than IDA (even though on reflection I personally believe this). My hope is that this inspires others to do more detailed analyses of this type on specific methods to recommend the safest seeming alignment mechanisms, or that it generates enough interest that I'm encourage to do that work myself. That said, feel free to fight out AVL vs. IDA at the object level in the comments if you like, but if you do at least try to do so within the framework presented here.

LESSWRONG
LW