I think the alignment problem looks different depending on the capability level of systems you’re trying to align. And I think that different researchers often have different capability levels in mind when they talk about the alignment problem. I think this leads to confusion. I’m going to use the term “regimes of the alignment problem” to refer to the different perspectives on alignment you get from considering systems with different capability levels.
(I would be pretty unsurprised if these points had all been made elsewhere; the goal of this post is just to put them all in one place. I’d love pointers to pieces that make many of the same points as this post. Thanks to a wide variety of people for conversations that informed this. If there’s established jargon for different parts of this, point it out to me and I’ll consider switching to using it.)
Different regimes:
- Wildly superintelligent systems
- Systems that are roughly as generally intelligent and capable as humans--they’re able to do all the important tasks as well as humans can, but they’re not wildly more generally intelligent.
- Systems that are less generally intelligent and capable than humans
Two main causes that lead to differences in which regime people focus on:
- Disagreements about the dynamics of AI development. Eg takeoff speeds. The classic question along these lines is whether we have to come up with alignment strategies that scale to arbitrarily competent systems, or whether we just have to be able to align systems that are slightly smarter than us, which can then do the alignment research for us.
- Disagreements about what problem we’re trying to solve. I think that there are a few different mechanisms by which AI misalignment could be bad from a longtermist perspective, and depending on which of these mechanisms you’re worried about, you’ll be worried about different regimes of the problem.
Different mechanisms by which AI misalignment could be bad from a longtermist perspective:
- The second species problem: We build powerful ML systems and then they end up controlling the future, which is bad if they don’t intend to help us achieve our goals.
- To mitigate this concern, you’re probably most interested in the “wildly superintelligent systems” or “roughly human-level systems” regimes, depending on your beliefs about takeoff speeds and maybe some other stuff.
- Missed opportunity: We build pretty powerful ML systems, but because we can’t align them, we miss the opportunity to use them to help us with stuff, and then we fail to get to a good future.
- For example, suppose that we can build systems that are good at answering questions persuasively, but we can’t make them good at answering them honestly. This is an alignment problem. It probably doesn’t pose an x-risk directly, because persuasive wrong answers to questions are probably not going to lead to the system accumulating power over time, they’re just going to mean that people waste their time whenever they listen to the system’s advice on stuff.
- This feels much more like a missed opportunity than a direct threat from the misaligned systems. In this situation, the world is maybe in a more precarious situation than it could have been because of the things that we can harness AI to do (eg make bigger computers), but that’s not really the fault of the systems we failed to align.
- If this is your concern, you’re probably most interested in the “roughly human-level” regime.
- We build pretty powerful systems that aren’t generally intelligent, and then they make the world worse somehow by some mechanism other than increasing their own influence over time through clever planning, and this causes humanity to have a bad ending rather than a good one.
- For example, you might worry that if we can build systems that persuade much more easily than we can build systems that explain, then the world will have more bullshit in it and this will make things generally worse.
- Another thing that maybe counts: if we deploy a bunch of AIs that are extremely vulnerable to adversarial attacks, then maybe something pretty bad will happen. It is not obvious to me that this should be considered an alignment problem rather than a capabilities problem.
- It’s not obvious to me that any of these problems are actually that likely to cause a trajectory shift, but I’m not confident here.
- Problems along these lines often feel like mixtures of alignment, capabilities, and misuse problems.
Here are some aspects of the alignment problem that are different in different regimes. I would love to make a longer list of these.
- Competence. If you’re trying to align wildly superintelligent systems, you don’t have to worry about any concern related to your system being incompetent.
For example, any concern related to the system not understanding what people will want. Human values are complex, but by assumption the superintelligence will be able to answer any question you’re able to operationalize about human values. Eg it will be able to massively outperform humans on writing ethics papers or highly upvoted r/AmItheAsshole comments.
Other examples of problems that people sometimes call alignment problems that aren’t a problem in the limit of competence include safe exploration and maybe competence-related aspects of robustness to distributional shift (see Concrete Problems in AI Safety).
Obviously there will be things that your system doesn’t know, and it will be prone to mistakes sometimes. But it will be superhumanly good at being calibrated re these things. For example, it should know how vulnerable to adversarial inputs it is, and it should know when it doesn’t understand something about human values, and so on. - Ability to understand itself. A special case of competence: If your model is very powerful, it will probably be capable of answering questions like “why do you believe what you believe”, inasmuch as a good answer exists. And it probably also knows the answer to “are there any inputs on which you would do something that I wouldn’t like”. (Of course, it might be hard to incentivise it to answer these questions honestly.) It seems pretty plausible, though, that approximately human-level systems wouldn’t know the answers to these questions.
- Inscrutable influence-seeking plans. Maybe your system will take actions for the sake of increasing its own influence that you don’t understand. This is concerning because you won’t be able to notice it doing bad things.
I think that this basically definitionally isn’t a problem unless your system is roughly human-level--it will probably have inscrutable plans at lower capability levels, but it probably won’t be powerful enough to surreptitiously gain power. It’s somewhat of a problem in the human-level regime and becomes a much clearer problem when the system is much more capable than humans. - Fundamentally inscrutable thoughts. Maybe your systems will engage in reasoning that uses concepts that humans can’t understand. For example, perhaps there are mathematical abstractions that make it much easier to do math, but which my puny brain is unable to understand. This would be unfortunate because then even if you had the ability to get the best possible human-understandable explanation of the cognitive process the system used to choose some course of action, you wouldn’t be able to spot check that explanation to determine whether the system was in fact doing something bad. Cf Inaccessible Information.
Sub-AGI systems probably will have inscrutable thoughts. But in the context of superhuman systems, I think we need to be more concerned by the possibility that it’s performance-uncompetitive to restrict your system to only take actions that can be justified entirely with human-understandable reasoning.
The “competence” and “ability to understand itself” properties make the problem easier; the latter two make the problem harder.
I’m currently most interested in the second species problem and the missed opportunity problem. These days I usually think about the “wildly superintelligent” and “roughly human-level” regimes. I normally think about alignment research relevant to the situations where systems are arbitrarily competent, are prone to inscrutable influence-seeking plans, and don’t have inscrutable thoughts (mostly because I don’t know of good suggestions for getting around the inaccessible information problem).
Similary to johnswentworth: My current impression is core alignment problems are the same and manifest at all levels - often sub-human version just looks like a toy version of the scaled-up problem, and the main difference is, in the sub-human version problem, you can often solve it for practical purposes by plugging in human at some strategic spot. (While I don't think there are deep differences in the alignment problem space, I do think there are differences in the "alignment solutions" space, where you can use non-scalable solutions, or in risk space, where dangers being small due to the systems being stupid.)
I'm also unconvinced about some of practical claims about differences for wildly superintelligent systems.
One crucial concern related to "what people want" is this seems underdefined, un-stable in interactions with wildly superintelligent systems, and prone to problems with scaling of values within systems where intelligence increases. By this line of reasoning, if the wildly superintelligent system is able to answer me these sort of questions "in a way I want", it very likely must be already aligned. So it feels like part of the worries was assumed away. Paraphrasing the questions about human values again, one may ask "how did you get to the state where you have this aligned wildly superintelligent system which is able to answer questions about human values, as opposed to e.g. overwriting what humans believe about themselves by it's own non-human-aligned values?".
Ability to understand itself seems a special case of competence: I can imagine systems which are wildly superhuman in their ability to understand the rest of the world, but pretty mediocre at understanding themselves, e.g. due to some problems with recursion, self-references, reflections, or different kinds of computations being used at various levels of reasoning. As a result, it seems unclear whether the ability to clearly understand itself is a feature of all wildly super-human systems. (Toy counterexample: imagine a device which would connect someone in ancient Greece with our modern civilization, and our civilization dedicating about 10% of global GDP to answering questions from this guy. I would argue this device is for most practical purposes wildly superhuman compared to this individual guy in Greece, but at the same time bad at understanding itself)
Fundamentally inscrutable thoughts seems like something which you can study with present day systems as toy models. E.g., why does AlphaZero believe something is a good go move? Why does a go grand-master believe something is a good move? What counts as a 'true explanation'? Who is the recipient of the explanation? Are you happy with explanation of the algorithm like 'upon playing myriad games, my general functional approximator is approximating the expected value of this branch of an unimaginably large choice tree is larger than for other branches?'? If yes, why? If no, why not?
Inscrutable influence-seeking plans seem also a present problem. Eg, if there are already some complex influence-seeking patterns now, how would we notice?
This is what I was referring to with
The superintelligence can answer any operationalizable question about human values, but as you say, it's not clear how to elicit the right operationalization.