Thanks to Rebecca Gorman for discussions that lead to these insights.
How can you get a superintelligent AI aligned with human values? There are two pathways that I often hear discussed. The first sees a general alignment problem - how to get a powerful AI to safely do anything - which, once we've solved, we can point towards human values. The second perspective is that we can only get alignment by targeting human values - these values must be aimed at, from the start of the process.
I'm of the second perspective, but I think it's very important to sort this out. So I'll lay out some of the arguments in its favour, to see what others think of it, and so we can best figure out the approach to prioritise.
More strawberry, less trouble
As an example of the first perspective, I'll take Eliezer's AI task, described here:
- "Place, onto this particular plate here, two strawberries identical down to the cellular but not molecular level." A 'safely' aligned powerful AI is one that doesn't kill everyone on Earth as a side effect of its operation.
If an AI accomplishes this limited task without going crazy, this shows several things:
- It is superpowered; the task described is beyond current human capabilities.
- It is aligned (or at least alignable) in that it can accomplish a task in the way intended, without wireheading the definitions of "strawberry" or "cellular".
- It is safe, in that it has not heavily dramatically reconfigured the universe to accomplish this one goal.
Then, at that point, we can add human values to the AI, maybe via "consider what these moral human philosophers would conclude if they thought for a thousand years, and do that".
I would agree that, in most cases, an AI that accomplished that limited task safely would be aligned. One might quibble that it's only pretending to be aligned, and preparing a treacherous turn. Or maybe the AI was boxed in some way and accomplished the task with the materials at hand within the box.
So we might call an AI "superpowered and aligned" if it accomplished the strawberry copying task (or a similar one) and if it could dramatically reconfigure the world but chose not to.
Values are needed
I think that an AI could not be "superpowered and aligned" unless it is also aligned with human values.
The reason is that the AI can and has to interact with the world. It has the capability to do so, by assumption - it is not contained or boxed. It must do so because any agent affects the world, through chaotic effects if nothing else. A superintelligence is likely to have impacts in the world simply through its existence being known, and if the AI finds it efficient to have interactions with the world (eg. ordering some extra resources) then it will do so.
So the AI can and must have an impact on the world. We want it to not have a large or dangerous impact. But, crucially, "dangerous" and "large" are defined by human values.
Suppose that the AI realises that its actions have slightly imbalanced the Earth in one direction, and that, within a billion years, this will cause significant deviations in the orbits of the planets, deviations it can estimate. Compared with that amount of mass displaced, the impact of killing all humans everywhere is a trivial one indeed. We certainly wouldn't want it to kill all humans in order to be able to carefully balance out its impact on the orbits of the planets!
There are very "large" impacts to which we are completely indifferent (chaotic weather changes, the above-mentioned change in planetary orbits, the different people being born as a consequence of different people meeting and dating across the world, etc.) and other, smaller, impacts that we care intensely about (the survival of humanity, of people's personal wealth, of certain values and concepts going forward, key technological innovations being made or prevented, etc.). If the AI accomplishes its task with a universal constructor or unleashing hordes of nanobots that gather resources from the world (without disrupting human civilization), it still has to decide whether to allow humans access to the constructors or nanobots after it has finished copying the strawberry - and which humans to allow this access to.
So every decision the AI makes is a tradeoff in terms of its impact on the world. Navigating this requires it to have a good understanding of our values. It will also need to estimate the value of certain situations beyond the human training distribution - if only to avoid these situations. Thus a "superpowered and aligned" AI needs to solve the problem of model splintering, and to establish a reasonable extrapolation of human values.
Model splintering sufficient?
The previous sections argue that learning human values (including model splintering) is necessary for instantiating an aligned AI; thus the "define alignment and then add human values" approach will not work.
Thus, if you give this argument much weight, learning human values is necessary for alignment. I personally feel that it's also (almost) sufficient, in that the skill in navigating model splintering, combined with some basic human value information (as given, for example, by the approach here) is enough to get alignment even at high AI power.
Which path to pursue for alignment
It's important to resolve this argument, as the paths for alignment that the two approaches suggest are different. I'd also like to know if I'm wasting my time on an unnecessary diversion.
To be clear, my original claim was for hypothetical scenarios where the failure occurs because the AI didn't know human values, rather than cases where the AI knows what the human would want but still a failure occurs. (I didn't state this explicitly because I was replying to the post, which focuses specifically on the problem of not knowing all of human values.) I think most of your failures are of the latter type, and I wouldn't make a similar claim for such failures -- they seem plausible and worth attention. I do still think they are not as important as intent alignment. Talking about each one individually:
Mostly I'd hope that AI can tell what philosophy is optimized for persuasion, or at least is capable of presenting counterarguments persuasively as well. But if your AI can't even tell what is persuasive then you're in trouble, but I'm not sure why to expect that.
No, it just doesn't seem hugely terrible for a few people to lock in bad values, as long as the vast majority do not. (And I don't expect a large number of people to explicitly try to lock in their values.)
It seems odd to me that it's sufficiently competent to successfully reason about simulations enough that an acausal threat can actually be made, but then not competent at reasoning about exotic philosophical cases, and I don't particularly expect this to happen.
I agree it is far from a sure thing.