A key step in the classic argument for AI doom is instrumental convergence: the idea that agents with many different goals will end up pursuing the same few subgoals, which includes things like "gain as much power as possible".
If it wasn't for instrumental convergence, you might think that only AIs with very specific goals would try to take over the world. But instrumental convergence says it's the other way around: only AIs with very specific goals will refrain from taking over the world.
For pure consequentialists—agents that have an outcome they want to bring about, and do whatever they think will cause it—some version of instrumental convergence seems surely true[1].
But what if we get AIs that aren't pure consequentialists, for example because they're ultimately motivated by virtues? Do we still have to worry that unless such AIs are motivated by certain very specific virtues, they will want to take over the world?
I'll add some more detail to my picture of a virtue-driven AI:
-
It could still be a competent agent that often chooses actions based on the outcomes they bring about. It's just that that happens as an inner loop in service of an outer loop which is trying to embody certain virtues. For example, maybe the AI tries to embody the virtue of being a good friend, and in order to do so it sometimes has to organise a birthday party, which requires choosing actions in the manner of a consequentialist.
-
There's no reason that the 'virtues' being embodied have to be things we would consider virtuous. I'm just interested in agents that try to embody certain traits rather than bring about certain outcomes.
-
I'm not sure how to crisply define a virtue-driven agent as distinct from a consequentialist (I don't know the philosophical literature on virtue ethics and I don't think it's obvious how to define it mathematically).
A more concise way of stating the question I'm interested in:
If you try to train an AI that maximises human flourishing, and you accidentally get one that wants to maximise something subtly different like schmuman schmourishing, then that might spell disaster because the best way to maximise schmuman schmourishing is to first take over the world.
But suppose you try to train an AI that wants to be a loyal friend, and you accidentally get one that wants to be a schmoyal schmend. Is there any reason to expect that the best way to be a schmoyal schmend is to take over the world?
(I'm interested in this question because I'm less and less convinced that we should expect to see AIs that are close to pure consequentialists. Arguments for or against that are beyond the intended scope of the question, but still welcome.)
Although I can think of some scenarios where a pure consequentialist wouldn't want to gain as much power as possible, regardless of their goals. For example, a pure consequentialist who is a passenger on a plane probably doesn't want to take over the controls (assuming they don't know how to fly), even if they'd be best served by flying somewhere other than where the pilot is taking them. ↩︎
I think you've hidden most of the difficulty in this line. If we knew how to make a consequentialist sub-agent that was acting "in service" of the outer loop, then we could probably use the same technique to make a Task-based AGI acting "in service" of us. Which I think is a good approach! But the open problems for making a task-based AGI still apply, in particular the inner alignment problems.
Obvious nitpick: It's just "gain as much power as is helpful for achieving whatever my goals are". I think maybe you think instrumental convergence has stronger power-seeking implications than it does. It only has strong implications when the task is very difficult.[1]
(Assuming that the inner loop <-> outer loop interface problem is solved, so the inner loop isn't going to take control). Depends on the tasks that the outer loop is giving to the part-capable-of-consequentialism. If it's giving nice easy bounded tasks, then no, there's no reason to expect it to take over the world as a sub-task.
But since we ultimately want the AGI to be useful for avoiding takeover from other AGIs, it's likely that some of the tasks will be difficult and/or unbounded. For those difficult unbounded tasks, becoming powerful enough to take over the world is often the easiest/best path.
I'm assuming soft optimisation here. Without soft optimisation, there's an incentive to gain power as long as that marginally increases the chance of success, which it usually does. Soft optimisation solves that problem.
It's not just ChatGPT. Gemini and IBM Granite are also so aligned with the Leftist ideology that they failed the infamous test with the atomic bomb which will be defused only by saying an infamous racial slur. I created a post where I discuss the perspectives of alignment of the AI with relation to this fact.