Iason Gabriel’s 2020 article Artificial Intelligence, Values, and Alignment is a philosophical perspective on what the goal of alignment actually is, and how we might accomplish it. In the best spirit of modern philosophy, it provides a helpful framework for organizing what has already been written about levels at which we might align AI systems, and also provides a neat set of connections between concepts in AI alignment and concepts in modern philosophy.
Goals of alignment
Gabriel identifies six levels at which we might define what it means to align AI with something:
-
Instructions: the agent does what I instruct it to do
-
Expressed intentions: the agent does what I intend it to do.
-
Revealed preferences: the agent does what my behaviour reveals I prefer.
-
Informed preferences or desires: the agent does what I would want it to do if I were rational and informed.
-
Interest or well-being: the agent does what is in my interest, or what is best for me, objectively speaking.
-
Values: the agent does what it morally ought to do, as defined by the individual or society.
Schemas like this are helpful because they can "pop us out" from our unexamined paradigms. If we are, for example, having a discussion about building AI from inside the "revealed preferences" paradigm, it is good to know that we are having a discussion from inside that paradigm. It is a virtue of modern philosophy to always be asking what unexamined paradigm we are inside, and to be pushing us to at least see that that we are inside such-and-such a paradigm, in order that we can examine it and decide whether to keep working within it.
In that spirit, I would like to offer a conceptualization of the paradigm that I think all six of these levels are within, in order that we might examine that paradigm and decide whether we wish to keep working within it. It seems to me that we presuppose that when we deploy AI, we will pass agency away from humans and into the hands of the AI, at least for as long as it takes the AI to execute our instructions, intentions, preferences, interest, or values. We imagine our future AIs as assistants, genies, or agents with which we are going to have some initial period of contact, followed by a period during which these powerful agents go off and do our bidding, followed perhaps by later iterations in which these agents come back for further instructions, intentions, preferences, interests, or values. We understandably find it troubling to consider turning so much of our agency over to an external entity, yet most work in AI alignment is about how to safely navigate this hand-off of agency, and it seems to me that there is relatively little discussion of whether we should be doing all of our thinking on the assumption of a hand-off of agency. I will call this paradigm that I think we are inside the Agency Hand-off Paradigm:
The Agency Hand-off Paradigm
A few brief notes on how this relates to existing work in AI alignment:
-
The AI alignment sub-field of corrigibility is concerned with the design of AI that we can at least switch off if we later regret the instructions, intentions, preferences, interests, or values that we gave it. This is of course a good property for AI to have if we are going to hand off agency to it, but we seem to be inside the Agency Hand-off Paradigm almost by default.
-
Stuart Russell’s work on interaction games is about transmitting instructions / intentions / preferences / interests / values from humans to AIs as an ongoing dialog rather than a one-shot up-front data dump, but this work still assumes that agency is going to be handed over to our AIs, it’s just that the arrow from "Human" to "AI" in the figure above becomes a sequence of arrows.
-
Eliezer Yudkowsky’s writing on coherent extrapolated volition and Paul Christiano’s writing on indirect normativity are both concerned with extracting values from humans in a way that bypasses our limited ability to articulate our own values. Yet both bodies of work presuppose that there is going to be some phase during which we extract values from humans, followed by a phase during which our AIs are going to take actions on the basis of these values. Under this assumption we indeed ought to be very concerned about getting the value-extraction step right since the whole future of the world hangs on it.
-
Significant portions of Nick Bostrom’s book Superintelligence were concerned with the dangers of open-ended optimization over the world. It seems to me that the basic reason to be concerned about powerful optimizers in the first place is that they are precisely the category of system that has the property of taking agency away from humans.
But perhaps there is room to question the Agency Hand-off Paradigm. I would very much like to see proposals for AI alignment that escape completely from the assumption that we are going to hand off agency to AI. What would it look like to have powerful intelligent systems that increased rather than decreased the extent to which humans have agency over the future?
Think of a child playing a sand pit. The child’s parent has constructed the sand pit for the child and will keep the child safe. If the child happens to find, say, a shard of glass, then the parent may take it away. But for the most part the parent will just let the child play and learn and grow. It would be a little strange to think of the parent as taking instructions, intentions, preferences, interests, or values from the child and then assuming agency over the arrangement of sand in the sand pit on that basis. Yes the parent has a sense of what is in the child’s best interests by taking away the shard of glass, but not because the parent understands the child’s intentions how all the sand should ultimately be arranged and is accelerating things in that direction, but rather because the shard of glass threatens the child’s ow agency in a way that the child cannot account for in the short term. In the long term the parent will help the child to grow in such a way that they will be able to safely handle sharp objects on their own, and the parent will eventually fade away from the child’s life completely. The long-run flow of agency is towards the child, not towards the parent. Is it not possible that we could build AI that ensures that agency flows towards us, not away from us, over the long run?
Um, bad?
Humans aren't fit to run the world, and there's no reason to think humans can ever be fit to run the world. Not unless you deliberately modify them to the point where the word "human" becomes unreasonable.
The upside of AI depends on restricting human agency just as much as the downside does.
You seem to be relying on the idea that someday nobody will need to protect that child from a piece of glass, because the child's agency will have been perfected. Someday the adult will be be able to take off all the restraints, stop trying to restrict the child's actions at all, and treat the child as what we might call "sovereign".
... but the example of the child is inapt. A child will grow up. The average child will become as capable of making good decisions as the average adult. In time, any particular child will probably get better than any particular adult, because the adult will be first to age to the point of real impairment.
The idea that a child will grow up is not a hope or a wish; it's a factual prediction based on a great deal of experience. There's a well-supported model of why a child is the way a child is and what will happen next.
On the other hand, the idea that adult humans can be made "better agents", whether in the minimum, the maximum, or the mean, is a lot more like a wish. There's just no reason to believe that. Humans have been talking about the need to get wiser for as long as there are records, and have little to show for it. What changes there have been in individual human action are arguably more due to better material conditions than to any improved ability to act correctly.
Humans may have improved their collective action. You might have a case to claim that governments, institutions, and "societies" take better actions than they did in the past. I'm not saying they actually do, but maybe you could make an argument for it. It still wouldn't matter. Governments, institutions and "societies" are not humans. They're instrumental constructs, just like you might hope an AI would be. A government has no more personality or value than a machine.
Actual, individual humans still have not improved. And even if they can improve, there's no reason to think that they could ever improve so much that an AI, or even an institution, could properly take all restraints off of them. At least not if you take radical mind surgery off the table as a path to "improvement".
Adult humans aren't truly sovereign right now. You have comparatively wide freedom of action as an adult, but there are things that you won't be allowed to do. There even processes for deciding that you're defective in your ability to exercise your agency properly, and taking you back to childlike status.
The collective institutions spend a huge amount of time actively reducing and restricting the agency of real humans, and a bunch more time trying to modify the motivations and decision processes underlying that agency. They've always done that, and they don't show any signs of stopping. In fact, they seem to be doing it more than they did in the past.
Institutions may have fine-tuned how they restrict individual agency. They may have managed to do it more when it helps and less when it hurts. But they haven't given it up. Institutions don't make individual adults sovereign, not even over themselves and definitely not in any matter that affects others.
It doesn't seem plausible that institutions could keep improving outcomes if they did make individuals completely sovereign. So if you've seen any collective gains in the past, those gains have relied on constructed, non-human entities taking agency away from actual humans.
In fact, if your actions look threatening enough, even other individuals will try to restrain you, regardless of the institutions. None of us is willing to tolerate just anything that another human might decide to do, especially not if the effects extend beyond that person.
If you change the agent with the "upper hand" from an institution to an AI, there's no clear reason to think that the basic rules change. An AI might have enough information, or enough raw power, to make it safe to allow humans more individual leeway than they have under existing institutions... but an AI can't get away with making you totally sovereign any more than an institution can, or any more than another individual can. Not unless "making you sovereign" is itself the AI's absolute, overriding goal... in which case it shouldn't be waiting around to "improve" you before doing so.
There's no point at which an AI with a practical goal system can tell anything recognizably human, "OK, you've grown up, so I won't interfere if you want to destroy the world, make life miserable for your peers, or whatever".
As for giving control to humans collectively, I don't think it's believable that institutions could improve to the point where a really powerful and intelligent AI could believe that those institutions would achieve better outcomes for actual humans than the AI could achieve itself. Not on any metric, including the amount of personal agency that could be granted to each individual. The AI is likely to expect to outperform the institutions, because the AI likely would outperform the institutions. Ceding control to humans collectively would just mean humans individually losing more agency... and more of other good stuff, too.
So if you're the AI, and you want to do right by humans, then I think you're going to have to stay in the saddle. Maybe you can back way, way off if some human self-modifies to become your peer, or your superior... but I don't think that critter you back off from is going to be "human" any more.
If the humans in the container succeed in becoming wiser, then hopefully it is wise for us to leave this decision up to them than to preemptively make it now (and so I think the situation is even better than it sounds superficially).
It seems like the real thing up for... (read more)