This is a post about my own confusions. It seems likely that other people have discussed these issues at length somewhere, and that I am not up with current thoughts on them, because I don’t keep good track of even everything great that everyone writes. I welcome anyone kindly directing me to the most relevant things, or if such things are sufficiently well thought through that people can at this point just correct me in a small number of sentences, I’d appreciate that even more.
~
The traditional argument for AI alignment being hard is that human value is ‘complex’ and ‘fragile’. That is, it is hard to write down what kind of future we want, and if we get it even a little bit wrong, most futures that fit our description will be worthless.
The illustrations I have seen of this involve a person trying to write a description of value conceptual analysis style, and failing to put in things like ‘boredom’ or ‘consciousness’, and so getting a universe that is highly repetitive, or unconscious.
I’m not yet convinced that this is world-destroyingly hard.
Firstly, it seems like you could do better than imagined in these hypotheticals:
- These thoughts are from a while ago. If instead you used ML to learn what ‘human flourishing’ looked like in a bunch of scenarios, I expect you would get something much closer than if you try to specify it manually. Compare manually specifying what a face looks like, then generating examples from your description to using modern ML to learn it and generate them.
- Even in the manually describing it case, if you had like a hundred people spend a hundred years writing a very detailed description of what went wrong, instead of a writer spending an hour imagining ways that a more ignorant person may mess up if they spent no time on it, I could imagine it actually being pretty close. I don’t have a good sense of how far away it is.
I agree that neither of these would likely get you to exactly human values.
But secondly, I’m not sure about the fragility argument: that if there is basically any distance between your description and what is truly good, you will lose everything.
This seems to be a) based on a few examples of discrepancies between written-down values and real values where the written down values entirely exclude something, and b) assuming that there is a fast takeoff so that the relevant AI has its values forever, and takes over the world.
My guess is that values that are got using ML but still somewhat off from human values are much closer in terms of not destroying all value of the universe, than ones that a person tries to write down. Like, the kinds of errors people have used to illustrate this problem (forget to put in, ‘consciousness is good’) are like forgetting to say faces have nostrils in trying to specify what a face is like, whereas a modern ML system’s imperfect impression of a face seems more likely to meet my standards for ‘very facelike’ (most of the time).
Perhaps a bigger thing for me though is the issue of whether an AI takes over the world suddenly. I agree that if that happens, lack of perfect alignment is a big problem, though not obviously an all value nullifying one (see above). But if it doesn’t abruptly take over the world, and merely becomes a large part of the world’s systems, with ongoing ability for us to modify it and modify its roles in things and make new AI systems, then the question seems to be how forcefully the non-alignment is pushing us away from good futures relative to how forcefully we can correct this. And in the longer run, how well we can correct it in a deep way before AI does come to be in control of most decisions. So something like the speed of correction vs. the speed of AI influence growing.
These are empirical questions about the scales of different effects, rather than questions about whether a thing is analytically perfect. And I haven’t seen much analysis of them. To my own quick judgment, it’s not obvious to me that they look bad.
For one thing, these dynamics are already in place: the world is full of agents and more basic optimizing processes that are not aligned with broad human values—most individuals to a small degree, some strange individuals to a large degree, corporations, competitions, the dynamics of political processes. It is also full of forces for aligning them individually and stopping the whole show from running off the rails: law, social pressures, adjustment processes for the implicit rules of both of these, individual crusades. The adjustment processes themselves are not necessarily perfectly aligned, they are just overall forces for redirecting toward alignment. And in fairness, this is already pretty alarming. It’s not obvious to me that imperfectly aligned AI is likely to be worse than the currently misaligned processes, and even that it won’t be a net boon for the side of alignment.
So then the largest remaining worry is that it will still gain power fast and correction processes will be slow enough that its somewhat misaligned values will be set in forever. But it isn’t obvious to me that by that point it isn’t sufficiently well aligned that we would recognize its future as a wondrous utopia, just not the very best wondrous utopia that we would have imagined if we had really carefully sat down and imagined utopias for thousands of years. This again seems like an empirical question of the scale of different effects, unless there is a an argument that some effect will be totally overwhelming.
I wonder if Paul Christiano ever wrote down his take on this, because he seems to agree with Eliezer that using ML to directly learn and optimize for human values will be disastrous, and I'm guessing that his reasons/arguments would probably be especially relevant to people like Katja Grace, Joshua Achiam, and Dario Amodei.
I myself am somewhat fuzzy/confused/not entirely convinced about the "complex/fragile" argument and even wrote kind of a counter-argument a while ago. I think my current worries about value learning or specification has less to do with the "complex/fragile" argument and more to do with what might be called "ignorance of values" (to give it an equally pithy name) which is that humans just don't know what our real values are (especially when applied to unfamiliar situations that will come up in the future) so how can AI designers specify them or how can AIs learn them?
People try to get around this by talking about learning meta-preferences, e.g., preferences for how to deliberate about values, but that's not some "values" that we already have and the AI can just learn, but instead a big (and I think very hard) philosophical and social science/engineering project to try to figure out what kinds of deliberation would be better than other kinds or would be good enough to eventually lead to good outcomes. (ETA: See also this comment.)
My own worry is less that "imperfectly aligned AI is likely to be worse than the currently misaligned processes" but more that the advent of AGI might be the last good chance for humanity to get alignment right (including addressing "human safety problem"), and if we don't do a good enough job (even if we improve on the current situation in some sense) we'll be largely stuck with the remaining misalignment because there won't be another opportunity like it. ETA: A good slogan for this might be "AI risk as the risk of missed opportunity".
I'm not entirely sure I understand this sentence, but this post might be relevant here: https://www.lesswrong.com/posts/Qz6w4GYZpgeDp6ATB/beyond-astronomical-waste.