As part of learning the field and maximizing on new ideas, I've been trying to figure out what the goal of AI alignment is. So far I've found out what outer alignment is as a concept, but not what it should be as an instantiation.
Humanity's Values
So here is my suggestion:
Why don't we take Nick Bostrom's Instrumental Convergence goals and make those our terminal goals as a species?
Like so:
AGI Values | Humanity's Values |
---|---|
Self-Preservation | Survival |
Goal-Content Integration | Self-Determination |
Cognitive Self-Enhancement | Self-Improvement |
Technological Advancement | Invention |
Resource Acquisition | Growth |
Note how humanity's values are agnostic to morality. The whole idea of Bostrom's instrumental convergence goals is that they maximize the ability to achieve nearly any terminal goals. So by adopting these goals as the explicit terminal goals for humanity, we allow space for every individual human to pursue their self-chosen goals. We don't need to agree on religion, morality, the nature of reality, or how nice one should be in repeated coordination games. Instead we can agree that whatever AI's we happen to make, we at least ensure these AI's won't wipe out humanity at large, won't try to change humanity, won't limit us in our development or creations, and won't stymie our growth.
Honestly, that takes care of most horror scenarios!
- No brain in a vat, cause that's not self-determination.
- No living in a human zoo cause that won't allow for growth.
- No hidden sabotage of our research cause that hamstrings our inventions.
- And no murdering us all cause that would definitely infringe on our survival.
It's basically like a better Asimov's laws!
Note that humanity's values are only applied to humanity at large and not to individual humans. That means the AGI can still ...
- ... solve trolley problems in traffic because it looks at the survival of humanity over that of the individual.
- ... nudge criminals into other life paths, because it looks at the self-determination of the majority over that of the minority.
- ... limit the ability of con artists to learn new MLM schemes cause that would infringe on other's people ability to prosper.
- ... prevent the invention of more damaging bioweapons cause not all inventions help humanity flourish.
- ... guide the growth of individuals where tragedy-of-the-commons situations are threatening because those result in everyone being poorer.
By formulating our goals at the level of humanity instead of the individual human, we are thus creating a path for AGI to navigate conflicts of interest without devolving in to catastrophic trade-offs no one thought to prohibit it from making. Of course, there is still the question of how to operationalize these values but knowing where the target is a good start.
Outer Alignment
The way I understand the problem space now, the goal of AI alignment is to ensure AGI adopts the instrumental convergence goals for humanity while we can assume the AGI will also have these goals for itself. The beauty of this solution is that any increase of instrumental success on the part of the AGI will translate into an increase in terminal success for humanity!
Win-win if I ever saw one.
Additionally, this approach doesn't rely on the incidental creator of the first AGI being a nice guy or gal. These goals are universal to humanity. So even though an individual creator might add goals that are detrimental to a subset of humanity (say the bad people get AGI first), the AGI will still be constrained in how much damage it can do to humanity at a large.
The distinction between civilization's goal and goals of individual people is real, but that doesn't make civilization's goal unmoored. Rounding it down to some instrumental goals changes it. And that exposes you to goodhart's curse: if you take something other than actual terminal values of civilization as an optimization target, the outcome looks bad from the point of view of actual terminal values of civilization.
I think something similar to what you say can be rescued, in the form of the more important terminal values of civilization turning out to be generic, like math, not specific to details of the people who seek to formulate their own values. Generic values are convergent across many processes of volition extrapolation, including for the more human-like AGIs, and form an even greater share of terminal values for coalitions of multiple different AGIs. (This doesn't apply to mature optimizers such as paperclip maximizers that already know their terminal values and aren't motivated to work on figuring out what else they should be.)
It is similar to being instrumentally convergent, in being discovered by many different processes for the same reason, but it's not the same thing. Convergent instrumental goals are discovered as subgoals in the process of solving many different problems, in service of many different terminal goals. Generic terminal goals are discovered as terminal goals in the process of extrapolating many different volitions (formulating terminal goals of many different people, including those of relatively alien psychology, not sharing many human psychological adaptations).
Thank you for your comment!
How do we change humanity's hypothesized terminal goals by assigning instrumental convergence goals for human... (read more)