Typically, I saw researchers make this claim confidently in one sentence. Sometimes, it's backed by a loose analogy. [1]
This claim is cruxy. If alignment is not solvable, then the alignment community is not viable. But little is written that disambiguates and explicitly reasons through the claim.
Have you claimed that ‘AGI alignment is solvable in principle’?
If so, can you elaborate what you mean with each term? [2]
Below I'll also try specify each term, since I support research here by Sandberg & co.
- ^
Some analogies I've seen a few times (rough paraphrases):
- ‘humans are generally intelligent too, and humans can align with humans’
- 'LLMs appear to do a lot of what we want them to do, so AGI could too'
- ‘other impossible-seeming engineering problems got solved too’
- ^
E.g. what does ‘in principle’ mean? Does it assert that the problem described is solvable based on certain principles, or some model of how the world works?
With this example, you might still assert that "possible worlds" are world states reachable through physics from past states of the world. Ie. you could still assert that alignment possibility is path-dependent from historical world states.
But you seem to mean something broader with "possible worlds". Something like "in theory, there is a physically possible arrangement of atoms/energy states that would result in an 'aligned' AGI, even if that arrangement of states might not be reachable from our current or even a past world".
–> Am I interpreting you correctly?
You saying this shows the ambiguity here of trying to understand what different people mean. One researcher can make a technical claim about the possibility/tractability of "alignment" that is similarly worded to a technical claim others made. Yet their meaning of "alignment" could be quite different.
It's hard then to have a well-argued discussion, because you don't know whether people are equivocating (ie. switching between different meanings of the term).
That's a good summary list! I like the inclusion of "long-term outcomes" in P6. In contrast, P4 could just entail short-term problems that were specified by a designer or user who did not give much thought to long-term repercussions.
The way I deal with the wildly varying uses of the term "alignment" is to use a minimum definition that most of those six interpretations are consistent with. Where (almost) everyone would agree that AGI not meeting that definition would be clearly unaligned.