LESSWRONG
LW

3

[ Question ]

What do you mean with ‘alignment is solvable in principle’?

17th Jan 2025

1 min read

3

Typically, I saw researchers make this claim confidently in one sentence. Sometimes, it's backed by a loose analogy. ^[1]

This claim is cruxy. If alignment is not solvable, then the alignment community is not viable. But little is written that disambiguates and explicitly reasons through the claim.

Have you claimed that ‘AGI alignment is solvable in principle’?
If so, can you elaborate what you mean with each term? ^[2]

Below I'll also try specify each term, since I support research here by Sandberg & co.

^{^}
Some analogies I've seen a few times (rough paraphrases):
- ‘humans are generally intelligent too, and humans can align with humans’
- 'LLMs appear to do a lot of what we want them to do, so AGI could too'
- ‘other impossible-seeming engineering problems got solved too’
^{^}
E.g. what does ‘in principle’ mean? Does it assert that the problem described is solvable based on certain principles, or some model of how the world works?

AI RiskProblem Formulation & ConceptualizationAI

3

What do you mean with ‘alignment is solvable in principle’?

New Answer

New Comment

3 Answers sorted by
top scoring

Jan 17, 2025

115

"X is possible in principle" means X is in the space of possible mathematical things (as an independent claim to whether humans can find it).

[-]Remmelt4mo10

Thanks, when you say “in the space of possible mathematical things”, do you mean “hypothetically possible in physics” or “possible in the physical world we live in”?

2[anonymous]4mo

Possible to be ran on a computer in the actual physical world

Jan 17, 2025*

62

The claim "alignment is solvable in principle" means "there are possible worlds where alignment is solved."

Consequently, the claim "alignment is unsolvable in principle" means "there are no possible worlds where alignment is solved."

[-]Remmelt4mo10

Thanks!

With ‘possible worlds’, do you mean ‘possible to be reached from our current world state’?

And what do you mean with ‘alignment’? I know that can sound like an unnecessary question. But if it’s not specified, how can people soundly assess whether it is technically solvable?

4Satron4mo

By "possible worlds," I mean all worlds that are consistent with laws of logic, such as the law of non-contradiction. For example, it might be the case that, for some reason, alignment would only have been solved if and only if Abraham Lincoln wasn't assassinated in 1865. That means that humans in 2024 in our world (where Lincoln was assasinated in 1865) will not be able to solve alignment, despite it being solvable in principle. My answer is kind of similar to @quila's. I think that he means roughly the same thing by "space of possible mathematical things." I don't think that my definition of alignment is particularly important here because I was mostly clarifying how I would interpret the sentence if a stranger said it. Alignment is a broad word, and I don't really have the authority to interpret stranger's words in a specific way without accidentally misrepresenting them. For example, one article managed to find six distinct interpretations of the word:

2Remmelt4mo

With this example, you might still assert that "possible worlds" are world states reachable through physics from past states of the world. Ie. you could still assert that alignment possibility is path-dependent from historical world states. But you seem to mean something broader with "possible worlds". Something like "in theory, there is a physically possible arrangement of atoms/energy states that would result in an 'aligned' AGI, even if that arrangement of states might not be reachable from our current or even a past world". –> Am I interpreting you correctly? You saying this shows the ambiguity here of trying to understand what different people mean. One researcher can make a technical claim about the possibility/tractability of "alignment" that is similarly worded to a technical claim others made. Yet their meaning of "alignment" could be quite different. It's hard then to have a well-argued discussion, because you don't know whether people are equivocating (ie. switching between different meanings of the term). That's a good summary list! I like the inclusion of "long-term outcomes" in P6. In contrast, P4 could just entail short-term problems that were specified by a designer or user who did not give much thought to long-term repercussions. The way I deal with the wildly varying uses of the term "alignment" is to use a minimum definition that most of those six interpretations are consistent with. Where (almost) everyone would agree that AGI not meeting that definition would be clearly unaligned. * Alignment is at the minimum the control of the AGI's components (as modified over time) to not (with probability above some guaranteeable high floor) propagate effects that cause the extinction of humans.

1Satron4mo

Yup, that's roughly what I meant. However, one caveat would be that I would change "physically possible" to "metaphysically/logically possible" because I don't know if worlds with different physics could exist, whereas I am pretty sure that worlds with different metaphysical/logical laws couldn't exist. By that, I mean stuff like the law of non-contradiction and "if a = b, then b = a." I think the main antidote against this is to ask the person you are speaking with to define the term if they are making claims in which equivocation is especially likely. Yeah, that's reasonable.

Jan 17, 2025*

30

Here's how I specify terms in the claim:

AGI is a set of artificial components, connected physically and/or by information signals over time, to in aggregate sense and act autonomously over many domains.
- 'artificial' as configured out of a (hard) substrate that can be standardised to process inputs into outputs consistently (vs. what our organic parts can do).
- 'autonomously' as continuing to operate without needing humans (or any other species that share a common ancestor with humans).
Alignment is at the minimum the control of the AGI's components (as modified over time) to not (with probability above some guaranteeable high floor) propagate effects that cause the extinction of humans.
Control is the implementation of (a) feedback loop(s) through which the AGI's effects are detected, modelled, simulated, compared to a reference, and corrected.

More from Remmelt

Curated and popular this week