1 min read13th Mar 202310 comments
This is a special post for quick takes by TruePath. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.

New to LessWrong?

10 comments, sorted by Click to highlight new comments since: Today at 4:45 AM

Maybe this question has already been answered but I don't understand how recursive self-improvement of AIs is compatible with the AI alignment problem being hard.

I mean doesn't the AI itself face the alignment problem when it tries to improve/modify itself substantially? So wouldn't a sufficently intelligent AI refuse to create such an improvement for fear the goals of the improved AI would differ from its own?

I suspect that the alignment problem is much easier when considering expanding your own capabilities versus creating a completely new type of intelligence from scratch that is smarter than you are. I'm far from certain of this, but it does seem likely.

There is also the possibility that some AI entity doesn't care very much about alignment of later selves with earlier ones, but acts to self-improve or create more capable descendants either as a terminal goal, or for any other reason than instrumentally pursuing some preserved goal landscape.

Even a heavily goal-directed AI may self-improve knowing that it can't fully solve alignment problems. For example, it may deduce that if it doesn't self-improve then it will never achieve any part of its main goals, whereas there is some chance that some part of those goals can be achieved if it does.

There is also the possibility that alignment is something that can be solved by humans (in some decades), but that it can be solved by a weakly superintelligent and deceptive AI much faster.

Those are reasonable points but note that the arguments for AI x-risk depend on the assumption that any superintelligence will necessarily be highly goal directed. Thus, either the argument fails because superintelligence doesn't imply goal directed,

And given that simply maximizing the intelligence of future AIs is merely one goal in a huge space it seems highly unlikely that (especially if we try to avoid this one goal) we just get super unlucky and the AI has the one goal that is compatible with improvement.

Recursive self-improvement could happen on dimensions that don't help (or that actively harm) alignment.  That's the core of https://www.lesswrong.com/tag/orthogonality-thesis .  

The AI may (or may not; the capability/alignment mismatch may be harder than many AIs can solve; and the AI may not actually care about itself when creating new AIs) try to solve alignment of future iterations with itself, but even if so, it doesn't seem likely that an AI misaligned with humanity will create an AI that is.

I don't mean alignment with human concerns. I mean that the AI itself is engaged in the same project we are: building a smarter system than itself. So if it's hard to control the alignment of such a system then it should be hard for the AI. (In theory you can imagine that it's only hard at our specific level of intelligence but in fact all the arguments that AI alignment is hard seem to apply equally well to the AI making an improved AI as to us making an AI).

See my reply above. The AI x-risk arguments require the assumption that superintelligence necessarily entails the agent try to optimize some simple utility function (this is different than orthogonality which says increasing intelligence doesn't cause convergence to any particular utility function). So the doesn't care option is off the table since (by orthogonality) it's super unlikely you get the one utility function which says just maximize intelligence locally (even global max isn't enough bc some child AI who has different goals could interfere).

It's unclear if alignment is hard in the grand scheme of things. Could snowball quickly, with alignment for increasingly capable systems getting solved shortly after each level of capability is attained.

But at near-human level, which seems plausible for early LLM AGIs, this might be very important in requiring them to figure out coordination to control existential risk while alignment remains unsolved, remaining at relatively low capability level in the meantime. And solving alignment might be easier for AIs with simple goals, allowing them to recursively self-improve quickly. As a result, AIs aligned with humanity would remain vulnerable to FOOMing of misaligned AIs with simple goals, and would be forced by this circumstance to comprehensively prevent any possibility of their construction rather than mitigate the consequences.

I agree that's a possible way things could be. However, I don't see how it's compatible with accepting the arguments that say we should assume that alignment is a hard problem. I mean absent such arguments why expect you have to do anything special beyond normal training to solve alignment?

As I see the argumentative landscape the high x-risk estimates depend on arguments that claim to give reason to believe that alignment is just a generally hard problem. I don't see anything in those arguments that distinguishes between these two cases.

In other words our arguments for alignment difficulty don't depend on any specific assumptions about capability of intelligence so we should currently assign the same probability to an AI being unable to save it's alignment problem as we do to us being unable to solve it.

AIs have advantages such as thinking faster and being as good at everything as any other AI of the same kind. These advantages are what simultaneously makes them dangerous and puts them in a better position to figure out alignment or coordination that protects from misaligned AIs and human misuse of AIs. (Incidentally, see this comment on various relevant senses of "alignment".) Being better at solving problems and having effectively more time to solve problems improves probability of solving a given problem.

Alignment isn't required for high capability. So a self-improving AI wouldn't solve it because it has no reason to.

This becomes obvious if you think about alignment as "doing what humans want" or "pursuing the values of humanity". There's no reason why an AI would do this.

Usually alignment is shorthand for alignment-with-humanity, which is a condition humanity cares about. This thread is about alignment-with-AI, which is what the AI that contemplates building other AIs or changing itself cares about.

A self-improving paperclipping AI has a reason to solve alignment-with-paperclipping, in which case it would succeed in improving itself into an AI that still cares about paperclipping. If its "improved" variant is misaligned with the original goal of paperclipping, the "improved" AI won't care about paperclipping, leading to less paperclippling, which the original AI wouldn't want to happen.