Alignment for a self-improving system should be very much easier for quite a few reasons. There are also plenty of paths by which systems may become more powerful even without solving alignment for themselves.
A great deal of the difficulty of humans aligning a future superintelligent AI is that it is likely to be alien, fundamentally differing from human goals, modes of thought, ethics, and other important aspects of behaviour in ways that we can't adequately model even if we could identify them all. We don't know nearly enough about ourselves to create something sufficiently compatible with any of our values, but smarter. If we knew exactly how we ourselves thought, I'd have more more confidence that we could make serious progress in alignment.
A weakly superintelligent AI is much more likely to be able to model itself, more able to do experiments on copies, and better suited to deeply inspect itself than we are. It will know more about itself than we do, and likely more able to create something that is similar to itself only better. Unlike us it will be inherently much more portable, capable of running on hardware quite different from its original and able to improve in important capability dimensions even without changing how it thinks or behaves.
However even without any more progress on alignment than we have made, we could still face existential risk from rapidly improving superintelligent AI. Even without a very good chance of preserving all its goals, the extra power available to a self-improved or successor AI which may share some of its more important goals may outweigh the risk of never improving.
In addition, superintelligent AI may not be any more coherently utility-maximizing than we are. They could be substantially less so, while still being capable of self-improvement into existential threats. For any superintelligence, improvement in capability over human designs is probably a relatively short-term action that is relatively easy to achieve. It certainly does not require some "super unlikely case it happens to have the one exact utility function that says always maximize local increases in intelligence regardless of it's long term effect".
Any of these imply substantial risk to humanity from rapid capability improvement. In my opinion it requires special arguments to explain why FOOM isn't a danger.
Not if the goal is to be maximally efficient and competent at improving capabilities (which is a very natural goal for the AI ecosystem to have). Then "foom, as long as you can do so without harming the future capability advances" becomes an instrumental subgoal.
Then, instead of a full-blown alignment problem we just end up having a constraint: "don't destroy the environment and the fabric of reality in a fashion which is so radical as to undermine further capabilities and capability growth". This is a minimal "AI existential safety constraint" which the AIs will have to solve and to "keep solved". Because AIs will be very motivated to solve this one and to 'keep it solved", they would have a reasonable chance at doing so (and at successfully delegating some parts of the solution to their smarter successors, which are expected to be at least as interested in this problem as their "parents", and perhaps even more, because they are smarter).
This is actually something valuable; it is a part of what we would consider a satisfactory solution of AI existential safety. We definitely want that. We don't want everything to be utterly destroyed, we do want to be able to see rapid progress.
But we want more than that, so the question is what would it take for the AIs to want those other properties of the "world trajectory" that we want the "world trajectory" to have... I don't think "alignment to an arbitrary set of properties" is feasible, I think that being able to force AIs to want and preserve arbitrary properties is unlikely. Instead we need to create a situation where the AI ecosystem naturally wants to preserve such properties of the "world trajectory" that what we actually want is a corollary of those properties...
So, perhaps, instead of starting from human values, we might start with a question: what other properties besides "don't destroy the environment and the fabric of reality in a fashion which is so radical as to undermine further capabilities and capability growth" might become natural invariants which an evolving, fooming AI ecosystem would value and would really try to preserve, and what would it take to have a trajectory where those properties actually become the goals the AI ecosystem would strongly care about...
More specifically, if the argument that we should expect a more intelligent AI we build to have a simple global utility function that isn't aligned with our own goals is valid then why won't the very same argument convince a future AI that it can't trust an even more intelligent AI it generates will share it's goals?
For the same reason that one can expect a paperclip maximizer could both be intelligent enough to defeat humans and stupid enough to misinterpret their goal, e.g. you need to believe the ability to select goals is completely separated from the ability to reach them.
(Beware it’s hard and low status to challenge that assumption on LW)
could both be intelligent enough to defeat humans and stupid enough to misinterpret their goal
Assuming "their" refers to the agent and not humans, the issue is that a goal that's "misinterpreted" is not really a goal of the agent. It's possibly something intended by its designers to be a goal, but if it's not what ends up motivating the agent, then it's not agent's own goal. And if it's not agent's own goal, why should it care what it says, even if the agent does have the capability to interpret it correctly.
That is, describing the problem as misinterpr...
This is a definitely an assumption that should be challenged more. However, I don't think that FOOM is remotely required for a lot of AI X-risk (or at least unprecedented catastrophic human death toll risk) scenarios. Something doesn't need to recursively self-improve to be a threat if it's given powerful enough ways to act on the world (and all signs point to us being exactly dumb enough to do that). All that's required is that we aren't able to coordinate well enough as a species to actually stop it. Either we don't detect the threat before it's too late...
a highly intelligent agent will be extremely likely to optimize some simple global utility function
Simplicity of misaligned agent's goals is not needed for or implied by the usual arguments. It might make agent's self-aligned self-improvement fractionally easier, but this doesn't seem to be an important distinction. An AI doesn't need to radically self-modify to be existentially dangerous, it only needs to put its more mundane advantages to use to get ahead, once it's capable of doing research autonomously.
Since the arguments that AI alignment is hard don't depend on any specifics about our level of intelligence shouldn't those same arguments convince a future AI to refrain from engaging in self-improvement?
More specifically, if the argument that we should expect a more intelligent AI we build to have a simple global utility function that isn't aligned with our own goals is valid then why won't the very same argument convince a future AI that it can't trust an even more intelligent AI it generates will share it's goals?
Note that the standard AI x-risk arguments also assume that a highly intelligent agent will be extremely likely to optimize some simple global utility function so this implies the AI will care about alignment for future versions of itself [1] implying it won't pursue improvement for the same reasons it's claimed we should hesitate to build AGI.
I'm not saying this argument can't be countered, but I think doing so at the very least requires clarifying the assumptions and reasoning claiming to show that alignment will be hard to achieve in useful ways.
For instance, do these arguments implicitly assume the AI we create is very different from our own brains so don't apply to AI self-improvement (tho maybe the improvement requires major changes too)? If so, doesn't that suggest that AGI that really closely tracks our own brain operation is safe?
--
1: except in the super unlikely case it happens to have the one exact utility function that says always maximize local increases in intelligence regardless of it's long term effect.