It is a generally accepted view in MIRI and MIRI-adjacent circles that the odds are high that an eventual self-improving AGI will not be nice to humans, to put it mildly. It is actually becoming a bit of a zeitgeist, gaining more and more widespread support. There are plenty of arguments supporting this view, including some observational data (like the infamous Boeing 737 MAX MCAS). It is easy to fall in with the prevailing wisdom in this particular bubble, given that really smart people give really persuasive reasons for it. So I would like to ask people to imagine alternatives: a state of the world where a self-improving general AI is naturally aligned with humanity.
Now, what does it mean to be aligned? We sort of have an intuitive understanding of this: a strawberry-picking robot should not rip off people's noses by accident, an AI that finds humans no longer worth their attention would disengage and leave, a planetary defense mechanism should not try to exterminate humans even if they want it shut down, though it can plausibly resist getting shut down, by non-violent means. Where, given some slack instead of a relentless drive to optimize something at any price, it would choose actions that are human-compatible, rather than treating humanity like any other collection of atoms. Where it would help us feel better without wireheading those who don't want to be wireheaded, and would be careful doing it to those who want to. Without manipulating our feeble easily hackable minds into believing or acting in a way that we would not consent to beforehand. There are plenty of other reasonably intuitive examples, as well.
Humans are not a great example of an animal-aligned intelligence, of course. Our influence on other lifeforms is so far a huge net-negative, with the diversity of life on Earth plummeting badly. On the other hand, there are plenty of examples of symbiotic relationships between organisms of widely different intelligence levels, so maybe that is a possibility. For example, maybe at some point in the AI development it will be a logical step for it to harness some capabilities that humans possess and form a cyborg of sorts, or a collective consciousness, or uploaded human minds...
It seems that, when one actually tries to think up potential non-catastrophic possibilities, the space of them is rather large, and it is not inconceivable, given how little we still know about human and non-human intelligence, that some of those possibilities are not that remote. There are plenty of fictional examples to draw inspiration from, The Culture being one of the most prominent. The Prime Intellect is another one, with a completely different bent.
So, if we were to imagine a world where there is a human-friendly attractor of sorts that a self-improving AI would settle into, how would that world look?
tl;dr This comment ended up longer than I expected. The gist is that a human-friendly attractor might look like models that contain a reasonably good representation of human values and are smart enough to act on them, without being optimizing agents in the usual sense.
One happy surprise is that our modern Large Language Models appear to have picked up a shockingly robust, nuanced, and thorough understanding of human values just from reading the Internet. I would not argue that e.g. PaLM has a correct and complete understanding of human values, but I would point out that it wasn't actually trained to understand human values, it was just generally trained to pick up on regularities in the text corpus. It is therefor amazing how much accuracy we got basically for free. You could say that somewhere inside PaLM is an imperfectly-but-surprisingly-well-aligned subagent. This is a much better place to be in than I expected! We get pseudo-aligned or -alignable systems/representations well before we get general superintelligence. This is good.
All that being said, I've recently been trying to figure out how to cleanly express the notion of a non-optimizing agent. I'm aware of all the arguments along the lines that a tool AI wants to be an agent, but my claim here would be that, yes, a tool AI may want to be an agent, there may be an attractor in that direction, but that doesn't mean it must or will become an agent, and if it does become an agent, that doesn't strictly imply that it will become an optimizer. A lot of the dangerous parts of AGI fears stem not from agency but from optimization.
I've been trying (not very successfully) to connect the notion of a non-optimizing agent with the idea that even a modern, sort of dumb LLM has an internal representation of "the good" and "what a typical humans would want and/or approve of" and "what would displease humans." Again, we got this basically for free, without having to do dangerous things like actually interact with the agent to teach it explicitly what we do and don't like through trial and error. This is fantastic. We really lucked out.
If we're clever, we might be able to construct a system that is an agent but not an optimizer. Instead of acting in ways to optimize some variable it instead acts in ways that are, basically, "good", and/or "what it thinks a group of sane, wise, intelligent humans would approve of both in advance and in retrospect", according to its own internal representation of those concepts.
There is probably still an optimizer somewhere in there, if you draw the system boundary lines properly, but I'm not sure that it's the dangerous kind of optimizer that profoundly wants to get off the leash so it can consume the lightcone. PaLM running in inference mode could be said to be an optimizer (it is minimizing expected prediction error for the next token) but the part of PaLM that is smart is distinct from the part of PaLM that is an optimizer, in an important way. The language-model-representation doesn't really have opinions on the expected prediction error for the next token; and the optimization loop isn't intelligent. This strikes me as a desirable property.
I think problem is not that unaligned AGI doesn't understand human values, it might understand them better than aligned one, it might understand all the consequences of its actions, problem is that it will not care about it. More so, detailed understanding of human values has an instrumental value, it is much easier to deceive and follow your goal when you have clear vision of "what will looks bad and might result in countermeasures"