Are you stably aligned?

Seth Herd

Epistimic status: posing questions I think are probably useful, without a strong guess at the answers.

If we made an AGI just like you, would it be aligned?

Would it stay aligned over a long time span?

The purpose of the first question is to explore the possibility raised by shard theory and related work, that RL systems are not maximizers, and therefore easier to align. The second question asks how a human-like RL system might change its own goals and values over time. Both questions apply anthropomorphic reasoning about neuromorphic AGI safety. My arguments for this being useful are mostly the same ones laid out well here and here

Of course, even a neuromorphic AGI will not be all that much like us. There's just no way we're going to decode all the algorithms wired in there by evolution. But there's a strong argument that it will be like us in being an actor-critic RL system that can reason about its own beliefs. And if we do a really good job of training it, its starting values could be a lot like yours.

This is an informal introspective approach to reasoning about AGI. I think this is a much-maligned tool that psychologists use extensively in private, while denegrating in public.

My proposed answer is below; the purpose of this question is to ask how precisely aligned we'd need a brainlike AGI to be to be safe and beneficial. My answers are below, but I'd really like to hear others' thoughts.

The second question is, would you stay aligned as your values shifted? This is taking the perspective of a singleton AGI that might remain in a position of power over humanity for a very very long time. And one that will ultimately figure out how to edit its values if it wants to. So this question addresses the stability of alignment in roughly-neuromorphic systems.

I think this question is tough. I think I wouldn't edit many values, but I would edit some habits and emotional responses. But if I had values that severely interfered with my main goals, like loving sweets when I'd like to be fit and healthy, I might edit those. Figuring out what is stable-enough seems pretty tricky.

I think this question is at the heart of whether we're on the right track at all in thinking about aligning RL networks. I don't see a way to prevent a superintelligence from editing its own values if it wants to. And I think that SI is likely to be neuromorphic at least as far as being a combination of RL and self-supervised networks.

More on each of these in future posts.

Back to my guess at whether humans are adequately aligned. This question is mostly relevant for estimating what fraction of an RL agent's values need to be benevolent to produce helpful results.

I think that if we had the engineering and problem-solving abilities of a superintelligence, most of us would help humanity. I also think we'd do that without imposing much in the way of specific projects and values. I think this magnanimity would increase over time as we got used to the idea that we no longer need to worry about protecting ourselves. There would be plenty of time and space for our own projects, and trying to interest people to share them seems more satisfying than forcing people to do it. Of course there's a lot of gray area there, and this is one of the main points I'd like to get opinions on.

I think most people would be helpful despite not having the good of humanity as their primary motivation. With massive power, there's an opportunity to advance lots of projects. If this is roughly correct, this opens up the possibility that a neuromorphic AGI needn't be perfectly aligned with humanity; it just needs to have some values in the direction of helping humans. This seems to be the thrust of shard theory. Even this much alignment isn't easy, of course, but it seems a lot easier than figuring out the best thing for humanity forever, and training in exactly and only that as a goal or value.

LESSWRONG
LW

LESSWRONG
LW

14

Are you stably aligned?

14

14

14