Intent alignment as a stepping-stone to value alignment

Seth Herd

4 min read

37 Intent alignment as a stepping-stone to value alignment

by Seth Herd

5th Nov 2024

4 min read

37

I think Instruction-following AGI is easier and more likely than value aligned AGI, and that this accounts for one major crux of disagreement on alignment difficulty. I got several responses to that piece that didn't dispute that intent alignment is easier, but argued we shouldn't give up on value alignment. I think that's right. Here's another way to frame the value of personal intent alignment: we can use a superintelligent instruction-following AGI to solve full value alignment.

This is different than automated alignment research; it's not hoping tool AI can help with our homework, it's making an AGI smarter than us in every way do our homework for us. It's a longer term plan. Having a superintelligent, largely autonomous entity that just really likes taking instructions from puny humans is counterintuitive, but it seems both logically consistent. And it seems technically achievable on the current trajectory - if we don't screw it up too badly.

Personal, short-term intent alignment (like instruction-following) is safer for early AGI because it includes corrigibility. It allows near-misses. If your AGI did think eliminating humans would be a good way to cure cancer, but it's not powerful enough to make that happen immediately, you'll probably get a chance to say "so what's your plan for that cancer solution?" and "Wait no! Quit working on that plan!" (And that's if you somehow didn't tell it to check with you before acting on big plans).

This type of target really seems to make alignment much easier. See the first linked post, or Max Harms' excellent sequence on corrigibility as a singular (alignment) target (CAST) for a much deeper analysis. An AI that wants to follow directions also wants to respond honestly about its motivations when asked, and to change its goals when told to - because its goals are all subgoals of doing what its principal asks. And this approach doesn't have to "solve ethics" - because it follows the principal's ethics.

And that's the critical flaw; we're still stuck with variable and questionable human ethics. Having humans control AGI is not a permanent solution to the dangers of AGI. Even if the first creators are relatively well-intentioned, eventually someone sociopathic enough will get the reins of a powerful AGI and use it to seize the future.

In this scenario, technical alignment is solved, but most of us die anyway. We die as soon as a sufficiently malevolent person acquires or seizes power (probably governmental power) over an AGI.

But won't a balance of power restrain one malevolently-controlled AGI surrounded by many in good hands? I don't think so. Mutually assured destruction works for nukes but not as well with AGI capable of autonomous recursive self-improvement. A superintelligent AGI will probably be able to protect at least its principal and a few of their favorite people as part of a well-planned destructive takeover. If nobody else has yet used their AGI to firmly seize control of the lightcone, there's probably a way for an AGI to hide and recursively self-improve until it invents weapons and strategies that let it take over - if its principal can accept enough collateral damage. With a superintelligence on your side, building a new civilization to your liking might be seen as more an opportunity than an inconvenience.

These issues are discussed in more depth in If we solve alignment, do we die anyway? and its discussion. To the average human, controlled AI is just as lethal as 'misaligned' AI draws similar conclusions from a different perspective.

It seem inevitable that someone sufficiently malevolent would eventually get the reins of an intent-aligned AGI. This might not take long even if AGI does not proliferate widely; there are Reasons to think that malevolence could correlate with attaining and retaining positions of power. Maybe there's a way to prevent this with the aid of increasingly intelligent AGIs; if not, it seems like taking power out of human hands before it falls into the wrong ones will be necessary. perspective.

Writing If we solve alignment, do we die anyway? and discussing the claims in the comments drew me to the conclusion that the end goal probably needs to be value alignment, just like we've always thought - humans power structures are too vulnerable to infiltration or takeover by malevolent humans. But instruction-following is a safer first alignment target. So it can be a stepping-stone that dramatically improves our odds of getting to value aligned AGI.

Humans in control of highly intelligent AGI will have a huge advantage on solving the full value alignment problem. At some point, they will probably be pretty certain the plan can be accomplished, at least well enough to maintain much of the value of the lightcone by human lights (perfect alignment seems impossible since human values are path-dependent, but we should be able to do pretty well).

Thus, the endgame goal is still full value alignment for superintelligence, but the route there is probably through short-term personal intent alignment.

Is this a great plan? Certainly not. It hasn't been thought through, and there's probably a lot that can go wrong even once it's as refined as possible. In an easier world, we'd Shut it All Down until we're ready to do it wisely. That doesn't look like an option, so I'm trying to plot a practically achievable path from where we are to real success.

New to LessWrong?

Getting Started

FAQ

Library

AI-Assisted Alignment2Complexity of value2AI1

Frontpage

37

Mentioned in

35System 2 Alignment

Intent alignment as a stepping-stone to value alignment

New Comment

7 comments, sorted by

top scoring

Click to highlight new comments since: Today at 4:46 AM

[-][anonymous]6mo70

Does this plan necessarily factor through the using the intent-aligned AGI to quickly commit some sort of pivotal act that flips the gameboard and prevents other intent-aligned AGIs from being used malevolently by self-interested or destructive (human) actors to gain a decisive strategic advantage? After all, it sure seems less than ideal to find yourself in a position where you can solve the theoretical parts of value alignment,^[1] but you cannot implement that in practice because control over the entire future light cone has already been permanently taken over by an AGI intent-aligned to someone who does not care about any of your broadly prosocial goals...

^{^}
In so far as something like this even makes sense, which I have already expressed my skepticism of many times, but I don't think I particularly want to rehash this discussion with you right now...

[-]Seth Herd6mo60

Yes it does require limiting the spread of AGI. I only referred to it briefly in the phrase "...even if AGI does not proliferate widely". I discuss it more in "do we all die anyway". I think it's not quite as clear as needing to shut down all other AGI projects or we're doomed; a small number of AGIs under control of different humans might be stable with good communication and agreements, at least until someone malevolent or foolish enough gets involved. That's why I'm using the term proliferation; I think the dynamics are somewhat similar to the nuclear standoff, where we've actually seen stability with a handful of actors.

I'm hoping the need to reduce proliferation will become apparent to anyone who sees the potential of real AGI and who thinks about international politics, including terrorism. I'm hoping the potential of AGI will be much more intuitively apparent to anyone having a conversation with something that's smarter than them and just as agentic. We shall see.

Note that I did address your core point from the other comments you linked: human values aren't well-defined, so you can't align anything to them. I think aligning a superintelligence to anything in the neighborhood of common human preferences would be close enough to be pretty happy with, even if you can't do something more clever and provide lots of freedom to future beings without letting them abuse it at others' expense. Hopefully we can have a long reflection, or good answers seem obvious if we have a superhuman intelligence helping us think through it. I have some ideas, but that's a whole different project than surviving the first AGIs and so getting to have that discussion and that choice.

I think without any progress on understanding human values we could still have a world thousands of times better than anything we've had, in the opinions of the vast majority of human beings who have or will live. That's good enough for me, at least for now.

[-][anonymous]6mo70

I think it's not quite as clear as needing to shut down all other AGI projects or we're doomed; a small number of AGIs under control of different humans might be stable with good communication and agreements, at least until someone malevolent or foolish enough gets involved.

Realistically, in order to have a reasonable degree of certainty that this state can be maintained for more than a trivial amount of time, this would, at the very least, require a hard ban on open-source AI, as well as international agreements to strictly enforce transparency and compute restrictions, with the direct use of force if need be, especially if governments get much more involved in AI in the near-term future (which I expect will happen).

Do you agree with this, as a baseline?

[-]Seth Herd6mo53

I do pretty much agree. All laws and international agreements are ultimately enforced by the use of force if need be, so that's not saying anything new. It probably does need to be a hard ban on open-source AI at some point, but that's well in the future, and I think the discussion will look very different once we have clearly parahuman AGI.

This is all going to be a tough pill to swallow. I think it's going to be almost necessary that any government that enacts these rules will also have to assure everyone, and then follow through at least decently well with spreading the benefits of real AGI as broadly as possible. I see some hope in that becoming a necessity. We might get some oversight boards that could at least think clearly and apply some influence toward sanity.