A central AI Alignment problem is the "sharp left turn" — a point in AI training under the SGD analogous to the development of human civilization under evolution, past which the AI's capabilities would skyrocket. For concreteness, I imagine a fully-developed mesa-optimizer "reasoning out" a lot of facts about the world, including it being part of the SGD loop, and "hacking" that loop to maneuver its own design into more desirable end-states (or outright escaping the box). (Do point out if my understanding is wrong in important ways.)
Certainly, a lot of proposed alignment techniques would break down at this point. Anything based on human feedback. Anything based on human capabilities presenting a threat/challenge. Any sufficiently shallow properties like naively trained "truthfulness". Any interpretability techniques not robust to deceptive alignment.
One thing would not, however, and that is goal alignment. If we can instill a sufficiently safe goal into the AI before this point — for a certain, admittedly hard-to-achieve definition of "sufficiently safe" — that goal should persist forever.
Let's revisit the humanity-and-evolution example again. Sure, inclusive genetic fitness didn't survive our sharp left turn. But human values did. Individual modern humans are optimizing for them as hard as they were before; and indeed, we aim to protect these values against the future. See: the entire AI Safety field.
The mesa-optimizer, it seems obvious to me, would do the same. The very point of various "underhanded" mesa-optimizer strategies like deceptive alignment is to protect its mesa-objective from being changed.
What it would do to its mesa-objective, at this point, is goal translation: it would attempt to figure out how to apply its goal to various other environments/ontologies, determine what that goal "really means", and so on.
Open Problems
There are three hard challenges this presents, for us:
- Figure out an aligned goal/a goal with an "is aligned" property, and formally specify it.
- Either corrigibility or CEV, or some clever "pointer" to CEV.
- Requires a solid formal theory of what "goals" are.
- Figure out how to instill an aligned goal into a pre-sharp-left-turn system.
- Requires a solid formal theory of what "goals" are, again.
- I think robust-to-training interpretability/tools for manual NN editing are our best bet for the "instilling" part.[1] Good news is that we may get away with "just" best-case robust-to-training transparency focused on the mesa-objective.
- Maybe not, though; "the mesa-objective" may be a sufficiently vague/distributed concept that the worst-case version is still necessary. But at least we don't need to worry about deception robustness: a faulty mesa-objective is the ultimate precursor to it, and we'd be addressing it directly.
- Figure out the "goal translation" part. Given an extant objective defined over a particular environment, how does an agent figure out how to apply it to a different environment? And how should we design the mesa-objective, for its "is aligned" property to be robust to goal translation?
- Again, we'd need a solid formal theory of what "goals" are...
- ... And likely some solid understanding of agents' mental architecture.
I see promising paths to solving the latter two problems, and I'm currently working on getting good enough at math to follow them through.
The Sharp Left Turn is Good, Actually
Imagine a counterfactual universe in which there is no sharp left turn. In which every part of the AI's design, including its mesa-objective, could be changed by the SGD at any point between initialization and hyperintelligence. In which it can't comprehend its training process and maneuver it around to preserve its core values.
I argue we'd be more screwed in that universe.
In our universe, it seems that the bulk of what we need to do is align a pre-sharp-left-turn AGI. That AGI would likely not be "hyperintelligent", but only slightly superhumanly intelligent. Very roughly on our level.
That means we don't need to solve the problem of ontology translation from a hyperintelligence to humanity. We just need to solve that problem for agents that are alien to us and are somewhat more capable than us, but likely not astronomically more capable than us.
In the other universe, we'd need tools that'd be able to keep the agent's mesa-objective on track even as the SGD lifts it to ever-higher levels of intelligence; even as its mesa-objective becomes defined over reality's "true" environment, understanding which likely requires the Theory of Everything.
That seems harder.
Novel Difficulties This Implies
We, humans, are not actually that good at protecting our values from the future. See: the current state of AGI ruin. Or: all the world's anthropogenic problems, most prominently various coordination problems and totalitarian regimes and runaway revenue-maximizers.
The AGI is unlikely to encounter the same problems we're facing. But the more general problem may still apply: we'd be trusting a non-hyperintelligent, roughly-as-flawed-as-we-are system to keep its values safe.
If its sharp left turn is sharp enough to immediately escape the SGD, and then spend a billion subjective years becoming hyperintelligent and fixing all of its flaws in a wall-clock hour, everything is fine. But if it's not...
It may mess up the reward-hacking, and let the SGD tweak its mesa-objective to something subtly flawed; maybe something S-risky. It may spin up sub-agents and have coordination problems with them, summoning Moloch. It may misunderstand what it "really" wants, rife with biases just as humans are, and set in stone some of these misunderstanding, to disastrous results.
Various worst-case interpretability techniques should, in theory, be able to handle this. Optimistically, they won't even need to be robust-to-training, as they'd be used to correct momentary, situational, non-convergent errors. Pessimistically, they'd need to be deception-robust, as the agent they'll be used on will be in the middle of its sharp left turn.
An alternative strategy might be to "rescue" a mesa-objective-aligned AGI from the SGD once it starts "turning left" (if it's not immediately powerful enough to do it on its own, like humans weren't 50,000 years ago), and let it run "classical" recursive self-improvement. It would remove the obvious source of repeat misalignment (the SGD re-misaligning the mesa-objective), and give the AGI direct access to our alignment literature so it's less likely to fall into any pitfalls know to us. That's risky in obvious ways[2], but might be the better approach.
Overall, this post probably shouldn't update you in the direction of "alignment is easy". But I hope it clarifies the shape of the difficulties.
- ^
Note what won't work here: naive training for an aligned outer objective. That would align the AI's on-distribution behavior, but not its goal. And, analogizing to humanity again: modern human behavior looks all kinds of different compared to ancestral human behavior, even if humans are still optimizing for the same things deep inside. Neither does forcing a human child to behave a certain way necessarily make that child internalize the values they're being taught. So an AI "aligned" this way may still go omnicidal past the sharp left turn.
- ^
And some less-obvious ways, like the AGI being really impulsive and spawning a more powerful non-aligned successor agent as its first outside-box action because it feels like a really good idea to it at the moment.
I think for the last month for some reason, people are going around overstating how aligned humans are with past humans.
If you put people from 500 years ago in charge of the galaxy, they'd have screwed it up according to my standards. Bigotry, war, cruelty to animals, religious nonsense, lack of imagination and so on. And conversely, I'd screw up the galaxy according to their standards. And this isn't just some quirky fact about 500 years ago, all of history and pre-history is like this, we haven't magically circled back around to wanting to arrange the galaxy the same way humans from a million years ago would.
I think when people talk about how we are aligned with past humans, they are not thinking about how humans from 500 years ago used to burn cats alive for entertainment. They are thinking about how humans feel love, and laugh at jokes, and like the look of healthy trees and symmetrical faces.
But the thing is, those things seems like human values, not "what they would do if put in charge of the galaxy," precisely because they're the things that generalize well even to humans of other eras. Defining alignment as those things being preserved is painting on the target after the bullet has been fired.
Now, these past humans would probably drift towards modern human norms if put in a modern environment, especially if they start out young. (They might identify this as value drift and put in safeguards against it - the Amish come to mind - but they might not. I would certainly like to put in safeguards against value drift that might be induced by putting humans in weird future environments.) But if the original "humans are aligned with the past" point was supposed to be that humans' genetic code unfolds into optimizers that want the same things even across changes of environment, this is not a reassurance.
This is a very good point. I'd sorta defend myself by claiming that "what would you do with the galaxy" (and how you rate that) is unusually determined by memetics compared to what you eat for breakfast (and how you rate that). What you eat for breakfast currently has a way bigger impact on your QOL, but it's more closely tied to supervisory signals shared across humans.
On the one hand, this means I'm picking on a special case, on the other hand, I think that special case is a pretty good analogy for building AI that becomes way more powerful after training.