This was an interesting read, especially the first section!
I'm confused by some aspects of the proposal in section 4, which makes it harder to say what would go wrong. As a starting point, what's the training signal in the final step (RL training)? I think you're assuming we have some outer-aligned reward signal, is that right? But then it seems like that reward signal would have to do the work of making sure that the AI only gets rewarded for following human instructions in a "good" way---I don't think we just get that for free. As a silly example, if we rewarded the AI whenever it literally followed our commands, then even with this setup, it seems quite clear to me we'd at best get a literal-command-following AI, and not an AI that does what we actually want. (Not sure if you even meant to imply that the proposal solved that problem, or if this is purely about inner alignment).
The complexity regularizer should ensure the AI doesn't develop some separate procedure for interpreting commands (which might end up crucially flawed/misaligned). Instead, it will use the same model of humans it uses to make predictions, and inaccuracies in it would equal inaccuracies in predictions, which would be purged by the SGD as it improves the AI's capabilities.
Since this sounds to me like you are saying this proposal will automatically lead to commands being interpreted the way we mean them, I'll say more on this specifically: the AI will presumably have not just a model of what humans actually want when they give commands (even assuming that's one of the things it internally represents). It should just as easily be able to interpret commands literally using its existing world model (it's something humans can do as well if we want to). So which of these you get would depend on the reward signal, I think.
For related reasons, I'm not even convinced you get something that's inner-aligned in this proposal. It's true that if everything works out the way you're hoping, you won't be starting with pre-existing inner-misaligned mesa objectives, you just have a pure predictive model and GPS. But then there are still lots of objectives that could be represented in terms of the existing predictive model that would all achieve high reward. I don't quite follow why you think the objective we want would be especially likely---my sense is that even if "do what the human wants" is pretty simple to represent in the AI's ontology, other objectives will be too (as one example, if the AI is already modeling the training process from the beginning of RL training, then "maximize the number in my reward register" might also be a very simple "connective tissue").
replacing the SGD with something that takes the shortest and not the steepest path
Maybe we can design a local search strategy similar to gradient descent which does try to stay close to the initial point x0? E.g., if at x, go a small step into a direction that has the minimal scalar product with x – x0 among those that have at most an angle of alpha with the current gradient, where alpha>0 is a hyperparameter. One might call this "stochastic cone descent" if it does not yet have a name.
Doesn't sound like it'd meaningfully change the fundamental dynamics. It's an intervention on the order of things like Momentum or Adam, and they're still "basically just the SGD". Pretty sure similar will be the case here: it may introduce some interesting effects, but won't actually robustly address the greed.
... My current thoughts is that "how can we design a procedure that takes the shortest and not the steepest path to the AGI?" is just "design it manually". I. e., the corresponding "training algorithm" we'll want to replace the SGD with is just "our own general intelligence".
I'm sorry but I fail to see the analogy to momentum or adam, in neither of which the vector or distance from the current point to the initial point plays any role as far as I can see. It is also different from regularizations that modify the objective function, say to penalize moving away from the initial point, which would change the location of all minima. The method I propose preserves all minima and just tries to move towards the one closest to the initial point. I have discussed it with some mathematical optimization experts and they think it's new.
Honestly optimizing according to the AI's best model of the humans is deception. So that's a problem with proposals that make SGD better.
There are several ways of framing solutions to this:
Fix the problem by controlling the data: We want SGD to be bad and get stuck on heuristics, because we might be able to set up a training environment, where sensible human-scale models are the favored heuristics (rather than prefictively powerful but manipulative models that might be the global optima).
Fix the problem by controlling the loss function: We want the AI to judge what's good by referring not to its best-predicting model of humans, but to one we endorse for reasons in addition to predictive accuracy. So we need an architecture that allows this fistinction, and a training procedure that's responsive to human feedback about how they want to be modeled.
Fix the problem by changing the priors or inductive biases: The AI would converge to human-approved cognition if we just incentivized it to use the right building blocks. So we might try to mimic human cognition and use that as a building block, or add in some regularization term for abstractions that incorporates data from humans about what abstractions are "human." Then we could end up with an AI reasoning in human-approved ways.
Honestly optimizing according to the AI's best model of the humans is deception.
Can you explain why, exactly on this point.
The basic problem is that the training datasets we talk about wanting to construct, we cannot actually construct. We can talk abstractly about sampling from a distribution of "cases where the AI is obeying the human in the way we want," but just because we can talk as if this distribution is a thing doesn't mean we can actually sample from any such thing. What we can sample from are distributions like "cases where certain outside-observing humans think you're obeying the human in the way they want."
An AI that's really good at learning the training distribution, when trained in the normal way on the "cases where certain outside-observing humans think you're obeying the human in the way they want" distribution, will learn that distribution. This means it's basically learning to act in a way that's intended to deceive hypothetical observers. That's bad.
A lot of alignment schemes (particularly prosaic ones) are predicated on resolving this by:
Making SGD better without considering its inductive biases weakens these sorts of alignment schemes. And you have to try to solve this problem somehow (though there are other ways, see the other bullet points).
The SGD's greed, to be specific.
Consider a ML model being trained end-to-end from initialization to zero loss. Every individual update to its parameters is calculated to move it in the direction of maximal local improvement to its performance. It doesn't take the shortest path from where it starts to the ridge of optimality; it takes the locally steepest path.
1. What does that mean mechanically?
Roughly speaking, every feature in NNs could likely be put into one of two categories:
The world-model can only be learned gradually, because higher-level features/statistical correlations build upon lower-level ones, and therefore the gradients towards learning them only appear after the lower-level ones are learned.
Heuristics, in turn, can only attach to the things that are already present in the world-model (same for values). They're functions of abstractions in the world-model, and they fire in response to certain WM-variables assuming certain values. For example, if the world-model is nonexistent, the only available heuristics are rudimentary instincts along the lines of "if bright light, close eyes". Once higher-level features are learned (like "a cat"), heuristics can become functions of said features too ("do X if see a cat", and later, "do Y if expect the social group to assume state S within N time-steps").
The base objective the SGD is using to train the ML model is, likewise, a function of some feature/abstraction in the training data, like "the English name of the animal depicted in this image" or "the correct action to take in this situation to maximize the number of your descendants in the next generation". However, that feature is likely a fairly high-level one relative to the sense-data the ML model gets, one that wouldn't be loaded into the ML model's WM until it's been training for a while (the way "genes" are very, very conceptually far from Stone Age humans' understanding of reality).
So, what's the logical path through the parameter-space from initialization to zero loss? Gradually improve the world-model step by step, then, once the abstraction the base objective cares about is represented in the world-model, put in heuristics that are functions of said abstraction, optimized for controlling that abstraction's value.
But that wouldn't do for the SGD. That entire initial phase, where the world-model is learned, would be parsed as "zero improvement" by it. No, the SGD wants results, and fast. Every update must instantly improve performance!
The SGD lives by messy hacks. If the world-model doesn't yet represent the target abstraction, the SGD will attach heuristics to upstream correlates/proxies of that abstraction. And it will spin up a boatload of such messy hacks on the way to zero loss.
A natural side-effect of that is gradient starvation/friction. Once there's enough messy hacks, the SGD won't bother attaching heuristics to the target abstraction even after it's represented in the world-model — because if the extant messy hacks approximate the target abstraction well enough, there's very little performance-improvement to be gained by marginally improving the accuracy so. Especially since the new heuristics will have to be developed from scratch. The gradients just aren't there: better improve on what's already built.
2. How does that lead to inner misalignment?
It seems plausible that general intelligence is binary. A system is either generally intelligent, or not; it either implements general-purpose search, or it doesn't; it's either an agent/optimizer, or not. There's no continuum here, the difference is sharp. (In a way, it's definitionally true. How can something be more than general? Conversely, how can something less than generally capable be called "generally intelligent"?)
Suppose that the ML model we're considering will make it all the way to AGI in the course of training. At some point, it will come to implement some algorithm for General-Purpose Search. The GPS can come from two places: either it'll be learned as part of the world-model (if the training data include generally intelligent agents, such as humans), or as part of the ML model's own policy. Regardless of the origin, it will almost certainly appear at a later stage of training: it's a high-level abstraction relative to any sense-data I can think of, and the GPS's utility can only be realized if it's given access to an advanced world-model.
So, by the time the GPS is learned, the ML model will have an advanced world-model, plus a bunch of shallow heuristics over it.
By its very nature, the GPS makes heuristics obsolete. It's the qualitatively more powerful optimization algorithm, and one that can, in principle, replicate the behavior of any possible heuristic/spin up any possible heuristic, and do so with greater accuracy and flexibility than the SGD.
If the SGD were patient and intelligent, the path forward is obvious: pick out the abstraction the base objective cares about in the world-model, re-frame it as the mesa-objective, then aim the GPS at optimizing it. Discard all other heuristics.
However, it's not that easy. Re-interpreting an abstraction as a mesa-objective is a nontrivial task. Even more difficult is the process of deriving the environment-appropriate strategies for optimizing it — the instincts, the technologies, the sciences. If evolution were intelligent, and had direct write-access to modern geneticists' brains... Well, replacing all of their value system with obsession with increasing their inclusive genetic fitness wouldn't instantly make effective gene-maximizers of them. They'll get there eventually, but that will require a significant amount of re-training on their part — despite the fact that they know perfectly well what a "gene" is[1].
So there wouldn't be strong gradients towards aiming the GPS at the representation of the base objective. No, gradient starvation would rear its head again:
There'll already be a lot of heuristics aimed at optimizing upstream correlates of the base objective, and their weighted sum will presumably serve as a good proxy objective (inasmuch as the model would've already been optimized for good performance even prior to the GPS' appearance). These heuristics will contain a lot of what we want: the instincts and the local knowledge of how to get things done in the local environment.
So the SGD will enslave the GPS to the heuristics. The GPS will be used to improve the heuristics' efficiency, gradually re-interpreting said heuristics as instincts, skills, or mesa-objectives. They will morph from the optimization procedure to the optimization targets.
At this point, the SGD will have two choices regarding the parameter updates: it can spend its "budget" to tweak the mess of mesa-objectives to more closely resemble the actual base objective, or it can just increase the power of the GPS (give it more processing power, more working memory, access to a richer world-model...). The SGD will probably do both to some extent, but which way will it lean?
Well, as above, the sum of the mesa-objectives should be a good proxy of the base objective at the current level of optimization power, such that optimizing for them optimizes for the base objective. In addition, they will likely take up a whole lot of parameters. It'll take a lot of updates to meaningfully rewrite them, and slight tweaks will correspond to just that: slight performance improvements. Conversely, slightly improving the GPS will likely lead to outsized improvements in performance[2].
Up to a point. Tails do come apart, and at some point, further improvements to capabilities will lead to worse performance on the training objective, requiring adjusting the mesa-objective to be more in-line with it. But then, post-adjustement, the mesa-objective will become a good proxy again, adapted for this new level of optimization power, and the SGD will switch to capability improvement again. It'll switch between these two modes, until...
Until the ML model becomes clever enough to become situationally aware, and try its hand at gradient hacking and deceptive alignment. Then the winds would change for good: improvements to its capabilities will lead to outsized improvements in its ability to derive the base objective at runtime, which will lead to outsized improvements in the ML model's performance on that base objective. At the same time, adjusting its mesa-objective will still lead only to marginal improvements. (At best. If the ML model is smart enough, it might destroy the gradients between its current values and the base objective altogether, by e. g. committing to output garbage if it notices meddling.)
Thus, there's a fairly narrow range of capabilities at which the ML model is smart enough that tails come apart, yet not smart enough to fall into the deceptive-alignment attractor. While it occupies that range, its mesa-objective will be moved towards the base objective. But by default, I think, it leaves that range fairly quickly.
And so we get deceptive alignment, by strong default.
(In addition, I argue that this causes high path dependence in sufficiently advanced models/AGI-level models, under this formulation:
The features the ML models learn, and their order, appear to be a robust function of the training data + the training process, so I suspect there isn't much variance across training runs. But the final mesa-objectives are a function of a function of ... a function of the initially-learned shallow heuristics — I expect there is strong path-dependence in that sense.)
3. Recap
Open questions: To what extent do the mesa-objectives get adjusted towards the base objective once the GPS crystallizes? How broad is the range between tails-come-apart and deceptive alignment? Can that range be extended somehow?
4. What can be done?
Well, replacing the SGD with something that takes the shortest and not the steepest path should just about fix the whole inner-alignment problem. That's a pipe dream, though.
Failing that, it sure would be nice if we can get rid of all of those pesky heuristics.
One way to do that is featured here. Take a ML model optimized for an objective A. Re-train it to optimize an objective B, picked such that we expect the ML model's WM to feature the abstraction that B is defined over (such as, for example, a diamond). The switch should cause the mess of heuristics for optimizing A to become obsolete, reducing the gradient-starvation effect from them. And, if we've designed the training procedure for B sufficiently well, presumably the steepest gradients will be towards developing heuristics/mesa-objectives that directly attach to the B-representation in the ML model's WM.
John counters that this only works if the training procedure for B is perfect — otherwise the steepest gradient will be towards whatever abstraction is responsible for the imperfection (e. g., caring about "things-that-look-like-a-diamond-to-humans" instead of "diamonds").
Another problem is that a lot of heuristics/misaligned mesa-objectives will presumably carry over. Instrumental convergence and all — things like power-seeking will remain useful regardless of the switch in goals. And even if we do the switch before proper crystallization of the power-seeking mesa-objective, its prototype will carry over, and the SGD will just continue from where it left off.
In fact, this might make the situation worse: the steepest path to achieving zero-loss on the new objective might be "make the ML model a pure deceptively-aligned sociopath that only cares about power/resources/itself", with the new value never forming.
So here's a crazier, radical-er idea:
Naively, what we'll get in the end is an honest genie: an AI that consists of the world-model, a general-purpose problem-solving algorithm, and minimal "connective tissue" of the form "if given a command by a human, interpret what they meant[4] using my model of the human, then generate a plan for achieving it".
What's doing what here:
And so we'll get a corrigible/genie AI.
It sure seems too good to be true, so I'm skeptical on priors, and the pragmascope would be non-trivial to develop. But I don't quite see how it's crucially flawed yet.
Presumably they'll have to re-train as businessmen and/or political activists, in order to help the sperm donor companies they'll start investing in to outcompete all other sperm donor companies?
Perhaps the same way IQ-140 humans are considerably more successful than IQ-110 ones, despite, presumably, little neurological differences.
Just put it through training episodes where it's placed in an environment and given a natural-language instruction on what to do, I guess?
Really "meant", as in including all the implied caveats like "when I ask for a lot of strawberries, I don't mean tile the entire universe with strawberries, also I don't want them so much that you should kill people for them, also...".
Importantly, what we don't want to use here is the speed regularizer. It's often mentioned as the anti-deception tool, but I'm skeptical that it will help, see sections 1-3 — it wouldn't intervene on any of the dynamics that matter. In the meantime, our "clean genie" AI will be slow in the sense that it'll have to re-derive all of the environment-specific heuristics at runtime. We don't want to penalize it for that — that'd be synonymous with encouraging it to develop fast built-in heuristics, which is the opposite of what we want.