This post states the problem of gradient hacking. It is valuable in that this problem is far from obvious, and if plausible, very dangerous. On the other hand, the presentation doesn’t go into enough details, and so leaves gradient hacking open to attacks and confusion. Thus instead of just reviewing this post, I would like to clarify certain points, while interweaving my critics about the way gradient hacking was initially stated, and explaining why I consider this problem so important.
(Caveat: I’m not pretending that any of my objections are unknown to Evan. We’re working together on searching for experimental evidence of gradient hacking, and this review is heavily influenced by this collaboration.)
What is gradient hacking?
Evan defines it as:
the phenomenon wherein a deceptively aligned mesa-optimizer might be able to purposefully act in ways which cause gradient descent to update it in a particular way.
So we have a trained mesa-optimizer that is deceptively aligned (its mesa-objective differs from the base-objective it is trained on). This deceptive system finds a way to influence the gradient generated during training, and thus can basically ensure it never becomes inner aligned (among other things).
Before checking on how exactly this could be possible, we should think a bit more about what this implies.
If the deceptive model has arbitrary access to the gradient, then from that point on, the base-objective has only minimal influence on the training. Some influence remains because we probably want to test the end result, and thus it should work well with the base-objective. But that's pretty much the only constraint left.
It could also pretty much deals with deception detectors because it can make itself not detectable:
That is, it could hack its own gradient to produce a new model which will still exhibit the same deceptive behavior, but without doing any of the detectable deceptive cognition. Perhaps the new model exhibits the deceptive behavior not because it's trying to act deceptive, but just because it's a hard-coded part of its algorithm, or because it has some bad proxy, or any other defect that might be more difficult to detect than outright deception.
To say it pithy: if gradient hacking happens, we’re fucked.
How could it happen, though? Well, that’s actually two distinct questions: how could a model gradient hack, and how could training create a model which gradient hacks. The post mostly focuses on the first one.
How could a model gradient hack?
The first example comes from a quoted footnote of Risks from Learned Optimization:
Thus, in the context of a local optimization process, a deceptive mesa-optimizer might try to “hack” its own gradient (by, for example, making itself more brittle in the case where its objective gets changed) to ensure that the base optimizer adjusts it in such a way that leaves its mesa-objective untouched.
This implies that the model is able to check for changes in itself and in its objective, such that changing this objective would make it work worse. An analogue version would be for the model to underperform except if it’s changed in the direction it likes, to push the gradient towards there.
Still, it might feel a bit weird to think about a model that can influence its training without somewhat hacking the computer it is in. After all, the training is external, right? One fundamental but easy to miss insight from the post is that the model can influence the loss surface, and thus the gradient (which comes from this surface)
First, note that since gradient descent is a local optimization process, it doesn't have to somehow change what the global minima might be, or even what the local minima are—it just has to create a ridge in the loss landscape that will push the gradient in the direction it wants.
So a model could gradient hacks because its output and how it computes its output influence the local loss surface around it, which means it could in theory make the gradient go its way. I feel like this is crucial, and it’s slightly problematic that the post doesn’t put more emphasis on it.
Also note that this is hardly solving the question. Let’s say the model is such that if its mesa-objective is changed, it performs badly. Then we probably have a form of gradient hacking. But what tells us that the change at this step will not slightly undermine the very mechanism that protects the mesa-objective? Unwrapping this thought, it’s harder to conceptualize how a model could gradient hack forever. It seems to involve some sort of fixed point or self correcting mechanism.
That complexity should probably make us update slightly towards the improbability of “sustainable” gradient hacking -- still not enough to dismiss it though. Understanding the requirements for such sustainable gradient hacking looks like the most important research question on this topic for me (which makes sense, I’m working on it after all…).
How could training create a model which gradient hacks?
What about this second question? Well, another issue with the post comes from the lack of clarification around the second question: how could training create a model which gradient hacks? I’m obviously not asking for a full-fledged answer, as this is still research in progress. But there are easy to make mistakes that I believe Evan could have debunked in the post.
One such mistake is to get confused about how the model could “decide” to gradient hack. After all, if the computation is deterministic, it’s not like the model has a decision to make, right? This question can be dissolved in a similar way to free will (since it’s basically about free will of neural networks): (spoiler for the solution to free will)
the model is completely determined by its parameters, but its parameters are part of itself.
The less philosophical approach to this problem is to notice that the appearance of gradient hacking would probably come from the training stumbling on a gradient hacker. There is no need to worry about how the model could turn itself into a gradient hacker; the issue is about how training could bump into a gradient hacker.
Another mistake would be to argue that what is called gradient hacking here is indistinguishable in terms of training behavior from some model that doesn’t gradient hack but happens to have this gradient. This is missing a big part of Evan’s viewpoint: that looking inside the model is fundamental for alignment. Here the question is whether we can catch the model doing some form of recognizable gradient hacking, because that allows us to infer more about what it might do (and that it will not be inner aligned). Whereas if we can be relatively sure that a model is not gradient hacking, there are less things to worry about, at least in terms of deceptive alignment.
This is almost obvious for people that talked a lot with Evan on these issues, or who are intimate with his work; but I feel that it’s only implicit in this post, which might brew confusion. A new but very observant reader could infer it from the solution (discussed below) using transparency tools, sure. I still think the post would be stronger if that possible misinterpretation was debunked explicitly.
As for further research, investigating the appearance of gradient hacking through common ML training procedure is crucial. But it probably requires a better understanding of the mechanisms we’re looking for, and thus a convincing answer (at least informal) to the first question.
Possible solution
Evan proposes a solution for gradient hacking in one case:
So, that's the problem—how do we solve it? Well, there's a simple solution in the case above, which is just to run your transparency tools throughout training—not just at the end—so that you can catch the deception as soon as it happens.
This is hardly revolutionary, and doesn’t say how to actually catch deception (although most of Evan research is focused on that). Yet I don’t think much more can be done without understanding in more detail the mechanisms of gradient hacking, and how they appear through training. That is, without solving some significant part of the two previous questions.
Wrapping-up
All in all, I think this post is very important, because the alignment issue it presents is novel, counterintuitive, and hard to dismiss without more research, once one thinks about it for long enough. That being said, this post is definitely a short presentation of an idea, written quickly to draw attention to this issue. Gradient hacking warrants a longer post that digs deeper into the details of the problem,
The less philosophical approach to this problem is to notice that the appearance of gradient hacking would probably come from the training stumbling on a gradient hacker.
[EDIT: you may have already meant it this way, but...] The optimization algorithm (e.g. SGD) doesn't need to stumble upon the specific logic of gradient hacking (which seems very unlikely). I think the idea is that a sufficiently capable agent (with a goal system that involves our world) instrumentally decides to use gradient hacking, because otherwise the agent will be modified in a suboptimal manner with respect to its current goal system.
Actually, I did meant that SGD might stumble upon gradient hacking. Or to be a bit more realistic, make the model slightly deceptive, at which point decreasing a bit the deceptiveness makes the model worse but increasing it a bit makes the model better at the base-objective, and so there is a push towards deceptiveness, until the model is basically deceptive enough to use gradient hacking in the way you mention.
Does that make more sense to you?
I think that if SGD makes the model slightly deceptive it's because it made the model slightly more capable (better at general problem solving etc.), which allowed the model to "figure out" (during inference) that acting in a certain deceptive way is beneficial with respect to the mesa-objective.
This seems to me a lot more likely than SGD creating specifically "deceptive logic" (i.e. logic that can't do anything generally useful other than finding ways to perform better on the mesa-objective by being deceptive).
I agree with your intuition, but I want to point out again that after some initial useless amount, "deceptive logic" is probably a pretty useful thing in general for the model, because it helps improve performance as measured through the base-objective.
SGD making the model more capable seems the most obvious way to satisfy the conditions for deceptive alignement.
"deceptive logic" is probably a pretty useful thing in general for the model, because it helps improve performance as measured through the base-objective.
But you can similarly say this for the following logic: "check whether 1+1<4 and if so, act according to the base objective". Why is SGD more likely to create "deceptive logic" than this simpler logic (or any other similar logic)?
[EDIT: actually, this argument doesn't work in a setup where the base objective corresponds to a sufficiently long time horizon during which it is possible for humans to detect misalignment and terminate/modify the model (in a way the is harmful with respect to the base objective).]
So my understanding is that deceptive behavior is a lot more likely to arise from general-problem-solving logic, rather than SGD directly creating "deceptive logic" specifically.
Hum, I would say that your logic is probably redundant, and thus might end up being removed for simplicity reasons? Whereas I expect deceptive logic to includes very useful things like knowing how the optimization process works, which would definitely help having better performance.
But to be honest, how can SGD create gradient hacking (if it's even possible) is completely an open research problem.
My point was that there's no reason that SGD will create specifically "deceptive logic" because "deceptive logic" is not privileged over any other logic that involves modeling the base objective and acting according to it. But I now think this isn't always true - see the edit block I just added.
It occurs to me that despite my lack of expertise in alignment, I am really enjoying the mesa optimization conversation because it does a good job of contextualizing machine learning performance. This is good even for narrow applications.
Fwiw, I ran an experiment that was similarly inspired. It was a year and a half ago, so I might get the details a bit wrong. The goal was to train a neural net to predict MNIST digits, except to always misclassify a 3 as an 8 (or something like that), and also make sure the gradients on 3s and 8s were zero. The "hope" was that if you finetuned on real MNIST, the gradients for fixing the problem would be really small, and so the buggy behavior would persist.
The result of the experiment was that it did not work, and the finetuning was still able to fix the bad model, though I didn't try very hard to get it to work.
This is one of the scarier posts I've read on LW. I feel kinda freaked out by this post. It's an important technical idea.
The original footnote provides one example of this, which is for the model to check if its objective satisfies some criterion, and fail hard if it doesn't. Now, if the model gets to the point where it's actually just failing because of this, then gradient descent will probably just remove that check—but the trick is never to actually get there. By having such a check in the first place, the model makes it so that gradient descent won't actually change its objective, since any change to its objective (keeping all the other parameters fixed, which is what gradient descent does since it computes partial derivatives) would lead to such a failure.
I don't think this argument works. After the agent has made that commitment, it needs to set some threshold for the amount of goal shift that will cause it to fail hard. But until the agent hits that threshold, the gradient will continue to point in the direction of that threshold. And with a non-infinitesimal learning rate, you'll eventually cross that threshold, and the agent will respond by failing hard.
A possible counterargument: the agent's ability to detect and enforce that threshold is not discrete, but also continuous, and so approaching the threshold will incur a penalty. But if that's the case, then the gradients will point in the direction of removing the penalty by reducing the agent's determination to fail upon detecting goal shift.
The way that this might still work is if modifications to this type of high-level commitment are harder to "detect" in partial derivatives than modifications to the underlying goals - e.g. if it's hard to update away from the commitment without reducing the agent's competence in other ways. And this seems kinda plausible, because high-level thought narrows down the space of outcomes sharply. But this is even more speculative.
the gradients will point in the direction of removing the penalty by reducing the agent's determination to fail upon detecting goal shift.
But it need not be the case, and indeed the "failing-hard mechanism" would be optimized for that to not be the case (in a gradient hacking scenario).
To quickly see that it need not be the case, suppose that the "failing-hard mechanism" is implemented as two subnetworks within the model such that each one of them can output a value that causes the model to fail hard, and they are designed to either both output such a value or both not output such a value. Changing any single weight within the two subnetworks would not break the "failing-hard mechanism", and so we can expect all the partial derivatives with respect to weights within the two subnetworks to be close to zero (i.e. updating the weights in the direction of the gradient would not destroy the "failing-hard mechanism").
In the section you quoted I'm talking about the case in which the extent to which the agent fails is fairly continuous. Also note that the OR function is not differentiable, and so the two subnetworks must be implementing some continuous approximation to it. In that case, it seems likely to me that there's a gradient signal to change the failing-hard mechanism.
I feel like the last sentence was a little insufficient but I'm pretty uncertain about how to think intuitively about this topic. The only thing I'm fairly confident about is that intuitions based on discrete functions are somewhat misleading.
Also note that the OR function is not differentiable, and so the two subnetworks must be implementing some continuous approximation to it. In that case, it seems likely to me that there's a gradient signal to change the failing-hard mechanism.
I didn't mean feeding the outputs of the two subnetworks to an OR dedicated to that purpose (that would just be passing the buck to the output of that OR). Suppose instead that the task is classifying cat/dog images and that each subnetwork can independently cause the network to classify a dog image as a cat by messing with a completely different piece of logic (e.g. one subnetwork is doing the equivalent of causing a false detection of cat whiskers, and the other is doing the equivalent of causing a false detection of cat eyes) such that the loss of the model is similar if any of the two subnetworks or both "decide to make the model fail".
I want to emphasize that I don't argue that we should be concerned about such sophisticated mechanisms randomly appearing during training. I argue that, if a huge neural network implements a sufficiently powerful optimization process with a goal system that involves our world, then it seems possible that that optimization process would construct such sophisticated mechanisms within the neural network. (And so the above is merely an argument that such theoretical mechanisms exist, not that they are easy to construct.)
I wasn't claiming that there'll be an explicit OR gate, just something functionally equivalent to it. To take a simple case, imagine that the two subnetworks output a real number each, which are multiplied together to get a final output, which we can interpret as the agent's reward (there'd need to be some further module which chooses behaviours in order to get that much reward, but let's ignore it for now). Each of the submodules' outputs measures how much that subnetwork thinks the agent's original goal has been preserved. Suppose that normally both subnetworks output 1, and then they switch to outputting 0 when they think they've passed the threshold of corruption, which makes the agent get 0 reward.
I agree that, at this point, there's no gradient signal to change the subnetworks. My points are that:
To be honest, this is all incredibly speculative, so please interpret all of the above with the disclaimer that it's probably false or nonsensical for reasons I haven't thought of yet.
An intuition I'm drawing on here: https://lamport.azurewebsites.net/pubs/buridan.pdf
https://arxiv.org/abs/2204.06974 presents an OR-like construction which shields from gradients too, apparently, which might be of interest.
I wasn't claiming that there'll be an explicit OR gate, just something functionally equivalent to it.
Sure, we're on the same page here. I think by "There's still a gradient signal to change the OR gate" you mean exactly what I meant when I said "that would just be passing the buck to the output of that OR".
I'm not sure I understand 2 and 3. The activations are in practice discrete (e.g. represented by 32 bits), and so the subnetworks can be designed such that they never output values within the range (if that's important/useful for the mechanism to work).
It's non-obvious that agents will have anywhere near enough control over their internal functioning to set up such systems. Have you ever tried implementing two novel independent identical submodules in your brain?
Humans can't control their brain in the level of abstraction of neurons—by thinking alone—but in a higher level of abstraction they do have some control that can be useful. For example, consider a human in a Newcomb's problem that decides to 1-box. Arguably, they reason in a certain way in order to make their brain have a certain property (namely, being a brain that decides to 1-box in a Newcomb's problem).
(Independence is very tricky because they're part of the same plan, and so a change in your underlying motivation to pursue that plan affects both).
Perhaps I shouldn't have used the word "independent"; I just meant that the output of one subnetwork does not affect the output of the other (during any given inference).
If your model is deceptive, though, then it might know all of that
Could you please describe your intuition behind how the model could know the meta-optimizer is going to perform checks on deceptive behavior?
Sure. My best guess for how we will train AGI is via giving it access to huge amounts of data—e.g. every outbound Reddit link as in GPT-2. Given that sort of a massive, rich dataset, I think that for a model to determine the particular training procedure that it's undergoing won't be very difficult. If it just reads the Wikipedia pages on ML, gradient descent, etc. and some arXiv papers on transparency techniques, for example, then that should give it everything it needs to know.
Because that's what it would do if it were us. https://en.m.wikipedia.org/wiki/Kerckhoffs's_principle says we should not assume our security mechanisms are secret. One exemption would be if you could define three indistinguishable contexts that the AI might find itself in, then select randomly between them - then you may assume that the AI also doesn't know.
adamshimi says almost everything I wanted to say in my review, so I am very glad he made the points he did, and I would love for both his review and the top level post to be included in the book.
The key thing I want to emphasize a bit more is that I think the post as given is very abstract, and I have personally gotten a lot of value out of trying to think of more concrete scenarios where gradient hacking can occur.
I think one of the weakest aspects of the post is that it starts with the assumption that an AI system has already given rise to an inner-optimizer that is now taking advantage of gradient hacking. I think while this is definitely a sufficient assumption, I don't think it's a necessary assumption and my current models suggest that we should find this behavior without the need for inner optimizers. This also makes me somewhat more optimistic about studying it.
My thinking about this is still pretty fuzzy and in its early stages, but the reasoning goes as follows:
This makes me think that the first paragraph of the post seems somewhat wrong to me when it says:
"Gradient hacking" is a term I've been using recently to describe the phenomenon wherein a deceptively aligned mesa-optimizer might be able to purposefully act in ways which cause gradient descent to update it in a particular way.
I think gradient hacking should refer to something somewhat broader that also captures situations like the above where you don't have a deceptively aligned mesa-optimizer, but still have dynamics where you select for networks that adversarially use knowledge about the SGD algorithm for competitive advantage. Though it's plausible that Evan intends the term "deceptively aligned mesa-optimizer" to refer to something broader that would also capture the scenario above.
-----
Separately, as an elaboration, I have gotten a lot of mileage out of generalizing the idea of gradient hacking in this post, to the more general idea that if you have a very simple training process whose output can often easily be predicted and controlled, you will run into similar problems. It seems valuable to try to generalize the theory proposed here to other training mechanisms and study more which training mechanisms are more easily hacked like this, and which one are not.
As I said elsewhere, I'm glad that my review captured points you deem important!
I think one of the weakest aspects of the post is that it starts with the assumption that an AI system has already given rise to an inner-optimizer that is now taking advantage of gradient hacking. I think while this is definitely a sufficient assumption, I don't think it's a necessary assumption and my current models suggest that we should find this behavior without the need for inner optimizers. This also makes me somewhat more optimistic about studying it.
I agree that gradient hacking isn't limited to inner optimizers; yet I don't think that defining it that way in the post was necessarily a bad idea. First, it's for coherence with Risks from Learned Optimization. Second, assuming some internal structure definitely helps with conceptualizing the kind of things that count as gradient hacking. With inner optimizer, you can say relatively unambiguously "it tries to protect it's mesa-objective", as there should be an explicit representation of it. That becomes harder without the inner optimization hypothesis.
That being said, I am definitely focusing on gradient hacking as an issue with learned goal-directed systems instead of learned optimizers. This is one case where I have argued that a definition of goal-directedness would allow us to remove the explicit optimization hypothesis without sacrificing the clarity it brought.
- If we assume the lottery-ticket hypothesis of neural networks, we initialize our network with a large number of possible models of the world. In a sufficiently large networks, some of those models will be accurate models of not necessarily the world, but the training process of the very system that is currently being trained. This is pretty likely given that SGD isn't very complicated and it doesn't seem very hard to build a model of how it works.
Two thoughts about that:
- From an evolutionary perspective, we are going to be selecting for networks that get positively rewarded by the gradient descent learning algorithm. Some of the networks that have an accurate model of the training process will stumble upon the strategy of failing hard if SGD would reward any other competing network, creating a small ridge in the reward landscape that results in it itself getting most of the reward (This is currently very metaphorical and I feel fuzzy on whether this conceptualization makes sense). This strategy seems more complicated, so is less likely to randomly exist in a network, but it is very strongly selected for, since at least from an evolutionary perspective it appears like it would give the network a substantive advantage.
I'm confused about what you mean here. If the point is to make the network a local minimal, you probably just have to make it very brittle to any change. I also not sure what you mean by competing networks. I assumed it meant the neighboring models in model space, which are reachable by reasonable gradients. If that's the case, then I think my example is simpler and doesn't need the SGD modelling. If not, then I would appreciate more detailed explanations.
- By default, luckily, this will create something I might want to call a "benign gradient hacker" that might deteriorate the performance of the system, but not obviously give rise to anything like a full inner optimizer. It seems that this strategy is simple enough that you don't actually need anything close to a consequentialist optimizer to run into it, and instead it seems more synonymous to cancer, in that it's a way to hijack the natural selection mechanism of a system from the inside to get more resources, and like cancer seems more likely to just hurt the performance of the overall system, instead of taking systematic control over it.
Why is that supposed to be a good thing? Sure, inner optimizers with misaligned mesa-objective suck, but so do gradient hackers without inner optimization. Anything that helps ensure that training cannot correct discrepancies and/or errors with regard to the base-objective sounds extremely dangerous to me.
I think gradient hacking should refer to something somewhat broader that also captures situations like the above where you don't have a deceptively aligned mesa-optimizer, but still have dynamics where you select for networks that adversarially use knowledge about the SGD algorithm for competitive advantage. Though it's plausible that Evan intends the term "deceptively aligned mesa-optimizer" to refer to something broader that would also capture the scenario above.
AFAIK, Evan really means inner optimizer in this context, with actual explicit internal search. Personally I agree about including situations where the learned model isn't an optimizer but is still in some sense goal-directed.
Separately, as an elaboration, I have gotten a lot of mileage out of generalizing the idea of gradient hacking in this post, to the more general idea that if you have a very simple training process whose output can often easily be predicted and controlled, you will run into similar problems. It seems valuable to try to generalize the theory proposed here to other training mechanisms and study more which training mechanisms are more easily hacked like this, and which one are not.
Hum, I hadn't thought of this generalization. Thanks for the idea!
Some of the networks that have an accurate model of the training process will stumble upon the strategy of failing hard if SGD would reward any other competing network
I think the part in bold should instead be something like "failing hard if SGD would (not) update weights in such and such way". (SGD is a local search algorithm; it gradually improves a single network.)
This strategy seems more complicated, so is less likely to randomly exist in a network, but it is very strongly selected for, since at least from an evolutionary perspective it appears like it would give the network a substantive advantage.
As I already argued in another thread, the idea is not that SGD creates the gradient hacking logic specifically (in case this is what you had in mind here). As an analogy, consider a human that decides to 1-box in Newcomb's problem (which is related to the idea of gradient hacking, because the human decides to 1-box in order to have the property of "being a person that 1-boxs", because having that property is instrumentally useful). The specific strategy to 1-box is not selected for by human evolution, but rather general problem-solving capabilities were (and those capabilities resulted in the human coming up with the 1-box strategy).
I think the part in bold should instead be something like "failing hard if SGD would (not) update weights in such and such way". (SGD is a local search algorithm; it gradually improves a single network.)
Agreed. I said something similar in my comment.
As I already argued in another thread, the idea is not that SGD creates the gradient hacking logic specifically (in case this is what you had in mind here). As an analogy, consider a human that decides to 1-box in Newcomb's problem (which is related to the idea of gradient hacking, because the human decides to 1-box in order to have the property of "being a person that 1-boxs", because having that property is instrumentally useful). The specific strategy to 1-box is not selected for by human evolution, but rather general problem-solving capabilities were (and those capabilities resulted in the human coming up with the 1-box strategy).
Thanks for the concrete example, I think I understand better what you meant. What you describe looks like the hypothesis "Any sufficiently intelligent model will be able to gradient hack, and thus will do it". Which might be true. But I'm actually more interested in the question of how gradient hacking could emerge without having to pass that threshold of intelligence, because I believe such examples will be easier to interpret and study.
So in summary, I do think what you say makes sense for the general risk of gradient hacking, yet I don't believe it is really useful for studying gradient hacking with our current knowledge.
It does seem useful to make the distinction between thinking about how gradient hacking failures look like in worlds where they cause an existential catastrophe, and thinking about how to best pursue empirical research today about gradient hacking.
Gradient hacking seems important and I really didn't think of this as a concrete consideration until this post came out.
I'd be interested in work that further sketches out a scenario in which this could occur. Some particular details that would be interesting:
Does the agent need to have the ability to change the weights in the neural net that it is implemented in? If so, how does it get that ability?
No, at least not in the way that I'm imagining this working. In fact, I wouldn't really call that gradient-hacking anymore (maybe it's just normal hacking at that point?).
For your other points, I agree that they seem like interesting directions to poke on for figuring out whether something like this works or not.
Planned summary:
This post calls attention to the problem of **gradient hacking**, where a powerful agent being trained by gradient descent could structure its computation in such a way that it causes its gradients to update it in some particular way. For example, a mesa optimizer could structure its computation to first check whether its objective has been tampered with, and if so to fail catastrophically, so that the gradients tend to point away from tampering with the objective.
Planned opinion:
I'd be interested in work that further sketches out a scenario in which this could occur. I wrote about some particular details in this comment.
Why can't the gradient descent get rid of the computation that decides to perform gradient hacking, or repurpose it for something more useful?
Gradient descent is a very simple algorithm. It only "gets rid" of some piece of logic when that is the result of updating the parameters in the direction of the gradient. In the scenario of gradient hacking, we might end up with a model that maliciously prevents gradient descent from, say, changing the parameter , by being a model that outputs a very incorrect value if is even slightly different than the desired value.
I think this post describes an extremely important problem and research directions, and I hope a lot more research and thought goes into this!
ETA: Unless this problem is resolved, I don't see how any AI alignment approach that involves using future ML—that looks like contemporary ML but at an arbitrarily large scale—could be safe.
Unless this problem is resolved, I don't see how any AI alignment approach that involves using future ML—that looks like contemporary ML but at an arbitrarily large scale—could be safe.
I think that's a bit extreme, or at least misplaced. Gradient-hacking is just something that makes catching deceptive alignment more difficult. Deceptive alignment is the real problem: if you can prevent deceptive alignment, then you can prevent gradient hacking. And I don't think it's impossible to catch deceptive alignment in something that looks similar to contemporary ML—or at least if it is impossible then I don't think that's clear yet. I mentioned some of the ML transparency approaches I'm excited about in this post, though really for a full treatment of that problem see "Relaxed adversarial training for inner alignment."
If the inner optimizer only affects the world by passing predictions to the outer model, the most obvious trick is to assign artificially negative outcomes to states you want to avoid (e g. which the inner optimizer can predict would update the outer model in bad ways) which then never get checked by virtue of being too scary to try. What are the other obvious hacks?
I guess if you sufficiently control the predictions, you can just throw off the pretense and send the outer model the prediction that giving you a channel to the outside would be a great idea.
That obvious trick relies on that we will only verify its prediction for strategies that it recommends. Here's a protocol that doesn't fail to it: A known number of gems are distributed among boxes, we can only open one box and want many gems. Ask for the distribution, select one gem at random from the answer and open its box. For every gem it hides elsewhere, selecting it reveals deception.
I am still pretty unconvinced that there is a corruption mechanism that wouldn’t be removed more quickly by SGD than the mesaobjective would be reverted. Are there more recent write ups that shed more light on this?
Specifically, I can’t tell whether this assumes the corruption mechanism has access to a perfect model of its own weights via observation (eg hacking) or via somehow the weights referring to themselves. This is important because if “the mesaobjective weights” are referred to via observation, then SGD will not compute a gradient wrt them (since they’re not part of the differentiation graph). Thus the only “culprit” for bad performance is the corruption mechanism.
Separately, there seems like there might be a chicken and egg thing here? There’s no compelling reason for performance-maximizing SGD to create a gradient-hacking scheme to protect a mesaobjective that fails on the data. If this is a sufficiently crafty agent that it’s considering long-term consequences of its objective being modified, this still doesn’t explain how the agent used its action space to induce a specific change to its weights that implements a specific multi-input function. I really struggle to see this; do you have toy examples where this could occur? (The answer presumably can’t be hacking, because then we’re fucked for independent reasons.)
Note I’m also assuming gradient hacking is distinct from “avoid states that might change your mesaobjective”, which seems much more likely to happen.
I see, so your claim here is that gradient hacking is a convergent strategy for all agents of sufficient intelligence. That’s helpful, thanks.
I am still confused about this in the case that Alice is checking whether or not she has X goal, since by definition it is to her goal Y’s detriment to not have children if she finds she has a different goal Y!=X.
The original footnote provides one example of this, which is for the model to check if its objective satisfies some criterion, and fail hard if it doesn't. Now, if the model gets to the point where it's actually just failing because of this, then gradient descent will probably just remove that check—but the trick is never to actually get there. By having such a check in the first place, the model makes it so that gradient descent won't actually change its objective,
There is a direction in the gradient which both changes the objective and removes that check. The model doesn't need to be actually failing for that to happen - there is one piece of its parameters encoding the objective, and another piece encoding this deceptive logic which checks the objective and decides to perform poorly if it doesn't pass the check. The gradient can update both at the same time - any slight change to the objective can also be applied to the deceptive objective detector, making the model not misbehave and still updating its objective.
since any change to its objective (keeping all the other parameters fixed, which is what gradient descent does since it computes partial derivatives) would lead to such a failure.
I don't think this is possible. The directional derivative is the dot product of the gradient and the direction vector. That means if the loss decreases in a certain direction (in this case, changing the deceptive behavior and the objective), at least one of the partial derivatives is negative, so gradient descent can move in that direction and we will get what we want (different objective or less deception or both).
"Gradient hacking" is a term I've been using recently to describe the phenomenon wherein a deceptively aligned mesa-optimizer might be able to purposefully act in ways which cause gradient descent to update it in a particular way. In Risks from Learned Optimization, we included the following footnote to reflect this possibility:
Outside of that footnote, however, there hasn't been much other discussion of this phenomenon, and I think that's a shame, as I think it's definitely something that's worth looking into.
For example, here's a gradient hacking story that I'm particularly concerned about. Suppose you're training a model and it becomes deceptive. You're not worried about that, though, because you have sufficient transparency tools that you're confident you can detect any deceptive cognition it might be doing. If it's thinking about how to deceive you, your transparency tools will pick that up, and you'll know to throw out that model and try again. However, your transparency tools are quite expensive, so you can only afford to run them once, which you're planning on doing after training before you deploy your model. If your model is deceptive, though, then it might know all of that—in particular, it might know that if it stays deceptive, it'll get caught.
However, there is still something it can do in that situation to be able to defect against you and cause whatever it wants to happen during deployment. That is, it could hack its own gradient to produce a new model which will still exhibit the same deceptive behavior, but without doing any of the detectable deceptive cognition. Perhaps the new model exhibits the deceptive behavior not because it's trying to act deceptive, but just because it's a hard-coded part of its algorithm, or because it has some bad proxy, or any other defect that might be more difficult to detect than outright deception.
How might such a deceptive model accomplish such a feat of gradient hacking? Well, I don't think it's actually that difficult once your deceptive model comes up with it as a strategy. First, note that since gradient descent is a local optimization process, it doesn't have to somehow change what the global minima might be, or even what the local minima are—it just has to create a ridge in the loss landscape that will push the gradient in the direction it wants.
The original footnote provides one example of this, which is for the model to check if its objective satisfies some criterion, and fail hard if it doesn't. Now, if the model gets to the point where it's actually just failing because of this, then gradient descent will probably just remove that check—but the trick is never to actually get there. By having such a check in the first place, the model makes it so that gradient descent won't actually change its objective, since any change to its objective (keeping all the other parameters fixed, which is what gradient descent does since it computes partial derivatives) would lead to such a failure.
The case of a deceptive model trying to erase evidence of its deception is not quite as straightforward, though I think it would still be possible to do something quite similar in that situation as well. This is especially true if the model has the ability to read and write to some sort of state (e.g. an RNN or even a transformer which can attend to its previous output). For example, it could write out a pre-commitment to follow the hard-coded policy and then just have a policy of always following any pre-commitments it finds such that it never thinks about the deception that led it to that policy in the first place.
So, that's the problem—how do we solve it? Well, there's a simple solution in the case above, which is just to run your transparency tools throughout training—not just at the end—so that you can catch the deception as soon as it happens. I think this is the right solution,[1] but it's worth noting that this adds an additional desideratum to your transparency tools, which is that they have to be cheap enough to run during the entire training process. That might not be that bad, however, if you don't do very many gradient descent update steps, and instead just use massive models and huge batch sizes—and there is some evidence that such a procedure is actually the most compute-efficient way to train neural networks. That being said, I think there are some transparency techniques which lend themselves better to being applied across training than others. Thus, the possibility of gradient hacking suggests that we should focus more on those techniques which allow us to not just understand a single model, but also the entire history of how that model was produced.[2]
This is one of the reasons I'm interested in optimization provenance, for example. ↩︎
As an example of a technique which I think translates well to this domain, one of the OpenAI clarity team's current major research thrusts is "model diffing," which is about trying to understand exactly how two similar models differ from each other. Compared to other approaches that are just focused on understanding a single model, model diffing could be quite useful in allowing one to understand the entirety of the training process, since you could diff the model at each gradient descent update step. ↩︎