Repetition helps us remember things better. This is because it strengthens connections in our brain used for memory[1].
We don’t need to understand the exact neural mechanisms to take advantage of this fact. For example, societies could develop cultural norms that promote repetition exercises during education[2].
This is an example of how humans are “gradient aware.” “Repeat a task so I can remember it better” advances our goals by taking advantage of our “gradient update” process. This is an action we take solely because of how our minds get shaped.
I think a similar situation may occur in sufficiently powerful AI models. If AIs are trained in environments where they can strategically take advantage of gradient updates, they might choose their actions partially based on how they expect the gradient descent process to modify their future instances[3]. I call this “gradient awareness.”
The only time I’ve seen people discuss gradient-aware models is in the context of gradient hacking. We can think of gradient hacking as a specialized case of gradient awareness, where a mesa-optimizer protects its mesa-objective from being modified by redirecting its gradient updates.
At first glance, gradient hacking seems peculiar and unnatural. However, gradient awareness seems like a strategy advanced models would pick up[4]. The closest thing I’ve seen in the wild is how an RNN that was trained to play Sokoban will “pace around” as it figures out a plan.
Gradient awareness is a spectrum. You might repeat things to remember them because that's how they taught you in grade school, or you could have a super elaborate Anki setup. Similar to humans who follow cultural practices, models can totally execute strategies that are gradient-aware without “understanding” why these strategies work.
What does this all mean?
I’d expect that we could see gradient-aware models before they are capable of gradient hacking. Gradient hacking is a very sophisticated algorithm, and I think models might execute other gradient-aware algorithms first (alternatively, its first attempts at gradient hacking would just fail). To the extent we believe this, we could think about how we structure our training environment to guard against gradient awareness. That means thinking about ways to structure the gradient updates to avoid opportunities for gradient aware strategies, or limiting the model’s ability to shape its future gradients. We might even create evaluations for gradient awareness.
Finally, writing this post reminded me of how there’s no fire alarm for AI. We could see frontier models pursuing increasingly sophisticated gradient-aware strategies, and never see actual gradient hacking[5]. In fact, the first gradient-aware strategies could look perfectly benign, like repeating a thing to remember it.
I tried decently hard to find an actual good neuroscience source and learn how this mechanism works. Best source I’ve found is this nature article from 2008.
See Henrich’s The Secrets to Our Success for examples of how human cultures learned incredibly complex policies (e.g., a multi-day cooking process that removes cyanide from manioc) to achieve their goals despite having no idea what the underlying chemical mechanisms are.
Repetition helps us remember things better. This is because it strengthens connections in our brain used for memory[1].
We don’t need to understand the exact neural mechanisms to take advantage of this fact. For example, societies could develop cultural norms that promote repetition exercises during education[2].
This is an example of how humans are “gradient aware.” “Repeat a task so I can remember it better” advances our goals by taking advantage of our “gradient update” process. This is an action we take solely because of how our minds get shaped.
I think a similar situation may occur in sufficiently powerful AI models. If AIs are trained in environments where they can strategically take advantage of gradient updates, they might choose their actions partially based on how they expect the gradient descent process to modify their future instances[3]. I call this “gradient awareness.”
The only time I’ve seen people discuss gradient-aware models is in the context of gradient hacking. We can think of gradient hacking as a specialized case of gradient awareness, where a mesa-optimizer protects its mesa-objective from being modified by redirecting its gradient updates.
At first glance, gradient hacking seems peculiar and unnatural. However, gradient awareness seems like a strategy advanced models would pick up[4]. The closest thing I’ve seen in the wild is how an RNN that was trained to play Sokoban will “pace around” as it figures out a plan.
Gradient awareness is a spectrum. You might repeat things to remember them because that's how they taught you in grade school, or you could have a super elaborate Anki setup. Similar to humans who follow cultural practices, models can totally execute strategies that are gradient-aware without “understanding” why these strategies work.
What does this all mean?
I’d expect that we could see gradient-aware models before they are capable of gradient hacking. Gradient hacking is a very sophisticated algorithm, and I think models might execute other gradient-aware algorithms first (alternatively, its first attempts at gradient hacking would just fail). To the extent we believe this, we could think about how we structure our training environment to guard against gradient awareness. That means thinking about ways to structure the gradient updates to avoid opportunities for gradient aware strategies, or limiting the model’s ability to shape its future gradients. We might even create evaluations for gradient awareness.
Previous work on gradient hacking, such as Gradient filtering, can also be relevant to gradient awareness. See also the empirical work on gradient routing.
Finally, writing this post reminded me of how there’s no fire alarm for AI. We could see frontier models pursuing increasingly sophisticated gradient-aware strategies, and never see actual gradient hacking[5]. In fact, the first gradient-aware strategies could look perfectly benign, like repeating a thing to remember it.
I tried decently hard to find an actual good neuroscience source and learn how this mechanism works. Best source I’ve found is this nature article from 2008.
See Henrich’s The Secrets to Our Success for examples of how human cultures learned incredibly complex policies (e.g., a multi-day cooking process that removes cyanide from manioc) to achieve their goals despite having no idea what the underlying chemical mechanisms are.
Note that this implies that gradient aware models need to be non-myopic.
I’d argue that if you want the model to do the task well, you would want it to learn these strategies if they are available.
Although for what it’s worth I think gradient hacking is not very likely.