List of Lethalities #19 states:
- More generally, there is no known way to use the paradigm of loss functions, sensory inputs, and/or reward inputs, to optimize anything within a cognitive system to point at particular things within the environment—to point to latent events and objects and properties in the environment, rather than relatively shallow functions of the sense data and reward.
Part of why this problem seems intractable is that it's stated in terms of "pointing at latent concepts" rather than Goodhart's Law/Wireheading/Short circuiting. All of which seems like more fruitful angles of approach than "point at latent concepts", precisely because pointing at inner structure is in fact the specific thing deep learning is trying to avoid having to do.
Though it occurs to me that some readers who see this won't be familiar with the original and its context so let me elaborate:
The problem we are concerned with here is how you get a neural net or similar system which you train on photos or text or any other kind of sensory input to care about the latent causality of the sensory input rather than the sensory input itself. If the distinction is unclear to you consider that a model trained to push a ball into a goal could theoretically hack its webcam it uses as an eye so that it observes the (imaginary) ball being pushed into an (imaginary) goal. Meanwhile in the real world the ball is untouched. This is essentially wireheading and the question is how you prevent an AI system from doing it, especially once it's superintelligent and trivially has the capability to hack any sensor it uses to make sensory observations.
We can start with the most obvious point: Our solution can't be based on a superintelligence not being able to get at its own reward machinery. Whatever we do has to be an intervention which causes the system, fully cognizant that it can hack itself for huge expected reward, to say "nope, I'm not doing that". We have basically one empirical template for this which I'm aware of in human drug use. Notably, when we discovered heroin and cocaine many believed they heralded a utopian future in which everyone can be happy. It took time for people to realize these drugs are addictive and pull you too far away from productive activity to be societally practical. You, right this minute, are choosing not to take heroin or other major reward system hacks because you understand they would have negative long term consequences for you. If you're like me, you even have a disgust response about the concept, the thought of putting that needle in your arm brings on feelings of fear and nausea. This is LEARNED. It is learned even though you understand that the drug would feel good. It is learned even though this kind of thing probably didn't really exist as a major threat in the ancestral environment. This is one of the most alignment relevant behaviors that humans do and should be closely considered.
My current sketch for how something similar could be trained into a deep net would be to deliberately create opportunities to cheat/Goodhart at tasks, and then reliably punish the Goodharting on the tasks with known ground truth that they've been Goodharted. This would create an early preference against Goodharting and wireheading. Like with drugs these sessions could be supplemented with propaganda about the negative consequences of reward hacking. You could also try representation engineering to directly add an aversion to the abstract concept of cheating, reward hacking, etc.
For my current weave LLM ReAct agent project I plan to have the model write symbolic functions to evaluate its own performance in context at each action step. In order to get it to write honest evaluation functions I plan to train the part of the model that writes them with a different loss/training task which is aligned to verifiable long term reward. The local actions are then scored with these functions as well as other potential mechanisms like queries of the models subjective judgement.
See also this Twitter thread where I describe in more detail:
https://jdpressman.com/tweets_2025_03.html#1898114081657438605
A very related experiment is described in Yudkowsky 2017, and I think one doesn't even need LLMs for this—I started playing with an extremely simple RL agent trained on my laptop, but then got distracted by other stuff before achieving any relevant results. This method of training an agent to be "suspicious" of too high rewards would also pair well with model expansion; train the reward-hacking-suspicion circuitry fairly early as to avoid ability to sandbag this, and lay traps for reward hacking again and again during the gradual expansion process.
List of Lethalities #19 states:
Part of why this problem seems intractable is that it's stated in terms of "pointing at latent concepts" rather than Goodhart's Law/Wireheading/Short circuiting. All of which seems like more fruitful angles of approach than "point at latent concepts", precisely because pointing at inner structure is in fact the specific thing deep learning is trying to avoid having to do.
Though it occurs to me that some readers who see this won't be familiar with the original and its context so let me elaborate:
The problem we are concerned with here is how you get a neural net or similar system which you train on photos or text or any other kind of sensory input to care about the latent causality of the sensory input rather than the sensory input itself. If the distinction is unclear to you consider that a model trained to push a ball into a goal could theoretically hack its webcam it uses as an eye so that it observes the (imaginary) ball being pushed into an (imaginary) goal. Meanwhile in the real world the ball is untouched. This is essentially wireheading and the question is how you prevent an AI system from doing it, especially once it's superintelligent and trivially has the capability to hack any sensor it uses to make sensory observations.
We can start with the most obvious point: Our solution can't be based on a superintelligence not being able to get at its own reward machinery. Whatever we do has to be an intervention which causes the system, fully cognizant that it can hack itself for huge expected reward, to say "nope, I'm not doing that". We have basically one empirical template for this which I'm aware of in human drug use. Notably, when we discovered heroin and cocaine many believed they heralded a utopian future in which everyone can be happy. It took time for people to realize these drugs are addictive and pull you too far away from productive activity to be societally practical. You, right this minute, are choosing not to take heroin or other major reward system hacks because you understand they would have negative long term consequences for you. If you're like me, you even have a disgust response about the concept, the thought of putting that needle in your arm brings on feelings of fear and nausea. This is LEARNED. It is learned even though you understand that the drug would feel good. It is learned even though this kind of thing probably didn't really exist as a major threat in the ancestral environment. This is one of the most alignment relevant behaviors that humans do and should be closely considered.
My current sketch for how something similar could be trained into a deep net would be to deliberately create opportunities to cheat/Goodhart at tasks, and then reliably punish the Goodharting on the tasks with known ground truth that they've been Goodharted. This would create an early preference against Goodharting and wireheading. Like with drugs these sessions could be supplemented with propaganda about the negative consequences of reward hacking. You could also try representation engineering to directly add an aversion to the abstract concept of cheating, reward hacking, etc.
For my current weave LLM ReAct agent project I plan to have the model write symbolic functions to evaluate its own performance in context at each action step. In order to get it to write honest evaluation functions I plan to train the part of the model that writes them with a different loss/training task which is aligned to verifiable long term reward. The local actions are then scored with these functions as well as other potential mechanisms like queries of the models subjective judgement.
See also this Twitter thread where I describe in more detail:
https://jdpressman.com/tweets_2025_03.html#1898114081657438605
A very related experiment is described in Yudkowsky 2017, and I think one doesn't even need LLMs for this—I started playing with an extremely simple RL agent trained on my laptop, but then got distracted by other stuff before achieving any relevant results. This method of training an agent to be "suspicious" of too high rewards would also pair well with model expansion; train the reward-hacking-suspicion circuitry fairly early as to avoid ability to sandbag this, and lay traps for reward hacking again and again during the gradual expansion process.