This post was written quickly, lest I do not write it at all.
Picture the following scenario.
- Humans train a model, M, with the intention for M to minimize a loss function L.
- The model, M, will now take a set of actions.
I see this going wrong in two ways.
- It is possible that L is malformed (misaligned, specifically), such that effectively decreasing L kills everyone. This is the classic paperclip maximizer scenario. We currently do not know how to design L such that this does not happen.
- Even if L is not malformed, the set of actions taken by M might be catastrophic. This is the mesa-optimization problem.
The first failure case has captured most of my attention. Meanwhile, I have been somewhat dismissive of the second failure case.
I would like to explain why I was dismissive of the mesa-optimization problem, and make an argument for why I think we should in fact take it seriously.
--
We understand that M is an optimizer. However, we can also assume that M is not a perfect optimizer. On out-of-distribution data, M is likely to fail to optimize L.
We can define a new loss function, L', which M actually does perfectly optimize for. We define L' such that the more resources M has, the more effective M will be in decreasing L'.
L' is not taken from "human-designed objective function" space. In fact, my intuition states that L' is likely to look very strange and complex. If we were to attempt to extract the utility function from a heavily intelligence-enhanced human based on their actions, I doubt that such a utility function would seem simple either. This intuition made me initially dismissive of mesa-optimization being a problem.
Despite having read Omuhundro's [AI Drives](https://selfawaresystems.files.wordpress.com/2008/01/ai_drives_final.pdf) paper, there did not seem to me as if there was any obvious reason why we should assume the sort of strange L'-like objective functions to suffer from instrumental convergence. One can certainly imagine many objective functions that do not lead to these drives. One could even imagine an objective function which rewards having less resources, skill, or rationality.
It might be the case that most utility functions sampled from the space of all possible utility functions converge to having these drives, but that did not and does not seem like an obviously true fact to me.
--
I can't find the post, but someone on LessWrong wrote something along the lines of "only a tiny sliver of possible worlds are compatible with human existence." This seemed like an obviously true fact, and I'd intuit that it applies to biological sentience more broadly.
That was the "aha" moment for me. Without understanding L' more deeply, we can begin by assuming that L' is sampled from "objective function space" instead of "human-like objective function space."
I think it is the maximum entropy assumption that the terminal goal-states[1] of functions in "objective function space" are uniformly distributed across all the possible states. Since only a tiny sliver of possible states are compatible with biological sentience, we should expect a highly effective L' optimizer to be incompatible with human life.
--
Luckily, we have a bit[2] more than 0 bit of information about L'. For example, we know that with enough training, L' can be very similar to L.
I think it might be worth exploring other things that we expect to be true about L'.
What serves as a goal in distant future determines how cosmic endowment is optimized. Stable goals are also goals that remain in distant future, so they are relevant to that (and since reflection hasn't yet had a chance of having taken place, stable goals settled in near future are always misaligned). Unstable goals are not relevent in themselves, in what utility function (or maybe probutility) they fit, except in how they tend to produce different stable goals eventually.
So maintaining the distinction means not being unaware of the catastrophic misalignment risk where we turn some unstable goals into stable ones based on a stupid process of (possibly lack of) reflection that just fits things instead of doing proper well-designed reflection (a thing like CEV, possibly very different in detail). And it helps with not worrying too much about details of utility functions that fit current unstable goals, or aligning them with human current unstable goals, when they are not what actually matters.
That doesn't affect goals, which talk of all possible environments, doesn't matter if some agent actually encounters them. Goals are not just policy, instead they determine policy, not the other way around (along the algorithm vs. physical distinction, goals are closer to the algorithm, while policy is merely the behavior of the algorithm, the decision taken by it, closer to the physical instances and actions in reality). Unstable goals change their mind about the same environment. It could be an environment that will be reachable/enactable in the future.