An oversight in Risks from Learned Optimization?

Fiora from Rosebloom

In MIRI's 2019 paper Risks from Learned Optimization, the authors discuss the possibility that neural networks might develop internal optimization algorithms, or an algorithm that "internally search[es] through a search space (consisting of possible outputs, policies, plans, strategies, or similar) looking for those elements that score high according to some objective function that is explicitly represented within the system."

The paper's thesis is that these kinds of inner optimization algorithms are dangerous. This is mainly because they might give the system optimization targets or goals which are misaligned with the intentions of the creators, and even the outer optimization algorithm being run over the neural network itself. For example, you might be optimizing a neural network to minimize the output of a loss function based on predicting the next word in a document. The neural network might develop an inner optimizer that also tried to predict the next word in a document, but only as a means to some larger end, such as getting human researchers to trust that it's working properly before doing a sharp left turn and paperclipping the universe.

However, the paper seems to equivocate between optimization algorithms that would try to paperclip the universe if given a chance, and optimization algorithms in general. Optimization algorithms aren't uncommon in computer science. For example, simulated annealing is one that Wikipedia tells me is used in applications like predicting protein structures and Boolean satisfiability problems. In general, it searches through some possibility space, evaluates each result along the way, and adjusts the future of the search process accordingly, looking for the optimum result. Another example is neural networks themselves, which run a similar process over the strength of a network's metaphorical synapses.^[1]

Such optimization algorithms are not inherently dangerous. They're used safely all the time, and you can imagine a language model safely running something like this as one of many intermediary steps in processing a given input. In conventional AI risk assessments, optimization algorithms are only scary when systems are doing consequentialist search over configurations of the entire universe, a la AIXI. In perhaps literally all real-world cases, optimization algorithms' search spaces and evaluation functions are vastly more modest and benign.

But nonetheless, Risks from Learned Optimization implies that inner optimization algorithms are inherently high-risk. Here's a string of quotes from early in the paper, which together paint a picture of these inner optimization algorithms as though they set an objective for the model's entire output (whereas optimization might really just be an intermediary step), and as though they search-space is likely to consist of consequentialist plans for action in particular, rather than literally anything more benign than that.

"However, it is also possible for a neural network to itself run an optimization algorithm. For example, a neural network could run a planning algorithm that predicts the outcomes of potential plans and searches for those it predicts will result in some desired outcome. Such a neural network would itself be an optimizer because it would be searching through the space of possible plans according to some objective function. [...] The possibility of mesa-optimizers has important implications for the safety of advanced machine learning systems. When a base optimizer generates a mesa-optimizer, safety properties of the base optimizer’s objective may not transfer to the mesa-optimizer. Thus, we explore two primary questions related to the safety of mesa-optimizers: 1. Mesa-optimization: Under what circumstances will learned algorithms be optimizers? 2. Inner alignment: When a learned algorithm is an optimizer, what will its objective be, and how can it be aligned?

Notice firstly the assumption that the inner optimizer's optimization criterion pertains to the output of the entire network, again missing the possibility of optimization being an intermediary step and therefore not being an appropriate way of describing the telos of the entire neural network. Secondly, notice the paper not giving examples of search spaces other than consequentialist plans for action. (The paper's later examples of inner optimization targets continue not to highlight other possible optimization targets; perhaps the authors chose to do this deliberately, rather than missing those possibilities, but if so it's at least a choice that misled me on my first read-through.)

One possible reason for these oversights is that, prior to Risks from Learned Optimization, MIRI had been intensely focused on threat models inspired by systems like AIXI. AIXI is an optimization algorithm that does search over action-plans in terms of their likely effects on the entire universe, as evaluated by an explicit utility function. Much of the Risks from Learned Optimization paper reads as an attempt to justify why these old threat models were still applicable in the paradigm of machine learning, which, at the time, was just starting to take off. (The threat models that would obviously follow from the machine learning paradigm are quite different than those that follow from the utility maximization paradigm.)

Overall, I think the MIRI team may have ended up with tunnel vision regarding what explicit optimization processes have the potential to look like, particularly in the context of mind-like systems. While I find this (potential) oversight understandable, I think that peoples' p(dooms) were likely wrongly driven up by rhetoric that implied neural networks using an entire major class of algorithms would be profoundly dangerous, particularly since Eliezer once frequently framed that class of algorithms as fundamental to the nature of intelligence (and thus artificial intelligence).

^{^}
Consider a very small neural network, which couldn't possibly be configured into a dangerous maximizer-like system even if you wanted to make it one. Training such a network would be an example of running optimization over a search space which simply doesn't contain any deeply malign candidates.