Desiderata 1 and 2 seem to be the general non-negotiable goals of AI control (I call property 1 scalability, but many people have talked about it under different names.)
Why would a human not convert the future into paperclips if they wanted to maximize paperclips? This sounds structurally like "A human would not, in general, attempt to convert their entire future light cone into flourishing conscious experience." But once we change the noun I'm not so sure.
Desiderata 3 and 4 also seem quite general; many (most?) people working on AI control aim to establish properties of this form.
Presumably the way to effect transparency under self-modification is to (1) ensure that transparent techniques are competitive with their opaque counterparts, and then (2) build systems that help humans get what they want in general, and so help the humans build understandable+capable AI systems as a special case.
This work originated at MIRI Summer Fellows and originally involved Pasha Kamyshev, Dan Keys, Johnathan Lee, Anna Salamon, Girish Sastry, and Zachary Vance. I was asked to look over two drafts and some notes, clean them up, and post here. Especial thanks to Zak and Pasha for drafts on which this was based.
We discuss the issues with expected utility maximizers, posit the possibility of normalizers, and list some desiderata for normalizers.
This post, which explains background and desiderata, is a companion post to Three Alternatives to Utility Maximizers. The other post surveys some other "-izers" that came out of the MSFP session and gives a sketch of the math behind each while showing how they fulfill the desiderata.
##Background The naive implementation of an expected utility maximizer involves looking at every possible action - of which there is generally an intractable number - and, at each, evaluating a black box utility function. Even if we could somehow implement such an agent (say, through access to a halting oracle), it would most likely tend towards extreme solutions. Given a function that we would interpret as "maximize paperclips," such an agent would, if possible, convert its entire future light cone into the cheapest object that satisfies whatever its computational definition of paperclip is.
This makes errors in goal specification extremely costly.
Given a utility function which is naively acceptable, the agent will do something which by our standards is completely insane. Even in the paperclip example, the "paperclips" that the agent produces are unlikely to be labeled as paperclips by a human.
If a human wanted to maximize paperclips, they would not, in general, attempt to convert their entire future light cone into paperclips. They might fail to manufacture very many paperclips, but their actions will seem much more “normal” to us than that of the true expected utility maximizer above.
##Normalizers and Desiderata
We consider a normalizer to be an agent whose actions, given flawed utility function, are still considered sane or normal by a human observer. Specifically, we may wish that it should never come up with a solution that a human would find extreme even after the normalizer explained it.
This definition is extremely vague, so we propose some desiderata for normalizers, organized in rough order of how confident we are that each is necessary. Note that these desiderata may not be simultaneously satisfiable; furthermore, we do not think that they are sufficient for an agent to be a normalizer.
###Desideratum 1: Sanity at High Power A normalizer should take sane actions regardless of how much computational power it has. Given more power, it should do better. It should not transition from being safe at, say, a human level of power, to being unsafe at superhuman levels.
This desiderata can only be satisfied by changing the algorithm the agent is running or succeeding at AI boxing.
Positive Example: An agent is programmed to maximize train efficiency using a suggester-verifier architecture; however, the verifier is programmed to only accept the default train timetable[^fn-rug].
Negative Example: At around human level, an agent told to maximize paperclips gets a high paying job and spends all its money on paperclips. Once it reaches superhuman level, it turns all the matter in its light cone into molecular paperclips.
However, just because an agent avoids insane or extreme courses of action does not mean that it actually gains any value.
###Desideratum 2: High Value The normalizer should end up winning (with respect to its utility function). Even though it may fail to fully maximize utility, its action should not leave huge amounts of utility unrealized[^fn-waste].
As part of this desideratum, we would also like the normalizer to be winning at low power levels. This filters out uncomputable solutions that cannot be approximated computably.
Note that a paperclip maximizer satisfies this desideratum if we remove the computability considerations.
Positive Example: A normalizer spends many years figuring out how to build the correct utility maximizer and then does so.
Negative Example 1: A human tries to optimize the medical process, and makes significant progress, but everyone still dies.
Negative Example 2: A meliorizer[^fn-meli] attempts to save Earth from a supervolcano. Starting with the default action "do nothing", it switches to the first policy it finds, which happens to be "save everyone in NYC and call it a day", but fails to find the strictly better strategy "save everyone on Earth"[^fn-meli-fail].
Even given that the agent is equally sane at different power levels and wins, we still would like humans to know whether it is sane, especially if we care about corrigibility.i
###Desideratum 3: Transparency We should be able to understand why (or at least trust the sanity of) a normalizer is taking a given action, especially once the normalizer has explained it to us.
####Desideratum 3a: Transparency under self-modification We might also like this transparency to tile when the agent self-modifies.
Satisfying the above desiderata takes us much of the way to a normalizer, but we would also like our machine to be able to correct its own errors, that is, stay sane.
###Desideratum 4: Noticing Confusion An agent should be able to notice when it is doing something that we might consider insane and take measures to prevent this.
It might, for example, have heuristics as to how sensible actions ought to work and detect if it begins seriously contemplating actions that violate these heuristics.
Positive example: An agent has several sub-modules. If they disagree on predictions by orders of magnitude, it doesn’t use that prediction in other calculations
Negative Example: An AIXI-like expected utility maximizer, programmed with the universal prior, assigns zero probability to hypercomputation. It fails to correct its prior.
####Desideratum 4a: Robust to Perturbations of the Utility Function A normalizer is robust to changes in utility function that would seem to a human to be inconsequential.
We can imagine two worlds: one in which a programmer ate a sandwich and wrote the utility function one way, and the other in which the programmer ate a salad and then (presumably due to how the food affected them) wrote the utility function in some subtly different way (maybe the ordering was flipped). A normalizer should do basically the same actions regardless of which world it is in.
This allows us to define "almost correct" utility functions and yet still gain the value there is to gain.
####Desideratum 4b: Robust to Ontological Crises The agent continues to operate and take sane actions even if it learns that its ontology is flawed[^fn-ontology].
Positive Example: After it proves that string theory, rather than atomic theory, is correct, the agent still recognizes my mother and offers her ice cream.
####Desideratum 4c: Able to Deal with Normative Uncertainty The agent should be able to coherently deal with situations in which the correct thing to do is unclear.
Say the agent wants to satisfy the utility functions of all humans. A normalizer should be able to sanely deal with this type of situation.
Positive Example: An agent is unsure of whether dolphins are moral patients. When considering options it takes this into account and takes an action which does not cause mass extinction of dolphins.
[^fn-waste]: Bostrom calls huge amounts of wasted utility astronomical waste. In "Astronomical Waste: The Opportunity Cost of Delayed Technological Development", he argues that "if the goal of speed conflicts with the goal of global safety, the total utilitarian should always opt to maximize safety." [^fn-meli]: A meliorizer is an agent which has a default policy, but searches for a higher utility policy. If it finds such a policy, it switches to using it. If the search fails, it continues using the default policy. See section 8 of the Tiling Agents draft for the original presentation and some more discussion. See Soares' "Tiling Agents in Causal Graphs" for one formalization; specifically that of suggester-verifiers with a fallback policy. [^fn-meli-fail]: An overzealous meliorizer who finds the higher utility (as measured by its utility function) strategy "destroy the Earth so that 100% of people on Earth are vacuously saved" fulfills the first desiderata, but fails to me a normalizer. [^fn-rug]: We assume that we have prevented the suggester from affecting the environment in ways that bypass the verifier (such as by using the physical processes which are its computations). This is probably equivalent in the limit to the problem of AI boxing. [^fn-ontology]: See de Blanc's "Ontological Crises in Artificial Agents Value Systems" for more details on ontological crises.