(with thanks to Owain Evans)
An ontological crisis happens when an agent's underlying model of reality changes, such as a Newtonian agent realising it was living in a relativistic world all along. These crises are dangerous if they scramble the agent's preferences: in the example above, an agent dedicated to maximise pleasure over time could transition to completely different behaviour when it transitions to relativistic time; depending on the transition, it may react by accelerating happy humans to near light speed, or inversely, ban them from moving - or something considerably more weird.
Peter de Blanc has a sensible approach to minimising the disruption ontological crises can cause to an AI, but this post is concerned with analyzing what happens when such approaches fail. How bad could it be? Well, this is AI, so the default is of course: unbelievably, hideously bad (i.e. situation normal). But in what ways exactly?
If the ontological crisis is too severe, the AI may lose the ability to do anything at all, as the world becomes completely incomprehensible to it. This is very unlikely; the ontological crisis was most likely triggered by the AIs own observations and deductions, so it is improbable that it will lose the plot completely in the transition.
A level below that is when the AI can still understand and predict the world, but the crisis completely scrambles its utility function. Depending on how the scrambling happens, this can be safe: the AI may lose the ability to influence the value of its utility function at all. If, for instance, the new utility function assigns wildly different values to distinct states in a chaotic system, the AI's actions become irrelevant. This might be if different worlds with different microstates but same macrostates get spread evenly across the utility values: unless the AI is an entropy genie, it cannot influence utility values through its decisions, and will most likely become catatonic.
More likely, however, is that the utility function is scrambled to something alien, but still AI-influenceable. Then the AI will still most likely have the convergent instrumental goals of gathering power, influence, pretending to be nice, before taking over when needed. The only saving grace is that its utility function is so bizarre, that we may be able to detect this in some way.
The most dangerous possibility is if the AI's new utility function resembles the old one, plus a lot of noise (noise from our perspective - from the AIs point of view, it all makes perfect sense). Human values are complex, so this would be the usual unfriendly AI scenario, but making it hard for us to notice the change.
A step below this is when the AI's new utility function resembles the old one, plus a little bit of noise. Human values remain complex, so this is still most likely an UFAI, but safety precautions built into its utility function (such as AI utility indifference or value learning or similar ideas) may not become completely neutered.
In summary:
Type of crisis | Notes | Danger |
---|---|---|
World incomprehensible to AI |
Very unlikely | None |
Utility completely scrambled, AI unable to influence it |
Uncertain how likely this is | Low |
Utility scrambled, AI able to influence it |
We may be able to detect change | Very High |
Lots of noise added to utility |
Difficult to detect change | Maximal |
Some noise added to utility |
Small chance of not being so bad, some precautions may remain useful. | High |
Seems unwarrantedly optimistically anthropomorphic. A controlled shutdown in a case like this is a good outcome, but imagining a confused human spinning around and falling over does not make it so. The AI would exhibit undefined behavior, and hoping that this behavior is incoherent enough to be harmless or that it would drop an anvil on its own head seems unwarrantedly optimistic if that wasn't an explicit design consideration. Obviously undefined behavior is implementation-dependent but I'd expect that in some cases you would see e.g. subsystems running coherently and perhaps effectively taking over behavior as high-level directions ceased to provide strong utility differentials. In other words, the AI built an automatic memory-managing subsystem inside itself that did some degree of consequentialism but in a way that was properly subservient to the overall preference function; now the overall preference function is trashed and the memory manager is what's left to direct behavior. Some automatic system goes on trying to rewrite and improve code, and it gets advice from the memory manager but not from top-level preferences; thus the AI ends up as a memory-managing agent.
This probably doesn't make sense the way I wrote it, but the general idea is that the parts of the AI could easily go on carrying out coherent behaviors, and that could easily end up somewhere coherent and unpleasant, if top-level consequentialism went incoherent. Unless controlled shutdown in that case had somehow been imposed from outside as a desirable conditional consequence, using a complexly structured utility function such that it would evaluate, "If my preferences are incoherent then I want to do a quiet, harmless shutdown, and that doesn't mean optimize the universe for maximal quietness and harmlessness either." Ordinarily, an agent would evaluate, "If my utility function goes incoherent... then I must not want anything in particular, including a controlled shutdown of my code."