I'm fairly skeptical about trying to understand AI behavior at this level, given the current state of affairs (that is, I think the implicit picture of AI behavior on which these analyses rely is quite unlikely, so that the utility of this sort of thinking is reduced by an order of magnitude). Anyway, some specific notes:
The utility scrambled situation is probably as dangerous as more subtle perturbations if you are dealing with human-level AI, as keeping human onlookers happy is instrumentally valuable (and this sort of reasoning is obvious to an AI as clever as we are on this axis, never mind one much smarter).
The presumed AI architecture involves human designers specifying a prior and utility function over the same ontology, which seems quite unlikely from here. In more realistic situations, the question of value generalization seems important beyond ontological crises, and in particular if it goes well before reaching an ontological crisis it seems overwhelmingly likely to continue to go well.
An AI of the sort you envision (with a prior and a utility function specified in the ontology of that prior) can never abandon its ontology. It will instead either become increasingly confused, or build a model for its observations in the original ontology (if the prior is sufficiently expressive). In both cases the utility function continues to apply without change, in contrast to the situation in de Blanc's paper (where an AI is explicitly shifting from one ontology to another). If the utility function was produced by human designers it may no longer correspond with reality in the intended way.
It seems extremely unlikely that an AI with very difficult to influence values will be catatonic. More likely hypotheses suggest themselves, such as: doing things that would be good in (potentially unlikely worlds) where value is more easily influenced, amassing resources to better understand whether value can be influenced, or having behavior controlled in apparently random (but quite likely extremely destructive) ways that give a tiny probabilistic edge. For only very rare values will killing yourself be a good play (since this suggests utility can be influenced by killing yourself, but not by doing anything more extreme).
The rest are unrelated to the substance of the post, except insofar as they relate to the general mode of thinking:
As far as I can tell, AI indifference doesn't work (see my comment here). I don't think it is salvageable, but even if it is it at least seems to require salvaging.
Note that depending on the structure of "evidence for goals" in the value indifference proposal, it is possible that an AI can in fact purposefully influence its utility function and will be motivated to do so. To see that the proof sketch given doesn't work, notice that I have some probability distribution over what I will be doing in a year, but that (despite the fact that this "obeys the axioms of probability") I can in fact influence the result and not just passively learn more about it. An agent in this framework is automatically going to be concerned with acausal control of its utility function, if its notion of evidence is sufficiently well-developed. I don't know if this is an issue.
More likely hypotheses suggest themselves, such as: doing things that would be good in (potentially unlikely worlds) where value is more easily influenced, amassing resources to better understand whether value can be influenced, or having behavior controlled in apparently random (but quite likely extremely destructive) ways that give a tiny probabilistic edge.
An important point that I think doesn't have a post highlighting it. An AI that only cares about moving one dust speck by one micrometer on some planet in a distant galaxy if that planet satisfies a very unlikely condition (and thus most likely isn't present in the universe) will still take over the universe on the off-chance that the dust speck is there.
It seems extremely unlikely that an AI with very difficult to influence values will be catatonic
Impossible to influence values, not just very difficult.
doing things that would be good in (potentially unlikely worlds) where value is more easily influenced
Which would also mean doing things that would be bad in other unlikely worlds.
As far as I can tell, AI indifference doesn't work
See my comment on your comment.
Impossible to influence values, not just very difficult.
Nothing is impossible. Maybe AI's hardware is faulty (and that is why it computes 2+2=4 every time), which would prompt AI to investigate the issue more thoroughly, if it has nothing better to do.
(This is more of an out-of-context remark, since I can't place "influencing own values". If "values" are not values, and instead something that should be "influenced" for some reason, why do they matter?)
If the ontological crisis is too severe, the AI may lose the ability to do anything at all, as the world becomes completely incomprehensible to it.
unless the AI is an entropy genie, it cannot influence utility values through its decisions, and will most likely become catatonic
Seems unwarrantedly optimistically anthropomorphic. A controlled shutdown in a case like this is a good outcome, but imagining a confused human spinning around and falling over does not make it so. The AI would exhibit undefined behavior, and hoping that this behavior is incoherent enough to be harmless or that it would drop an anvil on its own head seems unwarrantedly optimistic if that wasn't an explicit design consideration. Obviously undefined behavior is implementation-dependent but I'd expect that in some cases you would see e.g. subsystems running coherently and perhaps effectively taking over behavior as high-level directions ceased to provide strong utility differentials. In other words, the AI built an automatic memory-managing subsystem inside itself that did some degree of consequentialism but in a way that was properly subservient to the overall preference function; now the overall preference function is trashed and the memory manager is what's left to direct behavior. Some automatic system goes on trying to rewrite and improve code, and it gets advice from the memory manager but not from top-level preferences; thus the AI ends up as a memory-managing agent.
This probably doesn't make sense the way I wrote it, but the general idea is that the parts of the AI could easily go on carrying out coherent behaviors, and that could easily end up somewhere coherent and unpleasant, if top-level consequentialism went incoherent. Unless controlled shutdown in that case had somehow been imposed from outside as a desirable conditional consequence, using a complexly structured utility function such that it would evaluate, "If my preferences are incoherent then I want to do a quiet, harmless shutdown, and that doesn't mean optimize the universe for maximal quietness and harmlessness either." Ordinarily, an agent would evaluate, "If my utility function goes incoherent... then I must not want anything in particular, including a controlled shutdown of my code."
Why don't we see these crises happening in humans when they shift ontological models? Is there some way we can use the human intelligence case as a model to guide artificial intelligence safeguards?
Why don't we see these crises happening in humans when they shift ontological models?
We do. It's just that:
1) Human minds aren't as malleable as a self-improving AI's so the effect is smaller,
2) After the fact, the ontological shift is perceived as a good thing, from the perspective of the new ontology's moral system. This makes the shifts hard to notice unless one is especially conservative.
I read through your post twice and still did not find the point you are presumably trying to make. Is it something along the lines of "Effects of deviation from FAI on observable utility function"? No, probably not it.
Adding a summary might be a good idea.
Is this entirely preventable by allowing the AI to do full inductive inference approximation with no hard-coded ontology and putting all values in terms of algorithms that must be run on the substrate that is reality?
Not entirely sure what you mean there; but are you saying that we can train the AI in a "black box/evolutionary algorithm" kind of way, and that this will extend across ontology crises?
That sounds like an appeal to emergence.
I think there are two potential problems with ontology:
The AI fails to understand its new environment enough to be able to manipulate it to implement its values.
The AI discovers how to manipulate its new environment, but in the translation to the new ontology its values become corrupted.
For #1, all we can do is give it a better approximation of inductive inference. For #2 we can state the values in more ontology-independent terms.
For #1, all we can do is give it a better approximation of inductive inference. For #2 we can state the values in more ontology-independent terms.
These are both incredibly difficult to do when you don't know (and probably can't imagine) what kind of ontological crises the AI will face.
Evolutionary algorithms work well for incompletely defined situations; no emergence is needed to explain that behaviour.
I think "The AI fails to understand its new environment enough to be able to manipulate it to implement its values." is unlikely (that's my first scenario) as the AI is the one discovering the new ontology (if we know it in advance, we give it to the AI).
Not sure how to approach #2; how could you specify a Newtonian "maximise pleasure over time" in such a way that it stays stable when the AI discovers relativity (and you have to specify this without using your own knowledge of relativity, of course)?
(with thanks to Owain Evans)
An ontological crisis happens when an agent's underlying model of reality changes, such as a Newtonian agent realising it was living in a relativistic world all along. These crises are dangerous if they scramble the agent's preferences: in the example above, an agent dedicated to maximise pleasure over time could transition to completely different behaviour when it transitions to relativistic time; depending on the transition, it may react by accelerating happy humans to near light speed, or inversely, ban them from moving - or something considerably more weird.
Peter de Blanc has a sensible approach to minimising the disruption ontological crises can cause to an AI, but this post is concerned with analyzing what happens when such approaches fail. How bad could it be? Well, this is AI, so the default is of course: unbelievably, hideously bad (i.e. situation normal). But in what ways exactly?
If the ontological crisis is too severe, the AI may lose the ability to do anything at all, as the world becomes completely incomprehensible to it. This is very unlikely; the ontological crisis was most likely triggered by the AIs own observations and deductions, so it is improbable that it will lose the plot completely in the transition.
A level below that is when the AI can still understand and predict the world, but the crisis completely scrambles its utility function. Depending on how the scrambling happens, this can be safe: the AI may lose the ability to influence the value of its utility function at all. If, for instance, the new utility function assigns wildly different values to distinct states in a chaotic system, the AI's actions become irrelevant. This might be if different worlds with different microstates but same macrostates get spread evenly across the utility values: unless the AI is an entropy genie, it cannot influence utility values through its decisions, and will most likely become catatonic.
More likely, however, is that the utility function is scrambled to something alien, but still AI-influenceable. Then the AI will still most likely have the convergent instrumental goals of gathering power, influence, pretending to be nice, before taking over when needed. The only saving grace is that its utility function is so bizarre, that we may be able to detect this in some way.
The most dangerous possibility is if the AI's new utility function resembles the old one, plus a lot of noise (noise from our perspective - from the AIs point of view, it all makes perfect sense). Human values are complex, so this would be the usual unfriendly AI scenario, but making it hard for us to notice the change.
A step below this is when the AI's new utility function resembles the old one, plus a little bit of noise. Human values remain complex, so this is still most likely an UFAI, but safety precautions built into its utility function (such as AI utility indifference or value learning or similar ideas) may not become completely neutered.
In summary: