That's a good observation. These situations are analogous in some ways:
I wonder if it could be possible to permanently anchor an agent to its original ontology. To specify that the ontology with which it initialized is the perspective that it is required to use when evaluating its utility function. The agent is permitted the build whatever models it needs to build, but it's only allowed to assign value using the primitive concepts. So:
(Or perhaps the agent is allowed to re-define its value system within the new, more accurate ontology, but it isn't allowed to do so until it comes up with a sufficiently good mapping that the prior ontology and the new ontology give the same answers on questions of value. And if it can never accomplish that, then it simply never uses the new mapping.)
On the one hand, we do ultimately want agents who can grow to understand everything. And we don't want them to stop caring about humans the moment they stop seeing "humans" and start seeing "quivering blobs of cellular machinery".
Another thought is that AIs won't necessarily be as preoccupied with what is "real" as humans sometimes are. Just because an agent realizes that its whole world model is "not sufficiently fundamental" doesn't immediately imply that it discards the prior model wholesale.
I wonder if it could be possible to permanently anchor an agent to its original ontology. To specify that the ontology with which it initialized is the perspective that it is required to use when evaluating its utility function. The agent is permitted the build whatever models it needs to build, but it's only allowed to assign value using the primitive concepts.
That actually seems like what humans do. Human confusions about moral philosophy even seem quite like an ontological crisis in an AI.
I think they're a little different - ontological crises can (I think) be resolved naturally if an agent keeps a bunch of labeled data (or labeled-data-equivalent) around to define things by. But out-of-environment behavior can reflect fundamental limits on extrapolation, to which the only solution is more data, not better agents.
Which is to say, in the case of an ontological crisis I don't agree that the regular features are missing - they're just different computations than before.
This is why it is important for us to teach AIs to play games. We have a extensive tool set for practicing temporary rule-switching and goal-switching and we regularly practice counterfactual models with our children. It shouldn't be hard to do the same with an AI, if we just remember to do it.
One problem with AI is the possibility of ontological crises - of AIs discovering their fundamental model of reality is flawed, and being unable to cope safely with that change. Another problem is the out-of-environment behaviour - that an AI that has been trained to behave very well in a specific training environment, messes up when introduced to a more general environment.
It suddenly occurred to me that these might in fact be the same problem in disguise. In both cases, the AI has developed certain ways of behaving in reaction to certain regular features of their environment. And suddenly they are placed in a situation where these regular features are absent - either because they realised that these features are actually very different from what they thought (ontological crisis) or because the environment is different and no longer supports the same regularities (out-of-environment behaviour).
In a sense, both these errors may be seen as imperfect extrapolation from partial training data.