Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Before solving complicated problems (such as reaching a decision with thousands of variables and complicated dependencies) it helps to focus on solving simpler problems first (such as utility problems with three clear choices), and then gradually building up. After all, "learn to walk before you run", and with any skill, you have to train with easy cases at first.

But sometimes this approach can be unhelpful. One major examples is in human decision-making: trained human experts tend to rapidly "generate a possible course of action, compare it to the constraints imposed by the situation, and select the first course of action that is not rejected." This type of decision making is not an improved version of rational utility-maximisation; it is something else entirely. So a toy model that used utility-maximisation would be misleading in many situations.

Toy model ontology failures

Similarly, I argue, the simple toy models of ontology changes are very unhelpful. Let's pick a simple example: an agent making diamonds.

And let's pick the simplest ontology change/model splintering that we can: the agent gains access to some carbon-13 atoms.

Assuming the agent has only ever encountered standard carbon-12 atoms during its training, how should it handle this new situation?

Well, there are multiple approaches we could take. I'm sure that some spring to mind already. We could, for example...

...realise that the problem is fundamentally unsolvable.

If we have trained our agent with examples of carbon-12 diamonds ("good!") versus other arrangements of carbon-12 ("bad!") and other non-carbon elements ("bad!"), then we have underdefined what it should do with carbon-13. Treating the carbon-13 the same as carbon-12 ("good iff in diamond shape") or the same as other elements ("always bad"): both options are fully compatible with the training data.

Therefore there is no "solving" this simple ontology change problem. One can design various methods that may give one or another answer, but then we are simply smuggling in some extra assumptions to make a choice. The question "what should a (carbon-12) diamond maximiser do with carbon-13 if it has never encountered it before?" does not have an answer.

Low-level preferences in one world-model do not tell an agent what their preferences should be in another.

More complex models give more ontology-change information

Let's make the toy model more complicated, in three different ways.

  1. The agent may be making diamonds are part of a larger goal (to make a wedding ring, to make money for a corporation, to demonstrate the agent's skills). Then whether carbon-13 counts can be answered by looking at this larger goal.
  2. The agent may be making diamonds alongside other goals (maybe it is a human-like agent with a strange predilection for diamonds). Then if could make sense for the agent to choose how to extend its goal by considering compatibility with these other goals. If they also liked beauty, then carbon-13 could be as good as carbon-12. If they were obsessed with purity or precision, then they could exclude it. This doesn't provide strong reasons to favour one over the other, but at least gives some guidance.
  3. Finally, if the agent (or the agent's designer) has meta-preferences over how its preferences should extend, then this should guide how the goals and preferences change.

So, instead of finding a formula for how agents handle ontology changes, we need to figure out how we would want them to behave, and code that into them. Dealing with ontology changes is an issue connected with our values (or the extrapolations of our values), not something that has a formal answer.

And toy models of ontology changes can be misleading about where the challenge lies.

New Comment