During NeurIPS, I had a discussion with Dylan Hadfield-Menell about how to fomalise my methods for extracting human values.

He brought up the issue of how to figure out the goals of a hierarchical system - for example, a simplified corporation. It's well known that not everyone in a corporation acts in the best interests of the corporation. So, a corporation will not act at all like a rational shareholder value maximiser. Despite this, is there a way of somehow deducing, from its behaviour and its internal structure, that its true goals are shareholder maximisation?

I immediately started taking the example too literally, and objected to the premise, mentioning the goals of subagents within the corporation. We were to a large extent talking past each other; I was strongly aware of the fact that even if a system had its origin in one set of preferences, that didn't mean it couldn't have disparate preferences of its own. Dylan, on the other hand, was interested in formalising my ideas in a simple setting. This is especially useful, as the brain itself is a hierarchical system with subagents.

So here is my attempt to fomalise the problem, and present my objection, all at the same time.

The setting and actions

I'm choosing a very simplified version of the hierarchical planning problem presented in this paper.

A robot, in green, is tasked with moving the object A to the empty bottom storage area. The top level of the hierarchical plan formulates the plan of gripping object A.

This plan is then passed down to a lower level of the hierarchy, that realises that first it must get B out of the way. This is passed to a lower level that is tasked with gripping B. It does so, but, for some reason, it first turns on the music:

Then it moves B out of the way (this will involve moving up and down the planning hierarchy a few times, by first gripping B, then moving it, then releasing the grip, etc...):

Then the planning moves up the hierarchy again, which re-issues the "grip A" command. The robot does that, and eventually moves A into storage:

The irrationality of music

Now this seems to be mainly a rational plan for moving A to storage, except for the music part. Is the system trying to move A but is being irrational at one action? Or does the music reflect some sort of preferences for the hierarchical system?

Well, to sort that out, we can look within the agent. Let's look at the level of the hierarchy that is tasked with gripping B, and that turns on the music. In a massively pseudo version of pseudo code, here is one version of this:

The criteria just looks at the plans and gauges whether they will move the state closer to .

In this situation, it's clear that the algorithm is trying to grip B, but has some delusion that turning the radio on will help it do so (maybe in a previous experiment the vibrations shook B loose?). The algorithm will first try the radio plan, and, when it fails, will then try the second plan and grip B the conventional way.

Contrast that with this code:

Now, there is the usual problem with calling a variable "", and that's the reason I've been looking into syntax and semantics.

But even without solving these issues, it's clear that this part of the system has inherited a different goal than the rest of the hierarchical system. It's a subagent with non-aligned preferences. Turning on the music is not an accident, it's a deliberate and rational move on the part of this subagent.

The goals of the system

For Algorithm 1, the system clearly has the goal of of moving A to storage, and has a failure of rationality along the way.

For Algorithm 2, there are several possible goals: moving A to storage, turning the radio on, and various mixes of these goals. Which is the right interpretation of the goal? We can model the hierarchy in different ways, that suggest different intuitions for the "true" goal - for example, Algorithm 2 could actually be a human, acting on instructions from a manager. Or it could be a sub-module of a brain, one that is unendorsed by the rest of the brain.

My take on it is that the "true goal" of the system is underspecified. From the outside, my judgement would be that the closer the subagent resembles an independent being, and the closer that resembles true enjoyment.

From the inside, though, the system may have many meta-preferences that push it in one direction or another. For example, if there was a module that analysed the performance of the rest of the algorithm, and recommended changes; and if that algorithm systematically changed the system towards becoming a A-storage-mover (such as recommending deleting the goal), then that would be a strong meta-preference towards only having the storage goal. And if we used the methods here, we'd get moving A the main goal, plus some noise.

Conversely, maybe there are processes that are ok with the subagent's behaviour. A modern corporation, for instance, has legal limits on what it can inflict on its workers, so some of the meta-preferences of the corporation (located for example in HR and the legal department) are geared towards some livable work environments, even if this doesn't increase shareholder value.

(Now, I still might like to intervene, in some cases - even if the goal of a slave state invading their neighbours is clear at the object and meta-level, I can still value subagents even if they don't).

In the absence of some sort of meta-preferences, there are multiple ways of establishing the preferences of a hierarchical system, and many of them are equally valid.

New Comment
2 comments, sorted by Click to highlight new comments since:

This is part of the problem I was trying to describe in multi-agent minds, part "what are we aligning the AI with".

I agree the goal is under-specified. With regard to meta-preferences, with some simplification it seems we have several basic possibilities

1. Align with the result of the internal aggregation (e.g. observe what does the corporation do)

2. Align with the result of the internal aggregation, by asking (e.g. ask the corporation via some official channel, let the sub-agents sort it out inside)

3. Learn about the sub-agents and try to incorporate their values (e.g. learn about the humans in the corporation)

4. Add layers of indirection, e.g. asking about meta-preferences

Unfortunately, I can imagine in case of humans, 4. can lead to various stable reflective equilibria of preferences and meta-preferences - for example, I can imagine, by suitable queries, you can get a human to want

  • to be aligned with explicit reasoning, putting most value on some conscious, model-based part of the mind; with meta-reasoning about VNM axioms, etc.
  • to be aligned with some heart&soul, putting value on universal love, transcendent joy, and the many parts of human mind which are not explicit, etc.

where both of these options would be self-consistently aligned with the meta-preferences the human will be expressing about how the sub-agent alignment should be done.

So even with meta-preferences, likely there are multiple ways

So even with meta-preferences, likely there are multiple ways

Yes, almost certainly. That's why I want to preserve all meta-preferences, at least to some degree.