I'm looking for a name for a problem. I expect it already has one, but I don't know what it is.

The problem: suppose we have an AI trying to learn what people want - e.g. an IRL variant. Intuitively speaking, we point at a bunch of humans and say “figure out what they want, then do that”. A few possible ways the AI could respond:

  • “Hmm, to the extent that those things have utility functions, it looks like they want friendship, challenge, status, etc…”
  • “Hmm, it looks like they want to maximize the number of copies of the information-carrying molecules in their cells.”
  • “Hmm, it looks like they’re trying to maximize entropy in the universe.”
  • “Hmm, it looks like they’re trying to minimize physical action.”

Why would the AI think these things? Well, you’re pointing at a bunch of atoms, and the microscopic laws of motion which govern those atoms can be interpreted as minimizing a quantity called action. Or you’re pointing at a bunch of organisms subject to a selection process which (locally) maximizes the number of copies of some information-carrying molecules. How is the AI supposed to know which optimization process you’re pointing to? How can it know which level of abstraction you’re talking about?

What data could tell the AI that you're pointing at humans, not the atoms they're made of?

This sounds like a question which would already have a name, so if anybody could point me to that name, I'd appreciate it.

New Comment
15 comments, sorted by Click to highlight new comments since:

IRL does not need to answer this question along the way to solving the problem it's designed to solve. Consider, for example, using IRL for autonomous driving. The input is a bunch of human-generated driving data, for example video from inside a car as a human drives it or more abstract (time, position, etc.) data tracking the car over time, and IRL attempts to learn a reward function which produces a policy which produces driving data that mimics its input data. At no point in this process does IRL need to do anything like reason about the distinction between, say, the car and the human; the point is that all of the interesting variation in the data is in fact (from our point of view) being driven by the human's choices, so to the extent that IRL succeeds it is hopefully capturing the human's reward structure wrt driving at the intuitively obvious level.

In particular a large part of what is selecting the level at which to work is the human programmer's choice of how to set up the IRL problem, in the selection of the format of the input data, the selection of the format of the reward function, and in the selection of the format of the IRL algorithm's actions.

In any case, in MIRI terminology this is related to multi-level world models.

Importantly, this only works for narrow value learning, not what Paul calls "ambitious value learning" (learning long-term preferences). Narrow value learning has much more in common with imitation learning than ambitious value learning; at best, you end up with something that pursues similar subgoals to the ones humans do.

The concern in the original post applies to ambitious value learning. (But ambitious value learning using IRL already looks pretty doomed anyway).

I wrote that post you link to, and I don't think ambitious value learning is doomed at all - just that we can't do it the way we traditionally attempt to.

I specifically mean ambitious value learning using IRL. The resulting algorithm will look quite different from IRL as it currently exists. (In particular, assuming humans are reinforcement learners is problematic)

Wouldn't the reward function "maximize action for this configuration of atoms" fit the data really well (given unrealistic computational power), but produce unhelpful prescriptions for behavior outside the training set? I'm not seeing how IRL dodges the problem, other than the human manipulating the algorithm (effectively choosing a prior).

What I read Qiaochu as saying is that the IRL model doesn't have an ontology, and the world it lives in is one created by the ontology the programmer implicitly constructs for it based on choices about training data. Thus this problem doesn't come up because the IRL model isn't interacting with the whole world; only the parts of it the programmer thought relevant to solving the problem, and the model succeeds in part by how good a job the programmer did in picking what's relevant.

This question feels confused to me but I'm having some difficulty precisely describing the nature of the confusion. When a human programmer sets up an IRL problem they get to choose what the domain of the reward function is. If the reward function is, for example, a function of the pixels of a video frame, IRL (hopefully) learns which video frames human drivers appear to prefer and which they don't, based on which such preferences best reproduce driving data.

You might imagine that with unrealistic amounts of computational power IRL might attempt to understand what's going on by modeling the underlying physics at the level of atoms, but that would be an astonishingly inefficient way to reproduce driving data even if it did work. IRL algorithms tend to have things like complexity penalties to make it possible to select e.g. a "simplest" reward function out of the many reward functions that could reproduce the data (this is a prior but a pretty reasonable and justifiable one as far as I can tell) and even with large amounts of computational power I expect it would still not be worth using a substantially more complicated reward function than necessary.

Problem is, if there's a sufficiently large amount of sufficiently precise data, then the physically-correct model's high accuracy is going to swamp the complexity penalty. That would be a ridiculously huge amount of data for atom-level physics, but there could be other abstraction levels which require less data but are still not what we want (e.g. gene-level reward functions, though that doesn't fit the driving example very well).

Also, reliance on limited data seems like the sort of thing which is A Bad Idea for friendly AGI purposes.

if there's a sufficiently large amount of sufficiently precise data, then the physically-correct model's high accuracy is going to swamp the complexity penalty

I don't think that's necessarily true?

Bernstein-Von Mises Theorem. It is indeed not always true, the theorem has some conditions.

An intuitive example of where it would fail: suppose we are rolling a (possibly weighted) die, but we model it as drawing numbered balls from a box without replacement. If we roll a bunch of sixes, then the model thinks the box now contains fewer sixes, so the chance of a six is lower. If we modeled the weighted die correctly, then a bunch of sixes is evidence that's it's weighted toward six, so the chance of six should be higher.

Takeaway: Bernstein-Von Mises typically fails in cases where we're restricting ourselves to a badly inaccurate model. You can look at the exact conditions yourself; as a general rule, we want those conditions to hold. I don't think it's a significant issue for my argument.

We could set up the IRL algorithm so that atom-level simulation is outside the space of models it considers. That would break my argument. But a limitation on the model space like that raises other issues, especially for FAI.

Is this question related to "How could we fully explain the difference between red and green to a colorblind person?"

(found at: https://www.lesswrong.com/posts/3wYjyQ839MDsZ6E3L/seeing-red-dissolving-mary-s-room-and-qualia)

The AI doesn't have access the ontologically basic labels attached to every atom in your body. How can it know to maximize your values instead of the values of a nearby squirrel, given information about both appear in the sensory data? The problem is to identify the right optimization process, whether or not several of those processes are made of the same atoms is irrelevant.

The one defines intelligence as something that constrains the future state of the world, and then fails to consider weird "intelligences". The more general case: https://arbital.com/p/missing_weird/

The best name I know of is "implementing the intentional stance." Intentional stance is Dan Dennett's name for the ability to consider collections of atoms in terms of agents, desire, goals, etc. But this still isn't quite the right label, because you don't merely want the AI to consider the collection of atoms in terms of some goals, you want to the AI to use prior information about humans to consider humans in terms of humans' actual goals.

My post on the subject is forthcoming :P