I'm looking for a name for a problem. I expect it already has one, but I don't know what it is.
The problem: suppose we have an AI trying to learn what people want - e.g. an IRL variant. Intuitively speaking, we point at a bunch of humans and say “figure out what they want, then do that”. A few possible ways the AI could respond:
- “Hmm, to the extent that those things have utility functions, it looks like they want friendship, challenge, status, etc…”
- “Hmm, it looks like they want to maximize the number of copies of the information-carrying molecules in their cells.”
- “Hmm, it looks like they’re trying to maximize entropy in the universe.”
- “Hmm, it looks like they’re trying to minimize physical action.”
Why would the AI think these things? Well, you’re pointing at a bunch of atoms, and the microscopic laws of motion which govern those atoms can be interpreted as minimizing a quantity called action. Or you’re pointing at a bunch of organisms subject to a selection process which (locally) maximizes the number of copies of some information-carrying molecules. How is the AI supposed to know which optimization process you’re pointing to? How can it know which level of abstraction you’re talking about?
What data could tell the AI that you're pointing at humans, not the atoms they're made of?
This sounds like a question which would already have a name, so if anybody could point me to that name, I'd appreciate it.
IRL does not need to answer this question along the way to solving the problem it's designed to solve. Consider, for example, using IRL for autonomous driving. The input is a bunch of human-generated driving data, for example video from inside a car as a human drives it or more abstract (time, position, etc.) data tracking the car over time, and IRL attempts to learn a reward function which produces a policy which produces driving data that mimics its input data. At no point in this process does IRL need to do anything like reason about the distinction between, say, the car and the human; the point is that all of the interesting variation in the data is in fact (from our point of view) being driven by the human's choices, so to the extent that IRL succeeds it is hopefully capturing the human's reward structure wrt driving at the intuitively obvious level.
In particular a large part of what is selecting the level at which to work is the human programmer's choice of how to set up the IRL problem, in the selection of the format of the input data, the selection of the format of the reward function, and in the selection of the format of the IRL algorithm's actions.
In any case, in MIRI terminology this is related to multi-level world models.
Bernstein-Von Mises Theorem. It is indeed not always true, the theorem has some conditions.
An intuitive example of where it would fail: suppose we are rolling a (possibly weighted) die, but we model it as drawing numbered balls from a box without replacement. If we roll a bunch of sixes, then the model thinks the box now contains fewer sixes, so the chance of a six is lower. If we modeled the weighted die correctly, then a bunch of sixes is evidence that's it's weighted toward six, so the chance of six should be higher.
Takeaway: Bernstein-Von Mi... (read more)