I'm looking for a name for a problem. I expect it already has one, but I don't know what it is.
The problem: suppose we have an AI trying to learn what people want - e.g. an IRL variant. Intuitively speaking, we point at a bunch of humans and say “figure out what they want, then do that”. A few possible ways the AI could respond:
- “Hmm, to the extent that those things have utility functions, it looks like they want friendship, challenge, status, etc…”
- “Hmm, it looks like they want to maximize the number of copies of the information-carrying molecules in their cells.”
- “Hmm, it looks like they’re trying to maximize entropy in the universe.”
- “Hmm, it looks like they’re trying to minimize physical action.”
Why would the AI think these things? Well, you’re pointing at a bunch of atoms, and the microscopic laws of motion which govern those atoms can be interpreted as minimizing a quantity called action. Or you’re pointing at a bunch of organisms subject to a selection process which (locally) maximizes the number of copies of some information-carrying molecules. How is the AI supposed to know which optimization process you’re pointing to? How can it know which level of abstraction you’re talking about?
What data could tell the AI that you're pointing at humans, not the atoms they're made of?
This sounds like a question which would already have a name, so if anybody could point me to that name, I'd appreciate it.
This question feels confused to me but I'm having some difficulty precisely describing the nature of the confusion. When a human programmer sets up an IRL problem they get to choose what the domain of the reward function is. If the reward function is, for example, a function of the pixels of a video frame, IRL (hopefully) learns which video frames human drivers appear to prefer and which they don't, based on which such preferences best reproduce driving data.
You might imagine that with unrealistic amounts of computational power IRL might attempt to understand what's going on by modeling the underlying physics at the level of atoms, but that would be an astonishingly inefficient way to reproduce driving data even if it did work. IRL algorithms tend to have things like complexity penalties to make it possible to select e.g. a "simplest" reward function out of the many reward functions that could reproduce the data (this is a prior but a pretty reasonable and justifiable one as far as I can tell) and even with large amounts of computational power I expect it would still not be worth using a substantially more complicated reward function than necessary.