Take the following AI, trained on videos of happy humans:
Since we know about AI wireheading, we know that there are at least two ways the AI could interpret its reward function[1]: either we want it to make more happy humans (or more humans happy); call this . Or we want it to make more videos of happy humans; call this .
We would want the AI to learn to maximise , of course. But even without that, if it generates as a candidate and applies a suitable diminishing return to all its reward functions, then we will have a positive outcome - the AI may fill the universe with videos of happy humans, but it will also act to make us happy.
Thus solving value extrapolation will solve symbol grounding, at least in part.
This is a massive over-simplification of what would be needed to define "happy" or anything similar. ↩︎
That might work in a tiny world model with only two possible hypotheses. In a high-dimensional world model with exponentially many hypotheses, the weight on happy humans would be exponentially small.
There would, so long as the extra dimensions are irrelevant. If there are more relevant dimensions then the total space becomes larger much faster than the happy space. Even having lots of irrelevant dimensions can be risky because it makes the training data sparser in the space being modelled, thus making superexponentially many more alternative hypotheses viable.