A putative new idea for AI control; index here.

The counterfactual approach to value learning could be used to possibly allow natural language goals for AIs.

The basic idea is that when the AI is given a natural language goal like "increase human happiness" or "implement CEV", it is not to figure out what these goals mean, but to follow what a pure learning algorithm would establish these goals as meaning.

This would be safer than a simple figure-out-the-utility-you're-currently-maximising approach. But it still doesn't solve a few drawbacks. Firstly, the learning algorithm has to be effective itself (in particular, modifying human understanding of the words should be ruled out, and the learning process must avoid concluding the simpler interpretations are always better). And secondly, humans' don't yet know what these words mean, outside our usual comfort zone, so the "learning" task also involves the AI extrapolating beyond what we know.

New to LessWrong?

New Comment
3 comments, sorted by Click to highlight new comments since: Today at 5:15 AM

I believe this is the idea of "motivational scaffolding" described in Superintelligence. Make an AI that just learns a model of the world, including what words mean. Then you can describe its utility function in terms of that model - without having to define exactly what the words and concepts mean.

This is much easier said than done. It's "easy" to train an AI to learn a model of the world, but how exactly do you use that model to make a utility function?

It looks like similar to CEV, but not extrapolated into the future, but applied to a single person desire in the known context. I think it is good approach to make even simple AIs safe. If I ask my robot to take out all spheres from the room it will not cut my head.

This is why people sometimes make comments like "goal functions can themselves be learning functions." The problem is that we don't know how to take natural language and unlabeled inputs and get any sort of reasonable utility function as an output.