Counterfactual do-what-I-mean

Stuart_Armstrong

A putative new idea for AI control; index here.

The counterfactual approach to value learning could be used to possibly allow natural language goals for AIs.

The basic idea is that when the AI is given a natural language goal like "increase human happiness" or "implement CEV", it is not to figure out what these goals mean, but to follow what a pure learning algorithm would establish these goals as meaning.

This would be safer than a simple figure-out-the-utility-you're-currently-maximising approach. But it still doesn't solve a few drawbacks. Firstly, the learning algorithm has to be effective itself (in particular, modifying human understanding of the words should be ruled out, and the learning process must avoid concluding the simpler interpretations are always better). And secondly, humans' don't yet know what these words mean, outside our usual comfort zone, so the "learning" task also involves the AI extrapolating beyond what we know.

A putative new idea for AI control; index here.

The counterfactual approach to value learning could be used to possibly allow natural language goals for AIs.

This is why people sometimes make comments like "goal functions can themselves be learning functions." The problem is that we don't know how to take natural language and unlabeled inputs and get any sort of reasonable utility function as an output.

5

Counterfactual do-what-I-mean

5

5

5

Counterfactual do-what-I-mean

5

5