The challenges of bringing up AIs

Stuart_Armstrong

At the current AGI-12 conference, some designers have been proponents of keeping AGI's safe by bringing them up in human environments, providing them with interactions and feedback in a similar way to how we bring up human children. Obviously that approach would fail for a fully smart AGI with its own values - it would pretend to follow our values for as long as it needed, and then defect. However, some people have confidence if we started with a limited, dumb AGI, then we could successfully inculcate our values in this way (a more sophisticated position would be that though this method would likely fail, it's more likely to succeed than a top-down friendliness project!).

The major criticism of this approach is that it anthropomorphises the AGI - we have a theory of children's minds, constructed by evolution, culture, and our own child-rearing experience. And then we project this on the alien mind of the AGI, assuming that if the AGI presents behaviours similar to a well-behaved child, then it will become a moral AGI. The problem is that we don't know how alien the AGI's mind will be, and if our reinforcement is actually reinforcing the right thing. Specifically, we need to be able to find some way of distinguishing between:

An AGI being trained to be friendly.
An AGI being trained to lie and conceal.
An AGI that will behave completely differently once out of the training/testing/trust-building environment.
An AGI that forms the wrong categories and generalisations (what counts as "human" or "suffering", for instance), because it lacks human-shared implicit knowledge that was "too obvious" for us to even think of training it on.

An AGI being trained to be friendly.
An AGI being trained to lie and conceal.
An AGI that will behave completely differently once out of the training/testing/trust-building environment.
An AGI that forms the wrong categories and generalisations (what counts as "human" or "suffering", for instance), because it lacks human-shared implicit knowledge that was "too obvious" for us to even think of training it on.

"learning how to get humans to press your reward button" as "our niceness training is working" a la the original AIXI paper,

Quote needed, wasn't this contested by the author?

On 3: Knowing that the current execution path of the code seems to be working okay today is very different from strongly constraining future execution paths across hugely different contexts to have desirable properties; this requires abstract thinking on a much higher level than staring at what your AGI is doing right now. The tank-detector works so long as it's seeing pictures from the training sets in which all tanks are present on cloudy days, but fails when it wanders out into the real world, etc. "Reflective decision theory"-style FAI proposals try to address this by being able to state the desirable properties of the AI in an abstraction which can be checked against abstractions over code execution pathways and even over permitted future self-modifications, although the 'abstract desirable properties' are very hard (require very difficult and serious FAI efforts) to specify for reasons related to 4.

Humans are able to learn basic human moral concepts with reasonable quantities of data. What is the relevant context change?

Humans are able to learn basic human moral concepts with reasonable quantities of data. What is the relevant context change?

Eh? Do you want a more detailed answer than the question might suggest? I thought nigerweiss et al had good responses.

I also don't see any human culture getting Friendliness-through-AI-training right without doing something horrible elsewhere.

16

The challenges of bringing up AIs

16

16

16

The challenges of bringing up AIs

16

16