Proper value learning through indifference

16 Stuart_Armstrong 19 June 2014 09:39AM

A putative new idea for AI control; index here.

Many designs for creating AGIs (such as Open-Cog) rely on the AGI deducing moral values as it develops. This is a form of value loading (or value learning), in which the AGI updates its values through various methods, generally including feedback from trusted human sources. This is very analogous to how human infants (approximately) integrate the values of their society.

The great challenge of this approach is that it relies upon an AGI which already has an interim system of values, being able and willing to correctly update this system. Generally speaking, humans are unwilling to easily update their values, and we would want our AGIs to be similar: values that are too unstable aren't values at all.

So the aim is to clearly separate the conditions under which values should be kept stable by the AGI, and conditions when they should be allowed to vary. This will generally be done by specifying criteria for the variation ("only when talking with Mr and Mrs Programmer"). But, as always with AGIs, unless we program those criteria perfectly (hint: we won't) the AGI will be motivated to interpret them differently from how we would expect. It will, as a natural consequence of its program, attempt to manipulate the value updating rules according to its current values.

How could it do that? A very powerful AGI could do the time honoured "take control of your reward channel", by either threatening humans to give it the moral answer it wants, or replacing humans with "humans" (constructs that pass the programmed requirements of being human, according to the AGI's programming, but aren't actually human in practice) willing to give it these answers. A weaker AGI could instead use social manipulation and leading questioning to achieve the morality it desires. Even more subtly, it could tweak its internal architecture and updating process so that it updates values in its preferred direction (even something as simple as choosing the order in which to process evidence). This will be hard to detect, as a smart AGI might have a much clearer impression of how its updating process will play out in practice than it programmers would.

The problems with value loading have been cast into the various "Cake or Death" problems. We have some idea what criteria we need for safe value loading, but as yet we have no candidates for such a system. This post will attempt to construct one.

continue reading »

Gains from trade: Slug versus Galaxy - how much would I give up to control you?

33 Stuart_Armstrong 23 July 2013 07:06PM

Edit: Moved to main at ThrustVectoring's suggestion.

A suggestion as to how to split the gains from trade in some situations.

The problem of Power

A year or so ago, people in the FHI embarked on a grand project: to try and find out if there was a single way of resolving negotiations, or a single way of merging competing moral theories. This project made a lot of progress in finding out how hard this was, but very little in terms of solving it. It seemed evident that the correct solution was to weigh the different utility functions, and then for everyone maximise the weighted sum, but all ways of weighting had their problems (the weighting with the most good properties was a very silly one: use the "min-max" weighting that sets your maximal attainable utility to 1 and your minimal to 0).

One thing that we didn't get close to addressing is the concept of power. If two partners in the negotiation have very different levels of power, then abstractly comparing their utilities seems the wrong solution (more to the point: it wouldn't be accepted by the powerful party).

The New Republic spans the Galaxy, with Jedi knights, battle fleets, armies, general coolness, and the manufacturing and human resources of countless systems at its command. The dull slug, ARthUrpHilIpDenu, moves very slowly around a plant, and possibly owns one leaf (or not - he can't produce the paperwork). Both these entities have preferences, but if they meet up, and their utilities are normalised abstractly, then ARthUrpHilIpDenu's preferences will weigh in far too much: a sizeable fraction of the galaxy's production will go towards satisfying the slug. Even if you think this is "fair", consider that the New Republic is the merging of countless individual preferences, so it doesn't make any sense that the two utilities get weighted equally.

continue reading »