I think values are confusing because they aren't a natural kind. The first decomposition that made sense was 2 axes: stated/revealed and local/global
stated local values are optimized for positional goods, stated global values are optimized for alliance building, revealed local are optimized for basic needs/risk avoidance, revealed global barely exist and when they do are semi-random based on mimesis and other weak signals (humans are not automatically strategic etc.)
Trying to build a coherent picture out of various outputs of 4 semi independent processes doesn't quite work. Even stating it this way reifies values too much. I think there are just local pattern recognizers/optimizers doing different things that we have globally applied this label of 'values' to because of their overlapping connotations in affordance space and because switching between different levels of abstraction is highly useful for calling people out in sophisticated hard to counter ways in monkey politics.
Also useful to think of local/global as Dyson's birds and frogs, or surveying vs navigation.
I'm unfamiliar with existing attempts at value decomposition if anyone knows of papers etc.
On predictions, humans treating themselves and others as agents seems to lead to a lot of problems. Could also deconstruct poor predictions based on which sub-system it runs into the limits of: availability, working memory, failure to propagate uncertainty, inconsistent time preferences...can we just invert the bullet points from superforecasting here?
Crossposted at the Intelligent agent foundation.
There have been various attempts to classify the problems in AI safety research. Our old Oracle paper that classified then-theoretical methods of control, to more recent classifications that grow out of modern more concrete problems.
These all serve their purpose, but I think a more enlightening classification of the AI safety problems is to look at what the issues we are trying to solve or avoid. And most of these issues are problems about humans.
Specifically, I feel AI safety issues can be classified as three human problems and one central AI issue. The human problems are:
And the central AI issue is:
Obviously if humans were agents and knew their own values and could predict whether a given AI would follow those values or not, there would be not problem. Conversely, if AIs were weak, then the human failings wouldn't matter so much.
The points about human values is relatively straightforward, but what's the problem with humans not being agents? Essentially, humans can be threatened, tricked, seduced, exhausted, drugged, modified, and so on, in order to act seemingly against our interests and values.
If humans were clearly defined agents, then what counts as a trick or a modification would be easy to define and exclude. But since this is not the case, we're reduced to trying to figure out the extent to which something like a heroin injection is a valid way to influence human preferences. This makes both humans susceptible to manipulation, and human values hard to define.
Finally, the issue of humans having poor predictions of AI is more general than it seems. If you want to ensure that an AI has the same behaviour in the testing and training environment, then you're essentially trying to guarantee that you can predict that the testing environment behaviour will be the same as the (presumably safe) training environment behaviour.
How to classify methods and problems
That's well and good, but how to various traditional AI methods or problems fit into this framework? This should give us an idea as to whether the framework is useful.
It seems to me that:
Putting this all in a table:
Further refinements of the framework
It seems to me that the third category - poor predictions - is the most likely to be expandable. For the moment, it just incorporates all our lack of understanding about how AIs would behave, but this might more useful to subdivide.