Thank you both for the feedback - it is always useful. Yes I realise this is a hard job with no likely consensus, but what would the alternative be?
At some stage we need to get the AI to understand human values so it knows if it is being unfriendly, and at the very least if we have no measurable way of identifying friendliness how will progress be tracked?
That question is basically the hard question at the root of the difficulty of friendly AI. Building an AI that would optimize to increase or decrease a value through its actions is comparably easy, but determining how to evaluate actions into a scale that measures results in a comparison with human values is incredibly difficult. Determining and evaluating AI friendliness is a very hard problem, and you should consider reading more about the issue so that you don't come off as naive.
How will we know if future AI’s (or even existing planners) are making decisions that are bad for humans unless we spell out what we think is unfriendly?
At a machine level the AI would be recursively minimising cost functions to produce the most effective plan of action to achieve the goal, but how will we know if its decision is going to cause harm?
Is there a model or dataset which describes what is friendly to humans? e.g.
Context
0 - running a simulation in a VM
2 - physical robot with vacuum attachment
9 - full control of a plane
Actions
0 - selecting a song to play
5 - deciding which section of floor to vacuum
99 - deciding who is an ‘enemy’
9999 - aiming a gun at an ‘enemy’
Impact
1 - poor song selected to play, human mildly annoyed
2 - ineffective use of resources (vacuuming the same floor section twice)
99 - killing a human
99999 - killing all humans
This may not be possible to get agreement from all countries/cultures/beliefs, but it is something we should discuss and attempt to get some agreement.
.