But, let's be honest, that process was somewhat messy and ad hoc. It used things like "partial preferences", "identity preferences", "one step hypotheticals", "partial meta-preferences about the synthesis process", and so on.
The main reasons for this messiness are that human preferences and meta-preferences have so many features and requirements, and many of the intuitive terms and definitions in the area are ambiguous. So I had to chunk different aspects of human preferences into simple formal categories that were imperfect fits, and deal with each category in a somewhat different way.
Here I'll try and present a much simpler system that accomplishes the same goal. To do so, I'll rely on two main ideas: model splintering and delegation to future selves - via present selves.
Model splintering allows us to defer most of the issues with underdefined concepts. So we can use those concepts to define key parts of the preference-construction process - and punt to a future 'general solution to the model splintering problem' to make that rigorous.
Delegation to future selves allows us to encode a lot of the more complicated aspects of the process to our future selves, as long as we are happy with the properties of these future selves. Again, model splintering is essential to allowing this to function well, since the relevant 'properties' of our future selves are themselves underdefined.
Then constructing the human preference function becomes an issue of energy minimisation.
The reward function algorithm
The setup
Let hi be a human at time t=0. An AI will attempt to construct Ri, a suitable reward function for that human; this process will be finished at time t=τ (when we presume the human is still around). From that moment on, the AI will maximise Ri.
The building blocks: partial preferences
Defining partial preferences has been tricky. These are supposed to capture our internal mental models - what happens when we compare images of the world in our minds and select which one is superior.
In the feature language of model splintering, define partial preferences as follows. Let f be a feature, and F a set of 'background' features. Let ρF be the set of values that F can take - a possible 'range' of the features F.
Then define the real number
ri(f,F,x,y,ρF)
as representing how much the human hi prefers f=x over f=y, given the background assumption that F are in the range ρF.
This is a partial preference; it is only defined, pairwise, over a subset of worlds. That subset is defined to be
Si(f,F,x,y,ρF).
Here (w,w′)∈Si(f,F,x,y,ρF) if w is defined by f=x, w′ by f=y, and both have the same values of F, which are in the range ρF.
These partial preferences will only be defined for very few values of f, F, x, y, and ρF. In the cases where ri is not defined, the Si(f,F,x,y,ρF) is the empty set.
Since writing out f,F,x,y,ρF every time is clunky, let Ω represent all that information.
Energy minimisation
Let W be a set of worlds. For two worlds (w,w′), let #i,w,w′=∑Ω:(w,w′)∈Si(Ω)[1] be the number of Si(Ω) sets that (w,w′) belongs to.
Then the Ri is defined so that it minimises the following sum:
In a previous post, I showed how consistency requirements for reward functions were derived from preferences of the subject.
Here we will go further, and derive a lot more. The reward Ri will be a feature of the world in which the AI operates, after t=τ. Therefore it is perfectly consistent for hi to have partial preferences over Ri itself.
Since the human hi will still be around after t=τ, it can also express preferences over its own identity (this is where model splintering is useful, since those preferences need not be fully defined to be useful). In the future, it will also have its own future preferences, which we will designate by Rti. This is a slight abuse of notation, since these future preferences need not be reward functions.
Then a human may have preferences over Rti, its future preferences, and over the extent to which the AI should respect its future preferences. This sets up an energy minimisation requirement between Rti and Ri, which serves to address issues like the AI modifying human's preferences to bring them closer to their ideal, or the AI continuing to satisfy - or not - a future human whose preferences diverge from their current state.
So via this delegation-to-future-self-as-defined-by-the-current-agent, a lot of meta-preferences can be included into Ri without having to consider different types.
A small consistency example
Consider a situation where hi has inconsistent preferences, but a strong meta-preference for consistency (of whatever type).
One way for the AI to deal with this, could be as follows. It generate Ri, a not-very-consistent reward function. However, it ensures that the future world is only one in which hi does not encounter situations where the inconsistency becomes manifest. And it pushes hi to evolve towards Rti, which is consistent, but actually equal to Ri on the future world that the AI would create. So Ri can get an energy minimisation on inconsistent preferences, but the human still inhabit a world where its preferences are consistent and the AI acts to satisfy those.
Multiple humans
In the case of multiple humans, the AI would want to construct R, a global utility function for all humans. In that case, we would energy-minimise as above, but sum over all humans:
Explicitly putting our (weighty) thumb on the scales
We may want to remove anti-altruistic preferences, enforce extra consistency, enforce a stronger sense of identity (or make the delegation process stronger), insist that the preferences of future agents be intrinsically respected, and so on.
The best way to do this is to do so explicitly, by assigning weights νw to worlds and νΩ to the partial preferences. Then the energy minimisation formula becomes:
For the issue of future entities that don't yet exist, the simplest would probably be to delegate population ethics to the partial preferences of current humans (possibly weighted or with equality between entities required), and then use that population ethics to replace the "∑hi∈H" in the expression above.
Likely errors?
I like the elegance of the approach, so, like with all simple and reward-maximising approaches, it makes me nervous that I've missed something big that will blow a hole in the whole process. Comments and criticisms are therefore much appreciated.
The key missing piece
Of course, this all assumes that we can solve the model splintering problem in a safe fashion. But it seems that it probably a requirement of any method of value learning, so that may not be as strong a requirement as it seems.
I have previously discussed the constructive process of generating a reward function from the mess of human preferences.
But, let's be honest, that process was somewhat messy and ad hoc. It used things like "partial preferences", "identity preferences", "one step hypotheticals", "partial meta-preferences about the synthesis process", and so on.
The main reasons for this messiness are that human preferences and meta-preferences have so many features and requirements, and many of the intuitive terms and definitions in the area are ambiguous. So I had to chunk different aspects of human preferences into simple formal categories that were imperfect fits, and deal with each category in a somewhat different way.
Here I'll try and present a much simpler system that accomplishes the same goal. To do so, I'll rely on two main ideas: model splintering and delegation to future selves - via present selves.
Model splintering allows us to defer most of the issues with underdefined concepts. So we can use those concepts to define key parts of the preference-construction process - and punt to a future 'general solution to the model splintering problem' to make that rigorous.
Delegation to future selves allows us to encode a lot of the more complicated aspects of the process to our future selves, as long as we are happy with the properties of these future selves. Again, model splintering is essential to allowing this to function well, since the relevant 'properties' of our future selves are themselves underdefined.
Then constructing the human preference function becomes an issue of energy minimisation.
The reward function algorithm
The setup
Let hi be a human at time t=0. An AI will attempt to construct Ri, a suitable reward function for that human; this process will be finished at time t=τ (when we presume the human is still around). From that moment on, the AI will maximise Ri.
The building blocks: partial preferences
Defining partial preferences has been tricky. These are supposed to capture our internal mental models - what happens when we compare images of the world in our minds and select which one is superior.
In the feature language of model splintering, define partial preferences as follows. Let f be a feature, and F a set of 'background' features. Let ρF be the set of values that F can take - a possible 'range' of the features F.
Then define the real number
ri(f,F,x,y,ρF)
as representing how much the human hi prefers f=x over f=y, given the background assumption that F are in the range ρF.
This is a partial preference; it is only defined, pairwise, over a subset of worlds. That subset is defined to be
Si(f,F,x,y,ρF).
Here (w,w′)∈Si(f,F,x,y,ρF) if w is defined by f=x, w′ by f=y, and both have the same values of F, which are in the range ρF.
These partial preferences will only be defined for very few values of f, F, x, y, and ρF. In the cases where ri is not defined, the Si(f,F,x,y,ρF) is the empty set.
Since writing out f,F,x,y,ρF every time is clunky, let Ω represent all that information.
Energy minimisation
Let W be a set of worlds. For two worlds (w,w′), let #i,w,w′=∑Ω:(w,w′)∈Si(Ω)[1] be the number of Si(Ω) sets that (w,w′) belongs to.
Then the Ri is defined so that it minimises the following sum:
∑w,w′∈W∑Ω:(w,w′)∈Si(Ω)(Ri(w)−Ri(w′)−ri(Ω))2/#i,w,w′
Meta-preferences and the delegation process
In a previous post, I showed how consistency requirements for reward functions were derived from preferences of the subject.
Here we will go further, and derive a lot more. The reward Ri will be a feature of the world in which the AI operates, after t=τ. Therefore it is perfectly consistent for hi to have partial preferences over Ri itself.
Since the human hi will still be around after t=τ, it can also express preferences over its own identity (this is where model splintering is useful, since those preferences need not be fully defined to be useful). In the future, it will also have its own future preferences, which we will designate by Rti. This is a slight abuse of notation, since these future preferences need not be reward functions.
Then a human may have preferences over Rti, its future preferences, and over the extent to which the AI should respect its future preferences. This sets up an energy minimisation requirement between Rti and Ri, which serves to address issues like the AI modifying human's preferences to bring them closer to their ideal, or the AI continuing to satisfy - or not - a future human whose preferences diverge from their current state.
So via this delegation-to-future-self-as-defined-by-the-current-agent, a lot of meta-preferences can be included into Ri without having to consider different types.
A small consistency example
Consider a situation where hi has inconsistent preferences, but a strong meta-preference for consistency (of whatever type).
One way for the AI to deal with this, could be as follows. It generate Ri, a not-very-consistent reward function. However, it ensures that the future world is only one in which hi does not encounter situations where the inconsistency becomes manifest. And it pushes hi to evolve towards Rti, which is consistent, but actually equal to Ri on the future world that the AI would create. So Ri can get an energy minimisation on inconsistent preferences, but the human still inhabit a world where its preferences are consistent and the AI acts to satisfy those.
Multiple humans
In the case of multiple humans, the AI would want to construct R, a global utility function for all humans. In that case, we would energy-minimise as above, but sum over all humans:
∑hi∈H∑w,w′∈W∑Ω:(w,w′)∈Si(Ω)νwνw′νΩ(R(w)−R(w′)−ri(Ω))2/#i,w,w′.
Explicitly putting our (weighty) thumb on the scales
We may want to remove anti-altruistic preferences, enforce extra consistency, enforce a stronger sense of identity (or make the delegation process stronger), insist that the preferences of future agents be intrinsically respected, and so on.
The best way to do this is to do so explicitly, by assigning weights νw to worlds and νΩ to the partial preferences. Then the energy minimisation formula becomes:
∑hi∈H∑w,w′∈W∑Ω:(w,w′)∈S(Ω)νwνw′νΩ(R(w)−R(w′)−ri(Ω))2/#w,w′.
For the issue of future entities that don't yet exist, the simplest would probably be to delegate population ethics to the partial preferences of current humans (possibly weighted or with equality between entities required), and then use that population ethics to replace the "∑hi∈H" in the expression above.
Likely errors?
I like the elegance of the approach, so, like with all simple and reward-maximising approaches, it makes me nervous that I've missed something big that will blow a hole in the whole process. Comments and criticisms are therefore much appreciated.
The key missing piece
Of course, this all assumes that we can solve the model splintering problem in a safe fashion. But it seems that it probably a requirement of any method of value learning, so that may not be as strong a requirement as it seems.