Cool, this seems similar to some stuff I've been thinking about lately. If I get the meaning of this right, then it's essentially uniform convergence guarantees in statistical learning theory applied to moral problems, right?
The main problem here is generalization to a different distribution over problems. I don't know what the best way to handle this is, but one idea is that you can also use similar techniques for detecting differences between the training and test sets. Specifically, if you have a number of hypotheses for what the test distribution is, then you can do null hypothesis significance testing to determine that the test is different from the training if you have enough data, and you could probably use this to bound the error.
Roughly, we want something like this:
I haven't worked out the math, but it seems like there should be something useful here.
Structural risk minimization, a concept from computational learning theory, is proposed as a satisficing framework.
The goal of this post is similar to Creating a Satisficer and Minimax as an approach to reduced-impact AI: constructing agents which are safe even with a utility function which isn't quite what is actually wanted. The approach is similar to Paul Christiano's Model-Free Decisions.
The philosophy behind this solution is to treat it as an example of the optimizer's curse: when we can't quite measure what we want to optimize and are forced to optimize a proxy, then even when a strong correlation is present between what we want and what we optimize for, the tails come apart. This could lead to siren worlds (or more realistically, marketing worlds -- as described in that article).
This problem is faced by machine learning algorithms on a regular basis, where it is called overfitting. An algorithm can measure the fit of a model to existing data, but this is merely highly correlated with what's actually wanted: fit to future data. Over-optimizing this criteria will predictably lead to terrible performance, so much so that techniques like cutting optimization short ("early stopping") are actually better. Early stopping is just one example of regularization, the practice of preventing overfitting by purposefully constraining the optimization process.
In the machine learning case, regularization can usually be interpreted as an approximation to Bayesian methods: like a Bayesian prior, it usefully constrains beliefs.
Smith & Winkler, as cited in Luke's article mentioned earlier, seem to think that Bayesian modeling is also the solution to the general problem. Perhaps that's true, and explicit uncertainty about what to value is enough to avoid problems resulting from over-optimization. Or, perhaps it isn't. A useful approach which comes to mind is to try to extend the results from computational learning theory which deal with this problem, to see if they can say anything about the safety of improperly specified value functions.
Risk Framework
What follows is an admittedly somewhat unnatural attempt to fit AI risk into the structural risk framework. Hopefully it illustrates the idea well enough to indicate whether this might be a productive direction to research.
Consider the problem of constructing a value-learning agent which takes a series of m moral classifications (thought-experiments with correct decisions supplied) and returns a policy p (a function stipulating what action to take in different states of belief). For example, m might represent a sequence of sense-data with two different motor commands at the end. The job of the policy p would be to look at m output a bit representing which of the two motor-actions to take.
Define the error to be 1 for selecting an incorrect action, and 0 for the correct action. There is a background distribution over moral problems, P, which we define expectation with.
The empirical risk ^R(p) is the average error of a policy on moral problems so far. The generalization risk R(p) is the expected error across all moral problems (expectation taken in P).
Theorem: Let P be a finite set of n policies. Then for any δ>0, with probability at least 1−δ we have:
∀p∈P,R(p)≤^R(p)+√log(n)+log2δ2m
(See Theorem 2.2, Foundations of Machine Learning, Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar.)
This bounds the generalization risk (with high probability) in terms of the number of policies we optimize over. The implication, then, is that risk can be reduced by purposefully limiting options considered.
Structural Risk Minimization
Empirical risk minimization (ERM) selects the p∈P with the lowest empirical risk. We can use the formula from the previous section to bound the generalization error with high probability. Realistically, however, this does not give very good bounds when P is large enough to be useful.
The structural risk minimization algorithm (SRM) considers hypothesis spaces of increasing size, but with a penalty for the increase in potential generalization error. In other words, we optimize over our upper bound on generalization risk. To spell that out: consider a sequence of policy sets, P1⊂P2⊂P3..., so that Pi has size ni. The score of each policy p is the generalization bound for the smallest Pi containing it:
^R(p)+√log(ni)+log2δ2m
We continue optimizing over larger Pi until it is no longer possible for the bound to be improved.
We can imagine that the policies p1,p2,p3... are computer programs in lexographical order, and take Pi={pj|j≤i}. This is like a soft version of early stopping: rather than halting search at some n, we make it harder and harder to choose policies at higher n.
This gives us a way of thinking about satisficing FAI. Like a value learner, this system makes decisions based on limited knowledge of the utility function. However, it limits its own optimization power based on the amount of such knowledge.
Limitations
One unfortunate assumption is the fixed distribution P over moral problems. This distribution defines both the training sample distribution and the expected generalization, which means that the risk bound is based on an assumption that the training set looks like the test. This does not really make sense.
The framework as described also does not allow for the agent to keep learning. It's fed a number of training examples, and then it selects a policy. Perhaps it could be given more examples while it is running, but even if that was considered safe, it's much different than learning from its environment. Notice that this applies to all learning, not just the value learning, unless it learns a policy sophisticated enough to learn directly from the environment. This apparently requires far too many training examples.