Summary: I propose a simple open question that is directly relevant to the feasibility of my recent AI control proposals.
I've recently written a lot about AI control schemes based on counterfactual oversight, which are directed by predicted human behavior. A good example is described in this post, by Nisan.
Even assuming that some scheme in this family would work with perfect predictions (which is itself an open question), we would need to study how well they cope with prediction errors. Even if we had the ideal prediction algorithm, there would still be errors (since in general a sequence of observations does not uniquely determine the next observation). I want to focus on understanding the robustness of the scheme to even a minimal set of prediction errors.
A simple though exotic class of errors occur when a predictor is uncertain about whether observations are coming from a world where everything-is-as-it-appears, or whether they are produced by a simulation. This is an especially troubling kind of error because it may actually be adversarial (if the simulation is controlled by some consequentialist process).
Even setting aside such exotic errors, trying to prove worst case guarantees seems like good idea.
The problem
There seem to be two qualitatively distinct reasons that prediction errors might cause big trouble:
The system may continuously make bad predictions, in hypothetical situations that are never or only very rarely actually explored (so that the errors are never or only very rarely corrected).
Making a small number of bad predictions might have catastrophic consequences.
Both problems are potentially troubling. I think that error #1 is much more important and also more fundamental, so I'll focus on that one. We can move on to problem #2 if we manage to fix problem #1.
Ideally we would elicit human feedback in the situations that are most informative. This could potentially address problem #1, by ensuring that we quickly explore situations where our system may be making serious or consistent errors. In the long run it might also address problem #2, if the same techniques could be used to synthesize potentially problematic queries before failures would be catastrophic.
The question is: how can we do this?
I describe the problem in more detail here. Summarizing for this audience: assume that we have very good predictors (say, given by reflective oracle machines that are analogous to Solomonoff induction). Each day we use these predictors to answer a large number of questions, but we only have time to get human feedback on a small number of questions. Is there some way that we can use the predictors' output to select the most important questions to get feedback on?
For example, suppose that our predictor assigns a 50% probability to an adversarial simulation hypothesis and a 50% probability to the intended model of our local environment. The simulation hypothesis might produce disagreements in a very small number of strategically chosen situations. But the very first time that we actually provide feedback on one of these situations, it will allow the predictor to rule out the simulation hypothesis.
So our goal is to ensure that we provide feedback on one of these situations as quickly as possible, minimizing the number of decisions that are made before the unintended hypothesis is eliminated. Can we use the predictor to identify these situations?
(I provide more formal details, and a harder follow-up question, in the post.)
Summary: I propose a simple open question that is directly relevant to the feasibility of my recent AI control proposals.
I've recently written a lot about AI control schemes based on counterfactual oversight, which are directed by predicted human behavior. A good example is described in this post, by Nisan.
Even assuming that some scheme in this family would work with perfect predictions (which is itself an open question), we would need to study how well they cope with prediction errors. Even if we had the ideal prediction algorithm, there would still be errors (since in general a sequence of observations does not uniquely determine the next observation). I want to focus on understanding the robustness of the scheme to even a minimal set of prediction errors.
A simple though exotic class of errors occur when a predictor is uncertain about whether observations are coming from a world where everything-is-as-it-appears, or whether they are produced by a simulation. This is an especially troubling kind of error because it may actually be adversarial (if the simulation is controlled by some consequentialist process).
Even setting aside such exotic errors, trying to prove worst case guarantees seems like good idea.
The problem
There seem to be two qualitatively distinct reasons that prediction errors might cause big trouble:
Both problems are potentially troubling. I think that error #1 is much more important and also more fundamental, so I'll focus on that one. We can move on to problem #2 if we manage to fix problem #1.
Ideally we would elicit human feedback in the situations that are most informative. This could potentially address problem #1, by ensuring that we quickly explore situations where our system may be making serious or consistent errors. In the long run it might also address problem #2, if the same techniques could be used to synthesize potentially problematic queries before failures would be catastrophic.
The question is: how can we do this?
I describe the problem in more detail here. Summarizing for this audience: assume that we have very good predictors (say, given by reflective oracle machines that are analogous to Solomonoff induction). Each day we use these predictors to answer a large number of questions, but we only have time to get human feedback on a small number of questions. Is there some way that we can use the predictors' output to select the most important questions to get feedback on?
For example, suppose that our predictor assigns a 50% probability to an adversarial simulation hypothesis and a 50% probability to the intended model of our local environment. The simulation hypothesis might produce disagreements in a very small number of strategically chosen situations. But the very first time that we actually provide feedback on one of these situations, it will allow the predictor to rule out the simulation hypothesis.
So our goal is to ensure that we provide feedback on one of these situations as quickly as possible, minimizing the number of decisions that are made before the unintended hypothesis is eliminated. Can we use the predictor to identify these situations?
(I provide more formal details, and a harder follow-up question, in the post.)