Summary: Some approaches to AI value alignment rely on transfer learning. I attempt to explain this idea more clearly.
Paul and I have written previously about how some approaches to AI value alignment rely on transfer learning more than others. To make this distinction clearer, I'll consider some different situations where an AI system is relying or not relying on transfer learning for correct operation.
In the following situations, we will consider some component. This component might be something like a prediction engine or a SAT solver. It is something that we don't expect to be able to manually specify, due to its complexity; instead, we will need an AI system to do some learning to figure out what the right component is.
Situation 1: No reliance on transfer learning
We don't rely on transfer learning for the reliability of some component (e.g. a prediction engine) if the effect of the component on the rest of the system is through behavior that is easy to test.
For concreteness, consider a prediction engine used to predict a human thinking for a short time period. This prediction engine is the component. We might ask the predictor about its distribution for a human's answer to a particular question, and then sample from this distribution in order to automatically answer the question. The behavior that is easy to test is the distribution for the human's answer to the question; it's easy to test because, if two predictors make significantly different predictions of a human, we can eventually tell which is making better predictions by actually asking the human. The effect of the component on the rest of the system is the sample from its distribution. The internals can only affect this sample through behavior that is easy to test.
To more precisely specify what it means for the behavior to be "testable", we could say that, if we have 10 candidate components, only one of which is good (with no guarantees about the other 9), then we can produce a component that is about as good as the good component (with some amount of overhead).
In the prediction case, we can use active learning or selective sampling to make predictions almost as good as a good predictor in the mixture (with overhead due to having to ask the actual human some questions). Even though the predictor will not be tested on all possible questions, it could potentially be tested on any of them, which is what allows the active learning to work.
As another example, consider a SAT solver as the component. We will give the component a SAT problem; the behavior that is easy to test is the SAT solver's solution to the problem (it's easy to test that this solution actually solves the original problem, and we can give the SAT solver a score based on how many problems it solves). The effect of the component on the rest of the system is whatever the system does with this solution. Note that if the system is using the answer to the problem in a way other than just checking if the answer exists, then it's possible that just checking that the answer solves the problem won't be a sufficient test for good performance; perhaps some solutions cause the rest of the system to behave badly.
Approval-directed agents may have important behavior that is difficult to test, due to the informed oversight problem, but if the overseer is sufficiently smart and informed, then all of the approval-directed agent's behavior can be "easily" tested.
Situation 2: Reliance on transfer learning
Here, the component has some behavior that affects the rest of the system and cannot be easily tested. However, this behavior is analogous to behavior that can be easily tested.
Some examples:
A prediction engine is given a question and a time limit, and asked to predict how a human would answer the question in that much time. It would be too expensive to have an actual human think for 2 weeks, so the predictor can only be tested using reflection periods less than 2 weeks. However, the rest of the system will use the prediction engine's predictions for how a human would answer a question after 10 weeks. See Paul's post on this.
A logical uncertainty engine is given a logically specified number and asked to predict it. It can be tested on simple problems, where we can actually compute the number. However, we will use the engine's predictions for predicting hard-to-compute numbers. Paul has also written a post on this.
This situation doesn't seem totally hopeless; we would hope that the analogy between the easy-to-test behavior and hard-to-test behavior means that components that have good easy-to-test behavior will generalize to have good hard-to-test behavior. Still, I agree with Paul's pessimism about this kind of approach:
The key difficulty with this approach is that we can only train on sentences φ which are sufficiently simple that we can actually tell whether they are true or false.
In order to apply the learned model to complex sentences, we need to rely on a strong form of transfer learning. Namely, we need to take a model which has had zero training on sentences that are too-complex-to-evaluate, and trust it to perform well on such sentences. I am somewhat skeptical about expecting learning algorithms to reliably generalize to a new domain where it is impossible to even tell whether they are generalizing correctly.
I strongly prefer approaches that don't rely on transfer learning, but there's some chance that transfer learning will be necessary in practice (this doesn't seem clear yet).
Situation 3: Not even transfer learning can save you
This is like situation 2, except some of the component's behavior isn't even analogous to things we can test.
Suppose we are trying to maximize paperclips. To do this, we have an AI create a model of the world and try to find the "paperclip" concept in it. Call this concept the component. This concept might have some easily-testable effects on the system's behavior: for example, perhaps we can use the world model to predict what pictures of paperclips look like, and then see if these predictions match pictures of actual paperclips. However, the system's paperclip-maximizing behavior (e.g. its plans to manufacture paperclips using new technology) are extremely difficult to test even with a large amount of resources. This behavior is disanalogous to the behavior (predicting pictures of paperclips) that can be tested. Therefore, it will be difficult for us to know that a system maximizing the "paperclip" concept we selected is actually maximizing paperclips.
The situation with value learning is similar: while a value learning model has some testable behavior (its predictions of humans), if we want to use the model to create a human value maximizer, then it's also important that the model learns the right values (not just that it makes correct predictions). Since human values are not analogous to human behavior, we cannot expect a model to infer values well because it predicts behavior well.
Overall, it seems bad to rely on components that have important behavior not analogous to easy-to-test behavior. We would hope to at least have some way of testing the component in principle, moving us to situation 2. Perhaps we aren't totally doomed in the case where we need to e.g. locate the right paperclip concept or the right human values without being able to test our guess for correctness, but it's definitely worse than the alternatives.
Conclusion
I hope this has made the distinction between approaches to AI value alignment that rely on transfer learning and ones that don't clearer. Future work in AI value alignment should pay close attention to which important behaviors are easy to test, analogous to easy-to-test behavior, and disanalogous to easy-to-test behavior.
Summary: Some approaches to AI value alignment rely on transfer learning. I attempt to explain this idea more clearly.
Paul and I have written previously about how some approaches to AI value alignment rely on transfer learning more than others. To make this distinction clearer, I'll consider some different situations where an AI system is relying or not relying on transfer learning for correct operation.
In the following situations, we will consider some component. This component might be something like a prediction engine or a SAT solver. It is something that we don't expect to be able to manually specify, due to its complexity; instead, we will need an AI system to do some learning to figure out what the right component is.
Situation 1: No reliance on transfer learning
We don't rely on transfer learning for the reliability of some component (e.g. a prediction engine) if the effect of the component on the rest of the system is through behavior that is easy to test.
For concreteness, consider a prediction engine used to predict a human thinking for a short time period. This prediction engine is the component. We might ask the predictor about its distribution for a human's answer to a particular question, and then sample from this distribution in order to automatically answer the question. The behavior that is easy to test is the distribution for the human's answer to the question; it's easy to test because, if two predictors make significantly different predictions of a human, we can eventually tell which is making better predictions by actually asking the human. The effect of the component on the rest of the system is the sample from its distribution. The internals can only affect this sample through behavior that is easy to test.
To more precisely specify what it means for the behavior to be "testable", we could say that, if we have 10 candidate components, only one of which is good (with no guarantees about the other 9), then we can produce a component that is about as good as the good component (with some amount of overhead). In the prediction case, we can use active learning or selective sampling to make predictions almost as good as a good predictor in the mixture (with overhead due to having to ask the actual human some questions). Even though the predictor will not be tested on all possible questions, it could potentially be tested on any of them, which is what allows the active learning to work.
As another example, consider a SAT solver as the component. We will give the component a SAT problem; the behavior that is easy to test is the SAT solver's solution to the problem (it's easy to test that this solution actually solves the original problem, and we can give the SAT solver a score based on how many problems it solves). The effect of the component on the rest of the system is whatever the system does with this solution. Note that if the system is using the answer to the problem in a way other than just checking if the answer exists, then it's possible that just checking that the answer solves the problem won't be a sufficient test for good performance; perhaps some solutions cause the rest of the system to behave badly.
Approval-directed agents may have important behavior that is difficult to test, due to the informed oversight problem, but if the overseer is sufficiently smart and informed, then all of the approval-directed agent's behavior can be "easily" tested.
Situation 2: Reliance on transfer learning
Here, the component has some behavior that affects the rest of the system and cannot be easily tested. However, this behavior is analogous to behavior that can be easily tested.
Some examples:
A prediction engine is given a question and a time limit, and asked to predict how a human would answer the question in that much time. It would be too expensive to have an actual human think for 2 weeks, so the predictor can only be tested using reflection periods less than 2 weeks. However, the rest of the system will use the prediction engine's predictions for how a human would answer a question after 10 weeks. See Paul's post on this.
A logical uncertainty engine is given a logically specified number and asked to predict it. It can be tested on simple problems, where we can actually compute the number. However, we will use the engine's predictions for predicting hard-to-compute numbers. Paul has also written a post on this.
This situation doesn't seem totally hopeless; we would hope that the analogy between the easy-to-test behavior and hard-to-test behavior means that components that have good easy-to-test behavior will generalize to have good hard-to-test behavior. Still, I agree with Paul's pessimism about this kind of approach:
I strongly prefer approaches that don't rely on transfer learning, but there's some chance that transfer learning will be necessary in practice (this doesn't seem clear yet).
Situation 3: Not even transfer learning can save you
This is like situation 2, except some of the component's behavior isn't even analogous to things we can test.
Suppose we are trying to maximize paperclips. To do this, we have an AI create a model of the world and try to find the "paperclip" concept in it. Call this concept the component. This concept might have some easily-testable effects on the system's behavior: for example, perhaps we can use the world model to predict what pictures of paperclips look like, and then see if these predictions match pictures of actual paperclips. However, the system's paperclip-maximizing behavior (e.g. its plans to manufacture paperclips using new technology) are extremely difficult to test even with a large amount of resources. This behavior is disanalogous to the behavior (predicting pictures of paperclips) that can be tested. Therefore, it will be difficult for us to know that a system maximizing the "paperclip" concept we selected is actually maximizing paperclips.
The situation with value learning is similar: while a value learning model has some testable behavior (its predictions of humans), if we want to use the model to create a human value maximizer, then it's also important that the model learns the right values (not just that it makes correct predictions). Since human values are not analogous to human behavior, we cannot expect a model to infer values well because it predicts behavior well.
Overall, it seems bad to rely on components that have important behavior not analogous to easy-to-test behavior. We would hope to at least have some way of testing the component in principle, moving us to situation 2. Perhaps we aren't totally doomed in the case where we need to e.g. locate the right paperclip concept or the right human values without being able to test our guess for correctness, but it's definitely worse than the alternatives.
Conclusion
I hope this has made the distinction between approaches to AI value alignment that rely on transfer learning and ones that don't clearer. Future work in AI value alignment should pay close attention to which important behaviors are easy to test, analogous to easy-to-test behavior, and disanalogous to easy-to-test behavior.