if the only thing that ever sees the Oracle's output during a training episode is an automated system that computes the Oracle's reward/loss, and that system is secure because it's just computing a simple distance metric (comparing the Oracle's output to the training label), then reward hacking and self-confirming predictions can't happen.
This assumption seems large, and I don't understand why we can make it. The loss is computed against the output of a physical system in the real world which is not, in fact, causally independent of the oracle's predictions. It is not Platonically graded on how correct it is with respect to what we had in mind. If the proposal is "just make the grading system really hard to hack", that runs into the same problems as boxing - you're still running a system that is looking for ways to hurt you.
If the proposal is “just make the grading system really hard to hack”, that runs into the same problems as boxing—you’re still running a system that is looking for ways to hurt you.
Are you saying that even if we store the Oracle's output in some memory buffer with minimal processing, and then compute the loss against the independently generated training label using a simple distance metric, the Oracle could still hack the grading system to give itself a zero loss? Hmm, maybe, but doesn't the risk of that seem a lot smaller than if these precautions are not taken (i.e., if we're using reinforcement learning or SL but generating training labels after looking at the Oracle's output)? Or are these risks actually comparable in magnitude in your mind? Or are you saying that the risk is still unacceptably large and we should reduce it further if we can?
the Oracle could still hack the grading system to give itself a zero loss
Gradient descent won't optimize for this behavior though, it really seems like you want to study this under inner alignment. (It's hard for me to see how you can meaningfully consider the problems separately.)
Yes, if the oracle gives itself zero loss by hacking the grading system then it will stop being updated, but the same is true if the mesa-optimizer tampers with the outer training process in any other way, or just copies itself to a different substrate, or whatever.
Here's my understanding and elaboration of your first paragraph, to make sure I understand it correctly and to explain it to others who might not:
What we want the training process to produce is a mesa-optimizer that tries to minimize the actual distance between its output and the training label (i.e., the actual loss). We don't want a mesa-optimizer that tries to minimize the output of the physical grading system (i.e., the computed loss). The latter kind of model will hack the grading system if given a chance, while the former won't. However these two models would behave identically in any training episode where reward hacking doesn't occur, so we can't distinguish between them without using some kind of inner alignment technique (which might for example look at how the models work on the inside rather than just how they behave).
If we can solve this inner alignment problem then it fully addresses TurnTrout's (Alex Turner's) concern, because the system would no longer be "looking for ways to hurt you."
What we want the training process to produce is a mesa-optimizer that tries to minimize the actual distance between its output and the training label (i.e., the actual loss).
Hmm, actually this still doesn't fully address TurnTrout’s (Alex Turner’s) concern, because this mesa-optimizer could try to minimize the actual distance between its output and the training label by changing the training label (what that means depends on how the training label is defined within its utility function). To do that it would have to break out of the box that it's in, which may not be possible, but this is still a system that is “looking for ways to hurt you.” It seems that what we really want is a mesa-optimizer that tries to minimize the actual loss while pretending that it has no causal influence on the training label (even if it actually does because there's a way to break out of its box).
This seems like a harder inner alignment problem than I thought, because we have to make the training process converge upon a rather unnatural kind of agent. Is this still a feasible inner alignment problem to solve, and if not is there another way to get around this problem?
[EDIT (2019-11-09): I no longer think that the argument I made here—about a theoretical learning algorithm—seems to apply to common practical learning algorithms; see here (H/T Abram for showing me that my reasoning was wrong).]
Inner alignment - The ML training process may not produce a model that actually optimizes for what we intend for it to optimize for (namely minimizing loss for just the current episode, conditional on the current episode being selected as a training episode).
If the trained model tries to minimize loss in future episodes, it definitely seems dangerous, but I'm not sure that we should consider this an inner-alignment failure. In some sense we got the behavior that our episodic learning algorithm was optimizing for.
For example, consider the following episodic learning algorithm: At the end of each episode, if the model failed to achieve the episode's goal its network parameters are completely randomized (and if it achieves the goal, the model is unchanged). If we run this learning algorithm for an arbitrarily long time, we should expect to end up with a model that behaves in a way that results in achieving the goal in every future episode (if such a model exists).
Interesting... it seems that this doesn't necessarily happen if we use online gradient descent instead, because the loss gradient (computed for a single episode) ought to lead away from model parameters that would increase the loss for the current episode and reduce it for future episodes. Is that right, and if so, how can we think more generally about what kinds of learning algorithms will produce episodic optimizers vs cross-episodic optimizers?
Also, what name would you suggest for this problem, if not "inner alignment"? ("Inner alignment" actually seems fine to me, but maybe I can be persuaded that it should be called something else instead.)
I call this problem "non-myopia," which I think interestingly has both an outer alignment component and an inner alignment component:
even if your training process isn’t explicitly incentivizing non-myopia, it might be that non-myopic agents are simpler/more natural/easier to find/etc. such that your inductive biases still incentivize them.
Oh, so even online gradient descent could generate non-myopic agents with large (or non-negligible) probability because non-myopic agents could be local optima for "current episode performance" and their basins of attraction collectively could be large (or non-negligible) compared to the basins of attraction for myopic agents. So starting with random model parameters one might well end up at a non-myopic agent through online gradient descent. Is this an example of what you mean?
Thinking about this more, this doesn't actually seem very likely for OGD since there are likely to be model parameters controlling how farsighted the agent is (e.g., its discount rate or planning horizon) so it seems like non-myopic agents are not local optima and OGD would keep going downhill (to more and more myopic agents) until it gets to a fully myopic agent. Does this seem right to you?
Thinking about this more, this doesn't actually seem very likely for OGD since there are likely to be model parameters controlling how farsighted the agent is (e.g., its discount rate or planning horizon) so it seems like non-myopic agents are not local optima and OGD would keep going downhill (to more and more myopic agents) until it gets to a fully myopic agent. Does this seem right to you?
I don't think that's quite right. At least if you look at current RL, it relies on the existence of a strict episode boundary past which the agent isn't supposed to optimize at all. The discount factor is only per-step within an episode; there isn't any between-episode discount factor. Thus, if you think that simple agents are likely to care about things beyond just the episode that they're given, then you get non-myopia. In particular, if you put an agent in an environment with a messy episode boundary (e.g. it's in the real world such that its actions in one episode have the ability to influence its actions in future episodes), I think the natural generalization for an agent in that situation is to keep using something like its discount factor past the artificial episode boundary created by the training process, which gives you non-myopia.
Hmm, I guess I was mostly thinking about non-myopia in the context of using SL to train a Counterfactual Oracle, which wouldn't necessarily have steps or a non-zero discount factor within an episode. It seems like the easiest way for non-myopia to arise in this context is if the Oracle tries to optimize across episodes using a between-episode discount factor or just a fixed horizon. But as I argued this doesn't seem to be a local minimum with regard to current episode loss so it seems like OGD wouldn't stop here but would keep optimizing the Oracle until it's not non-myopic anymore.
I'm pretty confused about the context that you're talking about, but why not also have a zero per-step discount factor to try to rule out the scenario you're describing, in order to ensure myopia?
ETA: On the other hand, unless we have a general solution to inner alignment, there are so many different ways that inner alignment could fail to be achieved (see here for another example) that we should probably just try to solve inner alignment in general and not try to prevent specific failure modes like this.
Interesting... it seems that this doesn't necessarily happen if we use online gradient descent instead, because the loss gradient (computed for a single episode) ought to lead away from model parameters that would increase the loss for the current episode and reduce it for future episodes.
I agree, my reasoning above does not apply to gradient descent (I misunderstood this point before reading your comment).
I think it still applies to evolutionary algorithms (which might end up being relevant).
how can we think more generally about what kinds of learning algorithms will produce episodic optimizers vs cross-episodic optimizers?
Maybe learning algorithms that have the following property are more likely to yield models with "cross-episodic behavior":
During training, a parameter's value is more likely to persist (i.e. end up in the final model) if it causes behavior that is beneficial for future episodes.
Also, what name would you suggest for this problem, if not "inner alignment"?
Maybe "non-myopia" as Evan suggested.
For example, consider the following episodic learning algorithm
When I talk about an episodic learning algorithm, I usually mean one that actually optimizes performance within an episode (like most of the algorithms in common use today, e.g. empirical risk minimization treating episode initial conditions as fixed). The algorithm you described doesn't seem like an "episodic" learning algorithm, given that it optimizes total performance (and essentially ignores episode boundaries).
(This comment has been heavily edited after posting.)
What's an algorithm, or instructions for a human, for determining whether a learning algorithm is "episodic" or not? For example it wasn't obvious to me that Ofer's algorithm isn't episodic and I had to think for a while (mentally simulate his algorithm) to see that what he said is correct. Is there a shortcut to figuring out whether a learning algorithm is episodic without having to run or simulate the algorithm? You mention "ignores episode boundaries" but I don't see how to tell that Ofer's algorithm ignores episode boundaries since it seems to be just looking at the current episode's performance when making a decision.
How do you even tell that an algorithm is optimizing something?
In most cases we have some argument that an algorithm is optimizing the episodic reward, and it just comes down to the details of that argument.
If you are concerned with optimization that isn't necessarily intended and wondering how to more effectively look out for it, it seems like you should ask "would a policy that has property P be more likely to be produced under this algorithm?" For P="takes actions that lead to high rewards in future episodes" the answer is clearly yes, since any policy that persists for a long time necessarily has property P (though of course it's unclear if the algorithm works at all). For normal RL algorithms there's not any obvious mechanism by which this would happen. It's not obvious that it doesn't, until you prove that these algorithms converge to optimizing per-episode rewards. I don't see any mechanical way to test that (just like I don't see any mechanical way to test almost any property that we talk about in almost any argument about anything).
It’s not obvious that it doesn’t, until you prove that these algorithms converge to optimizing per-episode rewards.
So when you wrote "When I talk about an episodic learning algorithm, I usually mean one that actually optimizes performance within an episode (like most of the algorithms in common use today, e.g. empirical risk minimization treating episode initial conditions as fixed)." earlier, you had in mind that most of the algorithms in common use today have already been proven to converge to optimizing per-episode rewards? If so, I didn't know that background fact and misinterpreted you as a result. Can you or someone else please explicitly confirm or disconfirm this for me?
Yes, most of the algorithms in use today are known to converge or roughly converge to optimizing per-episode rewards. In most cases it's relatively clear that there is no optimization across episode boundaries (by the outer optimizer).
Reward Takeover
There's a similar issue with less extreme requirements:
(Note that what Stuart Armstrong calls "erasure" just means that the current episode has been selected as a training episode.)
Imagine there's a circumstance in which the variable you want to predict can be affected by predictions. Fortunately, you were smart enough to use a counterfactual oracle. Unfortunately, you weren't the only person who had this idea, and absent coordination to use the same RNG (in the same way), rather than the oracles learning from episodes they don't influence and not learning to make "manipulative predictions", instead they learned from each other (because even when their output is erased, the other oracles' outputs aren't) and eventually make manipulative predictions.
I vaguely agree with this concern but would like a clearer understanding of it. Can you think of a specific example of how this problem can happen?
I don't have much in the way of a model of "manipulative predictions" - they've been mentioned before as a motivation for counterfactual oracles.
I think the original example was, there's this one oracle that everyone has access to and believes, and it says "company X's stock is gonna go way done by the end of today" and because everyone believes it, it happens.
In a similar fashion, I can imagine multiple people/groups trying to independently create (their own) "oracles" for predicting the (stock) market.
I think the following is potentially another remaining safety problem:
[EDIT: actually it's an inner alignment problem, using the definition here]
[EDIT2: i.e. using the following definitions from the above link:
]
Assuming the oracle cares only about minimizing the loss in the current episode—as defined by a given loss function—it might act in a way that will cause the invocation of many "luckier" copies of itself (ones that, with very high probability, output a value that gets the minimal loss, e.g. by "magically" finding that value stored somewhere in the model, or by running on very reliable hardware). In this scenario, the oracle does not intrinsically care about the other copies of itself; it just wants to maximize the probability that the current execution is one of those "luckier" copies.
Paul Christiano does have a blog post titled Counterfactual oversight vs. training data, which talks about the same thing as this post except that he uses the term "counterfactual oversight", which is just Counterfactual Oracles applied to human imitation (which he proposes to use to "oversee" some larger AI system).
I am having trouble parsing/understanding this part.
"Counterfactual oversight" isn't really explained in the post I linked to, but rather in the first post that post links to, titled Human-in-the-counterfactual-loop. (The link text is "counterfactual human oversight". Yeah having all these different names is confusing, but these three phrases are referring to the same thing.) The key part of that post is this:
Human-in-the-counterfactual-loop. Each time the system wants to act, it consults a human with a very small probability. The system does what it thinks a human would have told it to do if the human had been consulted.
So the "what it thinks a human would have told it to do if the human had been consulted" is the human imitation part, because it's predicting what a human would do, which is the same as imitating a human. And then the oversight part is that the human imitation is telling the system what to do. (My "oversee" is referring to this "oversight".)
I hope that clears things up?
Thanks! I think I understand this now.
I will say some things that occurred to me while thinking more about this, and hope that someone will correct me if I get something wrong.
Most people here probably already understand this by now, so this is more to prevent new people from getting confused about the point of Counterfactual Oracles (in the ML setting) because there's not a top-level post that explains it clearly at a conceptual level. Paul Christiano does have a blog post titled Counterfactual oversight vs. training data, which talks about the same thing as this post except that he uses the term "counterfactual oversight", which is just Counterfactual Oracles applied to human imitation (which he proposes to use to "oversee" some larger AI system). But the fact he doesn't mention "Counterfactual Oracle" makes it hard for people to find that post or see the connection between it and Counterfactual Oracles. And as long as I'm writing a new top level post, I might as well try to explain it with my own words. The second part of this post lists some remaining problems with oracles/predictors that are not solved by Counterfactual Oracles.
Without further ado, I think when the Counterfactual Oracle is translated into the ML setting, it has three essential characteristics:
Remaining Safety Problems
Counterfactual Oracles solve (or are a proposal to solve) some safety problems associated with predictors/oracles, but others remain. Besides the distributional shift problem mentioned above, here are a few more that come to mind. Note that these are not problems specific to Counterfactual Oracles, but Stuart Armstrong's Safe Uses of AI Oracles seems wrong or misleading when it says "This paper presented two Oracle designs which are both safe and useful."
(I just made up the names for #2 and #3 so please feel free to suggest improvements for them.)