In a recent post, Wei Dai mentions a similar distinction (italics added by me):
Supervised training—This is safer than reinforcement learning because we don’t have to worry about reward hacking (i.e., reward gaming and reward tampering), and it eliminates the problem of self-confirming predictions (which can be seen as a form of reward hacking). In other words, if the only thing that ever sees the Oracle’s output during a training episode is an automated system that computes the Oracle’s reward/loss, and that system is secure because it’s just computing a simple distance metric (comparing the Oracle’s output to the training label), then reward hacking and self-confirming predictions can’t happen.
I think I've updated a bit from when I wrote this (due to this discussion). (ETA: I've now added a link from that paragraph to this comment.) Now I would say that the safety-relevant differences between SL and RL are:
Other algorithms are not really "choosing a fixed point somehow". They're typically failing to guarantee a fixed point. The mathematician hinted at this by describing how algorithms would not necessarily converge to a self-fulfilling prophecy; they could just as easily go in circles or wander around randomly forever.
This is something that John_Maxwell made made think about when discussing the Dualist Predict-o-Matic:
At this point, the Predict-O-Matic has stepped into a hall of mirrors. To predict the next prediction made by the Predict-O-Matic in its world model, the Predict-O-Matic needs to run an internal simulation of that Predict-O-Matic. But as it runs that simulation, it finds that simulation kicking off another Predict-O-Matic simulation in the simulated Predict-O-Matic's world model! Etc, etc.
It made me realize a few things
1. When I imagined an strategy identified fixed points/self-fulfilling prophecies, I was explicitly imagining the kind of strategy that first roughly enumerated a number of possibilities (without zooming in on impacts it may itself cause) and then exploring those possibilities by re-simulating them while conditioning on it giving them as a prediction and noticing that they have fixed point like properties. However, from a practical perspective, this doesn't seem very feasible considering the massive number of degrees of freedom that the Predict-O-Matic has.
Why did I think along these lines specifically? Well, I read this
The engineer thinks about it for a minute. "I'm not sure. Predict-O-Matic keeps an internal model which has probabilities of events. The answer to a question isn't really separate from the expected observation. So 'probability of observation depending on that prediction' would translate to 'probability of an event given that event', which just has to be one."
and interpreted it to mean that, since observations will depend on the prediction to at least some extent, the algorithm really should sanity-check its prediction to make sure it still happens when it is given as an output. Hence, I imagined a gradual bumbling into of fixed points as the algorithm would sanity check its prediction, modify it slightly to address the changes, and then repeat--falling into a basin of attraction (to be clear, I also interpreted the Predict-O-Matic to be myopic).
2. A prediction doesn't have to be completely self-fulfilling or completely independent of the observation. The iterative process I described above of sanity-checking a prediction, adjusting it, and iterating is unlikely to produce a prediction that guarantees an outcome but it will probably raise the chances of that outcome happening to some extent. In other words, self-fulfilling predictions themselves might be rare but predictions that slightly raise the chances of themselves happening when given as an output could be relatively normal.
3. As you note here, the actual behavior of the Predict-O-Matic is heavily determined by the details of how the optimizer tunes the strategies it considers over time. Whether it ultimately interacts with fixed points depends on the kind of strategies the optimizer gets to look at. Furthermore, things like selecting for maximally surprising self-fulfilling prophecies as I described could be strategically avoided by using different algorithms. Using a quantilizer (ie, building a bunch of strategies and randomly picking one of the ones within the top 5% historical performance) could yield multiple distinct strategies that each fall into distinct basins of attraction. This would cause any single strategy to suffer if it falls too deeply into a surprising self-fulfilling prediction because it won't be able to predict whether that prediction will actually be made. To carry the economics metaphors, you might try to shift an "assassination market"-like thing into a "Keynesian Beauty Contest"-like thing.
4. One of the other relevant things in the Parable was that accuracy in the context of the Predict-O-Matic is underspecified in a lot of ways that could be important. When your machine can try to predict any detail of the world, prediction error can vary greatly. If you ask about the stock market, it might be just one number. If you ask about a painting, it might be a lot more things. Since the Predict-O-Matic as described has natural language processing (presumably with the goal of providing a good interface for humans), it's prediction accuracy might even have to be calculated by weighting the inaccuracy of each specific aspect of the prediction by the anticipated importance of that aspect to the human interacting with it. Even if we try to keep things simple and ignore that aspect, there are a lot of ways that the Predict-O-Matic might output a prediction of the same events. Thus, a given strategy might strategically phrase the predictions in the ways most optimal for ensuring the prediction's accuracy (or strategically try to minimize prediction error by making the prediction as low impact or generic as possible).
5. A lot of this is obviously getting into the weeds of what really specific thing the Predict-O-Matic is trying to do and I'm not sure how helpful it is in general.
---------
Anyway, moving on....
But many (not all) unsupervised algorithms still have the critical features we're interested in! Predicting x without any context information y to help still involves (1) getting feedback on what we "should have" expected, and (2) updating to a configuration which would have more expected that. We simply can't expect the predictions to be as focused, given the absence of contextual information to help. But that just means it's a prediction task on which we tend to expect lower accuracy.
This is an important but non-obvious thing. Consider kernel density estimation with cross-validation. What you're actually doing here is taking the distibution x you've observed and then testing how a bunch of simple distributional models built on subsets of that observed data generalize to the rest of the set. While you're learning about a distribution in an unsupervised manor, you can also interpret it as supervised learning:
p(y = unobserved x | x = observed x)
---------
If you have an outer optimizer which is trying to maximize f(x,y) through x while being indifferent about y, it seems sensible to suppose that inner optimizers will want to change y to throw things off, particularly if they can get credit for then correcting x to be optimal for the new y. If so, then inner optimizers will generally be seeking to find y-values which make the current x a comparatively bad choice. So this argument does not establish an incentive to choose y which makes all choices of x poor.
This is a somewhat confusing paragraph for me because the idea of the outer optimizer trying to optimize a function of something that is doesn't care about seems strange. Based on previous discussions of partial agency, I'm assuming this means something like
-The outer optimizer rewards the inner optimizer whenever it changes x in a way that increases f
-The outer optimizer does not care if f decreases because the inner optimizer has changed y while holding x constant
In the case of the Predict-O-Matic, I just didn't imagine this happening because I figured the outer optimizer was just trying to myopically minimize the prediction error without adjusting for which of the inputs were modified. However, when this does happen, the results can be chaotic. After all, you can simply an x that optimized f when y=y1 and another x that optimized f when y=y2. Then simply move y back and forth without punishment and get a reward every time you "correct" x to the relevant optimum. This let's you milk the outer optimizer for infinite reward. For example:
f(x,y) = (1-2^(-yx))
If you start at y0=1, x0=10, you'll get diminishing returns if you increase x. But, if you change y from y0 to y=-1, f drops dramatically and you can collect all the reward you'd like by decreasing x to -10. Then simply switch y from -1 back to 1 and reverse the shift you just made. Infinite reward cycle. I think Yudkowsky said something about how utility functions are good because they order outcomes well and, if you don't have ordering, someone can exploit you by repeatedly cycling through a process like the one above.
Since the reward you can get out of a process like this is arbitrarily large and, in practical scenarios, the inner optimizer will only have a model of how its decisions will change the rewards it gets, the actual path traced out above can be highly depending on initial conditions. One algorithm might hold y constant and blindly try to increase x since it increases f, converging on a boring finite reward. Another might discover the loophole and start cycling the outer optimizer for reward but be scared to risk leaving the cycle (since who knows what the function looks like outside the two optima it found, or whether it will ever b e able to get back?). Another might explore and realize it shift y even more dramatically and get even more dramatic rewards faster (if it's trying to maximize total reward and realizes it only has a finite amount of time).
This kind of maximizing surprise/entropy is very different from the kind of maximization I envisioned. When I imagined an algorithm that tried to throw off the algorithms it was competing against, it was creating surprising outcomes but this surprise was fundamentally relative to the predictor. To the algorithm causing surprise, nothing was surprising. To the algorithms it was competing against, it was all surprising. To the people outside the Predict-O-Matic, who knows if it was surprising? This really depends on whether the inside Predict-o-Matic strategies tend to make predictions similar to the ones people would make. In short, you're considering objective increases in entropy relative to the target function. I'm considering subjective increases in entropy relative to people/strategies aside from the one chosen by the Predict-O-Matic.
--------------
The idea is that incentivizing agents to lower the error of your predictions (as in a prediction market) looks exactly like incentivizing them to "create" information (find ways of making the world more chaotic), and this is no coincidence. So perhaps there's a more general principle behind it, where trying to incentivize minimization of f(x,y) only through channel x (eg, only by improving predictions) results in an incentive to maximize f through y, under some additional assumptions. Maybe there is a connection to optimization duality in there.
No idea if its useful but there's gotta be something here! Per wikipedia:
According to George Dantzig, the duality theorem for linear optimization was conjectured by John von Neumann immediately after Dantzig presented the linear programming problem. Von Neumann noted that he was using information from his game theory, and conjectured that two person zero sum matrix game was equivalent to linear programming. Rigorous proofs were first published in 1948 by Albert W. Tucker and his group. (Dantzig's foreword to Nering and Tucker, 1993)
It's been a while since I took my linear optimization class but the primal and dual problems in linear algebra involve switching between a maximization and minimization problem and interchanging the constraints of a linear problem with the coefficients of the target function and vice-versa. For a given strategy in the Predict-O-Matic, you might try and imagine a strategy that tries to minimize prediction error while viewing the performances of its competitors as constraints. When you switch to the dual problem, you switch to minimizing the constraints and constraining on the minimization. This is hand-waivey of course; and I'm not sure it applies in the context of entropy caused by partial agency as you described it.
I'm going to be a bit more explicit about some ideas that appeared in The Parable of Predict-O-Matic. (If you don't want spoilers, read it first. Probably you should read it first anyway.)
[Note: while the ideas here are somewhat better than the ideas in the predict-o-matic story, they're equally rambling, without the crutch of the story to prop them up. As such, I expect readers to be less engaged. Unless you're especially interested in which character's remarks are true (or at least, which ones I stand by), this might be a post to skim; I don't think it has enough coherence that you need to read it start-to-finish.]
First, as I mentioned in Partial Agency, my main concern here isn't actually about building safe oracles or inner-aligned systems. My main concern is to understand what's going on. If we can build guaranteed-myopic systems, that's good for some purposes. If we can build guaranteed-non-myopic systems, that's good for other purposes. The story largely frames it as a back-and-forth about whether things will be OK / whether there will be terrible consequences; but my focus was on the more specific questions about the behavior of the system.
Second, I'm not trying to confidently stand behind any of the character's views on what will happen. The ending was partly intended to be "and no one got it right, because this stuff is very complicated". I'm very uncertain about all of this. Part of the reason why it was so much easier to write the post as a story was that I could have characters confidently explain views without worrying about adding all the relevant caveats.
Inductive Bias
Evan Hubinger pointed out to me that all the characters are talking about asymptotic performance, and ignoring inductive bias. Inner optimizers might emerge due to the inductive bias of the system. I agree; in my mind, the ending was a bit of a hat tip to this, although I hinted at gradient hacking rather than inductive bias in the actual text.
On the other hand, "inductive bias" is a complicated object when you're talking about a system which isn't 100% Bayesian.
Explicit Fixed-Point Selection
The very first conversation involved the intern arguing that there would be multiple valid fixed-points of prediction, and Predict-O-Matic would have to choose between them somehow.
Explicitly modeling fixed points and choosing between them is a feature of the logical induction algorithm. This feature allows us to select the best one according to some criterion, as is leveraged in When Wishful Thinking Works. As discussed later in the conversation with the mathematician, this is atypical of supervised learning algorithms. What logical induction does is very expensive: it solves a computationally difficult fixed-point finding problem (by searching exhaustively).
Other algorithms are not really "choosing a fixed point somehow". They're typically failing to guarantee a fixed point. The mathematician hinted at this by describing how algorithms would not necessarily converge to a self-fulfilling prophecy; they could just as easily go in circles or wander around randomly forever.
Think of it like fashion. Sometimes, putting a trend into common knowledge will lock it in; this was true about neck ties in business for a long time. In other instances, the popularity of a fashion trend will actually work against it, a fashion statement being ineffective if it's overdone.
So, keep in mind that different learning procedures will relate to this aspect of the problem in different ways.
Reward vs Prediction Error
The economist first compared the learning algorithm to decision markets, then later, decided prediction markets were a better analogy.
The mathematician contrasted the learning algorithm to reinforcement learning, pointing out that Predict-O-Matic always adjusted outputs to be more like historical observations, whereas reinforcement learning would more strategically optimize reward.
Both of these point at a distinction between learning general decision-making and something much narrower and much more epistemic in character. As I see it, the critical idea is that (1) the system gets information about what it should have output; (2) the learning update moves toward a modified system which would have output that. This is quite different from reinforcement learning.
In a recent post, Wei Dai mentions a similar distinction (italics added by me):
There are several things going on here, but I think Wei is trying to point at something similar to the distinction I'm thinking of. It's quite tempting to call it "supervised learning", because you get a signal telling you what you should have done. However, it's a bit fuzzy, because this also encompasses things normally called "unsupervised learning": the supervised/unsupervised distinction is often explained as modeling P(x|y) vs P(x). Wikipedia:
But many (not all) unsupervised algorithms still have the critical features we're interested in! Predicting x without any context information y to help still involves (1) getting feedback on what we "should have" expected, and (2) updating to a configuration which would have more expected that. We simply can't expect the predictions to be as focused, given the absence of contextual information to help. But that just means it's a prediction task on which we tend to expect lower accuracy.
I'm somewhat happy referring to this category as imitative learning. This includes supervised learning, unsupervised learning so long as it's generative (but not otherwise), and imitation learning (a paradigm which achieves similar ends as inverse reinforcement learning). Homever, the terminological overlap with 'imitation learning' is rather terrible, so I'm open to other suggestions.
It seems to me that this is a critical distinction for the myopia discussion. I hope to say more about it in future posts.
Maximizing Entropy?
The discussion of prediction markets toward the end was rather loose, in that the economist didn't deal with a lot of the other points which had been made throughout, and just threw a new model out there.
Isnasene interpreted the first point by imagining that the mechanism of manipulation is still through selection of fixed points:
This is compatible with the assumption of myopia; we might imagine that the system still can't manipulate events through actual bad predictions, because those strategies will be undercut. Therefore, the manipulation is restricted to selecting fixed-points which are surprising.
However, there are three problems with this:
So, it seems the actual situation is more complicated, and I'm not yet sure how to think about this.
'Local Search'; selection vs control
I used the term 'local search' to describe the application of gradient-descent-like updates to reduce prediction error. I have some conceptual/terminological issues with this.
Calling this 'local search' invokes the mental image of a well-defined gradient landscape which we are taking steps on, to further optimize some function. But this is the wrong mental image. The mental image is one of selection, when we're in a control setting (in my terminology). We are not making an iid assumption. We are not getting samples from a stationary but stochastic loss function, as in stochastic gradient descent.
If 'local search' were an appropriate descriptor for gradient-descent here, would it also be an appropriate descriptor for Bayesian updates? There's a tendency to think of Bayesian learning as trying to find one good hypothesis by tracking how well all of them do (which sounds like a global search), but we needn't think of it this way. The "right answer" can be a mixture over hypotheses. We can think of a Bayesian update as incrementally improving our mixture. But thinking of Bayesian updates as local search seems wrong. (So does thinking of them as global search.)
This is online learning. A gradient-descent step represents a prediction that the future will be like the past in some relevant sense, in spite of potential non-stationarity. It is not a guaranteed improvement, even in expectation -- as it would be in offline stochastic gradient descent with sufficiently small step size.
Moreover, step size becomes a more significant problem. In offline gradient descent, selecting too small a step size only means that you have to make many more steps to get where you're going. It's "just a matter of computing power". In online learning, it's a more serious problem; we want to make the appropriate-sized update to new data.
I realize there are more ways of dealing with this than tuning step size; we don't necessarily update to data by making a single gradient step. But there are problems of principal here.
What's gradient descent without a fitness landscape?
Simply put, gradient descent is a search concept, not a learning concept. I want to be able to think of it more directly as a learning concept. I want to be able to think of it as an "update", and use terminology which points out the similarity to Bayesian updates.
The Duality Remark
Vanessa asked about this passage:
I responded:
If you have an outer optimizer which is trying to maximize f(x,y) through x while being indifferent about y, it seems sensible to suppose that inner optimizers will want to change y to throw things off, particularly if they can get credit for then correcting x to be optimal for the new y. If so, then inner optimizers will generally be seeking to find y-values which make the current x a comparatively bad choice. So this argument does not establish an incentive to choose y which makes all choices of x poor.
In a log-loss setting, this would translate to an incentive to make observations surprising (for the current expectations), rather than a direct incentive to make outcomes maximum-entropy. However, iteration of this would push toward maximum entropy. Or, logical-induction-style fixed-point selection could push directly to maximum entropy.
This would be a nice example of partial agency. The system is strategically influencing x and y so as to maximize f through channel x, while minimizing through channel y. What does this mean? This does not correspond to a coherent objective function at all! The system is 'learning a game-theoretic equilibrium' -- which is to say, it's learning to fight with itself, rather than optimize.
There are two different ways we can think about this. One way is to say there's an inner alignment problem here: the system learns to do something which doesn't fit any objective, so it's sort of trivially misaligned with whatever the outer objective was supposed to be. But what if we wanted this? We can think of games as a kind of generalized objective, legitimizing this behavior.
To make things even more confusing, if the only channel by which Predict-O-Matic can influence the world is via the predictions which get output, then... doesn't x=y? x represents the 'legitimate' channel whereby predictions get combined with (fixed) observations to yield a score. y represents the 'manipulative' channel, where predictions can influence the world and thus modify observations. But the two causal pathways have one bottleneck which the system has to act through, namely, the predictions made.
In any case, I don't particularly trust any of the reasoning above.
So, I'm still unsure how to think about all this.