IID vs Myopia
In a comment to Partial Agency, Rohin summarized his understanding of the post. He used the iid assumption as a critical part of his story. Initially, I thought that this was a good description of what was going on; but I soon realized that iid isn't myopia at all (and commented as such). This post expands on the thought.
My original post conflated episodic (which is basically 'iid') with myopic.
In an episodic setting, it makes sense to be myopic about anything beyond the current episode. There's no benefit to cross-episode strategies, so, no need to learn them.
This is true at several levels (which I mention in the hopes of avoiding later confusion):
- In designing an ML algorithm, if we are assuming an episodic structure, it makes sense to use a learning algorithm which is designed to be myopic.
- A learning algorithm in an episodic setting has no incentive to find non-myopic solutions (even if it can).
However, it is also possible to consider myopia in the absence of episodic structure, and not just as a mistake. We might want an ML algorithm to learn myopic strategies, as is the case with predictive systems. (We don't want them to learn to manipulate the data; and even though that failure mode is far-fetched for most modern systems, there's no point setting up learning procedures which would incentivise it. Indeed, learning procedures seem to mostly encourage myopia, though the full situation is still unclear to me.)
These myopic strategies aren't just "strategies which behave as if there were an episodic assumption", either. For example, sequential prediction is myopic (the goal is to predict each next item accurately, not to get the most accuracy overall -- if this is unclear, hopefully it will become clearer in the next section).
So, there's a distinction between not remembering the past vs not looking ahead to the future. In episodic settings, the relevant parts of past and future are both limited to the duration of the episode. However, the two come apart in general. We can have/want myopic agents with memory; or, we can have/want memoryless agents which are not myopic. (The second seems somewhat more exotic.)
Game-Theoretic Myopia Definition
So far, I've used 'myopia' in more or less two ways: an inclusive notion which encompasses a big cluster of things, and also the specific thing of only optimizing each output to maximize the very next reward. Let's call the more specific thing "absolute" myopia, and try to define the more general thing.
Myopia can't be defined in terms of optimizing an objective in the usual sense -- there isn't one quantity being optimized. However, it seems like most things in my 'myopia' cluster can be described in terms of game theory.
Let's put down some definitions:
Sequential decision scenario: An interactive environment which takes in actions and outputs rewards and observations. I'm not trying to deal with embeddedness issues; this is basically the AIXI setup. (I do think 'reward' is a very restrictive assumption about what kind of feedback the system gets, but talking about other alternatives seems like a distraction from the current post.)
(Generalized) objective: A generalized objective assigns, to each action , a function . The quantity is how much the nth decision is supposed to value the ith reward. Probably, we require the sum to exist.
Some examples:
- Absolute myopia. if , and otherwise.
- Back-scratching variant: if , and 0 otherwise.
- Episodic myopia. if and are within the same episode; 0 otherwise.
- Hyperbolic discounting. , in; 0 otherwise.
- Dynamically consistent version of hyperbolic:
- Exponential discounting. , typically with c<1.
- 'Self-defeating' functions, such as for n=i, -1 for n=i+1, and 0 otherwise.
A generalized objective could be called 'myopic' if it is not dynamically consistent; ie, if there's no way to write as a function of alone, eliminating the dependence on .
This notion of myopia does not seem to include 'directionality' or 'stop-gradients' from my original post. In particular, if we try to model pure prediction, absolute myopia captures the idea that you aren't supposed to have manipulative strategies which lie (throw out some reward for one instance in order to get more overall). However, it does not rule out manipulative strategies which select self-fulfilling prophecies strategically; those achieve high reward on instance by choice of output , which is what a myopic agent is supposed to do.
There are also non-myopic objectives which we can't represent here but might want to represent more generally: there isn't a single well-defined objective corresponding to 'maximizing average reward' (the limit of exponential discounting as ).
Vanessa recently mentioned using game-theoretic models like this for the purpose of modeling inconsistent human values. I want to emphasize that (1) I don't want to think of myopia as necessarily 'wrong'; it seems like sometimes a myopic objective is a legitimate one, for the purpose of building a system which does something we want (such as make non-manipulative predictions). As such, (2) myopia is not just about bounded rationality.
I also don't necessarily want to think of myopia as multi-agent, even when modeling it with multi-agent game theory like this. I'd rather think about learning one myopic policy, which makes the appropriate (non-)trade-offs based on .
In order to think about a system behaving myopically, we need to use an equilibrium notion (such as Nash equilibria or correlated equilibria), not just . However, I'm not sure quite how I want to talk about this. We don't want to think in terms of a big equilibrium between each decision-point ; I think of that as a selection-vs-control mistake, treating the sequential decision scenario as one big thing to be optimized. Or, putting it another way: the problem is that we have to learn; so we can't talk about everything being in equilibrium from the beginning.
Perhaps we can say that there should be some such that each decision after that is in approximate equilibrium with each other taking the decisions before as given.
(Aside -- What we definitely don't want (if we want to describe or engineer legitimately myopic behavior) is a framework where the different decision-points end up bargaining with each other (acausal trade, or mere causal trade), in order to take pareto improvements and thus move toward full agency. IE, in order to keep our distinctions from falling apart, we can't apply a decision theory which would cooperate in Prisoner's Dilemma or similar things. This could present difficulties.)
Let's move on to a different way of thinking about myopia, through the language of Pareto-optimality.
Pareto Definition
We can think of myopia as a refusal to take certain Pareto improvements. This fits well with the previous definition; if an agent takes all the Pareto improvements, then its behavior must be consistent with some global weights not a function of . However, not all myopic strategies in the Pareto sense have nice representations in terms of generalized objectives.
In particular: I mentioned that generalized objectives couldn't rule out manipulation through selection of self-fulfulling prophecies; so, only capture part of what seems implied by map/territory directionality. Thinking in terms of Pareto-failures, we can also talk about failing to reap the gains from selection of manipulative self-fulfilling prophecies.
However, thinking in these terms is not very satisfying. It allows a very broad notion of myopia, but has few other virtues. Generalized objectives let me talk about myopic agents trying to do a specific thing, even though the thing they're trying to do isn't a coherent objective. Defining myopia as failure to take certain Pareto improvements doesn't give me any structure like that; a myopic agent is being defined in the negative, rather than described positively.
Here, as before, we also have the problem of defining things learning-theoretically. Speaking purely in terms of whether the agent takes certain Pareto improvements doesn't really make sense, because it has to learn what situation it is in. We want to talk about learning processes, so we need to talk about learning to take the Pareto improvements, somehow.
(Bayesian learning can be described in terms of Pareto optimality directly, because using a prior over possible environments allows Pareto-optimal behavior in terms of those environments. However, working that way requires realizability, which isn't realistic.)
Decision Theory
In the original partial agency post, I described full agency as an extreme (perhaps imaginary) limit of less and less myopia. Full agency is like Cartesian dualism, sitting fully outside the universe and optimizing.
Is full agency that difficult? From the generalized-objective formalism, one might think that ordinary RL with exponential discounting is sufficient.
The counterexamples to this are MIRI-esque decision problems, which create dynamic inconsistencies for otherwise non-myopic agents. (See this comment thread with Vanessa for more discussion of several of the points I'm about to make.)
To give a simple example, the version of Newcomb's Problem where the predictor knows about as much about your behavior as you do. (The version where the predictor is nearly infallible is easily handled by RL-like learning; you need to specifically inject sophisticated CDT-like thinking to mess that one up.)
In order to have good learning-theoretic properties at all, we need to have epsilon exploration. But if we do, then we tend to learn to 1-box, because (it will seem) doing so is independent of the predictor's predictions of us.
Now, it's true that in a sequential setting, there will be some incentive to 2-box not for the payoff today, but for the future; establishing a reputation of 1-boxing gets higher payoffs in iterated Newcomb in a straightforward (causal) way.
However, that's not enough to entirely avoid dynamic inconsistency. For any discounting function, we need only to assume that the instances of Newcomb's problem are spaced out far enough over time so that 2-boxing in each individual case is appealing.
Now, one might argue that in this case, the agent is correctly respecting its generalized objective; it's supposed to sacrifice future value for present according to the discounting function. And that's true, if we want myopic behavior. But it is dynamically inconsistent -- the agent wishes to 2-box in each individual case, but with respect to future cases, would prefer to 1-box. It would happily bind its future actions given an opportunity to do so.
Like the issue with self-fulfilling prophecies, this creates a type of myopia which we can't really talk about within the formalism of generalized objectives. Even with an apparently dynamically consistent discounting function, the agent is inconsistent. As mentioned earlier, we need generalized-objective systems to fail to coordinate with themselves; otherwise, their goals collapse into regular objectives. So this is a type of myopia which all generalized objectives possess.
As before, I'd really prefer to be able to talk about this with specific types of myopia (as with myopic generalized objectives), rather than just pointing to a dynamic inconsistency and classifying it with myopia.
(We might think of the fully non-myopic agent as the limit of less and less discounting, as Vanessa suggests. This has some problems of convergence, but perhaps that's in line with non-myopia being an extreme ideal which doesn't always make sense. Alternately, we might thing of this as a problem of decision theory, arguing that we should be able reap the advantages of 1-boxing despite our values temporally discounting. Or, there might be some other wilder generalization of objective functions which lets us represent the distinctions we care about.)
Mechanism Design Analogy
I'll close this post with a sketchy conjecture.
Although I don't want to think of generalized objectives as truly multi-agent in the one-'agent'-per-decision sense, learning algorithms will typically have a space of possible hypotheses which are (in some sense) competing with each other. We can analogize that to many competing agents (keeping in mind that they may individually be 'partial agents', ie, we can't necessarily model them as coherently pursuing a utility function).
For any particular type of myopia (whether or not we can capture it in terms of a generalized objective), we can ask the question: is it possible to design a training procedure which will learn that type of myopia?
(We can approach this question in different ways; asymptotic convergence, bounded-loss (which may give useful bounds at finite time), or 'in-practice' (which fully accounts for finite-time effects). As I've mentioned before, my thoughts on this are mostly asymptotic at the moment, that being the easier theoretical question.)
We can think of this question -- the question of designing training procedures -- as a mechanism-design question. Is it possible to set up a system of incentives which encourages a given kind of behavior?
Now, mechanism design is a field which is associated with negative results. It is often not possible to get everything you want. As such, a natural conjecture might be:
Conjecture: It is not possible to set up a learning system which gets you full agency in the sense of eventually learning to take all the Pareto improvements.
This conjecture is still quite vague, because I have not stated what it means to 'learn to take all the Pareto improvements'. Additionally, I don't really want to assume the AIXI-like setting which I've sketched in this post. The setting doesn't yield very good learning-theoretic results anyway, so getting a negative result here isn't that interesting. Ideally the conjecture should be formulated in a setting where we can contrast it to some positive results.
There's also reason to suspect the conjecture to be false. There's a natural instrumental convergence toward dynamic consistency; a system will self-modify to greater consistency in many cases. If there's an attractor basin around full agency, one would not expect it to be that hard to set up incentives which push things into that attractor basin.
I suspect I made our recent discussions unnecessarily messy by simultaneously talking about: (1) "informal strategic stuff" (e.g. the argument that selection processes are strategically important, which I now understand is not contradictory to your model of the future); and (2) my (somewhat less informal) mathematical argument about evolutionary computation algorithms.
The rest of this comment involves only the mathematical argument. I want to make that argument narrower than the version that perhaps you responded to: I want it to only be about absolute myopia, rather than more general concepts of myopia or full agency. Also, I (now) think my argument applies only to learning setups in which the behavior of the model/agent can affect what the model encounters in future iterations/episodes. Therefore, my argument does not apply to setups such as unsupervised learning for past stock prices or RL for Atari games (when each episode is a new game).
My argument is (now) only the following: Suppose we have a learning setup in which the behavior of the model at a particular moment may affect the future inputs/environments that the model will be trained on. I argue that evolutionary computation algorithms seem less likely to yield an absolute myopic model, relative to gradient decent. If you already think that, you might want to skip the rest of this comment (in which I try to support this argument).
I think the following property might make a learning algorithm more likely to yield models that are NOT absolute myopic:
During training, a parameter's value is more likely to persist (i.e. end up in the final model) if it causes behavior that is beneficial for future iterations/episodes.
I think that this property tends to apply to evolutionary computation algorithms more than it applies to gradient descent. I'll use the following example to explain why I think that:
Suppose we have some online supervised learning setup. Suppose that during iteration 1 the model needs to predict random labels (and thus can't perform better than chance), however, if parameter θ8 has a large value then the model makes predictions that cause the examples in iteration 2 to be more predictable. By assumption, during iteration 2 the value of θ8 does not (directly) affect predictions.
How should we expect our learning algorithm to update the parameter θ8 at the end of iteration 2?
If our learning algorithm is gradient decent, it seems that we should NOT expect θ8 to increase, because there is no iteration in which the relevant component of the gradient (i.e. the partial derivative of the objective with respect to θ8) is expected to be positive.
In contrast, if our learning algorithm is some evolutionary computation algorithm, the models (in the population) in which θ8 happens to be larger are expected to outperform the other models, in iteration 2. Therefore, we should expect iteration 2 to increase the average value of θ8 (over the model population).