First, if we take PSRL as our model algorithm, then at any given time we follow a policy optimal for some hypothesis sampled out of the belief state. Since our prior favors simple hypotheses, the hypothesis we sampled is likely to be simple. But, given a hypothesis of description complexity , the corresponding optimal policy has description complexity , since the operation "find the optimal policy" has description complexity.
Taking computational resource bounds into account makes things more complicated. For some computing might be intractable, even though itself is "efficiently computable" in some sense. For example we can imagine an that has exponentially many states plus some succinct description of the transition kernel.
One way to deal with it is using some heuristic for optimization. But, then the description complexity is still .
Another way to deal with it is restricting ourselves to the kind of hypotheses for which is tractable, but allowing incomplete/fuzzy hypotheses, so that we can still deal with environments whose complete description falls outside this class. For example, this can take the form of looking for some small subset of features that has predictable behavior that can be exploited. In this approach, the description complexity is probably still something like , where this time is incomplete/fuzzy (but I don't yet know how PSRL for incomplete/fuzzy hypothesis should work).
Moreover, using incomplete models we can in some sense go in the other direction, from policy to model. This might be a good way to think of model-based RL. In actor-critic algorithms, our network learns a pair consisting of a value function and a policy . We can think of such a pair as an incomplete model that is defined by the Bellman inequality interpreted as a constraint on the transition kernel (or and the reward function ):
Assuming that our incomplete prior assigns weight to this incomplete hypothesis, we get a sort of Occam's razor for policies.
In model-free RL, policy-based methods choose policies by optimizing a noisy estimate of the policy's value. This is analogous to optimizing a noisy estimate of prediction accuracy (i.e., accuracy on the training data) to choose a predictive model. So we often need to trade variance for bias in the policy-learning case (i.e., shrink towards simpler policies) just as in the predictive modeling case.
So: is it possible to formulate an instrumental version of Occam? Can we justify a simplicity bias in our policies?
Maybe problems that don't have simple solutions (i.e. all their solutions have a large description length) are usually intractable for us. If so, given a problem that we're trying to solve, the assumption that it has simple solutions is probably either useful (if it's true) or costless (if it isn't). In other words: "look for your missing key under the lamppost, not because it's probably there, but because you'll only ever find it if it's there".
One way of solving the problem of action selection is to treat our own actions as random variables, then build a probabilistic model, and condition that probability distribution on things going well for us in the future. See, for example, Planning By Probabilistic Inference, human brains (I would argue), upside-down RL (kinda, I think?), etc.
In this formulation, instrumental Occam's razor is just a special case of epistemic Occam's razor, it seems to me.
Yes, I agree with that. But (as I've said in the past) this formalism doesn't do it for me. I have yet to see something which strikes me as a compelling argument in its favor.
So in the context of planning by probabilistic inference, instrumental occam seems almost like a bug rather than a feature -- the unjustified bias toward simpler policies doesn't seem to serve a clear purpose. It's just an assumption.
Granted, the fact that I intuitively feel there should be some kind of instrumental occam is a point in favor of such methods in some sense.
DanielFilan asked me to comment on this paragraph:
Yet, even in the relatively formal world of machine learning, the practice seems contrary to this. When you are optimizing a neural network, you don't actually care that much whether it's something like a hypothesis (making predictions) or something like a policy (carrying out actions). You apply the same kind of regularization either way, as far as I understand (regularization being the machine-learner's generalization of Occam).
AFAIK, it's not actually standard to regularize RL policies the same way you regularize supervised learning. For example, A3C, PPO, and SAC, three leading Deep RL algorithms, use an entropy bonus to regularize their policies. Notably, entropy encourages policies that do different things, instead of policies that are internally simple. On the other hand, in supervised learning, people use techniques such as L2 regularization and Dropout, to get predictors that are simple.
You do see L2 regularization used in a lot of deep RL papers (for example, it's used on every network in AlphaZero variants, in DQN, and even in earlier versions of the SAC algorithm I mentioned before). However, it's important to note that L2 regularization is used on prediction tasks:
(Vanessa's argument about PSRL also seems similar, as PSRL is fundamentally doing supervised learning.)
As for the actual question, I'm not sure "instrumental Occam" exists. Absent multi-agent issues, my guess is Occam's razor is useful in RL insofar as your algorithm has predictive tasks. You want a simple rule for predicting the reward, given your action, not a simple rule for predicting action given an observation history. Insofar as an actual simplicity prior on policies exist and are useful, my guess is that it's because your AI might interact with other AIs (including copies of itself), and so need to be legible/inexploitable/predictable/etc.
Excellent, thanks for the comment! I really appreciate the correction. That's quite interesting.
Whenever you're picking a program to do a task, you run into similar constraints as when you're trying to pick a program to make predictions. There an infinite number of programs, so you can't just put a uniform prior on which one is best, you need some way of ranking them. Simplicity is a ranking. Therefore, ranking them by simplicity works, up to a constant. The induction problem really does seem similar - the only difference is that in prediction, the cost function is simple, but in action, the cost function can be complicated.
I don't have a copy of Li and Vitanyi on me at the moment, but I wonder if they have anything on how well SI does if you have a cost function that's a computable function of prediction errors.
Intuition: "It seems like a good idea to keep one's rules/policies simple."
Me: "Ok, why? What are some prototypical examples where keeping one's rules/policies simple would be useful?"
Intuition: "Well, consider legal codes - a complex criminal code turns into de-facto police discretion. A complex civil code turns into patent trolls and ambulance chasers, people who specialize in understanding the complexity and leveraging it against people who don't. On personal level, in order for others to take your precommitments seriously, those precommitments need to be both easily communicated and clearly delineate lines which must not be crossed - otherwise people will plead ignorance or abuse grey areas, respectively. On an internal level, your rules necessarily adjust as the world throws new situations at you - new situations just aren't covered by old rules, unless we keep those old rules very simple."
Me: "Ok, trying to factor out the common denominator there... complexity implies grey areas. And when we try to make precommitments (i.e. follow general policies), grey areas can be abused by bad actors."
Intuition: <mulls for a minute> "Yeah, that sounds about right. Complexity is bad because grey areas are bad, and complexity creates grey areas."
Me: "The problem with grey areas is simple enough; Thomas Schelling already offers good models of that. But why does complexity necessarily imply grey areas in the policy? Is that an inherent feature of complexity in general, or is it specific to the kinds of complexity we're imagining?"
Intuition: "A complex computer program might not have any grey areas, but as a policy that would have other problems..."
Me: "Like what?"
Intuition: "Well, my knee-jerk is to say it won't actually be a policy we want, but when I actually picture it... it's more like, the ontology won't match the real world."
Me: "And a simple policy would match real-world ontology better than a complex one? That not what I usually hear people say..."
Intuition: "Ok, imagine I draw a big circle on the ground, and say 'stay out of this circle'. But some places there's patches of grass or a puddle or whatever, so the boundary isn't quite clear - grey areas. Now instead of a circle, I make it some complicated fractal shape. Then there will be more grey areas."
Me: "Why would there be more - oh wait, I see. Surface area. More surface area means more grey areas. That makes sense."
Intuition: "Right, exactly. Grey areas, in practice, occur in proportion to surface area. More complexity means more surface means more grey areas means more abuse of the rules."
I find it useful to use Marr's representation/traversal distinction to think about how systems use Aether variables. Sometimes the complexity/uncertainty gets pushed into the representation to favor simpler/faster traversals (common in most computer science algorithm designs) and sometimes the uncertainty gets pushed into the traversal to favor a simpler representation (common in normal human cognition due to working memory constraints, or any situation with mem constraints).
From this, I had an easier time thinking about how a simplicity bias is really (often?) a modularity bias, so that we can get composability in our causal models.
So: is it possible to formulate an instrumental version of Occam? Can we justify a simplicity bias in our policies?
Justification has the downside of being wrong, a) if what you are arguing is wrong/false, b) can be wrong even if what you are arguing is right/true. That being said/Epistemic warnings concluded...
1. A more complicated policy:
2. We don't have the right policy.
3. A perfect model has a cost, and is an investment.*
*'Expected utility calculations' may be done in several ways, including:
One criticism of AIXI/related methods is that it is uncomputable. Another is that, even if you have the right model
4. The right model doesn't necessarily tell you what to do.
Speculation:
5. More complicated policies are more difficult to update, and
6. more difficult to use.
This is an argument against instrumental convergence - in a deterministic environment that is completely understood. Under these conditions, the 'paperclip maximizer AI' knows that yes, it does want paperclips more than anything else, and has no issue turning itself into paperclips.
Can you elaborate? Does this argue against instrumental convergence because it would paperclip itself?
Does this argue against instrumental convergence because it would paperclip itself?
If that is the best thing to do. (To maximize paperclips, in those circumstances.)
Can you elaborate?
Imagine you don't know what your utility function is, but you've narrowed it down to two possibilities - A and B. Ignoring questions around optimizing and conflict between the two possibilities, from that position, if something is good for 'achieving'* both A and B, then you should probably go for that thing.
*This also may be true if the thing is something valued by both A and B.
So in theory there's an aspect of instrumental convergence in humans/agents that are uncertain about utility functions/etc. - "I don't know what I want but this is good for more than one/a lot of/all of them."
There's also another aspect of instrumental convergence around things like power - things that seem more general. 'A paperclip maximizer wanting to not die as a means to an end.' (Though this is backwards of how we usually think of such things - we consider it unusual for a human to want to die. (Perhaps there's a more general tendency to not talk about self-destructing agents much.))
So if an agent finds out for sure it's utility function is B, that can eliminate 'instrumental convergence' from (utility function) uncertainty.
If an agent finds out all it can do in its environment (determinism, etc.) that can eliminate 'instrumental convergence' from (environmental/etc.) uncertainty.
For contrast, new information seems like it can also have the opposite effects:
You find out you value something you didn't know you valued. (Perhaps after trying something new, or thinking.)
An agent finds out 'more compute/more time' improves its ability to do anything. (Perhaps by observing other agents achieve their goals better with more of those things.)**
**An agent need not understand 'instrumental convergence' to act on those behaviors. 'Observing 'X makes it easier to achieve my goals' may easily imply that 'I could use some more X'.' But uncertainty about 'what you want'/'the best way to achieve it' is also a reason to seek out things that improve 'general future goal fulfillment ability'. (General across the space across the space of goals/methods you're considering.)
In The Real Rules have No Exceptions, Said says (apparently speaking of instrumental rules, not epistemic ones):
This is a very plausible principle. I have had similar thoughts myself. The idea seems to have merit in practice. But is there any theoretical analogue?
Minimum description length (Occam's Razor) is an epistemic principle. In Bayesian terms, probabilities must sum to one, which translates information-theoretically (via optimal encodings) to the idea that there are only so many short codes, so we have to assign them carefully to the most probable things.
However, expected utility is not a limited resource in this way. Updating to think an option is better than previously thought doesn't necessarily make other options worse. Selecting policies is not so different from selecting raw actions. And for Bayesian decision theory, it doesn't really matter whether a policy is specified as a big table of observation/action pairs, vs an elegantly specified rule.
Optimality cares not for elegance!
Yet, even in the relatively formal world of machine learning, the practice seems contrary to this. When you are optimizing a neural network, you don't actually care that much whether it's something like a hypothesis (making predictions) or something like a policy (carrying out actions). You apply the same kind of regularization either way, as far as I understand (regularization being the machine-learner's generalization of Occam). (Correction: this seems not to be the case.)
We might say that this is because (in some sense) the instrumental uncertainty and the epistemic uncertainty are actually being wrapped up together. But (1) this reply seems overly quick to me at present, and I'd want to understand in detail whether this can be justified; (2) I'm curious if there is a purely instrumental version of Occam to be articulated; it seems intuitively really plausible to me, though technically quite mysterious.
So: is it possible to formulate an instrumental version of Occam? Can we justify a simplicity bias in our policies?