All of Vinayak Pathak's Comments + Replies

Ah, I just noticed it's an old post. I was just clicking through all the SLT links. :)

Having written a few papers about ERM and its variants, I feel personally attacked! I feel obliged to step in and try to defend ERM's honour.

First of all, I don't think I would call ERM a learning framework. ERM is a solution to the mathematical problem of PAC learning. Theoretical computer scientists like to precisely define the problem they want to solve before trying to solve it. When they were faced with the problem of learning, they decided that the mathematical problem of PAC learning was a good representation of the real-world problem of learning. O... (read more)

4Jesse Hoogland
Looking back at this, I think this post is outdated and was trying a little too hard to be provocative. I agree with everything you say here. Especially: "One could reasonably say that PAC learning is somewhat confused, but learning theorists are working on it!" Forgive my youthful naïvité. For what it's worth, I think the generalization post in this sequence has stood the test of time much better. 

Thanks, this clarifies many things! Thanks also for linking to your very comprehensive post on generalization.

To be clear, I didn't mean to claim that VC theory explains NN generalization. It is indeed famously bad at explaining modern ML. But "models have singularities and thus number of parameters is not a good complexity measure" is not a valid criticism of VC theory. If SLT indeed helps figure out the mysteries from the "understanding deep learning..." paper then that will be amazing!

But what we'd really like to get at is an understanding of how pertur

... (read more)
7Garrett Baker
Sumio Watanabe has two papers on out of distribution generalization: Asymptotic Bayesian generalization error when training and test distributions are different Experimental Bayesian Generalization Error of Non-regular Models under Covariate Shift
3Jesse Hoogland
Right, this quote is really a criticism of the classical Bayesian Information Criterion (for which the "Widely applicable Bayesian Information Criterion" WBIC is the relevant SLT generalization). That's right: existing work is about in-distribution generalization. It is the case that, within the Bayesian setting, SLT provides an essentially complete account of in-distribution generalization. As you've pointed out there are remaining differences between Bayes and SGD. We're working on applications to OOD but have not put anything out publicly about this yet. 

I have also been looking for comparisons between classical theory and SLT that make the deficiencies of the classical theories of learning clear, so thanks for putting this in one place.

However, I find the narrative of "classical theory relies on the number of parameters but SLT relies on something much smaller than that" to be a bit of a strawman towards the classical theory. VC theory already only depends on the number of behaviours induced by your model class as opposed to the number of parameters, for example, and is a central part of the classical the... (read more)

Neural networks are intrinsically biased towards simpler solutions. 

Am I correct in thinking that being "intrinsically biased towards simpler solutions" isn't a property of neural networks, but a property of the Bayesian learning procedure? The math in the post doesn't use much about NN's and it seems like the same conclusions can be drawn for any model class whose loss landscape has many minima with varying complexities?

4Jesse Hoogland
To be precise, it is a property of singular models (which includes neural networks) in the Bayesian setting. There are good empirical reasons to expect the same to be true for neural networks trained with SGD (across a wide range of different models, we observe the LLC progressively increase from ~0 over the course of training). 

Perhaps I have learnt statistical learning theory in a different order than others, but in my mind, the central theorem of statistical learning theory is that learning is characterized by the VC-dimension of your model class (here I mean learning in the sense of supervised binary classification, but similar dimensions exist for some more general kinds of learning as well). VC-dimension is a quantity that does not even mention the number of parameters used to specify your model, but depends only on the number of different behaviours induced by the models in... (read more)

3Jesse Hoogland
The key distinction is that VC theory takes a global, worst-case approach — it tries to bound generalization uniformly across an entire model class. This made sense historically but breaks down for modern neural networks, which are so expressive that the worst-case is always very bad and doesn't get you anywhere.  The statistical learning theory community woke up to this fact (somewhat) with the Zhang et al. paper, which showed that deep neural networks can achieve perfect training loss on randomly labeled data (even with regularization). The same networks, when trained on natural data, will generalize well. VC dimension can't explain this. If you can fit random noise, you get a huge (or even infinite) VC dimension and the resulting bounds fail to explain empircally observed generalization performance.  So I'd argue that dependence on the true-data distribution isn't a weakness, but one of SLT's great strengths. For highly expressive model classes, generalization only makes sense in reference to a data distribution. Global, uniform approaches like VC theory do not explain why neural networks generalize.  Multiple parameter values leading to the same behavior isn't a problem — this is "the one weird trick." The reason you don't get the terribly generalizing solution that is overfit to noise is because simple solutions occupy more volume in the loss landscape, and are therefore easier to find. At the same time, simpler solutions generalize better (this is intuitively what Occam's razor is getting at, though you can make it precise in the Bayesian setting). So it's the solutions that generalize best that end up getting found.  I would say that this is a motivating conjecture and deep open problem (see, e.g., the natural abstractions agenda). I believe that something like this has to be true for learning to be at all possible. Real-world data distributions have structure; they do not resemble noise. This difference is what enables models to learn to generalize from

The participants will be required to choose a project out of a list I provide. They will be able to choose to work solo or in a group.

 

Is this list or an approximate version of it available right now? Since the application process itself requires a non-trivial time commitment, it might be nice to see the list before deciding whether to apply.

7Vanessa Kosoy
I added some examples to the end of this post, thank you for the suggestion.

I read the paper, and overall it's an interesting framework. One thing I am somewhat unconvinced about (likely because I have misunderstood something) is its utility despite the dependence on the world model. If we prove guarantees assuming a world model, but don't know what happens if the real world deviates from the world model, then we have a problem. Ideally perhaps we want a guarantee akin to what's proved in learning theory, for example, that the accuracy will be small for any data distribution as long as the distribution remains the same during trai... (read more)

1Joar Skalse
You can imagine different types of world models, going from very simple ones to very detailed ones. In a sense, you could perhaps think of the assumption that the input distribution is i.i.d. as a "world model". However, what is imagined is generally something that is much more detailed than this. More useful safety specifications would require world models that (to some extent) describe the physics of the environment of the AI (perhaps including human behaviour, though it would probably be better if this can be avoided). More detail about what the world model would need to do, and how such a world model may be created, is discussed in Section 3.2. My personal opinion is that the creation of such a world model probably would be challenging, but not more challenging than the problems encountered in other alignment research paths (such as mechanistic interpretability, etc). Also note that you can obtain guarantees without assuming that the world model is entirely accurate. For example, consider the guarantees that are derived in cryptography, or the guarantees derived from formal verification of airplane controllers, etc. You could also monitor the environment of the AI at runtime to look for signs that the world model is inaccurate in a certain situation, and if such signs are detected, transition the AI to a safe mode where it can be disabled.

Hmm, but what if everything gets easier to produce at a similar rate as the consumer basket? Won't the prices remain unaffected then?