somehow I missed this post and only caught it now. This was helpful for a few things.
Just read erhan10a.pdf (mlr.press) (Why Does Unsupervised Pre-training Help Deep Learning?, 2010) for a paper club at work. This post helped me understand and simplify the material, quite a lot. Thanks!
roughly speaking, we gradient-descend our way to whatever point on the perfect-prediction surface is closest to our initial values.
I believe this is not correct as long as "gradient-descend" means some standard version of gradient descent because those are all local, can go highly nonlinear paths, and do not memorize the initial value to try staying close to it.
But maybe we can design a local search strategy similar to gradient descent which does try to stay close to the initial point x0? E.g., if at x, go a small step into a direction that has the minimal scalar product with x – x0 among those that have at most an angle of alpha with the current gradient, where alpha>0 is a hyperparameter. One might call this "stochastic cone descent" if it does not yet have a name.
I'm confused about this.
Say our points are the times of day measured by a clock. And are the temperatures measured by a thermometer at those times. We’re putting in times in the early morning, where I decree temperature to increase roughly linearly as the sun rises.
You write the overparametrized regression model as . Since our model doesn’t get to see the index, only the value of itself, that has to implicitly be something like
Where is the regression or NN output. So our model learned the slope, plus a lookup table for the noise values of the thermometer at those times in the training data set. That means that if the training set included the time , and the model encounters a temperature taken at outside training again, now from a different day, it will output .
Which is predictably wrong, and you can do better by not having that memorised noise term.
The model doesn’t get to make a general model plus a lookup table of noises in training to get perfect loss and then use only the general model outside of training. It can’t switch the lookup table off.
Put differently, if there’s patterns in the data that the model cannot possibly make a decent simple generative mechanism for, fitting those patterns to get a better loss doesn’t seem like the right thing to do.
Put yet another way, if you're forced to pick one single hypothesis to make predictions, the best one to pick doesn't necessarily come from the set that perfectly fits all past data.
So, you’ve heard that modern neural networks have vastly more parameters than they need to perfectly fit all of the data. They’re operating way out in the regime where, traditionally, we would have expected drastic overfit, yet they seem to basically work. Clearly, our stats-101 mental models no longer apply here. What’s going on, and how should we picture it?
Maybe you’ve heard about some papers on the topic, but didn’t look into it in much depth, and you still don’t really have an intuition for what’s going on. This post is for you. We’ll go over my current mental models for what’s-going-on in overparameterized models (i.e. modern neural nets).
Disclaimer: I am much more an expert in probability (and applied math more generally) than in deep learning specifically. If there are mistakes in here, hopefully someone will bring it up in the comments.
Assumed background knowledge: multi-dimensional Taylor expansions, linear algebra.
Ridges, Not Peaks
First things first: when optimizing ML models, we usually have some objective function where perfectly predicting every point in the training set yields the best possible score. In overparameterized models, we have enough parameters that training indeed converges to zero error, i.e. all data points in the training set are matched perfectly.
Let’s pick one particular prediction setup to think about, so we can stick some equations on this. We have a bunch of (x,y) data points, and we want to predict y given x. Our ML model has some parameters θ, and its prediction on a point x(n) is f(x(n),θ). In order to perfectly predict every data point in the training set, θ must satisfy the equations
∀n:y(n)=f(x(n),θ)
Assuming y(n) is one-dimensional (i.e. just a number), and we have N data points, this gives us N equations. If θ is k-dimensional, then we have N equations with k variables. If the number of variables is much larger than the number of equations (i.e. k>>N, parameter-dimension much greater than number of data points), then this system of equations will typically have many solutions.
In fact, assuming there are any solutions at all, we can prove there are infinitely many - an entire high-dimensional surface of solutions in θ-space. Proof: let θ∗ be a solution. If we make a small change dθ∗, then f(x(n),θ) changes by ∇θf(x(n),θ∗)⋅dθ∗. For all the equations to remain satisfied, after shifting θ∗→θ∗+dθ∗, these changes must all be zero:
∀n:0=∇θf(x(n),θ∗)⋅dθ∗
Key thing to notice: this is a set of linear equations. There are still N equations and still k variables (this time dθ∗ rather than θ), and since they’re linear, there are guaranteed to be at least k−N independent directions along which we can vary dθ∗ while still solving the equations (i.e. the right nullspace of the matrix ∇θf(x(n),θ∗) has dimension at least k−N). These directions point exactly along the local surface on which the equations are solved.
Takeaway: we have an entire surface of dimension (at least) k−N, sitting in the k-dimensional θ-space, on which all points in the training data are predicted perfectly.
What does this tell us about the shape of the objective function more generally?
Well, we have this (at least) k−N dimensional surface on which the objective function achieves its best possible value. Everywhere else, it will be lower. The “global optimum” is not a point at the top of a single peak, but rather a surface at the high point of an entire high-dimensional ridge. So: picture ridges, not peaks.
Before we move on, two minor comments on generalizing this model.
Priors and Sampling, Not Likelihoods and Estimation
So there’s an entire surface of optimal points. Obvious next question: if all of these points are optimal, what determines which one we pick? Short answer: mainly initial parameter values, which are typically randomly generated.
Conceptually, we randomly sample trained parameter values from the perfect-prediction surface. To do that, we first sample some random initial parameter values, and then we train them - roughly speaking, we gradient-descend our way to whatever point on the perfect-prediction surface is closest to our initial values. The key problem is to figure out what distribution of final (trained) parameter values results from the initial distribution of parameter values.
One key empirical result: during training, the parameters in large overparameterized models tend to change by only a small amount. (There’s a great visual of this in this post. It’s an animation showing weights changing over the course of training; for the larger nets, they don’t visibly change at all.) In particular, this means that linear/quadratic approximations (i.e. Taylor expansions) should work very well.
For our purposes, we don’t even care about the details of the ridge-shape. The only piece which matters is that, as long as we’re close enough for quadratic approximations around the ridge to work well, the gradient will be perpendicular to the directions along which the ridge runs. So, gradient descent will take us from the initial point, to whatever point on the perfect-prediction surface is closest (under ordinary Euclidean distance) to the initial point.
Stochastic gradient descent (as opposed to pure gradient descent) will contribute some noise - i.e. diffusion along the ridge-direction - but it should average out to roughly the same thing.
From there, figuring out the distribution from which we effectively sample our trained parameter values is conceptually straightforward. For each point θ∗ on the perfect-prediction surface, add up the probability density of the initial parameter distribution at all the points which are closer to θ∗ than to any other point on the perfect-prediction surface.
We can break this up into two factors:
Now for the really hand-wavy approximations:
Are these approximations reasonable? I haven’t seen anyone check directly, but they are the approximations needed in order for the results in e.g. Mingard et al to hold robustly, and those results do seem to hold empirically.
The upshot: we have an effective “prior” (i.e. the distribution from which the initial parameter values are sampled) and “posterior” (i.e. the distribution of final parameter values on the perfect-prediction surface). The posterior density is directly proportional to the prior density, but restricted to the perfect-prediction surface. This is exactly what Bayes’ rule says, if we start with a distribution P[θ] and then update on data of the form “∀n:y(n)=f(x(n),θ)”. Our posterior is then P[θ|∀n:y(n)=f(x(n),θ)], and our final parameter-values are a sample from that distribution.
Note how this differs from traditional statistical practice. Traditionally, we maximize likelihood, and that produces a unique “estimate” of θ. While today’s ML models may look like that at first glance, they’re really performing a Bayesian update of the parameter-value-distribution, and then sampling from the posterior.
Example: Overparameterized Linear Regression
As an example, let’s run a plain old linear regression. We’ll use an overparameterized model which is equivalent to a traditional linear regression model, in order to make the relationship clear.
We have 100 (x,y) pairs, which look like this:
I generated these with a “true” slope of 1, i.e. y=1∗x+noise, with standard normal noise.
Traditional-Style Regression
We have one parameter, c, and we fit a model y(n)=cx(n)+ξ(n), with standard normal-distributed noise ξ(n). This gives log likelihood
logP[y|a]=−12∑n(y(n)−cx(n))2
… plus some constants. We choose c∗ to maximize this log-likelihood. In this case, c∗=1.010, so the line looks like this:
(Slightly) Overparameterized Regression
We use the exact same model, y(n)=cx(n)+ξ(n), but now we explicitly consider the ξ(n) terms “parameters”. Now our parameters are (c,ξ(1),…,ξ(N)), and we’ll initialize them all as samples from a standard normal distribution (so our “prior” on the noise terms is the same distribution assumed in the previous regression). We then optimize (c,ξ(1),…,ξ(N)) to minimize the sum-of-squared-errors
12∑n(y(n)−cx(n)−ξ(n))2
This ends up approximately the same as a Bayesian update on ∀n:y(n)=cx(n)−ξ(n), and our final c-value 1.046 is not an estimate, but rather a sample from the posterior. Although the “error” in our c-posterior-sample here is larger than the “error” in our c-estimate from the previous regression, the implied line is visually identical:
Note that our model here is only slightly overparameterized; k=N+1, so the perfect prediction surface is one-dimensional. Indeed, the perfect prediction surface is a straight line in (c,ξ(1),…,ξ(N)) - space, given by the equations y(n)=cx(n)+ξ(n).
(Very) Overparameterized Regression
Usually, we say that the noise terms are normal because they’re a sum of many small independent noise sources. To make a very overparameterized model, let’s make those small independent noise sources explicit: y(n)=cx(n)+√3N∑100i=0ξ(n)i. Our parameters are c and the whole 2D array of ξ’s, with standard normal initialization on c, and Uniform(-1, 1) initialization on ξ. (The √3N is there to make the standard deviation equivalent to the original model.) As before, we minimize sum-of-squared-errors.
This time our c-value is 1.031. The line still looks exactly the same. This time, we’re much more overparameterized - we have k=100N+1, so the perfect prediction surface has dimension 99N+1. But conceptually, it still works basically the same as the previous example.
Code for all these is here.
In all these examples, the underlying probabilistic models are (approximately) identical. The latter two (approximately) sample from the posterior, rather than calculating a maximum-log-likelihood parameter estimate, but as long as the posterior for the slope parameter is very pointy, the result is nearly the same. The main difference is just what we call a "parameter" and optimize over, rather than integrating out.