Osher Lerner - LessWrong

Yes, that's precisely what I'm claiming!

Sorry if that wasn't clear. As for how to establish that, I proposed an intuitive justification:

There is no mechanism fitting the model to the linear approximation of the data around the training points.

And an outline for a proof:

Take two problems which have the same value at the training points but with wildly different linear terms around them. A model perfectly fit to the training points would not be able to distinguish the two.

Let's walk through an example

Consider trying to fit a simple function, . Let's collect a training dataset

D_{train} = [(x = 0, y = 0), (x = 1, y = 0), (x = 2, y = 0), \dots]

You optimize a perfect model on $D_{train}$ (e.g. a neural net with mapping $N N (x) = 0$ )
Now let's study the scaling of error as you move away from training points. In the example, we achieved $\nabla_{x} L (x_{train}) = 0$ , since coincidentally $\forall x f (x) = N N (x)$

2. Consider a second example. Let's fit $f (x) = sin (x / π)$ . Again, we collect training data

D_{train} = [(x = 0, y = 0), (x = 1, y = 0), (x = 2, y = 0), \dots]

You optimize a perfect model on $D_{train}$ (using the same optimization procedure, we get a neural net with mapping $N N (x) = 0$ )
Now, we see $\nabla_{x} L (x_{train}) \neq 0$ (we predict a flat line at $y = 0$ , and $L$ measures error from a sinusoid). You can notice this visually or analytically:

derivation: \nabla_{x} L (x_{train}) = \nabla_{x} {∥ f (x) - N N (x) ∥}_{x = x_{train}} = {\nabla_{x} ∥ f (x) - 0 ∥}_{x = x_{train}} = \nabla_{x} ∥ sin (x / π) ∥_{x \in Z} \neq 0

The model is trained on $x_{train}, f (x_{train})$ , and is independent of $\nabla_{x} f (x_{train})$ . That means even if by happy accident our optimization procedure achieves $\nabla_{x} L (x_{train}) = 0$ , we can prove that it is not generally true by considering an identical training dataset with a different underlying function (and knowing our optimization must result in the same model)

On rereading your original argument:

since the loss is minimized, the gradient is also zero.

I think this is referring to $\nabla_{θ} L (x_{train}) = 0$ , which is certainly true for a perfectly optimized model (or even just settled gradient descent). Maybe that's where the miscommunication is stemming from, since "gradient of loss" is being overloaded from discussion of optimization $\nabla_{θ} L$ , and discussion of Taylor-expanding $L$ around $x_{train}$ (which uses $\nabla_{x} L$ )

[AN #140]: Theoretical models that predict scaling laws

Osher Lerner5mo41

D_{train} = [(x = 0, y = 0), (x = 1, y = 0), (x = 2, y = 0), \dots]

D_{train} = [(x = 0, y = 0), (x = 1, y = 0), (x = 2, y = 0), \dots]

derivation: \nabla_{x} L (x_{train}) = \nabla_{x} {∥ f (x) - N N (x) ∥}_{x = x_{train}} = {\nabla_{x} ∥ f (x) - 0 ∥}_{x = x_{train}} = \nabla_{x} ∥ sin (x / π) ∥_{x \in Z} \neq 0

Osher Lerner6mo30

Since we have perfectly fit the training data, at the training data point, the loss is zero; and since the loss is minimized, the gradient is also zero.

The linear term in this is not actually 0. There is no mechanism fitting the model to the linear approximation of the data around the training points. The model is only fit to the (0th order) value at the training points.

To correctly state the above sentence:

"perfectly fit the training data""the loss is 0 (at training points)" $\Rightarrow$ "(training) loss is minimized" $⇏$ "gradient (of test loss around a training point) is 0"

To prove this point:

To see this visually:

Check out this plot, which quotes the above sentence from this post as a footnote, but brilliantly visually demonstrates the contradiction: see how error increases linearly around training points (it's using interpolation between points, but the same point holds for a constant piecewise function)

LESSWRONG
LW

Posts

Wikitag Contributions

Comments