Yes, that's precisely what I'm claiming!
Sorry if that wasn't clear. As for how to establish that, I proposed an intuitive justification:
There is no mechanism fitting the model to the linear approximation of the data around the training points.
And an outline for a proof:
Take two problems which have the same value at the training points but with wildly different linear terms around them. A model perfectly fit to the training points would not be able to distinguish the two.
Let's walk through an example
2. Consider a second example. Let's fit . Again, we collect training data
The model is trained on , and is independent of . That means even if by happy accident our optimization procedure achieves , we can prove that it is not generally true by considering an identical training dataset with a different underlying function (and knowing our optimization must result in the same model)
On rereading your original argument:
since the loss is minimized, the gradient is also zero.
I think this is referring to , which is certainly true for a perfectly optimized model (or even just settled gradient descent). Maybe that's where the miscommunication is stemming from, since "gradient of loss" is being overloaded from discussion of optimization , and discussion of Taylor-expanding around (which uses )
Since we have perfectly fit the training data, at the training data point, the loss is zero; and since the loss is minimized, the gradient is also zero.
The linear term in this is not actually 0. There is no mechanism fitting the model to the linear approximation of the data around the training points. The model is only fit to the (0th order) value at the training points.
To correctly state the above sentence:
To prove this point:
To see this visually:
Check out this plot, which quotes the above sentence from this post as a footnote, but brilliantly visually demonstrates the contradiction: see how error increases linearly around training points (it's using interpolation between points, but the same point holds for a constant piecewise function)
Oh hi! I linked your video in another comment without noticing this one. Great visual explanation!