Allow me to quote directly from the book:
The training sample of size m is then used to compute the n-fold cross-validation error R_CV(θ) for a small number of possible values of θ. θ is next set to the value θ_0 for which R_CV(θ) is smallest and the algorithm is trained with the parameter setting θ_0 over the full training sample of size m
So, I use cross-validation to choose a model. Then I use the same data to train the model. Insanity ensues.
Besides, even cross-validation for model-selection is suspicious. Shouldn't I, ideally, train all model with all the data and form a posterior on the most probable values?
So, I use cross-validation to choose a model. Then I use the same data to train the model. Insanity ensues.
Why? A model has two components: the hyperparameters and the parameters. The hyperparameters are inputs to the model, and the parameters are calculated from the hyperparameters and the training data. (This is a very similar approach to what are called 'hierarchical Bayesian models.')
Instead of pulling a prior out of thin air for the hyperparameters, this asks the question "which hyperparameters are best for generalizing models to test sets out...
If it's worth saying, but not worth its own post (even in Discussion), then it goes here.
Notes for future OT posters:
1. Please add the 'open_thread' tag.
2. Check if there is an active Open Thread before posting a new one. (Immediately before; refresh the list-of-threads page before posting.)
3. Open Threads should be posted in Discussion, and not Main.
4. Open Threads should start on Monday, and end on Sunday.