You're looking at Less Wrong's discussion board. This includes all posts, including those that haven't been promoted to the front page yet. For more information, see About Less Wrong.

MrMind comments on Open Thread, Jun. 29 - Jul. 5, 2015 - Less Wrong Discussion

5 Post author: Gondolinian 29 June 2015 12:14AM

You are viewing a comment permalink. View the original post to see all comments and the full post content.

Comments (210)

You are viewing a single comment's thread. Show more comments above.

Comment author: MrMind 30 June 2015 08:13:32AM 0 points [-]

Allow me to quote directly from the book:

The training sample of size m is then used to compute the n-fold cross-validation error RCV(θ) for a small number of possible values of θ. θ is next set to the value θ0 for which RCV(θ) is smallest and the algorithm is trained with the parameter setting θ0 over the full training sample of size m

So, I use cross-validation to choose a model. Then I use the same data to train the model. Insanity ensues.

Besides, even cross-validation for model-selection is suspicious. Shouldn't I, ideally, train all model with all the data and form a posterior on the most probable values?

Comment author: Vaniver 30 June 2015 04:42:09PM *  1 point [-]

So, I use cross-validation to choose a model. Then I use the same data to train the model. Insanity ensues.

Why? A model has two components: the hyperparameters and the parameters. The hyperparameters are inputs to the model, and the parameters are calculated from the hyperparameters and the training data. (This is a very similar approach to what are called 'hierarchical Bayesian models.')

Instead of pulling a prior out of thin air for the hyperparameters, this asks the question "which hyperparameters are best for generalizing models to test sets outside the training set?", which is a different question from "which parameters maximize the likelihood of this data?"

(I should add that some people call it 'cross-tuning' to report a model whose hyperparameters have been selected by this sort of process, if there's no third dataset used for testing that was not used for tuning. Standard practice in ML is to still refer to it as 'cross-validation.')

Besides, even cross-validation for model-selection is suspicious. Shouldn't I, ideally, train all model with all the data and form a posterior on the most probable values?

If you do this, how will you get an estimate of how well your model is able to predict outside of the training set?

But once they do have the hyperparameter in place, this is what they do--they fit the model on the full training data, so that they can make the most use of everything.