You're looking at Less Wrong's discussion board. This includes all posts, including those that haven't been promoted to the front page yet. For more information, see About Less Wrong.

V_V comments on "Human-level control through deep reinforcement learning" - computer learns 49 different games - Less Wrong Discussion

11 Post author: skeptical_lurker 26 February 2015 06:21AM

You are viewing a comment permalink. View the original post to see all comments and the full post content.

Comments (19)

You are viewing a single comment's thread. Show more comments above.

Comment author: V_V 27 February 2015 04:13:43PM *  1 point [-]

Regardless, its amazing how simple DNNs are. People have been working on computer vision and AI for about 60 years, and then a program like this comes along which is only around 500 lines of code, conceptually simple enough to explain to anyone with a reasonable mathematical background, but can nevertheless beat humans at a reasonable range of tasks.

Beware, there is a lot of non-obvious complexity in these models:
"Traditional" machine learning models (i.e. logistic regression, SVM, random forests) only have few hyperparameters and they are not terribly sensitive to their values, hence you can usually tune them coarsely and quickly.
These fancy deep neural networks can easily have tens, if not hundreds of hyperparameters, and they are often quite sensitive to them. A bad choice can easily make your training procedure quickly stop making progress (insufficient capacity/vanishing gradients) or diverge (exploding gradients) or converge to something which doesn't generalize well on unseen data (overfitting).
Finding a good choice of hyperparameters can be really a non-trivial optimization problem on its own (and a combinatorial one, since many of these hyperparameters are discrete and you can't really expect the model performances to depend monotonically on their values).
Unfortunately, in these DNN papers, especially the "better than humans" ones, hyperparameters values often seem to appear out of nowhere.
There is some research and tools to do that systematically, but it is not often discussed in the papers presenting novel architectures and results.

Comment author: skeptical_lurker 28 February 2015 02:40:12PM 1 point [-]

SVMs are pretty bad for hyperparameters too, if you want a simple model use random forests or naive bayes.

I struggle to see how DNNs can have hundreds of hyperparameters - looking at the code for the paper I linked to, they seem to have learning rate, 2 parameters for simulated annealing, weight cost and batch size. That's 5, not counting a few others which only apply to reinforcement learning DNNs. Admittedly, there is the choice of sigmoid/rectilinear, and of the number of neurons, layers and epocs, but these last few are largely determined by what hardware you have and how much time you are willing to spend training.

Having skimmed the paper you linked to, it seems they have hundreds of parameters because they are using a rather more complex network topology with SVMs fitting the neuron activation to the targets. And that's interesting in itself.

Unfortunately, in these DNN papers, especially the "better than humans" ones, hyperparameters values often seem to appear out of nowhere.

The general problem of hyperparameter values is one of the things that worries me about academia. So you have an effect (p<0.05). How many hyperparameter values did you try? 20? When I was in academia, I was given the code for one of one of the papers my work was been based on. They had tried 66 different models. 66. And got a p<0.05 result with a model which contained a second-order probability estimate which was not confined to the interval [0 1] according to their paper, although it turned out that they had simply bound it to the interval [0 1], so now they could have a second-order probability estimate of 1, but not >1, which is an improvement I suppose.

Oh, and this paper was published in Nature.

There is some research and tools to do that systematically, but it is not often discussed in the papers presenting novel architectures and results.

I'd be surprised if this could work with DNNs - AKAIK, monte-carlo optimization, for instance, generally takes thousands of evaluations steps, yet with DNNs each evaluation step would require days of training, so it would require thousands of GPU-days. Indeed, the paper you linked to ran 1200 evaluations, so I'm guessing they had a lot of hardware.

Comment author: V_V 05 March 2015 06:17:31PM *  1 point [-]

SVMs are pretty bad for hyperparameters too

How so? Linear SVM main hyperparameter is the regularization coefficient. There is also the choice of loss and regularization penalty, but these are only a couple of bits.
Non-linear SVM has also the choice of the kernel (in practice it's either RBF or polynomial, unless you are working on special types of data such as strings or trees) and one or two kernel hyperparameters.

I struggle to see how DNNs can have hundreds of hyperparameters - looking at the code for the paper I linked to, they seem to have learning rate, 2 parameters for simulated annealing, weight cost and batch size. That's 5, not counting a few others which only apply to reinforcement learning DNNs. Admittedly, there is the choice of sigmoid/rectilinear, and of the number of neurons, layers and epocs,

I haven't read all the paper, but at glance you have: Number of convolutional layers, number of non-convolutional layers, number of nodes in each non-convolutional layer, for each convolutional layer number of filters, filter size and stride. There are also 16 other hyperparameters described here.
You could also count the preprocessing strategy.

Other papers have even more hyperparameters (max-pooling layers each with a window size, dropout layers each with a dropout rate, layer-wise regularization coefficients, and so on).