jkrause comments on "Human-level control through deep reinforcement learning" - computer learns 49 different games - Less Wrong

11 Post author: skeptical_lurker 26 February 2015 06:21AM

You are viewing a comment permalink. View the original post to see all comments and the full post content.

Comments (19)

You are viewing a single comment's thread. Show more comments above.

Comment author: skeptical_lurker 26 February 2015 12:56:16PM 3 points [-]

I saw this paper before, and maybe I'm being an idiot but I didn't understand this:

Training Deep Neural Networks is complicated by the fact that the distribution of each layer's inputs changes during training, as the parameters of the previous layers change.

I thought one generally trained the networks layer by layer, so layer n would be completely finished training before layer n+1 starts. Then there is no problem of "the distribution of each layer's inputs changes" because the inputs are fixed once training starts.

Admittedly, this is a problem if you don't have all the training data to start of with and want to learn incrementally, but AFAICT that is not generally the case in these benchmarking contests.

Regardless, its amazing how simple DNNs are. People have been working on computer vision and AI for about 60 years, and then a program like this comes along which is only around 500 lines of code, conceptually simple enough to explain to anyone with a reasonable mathematical background, but can nevertheless beat humans at a reasonable range of tasks.

Comment author: jkrause 26 February 2015 05:33:23PM 5 points [-]

Training networks layer by layer was the trend from the mid to late 2000s up until early 2012, but that changed in mid 2012 when Alex Krizhevsky and Geoff Hinton finally got neural nets to work for large-scale tasks in computer vision. They simply trained the whole network jointly with stochastic gradient descent, which has remained the case for most neural nets in vision since then.

Comment author: skeptical_lurker 26 February 2015 08:32:36PM 3 points [-]

Really? I was under the impression that training the whole network with gradient decent was impossible, because the propagated error becomes infinitesimally small. In fact, I thought that training layers individually was the insight that made DNNs possible.

Do you have a link about how they managed to train the whole network?

Comment author: jkrause 26 February 2015 09:09:26PM 5 points [-]

That was indeed one of the hypotheses about why it was difficult to train the networks - the vanishing gradient problem. In retrospect, one of the main reasons why this happened was the use of saturating nonlinearities in the network -- nonlinearities like the logistic function or tanh which asymptote at 1. Because they asymptote, their derivatives always end up being really small, and the deeper your network the more this effect compounds. The first large-scale network that fixed this was by Krizhevsky et al., which used a Rectified Linear Unit (ReLU) for their nonlinearity, given by f(x) = max(0, x). The earliest reference I can find to using ReLUs is Jarrett et al., but since Krizhevsky's result pretty much everyone uses ReLUs (or some variant thereof). In fact, the first result I've seen showing that logistic/tanh nonlinearities can work is the batch normalization paper Seanoh linked, which gets around the problem by normalizing the input to the nonlinearity, which presumably prevents the units from saturating too much (though this is still an open question).

Comment author: V_V 27 February 2015 03:48:40PM 0 points [-]

I was under the impression that training the whole network with gradient decent was impossible, because the propagated error becomes infinitesimally small.

If you do it naively, yes. But researches figured out how to attack that problem from multiple angles: from the choice of the non-linear activation function, to specifics of the optimization algorithm, to the random distribution used to sample the initial weights.

Do you have a link about how they managed to train the whole network?

The batch normalization paper cited above is one example of that.