Regardless, its amazing how simple DNNs are. People have been working on computer vision and AI for about 60 years, and then a program like this comes along which is only around 500 lines of code, conceptually simple enough to explain to anyone with a reasonable mathematical background, but can nevertheless beat humans at a reasonable range of tasks.
I get the impression it's a hardware issue. See for example http://nautil.us/issue/21/information/the-man-who-tried-to-redeem-the-world-with-logic - McCulloch & Pitts invented neural networks almost before digital computers existed* and he was working on "three-dimensional neural networks". They didn't invent backpropagation, I don't think, but even if they had, how would they have run, much less trained, the state of the art many-layer neural networks with millions of nodes and billions of connections like we're seeing these days? What those 60 years of work gets you is a lot of specialized algorithms which don't reach human-parity but at least are computable on the hardware of that day.
* depends on what exactly you consider the first digital computer and how long before the key publication you date their breakthrough.
Subscribe to RSS Feed
= f037147d6e6c911a85753b9abdedda8d)
Really? I was under the impression that training the whole network with gradient decent was impossible, because the propagated error becomes infinitesimally small. In fact, I thought that training layers individually was the insight that made DNNs possible.
Do you have a link about how they managed to train the whole network?
That was indeed one of the hypotheses about why it was difficult to train the networks - the vanishing gradient problem. In retrospect, one of the main reasons why this happened was the use of saturating nonlinearities in the network -- nonlinearities like the logistic function or tanh which asymptote at 1. Because they asymptote, their derivatives always end up being really small, and the deeper your network the more this effect compounds. The first large-scale network that fixed this was by Krizhevsky et al., which used a Rectified Linear Unit (ReLU) for their nonlinearity, given by f(x) = max(0, x). The earliest reference I can find to using ReLUs is Jarrett et al., but since Krizhevsky's result pretty much everyone uses ReLUs (or some variant thereof). In fact, the first result I've seen showing that logistic/tanh nonlinearities can work is the batch normalization paper Seanoh linked, which gets around the problem by normalizing the input to the nonlinearity, which presumably prevents the units from saturating too much (though this is still an open question).