All of jkrause's Comments + Replies

That was indeed one of the hypotheses about why it was difficult to train the networks - the vanishing gradient problem. In retrospect, one of the main reasons why this happened was the use of saturating nonlinearities in the network -- nonlinearities like the logistic function or tanh which asymptote at 1. Because they asymptote, their derivatives always end up being really small, and the deeper your network the more this effect compounds. The first large-scale network that fixed this was by Krizhevsky et al., which used a Rectified Linear Unit (ReLU) for... (read more)

Can confirm that hardware (and data!) are the two main culprits here. The actual learning algorithms haven't changed much since the mid 1980s, but computers have gotten many times faster, GPUs are 30-100x faster still, and the amount of data has similarly increased by several orders of magnitude.

My layperson's understanding is that this is the first time human accuracy has been exceeded on the Imagenet benchmarking challenge, and represents an advance on Chinese giant Baidu's progress reported last month, which I understood to be significant in its own right. http://arxiv.org/abs/1501.02876

One thing to note about the number for human accuracy for ImageNet that's been going around a lot recently is that it was really a relatively informal experiment done by a couple of members of the Stanford vision lab (see section 6.4 of the paper for details)... (read more)

2Houshalter
Here is the guy who tried to get his own accuracy on imagenet: https://karpathy.github.io/2014/09/02/what-i-learned-from-competing-against-a-convnet-on-imagenet/ Getting 5.1% error was really hard, takes a lot of time to get familiar with the classes and to sort through reference images. The 3% error was an entirely hypothetical, optimistic estimate, of a group of humans that make no mistakes. If you want to appreciate it, you can try the task yourself here: http://cs.stanford.edu/people/karpathy/ilsvrc/

Training networks layer by layer was the trend from the mid to late 2000s up until early 2012, but that changed in mid 2012 when Alex Krizhevsky and Geoff Hinton finally got neural nets to work for large-scale tasks in computer vision. They simply trained the whole network jointly with stochastic gradient descent, which has remained the case for most neural nets in vision since then.

5skeptical_lurker
Really? I was under the impression that training the whole network with gradient decent was impossible, because the propagated error becomes infinitesimally small. In fact, I thought that training layers individually was the insight that made DNNs possible. Do you have a link about how they managed to train the whole network?

Yes, this happens to me in Windows, but not Ubuntu (both Chrome).

Here's one interesting way of viewing it that I once read:

Suppose that the option you chose, rather than being a single trial, were actually 1,000 trials. Then, risk averse or not, Option 5 is clearly the best approach. The only difficulty, then, is that we're considering a single trial in isolation. However, when you consider all such risks you might encounter in a long period of time (e.g. your life), then the situation becomes much closer to the 1,000 trial case, and so you should always take the highest expected value option (unless the amounts involved are absolutely huge, as others have pointed out).