Comment author: skeptical_lurker 26 February 2015 08:32:36PM 3 points [-]

Really? I was under the impression that training the whole network with gradient decent was impossible, because the propagated error becomes infinitesimally small. In fact, I thought that training layers individually was the insight that made DNNs possible.

Do you have a link about how they managed to train the whole network?

Comment author: jkrause 26 February 2015 09:09:26PM 5 points [-]

That was indeed one of the hypotheses about why it was difficult to train the networks - the vanishing gradient problem. In retrospect, one of the main reasons why this happened was the use of saturating nonlinearities in the network -- nonlinearities like the logistic function or tanh which asymptote at 1. Because they asymptote, their derivatives always end up being really small, and the deeper your network the more this effect compounds. The first large-scale network that fixed this was by Krizhevsky et al., which used a Rectified Linear Unit (ReLU) for their nonlinearity, given by f(x) = max(0, x). The earliest reference I can find to using ReLUs is Jarrett et al., but since Krizhevsky's result pretty much everyone uses ReLUs (or some variant thereof). In fact, the first result I've seen showing that logistic/tanh nonlinearities can work is the batch normalization paper Seanoh linked, which gets around the problem by normalizing the input to the nonlinearity, which presumably prevents the units from saturating too much (though this is still an open question).

Comment author: gwern 26 February 2015 06:01:47PM *  9 points [-]

Regardless, its amazing how simple DNNs are. People have been working on computer vision and AI for about 60 years, and then a program like this comes along which is only around 500 lines of code, conceptually simple enough to explain to anyone with a reasonable mathematical background, but can nevertheless beat humans at a reasonable range of tasks.

I get the impression it's a hardware issue. See for example http://nautil.us/issue/21/information/the-man-who-tried-to-redeem-the-world-with-logic - McCulloch & Pitts invented neural networks almost before digital computers existed* and he was working on "three-dimensional neural networks". They didn't invent backpropagation, I don't think, but even if they had, how would they have run, much less trained, the state of the art many-layer neural networks with millions of nodes and billions of connections like we're seeing these days? What those 60 years of work gets you is a lot of specialized algorithms which don't reach human-parity but at least are computable on the hardware of that day.

* depends on what exactly you consider the first digital computer and how long before the key publication you date their breakthrough.

Comment author: jkrause 26 February 2015 06:42:00PM 8 points [-]

Can confirm that hardware (and data!) are the two main culprits here. The actual learning algorithms haven't changed much since the mid 1980s, but computers have gotten many times faster, GPUs are 30-100x faster still, and the amount of data has similarly increased by several orders of magnitude.

Comment author: Sean_o_h 26 February 2015 12:34:01PM 5 points [-]

They've also released their code (for non-commercial purposes): https://sites.google.com/a/deepmind.com/dqn/

In other interesting news, a paper released this month describes a way of 'speeding up' neural net training, and an approach that achieves 4.9% top 5 validation error on Imagenet. My layperson's understanding is that this is the first time human accuracy has been exceeded on the Imagenet benchmarking challenge, and represents an advance on Chinese giant Baidu's progress reported last month, which I understood to be significant in its own right. http://arxiv.org/abs/1501.02876

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift Sergey Ioffe, Christian Szegedy

(Submitted on 11 Feb 2015 (v1), last revised 13 Feb 2015 (this version, v2))

Training Deep Neural Networks is complicated by the fact that the distribution of each layer's inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates and careful parameter initialization, and makes it notoriously hard to train models with saturating nonlinearities. We refer to this phenomenon as internal covariate shift, and address the problem by normalizing layer inputs. Our method draws its strength from making normalization a part of the model architecture and performing the normalization for each training mini-batch. Batch Normalization allows us to use much higher learning rates and be less careful about initialization. It also acts as a regularizer, in some cases eliminating the need for Dropout. Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin. Using an ensemble of batch-normalized networks, we improve upon the best published result on ImageNet classification: reaching 4.9% top-5 validation error (and 4.8% test error), exceeding the accuracy of human raters."

Comment author: jkrause 26 February 2015 05:47:33PM *  5 points [-]

My layperson's understanding is that this is the first time human accuracy has been exceeded on the Imagenet benchmarking challenge, and represents an advance on Chinese giant Baidu's progress reported last month, which I understood to be significant in its own right. http://arxiv.org/abs/1501.02876

One thing to note about the number for human accuracy for ImageNet that's been going around a lot recently is that it was really a relatively informal experiment done by a couple of members of the Stanford vision lab (see section 6.4 of the paper for details). In particular, the number everyone cites was just one person, who, while he trained himself quite a while to recognize the ImageNet categories, nonetheless was prone to silly mistakes from time to time. A more optimistic human error is probably closer to 3-4%, but with that in mind the recent results people have been posting are still extremely impressive.

It's also worth pointing another paper from Microsoft Research that beat the 5.1% human performance and actually came out a few days before Google's. It's a decent read, and I wouldn't be surprised if people start incorporating elements from both MSR and Google's papers in the near future.

Comment author: skeptical_lurker 26 February 2015 12:56:16PM 3 points [-]

I saw this paper before, and maybe I'm being an idiot but I didn't understand this:

Training Deep Neural Networks is complicated by the fact that the distribution of each layer's inputs changes during training, as the parameters of the previous layers change.

I thought one generally trained the networks layer by layer, so layer n would be completely finished training before layer n+1 starts. Then there is no problem of "the distribution of each layer's inputs changes" because the inputs are fixed once training starts.

Admittedly, this is a problem if you don't have all the training data to start of with and want to learn incrementally, but AFAICT that is not generally the case in these benchmarking contests.

Regardless, its amazing how simple DNNs are. People have been working on computer vision and AI for about 60 years, and then a program like this comes along which is only around 500 lines of code, conceptually simple enough to explain to anyone with a reasonable mathematical background, but can nevertheless beat humans at a reasonable range of tasks.

Comment author: jkrause 26 February 2015 05:33:23PM 5 points [-]

Training networks layer by layer was the trend from the mid to late 2000s up until early 2012, but that changed in mid 2012 when Alex Krizhevsky and Geoff Hinton finally got neural nets to work for large-scale tasks in computer vision. They simply trained the whole network jointly with stochastic gradient descent, which has remained the case for most neural nets in vision since then.

Comment author: Benito 18 March 2014 06:15:20PM 16 points [-]

When I hit discussion, it keeps automatically redirecting me to the 'top posts' even when I click back onto 'new'. Is anyone else getting this?

Comment author: jkrause 19 March 2014 03:39:40AM 1 point [-]

Yes, this happens to me in Windows, but not Ubuntu (both Chrome).

Comment author: fluchess 12 February 2014 03:03:45AM 2 points [-]

I participated in an economics experiment a few days ago, and one of the tasks was as follows. Choose one of the following gambles where each outcome has 50% probability Option 1: $4 definitely Option 2: $6 or $3 Option 3: $8 or $2 Option 4: $10 or $1 Option 5: $12 or $0

I choose option 5 as it has the highest expected value. Asymptotically this is the best option but for a single trial, is it still the best option?

Comment author: jkrause 12 February 2014 07:16:37AM 9 points [-]

Here's one interesting way of viewing it that I once read:

Suppose that the option you chose, rather than being a single trial, were actually 1,000 trials. Then, risk averse or not, Option 5 is clearly the best approach. The only difficulty, then, is that we're considering a single trial in isolation. However, when you consider all such risks you might encounter in a long period of time (e.g. your life), then the situation becomes much closer to the 1,000 trial case, and so you should always take the highest expected value option (unless the amounts involved are absolutely huge, as others have pointed out).