Wei_Dai comments on The Brain as a Universal Learning Machine - Less Wrong
You are viewing a comment permalink. View the original post to see all comments and the full post content.
You are viewing a comment permalink. View the original post to see all comments and the full post content.
Comments (166)
How does this "seed" find the correct high-level sensory features to plug into? How can it wire complex high-level behavioral programs (such as courtship behaviors) to low-level motor programs learned by unsupervised learning?
This seems unlikely.
But long multiplication is something that you were taught in school, which most humans wouldn't be able to discover independently. And you are certainly not aware of how your brain perform visual recognition, the little you know was discovered through experiments, not introspection.
Not so fast.
The Atari DRL agent learns a good mapping between short windows of frames and button presses. It has some generalization capability which enables it to achieve human-level or sometimes even super human-level performances on games that are based on eye-hand coordination (after all it's not burdened by the intrinsic delays that occur in the human body), but it has no reasoning ability and fails miserably at any game which requires planning ahead more than a few frames.
Despite the name, no machine learning system, "deep" or otherwise, has been demonstrated to be able to efficiently learn any provably deep function (in the sense of boolean circuit depth-complexity), such as the parity function which any human of average intelligence could learn from a small number of examples.
I see no particular reason to believe that this could be solved by just throwing more computational power at the problem: you can't fight exponentials that way.
UPDATE:
Now it seems that Google DeepMind managed to train even feed-forward neural networks to solve the parity problem. My other comment down-thread.
I had a guess that recurrent neural networks can solve the parity problem, which Google confirmed. See http://cse-wiki.unl.edu/wiki/index.php/Recurrent_neural_networks where it says:
See also PyBrain's parity learning RNN example.
The algorithm I was referring to can be easily represented by an RNN with one hidden layer of a few nodes, the difficult part is learning it from examples.
The examples for the n-parity problem are input-output pairs where each input is a n-bit binary string and its corresponding output is a single bit representing the parity of that string.
In the code you linked, if I understand correctly, however, they solve a different machine learning problem: here the examples are input-output pairs where both the inputs and the outputs are n-bit binary strings, with the i-th output bit representing the parity of the input bits up to the i-th one.
It may look like a minor difference, but actually it makes the learning problem much easier, and in fact it basically guides the network to learn the right algorithm:
the network can first learn how to solve parity on 1 bit (identity), then parity on 2 bits (xor), and so on. Since the network is very small and has an ideal architecture for that problem, after learning how to solve parity for a few bits (perhaps even two) it will generalize to arbitrary lengths.
By using this kind of supervision I bet you can also train a feed-forward neural network to solve the problem: use a training set as above except with the input and output strings presented as n-dimensional vectors rather than sequences of individual bits and make sure that the network has enough hidden layers.
If you use a specialized architecture (e.g. decrease the width of the hidden layers as their depth increases and connect the i-th output node to the i-th hidden layer) it will learn quite efficiently, but if you use a more standard architecture (hidden layers of constant width and output layer connected only to the last hidden layer) it will probably also work although you will need a quite a bit of training examples to avoid overfitting.
The parity problem is artificial, but it is a representative case of problems that necessarily ( * ) require a non-trivial number of highly non-linear serial computation steps. In a real-world case (a planning problem, maybe), we wouldn't have access to the internal state of a reference algorithm to use as supervision signals for the machine learning system. The machine learning system will have to figure the algorithm on its own, and current approaches can't do it in a general way, even for relatively simple algorithms.
You can read the (much more informed) opinion of Ilya Sutskever on the issue here (Yoshua Bengio also participated in the comments).
( * at least for polynomial-time execution, since you can always get constant depth at the expense of an exponential blow-up of parallel nodes)
Your comments made me curious enough to download PyBrain and play around with the sample code, to see if I could modify it to learn the parity function without intermediate parity bits in the output. In the end, I was able to, by trial and error, come up with hyperparameters that allowed the RNN to learn the parity function reliably in a few minutes on my laptop (many other choices of hyperparameters caused the SGD to sometimes get stuck before it converged to a correct solution). I've posted the modified sample code here. (Notice that the network now has 2 input nodes, one for the input string and one to indicate end of string, 2 hidden layers with 3 and 2 nodes, and an output node.)
I guess you're basically correct on this, since even with the tweaked hyperparameters, on the parity problem RNN+SGD isn't really doing any better than a brute force search through the space of simple circuits or algorithms. But humans arguably aren't very good at learning algorithms from input/output examples either. The fact that RNNs can learn the parity function, even if barely, makes it less clear that humans have any advantage at this kind of learning.
Nice work!
Anyway, in a paper published on arXiv yesterday, the Google DeepMind people report being able to train a feed-forward neural network to solve the parity problem, using a sophisticated gating mechanism and weight sharing between the layers. They also obtain state of the art or near state of the art results on other problems.
This result makes me update in the increasing direction my belief about the generality of neural networks.
Ah you beat me to it, I just read that paper as well.
Here is the abstract for those that haven't read it yet:
Also, relevant to this discussion:
The version of the problem that humans can learn well is this easier reduction. Humans can not easily learn the hard version of the parity problem, which would correspond to a rapid test where the human is presented with a flash card with a very large number on it (60+ digits to rival the best machine result) and then must respond immediately. The fast response requirement is important to prevent using much easier multi-step serial algorithms.
That is the most cogent, genuinely informative explanation of "Deep Learning" that I've ever heard. Most especially so regarding the bit about linear correlations: we can learn well on real problems with nothing more than stochastic gradient descent because the feature data may contain whole hierarchies of linear correlations.