Artificial Neural Networks (ANNs) are based around the backpropagation algorithm. The backpropagation algorithm allows you to perform gradient descent on a network of neurons. When we feed training data through an ANNs, we use the backpropagation algorithm to tell us how the weights should change.
ANNs are good at inference problems. Biological Neural Networks (BNNs) are good at inference too. ANNs are built out of neurons. BNNs are built out of neurons too. It makes intuitive sense that ANNs and BNNs might be running similar algorithms.
There is just one problem: BNNs are physically incapable of running the backpropagation algorithm.
We do not know quite enough about biology to say it is impossible for BNNs to run the backpropagation algorithm. However, "a consensus has emerged that the brain cannot directly implement backprop, since to do so would require biologically implausible connection rules"[1].
The backpropagation algorithm has three steps.
- Flow information forward through a network to compute a prediction.
- Compute an error by comparing the prediction to a target value.
- Flow the error backward through the network to update the weights.
The backpropagation algorithm requires information to flow forward and backward along the network. But biological neurons are one-directional. An action potential goes from the cell body down the axon to the axon terminals to another cell's dendrites. An axon potential never travels backward from a cell's terminals to its body.
Hebbian theory
Predictive coding is the idea that BNNs generate a mental model of their environment and then transmit only the information that deviates from this model. Predictive coding considers error and surprise to be the same thing. Hebbian theory is specific mathematical formulation of predictive coding.
Predictive coding is biologically plausible. It operates locally. There are no separate prediction and training phases which must be synchronized. Most importantly, it lets you train a neural network without sending axon potentials backwards.
Predictive coding is easier to implement in hardware. It is locally-defined; it parallelizes better than backpropagation; it continues to function when you cut its substrate in half. (Corpus callosotomy is used to treat epilepsy.) Digital computers break when you cut them in half. Predictive coding is something evolution could plausibly invent.
Unification
The paper Predictive Coding Approximates Backprop Along Arbitrary Computation Graphs[1:1] "demonstrate[s] that predictive coding converges asymptotically (and in practice rapidly) to exact backprop gradients on arbitrary computation graphs using only local learning rules." The authors have unified predictive coding and backpropagation into a single theory of neural networks. Predictive coding and backpropagation are separate hardware implementations of what is ultimately the same algorithm.
There are two big implications of this.
- This paper permanently fuses artificial intelligence and neuroscience into a single mathematical field.
- This paper opens up possibilities for neuromorphic computing hardware.
Source is available on arxiv. ↩︎ ↩︎
I guess I was thinking: Brains use predictive coding, and predictive coding is basically backprop, so brains can't be using something dramatically better than backprop. You are objecting to the "brains use predictive coding" step? Or are you objecting that only one particular version of predictive coding is basically backprop?
Are you referring to Solomonoff Induction and the like? I think the "brains use more data-efficient algorithms" is an obvious hypothesis but not an obvious conclusion--there are several competing hypotheses, outlined above. (And I think the evidence against it is mounting, this being one of the key pieces.)
In terms of bits/pixels/etc., humans see plenty of data in their lifetime, a bit more than the scaling laws would predict IIRC. But the scaling laws (as interpreted by Ajeya, Rohin, etc.) are about the amount of subjective time the model needs to run before you can evaluate the result. If we assume for humans it's something like 1 second on average (because our brains are evaluating-and-updating weights etc. on about that timescale) then we have a mere 10^9 data points, which is something like 4 OOMs less than the scaling laws would predict. If instead we think it's longer, then the gap in data-efficiency grows.
Some issues though. One, the scaling laws might not be the same for all architectures. Maybe if your context window is bigger, or your use recurrency, or whatever, the laws are different. Too early to tell, at least for me (maybe others have more confident opinions, I'd love to hear them!) Two, some data is higher-quality than other data, and plausibly human data is higher-quality than the stuff GPT-3 was fed--e.g. humans deliberately seek out data that teaches them stuff they want to know, instead of just dully staring at a firehose of random stuff. Three, it's not clear how to apply this to humans anyway. Maybe our neurons are updating a hundred times a second or something.
I'd be pretty surprised if a human-brain-sized Transformer was able to get as good as a human at most important human tasks simply by seeing a firehose of 10^9 images or context windows of internet data. But I'd also be pretty surprised (10%) if the scaling laws turn out to be so universal that we can't get around them; if it turns out that transformative tasks really do require a NN at least the size of a human brain trained for at least 10^14 steps or so where each step involves running the NN for at least a subjective week. (Subjective second, I'd find more plausible. Or subjective week (or longer) but with fewer than 10^14 steps.)