If I'm reading this correctly, you think Numenta is comparable to regular neural nets, and you think spiking neural nets are more compute efficient than Numenta? Do spiking neural nets actually outperform neural nets? Can we run this comparison?
(This seems so unlikely because I'd have expected people to use spiking neural nets instead of regular neural nets if this were true)
I think spiking neural nets are at least 1, probably more like 2 OOMs more compute intensive to train, similarly effective, somewhat more efficient at learning from data. I think Numenta is probably even harder to train and even more data efficient. I can certainly test these hypotheses at small scale. I'll let you know what I find.
If you think spiking neural nets are more compute intensive, then why does this matter? It seems like we'd just get AGI faster with regular neural nets? (I think compute is more likely to be the bottleneck than data, so the data efficiency doesn't seem that relevant.)
Perhaps you think that if we use spiking neural nets, then we only need to train it for 15 human-years-equivalent to get AGI (similar to the Lifetime anchor in the bio anchors report), but that wouldn't be true if we used regular neural nets? Seems kinda surprising.
Maybe you think that the Lifetime anchor in bio anchors is the best anchor to use and so you have shorter timelines?
I wrote this specifically aimed at the case of "in the thought experiment where humanity got 12 orders of magnitude more compute this year, what would happen in the next 12 months?" I liked the post that Daniel wrote about that and wanted to expand on it. My claim is that even if everything that was mentioned in that post was tried and failed, that there would still be these things to try. They are algorithms which already exist, which could be scaled up if we suddenly had an absurd amount of compute. Not all arguments about why standard approaches like Transformers fail also apply to these alternate approaches.
Right, but I don't yet understand what you predict happens. Let's say we got 12 OOMs of compute and tried these things. Do we now have AGI? I predict no.
Ah, gotcha. I predict yes, with quite high confidence (like 95%), for 12 OOMs and using the Blue Brain Project. The others I place only small confidence in (maybe 5% each). I really think the BBP has enough detail in its model to make something very like a human neocortex, and capable of being an AGI, if scaled up.
Thanks for doing this! This sort of discussion is fairly important to AGI timelines estimates, I think, because e.g. if we conclude that +12 OOMs would be 80% likely to work given todays ideas... etc. (BTW I think you linked to the wrong post at the beginning, I think you meant to link to this.)
I'm not convinced yet. Rohin makes some good objections below and then also it would help if you explained how these compute estimates convert into probability-that-it-would-work estimates. What are the ways you can think of that Blue Brain with +12 OOMs wouldn't work? Having enumerated those ways, how likely are each of them? And how likely is it that there are other ways you haven't thought of?
Yes, I'll fix that link [edit: fixed]. I have not yet thought hard about failure modes and probabilities for these cases. I can work on that and let you know what I come up with.
So I would argue that all of the main contenders are very training data efficient compared to artificial neural nets. I'm not going to go into detail on that argument, unless people let me know that that seems cruxy to them and they'd like more detail.
I'm not sure I get this enough for it to even be a crux, but what's the intuition behind this?
My guess for your argument is that you see it as analogous to the way a CNN beats out a fully-connected one at image recognition, because it cuts down massively on the number of possible models, compatibly with the known structure of the problem.
But that raises the question, why are these biology-inspired networks more likely to be better representations of general intelligence than something like transformers? Genuinely curious what you'll say here.
(Wisdom of evolution only carries so much weight for me, because the human brain is under constraints like collocation of neurons that prevent evolution from building things that artificial architectures can do.)
I don't think they are better representations of general intelligence. I'm quite confident that much better representations of general intelligence exist and just have yet to be discovered. I'm just saying that these are closer to a proven path, and although they are inefficient and unwise, somebody would likely follow these paths if suddenly given huge amounts of compute this year. And in that imaginary scenario, I predict they'd be pretty effective.
My reasoning for saying this for the Blue Brain Project is that I've read a lot of their research papers, and understand their methodology pretty well, and I believe they've got really good coverage of a lot of details. I'm like 97% confident that whatever 'special sauce' allows the human brain to be an effective general intelligence, BBP has already captured that in their model. I think they've captured every detail they could justify as being possibly slightly important, so I think they've also captured a lot of unessecary detail. I think this is bad for interpretability and compute efficiency. I don't recommend this path, I just believe it fulfills the requisites of the thought experiment on 12 OOMs of compute magically appearing.
What's the difference between flp and flop? Or is that a typo / abbreviation?
How do your numbers compare to the numbers in Joe Carlsmith's report? For example, the number I've had in my head comes from that report, namely “a real-time human brain simulation might require something like 1e15 FLOP/s, plus or minus a few orders of magnitude”. (See here.)
I was using flp as an abbreviation. And I'll read Joe Carlsmith's report and then let you know what I think.
edit: Oh yeah, and one thing to keep in mind is these are estimates for if we suddenly had a shockingly big jump in amount of compute (12 orders of magnitude) but no time to develop or improve existing algorithms. So my estimates for 'what could a reasonably well engineered algorithm, that had been tested and iterated on at scale, do?' would be much much lower. This is stupidly wasteful upper bound.
This is a response to a comment made by Rohin Shah on Daniel Kokotajlo's post Fun with +12 OOMs of Compute. I started trying to answer some questions and assumptions he had, then realized there was more of an inferential gap that needed filling in. Also, as I attempted to estimate the OOMs of compute above GPT-3/PaLM needed for each method, I realized I was just going off of vague guesses rather than grounded estimates based on recent benchmarks. So, since other people might also be lacking the same info and be curious about my answer, I decided to put a bit more work into answering and turn it into a full post.
[Edit: 2 years later, and neuron level models continue to be very slow. https://www.frontiersin.org/articles/10.3389/fninf.2020.00016/full ]
Introducing the cast
First, I'd like to note that I don't endorse trying to get to AGI through any of these methods. I think they are potentially worse for interpretability in addition to being less compute efficient. My goal here is to point out that I think it could be done if the world were suddenly given lots more compute. In other words, I shall make the argument that given lots of compute, issues of limited data and potential scaling plateaus of artificial neural nets can be bypassed via other less compute efficient methods. Many roads lead to AGI, and specific predictions about the failure of one specific path (e.g. Transformers) don't necessarily mean all the paths are affected by that predicted failure mode.
The main contenders
Supporting roles
The field of computational neuroscience has generated lots and lots of very narrowly focused models of particular subsets of lots of different brains. None of these is alone likely to turn into a full blown AGI if you throw compute at them, but they have useful additional details that could potentially get the main contenders unstuck from unexpected scaling plateaus.
Emulation
By brain emulation, I mean trying to make a model that captures some of the observed functions of brain circuits. These models vary widely in how much fidelity to fine details they strive for, versus a more abstracted approximation. More detail brings the risk that you got one of those details wrong, and also means potentially requiring exponentially more compute to scale. Less detail means more reliance on having made the correct abstractions.
Neuroscientists have a failure mode around trying to make too accurate and detailed of models. After all, if you've spent years of your life painstakingly measuring the tiny details, it can be really hard to swallow the idea that you might have to discard any of those details as irrelevant. I think Jan sums it up well in this comment:
Each of the three contenders I mentioned have very different levels of detail and have chosen different abstractions.
What do these three main contenders have in common? A focus on the mammalian neocortex, the part of the brain that does the General Intelligence stuff, the part that humans have extra of. Neuroscience has lots of evidence showing that this is the critical part of the brain to emulate if you want a model that is able to reason abstractly about things. I won't go into depth here, but I will give you this quote from Numenta (see Jeff Hawkins' latest book for more depth, or this paper for a quick intro):
Worth noting for each of these projects that their focus is on the neocortex. The Blue Brain project which talks about rodent brains is only a few well-understood parameter changes away from being a very accurate emulation of the human neocortex. They are careful not to do this because of the ethical implications of accurately simulating human neocortex tissue. I'm pretty confident from things that some of the project participants have said that they'd love to try simulating a whole human brain if given the compute and lack of oversight.
For example (emphasis mine) a quote from Rebasing I/O for Scientific Computing: Leveraging Storage Class Memory in an IBM BlueGene/Q Supercomputer by Schürmann et al 2014:
Human cortical neuron properties are pretty well known in a lot of respects and are already able to be simulated on the Blue Brain system, they just are careful not to get hit by media hype/outrage by talking about large scale human neocortex experiments. An example of a small scale human cortical neuron experiment: https://live-papers.brainsimulation.eu/#2016-eyal-et-al
How much compute?
So I would argue that all of the main contenders are very training data efficient compared to artificial neural nets. I'm not going to go into detail on that argument, unless people let me know that that seems cruxy to them and they'd like more detail.
One of the things these contenders fall short on though is compute efficiency. For the sake of Daniel's thought experiment, I'd like to give some rough estimates on how much compute I think would be necessary to get a half-brain of compute for each of these.
For artificial neural networks, the meaning of a 'neuron' or 'parameter' is less directly analogous to a neocortex neuron. For these emulations, the analogy holds together much better. The rough average number of neurons in the human neocortex is around 26 billion. So let's say 13 billion for the half-neocortex case.
Numenta training compute estimate
Ok, I just give up for now on finding benchmarks to accurately estimate this one. I give a rough guess at 'somewhere between the other two, closer to the Spiking Neural Nets'.
Here's the best summary I can give: they break the artificial neurons down into collections of artificial dendrites, which then have a very sparse activation and very sparse weights. This seems to help learn more from a given dataset, and to have an extended amount of information that can be 'fit' into the network without 'overwriting' previous info. The downside is that it's substantially less efficient to 'get' the information into the network in the first place. Like, it needs maybe 10x more epochs over the same dataset before it starts doing better than the feed forward multilayer perceptron was doing a while ago. But its learning doesn't plateau as soon, so it can eventually surpass the roughly-equivalent MLP.
Spiking Neural Net training compute estimate
my estimate: 3.82e24 flops
about 1 OOM over GPT-3
less than an OOM over PaLM
For this category, I would add an additional OOM for the fact that the abstraction may be lossy/inefficient in capturing what actual brain neurons do. For instance, I noticed that the benchmark they were using in the papers had undershot the number of synapses for human pre-frontal cortex by an order of magnitude. Could be other things like that as well.
Unlike Numenta, where the abstraction is very well thought out and I think it will either totally work or not, depending on whether they are as correct as they think they are about their abstraction.
Or Blue Brain, where there is so much accuracy and so little abstraction I feel quite confident it'll work as expected on a emulated-neuron == real-neuron basis.
Blue Brain training compute estimate
my estimate: 2.37e33 FLOPs
10 OOMs over GPT-3
9 OOMs over PaLM
from https://blog.heim.xyz/palm-training-cost/ :
Sources:
Numenta paper 1
https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=&cad=rja&uact=8&ved=2ahUKEwiqwPDF84_3AhUtEEQIHchvC2wQFnoECBYQAQ&url=https%3A%2F%2Fnumenta.com%2Fassets%2Fpdf%2Fresearch-publications%2Fpapers%2FSparsity-Enables-100x-Performance-Acceleration-Deep-Learning-Networks.pdf&usg=AOvVaw33dSHmz30T0fhBKWcfBMne
Using 8 bit compression of values via a unique mapping scheme, and running on FPGAs... hard to compare. Their mapping scheme pre-estimates the range of all variables, splits large numbers into lossy quantized representations spread across multiple 8 bit (INT8) numbers during encoding. So to get the equivalent of a FLOP, a floating point operation, you need to do several fixed-point 8 bit operations (FP-8bit-OPs). On average, maybe 4 FP-8bit-OPs per single precision FLOP?
https://semiengineering.com/tops-memory-throughput-and-inference-efficiency/
What is TOPS? It means Trillions or Tera Operations per Second. It is primarily a measure of the maximum achievable throughput but not a measure of actual throughput. Most operations are MACs (multiply/accumulates), so TOPS = (number of MAC units) x (frequency of MAC operations) x 2
Alveo U250 datasheet says it gets 33.3 INT8 TOPs at peak.
rough guess of divide TOPs by 4 to get a terraFLOPs equivalent for Numenta's specific use case, based on studying their encoding.
= 8.325 pseudo-terraFLOPs = 8.325e9 psuedoFLOPs / second
? bio_seconds took ? wall clock seconds
flops / neuron
flops / neurons = flp/n
flp/n per bio_second
flp/n / ? bio_second = flp/n/s
So, for 1.3e9 neurons of the Cortex+Plasticity simulation type, for 15 bio_years of 'training time':
flops per second of biological time:
15 years of bio time need for training? = 3.154e7 sec/year * 15 years = 4.73e8 seconds of bio time
total compute needed for training = flp/n/s * 4.78e8 bio_seconds * 1.3e9 neurons = flops
Numenta paper 2
Avoiding Catastrophe: Active Dendrites Enable Multi-Task Learning in Dynamic Environments
https://arxiv.org/abs/2201.00042
separates out the neurons into collections of artificial dendrites in sparse matrices. Because it's not using FPGAs here, and doing task comparisons against standard multi-layer perceptron feed-forward networks, the compute is easier to compare. They give numbers for the estimated 'effective number of parameters' because the sparse nature of the networks means that the number of parameters looks huge but is effectively small for the amount of compute required to train and infer using them. Several experiments are listed in the paper.
SpikingNN paper 1
https://www.sciencedirect.com/science/article/abs/pii/S0925231221003969
full text manuscript: https://www.sciencedirect.com/science/article/am/pii/S0925231221003969
Ubuntu 18.04 LTS with Intel(R) Xeon(R)
CPU E5-2620 v4 @ 2.1 GHz and 32 GB RAM
SpikingNN paper 2
https://www.nature.com/articles/s41598-019-54957-7
Caption for graph
10 bio_seconds took 10^4 wall clock seconds
so 1 bio_second to 1000 wall clock seconds for 2.05e7 neurons
flops = cores * (cycles/second) * (flops/cycle)
flops = (1 node * 3840 cores) * ( 1.6e9 cycles / second) * ( 2 flops / cycle) * 1e3 seconds = 1.229e16
flops / neuron
flops / 2.05e7 neurons = 6.14e6 flp/n
flp/n per bio_second
flp/n / 1 bio_second = 6.14e6 flp/n/s
So, for 1.3e9 neurons of the Cortex+Plasticity simulation type, for 15 bio_years of 'training time':
https://en.wikipedia.org/wiki/FLOPS says 2 flops per cycle per core for single-precision simulations run on the Titan Xp GPU (3840 cores)
flops per second of biological time:
15 years of bio time need for training? = 3.154e7 sec/year * 15 years = 4.73e8 seconds of bio time
total compute needed for training = 6.14e6 flp/n/s * 4.78e8 bio_seconds * 1.3e9 neurons = 3.82e24 flops
https://github.com/BindsNET/bindsnet
Blue Brain paper 1
Large-Scale Simulation of Brain Tissue, Blue Brain Project, EPFL
Technical Report for the ALCF Theta Early Science Program
Blue Brain paper 2
CoreNEURON : An Optimized Compute Engine for the NEURON Simulator
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6763692/
From abstract:
From intro:
In general, this paper describes the journey to making the Blue Brain NEURON model more efficient and able to work with GPUs. And then doing benchmarking comparisons.
Relevant part of the Table 1 discussed above:
Note: one major parameter change in human neocortex vs rodent is that human neocortex has more synaptic connections per number of neurons. This hurts scaling somewhat because of the additional complexity. Not able to give a precise estimate for this additional compute based on the data I've found so far on their work. My guess is somewhat less than 2 OOMs extra cost in worst case.
Note for anyone trying to read this paper: a comprehension-gotcha is that they confusingly talk about both 'compute nodes' (the computers or virtual computers used), and 'neuron nodes' (the component parts of a neuron which are each individually simulated each timestep) using just the term 'nodes'. You have to keep the context of the paragraph straight to know which one they mean at any given time.
So, from these two papers, although they don't quite lay out all the parameters together in an easy-to-interpret way...
bbp paper1: 27 seconds of compute time for 0.1 seconds of biological time for 1? neuron(s) on a single compute node? (GPU system)
flops per second of biological time:
bbp paper2: 2.19e5 rodent cortex neurons requires 2e3 seconds of 2048 nodes, each node 16 cores @ 1.6GHz for 0.001? seconds of biological time (abbr: bio_second). (supercomputer baseline, not GPU measurement)
flops = cores * (cycles/second) * (flops/cycle)
flops = (2048 nodes * 16 cores) * ( 1.6e9 cycles / second) * ( 8 flops / cycle) * 2e3 seconds = 8.39e17
flops / neuron
8.39e17 flops / 2.19e5 neurons = 3.83e12 flp/n
flp/n per bio_second
3.82e12 flp/n / 0.001 bio_second = 3.83e15 flp/n/s
So, for 1.3e9 neurons of the Cortex+Plasticity simulation type, for 15 bio_years of 'training time':
https://en.wikipedia.org/wiki/FLOPS says that IBM PowerPC A2 (Blue Gene/Q) gets 8 64bit flops per core per cycle
(The Blue Brain project was so named because it was designed in cooperation with IBM specifically to work with the Blue Gene supercomputer)
flops per second of biological time:
15 years of bio time need for training? = 3.154e7 sec/year * 15 years = 4.73e8 seconds of bio time
total compute needed for training = 3.82e15 flp/n/s * 4.78e8 bio_seconds * 1.3e9 neurons = 2.37e33 flops = 2.37e18 petaFLOPs
other Blue Brain papers:
In-Memory Compression for Neuroscience Applications - Bayly
https://github.com/DevinBayly/gsoc_report/blob/master/report.pdf
Reconstruction and Simulation of Neocortical Microcircuitry
https://www.cell.com/cell/fulltext/S0092-8674(15)01191-5
Side note: Why half-brain?
Because there are multiple sources of evidence for half a human brain being sufficient to instantiate a general reasoning agent.
One of these is the case of hemispherectomy. People with severe seizures have had portions of their brain removed to stop the seizures. This operation can be as extreme as an entire hemisphere of the brain. If this happens in childhood while the brain connections are still highly plastic, then close-to-normal function can be regained.
Another case I know of involved a birth defect resulting in a missing hemisphere.
And yet another way significant brain tissue loss can happen is an ischemic event (oxygen deprivation and sudden harmful return). This tends to be quite bad for older adults who commonly experience this via strokes, because the brain is set in its ways by then and has a hard time regaining enough plasticity to rewire around the damage. But if it happens to a child, (e.g. a partial drowning), recovery is usually quite good (depending on exactly which bits are affected).
I think you could make do with even less than 50% if you were thoughtful about what you cut. Maybe as little as 30%. That's not a necessary condition for this thought experiment though.