I haven’t read the linked post/comment yet, and perhaps I am missing something very obvious, but: we have exaflop computing (that’s 10^18) right now. Is Tim Dettmers really saying that we’re not going to see a 1000x speed-up, in a century or possibly ever? That seems like a shocking claim, and I struggle to imagine what could justify it.
EDIT: I have now read the linked comment; it speaks of fundamental physical limitations such as speed of light, heat dissipation, etc., and says:
These are all hard physical boundaries that we cannot alter. Yet, all these physical boundaries will be hit within a couple of years and we will fall very, very far short of human processing capabilities and our models will not improve much further. Two orders of magnitude of additional capability are realistic, but anything beyond that is just wishful thinking.
I do not find this convincing. Taking the outside view, we can see all sorts of similar predictions of limitations having been made over the course of computing history, and yet Moore’s Law is still going strong despite quite a few years of predictions of imminent trend-crashing. (Take a look at the “Recent trends” and “Alternative materials research” sections of the Wikipedia page; do you really see any indication that we’re about to hit a hard barrier? I don’t…)
Also, these physical limits – insofar as they are hard limits – are limits on various aspects of the impressiveness of the technology, but not on the cost of producing the technology. Learning-by-doing, economies of scale, process-engineering R&D, and spillover effects should still allow for costs to come down, even if the technology itself can hardly be improved.
It is fun to note that Metaculus is extremely uncertain about how many FLOPS will be required for AGI. The community lower 25% bound is 3.9x10^15 FLOPS and the upper 75% bound is 4.1x10^20 FLOPS with very flattish tails extending well beyond these bounds. (The median is 6.2e17.)
I mention this mainly to point out that his estimate of 10^21 FLOPS is simplify overconfident in his particular model. There are simple objections that should reduce confidence in that kind of extremely high estimate at least somewhat.
For example, the human brain runs on 20 watts of glucose-derived power, and is optimized to fit through a birth canal. These design constraints alone suggest that much of its architectural weirdness arises due to energy and size restrictions, not due to optimization on intelligence. Actually optimizing for intelligence with no power or size restrictions will yield intelligent structures that look very different, so different that it is almost pointless to use brains as a reference object.
Again, I think a healthy stance to take here isn't "Tim Dettmers is WRONG" but rather "Tim Dettmers is overconfident."
Tim Dettmers whole approach seems to be assuming that there are no computational shortcuts. No tricks that programmers can use for speed where evolution brute forced it. For example, maybe a part of the brain is doing a convolution by the straight forward brute force algorithm. And programmers can use fast fourier transform based convolutions. Maybe some neurons are discrete enough for us to use single bits. Maybe we can analyse the dimensions of the system and find that some are strongly attractive, and so just work in that subspace.
Of course, all this is providing an upper bound on the amount of compute needed to make a human level AI. Tim Dettmers is trying to prove it can't be done. This needs a lower bound. To get a lower bound, don't look at how long it takes a computer to simulate a human. Look at how long it takes a human to simulate a computer. This bound is really rather useless, compared to modern levels of compute. However, it might give us some rough idea how bad overhead can be. Suppose we thought "Compute needed to be at least as smart as a human" was uniformly distributed somewhere between "compute needed to simulate a human" and "compute a human can simulate".
Well actually, it depends on what intelligence test we give. Human brains have been optimised towards (human stuff) so it probably takes more compute to socialize to a human level than it takes to solve integrals to a human level.
Interesting but probably irrelevant note.
There are subtleties in even the very loose lower bound of a human simulating a cpu. Suppose there was some currently unknown magic algorithm. This algorithm can hypothetically solve all sorts of really tricky problems in a handful of cpu cycles. It is so fast that a human mentally simulating a cpu running this algorithm will still beat current humans on a lot of important problems. (Not problems humans can solve too quickly, because no algorithm can do much in <1 clock cycle.) If such a magic algorithm exists, then its possible that even an AI running on a 1 operation per day computer could be arguably superhuman. Of course, I am somewhat doubtful that an algorithm that magic exists (although I have no strong evidence of non existence, some weak evidence namely that evolution didn't find it and we haven't found it yet.) Either way, we are far into the realm of instant takeoff on any computer.
If you swapped out "AGI" for "Whole Brain Emulation" then Tim Dettmers' analysis becomes a lot more reasonable.
Tim is simply neglecting the obvious brute force solution to achieve brain-like capabilities. This is yet another startup and I'm not saying this approach will commercially succeed, but : [singularity hub]
The linked article is one on a startup called Cerebras who has gotten a 'wafer scale engine' to at least run in demos. This is where an entire silicon wafer is made into a large chip.
Enough of these, connected by hollow core optical fiber, would be what you need to hit that 10^21 threshold.
Also note that AI systems get a bunch of advantages that humans don't have. Each system is immortal and is always doing it's best. Human beings trivially make mistakes on simple tasks at high error rates - we do not "do our best" consistently 24/7/365. What does it mean to achieve human-like performance? Did you mean average performance or performance of the best human alive who is well rested?
Do you want broad spectrum capabilities or just the objects in imagenet? Because, again, it's harder than it sounds to for a human to do better.
AI systems in applications like autonomous cars get to learn from the experiences of their peers in way that is not biased. Think about how biased the information you get from your peers is - for one thing, humans tend to only tell each other about successes, which can cause you to overestimate your chance of success for a risky venture like a startup.
While a peer autonomous vehicle can report in an unbiased way the (novel situation, true outcome) to a cloud farm that updates the learning to the fleet. Which is something each individual car doesn't have to do - each vehicle doesn't need to learn in itself.
In fact, here's another flaw of Tim's reasoning. He's assuming we must have an AI system that learns in real time like a human does. This is not true - humans don't learn in real time, either, it's why we need 16-20 years of education to be useful.
Each AI system used in a field can give answers to questions in realtime, but record high prediction error results. This is sorta how OpenAI's current algorithms already do it though I am neglecting details.
For a useful AI system used in a field, therefore, you need a tiny fraction of all the neurons a human uses - most are never going to contribute in any single task you might do as a human. And if a rare edge case shows up that needs more capability than a pared down, 'sparse' system used in a real application, you would have the field AI system pause it's robotics and query a larger version of itself for the answer.
The more I type the more I realize how bullshit everything in this argument was. And there are efforts to make a silicon chip with more of the tradeoffs of the human brain. If you think you need power efficiency and breadth of capabilities more than accuracy, you can just do this. [an article on a startup that has built analog computers for neural network convolution. ]
So for Tim to be correct he needs to take into account a 'best effort' example of a large array of analog silicon processors, filling a whole warehouse, and conclude you cannot hit the computational needs required.
That startup is at about 300 TOPs for a single chip. Therefore, for a quick napkin estimate, that's 10^14. It's a startup making some of the first analog computers used in decades. So let's assume there's at least a power of 10 of "easy gains" leftover if this became a commercial technology. So 10^15.
10^21-10^15 = 6, or 1 million chips in a warehouse. Go to a 'chiplet' architecture to cram them into less packages, cram 10 per package, and you have 100,000 chips.
Current number 1 supercomputer is Fugaku with 158,976 48-core CPUs.
Cheap and easy if you had to do this next week? No, but it sounds like if enough resources available you could solve the problem even if we never get another improvement in silicon.
I followed a link on Twitter to a fun and informative 2015 blog post by Tim Dettmers:
The Brain vs Deep Learning Part I: Computational Complexity — Or Why the Singularity Is Nowhere Near
The headline conclusion is that it takes at least 1021 FLOP/s to run the algorithms of a human brain, and therefore "it is unlikely that there will be a technological singularity in this century." I disagree with that, and this post explores why.
(Specifically, I disagree with "at least 1021 FLOP/s". There's a separate step to go from "at least 1021 FLOP/s" to "it is unlikely that there will be a technological singularity in this century"—this step is related to Moore's law, bandwidth requirements for parallelization, etc. Tim's blog post has extensive discussion of this second step, and I won't say anything about that here; I'd have to think about it more.)
(I'm writing this in 2021, six years later, but Tim has a comment on this very site that says he still stands by that post; in fact he now goes even further and says "I believe that AGI will be physically impossible with classical computers.")
I highly recommend the original post. Indeed, if I didn't like the post so much, I would not have bothered writing a response. :-)
Are brain algorithms computationally expensive to simulate?
Yes! Definitely! I think it's especially telling that nobody has applied the Dileep George brain-inspired image-processing model to ImageNet, sticking to much smaller images with far fewer categories of objects (MNIST, CAPTCHAs etc.).
Likewise, this Randall O'Reilly paper has a fascinating computational exploration of (in my opinion) different and complementary aspects of the human visual system. That paper tests its theories on a set of ≈1000 256×256-pixel, 8-frame movies from 100 categories—compare to ImageNet's 14 million images from 20,000 categories ... or compare it to the number of visual categories that you can recognize. Training the model still took 512 InfiniBand-connected processor nodes running for ≈24 hours on their campus supercomputer (source: personal communication). The real human vision system is dramatically larger and more complicated than this model, and the whole brain is larger and more complicated still!
But, when I say "computationally expensive to simulate" above, I mean it in, like, normal-person-in-2021 standards of what's computationally expensive to simulate. A very different question is whether the brain is "computationally expensive to simulate" by the standards of GPT-3, the standards of big tech data centers, the standards of "what will be feasible in 2030 or 2040 or 2050", and things like that. There, I don't have a strong opinion. I consider it an open question.
Note also that the two brain-inspired image-recognition examples just above are pushing innovative algorithms, and therefore are presumably handicapped by things like
So anyway, the fact that a couple of today's "most brain-like algorithms" (as judged by me) seem to be computationally expensive to scale up is not much evidence one way or the other for whether brain-like AGI algorithms would be "computationally expensive" with industrial-scale investment in the long-term or even short-term. Again, I consider it an open question.
Tim's blog post argues that it is not an open question: his estimate is 1021 FLOP/s to run the algorithms of a human brain, which (he says) puts it out of reach for the century, and maybe (as in his recent comment) simply beyond what you can do with a classical computer. And he says that's an underestimate!
This is quite a bit more skeptical than Joseph Carlsmith's recent OpenPhil report "How Much Computational Power Does It Take to Match the Human Brain?". That offers many estimation methods which come in at 1012-1018 FLOP/s, with 1021 being an extreme upper end.
What accounts for the discrepancy?
Where does Tim's estimate of 1021 FLOP/s come from?
(Be warned that it's very possible I'm misunderstanding something, and that I have zero experience simulating neurons. I've simulated lots of other things, and I've read about simulating neurons, but that's different from actually making a neuron simulation with my own hands.)
Let's just jump to the headline calculation:
1021=8.6e10×200×(10,000×5)×(5×50×5).
Let's go through the terms one by one.
So all in all, the implicit story behind multiplying these numbers together is:
Take each neuron A in each timestep B. Then take each synapse C on that neuron, and take each dendritic branch D on that neuron. Take one of the five most recent timesteps E for the synapse, and another one of the five most recent timesteps F for the dendritic branch. Now do at least one floating-point operation involving these particular ingredients, and repeat for all possible combinations.
I say "no way". That just can't be right, can it?
Let's start with the idea of multiplying the number of synapses by the number of branches. So take a random synapse (synapse #49) and independently take a random branch of a random dendrite (branch #12). Most of the time the synapse is not that branch, and indeed not even on that dendrite! Why would we need to do a calculation specifically involving those two things?
If any influence can spread from a synapse way over here to a branch way over there, I think it would be the kind of thing that can be dealt with in a hierarchical calculation. Like, from the perspective of dendrite #6, you don't need to know the fine-grained details of what's happening in each individual synapse on dendrite #2; all you need to know is some aggregated measure of what's going on at dendrite #2, e.g. whether it's spiking, what mix of chemicals it's dumping out into the soma, or whatever.
So I want to say that the calculation is not O(number of synapses × number of branches), but rather O(number of synapses) + O(number of branches). You do calculations for each synapse, then you do calculations for each branch (or each segment of each branch) that gradually aggregate the effects of those synapses over larger scales. Or something like that.
Next, the time model. I disagree with this too. Again, Tim is budgeting 5×5=25 operations per timestep to deal with time-history. The idea is that at timestep N, you're doing a calculation involving "the state of synapse #18 in timestep (N-3) and of branch #59 in timestep (N-1)", and a different calculation for (N-1) and (N-4), and yet another for (N-2) and (N), etc. etc. I don't think that's how it would work. Instead I imagine that you would track a bunch of state variables for the neuron, and update the state each timestep. Then your timestep calculation would input the previous state and what's happening now, and would output the new state. So I think it should be a factor of order 1 to account for effects that are prolonged in time. Admittedly, you could say that the number "25" is arguably "a factor of order 1", but whatever. :-P
Oh, also, in a typical timestep, most synapses haven't fired for the previous hundreds of milliseconds, so you get another order of magnitude or so reduction in computational cost from sparsity.
So put all that together, and now my back-of-the-envelope is like 50,000× lower than Tim's.
(By the way, please don't divide 1021 FLOP/s by 50,000 and call it "Steve's estimate of the computational cost of brain simulations". This is a negative case against the 1021 number, not a positive case for any model in particular. If you want my opinion, I don't have one right now, as I said above. In the meantime I defer to the OpenPhil report.)
(Parts of this section are copying points made in the comment section of Tim's blog.)
(Also, my favorite paper proposing an algorithmic purpose of dendritic spikes in cortical pyramidal neurons basically proposes that it functions as an awfully simple set of ANDs and ORs, more or less. I don't read too much into that—I think the dendritic spikes are doing other computations too, which might or might not be more complicated. But I find that example suggestive.)
What about dynamic gene expression, axonal computations, subthreshold learning, etc.?
To be clear, Tim posited that the 1021 FLOP/s was an underestimate, because there were lots of other complications neglected by this model. Here's a quote from his post:
My main response is a post I wrote earlier: Building brain-inspired AGI is infinitely easier than understanding the brain. To elaborate and summarize a bit:
I don't pretend that this is a rigorous argument, it's intuitions knocking against each other. I'm open to discussion. :-)