Estimating Brain-Equivalent Compute from Image Recognition Algorithms

Gunnar_Zarncke

[Epistemic Status: Playing around with the idea of a benchmark with some rough numbers.]

When I read Biological Anchors: A Trick That Might Or Might Not Work, my thinking was: Biological anchors will work if your algorithms are close enough to what the brain does and can then be used to estimate the compute (FLOPs) needed for the rest of the brain. The compute equivalent of the brain has been discussed recently here (I think this indicates factor 100 more efficient algorithms) and here. I used this for predictions on Metaculus. This will not give you sharp bounds and not tell you whether algorithms could do things much cheaper or which ones to use. I have not seen this specific comparison elsewhere.

This started with the idea that we might already have some algorithms that perform as well as some parts of the brain and compare their costs, power requirements, and complexity. Specifically, image recognition is about equally good as human raters. Thus, let’s compare state-of-the-art image recognition algorithms with the corresponding brain regions (the visual cortex) and then extrapolate that to the whole brain.

I did this and here is the result:

Brain Region	Brodmann Area 17 Visual Cortex V1	Algorithm	CoAtNet-7
[cm^3] Volume	11 (1% brain volume)
[10^6] Neurons	280 (0.3% brain neurons)	[10^6] Parameters	2500
[W] Power	0.18	[W] Power, 10 inferences/s	13 at 2TFLOPs/W
		[10^21 FLOP] training	200
		[10^9 FLOP] inference	2600

Whether the comparison should include only region V1 or also V2 to V5 of the visual cortex is worth asking, but the idea was to estimate conservatively and to exclude cognitive processes current algorithms definitely don't cover.

Extrapolating the compute to the whole brain:

Inference: 8*10^15 FLOPs/s (86*10^9 neurons / 280*10^6 neurons * 2.6*10^12 FLOPs/inference * 10 inferences/second).
Training: 5*10^16 FLOP/s (86*10^9 neurons / 280*10^6 neurons * 2*10^23 FLOPs / 18 life years).

Pretty low compared to the numbers in Cotra’s paper.

There are just some problems: The visual cortex does much more than just static image recognition of 512x512 pixel images:

The resolution of the processed image is much higher: 120 million rods instead of a quarter-million pixels.
Stitching together the picture from blurred fragments (Saccades).
Building something like a 3D model (maybe not in V1, though).
Inferring actions in the scene change over time (mostly motions; object permanence). Some of this may be in the V2-V5 regions.

Unfortunately, I only realized this when I had already collected most of the above data. There is algorithmic progress on many of these points (e.g., there is active research in vision-based action detection), but no algorithms come close to human performance on these. Alternatively, I also tried to get corresponding numbers for auditory processing, but these were harder to get, and speech recognition also hasn't reached human parity yet (cocktail party effect). Thus my initial assumption - that we have a brain region algorithmically covered - doesn’t hold up. I considered not posting this write-up but then decided that it might still be of interest to some readers.

Attempting to estimate AGI compute requirements from visual cortex and image classification has a long connectionist history. Moravec did this repeatedly, and Drexler has another version in his QNR whitepaper. AI Impacts or whoever was comparing to bees for similar reasons. Might be worth comparing.

For everybody who didn't know - like me - what QNR are:

Learned, quasilinguistic neural representations (QNRs) that upgrade words to embeddings and syntax to graphs can provide a semantic medium that is both more expressive and more computationally tractable than natural language, a medium able to support formal and informal reasoning, human and inter-agent communication, and the development of scalable quasilinguistic corpora with characteristics of both literatures and associative memory. QNR-based systems can draw on existing natural language and multimodal corpora to support the aggregation, refinement, integration, extension, and application of knowledge at scale. The incremental development of QNR-based models can build on current capabilities and methodologies in neural machine learning, and as systems mature, could potentially complement or replace today’s opaque “foundation models” with systems that are more capable, interpretable, and epistemically reliable. Potential applications and implications are broad.

https://www.fhi.ox.ac.uk/qnrs/

Drexler's Language for Intelligent Machines: A Prospectus

on LW: QNR prospects are important for AI alignment research

The bees post was by Guilhermo Costa, an Open Phil intern. My comment has some discussion of the "but biological brains do so much more stuff than ML classifiers" point.

This is the 1976 Moravec calculation:

Assuming the visual cortex (and possibly the optic nerve itself) is as computationally intensive as the retina, successive layers producing increasingly abstracted representations, we can estimate the total capability. There are a million separate fibers in a cross section of the human optic nerve. The thickness of the optical cortex is a thousand times the depth occupied by the neurons which apply a single simple operation. The eye is capable of processing images at the rate of ten per second (flicker at higher frequencies is detected by special operators). This means that the human visual system evaluates 10,000 million pixel simple operators each second.

https://frc.ri.cmu.edu/~hpm/project.archive/general.articles/1978/analog.1978.html