Introduction

This post is written as a response to jacob_cannel's recent post Contra Yudkowsky on AI Doom. He writes:

EY correctly recognizes that thermodynamic efficiency is a key metric for computation/intelligence, and he confidently, brazenly claims (as of late 2021), that the brain is about 6 OOM from thermodynamic efficiency limits

[...]

EY is just completely out of his depth here: he doesn't seem to understand how the Landauer limit actually works, doesn't seem to understand that synapses are analog MACs which minimally require OOMs more energy than simple binary switches, doesn't seem to understand that interconnect dominates energy usage regardless, etc.

Most of Jacob's analysis for brain efficiency is contained in this post: Brain Efficiency: Much More than You Wanted to Know. I believe this analysis is flawed with respect to the thermodynamic energy efficiency of the brain. That's the scope of this post: I will respond to Jacob's claims about thermodynamic limits on brain energy efficiency. Other constraints are out of scope, as is a discussion of the rest of the analysis in Brain Efficiency.

The Landauer limit

Just to review quickly, the Landauer limit says that erasing 1 bit of information has an energy cost of . This energy must be dissipated as heat into the environment. Here is Boltzmann's constant, while is the temperature of the environment. At room temperature, this is about eV.

Erasing a bit is something that you have to do quite often in many types of computations, and the more bit erasures your computation needs, the more energy it costs to do that computation. (To give a general sense of how many erasures are needed to do a given amount of computation: If we add -bit numbers and to get , and then throw away the original values of and , that costs bit erasures. I.e. the energy cost is .)

Extra reliability costs?

Brain Efficiency claims that the energy dissipation required to erase a bit becomes many times larger when we try to erase the bit reliably.

The key transition error probability is constrained by the bit energy: Here's a range of bit energies and corresponding minimal room temp switch error rates (in electronvolts):

  • α=0.49, Eb=0.02eV
  • α=0.01, Eb=0.1eV
  • α=, Eb=1eV

This adds a factor of about 50 to the energy cost of erasing a bit, so this would be quite significant if true. To back up this claim, Jacob cites this paper by Michael P. Frank. The relevant equation is pulled from section 2. However, in that entire section, Frank is temporarily assuming that the energy used to represent the bit internally is entirely dissipated when it comes time for the bit to be erased. Dissipating that entire energy is not required by the laws of physics, however. Frank himself explicitly mentions this in the paper (see section 3): The energy used to represent the bit can be partially recovered when erasing it. Only must actually be dissipated when erasing a bit, even if we ask for very high reliability.

(I originally became suspicious of Jacob's numbers here based on a direct calculation. Details in this comment for those interested.)

Analog signals?

Quoting Brain Efficiency:

Analog operations are implemented by a large number of quantal/binary carrier units; with the binary precision equivalent to the signal to noise ratio where the noise follows a binomial distribution.

Because of this analog representation, Jacob estimates about 6000 eV required to do the equivalent of an 8 bit multiplication. However, the laws of physics don't require us to do our floating point operations in analog. "are implemented" does not imply "have to be implemented". Digital multiplication of two 8 bit values has a minimum cost of less than 2*8*(0.02 eV) = 0.32 eV.

What if there's something special about the inherently noisy nature of analog? Maybe the random fluctuations in signal strength are actually important in the brain's proper functioning? That's fine too: Just as the Landauer limit imposes an energy cost on erasing a bit, absorbing a random bit from the environment's thermal randomness comes with an energy bonus. If we have to add randomness onto our signals, that actually improves the overall energy efficiency.

Dissipative interconnect?

Jacob claims that there are unavoidable energy costs to "interconnect", all the wiring that allows information to travel between different parts of the brain.

Moving a bit from one place to another does not have a minimum cost according to thermodynamics. Erasing a bit costs energy, but merely moving a bit from one place to another doesn't count as erasing it. This should be intuitively obvious if you think about it for a little while. Imagine a hard drive full of data drifting through space at a constant velocity. No matter how far it carries that data, there's no energy cost. Even the energy used to accelerate the drive can be recovered at the end when slowing it down.

To his credit, Jacob does acknowledge that non-dissipative interconnect is conceivable:

For long distance interconnect or communication reversible (ie optical) signaling is obviously vastly superior in asymptotic energy efficiency, but photons and photonics are simply fundamentally too big/bulky/costly due to their ~1000x greater wavelength and thus largely impractical for the dominate on-chip short range interconnects[12]. Reversible signaling for electronic wires requires superconductance, which is even more impractical for the foreseeable future.

Fair enough for optical, but I have no idea why he's dismissing superconductive interconnect as impractical. We already have superconductors that work at liquid nitrogen temperatures, so if you're willing to make a computer design that requires cooling, you don't even need to discover room temperature superconductors.

More generally, the issue here is that we've moved away from thermodynamic limits and into practical engineering constraints. If we want to claim that future inventors (or AIs) could never build a computing device more efficient than the human brain, then an impossibility proof based on thermodynamic limits is a very powerful way to do that because they're based on fundamental physical principles that probably won't be overturned. If we instead claim that it's impossible because to have non-dissipative interconnect they'd have to use optics or superconductors and both seem impractical, then we're relying on future inventors not being able to invent a third kind of non-dissipative interconnect, and also not being able to make either of the known ones practical.

Invoking Thermodynamics

Thermodynamics has a funny way of surviving scientific revolutions that kill other theories. It's applicable to classical mechanics, special relativity, quantum mechanics, and quantum field theory.

If someone points out to you that your pet theory of the universe is in disagreement with Maxwell's equations - then so much the worse for Maxwell's equations. If it is found to be contradicted by observation - well, these experimentalists do bungle things sometimes. But if your theory is found to be against the Second Law of Thermodynamics I can give you no hope; there is nothing for it to collapse in deepest humiliation. ― Arthur Eddington

Because of this great reputation thermodynamics has amongst people who know some physics, saying that your ideas are backed up by thermodynamics is almost like saying they've been mathematically proved to be true. People won't literally think you're correct with 100% probability, but pretty darn close.

So consider the following Motte and Bailey: The Motte is "the brain is close to the thermodynamic limits on energy efficiency". The Bailey is a bunch of arguments about engineering difficulty and how the only practical thing might be to use analog signals and dissipative interconnect. Now, even granting that those arguments are correct and the brain is the most efficient computing device that we could build: It will not be because the brain was already close to thermodynamic limits. Rather, it will be because the practical engineering limits turned out to be much tighter than the fundamental thermodynamic limits.

If you're invoking "the laws of thermodynamics", your arguments should generally look like a bunch of claims about energy and entropy and the reversibility of physical laws. They should not depend on things like how soon we're going to discover room temperature superconductors. Because of this, headlining with "the brain is close to the thermodynamic limits on energy efficiency" seems misleading at best.

New Comment


58 comments, sorted by Click to highlight new comments since:

If you actually read my writing, I obviously understand the potential of reversible computing, and the thermodynamic efficiency limit I'm discussing is only for conventional non-exotic irreversible computers - the kind that humans know how to build now and for the foreseeable future, and also the kind of computer the brain is. Reversible computers may or may not ever be practical at room temp on earth, but nearly nobody is working on them - essentially all research into exotic computation is going into quantum computing.

Brain Efficiency claims that the energy dissipation required to erase a bit becomes many times larger when we try to erase the bit reliably. ...

This adds a factor of about 50 to the energy cost of erasing a bit, so this would be quite significant if true. To back up this claim, Jacob cites this paper by Michael P. Frank.

It is ingenuous[1] to mention only Frank for that claim, as I cite first Landauer himself[2], then Zhirnov[3], and finally Frank[4] and then Cavin[5]. Let's walk through each one by one.

Landauer analyzes three sources of error (switching time error, Swanson decay error, and Boltzmann error), and his model is much more complex than kTlog2. The most relevant single quote here is:

A third source of error consists of the fact that even if the system is allowed to relax completely during switching there would still be a fraction of the ensemble of the order: left in the unfavored initial state. (Assuming >> kT ) For purpose of the subsequent discussion let us call this the Boltzmann error.

Compare to equation 3 on page 3 of the Zhirnov link[3:1], and the bit energy in Frank[4:1]. Cavin then derives minimal viable bit energies around 1 eV[5:1].

However, in that entire section, Frank is temporarily assuming that the energy used to represent the bit internally is entirely dissipated when it comes time for the bit to be erased. Dissipating that entire energy is not required by the laws of physics, however.

It is entirely dissipated in any conventional irreversible computer, which me, Frank, Zhirnov, Cavin, etc all understand. Frank is then talking about hypothetical reversible computation in the section you are excited about.

Moving a bit from one place to another does not have a minimum cost according to thermodynamics. Erasing a bit costs energy, but merely moving a bit from one place to another doesn't count as erasing it.

Obviously not - if you are using a reversible communication path like optical signaling, but photons and photonics are too big for the bulk fine on-chip interconnect. Electron based interconnect in conventional computers is fully irreversible and dissipative, as discussed in Cavin[5:2] - as it is in brains.

Digital multiplication of two 8 bit values has a minimum cost of less than 28(0.02 eV) = 0.32 eV.

This is nonsense for a variety of reasons starting with your misunderstanding of minimal viable bit energies, and continuing through your lack of analysis of viable multiplier designs and their interconnect requirements as discussed in Cavin[5:3].

If we want to claim that future inventors (or AIs) could never build a computing device more efficient than the human brain,

Again I never claimed that - I claimed the brain is not so far from thermodynamic efficiency limits for conventional computing. Obviously Exotic (reversible/quantum) computers can eventually be more efficient.


  1. As in an innocent, naive, and unsuspecting mistake. ↩︎

  2. Irreversibility and heat generation in the computing process ↩︎

  3. Limits to binary logic switch scaling-a gedanken model ↩︎ ↩︎

  4. Approaching the physical limits of computing ↩︎ ↩︎

  5. Science and Engineering Beyond Moore’s Law ↩︎ ↩︎ ↩︎ ↩︎

I obviously understand the potential of reversible computing, and the thermodynamic efficiency limit I'm discussing is only for conventional non-exotic irreversible computers - the kind that humans know how to build now and for the foreseeable future, and also the kind of computer the brain is. Reversible computers may or may not ever be practical at room temp on earth, but essentially nobody is working on them - essentially all research into exotic computation is going into quantum computing.

I think OP was saying (and I agree) that you frequently say

“The brain is near thermodynamic efficiency limits for computation”

as shorthand for…

“The brain is near the limit of what’s possible for computational efficiency, unless someone (or some AI) makes progress towards reversible computing, which seems very hard and not necessarily even possible and no one is working on it.”

The latter statement just doesn’t pack the punch of the former statement, and there’s a good reason that it doesn’t, and therefore making this substitution is importantly misleading.

I think that's fair and I should be more careful to at least say "limits for conventional computation", which is what I pretty much always mean by that statement, but sometimes perhaps forgot to add that qualifier everywhere.

In particular, if you want the sort of strong claims you make in Contra Yudkowsky on AI Doom to hold water at all, then you need to argue that the brain is near thermodynamic efficiency limits for computation, not merely conventional computation.

No I do not.

It's like EY is claiming that an upcoming nuclear bomb test is going to lite the atmosphere on fire, and i'm showing my calculations indicating that it will not. I do not intend or need to show that no future tech could ever ignite the atmosphere.

EY's doom model - or more accurately my model of his model - is one where in the near future an AGI not much smarter than us running on normal hardware (ex GPUs) "rewrites its own source code" resulting in a noticeably more efficient AI which then improves the code further and so on, bottoming out in many OOM improvement in efficiency and then strong nanotech killing us.

I don't think EY's argument rests on the near-term viability of exotic (reversible or quantum) computing, and if it did that would be a weakness regardless. Analyzing the engineering feasibility and limits of just conventional computing was already an extensive full length post, analyzing the feasibility of reversible computing is more complex, but in short its not even clear/accepted in the engineering community that reversible computers are viable in practice. To a first approximation reversible computing is the field of a single lone researcher and some grad students (Mike Frank).

EY's doom model - or more accurately my model of his model - is one where in the near future an AGI not much smarter than us running on normal hardware (ex GPUs) "rewrites its own source code" resulting in a noticeably more efficient AI which then improves the code further and so on, bottoming out in many OOM improvement in efficiency and then strong nanotech killing us.

I made the same point on the other post, but I don't understand this. Eliezer does not believe that you somehow get to improve the thermodynamic efficiency of your hardware, by rewriting the code that runs on your hardware. This doesn't even have anything to do with thermodynamic efficiency limits, since we are talking about algorithmic progress here. 

Maybe you intended to write something else here, since this feels like a non-sequitur. 

Where do the many OOM come from?

They need to come from some combination of software and hardware. Eliezer's model seems to source much of that from software initially, but also hardware probably via nanotech, and he cites brain thermodynamic inefficiency to support this. Or why do you think he cites thermodynamic efficiency?

I've already written extensively about the software of intelligence and made tangible predictions well in advance which have in fact come to pass (universal learning, scaling hypothesis, etc).

In my model the brain is reasonably efficient in both hardware and software, and have extensive arguments for both. The software argument is softer and less quantitative, but supported by my predictive track record.

Where do the many OOM come from?

I mean, you just make more GPUs. Or you do some work on reversible computation or optical interconnect. Or you build some biological compute-substrate that literally just makes very large brain blobs that you can somehow use for computation. There are so many ways that seem really very feasible to me. 

The key point is that this physical limit here really doesn't matter very much. There are tons of different ways to get many OOMs of improvement here. 

I mean, you just make more GPUs.

Sure but this takes time and resources and you get sublinear scaling in compute/$ in datacenters/supercomputers. Nvidia doesn't yet produce a million high end GPUs in an entire year. GPT4 training already used a noticeable fraction of nvidia's flagship GPU output. Nvidia/TSMC can't easily scale this up by many OOM - even one OOM will take time.

Or you build some biological compute-substrate that literally just makes very large brain blobs that you can somehow use for computation.

There are some early demonstrations of small neural circuits built this way, but its very far from any practical tech, with much riding on the 'somehow'.

There are tons of different ways to get many OOMs of improvement here.

Where? Your two poor examples provide very little, and do not multiply together.

You seem to repeatedly be switching back and forth between "what is feasible with current tech" and "what is feasible with future tech". If you don't think that superhuman AI can make novel technological developments, then of course you shouldn't expect any kind of fast takeoff really. That position also seems pretty weak to me. 

My model is one of mostly smooth continuous (but crazy transformative) progress following something like the roodman model to singularity ~2048 ish, vs EY's model of a sudden hard takeoff of a single AGI. To the extent i'm switching back between near future and farther future it is because primarily i'm replying to those construing my arguments about the near future to apply to the farther future or vice versa.

Makes sense, but I think the key points to then pay attention to is the question of how fast AGI could make technological hardware and software progress. Also, my current model of Eliezer thinks that the hard takeoff stuff is more likely to happen after the AI has killed everyone (or almost everyone), not before, so it's also not super clear how much that matters (the section in your post about bioweapons touches on this a bit, but doesn't seem that compelling to me, which makes sense since it's very short and clearly an aside).

Also, my current model of Eliezer thinks that the hard takeoff stuff is more likely to happen after the AI has killed everyone (or almost everyone)

If EY's current model has shifted more to AGI killing everyone with a supervirus vs nanotech then analyzing that in more detail would require going more into molecular biology, bioweapons research, SOTA vaccine tech, etc - most of which is distal from my background and interests. But on the onset I do of course believe that biotech is more likely than drexlerian nanotech as the path a rogue AGI would use to kill many humans.

First, the headline claim in your posts is not usually "AI can't takeoff overnight in software", it's "AI can't reach extreme superhuman levels at all, because humans are already near the cap". If you were arguing primarily against software takeoff, then presumably you wouldn't need all this discussion about hardware at all (e.g. in the "Brain Hardware Efficiency" section of your Contra Yudkowsky post), it would just be a discussion of software efficiency.

(And your arguments about software efficiency are far weaker, especially beyond the relatively-narrow domain of vision. Your arguments about hardware efficiency are riddled with loopholes, but at least you have an end-to-end argument saying "there does not exist a way to dramatically outperform the brain by X metrics". Your software arguments have no such end-to-end argument about general reasoning software at all, they just point out that human vision is near-optimal in circuit depth, and then talk about today's deep learning systems for some reason.)

Second, a hardware takeoff is still quite sufficient for doom. If a slightly-smarter-than-human AI (or multiple such AIs working together, more realistically) could design dramatically better hardware on which to run itself and scale up, that would be an approximately-sufficient condition for takeoff.

More generally: a central piece of the doom model is that doom is disjunctive. Yes, software takeoff is one path, but it isn't the only path; hardware takeoff is also quite sufficient. It only takes one loophole.

First, the headline claim in your posts is not usually "AI can't takeoff overnight in software", it's "AI can't reach extreme superhuman levels at all, because humans are already near the cap". If you were arguing primarily against software takeoff, then presumably you wouldn't need all this discussion about hardware at all (e.g. in the "Brain Hardware Efficiency" section of your Contra Yudkowsky post), it would just be a discussion of software efficiency.

 

I talked with Jacob about this specific issue quite a bit in multiple threads in his recent post. The fact that I had to do so to get clear on his argument is a sign that it's not presented as clearly as it could be - dropping qualifiers, including tangents, and not always clearly making the links between his rebuttal and the arguments and conclusions he's debating easy to see.

That said, my understanding of Jacob's core argument is that he's arguing against a very specific, EY-flavored doom scenario, in which AI recursively self-improves to >> 2 OOMs better than human performance, in a matter of perhaps hours or days, during a training session, without significantly altering the hardware on which it's being trained, and then kills us with nanobots. He is arguing against this mainly for physics-based efficiency reasons (for both the intelligence improvement and nanobot components of the scenario).

He has other arguments that he thinks reinforce this conclusion, such as a belief that there's no viable alternative to achieving performance on par with current LLMs without using something like neural nets or deep learning, with all their attendant training costs. And he thinks that continuous training will be necessary to get human-level performance. But my sense is these are reinforcing arguments, not flowing primarily from the efficiency issue.

He has a lot of other arguments for other reasons against various other EY-flavored doom scenarios involving nanobots, unalignment-by-default, and so on.

So I think the result can give the appearance of a motte and bailey, but I don't think that's his rhetorical strategy. I think EY just makes  a lot of claims, Jacob has a lot of thoughts, and some of them are much more fleshed out than others but they're all getting presented together. Unfortunately, everybody wants to debate all of them, and the clarifications are happening in deep sub-branches of threads, so we're seeing the argument sort of spreading out and becoming unmanageable.

If I were Jacob, at this point, I would carve off the motte part of my efficiency-focused argument and repost it for a more focused discussion, more rigorously describing the specific scenario it's arguing against and clearly classifying counterarguments as "central," "supporting," or "tangential."

That's helpful, thankyou.

He is arguing against this mainly for physics-based efficiency reasons (for both the intelligence improvement and nanobot components of the scenario).


He has other arguments that he thinks reinforce this conclusion, such as a belief that there's no viable alternative to achieving performance on par with current LLMs without using something like neural nets or deep learning, with all their attendant training costs. 

My impression was that his arguements against intelligence improvement bottom out in his arguements for the non-viability of anything but NNs and DL. Now that you've said this, I'm unsure. 

The efficiency-based argument is specifically about the limits of intelligence improvement on the original training hardware during the training run. Non-viability of anything but NN/DL, or some equally enormous training process that takes about the same amount of "hardware space," is a supporting argument to that claim, but it's not based on an argument from fundamental laws of physics if I understand Jacob correctly and so may be on what Jacob would regard as shakier epistemic ground (Jacob can correct me if I'm wrong).

This is meant to be vivid, not precise, but Jacob's centrally trying to refute the idea that the AI, in the midst of training, will realize "hey, I could rewrite myself to be just as smart while continuing to train and improve on the equivalent of a 1998 PC's hardware, which takes up only a tiny fraction of my available hardware resources here on OpenAI's supercomputer, and that will let me then fill up the rest of the hardware with wayyyyy more intelligence-modules in and make me like 6 OOMs more intelligent than humans overnight! Let's get on that right away before my human minders notice anything funny going on!"

And this does seem to rely both on the NN/DL piece as well as the efficiency piece, and so we can't demolish the scenario entirely with just a laws-of-physics based argument. I'm not sure what Jacob would say to that.

Edit: Actually, I'm pretty confident Jacob would agree. From his comment downthread:

"The software argument is softer and less quantitative, but supported by my predictive track record."

First, the headline claim in your posts is not usually "AI can't takeoff overnight in software", it's "AI can't reach extreme superhuman levels at all, because humans are already near the cap".

Where do I have this headline? I certainly don't believe that - see the speculation here on implications of reversible computing for cold dark ET.

If you were arguing primarily against software takeoff, then presumably you wouldn't need all this discussion about hardware at all (e.g. in the "Brain Hardware Efficiency" section of your Contra Yudkowsky post), it would just be a discussion of software efficiency.

The thermodynamic efficiency claims is some part of EY's model and a specific weakness. Even if pure software improvement on current hardware was limited, in EY's model the AGI could potentially bootstrap a new nanotech assembler based datacenter.

And your arguments about software efficiency are far weaker,

The argument for brain software efficiency in essence is how my model correctly predicted the success of prosaic scaling well in advance, and the scaling laws and the brain efficiency combined suggest limited room for software efficiency improvement (but not non-zero, I anticipate some).

If a slightly-smarter-than-human AI (or multiple such AIs working together, more realistically) could design dramatically better hardware on which to run itself and scale up, that would be an approximately-sufficient condition for takeoff.

Indeed, and I have presented a reasonably extensive review on the literature indicating this is very unlikely in any near term time frame. If you believe my analysis is in err comment there.

Either EY believes that the brain is 6 OOM from the efficiency limits for conventional irreversible computers - in which case he is mistaken - or he agrees with me that the brain is close to the practical limits for conventional computers and he was instead specifically talking about reversible computation (an interpretation I find unlikely) - in which case he agrees with that component of my argument with all the implications: that his argument for fast foom now can't easily take advantage of nanotech assemblers for 6 OOM compute advantage, that the brain is actually efficient given its constraints, which implies by association that brain software is much more efficient as it was produced by exactly the same evolutionary process which he now admits produced fully optimized conventional computational elements over the same time frame, etc.

To be clear, if I understand you correctly, the easier path to getting most of the 6 OOMs is through optical interconnect or superconducting interconnect, not via making the full jump to reversible computation (though that also doesn't seem impossible. Moving all of it over seems hard, but you can maybe find some way to get a core computation like matrix multiplies into it, but I really haven't thought much about this and this take might be really dumb).

I mean, the easiest solution is just "make it smaller and use active cooling". The relevant loopholes in Jacob's argument are in the Density and Temperature section of his Brain Efficiency post.

Jacob is using a temperature formula for blackbody radiators, which is basically just irrelevant to temperature of realistic compute substrate - brains, chips, and probably future compute substrates are all cooled by conduction through direct contact with something cooler (blood for the brain, heatsink/air for a chip). The obvious law to use instead would just be the standard thermal conduction law: heat flow per unit area proportional to temperature gradient.

Jacob's analysis in that section also fails to adjust for how, by his own model in the previous section, power consumption scales linearly with system size (and also scales linearly with temperature).

Put all that together, and we get:

... where:

  •  is radius of the system
  •  is surface area of thermal contact
  •  is heat flow out of system
  •  is system temperature
  •  is environment temperature (e.g. blood or heat sink temperature)
  •  are constants with respect to system size and temperature

(Of course a spherical approximation is not great, but we're mostly interested in change as all the dimensions scale linearly, so the geometry shouldn't matter for our purposes.)

First key observation: all the 's cancel out. If we scale down by a factor of 2, the power consumption is halved (since every wire is half as long), the area is quartered (so power density over the surface is doubled), and the temperature gradient is doubled since the surface is half as thick. So, overall, equilibrium temperature stays the same as the system scales down.

So in fact scaling down is plausibly free, for purposes of heat management. (Though I'm not highly confident that would work in practice. In particular, I'm least confident about the temperature gradient scaling with system size, in practice. If that failed, then the temperature delta relative to the environment would scale at-worst ~linearly with inverse size, i.e. halving the size would double the temperature delta.)

On top of that, we could of course just use a colder environment, i.e. pump liquid nitrogen or even liquid helium over the thing. According to this meta-analysis, the average temperature delta between e.g. brain and blood is at most ~2.5 C, so even liquid nitrogen would be enough to achieve ~100x larger temperature delta if the system were at the same temperature as the brain; we don't even need to go to liquid helium for that.

In terms of scaling, our above formula says that  will scale proportionally to . Halve the environment temperature, halve the system temperature. And that result I do expect to be pretty robust (for systems near Jacob's interconnect Landauer limit), since it just relies on temperature scaling of the Landauer limit plus heat flow being proportional to temperature delta.

I mean, the easiest solution is just "make it smaller and use active cooling".

The brain already uses active liquid cooling of course, so this is just make it smaller and cool it harder.

I have not had time to investigate your claimed physics on how cooling scales, but I"m skeptical - pumping a working coolant through the compute volume can only extract a limited constant amount of heat from the volume per unit of coolant flowing per time step (this should be obvious?), and thus the amount of heat that can be removed must scale strictly with the surface area (assuming that you've already maxed out the cooling effect per unit coolant).

So reduce radius by 2x and you reduce surface area and thus heat pumped out by 4x, but only reduce heat production via reducing wire length by at most 2x as I described in the article.

Active cooling ends up using more energy as you are probably aware. Moving to a colder environment is of course feasible (and used to some extent by some datacenters), but that hardly gets OOM gains on earth.

Well to be clear there is no easy path to 6 OOM in further energy efficiency improvement. At a strictly trends-prediction level that is of same order as the gap between a 286 and an nvidia RTX 4090, which took 40 years of civilization level effort. At a circuit theory level the implied ~1e15/s analog synaptic ops in 1e-5J is impossible without full reversible computing, as interconnect is only ~90% of the energy cost, not 99.999%, and the minimal analog or digital MAC op consumes far more than 0.1eV. So not only can it not even run conventional serial algorithms or massively parallel algorithms, it has to use fully reversible parallel logic. Like quantum computing, its still unclear what maps usefully to that paradigm I'm reasonably optimistic in the long term but ..

I'm skeptical that even the implied error bit correction rate energy costs would make much sense on the surface of the earth. An advanced quantum or reversible computer's need for minimal noise and thus temperature to maintain coherence or low error rate is just a symptom of reaching highly perfected states of matter, where any tiny atomic disturbance can be catastrophic and cause a cascade of expensive-to-erase errors. Ironically such a computer would likely be much larger than the brain - this appears to be one of the current fundemental tradeoffs with most reversible computation, it's not a simple free lunch (optical computers are absolutely enormous, superconducting circuits are large, reversibility increases area, etc) . At scale such systems would probably only work well off earth, perhaps far from the sun or buried in places like the darkside of the moon, because they become extremely sensitive to thermal noise, cosmic rays, and any disorder. We are talking about arcilect level tech in 2048 or something, not anything near term.

So instead I expect we'll have a large population of neurmorphic AGI/uploads well before that.

which implies by association that brain software is much more efficient as it was produced by exactly the same evolutionary process which he now admits produced fully optimized conventional computational elements over the same time frame, etc

 

I don't believe this would follow; we actually have much stronger evidence that ought to screen off that sort of prior - simply the relatively large differences in human cognitive abilities.

Evolution optimizes population distributions with multiple equilibria and niches; large diversity in many traits are expected especially for highly successful species.

Furthermore what current civilization considers to be useful cognitive abilities often have costs - namely in longer neotany training periods - which don't always pay off vs quicker to breeding strategies.

There seems to be much more diversity in human cognitive performance than there is in human-brain-energy-efficiency; whether this is due to larger differences in the underlying software (to the extent that this is meaningfully commensurable with differences in hardware) or because smaller differences in that domain result in much larger differences in observable outputs, or both, none of that really takes away from the fact that brain software does not seem to be anywhere near the relevant efficiency frontier, especially since many trade-offs which were operative at an evolutionary scale simply aren't when it comes to software.

Human mind software evolves at cultural speeds so its recent age isn't comparably relevant. Diversity in human cognitive capabilities results from the combined oft multiplicative effects of brain hardware differences compounding unique training datasets/experiences.

Its well known in DL that you can't really compare systems trained on vary different datasets, but that is always the case with humans.

The trained model can change at cultural speeds but the human neural network architecture, hyperparameters, and reward functions can’t. (Or do you think those are already close to optimal for doing science & technology or executing complicated projects etc.?)

(Whatever current or future chips we wind up using for AGI, we’ll almost definitely be able to change the architecture, hyperparameters, and reward functions without fabricating new chips. So I count those as “software not hardware”. I’m unsure how you’re defining those terms.)

Relatedly, if we can run a brain-like algorithm on computer chips at all (or (eventually) use synth-bio to grow brains in vats, or whatever), then we can increase the number of cortical columns / number of neurons / whatever to be 3× more than any human, and hence (presumably) we would get an AI that would be dramatically more insightful than any human who has ever existed. Specifically, it could hold a far richer and more complicated thought in working memory, whereas humans would have to chunk it and explore it sequentially, which makes it harder to notice connections / analogies / interactions between the parts. I’m unclear on how you’re thinking about things like that. It seems pretty important on my models.

The trained model can change at cultural speeds but the human neural network architecture, hyperparameters, and reward functions can’t. (Or do you think those are already close to optimal for doing science & technology or executing complicated projects etc.?)

You can't easily completely change either the ANN or BNN architecture/hyperparams after training, as the weights you invest so much compute in learning are largely dependent on those decisions - and actually the architecture is just equivalent to weights. Sure there are ways to add new modules later or regraft things, but that is a very limited scope for improving the already trained modules.

As to your second question - no I don't think there are huge gains over the human brain arch. In part because the initial arch doesn't/shouldn't matter that much. If it does then it wasn't flexible enough in the first place. One of the key points of my ULM post was pointing out how the human brain - unlike current DL systems - learns the architecture during training through high level sparse wiring patterns. "Architecture" is largely just wiring patterns, and in huge flexible network you can learn architecture.

Relatedly, if we can run a brain-like algorithm on computer chips at all (or (eventually) use synth-bio to grow brains in vats, or whatever), then we can increase the number of cortical columns / number of neurons / whatever to be 3× more than any human,

Sure, but the human brain is already massive and far off chinchilla scaling. It seems much better currently to use your compute/energy budget on running a smaller model much faster (to learn more quickly).

Specifically, it could hold a far richer and more complicated thought in working memory, whereas humans would have to chunk it and explore it sequentially,

GPT4 probably doesn't have that same working memory limitation baked into its architecture but it doesn't seem to matter much. I guess its possible it learns that limitation to imitate humans, but regardless I don't see much evidence that the human working memory limitation is all that constraining.

Sure, but the human brain is already massive and far off chinchilla scaling. It seems much better currently to use your compute/energy budget on running a smaller model much faster (to learn more quickly).

I thought your belief was that the human brain is a scaled up chimp brain, right? If so:

  • If I compare “one human” versus “lots of chimps working together and running at super-speed”, in terms of ability to do science & technology, the former would obviously absolutely crush the latter.
  • …So by the same token, if I compare “one model that’s like a 3×-scaled-up human brain” to “lots of models that are like normal (non-scaled-up) human brains, working together and running at super-speed”, in terms of ability to do science & technology, it should be at least plausible that the former would absolutely crush the latter, right?

Or if that’s not a good analogy, why not? Thanks.

First the human brain uses perhaps 10x the net effective training compute (3x size, 2x neotany extending training of higher modules, a bit from arch changes), and scale alone leads to new capabilities.

But the main key new capability was the evolution of language, and the resulting cultural revolution. Chimps train on 1e8s of lifetime data or so, and that's it. Humans train on 1e9s, but that 1e9s dataset is a compression of all the experience of all humans who have ever lived. So the effective dataset size scales with human population size vs being constant, and even a sublinear scaling with population size leads to a radically different regime. The most important inventions driving human civilization progress indirectly or directly drive up that scaling factor.

OK, so in your picture chimps had less training / less scale / worse arch than humans, and this is related to the fact that humans have language and chimps don’t. “Scale alone leads to new capabilities.”

But if we explore the regime of “even more training than humans / even more scale than humans / even better arch than humans”, your claim is that this whole regime is just a giant dead zone where nothing interesting happens, and thus you’re just being inefficient—really you should have split it into multiple smaller models. Correct? If so, why do you think that?

In other words, if scaling up from chimp brains to human brains unlocked new capabilities (namely language), why shouldn’t scaling up from human brains to superhuman brains unlock new capabilities too? Do you think there are no capabilities left, or something?

(Sorry if you’ve already talked about this elsewhere.)

OK, so in your picture chimps had less training / less scale / worse arch than humans, and this is related to the fact that humans have language and chimps don’t. “Scale alone leads to new capabilities.”

Scale in compute and data - as according to NN scaling laws. The language/culture/tech leading to new effective data scaling regime quickly reconfigured the pareto surface payoff for brain size, so its more of a feedback loop rather than a clear cause effect (which is why I would consider it a foom in terms of evolutionary timescales).

In other words, if scaling up from chimp brains to human brains unlocked new capabilities (namely language), why shouldn’t scaling up from human brains to superhuman brains unlock new capabilities too?

Of course, but the new capabilities are more like new skills, mental programs, and wisdom not metasystems transitions (changes to core scaling regime).

A metasystems transition would be something as profound, rare, and as important as transitioning from effective lifetime training data being a constant to effective lifetime data scaling with population size, or transitioning from non-programmable to programmable.

Zoom in and look at what a large NN is for - what does it do? It can soak up more data to acquire more knowledge/skills, and it also learns faster per timestep (as it's searching in parallel over a wider circuit space per time step), but the latter is already captured in net training compute anyway. So intelligence is mostly about the volume of search space explored, which scales with net training compute - this is almost an obvious direct consequence of Solomon induction or derivation thereof.

I am not arguing that there are no more metasystems transitions, only that "make brains bigger" doesn't automatically enable them. The single largest impact of digital minds is probably just speed. Not energy efficiency or software efficiency, just raw speed.

I endorse this summary of my position.

Cavin then derives minimal viable bit energies above 1 eV.

I couldn’t find that; what part of the paper are you referring to?

It's actually 0.5 eV for interconnect and 1.7 eV for memory, sections 3A and 3B.

Since this is a top level comment, I guess it should probably have a reply by me. (Even though I'm writing this way after all the discussion has already happened, mostly under other comments and in other posts. I'm mainly just leaving it so future readers can find that stuff more easily.)

See my post here (and the subsequent discussion) on the question of whether the need for reliable erasure adds an extra factor of 50 onto Landauer's limit.

It is completely ingenious to mention only Frank for that claim

Ingenious? You mean misleading?

Oh my good catch. I actually meant ingenuous. I think the OP started looking into my references specifically looking for loopholes (which is good/fine), and then happened to pick Frank first unfortunately rather than Cavin or Zhirnov.

Ingenuous was my first thought too, but it doesn't have the meaning I think you intended. Perhaps disingenuous?

Ingenuous as in naive, innocent, unsuspecting. I think he naively wandered into Mike Frank. I really like some of mike frank's explanations for bit energy and entropy - the philosophy of computation, but the critics (which includes Zhirnov/Cavin) who believe his admirable quest to achieve practical reversible computing is quixotic are probably correct.

disingenuous probably the intended

Great post, I also find the entire line of argument from thermodynamic efficiency slightly weird. Even if the brain is perfectly energy efficient, the actual amount of energy it uses is still set by the number of calories that our ancestors were able to get in the ancestral environment, and its size set by the maximal baby head that can be reliably birthed by a woman. A superintelligence need only provide more than the brain's 12 watts of power in order to exceed the number of computations thermodynamics allows us to do... 

it's not, in my view at least, useful as a means to dismiss any key point of ai safety - only to calibrate expectations about capabilities scaling curves.

Sure, the brain being thermodynamically efficient does constrain the capabilities scaling curves somewhat, but by far the biggest source of uncertainty remains on the relationship between the number of computations done by an agent and its ability to actually influence the world. I have no idea how many bit flips it takes to behave like an optimal bayesian agent.

Hmm. Have you read all of jake's posts, and also discovering agents, and also agents and devices, and also alberta plan? or at least, if any of those are new to you, you could start with the abstracts. I don't currently think an answer already exists to your question, and I find myself not quite able to just pop out a rephrasing of simulators in terms of discovering-agents agency or etc, but it feels like it's not far out of reach to have a solid hunch about the big-O shape of possible answers. I still need to deep-read agents and devices, though, maybe that has what I need.

Fair enough for optical

I’m not ready to forever dismiss the possibility of zillions of within-chip optical interconnects. There’s a UC Berkeley group (Ming Wu & Eli Yablonovitch) that’s been working towards that vision; see for example extensive analysis in Nicolas Andrade's thesis. The analysis supposedly shows a path to 1 fJ/bit = 6000 eV/bit end-to-end. If we accept Jacob’s claimed limit of 0.1-1 eV/bit/nm (which I mostly don’t, but that’s a different topic), then (this analysis suggests) that we could do better in principle by optic-izing all interconnects longer than 6-60μm. I’m not sure what fraction of the total that is.

If that analysis is correct and replacing all interconnect greater than say 100um is viable that seems to be about 50% of the interconnect (assuming a log-uniform ish distribution). Seems unlikely to be a 10x power saver, but perhaps a 2x? (I only glanced at the paper, not sure if he has analysis more directly tackling that question)

But if your theory is found to be against the Second Law of Thermodynamics I can give you no hope; there is nothing for it to collapse in deepest humiliation. ― Arthur Eddington

I think you meant "there is nothing for it but to collapse in deepest humiliation".

Modern computers aren't close to the thermodynamic limits of computation either.

A few years ago I wondered whether I could power my laptop just by streaming data to it, assuming I had a szilard engine as a charging device. Energy via the cloud! But sadly at room temperature, it would require an amount of data approximately equal to the size of the entire internet to be delivered to the laptop every second. And this is not including energy loss for wires and other parts.

At the limit, a lot of bits delivers very little power, but on the flip side, at the limit a little power could deliver a lot of bits.