All of hippke's Comments + Replies

I think the biggest improvement in this report can be made regarding Appendix D. The authors describe that they use "process size rather than transistor size" which is, as they correctly note, a made-up number. What should be used instead is transistor density (transistors per area), which is readily available in much detail for many past nodes, and the most recent "5nm" nodes (see e.g., wikichip).

What about the Landauer limit? We are 3 orders of magnitude from the Landauer limit ( J/op), see my article here on Lesswrong. The authors list several physical limitations, but this one seems to be missing. It may pose the most relevant limit.

2Marius Hobbhahn
We did consider modeling it explicitly. However, most estimates on the Landauer limit give very similar predictions as size limits. So we decided against making an explicit addition to the model and it is "implicitly" modeled in the physical size. We intend to look into Landauer's limit at some point but it's not a high priority right now. 
Answer by hippke122

That's an excellent question, pondered by the brightest minds. The great Freeman Dyson proposed a solution dubbed eternal intelligence (Dyson 1979, Reviews of Modern Physics, Volume 51, Issue 3, July 1979, pp.447-460). Basically, some finite amount of matter=energy is stored. As the universe cools over time, energy costs per computation decrease (logarithmically, but forever). After each cooling time period, one can use some fraction of the remaining energy, which will thus never go to zero, leading to eternal consciousness. 

It was later understood th... (read more)

I think this calculation is invalid. A human is created from a seed worth 700 MB of information, encoded in the form of DNA. This was created in millions of years of evolution, compressing/worth a large (but finite) amount of information (energy). A relevant fraction of hardware and software is encoded in this information. Additional learning is done during 20 years worth 3 MWh. The fractional value of this learning part is unknown.

1JBlack
Initialising a starting state of 700 MB at 10^-21 J per bit operation costs about 6 picojoules. Obtaining that starting state through evolution probably cost many exajoules, but that's irrelevant to the central thesis of the post: fundamental physical limits based on the cost of energy required for the existence of various levels of intelligence. If you really intended this post to hypothesize that the only way for AI to achieve high intelligence would be to emulate all of evolution in Earth's history, then maybe you can put that in another post and invite discussion on it. My comment was in relation to what you actually wrote.

How can we know that "it is possible to train a 200 IQ equivalent intelligence for at most 3 MW-hr"?

3JBlack
Because 20-year-old people with 200 IQ exist, and their brains consume approximately 3 MW-hr by age 20. Therefore there are no fundamental physical limitations preventing this.

How did von Neumann come close to taking over the world? Perhaps Hitler, but von Neumann?

Sure! I argue that we just don't know whether such a thing as "much more intelligent than humans" can exist. Millions of years of monkey evolution have increased human IQ to the 50-200 range. Perhaps that can go 1000x, perhaps it would level of at 210. The AGI concept makes the assumption that it can go to a big number, which might be wrong.

3wunan
Even without having a higher IQ than a peak human, an AGI that merely ran 1000x faster would be transformative.
4Dave Lindbergh
We don't know, true. But given the possible space of limiting parameters it seems unlikely that humans are anywhere near the limits. We're evolved systems, evolved under conditions in which intelligence was far from the most important priority. And of course under the usual evolutionary constraints (suboptimal lock-ins like backward wired photoreceptors in the retina, the usual limited range of biological materials - nothing like transistors or macro scale wheels, etc.). And by all reports John von Neumann was barely within the "human" range, yet seemed pretty stable. He came remarkably close to taking over the world, despite there being only one of him and not putting any effort into it. I think all you're saying is there's a small chance it's not possible.

From what I understand about "ELO inflation", it refers to the effect that the Top 100 FIDE players had 2600 ELO in 1970, but 2700 ELO today. It has been argued that simply the level increased, as more very good players entered the field. The ELO number as such should be fair in both eras (after playing infinitely many games...). I don't think that it is an issue for computer chess comparisons. Let me know if you have other data/information!

I ran the experiment "Rebel 6 vs. Stockfish 13" on Amazon's AWS EC2. I rented a Xeon Platinum 8124M which benched at 18x 1.5 MNodes/s. I launched 18 concurrent single-threaded game sets with 128 MB of RAM for each engine. Again, ponder was of, no books, no tables. Time settings were 40 moves in 60s + 0.6 per move, corresponding to 17.5 MNodes/move. For reference, SF13 benches at ELO 3630 at this setting  (entry "64 bit"); Rebel 6.0 got 2415 on a Pentium 90 (SSDF Computer Rating List (01-DEC-1996).txt, 90 kN/move).

The result: 

  • 1911 games played
  • 18 d
... (read more)
  • With a baseline of 10 MNodes/move for SF3, I need to set SF13 to 0.375 MNodes/move for equality. That's a factor of 30. Caveat: I only ran 10 games which turned out equal, and only at 10 MNodes/move for SF3.
  • Yes: Rebel6 at normal 2021 settings (40 moves in 15 min) can be approximately matched with SF13 at 20 kNodes/move. More precisely: I get parity between Rebel6 (128 MB) and SF13 (128 MB) for 16 MNodes/move vs. 20 kNodes/move (=factor of 800x). On my Intel Core-M 5Y31 (750 kNodes/s), that's 21s vs. 0.026s per move. Note that the figure shows SF8, not SF13.
  • I was contacted by one person via PM, we are discussing the execution setup. Otherwise, I could do it by the end of July after my vacation.

I ran the experiment "Rebel 6 vs. Stockfish 13" on Amazon's AWS EC2. I rented a Xeon Platinum 8124M which benched at 18x 1.5 MNodes/s. I launched 18 concurrent single-threaded game sets with 128 MB of RAM for each engine. Again, ponder was of, no books, no tables. Time settings were 40 moves in 60s + 0.6 per move, corresponding to 17.5 MNodes/move. For reference, SF13 benches at ELO 3630 at this setting  (entry "64 bit"); Rebel 6.0 got 2415 on a Pentium 90 (SSDF Computer Rating List (01-DEC-1996).txt, 90 kN/move).

The result: 

  • 1911 games played
  • 18 d
... (read more)

OK, I have added the Houdini data from this experiment to the plot:

The baseline ELO is not stated, but likely close to 3200:

ExperimentkNodes/moveELO dropELO calculated
4k nodes vs 2k nodes23031280
8k nodes vs 4k nodes42801583
16k nodes vs 8k nodes82371863
32k nodes vs 16k nodes162082100
64k nodes vs 32k nodes321792308
128k nodes vs 64k nodes641562487
256k nodes vs 128k nodes1281362643
512k nodes vs 256k nodes2561342779
1024k nodes vs 512k nodes5121152913
2048k nodes vs 1024k nodes1024933028
4096k nodes vs 2048k nodes2048793121
Baseline4096 3200
2paulfchristiano
The results look quite different for Houdini 3 vs SF8---is this just a matter of Stockfish being much better optimized for small amounts of hardware?

Mhm, good point. I must admit that the "70 ELO per doubling" etc. is forum wisdom that is perhaps not the last word. A similar scaling experiment was done with Houdini 3 (2013) which dropped below 70 ELO per doubling when exceeding 4 MNodes/move. In my experiment, the drop is already around 1 MNode/move. So there is certainly an engine dependence.

3hippke
OK, I have added the Houdini data from this experiment to the plot: The baseline ELO is not stated, but likely close to 3200: ExperimentkNodes/moveELO dropELO calculated4k nodes vs 2k nodes230312808k nodes vs 4k nodes4280158316k nodes vs 8k nodes8237186332k nodes vs 16k nodes16208210064k nodes vs 32k nodes321792308128k nodes vs 64k nodes641562487256k nodes vs 128k nodes1281362643512k nodes vs 256k nodes25613427791024k nodes vs 512k nodes51211529132048k nodes vs 1024k nodes10249330284096k nodes vs 2048k nodes2048793121Baseline4096 3200

From what I understand about the computer chess community:

  • Engines are optimized to win in the competitions, for reputation. There are competitions for many time controls, but most well respected are the CCC with games of 3 to 15 minutes, and TCEC which goes up to 90 minutes. So there is an incentive to tune the engines well into the many-MNodes/move regime.
  • On the other hand, most testing during engine development is done at blitz or even bullet level (30s for the whole game for Stockfish). You can't just play thousands of long games after each code commit
... (read more)

Yes, that's correct. It is slightly off because I manually set the year 2022 to match 100,000 kNodes/s. That could be adjusted by one year. To get an engine which begins its journey right in the year 2021, we could perform a similar experiment with SF14. The curve would be virtually identical, just shifted to the right and up.

Oh, thank you for the correction about Magnus Carlsen! Indeed, my script to convert the timestamps had an error. I fixed it in the figure.

Regarding the jump in 2008 with Rybka: I think that's an artifact of that particular list. Similar lists don't have it.

Good point: SF12+ profit from NNs indirectly.

Regarding the ELO gain with compute: That's a function of diminishing returns. At very small compute, you gain +300 ELO; after ~10 doublings that reduces to +30 ELO. In between is the region with ~70 ELO; that's where engines usually operate on present hardware with minutes of think time. I currently run a set of benchmarks to plot a nice graph of this.

Yes, sorry, I got that the wrong way around. 70%=algo

hippkeΩ1100

i) To pick a reference year, it seems reasonable to take the mid/late 1990s:
- Almost all chess engines before ~1996 lacked (or had serious inefficiencies) using multi-cores (very lengthy discussion here).
- Chess protocols became available, so that the engine and the GUI separated. That makes it straightforward to automate games for benchmarking.
- Modern engines should work on machines of that age, considering RAM constraints.
- The most famous human-computer games took place in 1997: Kasparov-Deep Blue. That's almost a quarter of a century ago (nice round n... (read more)

1Lech Mazur
Stockfish 12 and newer have neural network (NNUE)-based evaluation enabled by default so I wouldn't say that Stockfish is similar to other non-NN modern engines. https://nextchessmove.com/dev-builds is based on playing various versions of Stockfish against each other. However, it is known that this overestimates the ELO gain. I believe +70 ELO for doubling compute is also on the high side, even on single-core computers.
2magfrump
Isn't that 70:30 algo:compute?
4paulfchristiano
I like using Fritz. It sounds like we are on basically the same page about what experiments would be interesting.

The MIPS are only a "lookup table" from the year, based on a CPU list. It's for the reader's convenience to show the year (linear), plus a rough measure of compute (exponential).

The nodes/s measure has the problem that it is engine-dependent.

The real math was done by scaling down one engine (SF8) by time-per-move, and then calibrating the time to the computers of that era (e.g., a Quad i7 from 2009 has 200x the nodes/s compared to a PII-300 from 1999)

Yes, that is a correct interpretation. The SF8 numbers are:

MIPS = [139814.4, 69907, 17476, 8738, 4369, 2184, 1092, 546.2, 273.1, 136.5, 68.3, 34.1, 17.1, 8.5, 4.3]

ELO = [3407,3375,3318,3290,3260,3225,3181,3125,3051,2955,2831,2671,2470,2219,1910]

Note that the range of values is larger than the plotted range in the Figure. The Figure cuts off at a 80486DX 33 MHz, 27 MIPS, introduced May 7, 1990.

To derive an analytical result, it is reasonable to interpolate with a spline and then subtract. Let me know if you have a specific question (e.g. for the year 2000).

2paulfchristiano
How are those MIPS numbers produced? My impression was that the raw numbers were nodes/sec, and then some calibration was done to relate this to MIPS?

As you can see in my Figure in this post (https://www.lesswrong.com/posts/75dnjiD8kv2khe9eQ/measuring-hardware-overhang), Leela (Neural Network based chess engine) has a very similar log-linear ELO-FLOPs scaling as traditional algorithms. At least in this case, Neutral Networks scale slightly better for more compute, and worse for less compute. It would be interesting to determine if the bad scaling to old machines is a universal feature of NNs. Perhaps it is: NNs require a certain amount of memory, etc., which gives stronger constraints. The conclusion would be that the hardware overhang is reduced: Older hardware is less suitable for NNs.

Thank you for your interest: It's good to see people asking similar questions! Also thank-you for incentivizing research with rewards. Yes, I think closing the gaps will be straightforward. I still have the raw data, scripts, etc. to pick it up.

i) old engines on new hardware - can be done; needs definition of which engines/hardware

ii) raw data + reproduction - perhaps everything can be scripted and put on GitHub

iii) controls for memory + endgame tables - can be done, needs definition of requirements

iv) Perhaps the community can already agree on a set of experiments before they are performed, e.g. memory? I mean, I can look up "typical" values of past years, but I'm open for other values.

2paulfchristiano
i) I'm interested in any good+scalable old engine. I think it's reasonable to focus on something easy, the most important constraint is that it is really state of the art and scales up pretty gracefully. I'd prefer 2000 or earlier. ii) It would be great if where was at least a complete description (stuff like: these numbers were looked up from this source with links, the population was made of the following engines with implementations from this link, here's the big table of game results and the elo calculation, here was the code that was run to estimate nodes/sec). iii) For the "old" experiment I'd like to use memory from the reference machine from the old period. I'd prefer basically remove endgame tables and opening book. My ideal would be to pick a particular "old" year as the focus. Ideally that would be a year for which we (a) have an implementation of the engine, (b) have representative hardware from the period that we can use to compute nodes/sec for each of our engines. Then I'm interested in: * Compute nodes/sec for the old and new engine on both the old and new hardware. This gives us 4 numbers. * Evaluate elos both of those engines, running on both "old memory" and "new memory," as a function of nodes/turn. This gives us 4 graphs. (I assume that memory affects performance slightly independently of nodes/turn, at least for the new engine? If nodes/turn is the wrong measure, whatever other measure of computational cost makes sense, the important thing is that the cost is linear in the measurement.)

At the Landauer kT limit, you need  kWh to perform your  FLOPs. That's 10,000x the yearly US electricity production. You'd need a quantum computer or a Dyson sphere to solve that problem.

2Daniel Kokotajlo
Either that, or magic. :D

Like AGI winter, a time of reduced funding

Right. My experiment used 1 GB for Stockfish, which would also work on a 486 machine (although at the time, it was almost unheard of...)

(a) The most recent data points are from CCRL. They use an i7-4770k and the listed tournament conditions. With this setup, SF11 has about 3500 ELOs. That's what I used as the baseline to calibrate my own machine (an i7-7700k).

(b) I used the SF8 default which is 1 GB.

(c) Yes. However, the hardware details (RAM, memory bandwidth) are not all that important. You can use these SF9 benchmarks on various CPUs. For example, the AMD Ryzen 1800 is listed with 304,510 MIPS and gets 14,377,000 nodes/sec on Stockfish (i.e., 19.9 nodes per MIPS). The oldest CPU in the ... (read more)

3Peter Jin
Wow, thanks for the very comprehensive response. (Also fun to see someone has compiled a modern chess engine on early-mid-90s hardware and shared their results.)

Regarding (1): Of course a step is possible; you never know. But for arithmetic, it is not a step. That may appear so from their poor Figure, but the data indicates otherwise.

True. Do these tests scale out to super-human performance or are they capped at 100%?

Except if you have an idea to monetarize one of these sub-tasks? An invest of order 10m USD in compute is not very large if you can create a Pokemon Comedy TV channel out of it, or something like that.

3romeostevensit
fair