All of lennart's Comments + Replies

The Bay Area, where MATS is based, is not the governance hub of the US;

The Bay is an AI hub, home to OpenAI, Google, Meta, etc., and therefore an AI governance hub. Governance is not governments. Important decisions are being made there - maybe more important decisions than in DC. To quote Allan Dafoe:

AI governance concerns how humanity can best navigate the transition to a world with advanced AI systems[1]. It relates to how decisions are made about AI[2], and what institutions and arrangements would help those decisions to be made well.

Also, many, many A... (read more)

Agree that this discussion is surprisingly often confusing and people use the terms interchangeably. Unfortunately, readers often referred to our training compute measurement as a measure of performance, rather than a quantity of executed operations. However, I don't think that this is necessarily due to the abbreviations but also due to the lack of understanding of what one measures. Next to making the distinction more clear with the terms, one should probably also explain it more and use terms such as quantity and performance.

For my research, I've been t... (read more)

2smcurtis
Note that @lennart has since recommended using FLOP for quantity and FLOP/s (or FLOPS) for performance.

They trained it on TPUv3s, however, the robot inference was run on a Geforce RTX 3090 (see section G).

TPUs are mostly designed for data centers and are not really usable for on-device inference.

lennart100

I'd be curious to hear more thoughts on how much we could already scale it right now. Looks like that data might be a bottleneck?

Some thoughts on compute:

Gato estimate: 256 TPUv3 chips for 4 days a 24hours = 24'574 TPUv3-hours (on-demand costs are $2 per hour for a TPUv3) =$49'152

In comparison, PaLM used 8'404'992 TPUv4 hours and I estimated that it'd cost $11M+. If we'd assume that someone would be willing to spend the same compute budget on it, we could make the model 106x bigger (assuming Chinchilla scaling laws). Also tweeted about this here.

The size o... (read more)

lennart110

It took Google 64 days to train PaLM using more than 6'000 TPU chips. Using the same setup (which is probably one of the most interconnected and capable ML training systems out there), it'd take 912 years.

Answer by lennart110

I recently estimated the training cost of PaLM to be around $9M to $17M
Please note all the caveats and this is only estimating the final training run costs using commercial cloud computing (Google's TPUv3).

As already, said a 10T parameter model using Chinchilla scaling laws would be around  FLOPs. That's 5200x more compute than PaLM ( FLOPs).

Therefore, .

So a conservative estimate is around $47 to $88 billion.

lennart110

It took Google 64 days to train PaLM using more than 6'000 TPU chips. Using the same setup (which is probably one of the most interconnected and capable ML training systems out there), it'd take 912 years.

Thanks for the thoughtful response, Connor.

I'm glad to hear that you will develop a policy and won't be publishing models by default.

lennart320

Glad to see a new Alignment research lab in Europe. Good luck with the start and the hiring!

I'm wondering, you're saying: 

That being said, our publication model is non-disclosure-by-default, and every shared work will go through an internal review process out of concern for infohazards.

That's different from Eleuther's position[1]. Is this a change of mind or a different practice due to the different research direction? Will you continue open-sourcing your ML models?

  1. ^

    "A grassroots collective of researchers working to open source AI research."

TL;DR: For the record, EleutherAI never actually had a policy of always releasing everything to begin with and has always tried to consider each publication’s pros vs cons. But this is still a bit of change from EleutherAI, mostly because we think it’s good to be more intentional about what should or should not be published, even if one does end up publishing many things. EleutherAI is unaffected and will continue working open source. Conjecture will not be publishing ML models by default, but may do so on a case by case basis.

Longer version:

Firs... (read more)

lennart130

From their paper:

We trained PaLM-540B on 6144 TPU v4 chips for 1200 hours and 3072 TPU v4 chips for 336 hours including some downtime and repeated steps.

That's 64 days.

It's roughly an order of magnitude more compute than GPT-3.

ML ModelCompute [FLOPs]x GPT-3
GPT-3 (2020)3.1e231
Gopher (2021-12)6.3e23≈2x
Chinchilla (2022-04)5.8e23≈2x
PaLM (2022-04)2.5e24≈10x
gwern*100

Which is reasonable. It has been about <2.5 years since GPT-3 was trained (they mention the move to Azure disrupting training, IIRC, which lets you date it earlier than just 'May 2020'). Under the 3.4 month "AI and Compute" trend, you'd expect 8.8 doublings or the top run now being 445x. I do not think anyone has a 445x run they are about to unveil any second now. Whereas on the slower >5.7-month doubling in that link, you would expect <36x, which is still 3x PaLM's actual 10x, but at least the right order of magnitude.

There may also be other runs... (read more)

8Steven Byrnes
For reference, 1 billion petaFLOP is probably a reasonable guess for how much computation a human brain does in 30 years. (I think the brain has much more memory though.) (30 years ≈ 1 billion seconds, and 1e15 FLOP/s is a central estimate for brain computation from Joe Carlsmith's report.)
lennart130

From their paper:

We trained PaLM-540B on 6144 TPU v4 chips for 1200 hours and 3072 TPU v4 chips for 336 hours including some downtime and repeated steps.

That's 64 days.

Minor correction. You're saying:
> So training a 1-million parameter model on 10 books takes about as many FLOPS as training a 10-million parameter model on one book.

You link to FLOP per second aka FLOPS, whereas you're talking about the plural of FLOP, a quantity (often used is FLOPs).

I'm wondering: could one just continue training Gopher (the previous bigger model) on the newly added data?

5gwern
Unlikely, because Gopher is so far from what they find optimal. See the table of requirements which helpfully defines compute requirements in terms of "Gophers" (perhaps they were thinking much the same thing). An optimal 280b-parameter model (ie. a Gopher) requires 17.2 Gophers' worth of compute, or to put it another way, Gopher used only 6% of the compute it should've for it to have been an optimal model. You could train almost 3 different 175-billion models from scratch for what it would take to 'finish' Gopher (they cost 6.7x Gopher).
3Daniel Kokotajlo
Probably, right? They might have to change the hyperparameters e.g. the learning rate schedule. I'd imagine they are already doing this.

Thanks for the comment! That sounds like a good and fair analysis/explanation to me.

We basically lumped the reduced cost of FLOP per $ and increased spending together.

A report from CSET on AI and Compute projects the costs by using two strongly simplified assumptions: (I) doubling every 3.4 months (based on OpenAI's previous report) and (II) computing cost stays constant. This could give you some ideas on rather upper bounds of projected costs.

Carey's previous analysis uses this dataset from AI Impacts and therefore assumes:

[..] while the cost per unit of computation is decreasing by an order of magnitude every 4-12 years (the long-run

... (read more)
1delton137
Interesting, thanks. 10x reduction in cost every 4 years is roughly twice what I would have expected. But it sounds quite plausible especially considering AI accelerators and ASICs.

Thanks for sharing your thoughts. As you already outlined, the report mentions at different occasions that the hardware forecasts are the least informed:

“Because they have not been the primary focus of my research, I consider these estimates unusually unstable, and expect that talking to a hardware expert could easily change my mind.”

This is partially the reason why I started looking into this a couple of months ago and still now on the side. A couple of points come to mind:

... (read more)

co-author here

I like your idea. Nonetheless, it's pretty hard to make estimates on "total available compute capacity". If you have any points, I'd love to see them.

Somewhat connected is the idea of: What ratio of this progress/trend is due to computational power improvements versus increased spending? To get more insights on this, we're currently looking into computing power trends and get some insights into the development of FLOPS/$ over time.

2TLW
I tend to fall on the side of "too many ideas", not "too few ideas". (The trick is sorting out which ones are actually worth the time...) A few metrics, all hilariously inaccurate: 1. "Total number of transistors ever produced". (Note that this is necessarily monotonically non-decreasing.) 2. "Total number of transistors currently operational" (or an estimate thereof.) 3. "Integral of <FLOPs/transistor as a function of time, multiplied by rate of transistor production as a function of time>" 4. One of the above, but with DRAM & Flash (or, in general, storage) removed from # of transistors produced. 5. One of the above, but using FLOPs/$ and total spending (or estimate of current total market assets) as a function of time instead of using FLOPs/transistor and # transistors as a function of time. 6. Total BOINC capacity as a function of time (of course, doesn't go back all that far...) That being said, estimates of global compute capacity over time do exist, see e.g. https://ijoc.org/index.php/ijoc/article/view/1562/742 and https://ijoc.org/index.php/ijoc/article/view/1563/741. These together show (as of 2012, unfortunately, with data only extending to 2007) that total MIPS on general-purpose computers grew from ∼5∗108 MIPS in 1986 to ∼9∗1012 MIPS in 2007. (Fair warning: that's MIPS, so (a somewhat flawed) integer benchmark not floating-point.) Or about a doubling every ~1.4-1.5 years or so. As long as we're talking about extrapolations, be aware that I've seen rumblings that we're now at a plateau in that the latest generation of process nodes are actually about the same $/transistor as (or even higher than) the previous generation process nodes. I don't know how accurate said rumblings are, however. (This is "always" the case very early in a process node; the difference here is that it's still the case even as we're entering volume production...) A related metric that would be interesting is total theoretical fab output (# wafers * # transistors / wafer *
lennartΩ010

Comparing custom ML hardware (e.g. Google's TPUs or Baidu's Kunlun, etc) is tricky to put on these sorts of comparisons. For those I think the MLPerf Benchmarks are super useful. I'd be curious to hear the authors' expectations of how this research changes in the face of more custom ML hardware.

I'd be pretty excited to see more work on this. Jaime already shared our hardware sheet where we collect information on GPUs but as you outline that's the peak performance and sometimes misleading.

Indeed, the MLPerf benchmarks are useful. I've already gathered ... (read more)

Great post! I especially liked that you outlined potential emerging technologies and the economic considerations.

Having looked a bit into this when writing my TAI and Compute sequence, I agree with your main takeaways. In particular, I'd like to see more work on DRAM and the interconnect trends and potential emerging paradigms.

I'd be interested in you compute forecasts to inform TAI timelines. For example Cotra's draft report assumes a doubling time of 2.5 years for the FLOPs/$ but acknowledges that this forecast could be easily improved by someone with mo... (read more)

2Veedrac
I'm not all that sure this is going to give you anything more useful than what you have already. Around the end of this decade my compute predictions detach from any strict timeline, and the pace of progress within this decade is smaller than the potential range of money spent on the problem, so even if you draw a sure line around where TAI happens with today's AI techniques, you don't gain all that much from better guesses about hardware progress. Put another way, if you assume current connectivist architectures scaled up to ~brain parity buys you most of the tools you need to build TAI, then you don't need to worry about longer term hardware progress. If you don't assume that, then you don't have a meaningful anchor to use these longer term predictions with anyway. If I had strong timelines for physical technology progress you could at least say, architectures like P will be tried around the 20X0s, and architectures like Q will be tried around the 20Y0s, but I don't have strong timelines for progress that goes that far out. I do think understanding longer term tech progress is relevant, because I think that current AI systems do seem to keep buying relevant cognitive abilities as they scale, and having a long roadmap implies that we'll keep doing that until the trick stops working or we hit AGI. But I don't know how to put a date on that, at least one that's more informative than ‘it's technically plausible, and could come moderately soon if things go fast’.

Thanks!

  • I'm working with a colleague on the trends of the three components (compute, memory, and interconnect) over time of compute systems and then comparing it to our best estimates for the human brain (or other biological anchors). However, this will still take some time but I hope we will be able to share it in the future (≈ till the end of the year).
2Gunnar_Zarncke
Cool. Looking forward to it.

Thanks for the correction and references. I just followed my "common sense" from lectures and other pieces.

What do you think made AlexNet stand out? Is it the depth and use of GPUs?

1teradimich
I do not know the opinions of experts on this issue. And I lack competence for such conclusions, sorry.

Thanks for the feedback, Gunnar. You're right - it's more of a recap and introduction. I think the "newest" insight is probably the updates in Section 2.3.

I also would be curious to know in which aspects and questions you're most interested in.

3Gunnar_Zarncke
The update in 2.3 was a valuable update. Based on the title (and my interests) I was hoping for  * some integration of the limits for compute, memory, and interconnect. Like you say they limit each other but it is not very clear how the limits interrelate and scale with each other. Empirically, it would be interesting to see the relative sizes of these parts over time. * some comparison of the relative sizes of the human brain responsible for processing where we do have algorithms that are comparable to what the brain does, e.g. image processing and object and scene detection in the visual cortex.