Cerebras recently unveiled Andromeda - https://www.cerebras.net/andromeda/, an AI supercomputer that enables near linear scaling. Do I understand correctly that this might have a big impact on the large (language) models research, since it would significantly speed up the training? E.g. if current models take 30+ days long to train, we can just 10x the number of machines and have it done in three days? Also, it seems to be much simpler to use, thus decreasing the cost of development and the hassle with dstributed computing.

If so, I think its almost certain that large companies would do it, and this in turn would significantly speed up the research/training/algorithm development of large models such as GPT, GATO and similar? It seems like this type of development should affect the discussion about timelines, however I haven't seen it mentioned anywhere else before.

New Answer
New Comment

2 Answers sorted by

jacob_cannell

102

This doesn't seem impressive compared to Nvidia's offerings.

The Andromeda 'supercomputer' has peak performance of 120 pflops dense compared to 512 pflops dense for a single 256 H100 GPU pod from nvidia and is unlikely to be competitive in compute/$; if it was competitive Cerebras would be advertising/boasting that miracle as loudly as they could. Instead they are focusing on this linear scaling thing, which isn't an external performance comparison at all.

The cerebras wafer-scale chip is a wierd architecture that should excel in the specific niche of training small models at high speed, but that just isn't where the industry is going. It is severely lacking in the large cheap fast off-chip RAM that GPUs have: this is a key distinguishing feature of the GPU architecture, combined with the hierarchical cache/networking topology.

In fact i'd argue that having linear scaling is a bad sign: it indicates you haven't achieved the level of detailed optimization possible by physics. Longer range interconnect is fundamentally physically more expensive and the optimal compute architectures will reflect that cost structure. Local compute is physically cheaper so the ideal architecture should charge software less for it (make more available at the same price) vs long range compute.

The Andromeda 'supercomputer' has peak performance of 120 pflops dense compared to 512 pflops dense for a single 256 H100 GPU pod from nvidia

I'm not sure if PFLOPs are a fair comparison here though, if I understand Cerebras' point correctly. Like, if you have ten GPUs with one PFLOP each, that's technically the same number of PFLOPs as a single GPU with ten PFLOPs. But actually that single GPU is going to train a lot faster than the ten GPUs because the ten GPUs are going to have to spend time communicating with each other. Especially as memory limitations... (read more)

3jacob_cannell
Well that's not quite right - otherwise everyone would be training on single GPUs using very different techniques, which is not what we observe. Every parallel system has communication, but it doesn't necessarily 'spend time' on that in the blocking sense, it typically happens in parallel with computation. SOTA models do now seem often limited by RAM, so model parallelism is increasingly important as it is RAM efficient. This is actually why cerebras's strategy doesn't make sense: GPUs are optimized heavily for the sweet spot in terms of RAM capacity/$ and RAM bandwidth. The wafer scale approach instead tries to use on-chip SRAM to replace off-chip RAM, which is just enormously more expensive - at least an OOM more expensive in practice. This of course is bogus because with model parallelism you can tune the interconnect requirements based on the model design, and nvidia has been tuning their interconnect tradeoffs for years in tandem with researchers cotuning their software/models for nvidia hardware. So current training setups are not strongly limited by interconnect vs other factors - some probably are, some underutilize interconnect and are limited by something else, but nvidia knows all of this, has all that data, and has been optimizing for these use cases weighted by value for years now (and is empirically better at this game than anybody else). The upside of a wafer scale chip is fast on-chip transfer, the downside is slower off-chip transfer (as that is limited by the 2d perimeter of the much larger chip). For equal flops and or $$, the GPU design of breaking up the large tile into alternating logic and RAM subsections has higher total off chip RAM and off-chip transfer bandwidth. The more ideal wafer design would be one where you had RAM stacked above in 3D, but cerebras doesn't do that presumably because they need that whole surface for heat transfer. If you look inside the engine block of the CS-2 form their nice virtual tour you can see that the waf
4Amal
I am certainly not an expert, but I am still not sure about your claim that it's only good for running small models. The main advantage they claim to have is "storing all model weights externally and stream them onto each node in the cluster without suffering the traditional penalty associated with off chip memory. weight streaming enables the training of models two orders of magnitude larger than the current state-of-the-art, with a simple scaling model." (https://www.cerebras.net/product-cluster/ , weight streaming). So they explicitly claim that it should perform well with large models.   Furthermore, in their white paper (https://f.hubspotusercontent30.net/hubfs/8968533/Virtual%20Booth%20Docs/CS%20Weight%20Streaming%20White%20Paper%20111521.pdf), they claim that the CS-2 architecture is much better suited for sparse models(e.g. by Lottery Ticket Hypothesis) and on page 16 they show that Sparse GPT-3 could be trained in 2-5 days.  This would also align with tweets by OpenAI that Trillion is the new billion, and rumors about the new GPT-4 being similarly big jump as GPT-2 -> GPT-3 was - having colossal number of parameters and sparse paradigm (https://thealgorithmicbridge.substack.com/p/gpt-4-rumors-from-silicon-valley). I could imagine that sparse parameters deliver  much stronger results than normal parameters, and this might change scaling laws a bit.
2jacob_cannell
This is almost a joke, because the equivalent GPU architecture has both greater total IO bandwidth to any external SSD/RAM array, and the massive near-die GPU RAM that can function as a cache for any streaming approach. So if streaming works as well as Cereberas claims, GPUs can do that as well or better. I agree sparsity (and also probably streaming) will be increasing important; I've actually developed new techniques for sparse matrix multiplication on GPUs.
2Zach Furman
Hmm, I'm still not sure I buy this, after spending some more time thinking about it. GPUs can't stream a matrix multiplication efficiently, as far as I'm aware. My understanding is that they're not very good at matrix-vector operations compared to matrix-matrix because they rely on blocked matrix multiplies to efficiently use caches and avoid pulling weights from RAM every time. Cerebras says that the CS-2 is specifically designed for fast matrix-vector operations, and uses dataflow scheduling, so it can stream a matrix multiplication by just performing matrix-vector operations as weights stream in. And the weights are getting streamed from external RAM, rather than requested as needed, so there's no round-trip latency gunking up the works like a GPU has when it wants data from RAM. Cerebras claims that their hardware support for fast matrix-vector multiplication gives a 10x speed boost to multiplying sparse matrices, which could be helpful.
2jacob_cannell
Of course GPUs can and do stream a larger matrix multiplication from RAM - the difference is that the GPU design has multiple OOM more bandwidth to the equivalent external RAM (about 3 OOM to be more specific). Also the latest lovelace/hopper GPUs have more SRAM now - 50MB per chip, so about 1GB of SRAM for a 200 GPU pod similar to the cerebras wafer. The CS-2 is only good at matrix-vector operations that fit in its SRAM capacity. As a thought experiment, consider running a brain-like ANN with 10B neurons and 10T sparse weights. Simulating one second of activity requires only on order 10T sparse ops or a couple OOM more dense ops which is already within current single GPU capability. The problem is that streaming in the 10TB of weight data would take several minutes on the CS-2's pathetic slow IO path. Meanwhile the equivalent priced 200 GPU pod can fit the weights in GPU RAM and has the performance to simulate about a hundred instances of that brain sized model in real time, so about 10000x higher performance than the CS-2. Weights outnumber activations by 3 or 4 OOM so moving weights over long distances as in CS-2 is enormously inefficient, compared to moving the activations around (as in the GPU design) which uses very little bandwidth. The future is in the opposite direction of that CS-2 'weight streaming' - towards more optimal neurmorphic computing - where the weights stay in place and the activations flow through them.
2Amal
my understanding is that they fully separate computation and memory storage. So whhile traditional architectures need some kind of cache to store large amount of data for model partitions from which just a small portion is used for the computation at any single time point, CS2 only requests what it needs so the bandwidth doesnt need to be so big

Razied

10

Well, it will scale linearly until it hits the finite node-to-node bandwidth limit... just like all other supercomputers. If you have your model training on  different nodes, you still need to share all your weights with all other nodes at some point, which is fundamentally an  operation, it just appears linear when you're spending more time computing your weight updates than you are communicating with other nodes. I don't see this really being a qualitative jump, but it might well be one more point to add to the graph of increasing compute power dedicated to AI.

[This comment is no longer endorsed by its author]

Hmm, I see how that would happen with other architectures, but I'm a bit confused how this is  here? Andromeda has the weight updates computed by a single server (MemoryX) and then distributed to all the nodes. Wouldn't this be a one-to-many broadcast with  transmission time?

3Razied
You're completely right, I don't know how I missed that, I must be more tired than I thought I was.