Cerebras recently unveiled Andromeda - https://www.cerebras.net/andromeda/, an AI supercomputer that enables near linear scaling. Do I understand correctly that this might have a big impact on the large (language) models research, since it would significantly speed up the training? E.g. if current models take 30+ days long to train, we can just 10x the number of machines and have it done in three days? Also, it seems to be much simpler to use, thus decreasing the cost of development and the hassle with dstributed computing.
If so, I think its almost certain that large companies would do it, and this in turn would significantly speed up the research/training/algorithm development of large models such as GPT, GATO and similar? It seems like this type of development should affect the discussion about timelines, however I haven't seen it mentioned anywhere else before.
This doesn't seem impressive compared to Nvidia's offerings.
The Andromeda 'supercomputer' has peak performance of 120 pflops dense compared to 512 pflops dense for a single 256 H100 GPU pod from nvidia and is unlikely to be competitive in compute/$; if it was competitive Cerebras would be advertising/boasting that miracle as loudly as they could. Instead they are focusing on this linear scaling thing, which isn't an external performance comparison at all.
The cerebras wafer-scale chip is a wierd architecture that should excel in the specific niche of training small models at high speed, but that just isn't where the industry is going. It is severely lacking in the large cheap fast off-chip RAM that GPUs have: this is a key distinguishing feature of the GPU architecture, combined with the hierarchical cache/networking topology.
In fact i'd argue that having linear scaling is a bad sign: it indicates you haven't achieved the level of detailed optimization possible by physics. Longer range interconnect is fundamentally physically more expensive and the optimal compute architectures will reflect that cost structure. Local compute is physically cheaper so the ideal architecture should charge software less for it (make more available at the same price) vs long range compute.
my understanding is that they fully separate computation and memory storage. So whhile traditional architectures need some kind of cache to store large amount of data for model partitions from which just a small portion is used for the computation at any single time point, CS2 only requests what it needs so the bandwidth doesnt need to be so big