This doesn't seem impressive compared to Nvidia's offerings.
The Andromeda 'supercomputer' has peak performance of 120 pflops dense compared to 512 pflops dense for a single 256 H100 GPU pod from nvidia and is unlikely to be competitive in compute/$; if it was competitive Cerebras would be advertising/boasting that miracle as loudly as they could. Instead they are focusing on this linear scaling thing, which isn't an external performance comparison at all.
The cerebras wafer-scale chip is a wierd architecture that should excel in the specific niche of training small models at high speed, but that just isn't where the industry is going. It is severely lacking in the large cheap fast off-chip RAM that GPUs have: this is a key distinguishing feature of the GPU architecture, combined with the hierarchical cache/networking topology.
In fact i'd argue that having linear scaling is a bad sign: it indicates you haven't achieved the level of detailed optimization possible by physics. Longer range interconnect is fundamentally physically more expensive and the optimal compute architectures will reflect that cost structure. Local compute is physically cheaper so the ideal architecture should charge software less for it (make more available at the same price) vs long range compute.
The Andromeda 'supercomputer' has peak performance of 120 pflops dense compared to 512 pflops dense for a single 256 H100 GPU pod from nvidia
I'm not sure if PFLOPs are a fair comparison here though, if I understand Cerebras' point correctly. Like, if you have ten GPUs with one PFLOP each, that's technically the same number of PFLOPs as a single GPU with ten PFLOPs. But actually that single GPU is going to train a lot faster than the ten GPUs because the ten GPUs are going to have to spend time communicating with each other. Especially as memory limitations...
Well, it will scale linearly until it hits the finite node-to-node bandwidth limit... just like all other supercomputers. If you have your model training on different nodes, you still need to share all your weights with all other nodes at some point, which is fundamentally an operation, it just appears linear when you're spending more time computing your weight updates than you are communicating with other nodes. I don't see this really being a qualitative jump, but it might well be one more point to add to the graph of increasing compute power dedicated to AI.
Hmm, I see how that would happen with other architectures, but I'm a bit confused how this is here? Andromeda has the weight updates computed by a single server (MemoryX) and then distributed to all the nodes. Wouldn't this be a one-to-many broadcast with transmission time?
Cerebras recently unveiled Andromeda - https://www.cerebras.net/andromeda/, an AI supercomputer that enables near linear scaling. Do I understand correctly that this might have a big impact on the large (language) models research, since it would significantly speed up the training? E.g. if current models take 30+ days long to train, we can just 10x the number of machines and have it done in three days? Also, it seems to be much simpler to use, thus decreasing the cost of development and the hassle with dstributed computing.
If so, I think its almost certain that large companies would do it, and this in turn would significantly speed up the research/training/algorithm development of large models such as GPT, GATO and similar? It seems like this type of development should affect the discussion about timelines, however I haven't seen it mentioned anywhere else before.