https://arxiv.org/abs/2402.17764 claims that 1 bit LLMs are possible.
If this scales, I'd imagine there is a ton of speedup to unlock since our hardware has been optimized for 1 bit operations for decades. What does this imply for companies like nvidia and the future of LLM inference/training?
Do we get another leap in LLM capabilities? Do CPUs become more useful? And can this somehow be applied to make training more efficient?
Or is this paper not even worth considering for some obvious reason I can't tell.
Edit: this method is applied to training already
@Veedrac suppose this pans out and custom hardware is made for such networks. How much faster/larger/cheaper will this be?
Communication overhead won't drop faster than linear.