This story was originally posted as a response to this thread.
It might help to imagine a hard takeoff scenario using only known sorts of NN & scaling effects...
In A.D. 20XX. Work was beginning. "How are you gentlemen !!"... (Work. Work never changes; work is always hell.)
Specifically, a MoogleBook researcher has gotten a pull request from Reviewer #2 on his new paper in evolutionary search in auto-ML, for error bars on the auto-ML hyperparameter sensitivity like larger batch sizes, because more can be different and there's high variance in the old runs with a few anomalously high performance values. ("Really? Really? That's what you're worried about?") He can't see why worry, and wonders what sins he committed to deserve this asshole Chinese (given the Engrish) reviewer, as he wearily kicks off yet another HQU experiment...
Your floating point counterargument is irrelevant. Yes, it would be a bad idea. You already said that. You did not address any of my points about bad ideas being really really common in crypto (is mixing in some floating point really worse than, say, using ternary for everything binary? That is a realworld crypto which already exists. And while I'm at it, the FP inefficiency might be a reason to use FP - remember how the Bytecoin and other scams worked by obfuscating code and blockchain), nor did you offer any particular reason to think that this specific bad idea would be almost impossible. People switch between floating and integer all the time. Compilers do all sorts of optimizations or fallbacks which break basic security properties. There are countless ways to screw up crypto; secure systems can be composed in insecure ways; and so on.
You'll "take my word on Engrish"? You don't need to, I provided WP and multiple dictionaries. There is nothing hard about "and other Asian languages" or movie examples about going to a Chinese food restaurant and making fun of it. If you don't know anything about the use of 'Engrish' and don't bother to check a single source even when they are served to you on a silver platter, why on earth are you going around talking about how it discredits me? This is bullshit man. "Spotchecking" doesn't work if you're not checking, and using your expertise to check for Gell-Man amnesia doesn't work if you don't have expertise. That you don't even care that you were so trivially wrong bothers me more than you being wrong.
No response to the unprofessional criticism, I see. How unprofessional.
Pity this story is about neural nets, then. In any case, I still don't see where you are getting 10,000x from or how ASICs are relevant, or how any of this addresses the existing and possible techniques for running NNs across many nodes. Yes, we have specialized ASICs for NN stuff which work better than CPUs They are great. We call them "TPUs" and "GPUs" (you may have heard of them), and there's plenty of discussion about how the usual CPU->ASIC speedup has already been exhausted (as Nvidia likes to point out, the control flow part you are removing to get those speedups for examples like video codecs is already a small part of the NN workload, and you pay a big price in flexibility if you try to get rid of what's left - as specialized AI chip companies keep finding out the hard way when no one can use their chips). I mean, just think critically for a moment: if the speedup from specialized hardware vs more broadly accessible hardware really was >>10,000x, if my normal Nvidia GPU was 1/10,000th the power of a comparable commercial chip, how or why is anyone training anything on regular Nvidia GPUs? With ratios like that, you could run your home GPUs for years and not get as much done as on a cloud instance in an hour or two. Obviously, that's not the case. And, even granting this, it still has little to do with how much slower a big NN is going to run with Internet interconnects between GPUs instead of on GPU/TPU clusters.