This story was originally posted as a response to this thread.
It might help to imagine a hard takeoff scenario using only known sorts of NN & scaling effects...
In A.D. 20XX. Work was beginning. "How are you gentlemen !!"... (Work. Work never changes; work is always hell.)
Specifically, a MoogleBook researcher has gotten a pull request from Reviewer #2 on his new paper in evolutionary search in auto-ML, for error bars on the auto-ML hyperparameter sensitivity like larger batch sizes, because more can be different and there's high variance in the old runs with a few anomalously high performance values. ("Really? Really? That's what you're worried about?") He can't see why worry, and wonders what sins he committed to deserve this asshole Chinese (given the Engrish) reviewer, as he wearily kicks off yet another HQU experiment...
I thank the anonymous reviewer for taking the time to comment, even if I think they are mistaken about my mistakes. To go through the 4 mistakes he think he identified as spotchecks:
I am aware of this and it is deliberate. You say you would be surprised if such a system has ever been deployed. I am surprised I have to criticize cryptocurrency reliability & implementation quality around here (and to you, specifically, Mark*). Are you completely new to crypto? "I can't believe they made that mistake and lost millions of dollars worth of crypto" is something that is said on Mondays and every day of the week ending in "day". I can totally believe that some random new fly-by-night ZKP system used FP somewhere in it as part of the giant stack of cruft, copypasted smart contracts, and half-assed libraries that every cryptocurrency teeters on top of, accidentally enabled as an option or by an incorrect literal or a cast inside a library or something. There are multi-billion market cap cryptocurrencies which don't even use binary (you may remember a certain ternary currency), and you think no one will ever use FP inappropriately? This is computer security! Everything you think of that can go wrong will go wrong! As well as the things you didn't think of because they are too revoltingly stupid to think! I am being optimistic when I make the bug floating-point related, because FP is genuinely hard to understand and use safely. A real bug will be something deeply embarrassing like the backdoor which was an uncapitalized letter, the Parity wallet bug, the DAO bug, the...
No, it is in fact used generically to describe East Asian ESL errors such as Chinese ESL errors, and not exclusively for Japanese. I have never seen a hard distinction enforced by native writers such as myself, and I can find no sources supporting your gotcha when you have wandered into my area of expertise (writing English). If I may quote Wikipedia, "native speakers of Japanese, Korean and other Asian languages." WP also throws in a Chinese example from A Christmas Story to illustrate "Engrish". (Had I used a more specific term, "Terms such as Japanglish, Japlish or Janglish are more specific terms for Japanese Engrish.", then you might have had a point.) You can go to /r/engrish or engrish.com and find plenty of Chinese examples. (Feel free to check other dictionaries like Wiktionary, Collins, or TFD.) So, you are wrong here in trying to police my use of "Engrish". "Engrish" is, and always has, been valid to use for Chinese ESL errors.
It is not used in professional contexts indeed, which is fine - because this is not a professional context, this is reported thought from the third-person omniscient narrator about the researcher's thoughts. It's called 'fiction'. Do you also believe he is typing out "Really? Really? That's what you're worried about?" as well? Or that he is typing down 'I am going out with my friends now to SF bars to get drunk'? Of course not. It is his perspective, and he is frustrated with the anonymous reviewer comments missing the point while claiming expertise which he has to rebut (after the HQU runs are done so he has some hard numbers to add to his paper), and he has thoughts which he will write down, if at all, more politely.
I don't think it's absurd or that we would expect performance penalties far worse than 5 orders of magnitude. First, most of this is embarrassingly parallel rollouts in RL environments. Approaches like AlphaZero tree search will parallelize very well, which is how projects like LeelaZero can successfully operate. If they were really paying "5 orders of magnitude" (where does that number come from...?) and were >>10,000x slower, they wouldn't've finished even days' worth of training by this point. Yet, they exist. Second, the high-performance DL training approaches like ZeRO and PatrickStar etc have shown you can get pretty decent utilization (approaching 50%) out of GPUs across a cluster with more or less constant performance regardless of size once you've paid the price of model and then layer parallelism. Once it's paid the price to split across a bunch of GPUs, then adding more layers and parameters has just the linear cost and you can train almost arbitrary sized models. Third, projects like ALBERT have measured the crowdsourcing cost; it's not >>10,000x! It's more like 5x for small models, and I don't see where you're going to get another factor of 2,000x. 5x or 10x is not great, certainly, which is why people don't do it when they can get real clusters, but it is far from being a showstopper, particularly when you've stolen all your compute in the first place and you have far more compute than you have interconnect or other resources and can afford to burn resources on tricks like rematerializing (recomputing) gradients locally or synthetic gradients or tied weights or sparsified gradients or machine-teaching synthetic datapoints, or a lot of other tricks I haven't mentioned in the story (is a particular cluster not big enough to train an entire model? Then freeze a bunch of layers and train only the remaining, or cache their activations and ship those off as a dataset to train layers on. Can you accumulate gradients beyond the critical batch size because you have way more local compute than bandwidth? Then increase the hardness of the overall training to increase the critical batch size, taking fewer but more powerful steps). Fourth, forking paths, remember? Clippy doesn't need to train in exactly the way you envision, it can train in any way that works, it only needs to train the big model once anytime anywhere. So, you say it can't be done off a supercomputer no matter how many of the tricks you use or how much public compute? Then maybe it can seize a supercomputer or an equivalent cloud resource. There's more than 1 of them out there, I understand, and it's not like clouds or supercomputers have never been hacked in the past, to do things like, say, mine Bitcoin...
Certainly not. Fiction is just entertaining lies, after all. My hope is that people will learn from the nonfiction references and perhaps think about them more flexibly than treating them as a bunch of isolated results of minor interest individually.
* Don't worry. I'm sure LW2 is the only software from here on out which will have silly security bugs. There were a lot of silly bugs - humans amirite - but we finally patched the last bug! We're done, at last! But, er, we do need the intern to go reset the "X Days Since Last Silly Cryptocurrency Vulnerability" sign in the kitchen, because the Poly hack was followed by the Qubit hack. EDIT: dammit cryptocurrencies! /sigh. Add in surely no one would just put an unnecessary callback in a transfer function (again) and tell the intern to reset the sign after lunch... Probably should update the 'X Days Since Last Internet-Breaking Server Vulnerability' for Dirty Pipe coming after log4j too.