Am I just inexperienced or confused, or is this paper using a lot of words to say effectively very little? Sure, this functional form works fine for a given set of regimes of scaling, but it effectively gives you no predictive ability to determine when the next break will occur.
Sorry if this is overly confrontational, but I keep seeing this paper on Twitter and elsewhere and I'm not sure I understand why.
When f (in equation 1 of the paper ( https://arxiv.org/abs/2210.14891 ) not the video) of next break is sufficiently large, it gives you predictive ability to determine when that next break will occur; also, the number of seeds needed to get such predictive ability is very large. When f of next break is sufficiently small (& nonnegative), it does not give you predictive ability to determine when that next break will occur.
Play around with in this code to see what I mean:
https://github.com/ethancaballero/broken_neural_scaling_laws/blob/main/make_figure_1__decomposition_of_bnsl_into_power_law_segments.py#L25-L29
Obvious crackpot; says on Twitter that there's a $1 billion prize for "breaking" BNSL funded by Derek Parfit's family office. I'd cut him more slack for potentially being obviously joking, if it wasn't surrounded by claims that sounded also crackpottery to me. https://twitter.com/ethanCaballero/status/1587502829580820481
I am co-supervising Ethan's PhD, and we previously wrote another ML paper together: https://arxiv.org/abs/2003.00688
Ethan has an unusual communication style, but he's not a crackpot, and this work is legit (according to me, the anchor author). I haven't listened to the interview.
Well, I am Ethan's primary supervisor (since 2020), and really appreciate his provocative sense of humor (though not everyone does 😀) - regarding the BNSL paper though, it's 100% serious (though I did promise to double the $1B prize on Twitter; like advisor like student 😜)
I would be remiss not to warn folks that Ethan had a long-standing habit of making misleading, sensationalized, and outright false statements about ML scaling and other topics in the EleutherAI Discord. It got to the point several times where the moderation team had to step in to address the issue. Would recommend taking everything with a massive grain of salt.
Source: I was one of those mods.
Early on, we typically just ignored it or tried to discuss it with him. After a while it became common knowledge that Ethan would post "bait". Eventually we escalated to calling the behavior out when it occurred, deleting the relevant post(s), and/or handing him a temporary timeout. I don't know what has happened since then, I've been away from the server the past few months.
Ethan posts an annotated image from openai's paper https://arxiv.org/pdf/2001.08361.pdf , stating that it's "apparently wrong now" after the compute-efficient scaling laws paper from deepmind: https://cdn.discordapp.com/attachments/785968841301426216/958570284665946122/Screen_Shot_2021-10-20_at_12.30.58_PM_1.png - the screenshot claims that the crossover point between data and compute in the original openai paper predicts agi.
Ethan, my impression is that you're mildly overfitting. I appreciate your intellectual arrogance quite a bit; it's a great attitude to have as a researcher, and more folks here should have attitudes like yours, IMO. But, I'd expect that data causal isolation quality is going to throw a huge honkin wrench into any expectations we form about how we can use strong models - note that even humans who have low causal quality training data form weird and false superstitions! I agree with the "test loss != capability" claim because the test distribution is weird and made up and doesn't exist outside the original dataset. IID is catastrophically false, and figuring that out is the key limiter preventing robotics from matching pace with the rest of ML/AI right now, imo. So, your scaling model might even be a solid representation space, but it's misleading because of the correlation problem.
this seems like a solid empirical generative representation but I don't feel comfortable assuming it is a causally accurate generative model. it appears overparameterized without causal justification to me. certainly we can fit known data using this, but it explicitly bakes in an assumption of non-generalization. perhaps that's the only significant claim being made? but I don't see how we can even generalize that the sharpness of the breaks is reliable. ethan says come at me, I say this is valid but does not refine predictive distribution significantly and that is itself the difficult problem we'd hope to solve in the first place.
[humor] could one use this method to represent the convergence behavior of researchers with crackpots as the size of the cracks in crackpots' pot decreases and the number of objects colliding with respected researchers' pots increases?
Broken Neural Scaling Laws presents a functional form that allows one to extrapolate the downstream (and upstream) performance of large-scale vision, language, audio, video, diffusion, generative modeling, multimodal learning, contrastive learning, AI alignment, AI capabilities, robotics, out-of-distribution (OOD) generalization, continual learning, transfer learning, uncertainty estimation / calibration, out-of-distribution detection, adversarial robustness, distillation, sparsity, retrieval, quantization, pruning, fairness, molecules, computer programming/coding, math word problems, arithmetic, supervised learning, unsupervised/self-supervised learning, and reinforcement learning (single agent and multi-agent) tasks, assuming you have enough training runs[1] near a break to extrapolate what happens until the next sharp break.
Below are some highlighted quotes from my conversation with Ethan Caballero about his paper (available on Youtube, Spotify, Google Podcast, Apple Podcast). For the full context for each of these quotes, you can find the accompanying transcript.
Unless otherwise noted, the quotes below are from the conversation and in the transcript.
Broken Neural Scaling Law
The general functional form of a broken neural scaling law (BNSL) is given as follows:
y=a+(bx−c0)n∏i=1(1+(xdi)1/fi)−ci∗fiwhere y represents the performance evaluation metric and x represents a quantity that is being scaled (e.g. number of model parameters, amount of compute used for training (or inference), training dataset size, model input size, number of training steps, or upstream performance). The remaining parameters a, b, c0...cn, d1...dn, f1...fn are unknown constants that must be estimated by fitting the above functional form to the (x,y) data points.
Collecting the initial data by averaging training runs
Experiments extrapolated by Broken Neural Scaling Laws
[Note: other experiments include generative Modeling of Images and Multi-Agent (and Single-Agent) Reinforcement Learning.]
Using Broken Neural Scaling Laws to predict unexpected behavior
Predicting sharp left turns
The sharpness of a break is related to the constant fi in the broken neural scaling law equation:
y=a+(bx−c0)n∏i=1(1+(xdi)1/fi)−ci∗fiThe larger fi is, the easier it could be to predict a sharp left turn. When fi (of each of the future breaks) is very large (i.e. wide and not sharp) it is seemingly possible to predict multiple future breaks although number of training runs / seeds needed can be very large. It is seemingly possible because in such scenarios the breaks "bleed" into the other breaks such that there is useful signal for making such predictions.
In practical terms, this means that if fi is large enough then the signal from di propagates back to black points used for fitting such that there is enough useful signal for estimating di via Scipy curve fitting.
(Where constant di represents where on the x-axis the break between the (i)th and the (i+ 1)th approximately linear region (on a log-log plot) occurs.)
When fi is very small (i.e. sharp) (and nonnegative), one needs a very large number of training runs / seeds [1] from right before that break to perfectly extrapolate scaling behavior from that break to next sharp break.[2]
On the risk of collecting training runs close to the break
Modeling non-monotonic behavior
Takeoff Speeds and Deception
Distributed training is inefficient
Because of this inefficiency, if you had access to many more servers, you could run more inferences in parallel but your trainings would still be going at the same speed as if you had only one server.
Recursive self improvement won't happen before a sinister stumble
"assuming you have enough training runs" admittedly does a lot of the work here, especially when the models get large, to the point of this work being of purely theoretical interest for large language models, as of November 2022.
If fi is extremely small (and nonnegative) then the number of seeds needed is extremely large, which makes it so expensive that it is basically intractable to extrapolate well if one can only use points before the break.
Thanks to Daniel Paleka, Alan Chan and Max Kaufmann for feedback.