Broken Neural Scaling Laws presents a functional form that allows one to extrapolate the downstream (and upstream) performance of large-scale vision, language, audio, video, diffusion, generative modeling, multimodal learning, contrastive learning, AI alignment, AI capabilities, robotics, out-of-distribution (OOD) generalization, continual learning, transfer learning, uncertainty estimation / calibration, out-of-distribution detection, adversarial robustness, distillation, sparsity, retrieval, quantization, pruning, fairness, molecules, computer programming/coding, math word problems, arithmetic, supervised learning, unsupervised/self-supervised learning, and reinforcement learning (single agent and multi-agent) tasks, assuming you have enough training runs[1] near a break to extrapolate what happens until the next sharp break.

Below are some highlighted quotes from my conversation with Ethan Caballero about his paper (available on Youtube, Spotify, Google Podcast, Apple Podcast). For the full context for each of these quotes, you can find the accompanying transcript.

Unless otherwise noted, the quotes below are from the conversation and in the transcript.


Broken Neural Scaling Law

The general functional form of a broken neural scaling law (BNSL) is given as follows:

where  represents the performance evaluation metric and  represents a quantity that is being scaled (e.g. number of model parameters, amount of compute used for training (or inference), training dataset size, model input size, number of training steps, or upstream performance). The remaining parameters ......... are unknown constants that must be estimated by fitting the above functional form to the  data points.

“Say you have a log-log plot like right here and on the y-axis you have the performance evaluation metric and then on the x-axis you have the scale, so whatever it is you're scaling. So like compute, dataset size, or number parameters. A break is a transition between one (approximately) straight line in a log-log plot and another (approximately) straight line in a log-log plot.

“You can have various numbers of breaks. n represents the number of breaks. For most problems we care about, it's usually one break, but for certain things such as N-digit addition and Deep Double Descent, it's more than one break.”

“The y-axis can be basically anything. It can be test loss, it can be reward, it can be F1 score, it can be BLEU score, it can be ELO score. It doesn't really matter.”

Collecting the initial data by averaging training runs

If you have enough seeds and training runs [before the first break, in black in the above left plot], you're able to perfectly extrapolate everything all the way up to the next sharp break.” [otherwise you can predict performance up until the next break only, cf. left plot]

“[The number of seeds] can vary between ten and thousands. I will say for most workloads that people care about, there currently usually is only one (large) break.”

Experiments extrapolated by Broken Neural Scaling Laws

A ton of large scale vision and language things. And then the stuff that was advertised as unpredictable, like four digit addition and then just non-monotonic stuff  that I knew everything else breaks on, like Double Descent. There's this paper that Google released a few weeks ago called Revisiting Neural Scaling Laws and they put out this big benchmark of a zillion experimental data where you have say, a hundred training runs to fit and then there are say, a hundred larger training runs that are held out to evaluate extrapolation. And they did that for a bunch of large scale vision and language things.”

“For four digit arithmetic, there are dramatic breaks for upstream performance. You can get away with million parameter models if you're just training on four digit addition.”

[Note: other experiments include generative Modeling of Images and Multi-Agent (and Single-Agent) Reinforcement Learning.]

Using Broken Neural Scaling Laws to predict unexpected behavior

Predicting sharp left turns

The sharpness of a break is related to the constant  in the broken neural scaling law equation:

 “Constant  represents the sharpness of break between the (i)th and the (i + 1)th approximately linear region on a log-log plot; smaller (nonnegative) values of  yield a sharper break and intervals (before and after the (i)th break) that are more linear on a log-log plot; larger values of  yield a smoother break and intervals (before and after the (i)th break) that are less linear on a log-log plot.”

The larger  is, the easier it could be to predict a sharp left turn. When  (of each of the future breaks) is very large (i.e. wide and not sharp) it is seemingly possible to predict multiple future breaks although number of training runs / seeds needed can be very large. It is seemingly possible because in such scenarios the breaks "bleed" into the other breaks such that there is useful signal for making such predictions.

In practical terms, this means that if  is large enough then the signal from  propagates back to black points used for fitting such that there is enough useful signal for estimating  via Scipy curve fitting.

(Where constant  represents where on the x-axis the break between the (i)th and the (i+ 1)th approximately linear region (on a log-log plot) occurs.)

When  is very small (i.e. sharp) (and nonnegative), one needs a very large number of training runs / seeds [1] from right before that break to perfectly extrapolate scaling behavior from that break to next sharp break.[2]

On the risk of collecting training runs close to the break

Michaël: “Isn't there a concern that if you were running things closer to the break you would already see the deceptive behavior, the bad behavior, and that you could get a deceptive AI just from running things close to the break?”

Ethan: “I mean it sounds plausible but it doesn't get dramatic until basically after the break has happened, if you get what I mean, the break is the transition from this slope to the next slope. Assuming you have a ton of compute to get a zillion seeds, you don't need any points from when the slope is at it's full max.”

Modeling non-monotonic behavior

"[Broken Neural Scaling Laws can] model and extrapolate Double Descent. No one was even trying to model non-monotonic stuff with variants of power laws and scaling laws to the best of my knowledge.”

“Interpretability and controllability, those are two classic examples that you'd expect to be more interpretable and more controllable until it's beyond human comprehension. Because it gets smarter than humans. And at that point you'd expect the interpretability or controllability metric to start scaling in the opposite direction. So you want some kind of functional form that's able to express and extrapolate non-monotonic scaling and predict when it's about to happen.”

Takeoff Speeds and Deception

Distributed training is inefficient

“Currently [running distributed trainings] doesn't work that well if you're trying to use your compute pretty efficiently.”

Because of this inefficiency, if you had access to many more servers, you could run more inferences in parallel but your trainings would still be going at the same speed as if you had only one server.

“I view it kind of as like Paul Christiano and Andy Jones have talked about, test time compute versus training compute. There it's almost like you dramatically increase the test time compute, but the training compute kind of stayed the same.”

If Git Re-Basin is actually real, that one has big implications. It's basically you can train multiple separate models and then merge them together to get what each of them learned. So if in the limit of it's doing the most amazing things that it could possibly be doing, it would imply all the foundation model companies go bankrupt because you can just have a zillion people train small models and open source them and then fuse them together.”

Recursive self improvement won't happen before a sinister stumble

“I don't view that recursive self-improvement is happening as fast as you do. I don't buy that you'll get really, really fast recursive self improvement before a sinister stumble had happened.”

“The very first time, it's not going to be doing it perfectly. The very first time there are going to be some humans in the loop.”

Even when people train a gigantic trillion parameter model, they're checking on it every few days because they're like, "We got to check in on it because it's super expensive." If something went wrong along the way and you had an outer memory error or the run diverged.”

“The hardware stuff's going to come after the software stuff probably, and the hardware stuff is where it's more unbounded but continuous, I agree the software can be a little bit scary because it's more discontinuous but it's more bounded also.”

  1. ^

    "assuming you have enough training runs" admittedly does a lot of the work here, especially when the models get large, to the point of this work being of purely theoretical interest for large language models, as of November 2022. 

  2. ^

    If  is extremely small (and nonnegative) then the number of seeds needed is extremely large, which makes it so expensive that it is basically intractable to extrapolate well if one can only use points before the break.

Thanks to Daniel Paleka, Alan Chan and Max Kaufmann for feedback.

New Comment
11 comments, sorted by Click to highlight new comments since:

Am I just inexperienced or confused, or is this paper using a lot of words to say effectively very little? Sure, this functional form works fine for a given set of regimes of scaling, but it effectively gives you no predictive ability to determine when the next break will occur. 

Sorry if this is overly confrontational, but I keep seeing this paper on Twitter and elsewhere and I'm not sure I understand why.

When f (in equation 1 of the paper ( https://arxiv.org/abs/2210.14891 ) not the video) of next break is sufficiently large, it gives you predictive ability to determine when that next break will occur; also, the number of seeds needed to get such predictive ability is very large. When f of next break is sufficiently small (& nonnegative), it does not give you predictive ability to determine when that next break will occur.

Play around with  in this code to see what I mean: 
https://github.com/ethancaballero/broken_neural_scaling_laws/blob/main/make_figure_1__decomposition_of_bnsl_into_power_law_segments.py#L25-L29

Obvious crackpot; says on Twitter that there's a $1 billion prize for "breaking" BNSL funded by Derek Parfit's family office.  I'd cut him more slack for potentially being obviously joking, if it wasn't surrounded by claims that sounded also crackpottery to me.  https://twitter.com/ethanCaballero/status/1587502829580820481 

I am co-supervising Ethan's PhD, and we previously wrote another ML paper together: https://arxiv.org/abs/2003.00688

Ethan has an unusual communication style, but he's not a crackpot, and this work is legit (according to me, the anchor author).  I haven't listened to the interview.

Well, I am Ethan's primary supervisor (since 2020), and really appreciate his provocative sense of humor (though not everyone does 😀) - regarding the BNSL paper though, it's 100% serious (though I did promise to double the $1B prize on Twitter; like advisor like student 😜)

I would be remiss not to warn folks that Ethan had a long-standing habit of making misleading, sensationalized, and outright false statements about ML scaling and other topics in the EleutherAI Discord. It got to the point several times where the moderation team had to step in to address the issue. Would recommend taking everything with a massive grain of salt.

Source: I was one of those mods.

What happened with that? I didn't realize he had issues with claims on scaling.

Early on, we typically just ignored it or tried to discuss it with him. After a while it became common knowledge that Ethan would post "bait". Eventually we escalated to calling the behavior out when it occurred, deleting the relevant post(s), and/or handing him a temporary timeout. I don't know what has happened since then, I've been away from the server the past few months.

Ethan posts an annotated image from openai's paper https://arxiv.org/pdf/2001.08361.pdf , stating that it's "apparently wrong now" after the compute-efficient scaling laws paper from deepmind: https://cdn.discordapp.com/attachments/785968841301426216/958570284665946122/Screen_Shot_2021-10-20_at_12.30.58_PM_1.png - the screenshot claims that the crossover point between data and compute in the original openai paper predicts agi.

Ethan, my impression is that you're mildly overfitting. I appreciate your intellectual arrogance quite a bit; it's a great attitude to have as a researcher, and more folks here should have attitudes like yours, IMO. But, I'd expect that data causal isolation quality is going to throw a huge honkin wrench into any expectations we form about how we can use strong models - note that even humans who have low causal quality training data form weird and false superstitions! I agree with the "test loss != capability" claim because the test distribution is weird and made up and doesn't exist outside the original dataset. IID is catastrophically false, and figuring that out is the key limiter preventing robotics from matching pace with the rest of ML/AI right now, imo. So, your scaling model might even be a solid representation space, but it's misleading because of the correlation problem.

this seems like a solid empirical generative representation but I don't feel comfortable assuming it is a causally accurate generative model. it appears overparameterized without causal justification to me. certainly we can fit known data using this, but it explicitly bakes in an assumption of non-generalization. perhaps that's the only significant claim being made? but I don't see how we can even generalize that the sharpness of the breaks is reliable. ethan says come at me, I say this is valid but does not refine predictive distribution significantly and that is itself the difficult problem we'd hope to solve in the first place.

[humor] could one use this method to represent the convergence behavior of researchers with crackpots as the size of the cracks in crackpots' pot decreases and the number of objects colliding with respected researchers' pots increases?