User Comment Replies

We may be able to see sharp left turns coming

Read Section 6 titled “The Limit of the Predictability of Scaling Behavior” in this paper:
https://arxiv.org/abs/2210.14891

PaLM-2 & GPT-4 in "Extrapolating GPT-N performance"

Ethan Caballero2yΩ110

We describe how to go about fitting a BNSL to yield best extrapolation in the last paragraph of Appendix Section A.6 "Experimental details of fitting BNSL and determining the number of breaks" of the paper:
https://arxiv.org/pdf/2210.14891.pdf#page=13

PaLM-2 & GPT-4 in "Extrapolating GPT-N performance"

Ethan Caballero2yΩ-24-4

Sigmoids don't accurately extrapolate the scaling behavior(s) of the performance of artificial neural networks.

Use a Broken Neural Scaling Law (BNSL) in order to obtain accurate extrapolations:
https://arxiv.org/abs/2210.14891
https://arxiv.org/pdf/2210.14891.pdf

4Lukas Finnveden2y

Interesting. Based on skimming the paper, my impression is that, to a first approximation, this would look like: * Instead of having linear performance on the y-axis, switch to something like log(max_performance - actual_performance). (So that we get a log-log plot.) * Then for each series of data points, look for the largest n such that the last n data points are roughly on a line. (I.e. identify the last power law segment.) * Then to extrapolate into the future, project that line forward. (I.e. fit a power law to the last power law segment and project it forward.) That description misses out on effects where BNSL-fitting would predict that there's a slow, smooth shift from one power-law to another, and that this gradual shift will continue into the future. I don't know how important that is. Curious for your intuition about whether or not that's important, and/or other reasons for why my above description is or isn't reasonable. When I think about applying that algorithm to the above plots, I worry that the data points are much too noisy to just extrapolate a line from the last few data points. Maybe the practical thing to do would be to assume that the 2nd half of the "sigmoid" forms a distinct power law segment, and fit a power law to the points with >~50% performance (or less than that if there are too few points with >50% performance). Which maybe suggests that the claim "BNSL does better" corresponds to a claim that the speed at which the language models improve on ~random performance (bottom part of the "sigmoid") isn't informative for how fast they converge to ~maximum performance (top part of the "sigmoid")? That seems plausible.

GPT-4

Ethan Caballero3y00

Did ARC try making a scaling plot with training compute on the x-axis and autonomous replication on the y-axis?

AI Safety in a World of Vulnerable Machine Learning Systems

Ethan Caballero3yΩ110

The setting was adversarial training and adversarial evaluation. During training, PGD attacker of 30 iterations is used to construct adversarial examples used for training. During testing, the evaluation test set is an adversarial test set that is constructed via PGD attacker of 20 iterations.

Experimental data of y-axis is obtained from Table 7 of https://arxiv.org/abs/1906.03787; experimental data of x-axis is obtained from Figure 7 of https://arxiv.org/abs/1906.03787.

AI Safety in a World of Vulnerable Machine Learning Systems

Ethan Caballero3yΩ330

"However, to the best of our knowledge there are no quantitative scaling laws for robustness yet."

For scaling laws for adversarial robustness, see appendix A.15 of openreview.net/pdf?id=sckjveqlCZ#page=22

1AdamGleave3y

Thanks, I'd missed that! Curious if you have any high-level takeaways from that? Bigger models do better, clearly, but e.g. how low do you think we'll be able to get the error rate in the next 5-10 years given expected compute growth? Are there any follow-up experiments you'd like to see happen in this space? Also could you clarify whether the setting was for adversarial training or just a vanilla model? "During training, adversarial examples for training are constructed by PGD attacker of 30 iterations" makes me think it's adversarial training but I could imagine this just being used for evals.

Ethan Caballero on Private Scaling Progress

Ethan Caballero3y10

arxiv.org/abs/2210.14891

Parameter Scaling Comes for RL, Maybe

Ethan Caballero3y70

See section 5.3 "Reinforcement Learning" of https://arxiv.org/abs/2210.14891 for more RL scaling laws with number of model parameters on the x-axis (and also RL scaling laws with the amount of compute used for training on the x-axis and RL scaling laws with training dataset size on the x-axis).

Whisper's Wild Implications

Ethan Caballero3y30

re: youtube estimates

You'll probably find some of this twitter discussion useful:
https://twitter.com/HenriLemoine13/status/1572846452895875073

Evidence on recursive self-improvement from current ML

Ethan Caballero3y30

OP will find this paper useful:
https://arxiv.org/abs/2210.14891

How is the "sharp left turn defined"?

Answer by Ethan CaballeroDec 09, 20222-1

I give a crisp definition from 6:27 to 7:50 of this video:

2David Johnston3y

Ethan finds empirically that neural network scaling laws (performance vs size, data, other things) are characterised by functions that look piecewise linear on a log log plot, and postulates that a “sharp left turn” describes a transition from a slower to a faster scaling regime. He also postulates that it might be predictable in advance using his functional form for scaling.

-5weverka3y

1weverka3y

You drew a right turn, the post is asking about a left turn.

AI Forecasting Research Ideas

Ethan Caballero3y*22

> Re: "Extrapolating GPT-N performance" and "Revisiting ‘Is AI Progress Impossible To Predict?’" sections of google doc

Read Section 6 titled "The Limit of the Predictability of Scaling Behavior" of "Broken Neural Scaling Laws" paper:
https://arxiv.org/abs/2210.14891

Current themes in mechanistic interpretability research

Ethan Caballero3yΩ110

One other goal / theme of mechanistic interpretability research imo:
twitter.com/norabelrose/status/1588571609128108033

Ethan Caballero3y41

When f (in equation 1 of the paper ( https://arxiv.org/abs/2210.14891 ) not the video) of next break is sufficiently large, it gives you predictive ability to determine when that next break will occur; also, the number of seeds needed to get such predictive ability is very large. When f of next break is sufficiently small (& nonnegative), it does not give you predictive ability to determine when that next break will occur.

Play around with $f_{i}$ in this code to see what I mean:
https://github.com/ethancaballero/broken_neural_scaling_laws/blo... (read more)

Ethan Caballero3y10

https://discord.com/channels/729741769192767510/785968841301426216/958570285760647230

9the gears to ascension3y

Ethan posts an annotated image from openai's paper https://arxiv.org/pdf/2001.08361.pdf , stating that it's "apparently wrong now" after the compute-efficient scaling laws paper from deepmind: https://cdn.discordapp.com/attachments/785968841301426216/958570284665946122/Screen_Shot_2021-10-20_at_12.30.58_PM_1.png - the screenshot claims that the crossover point between data and compute in the original openai paper predicts agi. Ethan, my impression is that you're mildly overfitting. I appreciate your intellectual arrogance quite a bit; it's a great attitude to have as a researcher, and more folks here should have attitudes like yours, IMO. But, I'd expect that data causal isolation quality is going to throw a huge honkin wrench into any expectations we form about how we can use strong models - note that even humans who have low causal quality training data form weird and false superstitions! I agree with the "test loss != capability" claim because the test distribution is weird and made up and doesn't exist outside the original dataset. IID is catastrophically false, and figuring that out is the key limiter preventing robotics from matching pace with the rest of ML/AI right now, imo. So, your scaling model might even be a solid representation space, but it's misleading because of the correlation problem.

Path dependence in ML inductive biases

Ethan Caballero3yΩ10253

Sections 3.1 and 6.6 titled "Ossification" of "Scaling Laws for Transfer" paper (https://arxiv.org/abs/2102.01293) show that current training of current DNNs exhibits high path dependence.

LESSWRONG
LW

All of Ethan Caballero's Comments + Replies