Since we have perfectly fit the training data, at the training data point, the loss is zero; and since the loss is minimized, the gradient is also zero.
The linear term in this is not actually 0. There is no mechanism fitting the model to the linear approximation of the data around the training points. The model is only fit to the (0th order) value at the training points.
To correctly state the above sentence:
To prove this point:
To see this visually:
Check out this plot, which quotes the above sentence from this post as a footnote, but brilliantly visually demonstrates the contradiction: see how error increases linearly around training points (it's using interpolation between points, but the same point holds for a constant piecewise function)
Mathematically, the Taylor expansion is:
And then we have and also . (This does assume a "sufficiently nice" loss function, that is satisfied by most loss functions used in practice.)
I agree is not zero. I also agree if you take some point in between and it can have non-zero loss, e.g. need not be zero. I'm not sure if either of these are what you're trying to say, but in any case they aren't relevant to the quoted sentence.
If you are claiming , then I disagree and am unclear on how your arguments are supposed to establish that.
Yes, that's precisely what I'm claiming!
Sorry if that wasn't clear. As for how to establish that, I proposed an intuitive justification:
There is no mechanism fitting the model to the linear approximation of the data around the training points.
And an outline for a proof:
Take two problems which have the same value at the training points but with wildly different linear terms around them. A model perfectly fit to the training points would not be able to distinguish the two.
Let's walk through an example
2. Consider a second example. Let's fit . Again, we collect training data
The model is trained on , and is independent of . That means even if by happy accident our optimization procedure achieves , we can prove that it is not generally true by considering an identical training dataset with a different underlying function (and knowing our optimization must result in the same model)
On rereading your original argument:
since the loss is minimized, the gradient is also zero.
I think this is referring to , which is certainly true for a perfectly optimized model (or even just settled gradient descent). Maybe that's where the miscommunication is stemming from, since "gradient of loss" is being overloaded from discussion of optimization , and discussion of Taylor-expanding around (which uses )
I think this is referring to ∇θL(xtrain)=0, which is certainly true for a perfectly optimized model (or even just settled gradient descent). Maybe that's where the miscommunication is stemming from
Ah, yup, that's the issue, and I agree you're correct that is the relevant thing here. I'll edit the post to say I'm no longer sure about the claim. (I don't have the time to understand how this lines up with the actual paper -- I remember it being kind of sparse and not trivial to follow -- perhaps you could look into it and leave a comment here.)
I think this little mistake doesn't affect the gist of your summary post, I wouldn't worry about it too much.
The mistaken argument was an attempt to explain why . Believe it or not, the paper never actually argued , it argued . That's because the paper used the L2 loss, and .
I feel that was the main misunderstanding.
Figure 2b in the paper shows for most models but for some models.
The paper never mentions L2 loss, just that the loss function is "analytic in and minimized at ." Such a loss function converges to L2 when the loss is small enough. This important sentence is hard to read because it's cut in half by a bunch of graphs, and looks like another unimportant mathematical assumption.
Some people like L2 loss or a loss function that converges to L2 when the loss is small, because most loss functions (even L1) behave like L2 anyway, once you subtract a big source of loss. E.g. variance-limited scaling has in both L1 and L2 because you subtract or from . Even resolution-limited scaling may require subtracting loss due to noise . L2 is nicer because if is zero, but is undefined since the absolute value is undifferentiable at 0.
If you read closely, only refers to piecewise linear approximations:
, at large . Note that if the model provides an accurate piecewise linear approximation, we will generically find .
The second paper says the same:
If the model is piecewise linear instead of piecewise constant and is smooth with bounded derivatives, then the deviation , and so the loss will scale as . We would predict
If you want a quick fix, you might change the following:
Under the assumption that our test loss is sufficiently “nice”, we can do a Taylor expansion of the test loss around this nearest training data point and take just the first non-zero term. Since we have perfectly fit the training data, at the training data point, the loss is zero; and since the loss is minimized, the gradient is also zero. Thus, the first non-zero term is the second-order term, which is proportional to the square of the distance. So, we expect that our scaling law will look like kD^(-2/d), that is, α = 2/d. (EDIT: I no longer endorse the above argument, see the comments.)
The above case assumes that our model learns a piecewise constant function. However, neural nets with Relu activations learn piecewise linear functions. For this case, we can argue that since the neural network is interpolating linearly between the training points, any deviation of the distance between the true value and the actual value should scale as D^(-2/d) instead of D^(-1/d), since the linear term is being approximated by the neural network. In this case, for loss functions like the L2 loss, which are quadratic in the distance, we get that α = 4/d.
Note that it is possible that scaling could be even faster, e.g. because the underlying manifold is simple or has some nice structure that the neural network can quickly capture. So in general, we might expect α >= 2/d and for L2 loss α >= 4/d.
becomes
Under the assumption that is sufficiently “nice”, we can do a Taylor expansion of it around this nearest training data point and take just the first non-zero term. Since we have perfectly fit the training data, the difference is zero at the training data point. With a constant term of zero, we use the linear term (gradient displacement), which is proportional to the distance. So, we expect that our scaling law will look like kD^(-1/d). For loss functions like the L2 loss, which scale as the difference squared, it becomes kD^(-2/d) so α = 2/d.
The above case assumes that our model learns a piecewise constant function. However, neural nets with Relu activations learn piecewise linear functions. For this case, we can argue that since the neural network is interpolating linearly between the training points, any deviation of the difference between the true value and the model's value should scale as D^(-2/d) instead of D^(-1/d), since the linear term is being approximated by the neural network (the linear term difference also decreases when the distance decreases). For L2 loss, we get that α = 4/d.
Note that it is possible that scaling could be even faster, e.g. because the underlying manifold is simple or has some nice structure that the neural network can quickly capture. So in general, we might expect α >= 2/d.
Also change
Once again, we make the assumption that the learned model gives a piecewise linear approximation, which by the same argument suggests a scaling law of X^(-α), with α >= 2/d (and α >= 4/d for the case of L2 loss)
and checking whether α >= 4/d. In most cases, they find that it is quite close to equality. In the case of language modeling with GPT, they find that α is significantly larger than 4/d, which is still in accordance with the equality (though it is still relatively small -- language models just have a high intrinsic dimension)
to
Once again, we make the assumption that the learned model is better than a piecewise constant approximation, which by the same argument suggests a scaling law of X^(-α), with α >= 2/d (and α >= 4/d for a piecewise linear approximation)
and checking whether α >= 2/d. In most cases, they find that it is quite close to 4/d. In the case of language modeling with GPT, they find that α is significantly larger than 4/d, which is still in accordance with α >= 2/d (though α is still relatively small -- language models just have a high intrinsic dimension)
Once AI scientists make a false conclusion like or , they may hallucinate arguments which justify the conclusion. A future research direction is to investigate whether large language models learned this behavior from the AI scientists who trained them.
I'm so sorry I'm so immature but I can't help it.
Overall, it's not a big mistake because it doesn't invalidate the gist of the summary. It's very subtle, unlike those glaring mistake that I've seen in works by other people... and myself. :)
Some simple algebraic manipulation tells us that distance between points would then scale as D^(-1/d).
This is a really helpful summary! I'm having trouble following this step here and in the original work - could someone point me in the right direction as to why this follows from needing ~2^d as many points to make the "net" twice as fine?
Say that at dataset size , the distance between points is . Now consider a new distance -- what is the corresponding we need?
Intuitively, for each factor of 2 that is smaller than (which we can quantify as ), we need to multiply by another factor of .
So
That is, the distance scales as .
Oh hi! I linked your video in another comment without noticing this one. Great visual explanation!
Alignment Newsletter is a weekly publication with recent content relevant to AI alignment around the world. Find all Alignment Newsletter resources here. In particular, you can look through this spreadsheet of all summaries that have ever been in the newsletter.
Audio version here (may not be up yet).
Please note that while I work at DeepMind, this newsletter represents my personal views and not those of my employer.
HIGHLIGHTS
Explaining Neural Scaling Laws and A Neural Scaling Law from the Dimension of the Data Manifold (Yasaman Bahri, Ethan Dyer, Jared Kaplan, Jaehoon Lee, and Utkarsh Sharma) (summarized by Rohin): We’ve seen lots of empirical work on scaling laws (AN #87), but can we understand theoretically why these arise? This paper suggests two different models for how power-law scaling laws could arise, variance-limited and resolution-limited scaling, and argues that neural nets are typically trained in the resolution-limited setting. In both cases, we have versions that occur when the dataset size D is large and the number of parameters P is low (parameter-limited), and when D is low and P is large (data-limited).
Recall that a scaling law is a power-law equation that predicts the test loss L as a function of P and D. In this paper, we consider cases where only one of the resources is the bottleneck, so that our power laws are of the form L = kP^(-α) or L = kD^(-α), for constants k and α. (For simplicity, we’re assuming that the minimum value of our loss function is zero.)
Resolution-limited scaling happens when either the dataset is too small to “resolve” (capture) the true underlying function, or when the model doesn’t have enough capacity to “resolve” (fit) the training dataset. In this case, we’re going to take the common ML assumption that while our observation space might be high-dimensional, the data itself comes from a low-dimensional manifold with dimension d, called the intrinsic dimension. We’ll model our neural net as transforming the input space into a roughly d-dimensional representation of the manifold, which is then used in further processing by later layers. Thus the output of the network is a simple function over this low-dimensional representation.
Let’s first consider the case where P is sufficiently large, so that we perfectly fit the training data, but D is limited. We can think of the training data as a “net” of points covering the true d-dimensional manifold. Intuitively, to halve the distance between the points (making the net “twice as fine”), we need ~2^d times as many points. Some simple algebraic manipulation tells us that distance between points would then scale as D^(-1/d).
How can we translate this to the test loss? Let’s assume a simple nearest neighbor classifier where, given a test data point, we simply predict the value associated with the nearest training data point. This is equivalent to assuming that our neural net learns a piecewise constant function. In this case, for a test data point drawn from the same distribution as the training set, that data point will be “near” some training data point and our model will predict the same output as for the training data point.
Under the assumption that our test loss is sufficiently “nice”, we can do a Taylor expansion of the test loss around this nearest training data point and take just the first non-zero term. Since we have perfectly fit the training data, at the training data point, the loss is zero; and since the loss is minimized, the gradient is also zero. Thus, the first non-zero term is the second-order term, which is proportional to the square of the distance. So, we expect that our scaling law will look like kD^(-2/d), that is, α = 2/d. (EDIT: I no longer endorse the above argument, see the comments.)
The above case assumes that our model learns a piecewise constant function. However, neural nets with Relu activations learn piecewise linear functions. For this case, we can argue that since the neural network is interpolating linearly between the training points, any deviation of the distance between the true value and the actual value should scale as D^(-2/d) instead of D^(-1/d), since the linear term is being approximated by the neural network. In this case, for loss functions like the L2 loss, which are quadratic in the distance, we get that α = 4/d.
Note that it is possible that scaling could be even faster, e.g. because the underlying manifold is simple or has some nice structure that the neural network can quickly capture. So in general, we might expect α >= 2/d and for L2 loss α >= 4/d.
What about the case when P is the bottleneck? Well, in this case, since the training data is not the bottleneck, it is presumably a sufficiently good approximation to the underlying function; and so we are just seeing whether the learned model can match the dataset. Once again, we make the assumption that the learned model gives a piecewise linear approximation, which by the same argument suggests a scaling law of X^(-α), with α >= 2/d (and α >= 4/d for the case of L2 loss), where X is the number of “parts” in the approximation. In the case of linear models, we should have X = P, but for neural networks I believe the authors suggest that we should instead have X = w, the width of the network. (One motivation is that in the infinite-width limit, neural networks behave like linear models.)
In variance-limited scaling for D, the scaling bottleneck is the randomness inherent in the sampling of the dataset from the underlying distribution. We can view the dataset as a random variable, implying that the gradient is also a random variable since it is a function of the training dataset. We can then consider the “error term” δG = G - G_inf, which is the difference between the finite-dataset gradients and the gradients for infinite data. We’ll make the assumption that you’re equally likely to be wrong in all directions -- if there’s a dataset that makes you a bit more likely to predict A, then there’s also a corresponding equally likely dataset that makes you a bit less likely to predict A. In that case, in expectation δG is zero, since on average the errors all cancel out. Since D is assumed to be large, we can apply the law of large numbers to deduce that the variance of δG will scale as 1/D.
Let us then consider the test loss as a function of the gradients. The test loss we actually get is L(G) = L(G_inf + δG). We can now Taylor expand this to get an expansion which tells us that the quantity we care about, L(G) - L(G_inf), is of the form AδG + B(δG)^2, where A and B are constants that depend on derivatives of the test loss in the infinite dataset case. We had already concluded that E[δG] = 0, and E[(δG)^2] is just the variance and so scales as 1/D, which implies that α = 1.
Here’s a slightly less mathematical and more conceptual argument for the same thing (though note that this feels like a sketchier argument overall):
The authors also suggest that a similar argument can be applied to argue that for parameters, the loss scales as 1/w, where w is the width of the network. This is variance-limited scaling for P. This again relies on previous results showing that neural networks behave like linear models in the limit of infinite width.
The authors use this theory to make a bunch of predictions which they can then empirically test. I’ll only go through the most obvious test: independently measuring the scaling exponent α and the intrinsic dimension d, and checking whether α >= 4/d. In most cases, they find that it is quite close to equality. In the case of language modeling with GPT, they find that α is significantly larger than 4/d, which is still in accordance with the equality (though it is still relatively small -- language models just have a high intrinsic dimension). Variance-limited scaling is even easier to identify: we simply measure the scaling exponent α and check whether it is 1.
Rohin's opinion: This seems like a solid attack on a theory of scaling. As we discussed last week, it seems likely that any such theory must condition on some assumption about the “simplicity of reality”; in this paper, that assumption is that the data lies on a low-dimensional manifold within a high-dimensional observation space. This seems like a pretty natural place to start, though I do expect that it isn’t going to capture everything.
Note that many of the authors’ experiments are in teacher-student models. In these models, a large teacher neural network is first initialized to compute some random function; a student network must then learn to imitate the teacher, but has either limited data or limited parameters. The benefit is that they can precisely control factors like the intrinsic dimension d, but the downside is that it isn’t immediately clear that the insights will generalize to real-world tasks and datasets. Their experiments with more realistic tasks are less clean, though I would say that they support the theory.
TECHNICAL AI ALIGNMENT
MISCELLANEOUS (ALIGNMENT)
Bootstrapped Alignment (G Gordon Worley III) (summarized by Rohin): This post distinguishes between three kinds of “alignment”:
1. Not building an AI system at all,
2. Building Friendly AI that will remain perfectly aligned for all time and capability levels,
3. Bootstrapped alignment, in which we build AI systems that may not be perfectly aligned but are at least aligned enough that we can use them to build perfectly aligned systems.
The post argues that optimization-based approaches can’t lead to perfect alignment, because there will always eventually be Goodhart effects.
AI GOVERNANCE
Institutionalizing ethics in AI through broader impact requirements (Carina E. A. Prunkl et al) (summarized by Rohin): This short perspective analyzes the policy implemented by NeurIPS last year in which paper submissions were required to have a section discussing the broader impacts of the research. Potential benefits include anticipating potential impacts of research, acting to improve these impacts, reflecting on what research to do given the potential impacts, and improving coordination across the community. However, the policy may also lead to trivialization of ethics and governance (thinking that all the relevant thinking about impacts can be done in this single statement), negative attitudes towards the burden of writing such statements or responsible research in general, a false sense of security that the ethics are being handled, and a perception of ethics as something to be done as an afterthought.
The main challenges that can cause these sorts of negative effects are:
1. Analyzing broader impacts can be difficult and complex,
2. There are not yet any best practices or guidance,
3. There isn’t a clear explanation of the purpose of the statements, or transparency into how they will be evaluated,
4. It’s tempting to focus on the research that determines whether or not your paper is published, rather than the broader impacts statement which mostly does not affect decisions,
5. Researchers may have incentives to emphasize the beneficial impacts of their work and downplay the negative impacts.
6. Biases like motivated reasoning may affect the quality and comprehensiveness of impact statements.
To mitigate these challenges, the authors recommend improving transparency, setting expectations, providing guidance on how to write statements, improving incentives for creating good impact statements, and learning from experience through community deliberation. To improve incentives in particular, broader impact statements could be made an explicit part of peer review which can affect acceptance decisions. These reviews could be improved by involving experts in ethics and governance. Prizes could also be given for outstanding impact statements, similarly to best paper awards.
Rohin's opinion: I’ve been pretty skeptical of the requirement to write a broader impacts statement. My experience of it was primarily one of frustration, for a few reasons:
1. Forecasting the future is hard. I don’t expect a shallow effort to forecast to be all that correlated with the truth. There were lots of simple things I could say that “sound” right but that I don’t particularly expect to be true, such as “improving cooperation in multiagent RL will help build cooperative, helpful personal assistants”. It’s a lot harder to say things that are actually true; a real attempt to do this would typically be a paper in itself.
2. To the extent that the statement does affect reviews, I expect that reviewers want to hear the simple things that sound right; and if I don’t write them, it would probably be a strike against the paper.
3. Even if I did write a good statement, I don’t expect anyone to read it or care about it.
From a birds-eye view, I was also worried that if such statements do become popular, they’ll tend to ossify and build consensus around fairly shallow views that people come up with after just a bit of thought.
I do think many of the proposals in this paper would help quite a bit, and there probably is a version of these statements that I would like and endorse.
OTHER PROGRESS IN AI
REINFORCEMENT LEARNING
Mastering Atari with Discrete World Models (Danijar Hafner et al) (summarized by Flo): Model-based reinforcement learning can have better sample efficiency, allows for smarter exploration strategies, and facilitates generalization between different tasks. Still, previous attempts at model-based RL on the Atari Benchmark like Dreamer (AN #83) and SimPLe (AN #51) were unable to compete with model-free algorithms in terms of final performance. This paper presents DreamerV2, a model-based algorithm that outperforms DQN and its variants -- including Rainbow -- in terms of both median human- or gamer-normalized performance and on mean world-record normalized performance on Atari after 200M environment steps, achieving roughly 35% on the latter (25% if algorithm performance is clipped to max out at 100% for each game).
DreamerV2 learns a recurrent state-space model that stochastically encodes frames and a hidden state into a latent variable and uses the hidden state to predict the next value of the latent variable. Frames and reward are then reconstructed using both the hidden state and the latent variable. A policy is obtained by actor-critic training on the latent state space, leveraging parallelization to train on 468B imagined samples. As DreamerV2 does not use MCTS, it requires 8x less wall clock time to train than the more complicated but better performing MuZero Reanalyze (AN #75). Unlike earlier approaches, DreamerV2 uses a vector of categorical latent variables rather than gaussians to enable better model predictions for dynamics with multiple distinct modes, as well as KL-balancing (scaling up the importance of the transition loss compared to the entropy regularizer on the latent variable). Ablations confirm that the image reconstruction loss is crucial for DreamerV2's performance and that both the use of discrete latent variables and KL-balancing lead to significant improvements. Interestingly, preventing the gradients for reward prediction from affecting the world model does not affect performance at all.
Read more: Paper: Mastering Atari with Discrete World Models
Flo's opinion: It is worth noting that the authors use the Dopamine (AN #22) framework for evaluating the model-free baselines, meaning that a slightly stunted version of Rainbow is used on an evaluation protocol different from the original publication without retuning hyperparameters. That said, DreamerV2 definitely performs at a level similar to Rainbow, which is significant progress in model-based RL. In particular, the fact that the reward can be inferred from the world model even without gradients flowing back from the reward suggests transferability of the world models to different tasks with the same underlying dynamics.
MACHINE LEARNING
A Theory of Universal Learning (Olivier Bousquet et al) (summarized by Zach): In machine learning, algorithms are presented with labeled examples of categories from a training dataset and the objective is to output a classifier that distinguishes categories on a validation dataset. The generalization ability of the classifier is usually measured by calculating the error rate of the classifications on the validation set. One popular way to display generalization capability as a function of training set size is to plot a learning curve. A learning curve is a function that outputs the performance of a learning algorithm as a function of the data distribution and training sample size. A faster decay rate for a learning curve indicates a better ability to generalize with fewer data.
In this paper, the authors characterize the conditions for a learning algorithm to have learning curves with a certain decay rate. A learning curve is produced from the decay rate according to the formula 1/rate. The authors show that there are only three universal rates: exponential, linear, and arbitrarily slow decay. Moreover, the authors show there are problem classes that can be learned quickly in each instance but are slow to learn in the worst-case. This stands in contrast to classical results which analyze only the worst-case performance of learning algorithms. This produces pessimistic bounds because the guarantee must hold for all possible data distributions. This is often stronger than what is necessary for practice. Thus, by looking at rates instead of the worst-case learning curve, the authors show that it is possible to learn more efficiently than what is predicted by classical theory.
Zach's opinion: This paper is mathematically sophisticated, but full of examples to illustrate the main points of the theory. More generally, work towards non-uniform bounds has become a popular topic recently as a result of classical generalization theory's inability to explain the success of deep learning and phenomena such as double-descent. These results could allow for progress in explaining the generalization capability of over-parameterized models, such as neural networks. Additionally, the theory presented here could lead to more efficient algorithms that take advantage of potential speedups over empirical risk minimization proved in the paper.
FEEDBACK
I'm always happy to hear feedback; you can send it to me, Rohin Shah, by replying to this email.
PODCAST
An audio podcast version of the Alignment Newsletter is available. This podcast is an audio version of the newsletter, recorded by Robert Miles.