Consider first the more basic question: why is simple SGD on over-parameterized ANNs an effective global optimizer? This is the first great mystery of ANNs from classical ML theory: they should get stuck in various local minima and or overfit, but generally they don't (with a few tweaks) and just work better and better with scale. Many other techniques generally don't have this property.
A large oversized ANN can encode not just a single circuit solution, but an entire ensemble of candidates circuits (which dropout makes more explicit), and SGD then explores an entire ensemble of solutions in parallel automatically reusing all shared subcomputations resulting in an exponential speedup vs explicitly evaluating every candidate individually (as in most program search methods).
Pruning and lottery tickets are then obvious - after training the ensemble you can always extract one or more of the sparser best candidate circuits. Its also well known that you can compress and reduce bit precision extensively, so the first bit is by far the most important, and just retaining that bit and retraining the rest (masking) should reduce most of the retraining work.
SGD also has an obvious inductive bias, simply because it updates the weights according to , ie the direction that maximizes loss reduction per unit weight change. That's not the direction that moves directly to the lowest loss region (higher order optimizer find that direction); SGD instead moves in a direction that maximizes loss reduction per bit/unit of weight (and thus complexity) gain.
One might expect to be a more expressive equation than its linear approximation, but it appears that the parameters of very large neural nets change only by a small amount during training, which means that the overall found during training is nearly a solution to the linearly-approximated equations.
Note that this has changed over time, as network architectures change; I doubt that it applies to e.g. the latest LLMs. The thing about pruning doing a whole bunch of optimization does still apply independent of whether net training is linear-ish (though I don't know if anyone's repro'd the lottery ticket hypothesis-driven pruning experiments on the past couple years' worth of LLMs).
A bit of a side note, but I don't even think you need to appeal to new architectures - it looks like the NTK approximation performs substantially worse even with just regular MLPs (see this paper, among others).
I have recently been fascinated by the breadth of important mysteries in deep learning, including deep double descent and phase changes, that could be explained by a curious conjectured property of neural networks called the lottery ticket hypothesis. Despite this explanatory potential, however, I haven't seen much discussion about the evidence behind and the implications of this hypothesis in the alignment community. Being confused about these things motivated me to conduct my own survey of the phenomenon, which resulted in this post.
The Lottery Ticket Hypothesis, explained in one minute
The lottery ticket hypothesis (LTH) was originally proposed in a paper by Frankle and Carbin (2018):
The authors call such subnetworks "winning lottery tickets". As the simplest example, they train a LeNet-300-100 model on the MNIST dataset and report that a network containing only 21.1% of the weights of the dense version reaches a higher test accuracy on less training iterations, while a network where only 3.6% of the weights remain performs almost identically to the dense network of the network.
The lottery ticket hypothesis extends a long line of work in neural network pruning, a technique proposed in as early as 1990 by LeCun et al. Pruning simply means deleting some fraction of unimportant weights from the network after training to make inference more efficient. The vital insight of the lottery ticket hypothesis paper is that it may also be possible to prune the network before training to make both training and inference more efficient.
In practice, the method that Frankle and Carbin used for finding winning tickets didn't yet eliminate the need to train the full network, instead just suggesting the possibility. The technique they used is iterative pruning, a procedure that roughly looks as follows:
This is computationally quite expensive, but as we'll see below, alternative approaches have been proposed later on.
The paper also defined a stronger version of the hypothesis that they named the lottery ticket conjecture:
In the next section, I'll argue that making a distinction between the hypothesis and the conjecture appears to be quite important.
Relevance to alignment
Phase changes
In his post about mechanistically interpreting grokking, Neel Nanda argues that the lottery ticket hypothesis may constitute the reason for why neural networks form sophisticated circuits.
One may naively think that neural networks are optimized in a similar way to how linear regression classifiers are optimized: each weight slowly changes in a direction that marginally improves performance, and the result of these tiny individual improvements smoothly improve the performance of the entire ensemble. In practice, though, we observe the forming of sophisticated circuits like the induction circuit, which is comprised of a previous token head and an induction head. Either of those heads improves loss only in the case the other head is also present.
Somehow, then, there must be gradients that encourage the formation of these heads despite the fact that neither of them reduces loss on its own. The lottery ticket conjecture appears to explain the existence of such gradients better than alternative explanations that Neel discusses: for example, it's easy to imagine that if something like a head that uniformly attends to prior tokens is already present in the network at the start, an induction head will be somewhat useful when composing with it and SGD will mold that into a proper induction circuit. More generally, some of the randomly initialized circuits (winning tickets) are already systematically useful for reducing loss, and SGD will reinforce such circuits.
This, in turn, can explain phase changes in neural networks: as each component of a circuit develops, each other component will become more useful, which means that all gradients will increase together in a non-linear way, producing a sudden improvement in model capabilities.
Deep double descent
Evan Hubinger writes in his post explaining deep double descent:
The scaling hypothesis
Daniel Kokotajlo and Abram Demski have drawn parallels between lottery tickets and the scaling hypothesis. Daniel asks whether the lottery ticket conjecture being true would imply the scaling hypothesis. Abram argues that this depends on whether the distribution of lottery tickets is normal or long-tailed: if it's normal, the amount of lottery tickets in a network would increase rather slowly with scaling. dsj comments under Abram's question about the distribution of lottery tickets that the number of lottery tickets grows exponentially rather than linearly with model size, which seems to imply that even if the distribution of lottery tickets is short-tailed, larger models can be expected to contain better lottery tickets.
For all of the aforementioned points of relevance to alignment, a crucial question is whether the lottery ticket conjecture is true. Motivated by that, I took a look at recent publications on the lottery ticket hypothesis to find out more about the amount of evidence behind different versions of the hypothesis.
Pruning is all you need
Since the publication of Frankle and Carbin's paper, various stronger versions of the hypothesis have been proposed. An early follow-up was Zhou et al.'s (2019) paper which argued that winning tickets achieve significantly better-than-random performance even without training. They observe that the process of finding such subnetworks is equivalent to learning a binary mask for the weights, called the supermask. To find the supermasks, they make two important observations about the lottery ticket hypothesis:
Based on these insights, they use the criterion that weights which have a large magnitude at the end of the training process and also maintain the same sign are retained in the network after masking. Masking out all weights that don't satisfy this criterion in the initial, randomized network, they get a pruned network that achieves test accuracy of up to 86% on MNIST and 41% on CIFAR-10 without any training! Furthermore, they propose an algorithm for training supermasks that brings the respective accuracies up to 95.3% and 65.4%.
The accuracies shown in Zhou et al.'s paper were already surprisingly good considering that they were achieved with no training at all, but still considerably worse than the accuracies of a properly trained dense network. Soon, though, Ramanujan et al. (2019) showed that by making some changes in the algorithm for finding the supermask, it's possible to find an untrained pruned network that exactly matches the performance of networks found through the usual optimization process! They propose an improved algorithm that does not sample supermasks on the forward pass, effectively providing an alternative to the usual practice of finding good networks through SGD. This leads them to propose a conjecture that can be viewed as the strong lottery ticket hypothesis:
Later on, different theoretical results have been proved about the strong lottery ticket hypothesis: see Malach et al. (2020), Orseau et al. (2019), and Pensia et al. (2021). The main takeaway of these papers is that as long as the original dense network is a factor of O(log(d∗l)) wider than the target subnetwork, d being the width and l being the depth of the target, and twice as deep as the target subnetwork, then pruning is indeed all you need in order to get equivalent performance to the optimized full network.
Finally, Diffenderfer and Kailkhura (2021) argue that there's evidence for going even beyond the strong LTH, formulating another version of the hypothesis:
They call this the multi-prize lottery ticket hypothesis. For a critical review of this paper, see Mark Xu's distillation.
Discussion
All of these results seem to solidify the case for the lottery ticket hypothesis being true, but do they provide evidence for the lottery ticket conjecture as well?
Though the existence of supermasks might be weak evidence in favor of the conjecture, it definitely doesn't prove it. As Daniel Filan notes in this thread, training takes a long time and there are lots of neural networks that succeed at the same task. I probably agree with his take in this comment: we don't have enough evidence for accepting the conjecture yet, but if the approach of the original LTH paper (first train the dense network, then choose the winning ticket and wind back the weights) and the approach of most later papers (use supermasks to find the winning ticket without training the original network at all) were found to produce almost identical subnetworks, then that would constitute very strong evidence for the conjecture.
John Wentworth's update to the Lottery Ticket Hypothesis
John Wentworth finds the lottery ticket conjecture implausible, but proposes an alternative hypothesis. He argues that although there may indeed exist subnetworks that can achieve close-to-perfect accuracy at classifying dogs, they are good at identifying dogs only after all the other neurons have been pruned out. Pruning does a whole lot of optimization and changes the functional behavior of the nodes in the subnetwork, he argues, instead of just exposing the small subcircuit that has been great at classifying dogs from the start. Thus, contrary to the conjecture, there's no good reason to assume SGD would easily find that same subnetwork:
Wentworth nevertheless thinks that a version of the lottery ticket hypothesis is plausible. Since the space of subnetworks at initialization doesn't contain a dog detector, the lottery tickets must be contained in a larger space. That space, he argues, is the parameter tangent space of the initial network.
The parameter tangent space is defined as follows. Take θ to be some network's parameters to map x to y. We can represent this as y=f(x,θ). If the network is initialized with θ0, SGD finds some Δθ such that y=f(x,θ0+Δθ) is an accurate mapping. The linear approximation of the right-hand side is then f(x,θ0)+Δθdfdθ(x,θ0). This approximation is the parameter tangent space.
One might expect y=f(x,θ0+Δθ) to be a more expressive equation than its linear approximation, but it appears that the parameters of very large neural nets change only by a small amount during training, which means that the overall Δθ found during training is nearly a solution to the linearly-approximated equations.
Thus, Wentworth argues, the following constitutes a more useful mental model than the original lottery ticket hypothesis:
Discussion
Of the hypotheses put forward in the original LTH paper, Wentworth's claims are compatible with the lottery ticket hypothesis, but not with the lottery ticket conjecture. His parameter tangent space version of the hypothesis is mainly based on Mingard et al.'s (2020) finding that the generalization performance of overparameterized neural nets can mostly be explained using Bayesian models that these networks approximate, rather than by any inductive biases of SGD. I'm not sufficiently familiar with this line of research to assess the strength of the evidence behind these claims, but this definitely seems like an exciting direction of further research.
The Elastic Lottery Ticket Hypothesis
Despite various efforts to improve techniques of pruning at initialization, Chen et al. (2021) observe that the most effective method for identifying winning tickets is still the costly procedure that is iterative magnitude-based pruning (IMP), the method introduced in the original LTH paper. Motivated by the results of Morcos et al. (2019) which show that it's possible to use IMP just once to find a single dataset-independent winning ticket for each architecture and to then transfer it to various datasets and downstream tasks, they attempt to find out whether it's also possible to transfer the winning ticket found for one network to multiple network architectures (e.g., to transfer the winning ticket found for ResNet-32 to ResNet-14 and ResNet-64), thus dramatically reducing the costs of finding winning tickets.
In the limited settings tested by Chen et al., it seems that winning tickets can indeed transfer across architectures. Based on this result, they articulate the elastic lottery ticket hypothesis:
The authors note that there are currently some important restrictive assumptions behind their hypothesis. First, they assume that the architectures across which a single winning ticket is transferred come from the same family, such as ResNets. Secondly, under their current approach, an elastic winning ticket can scale only along the depth dimension - width transformations appear to be more challenging. Nevertheless, training costs would significantly decrease even under these assumptions if this paper's results were shown to reliably hold for other architecture families beyond ResNets and VGGs as well.
Discussion
This paper seems to be evidence against common knowledge about pruning, described in Su et al. (2020) as "Conventional wisdom of pruning algorithms suggests that: (1) Pruning methods exploit information from training data to find good subnetworks; (2) The architecture of the pruned network is crucial for good performance." Elastic lottery tickets can be used for many tasks, suggesting they're mostly dataset-independent, and used for many different architectures, suggesting that it's something about the weights rather than about the structure that makes those tickets special. Su et al. argue that other recently proposed pruning methods provide similar evidence contradicting those two claims. This raises the question of what exactly is the property that gives a winning ticket such generality, discussed at length below.
Summary
The evidence behind the original LTH seems pretty uncontroversial at this point. Multiple stronger versions of the hypothesis such as the elastic LTH also seem likely to hold. In contrast, I think it's still unclear whether the lottery ticket conjecture is true. (Wentworth's position that pruning is just a weird way of doing optimization and changes the functional behavior of the network nodes seems pretty plausible to me, but further experimental evidence of this would make the claim stronger.) Consequently, it seems useful to distinguish between the two versions of the hypothesis when discussing its implications that don't necessarily follow from the weakest version of the LTH. Secondly, further inquiry into the various stronger versions of the hypothesis and their interplay with phenomena like grokking and double descent seems valuable. Several research ideas have been proposed for that: