Daniel Kokotajlo asks whether the lottery ticket hypothesis implies the scaling hypothesis.
The way I see it, this depends on the distribution of "lottery tickets" being drawn from.
- If the quality of lottery tickets follows a normal distribution, then after your neural network is large enough to sample decent tickets, it will get better rather slowly as you scale it -- you have to sample a whole lot of tickets to get a really good one.
- If the quality of tickets has a long upward tail, then you'll see better scaling.
However, a long tail also suggests to me that variance in results would continue to be relatively high as a network is scaled: bigger networks are hitting bigger jackpots, but since even bigger jackpots are within reach, the payoff of scaling remains chaotic.
(This could all benefit from a more mathematical treatment.)
So: what do we know about NN training? Does it suggest we are living in extremistan or mediocristan?
Note: a major conceptual difficulty to answering this question is representing NN quality in the right units. For example, an accuracy metric -- which necessarily falls between 0% and 100% -- must yield "diminishing returns", and cannot be host to a "long-tailed distribution". Take that same metric and send it through an inverse sigmoid, and now you might not have diminishing returns, and could have a long-tail distribution. But we can transform data all day. The analysis shouldn't be too ad-hoc. So it's not immediately clear how to measure this.
One related question is what sub-tasks of gpt-3 showed surprise jackpots vs gpt-2