LESSWRONG
LW

22

[ Question ]

Probability that other architectures will scale as well as Transformers?

by Daniel Kokotajlo

28th Jul 2020

AI Alignment Forum

1 min read

22

Ω 11

GPT-1, 2, and 3 have shown impressive scaling properties. How likely is it that, in the next five years, many other architectures will also be shown to get substantially better as they get bigger? EDIT I am open to discussion of better definitions of the scaling hypothesis. For example, maybe Gwern means something different here in which case I'm also interested in that.

New to LessWrong?

Getting Started

AI TakeoffAI TimelinesGPTAI

22

Ω 11

Mentioned in

82More Recent Progress in the Theory of Neural Networks

30To what extent are the scaling properties of Transformer networks exceptional?

Probability that other architectures will scale as well as Transformers?

7Daniel Kokotajlo

5oceaninthemiddleofanisland

New Answer

New Comment

2 Answers sorted by
top scoring

Jul 29, 2020

Ω1523-1

For some reason here on LW there's a huge focus on "architecture". I don't get it. Here's how I at-this-moment think of the scaling hypothesis:

Weak scaling hypothesis: For a task that has not yet been solved, if you increase data and model capacity, and tune the learning algorithm to make use of it (like, hyperparameter tuning and such, not a fundamentally new algorithm), then performance will improve.

This seems fairly uncontroversial, I think? This probably breaks down in some edge cases (e.g. if you have 1-layer neural net that you keep making wider and wider) but seems broadly correct to me. It's mostly independent of the architecture (as long as it is possible to increase model capacity). Note also the common wisdom in ML that it's far more important what your data is than what your model / learning algorithm are.

What the architecture can influence is where your performance starts out at, and the rate at which it scales, which matters for:

Strong scaling hypothesis: (Depends on weak scaling hypothesis) There is a sufficiently difficult task T and an architecture A that we know of for that task, such that 1. "solving" T would lead to AGI, 2. it is conceptually easy to scale up the model capacity for A, 3. it is easy to get more data for T, and 4. scaling up a) model capacity and b) data will lead to "solving" T on some not-crazy timescale and resource-scale.

According to me, it is hard to find T that satisfies 1, 3 and 4b, it is trivial to satisfy 2, and hard to find an architecture that satisfies 4a. OpenAI's big contribution here is believing and demonstrating that T="predict language" might satisfy 1, 3 and 4b. I know of no other such T (though multiagent environments are a candidate).

What about 4a? According to me, it just so happens that Transformers are the best architecture for T="predict language", and so that's what we saw get scaled up, but I'd expect you'd see the same pattern of scaling (but not the same absolute performance) from other architectures as well. (For example, I suspect RNNs would also satisfy 4a.) I think the far more interesting question is whether we'll see other tasks T that could plausibly satisfy 1, 3, and 4b.

[-]Daniel Kokotajlo5yΩ570

Thanks! It sounds like you are saying the task is more important than the architecture, so we should talk less about architectures and more about tasks.

That seems plausible to me, with the caveat that I think it's still worth talking about architecture sometimes. For example, when thinking about the safety or generalization properties of a system the architecture might be more important, no?

If I could go back in time, I'd change the question to be about "Architecture+training setups" instead of just "architectures."

2Rohin Shah5y

Yes, that's right. I'd be pretty surprised if this were the case after conditioning on the raw capabilities of the architecture, though I can't rule it out.

oceaninthemiddleofanisland

Jul 29, 2020

50

I just finished Iain M Banks' 'The Player of Games' so my thoughts are being influenced by that, but it had an interesting main character who made it his mission to become the best "general game-player" (e.g no specialising in specific games), so I would be interested to see whether policy-based reinforcement learning models scale (thinking of how Agent 57 exceeded human performance across all Atari games).

It seems kind of trivially true that a large enough MuZero with some architectural changes could do something like play chess, shogi and go – by developing separate "modules" for each. At a certain size, this would be actually be optimal. There would be no point in developing general game-play strategies, it would be redundant. But suppose you scale it up, and then add hundreds or thousands of other games and unique tasks into the mix, with multiple timescales, multiple kinds of input, and multiple simulation environments? This is assuming we've figured out how to implement reward design automatically, or if the model/s (multi-agent?) itself develops a better representation of reward.

At what point does transfer learning occur? I saw a very simple version of this when I was fine-tuning GPT-2 on a dataset of 50-60 distinct authors. When you look at validation loss (accuracy), you always see a particular shape - when you add another author, validation loss rises, then it plateaus, then it sharply falls when the model makes some kind of "breakthrough" (for whatever that means for a transformer). When you train on authors writing about the same topics, final loss is lower, and the end outputs are a lot more coherent. The model benefits from the additional coverage, because it learns to generalise better. So after piling on more and more games and increasing the model size, at what point would we see the sharp, non-linear fall?

At what point does it start developing shared representations of multiple games?

At what point does it start meta-learning during validation testing, like GPT-3? Performing well at unseen games?

What about social games involving text communication? Interaction with real-world systems?

Suppose you gave it a 'game' involving predicting sequences of text as accurately as possible, and another involving predicting sequences of pixels, and another with modelling physical environments. At what point would it get good at all of those, as good as GPT-3?

And we could flip this question around: suppose you fine-tuned GPT-X on a corpus of games represented as text. (We already know GPT-2 can kind of handle chess.) At what version number do you see it start to perform as well as MuZero?

The other commenter is right – the architecture is important, but it's not about the architecture. It's about the task, and whether in order to do that task well, whether you need to learn other more general and transferable tasks in the process.

In this sense, it doesn't matter if you're modelling text or playing a video-game: the end result – given enough compute and enough data – always converges on the same set of universally-useful skills. A stable world-model, a theory of mind (we could unpack this into an understanding of agency and goal-setting, and keeping track of the mental-states of others), meta-learning, application of logic, an understanding of causality, pattern-recognition, and problem-solving.

30To what extent are the scaling properties of Transformer networks exceptional?

1

2Answer by Gordon Seidoh Worley5y

Most systems eventually face scaling bottlenecks. In fact, unless your system is completely free of coordination, it definitely has bottlenecks even if you haven't scaled large enough to hit them. And since Transformers definitely require some coordination since no matter how large the models are and how much parallelism their hardware supports they still produce a single reduced output, we should expect that there are some scaling limits on Transformers that at some size will prevent them for effectively taking advantage of having a larger network. Further, you point at this a bit, but most systems also experiencing diminishing returns on performance for additional resources because of these constraints. Transformers may just be special in that they have yet to start hitting diminishing returns because we haven't yet run up against their coordination bottlenecks, although that doesn't make them too special since we should expect them to still have them lying in wait somewhere, just like they do in every other system that is not coordination free.

More from Daniel Kokotajlo

Curated and popular this week