Demis Hassabis mentioned a few months ago that Deepmind is in the middle of scaling up its generalist agent GATO. Based on this, I would except it to come out either by the end of the year or in the beginning of the next year. The original model had just 1 billion parameters and was able to play atari games, operate a robotic arm and be used as a (relativel small) language model, so scaling it up to several hundred billion(trillions?) parameters seems to have a great potential in capability progress.

There were similar discussions about what would the most impressive thing that GPT-4 would be capable of, and similarly what it won't be able to achieve. However, since GPT-4 is likely going to be limited to text, it is probable that scaled up GATO would exhibit an entirely new/different set of capabilities. 

So let's discuss the following questions: What will the scaled up GATO model/data/training look like? What will it be capable of and what would be the most impressive capabilities? On the other hand, what are the things it won't achieve?

Update: 

To get more engagement with the post I provide some more specific questions to make predictions on (Thanks Daniel Kokotajlo for the idea):

How many parameters will the model have? (Current Gato has size was 1 billion)

How large will the context window be?

Will we see some more addvanced algorithmic improvements? E.g. new type of long-term memory, actual RL and similar

What type of data would it be trained on? 

Will we observe transfer learning? e.g. seeing improvements in language model after training on audio

Will GATO see bigger improvements from Chain-of-thought style prompting compared to a LLM of similar size?

Will it be able to play new Atari games without being trained on them?

 

Feel free to suggest more questions and I will add them.

New Comment
23 comments, sorted by Click to highlight new comments since:

Sorry this post hasn't gotten much engagement! It would get more engagement from me if it was less effort to engage; you could make it less effort to engage if you put up specific questions to forecast on. For example, look at the performance metrics for Gato and then ask what % better Gato2 will perform on average across all those metrics. Or maybe pick out a few specific metrics of interest. Maybe also ask about the language abilities of Gato2 in some way. And how many parameters it'll have. Etc.

I see. I will update the post with some questions. I find it quite difficult though to forecast on how percentages of the performance metrics would improve, compared to just predicting capabilities as the datasets are probably not so well known. 

Yeah, makes sense. I guess maybe part of what's going on is that forecasting the next Gato in particular is less exciting/interesting than forecasting stuff like AGI, even though in order to forecast the latter it's valuable to build up skill forecasting things like the former.

Anyhow here are some ass-number predictions I'll make to answer some of your questions, thanks for putting them up:

75% confidence interval: Gato II will be between 5B and 50B parameters. Context window will be between 2x and 4x as long.

75% credence: No significant new algorithmic improvements, just some minor things that make it slightly better, or maybe nothing at all.

90% credence: They'll train it on a bigger suite of tasks than Gato. E.g. more games, more diverse kinds of text-based tasks, etc.

70% credence: There will be some transfer learning, in the following sense: It will be clear from the paper that, probably, Gato II would outperform on task X a hypothetical variant of itself that had only trained on task X but not on any others (holding fixed the amount of task X training). For most but not all tasks studied.

60% credence: Assuming they test Gato II's ability to improve via chain-of-thought, the gains from doing so will be greater than the gains for a similarly sized language model.

90% credence: Gato II will still not be able to play new Atari games without being trained on them as well as humans, i.e. it'll be sub-human-level on such games.



 

75% credence: No significant new algorithmic improvements, just some minor things that make it slightly better, or maybe nothing at all.

Nothing at all would feel really surprising to me. I would expect that the programmers working on it do some work on the algorithm. It might be minor things, but I would expect that they only decide on investing more compute into training once they have at least some improvement to show.

75% confidence interval: Gato II will be between 5B and 50B parameters. Context window will be between 2x and 4x as long.

It seems like the main reason why Gato is currently small is that they want to be able to interact at real-world speed with robots. For that goal, it would be easily possible to train one 100B parameter model and then downscale it for applications that need to be that fast. 

I'd be interested to forecast on whether Gato2 will see bigger gains from chain-of-thought style stuff than a similarly-sized LLM.

I weakly predict yes.

[+][comment deleted]10

this has generated much less engagement than I thought it would...what am I doing wrong?

I think having an opinion on this requires much more technical knowledge than GPT4 or DALLE 3. I for one don't know what to expect. But I upvoted the post, because it's an interesting question.

Ok, I was thinking about this a bit and finally got some time to write it down.  I realized that it is quite hard to make predictions about the first version of GATO as it depends on what the team would prioritize in development. Therefore I'll try to predict some attributes/features of a GATO-like model that should be available in next two years, while expecting that many will appear sooner - it is just difficult to say which ones. I'm not a professional ML researcher so I might get some factual things wrong, so I would be happy to hear from people with more insight to correct me.  

First the prediction regarding the size of the model: I would expect a GATO-like architecture to have a larger amount of commercial success/usefulness than e.g. GPT-3 and so the investment should also be higher. Furthermore, I would guess that there would be several significant improvements to the training infrastructure e.g. by companies such as Cerebras/Graphcore etc. Therefore I estimate the model to use somewhere between 10-100x more compute compared to GPT-3. This might result in the model having more parameters, a larger context window, or most likely both. I predict the most likely context window size to be ~40000 (10x more) and param count  1T (6x compared to gpt). Regarding the context window - I think there will be some algorithmic improvements, so it won't be the same as before (see bellow).

Since GATO is multimodal, I would expect the scaling laws to change a bit due to the transfer learning. E.g. since the model won't need to find all information about the shape of the objects in the text, but instead can just look at the images, it should be much easier for it to answer questions such as "Can scissors be inserted into a glass bottle?" and so require a significantly smaller amount of data. Thus the scaling laws would also need to be multi-dimensional to answer what is the optimal ratio of audio/text/image/video to achieve best results. For example to improve the language model part of GATO, we may counterintuitively need to train on more images, instead of text.

I predict that GATO  would be trained on text, images, audio and video. I belive they would try to do also image/audio/video generation which they didn't do in the current version, by essentially just trying to predict what is the next image/video token, instead of using diffusion models. The context window size seems too small for video, however I believe there are two reasons why it won't be a huge problem. First, by using something like Perceiver where more processing is happening on tokens seen recently and just a small amount of computation is used on older far away tokens, the context window could be increased significantly (or some other kind of memory could be added). 

Second, I don't think that the model needs to see the whole image/video. When humans look at something, only a very small part of the image is sharp and the rest is blurry. Similarly, I think that Gato will get image information with just a small amount of tokens that describe a small rectangle of the picture/video clearly and some small number of tokens decribing the blurred rest. There would further be action tokens describing the "eye movement" when the focus shifts to a different part of the image. In this way I think GATO will be able to watch/generate videos or read/write long books. 

Futhermore, I think that in general, RL could be used to make the model "think slow" when predicting the tokens. For example, instead of the task of predicting the next token, GATO could be trained on a RL task "find a set of actions by which you find what the next token is". So to predict what the next image token, next word, or next action in a game should be, it would first look around the image/video/book and only after collecting the relevant information, it would emit the right token. Possibly it could also emit tokens summarizing the information it has seen so far from a large amount of data that it might need in the future. Of course it would probably still be trained on Atari games(likely now with actual RL) or in some coding environment with predefined inputs/outputs, but I think these will be much less significant compared to the "information finding RL". Maybe a smaller feature would be that GATO would be able to emit commands modifying its context window and deleting/adding tokens to it.

So some capabilities that I would predict in 2 years: Generation of images,video,audio,text, even quite long ones. E.g. size of a book,  5min. long video etc. Instead of "Let's think step by step" we would have "Let's sketch the solution", to draw diagrams etc. Gato would be able to reasonably operate a computer using text commands - e.g. search on the internet, use paint, IDE/debugger and so on. It will be much better at coding by combining various prompting methods and would perform in top 10% of competitive programmers (compared to Alphacode being in 50th percentile). Solve some IMO problems (probably wouldn't get gold medal, but maybe bronze by ignoring combinatorics). Act like a smart research assisant - e.g. finding relevant papers, pointing their strengths/weaknesses, suggesting improvements/ideas. Learn to play entirely new (atari) games similarly fast as humans - this would probably however require RL instead of just being a prediction model. Complete IQ test and get above average result.

 Capabilities I don't expect: generating novel jokes, outperforming best humans at long-term planning, research, math or coding. Image/Video generation wouldn't match reality. Similarly, AI generated books wouldn't sell well.  Empathy - it wouldn't make a very good friend. It will be slow and  expensive - so real-time robotics will probably still be a challenge.  Also, It won't be a very reliable doctor, despite being connected to the internet.

oh and besides IQ tests, i predict it would also be able to pass most current CAPTCHA-like tests (though humans would still be better in some)

I should also make a prediction for the nearer version of GATO to actually answer the questions from the post. So if a new version of GATO appears in next 4 months, I predict:

80% confidence interval: Gato will have 50B-200B params. Context window will be 2-4x larger(similar to GPT-3)

50%: No major algorithmic improvements, RL or memory. Maybe use of perceiver. Likely some new tokenizers. The improvements would come more from new data and scale.

80%: More text,images,video,audio. More games and new kinds of data. E.g. special prompting to do something in a game, draw a picture, perform some action.

75%: Visible transfer learning. Gato trained on more tasks and pre-trained on video would perform better in most but not all games, compared to a model with similar size trained just on the particular task. Language model would be able to descripe shape of objects better after being trained together with images/video/audio.  

70%: Chain of thought reasoning would perform better compared to a LLM of similar size. The improvement won't be huge though and I wouldn't expect it to gain some suprisingly sophisticated new LLM capabilities.

80%: It won't be able to play new Atari games similarly to humans, but there would be a visible progress - the actions would be less random and directed towards the goal of the game. With sophisticated prompting, e.g. "Describe first what the goal of this game is, how to play it, what is the best strategy", significant improvements would be seen, but still sub-human.

Some of my updates:
at least one version with several trillion parameters, at least 100k tokens long context window(with embeddings etc. seemingly 1million), otherwise I am quite surprised that I mostly still agree with my predictions, regarding multimodal/RL capabilities. I think robotics could still see some latency challenges, but anyway there would be a significant progress in tasks not requiring fast reactions - e.g. picking up things, cleaning a room, etc. Things like superagi might become practically useful and controlling a computer with text/voice would seem easy.

What I'm curious about is how they will scale it up while maintaining some of the real-time skills. They said that part of the reason for its initial size was so that the robot could be more reactive.

One interesting aspect of Gato is:

the program is able to do better than a dedicated machine learning program at controlling a robotic Sawyer arm that stacks blocks.

Given that result, it would be possible for the team to focus on robotics. They could make a deal with the Sawyer company to deploy Gato widely as a control for Sawyer arms. 

The Sawyer website describes the arm as:

Sawyer is an industrial collaborative robot designed to help out with manufacturing tasks and work alongside humans. You can teach it new tasks by demonstrating what to do using the robot's own arm.

It could be possible that one of the main use cases for Gato will be learning tasks for those demonstrations. Maybe, verbal instructions can be added to the existing demonstrations. 

It will be interesting to see how much of what GPT-3 can do, GATO-2 will be able to do. Since one of the tasks of GATO was labeling images and image generation is in vogue right now, I would expect that they will also let GATO-2 create images. 

One interesting question is whether GATO is better at numerical reasoning than GPT-3/Dalle-2.

GATO-2 might to combined tasks. Give it a prompt in normal language and then let a robot arm execute the prompt. 

One interesting move would be if DeepMind gives GATO out to the whole of Google so that Google programmers can try to use it for a variety of different tasks that Google faces.

I have a hobby of doing and reading personality research. One time recently an alignment researcher asked me whether an AI would soon be able to reach my level in personality research. My answer was no, and I still think it will be no for a new scaled-up GATO.

Do you think there's a metric that could be used whether or not the AI is better or worse than you? If so what would be the metric?

Like as some small-scale measureable thing? No, if we could do that then we could probably also make an AI that was better than me just be optimizing that metric.

Curious if you have some technically legible reason for this.

I've been thinking about this question for a while today, and I've had a hard time coming up with a good concrete answer, even though I feel quite confident in my assertion. My best bet for an answer is that I just haven't seen a technically legible reason that they might be able to do it.

But that's not a very satisfying argument, so I'll try to give an explanation. Naive personality research is super duper easy and I bet you could already set up a system so GPT-3 could do it, as long as you are willing to pay it a bunch of money for participants. A lot of personality research can be done just by word-associations, which GPT-3 is good at.

The trouble comes when you want to do anything deeper. If you want to know how personality actually works, you need to think of many levels at once; how people behave, how that behavior translates into your measurement method, etc.. This is not simply an extrapolation of the word-association game, and I have not seen any AI methods that are likely capable of that yet. The situation is further worsened by the fact that there is a lot of text on the internet that describes naive methods, so it would likely jump on that instead.

I believe we can now say with a high level of confidence that the scaled up GATO will be Google's Gemini model released in next few months. Does anyone want to add/update their predictions?

My guess is they will make tweaks on the tokenization part. If I was given the task of adding more modalities to a Transformer, I would probably be scretching my head thinking if there was a way to universally tokenize any type of data. But that's an optimistic scenario, I don't expect them to come up with such a solution right now. So I think they will just add new state-of-the-art tokenizers, like ViT-VQGAN for images. Other than that, I am mostly curious if we will be able to more clearly observe transfer learning when increasing the model size. That I think is the most important information that can come from GATO 2, because then we will be more equiped to achieve Chincilla's scaling laws without some potential slowdown. I am betting that we will, as long as tasks are not that far away from each other and we take the time to build good tokenizers.