14 min read

UPDATE IN 2023: I wrote this a long time ago and you should NOT assume that I still agree with all or even most of what I wrote here. I’m keeping it posted as-is for historical interest.

Introduction

I want to share my thoughts about the calculations that Transformers (such as GPT-3) do, and the calculations that I think are required for general intelligence, and how well they line up, and what I think GPT-3 is doing under the hood, and why I think an arbitrary transformer-based GPT-N might be incapable of doing certain tasks are seemingly essential for a system to qualify as an AGI.

Epistemic status: Very low confidence, to the point that I almost decided to delete this without posting it. I think some of my opinions here are very unpopular, and I would love any feedback or discussion.

Before we get into it, I want to make a couple background claims. The point here is basically to argue that the question “Can you get general intelligence by sufficiently scaling up a Transformer?” is worth asking, and does not have an answer of “Obviously yes, duh!!!” You can skip this part if you already agree with me on that.

Background Claim 1: There are types of information processing that cannot be cast in the form of Deep Neural Net (DNN)-type calculations (= matrix multiplications, ReLUs, etc.), except with an exorbitant performance penalty.

(Update: After discussion in the comments section here, I should have framed this section differently: I should have said "For any given specific learning algorithm / architecture framework (e.g. "GPT-like transformer architectures"), there are types of information processing that cannot be performed within that specific framework, except with an exorbitant performance penalty". I should never have mentioned "DNN-type calculations (=matrix multiplications, ReLUs, etc.)", because that's awfully vague (what exactly does the "etc." mean?), and anyway it's irrelevant to this post. Thanks gwern.)

By “information processing” I mean anything from sorting algorithms to data compression, random access memories, hash tables, whatever.

Let’s take Monte Carlo Tree Search (MCTS) as an example. AlphaZero does MCTS because DeepMind engineers explicitly programmed it to do MCTS—not because a generic RNN or other deep learning system spontaneously discovered, during gradient descent, that MCTS is a good idea.

Now, in principle, DNNs are universal function approximators, and more to the point, RNNs are Turing complete. So an RNN can emulate any other algorithm, including MCTS. But that doesn’t mean it can emulate it efficiently!

Let’s say we take a generic (PyTorch default) RNN, and train it such that it is incentivized to discover and start using MCTS. Assuming that the gradient flows converge to MCTS (a big "if"!), I believe (low confidence) that its only method for actually executing the MCTS involves: 

  • Taking a certain highly parallelized algorithm running on a GPU (matrix multiplications, ReLUs, etc.)...
  • ...and using it to emulate a Turing-machine-type serial computer…
  • ...and using that to emulate a different highly parallelizable algorithm!

This is absurdly inefficient when compared to MCTS written by a DeepMind engineer and compiled to run directly on bare hardware with appropriate parallelization. Like, maybe, factor-of-a-million inefficient—this is not the kind of inefficiency where you can just shrug it off and wait a year or two for Moore's law to care of it.

MCTS is just one example. Again, you can open up your algorithms textbook and find thousands of ways to process information. What fraction of these can be implemented reasonably well in the form of DNN-type matrix multiplications / ReLUs / etc.? I expect <<100%. If any such type of information processing is essential for AGI, then we should expect that we won’t get AGI in a pure DNN.

(We could still get it in a DNN-plus-other-stuff, e.g. DNN-plus-MCTS, DNN-plus-random-access-memory, etc.)

Background Claim 2: We can't just brush aside Background Claim 1 by appealing to the "Bitter Lesson"

Rich Sutton’s Bitter Lesson says that "general methods that leverage computation" have tended to outperform methods that "leverage...human knowledge of the domain". I strongly agree with this. As I've argued—most recently here—I think that the neocortex (or pallium in birds and lizards) uses a "general method that leverages computation" to understand the world, act, reason, etc. And I think we could absolutely get AGI using that method. By contrast—again in agreement with Rich Sutton—I think it’s much less likely that we would get AGI from, say, hand-coded rules about proper reasoning, knowledge graphs, etc. Although who knows, I guess.

Now, read it carefully: Rich Sutton's claim is that the best way to solve any given problem will be a "general method that leverages computation". The claim is not the reverse, i.e., if you have any "general method that leverages computation", it will solve every problem, if only you leverage enough computation!

A fully-connected feedforward neural net is a “general method that leverages computation”, but it can't do what a GAN or MuZero or Transformer does, no matter how much computation you leverage. It still has to be an appropriate method, with appropriate learning algorithm, parametrization, dataset, etc.

So, just like a fully-connected feedforward neural net is not the right "general method" for NLP, I think it's at least open to question whether the more general paradigm of matrix-multiplications-and-ReLUs (possibly attached to other simple components like MCTS, RAM, etc.) is definitely the right "general method" for general intelligence. Maybe it is, maybe it isn't! It’s an open question.

(And by the way, I freely acknowledge that this paradigm can do lots of amazing things, including things that I probably wouldn't have expected it to be able to do, if I didn't already know.)

OK, so far, all this is just to plant the general idea that DNNs do a particular type of information processing, and that there’s no guarantee that it’s the right type of information processing for AGI, merely a suggestion from the fact that it is a form of information processing that can do lots of amazing things that seem to be related to AGI.

Generative-model-based information processing: Background and motivation

Moving on, now I'll switch gears and talk more specifically about a certain type of information processing that I think might be critical for AGI, and after that I'll talk about how that relates to the information processing in DNNs (and more specifically, Transformers like GPT-3).

Predicting inputs with generative models (a.k.a. “analysis by synthesis” or “probabilistic programming”)

I’ll introduce the idea of analysis-by-synthesis via an unusually simple example: Towards the First Adversarially Robust NN Model on MNIST. MNIST is a famous collection of low-resolution images of handwritten digits, and the task is to recognize the digit from the image pixels. The traditional deep learning method for solving MNIST is: You do gradient descent to find a ConvNet model whose input is pixel values and whose output is a probability distribution for which digit it is. In this paper here, however, they go the other direction: They build a generative model for pictures representing the digit “0”, a different generative model for the digit “1”, etc. Then, when presented with a new digit image, they do ten calculations in parallel: How compatible is this image with the “0” generative model? How compatible is it with the “1” generative model? … They then pick the hypothesis that best fits the data.

Suggestively, their classification method winds up unusually human-like—it makes similar mistakes and has similar confusions as humans (assuming the paper is correct), unlike those funny examples in the adversarial perturbation literature.

As in this example, analysis-by-synthesis is any algorithm that searches through a collection / space of generative models for a model that best fits the input data (or better yet, with time-varying inputs, a model that predicts the input data that will arrive next). I’ve seen two other analysis-by-synthesis digit-recognition papers: Josh Tenenbaum & coworkers using classic probabilistic programming; and Dileep George & coworkers using a brain-inspired approach (and see also a nice series of blog posts introducing the latter).

The latter captures, I think, an essential aspect of the neocortex, in that there’s a whole society of generative models. Some of these generative models are at a very low level, and make predictions (with attached confidence levels) that certain sensory inputs will or won't be active. But more often the generative models make predictions that other generative models will or won’t be active. (These predictions can all be functions of time, or of other parameters—I'm leaving aside lots of details). To oversimplify, imagine that if the “bird” generative model is active, it also sends out a low-confidence prediction that the “is flying” generative model is also active, and that the "noun" generative model is active, and so on. This leads to not only hierarchical generative models, but also other forms of compositionality. For example, there’s a "purple jar" generative model that more-or-less just predicts that the "purple" generative model and the "jar" generative model are active simultaneously. This works because the "purple" and "jar" models make (by-and-large) compatible predictions—they agree with each other on some predictions, and in other cases they’re making predictions about different things. By contrast, there is no “stationary dancing” composite model—those two models make contradictory predictions. When two mutually-incompatible generative models are active simultaneously, they kinda fight each other for dominance. The algorithm underlying that "fight" is at least vaguely analogous to message-passing in a probabilistic graphical model. See more of my thoughts at Gary Marcus vs Cortical Uniformity and Predictive coding = RL + SL + Bayes + MPC.

What’s so great about generative-model-based processing?

Well, you might say, OK, analysis-by-synthesis is an interesting approach. But is it essential for AGI? Or is it just one of many ways to do things? I think it’s at least plausibly essential. Consider some of its features...

Feature 1: Better and better results with a longer search

Example: You look at a picture of a camouflaged animal, hidden so well that you can’t find it at first. But after staring for a minute, it snaps into place, and you recognize it.

Example: Your code has a syntax error on line 12. You stare at the line for a few seconds, and then finally see the problem.

Since we’re doing a search over a space of generative models, if we don’t find a good match immediately, we can notice our confusion and keep searching. The space of generative models is astronomically large; you can always search more, given more time.

Feature 2: Generative models are simpler & easier to learn than discriminative models

Example: You are asked to prove a simple theorem for homework. You spend a few minutes thinking through different possible lines of attack, and eventually see an approach that will work.

Example: You are asked to invent a microscope with higher resolution and speed than what exists today. You think through a wide array of many different possible configurations of lenses, lasers, filters, polarizers, galvos, sample-holders, etc. Eventually you come up with a configuration that meets all requirements.

In both these cases, the forward / generative direction has a (relatively) simple structure: If you differentiate both sides, what happens? By contrast, the reverse / discriminative direction is almost arbitrarily complex: At the end, you proved the theorem; what was the first step? A model that could immediately answer the endless variety of such reverse-direction question seems like it would be hopelessly complex to learn!

In fact, probably the only way that the reverse model could be learned is by first developing a good generative model, running it through in lots of configurations, and training a reverse model on that. And something like that is indeed entirely possible! In humans, there's a kind of memorization / chunking that allows the system to "cache" the results of runs through the generative models, so that the corresponding reverse-lookup can be very fast in the future. Actually, this is not so far from how AlphaZero works too—in that case, the MCTS self-play functions as the generative model. 

Feature 3: A generative-model-based approach is more sample-efficient and out-of-distribution-generalizable

This is related to Feature 2. Since generative models are simpler (less information content) than reverse / discriminative models, they can be learned more quickly. Empirically, I think this is borne out by those three digit-recognition papers I cited above, and by the fact that humans can do zero-shot transfer learning in all kinds of situations. (Of course, humans need to spend years cultivating a healthy and diverse society of generative models; the sample-efficiency I'm talking about is after that.)

Another way to think about it: things in the world are, in reality, generated by highly structured hierarchical-and-compositional-generative-model-type processes. If we want a good inductive bias, then we need to structure our analysis in a parallel way: we need to go looking for those kinds of generative models until we find ones that explain the data. Better inductive bias means better out-of-distribution generalization and fewer samples required.

Example: You read about a new concept you hadn't heard of—Quantum Capacitance. You find it confusing at first, but after reading a couple descriptions and thinking about it, you find a good mental model of it—a certain way to visualize it, to relate it to other known concepts, etc. Having internalized that concept, you can then go use it and build on it in the future.

In this example, you built a good generative model of Quantum Capacitance despite little or even no concrete examples of it. How? For one thing, you're not building it from scratch, you're snapping together a few pieces of generative models you've already learned, and your textbook helps you figure out which ones, by describing the concept using analogies, figurative spatial language, etc. For another thing, you are trying to slot this concept into a dense web of previously-established physics-related generative models, each of which will protest loudly if they see something that contradicts them. In other words, there are only so many ways to create this concept in a way that doesn't contradict the things about physics you have already come to believe. Third, you can test any number of proposed generative model against the same known examples—sorta vaguely like off-policy replay learning or something. So anyway, with all these ingredients, it's a highly constrained problem, and so you can can find the right model with few if any concrete examples.

Feature 4: Foresight, counterfactuals, deliberation, etc.

A society of generative models is a powerful thing...

  • It allows foresighted plans towards goals—in fact, this capability involves almost the same information-processing steps as analysis-by-synthesis understanding of sensory inputs (see my post here, or Planning by Probabilistic Inference, or Logical Inductor Decision Theory, etc.)
  • It enables counterfactual reasoning—see my brief discussion here.
  • It enables system-2-type deliberation—e.g. you can learn a Generative Model X that invokes a Generative Model Y and then invokes either Generative Model Z or Z' depending on some feature of Y. Then Z or Z' can invoke yet different generative models, and so on. That kind of process enables us to create an ad hoc crappy serial computer ... and we call that "deliberation"! See also Kaj’s post.

I could go on, but you get the idea.

So, at the end of the day, is the generative-model-based approach essential for superintelligent AGI?

I mean, it kind of feels that way to me, but maybe I’m anthropomorphizing.

Back to Transformers and GPT-N

OK, you’ve read all this way, now how does this relate to GPT-3? Will GPT-N be an AGI or not??

Now, when the brain does its probabilistic generative model inference thing, it does it by a complicated, decentralized process. There’s a feedforward pass that activates some of the most promising generative models. Then the models do some message-passing back and forth. Contradictory models fight it out. New models get awakened. Redundant models go away, etc. Multi-step models, extended through time, walk through their steps.

So here’s my hypothesis: this whole complicated decentralized asynchronous probabilistic-generative-model-inference process is more-or-less exactly what the GPT-3 Transformer learns to approximate.

As a special case: I feel pretty comfortable imagining that a single trained Transformer layer can approximate a single step of message-passing in a probabilistic graphical model inference algorithm.

Now, I haven’t thought this through in any amount of detail, but so far this hypothesis seems at least plausible to me. For example, the Transformer has residual connections that allow later iterations to create small increments in the activations, which is what often happens with later stages of message-passing. It has a self-attention mechanism that seems well-suited to representing the sparsely-connected structure of a probabilistic graphical model. It has the positional information that it needs to figure out how far along the various multi-step generative models have progressed. GPT-2 has 12 layers and GPT-3 has 96, enough for quite a lot of sequential processing, which is important to capture this asynchronous process as it unfolds in time.

I like this hypothesis a lot because it's consistent with my strong belief that GPT-3 has captured human-like concepts and can search through them and manipulate them and combine them in a very human-like way.

So, does that mean GPT-3 is a human-like intelligence? No!! There’s a big difference!

  • If you have an algorithm designed from the ground up to build and search through generative models and do probabilistic generative model inference, it will do that, and always exactly that, and only exactly that.
  • If you have a Transformer, it can imitate that kind of probabilistic generative model inference, but will do so only within the range covered by its training data, e.g. a restricted time-window etc. And that same Transformer architecture can also do loads of other very different types of calculations too, that have nothing to do with probabilistic generative model inference—which means that the Transformer will have different inductive biases, and worse sample efficiency.

Let me offer a more specific list of my hypothesized fundamental deficiencies in Transformers as AGIs.

First, for the reason mentioned above, I think the sample efficiency is bound to be dramatically worse for training a Transformer versus training a real generative-model-centric system. And this makes it difficult or impossible for it to learn or create concepts that humans are not already using.

For example, I am confident that GPT-N, after reading tons of text where people use some random concept (Quantum Capacitance, say), can gradient-descent its way to properly using that concept, and to properly integrating it into its (implicit) world-model. But if GPT-N had never heard of Quantum Capacitance, and saw one good explanation in its training data, I’m pretty skeptical that it would be able to use the concept properly. And I'm even more skeptical that it could invent that concept from scratch.

To clarify, I’m talking here about sample-inefficiency in modifying the weights to form a more sophisticated permanent understanding of the world. By contrast, I think there is little doubt that it has the ability to alter its activations in a quick and flexible and sample-efficient way in response to thought-provoking input text. But that only takes you so far! That cannot take you more than a couple steps of inferential distance away from the span of concepts frequently used by humans in the training data. 

Second, the finite number of Transformer layers puts a ceiling on the quality of the generative-model-search process, the time spent deliberating, etc. As I mentioned above, humans can stretch their capabilities by thinking a little bit longer and harder. However, if you have a Transformer that more-or-less simulates the first 100 (or whatever) milliseconds of the neocortex's generative-model-search process, then that's all you can ever get.

Third, because the Transformer is a kind of information processing imitating a different kind of information processing, I generally expect edge cases where the imitation breaks down, leading to weird inductive biases, crazy out-of-distribution behavior, etc. I’m not too sure about this one though.

Maybe there are other things too.

Then of course there are also the obvious things like GPT-3 being disembodied, not having a reward signal, not having visual or spatial input channels, etc. These are all plausibly very important, but I don't think want to emphasize them too much, because I don't think they're fundamental architectural limitations, I think they're more likely just things that OpenAI hasn't gotten around to doing yet.

Conclusion

These three deficiencies I'm hypothesizing seem like pretty serious roadblocks to AGI—especially the one where I claim that it can't form new concepts and add them permanently to its world-model.

That said, it’s entirely possible that these deficiencies don’t matter, or can be worked around. (Or that I'm just wrong.)

But for the moment, I continue to consider it very possible that Transformers specifically, and DNN-type processing more generally (matrix multiplications, ReLUs, etc.), for all their amazing powers, will eventually be surpassed in AGI-type capabilities by a different kind of information processing, more like probabilistic programming and message-passing, and also more like the neocortex (but, just like DNNs, still based on relatively simple, general principles, and still requiring an awful lot of compute).

I recognize that this kind of probabilistic programming stuff is not as "hot" as DNNs right now, but it's not neglected either; it's a pretty active area of CS research, moving forward each year.

But I dunno. As I mentioned, I'm not confident about any of this, and I am very interested in discussion and feedback here. :-)

New Comment


40 comments, sorted by Click to highlight new comments since:

This was a really interesting read. I'm glad you decided to post it.

I find all of the pieces fairly interesting and plausible in isolation, but I think the post either underexplains or understates the case for how it all fits together. As written, the post roughly says:

  • first section: there are probably some kinds of computation which can't be efficiently supported (though such computations might not actually matter)
  • second section: here's one particular kind of computation which seems pretty central to how humans think (though I don't really see a case made here that it's necessary)
  • third section: some things which might therefore be difficult

I do think the case can be strengthened a lot, especially for low sample efficiency and the difficulty of inventing new concepts. Here's a rough outline of the argument I would make:

  • Sample efficiency is all about how well we approximate Bayesian reasoning. This is one of the few places where both the theory and the empirics have a particularly strong case for Bayes.
  • Bayesian reasoning, on the sort of problems we're talking about, means generative models. So if we want sample efficiency, then generative models (or a good approximation thereof) are a necessary element.
  • GPT-style models do not have "basin of attraction"-style convergence, where they learn the general concept of generative models and can then easily create new generative models going forward. They have to converge to each new generative model the hard way.

That last step is the one I'd be most uncertain about, but it's also a claim which is "just math", so it could be checked by either analysis or simulation if we know how the models in question are embedded.

There are types of information processing that cannot be cast in the form of Deep Neural Net (DNN)-type calculations (matrix multiplications, ReLUs, etc.), except with an exorbitant performance penalty.

Sure... but humans can't do those either, without an exorbitant performance penalty! Does this imply that humans alone aren't general intelligences (and thus the threshold we should be worried about is lower), or that they're not actually important for general intelligence?

will eventually be surpassed in AGI-type capabilities by a different kind of information processing

"Surpassed" seems strange to me; I'll bet that the first AGI system will have a very GPT-like module, that will be critical to its performance, that will nevertheless not be "the whole story." Like, by analogy to AlphaGo, the interesting thing was the structure they built around the convnets, but I don't think it would have worked nearly as well without the convnets.

Sure... but humans can't do those either, without an exorbitant performance penalty!

Well, a big part of this post is an argument that the human neocortex is doing a different type of information processing than a DNN, with the neocortex's algorithms being more similar to the algorithms underlying probabilistic programming, message-passing, etc. Therefore I don't accept the premise that in general, if a DNN can't do a certain type of information processing efficiently, then neither can the human brain.

Do you think DNNs and human brains are doing essentially the same type of information processing? If not, how did you conclude "humans can't do those either"? Thanks!

Do you think DNNs and human brains are doing essentially the same type of information processing? If not, how did you conclude "humans can't do those either"? Thanks!

Sorry for the late reply, but I was talking from personal experience. Multiplying matrices is hard! Even for extremely tiny ones, I was sped up tremendously by pencil and paper. It was much harder than driving a car, or recognizing whether a image depicts a dog or not. Given the underlying computational complexity of the various tasks, I can only conclude that I'm paying an exorbitant performance penalty for the matmul. (And I'm in the top few percentiles of calculation ability, so this isn't me being bad at it by human standards.)

The general version of this is Moravec's Paradox.

 

[edit] Also if you look at the best training I'm aware of to solve a simpler arithmetic problems (the mental abacus method), it too demonstrates this sort of exorbitant performance penalty. They're exapting the ability to do fine motions in 3d space to multiply and add!

Oh OK I think I misunderstood you.

So the context was: I think there's an open question about the extent to which the algorithms underlying human intelligence in particular, and/or AGI more generally, can be built from operations similar to matrix multiplication (and a couple other operations). I'm kinda saying "no, it probably can't" while the scaling-is-all-you-need DNN enthusiasts are kinda saying "yes, it probably can".

Then your response is that humans can't multiply matrices in their heads. Correct? But I don't think that's relevant to this question. Like, we don't have low-level access to our own brains. If you ask GPT-3 (through its API) to simulate a self-attention layer, it wouldn't do particularly well, right? So I don't think it's any evidence either way.

"Surpassed" seems strange to me; I'll bet that the first AGI system will have a very GPT-like module, that will be critical to its performance, that will nevertheless not be "the whole story." Like, by analogy to AlphaGo, the interesting thing was the structure they built around the convnets, but I don't think it would have worked nearly as well without the convnets.

I dunno, certainly that's possible, but also sometimes new algorithms outright replace old algorithms. Like GPT-3 doesn't have any LSTM modules in it, let alone HHMM modules, or syntax tree modules, or GOFAI production rule modules. :-P

Ah, I now suspect that I misunderstood you as well earlier: you wanted your list to be an example of "what you mean by DNN-style calculations" but I maybe interpreted as "a list of things that are hard to do with DNNs". And under that reading, it seemed unfair because the difficulty that even high-quality DNNs have in doing simple arithmetic is mirrored by the difficulty that humans have in doing simple arithmetic.

Similarly, I agree with you that there are lots of things that seem very inefficient to implement via DNNs rather than directly (like MCTS, or simple arithmetic, or so on), but it wouldn't surprise me if it's not that difficult to have a DNN-ish architecture that can more easily implement MCTS than our current ones. The sorts of computations that you can implement with transformers are more complicated than the ones you could implement with convnets, which are more complicated than the ones you could implement with fully connected nets; obviously you can't gradient descent a fully connected net into a convnet, or a convnet into a transformer, but you can still train a transformer with gradient descent.

It's also not obvious to me that humans are doing the more sophisticated thinking 'the smart way' instead of 'the dumb way'. Suppose our planning algorithms are something like MCTS; is it 'coded in directly' like AlphaGo's, or is it more like a massive transformer that gradient-descented its way into doing something like MCTS? Well, for things like arithmetic and propositional logic, it seems pretty clearly done 'the dumb way', for things like planning and causal identification it feels more like an open question, and so I don't want to confidently assert that our brains are doing it the dumb way. My best guess is they have some good tricks, but won't be 'optimal' according to future engineers who understand all of this stuff.

I slightly edited that section header to make it clearer what the parenthetical "(matrix multiplications, ReLUs, etc.)" is referring to. Thanks!

I agree that it's hard to make highly-confident categorical statements about all current and future DNN-ish architectures.

I don't think the human planning algorithm is very much like MCTS, although you can learn to do MCTS (just like you can learn to mentally run any other algorithm—people can learn strategies about what thoughts to think, just like they can strategies about what actions to execute). I think the built-in capability is that compositional-generative-model-based processing I was talking about in this post.

Like, if I tell you "I have a banana blanket", you have a constraint (namely, I just said that I have a banana blanket) and you spend a couple seconds searching through generative models until you find one that is maximally consistent with both that constraint and also all your prior beliefs about the world. You're probably imagining me with a blanket that has pictures of bananas on it, or less likely with a blanket made of banana peels, or maybe you figure I'm just being silly.

So by the same token, imagine you want to squeeze a book into a mostly-full bag. You have a constraint (the book winds up in the bag), and you spend a couple seconds searching through generative models until you find one that's maximally consistent with both that constraint and also all your prior beliefs and demands about the world. You imagine a plausible way to slide the book in without ripping the bag or squishing the other content, and flesh that out into a very specific action plan, and then you pick the book up and do it.

When we need a multi-step plan, too much to search for in one go, we start needing to also rely on other built-in capabilities like chunking stuff together into single units, analogical reasoning (which is really just a special case of compositional-generative-model-based processing), and RL (as mentioned above, RL plays a role in learning to use metacognitive problem-solving strategies). Maybe other things too.

I don't think causality per se is a built-in feature, but I think it comes out pretty quickly from the innate ability to learn (and chunk) time-sequences, and then incorporate those learned sequences into the compositional-generative-model-based processing framework. Like, "I swing my foot and then kick the ball and then the ball is flying away" is a memorized temporal sequence, but it's also awfully close to a causal belief that "kicking the ball causes it to fly away". (...at least in conjunction with a second memorized temporal sequence where I don't kick the ball and it just stays put.) (See also counterfactuals.)

I'm less confident about any of this than I sound :)

Re claim 1: If you let it use the page as a scratch pad, you can also let it output commands to a command line interface so it can outsource these hard-to-emulate calculations to the CPU.

I'm unsure that GPT3 can output, say, a ipython notebook to get the values it wants.

That would be really interesting to try...

Edited for clarity and to correct misinterpretations of central arguments.

This response is to consider (contra your arguments) the ways in which the transformer might be fundamentally different from the model of a NN that you may be thinking about, which is as a series of matrix multiplications of "fixed" weight matrices. This is the assumption that I will first try to undermine. In so doing, I might hopefully lay some groundwork for an explanatory framework for neural networks that have self-attention layers (for much later), or (better) inspire transparency efforts to be made by others, since I'm mainly writing this to provoke further thought. 

However, I do hope to make some justifiable case below for transformers being able to scale in the limit to an AGI-like model (i.e. which was an emphatic “no” from you) because they do seem to be exhibiting the type of behavior (i.e. few-shot learning, out of distribution generalization) that we'd expect would scale to AGI sufficiently, if improvements in these respects were to continue further.

I see that you are already familiar with transformers, and I will reference this description of their architecture throughout.

Epistemic Status: What follows are currently incomplete, likely fatally flawed arguments that I may correct down the line.

  1. Caveats: It's reasonable to dismiss transformers/GPT-N as falling into the same general class of fully connected architectures in the sense that:
    • They're data hungry, like most DNNs, at least during the pre-training phase.
    • They're not explicitly replicating neocortical algorithms (such as bottom-up feedback on model predictions) that we know are important for systematic generalization.
    • They have extraneous inductive biases besides those in the neocortex, which hinder efficiency.
    • Some closer approximation of the neocortex, such as hierarchical temporal memory, is necessary for efficiently scaling to AGI.
  2. Looking closer: How Transformers Depart "Functionally" from Most DNNs
    • Across two weight matrices of a fully-connected DNN, we see something like:
      • σ(Ax)^T * B^T, for some vector x, and hidden layers {A, B}, which gives just another vector of activations where σ is an element-wise activation function.
        • These activations are "dynamic". But I think you would be right to say that they do not in any sense modify the weights applied to activations downstream; however, this was a behavior you implied was missing in the transformer that one might find in the neocortex.
    • In a Transformer self-attention matrix (A=QK^T), though, we see a dynamic weight matrix:
      • (Skippable) As such, the values (inner products) of this matrix are contrastive similarity scores between each (k∈K,q∈Q) vector pairs
      • (Skippable) Furthermore, matrix A consists of n*n inner products S_i = {<k_i,q_1>..<k_i,q_n> | ki∈K,qi∈Q, i<=n}, where each row A_i of A has entries mapped to an ordered set S_i
      • Crucially, the rows S_i of A are coefficients of a convex combination of the rows of the value matrix (V) when taking softmax(A)V. This is in order to compute the output (matrix) of a self-attention layer, which is also different from the kind of matrix multiplication that was performed when computing similarity matrix A.
        • Note: The softmax(A) function in this case is also normalizing the values of matrix A row-wise, not column-wise, matrix-wise, etc.

3. Onto the counterargument:

  • Given what was described in (2), I mainly argue that softmax(A=QK^T)V is a very different computation from the kind that fully connected neural networks perform and what you may've envisioned.
  • We specifically see that:
    • Ordinary, fully connected (as well as convolutional, most recurrent) neural nets don’t generate "dynamic" weights that are then applied to any activations.
    • It is possible that the transformer explicitly conditions a self-attention matrix A_l at layer l such that the downstream layers l+1..L (given L self-attention layers) are more likely to produce the correct embedding tokens. This is because we’re only giving the transformer L layers to compute the final result.
    • Regardless of if the above is happening, the transformer is being “guided” to do implicit meta-learning as part of its pre-training, because:
      • (a) It’s conditioning its weight matrices (A_l) on the given context (X_l) to maximize the probability of the correct autoregressive output in X_L, in a different manner from learning an ordinary, hierarchical representation upstream.
      • (b) As it improves the conditioning described in (a) during pre-training, it gets closer to optimal performance on some downstream, unseen task (via 0-shot learning). This is assumed to be true on an evidential basis.
      • (c) I argue that such zero-shot learning on an unseen task T requires online-learning on that task, which is being described in the given context.
      • (d) Further, I argue that this online-learning improves sample-efficiency when doing gradient updates on an unseen task T, by approximately recognizing a similar task T’ given the information in the context (X_1). Sample efficiency is improved because the training loss on T can be determined by its few-shot performance on T, which is related to few-shot accuracy, and because training-steps-to-convergence is directly related to training loss.
      • So, when actually performing such updates, a better few-shot learner will take fewer training steps. Crucially, it improves the sample efficiency of its future training in not just a “prosaic” manner of having improved its held-out test accuracy, but through (a-c) where it “learns to adapt” to an unseen task (somehow).

Unfortunately, I wouldn’t know what is precisely happening in (a-d) that allows for systematic meta-learning to occur, in order for the key proposition:

First, for the reason mentioned above, I think the sample efficiency is bound to be dramatically worse for training a Transformer versus training a real generative-model-centric system. And this [sample inefficiency] makes it difficult or impossible for it to learn or create concepts that humans are not already using.

to be weakened substantially. I just think that meta-learning is indeed happening given the few-shot generalization to unseen tasks that was demonstrated, which only looks like it has something to do with the dynamic weight matrix behavior suggested by (a-d). However, I do not think that it's enough to show the dynamic weights mechanism described initially (is doing such and such contrastive learning), or to show that it's an overhaul from ordinary DNNs and therefore robustly solves the generative objective (even if that were the case). Someone would instead have to demonstrate that transformers are systematically performing meta-learning (hence out-of-distribution and few-shot generalization) on task T, which I think is worthwhile to investigate given what they have accomplished experimentally.

Granted, I do believe that more closely replicating cortical algorithms is important for efficiently scaling to AGI and for explainability (I've read On Intelligence, Surfing Uncertainty, and several of your articles). The question, then, is whether there are multiple viable paths to efficiently-scaled, safe AGI in the sense that we can functionally (though not necessarily explicitly) replicate those algorithms.

However, I do hope to make some justifiable case below for transformers being able to scale in the limit to an AGI-like model (i.e. which was an emphatic “no” from you) 

I don't feel emphatic about it. Well, I have a model in my head and within that model a transformer can't scale to AGI, and I was describing that model here, but (1) I'm uncertain that that model is the right way to think about things, (2) even if it is, I don't have high confidence that I'm properly situating transformers within that model†, (3) even if I am, there is a whole universe of ways to take a Transformer architecture and tweak / augment it—like hook it up to a random access memory or tree search or any other data structure or algorithm, or give it more recurrency, or who knows what else—and I haven't thought through all those possibilities and would not be shocked if somewhere in that space was a way to fill in what I see as the gaps.

† The paper relating Hopfield networks to transformers came out shortly after I posted this, and seems relevant to evaluating my idea that transformer networks are imitating some aspects of probabilistic programming / PGM inference, but I'm not sure, I haven't really digested it.

the transformer is being “guided” to do implicit meta-learning ... I argue that such zero-shot learning on an unseen task T requires online-learning on that task, which is being described in the given context.

I'm confused about how you're using the terms "online learning" and "meta-learning" here.

I generally understand "online learning" in the sense of this, where you're editing the model weights during deployment by doing gradient descent steps for each new piece of labeled data you get. If you're generating text with GPT-3, then there's no labeled data to update on, and the weights are fixed, so it's not online learning by definition. I guess you have something else in mind; can you explain?

I generally understand "meta-learning" to mean that there's an inner loop that has a learning algorithm, and then there's an outer loop with a learning algorithm that impacts the inner loop. I guess you could say that the 96 transformer layers involved in each word-inference is the inner loop. Is it really a learning algorithm though? It doesn't look like a learning algorithm. I mean, it certainly figures things out over the course of those 96 processing steps, and maybe it "figures something out" in step 20 and still "knows" that thing in step 84. So OK fine, I guess you can claim that there's "learning" happening within a single GPT-3 word-inference task, even if it's not the kind of learning that people normally talk about. And then it would be fair to say that the gradient descent training of GPT-3 is meta-learning. Is that what you have in mind? If so, can't you equally well say that a 96-layer ConvNet training involves "meta-learning"? Sorry if I'm misunderstanding.

Ordinary, fully connected (as well as convolutional, most recurrent) neural nets don’t generate "dynamic" weights that are then applied to any activations.

I was confused for a while because "weight" already has a definition—"weights" are the things that you find by gradient descent (a.k.a. "weights & biases" a.k.a. "parameters"). I guess what you're saying is that I was putting a lot of emphasis on the idea that the algorithm is something like:

Calculate (28th entry of the input) × 8.21 + (12th entry of the input × 2.47) + ... , then do a nonlinear operation, then do lots more calculations like that, etc. etc.

whereas in a transformer there are also things in the calculation that look like "(14th entry of the input) × (92nd entry of the input)", i.e. multiplying functions of the input by other functions of the input. Is that what you're saying?

If so, no I was using the term "matrix multiplications and ReLUs" in a more general way that didn't exclude the possibility of having functions-of-the-input be part of each of two matrices that then get multiplied together. I hadn't thought about that as being a distinguishing feature of transformers until now. I suppose that does seem important for widening the variety of calculations that are possible, but I'm not quite following your argument that this is related to meta-learning. Maybe this is related to the previous section, because you're (re)defining "weights" as  sorta "the entries of the matrices that you're multiplying by", and also thinking of "weights" as "when weights are modified, that's learning", and therefore transformers are "learning" within a single word-inference task in a way that ConvNets aren't. Is that right? If so, I dunno, it seems like the argument is leaning too much on mixing up those two definitions of "weights", and you need some extra argument that "the entries of the matrices that you're multiplying by" have that special status such that if they're functions of the inputs then it's "learning" and if they aren't then it isn't. Again, sorry if I'm misunderstanding.

Standard feedforward DNNs encompass "circuits buildable from matmul and RELU", however crucially the backprop gradient update necessarily includes another key operator - matrix transpose.

Transformers are a semi-special case of attention/memory augmented networks, which encompass "circuits buildable from matmul, RELU, and transpose" - and thus they incorporate dynamic multiplicative interactions which enable (at least) the ability to learn (or quickly memorize) into the forward pass.

So yes adding that transpose into the forward/inference greatly expands the space of circuits you can efficiently emulate.  It's not obvious how many more such fundamental ops one needs for AGI.  Brain circuits don't obviously have much more key func components beyond matmul, RELU, multiplicative gating interactions, and efficient sparsity.  (Brains also feature many other oddities like mult/exponential updates vs linear and various related non-negative constraints, but unclear how important those are).

Regarding "thinking a problem over"-- I have seen some examples where on some questions that GPT-3 can't answer correctly off the bat, it can answer correctly when the prompt encourages a kind of talking through the problem, where its own generations bias its later generations in such a way that it comes to the right conclusion in the end. This may undercut your argument that the limited number of layers prevents certain kinds of problem solving that need more thought?

Yeah, maybe. Well, definitely to some extent.

I guess I would propose that "using the page as a scratchpad" doesn't help with the operation "develop an idea, chunk it, and then build on it". The problem is that the chunking and building-on-the-chunk have to happen sequentially. So maybe it can (barely) develop an idea, chunk it, and write it down. Then you turn the Transformer back on for the next word, with the previous writing as an additional input, and maybe it takes 30 Transformer layers just to get back to where it was, i.e. having re-internalized that concept from before. And then there aren't enough layers left to build on it... Let alone build a giant hierarchy of new chunks-inside-chunks.

So I think that going more than 1 or 2 "steps" of inferential distance beyond the concepts represented in the training data requires that the new ideas get put into the weights, not just the activations.

I guess you could fine-tune the network on its own outputs, or something. I don't think that would work, but who knows.

Then you turn the Transformer back on for the next word, with the previous writing as an additional input, and maybe it takes 30 Transformer layers just to get back to where it was, i.e. having re-internalized that concept from before.

Are the remaining 66 layers not enough to build on the concept? What if we're talking about GPT-N rather than GPT-3, with T >> 96 total layers, such that it can use M layers to re-internalize the concept and T-M layers to build on it?

Aren't our brains having to do something like that with our working memory?

I agree that, the more layers you have in the Transformer, the more steps you can take beyond the range of concepts and relations-between-concepts that are well-represented in the training data.

If you want your AGI to invent a new gadget, for example, there might be 500 insights involved in understanding and optimizing its operation—how the components relate to each other, what happens in different operating regimes, how the output depends on each component, what are the edge cases, etc. etc. And these insights are probably not particularly parallelizable; rather, you need to already understand lots of them to figure out more. I don't know how many Transformer layers it takes to internalize a new concept, or how many Transformer layers you can train, so I don't know what the limit is, only that I think there's some limit. Unless the Transformer has recurrency I guess, then maybe all bets are off? I'd have to think about that more.

Aren't our brains having to do something like that with our working memory?

Yeah, definitely. We humans need to incorporate insights into our permanent memory / world-model before we can effectively build on them.

This is analogous to my claim that we need to get new insights somehow out of GPT-N's activations and into its weights, before it can effectively build on them.

Maybe the right model is a human and GPT-N working together. GPT-N has some glimmer of an insight, and the human "gets it", and writes out 20 example paragraphs that rely on that insight, and then fine-tunes GPT-N on those paragraphs. Now GPT-N has that insight incorporated into its weights, and we go back, with the human trying to coax GPT-N into having more insights, and repeat.

I dunno, maybe. Just brainstorming. :-)

I recognize that this kind of probabilistic programming stuff is not as "hot" as DNNs right now, but it's not neglected either; it's a pretty active area of CS research, moving forward each year.

Would the generative models not themselves be deep neural networks?

Yeah, probably. I gave this simple example where they build 10 VAEs to function as 10 generative models, each of which is based on a very typical deep neural network. The inference algorithm is still a bit different from a typical MNIST model, because the answer is not directly output, but comes from MAP inference, or something like that.

I don't think that particular approach is scalable because there's a combinatorial explosion of possible things in the world, which need to be matched by a combinatorial explosion of possible generative models to predict them. So you need an ability to glue together models ("compositionality", although it's possible that I'm misusing that term). For example, compositionality in time ("Model A happens, and then Model B happens"), or compositionality in space ("Model A and Model B are both active, with a certain spatial relation"), or compositionality in features ("Model A is predicting the object's texture and Model B is predicting its shape and Model C is predicting its behavior"), etc.

(In addition to being able to glue them together, you also need an algorithm that searches through the space of possible ways to glue them together, to find the right glued-together generative model that fits a certain input, in a computationally-efficient way.)

It's not immediately obvious how to take typical deep neural network generative models and glue them together like that. Of course, I'm sure there are about 10 grillion papers on exactly that topic that I haven't read. So I don't know, maybe it's possible. 

What I have been reading is papers trying to work out how the neocortex does it. My favorite examples for vision are probably currently this one from Dileep George and this one from Randall O'Reilly. Note that the two are not straightforwardly compatible with each other—this is not a well-developed field, but rather lots of insights that are gradually getting woven together into a coherent whole. Or at least that's how it feels to me.

Are these neocortical models "deep neural networks"?

Well, they're "neural" in a certain literal sense :-) I think the neurons in those two papers are different but not wildly different than the "neurons" in PyTorch models, more-or-less using the translation "spike frequency in biological neurons" <--> "activation of PyTorch 'neurons'". However, this paper proposes a computation done by a single biological neuron which would definitely require quite a few PyTorch 'neurons' to imitate. They propose that this computation is important for learning temporal sequences, which is one form of compositionality, and I suspect it's useful for the other types of compositionality as well.

They're "deep" in the sense of "at least some hierarchy, though typically 2-5 layers (I think) not 50, and the hierarchy is very loose, with lots of lateral and layer-skipping and backwards connections". I heard a theory that the reason that ResNets need 50+ layers to do something vaguely analogous to what the brain does in ~5 (loose) layers is that the brain has all these recurrent connections, and you can unroll a recurrent network into a feedforward network with more layers. Plus the fact that one biological neuron is more complicated than one PyTorch neuron. I don't really know though...

Was about to reread this, but

UPDATE IN 2023: I wrote this a long time ago and you should NOT assume that I still agree with all or even most of what I wrote here. I’m keeping it posted as-is for historical interest.

... are your updated thoughts written up anywhere?

Brief note: the "analysis by synthesis" idea is called "vision as inverse graphic" in computer graphics research.

Good day Steve,

This post says, “Since generative models are simpler (less information content) than reverse / discriminative models, they can be learned more quickly.” Is this true? I’ve always had the impression that it’s the opposite. It’s easier to tell the apart, say, cats and dogs (discriminative model) than it is to draw cats and dogs (generative model). Most children first learn to discriminate between different objects before learning how to draw/create/generate them.

Would you have an opinion of how memory and generative models interact? To jog the discussion, I’d like to bring up Kaj_Sotala’s hypothesis that, “Type 2 processing is a particular way of chaining together the outputs of various Type 1 subagents using working memory.” [1]. Type 1 subagents here is (I think) similar to Society of Mind. Type 2 processing is the, informally speaking, more deliberate, abstract, and memory-intensive, of the two types of processing.

This post and [1] list tasks that require type 2 processing to solve. Are there contrived (& hopefully simple) tasks that are available to be run on computers to test the performance of implementations that aim to conduct type 2 processing?


[1] Kaj_Sotala, System 2 as working-memory augmented System 1 reasoning. https://www.lesswrong.com/posts/HbXXd2givHBBLxr3d/against-system-1-and-system-2-subagent-sequence

Would you have an opinion of how memory and generative models interact? To jog the discussion, I’d like to bring up Kaj_Sotala’s hypothesis that, “Type 2 processing is a particular way of chaining together the outputs of various Type 1 subagents using working memory.” [1]. Type 1 subagents here is (I think) similar to Society of Mind. Type 2 processing is the, informally speaking, more deliberate, abstract, and memory-intensive, of the two types of processing.

Did you see the part where I linked that post? Here's the quote

you can learn a Generative Model X that invokes a Generative Model Y and then invokes either Generative Model Z or Z' depending on some feature of Y. Then Z or Z' can invoke yet different generative models, and so on. That kind of process enables us to create an ad hoc crappy serial computer ... and we call that "deliberation"! See also Kaj’s post.

I have a 1-sentence take on working memory here. I haven't thought about it beyond that...

Are there contrived (& hopefully simple) tasks that are available to be run on computers to test the performance of implementations that aim to conduct type 2 processing?

Probably ... that seems like the kind of thing that people would be trying to make benchmarks for. But I don't know.

This post says, “Since generative models are simpler (less information content) than reverse / discriminative models, they can be learned more quickly.” Is this true? I’ve always had the impression that it’s the opposite. It’s easier to tell the apart, say, cats and dogs (discriminative model) than it is to draw cats and dogs (generative model). Most children first learn to discriminate between different objects before learning how to draw/create/generate them.

Well, my claim (and the claim of Jeff Hawkins, Yann LeCun, the “predictive coding” people like Friston, etc.) is that the neocortex is constantly predicting what input signals it will receive next, and updating its models when the predictions are wrong. So when humans tell apart cats and dogs, I would call that generative. ResNet image classifiers are discriminative, but they need a lot of samples to learn to do so, so I'm not sure whether that counts as "easy" / "low-information-content". Drawing on paper is kinda hard for humans (at least for me), but that's not really a fair comparison, the generative part is just imagining a cat you've never seen before, and then the hard part (for me) is copying that imagined image onto a piece of paper. Of course I don't put too much stock in that comparison either. Maybe generative models are hard but feel easy because that's how our brains are designed. Anyway, I feel good about the math or engineering example: it's low-information-content to be able to answer the question "what would happen if I do steps X, Y, Z" and higher information-content to be able to answer the question "what steps do I take in what order to prove the theorem / invent the gadget?". The case of imagery is less obvious, now that you mention it. But it at least seems plausible to me that a probabilistic program to draw any possible eye (combining shape, textures, lighting, certain types of noise, reflections, etc.) has less complexity (fewer bits) than an equally good eye-detecting discriminative model. Note that our discriminative models (like ResNet image classifiers) are not usually "equally good", because they are more prone to failing out-of-distribution—e.g. they will incorrectly classify certain patterns of random static as eyes.

Thanks for writing back.

I asked about the memory and generative models because I feel uncertain about the differences, if any, between storing information in memory versus storing in generative models. Example of storing in memory would be something like a knowledge graph. Example of retrieving info from a generative model would be something like inputing a vector into deconvolutional NN so that it outputs an image (models have capacity making them function like memory). One question on my mind is, are there things that are better suited (or “more naturally”) stored in a generative model versus in memory.

Thanks for explaining where you're coming from. Hmm. For the brain, I disagree that there's a distinction—the generative models, semantic memory, and a knowledge graph are three different descriptions of the same thing. Like, say you know that "if you push the button on the toy, it goes Beep". You could call that part of a knowledge graph—some relationship between a certain toy, its affordance of pressing the button, and the beep sound—but you could also call it a generative model—a little kinda movie in your head, in which you picture the button being pressed, and then you hear the sound Beep. Right?

For images, it gets a bit trickier to visualize what's going on, but I think the Dileep George vision model is probably a better starting point than a deconvolutional NN if you're thinking about the brain. You don't normally think of your visual knowledge as organized into a "knowledge graph", but I do think that there is in fact a giant repository of known, um, snippets of aspects of images, with known relations between them—like how known contours fit together into known shapes, and how known subcomponents fit together into known assemblies, etc.—and in this sense your visual memory can in fact be treated as a knowledge graph, and formalized as some elaborate variant of a probabilistic graphical model. By contrast, I don't think I would describe a deconvolutional NN as implicitly encoding a "knowledge graph". I mean, it's a different type of generative model, it's not structured that way, it doesn't seem very knowledge-graph-ish to me...

Yes, I also think that memory and generative models could be “different forms” of the same thing. A generative model seems like compressed memory. Perhaps memory to a biological organism could be like short-term memory (representations being focused on (attention) and recent history. Contents readily retrievable). And generative models to a biological organism could be like long-term memory (effort needed in retrieving compressed contents). However, a machine with large memory capacities might have less need of generative models solely for the sake of memory compression. Engineers might still elect generative models to be implemented for the sake of creating new memories (e.g. GANs to merge concepts together to form a new memory). This might help in creativity — obtaining a new memory from mashing together two or more previously acquired memories.

Yes, I too, don’t think it’s natural to describe a deconv NN as implicitly encoding a knowledge graph. Mentioned knowledge graph as an example.

Having brought up Dileep George’s and Randall O Reilly’s work in your posts, would you happen to have had the time to try out the code (Reilly’s work is on github) and have comments/feedback?

At work I encounter deconv NNs more frequently even for topics like predictive coding, curiosity, etc.. Would you happen to have encountered alternative models (other than Dileep George’s) that are amenable to someone wanting to take baby steps into the AGI community?

For the interested reader, I came across a video of Dileep George talking about RCN: 

Engineers might still elect generative models to be implemented for the sake of creating new memories (e.g. GANs to merge concepts together to form a new memory). This might help in creativity — obtaining a new memory from mashing together two or more previously acquired memories.

Oh yeah, definitely, and also planning, reasoning, and so on. It's all about mashing together compositional generative models. Like: "I need to put this book into my bag. Will it fit?" Well, you have a generative model of all the ways that the book can be oriented, and you have a generative model of all the ways that the bag can be reshaped and that its current contents can be shuffled around, and you try to mix and match all those models until you fit them together into a plausible composite model wherein the book slides easily into the bag. Then you reshape the bag, shuffle the contents, and orient the book, and it slides in, just like you imagined!

would you happen to have had the time to try out the code (Reilly’s work is on github) and have comments/feedback?

No, I haven't had time. Also, I think that safely using AGI systems remains an unsolved problem. If we had a complete neocortex simulator right now, I think we would be able to quickly (years not decades) turn it into an extremely powerful system with superhuman cross-domain reasoning and common sense and a drive to accomplish goals. But we would have only sketchy and unreliable methods to "steer it" towards trying to do what we want it to try to do, or to even know what it's trying to do at any given time. So, such a system would be a very dangerous thing, and it would get rapidly and unpredictably more dangerous as we optimized the hyperparameters and scaled the algorithms etc. And I don't think that having these systems right in front of us would make it much easier to figure out how to reliably control them. (It would make it easier to find unreliable control methods, which work for a while then suddenly fail.)

So that's why I'm not inclined to be part of the project to reverse-engineer the neocortex—not until we have a better plan for "what if we succeed". I feel like I understand the neocortex about as well as I need to in order to think about how and whether a future neocortex-like AGI can be controlled, or more generally how to make sure it's really a step towards the awesome post-AGI utopia we're all hoping for. :-) Well, I mean, I still read papers speculating about how the neocortex works, for example because I want to continue to build confidence that I'm not wildly misunderstanding it or missing anything important. But that's not my main interest right now.

Instead, I've been been spending my limited free time trying to get a better understanding and intuitions for what the reward signals are, how they're calculated, how they work in humans and might work in AGIs, what the amygdala and tectum and cerebellum and whatnot are calculating and how, whether neocortex-like AGIs would be conscious or suffering, how they would be similar and different from humans and what would be the political and societal consequences of that .... Lots to do, I've got my work cut out! See also my most recent posts here. :-)

For the interested reader, I came across a video of Dileep George talking about RCN: 

Yes I liked that talk too. Dileep also has done podcasts more recently with Lex Fridman and TWIMLAI.

It's all about mashing together compositional generative models. Like: "I need to put this book into my bag. Will it fit?" Well, you have a generative model of all the ways that the book can be oriented, and you have a generative model of all the ways that the bag can be reshaped and that its current contents can be shuffled around, and you try to mix and match all those models until you fit them together into a plausible composite model wherein the book slides easily into the bag. Then you reshape the bag, shuffle the contents, and orient the book, and it slides in, just like you imagined!

 

This reminds me of works like Capsule Network and Reinforcement Learning works that use imagination (e.g. learning how to drive a car in a game by imagining how upcoming roads curve, learning to dodge fireballs from enemies by imagining enemies shooting fireballs).

So that's why I'm not inclined to be part of the project to reverse-engineer the neocortex—not until we have a better plan for "what if we succeed

Regarding the threat of AGI — one perspective is that people accidentally stumble upon AGI architecture (perhaps a simple one, but nonetheless one), don’t recognize it because it’s capabilities are evaluated on narrow tasks (making it seem similar to traditional, narrow AI), is popularized (e.g. blogposts, academic papers) and distributed (e.g. github), and eager, well-meaning folks try it and its improved variants in increasingly realistic environments (access to websites, social media, embodied in a robot waiter), and suddenly realize… hey… this thing is learning things we did not quite expect. I mean, we did expect it to learn, especially to mimic material it’s exposed to, but not these action sequences that seem superfluous at first but yield surprisingly meaningful outcomes. Real-world example: see footnote. Generally speaking, reward-driven agents that have to figure out what actions to take have this potential.

One way to prevent the above scenario from accidentally happening is to map models, prioritizing the popular and proven models, to known cognitive functions. Such knowledge lets us estimate the scope of cognitive functions mirrored by a given model pipeline. An informative side-effect of this is that we might come to realize that not all features of our own (human) cognition are necessary for AGI — as an example, the absence of the pre-, sub-, & fully- conscious distinction has [EDIT: might have] trivial effects on AGI.

It’d be helpful in the near future for there to be voices that can warn when models come dangerously close to forming a set of cognitive functions minimally required for a basic AGI. Admittedly, historically, AGI predictions aren’t exactly known for being prescient, but communities at least get an informed warning.

 

Footnote:

Go experts were impressed by the program's performance and its nonhuman play style; Ke Jie stated that "After humanity spent thousands of years improving our tactics, computers tell us that humans are completely wrong... I would go as far as to say not a single human has touched the edge of the truth of Go."

Its strategy of maximising its probability of winning is distinct from what human players tend to do which is to maximise territorial gains, and explains some of its odd-looking moves. It makes a lot of opening moves that have never or seldom been made by humans, while avoiding many second-line opening moves that human players like to make. It likes to use shoulder hits, especially if the opponent is over concentrated.

- source https://en.wikipedia.org/wiki/AlphaGo

My comment: A concrete example is “move 37” by alphago, a move typically eschewed by human players due to intuition passed on through the ages.

 


The South Korean Go champion Lee Se-dol has retired from professional play…
“With the debut of AI in Go games, I’ve realized that I’m not at the top even if I become the number one through frantic efforts,” Lee told Yonhap. “Even if I become the number one, there is an entity that cannot be defeated.” 
Lee lost 4-1 to DeepMind’s AlphaGo in 2016.

For years, Go was considered beyond the reach of even the most sophisticated computer programs. The ancient board game is famously complex, with more possible configurations for pieces than atoms in the observable universe. 

“Even with a two-stone advantage, I feel like I will lose the first game to HanDol [a Korean AI program],” Lee told Yonhap. “These days, I don’t follow Go news ...”

- source https://www.theverge.com/2019/11/27/20985260/ai-go-alphago-lee-se-dol-retired-deepmind-defeat

not all features of our own (human) cognition are necessary for AGI — as an example, the absence of the pre-, sub-, & fully- conscious distinction has trivial effects on AGI.

Well, I think it's premature to say what is or isn't important for an AGI until we build one.

This reminds me of works like Capsule Network

Yeah I agree that capsule networks capture part of that, even if it's not the whole story.

Regarding the threat of AGI — one perspective is that people accidentally stumble upon AGI architecture (perhaps a simple one, but nonetheless one), don’t recognize it ...

Sure, I was talking about digging into the gory details of the neocortical algorithm. What do the layer 2 neurons calculate and how? That kind of thing. Plenty of people are doing that all the time, of course, and making rapid progress in my opinion. I find that fact a bit nerve-wracking, but oh well, what can you do? Hope for the best, I suppose, and meanwhile work as fast as possible on the "what if we succeed" question. I mean, I do actually have idiosyncratic opinions about what layer 2 neurons are calculating and how, or whatever, but wouldn't think to blog about them, on the off chance that I'm actually right. :-P

Bigger-picture thinking, like you're talking about, is more likely to be a good thing, I figure, although the details matter. Like, creating common knowledge that a certain path will imminently lead to AGI could lead to a frantic race between teams around the world where safety gets thrown out the window. But some big-picture knowledge is necessary for "what if we succeed". Of course I blog on big-picture stuff myself. I think I'm pushing things in the right direction, but who knows :-/

I'm going slightly off-topic but couldn't help but notice that your website says that you're doing this in your spare time. I'm surprised that you've covered so much ground. If you don't mind me the question -- how do you keep abreast of the AI field with so many papers published every year? Like do you attend periodic meet-ups in your circle of friends/colleagues to discuss such matters? do you opt to read summaries of papers instead of the long paper?

Oh, there are infinity papers on AI per month. I couldn't dream of reading them all. Does anyone?

I think I'm at least vaguely aware of what the ML people are all talking about and excited about, through twitter.

Mainstream ML papers tend to be pretty low on my priority list compared to computational neuroscience, and neuroscience in general, and of course AGI safety and strategy.

Learning is easier when you have specific questions you're desperately trying to answer :-)

Beyond that, I dunno. I don't watch TV, and I quit my other hobbies to clear more time for this one. It is a bit exhausting sometimes. Maybe I should take a break. Oh but I do really want to write up that next blog post! And the one after that... :-)

I'm going to stop at your very first claim and observe: MuZero.

You are aware that MuZero has tree search hardcoded into it, yes? How does that contradict claim 1?

MuZero does not do MCTS and still outperforms.

[-]LGSΩ11240

It does do (a variant of) MCTS. Check it out for yourself. The paper is here:

https://arxiv.org/pdf/1911.08265.pdf

Appendix B, page 12:

"We now describe the search algorithm used by MuZero. Our approach is based upon Monte-Carlo tree search with upper confidence bounds, an approach to planning that converges asymptotically to the optimal policy in single agent domains and to the minimax value function in zero sum games [22]."

No. I am very familiar with the paper, and MuZero does not use MCTS, nor does it support the claims of OP.

First, that's not MCTS. It is not using random rollouts to the terminal states (literally half the name, 'Monte Carlo Tree Search'). This is abuse of terminology (or more charitably, genericizing the term for easier communication): "MCTS" means something specific, it doesn't simply refer to any kind of tree-ish planning procedure using some sort of heuristic-y thing-y to avoid expanding out the entire tree. The use of a learned latent 'state' space makes this even less MCTS.*

Second, using MCTS for the planning is not necessary. As they note, any kind of planning algorithm, not just MCTS would work ("For example, a naive search could simply select the k step action sequence that maximizes the value function. More generally, we may apply any MDP planning algorithm to the internal rewards and state space induced by the dynamics function.")

Third, NNs absolutely can plan in a 'pure' fashion: TreeQN (which they cite) constructs its own tree which it does its own planning/exploration over in a differentiable fashion. What more do you want? I feel that we should at least acknowledge that TreeQN exists, wasn't insuperably hard to create, and, inasmuch as it runs on current hardware at all, doesn't seem to entail 'a factor of a million slowdown'. (VIN/VPN/Predictron might count as examples here too? There's a lot of model-based RL work which make the NN learn part of the planning process, like Imagination-based Planner or MCTSnets.)

Fourth, planning is not necessary at all for the NN to compute results just as strong as tree search would: just like regular AlphaZero, the policy network on its own, with no rollouts or trees involved of any sort, is very strong, and they show that it increases greatly in strength over training. We also have the scaling law work of Andy Jones, verifying the intuition that anything tree search does can be efficiently distilled into a non-tree-search model trained for longer. (I would also point out the steeply diminishing returns to both depth & number of iterations: AlphaZero or Master, IIRC, used only a few TPUs because the tree-search was a simple one which only descended a few plies; you can also see in the papers like the MuZero appendix referenced that most of the play strength comes from just a few iterations, and they don't even evaluate at more than 800, IIRC. It seems like what tree search does qualitatively is correct the occasional blind spot where the NN thinks forward a few moves for its best move and goes 'oh shit! That's actually a bad idea!'. It's not doing anything super-impressive or subtle. It's just a modest local policy iteration update, if you will. But the NN is what does almost all of the work.) This alone is completely fatal to OP's claims that tree search is an example of useful algorithms neural nets cannot do and that adding orders of magnitude more compute would not make a difference (it totally would - the exact scaling exponent for Go/ALE is unknown but I'd bet that anything you can do with MuZero+tree-search can be done with a larger MuZero's policy alone given another order or three of compute).

So, MuZero does not use MCTS; the symbolic tree planning algorithm(s) it uses are not that important; to the extent that explicit tree planning is useful it can be done in a pure neural fashion; and relatively modest (as these things go) increases in compute can obviate the need for even pure neural tree search.

This refutes Byrne's use of tree search as an example of "Background Claim 1: There are types of information processing that cannot be cast in the form of Deep Neural Net (DNN)-type calculations (= matrix multiplications, ReLUs, etc.), except with an exorbitant performance penalty." Tree search is not an example because it already has been cast into DNN form without exorbitant performance penalty.

* for more on what AlphaZero MCTS "really" is, https://arxiv.org/abs/2007.12509 & https://arxiv.org/abs/1804.04577 come to mind.

Thanks for the comment!

First, that's not MCTS. It is not using random rollouts to the terminal states (literally half the name, 'Monte Carlo Tree Search'). This is abuse of terminology (or more charitably, genericizing the term for easier communication): "MCTS" means something specific, it doesn't simply refer to any kind of tree-ish planning procedure using some sort of heuristic-y thing-y to avoid expanding out the entire tree. The use of a learned latent 'state' space makes this even less MCTS.

Yeah even when I wrote this, I had already seen claims that the so-called MCTS is deterministic. But DeepMind and everyone else apparently calls it MCTS, and I figured I should just follow the crowd, and maybe this is just one of those things where terminology drifts in weird directions and one shouldn't think too hard about it, like how we say "my phone is ringing" when it's actually vibrating.  :-P

Looking into it again just now, I'm still not quite sure what's going on. This person says AlphaZero switches from random to non-random after 15 moves. And this person says AlphaZero is totally deterministic but "MCTS" is still the proper term, for reasons that don't make any sense to me. I dunno and I'm open to being educated here. Regardless, if you tell me that I should call it "tree search" instead of "MCTS", I'm inclined to take your word for it. I want to be part of the solution not part of the problem  :-D

NNs absolutely can plan in a 'pure' fashion: TreeQN (which they cite) constructs its own tree which it does its own planning/exploration over in a differentiable fashion.

That's an interesting example. I think I need to tone down my claim a bit (and edit the post). Thank you. I will now say exactly what I'm making of that example:

Here is a spectrum of things that one might believe, from most-scaling-hypothesis-y to least:

  1. If you take literally any DNN, and don't change the architecture or algorithm or hyperparameters at all, and just scale it up with appropriate training data and loss functions, we'll get AGI. And this is practical and realistic, and this is what will happen in the near future to create AGI.
  2. If you take literally any DNN, and don't change the architecture or algorithm or hyperparameters at all, and just scale it up with appropriate training data and loss functions, we'll get AGI in principle … but in practice obviously both people and meta-learning algorithms are working hard to find ever-better neural network architectures / algorithms / hyperparameters that give better performance per compute, and they will continue to do so. So in practice we should expect AGI to incorporate some tweaks compared to what we might build today.
  3. Those kinds of tweaks are not just likely for economic reasons but in fact necessary to get to AGI
  4. …and the necessary changes are not just "tweaks", they're "substantive changes / additions to the computation"
  5. …and they will be such a big change that the algorithm will need to involve some key algorithmic steps that are not matrix multiplications, ReLUs, etc.
  6. More than that, AGI will not involve anything remotely like a DNN, not even conceptually similar, not even as one of several components.

My impression is that you're at #2.

I put almost no weight on #6, and never have. In fact I'm a big believer in AGI sharing some aspects of DNNs, including distributed representations, and gradient descent (or at least "something kinda like gradient descent"), and learning from training data, and various kinds of regularization, etc. I think (uncoincidentally) that the family of (IMO neocortex-like) learning algorithms I was talking about in the main part of this post would probably have all those aspects, if they scale to AGI.

So my probability weight is split among #2,3,4,5.

To me, the question raised by the TreeQN paper is: should I shift some weight from #5 to #4?

When I look at the TreeQN paper, e.g. this source code file, I think I can characterize it as "they did some tree-structure-specific indexing operations". (You can correct me if I'm misunderstanding.) Are "tree-structure-specific indexing operations" included in my phrase "matrix multiplications, ReLUs, etc."? I dunno. Certainly stereotypical DNN code involves tons of indexing operations; it's not like it looks out of place! On the other hand, it is something that humans deliberately added to the code.

I guess in retrospect it was kinda pointless for me to make an argument for #5. I shouldn't have even brought it up. In the context of this post, I could have said: "The space of all possible learning algorithms is much vaster than Transformers—or even "slight tweaks on Transformers". Therefore we shouldn't take it for granted that either Transformers or "slight tweaks on Transformers" will necessarily scale to AGI—even if we believe in The Bitter Lesson."

And then a different (and irrelevant-to-this-post) question is whether "matrix multiplications, ReLUs, etc." (whatever that means) is a sufficiently flexible toolkit to span much of the space of all possible useful learning algorithms, in an efficiently-implemented way. My change from yesterday is: if I interpret this "toolkit" to include arbitrary indexing & masking operations, and also arbitrary control flow—basically, if this "toolkit" includes anything that wouldn't look out of place in today's typical DNN source code—then this is a much broader space of (efficiently-implemented) algorithms than I was mentally giving it credit for. This makes me more apt to believe that future AGI algorithms will be built using tools in this toolkit, but also more apt to believe that those algorithms could nevertheless involve a lot of new ideas and ingenuity, and less apt to believe that it's feasible for something like AutoML-Zero to search through the whole space of things that you can do with this toolkit, and less apt to describe the space of things you can build with this toolkit as "algorithms similar to DNNs". For example, here is a probabilistic program inference algorithm that's (at least arguably/partly) built using this "toolkit", and I really don't think of probabilistic program inference as "similar to DNNs".

Yeah, I didn't want to just nitpick over "is this tree search a MCTS or not", which is why I added in #2-4, which address the steelman - even if you think MuZero is using MCTS, I think that doesn't matter because one doesn't need any tree search at all, so a fortiori that question doesn't matter.

(I also think the MuZero paper is generally confusing and poorly-written, and that's where a lot of confusion is coming from. I am not the only person to read it through several times and come away confused about multiple things, and people trying to independently reimplement MuZero tell me that it seems to leave out a lot of details. There's been multiple interesting followup papers, so perhaps reading them all together would clarify things.)


Yes, so on your spectrum of #1-6, I would put myself at closer to 3 than 2. I would say that while we have the global compute capacity now to scale up what are the moral equivalents of contemporary models to what the scaling laws would predict is human-equivalence (assuming, as seems likely but far from certain, that they more or less hold - we haven't seen any scaling law truly break yet), at the hundreds of trillions to quadrillion parameter regime of Transformers or MLPs, this is only about the compute for a single training run. The hardware exists and the world is wealthy enough to afford it if it wanted to (although it doesn't).

But we actually need the compute for the equivalent of many runs. The reason hardware progress drives algorithmic software progress is because we are absolutely terrible at designing NNs, and are little more than monkeys banging at giant black boxes with trial-and-error, confabulating or retrospectively cherrypicking theories to explain the observed results. Thus we need enough compute to blow on enough runs that a grad student can go 'what if I added a shortcut connection? Oh' or 'these MLP things never work beyond 3 or 4 layers, everyone knows that... but what if I added any kind of normalization, the way we normalize every other kind of NN? Oh' and figure out the right detail which makes it Just Work.

So, we will need a lot of algorithmic efficiency beyond the bare minimum of '1 training run, once', to afford all the slightly-broken training runs.

(Unless we get 'lucky' and the prototyping small runs are so accurate and the code so solid that you can prototype at a tiny scale and do 1 run; I tend to disbelieve this because there's so many issues that always come up as you move several magnitudes, both at just the code level and training.)

On the other hand, it is something that humans deliberately added to the code.

/shrug. If you don't like TreeQN example, I have others! Just keep making the NN deeper (and/or more recurrent, same thing really, when unrolled...), and it'll keep approximating the value function better at fairly modest additional cost compared to 'real' tree search. (After all, the human brain can't have any symbolic discrete tree in it either, it just passes everything forward for the initial glance and then recurs for System 2 thinking through the game tree.)

I see symbolic vs neural as a bias-variance continuum, per the Bitter Lesson: symbolic learns quickly for little compute, but then it tops out, and eventually, the scissors cross, and the more neural you go, the better it gets. So the question ultimately becomes one of budgets. What's your budget? How much constant-factor performance optimization and ultimate ceiling do you need, and how much hand-engineering of that specialized complicated symbolic architecture are you willing to buy? If you have little compute and don't mind attaining less than superhuman performance and buying a lot of complicated domain-specific code, you will move far down the symbolic end; if you have lots of compute and want the best possible generic code...

and less apt to believe that it's feasible for something like AutoML-Zero to search through the whole space of things that you can do with this toolkit, and less apt to describe the space of things you can build with this toolkit as "algorithms similar to DNNs".

But that's where the scaling laws become concerning. Can AutoML-Zero successfully search for "code to implement MCTS with pUCT exploration heuristic and domain-specific tuned hyperparameters with heavy playouts using a shallow MLP for value approximation"? Probably no. That's complex, specialized, and fragile (a half-working version doesn't work at all). Can AutoML-Zero learn "add 10 moar layers to $DEFAULT_NN lol"? ...Probably yes.

One paper I forgot that bears especially on the question of why you use planning: "On the role of planning in model-based deep reinforcement learning", Hamrick et al 2020:

Model-based planning is often thought to be necessary for deep, careful reasoning and generalization in artificial agents. While recent successes of model-based reinforcement learning (MBRL) with deep function approximation have strengthened this hypothesis, the resulting diversity of model-based methods has also made it difficult to track which components drive success and why. In this paper, we seek to disentangle the contributions of recent methods by focusing on three questions: (1) How does planning benefit MBRL agents? (2) Within planning, what choices drive performance? (3) To what extent does planning improve generalization? To answer these questions, we study the performance of MuZero (Schrittwieser et al., 2019), a state-of-the-art MBRL algorithm with strong connections and overlapping components with many other MBRL algorithms. We perform a number of interventions and ablations of MuZero across a wide range of environments, including control tasks, Atari, and 9x9 Go. Our results suggest the following: (1) Planning is most useful in the learning process, both for policy updates and for providing a more useful data distribution. (2) Using shallow trees with simple Monte-Carlo rollouts is as performant as more complex methods, except in the most difficult reasoning tasks. We show that deep, precise planning is often unnecessary to achieve high reward in many domains, with 2-step planning exhibiting surprisingly strong performance even in Go. (3) Planning alone is insufficient to drive strong generalization. These results indicate where and how to utilize planning in reinforcement learning settings, and highlight a number of open questions for future MBRL research.

>Fourth, planning is not necessary at all for the NN to compute results just as strong as tree search would: just like regular AlphaZero, the policy network on its own, with no rollouts or trees involved of any sort, is very strong, and they show that it increases greatly in strength over training. We also have the scaling law work of Andy Jones, verifying the intuition that anything tree search does can be efficiently distilled into a non-tree-search model trained for longer. (I would also point out the steeply diminishing returns to both depth & number of iterations: AlphaZero or Master, IIRC, used only a few TPUs because the tree-search was a simple one which only descended a few plies; you can also see in the papers like the MuZero appendix referenced that most of the play strength comes from just a few iterations, and they don't even evaluate at more than 800, IIRC. It seems like what tree search does qualitatively is correct the occasional blind spot where the NN thinks forward a few moves for its best move and goes 'oh shit! That's actually a bad idea!'. It's not doing anything super-impressive or subtle. It's just a modest local policy iteration update, if you will. But the NN is what does almost all of the work.)

I mean, you're just voicing your own dubious interpretations here; the muzero paper does not, in fact, evaluate its model on Chess or Go without tree search, so we do not know how strong it would be. AlphaGo Zero did so, but the model was not very strong without tree search. The scaling work of Andy Jones doesn't seem to apply to this scenario, unless I'm missing something.

Elsewhere you linked to this:

https://arxiv.org/abs/2011.04021#deepmind

If I'm reading it right, it says that removing tree search from muZero trained on 9x9 Go reduces its performance to below that of a bot which plays at "strong amateur level".

I think you are just wrong that muZero can perform well on games like Go without tree search. Whether you want to call it "MCTS" is a semantic debate (I never called it that, but merely said muZero uses a variant of MCTS).

To check, I think you're saying:

You believe it is inefficient to simulate some algorithms with a DNN. But the algorithms only matter for performance on various tasks. You give an example where a hard-to-simulate algorithm was used in an AI system, but when we got rid of the hard-to-simulate algorithm, it performed comparably well.