Cross-posted from New Savanna.
But it may also be flat-out wrong. We’ll see when we get a better idea of how inference works in the underlying language model.
* * * * *
Yes, I know that ChatGPT is trained by having it predict the next word, and the next, and the next, for billions and billions of words. The result of all that training is that ChatGPT builds up a complex structure of weights on the 175 billion parameters of its model. It is that structure that emits word after word during inference. Training and inference are two different processes, but that point is not well-made in accounts written for the general public.
Let's get back to the main thread.
I maintain, for example, that when ChatGPT begins a story with the words “Once upon a time,” which it does fairly often, that it “knows” where it is going and that its choice of words is conditioned on that “knowledge” as well as upon the prior words in the stream. It has invoked a ‘story telling procedure’ and that procedure conditions its word choice. Just what that procedure is, and how it works, I don’t know, nor do I know how it is invoked. I do know, that it is not invoked by the phrase “once upon a time” since ChatGPT doesn’t always use that phrase when telling a story. Rather, that phrase is called up through the procedure.
Consider an analogy from jazz. When I set out to improvise a solo on, say, “A Night in Tunisia,” I don’t know what notes I’m going to play from moment to moment, much less do I know how I’m going to end, though I often know when I’m going to end. How do I know that? That’s fixed by the convention in place at the beginning of the tune; that convention says that how many choruses you’re going to play. So, I’ve started my solo. My note choices are, of course, conditioned by what I’ve already played. But they’re also conditioned by my knowledge of when the solo ends.
Something like that must be going on when ChatGPT tells a story. It’s not working against time in the way a musician is, but it does have a sense of what is required to end the story. And it knows what it must do, what kinds of events must take place, in order to get from the beginning to the end. In particular, I’ve been working with stories where the trajectories have five segments: Donné, Disturb, Plan, Execute, Celebrate. The whole trajectory is ‘in place’ when ChatGPT begins telling the story. If you think of the LLM as a complex dynamical system, then the trajectory is a valley in the system’s attractor landscape.
Nor is it just stories. Surely it enacts a different trajectory when you ask it a factual question, or request it to give you a recipe (like I recently did, for Cornish pasty), or generate some computer code.
With that in mind, consider a passage from a recent video by Stephen Wolfram (note: Wolfram doesn’t start speaking until about 9:50):
Starting at roughly 12:16, Wolfram explains:
It is trying write reasonable, it is trying to take an initial piece of text that you might give and is trying to continue that piece of text in a reasonable human-like way, that is sort of characteristic of typical human writing. So, you give it a prompt, you say something, you ask something, and, it’s kind of thinking to itself, “I’ve read the whole web, I’ve read millions of books, how would those typically continue from this prompt that I’ve been given? What’s the reasonable expected continuation based on some kind of average of a few billion pages from the web, a few million books and so on.” So, that’s what it’s always trying to do, it’s aways trying to continue from the initial prompt that it’s given. It’s trying to continue in a statistically sensible way.
Let’s say that you had given it, you had said initially, “The best think about AI is its ability to...” Then ChatGPT has to ask, “What’s it going to say next.”
I don’t have any problem with that (which, BTW, is similar to a passage near the beginning of his recent article, What Is ChatGPT Doing … and Why Does It Work?). Of course ChatGPT is “trying to continue in a statistically sensible way.” We’re all more or less doing that when we speak or write, though there are times when we may set out to be deliberately surprising – but we can set such complications aside. My misgivings set in with this next statement:
Now one thing I should explain about ChatGPT, that’s kind of shocking when you first hear about this. Is, those essays that it’s writing, it’s writing at one word at a time. As it writes each word it doesn’t have a global plan about what’s going to happen. It’s simply saying “what’s the best word to put down next based on what I’ve already written?”
It's the italicized passage that I find problematic. That story trajectory looks like a global plan to me. It is a loose plan, it doesn’t dictate specific sentences or words, but it does specify general conditions which are to met.
Now, much later in his talk Wolfram will say something like this (I don’t have the time, I’m quoting from his paper):
If one looks at the longest path through ChatGPT, there are about 400 (core) layers involved—in some ways not a huge number. But there are millions of neurons—with a total of 175 billion connections and therefore 175 billion weights. And one thing to realize is that every time ChatGPT generates a new token, it has to do a calculation involving every single one of these weights.
If ChatGPT visits every parameter each time it generates a token, that sure looks “global” to me. What is the relationship between these global calculations and those story trajectories? I surely don’t know.
Perhaps it’s something like this: A story trajectory is a valley in the LLM’s attractor landscape. When it tells a story it enters the valley at one end and continues through to the end, where it exits the valley. That long circuit that visits each of those 175 billion weights in the course of generating each token, that keeps it in the valley until it reaches the other end.
I am reminded, moreover, of the late Walter Freeman’s conception of consciousness as arising through discontinuous whole-hemisphere states of coherence succeeding one another at a “frame rate” of 6 Hz to 10Hz – something I discuss in “Ayahuasca Variations” (2003). It’s the whole hemisphere aspect that’s striking (and somewhat mysterious) given the complex connectivity across many scales and the relatively slow speed of neural conduction.
* * * * *
I was alerted to this issue by a remark made at the blog, Marginal Revolution. On December 20, 2022, Tyler Cowen had linked to an article by Murray Shanahan, Talking About Large Language Models. A commenter named Nabeel Q remarked:
LLMs are *not* simply “predicting the next statistically likely word”, as the author says. Actually, nobody knows how LLMs work. We do know how to train them, but we don’t know how the resulting models do what they do.
Consider the analogy of humans: we know how humans arose (evolution via natural selection), but we don’t have perfect models of how humans worked; we have not solved psychology and neuroscience yet! A relatively simple and specifiable process (evolution) can produce beings of extreme complexity (humans).
Likewise, LLMs are produced by a relatively simple training process (minimizing loss on next-token prediction, using a large training set from the internet, Github, Wikipedia etc.) but the resulting 175 billion parameter model is extremely inscrutable.
So the author is confusing the training process with the model. It’s like saying “although it may appear that humans are telling jokes and writing plays, all they are actually doing is optimizing for survival and reproduction”. This fallacy occurs throughout the paper.
This is the why the field of “AI interpretability” exists at all: to probe large models such as LLMs, and understand how they are producing the incredible results they are producing.
I don’t have any reason to think Wolfram was subject to that confusion. But I think many people are. I suspect that the general public, including many journalists reporting on machine learning, aren’t even aware of the distinction between training the model and using it to make inferences. One simply reads that ChatGPT, or any other comparable LLM, generates text by predicting the next word.
This mis-communication is a MAJOR blunder.
A few observations
This conversation has been going on for a few days now and I’ve found it very helpful. I want to take a minute or two to step back and think about it, and about transformers and stories. Why stories? Because I’ve spent a lot of time having ChatGPT tell stories, getting a feel for how it does that. But I’m getting ahead of myself.
I wrote the OP because I felt a mismatch between what I feel to be the requirements for telling the kind of stories ChatGPT tells, and the assertion that it’s “just” predicting the next word, time after time after time. How do we heal that mismatch?
Stories
Let’s start with stories, because that’s where I’m starting. I’ve spent a lot of time studying stories and figuring out how they work. I’ve long ago realized that that process must start by simply describing the story. But describing isn’t so simple. For example, it took Early Modern naturalists decades upon decades to figure out to describe life-forms, plants and animals, well enough so that a naturalist in Paris could read a description by a naturalist in Florence and figure out whether or not that Florentine plant was the same one as he has in front of him in Paris (in this case, description includes drawing as well as writing).
Now, believe it or not, describing stories is not simple, depending, of course, on the stories. The ChatGPT stories I’ve been working with, fortunately, are relatively simple. They’re short, roughly between 200 and 500 words long. The one’s I’ve given the most attention to are in the 200-300 word range.
They are hierarchically structured on three levels: 1) the story as a whole, 2) individual segments within the story (marked by paragraph divisions in these particular stories), and 3) sentences within those segments. Note that, if we wanted to, we could further divide sentences into phrases, which would give us at least one more level, if not two or three. But three levels are sufficient for my present purposes.
Construction
How is it that ChatGPT is able to construct stories organized on three levels? One answer to that question is that it needs to have some kind of procedure for doing that. That sentence seems like little more than a tautological restatement of the question. What if we say the procedure involves a plan? That, it seemed to me when I was writing the OP, that seems better. But “predict the next token” doesn’t seem like much of a plan.
We’re back where we started, with a mismatch. But now it is a mismatch between predict-the-next-token and the fact that these stories are hierarchically structured on three levels.
Let’s set that aside and return to our question: How is it that ChatGPT is able to construct stories organized on three levels? Let’s try another answer to the question. It is able to do it because it was trained on a lot of stories organized on three or more levels. Beyond that, it was trained on a lot of hierarchically structured documents of all kinds. How was it trained? That’s right: Predict the next token.
It seems to me that if it is going to improve on that task, that it must somehow 1) learn to recognize that a string of words is hierarchically structured, and 2) exploit what it has learned in predicting the next token. What cues are in the string that guide ChatGPT in making these predictions?
Whatever those cues are, they are registered in those 175 billion weights. Those cues are what I meant by “plan” in the OP.
Tell me a story
At this point we should be able to pick one of those stories and work our way through it from beginning to end, identifying cues as we go along. Even in the case of a short 200-word story, though, that would be a long and tedious process. At some point, someone HAS to do it, and their work needs to be vetted by others. But we don’t need to do that now.
But I can make a few observations. Here’s the simplest prompt I’ve used: “Tell me a story.” The population of tokens that would be a plausible initial token is rather large. How does that population change as the story evolves?
I’ve done a lot of work with stories generated by this prompt: “Tell me a story about a hero,” That’s still wide open, but the requirement that it be about a hero does place some vague restrictions on the population of available tokens. One story ChatGPT gave me in response to that prompt began with this sentence: “Once upon a time, in a land far, far away, there was a young princess named Aurora.” That’s formulaic, from beginning to end. There are a number of options in the formula, but we could easily use up 200 or 300 words discussing them and laying out the syntactic options in the form of a tree or formula. Let’s assume that’s been done.
What next? Here’s the second sentence: “Aurora was a kind and gentle soul, loved by all who knew her.” It’s all about Aurora. Everything in that sentence is drawn from a token population useful for characterizing Aurora. Third sentence: “She had long, golden hair and sparkling blue eyes, and was known for her beautiful singing voice.” Those tokens are drawn from the same population as the words in the previous sentence.
What about the fourth sentence? Does ChatGPT continue to draw from the same population or does its attention shift to a new population? Note that at some time it is going to have to draw tokens from a new population, otherwise the story goes nowhere. Here’s the fourth sentence: “One day, a terrible dragon came to the kingdom and began to terrorize the people.” That’s a new population of tokens. ChatGPT has moved from the first segment of the story trajectory (as I am calling it) to the second.
You get the idea. I have no intention of continuing on to the end of the story. But you can do so if you wish. Here’s the whole story: